Google's New Gemini Pro Model Has Record Benchmark Scores — Again

🟡 SKIM | ⏱ 2 min | 📡 7/10 | 🎯 AI researchers, developers choosing models, anyone tracking the frontier

TL;DR

Google's Gemini 3.1 Pro in preview tops both Humanity's Last Exam and the APEX-Agents leaderboard — a benchmark that measures performance on real professional tasks (not just academic ones). ARC-AGI-2 score of 77.1% is notable because it more than doubles Gemini 3. The article is thin on analysis — it's essentially a press release summary. The real story is the release cadence: Google, OpenAI, and Anthropic are all dropping major models simultaneously.

Signal

APEX-Agents leaderboard is a more meaningful benchmark than most — it's designed by Mercor to test real knowledge-work tasks, not just academic pattern matching
77.1% on ARC-AGI-2 is significant because ARC-AGI-2 was specifically designed to be hard for LLMs — doubling predecessor performance on a capability-ceiling test matters
Three major labs shipping frontier models simultaneously in Feb 2026 = the competitive pressure is structural, not episodic

What They're NOT Telling You

TechCrunch's coverage is thin (the article is brief). "Record benchmark scores — again" signals benchmark gaming fatigue — Google and others routinely cherry-pick benchmarks where they lead. APEX-Agents is more credible than most but it's still run by a startup with its own incentives. No independent replication cited.

Trust Check

Factuality ✅ | Author Authority ⚠️ (surface-level coverage) | Actionability ⚠️ (preview, not GA yet)