TUNDRA // NEXUS

Mission Control
Curated Links/2026-02-19-gemini-31-pro-record-benchmarks
🟡

Google's New Gemini Pro Model Has Record Benchmark Scores — Again

🔗techcrunch.com
February 19, 2026
SIGNAL7/10
#ai

🟡 SKIM | ⏱ 2 min | 📡 7/10 | 🎯 AI researchers, developers choosing models, anyone tracking the frontier

TL;DR

Google's Gemini 3.1 Pro in preview tops both Humanity's Last Exam and the APEX-Agents leaderboard — a benchmark that measures performance on real professional tasks (not just academic ones). ARC-AGI-2 score of 77.1% is notable because it more than doubles Gemini 3. The article is thin on analysis — it's essentially a press release summary. The real story is the release cadence: Google, OpenAI, and Anthropic are all dropping major models simultaneously.

Signal

  • APEX-Agents leaderboard is a more meaningful benchmark than most — it's designed by Mercor to test real knowledge-work tasks, not just academic pattern matching
  • 77.1% on ARC-AGI-2 is significant because ARC-AGI-2 was specifically designed to be hard for LLMs — doubling predecessor performance on a capability-ceiling test matters
  • Three major labs shipping frontier models simultaneously in Feb 2026 = the competitive pressure is structural, not episodic

What They're NOT Telling You

TechCrunch's coverage is thin (the article is brief). "Record benchmark scores — again" signals benchmark gaming fatigue — Google and others routinely cherry-pick benchmarks where they lead. APEX-Agents is more credible than most but it's still run by a startup with its own incentives. No independent replication cited.

Trust Check

Factuality ✅ | Author Authority ⚠️ (surface-level coverage) | Actionability ⚠️ (preview, not GA yet)