TUNDRA // NEXUS
LOC: SRV1304246| Mission ControlExponential Progress: Claude Opus 4.6 Has 50% Time Horizon Of 14.5 Hours On METR Time Horizons Benchmark
🟢 READ | ⏱ 6 min | 📡 9/10 | 🎯 AI researchers, engineering leaders, anyone tracking capability trajectories
TL;DR
METR's time-horizon benchmark measures how long an AI takes to complete tasks that would take a skilled human professional from scratch. Claude Opus 4.6 hits a 50% success rate on tasks requiring 14.5 hours of human expert time. The confidence interval is wide (6–98 hours) because the benchmark is nearly saturated — the model may be underselling its actual capability. More important than the number: the exponential doubling rate (123 days) has held since 2023 with R²=0.93.
Signal
- In mid-2024, frontier models had time horizons in single-digit minutes. Early 2025: 15–30 minutes. Late 2025: several hours. Early 2026: 14.5 hours. That's 3 orders of magnitude in ~18 months
- Benchmark saturation is itself a signal: METR is actively developing new evaluation methods because the current task suite can no longer distinguish between frontier models
- Tasks in the benchmark include implementing complex network protocols from scratch, iterative ML debugging, and cybersecurity tasks — not trivia, genuine work-product generation
What They're NOT Telling You
The 14.5-hour figure applies to well-specified, self-contained tasks with no organizational context, relationships, or ambiguous goals — the conditions that define most real jobs. METR explicitly caveats this. The article (and the benchmark) are also measuring rate-of-capability-growth, not economic impact, which has many more variables. The "singularity" framing in the intro is Musk-quoting clickbait; the METR data itself is measured and credible.
Trust Check
Factuality ✅ (METR data directly cited, R² and CI disclosed) | Author Authority ⚠️ (OfficeChai, secondary source) | Actionability ✅ (recalibrate automation assumptions now)