Exponential Progress: Claude Opus 4.6 Has 50% Time Horizon Of 14.5 Hours On METR Time Horizons Benchmark

🟢 READ | ⏱ 6 min | 📡 9/10 | 🎯 AI researchers, engineering leaders, anyone tracking capability trajectories

TL;DR

METR's time-horizon benchmark measures how long an AI takes to complete tasks that would take a skilled human professional from scratch. Claude Opus 4.6 hits a 50% success rate on tasks requiring 14.5 hours of human expert time. The confidence interval is wide (6–98 hours) because the benchmark is nearly saturated — the model may be underselling its actual capability. More important than the number: the exponential doubling rate (123 days) has held since 2023 with R²=0.93.

Signal

In mid-2024, frontier models had time horizons in single-digit minutes. Early 2025: 15–30 minutes. Late 2025: several hours. Early 2026: 14.5 hours. That's 3 orders of magnitude in ~18 months
Benchmark saturation is itself a signal: METR is actively developing new evaluation methods because the current task suite can no longer distinguish between frontier models
Tasks in the benchmark include implementing complex network protocols from scratch, iterative ML debugging, and cybersecurity tasks — not trivia, genuine work-product generation

What They're NOT Telling You

The 14.5-hour figure applies to well-specified, self-contained tasks with no organizational context, relationships, or ambiguous goals — the conditions that define most real jobs. METR explicitly caveats this. The article (and the benchmark) are also measuring rate-of-capability-growth, not economic impact, which has many more variables. The "singularity" framing in the intro is Musk-quoting clickbait; the METR data itself is measured and credible.

Trust Check

Factuality ✅ (METR data directly cited, R² and CI disclosed) | Author Authority ⚠️ (OfficeChai, secondary source) | Actionability ✅ (recalibrate automation assumptions now)