AIME 2025 Benchmark — How Today's Top AI Models Stack Up

May 12, 2025•4 min read•The Regularizer Team

AIBenchmarksAIMEMathematicsLLMs

(Scroll past the bar chart above for the full story. It visualises the headline numbers you'll see in the table below.)

Quick-Glance Leaderboard

Rank	Model	Pass @ 1 Accuracy (AIME 2025)	Notes
1 ⬆︎	Grok 3 (Think)	93.3 %	Company-reported result with heavy "think time" (cons @ 64); independent replications still pending. (xAI)
2	o3 Mini	86.5 %	Best independently benchmarked score so far. (Vals AI)
3	DeepSeek R1	74.0 %	Open-source reasoning giant (671 B params). (Vals AI)
4	o1	71.5 %	2024's breakout model still strong. (Vals AI)
5	Claude 3.7 Sonnet (Thinking)	52.7 %	Anthropic's "chain-of-thought" variant. (Vals AI)
6	Gemini 2.0 Flash	29.8 %	Google's speed-optimised model. (Vals AI)

Tool-assisted champion: With Python-tool access, o4-mini hits 99.5 % on AIME 2025, but that's a different category (open-book vs closed-book). (OpenAI)

Why the AIME Matters for AI

Fresh & Difficult – The 2025 questions were released only in February 2025, so they're effectively unseen by models trained earlier.
Binary Scoring – Each answer is a single integer (0-999); no partial credit, no wiggle room.
Human Reference Point – The historical human median is just 4–6 correct out of 15. (Vals AI)

That makes AIME a stress-test of genuine multi-step reasoning, not just pattern matching.

Reading the Chart

The bar chart you saw at the top tells the same story in a single glance:

A clear gap between the top three "reasoning" models and the rest.
Grok 3's claim breaks the 90 % ceiling, but verification is still underway.
Performance drops off steeply after the o-series and DeepSeek, underscoring how tough the exam remains.

What the Numbers Tell Us

Rapid Gains, But Fresh Problems Still Sting Models that breeze through the 2024 AIME stumble a bit on the newer questions. Benchmarks note a 15 - 20 pp drop from 2024→2025, hinting at training-data overlap on the older set. (Vals AI)
Open Source is in the Race DeepSeek R1's 74 % shows community models can nip at proprietary heels — good news for transparency and reproducibility.
Tool Use Changes the Game Give o4-mini a Python sandbox and it all but solves the test. That blurs the line between "raw reasoning" and "calculator-augmented" ability — expect future benchmarks to separate those modes more clearly. (OpenAI)
Still Room Above the Bar Even the best closed-book score (86.5 %) leaves 2-3 problems unsolved. Different models miss different questions, suggesting no single system has a universal math strategy yet.

Looking Ahead

Harder Benchmarks Incoming – Researchers are already eyeing Olympiad proof problems and cross-domain "AI decathlons" to keep stretching the frontier.
Safety & Alignment – Stronger reasoning magnifies the need for verifiable chains-of-thought and robust guard-rails.
Democratised Capability – With open models closing in on top scores, advanced math reasoning could soon be a commodity capability.

Bottom Line

In early 2023, frontier models could barely scratch AIME. By early 2025, they're routinely outperforming top human contestants—and with tool use, they're nearly perfect. The AIME benchmark shows just how fast AI reasoning is accelerating, even as it reminds us there's still a little head-room left before "100 % solved" becomes the new normal.

Stay tuned: at this pace, the 2026 leaderboard may need new axes to fit the columns.