Humanity’s Last Exam 2026: Results and Leaderboard Analysis

Humanity's Last Exam (HLE) was designed to be the benchmark that AI couldn't solve. Unlike MMLU or GSM8K, which models now routinely saturate with 90%+ scores, HLE was built by subject matter experts to stump even the best PhDs. Yet, as of May 2026, we are seeing a dramatic shift in the leaderboard that challenges our assumptions about AI reasoning. Check out the latest Humanity’s Last Exam 2026 results below.

I have been tracking the Humanity’s Last Exam 2026 leaderboard since its release. The progress is staggering. In my previous analysis of HLE limits, models were struggling. Today, we are nearing 50%. But there's a catch: the benchmark itself is under fire for accuracy.

The State of Humanity’s Last Exam 2026

The current leaderboard for Humanity’s Last Exam 2026 has a new king. Gemini 3.1 Pro Preview currently holds the top spot with a score of 44.7%. It is followed closely by GPT-5.5 (xhigh) at 44.3% and Claude Mythos Preview at 43.8%.

This cluster of scores around the 44% mark suggests we've hit a new plateau. These models aren't just guessing. They are solving graduate-level mathematics and physics problems. However, the race for the highest score in Humanity’s Last Exam 2026 is increasingly being overshadowed by a fundamental question: who is grading the graders?

Need help with AI strategy?

If you're trying to navigate which frontier model is right for your business use case, let's talk.

Contact Me

Why Humanity’s Last Exam 2026 Is Different From Previous Benchmarks

Standard benchmarks like MMLU have become "leaked" into the training sets of modern LLMs. Humanity’s Last Exam 2026 tries to solve this with three distinct features:

PhD-Level Expertise: 41% of the exam focuses on advanced mathematics. Physics, biology, and chemistry make up the bulk of the rest.
Crowdsourced Secrecy: The questions were crowdsourced from experts worldwide. They were filtered by AI to ensure they weren't already in the training data.
Low Baseline: Human experts average around 60-70%. Early AI models were scoring near random guessing (2-5%).

The fact that we've jumped from 5% to 44.7% in a year is proof of the incredible pace of the "Agentic Era." Models like DeepSeek V4 Pro are now using internal reasoning loops. This is exactly what Humanity’s Last Exam 2026 rewards.

The 30% Error Controversy in Humanity’s Last Exam 2026

The biggest story in May 2026 isn't the scores themselves. It is a report from FutureHouse. Their researchers investigated the chemistry and biology sections of HLE. They found that roughly 30% of the "correct" answers were actually wrong.

This is a massive problem. If the gold standard for "AGI-level reasoning" contains errors, the leaderboard becomes a measure of which model is best at matching the benchmark's mistakes. For a Chief Technology Officer, this is a critical reminder. Benchmarks like Humanity’s Last Exam 2026 are a signal, not a guarantee.

Frontier Comparison: Gemini 3.1 vs. GPT-5.5 vs. Claude Mythos

Model	HLE Score (May 2026)	Primary Strength
Gemini 3.1 Pro Preview	44.7%	Multimodal reasoning & Math
GPT-5.5 (xhigh)	44.3%	Instruction following & Logic
Claude Mythos Preview	43.8%	Nuance & Creative Problem Solving

While Gemini leads on the raw numbers for Humanity’s Last Exam 2026, Claude Mythos is reportedly more "honest." It is more likely to admit when a question is poorly phrased. GPT-5.5 has a higher tendency to "force" a reasoning path to reach the expected answer.

See my AI projects

I test these frontier models daily in real-world OpenClaw workflows. Check out what I'm building.

View Projects

Humanity’s Last Exam 2026: May Leaderboard Analysis

The May data for Humanity’s Last Exam 2026 confirms that the 50% barrier is within reach. But what do these numbers tell us about true intelligence? If a model excels at HLE but fails at simple logic tasks, we are witnessing a form of benchmark specialization. In my daily tests, I've found that Humanity’s Last Exam 2026 is a great stress test for reasoning logic.

What This Means for Businesses

For most SMEs, an AI model's score on a graduate-level physics exam is irrelevant. However, Humanity’s Last Exam 2026 serves as a "stress test" for the reasoning capabilities that do matter:

Complex Troubleshooting: If a model can solve a PhD math problem, it can probably debug a complex logistics chain.
Policy Analysis: The ability to hold 256k tokens of context while reasoning is a direct translation of Humanity’s Last Exam 2026 skills.
Reliability: The 30% error controversy teaches us to build "Verification Loops." Never trust a single model's output for critical business decisions. Always use a multi-agent system like OpenClaw to cross-check results.

Written by Matteo Giardino, CTO and founder. I build AI-powered solutions and agents for the Italian and global markets. My projects.

Humanity’s Last Exam 2026: Results and Leaderboard Analysis

The State of Humanity’s Last Exam 2026

Need help with AI strategy?

Why Humanity’s Last Exam 2026 Is Different From Previous Benchmarks

The 30% Error Controversy in Humanity’s Last Exam 2026

Frontier Comparison: Gemini 3.1 vs. GPT-5.5 vs. Claude Mythos

See my AI projects

Humanity’s Last Exam 2026: May Leaderboard Analysis

What This Means for Businesses

FAQ: Humanity’s Last Exam 2026

What is Humanity’s Last Exam 2026?

Who is leading Humanity’s Last Exam 2026 in May?

Why is there an error controversy in Humanity’s Last Exam 2026?

How does Humanity’s Last Exam 2026 compare to MMLU?

Is Humanity’s Last Exam 2026 a good measure of AGI?

Can I run Humanity’s Last Exam 2026 locally?