Humanity's Last Exam (HLE) was designed to be the benchmark that AI couldn't solve. Unlike MMLU or GSM8K, which models now routinely saturate with 90%+ scores, HLE was built by subject matter experts to stump even the best PhDs. Yet, as of May 2026, we are seeing a dramatic shift in the leaderboard that challenges our assumptions about AI reasoning. Check out the latest Humanity’s Last Exam 2026 results below.
I have been tracking the Humanity’s Last Exam 2026 leaderboard since its release. The progress is staggering. In my previous analysis of HLE limits, models were struggling. Today, we are nearing 50%. But there's a catch: the benchmark itself is under fire for accuracy.
The State of Humanity’s Last Exam 2026
The current leaderboard for Humanity’s Last Exam 2026 has a new king. Gemini 3.1 Pro Preview currently holds the top spot with a score of 44.7%. It is followed closely by GPT-5.5 (xhigh) at 44.3% and Claude Mythos Preview at 43.8%.
This cluster of scores around the 44% mark suggests we've hit a new plateau. These models aren't just guessing. They are solving graduate-level mathematics and physics problems. However, the race for the highest score in Humanity’s Last Exam 2026 is increasingly being overshadowed by a fundamental question: who is grading the graders?
Need help with AI strategy?
If you're trying to navigate which frontier model is right for your business use case, let's talk.
Why Humanity’s Last Exam 2026 Is Different From Previous Benchmarks
Standard benchmarks like MMLU have become "leaked" into the training sets of modern LLMs. Humanity’s Last Exam 2026 tries to solve this with three distinct features:
- PhD-Level Expertise: 41% of the exam focuses on advanced mathematics. Physics, biology, and chemistry make up the bulk of the rest.
- Crowdsourced Secrecy: The questions were crowdsourced from experts worldwide. They were filtered by AI to ensure they weren't already in the training data.
- Low Baseline: Human experts average around 60-70%. Early AI models were scoring near random guessing (2-5%).
The fact that we've jumped from 5% to 44.7% in a year is proof of the incredible pace of the "Agentic Era." Models like DeepSeek V4 Pro are now using internal reasoning loops. This is exactly what Humanity’s Last Exam 2026 rewards.
The 30% Error Controversy in Humanity’s Last Exam 2026
The biggest story in May 2026 isn't the scores themselves. It is a report from FutureHouse. Their researchers investigated the chemistry and biology sections of HLE. They found that roughly 30% of the "correct" answers were actually wrong.
This is a massive problem. If the gold standard for "AGI-level reasoning" contains errors, the leaderboard becomes a measure of which model is best at matching the benchmark's mistakes. For a Chief Technology Officer, this is a critical reminder. Benchmarks like Humanity’s Last Exam 2026 are a signal, not a guarantee.
Frontier Comparison: Gemini 3.1 vs. GPT-5.5 vs. Claude Mythos
| Model | HLE Score (May 2026) | Primary Strength |
|---|---|---|
| Gemini 3.1 Pro Preview | 44.7% | Multimodal reasoning & Math |
| GPT-5.5 (xhigh) | 44.3% | Instruction following & Logic |
| Claude Mythos Preview | 43.8% | Nuance & Creative Problem Solving |
While Gemini leads on the raw numbers for Humanity’s Last Exam 2026, Claude Mythos is reportedly more "honest." It is more likely to admit when a question is poorly phrased. GPT-5.5 has a higher tendency to "force" a reasoning path to reach the expected answer.
See my AI projects
I test these frontier models daily in real-world OpenClaw workflows. Check out what I'm building.
Humanity’s Last Exam 2026: May Leaderboard Analysis
The May data for Humanity’s Last Exam 2026 confirms that the 50% barrier is within reach. But what do these numbers tell us about true intelligence? If a model excels at HLE but fails at simple logic tasks, we are witnessing a form of benchmark specialization. In my daily tests, I've found that Humanity’s Last Exam 2026 is a great stress test for reasoning logic.
What This Means for Businesses
For most SMEs, an AI model's score on a graduate-level physics exam is irrelevant. However, Humanity’s Last Exam 2026 serves as a "stress test" for the reasoning capabilities that do matter:
- Complex Troubleshooting: If a model can solve a PhD math problem, it can probably debug a complex logistics chain.
- Policy Analysis: The ability to hold 256k tokens of context while reasoning is a direct translation of Humanity’s Last Exam 2026 skills.
- Reliability: The 30% error controversy teaches us to build "Verification Loops." Never trust a single model's output for critical business decisions. Always use a multi-agent system like OpenClaw to cross-check results.
FAQ: Humanity’s Last Exam 2026
AI benchmarks are evolving fast. Here are the answers to the most common questions about the HLE exam.
What is Humanity’s Last Exam 2026?
It is a graduate-level AI benchmark designed to test advanced reasoning in fields like mathematics, physics, and biology.
Who is leading Humanity’s Last Exam 2026 in May?
Gemini 3.1 Pro Preview is currently in first place with a score of 44.7%. OpenAI's GPT-5.5 is a close second.
Why is there an error controversy in Humanity’s Last Exam 2026?
Research from FutureHouse suggests that up to 30% of the answers in the chemistry and biology sections of HLE may be scientifically incorrect.
How does Humanity’s Last Exam 2026 compare to MMLU?
MMLU has become saturated and leaked into training data. Humanity’s Last Exam 2026 uses secret, crowdsourced questions to provide a more accurate measure of frontier intelligence.
Is Humanity’s Last Exam 2026 a good measure of AGI?
It is one of the best proxies we have for PhD-level reasoning. But the error controversy shows that no single benchmark is perfect.
Can I run Humanity’s Last Exam 2026 locally?
The dataset is available for research. You can test your own models against the HLE questions to see how they rank.
As we move toward the second half of 2026, expect the Humanity’s Last Exam 2026 benchmark to be updated. The race for AGI reasoning is far from over. The gap between human and machine is closing faster than anyone predicted.
Written by Matteo Giardino, CTO and founder. I build AI-powered solutions and agents for the Italian and global markets. My projects.
