GPT-4o scores 2.7%. GPT-5 reaches 25%. Gemini 3 Pro hits 44.7%. No, these aren't Black Friday discounts on ChatGPT Plus - they're the scores of the world's most advanced AI models on Humanity's Last Exam, the benchmark that's redefining what "artificial intelligence" actually means.
And the fact that even the best models are struggling? That's exactly the point.
The Problem with Benchmarks Getting Too Easy
For years, we've measured AI progress with standardized benchmarks. Models answer questions about math, biology, medicine, programming. More correct answers mean a "smarter" model.
The problem? AI models started scoring over 90% on these tests.
When everyone aces the exam, the exam becomes useless. You can't distinguish genuine improvement from simple overfitting on public datasets. It's like everyone having access to last year's test answers.
Enter Humanity's Last Exam.
What Is Humanity's Last Exam?
Humanity's Last Exam (HLE) is an AI benchmark created by an international consortium of experts - over 1,000 researchers from 50 countries - collaborating with the Center for AI Safety and Scale AI.
The technical specs:
- 2,500 public questions (thousands more kept private)
- Graduate-level expertise required
- Multi-disciplinary coverage: math (41%), physics (9%), biology/medicine, humanities, linguistics
- Two answer types: exact match or multiple-choice
- No answers available online (original expert-created questions)
Real examples from the test:
- "How do you translate a Roman inscription found on a tombstone?"
- "How many pairs of tendons are supported by one bone in hummingbirds?"
- "Based on the latest research on Tiberian pronunciation, identify all syllables ending in a consonant sound from this Hebrew text"
These aren't questions you solve with a Google search.
Need help building smarter AI systems?
I help companies integrate and optimize AI models, choosing the right solutions for your specific use case.
How Are AI Models Performing?
When the test launched in early 2025, the results were... humbling.
Initial scores (early 2025):
- GPT-4o: 2.7%
- Most models: single digits
Current scores (March 2026):
- Gemini 3.1 Pro Preview: 44.7%
- GPT-5.4: 41.6%
- GPT-5.3 Codex: 39.9%
- GPT-5: ~25%
Even the world's best model answers less than half the questions correctly. For comparison, a human expert in their field should approach 90%+ on questions in their specialization.
How the Question Selection Process Works
Not every difficult question makes it into HLE. The curation process is rigorous:
- Expert submissions: Thousands of researchers submit graduate-level questions in their fields
- AI testing: Questions are tested against multiple AI models. Only those that stump the models advance
- Expert review: Other experts evaluate usefulness and originality using strict guidelines
- Public/private split: 2,500 questions released publicly, thousands kept private to prevent overfitting
Around 70,000 initial submissions were narrowed down to a few thousand through this process.
Why "Last Exam" Is a Controversial Name
The benchmark's name itself has sparked debate. "Humanity's Last Exam" sounds apocalyptic, as if after this test we won't need to test AI anymore because it will have reached human level.
Main criticisms:
1. Expertise ≠ Intelligence
As MIT researchers Katherine Collins and Joshua Tenenbaum note, HLE measures performance on academic problems, not true "intelligence." Real expertise also includes:
- Evaluating whether a question makes sense
- Recognizing when multiple answers are possible
- Knowing how confident you are in your answer
- Asking new questions, not just answering existing ones
2. Limited Format
Questions require short answers or multiple-choice. But many complex problems require articulated responses, interdisciplinary reasoning, scientific papers. These forms of expertise aren't captured by HLE.
3. Gaming the System
An improvement in HLE score can mean two things:
- The model genuinely became more capable
- The model got extra training on the public dataset (like studying last year's questions)
It's not always clear which one.
Check out my AI projects
See how I use AI, automation, and agents in my daily workflow.
The Benchmark Arms Race
HLE is just the latest chapter in a longer story: the continuous race between AI capabilities and AI benchmarks.
The pattern repeats:
- A new "difficult" benchmark is created
- AI models improve rapidly
- After 1-2 years, they score close to 100%
- The benchmark becomes obsolete
- A new, harder benchmark is needed
It happened with:
- ImageNet (image recognition)
- GLUE and SuperGLUE (natural language understanding)
- MMLU (massive multitask language understanding)
And now it's happening with HLE. GPT-4o went from 2.7% to over 40% in just over a year.
What This Means for AI's Future
HLE tells us some important things:
1. We're Still Far from AGI
If even the best models correctly answer less than half of graduate-level questions, we're still very far from artificial general intelligence (AGI) that competes with human experts across the board.
2. Progress Is Real but Uneven
Models are improving rapidly. But this progress is uneven - they excel in some domains (math, programming) while struggling in others (ancient linguistics, specialized biology).
3. We Need a New Evaluation Paradigm
As Subbarao Kambhampati (former president of the Association for the Advancement of Artificial Intelligence) notes: "Humanity is not contained in any static test, but in our ability to continually evolve both in asking and answering questions we never, in our wildest dreams, thought we would - generation after generation."
OpenAI and others are exploring new ways to evaluate AI - scientific creativity, collaborative thinking with humans, real-world scenarios instead of academic tests.
The Benchmark That Wants to Become Obsolete
The most interesting thing about HLE? The team that created it hopes it becomes obsolete.
Not because AI will completely surpass it (though it will), but because its purpose is to force the development of innovative paradigms for AI evaluation.
As Collins and Tenenbaum write: "The project will ideally make itself obsolete by forcing the development of innovative paradigms for AI evaluation."
When Gemini or GPT reach 90%+ on HLE, we'll need something even harder. Not to test AI on academic questions, but to measure its ability to collaborate with humans, generate new ideas, navigate real-world ambiguity.
Humanity's Last Exam won't truly be humanity's last exam. But for now, it's the best we have.
Resources:
Want to stay updated on AI, benchmarks, and model progress? Follow me on this blog or get in touch for consulting on how to choose and integrate the right AI models for your business.
