Humanity's Last Exam: The AI Benchmark That Stumps Even the Best Models

GPT-4o scores 2.7%. GPT-5 reaches 25%. Gemini 3 Pro hits 44.7%. No, these aren't Black Friday discounts on ChatGPT Plus - they're the scores of the world's most advanced AI models on Humanity's Last Exam, the benchmark that's redefining what "artificial intelligence" actually means.

And the fact that even the best models are struggling? That's exactly the point.

The Problem with Benchmarks Getting Too Easy

For years, we've measured AI progress with standardized benchmarks. Models answer questions about math, biology, medicine, programming. More correct answers mean a "smarter" model.

The problem? AI models started scoring over 90% on these tests.

When everyone aces the exam, the exam becomes useless. You can't distinguish genuine improvement from simple overfitting on public datasets. It's like everyone having access to last year's test answers.

Enter Humanity's Last Exam.

What Is Humanity's Last Exam?

Humanity's Last Exam (HLE) is an AI benchmark created by an international consortium of experts - over 1,000 researchers from 50 countries - collaborating with the Center for AI Safety and Scale AI.

The technical specs:

2,500 public questions (thousands more kept private)
Graduate-level expertise required
Multi-disciplinary coverage: math (41%), physics (9%), biology/medicine, humanities, linguistics
Two answer types: exact match or multiple-choice
No answers available online (original expert-created questions)

Real examples from the test:

"How do you translate a Roman inscription found on a tombstone?"
"How many pairs of tendons are supported by one bone in hummingbirds?"
"Based on the latest research on Tiberian pronunciation, identify all syllables ending in a consonant sound from this Hebrew text"

These aren't questions you solve with a Google search.

Need help building smarter AI systems?

I help companies integrate and optimize AI models, choosing the right solutions for your specific use case.

Get in Touch

How Are AI Models Performing?

When the test launched in early 2025, the results were... humbling.

Initial scores (early 2025):

GPT-4o: 2.7%
Most models: single digits

Current scores (March 2026):

Gemini 3.1 Pro Preview: 44.7%
GPT-5.4: 41.6%
GPT-5.3 Codex: 39.9%
GPT-5: ~25%

Even the world's best model answers less than half the questions correctly. For comparison, a human expert in their field should approach 90%+ on questions in their specialization.

How the Question Selection Process Works

Not every difficult question makes it into HLE. The curation process is rigorous:

Expert submissions: Thousands of researchers submit graduate-level questions in their fields
AI testing: Questions are tested against multiple AI models. Only those that stump the models advance
Expert review: Other experts evaluate usefulness and originality using strict guidelines
Public/private split: 2,500 questions released publicly, thousands kept private to prevent overfitting

Around 70,000 initial submissions were narrowed down to a few thousand through this process.

Why "Last Exam" Is a Controversial Name

The benchmark's name itself has sparked debate. "Humanity's Last Exam" sounds apocalyptic, as if after this test we won't need to test AI anymore because it will have reached human level.

Main criticisms:

1. Expertise ≠ Intelligence

As MIT researchers Katherine Collins and Joshua Tenenbaum note, HLE measures performance on academic problems, not true "intelligence." Real expertise also includes:

Evaluating whether a question makes sense
Recognizing when multiple answers are possible
Knowing how confident you are in your answer
Asking new questions, not just answering existing ones

2. Limited Format

Questions require short answers or multiple-choice. But many complex problems require articulated responses, interdisciplinary reasoning, scientific papers. These forms of expertise aren't captured by HLE.

3. Gaming the System

An improvement in HLE score can mean two things:

The model genuinely became more capable
The model got extra training on the public dataset (like studying last year's questions)

It's not always clear which one.

Check out my AI projects

See how I use AI, automation, and agents in my daily workflow.

View Projects

The Benchmark Arms Race

HLE is just the latest chapter in a longer story: the continuous race between AI capabilities and AI benchmarks.

The pattern repeats:

A new "difficult" benchmark is created
AI models improve rapidly
After 1-2 years, they score close to 100%
The benchmark becomes obsolete
A new, harder benchmark is needed

It happened with:

ImageNet (image recognition)
GLUE and SuperGLUE (natural language understanding)
MMLU (massive multitask language understanding)

And now it's happening with HLE. GPT-4o went from 2.7% to over 40% in just over a year.

What This Means for AI's Future

HLE tells us some important things:

1. We're Still Far from AGI

If even the best models correctly answer less than half of graduate-level questions, we're still very far from artificial general intelligence (AGI) that competes with human experts across the board.

2. Progress Is Real but Uneven

Models are improving rapidly. But this progress is uneven - they excel in some domains (math, programming) while struggling in others (ancient linguistics, specialized biology).

3. We Need a New Evaluation Paradigm

As Subbarao Kambhampati (former president of the Association for the Advancement of Artificial Intelligence) notes: "Humanity is not contained in any static test, but in our ability to continually evolve both in asking and answering questions we never, in our wildest dreams, thought we would - generation after generation."

OpenAI and others are exploring new ways to evaluate AI - scientific creativity, collaborative thinking with humans, real-world scenarios instead of academic tests.

The Benchmark That Wants to Become Obsolete

The most interesting thing about HLE? The team that created it hopes it becomes obsolete.

Not because AI will completely surpass it (though it will), but because its purpose is to force the development of innovative paradigms for AI evaluation.

As Collins and Tenenbaum write: "The project will ideally make itself obsolete by forcing the development of innovative paradigms for AI evaluation."

When Gemini or GPT reach 90%+ on HLE, we'll need something even harder. Not to test AI on academic questions, but to measure its ability to collaborate with humans, generate new ideas, navigate real-world ambiguity.

Humanity's Last Exam won't truly be humanity's last exam. But for now, it's the best we have.

Resources:

Want to stay updated on AI, benchmarks, and model progress? Follow me on this blog or get in touch for consulting on how to choose and integrate the right AI models for your business.