Humanity's Last Exam: The AI Benchmark Stumping GPT-5

name="Matteo Giardino" role="Fractional AI CTO" avatar="/images/matteo.jpg" />

TL;DR: Humanity's Last Exam is the toughest AI benchmark ever created. It features 2,500 graduate-level questions spanning biology to math. Currently, top models like GPT-5 and Gemini 3 score under 45%, proving we are still far from AGI. In my experience testing these models, they struggle with deep logic.

GPT-4o scores 2.7%. GPT-5 reaches 25%. Gemini 3 Pro hits 44.7%. No, these aren't Black Friday discounts on ChatGPT Plus - they're the scores of the world's most advanced AI models on Humanity's Last Exam, the benchmark that's redefining what "artificial intelligence" actually means.

And the fact that even the best models are struggling? That's exactly the point.

Why Humanity's Last Exam Replaces Easy Benchmarks

For years, we've measured AI progress with standardized benchmarks. Models answer questions about math, biology, medicine, programming. More correct answers mean a "smarter" model.

The problem? AI models started scoring over 90% on these tests. As we explored in our guide on testing AI models, easy tests hide real flaws.

When everyone aces the exam, the exam becomes useless. You can't distinguish genuine improvement from simple overfitting on public datasets. It's like everyone having access to last year's test answers.

Enter Humanity's Last Exam.

Humanity's Last Exam

What Is Humanity's Last Exam?

Humanity's Last Exam (HLE) is a definitive AI benchmark created by an international consortium of experts - over 1,000 researchers from 50 countries - collaborating with the Center for AI Safety and Scale AI.

The technical specs:

2,500 public questions (thousands more kept private)
Graduate-level expertise required
Multi-disciplinary coverage: math (41%), physics (9%), biology/medicine, humanities, linguistics
Two answer types: exact match or multiple-choice
No answers available online (original expert-created questions)

Real examples from the test:

"How do you translate a Roman inscription found on a tombstone?"
"How many pairs of tendons are supported by one bone in hummingbirds?"
"Based on the latest research on Tiberian pronunciation, identify all syllables ending in a consonant sound from this Hebrew text"

These aren't questions you solve with a Google search.

Here is an example of what a question looks like in JSON format:

{

  "question": "How many pairs of tendons are supported by one bone in hummingbirds?",
  "subject": "biology",
  "difficulty": "graduate",
  "correct_answer": "Multiple answers possible depending on species"
}

Need help building smarter AI systems?

I help companies integrate and optimize AI models, choosing the right solutions for your specific use case.

Get in Touch

How Are AI Models Performing?

Results were humbling when the test launched in 2025.

Initial scores (early 2025):

GPT-4o: 2.7%
Most models: single digits

Current scores (March 2026):

Gemini 3.1 Pro Preview: 44.7%
GPT-5.4: 41.6%
GPT-5.3 Codex: 39.9%
GPT-5: ~25%

Top models miss over half the answers. For comparison, a human expert in their field should approach 90%+ on questions in their specialization.

How Humanity's Last Exam Selects Questions

Not every question makes it. The process is strict:

Expert submissions: Thousands of researchers submit graduate-level questions in their fields
AI testing: Questions are tested against multiple AI models. Only those that stump the models advance
Expert review: Other experts evaluate usefulness and originality using strict guidelines
Public/private split: 2,500 questions released publicly, thousands kept private to prevent overfitting

Around 70,000 initial submissions were narrowed down to a few thousand through this process.

Why "Humanity's Last Exam" Is Controversial

The benchmark's name itself has sparked debate. "Humanity's Last Exam" sounds apocalyptic, as if after this test we won't need to test AI anymore because it will have reached human level.

Main criticisms:

1. Expertise ≠ Intelligence

As MIT researchers Katherine Collins and Joshua Tenenbaum note, HLE measures performance on academic problems, not true "intelligence." Real expertise also includes:

Evaluating whether a question makes sense
Recognizing when multiple answers are possible
Knowing how confident you are in your answer
Asking new questions, not just answering existing ones

2. Limited Format

Questions require short answers or multiple-choice. But many complex problems require articulated responses, interdisciplinary reasoning, scientific papers. These forms of expertise aren't captured by HLE.

3. Gaming the System

An improvement in HLE score can mean two things:

The model genuinely became more capable
The model got extra training on the public dataset (like studying last year's questions)

It's not always clear which one.

Check out my AI projects

See how I use AI, automation, and agents in my daily workflow.

View Projects

Humanity's Last Exam and the AI Arms Race

HLE is just the latest chapter in a longer story: the continuous race between AI capabilities and AI benchmarks.

The pattern repeats:

A new "difficult" benchmark is created
AI models improve rapidly
After 1-2 years, they score close to 100%
The benchmark becomes obsolete
A new, harder benchmark is needed

It happened with:

ImageNet (image recognition)
GLUE and SuperGLUE (natural language understanding)
MMLU (massive multitask language understanding)

And now it's happening with HLE. GPT-4o went from 2.7% to over 40% in just over a year.

What This Means for AI's Future

HLE tells us some important things:

1. We're Still Far from AGI

When I test models on local enterprise workloads, I see similar limits. We remain far from AGI. Current models cannot match human experts in broad domains.

2. Progress Is Real but Uneven

Models improve fast but unevenly. They ace math but fail ancient biology.

3. We Need a New Evaluation Paradigm

As Subbarao Kambhampati (former president of the Association for the Advancement of Artificial Intelligence) notes: "Humanity is not contained in any static test, but in our ability to continually evolve both in asking and answering questions we never, in our wildest dreams, thought we would - generation after generation."

OpenAI and others are exploring new ways to evaluate AI - scientific creativity, collaborative thinking with humans, real-world scenarios instead of academic tests.

The Benchmark That Wants to Become Obsolete

The most interesting thing about HLE? The team that created it hopes it becomes obsolete.

Not because AI will completely surpass it (though it will), but because its purpose is to force the development of innovative paradigms for AI evaluation.

As Collins and Tenenbaum write: "The project will ideally make itself obsolete by forcing the development of innovative paradigms for AI evaluation."

When Gemini or GPT reach 90%+ on HLE, we'll need something even harder. Not to test AI on academic questions, but to measure its ability to collaborate with humans, generate new ideas, navigate real-world ambiguity.

Humanity's Last Exam won't truly be humanity's last exam. But for now, it's the best we have.

FAQ

What is Humanity's Last Exam? It is a 2,500-question benchmark designed to test AI models on graduate-level academic knowledge.

Why do AI models fail Humanity's Last Exam? AI models struggle because the questions require deep, multi-step logical reasoning rather than simple pattern matching.

Resources:

Want to stay updated on AI, benchmarks, and model progress? Follow me on this blog or get in touch for consulting on how to choose and integrate the right AI models for your business. Written by Matteo Giardino, CTO and founder. I build AI agents for SMEs in Italy. My projects.