Logo

HLE Benchmark: Why Humanity's Last Exam is the Final AI Challenge

Humanity's Last Exam (HLE) is the final closed-ended benchmark for frontier LLMs. See latest scores and why calibration is the new bottleneck in 2026.
CN

Matteo Giardino

May 13, 2026

HLE Benchmark: Why Humanity's Last Exam is the Final AI Challenge

HLE Benchmark: Why Humanity's Last Exam is the Final AI Challenge

Written by Matteo Giardino - CTO and AI consultant.

AI benchmarks are saturating. Models now score over 90% on tests like MMLU, making them useless for measuring true frontier capabilities. Humanity's Last Exam (HLE) is the solution: a graduate-level challenge designed to be the final academic test for AI.

In this post, I break down what HLE is, analyze the latest scores from GPT-5.4 and Gemini 3.1, and explain why the biggest bottleneck in 2026 is not just raw accuracy - it is overconfidence and calibration.

What is Humanity's Last Exam (HLE)?

Humanity's Last Exam is a multi-modal, closed-ended benchmark created by 1,000 subject matter experts. Unlike previous tests, HLE targets the frontier of human understanding in math, physics, and medicine.

The dataset consists of 2,500 questions so difficult that even the most advanced LLMs struggle to break the 50% accuracy barrier. It is not about facts; it is about deep reasoning and the ability to interpret multi-modal diagrams.

Beyond MMLU: Why HLE Matters in 2026

Why is this challenge fundamental? Because we've reached "benchmark saturation." When every model hits 95% on MMLU, the signal is lost. HLE sets the bar at the post-graduate level, providing a clean metric for the reasoning depth of a frontier AI system.

Specifically, it combats training data contamination by using expert-vetted questions that aren't easily searchable. For a CTO, it's the best way to verify if a model actually "understands" complex technical domains.

Need help with AI integration?

Get in touch for a consultation on implementing AI tools and automations in your business.

Latest Scores & Leaderboard Analysis

The current leaderboard shows a fascinating shift in the AI landscape. As of May 2026, here are the top performers on the HLE (Humanity's Last Exam):

  1. Gemini 3.1 Pro Preview (Thinking High): 46.44%
  2. GPT-5.4 Pro (2026-03-05): 44.32%
  3. Muse Spark: 40.56%
  4. Claude Opus 4.7: 36.20%

What's interesting here isn't just the ranking, but the gap. We are seeing a plateau. Even with "High Thinking" modes enabled, models are struggling to move past the 50% mark. This suggests that we are hitting a fundamental bottleneck in how LLMs handle specialized, world-class scientific problems.

The Overconfidence Problem: Calibration Error

The most striking metric in the HLE results isn't accuracy - it is Calibration Error. A well-calibrated model knows when it is likely to be wrong. If a model says it is 90% confident, it should be right 90% of the time.

In HLE, we see systematic overconfidence. Many models exhibit calibration errors higher than 50%. They provide "confidently wrong" answers to graduate-level physics or math problems. For enterprise applications, this is a massive red flag. It is often better to have a model that says "I don't know" than one that hallucinates a plausible-sounding but incorrect solution. This expert-level bottleneck is what the hle benchmark was built to identify.

How to Use HLE for Model Selection

When you are choosing a model for complex tasks like autonomous coding or server management, don't just look at the raw accuracy. You might want to explore how to run models locally or check out free API options to test these capabilities yourself.

  1. Check the Calibration: Look for models with lower calibration errors. They are more reliable for automation where human oversight is minimal.
  2. Multi-modal Performance: If your workflow involves charts or diagrams, HLE is the best way to test if a model can actually "see" the logic in an image.
  3. Test with OpenClaw: I always recommend running these models through the OpenClaw framework to see how they handle real-world tools and long-running tasks beyond static benchmarks.

FAQ

What is the passing score for the hle benchmark?

There is no "passing" score for AI. However, current frontier models are all scoring below 50%, highlighting the extreme difficulty of the benchmark.

How does the hle benchmark compare to MMLU?

MMLU is essentially "undergraduate" level and is now saturated (models score >90%). HLE is "graduate" level and multi-modal, designed to be the final academic benchmark of its kind.

Can I run the hle benchmark locally?

The public set is available for the research community, but you'll need significant compute and proper evaluation scripts to test models locally.

Written by Matteo Giardino, CTO and founder. My projects.

CN
Matteo Giardino