Scientists built the hardest AI test in history – and even the best models are failing it

When AI started breezing through the tests humans built to challenge it, researchers from around the world decided to build something it genuinely couldn't pass – and the scores are quite telling

Researchers have developed a new benchmark designed to push beyond existing AI tests that have become too easy to meaningfully measure progress.

Researchers have long used standardised benchmarks to measure how capable AI systems actually are.

Tests like the Massive Multitask Language Understanding exam were designed to be demanding – covering a broad range of academic subjects and thought to be a reliable gauge of what AI could and couldn't do. The problem, however, is that modern AI systems have got good enough that those tests no longer tell researchers much.

So a worldwide group of nearly 1,000 researchers, spanning disciplines from mathematics and linguistics to medicine and ancient history, set about building something harder. The result is "Humanity's Last Exam" – a 2,500-question assessment covering everything from advanced mathematics to translating ancient Palmyrene inscriptions, identifying tiny anatomical structures in birds, and analysing features of Biblical Hebrew pronunciation.

How the questions were chosen

Among the contributors to the test is Dr Tung Nguyen, an instructional associate professor in computer science and engineering at Texas A&M, who wrote 73 of the publicly available questions – the second highest number of any individual contributor.

"When AI systems start performing extremely well on human benchmarks, it's tempting to think they're approaching human-level understanding," he said. "But HLE reminds us that intelligence isn't just about pattern recognition – it's about depth, context and specialised expertise."

Every question was tested against leading AI models before being finalised. If any model answered correctly, the question was removed. The filtering process was designed to ensure the exam sat just beyond what current systems can reliably handle.

Early results showed that even the most advanced AI systems have struggled with complex, specialised questions that demand depth and expert-level understanding.

Early results bear that out. GPT-4o scored 2.7%, Claude 3.5 Sonnet reached 4.1%, and OpenAI's o1 came in around 8%. More recent systems, including Gemini 2.1 Pro and Claude Opus, have reached somewhere between 40% and 50%. To stop models training on the questions in advance, the majority are kept hidden, with only a portion released publicly.

Nguyen said the need for reliable benchmarking goes beyond academic interest:

"Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do."

He added that Benchmarks provide the foundation for measuring progress and identifying risks, he added.

Not a warning – a measuring tool

Despite the name, the researchers stressed the exam isn't meant as a statement about AI overtaking human expertise. The aim is to give the field a clearer, more honest picture of where AI systems still fall short, and to produce a benchmark that stays useful as models continue to improve.

"This isn't a race against AI," Nguyen said. "It's a method for understanding where these systems are strong and where they struggle. That understanding helps us build safer, more reliable technologies. And, importantly, it reminds us why human expertise still matters."

Nguyen added that experts from nearly every discipline contributed to the collab, and it's that breadth of human knowledge, he says, that makes the gaps in AI performance visible in ways that narrower tests do not.

News reference:

Scientists built the hardest AI test ever and the results are surprising, published by Texas A&M University, March 2026.