The rapid evolution of artificial intelligence is bringing a new milestone within reach: mastering one of the most demanding knowledge tests ever created, the “Humanity’s Last Exam” (HLE). According to its developers, top AI systems could approach a perfect score within months.
Designed as an extreme benchmark, HLE consists of 2,500 questions spanning roughly 100 disciplines — from rocket science to mythology. Each question requires PhD-level understanding, with near-perfect performance effectively representing a “universal expert.”
Until recently, AI systems struggled with the test. ChatGPT by OpenAI scored just 3%, while models from Google and Anthropic performed only marginally better.
That picture is now changing rapidly. Google’s Gemini model reached 45.9% in February, nearly doubling its performance within months. According to Calvin Zhang, head of research at Scale AI, a perfect score is no longer out of reach.
“We aimed to build a benchmark at the level of top human experts — something only a handful of people in the world could solve,” Zhang said. Meanwhile, researchers at Google DeepMind highlight major advances in reasoning capabilities, with product manager Kate Olszewska describing recent progress as “remarkable.”
Anthropic’s Claude model has already achieved 34.2% and continues to improve quickly. Reaching 100% would mark a major turning point, as HLE is designed as a closed benchmark — one based entirely on existing human knowledge.
If that milestone is reached, the next challenge will be even more ambitious: evaluating AI using questions whose answers are unknown to humanity.
The test was developed through a collaboration between Scale AI and the Center for AI Safety, aiming to measure both breadth of knowledge and depth of reasoning. Experts from around 50 countries submitted 70,000 questions in response to a global call.
The selection process was rigorous. Questions had to have clear answers while remaining difficult to find online. Ultimately, 2,500 were chosen, with many kept undisclosed to preserve the test’s integrity.
The significance of such an achievement is already being compared to historic AI milestones, such as IBM’s Deep Blue defeating world chess champion Garry Kasparov in 1997.
Despite the rapid progress, experts stress that human expertise remains essential — especially in fields requiring judgment, creativity, and hands-on skill.
Still, one question looms: if artificial intelligence reaches the limits of human knowledge, what comes next?

