Humanity's Last Exam

Humanity's Last Exam is a language model benchmark consisting of over 2,500 expert-level questions across a broad range of subjects. It was created jointly by the Center for AI Safety and Scale AI, and was designed to test reasoning abilities and human-like intelligence, as opposed to just pattern recognition.

WWW lastexam.ai

History

Benchmark tests like Humanity's Last Exam have long been used to evaluate reasoning and learning capabilities in machines. Early benchmarks, such as the Turing test, measured whether machines could demonstrate human-like conversation abilities. Other early benchmark tests evaluated computer vision, like MNIST for handwritten digit recognition and ImageNet for continual image classification. The emergence of large language models in the 2020s led to the advancement and evolution of benchmark tests, with a focus on emphasizing interpretability, reproducibility, and clearer evaluation criteria. Recent foundation model benchmarks, such as MMLU, HellaSwag, and ARC Challenge, illustrate this shift.

Creation

Humanity’s Last Exam was created to parallel the quick progression of LLMs and provide a proper assessment of these models. Previous benchmarks evaluated LLMs with about 90% correctness creating the need for a more difficult exam. Stanford HAI's AI Index 2025 Annual Report cites Humanity's Last Exam as one of the "more challenging benchmarks" developed in response to the popular AI benchmarks having reached "saturation". The test has been described as the brainchild of Dan Hendrycks, a machine learning researcher and the director of the Center for AI Safety, who stated that he was inspired to create the test after a conversation with Elon Musk, who thought the existing language model benchmarks, such as the MMLU, were too easy. Hendrycks worked with Scale AI to compile the questions. The questions were crowdsourced from subject matter experts from various institutions across the world. The questions were first filtered by the leading AI models; if the models failed to answer the question or did worse than random guessing on the multiple-choice questions, they were reviewed by human experts for accuracy and wording in two rounds, and then approved for inclusion in the dataset. The submitters of the top-rated questions were given prize money from a pool of 500,000 U.S. dollars—$5000 for each of the top 50 questions and $500 for the next 500. After the initial release, a "community feedback bug bounty program" was opened to "identify and remove major errors in the dataset". AI systems are able to surpass more focused, task-oriented tests, yet few are able to perform well on broader, general ability assessments. HLE was designed to test reasoning abilities, which are considered a metric of “human” intelligence.

Composition

The benchmark consists of 2,500 questions in the publicly released set. The paper classifies the questions into the following broad subjects: mathematics, physics, biology/medicine, humanities/social science, computer science/artificial intelligence, engineering, chemistry, and other. Around 14% of the questions require the ability to understand both text and images, i.e., multi-modality. 24% of the questions are multiple-choice; the rest are short-answer, exact-match questions. A private set is also maintained to test for benchmark overfitting.
An example question:
An independent investigation by FutureHouse, published in July 2025, suggested that around 30% of the HLE answers for text-only chemistry and biology questions could be incorrect; the benchmark's team partially replicated the findings, and said they hope to institute a continuous revisions process.

Results

Calibration Error: Miscalibration measures the extent to which AI systems are underconfident or overconfident. To measure calibration, test administrators prompt models to provide both an answer and their confidence from 0% to 100%.

Organization	Model	Accuracy ↑	Calibration Error ↓
OpenAI	gpt-oss-120b	15.48	76
Alibaba Cloud	Qwen3-235B-A22B-Thinking-2507	15.43	78
DeepSeek	DeepSeek-R1-0528	14.04	78
Moonshot AI	Kimi-K2-Instruct	4.68	82
Amazon Web Services	Nova Micro	4.41	84