| Benchmark | Focus | Links |
|---|---|---|
| MMLU | Evaluates broad knowledge and reasoning across many academic and professional subjects. | Paper |
| HellaSwag | Measures commonsense reasoning by selecting the most plausible ending for a grounded scenario. |
Project Page Paper |
| GSM8K | Evaluates grade-school math word-problem solving and multi-step numerical reasoning. |
Paper Dataset |
| TruthfulQA | Tests factual truthfulness and resistance to generating common misconceptions. |
Paper Repository |
| ARC (Challenge) | Assesses science question answering and reasoning over grade-school level knowledge. |
Paper Dataset |
| BIG-bench Hard (BBH) | Stress-tests difficult reasoning tasks selected from BIG-bench for stronger discrimination. |
Paper Repository |
| MATH | Evaluates advanced mathematical reasoning on competition-style problems. |
Paper Repository |
| BIG-bench | Evaluates broad and diverse emergent capabilities across many challenging language tasks. |
Repository Paper |
| MT-Bench | Measures multi-turn instruction-following quality and chat performance, often with LLM-as-a-judge. |
Paper Implementation |
| GPQA | Evaluates graduate-level domain knowledge and difficult reasoning in science-focused questions. | Paper |
| DROP | Tests reading comprehension with discrete reasoning over passages (counts, arithmetic, and comparisons). |
Dataset Paper |
| SQuAD | Evaluates extractive question answering and reading comprehension on Wikipedia passages. |
Dataset Paper |
| HumanEval | Measures code generation quality by checking functional correctness of generated Python functions. | Repository |
| SWE-bench | Tests software engineering ability by solving real GitHub issues in open-source repositories. |
Leaderboard Paper |
| MMMU | Assesses multimodal understanding and reasoning with image-plus-text tasks across disciplines. |
Project Page Paper |
| HELM | Provides holistic evaluation across scenarios and metrics such as accuracy, calibration, and robustness. |
Dashboard Paper |
| LMArena | LMArena (formerly known as Chatbot Arena) is a popular, crowdsourced AI evaluation platform where users compare different artificial intelligence models through blind, head-to-head battles. |
Website YouTube tutorials |