Benchmark Focus Links
MMLU Evaluates broad knowledge and reasoning across many academic and professional subjects. Paper
HellaSwag Measures commonsense reasoning by selecting the most plausible ending for a grounded scenario. Project Page
Paper
GSM8K Evaluates grade-school math word-problem solving and multi-step numerical reasoning. Paper
Dataset
TruthfulQA Tests factual truthfulness and resistance to generating common misconceptions. Paper
Repository
ARC (Challenge) Assesses science question answering and reasoning over grade-school level knowledge. Paper
Dataset
BIG-bench Hard (BBH) Stress-tests difficult reasoning tasks selected from BIG-bench for stronger discrimination. Paper
Repository
MATH Evaluates advanced mathematical reasoning on competition-style problems. Paper
Repository
BIG-bench Evaluates broad and diverse emergent capabilities across many challenging language tasks. Repository
Paper
MT-Bench Measures multi-turn instruction-following quality and chat performance, often with LLM-as-a-judge. Paper
Implementation
GPQA Evaluates graduate-level domain knowledge and difficult reasoning in science-focused questions. Paper
DROP Tests reading comprehension with discrete reasoning over passages (counts, arithmetic, and comparisons). Dataset
Paper
SQuAD Evaluates extractive question answering and reading comprehension on Wikipedia passages. Dataset
Paper
HumanEval Measures code generation quality by checking functional correctness of generated Python functions. Repository
SWE-bench Tests software engineering ability by solving real GitHub issues in open-source repositories. Leaderboard
Paper
MMMU Assesses multimodal understanding and reasoning with image-plus-text tasks across disciplines. Project Page
Paper
HELM Provides holistic evaluation across scenarios and metrics such as accuracy, calibration, and robustness. Dashboard
Paper
LMArena LMArena (formerly known as Chatbot Arena) is a popular, crowdsourced AI evaluation platform where users compare different artificial intelligence models through blind, head-to-head battles. Website
YouTube tutorials