Benchmarks Table

Benchmark	Focus	Links
MMLU	Evaluates broad knowledge and reasoning across many academic and professional subjects.	Paper
HellaSwag	Measures commonsense reasoning by selecting the most plausible ending for a grounded scenario.	Project Page Paper
GSM8K	Evaluates grade-school math word-problem solving and multi-step numerical reasoning.	Paper Dataset
TruthfulQA	Tests factual truthfulness and resistance to generating common misconceptions.	Paper Repository
ARC (Challenge)	Assesses science question answering and reasoning over grade-school level knowledge.	Paper Dataset
BIG-bench Hard (BBH)	Stress-tests difficult reasoning tasks selected from BIG-bench for stronger discrimination.	Paper Repository
MATH	Evaluates advanced mathematical reasoning on competition-style problems.	Paper Repository
BIG-bench	Evaluates broad and diverse emergent capabilities across many challenging language tasks.	Repository Paper
MT-Bench	Measures multi-turn instruction-following quality and chat performance, often with LLM-as-a-judge.	Paper Implementation
GPQA	Evaluates graduate-level domain knowledge and difficult reasoning in science-focused questions.	Paper
DROP	Tests reading comprehension with discrete reasoning over passages (counts, arithmetic, and comparisons).	Dataset Paper
SQuAD	Evaluates extractive question answering and reading comprehension on Wikipedia passages.	Dataset Paper
HumanEval	Measures code generation quality by checking functional correctness of generated Python functions.	Repository
SWE-bench	Tests software engineering ability by solving real GitHub issues in open-source repositories.	Leaderboard Paper
MMMU	Assesses multimodal understanding and reasoning with image-plus-text tasks across disciplines.	Project Page Paper
HELM	Provides holistic evaluation across scenarios and metrics such as accuracy, calibration, and robustness.	Dashboard Paper
LMArena	LMArena (formerly known as Chatbot Arena) is a popular, crowdsourced AI evaluation platform where users compare different artificial intelligence models through blind, head-to-head battles.	Website YouTube tutorials