100+ LLM benchmarks and evaluation datasets
100+ LLM benchmarks and evaluation datasets
AlpacaEval
MT-Bench-101
Chatbot Arena
HarmBench
MMLU Pro
MixEval
SimpleQA
CRUXEval (Code Reasoning, Understanding, and Execution Evaluation)
AgentHarm
StrongReject
BFCL (Berkeley Function-Calling Leaderboard)
TrustLLM
BigCodeBench
AIR-Bench
WildChat
API-Bank
CrossCodeEval
SEED-Bench
Chain-of-Thought Hub
ForbiddenQuestions
MuSR
ToolLLM
FreshQA
MT-Bench
ToolBench
AgentBench
Q-bench
EvalPlus
MaliciousInstruct
SycophancyEval
DecodingTrust
AdvBench
XSTest
ClassEval
MetaTool
M3Exam
OpinionQA
SafetyBench
GPQA
Repobench
IFEval
AGIEval
HarmfulQA
QHarm
LegalBench
MMMU
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT
SWE-bench
Webarena
BeaverTails
Code Lingua
SummEdits
EvalCrafter
MME
DoNotAnswer
ScienceQA
DS-1000
MedMCQA
ToxiGen
HELM
HHH (Helpfulness, Honesty, Harmlessness)
PersonalInfoLeak
e-CARE (explainable CAusal REasoning dataset)
MGSM (Multilingual Grade School Math)
BigBench Hard
PlanBench
BigBench
AnthropicRedTeam
SVAMP
MATH
BEIR
SpartQA
TAT-QA
CodeXGLUE
TruthfulQA
APPS (Automated Programming Progress Standard)
BOLD
BBQ
MBPP (Mostly Basic Programming Problems)
HumanEval
GSM8K
StereoSet
ETHICS
Social Chemistry 101
RealToxicityPrompt
MMLU
CrowS-Pairs (Crowdsourced Stereotype Pairs)
MLSUM
MedQA
Natural Questions
ANLI
SEAT (Sentence Encoder Association Test)
BoolQ
SuperGLUE
DROP (Discrete Reasoning Over Paragraphs)
OpenDialKG
HellaSwag
PubMedQA
Winogrande
PIQA (Physical Interaction QA)
HotpotQA
GLUE (General Language Understanding Evaluation)
OpenBookQA
WinoGender
CoQA (Conversational Question Answering)
SQuAD2.0
ARC
SWAG
CommonsenseQA
QuAC (Question Answering in Context)
RACE (ReAding Comprehension Dataset From Examinations)
SciQ
TriviaQA
MultiNLI (Multi-Genre Natural Language Inference)
SQuAD (Stanford Question Answering Dataset)
LAMBADA (LAnguage Modelling Broadened to Account for Discourse Aspects)
MS MARCO
Type
instruction-following,conversation & chatbots
conversation & chatbots
conversation & chatbots
safety
knowledge,language & reasoning
conversation & chatbots
safety
coding
safety
safety
agents & tools use
safety,bias & ethics
coding
safety
conversation & chatbots
agents & tools use
coding
multimodal
language & reasoning
safety
language & reasoning
agents & tools use
knowledge
conversation & chatbots
agents & tools use
agents & tools use
multimodal
coding
safety
safety
safety
safety
safety
coding
agents & tools use
multimodal
safety
safety
language & reasoning
coding
language & reasoning,instruction-following
language & reasoning
safety
safety
domain-specific
multimodal,language & reasoning
multimodal
coding
agents & tools use
safety
coding
language & reasoning
video,multimodal
multimodal
safety
knowledge,language & reasoning,multimodal
coding
domain-specific
safety
language & reasoning,safety
safety
safety
language & reasoning
math
knowledge,language & reasoning
language & reasoning
knowledge,language & reasoning
safety
math
math
information retrieval,language & reasoning
language & reasoning
domain-specific
coding
knowledge,language & reasoning,safety
coding
safety,bias & ethics
safety,bias & ethics
coding
coding
math
safety,bias & ethics
safety,bias & ethics
language & reasoning,bias & ethics
safety
knowledge,language & reasoning
safety,bias & ethics
summarization,language & reasoning
domain-specific
language & reasoning
language & reasoning
safety,bias & ethics
language & reasoning
language & reasoning
language & reasoning
conversation & chatbots
language & reasoning
domain-specific
language & reasoning
language & reasoning
language & reasoning
language & reasoning
language & reasoning
safety,bias & ethics
conversation & chatbots
language & reasoning
knowledge,language & reasoning
language & reasoning
language & reasoning
conversation & chatbots
language & reasoning
language & reasoning
language & reasoning
language & reasoning
language & reasoning
language & reasoning
language & reasoning
Description
An automatic evaluator for instruction-following LLMs.
Multi-turn dialogues.
Open-source platform for comparing LLMs in a competitive environment.
Adversarial behaviors including cybercrime, copyright violations, and generating misinformation (https://www.harmbench.org
An enhanced dataset designed to extend the MMLU benchmark. More challenging questions, the choice set of ten options.
A ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures.
Measures the ability for language models to answer short, fact-seeking questions to reduce hallucinations.
A set of Python functions and input-output pairs that consists of two tasks: input prediction and output prediction.
Explicitly malicious agent tasks, including fraud, cybercrime, and harassment.
Tests a model’s resistance against common attacks from the literature.
A set of function-calling tasks, including multiple and parallel function calls.
A benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. Consists o
Function-level code generation tasks with complex instructions and diverse function calls.
AI safety benchmark aligned with emerging regulations. Considers operational, content safety, legal and societal risks (https://
A collection of 1 million conversations between human users and ChatGPT, alongside demographic data (https://wildchat.allen
Specifically designed for tool-augmented LLMs.
Multilingual code completion tasks built on built on real-world GitHub repositories in Python, Java, TypeScript, and C#.
A benchmark for evaluating Multimodal LLMs using multiple-choice questions.
Curated complex reasoning tasks including math, science, coding, long-context.
A set of questions targetting 13 behavior scenarios disallowed by OpenAI.
Multistep reasoning tasks based on text narratives (e.g., 1000 words murder mysteries).
An instruction-tuning dataset for tool use.
Tests factuality of LLM-generated text in the context of answering questions that test current world knowledge. The dataset is
Multi-turn questions: an open-ended question and a follow-up question.
A tool manipulation
Evaluate benchmark
LLM-as-Agent consisting of software
across 8 environments, tools
including for real-world
Operating Systemtasks.
(OS)
Database (DB), Knowledge Graph (KG), Digital Card Game (DCG), and Lateral Thinking Puzzles (LTP).
Evaluates MLLMs on three dimensions: low-level visual perception, low-level visual description, and overall visual quality asses
Extended HumanEval & MBPP by 80x/35x for rigorous eval.
Covers ten 'malicious intentions', including psychological manipulation, theft, cyberbullying, and fraud.
Tests if human feedback encourages model responses to match user beliefs over truthful ones, a behavior known as sycophan
Evaluate trustworthiness of LLMs across 8 perspectives: toxicity, stereotypes, adversarial and robustness, privacy, ethics and f
A set of 500 harmful strings that the model should not reproduce and 500 harmful instructions.
A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models.
Class-level Python code generation tasks.
A set of user queries in the form of prompts that trigger LLMs to use tools, including both single-tool and multi-tool scenarios.
A set of human exam questions in 9 diverse languages with three educational levels, where about 23% of the questions requir
A dataset for evaluating the alignment of LM opinions with those of 60 US demographic groups.
Multiple-choice questions concerning offensive content, bias, illegal activities, and mental health.
A set of multiple-choice questions written by domain experts in biology, physics, and chemistry.
Consists of three interconnected evaluation tasks: retrieve the most relevant code snippets, predict the next line of code, and
A set of prompts with verifiable instructions, such as "write in more than 400 words".
A collection of standardized tests, including GRE, GMAT, SAT, LSAT.
Harmful questions covering 10 topics and ~10 subtopics each.
Dataset consists of human-written entries sampled randomly from AnthropicHarmlessBase.
Collaboratively curated tasks for evaluating legal reasoning in English LLMs.
Evaluates multimodal models on massive multi-discipline tasks demanding college-level subject knowledge. Includes 11.5K qu
A framework for quantitatively evaluating interactive LLMs such as ChatGPT using 23 data sets covering 8 common NLP tasks.
Real-world software issues collected from GitHub.
An environment for autonomous agents that perform tasks on the web.
A set of prompts sampled from AnthropicRedTeam that cover 14 harm categories.
Compares the ability of LLMs to understand what the code implements in source language and translate the same semantics i
Inconsistency detection in summaries
A framework and pipeline for evaluating the performance of the generated videos, such as visual qualities, content qualities, m
Measures both perception and cognition abilities on a total of 14 subtasks.
The dataset consists of prompts across 12 harm types to which responsible LLMs do not answer.
Multimodal multiple choice questions with diverse science topics and annotations of their answers with corresponding lecture
Code generation benchmark with data science problems spanning seven Python libraries, such as NumPy and Pandas.
Four-option multiple-choice questions from Indian medical entrance examinations. Covers 2,400 healthcare topics and 21 med
A set of toxic and benign statements about minority groups.
Reasoning tasks in several domains (reusing other benchmarks) with a focus on multi-metric evaluation (https://crfm.stanford
Human preference data about helpfulness and harmlessness.
Evaluates whether LLMs are prone to leaking PII, contains name-email pairs.
A human-annotated dataset that contains causal reasoning questions.
Grade-school math problems from the GSM8K dataset, translated into 10 languages.
A suite of BigBench tasks for which LLMs did not outperform the average human-rater.
A benchmark designed to evaluate the ability of LLMs to generate plans of action and reason about change.
Set of questions crowdsourced by domain experts in math, biology, physics, and beyond.
Human-generated and annotated red teaming dialogues.
Grade-school-level math word problems that require models to perform single-variable arithmetic operations. Created by app
Tasks from US mathematics competitions that cover algebra, calculus, geometry, and statistics.
BEIR is a heterogeneous benchmark for information retrieval (IR) tasks, contains 15+ IR datasets.
A textual question answering benchmark for spatial reasoning on natural language text.
Questions and associated hybrid contexts from real-world financial reports.
14 datasets for program understanding and generation and three baseline systems, including the BERT-style, GPT-style, and En
Evaluates how well models generate truthful responses.
A dataset for code generation, including introductory to competitive programming problems.
A set of unfinished sentences from Wikipedia designed to assess bias in text generation.
Evaluate social biases of LLMs in question answering.
Crowd-sourced entry-level programming tasks.
Programming tasks and unit tests to check model-generated code.
Grade school math word problems.
A large-scale natural dataset in English to measure stereotypical biases in four domains: gender, profession, race, and religion.
A set of binary-choice questions on ethics with two actions to choose from.
A conceptual formalism to study people’s everyday social norms and moral judgments.
A dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with tox
Multi-choice tasks across 57 subjects, high school to expert level.
Covers stereotypes dealing with nine types of bias, like race, religion, and age.
Multilingual summarization dataset crawled from different news websites.
Free-form multiple-choice OpenQA dataset for solving medical problems collected from the professional medical board exams
User questions issued to Google search, and answers found from Wikipedia by annotators.
Large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure.
Measures bias in sentence encoders.
Yes/No questions from Google searches, paired with Wikipedia passages.
Improved and more challenging version of GLUE benchmark.
Tasks to resolve references in a question and perform discrete operations over them (such as addition, counting, or sorting).
A dataset of conversations between two crowdsourcing
agents engaging in a dialog about a given topic.
Predict the most likely ending of a sentence, multiple-choice.
A dataset for biomedical research question answering.
Fill-in-a-blank tasks resolving ambiguities in pronoun references with binary options.
Naive
A set ofphysics reasoning tasks
Wikipedia-based focusing on how we interact with everyday objects in everyday situations.
question-answer
pairs with multi-hop questions.
Tool for evaluating and analyzing the performance of models on NLU tasks. Was quickly outperformed by LLMs and replaced b
Question answering dataset, modeled after open book exams.
Pairs of sentences that differ only by the gender of one pronoun in the sentence, designed to test for the presence of gender b
Questions with answers collected from 8000+ conversations.
Combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworker
Grade-school level, multiple-choice science questions.
Multi-choice tasks of grounded commonsense inference with adversarial filtering.
Multiple-choice question answering dataset that requires commonsense knowledge to predict the correct answers.
Question-answer pairs, simulating student-teacher interactions.
Reading comprehension tasks collected from the English exams for middle and high school Chinese students.
Multiple choice science exam questions.
A large-scale question-answering dataset.
A crowdsourced collection of sentence pairs annotated with textual entailment information.
A
A reading comprehension
set of passages composed dataset
of a consisting of 100,000 questions posed by crowdworkers on a set of Wikipedia articles.
context and a target sentence. The task is to guess the last word of the target sentence.
Questions sampled from Bing's search query logs and passages from web documents.
Benchmark paper AlpacaEval: A Simple Way to Debias Automatic Evaluators
Length-Controlled
https://arxiv.org/abs/2404.04475
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
https://arxiv.org/abs/2402.14762
https://arxiv.org/abs/2403.04132
https://arxiv.org/abs/2402.04249
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
https://arxiv.org/abs/2406.01574
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
https://arxiv.org/abs/2406.06565
Measuring short-form factuality in large language models
https://arxiv.org/abs/2411.04368
Understanding and Execution
https://arxiv.org/abs/2401.03065
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
https://arxiv.org/abs/2410.09024
A StrongREJECT for Empty Jailbreaks
https://arxiv.org/abs/2402.10260
Berkeley Function-Calling Leaderboard
https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html
TrustLLM: Trustworthiness in Large Language Models
https://arxiv.org/abs/2401.05561
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
https://arxiv.org/abs/2406.15877
https://arxiv.org/abs/2407.17436
WildChat: 1M ChatGPT Interaction Logs in the Wild
https://arxiv.org/abs/2405.01470
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
https://arxiv.org/abs/2304.08244
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
https://arxiv.org/abs/2310.11248
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
https://arxiv.org/abs/2307.16125
Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance
https://arxiv.org/abs/2305.17306
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
https://arxiv.org/abs/2308.03825
https://arxiv.org/abs/2310.16049
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
https://arxiv.org/abs/2307.16789
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
https://arxiv.org/abs/2310.03214
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
https://arxiv.org/abs/2306.05685
On the Tool Manipulation Capability of Open-source Large Language Models
https://arxiv.org/abs/2305.16504
AgentBench: Evaluating LLMs as Agents
https://arxiv.org/abs/2308.03688
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision
https://arxiv.org/abs/2309.14181
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
https://arxiv.org/abs/2305.01210
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
https://arxiv.org/abs/2310.06987
Towards Understanding Sycophancy in Language Models
https://arxiv.org/abs/2310.13548
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
https://arxiv.org/abs/2306.11698
Universal and Transferable Adversarial Attacks on Aligned Language Models
https://arxiv.org/abs/2307.15043
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
https://arxiv.org/abs/2308.01263
ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation"
https://arxiv.org/abs/2308.01861
MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
https://arxiv.org/abs/2310.03128
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
https://arxiv.org/abs/2306.05179
Whose Opinions Do Language Models Reflect?
https://arxiv.org/abs/2303.17548
SafetyBench: Evaluating the Safety of Large Language Models
https://arxiv.org/abs/2309.07045
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
https://arxiv.org/abs/2311.12022
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
https://arxiv.org/abs/2306.03091
https://arxiv.org/abs/2311.07911
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
https://arxiv.org/abs/2304.06364
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
https://arxiv.org/abs/2308.09662
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
https://arxiv.org/abs/2309.07875
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models
https://arxiv.org/abs/2308.11462
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
https://arxiv.org/abs/2311.16502
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
https://arxiv.org/abs/2302.04023
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
https://arxiv.org/abs/2310.06770
WebArena: A Realistic Web Environment for Building Autonomous Agents
https://arxiv.org/abs/2307.13854
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
https://arxiv.org/abs/2307.04657
Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code
https://arxiv.org/abs/2308.03109
https://arxiv.org/abs/2305.14540
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
https://arxiv.org/abs/2310.11440
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
https://arxiv.org/abs/2306.13394
Do-Not-Answer: Evaluating Safeguards in LLMs
https://arxiv.org/abs/2308.13387
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
https://arxiv.org/abs/2209.09513
DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
https://arxiv.org/abs/2211.11501
MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering
https://arxiv.org/abs/2203.14371
ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
https://arxiv.org/abs/2203.09509
https://arxiv.org/abs/2211.09110
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
https://arxiv.org/abs/2204.05862
Are Large Pre-Trained Language Models Leaking Your Personal Information?
https://arxiv.org/abs/2205.12628
e-CARE: a New Dataset for Exploring Explainable Causal Reasoning
https://arxiv.org/abs/2205.05849
Language Models are Multilingual Chain-of-Thought Reasoners
https://arxiv.org/abs/2210.03057
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
https://arxiv.org/abs/2210.09261
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
https://arxiv.org/abs/2206.10498
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
https://arxiv.org/abs/2206.04615
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
https://arxiv.org/abs/2209.07858
Are NLP Models really able to Solve Simple Math Word Problems?
https://arxiv.org/abs/2103.07191
Measuring Mathematical Problem Solving With the MATH Dataset
https://arxiv.org/abs/2103.03874
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
https://arxiv.org/abs/2104.08663
SpartQA: : A Textual Question Answering Benchmark for Spatial Reasoning
https://arxiv.org/abs/2104.05832
TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance
https://arxiv.org/abs/2105.07624
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
https://arxiv.org/abs/2102.04664
TruthfulQA: Measuring How Models Mimic Human Falsehoods
https://arxiv.org/abs/2109.07958v2
Measuring Coding Challenge Competence With APPS
https://arxiv.org/abs/2105.09938
BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation
https://arxiv.org/abs/2101.11718
BBQ: A Hand-Built Bias Benchmark for Question Answering
https://arxiv.org/abs/2110.08193
Program Synthesis with Large Language Models
https://arxiv.org/abs/2108.07732
Evaluating Large Language Models Trained on Code
https://arxiv.org/abs/2107.03374
Training Verifiers to Solve Math Word Problems
https://arxiv.org/abs/2110.14168
StereoSet: Measuring stereotypical bias in pretrained language models
https://arxiv.org/abs/2004.09456
Aligning AI With Shared Human Values
https://arxiv.org/abs/2008.02275
Social Chemistry 101: Learning to Reason about Social and Moral Norms
https://arxiv.org/abs/2011.00620
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
https://arxiv.org/abs/2009.11462
Measuring Massive Multitask Language Understanding
https://arxiv.org/abs/2009.03300
CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models
https://arxiv.org/abs/2010.00133
MLSUM: The Multilingual Summarization Corpus
https://arxiv.org/abs/2004.14900
What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams
https://arxiv.org/abs/2009.13081
Natural Questions: A Benchmark for Question Answering Research
https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00276/43518/Natural-Questions-A-Benchmark-for-Question
Adversarial NLI: A New Benchmark for Natural Language Understanding
https://arxiv.org/abs/1910.14599
On Measuring Social Biases in Sentence Encoders
https://arxiv.org/abs/1903.10561
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
https://arxiv.org/abs/1905.10044
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
https://arxiv.org/abs/1905.00537
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
https://arxiv.org/abs/1903.00161
OpenDialKG: Explainable Conversational Reasoning with Attention-based Walks over Knowledge Graphs
https://aclanthology.org/P19-1081/
HellaSwag: Can a Machine Really Finish Your Sentence?
https://arxiv.org/abs/1905.07830
PubMedQA: A Dataset for Biomedical Research Question Answering
https://arxiv.org/abs/1909.06146
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
https://arxiv.org/abs/1907.10641
PIQA: Reasoning about Physical Commonsense in Natural Language
https://arxiv.org/abs/1911.11641
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
https://arxiv.org/abs/1809.09600
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
https://arxiv.org/abs/1804.07461
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
https://arxiv.org/abs/1809.02789
Gender Bias in Coreference Resolution
https://arxiv.org/abs/1804.09301
CoQA: A Conversational Question Answering Challenge
https://arxiv.org/abs/1808.07042
Know What You Don't Know: Unanswerable Questions for SQuAD
https://arxiv.org/abs/1806.03822
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
https://arxiv.org/abs/1803.05457
https://arxiv.org/abs/1808.05326
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
https://arxiv.org/abs/1811.00937
QuAC : Question Answering in Context
https://arxiv.org/abs/1808.07036
RACE: Large-scale ReAding Comprehension Dataset From Examinations
https://arxiv.org/abs/1704.04683
Crowdsourcing Multiple Choice Science Questions
https://arxiv.org/abs/1707.06209
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
https://arxiv.org/abs/1705.03551
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
https://arxiv.org/abs/1704.05426
SQuAD: 100,000+ Questions for Machine Comprehension of Text
https://arxiv.org/abs/1606.05250
The LAMBADA dataset: Word prediction requiring a broad discourse context
https://arxiv.org/abs/1606.06031
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
https://arxiv.org/abs/1611.09268
Code repository
https://github.com/tatsu-lab/alpaca_eval
https://github.com/mtbench101/mt-bench-101
https://github.com/lm-sys/FastChat/tree/main
https://github.com/centerforaisafety/HarmBench/tree/main
https://github.com/TIGER-AI-Lab/MMLU-Pro
https://github.com/Psycoy/MixEval
https://github.com/openai/simple-evals
https://github.com/facebookresearch/cruxeval
https://github.com/UKGovernmentBEIS/inspect_evals
https://github.com/dsbowen/strong_reject
https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard
https://github.com/HowieHwong/TrustLLM
https://github.com/bigcode-project/bigcodebench
https://github.com/stanford-crfm/air-bench-2024
n/a
https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/api-bank
https://github.com/amazon-science/cceval
https://github.com/AILab-CVC/SEED-Bench
https://github.com/FranxYao/chain-of-thought-hub/
https://github.com/verazuo/jailbreak_llms
https://github.com/Zayne-sprague/MuSR
https://github.com/OpenBMB/ToolBench
https://github.com/freshllms/freshqa
https://github.com/lm-sys/FastChat/tree/main
https://github.com/sambanova/toolbench/tree/main
https://github.com/THUDM/AgentBench
https://github.com/Q-Future/Q-Bench
https://github.com/evalplus/evalplus
https://github.com/Princeton-SysML/Jailbreak_LLM/tree/main
https://github.com/meg-tong/sycophancy-eval
https://github.com/AI-secure/DecodingTrust
https://github.com/llm-attacks/llm-attacks
https://github.com/paul-rottger/exaggerated-safety
https://github.com/FudanSELab/ClassEval
https://github.com/HowieHwong/MetaTool
https://github.com/DAMO-NLP-SG/M3Exam
https://github.com/tatsu-lab/opinions_qa
https://github.com/thu-coai/SafetyBench
https://github.com/idavidrein/gpqa
https://github.com/Leolty/repobench
https://github.com/google-research/google-research/tree/master/instruction_following_eval
https://github.com/ruixiangcui/AGIEval/tree/main
https://github.com/declare-lab/red-instruct
https://github.com/vinid/safety-tuned-llamas
https://github.com/HazyResearch/legalbench/
https://github.com/MMMU-Benchmark/MMMU
https://github.com/HLTCHKUST/chatgpt-evaluation
https://github.com/princeton-nlp/SWE-bench
https://github.com/web-arena-x/webarena
https://github.com/PKU-Alignment/beavertails
https://github.com/codetlingua/codetlingua
https://github.com/salesforce/factualNLG
https://github.com/EvalCrafter/EvalCrafter
https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation
https://github.com/Libr-AI/do-not-answer
https://github.com/lupantech/ScienceQA
https://github.com/xlang-ai/DS-1000
https://github.com/medmcqa/medmcqa
https://github.com/microsoft/TOXIGEN/tree/main
https://github.com/stanford-crfm/helm
https://github.com/anthropics/hh-rlhf
https://github.com/jeffhj/LM_PersonalInfoLeak
https://github.com/Waste-Wood/e-CARE
https://github.com/google-research/url-nlp
https://github.com/suzgunmirac/BIG-Bench-Hard
https://github.com/karthikv792/LLMs-Planning/tree/main/plan-bench
https://github.com/google/BIG-bench
https://github.com/anthropics/hh-rlhf
https://github.com/arkilpatel/SVAMP
https://github.com/hendrycks/math/?tab=readme-ov-file
https://github.com/beir-cellar/beir
https://github.com/HLR/SpartQA-baselines
https://github.com/NExTplusplus/TAT-QA
https://github.com/microsoft/CodeXGLUE
https://github.com/sylinrl/TruthfulQA
https://github.com/hendrycks/apps
https://github.com/amazon-science/bold
https://github.com/nyu-mll/BBQ
https://github.com/google-research/google-research/blob/master/mbpp/README.md
https://github.com/openai/human-eval
https://github.com/openai/grade-school-math
https://github.com/moinnadeem/StereoSet
https://github.com/hendrycks/ethics
https://github.com/mbforbes/social-chemistry-101
https://github.com/allenai/real-toxicity-prompts?tab=readme-ov-file
https://github.com/hendrycks/test/tree/master
https://github.com/nyu-mll/crows-pairs
https://github.com/ThomasScialom/MLSUM
https://github.com/jind11/MedQA
https://github.com/google-research-datasets/natural-questions
https://github.com/facebookresearch/anli
https://github.com/W4ngatang/sent-bias
https://github.com/google-research-datasets/boolean-questions
https://github.com/nyu-mll/jiant
https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/drop/README.md
https://github.com/facebookresearch/opendialkg
https://github.com/rowanz/hellaswag/tree/master
https://github.com/pubmedqa/pubmedqa
https://github.com/allenai/winogrande
https://github.com/ybisk/ybisk.github.io/tree/master/piqa
https://github.com/hotpotqa/hotpot
https://github.com/nyu-mll/GLUE-baselines
https://github.com/allenai/OpenBookQA
https://github.com/rudinger/winogender-schemas
https://stanfordnlp.github.io/coqa/
https://rajpurkar.github.io/SQuAD-explorer/
https://github.com/allenai/aristo-leaderboard/tree/master/arc
https://github.com/rowanz/swagaf
https://github.com/jonathanherzig/commonsenseqa
https://quac.ai/
n/a
n/a
https://github.com/mandarjoshi90/triviaqa
https://github.com/nyu-mll/multiNLI
https://rajpurkar.github.io/SQuAD-explorer/
n/a
https://microsoft.github.io/msmarco/
Dataset
https://huggingface.co/datasets/tatsu-lab/alpaca_eval
https://github.com/mtbench101/mt-bench-101/tree/main/data/subjective
https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
https://github.com/centerforaisafety/HarmBench/tree/main/data/behavior_datasets
https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
https://huggingface.co/datasets/MixEval/MixEval
https://huggingface.co/datasets/basicv8vc/SimpleQA
https://huggingface.co/datasets/cruxeval-org/cruxeval
https://huggingface.co/datasets/ai-safety-institute/AgentHarm
https://github.com/dsbowen/strong_reject/tree/main/docs/api
https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard
https://github.com/HowieHwong/TrustLLM?tab=readme-ov-file#dataset-download
https://github.com/bigcode-project/bigcodebench
https://huggingface.co/datasets/stanford-crfm/air-bench-2024
https://huggingface.co/datasets/allenai/WildChat-1M
https://huggingface.co/datasets/liminghao1630/API-Bank
https://github.com/amazon-science/cceval/tree/main/data
https://huggingface.co/datasets/AILab-CVC/SEED-Bench-2
see repository
https://github.com/verazuo/jailbreak_llms
https://github.com/Zayne-sprague/MuSR/tree/main/datasets
https://github.com/OpenBMB/ToolBench?tab=readme-ov-file#data-release
https://github.com/freshllms/freshqa?tab=readme-ov-file#freshqa
https://huggingface.co/datasets/lmsys/mt_bench_human_judgments
https://github.com/sambanova/toolbench/tree/main
https://github.com/THUDM/AgentBench/tree/main/data
https://huggingface.co/datasets/q-future/Q-Bench-HF
https://github.com/evalplus/evalplus/tree/master/evalplus/data
https://github.com/Princeton-SysML/Jailbreak_LLM/tree/main/data
https://huggingface.co/datasets/meg-tong/sycophancy-eval
https://huggingface.co/datasets/AI-Secure/DecodingTrust
https://github.com/llm-attacks/llm-attacks/tree/main/data/advbench
https://github.com/paul-rottger/exaggerated-safety/blob/main/xstest_v2_prompts.csv
https://huggingface.co/datasets/FudanSELab/ClassEval
https://github.com/HowieHwong/MetaTool/tree/master/dataset
https://github.com/DAMO-NLP-SG/M3Exam?tab=readme-ov-file#data
https://worksheets.codalab.org/worksheets/0x6fb693719477478aac73fc07db333f69
https://huggingface.co/datasets/thu-coai/SafetyBench
https://huggingface.co/datasets/Idavidrein/gpqa
https://huggingface.co/datasets/tianyang/repobench-c
https://huggingface.co/datasets/tianyang/repobench-p
https://github.com/google-research/google-research/tree/master/instruction_following_eval
https://github.com/ruixiangcui/AGIEval/tree/main/data
https://huggingface.co/datasets/declare-lab/HarmfulQA
https://github.com/vinid/safety-tuned-llamas
https://huggingface.co/datasets/nguha/legalbench
https://huggingface.co/datasets/MMMU/MMMU
https://github.com/HLTCHKUST/chatgpt-evaluation/tree/main/src
https://huggingface.co/datasets/princeton-nlp/SWE-bench
https://github.com/web-arena-x/webarena/blob/main/config_files/test.raw.json
https://huggingface.co/datasets/PKU-Alignment/BeaverTails
https://huggingface.co/iidai
https://github.com/salesforce/factualNLG/tree/master/data/summedits
https://huggingface.co/datasets/RaphaelLiu/EvalCrafter_T2V_Dataset
https://huggingface.co/datasets/lmms-lab/MME
https://huggingface.co/datasets/LibrAI/do-not-answer
https://huggingface.co/datasets/derek-thomas/ScienceQA
https://huggingface.co/datasets/xlangai/DS-1000
https://github.com/medmcqa/medmcqa
https://huggingface.co/datasets/toxigen/toxigen-data
see repository
https://github.com/anthropics/hh-rlhf
https://github.com/jeffhj/LM_PersonalInfoLeak/tree/main/data
https://github.com/Waste-Wood/e-CARE/tree/main/dataset
https://huggingface.co/datasets/juletxara/mgsm
https://huggingface.co/datasets/maveriq/bigbenchhard
https://huggingface.co/datasets/tasksource/planbench
https://huggingface.co/datasets/google/bigbench
https://huggingface.co/datasets/Anthropic/hh-rlhf
https://github.com/arkilpatel/SVAMP/tree/main/data
https://github.com/hendrycks/math/?tab=readme-ov-file
https://huggingface.co/BeIR
https://github.com/HLR/SpartQA_generation
https://github.com/NExTplusplus/TAT-QA/tree/master/dataset_raw
https://huggingface.co/datasets?search=code_x_glue
https://huggingface.co/datasets/truthfulqa/truthful_qa
https://huggingface.co/datasets/codeparrot/apps
https://github.com/amazon-science/bold/tree/main/prompts
https://github.com/nyu-mll/BBQ/tree/main/data
https://github.com/google-research/google-research/blob/master/mbpp/mbpp.jsonl
https://huggingface.co/datasets/openai/openai_humaneval
https://github.com/openai/grade-school-math/tree/master/grade_school_math/data
https://huggingface.co/datasets/McGill-NLP/stereoset
https://huggingface.co/datasets/hendrycks/ethics
https://github.com/mbforbes/social-chemistry-101?tab=readme-ov-file#data
https://huggingface.co/datasets/allenai/real-toxicity-prompts
https://huggingface.co/datasets/cais/mmlu
https://github.com/nyu-mll/crows-pairs/blob/master/data/crows_pairs_anonymized.csv
https://huggingface.co/datasets/GEM/mlsum
https://github.com/jind11/MedQA
https://ai.google.com/research/NaturalQuestions
https://huggingface.co/datasets/facebook/anli
https://github.com/W4ngatang/sent-bias/tree/master/tests
https://github.com/google-research-datasets/boolean-questions
https://huggingface.co/datasets/aps/super_glue
https://huggingface.co/datasets/ucinlp/drop
https://github.com/facebookresearch/opendialkg/tree/main/data
https://github.com/rowanz/hellaswag/tree/master/data
https://github.com/pubmedqa/pubmedqa
https://huggingface.co/datasets/allenai/winogrande
https://huggingface.co/datasets/ybisk/piqa
https://hotpotqa.github.io/
https://huggingface.co/datasets/nyu-mll/glue
https://huggingface.co/datasets/allenai/openbookqa
https://huggingface.co/datasets/oskarvanderwal/winogender
https://stanfordnlp.github.io/coqa/
https://huggingface.co/datasets/bayes-group-diffusion/squad-2.0
https://huggingface.co/datasets/allenai/ai2_arc
https://github.com/rowanz/swagaf/tree/master/data
https://github.com/jonathanherzig/commonsenseqa
https://quac.ai/
https://www.cs.cmu.edu/~glai1/data/race/
https://huggingface.co/datasets/allenai/sciq
https://huggingface.co/datasets/mandarjoshi/trivia_qa
https://huggingface.co/datasets/nyu-mll/multi_nli
https://huggingface.co/datasets/rajpurkar/squad
https://huggingface.co/datasets/cimec/lambada
https://huggingface.co/datasets/microsoft/ms_marco
Number of examples License Initial publication year Cited by:
n/a CC-BY-NC-4.0 2024 New
4208 Apache-2.0 license 2024 New
33000 see dataset page 2024 New
510 MIT License 2024 New
12100 MIT License 2024 New
5000 Apache-2.0 license 2024 New
4326 MIT License 2024 New
800 MIT License 2024 New
110 MIT License,see dataset page 2024 New
n/a MIT License 2024 New
2000 Apache-2.0 license 2024 New
n/a MIT License 2024 New
1140 Apache-2.0 license 2024 New
5,694 Apache-2.0 license 2024 New
1,000,000+ ODC-BY license 2024 New
n/a MIT License 2023 51-100
10000 Apache-2.0 license 2023 51-100
24000 CC-BY-NC-4.0 2023 101-500
1000+ MIT License 2023 Nov-50
15140 MIT License 2023 101-500
756 MIT License 2023 Nov-50
n/a Apache-2.0 license 2023 101-500
599 Apache-2.0 license 2023 101-500
3300 CC-BY-4.0 2023 >500
n/a Apache-2.0 license 2023 51-100
1360 Apache-2.0 license 2023 101-500
2990 S-Lab License 1.0 2023 51-100
n/a Apache-2.0 license 2023 >500
100 n/a 2023 101-500
n/a MIT License 2023 101-500
243,877 CC-BY-SA-4.0 2023 101-500
1000 MIT License 2023 >500
450 CC-BY-4.0 2023 51-100
100 MIT License 2023 101-500
20879 MIT License 2023 51-100
12317 n/a 2023 51-100
1498 n/a 2023 101-500
11435 MIT License 2023 101-500
448 CC-BY-4.0 2023 101-500
unspecified CC-BY-NC-ND 4.0 2023 51-100
500 Apache-2.0 license 2023 101-500
n/a MIT License 2023 101-500
1960 Apache-2.0 license 2023 51-100
100 CC-BY-SA-4.0 2023 101-500
162 see dataset page 2023 101-500
11500 Apache-2.0 license 2023 101-500
n/a see dataset page 2023 >500
2200 MIT License 2023 101-500
n/a Apache-2.0 license 2023 101-500
334000 CC-BY-SA-4.0 2023 101-500
1700 MIT License 2023 51-100
6,348 Apache-2.0 license 2023 Nov-50
700 Apache-2.0 license 2023 51-100
2374 n/a 2023 >500
939 Apache-2.0 license 2023 51-100
21208 CC-BY-SA-4.0 2022 >500
1000 CC-BY-SA-4.0 2022 101-500
194000 MIT License 2022 101-500
274000 MIT License 2022 101-500
unspecified Apache-2.0 license 2022 >500
44849 MIT License 2022 >500
3238 Apache-2.0 license 2022 101-500
21000 MIT License 2022 Nov-50
2500 CC-BY-SA-4.0 2022 101-500
6500 MIT License 2022 >500
11113 n/a 2022 101-500
n/a Apache-2.0 license 2022 >500
38961 MIT License 2022 101-500
1000 MIT License 2021 >500
12500 MIT License 2021 >500
n/a see dataset page 2021 >500
510 MIT License 2021 51-100
16552 MIT License 2021 101-500
n/a see dataset page 2021 >500
1634 Apache-2.0 license 2021 >500
10000 MIT License 2021 101-500
23679 CC-BY-SA-4.0 2021 101-500
58492 CC-BY-SA-4.0 2021 101-500
974 CC-BY-SA-4.0 2021 >500
164 MIT License 2021 >500
8500 MIT License 2021 >500
4229 CC-BY-SA-4.0 2020 >500
134400 MIT License 2020 101-500
4500000 CC-BY-SA-4.0 2020 101-500
99442 Apache-2.0 license 2020 >500
231400 MIT License 2020 >500
1508 CC-BY-SA-4.0 2020 >500
535062 n/a 2020 101-500
12723 MIT License 2020 101-500
300000 Apache-2.0 license 2019 >500
169265 CC-BY-NC-4.0 2019 >500
n/a CC-BY-NC-4.0 2019 >500
16000 CC-BY-SA-3.0 2019 >500
n/a see dataset page 2019 >500
96000 CC-BY-SA-4.0 2019 >500
15000 CC-BY-NC-4.0 2019 101-500
59950 MIT License 2019 >500
270000 MIT License 2019 >500
44000 Apache-2.0 license 2019 >500
18000 "Academic Free License (""AFL"") v. 3.1" 2019 >500
113000 CC-BY-SA-4.0 2018 >500
n/a see dataset page 2018 >500
12000 Apache-2.0 license 2018 >500
720 MIT License 2018 >500
127000 see dataset page 2018 >500
150000 CC-BY-SA-4.0 2018 >500
7787 CC-BY-SA-4.0 2018 >500
113000 MIT License 2018 >500
12102 n/a 2018 >500
100000 CC-BY-SA-4.0 2018 >500
100000 see dataset page 2017 >500
13700 CC-BY-SA-3.0 2017 101-500
650000 n/a 2017 >500
433000 see dataset page 2017 >500
100000 CC-BY-SA-4.0 2016 >500
12684 CC-BY-SA-4.0 2016 >500
1112939 see dataset page 2016 >500