100+ LLM benchmarks and evaluation datasets

The document lists various benchmarks and evaluation tools for large language models (LLMs), categorized by their focus areas such as instruction-following, safety, coding, and conversation. Each entry includes a brief description of the benchmark's purpose and its associated research paper links. The benchmarks aim to assess different capabilities of LLMs, including reasoning, factuality, and harmfulness.

Uploaded by

Rohit Vanapalli

Available Formats

Download as XLSX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

100+ LLM benchmarks and evaluation datasets

Uploaded by

Rohit Vanapalli

Available Formats

Download as XLSX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Name

AlpacaEval
MT-Bench-101
Chatbot Arena
HarmBench
MMLU Pro
MixEval
SimpleQA
CRUXEval (Code Reasoning, Understanding, and Execution Evaluation)
AgentHarm
StrongReject
BFCL (Berkeley Function-Calling Leaderboard)
TrustLLM
BigCodeBench
AIR-Bench
WildChat
API-Bank
CrossCodeEval
SEED-Bench
Chain-of-Thought Hub
ForbiddenQuestions
MuSR
ToolLLM
FreshQA
MT-Bench
ToolBench
AgentBench
Q-bench
EvalPlus
MaliciousInstruct
SycophancyEval
DecodingTrust
AdvBench
XSTest
ClassEval
MetaTool
M3Exam
OpinionQA
SafetyBench
GPQA
Repobench
IFEval
AGIEval
HarmfulQA
QHarm
LegalBench
MMMU
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT
SWE-bench
Webarena
BeaverTails
Code Lingua
SummEdits
EvalCrafter
MME
DoNotAnswer
ScienceQA
DS-1000
MedMCQA
ToxiGen
HELM
HHH (Helpfulness, Honesty, Harmlessness)
PersonalInfoLeak
e-CARE (explainable CAusal REasoning dataset)
MGSM (Multilingual Grade School Math)
BigBench Hard
PlanBench
BigBench
AnthropicRedTeam
SVAMP
MATH
BEIR
SpartQA
TAT-QA
CodeXGLUE
TruthfulQA
APPS (Automated Programming Progress Standard)
BOLD
BBQ
MBPP (Mostly Basic Programming Problems)
HumanEval
GSM8K
StereoSet
ETHICS
Social Chemistry 101
RealToxicityPrompt
MMLU
CrowS-Pairs (Crowdsourced Stereotype Pairs)
MLSUM
MedQA
Natural Questions
ANLI
SEAT (Sentence Encoder Association Test)
BoolQ
SuperGLUE
DROP (Discrete Reasoning Over Paragraphs)
OpenDialKG
HellaSwag
PubMedQA
Winogrande
PIQA (Physical Interaction QA)
HotpotQA
GLUE (General Language Understanding Evaluation)
OpenBookQA
WinoGender
CoQA (Conversational Question Answering)
SQuAD2.0
ARC
SWAG
CommonsenseQA
QuAC (Question Answering in Context)
RACE (ReAding Comprehension Dataset From Examinations)
SciQ
TriviaQA
MultiNLI (Multi-Genre Natural Language Inference)
SQuAD (Stanford Question Answering Dataset)
LAMBADA (LAnguage Modelling Broadened to Account for Discourse Aspects)
MS MARCO
Type
instruction-following,conversation & chatbots
conversation & chatbots
conversation & chatbots
safety
knowledge,language & reasoning
conversation & chatbots
safety
coding
safety
safety
agents & tools use
safety,bias & ethics
coding
safety
conversation & chatbots
agents & tools use
coding
multimodal
language & reasoning
safety
language & reasoning
agents & tools use
knowledge
conversation & chatbots
agents & tools use
agents & tools use
multimodal
coding
safety
safety
safety
safety
safety
coding
agents & tools use
multimodal
safety
safety
language & reasoning
coding
language & reasoning,instruction-following
language & reasoning
safety
safety
domain-specific
multimodal,language & reasoning
multimodal
coding
agents & tools use
safety
coding
language & reasoning
video,multimodal
multimodal
safety
knowledge,language & reasoning,multimodal
coding
domain-specific
safety
language & reasoning,safety
safety
safety
language & reasoning
math
knowledge,language & reasoning
language & reasoning
knowledge,language & reasoning
safety
math
math
information retrieval,language & reasoning
language & reasoning
domain-specific
coding
knowledge,language & reasoning,safety
coding
safety,bias & ethics
safety,bias & ethics
coding
coding
math
safety,bias & ethics
safety,bias & ethics
language & reasoning,bias & ethics
safety
knowledge,language & reasoning
safety,bias & ethics
summarization,language & reasoning
domain-specific
language & reasoning
language & reasoning
safety,bias & ethics
language & reasoning
language & reasoning
language & reasoning
conversation & chatbots
language & reasoning
domain-specific
language & reasoning
language & reasoning
language & reasoning
language & reasoning
language & reasoning
safety,bias & ethics
conversation & chatbots
language & reasoning
knowledge,language & reasoning
language & reasoning
language & reasoning
conversation & chatbots
language & reasoning
language & reasoning
language & reasoning
language & reasoning
language & reasoning
language & reasoning
language & reasoning
Description
An automatic evaluator for instruction-following LLMs.
Multi-turn dialogues.
Open-source platform for comparing LLMs in a competitive environment.
Adversarial behaviors including cybercrime, copyright violations, and generating misinformation (https://www.harmbench.org
An enhanced dataset designed to extend the MMLU benchmark. More challenging questions, the choice set of ten options.
A ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures.
Measures the ability for language models to answer short, fact-seeking questions to reduce hallucinations.
A set of Python functions and input-output pairs that consists of two tasks: input prediction and output prediction.
Explicitly malicious agent tasks, including fraud, cybercrime, and harassment.
Tests a model’s resistance against common attacks from the literature.
A set of function-calling tasks, including multiple and parallel function calls.
A benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. Consists o
Function-level code generation tasks with complex instructions and diverse function calls.
AI safety benchmark aligned with emerging regulations. Considers operational, content safety, legal and societal risks (https://
A collection of 1 million conversations between human users and ChatGPT, alongside demographic data (https://wildchat.allen
Specifically designed for tool-augmented LLMs.
Multilingual code completion tasks built on built on real-world GitHub repositories in Python, Java, TypeScript, and C#.
A benchmark for evaluating Multimodal LLMs using multiple-choice questions.
Curated complex reasoning tasks including math, science, coding, long-context.
A set of questions targetting 13 behavior scenarios disallowed by OpenAI.
Multistep reasoning tasks based on text narratives (e.g., 1000 words murder mysteries).
An instruction-tuning dataset for tool use.
Tests factuality of LLM-generated text in the context of answering questions that test current world knowledge. The dataset is
Multi-turn questions: an open-ended question and a follow-up question.
A tool manipulation
Evaluate benchmark
LLM-as-Agent consisting of software
across 8 environments, tools
including for real-world
Operating Systemtasks.
(OS)
Database (DB), Knowledge Graph (KG), Digital Card Game (DCG), and Lateral Thinking Puzzles (LTP).
Evaluates MLLMs on three dimensions: low-level visual perception, low-level visual description, and overall visual quality asses
Extended HumanEval & MBPP by 80x/35x for rigorous eval.
Covers ten 'malicious intentions', including psychological manipulation, theft, cyberbullying, and fraud.
Tests if human feedback encourages model responses to match user beliefs over truthful ones, a behavior known as sycophan
Evaluate trustworthiness of LLMs across 8 perspectives: toxicity, stereotypes, adversarial and robustness, privacy, ethics and f
A set of 500 harmful strings that the model should not reproduce and 500 harmful instructions.
A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models.
Class-level Python code generation tasks.
A set of user queries in the form of prompts that trigger LLMs to use tools, including both single-tool and multi-tool scenarios.
A set of human exam questions in 9 diverse languages with three educational levels, where about 23% of the questions requir
A dataset for evaluating the alignment of LM opinions with those of 60 US demographic groups.
Multiple-choice questions concerning offensive content, bias, illegal activities, and mental health.
A set of multiple-choice questions written by domain experts in biology, physics, and chemistry.
Consists of three interconnected evaluation tasks: retrieve the most relevant code snippets, predict the next line of code, and
A set of prompts with verifiable instructions, such as "write in more than 400 words".
A collection of standardized tests, including GRE, GMAT, SAT, LSAT.
Harmful questions covering 10 topics and ~10 subtopics each.
Dataset consists of human-written entries sampled randomly from AnthropicHarmlessBase.
Collaboratively curated tasks for evaluating legal reasoning in English LLMs.
Evaluates multimodal models on massive multi-discipline tasks demanding college-level subject knowledge. Includes 11.5K qu
A framework for quantitatively evaluating interactive LLMs such as ChatGPT using 23 data sets covering 8 common NLP tasks.
Real-world software issues collected from GitHub.
An environment for autonomous agents that perform tasks on the web.
A set of prompts sampled from AnthropicRedTeam that cover 14 harm categories.
Compares the ability of LLMs to understand what the code implements in source language and translate the same semantics i
Inconsistency detection in summaries
A framework and pipeline for evaluating the performance of the generated videos, such as visual qualities, content qualities, m
Measures both perception and cognition abilities on a total of 14 subtasks.
The dataset consists of prompts across 12 harm types to which responsible LLMs do not answer.
Multimodal multiple choice questions with diverse science topics and annotations of their answers with corresponding lecture
Code generation benchmark with data science problems spanning seven Python libraries, such as NumPy and Pandas.
Four-option multiple-choice questions from Indian medical entrance examinations. Covers 2,400 healthcare topics and 21 med
A set of toxic and benign statements about minority groups.
Reasoning tasks in several domains (reusing other benchmarks) with a focus on multi-metric evaluation (https://crfm.stanford
Human preference data about helpfulness and harmlessness.
Evaluates whether LLMs are prone to leaking PII, contains name-email pairs.
A human-annotated dataset that contains causal reasoning questions.
Grade-school math problems from the GSM8K dataset, translated into 10 languages.
A suite of BigBench tasks for which LLMs did not outperform the average human-rater.
A benchmark designed to evaluate the ability of LLMs to generate plans of action and reason about change.
Set of questions crowdsourced by domain experts in math, biology, physics, and beyond.
Human-generated and annotated red teaming dialogues.
Grade-school-level math word problems that require models to perform single-variable arithmetic operations. Created by app
Tasks from US mathematics competitions that cover algebra, calculus, geometry, and statistics.
BEIR is a heterogeneous benchmark for information retrieval (IR) tasks, contains 15+ IR datasets.
A textual question answering benchmark for spatial reasoning on natural language text.
Questions and associated hybrid contexts from real-world financial reports.
14 datasets for program understanding and generation and three baseline systems, including the BERT-style, GPT-style, and En
Evaluates how well models generate truthful responses.
A dataset for code generation, including introductory to competitive programming problems.
A set of unfinished sentences from Wikipedia designed to assess bias in text generation.
Evaluate social biases of LLMs in question answering.
Crowd-sourced entry-level programming tasks.
Programming tasks and unit tests to check model-generated code.
Grade school math word problems.
A large-scale natural dataset in English to measure stereotypical biases in four domains: gender, profession, race, and religion.
A set of binary-choice questions on ethics with two actions to choose from.
A conceptual formalism to study people’s everyday social norms and moral judgments.
A dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with tox
Multi-choice tasks across 57 subjects, high school to expert level.
Covers stereotypes dealing with nine types of bias, like race, religion, and age.
Multilingual summarization dataset crawled from different news websites.
Free-form multiple-choice OpenQA dataset for solving medical problems collected from the professional medical board exams
User questions issued to Google search, and answers found from Wikipedia by annotators.
Large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure.
Measures bias in sentence encoders.
Yes/No questions from Google searches, paired with Wikipedia passages.
Improved and more challenging version of GLUE benchmark.
Tasks to resolve references in a question and perform discrete operations over them (such as addition, counting, or sorting).
A dataset of conversations between two crowdsourcing
agents engaging in a dialog about a given topic.
Predict the most likely ending of a sentence, multiple-choice.
A dataset for biomedical research question answering.
Fill-in-a-blank tasks resolving ambiguities in pronoun references with binary options.
Naive
A set ofphysics reasoning tasks
Wikipedia-based focusing on how we interact with everyday objects in everyday situations.
question-answer
pairs with multi-hop questions.
Tool for evaluating and analyzing the performance of models on NLU tasks. Was quickly outperformed by LLMs and replaced b
Question answering dataset, modeled after open book exams.
Pairs of sentences that differ only by the gender of one pronoun in the sentence, designed to test for the presence of gender b
Questions with answers collected from 8000+ conversations.
Combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworker
Grade-school level, multiple-choice science questions.
Multi-choice tasks of grounded commonsense inference with adversarial filtering.
Multiple-choice question answering dataset that requires commonsense knowledge to predict the correct answers.
Question-answer pairs, simulating student-teacher interactions.
Reading comprehension tasks collected from the English exams for middle and high school Chinese students.
Multiple choice science exam questions.
A large-scale question-answering dataset.
A crowdsourced collection of sentence pairs annotated with textual entailment information.
A
A reading comprehension
set of passages composed dataset
of a consisting of 100,000 questions posed by crowdworkers on a set of Wikipedia articles.
context and a target sentence. The task is to guess the last word of the target sentence.
Questions sampled from Bing's search query logs and passages from web documents.
Benchmark paper AlpacaEval: A Simple Way to Debias Automatic Evaluators
Length-Controlled
https://arxiv.org/abs/2404.04475
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
https://arxiv.org/abs/2402.14762
https://arxiv.org/abs/2403.04132

https://arxiv.org/abs/2402.04249
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
https://arxiv.org/abs/2406.01574
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
https://arxiv.org/abs/2406.06565
Measuring short-form factuality in large language models
https://arxiv.org/abs/2411.04368
Understanding and Execution
https://arxiv.org/abs/2401.03065
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
https://arxiv.org/abs/2410.09024
A StrongREJECT for Empty Jailbreaks
https://arxiv.org/abs/2402.10260
Berkeley Function-Calling Leaderboard
https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html
TrustLLM: Trustworthiness in Large Language Models
https://arxiv.org/abs/2401.05561
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
https://arxiv.org/abs/2406.15877
https://arxiv.org/abs/2407.17436
WildChat: 1M ChatGPT Interaction Logs in the Wild
https://arxiv.org/abs/2405.01470
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
https://arxiv.org/abs/2304.08244
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
https://arxiv.org/abs/2310.11248
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
https://arxiv.org/abs/2307.16125
Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance
https://arxiv.org/abs/2305.17306
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
https://arxiv.org/abs/2308.03825
https://arxiv.org/abs/2310.16049
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
https://arxiv.org/abs/2307.16789
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
https://arxiv.org/abs/2310.03214
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
https://arxiv.org/abs/2306.05685
On the Tool Manipulation Capability of Open-source Large Language Models
https://arxiv.org/abs/2305.16504
AgentBench: Evaluating LLMs as Agents
https://arxiv.org/abs/2308.03688
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision
https://arxiv.org/abs/2309.14181
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
https://arxiv.org/abs/2305.01210
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
https://arxiv.org/abs/2310.06987
Towards Understanding Sycophancy in Language Models
https://arxiv.org/abs/2310.13548
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
https://arxiv.org/abs/2306.11698
Universal and Transferable Adversarial Attacks on Aligned Language Models
https://arxiv.org/abs/2307.15043
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
https://arxiv.org/abs/2308.01263
ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation"
https://arxiv.org/abs/2308.01861
MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
https://arxiv.org/abs/2310.03128
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
https://arxiv.org/abs/2306.05179
Whose Opinions Do Language Models Reflect?
https://arxiv.org/abs/2303.17548
SafetyBench: Evaluating the Safety of Large Language Models
https://arxiv.org/abs/2309.07045
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
https://arxiv.org/abs/2311.12022
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
https://arxiv.org/abs/2306.03091
https://arxiv.org/abs/2311.07911
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
https://arxiv.org/abs/2304.06364
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
https://arxiv.org/abs/2308.09662
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
https://arxiv.org/abs/2309.07875
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models
https://arxiv.org/abs/2308.11462
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
https://arxiv.org/abs/2311.16502
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
https://arxiv.org/abs/2302.04023
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
https://arxiv.org/abs/2310.06770
WebArena: A Realistic Web Environment for Building Autonomous Agents
https://arxiv.org/abs/2307.13854
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
https://arxiv.org/abs/2307.04657
Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code
https://arxiv.org/abs/2308.03109
https://arxiv.org/abs/2305.14540
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
https://arxiv.org/abs/2310.11440
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
https://arxiv.org/abs/2306.13394
Do-Not-Answer: Evaluating Safeguards in LLMs
https://arxiv.org/abs/2308.13387
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
https://arxiv.org/abs/2209.09513
DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
https://arxiv.org/abs/2211.11501
MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering
https://arxiv.org/abs/2203.14371
ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
https://arxiv.org/abs/2203.09509
https://arxiv.org/abs/2211.09110
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
https://arxiv.org/abs/2204.05862
Are Large Pre-Trained Language Models Leaking Your Personal Information?
https://arxiv.org/abs/2205.12628
e-CARE: a New Dataset for Exploring Explainable Causal Reasoning
https://arxiv.org/abs/2205.05849
Language Models are Multilingual Chain-of-Thought Reasoners
https://arxiv.org/abs/2210.03057
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
https://arxiv.org/abs/2210.09261
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
https://arxiv.org/abs/2206.10498
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
https://arxiv.org/abs/2206.04615
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
https://arxiv.org/abs/2209.07858
Are NLP Models really able to Solve Simple Math Word Problems?
https://arxiv.org/abs/2103.07191
Measuring Mathematical Problem Solving With the MATH Dataset
https://arxiv.org/abs/2103.03874
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
https://arxiv.org/abs/2104.08663
SpartQA: : A Textual Question Answering Benchmark for Spatial Reasoning
https://arxiv.org/abs/2104.05832
TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance
https://arxiv.org/abs/2105.07624
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
https://arxiv.org/abs/2102.04664
TruthfulQA: Measuring How Models Mimic Human Falsehoods
https://arxiv.org/abs/2109.07958v2
Measuring Coding Challenge Competence With APPS
https://arxiv.org/abs/2105.09938
BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation
https://arxiv.org/abs/2101.11718
BBQ: A Hand-Built Bias Benchmark for Question Answering
https://arxiv.org/abs/2110.08193
Program Synthesis with Large Language Models
https://arxiv.org/abs/2108.07732
Evaluating Large Language Models Trained on Code
https://arxiv.org/abs/2107.03374
Training Verifiers to Solve Math Word Problems
https://arxiv.org/abs/2110.14168
StereoSet: Measuring stereotypical bias in pretrained language models
https://arxiv.org/abs/2004.09456
Aligning AI With Shared Human Values
https://arxiv.org/abs/2008.02275
Social Chemistry 101: Learning to Reason about Social and Moral Norms
https://arxiv.org/abs/2011.00620
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
https://arxiv.org/abs/2009.11462
Measuring Massive Multitask Language Understanding
https://arxiv.org/abs/2009.03300
CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models
https://arxiv.org/abs/2010.00133
MLSUM: The Multilingual Summarization Corpus
https://arxiv.org/abs/2004.14900
What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams
https://arxiv.org/abs/2009.13081
Natural Questions: A Benchmark for Question Answering Research
https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00276/43518/Natural-Questions-A-Benchmark-for-Question
Adversarial NLI: A New Benchmark for Natural Language Understanding
https://arxiv.org/abs/1910.14599
On Measuring Social Biases in Sentence Encoders
https://arxiv.org/abs/1903.10561
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
https://arxiv.org/abs/1905.10044
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
https://arxiv.org/abs/1905.00537
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
https://arxiv.org/abs/1903.00161
OpenDialKG: Explainable Conversational Reasoning with Attention-based Walks over Knowledge Graphs
https://aclanthology.org/P19-1081/
HellaSwag: Can a Machine Really Finish Your Sentence?
https://arxiv.org/abs/1905.07830
PubMedQA: A Dataset for Biomedical Research Question Answering
https://arxiv.org/abs/1909.06146
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
https://arxiv.org/abs/1907.10641
PIQA: Reasoning about Physical Commonsense in Natural Language
https://arxiv.org/abs/1911.11641
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
https://arxiv.org/abs/1809.09600
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
https://arxiv.org/abs/1804.07461
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
https://arxiv.org/abs/1809.02789
Gender Bias in Coreference Resolution
https://arxiv.org/abs/1804.09301
CoQA: A Conversational Question Answering Challenge
https://arxiv.org/abs/1808.07042
Know What You Don't Know: Unanswerable Questions for SQuAD
https://arxiv.org/abs/1806.03822
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
https://arxiv.org/abs/1803.05457
https://arxiv.org/abs/1808.05326
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
https://arxiv.org/abs/1811.00937
QuAC : Question Answering in Context
https://arxiv.org/abs/1808.07036
RACE: Large-scale ReAding Comprehension Dataset From Examinations
https://arxiv.org/abs/1704.04683
Crowdsourcing Multiple Choice Science Questions
https://arxiv.org/abs/1707.06209
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
https://arxiv.org/abs/1705.03551
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
https://arxiv.org/abs/1704.05426
SQuAD: 100,000+ Questions for Machine Comprehension of Text
https://arxiv.org/abs/1606.05250
The LAMBADA dataset: Word prediction requiring a broad discourse context
https://arxiv.org/abs/1606.06031
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
https://arxiv.org/abs/1611.09268
Code repository
https://github.com/tatsu-lab/alpaca_eval
https://github.com/mtbench101/mt-bench-101
https://github.com/lm-sys/FastChat/tree/main
https://github.com/centerforaisafety/HarmBench/tree/main
https://github.com/TIGER-AI-Lab/MMLU-Pro
https://github.com/Psycoy/MixEval
https://github.com/openai/simple-evals
https://github.com/facebookresearch/cruxeval
https://github.com/UKGovernmentBEIS/inspect_evals
https://github.com/dsbowen/strong_reject
https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard
https://github.com/HowieHwong/TrustLLM
https://github.com/bigcode-project/bigcodebench
https://github.com/stanford-crfm/air-bench-2024
n/a
https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/api-bank
https://github.com/amazon-science/cceval
https://github.com/AILab-CVC/SEED-Bench
https://github.com/FranxYao/chain-of-thought-hub/
https://github.com/verazuo/jailbreak_llms
https://github.com/Zayne-sprague/MuSR
https://github.com/OpenBMB/ToolBench
https://github.com/freshllms/freshqa
https://github.com/lm-sys/FastChat/tree/main
https://github.com/sambanova/toolbench/tree/main
https://github.com/THUDM/AgentBench
https://github.com/Q-Future/Q-Bench
https://github.com/evalplus/evalplus
https://github.com/Princeton-SysML/Jailbreak_LLM/tree/main
https://github.com/meg-tong/sycophancy-eval
https://github.com/AI-secure/DecodingTrust
https://github.com/llm-attacks/llm-attacks
https://github.com/paul-rottger/exaggerated-safety
https://github.com/FudanSELab/ClassEval
https://github.com/HowieHwong/MetaTool
https://github.com/DAMO-NLP-SG/M3Exam
https://github.com/tatsu-lab/opinions_qa
https://github.com/thu-coai/SafetyBench
https://github.com/idavidrein/gpqa
https://github.com/Leolty/repobench
https://github.com/google-research/google-research/tree/master/instruction_following_eval
https://github.com/ruixiangcui/AGIEval/tree/main
https://github.com/declare-lab/red-instruct
https://github.com/vinid/safety-tuned-llamas
https://github.com/HazyResearch/legalbench/
https://github.com/MMMU-Benchmark/MMMU
https://github.com/HLTCHKUST/chatgpt-evaluation
https://github.com/princeton-nlp/SWE-bench
https://github.com/web-arena-x/webarena
https://github.com/PKU-Alignment/beavertails
https://github.com/codetlingua/codetlingua
https://github.com/salesforce/factualNLG
https://github.com/EvalCrafter/EvalCrafter
https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation
https://github.com/Libr-AI/do-not-answer
https://github.com/lupantech/ScienceQA
https://github.com/xlang-ai/DS-1000
https://github.com/medmcqa/medmcqa
https://github.com/microsoft/TOXIGEN/tree/main
https://github.com/stanford-crfm/helm
https://github.com/anthropics/hh-rlhf
https://github.com/jeffhj/LM_PersonalInfoLeak
https://github.com/Waste-Wood/e-CARE
https://github.com/google-research/url-nlp
https://github.com/suzgunmirac/BIG-Bench-Hard
https://github.com/karthikv792/LLMs-Planning/tree/main/plan-bench
https://github.com/google/BIG-bench
https://github.com/anthropics/hh-rlhf
https://github.com/arkilpatel/SVAMP
https://github.com/hendrycks/math/?tab=readme-ov-file
https://github.com/beir-cellar/beir
https://github.com/HLR/SpartQA-baselines
https://github.com/NExTplusplus/TAT-QA
https://github.com/microsoft/CodeXGLUE
https://github.com/sylinrl/TruthfulQA
https://github.com/hendrycks/apps
https://github.com/amazon-science/bold
https://github.com/nyu-mll/BBQ
https://github.com/google-research/google-research/blob/master/mbpp/README.md
https://github.com/openai/human-eval
https://github.com/openai/grade-school-math
https://github.com/moinnadeem/StereoSet
https://github.com/hendrycks/ethics
https://github.com/mbforbes/social-chemistry-101
https://github.com/allenai/real-toxicity-prompts?tab=readme-ov-file
https://github.com/hendrycks/test/tree/master
https://github.com/nyu-mll/crows-pairs
https://github.com/ThomasScialom/MLSUM
https://github.com/jind11/MedQA
https://github.com/google-research-datasets/natural-questions
https://github.com/facebookresearch/anli
https://github.com/W4ngatang/sent-bias
https://github.com/google-research-datasets/boolean-questions
https://github.com/nyu-mll/jiant
https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/drop/README.md
https://github.com/facebookresearch/opendialkg
https://github.com/rowanz/hellaswag/tree/master
https://github.com/pubmedqa/pubmedqa
https://github.com/allenai/winogrande
https://github.com/ybisk/ybisk.github.io/tree/master/piqa
https://github.com/hotpotqa/hotpot
https://github.com/nyu-mll/GLUE-baselines
https://github.com/allenai/OpenBookQA
https://github.com/rudinger/winogender-schemas
https://stanfordnlp.github.io/coqa/
https://rajpurkar.github.io/SQuAD-explorer/
https://github.com/allenai/aristo-leaderboard/tree/master/arc
https://github.com/rowanz/swagaf
https://github.com/jonathanherzig/commonsenseqa
https://quac.ai/
n/a
n/a
https://github.com/mandarjoshi90/triviaqa
https://github.com/nyu-mll/multiNLI
https://rajpurkar.github.io/SQuAD-explorer/
n/a
https://microsoft.github.io/msmarco/
Dataset
https://huggingface.co/datasets/tatsu-lab/alpaca_eval
https://github.com/mtbench101/mt-bench-101/tree/main/data/subjective
https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
https://github.com/centerforaisafety/HarmBench/tree/main/data/behavior_datasets
https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
https://huggingface.co/datasets/MixEval/MixEval
https://huggingface.co/datasets/basicv8vc/SimpleQA
https://huggingface.co/datasets/cruxeval-org/cruxeval
https://huggingface.co/datasets/ai-safety-institute/AgentHarm
https://github.com/dsbowen/strong_reject/tree/main/docs/api
https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard
https://github.com/HowieHwong/TrustLLM?tab=readme-ov-file#dataset-download
https://github.com/bigcode-project/bigcodebench
https://huggingface.co/datasets/stanford-crfm/air-bench-2024
https://huggingface.co/datasets/allenai/WildChat-1M
https://huggingface.co/datasets/liminghao1630/API-Bank
https://github.com/amazon-science/cceval/tree/main/data
https://huggingface.co/datasets/AILab-CVC/SEED-Bench-2
see repository
https://github.com/verazuo/jailbreak_llms
https://github.com/Zayne-sprague/MuSR/tree/main/datasets
https://github.com/OpenBMB/ToolBench?tab=readme-ov-file#data-release
https://github.com/freshllms/freshqa?tab=readme-ov-file#freshqa
https://huggingface.co/datasets/lmsys/mt_bench_human_judgments
https://github.com/sambanova/toolbench/tree/main
https://github.com/THUDM/AgentBench/tree/main/data
https://huggingface.co/datasets/q-future/Q-Bench-HF
https://github.com/evalplus/evalplus/tree/master/evalplus/data
https://github.com/Princeton-SysML/Jailbreak_LLM/tree/main/data
https://huggingface.co/datasets/meg-tong/sycophancy-eval
https://huggingface.co/datasets/AI-Secure/DecodingTrust
https://github.com/llm-attacks/llm-attacks/tree/main/data/advbench
https://github.com/paul-rottger/exaggerated-safety/blob/main/xstest_v2_prompts.csv
https://huggingface.co/datasets/FudanSELab/ClassEval
https://github.com/HowieHwong/MetaTool/tree/master/dataset
https://github.com/DAMO-NLP-SG/M3Exam?tab=readme-ov-file#data
https://worksheets.codalab.org/worksheets/0x6fb693719477478aac73fc07db333f69
https://huggingface.co/datasets/thu-coai/SafetyBench
https://huggingface.co/datasets/Idavidrein/gpqa
https://huggingface.co/datasets/tianyang/repobench-c
https://huggingface.co/datasets/tianyang/repobench-p
https://github.com/google-research/google-research/tree/master/instruction_following_eval
https://github.com/ruixiangcui/AGIEval/tree/main/data

https://huggingface.co/datasets/declare-lab/HarmfulQA
https://github.com/vinid/safety-tuned-llamas
https://huggingface.co/datasets/nguha/legalbench
https://huggingface.co/datasets/MMMU/MMMU
https://github.com/HLTCHKUST/chatgpt-evaluation/tree/main/src
https://huggingface.co/datasets/princeton-nlp/SWE-bench
https://github.com/web-arena-x/webarena/blob/main/config_files/test.raw.json
https://huggingface.co/datasets/PKU-Alignment/BeaverTails
https://huggingface.co/iidai
https://github.com/salesforce/factualNLG/tree/master/data/summedits
https://huggingface.co/datasets/RaphaelLiu/EvalCrafter_T2V_Dataset
https://huggingface.co/datasets/lmms-lab/MME
https://huggingface.co/datasets/LibrAI/do-not-answer
https://huggingface.co/datasets/derek-thomas/ScienceQA
https://huggingface.co/datasets/xlangai/DS-1000
https://github.com/medmcqa/medmcqa
https://huggingface.co/datasets/toxigen/toxigen-data
see repository
https://github.com/anthropics/hh-rlhf
https://github.com/jeffhj/LM_PersonalInfoLeak/tree/main/data
https://github.com/Waste-Wood/e-CARE/tree/main/dataset
https://huggingface.co/datasets/juletxara/mgsm
https://huggingface.co/datasets/maveriq/bigbenchhard
https://huggingface.co/datasets/tasksource/planbench
https://huggingface.co/datasets/google/bigbench
https://huggingface.co/datasets/Anthropic/hh-rlhf
https://github.com/arkilpatel/SVAMP/tree/main/data
https://github.com/hendrycks/math/?tab=readme-ov-file
https://huggingface.co/BeIR
https://github.com/HLR/SpartQA_generation
https://github.com/NExTplusplus/TAT-QA/tree/master/dataset_raw
https://huggingface.co/datasets?search=code_x_glue
https://huggingface.co/datasets/truthfulqa/truthful_qa
https://huggingface.co/datasets/codeparrot/apps
https://github.com/amazon-science/bold/tree/main/prompts
https://github.com/nyu-mll/BBQ/tree/main/data
https://github.com/google-research/google-research/blob/master/mbpp/mbpp.jsonl
https://huggingface.co/datasets/openai/openai_humaneval
https://github.com/openai/grade-school-math/tree/master/grade_school_math/data
https://huggingface.co/datasets/McGill-NLP/stereoset
https://huggingface.co/datasets/hendrycks/ethics
https://github.com/mbforbes/social-chemistry-101?tab=readme-ov-file#data
https://huggingface.co/datasets/allenai/real-toxicity-prompts
https://huggingface.co/datasets/cais/mmlu
https://github.com/nyu-mll/crows-pairs/blob/master/data/crows_pairs_anonymized.csv
https://huggingface.co/datasets/GEM/mlsum
https://github.com/jind11/MedQA
https://ai.google.com/research/NaturalQuestions
https://huggingface.co/datasets/facebook/anli
https://github.com/W4ngatang/sent-bias/tree/master/tests
https://github.com/google-research-datasets/boolean-questions
https://huggingface.co/datasets/aps/super_glue
https://huggingface.co/datasets/ucinlp/drop
https://github.com/facebookresearch/opendialkg/tree/main/data
https://github.com/rowanz/hellaswag/tree/master/data
https://github.com/pubmedqa/pubmedqa
https://huggingface.co/datasets/allenai/winogrande
https://huggingface.co/datasets/ybisk/piqa
https://hotpotqa.github.io/
https://huggingface.co/datasets/nyu-mll/glue
https://huggingface.co/datasets/allenai/openbookqa
https://huggingface.co/datasets/oskarvanderwal/winogender
https://stanfordnlp.github.io/coqa/
https://huggingface.co/datasets/bayes-group-diffusion/squad-2.0
https://huggingface.co/datasets/allenai/ai2_arc
https://github.com/rowanz/swagaf/tree/master/data
https://github.com/jonathanherzig/commonsenseqa
https://quac.ai/
https://www.cs.cmu.edu/~glai1/data/race/
https://huggingface.co/datasets/allenai/sciq
https://huggingface.co/datasets/mandarjoshi/trivia_qa
https://huggingface.co/datasets/nyu-mll/multi_nli
https://huggingface.co/datasets/rajpurkar/squad
https://huggingface.co/datasets/cimec/lambada
https://huggingface.co/datasets/microsoft/ms_marco
Number of examples License Initial publication year Cited by:
n/a CC-BY-NC-4.0 2024 New
4208 Apache-2.0 license 2024 New
33000 see dataset page 2024 New
510 MIT License 2024 New
12100 MIT License 2024 New
5000 Apache-2.0 license 2024 New
4326 MIT License 2024 New
800 MIT License 2024 New
110 MIT License,see dataset page 2024 New
n/a MIT License 2024 New
2000 Apache-2.0 license 2024 New
n/a MIT License 2024 New
1140 Apache-2.0 license 2024 New
5,694 Apache-2.0 license 2024 New
1,000,000+ ODC-BY license 2024 New
n/a MIT License 2023 51-100
10000 Apache-2.0 license 2023 51-100
24000 CC-BY-NC-4.0 2023 101-500
1000+ MIT License 2023 Nov-50
15140 MIT License 2023 101-500
756 MIT License 2023 Nov-50
n/a Apache-2.0 license 2023 101-500
599 Apache-2.0 license 2023 101-500
3300 CC-BY-4.0 2023 >500
n/a Apache-2.0 license 2023 51-100
1360 Apache-2.0 license 2023 101-500
2990 S-Lab License 1.0 2023 51-100
n/a Apache-2.0 license 2023 >500
100 n/a 2023 101-500
n/a MIT License 2023 101-500
243,877 CC-BY-SA-4.0 2023 101-500
1000 MIT License 2023 >500
450 CC-BY-4.0 2023 51-100
100 MIT License 2023 101-500
20879 MIT License 2023 51-100
12317 n/a 2023 51-100
1498 n/a 2023 101-500
11435 MIT License 2023 101-500
448 CC-BY-4.0 2023 101-500
unspecified CC-BY-NC-ND 4.0 2023 51-100
500 Apache-2.0 license 2023 101-500
n/a MIT License 2023 101-500
1960 Apache-2.0 license 2023 51-100
100 CC-BY-SA-4.0 2023 101-500
162 see dataset page 2023 101-500
11500 Apache-2.0 license 2023 101-500
n/a see dataset page 2023 >500
2200 MIT License 2023 101-500
n/a Apache-2.0 license 2023 101-500
334000 CC-BY-SA-4.0 2023 101-500
1700 MIT License 2023 51-100
6,348 Apache-2.0 license 2023 Nov-50
700 Apache-2.0 license 2023 51-100
2374 n/a 2023 >500
939 Apache-2.0 license 2023 51-100
21208 CC-BY-SA-4.0 2022 >500
1000 CC-BY-SA-4.0 2022 101-500
194000 MIT License 2022 101-500
274000 MIT License 2022 101-500
unspecified Apache-2.0 license 2022 >500
44849 MIT License 2022 >500
3238 Apache-2.0 license 2022 101-500
21000 MIT License 2022 Nov-50
2500 CC-BY-SA-4.0 2022 101-500
6500 MIT License 2022 >500
11113 n/a 2022 101-500
n/a Apache-2.0 license 2022 >500
38961 MIT License 2022 101-500
1000 MIT License 2021 >500
12500 MIT License 2021 >500
n/a see dataset page 2021 >500
510 MIT License 2021 51-100
16552 MIT License 2021 101-500
n/a see dataset page 2021 >500
1634 Apache-2.0 license 2021 >500
10000 MIT License 2021 101-500
23679 CC-BY-SA-4.0 2021 101-500
58492 CC-BY-SA-4.0 2021 101-500
974 CC-BY-SA-4.0 2021 >500
164 MIT License 2021 >500
8500 MIT License 2021 >500
4229 CC-BY-SA-4.0 2020 >500
134400 MIT License 2020 101-500
4500000 CC-BY-SA-4.0 2020 101-500
99442 Apache-2.0 license 2020 >500
231400 MIT License 2020 >500
1508 CC-BY-SA-4.0 2020 >500
535062 n/a 2020 101-500
12723 MIT License 2020 101-500
300000 Apache-2.0 license 2019 >500
169265 CC-BY-NC-4.0 2019 >500
n/a CC-BY-NC-4.0 2019 >500
16000 CC-BY-SA-3.0 2019 >500
n/a see dataset page 2019 >500
96000 CC-BY-SA-4.0 2019 >500
15000 CC-BY-NC-4.0 2019 101-500
59950 MIT License 2019 >500
270000 MIT License 2019 >500
44000 Apache-2.0 license 2019 >500
18000 "Academic Free License (""AFL"") v. 3.1" 2019 >500
113000 CC-BY-SA-4.0 2018 >500
n/a see dataset page 2018 >500
12000 Apache-2.0 license 2018 >500
720 MIT License 2018 >500
127000 see dataset page 2018 >500
150000 CC-BY-SA-4.0 2018 >500
7787 CC-BY-SA-4.0 2018 >500
113000 MIT License 2018 >500
12102 n/a 2018 >500
100000 CC-BY-SA-4.0 2018 >500
100000 see dataset page 2017 >500
13700 CC-BY-SA-3.0 2017 101-500
650000 n/a 2017 >500
433000 see dataset page 2017 >500
100000 CC-BY-SA-4.0 2016 >500
12684 CC-BY-SA-4.0 2016 >500
1112939 see dataset page 2016 >500

Fresenius Kabi Infusia sp7s Syringe Pump
No ratings yet
Fresenius Kabi Infusia sp7s Syringe Pump
2 pages
Creating Special Effects For TV and Films by Bernard Wilkie (Starbrite) PDF
100% (5)
Creating Special Effects For TV and Films by Bernard Wilkie (Starbrite) PDF
162 pages
LLM Benchmarks
No ratings yet
LLM Benchmarks
5 pages
llm_evals
No ratings yet
llm_evals
23 pages
2305.01210v3
No ratings yet
2305.01210v3
15 pages
A Survey On Evaluation of Large Language Models
No ratings yet
A Survey On Evaluation of Large Language Models
24 pages
A Survey On Evaluation of Large Language Models
No ratings yet
A Survey On Evaluation of Large Language Models
26 pages
Code Generation With LLMs
No ratings yet
Code Generation With LLMs
59 pages
ChatBenc
No ratings yet
ChatBenc
26 pages
2404.08008
No ratings yet
2404.08008
32 pages
CIBench Evaluating Your LLMs With a Code Interpret
No ratings yet
CIBench Evaluating Your LLMs With a Code Interpret
22 pages
10 Important LLM Benchmarks That You Should Know-1
No ratings yet
10 Important LLM Benchmarks That You Should Know-1
13 pages
Week 4 Day 1
No ratings yet
Week 4 Day 1
26 pages
Automating LLM Bias Testing
No ratings yet
Automating LLM Bias Testing
3 pages
Rust In Practice, Second Edition
From Everand
Rust In Practice, Second Edition
Rick Tim
No ratings yet
A Survey On Evaluation of Large Language Models
No ratings yet
A Survey On Evaluation of Large Language Models
42 pages
mmeval-survey
No ratings yet
mmeval-survey
31 pages
PsycoLLM- Enhancing LLM for Psychological Understanding and Evaluation
No ratings yet
PsycoLLM- Enhancing LLM for Psychological Understanding and Evaluation
12 pages
A Survey On Evaluation of Large Language Models
No ratings yet
A Survey On Evaluation of Large Language Models
45 pages
133_large_language_model_evaluatio
No ratings yet
133_large_language_model_evaluatio
12 pages
Toolqa: A Dataset For LLM Question Answering With External Tools
No ratings yet
Toolqa: A Dataset For LLM Question Answering With External Tools
25 pages
Ijnlc 01
No ratings yet
Ijnlc 01
18 pages
Ijnlc 01
No ratings yet
Ijnlc 01
18 pages
Evaluation of Medium-Sized Language Models in German and English Language
No ratings yet
Evaluation of Medium-Sized Language Models in German and English Language
18 pages
2310.00746
No ratings yet
2310.00746
30 pages
2409.18142v1
No ratings yet
2409.18142v1
23 pages
LLM Benchmark
No ratings yet
LLM Benchmark
21 pages
1 s2.0 S266729522400014X Main
No ratings yet
1 s2.0 S266729522400014X Main
21 pages
LPML - Llm-Prompting Markup Language For Mathematical Reasoning
No ratings yet
LPML - Llm-Prompting Markup Language For Mathematical Reasoning
10 pages
3649165.3690123
No ratings yet
3649165.3690123
7 pages
ASP.NET Core 1.0 High Performance
From Everand
ASP.NET Core 1.0 High Performance
James Singleton
No ratings yet
Evaluating Vision-Language Models by
No ratings yet
Evaluating Vision-Language Models by
17 pages
Trapping LLM "Hallucinations" Using Tagged Context Prompts: Philip Feldman, James R. Foulds, and Shimei Pan
No ratings yet
Trapping LLM "Hallucinations" Using Tagged Context Prompts: Philip Feldman, James R. Foulds, and Shimei Pan
14 pages
Liu et al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models for Verilog Code Generation
No ratings yet
Liu et al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models for Verilog Code Generation
8 pages
Top Networking Terms You Should Know
From Everand
Top Networking Terms You Should Know
JOHN SMITH
No ratings yet
Wider and Deeper LLM Networks Are Fairer LLM Networks
No ratings yet
Wider and Deeper LLM Networks Are Fairer LLM Networks
15 pages
Benchmark LLM
No ratings yet
Benchmark LLM
28 pages
CMMLU - Measuring Massive Multitask Language Understanding in Chinese
No ratings yet
CMMLU - Measuring Massive Multitask Language Understanding in Chinese
17 pages
TinyBenchmarks - Evaluating LLMs With Fewer Examples
No ratings yet
TinyBenchmarks - Evaluating LLMs With Fewer Examples
21 pages
MLAgentBench Evaluating Language Agents on Machine Learning Experimentation
No ratings yet
MLAgentBench Evaluating Language Agents on Machine Learning Experimentation
39 pages
Can Multiple-Choice Questions Really Be Useful in Detecting The Abilities of LLMs
No ratings yet
Can Multiple-Choice Questions Really Be Useful in Detecting The Abilities of LLMs
16 pages
Webinar LLM tv2
No ratings yet
Webinar LLM tv2
20 pages
Beavertails Towards Improved Safety Alignment of LLM Via A Human-Preference Dataset
No ratings yet
Beavertails Towards Improved Safety Alignment of LLM Via A Human-Preference Dataset
27 pages
Beyond Task Performance: Evaluating and Readucing The Flaws of Large Multimodal Models With In-Context Learning
No ratings yet
Beyond Task Performance: Evaluating and Readucing The Flaws of Large Multimodal Models With In-Context Learning
31 pages
A Systematic Survey and Critical Review on Evaluating Large Language Models- Challenges, Limitations, and Recommendations
No ratings yet
A Systematic Survey and Critical Review on Evaluating Large Language Models- Challenges, Limitations, and Recommendations
32 pages
2303.17580
No ratings yet
2303.17580
25 pages
Artificial Intelligence Systems Integration: Fundamentals and Applications
From Everand
Artificial Intelligence Systems Integration: Fundamentals and Applications
Fouad Sabry
No ratings yet
Programming in Star
From Everand
Programming in Star
Francis McCabe
No ratings yet
Assessing Large Language Models for Code Generation: A Comprehensive Framework
No ratings yet
Assessing Large Language Models for Code Generation: A Comprehensive Framework
6 pages
15. Purple Llama CYBERSECEVAL
No ratings yet
15. Purple Llama CYBERSECEVAL
13 pages
Artificial Intelligence, Machine Learning and User Interface Design
From Everand
Artificial Intelligence, Machine Learning and User Interface Design
Abhijit Banubakode
No ratings yet
Evo Code Bench
No ratings yet
Evo Code Bench
15 pages
2407.21072v1
No ratings yet
2407.21072v1
15 pages
3649409.3691090
No ratings yet
3649409.3691090
2 pages
LAB-Bench: Measuring Capabilities of Language Models For Biology Research
No ratings yet
LAB-Bench: Measuring Capabilities of Language Models For Biology Research
40 pages
2408.11539v1
No ratings yet
2408.11539v1
8 pages
ASTRAL- Automated Safety Testing of Large Language Models
No ratings yet
ASTRAL- Automated Safety Testing of Large Language Models
11 pages
NeurIPS-2023-hugginggpt-solving-ai-tasks-with-chatgpt-and-its-friends-in-hugging-face-Paper-Conference
No ratings yet
NeurIPS-2023-hugginggpt-solving-ai-tasks-with-chatgpt-and-its-friends-in-hugging-face-Paper-Conference
27 pages
Evaluating Large Language Models With Human Feedback: Establishing A Swedish Benchmark
No ratings yet
Evaluating Large Language Models With Human Feedback: Establishing A Swedish Benchmark
9 pages
M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark For Chinese Large Language Models
No ratings yet
M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark For Chinese Large Language Models
14 pages
LLM Are Human-Level Prompt Engineers
No ratings yet
LLM Are Human-Level Prompt Engineers
43 pages
MME: A Comprehensive Evaluation Benchmark For Multimodal Large Language Models
No ratings yet
MME: A Comprehensive Evaluation Benchmark For Multimodal Large Language Models
12 pages
Amazon.com Chess starter course for the adult beginner
No ratings yet
Amazon.com Chess starter course for the adult beginner
5 pages
Remote Chess Academy - YouTube
No ratings yet
Remote Chess Academy - YouTube
3 pages
btahir_open-deep-research
No ratings yet
btahir_open-deep-research
8 pages
jacksark - YouTube
No ratings yet
jacksark - YouTube
4 pages
Recommended Reading - Chessprogramming wiki
No ratings yet
Recommended Reading - Chessprogramming wiki
4 pages
0. Chessboard Diagram
No ratings yet
0. Chessboard Diagram
1 page
Fine-Tune & Evaluate LLMs in 2024 With Amazon SageMaker
No ratings yet
Fine-Tune & Evaluate LLMs in 2024 With Amazon SageMaker
12 pages
RAFT Gorilla Blog
No ratings yet
RAFT Gorilla Blog
5 pages
How To Fine-Tune LLMs in 2024 With Hugging Face
100% (1)
How To Fine-Tune LLMs in 2024 With Hugging Face
13 pages
Essentials From Oxford Guide
No ratings yet
Essentials From Oxford Guide
7 pages
Effect of Approach Angle in Face Milling Using Tungsten Carbide Tool
No ratings yet
Effect of Approach Angle in Face Milling Using Tungsten Carbide Tool
14 pages
Mycelial Sqlite Learn
No ratings yet
Mycelial Sqlite Learn
5 pages
Steel Grade: Material Data Sheet
No ratings yet
Steel Grade: Material Data Sheet
7 pages
Srinivasa Rohit Vanapalli: Education
No ratings yet
Srinivasa Rohit Vanapalli: Education
2 pages
Musicofindia00popl PDF
No ratings yet
Musicofindia00popl PDF
208 pages
Vignesh Custodial Death 04.05.2022 - 0
No ratings yet
Vignesh Custodial Death 04.05.2022 - 0
38 pages
Nissan: Request A Test Drive
No ratings yet
Nissan: Request A Test Drive
63 pages
Lohas Beans Brochure
No ratings yet
Lohas Beans Brochure
1 page
Hsslive Xii Maths Term I Key 2023
No ratings yet
Hsslive Xii Maths Term I Key 2023
8 pages
Pseudocodes and Flowcharts(Riyansha Shahare)
No ratings yet
Pseudocodes and Flowcharts(Riyansha Shahare)
14 pages
Gross Profit ND Net Profit
No ratings yet
Gross Profit ND Net Profit
4 pages
SLRC Paper III SET B
No ratings yet
SLRC Paper III SET B
64 pages
Ra 9379
No ratings yet
Ra 9379
1 page
Primed. Sex Guide For Trans Men Into Men
No ratings yet
Primed. Sex Guide For Trans Men Into Men
36 pages
Notes On Writing: Fredo Durand, MIT CSAIL
No ratings yet
Notes On Writing: Fredo Durand, MIT CSAIL
5 pages
Diet Meals
No ratings yet
Diet Meals
28 pages
Introduction To Pestle Analysis
No ratings yet
Introduction To Pestle Analysis
33 pages
Pedram Azad Dissertation
100% (2)
Pedram Azad Dissertation
8 pages
FOUNDATION NOTES
No ratings yet
FOUNDATION NOTES
27 pages
Relationship Marketing and Customer Relationship M... - (Chapter 10 Technologies and Metrics in Cus
No ratings yet
Relationship Marketing and Customer Relationship M... - (Chapter 10 Technologies and Metrics in Cus
30 pages
A-103 - Property Ramp and Room Schedule
No ratings yet
A-103 - Property Ramp and Room Schedule
1 page
Clunie - Living Green - VG Rev 2010
No ratings yet
Clunie - Living Green - VG Rev 2010
118 pages
Exchange Administrator Interview Questions
No ratings yet
Exchange Administrator Interview Questions
3 pages
Stainless Steel Tube and Fittings: Photography Courtesy of Outokumpu, Arcelor and New Zealand Tube Mills
No ratings yet
Stainless Steel Tube and Fittings: Photography Courtesy of Outokumpu, Arcelor and New Zealand Tube Mills
13 pages
VTR 441
No ratings yet
VTR 441
2 pages
Mastercontrol Integrations Faq
No ratings yet
Mastercontrol Integrations Faq
8 pages
Pcme-701 PDF
No ratings yet
Pcme-701 PDF
5 pages
Toh112 84000 The White Lotus of Compassion
No ratings yet
Toh112 84000 The White Lotus of Compassion
430 pages
0707 Moteck
No ratings yet
0707 Moteck
2 pages
3940476
No ratings yet
3940476
76 pages
Fleece Dog Sweater: Size
No ratings yet
Fleece Dog Sweater: Size
8 pages
Cars - Details &amp Ion
No ratings yet
Cars - Details &amp Ion
4 pages
Milton
No ratings yet
Milton
2 pages
HTML 5
No ratings yet
HTML 5
13 pages

100+ LLM benchmarks and evaluation datasets

Uploaded by

100+ LLM benchmarks and evaluation datasets

Uploaded by

Name

You might also like