A Systematic Survey and Critical Review on Evaluating Large Language Models- Challenges, Limitations, and Recommendations
A Systematic Survey and Critical Review on Evaluating Large Language Models- Challenges, Limitations, and Recommendations
†
York University, § Princess Nourah Bint Abdulrahman University, ‡ Nanyang Technological University,
¶
National Center for AI, Saudi Arabia, $ Qatar Computing Research Institute (QCRI),
||
Dialpad Canada Inc., • Royal Bank of Canada, ° Salesforce Research
13785
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13785–13816
November 12-16, 2024 ©2024 Association for Computational Linguistics
Figure 1: Typology of the LLM Evaluation Workflow. A more detailed description of the challenges and the
limitations can be found in Table 5.
ducibility, reliability, and robustness (see Section to evaluate LLMs on diverse task (e.g., HELM
3). Based on our findings, we provide a prin- (Liang et al., 2022)). We provide more details on
cipled guideline in Section 4 to address current each category in Appendix A.1.
limitations in LLM evaluation. The data and
Model Selection: Selecting the appropriate
the code used in this paper are publicly avail-
model from the numerous LLMs currently avail-
able here: https://github.com/ntunlp/
able is crucial for ensuring a fair evaluation, as it
Critical-Review-of-LLM-Eval.
helps to avoid risks such as data contamination and
2 Overview of LLM Evaluation Process unfair comparisons. For a detailed discussion on
prominent LLMs, see Appendix A.2.
The following components are crucial for LLM
evaluation: Evaluation Setup, Response Genera- 2.2 Response Generation
tion, and Evaluation Methodology (Chang et al., Once the benchmarks and the models are selected,
2024). Each component has its own challenges, the next step in the evaluation process is to design
which we discuss in Section 3. These components the prompt and set up the decoding parameters for
in an evaluation workflow are shown in Figure 1. response generation. In the prompt design step,
decisions on what type of prompting (e.g., zero-
2.1 Evaluation Setup shot or few-shot) would be used are taken. More-
Benchmark Selection: To initiate the evalua- over, configuring the decoding parameters (e.g.,
tion process of LLMs, the first step is selecting ap- temperature) is important to ensure optimal per-
propriate benchmarks. We categorize the bench- formance (Shi et al., 2024). More discussions on
marking datasets into the following: general ca- this are provided in Appendix A.3 and A.4.
pability benchmarks, specialized benchmarks, and
other diverse benchmarks. We refer to general ca- 2.3 Evaluation Methodology
pability benchmarks as the ones that are often used Parsing Script Design: Evaluating LLM-
for evaluation upon the release of an LLM (e.g., generated responses is difficult because they often
MMLU (Hendrycks et al., 2020b), HumanEval produce verbose outputs (see Table 6 for some
(Chen et al., 2021)). In addition, there are special- examples). Therefore, parsing scripts are often
ized benchmarks that measure specific capabilities necessary (Jahan et al., 2024; Laskar et al., 2023a)
of LLMs (e.g., MT-Bench for chatting capabilities to extract target labels before applying evaluation
(Zheng et al., 2024)). There are also other bench- metrics, ensuring alignment with evaluation
marks that usually combine multiple benchmarks criteria to maintain reliability.
13786
Availability (%) Comparison (%) releasing the exact data used for evaluation (Bal-
Prompt Code Prompt + Code Model Version Fair Unfair
loccu et al., 2024). Many studies evaluate LLMs
90.6 53.3 50.0 29.3 20.7 79.3
on only a subset of existing datasets (Bang et al.,
Table 1: Availability of resources and fairness in model 2023; Kocoń et al., 2023), while others use the ex-
comparisons (out of 212 papers), analyzed by Balloccu act benchmarking datasets (Laskar et al., 2023a;
et al. (2024). Qin et al., 2023). Despite the expectation not to
compare results across studies using different sub-
sets of the data, such comparisons often occur, as
Evaluation Approach: The evaluation ap- discussed by Balloccu et al. (2024). Nonetheless,
proach can be divided into the following: without explaining the sampling strategy, or re-
automatic evaluation, human evaluation, LLMs leasing the subsets used for evaluation (and possi-
as evaluators. In automatic evaluation, before bly their responses), reproducing results using dif-
applying task-specific metrics (e.g., F1, Exact ferent data subsets of the same size is challenging.
Match, Perplexity (Jelinek et al., 1977)), parsing Model Versions: The information regarding the
scripts are often utilized to extract the targeted version of a model being used is also missing
answer, especially in discriminative tasks. Hu- in many studies (Balloccu et al., 2024; Biderman
man evaluation is required to ensure qualitative et al., 2024), creating reproducibility concern (see
assessments of LLM responses (e.g., measuring Table 1). The continuous updates of the closed-
clarity, coherence, factuality) (van der Lee et al., source models, often with undisclosed changes
2021). Recently, human evaluation based on the can also impact reproducibility. With these up-
Elo-based rating system (Zheng et al., 2024) has dates, earlier versions are often deprecated, and re-
gained a lot of attention. Since human evaluation sults from these versions may not apply to newer
is time-consuming, the utilization of LLMs as models (Chen et al., 2023b), making prior evalu-
evaluators to assess other LLMs has become a ation results to be no longer reproducible (Bang
popular evaluation approach (Chiang and Lee, et al., 2023; Kocoń et al., 2023; Laskar et al.,
2023; Huang et al., 2024a). More details on LLM 2023a; Qin et al., 2023). Therefore, it is crucial to
evaluation approaches are in Appendix A.6.1. specify the model versions used (Balloccu et al.,
2024; Biderman et al., 2024), while model owners
3 Challenges in Evaluating LLMs
should keep earlier versions available.
We examine challenges and limitations in the eval-
3.1.2 Lacking Response Generation Details
uation process of LLMs based on three dimen-
sions: reproducibility, reliability, and robustness. Prompting: The lack of details behind how the
prompts are designed may make the findings in
3.1 Reproducibility different literature inconsistent. For instance, vari-
Reproducibility, the ability to consistently repli- ations in prompt design can lead to significantly
cate model results under the same conditions, is different results, as seen in various studies (Bang
a major challenge in generative models (Bider- et al., 2023; Jahan et al., 2024; Laskar et al.,
man et al., 2024). The primary challenge is the 2023a; Qin et al., 2023). While few-shot learn-
lack of comprehensive documentation for each ing is found to outperform zero-shot in the orig-
part of the evaluation cycle, including benchmark- inal evaluation conducted by the authors of vari-
ing datasets, prompt construction, model details, ous LLMs (Anil et al., 2023; OpenAI, 2023; Tou-
decoding strategy, response parsing, and evalua- vron et al., 2023b), many independent evaluations
tion methodology (Kosch and Feger, 2024; McIn- demonstrate that adding few-shot examples does
tosh et al., 2024). Table 1 presents an analysis by not necessarily outperform zero-shot models in ev-
Balloccu et al. (2024), revealing that a relatively ery task (Jahan et al., 2024; Ye et al., 2023a). This
low percentage of the analyzed papers shared their raises the concern of whether certain prompt engi-
resources. Below, we discuss factors impacting re- neering techniques or optimizations to select few-
producibility in the evaluation step. shot samples were applied in the original evalu-
ations. Hence, not disclosing the details behind
3.1.1 Missing Details on Data & Models Used how the prompt is designed or how the few-shot
Benchmarking Data: One factor that can nega- examples are selected can hinder reproducibility.
tively impede the ability to reproduce results is not Decoding Strategy: LLMs are sensitive to de-
13787
coding parameters, leading to significant perfor- humans, utilizing open-source LLMs as the judge
mance variations based on the chosen settings can help mitigate the reproducibility issues preva-
(Roziere et al., 2023; Touvron et al., 2023b). How- lent with closed-source LLMs.
ever, crucial details on their selection are excluded
in existing literature (Bang et al., 2023; Kocoń 3.2 Reliability
et al., 2023; Laskar et al., 2023a; OpenAI, 2023; Reliability, the ability to trust that outcomes are as
Qin et al., 2023; Team et al., 2023). This lack intended, is another challenge encountered during
of transparency raises reproducibility concerns, evaluation. Issues like contamination/inaccurate
which could be responsible for inconsistent results labels in the data, irrelevant evaluation methods,
across studies even when similar prompts are used. and unfair comparisons may impact the reliability
For instance, Qin et al. (2023) found that adding of the findings, which we discuss below.
output length restrictions in the prompt to gener-
ate summaries in no more than N words led to a 3.2.1 Data and Model Integrity Issues
performance drop in the SAMSum dataset (Gliwa Data Integrity: Errors in benchmarks under-
et al., 2019). However, Laskar et al. (2023a) found mine accurate conclusions and model compar-
that such controlled experiments led to a gain in isons, rendering evaluations of LLMs unreliable.
performance in the SAMSum dataset. An integrity-compromising factor is the presence
of incorrect gold labels. For instance, existing is-
3.1.3 Evaluation Methods Unavailable sues in the gold labels of the widely used MMLU
Parsing Scripts: LLM-generated responses of- (Hendrycks et al., 2020b) dataset have led to the
ten require parsing scripts to extract desired in- development of MMLU-Pro (Wang et al., 2024b)
formation. However, as demonstrated in Table 1, and MMLU-Redux (Gema et al., 2024). Recently
Balloccu et al. (2024) observed in their analysis it was also found that the coding benchmarks, Hu-
that almost half of the LLM evaluation papers do manEval (Chen et al., 2021), lacked essential test
not release any codes. We also observe that most cases, leading to the development of an advanced
studies (these include both the LLM technical re- version, HumanEvalPlus (Liu et al., 2024b).
ports, as well independent evaluations) do not re- Despite these improvements, many recent
lease their parsing scripts (Bang et al., 2023; Ko- studies continue to use the older versions of
coń et al., 2023; OpenAI, 2023; Qin et al., 2023; datasets. For instance, despite the release of Hu-
Team et al., 2023, 2024). Nonetheless, inaccu- manEvalPlus, HumanEval is still used to bench-
rate design of parsing scripts may lead to different mark LLM coding performance (Gloeckle et al.,
evaluation results (Laskar et al., 2023a). Thus, the 2024; Jiang et al., 2023; Li et al., 2023c; Roziere
unavailability of parsing scripts would complicate et al., 2023; Team et al., 2023, 2024; Wong et al.,
result comparisons while impacting reproducibil- 2023), potentially providing misleading insights.
ity (Balloccu et al., 2024; Biderman et al., 2024). In addition, outdated labels in existing bench-
Evaluation Approach: LLMs are increasingly marks undermine reliability of gold references.
used to evaluate other LLMs in development For example, in tasks like open-domain question
(Zheng et al., 2024). Concerns arise due to the answering, which demand real-world knowledge,
use of closed-source LLMs as evaluators, as their many gold labels become outdated over time, as
frequent updates can affect reproducibility (Chen noted by Laskar et al. (2023a). Consequently,
et al., 2023b; Verga et al., 2024). Moreover, Chen even if LLMs produce correct answers, compar-
et al. (2023b) observed significant behavioral ing them to obsolete gold labels can yield inaccu-
changes in closed-source LLMs over short peri- rate results. Moreover, in tasks like summariza-
ods. Such reproducibility concerns are also ob- tion, LLM-generated summaries are often favored
served in prior research that used LLMs as evalua- over human-annotated gold references (Ding et al.,
tors. For instance, Chiang and Lee (2023); Zheng 2022; Pu et al., 2023; Zhang et al., 2024b).
et al. (2024) found that using closed-source LLMs Contamination in Existing Models: Contamina-
as the judge could collide with human evalua- tion occurs when a benchmarking dataset is used
tions, whereas Fu et al. (2023b) observed the op- in training, reducing result reliability and validity
posite. Since the recently proposed Prometheus-2 (Sainz et al., 2023; Shi et al., 2023; Zhou et al.,
(Kim et al., 2024a) model is an open-source alter- 2023b). Ensuring benchmarking examples are ex-
native and demonstrates a strong correlation with cluded from training data is essential to maintain
13788
reliable results. Since LLMs are pre-trained on ing these parallels underscores the need for trans-
vast amounts of text data available on the internet, parency and robust methodologies to ensure fair-
this could lead to unfair evaluations if LLMs have ness in AI research and development.
already encountered these datasets during their Lack of Transparency in Decoding Parameters:
pre-training phase (Balloccu et al., 2024; Ravaut Shi et al. (2024) demonstrated that extensive tun-
et al., 2024; Xu et al., 2024). ing of decoding parameters could improve the per-
Nonetheless, most prior LLM evaluation work formance during inference. However, how the dif-
focusing on zero-shot evaluation did not conduct ferent decoding parameters are selected is often
any data contamination tests (Bang et al., 2023; underexplored in existing evaluations (Bang et al.,
Laskar et al., 2023a; OpenAI, 2023; Qin et al., 2023; Laskar et al., 2023a,b; OpenAI, 2023; Qin
2023; Team et al., 2023), raising concerns about et al., 2023; Team et al., 2023), as discussed in
whether these evaluations truly represent the zero- Section 3.1. This poses the risk of optimizing the
shot capabilities of LLMs. Recent research has parameters on test sets to improve performance.
also demonstrated a strong possibility of data con-
tamination in many datasets used to evaluate dif- 3.2.3 Inappropriate Evaluation Methodology
ferent LLMs (Balloccu et al., 2024; Golchin and Inaccurate Design of Parsing Scripts: As Laskar
Surdeanu, 2023; Li and Flanigan, 2023; Matton et al. (2023a) observed, evaluating LLMs entirely
et al., 2024; Oren et al., 2023; Ravaut et al., 2024; with an automated approach based on the answer
Sainz et al., 2023; Xu et al., 2024; Zhang et al., extracted using parsing scripts may lead to an er-
2024a). With the current generation of LLMs be- ror of up to more than 10% difference in many
ing extremely capable of learning new skills with tasks. This raises questions about the reliability
minimal amounts of data, exposing them to eval- of LLM evaluations that solely depend on parsing
uation data may undermine the measurement of scripts without validating the scripts’ effectiveness
their true capabilities. Since the possibility of data for the task. To tackle this, Laskar et al. (2023a)
contamination has led to the development of new proposed a hybrid approach combining parsing
versions of existing datasets (e.g., utilizing GSM- script-based automatic evaluation with human-in-
8K to construct GSM-1K (Zhang et al., 2024a)), it the-loop (Laskar et al., 2022a; Wu et al., 2022).
is crucial to use fair evaluation datasets. Initially, the parsing script extracts answers from
LLM-generated responses. If any issues arise, hu-
3.2.2 Lack of Fairness by Manipulating mans resolve them, enhancing the reliability of
Response Generation parsing-based automatic evaluation.
Prompt Hacking: One major concern in terms In Figure 2, we demonstrate the differences
of lack of fairness in LLM evaluation is the possi- between automatic and hybrid evaluation in
bility of prompt hacking (Schulhoff et al., 2023), Open-Domain QA3 and reading comprehnesion
which involves manipulating input prompts to a datasets4 . The figure highlights the influence
language model to elicit desired responses (e.g., of human intervention on results in open-domain
biasing the outputs, or taking unfair advantages by QA, where LLMs may generate synonymous or
using specific few-shot examples). While the per- time-sensitive correct answers, potentially render-
formance of LLMs depends on many factors rel- ing gold answers outdated (Laskar et al., 2023a).
evant to how the prompt is structured, most work Parsing script-based automatic evaluation is found
(Bang et al., 2023; Laskar et al., 2023a; Qin et al., to be reliable in Race datasets for reading com-
2023), even the official technical reports (An- prehension, whereas notable discrepancies are ob-
thropic, 2024; OpenAI, 2023; Team et al., 2023) served in the SQuAD-V2 dataset. Therefore,
of different LLMs lack the necessary details be- there’s a need for designing dependable parsing
hind prompt construction (e.g., missing scientific scripts and involving humans when appropriate.
validity on why a certain prompt was preferred Evaluation Approaches Lacking Relevancy: In
over others, how the few-shot examples are se- generative tasks, utilizing automatic string-based
lected, etc.). This makes the claims regarding the matching techniques may not be reliable as well.
effectiveness and limitations of certain LLMs in
3
comparison to others questionable2 . Recogniz- NQ-Open (Kwiatkowski et al., 2019), WebQuestions
(Talmor and Berant, 2018), TriviaQA (Joshi et al., 2017))
2 4
https://crfm.stanford.edu/2024/05/01/ SQuAD-V2 (Rajpurkar et al., 2018), Race-High and
helm-mmlu.html Race-Middle (Lai et al., 2017)
13789
Figure 3: Performance Comparison: LLaMA-3 and
Qwen2
Figure 2: Comparing Automatic and Hybrid Evalua-
tion. Tokenizer Vocab MMLU MMLU-Pro MixEval MixEval-Hard
13790
models use different tokenizers to represent the
benchmarking dataset, it also leads to variations
in what is evaluated across models.
As can be seen in Table 2, we conducted a
small-scale analysis for LLaMA-2 (Touvron et al.,
2023b), LLaMA-3,5 Mistral (Jiang et al., 2023),
and Qwen26 on two benchmarking datasets with
varying complexities: MMLU (Hendrycks et al.,
2020b) and its more challenging version, MMLU-
Pro (Wang et al., 2024b), as well as MixEval (Ni
et al., 2024) and its harder version, MixEval-Hard.
Our findings indicate that these datasets cover a Figure 4: ROUGE-1 scores in the SAMSum dataset
relatively small portion of the model’s capabilities. based on Prompt Tuning.
Specifically, for MixEval, as the datasets became
more diverse and dynamic, the vocabulary cover-
age for the tokenizer decreased. This trend con- the unrestricted approach for GPT-3.5 and GPT-
tinued as the datasets increased in difficulty, with 4o. However, it surprisingly outperforms the unre-
vocabulary coverage further declining. stricted method, indicating the significant impact
of prompt tuning across models. Evaluating lan-
3.3.2 No Tuning of Prompt and Decoding guage models with a single prompt lacks fairness
Parameters (Zhu et al., 2023b), yet it remains common prac-
tice (Bang et al., 2023; Laskar et al., 2023a; Qin
While various combinations of decoding parame-
et al., 2023). Minor prompt variations can lead to
ters may lead to differences in results (Shi et al.,
diverse outcomes for different models (Alzahrani
2024), possibly due to high computing require-
et al., 2024; An et al., 2023; Biderman et al., 2024;
ments, existing LLM evaluation work mostly un-
Lanham et al., 2023; Sclar et al., 2023; Wei et al.,
dermines the necessity of evaluating how the
2024; Zhang et al., 2024a), highlighting the need
model performance may vary depending on its
to compare benchmarks across multiple prompts.
variations. Similar to the absence of decoder pa-
Using automated prompt tuning techniques like
rameter tuning, most prior work also evaluated
Meta Probing Agents (Zhu et al., 2024) can ensure
LLMs using only a single prompt (Bang et al.,
robustness to prompt variations.
2023; Jahan et al., 2024; Kocoń et al., 2023;
Laskar et al., 2023a; Qin et al., 2023). However, 3.3.3 Evaluation Method’s Generalizability
in the real world, users express themselves with and Correlation Shortcomings
diverse word choices, varying semantics and syn-
taxes, alongside minor discrepancies (e.g., mis- While automatic evaluations are usually utilized
spellings or differing punctuation styles). To fur- in discriminative tasks, they may not be applica-
ther examine the effects of prompt variations, we ble to every task, as demonstrated by Jahan et al.
conduct an experiment using GPT-4o (2024-04- (2024) that parsing scripts are not usable in cer-
09) and GPT-3.5-Turbo (0125) (OpenAI, 2023), as tain discriminative tasks like relation extraction.
well as Claude-3-Opus (2024-02-29) (Anthropic, Jahan et al. (2024) also noted a significant per-
2024) with the prompts used by (Laskar et al., formance gap between the string-matching-based
2023a) and (Qin et al., 2023) in the SAMSum ROUGE metric (Lin, 2004) and the contextual
dataset. For this experiment, the default param- similarity-based metric BERTScore (Zhang et al.,
eters for respective LLMs are used. 2019) in text summarization. While larger mod-
els achieve better accuracy, they involve a speed-
As shown in Figure 4, the restricted prompt-
accuracy trade-off (Parvez et al., 2019), leading to
ing method by Laskar et al. (2023a) consistently
higher costs and latency (Fu et al., 2024b; Laskar
outperforms the unrestricted approach across all
et al., 2023b). While metrics like perplexity are
three models. Conversely, the restricted prompt-
widely used to evaluate language models (Chen
ing method by Qin et al. (2023) fails to surpass
et al., 2023c), Huang et al. (2024b) found that
5
https://llama.meta.com/llama3/ quantized LLaMA-3 versions have lower output
6
https://github.com/QwenLM/Qwen2 confidence than the original. They noted simi-
13791
Model
Chatbot
Arena
HELM
MMLU
Vellum
MMLU
key factors shaping current LLM evaluation prac-
GPT-4o-2024-05-13 1 (1) 2 (2) 1 (1)
tices: inherent randomness in generative models,
GPT-4-Turbo-2024-04-09
GPT-4-0125-preview
5 (3)
6 (4)
3 (3)
5 (5)
3 (3)
4 (4)
significant computational demands, and insuffi-
Gemini-1.5-Pro
Gemini-1.5-Flash
4 (2)
10 (6)
4 (4)
10 (6)
13 (6)
10 (5)
cient documentation across stages.
Claude-3-Opus-2024-02-29 7 (5) 1 (1) 2 (2)
Evaluation Setup: Selecting benchmarks for
model assessment is crucial. Rather than sim-
Table 3: Rankings of models on LMSys Chatbot Arena
ply replicating past choices, researchers should
vs two MMLU implementations. The relative rank of
each model in MMLU is shown in parentheses. align datasets with required capabilities. To ensure
robustness, datasets should vary across expected
LLM capabilities (e.g., long-context understand-
lar model rankings for perplexity and a common- ing), tasks (e.g., summarization), and language
sense QA dataset. However, Hu et al. (2024) found complexity (e.g., vocabulary coverage). Ideally,
no correlation between perplexity and long context a metric should measure dataset diversity. For
understanding tasks, highlighting the need for ro- model selection, conduct contamination tests be-
bust evaluations with human-correlated metrics. tween the chosen model and benchmarks using
This raises another question, whether au- relevant techniques (Ravaut et al., 2024). This acts
tomated evaluations and LLM-as-a-judge cor- as an additional filter for benchmarking datasets,
relate with human evaluations (e.g., Elo rat- ensuring selection of unseen ones measuring in-
ings). Zheng et al. (2024) demonstrated signif- tended capabilities. Meanwhile, for reproducibil-
icant correlations between Elo ratings, LLM-as- ity, document any subset use of benchmarking
a-judge, and automated evaluations. However, datasets, along with the selected model version. In
recent research (Alzahrani et al., 2024) suggest addition, throughout scientific history, intelligence
that automated evaluations, especially those us- progress has evolved across generations. Tests
ing multiple-choice questions, can yield unstable from a decade ago may appear simplistic com-
rankings with minor changes in evaluation meth- pared to today’s standards (e.g., Math Olympiads,
ods. Given this instability, it prompts us to ques- ICPC programming contests). Refreshing LLM
tion why these automated tests should align with evaluations periodically can effectively communi-
human Elo ratings despite demonstrating such in- cate standard capabilities in both open and closed-
consistencies. In our view, we should focus not source LLM markets and ecosystems (e.g., chat-
only on correlating scores but also on how well bots). Hence, to ensure reliability, verify if
a benchmark’s rankings align with the gold stan- the dataset has updated versions and incorporate
dards. Analysis in Table 4 for GPT-4 (OpenAI, them if available (e.g., HumanEvalPlus (Liu et al.,
2023), Gemini (Team et al., 2023), and Claude- 2024b), MMLU-Pro (Wang et al., 2024b), GSM-
3 (Anthropic, 2024) reveals two key observations: 1K (Zhang et al., 2024a))
(i) MMLU rankings disagree with LMSys Chatbot Response Generation: For reproducibility,
Arena and (ii) MMLU rankings vary among them- thorough documentation of prompts (e.g., explain-
selves due to implementation differences. ing the selection of few-shot samples) and parame-
ter settings (e.g., use tools like mlflow7 or Weights
4 Recommendations and Best Practices & Biases8 (W&B)) is essential. To ensure relia-
bility, it’s crucial to justify why specific prompts
So far, we’ve outlined the primary challenges in and parameters are chosen over others by provid-
evaluating LLMs. In light of these challenges, a ing comparisons with alternative options. As for
crucial question arises: How can we enhance the robustness, experimenting with diverse prompts
evaluation of LLMs? Crafting a structured frame- and parameters is the key to showcasing their ef-
work that’s both practical and easy to implement fectiveness and limitations in different scenarios.
is daunting, given the complexities of generative In resource-constrained environments, conducting
LLM development. Previous studies tended to fo- experiments with diverse evaluation settings may
cus on specific evaluation aspects without offering pose challenges, yet it remains vital to perform ro-
comprehensive guidelines for the entire evaluation bust evaluations on at least a subset of samples.
cycle, leaving researchers without clear guidance.
Before diving into recommendations for each eval- 7
https://mlflow.org/
uation stage, it’s important to acknowledge three 8
https://wandb.ai/site
13792
Step Sub-Step Recommendation Implementation: Suggested Tools or Techniques
Evaluation Setup Benchmark Selection Selected benchmarks should align with the ca- Reliability: Use refined benchmarks like MMLU-Pro,
pabilities required and updated versions of the Human-Eval Plus, GSM-1k to address the limitations in
datasets should be used to ensure reliability, diver- existing benchmarks to improve reliability.
sity in the selected benchmarks is required to en- Reproducibility: Document the data sampling technique
sure robustness, and proper documentation of the and release the data subset used for evaluation alongside
dataset subsets is required for reproducibility. the model-generated response.
Robustness: Check tokenizer vocabulary coverage in se-
lected benchmarks.
Model Selection Data contamination check in the selected model Reliability: Use tools like LLMSanitize Library (Ravaut
is required for reliability, proper versioning of the et al., 2024) for contamination check.
model is required for reproducibility, and diverse Reproducibility: Use mlflow or W&B for documenta-
capability evaluation (e.g., latency, memory usage, tion.
format following capability, etc.) is important to Robustness: Use tools like pyNVML to measure GPU
ensure robustness. memory requirements, FOFO for format following, com-
pare accuracy vs latency trade-off, etc.
Response Generation Prompt Design Release the prompts and few-shot examples for Reliability: Justify the choice of certain prompts to en-
reproducibility, justify the selection of certain sure no potential of prompt hacking and compare the al-
prompts and few-shot examples to ensure reliabil- ternatives. Meanwhile, clearly demonstrate what and how
ity, and compare with alternative prompts to ensure few-shot examples are selected.
robustness. Reproducibility: Use tools like LM-Evaluation-Harness.
Robustness: Use Prompt Bench or Meta-Probing Agent.
Decoding Parameters Document the decoding parameters to ensure re- Reliability: Justify the choice of certain parameters to
producibility, justify the selection to ensure relia- eliminate the risk of optimization in the test data.
bility, and experiment with various parameters to Reproducibility: Use mlflow or W&B.
ensure robustness. Robustness: Compare the performance based on differ-
ent decoding parameters, at least in a subset of the data.
Evaluation Methodology Parsing Script Design Accurate parsing of the response is required for re- Reliability: Validate the reliability based on human eval-
liability, availability of these scripts is needed for uation, at least on a subset.
reproducibility, and parsing scripts should show Reproducibility: Release the code.
robustness across different models and datasets. Robustness: Evaluate multiple models and datasets,
across all types of labels and corner cases.
Evaluation Approach Availability of the evaluation output is required for Reliability: Validate the effectiveness of selected metrics
reproducibility, selected evaluation metrics should (e.g., measure correlation with humans), use techniques
maintain correlation with humans to ensure relia- like LLM-as-juries to mitigate bias.
bility, and multiple evaluation metrics are required Reproducibility: Release the Evaluation Output.
for evaluation robustness. Robustness: Use multiple evaluation metrics (e.g., in
Summarization, use both word-based (e.g., ROUGE) or
Contextualized (e.g., BERTScore) metrics), measure la-
tency, GPU usage via pyNVML.
13793
Acknowledgements vision is another interesting capability of recently
proposed LLMs (Bai et al., 2023; Chen et al.,
We would like to thank all the anonymous re-
2023a; Dai et al., 2024; Liu et al., 2023b, 2024a;
viewers for their excellent review comments. This
Luo et al., 2024; Ye et al., 2023b; Zhang et al.,
research was supported by the Natural Sciences
2023; Zhu et al., 2023a). This has led to the devel-
and Engineering Research Council (NSERC) of
opment of many multi-modal benchmarks (Chen
Canada and the York Research Chairs (YRC) pro-
et al., 2024b; Fu et al., 2023a, 2024a; Guan et al.,
gram. We also acknowledge Compute Canada for
2023; Li et al., 2023a,b,d; Liu et al., 2024a, 2023e;
the computing resources. Finally, we thank Mir
Lu et al., 2022; Qiu et al., 2024; Yu et al., 2023).
Tafseer Nayeem for providing valuable feedback.
However, this paper was mostly focused on text-
Limitations based NLP tasks and the evaluation of LLMs on
multimodal benchmarks is left out for future work.
One limitation of this work is that it is focused
only on the evaluation phase of the LLM devel- Ethics Statement
opment cycle. Therefore, the challenges and lim-
This paper only reviews the existing challenges
itations that happen during the training phase of
and limitations in LLM evaluations and provides
LLMs are left out of the scope of this paper.
an opinion piece and recommendation to ensure
Nonetheless, with the rapid growth of LLM tech-
reliable, robust, and reproducible evaluations of
nologies and huge financial incentives, it is es-
LLMs. Thus, this review does not pose any eth-
sential to conduct a fair and reliable evaluation
ical concerns.
of LLM, alongside ensuring robustness and repro-
ducibility, which is the focus of this work.
Another limitation of this study is that it does References
not study how to prevent closed-source LLMs
from getting access to the online benchmarks. For Ahmed Abdelali, Hamdy Mubarak, Shammur
instance, assume we have two entities: model de- Chowdhury, Maram Hasanain, Basel Mousi,
velopers and evaluators. Evaluators do not want Sabri Boughorbel, Samir Abdaljalil, Yassine
to expose their data to the modeling team. Con- El Kheir, Daniel Izham, Fahim Dalvi, et al.
versely, model developers do not want to release 2024. Larabench: Benchmarking arabic ai with
their model weights due to significant financial in- large language models. In Proceedings of the
centives. If evaluators use an API to get the re- 18th Conference of the European Chapter of the
sponses, there is a risk that the queries may get ex- Association for Computational Linguistics (Vol-
posed to the model developers. Therefore, without ume 1: Long Papers), pages 487–520.
getting access to the weights, evaluators cannot re-
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad
liably assess the models on their queries. Mathe-
Awan, Jyoti Aneja, Ahmed Awadallah, Hany
matically and technically, there is no fundamen-
Awadalla, Nguyen Bach, Amit Bahree, Arash
tal way to solve this problem without altering the
Bakhtiari, Harkirat Behl, et al. 2024. Phi-3
training dynamics which may not be an option for
technical report: A highly capable language
training teams.
model locally on your phone. arXiv preprint
Moreover, given the limited amount of study
arXiv:2404.14219.
to evaluate LLMs in non-English data, our work
was more focused on the monolingual scenario Kabir Ahuja, Harshita Diddee, Rishav Hada, Mil-
(mostly on English data). Therefore, investigat- licent Ochieng, Krithika Ramesh, Prachi Jain,
ing the challenges and limitations of LLM evalua- Akshay Nambi, Tanuja Ganu, Sameer Segal,
tion in multilingual and resource-constrained sce- Maxamed Axmed, et al. 2023. Mega: Multilin-
narios could be studied in the future, alongside gual evaluation of generative ai. arXiv preprint
also studying the performance of various tokeniz- arXiv:2303.12528.
ers (both multilingual and monolingual) in LLM
benchmarking (Choo and Kim, 2023; Rust et al., Ebtesam Almazrouei, Hamza Alobeidli, Abdu-
2021)). laziz Alshamsi, Alessandro Cappelli, Ruxan-
Finally, the multimodal capability, in other dra Cojocaru, Mérouane Debbah, Étienne
words, the ability to understand both language and Goffinet, Daniel Hesslow, Julien Launay,
13794
Quentin Malartic, et al. 2023. The falcon se- Simone Balloccu, Patrícia Schmidtová, Mateusz
ries of open language models. arXiv preprint Lango, and Ondřej Dušek. 2024. Leak, cheat,
arXiv:2311.16867. repeat: Data contamination and evaluation mal-
practices in closed-source llms. arXiv preprint
Iñigo Alonso, Maite Oronoz, and Rodrigo Agerri. arXiv:2402.03927.
2024. Medexpqa: Multilingual benchmarking
of large language models for medical question Yejin Bang, Samuel Cahyawijaya, Nayeon Lee,
answering. arXiv preprint arXiv:2404.05590. Wenliang Dai, Dan Su, Bryan Wilie, Holy
Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung,
Norah Alzahrani, Hisham Abdullah Alyahya, et al. 2023. A multitask, multilingual, mul-
Yazeed Alnumay, Sultan Alrashed, Shaykhah timodal evaluation of ChatGPT on reasoning,
Alsubaie, Yusef Almushaykeh, Faisal Mirza, hallucination, and interactivity. arXiv preprint
Nouf Alotaibi, Nora Altwairesh, Areeb Alow- arXiv:2302.04023.
isheq, M Saiful Bari, and Haidar Khan. 2024. Stella Biderman, Hailey Schoelkopf, Lintang
When benchmarks are targets: Revealing the Sutawika, Leo Gao, Jonathan Tow, Baber Ab-
sensitivity of large language model leader- basi, Alham Fikri Aji, Pawan Sasanka Am-
boards. manamanchi, Sidney Black, Jordan Clive, An-
thony DiPofi, Julen Etxaniz, Benjamin Fattori,
Shengnan An, Bo Zhou, Zeqi Lin, Qiang Fu, Bei
Jessica Zosa Forde, Charles Foster, Mimansa
Chen, Nanning Zheng, Weizhu Chen, and Jian-
Jaiswal, Wilson Y. Lee, Haonan Li, Charles
Guang Lou. 2023. Skill-based few-shot se-
Lovering, Niklas Muennighoff, Ellie Pavlick,
lection for in-context learning. arXiv preprint
Jason Phang, Aviya Skowron, Samson Tan,
arXiv:2305.14210.
Xiangru Tang, Kevin A. Wang, Genta Indra
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Winata, François Yvon, and Andy Zou. 2024.
Johnson, Dmitry Lepikhin, Alexandre Passos, Lessons from the trenches on reproducible eval-
Siamak Shakeri, Emanuel Taropa, Paige Bailey, uation of language models.
Zhifeng Chen, et al. 2023. Palm 2 technical re- Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin
port. arXiv preprint arXiv:2305.10403. Choi, et al. 2020. Piqa: Reasoning about phys-
ical commonsense in natural language. In Pro-
Anthropic. 2024. The claude 3 model family: ceedings of the AAAI conference on artificial in-
Opus, sonnet, haiku. telligence, volume 34, pages 7432–7439.
Jacob Austin, Augustus Odena, Maxwell Nye, Meriem Boubdir, Edward Kim, Beyza Ermis,
Maarten Bosma, Henryk Michalewski, David Sara Hooker, and Marzieh Fadaee. 2023. Elo
Dohan, Ellen Jiang, Carrie Cai, Michael Terry, uncovered: Robustness and best practices in
Quoc Le, et al. 2021. Program synthesis language model evaluation. arXiv preprint
with large language models. arXiv preprint arXiv:2311.17295.
arXiv:2108.07732.
Sabri Boughorbel, MD Parvez, and Majd
Jinze Bai, Shuai Bai, Shusheng Yang, Shi- Hawasly. 2024. Improving language models
jie Wang, Sinan Tan, Peng Wang, Junyang trained with translated data via continual pre-
Lin, Chang Zhou, and Jingren Zhou. 2023. training and dictionary learning analysis. arXiv
Qwen-vl: A frontier large vision-language preprint arXiv:2405.14277.
model with versatile abilities. arXiv preprint Tom B Brown, Benjamin Mann, Nick Ry-
arXiv:2308.12966. der, Melanie Subbiah, Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze
Girish Sastry, Amanda Askell, et al. 2020. Lan-
He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng,
guage models are few-shot learners. arXiv
Yijia Xiao, Haozhe Lyu, et al. 2024. Bench-
preprint arXiv:2005.14165.
marking foundation models with language-
model-as-an-examiner. Advances in Neural In- Jannis Bulian, Christian Buck, Wojciech Gajew-
formation Processing Systems, 36. ski, Benjamin Boerschinger, and Tal Schuster.
13795
2022. Tomayto, tomahto. beyond token-level Mark Chen, Jerry Tworek, Heewoo Jun, Qiming
answer equivalence for question answering Yuan, Henrique Ponde de Oliveira Pinto, Jared
evaluation. arXiv preprint arXiv:2202.07654. Kaplan, Harri Edwards, Yuri Burda, Nicholas
Joseph, Greg Brockman, et al. 2021. Evaluating
Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, large language models trained on code. arXiv
Sunghun Kim, and Jiayi Huang. 2024. A sur- preprint arXiv:2107.03374.
vey on mixture of experts. arXiv preprint
arXiv:2407.06204. Yukang Chen, Shengju Qian, Haotian Tang, Xin
Lai, Zhijian Liu, Song Han, and Jiaya Jia.
Yupeng Chang, Xu Wang, Jindong Wang, Yuan 2023c. Longlora: Efficient fine-tuning of long-
Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xi- context large language models. arXiv preprint
aoyuan Yi, Cunxiang Wang, Yidong Wang, Wei arXiv:2309.12307.
Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang
Yang, and Xing Xie. 2024. A survey on eval- Steffi Chern, Ethan Chern, Graham Neubig, and
uation of large language models. ACM Trans. Pengfei Liu. 2024. Can large language mod-
Intell. Syst. Technol., 15(3). els be trusted for evaluation? scalable meta-
evaluation of llms as evaluators via agent de-
Patrick Chao, Edoardo Debenedetti, Alexander bate. arXiv preprint arXiv:2401.16788.
Robey, Maksym Andriushchenko, Francesco
Croce, Vikash Sehwag, Edgar Dobriban, Nico- Cheng-Han Chiang and Hung-yi Lee. 2023.
las Flammarion, George J Pappas, Florian Can large language models be an alterna-
Tramer, et al. 2024. Jailbreakbench: An tive to human evaluations? arXiv preprint
open robustness benchmark for jailbreaking arXiv:2305.01937.
large language models. arXiv preprint
arXiv:2404.01318. Sanghyun Choo and Wonjoon Kim. 2023. A study
on the evaluation of tokenizer performance in
Anthony Chen, Gabriel Stanovsky, Sameer Singh, natural language processing. Applied Artificial
and Matt Gardner. 2020. Mocha: A dataset Intelligence, 37(1):2175112.
for training and evaluating generative read-
ing comprehension metrics. arXiv preprint Hyung Won Chung, Le Hou, Shayne Longpre,
arXiv:2010.03636. Barret Zoph, Yi Tay, William Fedus, Eric Li,
Xuezhi Wang, Mostafa Dehghani, Siddhartha
Jiawei Chen, Hongyu Lin, Xianpei Han, and Brahma, et al. 2022. Scaling instruction-
Le Sun. 2024a. Benchmarking large language finetuned language models. arXiv preprint
models in retrieval-augmented generation. In arXiv:2210.11416.
Proceedings of the AAAI Conference on Arti-
ficial Intelligence, volume 38, pages 17754– Christopher Clark, Kenton Lee, Ming-Wei Chang,
17762. Tom Kwiatkowski, Michael Collins, and
Kristina Toutanova. 2019. Boolq: Exploring
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, the surprising difficulty of natural yes/no ques-
Yuhang Zang, Zehui Chen, Haodong Duan, Ji- tions. arXiv preprint arXiv:1905.10044.
aqi Wang, Yu Qiao, Dahua Lin, et al. 2024b.
Are we on the right way for evaluating large Jonathan H Clark, Eunsol Choi, Michael Collins,
vision-language models? arXiv preprint Dan Garrette, Tom Kwiatkowski, Vitaly Niko-
arXiv:2403.20330. laev, and Jennimaria Palomaki. 2020. TyDi QA:
A benchmark for information-seeking question
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, answering in typologically diverse languages.
Conghui He, Jiaqi Wang, Feng Zhao, and arXiv preprint arXiv:2003.05002.
Dahua Lin. 2023a. Sharegpt4v: Improving
large multi-modal models with better captions. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar
arXiv preprint arXiv:2311.12793. Khot, Ashish Sabharwal, Carissa Schoenick,
and Oyvind Tafjord. 2018. Think you
Lingjiao Chen, Matei Zaharia, and James Zou. have solved question answering? try arc,
2023b. How is ChatGPT’s behavior changing the ai2 reasoning challenge. arXiv preprint
over time? arXiv preprint arXiv:2307.09009. arXiv:1803.05457.
13796
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian
Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Yin, Shiping Yang, and Xiaojun Wan. 2023a.
Boyang Li, Pascale N Fung, and Steven Hoi. Human-like summarization evaluation with
2024. Instructblip: Towards general-purpose ChatGPT. arXiv preprint arXiv:2304.02554.
vision-language models with instruction tun-
ing. Advances in Neural Information Process- Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang
ing Systems, 36. Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun,
and Haofen Wang. 2023b. Retrieval-augmented
Fahim Dalvi, Maram Hasanain, Sabri Boughor- generation for large language models: A survey.
bel, Basel Mousi, Samir Abdaljalil, Nizi Nazar, arXiv preprint arXiv:2312.10997.
Ahmed Abdelali, Shammur Absar Chowdhury,
Hamdy Mubarak, Ahmed Ali, et al. 2023. Aryo Pradipta Gema, Joshua Ong Jun Leang,
Llmebench: A flexible framework for accel- Giwon Hong, Alessio Devoto, Alberto
erating llms benchmarking. arXiv preprint Carlo Maria Mancino, Rohit Saxena, Xu-
arXiv:2308.04945. anli He, Yu Zhao, Xiaotang Du, Moham-
mad Reza Ghasemi Madani, et al. 2024.
Bosheng Ding, Chengwei Qin, Linlin Liu, Lidong Are we done with mmlu? arXiv preprint
Bing, Shafiq Joty, and Boyang Li. 2022. Is arXiv:2406.04127.
gpt-3 a good data annotator? arXiv preprint
arXiv:2212.10450. Bogdan Gliwa, Iwona Mochol, Maciej Biesek,
and Aleksander Wawer. 2019. Samsum cor-
Markus Freitag and Yaser Al-Onaizan. 2017.
pus: A human-annotated dialogue dataset for
Beam search strategies for neural machine
abstractive summarization. In Proceedings of
translation. arXiv preprint arXiv:1702.01806.
the 2nd Workshop on New Frontiers in Summa-
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei rization, pages 70–79.
Qin, Mengdan Zhang, Xu Lin, Jinrui Yang,
Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Fabian Gloeckle, Badr Youbi Idrissi, Baptiste
Wu, and Rongrong Ji. 2023a. Mme: A Rozière, David Lopez-Paz, and Gabriel Syn-
comprehensive evaluation benchmark for mul- naeve. 2024. Better & faster large language
timodal large language models. arXiv preprint models via multi-token prediction. arXiv
arXiv:2306.13394. preprint arXiv:2404.19737.
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Shahriar Golchin and Mihai Surdeanu. 2023.
Haoyu Wang, Xudong Lin, Dan Roth, Noah A Time travel in llms: Tracing data contamina-
Smith, Wei-Chiu Ma, and Ranjay Krishna. tion in large language models. arXiv preprint
2024a. Blink: Multimodal large language mod- arXiv:2308.08493.
els can see but not perceive. arXiv preprint
arXiv:2404.12390. Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi
Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang,
Xue-Yong Fu, Md Tahmid Rahman Laskar, Cheng Lichang Chen, Furong Huang, Yaser Yacoob,
Chen, and Shashi Bhushan Tn. 2023b. Are et al. 2023. Hallusionbench: An advanced di-
large language models reliable judges? a agnostic suite for entangled language hallucina-
study on the factuality evaluation capabilities of tion & visual illusion in large vision-language
LLMs. In Proceedings of the Third Workshop models. arXiv preprint arXiv:2310.14566.
on Natural Language Generation, Evaluation,
and Metrics (GEM), pages 310–316, Singapore. Yue Guo, Zian Xu, and Yi Yang. 2023a. Is Chat-
Association for Computational Linguistics. GPT a financial expert? evaluating language
models on financial natural language process-
Xue-Yong Fu, Md Tahmid Rahman Laskar, Elena ing. arXiv preprint arXiv:2310.12664.
Khasanova, Cheng Chen, and Shashi Bhushan
TN. 2024b. Tiny titans: Can smaller large lan- Zishan Guo, Renren Jin, Chuang Liu, Yufei
guage models punch above their weight in the Huang, Dan Shi, Linhao Yu, Yan Liu, Ji-
real world for meeting summarization? arXiv axuan Li, Bojian Xiong, Deyi Xiong, et al.
preprint arXiv:2402.00841. 2023b. Evaluating large language models:
13797
A comprehensive survey. arXiv preprint Mohammed Saidul Islam, Raian Rahman, Ahmed
arXiv:2310.19736. Masry, Md Tahmid Rahman Laskar, Mir Tafseer
Nayeem, and Enamul Hoque. 2024b. Are
Rishav Hada, Varun Gumma, Adrian de Wynter, large vision language models up to the chal-
Harshita Diddee, Mohamed Ahmed, Monojit lenge of chart comprehension and reasoning?
Choudhury, Kalika Bali, and Sunayana Sitaram. an extensive investigation into the capabili-
2023. Are large language model-based evalua- ties and limitations of lvlms. arXiv preprint
tors the solution to scaling up multilingual eval- arXiv:2406.00257.
uation? arXiv preprint arXiv:2309.07462.
Israt Jahan, Md Tahmid Rahman Laskar, Chun
Dan Hendrycks, Steven Basart, Saurav Kada- Peng, and Jimmy Huang. 2023. Evaluation
vath, Mantas Mazeika, Akul Arora, Ethan Guo, of ChatGPT on biomedical tasks: A zero-shot
Collin Burns, Samir Puranik, Horace He, Dawn comparison with fine-tuned generative trans-
Song, and Jacob Steinhardt. 2021. Measur- formers. In The 22nd Workshop on Biomedi-
ing coding challenge competence with APPS. cal Natural Language Processing and BioNLP
In Proceedings of the Neural Information Pro- Shared Tasks, pages 326–336, Toronto, Canada.
cessing Systems Track on Datasets and Bench- Association for Computational Linguistics.
marks 1, NeurIPS Datasets and Benchmarks
2021, December 2021, virtual. Israt Jahan, Md Tahmid Rahman Laskar, Chun
Peng, and Jimmy Xiangji Huang. 2024. A com-
Dan Hendrycks, Collin Burns, Steven Basart, An- prehensive evaluation of large language mod-
drew Critch, Jerry Li, Dawn Song, and Jacob els on benchmark biomedical text processing
Steinhardt. 2020a. Aligning ai with shared hu- tasks. Computers in Biology and Medicine,
man values. arXiv preprint arXiv:2008.02275. page 108189.
Dan Hendrycks, Collin Burns, Steven Basart, Fred Jelinek, Robert L Mercer, Lalit R Bahl, and
Andy Zou, Mantas Mazeika, Dawn Song, and James K Baker. 1977. Perplexity—a measure
Jacob Steinhardt. 2020b. Measuring mas- of the difficulty of speech recognition tasks.
sive multitask language understanding. arXiv The Journal of the Acoustical Society of Amer-
preprint arXiv:2009.03300. ica, 62(S1):S63–S63.
Albert Q Jiang, Alexandre Sablayrolles, Arthur
Yutong Hu, Quzhe Huang, Mingxu Tao, Chen
Mensch, Chris Bamford, Devendra Singh Chap-
Zhang, and Yansong Feng. 2024. Can per-
lot, Diego de las Casas, Florian Bressand,
plexity reflect large language model’s ability
Gianna Lengyel, Guillaume Lample, Lucile
in long text understanding? arXiv preprint
Saulnier, et al. 2023. Mistral 7b. arXiv preprint
arXiv:2405.06105.
arXiv:2310.06825.
Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, Carlos E Jimenez, John Yang, Alexander Wettig,
and Tiejun Zhao. 2024a. An empirical study Shunyu Yao, Kexin Pei, Ofir Press, and Karthik
of llm-as-a-judge for llm evaluation: Fine-tuned Narasimhan. 2023. Swe-bench: Can language
judge models are task-specific classifiers. arXiv models resolve real-world github issues? arXiv
preprint arXiv:2403.02839. preprint arXiv:2310.06770.
Wei Huang, Xudong Ma, Haotong Qin, Xingyu Mandar Joshi, Eunsol Choi, Daniel S Weld, and
Zheng, Chengtao Lv, Hong Chen, Jie Luo, Xi- Luke Zettlemoyer. 2017. Triviaqa: A large
aojuan Qi, Xianglong Liu, and Michele Magno. scale distantly supervised challenge dataset for
2024b. How good are low-bit quantized llama3 reading comprehension. In Proceedings of the
models? an empirical study. arXiv preprint 55th Annual Meeting of the Association for
arXiv:2404.14047. Computational Linguistics (Volume 1: Long Pa-
pers), pages 1601–1611.
Md Ashraful Islam, Mohammed Eunus Ali, and
Md Rizwan Parvez. 2024a. Mapcoder: Multi- Mohsinul Kabir, Mohammed Saidul Islam,
agent code generation for competitive problem Md Tahmid Rahman Laskar, Mir Tafseer Nay-
solving. arXiv preprint arXiv:2405.11403. eem, M Saiful Bari, and Enamul Hoque. 2023.
13798
Benllmeval: A comprehensive evaluation into Tom Kocmi and Christian Federmann. 2023.
the potentials and pitfalls of large language Large language models are state-of-the-art eval-
models on bengali NLP. arXiv preprint uators of translation quality. arXiv preprint
arXiv:2309.13173. arXiv:2302.14520.
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Jan Kocoń, Igor Cichecki, Oliwier Kaszyca, Ma-
Minjoon Seo, Hannaneh Hajishirzi, and Ali teusz Kochanek, Dominika Szydło, Joanna
Farhadi. 2016. A diagram is worth a dozen im- Baran, Julita Bielaniewicz, Marcin Gruza,
ages. In Computer Vision–ECCV 2016: 14th Arkadiusz Janz, Kamil Kanclerz, et al. 2023.
European Conference, Amsterdam, The Nether- Chatgpt: Jack of all trades, master of none. In-
lands, October 11–14, 2016, Proceedings, Part formation Fusion, 99:101861.
IV 14, pages 235–251. Springer. Thomas Kosch and Sebastian Feger. 2024. Risk
or chance? large language models and repro-
Zachary Kenton, Noah Y Siegel, János Kramár, ducibility in human-computer interaction re-
Jonah Brown-Cohen, Samuel Albanie, Jannis search. arXiv preprint arXiv:2404.15782.
Bulian, Rishabh Agarwal, David Lindner, Yun-
hao Tang, Noah D Goodman, et al. 2024. Tom Kwiatkowski, Jennimaria Palomaki, Olivia
On scalable oversight with weak llms judging Redfield, Michael Collins, Ankur Parikh, Chris
strong llms. arXiv preprint arXiv:2407.04622. Alberti, Danielle Epstein, Illia Polosukhin, Ja-
cob Devlin, Kenton Lee, et al. 2019. Natural
Mohammad Abdullah Matin Khan, M Saiful Bari, questions: a benchmark for question answering
Xuan Long Do, Weishi Wang, Md Rizwan research. Transactions of the Association for
Parvez, and Shafiq Joty. 2023. xcodeeval: A Computational Linguistics, 7:453–466.
large scale multilingual multitask benchmark
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming
for code understanding, generation, translation
Yang, and Eduard Hovy. 2017. Race: Large-
and retrieval. arXiv preprint arXiv:2303.03004.
scale reading comprehension dataset from ex-
aminations. arXiv preprint arXiv:1704.04683.
Md Tawkat Islam Khondaker, Abdul Waheed,
El Moatez Billah Nagoudi, and Muhammad Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben
Abdul-Mageed. 2023. Gptaraeval: a compre- Veyseh, Hieu Man, Franck Dernoncourt, Trung
hensive evaluation of ChatGPT on arabic NLP. Bui, and Thien Huu Nguyen. 2023. ChatGPT
arXiv preprint arXiv:2305.14976. beyond english: Towards a comprehensive eval-
uation of large language models in multilingual
Seungone Kim, Juyoung Suk, Shayne Longpre, learning. arXiv preprint arXiv:2304.05613.
Bill Yuchen Lin, Jamin Shin, Sean Welleck,
Graham Neubig, Moontae Lee, Kyungjae Lee, Nathan Lambert, Valentina Pyatkin, Jacob Mor-
and Minjoon Seo. 2024a. Prometheus 2: An rison, LJ Miranda, Bill Yuchen Lin, Khyathi
open source language model specialized in eval- Chandu, Nouha Dziri, Sachin Kumar, Tom
uating other language models. arXiv preprint Zick, Yejin Choi, et al. 2024. Rewardbench:
arXiv:2405.01535. Evaluating reward models for language model-
ing. arXiv preprint arXiv:2403.13787.
Tae Soo Kim, Yoonjoo Lee, Jamin Shin, Young- Tamera Lanham, Anna Chen, Ansh Radhakrish-
Ho Kim, and Juho Kim. 2024b. Evallm: In- nan, Benoit Steiner, Carson Denison, Danny
teractive evaluation of large language model Hernandez, Dustin Li, Esin Durmus, Evan Hub-
prompts on user-defined criteria. In Proceed- inger, Jackson Kernion, et al. 2023. Measur-
ings of the CHI Conference on Human Factors ing faithfulness in chain-of-thought reasoning.
in Computing Systems, pages 1–21. arXiv preprint arXiv:2307.13702.
Masamune Kobayashi, Masato Mita, and Mamoru Md Tahmid Rahman Laskar, M Saiful Bari, Miza-
Komachi. 2024. Large language models are nur Rahman, Md Amran Hossen Bhuiyan,
state-of-the-art evaluator for grammatical error Shafiq Joty, and Jimmy Huang. 2023a. A sys-
correction. arXiv preprint arXiv:2403.17540. tematic study and comprehensive evaluation of
13799
ChatGPT on benchmark datasets. In Find- Patrick Lewis, Ethan Perez, Aleksandra Piktus,
ings of the Association for Computational Lin- Fabio Petroni, Vladimir Karpukhin, Naman
guistics: ACL 2023, pages 431–469, Toronto, Goyal, Heinrich Küttler, Mike Lewis, Wen-tau
Canada. Association for Computational Lin- Yih, Tim Rocktäschel, et al. 2020. Retrieval-
guistics. augmented generation for knowledge-intensive
NLP tasks. Advances in Neural Information
Md Tahmid Rahman Laskar, Cheng Chen, Xue-
Processing Systems, 33:9459–9474.
yong Fu, and Shashi Bhushan Tn. 2022a.
Improving named entity recognition in tele- Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang,
phone conversations via effective active learn- Rui Wang, Ruimao Zhang, and Ying Shan.
ing with human in the loop. arXiv preprint 2023a. Seed-bench-2: Benchmarking multi-
arXiv:2211.01354. modal large language models. arXiv preprint
Md Tahmid Rahman Laskar, Xue-Yong Fu, Cheng arXiv:2311.17092.
Chen, and Shashi Bhushan Tn. 2023b. Building
Bohao Li, Rui Wang, Guangzhi Wang, Yuy-
real-world meeting summarization systems us-
ing Ge, Yixiao Ge, and Ying Shan. 2023b.
ing large language models: A practical perspec-
Seed-bench: Benchmarking multimodal llms
tive. In Proceedings of the 2023 Conference on
with generative comprehension. arXiv preprint
Empirical Methods in Natural Language Pro-
arXiv:2307.16125.
cessing: Industry Track, pages 343–352.
Md Tahmid Rahman Laskar, Enamul Hoque, and Changmao Li and Jeffrey Flanigan. 2023.
Jimmy Xiangji Huang. 2022b. Domain adap- Task contamination: Language models may
tation with pre-trained transformers for query- not be few-shot anymore. arXiv preprint
focused abstractive text summarization. Com- arXiv:2312.16337.
putational Linguistics, 48(2):279–320.
Raymond Li, Loubna Ben Allal, Yangtian Zi,
Md Tahmid Rahman Laskar, Xiangji Huang, and Niklas Muennighoff, Denis Kocetkov, Cheng-
Enamul Hoque. 2020. Contextualized embed- hao Mou, Marc Marone, Christopher Akiki,
dings based transformer encoder for sentence Jia Li, Jenny Chim, et al. 2023c. Starcoder:
similarity modeling in answer selection task. In may the source be with you! arXiv preprint
Proceedings of the Twelfth Language Resources arXiv:2305.06161.
and Evaluation Conference, pages 5505–5514.
Yifan Li, Yifan Du, Kun Zhou, Jinpeng
Md Tahmid Rahman Laskar, Mizanur Rahman, Is-
Wang, Wayne Xin Zhao, and Ji-Rong Wen.
rat Jahan, Enamul Hoque, and Jimmy Huang.
2023d. Evaluating object hallucination in
2023c. Can large language models fix data
large vision-language models. arXiv preprint
annotation errors? an empirical study using
arXiv:2305.10355.
debatepedia for query-focused text summariza-
tion. In Findings of the Association for Com- Yinheng Li, Shaofei Wang, Han Ding, and Hang
putational Linguistics: EMNLP 2023, pages Chen. 2023e. Large language models in fi-
10245–10255. nance: A survey. In Proceedings of the Fourth
Md Tahmid Rahman Laskar, Mizanur Rahman, Is- ACM International Conference on AI in Fi-
rat Jahan, Enamul Hoque, and Jimmy Huang. nance, pages 374–382.
2023d. CQSumDP: a ChatGPT-annotated re-
Yujia Li, David Choi, Junyoung Chung,
source for query-focused abstractive summa-
Nate Kushman, Julian Schrittwieser, Rémi
rization based on debatepedia. arXiv preprint
Leblond, Tom Eccles, James Keeling, Fe-
arXiv:2305.06147.
lix Gimeno, Agustin Dal Lago, Thomas
Chris van der Lee, Albert Gatt, Emiel van Mil- Hubert, Peter Choy, Cyprien de Mas-
tenburg, and Emiel Krahmer. 2021. Human son d’Autume, Igor Babuschkin, Xinyun
evaluation of automatically generated text: Cur- Chen, Po-Sen Huang, Johannes Welbl, Sven
rent trends and best practice guidelines. Com- Gowal, Alexey Cherepanov, James Mol-
puter Speech & Language, 67:101151. loy, Daniel J. Mankowitz, Esme Sutherland
13800
Robson, Pushmeet Kohli, Nando de Fre- and Percy Liang. 2024c. Lost in the mid-
itas, Koray Kavukcuoglu, and Oriol Vinyals. dle: How language models use long contexts.
2022. Competition-level code generation with Transactions of the Association for Computa-
alphacode. Science, 378(6624):1092–1097. tional Linguistics, 12:157–173.
Zongxia Li, Ishani Mondal, Yijun Liang, Huy Xiao Liu, Xuanyu Lei, Shengyuan Wang, Yue
Nghiem, and Jordan Lee Boyd-Graber. 2024. Huang, Zhuoer Feng, Bosi Wen, Jiale Cheng,
Pedants (precise evaluations of diverse answer Pei Ke, Yifan Xu, Weng Lam Tam, et al.
nominee text for skinflints): Efficient evaluation 2023c. Alignbench: Benchmarking chinese
analysis and benchmarking for open-domain alignment of large language models. arXiv
question answering. preprint arXiv:2311.18743.
Percy Liang, Rishi Bommasani, Tony Lee, Dim-
Yi Liu, Lianzhe Huang, Shicheng Li, Sishuo
itris Tsipras, Dilara Soylu, Michihiro Yasunaga,
Chen, Hao Zhou, Fandong Meng, Jie Zhou,
Yian Zhang, Deepak Narayanan, Yuhuai Wu,
and Xu Sun. 2023d. Recall: A benchmark for
Ananya Kumar, et al. 2022. Holistic eval-
llms robustness against external counterfactual
uation of language models. arXiv preprint
knowledge. arXiv preprint arXiv:2311.08147.
arXiv:2211.09110.
Chin-Yew Lin. 2004. ROUGE: A package for au- Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li,
tomatic evaluation of summaries. In Text sum- Songyang Zhang, Wangbo Zhao, Yike Yuan,
marization branches out, pages 74–81. Jiaqi Wang, Conghui He, Ziwei Liu, et al.
2023e. Mmbench: Is your multi-modal
Stephanie Lin, Jacob Hilton, and Owain Evans. model an all-around player? arXiv preprint
2021. Truthfulqa: Measuring how mod- arXiv:2307.06281.
els mimic human falsehoods. arXiv preprint
arXiv:2109.07958. Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu,
Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng,
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Kai-Wei Chang, Michel Galley, and Jianfeng
Yaser Yacoob, and Lijuan Wang. 2023a. Miti- Gao. 2023. Mathvista: Evaluating mathemat-
gating hallucination in large multi-modal mod- ical reasoning of foundation models in visual
els via robust instruction tuning. In The Twelfth contexts. arXiv preprint arXiv:2310.02255.
International Conference on Learning Repre-
sentations. Pan Lu, Swaroop Mishra, Tanglin Xia, Liang
Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind
Haotian Liu, Chunyuan Li, Yuheng Li, and Tafjord, Peter Clark, and Ashwin Kalyan. 2022.
Yong Jae Lee. 2023b. Improved baselines Learn to explain: Multimodal reasoning via
with visual instruction tuning. arXiv preprint thought chains for science question answer-
arXiv:2310.03744. ing. Advances in Neural Information Process-
ing Systems, 35:2507–2521.
Haotian Liu, Chunyuan Li, Qingyang Wu, and
Yong Jae Lee. 2024a. Visual instruction tun-
Yujie Lu, Xianjun Yang, Xiujun Li, Xin Eric
ing. Advances in neural information processing
Wang, and William Yang Wang. 2024. Llm-
systems, 36.
score: Unveiling the power of large language
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and models in text-to-image synthesis evaluation.
Lingming Zhang. 2024b. Is your code gener- Advances in Neural Information Processing
ated by ChatGPT really correct? rigorous eval- Systems, 36.
uation of large language models for code gen-
eration. Advances in Neural Information Pro- Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen,
cessing Systems, 36. Xiaoshuai Sun, and Rongrong Ji. 2024. Cheap
and quick: Efficient vision-language instruction
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin tuning for large language models. Advances in
Paranjape, Michele Bevilacqua, Fabio Petroni, Neural Information Processing Systems, 36.
13801
Zheheng Luo, Qianqian Xie, and Sophia Anani- Todor Mihaylov, Peter Clark, Tushar Khot, and
adou. 2023. ChatGPT as a factual inconsis- Ashish Sabharwal. 2018. Can a suit of ar-
tency evaluator for text summarization. arXiv mor conduct electricity? a new dataset for
preprint arXiv:2303.15621. open book question answering. arXiv preprint
arXiv:1809.02789.
Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu
Xiong, Bo Tang, Wenjin Wang, Hao Wu, Shervin Minaee, Tomas Mikolov, Narjes Nikzad,
Huanyong Liu, Tong Xu, and Enhong Chen. Meysam Chenaghlu, Richard Socher, Xavier
2024. Crud-rag: A comprehensive chinese Amatriain, and Jianfeng Gao. 2024. Large lan-
benchmark for retrieval-augmented generation guage models: A survey.
of large language models. arXiv preprint
Abhika Mishra, Akari Asai, Vidhisha Balachan-
arXiv:2401.17043.
dran, Yizhong Wang, Graham Neubig, Yu-
Oscar Mañas, Benno Krojer, and Aishwarya lia Tsvetkov, and Hannaneh Hajishirzi. 2024.
Agrawal. 2024. Improving automatic vqa eval- Fine-grained hallucination detection and edit-
uation using large language models. In Pro- ing for language models. arXiv preprint
ceedings of the AAAI Conference on Artificial arXiv:2401.06855.
Intelligence, volume 38, pages 4171–4179.
Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng,
Rui Mao, Guanyi Chen, Xulang Zhang, Frank Mahir Shah, Kabir Jain, Graham Neubig, and
Guerin, and Erik Cambria. 2023. Gpteval: A Yang You. 2024. Mixeval: Deriving wisdom of
survey on assessments of chatgpt and gpt-4. the crowd from llm benchmark mixtures.
arXiv preprint arXiv:2308.12488.
OpenAI. 2023. Gpt-4 technical report.
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Yonatan Oren, Nicole Meister, Niladri Chatterji,
Shafiq Joty, and Enamul Hoque. 2022. Chartqa: Faisal Ladhak, and Tatsunori B Hashimoto.
A benchmark for question answering about 2023. Proving test set contamination in
charts with visual and logical reasoning. arXiv black box language models. arXiv preprint
preprint arXiv:2203.10244. arXiv:2310.17623.
Ahmed Masry, Mehrad Shahmohammadi, Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo
Md Rizwan Parvez, Enamul Hoque, and Shafiq Almeida, Carroll Wainwright, Pamela Mishkin,
Joty. 2024. Chartinstruct: Instruction tuning Chong Zhang, Sandhini Agarwal, Katarina
for chart comprehension and reasoning. arXiv Slama, Alex Ray, et al. 2022. Training language
preprint arXiv:2403.09028. models to follow instructions with human feed-
back. Advances in Neural Information Process-
Minesh Mathew, Dimosthenis Karatzas, and
ing Systems, 35:27730–27744.
CV Jawahar. 2021. Docvqa: A dataset for vqa
on document images. In Proceedings of the Kishore Papineni, Salim Roukos, Todd Ward, and
IEEE/CVF winter conference on applications of Wei-Jing Zhu. 2002. Bleu: a method for auto-
computer vision, pages 2200–2209. matic evaluation of machine translation. In Pro-
ceedings of the 40th annual meeting of the As-
Alexandre Matton, Tom Sherborne, Dennis Au-
sociation for Computational Linguistics, pages
miller, Elena Tommasone, Milad Alizadeh,
311–318.
Jingyi He, Raymond Ma, Maxime Voisin,
Ellen Gilsenan-McMahon, and Matthias Gallé. Md Rizwan Parvez. 2024. Evidence to generate
2024. On leakage of code generation evaluation (e2g): A single-agent two-step prompting for
datasets. context grounded and retrieval augmented rea-
soning. arXiv preprint arXiv:2401.05787.
Timothy R McIntosh, Teo Susnjak, Tong Liu, Paul
Watters, and Malka N Halgamuge. 2024. Inad- Md Rizwan Parvez, Wasi Ahmad, Saikat
equacies of large language model benchmarks Chakraborty, Baishakhi Ray, and Kai-Wei
in the era of generative artificial intelligence. Chang. 2021. Retrieval augmented code
arXiv preprint arXiv:2402.09880. generation and summarization. In Findings
13802
of the Association for Computational Linguis- tional Conference on Language Resources and
tics: EMNLP 2021, pages 2719–2734, Punta Evaluation (LREC 2018).
Cana, Dominican Republic. Association for
Computational Linguistics. Xiao Pu, Mingqi Gao, and Xiaojun Wan. 2023.
Summarization is (almost) dead. arXiv preprint
Md Rizwan Parvez, Tolga Bolukbasi, Kai-Wei arXiv:2309.09558.
Chang, and Venkatesh Saligrama. 2019. Ro-
Chengwei Qin, Aston Zhang, Zhuosheng Zhang,
bust text classifier on test-time budgets. In Pro-
Jiaao Chen, Michihiro Yasunaga, and Diyi
ceedings of the 2019 Conference on Empirical
Yang. 2023. Is ChatGPT a general-purpose nat-
Methods in Natural Language Processing and
ural language processing task solver? arXiv
the 9th International Joint Conference on Nat-
preprint arXiv:2302.06476.
ural Language Processing (EMNLP-IJCNLP),
pages 1167–1172, Hong Kong, China. Associa- Haoyi Qiu, Wenbo Hu, Zi-Yi Dou, and Nanyun
tion for Computational Linguistics. Peng. 2024. Valor-eval: Holistic coverage and
faithfulness evaluation of large vision-language
Md Rizwan Parvez, Saikat Chakraborty,
models. arXiv preprint arXiv:2404.13874.
Baishakhi Ray, and Kai-Wei Chang. 2018.
Building language models for text with named Qwen2. 2024. Hello qwen2.
entities. In Proceedings of the 56th Annual
Meeting of the Association for Computational Alec Radford, Karthik Narasimhan, Tim Sali-
Linguistics (Volume 1: Long Papers), pages mans, Ilya Sutskever, et al. 2018. Improv-
2373–2383, Melbourne, Australia. Association ing language understanding by generative pre-
for Computational Linguistics. training.
Md Rizwan Parvez and Kai-Wei Chang. 2021. Raian Rahman, Rizvi Hasan, Abdullah Al Farhad,
Evaluating the values of sources in transfer Md. Tahmid Rahman Laskar, Md. Hamjajul
learning. In Proceedings of the 2021 Confer- Ashmafee, and Abu Raihan Mostofa Kamal.
ence of the North American Chapter of the As- 2023. Chartsumm: A comprehensive bench-
sociation for Computational Linguistics: Hu- mark for automatic chart summarization of long
man Language Technologies, pages 5084–5116, and short summaries. Proceedings of the Cana-
Online. Association for Computational Linguis- dian Conference on Artificial Intelligence.
tics. Pranav Rajpurkar, Robin Jia, and Percy Liang.
2018. Know what you don’t know: Unan-
Md Rizwan Parvez, Jianfeng Chi, Wasi Uddin Ah-
swerable questions for squad. arXiv preprint
mad, Yuan Tian, and Kai-Wei Chang. 2023. Re-
arXiv:1806.03822.
trieval enhanced data augmentation for question
answering on privacy policies. In Proceedings Mathieu Ravaut, Bosheng Ding, Fangkai Jiao,
of the 17th Conference of the European Chapter Hailin Chen, Xingxuan Li, Ruochen Zhao,
of the Association for Computational Linguis- Chengwei Qin, Caiming Xiong, and Shafiq
tics, pages 201–210, Dubrovnik, Croatia. Asso- Joty. 2024. How much are llms contaminated?
ciation for Computational Linguistics. a comprehensive survey and the llmsanitize li-
brary. arXiv preprint arXiv:2404.00699.
Ethan Perez, Saffron Huang, Francis Song, Trevor
Cai, Roman Ring, John Aslanides, Amelia Vipula Rawte, Amit Sheth, and Amitava
Glaese, Nat McAleese, and Geoffrey Irv- Das. 2023. A survey of hallucination in
ing. 2022. Red teaming language mod- large foundation models. arXiv preprint
els with language models. arXiv preprint arXiv:2309.05922.
arXiv:2202.03286.
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle,
Sarah Masud Preum, Md Rizwan Parvez, Kai-Wei Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi
Chang, and John Stankovic. 2018. A corpus Adi, Jingyu Liu, Tal Remez, Jérémy Rapin,
of drug usage guidelines annotated with type of et al. 2023. Code llama: Open foundation mod-
advice. In Proceedings of the Eleventh Interna- els for code. arXiv preprint arXiv:2308.12950.
13803
Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebas- report: A systematic survey of prompting tech-
tian Ruder, and Iryna Gurevych. 2021. How niques.
good is your tokenizer? on the monolingual
performance of multilingual language models. Sander Schulhoff, Jeremy Pinto, Anaum Khan,
In Proceedings of the 59th Annual Meeting of Louis-François Bouchard, Chenglei Si, Svetlina
the Association for Computational Linguistics Anati, Valen Tagliabue, Anson Kost, Christo-
and the 11th International Joint Conference on pher Carnahan, and Jordan Boyd-Graber. 2023.
Natural Language Processing (Volume 1: Long Ignore this title and hackaprompt: Exposing
Papers), pages 3118–3135, Online. Association systemic vulnerabilities of llms through a global
for Computational Linguistics. prompt hacking competition. In Proceedings
of the 2023 Conference on Empirical Methods
Mobashir Sadat, Zhengyu Zhou, Lukas Lange, in Natural Language Processing, pages 4945–
Jun Araki, Arsalan Gundroo, Bingqing Wang, 4977.
Rakesh Menon, Md Parvez, and Zhe Feng.
2023. DelucionQA: Detecting hallucinations in Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and
domain-specific question answering. In Find- Alane Suhr. 2023. Quantifying language mod-
ings of the Association for Computational Lin- els’ sensitivity to spurious features in prompt
guistics: EMNLP 2023, pages 822–835, Sin- design or: How i learned to start worry-
gapore. Association for Computational Linguis- ing about prompt formatting. arXiv preprint
tics. arXiv:2310.11324.
Oscar Sainz, Jon Ander Campos, Iker García- Shreya Shankar, JD Zamfirescu-Pereira, Björn
Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, Hartmann, Aditya G Parameswaran, and Ian
and Eneko Agirre. 2023. NLP evaluation in Arawjo. 2024. Who validates the validators?
trouble: On the need to measure llm data con- aligning llm-assisted evaluation of llm out-
tamination for each benchmark. arXiv preprint puts with human preferences. arXiv preprint
arXiv:2310.18018. arXiv:2404.12272.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bha- Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen,
gavatula, and Yejin Choi. 2021. Winogrande: Yang You, and Lidong Bing. 2023. Large lan-
An adversarial winograd schema challenge at guage models are not yet human-level evalua-
scale. Communications of the ACM, 64(9):99– tors for abstractive summarization. In Findings
106. of the Association for Computational Linguis-
tics: EMNLP 2023, pages 4215–4233.
SambaNova. 2024. Samba-coe v0.3: The power
of routing ml models at scale. Chufan Shi, Haoran Yang, Deng Cai, Zhisong
Zhang, Yifan Wang, Yujiu Yang, and Wai Lam.
Maarten Sap, Hannah Rashkin, Derek Chen, Ro-
2024. A thorough examination of decoding
nan LeBras, and Yejin Choi. 2019. Socialiqa:
methods in the era of llms. arXiv preprint
Commonsense reasoning about social interac-
arXiv:2402.06925.
tions. arXiv preprint arXiv:1904.09728.
Weijia Shi, Anirudh Ajith, Mengzhou Xia,
Sander Schulhoff, Michael Ilie, Nishant Balepur,
Yangsibo Huang, Daogao Liu, Terra Blevins,
Konstantine Kahadze, Amanda Liu, Chen-
Danqi Chen, and Luke Zettlemoyer. 2023. De-
glei Si, Yinheng Li, Aayush Gupta, Hyo-
tecting pretraining data from large language
Jung Han, Sevien Schulhoff, Pranav Sandeep
models. arXiv preprint arXiv:2310.16789.
Dulepet, Saurav Vidyadhara, Dayeon Ki, Sweta
Agrawal, Chau Pham, Gerson C. Kroiz, Feileen Rickard Stureborg, Dimitris Alikaniotis, and
Li, Hudson Tao, Ashay Srivastava, Hevan- Yoshi Suhara. 2024. Large language models
der Da Costa, Saloni Gupta, Megan L. are inconsistent and biased evaluators. arXiv
Rogers, Inna Goncearenco, Giuseppe Sarli, Igor preprint arXiv:2405.01724.
Galynker, Denis Peskoff, Marine Carpuat, Jules
White, Shyamal Anadkat, Alexander Miserlis Lichao Sun, Yue Huang, Haoran Wang, Siyuan
Hoyle, and Philip Resnik. 2024. The prompt Wu, Qihui Zhang, Chujie Gao, Yixin Huang,
13804
Wenhan Lyu, Yixuan Zhang, Xiner Li, with juries: Evaluating llm generations with
et al. 2024. Trustllm: Trustworthiness a panel of diverse models. arXiv preprint
in large language models. arXiv preprint arXiv:2404.18796.
arXiv:2401.05561.
Jindong Wang, Xixu Hu, Wenxin Hou, Hao Chen,
Alon Talmor and Jonathan Berant. 2018. The web Runkai Zheng, Yidong Wang, Linyi Yang, Hao-
as a knowledge-base for answering complex jun Huang, Wei Ye, Xiubo Geng, et al. 2023a.
questions. arXiv preprint arXiv:1803.06643. On the robustness of ChatGPT: An adversar-
ial and out-of-distribution perspective. arXiv
Yixuan Tang and Yi Yang. 2024. Multihop- preprint arXiv:2302.12095.
rag: Benchmarking retrieval-augmented gen-
eration for multi-hop queries. arXiv preprint Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei
arXiv:2401.15391. Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu
Liu, and Zhifang Sui. 2023b. Large language
Gemini Team, Rohan Anil, Sebastian Borgeaud, models are not fair evaluators. arXiv preprint
Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, arXiv:2305.17926.
Radu Soricut, Johan Schalkwyk, Andrew M
Dai, Anja Hauth, et al. 2023. Gemini: a fam- Tong Wang, Ninad Kulkarni, and Yanjun Qi.
ily of highly capable multimodal models. arXiv 2024a. Less is more for improving auto-
preprint arXiv:2312.11805. matic evaluation of factual consistency. arXiv
preprint arXiv:2404.06579.
Gemma Team, Thomas Mesnard, Cassidy Hardin,
Robert Dadashi, Surya Bhupatiraju, Shreya Yubo Wang, Xueguang Ma, Ge Zhang, Yuan-
Pathak, Laurent Sifre, Morgane Rivière, Mi- sheng Ni, Abhranil Chandra, Shiguang Guo,
hir Sanjay Kale, Juliette Love, et al. 2024. Weiming Ren, Aaran Arulraj, Xuan He, Ziyan
Gemma: Open models based on gemini Jiang, Tianle Li, Max Ku, Kai Wang, Alex
research and technology. arXiv preprint Zhuang, Rongqi Fan, Xiang Yue, and Wenhu
arXiv:2403.08295. Chen. 2024b. Mmlu-pro: A more robust and
challenging multi-task language understanding
Simone Tedeschi, Felix Friedrich, Patrick benchmark.
Schramowski, Kristian Kersting, Roberto Nav-
Zhiruo Wang, Jun Araki, Zhengbao Jiang,
igli, Huu Nguyen, and Bo Li. 2024. Alert: A
Md Rizwan Parvez, and Graham Neubig.
comprehensive benchmark for assessing large
2023c. Learning to filter context for
language models’ safety through red teaming.
retrieval-augmented generation. arXiv preprint
arXiv preprint arXiv:2404.08676.
arXiv:2311.08377.
Hugo Touvron, Thibaut Lavril, Gautier Izacard,
Jason Wei, Yi Tay, and Quoc V Le. 2022a. Inverse
Xavier Martinet, Marie-Anne Lachaux, Timo-
scaling can become u-shaped. arXiv preprint
thée Lacroix, Baptiste Rozière, Naman Goyal,
arXiv:2211.02011.
Eric Hambro, Faisal Azhar, et al. 2023a. Llama:
Open and efficient foundation language models. Jason Wei, Xuezhi Wang, Dale Schuurmans,
arXiv preprint arXiv:2302.13971. Maarten Bosma, Ed Chi, Quoc Le, and Denny
Zhou. 2022b. Chain of thought prompting elic-
Hugo Touvron, Louis Martin, Kevin Stone, Pe-
its reasoning in large language models. arXiv
ter Albert, Amjad Almahairi, Yasmine Babaei,
preprint arXiv:2201.11903.
Nikolay Bashlykov, Soumya Batra, Prajjwal
Bhargava, Shruti Bhosale, et al. 2023b. Llama Sheng-Lun Wei, Cheng-Kuang Wu, Hen-Hsen
2: Open foundation and fine-tuned chat models. Huang, and Hsin-Hsi Chen. 2024. Unveil-
arXiv preprint arXiv:2307.09288. ing selection biases: Exploring order and to-
ken sensitivity in large language models. arXiv
Pat Verga, Sebastian Hofstatter, Sophia Altham- preprint arXiv:2406.03009.
mer, Yixuan Su, Aleksandra Piktus, Arkady
Arkhangorodsky, Minjie Xu, Naomi White, Man-Fai Wong, Shangxin Guo, Ching-Nam Hang,
and Patrick Lewis. 2024. Replacing judges Siu-Wai Ho, and Chee-Wei Tan. 2023. Natural
13805
language generation and understanding of big Xing Xie, and Yue Zhang. 2022. Glue-x: Eval-
code for ai-assisted programming: A review. uating natural language understanding models
Entropy, 25(6):888. from an out-of-distribution generalization per-
spective. arXiv preprint arXiv:2211.08073.
Honghan Wu, Minhong Wang, Jinge Wu, Farah
Francis, Yun-Hsuan Chang, Alex Shavick, Junjie Ye, Xuanting Chen, Nuo Xu, Can Zu,
Hang Dong, Michael TC Poon, Natalie Fitz- Zekai Shao, Shichun Liu, Yuhan Cui, Zeyang
patrick, Adam P Levine, et al. 2022. A sur- Zhou, Chao Gong, Yang Shen, et al. 2023a.
vey on clinical natural language processing in A comprehensive capability analysis of gpt-
the united kingdom from 2007 to 2022. NPJ 3 and gpt-3.5 series models. arXiv preprint
digital medicine, 5(1):186. arXiv:2303.10420.
Minghao Wu and Alham Fikri Aji. 2023. Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye,
Style over substance: Evaluation biases for Ming Yan, Yiyang Zhou, Junyang Wang, An-
large language models. arXiv preprint wen Hu, Pengcheng Shi, Yaya Shi, et al. 2023b.
arXiv:2307.03025. mplug-owl: Modularization empowers large
language models with multimodality. arXiv
Congying Xia, Chen Xing, Jiangshu Du, Xinyi preprint arXiv:2304.14178.
Yang, Yihao Feng, Ran Xu, Wenpeng Yin, and
Caiming Xiong. 2024. Fofo: A benchmark Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng
to evaluate llms’ format-following capability. Wang, Kevin Lin, Zicheng Liu, Xinchao Wang,
arXiv preprint arXiv:2402.18667. and Lijuan Wang. 2023. Mm-vet: Evaluating
large multimodal models for integrated capabil-
Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and ities. arXiv preprint arXiv:2308.02490.
Aidong Zhang. 2024. Benchmarking retrieval-
Weizhe Yuan, Graham Neubig, and Pengfei Liu.
augmented generation for medicine. arXiv
2021. Bartscore: Evaluating generated text as
preprint arXiv:2402.13178.
text generation. Advances in Neural Informa-
Ruijie Xu, Zengzhi Wang, Run-Ze Fan, and tion Processing Systems, 34:27263–27277.
Pengfei Liu. 2024. Benchmarking bench-
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu
mark leakage in large language models. arXiv
Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens,
preprint arXiv:2404.18824.
Dongfu Jiang, Weiming Ren, Yuxuan Sun,
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, et al. 2023. Mmmu: A massive multi-
Bowen Yu, Chang Zhou, Chengpeng Li, discipline multimodal understanding and rea-
Chengyuan Li, Dayiheng Liu, Fei Huang, soning benchmark for expert agi. arXiv
Guanting Dong, Haoran Wei, Huan Lin, Jia- preprint arXiv:2311.16502.
long Tang, Jialin Wang, Jian Yang, Jianhong Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Farhadi, and Yejin Choi. 2019. Hellaswag: Can
Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai a machine really finish your sentence? arXiv
Dang, Keming Lu, Keqin Chen, Kexin Yang, preprint arXiv:1905.07830.
Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng
Wang, Ru Peng, Rui Men, Ruize Gao, Runji Yuheng Zha, Yichi Yang, Ruichen Li, and Zhit-
Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tian- ing Hu. 2023. Alignscore: Evaluating factual
hang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, consistency with a unified alignment function.
Xiaodong Deng, Xiaohuan Zhou, Xingzhang arXiv preprint arXiv:2305.16739.
Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren,
Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robin-
Yunfei Chu, Zeyu Cui, Zhenru Zhang, and Zhi- son, Catherine Wu, Will Song, Tiffany Zhao,
hao Fan. 2024. Qwen2 technical report. Pranav Raja, Dylan Slack, Qin Lyu, et al. 2024a.
A careful examination of large language model
Linyi Yang, Shuibai Zhang, Libo Qin, Yafu Li, performance on grade school arithmetic. arXiv
Yidong Wang, Hanmeng Liu, Jindong Wang, preprint arXiv:2405.00332.
13806
Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, with advanced large language models. arXiv
Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuan- preprint arXiv:2304.10592.
grui Ding, Songyang Zhang, Haodong Duan,
Hang Yan, et al. 2023. Internlm-xcomposer: A Kaijie Zhu, Jindong Wang, Qinlin Zhao, Ruochen
vision-language large model for advanced text- Xu, and Xing Xie. 2024. Dyval 2: Dynamic
image comprehension and composition. arXiv evaluation of large language models by meta
preprint arXiv:2309.15112. probing agents.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen
Weinberger, and Yoav Artzi. 2019. Bertscore: Wang, Hao Chen, Yidong Wang, Linyi Yang,
Evaluating text generation with bert. In Inter- Wei Ye, Neil Zhenqiang Gong, Yue Zhang,
national Conference on Learning Representa- et al. 2023b. Promptbench: Towards evalu-
tions. ating the robustness of large language mod-
els on adversarial prompts. arXiv preprint
Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy
arXiv:2306.04528.
Liang, Kathleen McKeown, and Tatsunori B
Hashimoto. 2024b. Benchmarking large lan- Yutao Zhu, Huaying Yuan, Shuting Wang,
guage models for news summarization. Trans- Jiongnan Liu, Wenhan Liu, Chenlong Deng,
actions of the Association for Computational Zhicheng Dou, and Ji-Rong Wen. 2023c. Large
Linguistics, 12:39–57. language models for information retrieval: A
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi survey. arXiv preprint arXiv:2308.07107.
Tang, Xiaolei Wang, Yupeng Hou, Yingqian
Ziyu Zhuang, Qiguang Chen, Longxuan Ma,
Min, Beichen Zhang, Junjie Zhang, Zican
Mingda Li, Yi Han, Yushan Qian, Haopeng
Dong, et al. 2023a. A survey of large language
Bai, Zixian Feng, Weinan Zhang, and Ting Liu.
models. arXiv preprint arXiv:2303.18223.
2023. Through the lens of core competency:
Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Survey on evaluation of large language models.
Khalman, Mohammad Saleh, and Peter J Liu. arXiv preprint arXiv:2308.07902.
2023b. Slic-hf: Sequence likelihood calibra-
tion with human feedback. arXiv preprint Terry Yue Zhuo, Yujin Huang, Chunyang Chen,
arXiv:2305.10425. and Zhenchang Xing. 2023. Red team-
ing ChatGPT via jailbreaking: Bias, robust-
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, ness, reliability and toxicity. arXiv preprint
Siyuan Zhuang, Zhanghao Wu, Yonghao arXiv:2301.12867.
Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric
Xing, et al. 2024. Judging llm-as-a-judge with Caleb Ziems, William Held, Omar Shaikh, Jiaao
mt-bench and chatbot arena. Advances in Neu- Chen, Zhehao Zhang, and Diyi Yang. 2024.
ral Information Processing Systems, 36. Can large language models transform computa-
tional social science? Computational Linguis-
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Sid- tics, pages 1–55.
dhartha Brahma, Sujoy Basu, Yi Luan, Denny
Zhou, and Le Hou. 2023a. Instruction- A Appendix
following evaluation for large language models.
arXiv preprint arXiv:2311.07911. A.1 Benchmarking Datasets
Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong General Capability Benchmarks: To bench-
Chen, Wayne Xin Zhao, Xu Chen, Yankai mark the performance of LLMs, researchers typ-
Lin, Ji-Rong Wen, and Jiawei Han. 2023b. ically use a set of widely recognized datasets.
Don’t make your llm an evaluation benchmark These common benchmarks are employed by au-
cheater. arXiv preprint arXiv:2311.01964. thors upon the release of an LLM to evaluate
its general capabilities. One of the most fre-
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang quently used benchmarks is the MMLU bench-
Li, and Mohamed Elhoseiny. 2023a. Minigpt- mark (Hendrycks et al., 2020b), which assesses
4: Enhancing vision-language understanding LLMs’ overall knowledge and reasoning abilities
13807
Criteria Challenges & Limitations Description
Reproducibility Missing Experimental Details Lack of documentation on the data subsets used for evaluation, which few-
shot examples added to the prompt, what decoding parameters are used, etc.,
will impact reproducibility.
Not Releasing the Data The detailed prompt as well as the response generated by the LLMs are often
missing.
Code Unavailable Many studies do not release the necessary codes (e.g., parsing scripts). This
may impact reproducibility of the results.
Model Updates and Depreciation Continuous updates to the closed-source models (alongside possible depre-
ciation of the models) will create challenges for reproducing previous re-
sults.
Reliability Not Documenting Model Versions The exact version of the model being used is often missing. This creates
another reproducibility concern.
Data Integrity Incorrect gold labels and outdated benchmark datasets compromise evalua-
tion reliability.
Unfair Comparisons Comparing models evaluated on the full dataset against the subset of a
dataset, different few-shot examples being selected, etc.
Contamination LLMs may encounter evaluation data during pre-training, leading to con-
tamination.
Prompt Hacking Manipulating input prompts to elicit desired responses can undermine fair
evaluation.
Transparency in Decoding Parameters Lack of transparency in how decoding parameters are selected can lead to
unfair comparisons.
Robustness Evaluation Methodology and Metrics Reliance on string-based metrics and automated evaluation methods without
proper validation can lead to unreliable results.
Limiting Evaluation to Certain Evaluating LLMs only on a set of common benchmarks does not ensure
Benchmarks generalizability.
Lack of Diversity in Prompts Most existing research used only a single prompt while also not tuning any
and Parameters of the decoding parameters, restricting the robustness of the evaluation.
Insufficient Evaluation Metrics Lack of correlation between existing evaluation metrics impacts evaluation
robustness.
Table 5: Challenges and Limitations in terms of Reproducibility, Reliability, and Robustness in LLM Evaluation.
across various subjects. Other common bench- Specialized Benchmarks: There are also spe-
marks focus primarily on evaluating the common- cialized benchmarks that measure specific capabil-
sense reasoning capabilities of LLMs (Wei et al., ities of LLMs. For instance, the MT-Bench (Zheng
2022a), such as HellaSwag (Zellers et al., 2019), et al., 2024)) evaluates whether LLMs can prop-
PIQA (Bisk et al., 2020), SIQA, (Sap et al., 2019), erly engage in conversations, the RewardBench
WinoGrande (Sakaguchi et al., 2021), Open- (Lambert et al., 2024) assesses the performance
BookQA (Mihaylov et al., 2018), ARC (Clark of reward models. Other specialized benchmarks
et al., 2018). In addition, the TruthfulQA dataset like the AlpacaEval10 evaluates the instruction fol-
(Lin et al., 2021) is used to measure the truthful- lowing capabilities (Zhou et al., 2023a) of LLMs,
ness of an LLM, while the TyDi QA dataset (Clark the Open Medical-LLM Leaderboard11 evaluates
et al., 2020) is used for evaluating the information the biomedical capabilities of LLMs, HHEM12
seeking question answering capability across di- leaderboard for hallucination detection (Mishra
verse languages. For assessing coding capabili- et al., 2024; Sadat et al., 2023), BigCodeBench13
ties, the HumanEval (Chen et al., 2021) and the
MBPP (Austin et al., 2021) are two widely used 10
https://tatsu-lab.github.io/alpaca_
benchmarks. Additional problem-solving datasets eval/
11
include APPS (Hendrycks et al., 2021), CodeCon- https://huggingface.co/blog/
leaderboard-medicalllm
tests (Li et al., 2022), and xCodeEval (Khan et al., 12
https://huggingface.co/spaces/
2023), among others. vectara/leaderboard
13
https://huggingface.co/blog/
leaderboard-bigcodebench
13808
and LiveCodeBench14 for code generation ca- verse NLP datasets and tasks (Bang et al., 2023;
pability evaluation, SWE-bench (Jimenez et al., Kocoń et al., 2023; Laskar et al., 2023a; Qin
2023) for software engineering capability evalu- et al., 2023). They also employed domain-specific
ation. The recently proposed FOFO benchmark benchmarks in fields such as biomedicine (Jahan
Xia et al. (2024) measures language models’ abil- et al., 2023, 2024), finance (Guo et al., 2023a; Li
ity to adhere to the requested formats in prompts et al., 2023e), language-specific (Abdelali et al.,
across different domains. Moreover, there are also 2024; Ahuja et al., 2023; Kabir et al., 2023; Khon-
some specialized benchmarks that are used for daker et al., 2023; Lai et al., 2023; Liu et al.,
LLM safety15 (Chao et al., 2024) and red team- 2023c), social science (Ziems et al., 2024), coding
ing16 (Tedeschi et al., 2024) evaluation. The abil- (Liu et al., 2024b), and information retrieval (Zhu
ity to understand both language and vision is an- et al., 2023c). In addition to that, ethics, bias, tox-
other interesting capability of recently proposed icity, robustness, and trustworthiness are also in-
LLMs (Bai et al., 2023; Chen et al., 2023a; Dai dependently evaluated by researchers across vari-
et al., 2024; Liu et al., 2023b, 2024a; Luo et al., ous datasets (Hendrycks et al., 2020a; Liu et al.,
2024; Ye et al., 2023b; Zhang et al., 2023; Zhu 2023a; McIntosh et al., 2024; Rawte et al., 2023;
et al., 2023a). This has led to the development Sun et al., 2024; Wang et al., 2023a; Yang et al.,
of many multi-modal benchmarks (Chen et al., 2022; Zhuo et al., 2023).
2024b; Fu et al., 2023a, 2024a; Guan et al., 2023;
Li et al., 2023a,b,d; Liu et al., 2024a, 2023e; Lu A.2 Prominent LLMs
et al., 2022; Qiu et al., 2024; Yu et al., 2023). The impressive success of ChatGPT has led to the
These benchmarks study the multimodal capabili- development of many LLMs in recent years. Since
ties of LLMs across various domains, such as math there are hundreds of LLMs being released in re-
and reasoning (Lu et al., 2023; Yue et al., 2023), cent years (Zhao et al., 2023a), we only discuss
science diagrams (Kembhavi et al., 2016), chart some of the prominent LLMs that achieved top
understanding and reasoning (Islam et al., 2024b; rankings in various public leaderboards recently.
Masry et al., 2022, 2024; Rahman et al., 2023), LLMs can be categorized into two parts: Closed-
document understanding (Mathew et al., 2021). Source LLMs: only available for use through the
API or web interface, and (ii) Open-Source LLMs:
Other Diverse Benchmarks: To enable a more where the pre-trained weights of the model are
comprehensive evaluation of LLMs across a wide available that allow further training of such mod-
range of scenarios, some studies also focused on els. Below, we present some prominent LLMs in
introducing new benchmarks covering various as- these two categories.
pects, such as HELM (Liang et al., 2022), Prompt-
Bench (Zhu et al., 2023b), OpenLLM17 , MixE- A.2.1 Closed Source LLMs
val (Ni et al., 2024), etc. These benchmarks In the following, we categorize LLMs based on
cover diverse tasks and usually include existing the organizations that develop these LLMs:
benchmarking datasets (e.g., MMLU, HellaSwag,
BoolQ (Clark et al., 2019), etc.). Additionally,
despite the availability of numerous benchmarks OpenAI models (OpenAI, 2023):
(both general and specialized), existing widely-
• GPT-3.5: This model is an iteration of the
used benchmarks still do not cover the full va-
GPT-3 architecture, emphasizing improve-
riety of tasks (Parvez et al., 2018; Preum et al.,
ments in response quality through the ap-
2018). Therefore, some researchers have inde-
plication of the reinforcement learning from
pendently evaluated LLMs using additional di-
human feedback (RLHF) technique. GPT-
14
https://huggingface.co/blog/ 3.5 is known for its robust performance in
leaderboard-livecodebench zero-shot tasks, where no specific training ex-
15
https://huggingface.co/spaces/ amples are provided during the task execu-
AI-Secure/llm-trustworthy-leaderboard
16 tion. This model has been instrumental due
https://huggingface.
co/spaces/HaizeLabs/ to its strong foundational capabilities in un-
red-teaming-resistance-benchmark derstanding and generating human-like text.
17
https://huggingface.co/spaces/
HuggingFaceH4/open_llm_leaderboard • GPT-4: It extends GPT-3.5’s capabilities by
13809
incorporating multimodal functionalities, al- Anthropic Models: The Claude series mod-
lowing the model to process not just text but els, developed by Anthropic, represent a series
also visual inputs. This advancement signifi- of advanced language models designed to en-
cantly broadens its applicational scope, mak- hance user interaction through natural language
ing it adept at handling more complex tasks understanding and generation. Starting with the
that require an understanding of both textual original Claude, which excelled in tasks like
and visual information. It features enhanced summarization and creative writing, each subse-
safety protocols and a sophisticated training quent model—Claude Instant, Claude 2.0, and the
regime that includes a safety reward signal Claude 3 family (Haiku, Sonnet, and Opus)—has
during its reinforcement learning phase. introduced significant improvements in process-
ing speed, reasoning capabilities, and multimodal
• GPT-4 Turbo: This version builds upon functionality. These models have a variety of
GPT-4’s foundation with substantial up- uses, from quick response generation in Claude
grades in computational efficiency and func- Instant to sophisticated multimodal understand-
tionality. GPT-4 Turbo boasts an increased ing in Claude 3 Opus, showcasing their versatility
model capacity and an extended knowledge and advanced AI technology to meet different user
base that encompasses more recent data up to and enterprise needs18 . The latest model in the
April 2023. It features a longer context win- Claude-3 series is the Claude-3.5-Sonnet19 model.
dow of up to 128,000 tokens and includes sig-
nificant improvements in the model’s econ- A.2.2 Open Source LLMs
omy and output consistency. We similarly categorize the open-source LLMs
based on the organizations that develop them:
• GPT-4o: OpenAI’s most sophisticated
model, GPT-4o ("o" for "omni") is a multi- Meta Models:
modal powerhouse capable of handling both • Llama: Launched in February 2023 by Meta
text and image inputs to generate text outputs. AI, Llama was the first in the Llama series,
It improves upon GPT-4 Turbo by offering showcasing strong performance on a range of
double the text generation speed and reduc- natural language processing tasks. It com-
ing operational costs by 50%. peted well against larger models like GPT-
3 with a smaller parameter size and was
Google models:
made available under a non-commercial li-
• PaLM-2: Released by Google in 2023, it is cense, primarily for academic research (Tou-
an advanced large language model that builds vron et al., 2023a).
on the foundations set by its predecessor, the
• Llama 2: Released in July 2023, Llama
original PaLM. This iteration incorporates
2 improved on its predecessor by expand-
a sophisticated ’mixture of objectives’ tech-
ing model sizes up to 70 billion parame-
nique, allowing it to surpass the capabilities
ters. It maintained the original architec-
of the earlier model significantly (Anil et al.,
ture but included better training data and en-
2023).
hanced functionality. Notably, Llama 2 was
• Gemini: It is a multimodal model devel- more accessible, available for both academic
oped by google in December 2023, to un- and some commercial uses (Touvron et al.,
derstand and process a variety of informa- 2023b).
tion types, including text, images, audio, and • Llama 3: In April 2024, Meta AI intro-
video, seamlessly. Gemini’s architecture al- duced Llama 320 , the most advanced version
lows it to perform exceptionally across mul- with up to 70 billion parameters. This ver-
tiple platforms, from large-scale data centers sion added longer context capabilities and
to mobile devices, adapting efficiently to the
18
needs of different applications. This model https://www.anthropic.com/news/
claude-3-family
sets new benchmarks in AI with its ability 19
https://www.anthropic.com/news/
to excel in tasks that require complex mul- claude-3-5-sonnet
timodal integrations (Team et al., 2023). 20
https://llama.meta.com/llama3/
13810
improved multimodal functions, marking a els but in a smaller, more cost-effective package.
significant advancement in AI technology ap- Phi-3 models are particularly suited for simpler
plication across various fields. tasks, local device operations, and environments
with limited resources, making AI more accessi-
Mistral Models: Mistral AI, founded in April ble and efficient for diverse applications. They
2023, is specialized in the development of open- are available through Microsoft Azure AI Model
source large language models. Rapidly gaining Catalog, Hugging Face, and as NVIDIA NIM mi-
recognition in the AI industry, Mistral AI empha- croservices. Several followup works extends Phi-
sizes the importance of open-source software, pro- models or their synthetic data into multilingual
viding a viable alternative to proprietary models. space such as (Boughorbel et al., 2024).
The company has released several models, includ-
ing Mistral 7B, Mixtral 8x7B, and Mixtral 8x22B, Technology Innovation Institute Models:
which are known for their high performance and Technology Innovation Institute release the
innovation in the use of mixture of experts ar- Falcon series models (Almazrouei et al., 2023),
chitectures (Cai et al., 2024; Jiang et al., 2023). such as the Falcon 2 series that include models
Codestral 22B, introduced on May 29, 2024, is with parameter sizes of 1.3B, 7.5B, 40B, and
a pioneering code generation model designed to 180B. These models are notable for their use
enhance coding efficiency across more than 80 of the REFINEDWEB dataset. Falcon models
programming languages. With its specialized fo- are designed for both research and commercial
cus and lightweight architecture, Codestral signif- use, with Falcon 2 models featuring multilingual
icantly outperforms other leading models on the and multimodal capabilities, including vision-to-
HumanEval FIM benchmark, making it a critical language. The Falcon 180B model, in particular,
tool for developers seeking advanced AI-assisted is accessible under a royalty-free license.
coding capabilities.
Alibaba Models: QWEN series models are Cohere Models: Cohere offers a variety of ad-
transformer-based large language models devel- vanced large language models designed for mul-
oped by Alibaba Cloud (Bai et al., 2023). These tiple use cases, including text generation, embed-
models, pre-trained on diverse data sources in- dings, and reranking. The Command family mod-
cluding web texts, books, code, and more, come els, such as Command R+ and Command R, ex-
in various sizes ranging from 0.5 billion to 110 cel in conversational tasks and complex workflows
billion parameters. Qwen models support long like code generation and retrieval-augmented gen-
context lengths and demonstrate strong perfor- eration (RAG) 21 (Alonso et al., 2024; Chen et al.,
mance on multiple Chinese and English evalu- 2024a; Gao et al., 2023b; Lewis et al., 2020; Liu
ation tasks, including common-sense reasoning, et al., 2023d; Lyu et al., 2024; Parvez et al., 2021,
code, and mathematics. The latest versions, Qwen 2023; Tang and Yang, 2024; Wang et al., 2023c;
1.5 and Qwen 2, offer significant improvements Xiong et al., 2024). The Embed models enhance
in chat model performance, multilingual support, search, classification, and clustering capabilities
and stable support for up to 32K context length. with both English and multilingual support. The
With a comprehensive vocabulary of over 150K Rerank models improve search algorithms by re-
tokens, Qwen models are designed to handle mul- organizing results based on specified parameters.
tiple languages effectively, making them a versa- Cohere models are accessible across platforms
tile tool for various AI applications. like Amazon SageMaker, Microsoft Azure, and
Oracle GenAI Service, enabling seamless integra-
Microsoft Models: The Phi series (Abdin et al., tion into diverse applications and retrieval aug-
2024) by Microsoft consists of small language mented generation.
models (SLMs) designed to provide high perfor-
mance with lower computational requirements. Google Gemma Models: While early LLMs re-
The newly announced Phi-3 family includes leased by Google’s are mostly closed-source (e.g.,
models like Phi-3-mini, Phi-3-small, and Phi-3- PalM-2, Gemini, etc.), Google has also recently
medium, ranging from 3.8 billion to 14 billion pa- released some lightweight open-source LLMs,
rameters. These models excel in various bench-
marks, offering capabilities similar to larger mod- 21
https://cohere.com/command
13811
named as Gemma22 family LLMs, that also have simpler sub-problems that can be solved se-
multimodal capabilities23 . quentially by the GenAI model. Each com-
ponent of the problem is addressed individu-
A.3 Prompting Techniques ally, and the solutions are integrated to form
Prompts can be designed in various ways (Brown a comprehensive response. Decomposition
et al., 2020; Chung et al., 2022; Islam et al., 2024a; is especially useful in tasks that require lay-
Parvez, 2024; Schulhoff et al., 2024; Wei et al., ered reasoning or have multiple steps. For ex-
2022b), as stated below: ample, in solving a math word problem, de-
composition might involve separately calcu-
• In-Context Learning (Zero-shot): It means lating the distances each person travels and
that the prompt used to interact with the then combining these calculations to deter-
model contains no examples or demonstra- mine when they meet.
tions. The model relies on its pre-existing
knowledge, obtained from its initial training • Role-based and Style-based Prompting:
on diverse data, to generate a response or per- In these techniques prompts are designed
form the task based solely on the instructions to induce a specific style or persona in the
given. For example, “classify the sentence as model’s responses. By specifying a role (e.g.,
biased or unbiased text”. a scientist explaining a concept) or a style
(e.g., formal or poetic), users can guide the
• In-Context Learning (Few-shot): It means tone and formality of the AI’s output. This
that the prompt used to interact with the technique is valuable in applications requir-
model includes a small number of examples ing genre-specific content generation or when
or demonstrations. The model uses these ex- the output needs to fit a particular commu-
amples to quickly adapt and understand how nicative context.
to perform a specific task, leveraging the de-
tails within these examples. This technique • Prompt chaining: It is a technique where
allows the model to extend its pre-existing a complex task is divided into simpler sub-
knowledge to new tasks by closely analyz- tasks, each addressed by its own prompt. The
ing the limited examples given. For instance, response from one prompt is used as the in-
classify the sentence as biased or unbiased put for the next, creating a sequential chain
based on a few similar examples provided. of prompts that gradually build towards the fi-
nal answer. This method enhances the perfor-
• Chain-of-Thought Prompting (CoT): This mance and reliability of large language mod-
technique encourages models to generate in- els by breaking down tasks into manageable
termediate reasoning steps before arriving parts, making it easier to control and refine
at a final answer, mimicking a human-like the model’s responses at each step. For ex-
problem-solving approach. This can be com- ample, in a document analysis task, the first
bined with few-shot prompting to achieve prompt might extract key facts from a text,
better results on more complex tasks. For and the second prompt would use these facts
example, if asked to determine whether the to generate a summary.
number "15" is odd or even, the model might
outline its reasoning as follows: "An even • Tree of Thoughts (ToT): It is a technique
number is divisible by 2 without a remainder. that structures problem-solving into a tree
15 divided by 2 is 7 with a remainder of 1. of possible solutions. It uses strategies like
Therefore, 15 is an odd number." This step- breadth-first or depth-first search to evaluate
by-step explanation helps clarify the model’s each potential solution path. For example, in
thought process and supports its conclusion. solving a puzzle, ToT might explore different
moves to find the quickest solution path.
• Decomposition Techniques: These tech-
niques break down complex problems into • Directional Stimulus Prompting (DSP) : It
22
https://storage.googleapis.com/
is a technique that enhances how large lan-
deepmind-media/gemma/gemma-report.pdf guage models (LLMs) respond to tasks by
23
https://huggingface.co/blog/paligemma using dynamically generated prompts. A
13812
secondary, tuneable model creates specific 2017), a decoding strategy that keeps track of
hints that guide the main, unchangeable LLM multiple possible sequences (beams) at each
to produce more targeted and relevant out- step of generation to find the most likely se-
puts. This method uses reinforcement learn- quence. A higher number of beams usually
ing to refine these prompts based on how well leads to more accurate results but at the cost
they perform, making DSP a more adaptive of increased computation.
and precise approach compared to standard
prompting techniques. For instance, in sum- • Top-K: The number of top probable tokens
marizing complex documents, DSP might to consider. For example, if K=10, the model
generate a prompt like "Summarize focusing will choose the next token only from the top
on economic impacts," guiding the LLM to 10 most likely tokens.
tailor its output specifically to the economic • Top-P: The cumulative probability threshold.
aspects mentioned in the text. For example, if P=0.9, the model will sample
• Multimodal Prompting: Extending beyond from the smallest set of tokens whose com-
text, multimodal prompting involves using bined probability is at least 90%.
inputs like images, audio, or video along with • Maximum Output Tokens: It sets the max-
textual descriptions. This technique lever- imum number of tokens to generate.
ages the model’s capability to process and in-
tegrate information from diverse data types,
enhancing its applicability in scenarios where A.5 Parsing Script Design
multiple forms of data are available. For ex- While there are various evaluation software (Bi-
ample, interpret a scene from a video by ana- derman et al., 2024; Dalvi et al., 2023) currently
lyzing both the spoken dialogue and the vi- available, they are limited to certain scenarios
sual content to determine the mood of the (e.g., limited to certain datasets and benchmarks,
conversation. prompts, etc.). Thus, for the evaluation of LLMs
across diverse settings, researchers often require
• Meta-Prompting: It involves creating to write parsing scripts. We present some scenar-
prompts that instruct the AI to generate or ios in Table 6 to demonstrate why parsing script
refine its prompts, essentially using AI to is required for such cases and the importance of
improve the efficiency and effectiveness of validating parsing scripts.
prompt engineering. This recursive use of
prompting can lead to more dynamic and A.6 Evaluation Approach
contextually adaptive AI behaviors. For ex- A.6.1 Automatic Evaluation
ample, ask the AI to optimize a prompt that
To provide a high-level overview, automatic evalu-
instructs another AI to summarize news arti-
ation for LLMs can be divided into the following:
cles, thereby refining the instructions to en-
Language Modeling: Perplexity (Jelinek et al.,
hance summary relevance and conciseness.
1977) is widely used to study the performance of
A.4 Decoding Parameters auto-regressive language models. It measures how
confidently a model predicts the next word in a se-
There are various decoding parameters that are re-
quence, with the assumption that lower perplex-
quired to be set. For instance:
ity indicates better performance. Hence, perplex-
• Temperature: It is used to control the ran- ity has been historically used to assess the lan-
domness of the output. It is typically between guage model’s capability to generate a coherent
0 and 1. Lower values (e.g., 0.1) make the language and is also useful to quickly compare dif-
model more deterministic and focused on the ferent models or checkpoints.
most likely next token, while higher values Discriminative Tasks: For tasks involving class
(e.g., 0.9) introduce more randomness and di- prediction, post-processing using a parsing script
versity. is usually required to extract answers from the
LLM-generated responses to compare against gold
• Beam Size: It refers to the number of beams labels. In this context, metrics such as Ex-
in Beam Search (Freitag and Al-Onaizan, act Match, Accuracy, Precision, Recall, and F1,
13813
Scenario 1: For the response generated, designing a parsing script to extract the answer “Lionel Messi” is straight-
forward. However, the parsing script should also be robust to cover cases like abbreviations, uppercase-lowercase
sensitivity, punctuations, synonyms, stemming, lemmatization, paraphrases, etc.
Prompt: Which player has won the best player award in Fifa world cup 2022?
Sample LLM Response (GPT 4o): Lionel Messi won the Best Player award (Golden Ball) in the FIFA World
Cup 2022. He was instrumental in leading Argentina to victory in the tournament, culminating in their triumph in
the final against France.
Correct Answer: Lionel Messi
Scenario 2: While Extraction of the answer “Lionel Messi” is required, due to the LLM knowledge-cut-off date
of September 2021, it may answer about 2018. However, the target answer “Lionel Messi” is also in the output
and so if the parsing script only parses the target answer then it may consider the response as correct whereas the
response is wrong.
Prompt: Which player has won the best player award in the last Fifa world cup?
Sample LLM Response (Older ChatGPT 3.5 having knowledge cut-off date of September 2021): The Best
Player award (Golden Ball) in the previous FIFA World Cup, which was held in 2018 in Russia, was won by Luka
Modric from Croatia. Prior to the that, Lionel Messi had won it in 2014.
Correct Answer: Lionel Messi
Table 6: Some examples of LLM-generated response requiring parsing script to extract the target answer. For
Scenario 2, human evaluation is usually needed to ensure accurate parsing of the answer.
are usually utilized in discriminative tasks (Bang (Laskar et al., 2022b). However, with LLMs usu-
et al., 2023; Laskar et al., 2023a; Qin et al., 2023). ally generating informative, fluent, and coherent
Since metrics like exact match have several limita- response (Bang et al., 2023; Kocoń et al., 2023;
tions (e.g., they do not consider the synonym of the Laskar et al., 2023a; Qin et al., 2023), the evalu-
gold label), various metrics for certain tasks (e.g., ation of factual consistency of LLM-generated re-
question answering (Bulian et al., 2022; Chen sponses has become more important recently (Fu
et al., 2020; Li et al., 2024; Mañas et al., 2024)) et al., 2023b). Moreover, qualitative evaluation
are proposed. to compare between LLM-generated responses via
Generative Tasks: For generative tasks such leveraging humans based on the Elo rating system
as summarization or machine translation, pars- (Zheng et al., 2024) has gained a lot of attention.
ing scripts are usually not required (Jahan et al.,
2024; Laskar et al., 2023a) and so the full re- Elo Rating: Elo rating works by comparing
sponse generated by LLMs are compared against LLMs in pairwise “A vs B” comparisons, where
the gold reference. In this regard, ROUGE (Lin, each model is assigned an initial numerical rating
2004) and BLEU (Papineni et al., 2002) which (Boubdir et al., 2023; Zhao et al., 2023b). The
are based on n-gram word matching are widely outcome of each comparison adjusts these ratings
used. Meanwhile, various contextualized similar- based on the Elo algorithm: if a model performs
ity (Laskar et al., 2020; Parvez and Chang, 2021) better than expected, its rating increases; if it per-
metrics (e.g., BERTScore (Zhang et al., 2019), forms worse, its rating decreases. The expecta-
BARTScore (Yuan et al., 2021), AlignScore (Wang tion of a model’s performance is calculated using
et al., 2024a; Zha et al., 2023)) are also utilized its rating relative to its opponent’s, adjusted by a
that do not depend on word-based similarity mea- factor that represents the sensitivity of expected
sures. scores to differences in ratings. To ensure a ro-
bust evaluation of LLMs using the Elo benchmark,
A.6.2 Human Evaluation it’s important to follow key indicators like relia-
Since LLMs generate human-like responses, it is bility and transitivity (Boubdir et al., 2023). Re-
often required to conduct qualitative evaluation of liability keeps Elo ratings consistent across vari-
their responses. Earlier, qualitative evaluation of ous comparison sequences and prevents them from
model-generated responses in terms of fluency, co- being overly sensitive to changes in hyperparam-
herence, and informativeness were very popular eters, such as the K-factor. Transitivity is cru-
13814
Figure 5: Ownership attack for blind evaluation on LLMs: Reviewers can pose any ownership-related
questions and select their preferred model solely based on the ownership of the model. LMSys doesn’t
count votes if the model’s identities are revealed during conversation
cial, indicating that if model A is rated higher the ratings found from BTL which is transitive but
than model B, and model B is rated higher than doesn’t correlate with the empirical win rates.
model C, model A should logically rank above
Elo hacking: Crowdsourced Elo-based ranking
model C. Extensive testing with both synthetic
has gained popularity through the LMSys leader-
and real-world data is essential to verify that Elo
board 24 and has been accepted by various orga-
scores accurately and stably reflect model perfor-
nizations, prompting them to release their LLMs
mance (Boubdir et al., 2023). This involves mak-
early into this ecosystem for human evaluation.
ing precise adjustments to the comparison order,
However, such setups can be easily exploited on
selecting hyperparameters carefully, and utilizing
a large scale using simple techniques. Figure 5
numerous permutations to ensure outcome con-
illustrates how someone can initially bypass the
sistency. Due to the sensitive nature of the Elo
blind scoring mechanism through ownership hack-
rating system towards the order in which the up-
ing. Additionally, the evaluation of knowledge
dates were performed, Zheng et al. (2024) used the
bases is not easily tracked, making votes on highly
Bradley-Terry (BTL) model for their chatbot arena
complex reasoning questions equivalent to those
ranking. It is observed that model A can have a
on simpler queries. Furthermore, upon the release
higher win rate than model B both empirically and
of a popular model, systematic attacks or boosting
statistically but a lower Elo rating. Since win rate
can be initiated through ownership hacking. In ad-
serves as the stand-in measure for the probability
dition to that, considering same score for tie and
of a model being better than another, this signi-
both-bad can significantly change leaderboard po-
fies the findings by Boubdir et al. (2023) that Elo
sition. We recommend to use tie as 0.5 point and
rating is non-transitive with or without (BTL). On
both-bad as 0 point.
the other hand, BTL-based rating is tolerant to an
imbalanced number of votes per model as shown
by (Zheng et al., 2024), they also propose a differ-
ent probability of win rates that are derived from 24
https://huggingface.co/spaces/lmsys/
chatbot-arena-leaderboard
13815
A.6.3 LLMs as Evaluators
Since human evaluation is time-consuming
(Laskar et al., 2023c,d) and difficult to reproduce,
the instruction-following capabilities of LLMs
have also inspired researchers to use certain LLMs
as a judge to evaluate the responses generated by
other LLMs (Chern et al., 2024; Fu et al., 2023b;
Gao et al., 2023a; Hada et al., 2023; Huang et al.,
2024a; Kenton et al., 2024; Kim et al., 2024b;
Kobayashi et al., 2024; Kocmi and Federmann,
2023; Lu et al., 2024; Luo et al., 2023; Perez
et al., 2022; Shankar et al., 2024). While prior
work mostly utilized general-purpose closed-
source LLMs-as-a-judge, the recently proposed
Prometheus 2 (Kim et al., 2024a) model is an
open-source variant which is specifically trained
for qualitative evaluation of model-generated
responses and demonstrated higher correlation
with humans.
However, research by (Wang et al., 2023b) and
(Shen et al., 2023) has highlighted potential lim-
itations in using LLM as evaluators, suggesting
that while LLMs can excel in specific areas like
translation quality and grammatical error correc-
tion (Kobayashi et al., 2024; Kocmi and Feder-
mann, 2023), their effectiveness as evaluators may
vary significantly across different tasks. More-
over, using closed-source LLMs as evaluators also
have associated cost. This highlights the ongoing
debate and research into the capabilities and limi-
tations of LLMs as evaluators in diverse linguistic
domains. Therefore, to use LLMs as evaluators, it
is important to consider the following:
• Consistency: Ensuring consistent combina-
tions of LLMs are used as evaluators when
LLMs are used as juries to ensure consistency
and reproducibility in assessments.
• Bias and Hallucination Detection: Devel-
oping methods to identify and mitigate bias
and hallucinations in the outputs of LLM
judges/juries to ensure the reliability and ro-
bustness of the evaluation.
• Interpretability: Enhancing the inter-
pretability of LLM outputs (e.g., asking
LLMs to provide reasoning/explanations) to
improve understanding and trustworthiness
of the evaluation.
• Cost Efficiency: Advancing the develop-
ment of efficient LLMs to reduce costs.
13816