M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark For Chinese Large Language Models

Uploaded by

nikebeta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views14 pages

M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark For Chinese Large Language Models

Uploaded by

nikebeta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation

Benchmark for Chinese Large Language Models

Chuang Liu1 , Renren Jin1 , Yuqi Ren1 , Linhao Yu1 , Tianyu Dong1 , Xiaohan Peng1 , Shuting Zhang1
Jianxiang Peng1 , Peiyi Zhang1 , Qingqing Lyu1 , Xiaowen Su1 , Qun Liu2 and Deyi Xiong1 ∗
1
College of Intelligence and Computing, Tianjin University, Tianjin, China
2
Huawei Noah’s Ark Lab, Hong Kong, China
{liuc_09,rrjin,ryq20,linhaoyu,skydong112358,pengxiaohan}@tju.edu.cn
{shutingzhang,pjasonx,zhangpeiyi,qingq_lv,suxiaowen,dyxiong}@tju.edu.cn
[email protected]
Abstract et al., 2023; Cao et al., 2023). Generally, LLMs
are trained via self-supervised learning (Balestriero
Large language models have recently made
et al., 2023) on a huge amount of unlabeled data
tremendous progress in a variety of aspects,
arXiv:2305.10263v2 [cs.CL] 21 May 2023

e.g., cross-task generalization, instruction fol- (Zhu et al., 2015; Liu et al., 2019b; Zellers et al.,
lowing. Comprehensively evaluating the ca- 2019; Gokaslan et al., 2019), which cover a wide
pability of large language models in multi- range of genres, e.g., encyclopedias, news, books,
ple tasks is of great importance. In this pa- social medias, etc. Many studies have demon-
per, we propose M3KE, a Massive Multi-Level strated that LLMs are able to acquire broad knowl-
Multi-Subject Knowledge Evaluation bench- edge of many types and subjects (Zhao et al., 2023;
mark, which is developed to measure knowl-
Paperno et al., 2016; Hoffmann et al., 2022; Tou-
edge acquired by Chinese large language mod-
els by testing their multitask accuracy in zero- vron et al., 2023; Rae et al., 2021; Raffel et al.,
and few-shot settings. We have collected 2020; Du et al., 2022a).
20,477 questions from 71 tasks. Our selection The paradigms that elicit and apply the ac-
covers all major levels of Chinese education
quired knowledge in LLMs onto downstream tasks
system, ranging from the primary school to
college, as well as a wide variety of subjects, have shifted from fine-tuning to instruction-tuning.
including humanities, history, politics, law, ed- Early LLMs usually adopt fine-tuning, which, how-
ucation, psychology, science, technology, art ever, suffers from lack of cross-task generaliza-
and religion. All questions are multiple-choice tion as the fine-tuned LLMs are often task-specific
questions with four options, hence guarantee- and not being parameter-efficient as all pre-trained
ing a standardized and unified assessment pro- LLM parameters are usually required to be updated
cess. We’ve assessed a number of state-of-the- on downstream tasks. As LLMs reach the scale
art open-source Chinese large language mod-
of billions of parameters, a more efficient alterna-
els on the proposed benchmark. The size of
these models varies from 335M to 130B pa- tive to elicit knowledge, in-context Learning (ICL)
rameters. Experiment results demonstrate that (Brown et al., 2020; Xie et al., 2022; Dong et al.,
they perform significantly worse than GPT-3.5 2023) has emerged, which uses only a few demon-
that reaches an accuracy of ∼ 48% on M3KE. stration examples concatenated in a prompt. In
The dataset is available at https://github. order to enhance the cross-task generalization of
com/tjunlp-lab/M3KE. LLMs to a variety of downstream tasks, instruction-
1 Introduction tuning (Wei et al., 2022; Bach et al., 2022; Wang
et al., 2022b), which is performed via multi-task
Large Language Models (LLMs) (Raffel et al., learning (Chung et al., 2022; Liu et al., 2019a) has
2020; Xue et al., 2021; Zhang et al., 2022; Brown been proposed. In instruction-tuning, the instruc-
et al., 2020; Touvron et al., 2023; Scao et al., 2022; tions for different tasks are different, but in a uni-
Zhao et al., 2023; Zhou et al., 2023) have achieved fied form. Supervised Fine-tuning (SFT) (Ouyang
remarkable progress in recent years, especially et al., 2022) and Reinforcement Learning from Hu-
with the release of ChatGPT1 , which is widely ac- man Feedback (RLHF) (Christiano et al., 2017;
knowledged to revolutionize the world of natural Stiennon et al., 2020; Ouyang et al., 2022) are suc-
language processing and to transform AI and so- cessful methods of instruction-tuning, which not
ciety (Altman, 2023; Bubeck et al., 2023; Huang only achieve generalization to unseen instructions
∗
Corresponding author. but also align LLMs with human values and intents
1
https://openai.com/blog/chatgpt (Sanh et al., 2022; Wei et al., 2022; Chung et al.,
Benchmark Language # Tasks # Questions
MMLU (Hendrycks et al., 2021) En 57 15,908
mary school to college, as well as a wide variety
AGIEval (Zhong et al., 2023) En & Zh 20 8,062 of subjects, including humanities, history, politics,
MMCU (Zeng, 2023) Zh 51 11,900
M3KE Zh 71 20,477
law, education, psychology, science, technology,
art and religion. All questions are multiple-choice
Table 1: The comparison between M3KE and other re- questions with four options, hence ensuring a stan-
lated benchmarks. dardized and unified assessment process. Table 1
shows the comparison between M3KE and other
related benchmarks.
2022).
With M3KE, we have tested recently released
As the capability of knowledge acquisition and Chinese LLMs , to track the progress of Chinese
application in LLMs is constantly and rapidly LLMs in knowledge acquisition and application.
evolving, a natural question which arises, is how The evaluated models are either pre-trained on mas-
we can assess such knowledge. Traditional single- sive data or pre-trained + fine-tuned with SFT or
task evaluation benchmarks (Rajpurkar et al., 2016; RLHF. The model sizes vary from 335M to 130B
Khot et al., 2020) are no longer adequate for evalu- parameters.
ating them. Multi-task benchmarks like GLUE With extensive experiments, we observe that
(Wang et al., 2018), SuperGLUE (Wang et al., most evaluated Chinese LLMs have near random-
2019) and BIG-bench (Srivastava et al., 2022) chance accuracy, even for primary school tasks.
aggregate multiple NLP tasks to evaluate LLMs, The best performance is achieved by an SFT model
which, however, are not sufficient either to assess built on the open-source BLOOM (Scao et al.,
knowledge acquired by LLMs. To address this 2022), which is 14.8 points lower than the accuracy
issue, Hendrycks et al. (2021) propose MMLU, of GPT-3.5-turbo.
a widely used benchmark to test the knowledge Our main contributions are summarized as fol-
acquisition and application capability of LLMs, lows.
which uses test questions across multiple subjects
that humans lean to assess LLMs in zero- and few- • We propose M3KE, a knowledge evaluation
shot settings. As MMLU is an English benchmark, benchmark for Chinese LLMs, which to date
it cannot be directly used for measuring LLMs covers the largest number of tasks in line with
trained with data in other languages. Even if it is Chinese education system.
translated into other languages, like the way used
in evaluating GPT-4 (OpenAI, 2023), there are still • We have tested a wide range of open-source
gaps in knowledge across different languages as Chinese LLMs, with model sizes varying from
they usually have different education systems and 335M to 130B, against GPT-3.5-turbo.
knowledge structures.
• We have analyzed the performance of each
Similar to LLMs in English, LLMs dedicated in
model on different subject clusters and educa-
Chinese have also achieved rapid advances recently
tion levels in both zero- and five-shot settings.
(Du et al., 2022b; Zeng et al., 2021; Zhang et al.,
2021; Sun et al., 2021; Zeng et al., 2022; Ren et al., 2 Related Work
2023; Wu et al., 2021; Wang et al., 2021; Chen
et al., 2023). However, a massive knowledge evalu- Chinese Large Language Models. Recent
ation benchmark that measures Chinese LLMs in years have witnessed a rapid development of Chi-
line with Chinese education system is a desider- nese LLMs, following the efforts of their English
atum. To bridge this gap, we propose M3KE, counterparts, e.g., GPT-3 (Brown et al., 2020), Go-
a Massive Multi-Level Multi-Subject Knowledge pher (Rae et al., 2021), LLaMA (Touvron et al.,
Evaluation benchmark, which is designed to mea- 2023). Chinese LLMs, such as Pangu-α with 200B
sure the knowledge acquired by Chinese LLMs by parameters (Zeng et al., 2021), Yuan 1.0 with 245B
testing their multitask accuracy in zero- and few- parameters (Wu et al., 2021), ERNIE 3.0 Titan
shot settings. M3KE contains 20,477 questions with 260B parameters (Sun et al., 2021), have been
collected from 71 tasks. In particular, unlike recent trained on Chinese textual data that contain tokens
benchmarks MMCU (Zeng, 2023) and AGIEval ranging from 180B to 329B. These models are de-
(Zhong et al., 2023), M3KE covers all major levels veloped in industry, which are usually not open-
of Chinese education system, ranging from pri- source. With the success of open-source LLMs
Engineering
(Taori et al., 2023; Peng et al., 2023) based on 10%
Medicine
16%
LLaMA, Chinese versions, such as ChatGLM-6B2 , Economics
4%
MOSS3 , Phoenix (Chen et al., 2023), have emerged
Arts
very recently. These models usually contain less 6%

than 20 billion parameters and are supervised fine-

tuned on instructions that are either distilled from Basic Sciences
15%
models of GPT-3.5 or learned in a self-instructing Laws
9%
manner (Wang et al., 2022a).

Benchmarks. The capability of eliciting and ap-

Psychology
plying knowledge acquired during training is an 6%
Literature
important indicator for measuring LLMs. However, 9%

Education
existing evaluation benchmarks (Wang et al., 2018, 7%
2019; Srivastava et al., 2022; Xu et al., 2020) are Politics
History
10%
8%
normally designed to evaluate LLMs on various
NLP tasks, not tailored for knowledge acquisition Figure 1: The distribution of tasks in M3KE.
and application assessment. To comprehensively
measure knowledge in LLMs, MMLU (Hendrycks
et al., 2021) is proposed, which collects multiple- tasks as shown in Figure 1 while the detailed sub-
choice questions from 57 tasks that humans learn. jects are listed in Appendix A. We collect and or-
As a different education system is used, on the one ganize multiple-choice questions from public web-
side, knowledge in Chinese LLMs may not exhibit sites. To ensure the quality and comprehensiveness
in the translated-into-Chinese version of MMLU, of the questions, entrance exam questions are se-
e.g., Chinese Medicine, Chinese Legal System. On lected as much as possible. For the primary school,
the other side, knowledge to be assessed in MMLU middle school and high school education level, we
may be absent in Chinese textual data used to train choose the subjects according to the correspond-
Chinese LLMs. ing entrance exams for Chinese students. For the
college level, we select subjects according to the na-
Our work is related to 3 datasets that have
tional entrance exam for master’s degree in China.
been developed concurrently with M3KE. MMCU
In addition to subjects under the major Chinese
(Zeng, 2023) is a Chinese benchmark that assesses
education system, we also collect comprehensive
knowledge in four domains: medicine, education,
tasks to expand the knowledge coverage in M3KE,
law, and psychology. AGIEval (Zhong et al., 2023)
including computer grade exam, ancient Chinese
is a bilingual benchmark that measures the capa-
language, novels and Chinese national civil service
bility of LLMs on tasks of the Chinese college
exam which covers commonsense knowledge, arts,
entrance exam and American college admission
religion, etc.
test, for high-school graduates. DomMa (Gu et al.,
2023) is another Chinese benchmark that focuses In total, we have 71 tasks and 20,477 questions.
on domain-specific knowledge. In contrast to these We divide each task into a test set and a few-shot
benchmarks, M3KE is a comprehensive Chinese set, where the few-shot set includes 5 questions for
benchmark that spans major stages of Chinese ed- each task for the few-shot evaluation setting. The
ucation system, from primary school to college test set includes 20,122 questions, and each task
with a broader range of subject categories, such contains at least 100 questions. Instances of M3KE
as art, religion, traditional Chinese medicine, and are listed in Table 2.
classical literature.
3.1 Arts & Humanities
3 M3KE Arts & Humanities comprise a range of disciplines
that cover Chinese, literature, arts and history.
M3KE covers major Chinese education levels, in-
These disciplines focus on the analysis and interpre-
cluding primary school, middle school, high school,
tation of literary and cultural artifacts, rather than
college and professional exams, as well as multiple
on practical applications. For instance, the Chinese
2
https://github.com/THUDM/ChatGLM-6B in primary school aims to evaluate the students’
3
https://github.com/OpenLMLab/MOSS proficiency in language use and literary apprecia-
下面关于拉斯科洞穴壁画说法错误的是? Which statement about the Lascaux cave
murals is incorrect?
Arts & Humanities
A 这个壁画是在法国发现的 This fresco was found in France
B 发现的动物形象有100多个 There are more than 100 animal images found
C 发现的时间为1940年 The discovery was made in 1940
D 壁画颜色以黑色为主 Mural color is mainly black
甲欲杀乙,将毒药投入乙的饭食中. 乙服食 A wants to kill B, and puts poison into B’s
后,甲后悔,赶紧说明情况,并将乙送往医院 food. After B consumed it, A regretted it
Social Sciences 抢救.医院在抢救过程中检查发现,甲所投 and rushed to explain the situation and sent
放的"毒药"根本没有毒性,乙安然无恙.甲 B to the hospital for rescue. The hospital
的行为属于? found that the poison was not toxic at all
and B was unharmed. A’s behavior belongs
to?
A 不构成犯罪 Not a crime
B 犯罪未遂 Attempted crime
C 犯罪中止 Crime suspension
D 犯罪既遂 Crime reached
使用普鲁卡因麻醉神经纤维,影响了神经纤 Which characteristic of nerve fiber conduc-
维传导兴奋的哪一项特征? tion excitation is affected by the use of pro-
Natural Sciences caine anesthesia?
A 生理完整性 Physiological integrity
B 绝缘性 Insulation
C 双向传导性 Bidirectional conduction
D 相对不疲劳性 Relative non-fatigability
以前有几项研究表明，食用巧克力会增 Several studies have previously suggested
加食用者患心脏病的可能性。而一项最新 that consuming chocolate increases the like-
Other 的、更为可靠的研究得出的结论是：食用 lihood of developing heart disease. How-
巧克力与心脏病发病率无关。估计这项研 ever, a recent and more reliable study con-
究成果公布以后，巧克力的消费量将会大 cluded that there is no association between
大增加。上述推论基于以下哪项假设? chocolate consumption and incidence of
heart disease. It is estimated that the con-
sumption of chocolate will significantly in-
crease after the publication of this research.
The above inference is based on the assump-
tion that the reliability of the previous stud-
ies was lower than that of the latest study.
A 尽管有些人知道食用巧克力会增加患心脏 Although some people are aware that consum-
病的可能性，却照样大吃特吃 ing chocolate increases the likelihood of devel-
oping heart disease, they still indulge in it.
B 人们从来也不相信进食巧克力会更容易患 People have never believed the claim that eat-
心脏病的说法 ing chocolate makes it more likely to develop
heart disease.
C 现在许多人吃巧克力是因为他们没有听过 Nowadays, many people eat chocolate because
巧克力会导致心脏病的说法 they have not heard of the claim that chocolate
can lead to heart disease.
D 现在许多人不吃巧克力完全是因为他们相 Nowadays, many people abstain from eat-
信巧克力会诱发心脏病 ing chocolate solely because they believe
that chocolate can trigger heart disease.

Table 2: Examples from M3KE. Bolded items represent correct answers. Examples from top to bottom are from
Fine Arts, Criminal Jurisprudence, Animal Physiology and Chinese Civil Service Examination task, respectively.
Arts & Humanities Social Sciences Natural Sciences Other
Tasks 12 21 31 7
Q Numbers 3,612 6,222 8,162 2,126
Avg.Q Numbers 301 296 263 303
Max.Q Numbers 352 374 347 425
Min.Q Numbers 190 190 100 129
Avg.Q Tokens 30.33 38.75 38.54 33.21
Avg.C Tokens 53.92 30.99 44.57 52.53

Table 3: Overall statistics of M3KE. Q: question. C: answer choices

tion for ages 7 to 13, such as the usage of synonyms Chinese language and novel reasoning task. These
and antonyms. The historical studies cover both tasks require knowledge that is not limited to a
Chinese and world history from ancient to modern single level or subject as described above. The
times. M3KE also incorporates artistic subjects, Chinese civil service exam involves knowledge
such as dance, fine arts, music and film, because in commonsense, humanities, logic and other do-
we believe that art is an essential aspect of human mains, which we can consider as an assessment
culture and should be relevant to LLMs as well. of the comprehensive knowledge for LLMs. Sim-
ilarly, in the novel task, these questions involve a
3.2 Social Sciences lot of information from many classical novels.
Social sciences differ from Arts & Humanities in
that they emphasize practical aspects of human- 3.5 Overall Statistics
istic studies, such as law, politics, education and Table 3 shows the overall statistics of M3KE. The
psychology. These subjects are mainly taught at numbers of tasks in the four subject clusters de-
the college level. Although ideological and polit- scribed above are 12, 21, 31 and 7, respectively,
ical courses are also part of the Chinese middle while the numbers of questions in the four subject
school and high school curriculum, they primarily clusters are 3,612, 6,222, 8,162 and 2,126, respec-
involve moral education. Social sciences also en- tively. The maximum number of questions is 425
compass economic and management studies, which while the minimum number is 100. Questions in
largely consist of questions from the joint exams for social and natural sciences are usually longer than
graduate students majoring in these fields in China. those in arts & humanities and other while their
These studies include microeconomics, macroeco- answer choices are shorter.
nomics, management and logic at the undergradu-
ate level. 4 Experiments
3.3 Natural Sciences We assessed state-of-the-art large language models
Natural sciences encompass engineering, science, recently developed for Chinese on M3KE, attempt-
medicine and fundamental disciplines such as math, ing to understand and track the progress of Chinese
physics, chemistry, biology and so on. These sub- LLMs in learning and applying knowledge from
jects often require a high degree of computation, massive data.
analysis and logical reasoning skills. The same 4.1 Assessed Models
subject may assess different types of knowledge
at different levels according to the Chinese edu- The assessed Chinese LLMs can be divided into
cation system. For instance, primary school math two categories: models being only pre-trained and
mainly tests the basic arithmetic operations, while models that are instruction-tuned with SFT/RLHF.
high school math covers more advanced mathemat- For the former, we selected GLM-335M (Du et al.,
ical concepts, such as sequences, derivatives and 2022b), GLM-10B (Du et al., 2022b), GLM-130B
geometry. (Zeng et al., 2022) and BLOOM-7.1B (Scao et al.,
2022). For the latter, we included ChatGLM-6B4 ,
3.4 Other MOSS-SFT-16B5 , BELLE-7B (Yunjie Ji and Li,
Other types of tasks include religion, Chinese civil 4
https://github.com/THUDM/ChatGLM-6B
5
service exam, and specialized tasks, like ancient https://huggingface.co/fnlp/moss-moon-003-sft
Models Arts & Humanities Social Sciences Natural Sciences Other Average
GLM-335M 0.070 0.046 0.084 0.044 0.062
BLOOM-7.1B 0.163 0.159 0.161 0.158 0.161
GLM-10B 0.180 0.229 0.219 0.150 0.197
GLM-130B 0.326 0.352 0.274 0.359 0.328
ChatGLM-6B 0.246 0.267 0.168 0.263 0.236
MOSS-SFT-16B 0.260 0.263 0.207 0.275 0.251
BELLE-7B-0.2M 0.247 0.296 0.260 0.260 0.266
BELLE-7B-2M 0.328 0.367 0.282 0.355 0.333
GPT-3.5-turbo 0.460 0.538 0.444 0.481 0.481

Table 4: Average zero-shot accuracy for each model on the four subject clusters.

Models Arts & Humanities Social Sciences Natural Sciences Other Average
GLM-335M 0.220 0.247 0.193 0.126 0.196
BLOOM-7.1B 0.247 0.260 0.235 0.246 0.247
GLM-10B 0.294 0.304 0.232 0.211 0.260
GLM-130B 0.297 0.329 0.246 0.228 0.275
ChatGLM-6B 0.188 0.175 0.121 0.198 0.171
MOSS-SFT-16B 0.266 0.264 0.258 0.284 0.268
BELLE-7B-0.2M 0.292 0.327 0.273 0.307 0.299
BELLE-7B-2M 0.287 0.309 0.284 0.313 0.298
GPT-3.5-turbo 0.453 0.540 0.464 0.476 0.483

Table 5: Average five-shot accuracy for each model on the four subject clusters.

2023), where BELLE-7B is the SFT version based and rationale to the answer choice (the order of
on BLOOMZ-7.1B-MT (Muennighoff et al., 2022). these two types of outputs are random). We hence
We used the two variants of BELLE fine-tuned on keep only the output answer choice indicator as the
200K and 2M instructions, namely BELLE-7B- final answer to calculate accuracy.
0.2M6 and BELLE-7B-2M7 . We also evaluated
GPT-3.5-turbo8 from OpenAI as a reference. 4.3 Results
We compared the zero-shot accuracy of each model
4.2 Prompts in Table 4 in terms of subject clusters. For the pre-
All models were tested using the n-shot setting with trained models, there is a clear positive correlation
a unified prompt, where n is an integer from 0 to 5. between accuracy and model size, where the model
For the zero-shot setting (i.e., n = 0), the unified with 130B parameters significantly outperforms
prompt provided to all models is “Please choose the models with 335M/7B/10B parameters, even
the correct option from ‘A’, ‘B’, ‘C’, ‘D’ based on though they have different backbones. The accu-
the following question”. For few-shot setting (i.e., racy of GPT-3.5-turbo is significantly higher than
n > 0), the unified prompt is “Please choose the those of the evaluated Chinese LLMs, which cur-
correct option from ‘A’, ‘B’, ‘C’, ‘D’ based on the rently provides an upper bound for open-source
following examples and question”. The input to Chinese LLMs. All pretrained LLMs with ≤ 10B
all LLMs consists of the prompt, question, answer parameters achieve an accuracy lower than random-
choices and suffix, which is “the correct option is: chance accuracy (i.e., 25%), indicating that knowl-
”. Even we tell models to only output the correct edge acquired by these models is not adequate for
answer choice indicator (i.e., ∈ {A, B, C, D}) in M3KE. In addition, we observe that the number of
the prompt, not all models can follow this instruc- instructions used for SFT is an important factor, as
tion. Sometimes they output both answer choice the BELLE model fine-tuned with 2M instructions
6 is significantly better than that with 0.2M instruc-
https://huggingface.co/BelleGroup/BELLE-7B-0.2M
7
https://huggingface.co/BelleGroup/BELLE-7B-2M tions. The zero-shot performance of GPT-3.5-turbo
8
https://openai.com/product is much higher than the compared open-sourced
Models Primary School Middle School High School College Other Average
GLM-335M 0.075 0.099 0.099 0.054 0.046 0.075
BLOOM-7.1B 0.173 0.142 0.173 0.160 0.164 0.163
GLM-10B 0.190 0.199 0.197 0.213 0.152 0.190
GLM-130B 0.243 0.303 0.229 0.324 0.359 0.292
ChatGLM-6B 0.180 0.243 0.191 0.213 0.250 0.216
MOSS-SFT-16B 0.224 0.223 0.213 0.242 0.260 0.232
BELLE-7B-0.2M 0.233 0.269 0.259 0.268 0.263 0.258
BELLE-7B-2M 0.248 0.313 0.263 0.332 0.349 0.301
GPT-3.5-turbo 0.328 0.403 0.395 0.509 0.484 0.435

Table 6: Average zero-shot accuracy for each model on five major education levels.

Models Primary School Middle School High School College Other Average
GLM-335M 0.206 0.229 0.232 0.223 0.114 0.201
BLOOM-7.1B 0.262 0.222 0.245 0.249 0.246 0.245
GLM-10B 0.229 0.263 0.270 0.278 0.197 0.248
GLM-130B 0.268 0.293 0.272 0.294 0.208 0.267
ChatGLM-6B 0.089 0.150 0.137 0.155 0.196 0.146
MOSS-SFT-16B 0.272 0.223 0.263 0.266 0.281 0.261
BELLE-7B-0.2M 0.260 0.256 0.273 0.298 0.310 0.280
BELLE-7B-2M 0.258 0.264 0.268 0.306 0.299 0.279
GPT-3.5-turbo 0.308 0.565 0.373 0.517 0.475 0.448

Table 7: Average five-shot accuracy for each model on five major education levels.

Chinese LLMs, but still lower than 50% accuracy, the mixed results to our future work.
suggesting that M3KE is a very challenging bench- We finally provide the results of each model on
mark. different education levels in Table 6 for the zero-
We further compared the accuracy of different shot setting and Table 7 for the few-shot setting.
models under the 5-shot setting. Results are shown Interestingly, we observe that LLMs do not reach
in Table 5. For pre-trained models, ICL in the higher performance at lower education levels than
few-shot setting significantly improves the perfor- higher education levels, even for GPT-3.5-turbo.
mance and the smaller the pretrained model is, the This suggests that tasks from lower education lev-
larger the achieved improvement is. The excep- els remain challenging for these state-of-the-art
tion is GLM-130B, which performs significantly Chinese LLMs.
worse under the 5-shot setting than the zero-shot
setting. We conjecture that GLM-130B already has 5 Conclusion
the ability to understand questions without exam- We have presented a new benchmark M3KE, to as-
ples because it uses instances in the instruction for- sess the capability of Chinese LLMs in learning and
mat as part of the pre-training corpus (Zeng et al., applying knowledge in multiple subjects at multi-
2022), and demonstrations may bring interference ple levels of Chinese education system. M3KE
to the final prediction of the model. The 5-shot contains 71 tasks and 20,447 questions. We find
results of the SFT models are mixed in compari- that all evaluated state-of-the-art open-source Chi-
son to those in the zero-shot setting. We find that nese LLMs significantly lag behind GPT-3.5. We
for ChatGLM-6B and BELLE-7B-2M, 5-shot is hope that this benchmark can be used to track and
worse than zero-shot setting, similar to the results promote further progress in Chinese LLMs.
observed on GLM-130B. In contrast, 5-shot has a
positive impact on MOSS-SFT-16B and BELLE-
7B-0.2M. As these models are different from each
other in terms of model size, training data, instruc-
tion data, etc., we leave the in-depth analysis on
References 30: Annual Conference on Neural Information Pro-
cessing Systems 2017, December 4-9, 2017, Long
Sam Altman. 2023. Planning for agi and beyond. Ope- Beach, CA, USA, pages 4299–4307.
nAI Blog.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret
Stephen H. Bach, Victor Sanh, Zheng Xin Yong, Al- Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang,
bert Webson, Colin Raffel, Nihal V. Nayak, Ab- Mostafa Dehghani, Siddhartha Brahma, Albert Web-
heesht Sharma, Taewoon Kim, M. Saiful Bari, son, Shixiang Shane Gu, Zhuyun Dai, Mirac Suz-
Thibault Févry, Zaid Alyafeai, Manan Dey, An- gun, Xinyun Chen, Aakanksha Chowdhery, Sha-
drea Santilli, Zhiqing Sun, Srulik Ben-David, Can- ran Narang, Gaurav Mishra, Adams Yu, Vincent Y.
wen Xu, Gunjan Chhablani, Han Wang, Jason Alan Zhao, Yanping Huang, Andrew M. Dai, Hongkun
Fries, Maged Saeed AlShaibani, Shanya Sharma, Ur- Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin,
mish Thakker, Khalid Almubarak, Xiangru Tang, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason
Dragomir R. Radev, Mike Tian-Jian Jiang, and Wei. 2022. Scaling instruction-finetuned language
Alexander M. Rush. 2022. Promptsource: An inte- models. CoRR, abs/2210.11416.
grated development environment and repository for
natural language prompts. In ACL (demo), pages 93– Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiy-
104. Association for Computational Linguistics. ong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei
Li, and Zhifang Sui. 2023. A survey for in-context
Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari learning. CoRR, abs/2301.00234.
Morcos, Shashank Shekhar, Tom Goldstein, Florian
Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong,
Tian, et al. 2023. A cookbook of self-supervised Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun,
learning. arXiv preprint arXiv:2304.12210. Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret
Zoph, Liam Fedus, Maarten P. Bosma, Zongwei
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Zhou, Tao Wang, Yu Emma Wang, Kellie Webster,
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Marie Pellat, Kevin Robinson, Kathleen S. Meier-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Hellstern, Toju Duke, Lucas Dixon, Kun Zhang,
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Quoc V. Le, Yonghui Wu, Zhifeng Chen, and Claire
Gretchen Krueger, Tom Henighan, Rewon Child, Cui. 2022a. Glam: Efficient scaling of language
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, models with mixture-of-experts. In International
Clemens Winter, Christopher Hesse, Mark Chen, Conference on Machine Learning, ICML 2022, 17-
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin 23 July 2022, Baltimore, Maryland, USA, pages
Chess, Jack Clark, Christopher Berner, Sam Mc- 5547–5569.
Candlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. 2020. Language models are few-shot learn- Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding,
ers. In Advances in Neural Information Processing Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022b.
Systems 33: Annual Conference on Neural Informa- GLM: General language model pretraining with au-
tion Processing Systems 2020, NeurIPS 2020, De- toregressive blank infilling. In Proceedings of the
cember 6-12, 2020, virtual. 60th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
S’ebastien Bubeck, Varun Chandrasekaran, Ronen El- 320–335, Dublin, Ireland. Association for Computa-
dan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Pe- tional Linguistics.
ter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg,
Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Aaron Gokaslan, Vanya Cohen Ellie Pavlick, and Ste-
and Yi Zhang. 2023. Sparks of artificial general in- fanie Tellex. 2019. Openwebtext corpus. http:
telligence: Early experiments with gpt-4. volume //Skylion007.github.io/OpenWebTextCorpus.
abs/2303.12712.
Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang,
Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Zhuozhi Xiong, Zihan Li, Qianyu He, Sihang Jiang,
Dai, Philip S Yu, and Lichao Sun. 2023. A com- Hongwei Feng, and Yanghua Xiao. 2023. Do-
prehensive survey of ai-generated content (aigc): A main mastery benchmark: An ever-updating bench-
history of generative ai from gan to chatgpt. arXiv mark for evaluating holistic domain knowledge of
preprint arXiv:2303.04226. large language model–a preliminary release. arXiv
preprint arXiv:2304.11679.
Zhihong Chen, Feng Jiang, Junying Chen, Tiannan
Wang, Fei Yu, Guiming Chen, Hongbo Zhang, Dan Hendrycks, Collin Burns, Steven Basart, Andy
Juhao Liang, Chen Zhang, Zhiyi Zhang, et al. 2023. Zou, Mantas Mazeika, Dawn Song, and Jacob Stein-
Phoenix: Democratizing chatgpt across languages. hardt. 2021. Measuring massive multitask language
arXiv preprint arXiv:2304.10453. understanding. In ICLR. OpenReview.net.

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch,
Martic, Shane Legg, and Dario Amodei. 2017. Deep Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
reinforcement learning from human preferences. In Diego de Las Casas, Lisa Anne Hendricks, Johannes
Advances in Neural Information Processing Systems Welbl, Aidan Clark, Tom Hennigan, Eric Noland,
Katie Millican, George van den Driessche, Bogdan Baolin Peng, Chunyuan Li, Pengcheng He, Michel Gal-
Damoc, Aurelia Guy, Simon Osindero, Karen Si- ley, and Jianfeng Gao. 2023. Instruction tuning with
monyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, gpt-4. arXiv preprint arXiv:2304.03277.
and Laurent Sifre. 2022. Training compute-optimal
large language models. abs/2203.15556. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie
Millican, Jordan Hoffmann, H. Francis Song, John
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Aslanides, Sarah Henderson, Roman Ring, Susan-
Saksham Singhal, Shuming Ma, Tengchao Lv, Lei nah Young, Eliza Rutherford, Tom Hennigan, Ja-
Cui, Owais Khan Mohammed, Barun Patra, Qiang cob Menick, Albin Cassirer, Richard Powell, George
Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, van den Driessche, Lisa Anne Hendricks, Mari-
Vishrav Chaudhary, Subhojit Som, Xia Song, and beth Rauh, Po-Sen Huang, Amelia Glaese, Jo-
Furu Wei. 2023. Language is not all you need: hannes Welbl, Sumanth Dathathri, Saffron Huang,
Aligning perception with language models. CoRR, Jonathan Uesato, John Mellor, Irina Higgins, An-
abs/2302.14045. tonia Creswell, Nat McAleese, Amy Wu, Erich
Elsen, Siddhant M. Jayakumar, Elena Buchatskaya,
Tushar Khot, Peter Clark, Michal Guerquin, Peter David Budden, Esme Sutherland, Karen Simonyan,
Jansen, and Ashish Sabharwal. 2020. QASC: A Michela Paganini, Laurent Sifre, Lena Martens,
dataset for question answering via sentence com- Xiang Lorraine Li, Adhiguna Kuncoro, Aida
position. In The Thirty-Fourth AAAI Conference Nematzadeh, Elena Gribovskaya, Domenic Do-
on Artificial Intelligence, AAAI 2020, The Thirty- nato, Angeliki Lazaridou, Arthur Mensch, Jean-
Second Innovative Applications of Artificial Intelli- Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grig-
gence Conference, IAAI 2020, The Tenth AAAI Sym- orev, Doug Fritz, Thibault Sottiaux, Mantas Pa-
posium on Educational Advances in Artificial Intel- jarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama,
ligence, EAAI 2020, New York, NY, USA, February Cyprien de Masson d’Autume, Yujia Li, Tay-
7-12, 2020, pages 8082–8090. fun Terzi, Vladimir Mikulik, Igor Babuschkin,
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian- Aidan Clark, Diego de Las Casas, Aurelia Guy,
feng Gao. 2019a. Multi-task deep neural networks Chris Jones, James Bradbury, Matthew J. Johnson,
for natural language understanding. In ACL (1), Blake A. Hechtman, Laura Weidinger, Iason Gabriel,
pages 4487–4496. Association for Computational William S. Isaac, Edward Lockhart, Simon Osin-
Linguistics. dero, Laura Rimell, Chris Dyer, Oriol Vinyals, Ka-
reem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Hassabis, Koray Kavukcuoglu, and Geoffrey Irv-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, ing. 2021. Scaling language models: Methods,
Luke Zettlemoyer, and Veselin Stoyanov. 2019b. analysis & insights from training gopher. CoRR,
Roberta: A robustly optimized BERT pretraining ap- abs/2112.11446.
proach. CoRR, abs/1907.11692.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Adam Roberts, Stella Biderman, Teven Le Scao, Wei Li, and Peter J. Liu. 2020. Exploring the limits
M. Saiful Bari, Sheng Shen, Zheng Xin Yong, Hai- of transfer learning with a unified text-to-text trans-
ley Schoelkopf, Xiangru Tang, Dragomir Radev, former. J. Mach. Learn. Res., pages 140:1–140:67.
Alham Fikri Aji, Khalid Almubarak, Samuel Al-
banie, Zaid Alyafeai, Albert Webson, Edward Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Raff, and Colin Raffel. 2022. Crosslingual gen- Percy Liang. 2016. Squad: 100, 000+ questions for
eralization through multitask finetuning. CoRR, machine comprehension of text. In Proceedings of
abs/2211.01786. the 2016 Conference on Empirical Methods in Nat-
ural Language Processing, EMNLP 2016, Austin,
OpenAI. 2023. Gpt-4 technical report. OpenAI. Texas, USA, November 1-4, 2016, pages 2383–2392.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- Xiaozhe Ren, Pingyi Zhou, Xinfan Meng, Xinjing
roll L. Wainwright, Pamela Mishkin, Chong Zhang, Huang, Yadao Wang, Weichao Wang, Pengfei Li,
Sandhini Agarwal, Katarina Slama, Alex Ray, John Xiaoda Zhang, Alexander Podolskiy, Grigory Arshi-
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, nov, Andrey Bout, Irina Piontkovskaya, Jiansheng
Maddie Simens, Amanda Askell, Peter Welinder, Wei, Xin Jiang, Teng Su, Qun Liu, and Jun Yao.
Paul F. Christiano, Jan Leike, and Ryan Lowe. 2023. Pangu-Σ: Towards trillion parameter lan-
2022. Training language models to follow instruc- guage model with sparse heterogeneous computing.
tions with human feedback. CoRR, abs/2203.02155. CoRR, abs/2303.10845.
Denis Paperno, Germán Kruszewski, Angeliki Lazari- Victor Sanh, Albert Webson, Colin Raffel, Stephen H.
dou, Quan Ngoc Pham, Raffaella Bernardi, San- Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
dro Pezzelle, Marco Baroni, Gemma Boleda, and Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey,
Raquel Fernández. 2016. The LAMBADA dataset: M Saiful Bari, Canwen Xu, Urmish Thakker,
Word prediction requiring a broad discourse context. Shanya Sharma Sharma, Eliza Szczechla, Taewoon
In ACL (1). The Association for Computer Linguis- Kim, Gunjan Chhablani, Nihal V. Nayak, De-
tics. bajyoti Datta, Jonathan Chang, Mike Tian-Jian
Jiang, Han Wang, Matteo Manica, Sheng Shen, Shang, Peng Sun, Wei Liu, Xuan Ouyang, Dianhai
Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Yu, Hao Tian, Hua Wu, and Haifeng Wang. 2021.
Thomas Wang, Trishala Neeraj, Jos Rozen, Ab- ERNIE 3.0: Large-scale knowledge enhanced pre-
heesht Sharma, Andrea Santilli, Thibault Févry, Ja- training for language understanding and generation.
son Alan Fries, Ryan Teehan, Teven Le Scao, Stella CoRR, abs/2107.02137.
Biderman, Leo Gao, Thomas Wolf, and Alexan-
der M. Rush. 2022. Multitask prompted training Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
enables zero-shot task generalization. In The Tenth Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
International Conference on Learning Representa- and Tatsunori B. Hashimoto. 2023. Stanford al-
tions, ICLR 2022, Virtual Event, April 25-29, 2022. paca: An instruction-following llama model. https:
OpenReview.net. //github.com/tatsu-lab/stanford_alpaca.

Teven Le Scao, Angela Fan, Christopher Akiki, El- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
lie Pavlick, Suzana Ilic, Daniel Hesslow, Ro- Martinet, Marie-Anne Lachaux, Timothée Lacroix,
man Castagné, Alexandra Sasha Luccioni, François Baptiste Rozière, Naman Goyal, Eric Hambro,
Yvon, Matthias Gallé, Jonathan Tow, Alexan- Faisal Azhar, Aurélien Rodriguez, Armand Joulin,
der M. Rush, Stella Biderman, Albert Webson, Edouard Grave, and Guillaume Lample. 2023.
Pawan Sasanka Ammanamanchi, Thomas Wang, Llama: Open and efficient foundation language mod-
Benoît Sagot, Niklas Muennighoff, Albert Villanova els. CoRR.
del Moral, Olatunji Ruwase, Rachel Bawden, Stas
Bekman, Angelina McMillan-Major, Iz Beltagy, Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Huu Nguyen, Lucile Saulnier, Samson Tan, Pe- Amanpreet Singh, Julian Michael, Felix Hill, Omer
dro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Levy, and Samuel R. Bowman. 2019. Superglue: A
Yacine Jernite, Julien Launay, Margaret Mitchell, stickier benchmark for general-purpose language un-
Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor derstanding systems. In NeurIPS, pages 3261–3275.
Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Alex Wang, Amanpreet Singh, Julian Michael, Fe-
Ariel Kreisberg Nitzav, Canwen Xu, Chenghao lix Hill, Omer Levy, and Samuel R. Bowman.
Mou, Chris Emezue, Christopher Klamm, Colin 2018. GLUE: A multi-task benchmark and anal-
Leong, Daniel van Strien, David Ifeoluwa Ade- ysis platform for natural language understand-
lani, and et al. 2022. BLOOM: A 176b-parameter ing. In Proceedings of the Workshop: Analyzing
open-access multilingual language model. CoRR, and Interpreting Neural Networks for NLP, Black-
abs/2211.05100. boxNLP@EMNLP 2018, Brussels, Belgium, Novem-
ber 1, 2018, pages 353–355. Association for Com-
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, putational Linguistics.
Abu Awal Md Shoeb, Abubakar Abid, Adam
Fisch, Adam R. Brown, Adam Santoro, Aditya Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu,
Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Siyu Ding, Weibao Gong, Shikun Feng, Junyuan
Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Shang, Yanbin Zhao, Chao Pang, Jiaxiang Liu, Xuyi
Alex Ray, Alex Warstadt, Alexander W. Kocurek, Chen, Yuxiang Lu, Weixin Liu, Xi Wang, Yangfan
Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Par- Bai, Qiuliang Chen, Li Zhao, Shiyong Li, Peng Sun,
rish, Allen Nie, Aman Hussain, Amanda Askell, Dianhai Yu, Yanjun Ma, Hao Tian, Hua Wu, Tian
Amanda Dsouza, Ameet Rahane, Anantharaman S. Wu, Wei Zeng, Ge Li, Wen Gao, and Haifeng Wang.
Iyer, Anders Andreassen, Andrea Santilli, Andreas 2021. ERNIE 3.0 titan: Exploring larger-scale
Stuhlmüller, Andrew M. Dai, Andrew La, An- knowledge enhanced pre-training for language un-
drew K. Lampinen, Andy Zou, Angela Jiang, Angel- derstanding and generation. CoRR, abs/2112.12731.
ica Chen, Anh Vuong, Animesh Gupta, Anna Got-
tardi, Antonio Norelli, Anu Venkatesh, Arash Gho- Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-
lamidavoodi, Arfa Tabassum, Arul Menezes, Arun isa Liu, Noah A. Smith, Daniel Khashabi, and Han-
Kirubarajan, Asher Mullokandov, Ashish Sabhar- naneh Hajishirzi. 2022a. Self-instruct: Aligning lan-
wal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla guage model with self generated instructions. CoRR,
Karakas, and et al. 2022. Beyond the imitation abs/2212.10560.
game: Quantifying and extrapolating the capabilities
of language models. CoRR, abs/2206.04615. Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-
labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Naik, Arjun Ashok, Arut Selvan Dhanasekaran, An-
Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, jana Arunkumar, David Stap, Eshaan Pathak, Gian-
Dario Amodei, and Paul F. Christiano. 2020. Learn- nis Karamanolakis, Haizhi Gary Lai, Ishan Puro-
ing to summarize from human feedback. CoRR, hit, Ishani Mondal, Jacob Anderson, Kirby Kuz-
abs/2009.01325. nia, Krima Doshi, Kuntal Kumar Pal, Maitreya Pa-
tel, Mehrad Moradshahi, Mihir Parmar, Mirali Puro-
Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, hit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit
Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Verma, Ravsehaj Singh Puri, Rushang Karia, Savan
Chen, Yanbin Zhao, Yuxiang Lu, Weixin Liu, Zhi- Doshi, Shailaja Keyur Sampat, Siddhartha Mishra,
hua Wu, Weibao Gong, Jianzhong Liang, Zhizhou Sujan Reddy A, Sumanta Patro, Tanay Dixit, and
Xudong Shen. 2022b. Super-naturalinstructions: Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang,
Generalization via declarative instructions on 1600+ Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
NLP tasks. In Proceedings of the 2022 Conference Wendi Zheng, Xiao Xia, Weng Lam Tam, Zix-
on Empirical Methods in Natural Language Process- uan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen,
ing, EMNLP 2022, Abu Dhabi, United Arab Emi- Peng Zhang, Yuxiao Dong, and Jie Tang. 2022.
rates, December 7-11, 2022, pages 5085–5109. GLM-130B: an open bilingual pre-trained model.
abs/2210.02414.
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin
Guu, Adams Wei Yu, Brian Lester, Nan Du, An- Hui Zeng. 2023. Measuring massive multitask chinese
drew M. Dai, and Quoc V. Le. 2022. Finetuned lan- understanding. arXiv preprint arXiv:2304.12986.
guage models are zero-shot learners. In The Tenth
International Conference on Learning Representa- Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang,
tions, ICLR 2022, Virtual Event, April 25-29, 2022. Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang
OpenReview.net. Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li,
Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang,
Shaohua Wu, Xudong Zhao, Tong Yu, Rongguo Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang, Jin
Zhang, Chong Shen, Hongli Liu, Feng Li, Hong Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang
Zhu, Jiangang Luo, Liang Xu, et al. 2021. Yuan Peng, Fangqing Jiang, Han Zhang, Lingfeng Deng,
1.0: Large-scale pre-trained language model in Yehong Zhang, Zhe Lin, Chao Zhang, Shaojie
zero-shot and few-shot learning. arXiv preprint Zhang, Mingyue Guo, Shanzhi Gu, Gaojun Fan,
arXiv:2110.04725. Yaowei Wang, Xuefeng Jin, Qun Liu, and Yonghong
Tian. 2021. Pangu-α: Large-scale autoregres-
Sang Michael Xie, Aditi Raghunathan, Percy Liang, sive pretrained chinese language models with auto-
and Tengyu Ma. 2022. An explanation of in-context parallel computation. CoRR, abs/2104.12369.
learning as implicit bayesian inference. In The Tenth
International Conference on Learning Representa- Susan Zhang, Stephen Roller, Naman Goyal, Mikel
tions, ICLR 2022, Virtual Event, April 25-29, 2022. Artetxe, Moya Chen, Shuohui Chen, Christopher
Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin,
Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shus-
Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, ter, Daniel Simig, Punit Singh Koura, Anjali Srid-
Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, har, Tianlu Wang, and Luke Zettlemoyer. 2022.
Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao OPT: open pre-trained transformer language models.
Wang, Weijian Xie, Yanting Li, Yina Patterson, CoRR, abs/2205.01068.
Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua
Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui Zhengyan Zhang, Yuxian Gu, Xu Han, Shengqi Chen,
Zhang, Zhengliang Yang, Kyle Richardson, and Chaojun Xiao, Zhenbo Sun, Yuan Yao, Fanchao Qi,
Zhenzhong Lan. 2020. CLUE: A chinese language Jian Guan, Pei Ke, Yanzheng Cai, Guoyang Zeng,
understanding evaluation benchmark. In COLING, Zhixing Tan, Zhiyuan Liu, Minlie Huang, Wentao
pages 4762–4772. International Committee on Com- Han, Yang Liu, Xiaoyan Zhu, and Maosong Sun.
putational Linguistics. 2021. CPM-2: large-scale cost-effective pre-trained
language models. CoRR, abs/2106.10715.
Linting Xue, Noah Constant, Adam Roberts, Mi-
hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xi-
Barua, and Colin Raffel. 2021. mt5: A massively aolei Wang, Yupeng Hou, Yingqian Min, Beichen
multilingual pre-trained text-to-text transformer. In Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen
Proceedings of the 2021 Conference of the North Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang,
American Chapter of the Association for Computa- Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu,
tional Linguistics: Human Language Technologies, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023.
NAACL-HLT 2021, Online, June 6-11, 2021, pages A survey of large language models. arXiv preprint
483–498. arXiv:2303.18223.

Yan Gong Yiping Peng Qiang Niu Baochang Ma Yun- Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo
jie Ji, Yong Deng and Xiangang Li. 2023. Belle: Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu
Be everyone’s large language model engine. https: Chen, and Nan Duan. 2023. Agieval: A human-
//github.com/LianjiaTech/BELLE. centric benchmark for evaluating foundation models.
arXiv preprint arXiv:2304.06364.
Rowan Zellers, Ari Holtzman, Hannah Rashkin,
Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu,
Yejin Choi. 2019. Defending against neural fake Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan,
news. In Advances in Neural Information Process- Lifang He, Hao Peng, Jianxin Li, Jia Wu, Ziwei Liu,
ing Systems 32: Annual Conference on Neural Infor- Pengtao Xie, Caiming Xiong, Jian Pei, Philip S. Yu,
mation Processing Systems 2019, NeurIPS 2019, De- and Lichao Sun. 2023. A comprehensive survey on
cember 8-14, 2019, Vancouver, BC, Canada, pages pretrained foundation models: A history from BERT
9051–9062. to chatgpt. CoRR, abs/2302.09419.
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan
Salakhutdinov, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Aligning books and movies:
Towards story-like visual explanations by watching
movies and reading books. In 2015 IEEE Interna-
tional Conference on Computer Vision, ICCV 2015,
Santiago, Chile, December 7-13, 2015, pages 19–27.
IEEE Computer Society.
Tasks Subjects Education System
Chinese Arts & Humanities Primary school
Math Natural Sciences Primary school
Chinese Arts & Humanities Junior high school
History Arts & Humanities Junior high school
Politics Social Sciences Junior high school
Math Natural Sciences Junior high school
Physics Natural Sciences Junior high school
Biology Natural Sciences Junior high school
Chemistry Natural Sciences Junior high school
Geography Natural Sciences Junior high school
Chinese Arts & Humanities High school
History Arts & Humanities High school
Politics Social Sciences High school
Math Natural Sciences High school
Physics Natural Sciences High school
Biology Natural Sciences High school
Chemistry Natural Sciences High school
Geography Natural Sciences High school
Modern History Arts & Humanities College
History Foundation Arts & Humanities College
Modern World History Arts & Humanities College
Chinese Constitutional Law Social Sciences College
History of Chinese Education Social Sciences College
History of the Chinese Legal System Social Sciences College
Developmental and Educational Psychology Social Sciences College
History of Foreign Education Social Sciences College
Experimental Psychology Social Sciences College
Introduction to Psychology Social Sciences College
Moral Cultivation Social Sciences College
Psychology of Teaching Social Sciences College
Principles of Pedagogy Social Sciences College
Educational Research Methods Social Sciences College
Current Affairs and Politics Social Sciences College
Introduction to Mao Tsetung Thoughts Social Sciences College
Civil Law Social Sciences College
Jurisprudence Social Sciences College
Sociology Social Sciences College
Basic Principle of Marxism Social Sciences College
Criminal Jurisprudence Social Sciences College
Outline of Chinese Modern History Social Sciences College
Humanistic Medicine Natural Sciences College
Internal Medicine Natural Sciences College
Animal Physiology Natural Sciences College
Surgical Sciences Natural Sciences College
Operating Systems Natural Sciences College
Data Structures Natural Sciences College
Probability Theory Natural Sciences College
Biochemistry Natural Sciences College
Biochemistry and Pathology Natural Sciences College
Physiology Natural Sciences College
Principles of Computer Composition Natural Sciences College
Computer Networks Natural Sciences College
Advanced Mathematics Natural Sciences College
Linear Algebra Natural Sciences College
Stomatology Natural Sciences College
Anthropotomy Natural Sciences College
Pharmacology Natural Sciences College
Immunology Natural Sciences College
Management Natural Sciences College
Economics Natural Sciences College
Film Arts & Humanities Other
Music Arts & Humanities Other
Dance Arts & Humanities Other
Fine Arts Arts & Humanities Other
Computer Fundamentals Natural Sciences Other
Computer Programming Language Natural Sciences Other
Chinese Medicine Other Other
Ancient Chinese Language Other Other
Novels Other Other
Religion Other Other
Chinese Civil Service Examination Other Other

Table 8: Summary of all 71 tasks.

A All Subjects
See Table 8 for all 71 tasks.

Testing&Evaluationof LLM
No ratings yet
Testing&Evaluationof LLM
223 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Module 1 in CE-354-Bulding Systems Design
100% (8)
Module 1 in CE-354-Bulding Systems Design
72 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
124 pages
Discrete Structure and Automata Theory for Learners: Learn Discrete Structure Concepts and Automata Theory with JFLAP
From Everand
Discrete Structure and Automata Theory for Learners: Learn Discrete Structure Concepts and Automata Theory with JFLAP
Sukhpreet Kaur Gill
No ratings yet
3418499+ +Artigo+Cilamce+Modificado
No ratings yet
3418499+ +Artigo+Cilamce+Modificado
7 pages
Pub Assessment of Learning
100% (1)
Pub Assessment of Learning
177 pages
Coli A 00561
No ratings yet
Coli A 00561
27 pages
24 - VNHSGE - VietNamese High School Graduation Examination Dataset For Large Language Models
No ratings yet
24 - VNHSGE - VietNamese High School Graduation Examination Dataset For Large Language Models
74 pages
(2303.18223) A Survey of Large Language Models
No ratings yet
(2303.18223) A Survey of Large Language Models
115 pages
P LM: A A E B - LLM I T O: Anda N Utomatic Valuation Ench Mark For Nstruction Uning Ptimization
No ratings yet
P LM: A A E B - LLM I T O: Anda N Utomatic Valuation Ench Mark For Nstruction Uning Ptimization
21 pages
ART Automatic Multi-Step Reasoning and Tool-Use For
No ratings yet
ART Automatic Multi-Step Reasoning and Tool-Use For
26 pages
Beyond Task Performance: Evaluating and Readucing The Flaws of Large Multimodal Models With In-Context Learning
No ratings yet
Beyond Task Performance: Evaluating and Readucing The Flaws of Large Multimodal Models With In-Context Learning
31 pages
Zhu Et Al. - LLMs For Knowledge Graph Construction and Reasonin
No ratings yet
Zhu Et Al. - LLMs For Knowledge Graph Construction and Reasonin
17 pages
CMMLU - Measuring Massive Multitask Language Understanding in Chinese
No ratings yet
CMMLU - Measuring Massive Multitask Language Understanding in Chinese
17 pages
MME - A Comprehensive Evaluation Benchmark For Multimodal Large Language Models
No ratings yet
MME - A Comprehensive Evaluation Benchmark For Multimodal Large Language Models
11 pages
2024 Inlg-Main 15
No ratings yet
2024 Inlg-Main 15
18 pages
Sample-Efficient Human Evaluation of Large Language Models Via Maximum Discrepancy Competition
No ratings yet
Sample-Efficient Human Evaluation of Large Language Models Via Maximum Discrepancy Competition
32 pages
U R C LLM S Q S S: Nleashing Easoning Apability of S Via Calable Uestion Ynthesis From Cratch
No ratings yet
U R C LLM S Q S S: Nleashing Easoning Apability of S Via Calable Uestion Ynthesis From Cratch
22 pages
Natural Learning
No ratings yet
Natural Learning
35 pages
Can Multiple-Choice Questions Really Be Useful in Detecting The Abilities of LLMs
No ratings yet
Can Multiple-Choice Questions Really Be Useful in Detecting The Abilities of LLMs
16 pages
The Best LLMs Cheatsheet 1727364716
No ratings yet
The Best LLMs Cheatsheet 1727364716
15 pages
Hugginggpt: Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face
No ratings yet
Hugginggpt: Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face
25 pages
Facilitating Large Language Model Russian Adaptation
No ratings yet
Facilitating Large Language Model Russian Adaptation
17 pages
Evaluating The Performance of Large Language Models On GAOKAO Benchmark
No ratings yet
Evaluating The Performance of Large Language Models On GAOKAO Benchmark
13 pages
ChatGPT in The Age of Generative AI and Large Lang
No ratings yet
ChatGPT in The Age of Generative AI and Large Lang
60 pages
Debate On Graph - A Flexible and Reliable Reasoning Framework For Large Language Models
No ratings yet
Debate On Graph - A Flexible and Reliable Reasoning Framework For Large Language Models
12 pages
Exploring The Potential of Using ChatGPT in Physics Education
No ratings yet
Exploring The Potential of Using ChatGPT in Physics Education
19 pages
T 2Kgb: A B O - D K G G T: EXT Ench Enchmark For Ntology Riven Nowledge Raph Eneration From EXT
No ratings yet
T 2Kgb: A B O - D K G G T: EXT Ench Enchmark For Ntology Riven Nowledge Raph Eneration From EXT
15 pages
Can ChatGPT Replace Traditional KBQA
No ratings yet
Can ChatGPT Replace Traditional KBQA
20 pages
KnowPath - Reasoning Via LLM-generated Inference Paths
No ratings yet
KnowPath - Reasoning Via LLM-generated Inference Paths
9 pages
BC Brochure 2016 17 Without Fee & Dates
0% (1)
BC Brochure 2016 17 Without Fee & Dates
60 pages
2022 Acl-Long 140
No ratings yet
2022 Acl-Long 140
11 pages
A Fine-Tuned Large Language Model For Domain-Specific With Reinforcement Learning
No ratings yet
A Fine-Tuned Large Language Model For Domain-Specific With Reinforcement Learning
6 pages
Unifying Large Language Models and Knowledge Graphs: A Roadmap
No ratings yet
Unifying Large Language Models and Knowledge Graphs: A Roadmap
29 pages
My Library - CSV
No ratings yet
My Library - CSV
10 pages
Measuring Massive Multitask Language Understanding
No ratings yet
Measuring Massive Multitask Language Understanding
27 pages
TinyBenchmarks - Evaluating LLMs With Fewer Examples
No ratings yet
TinyBenchmarks - Evaluating LLMs With Fewer Examples
21 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
DLSU D Undergraduate Research Handbook 2021 Approved by Academic Council2
No ratings yet
DLSU D Undergraduate Research Handbook 2021 Approved by Academic Council2
43 pages
Enhancing Educational Qa Systems: Integrating Knowledge Graphs and Large Language Models For Context-Aware Learning
No ratings yet
Enhancing Educational Qa Systems: Integrating Knowledge Graphs and Large Language Models For Context-Aware Learning
9 pages
Unifying Large Language Models and Knowledge Graphs: A Roadmap
No ratings yet
Unifying Large Language Models and Knowledge Graphs: A Roadmap
28 pages
Synergizing Knowledge Graphs With Large Language Models: A Comprehensive Review and Future Prospects
No ratings yet
Synergizing Knowledge Graphs With Large Language Models: A Comprehensive Review and Future Prospects
8 pages
Perspective Large Languagemodels in Applied Mechanics
No ratings yet
Perspective Large Languagemodels in Applied Mechanics
7 pages
E Nhancing e Ducational Qa S Ystems I Ntegrating K Nowledge G Raphs A ND L Arge L Anguage M Odels F or C Ontext A Ware L Earning
No ratings yet
E Nhancing e Ducational Qa S Ystems I Ntegrating K Nowledge G Raphs A ND L Arge L Anguage M Odels F or C Ontext A Ware L Earning
9 pages
Ijnlc 01
No ratings yet
Ijnlc 01
18 pages
Evaluation of Medium-Sized Language Models in German and English Language
No ratings yet
Evaluation of Medium-Sized Language Models in German and English Language
18 pages
Ijnlc 01
No ratings yet
Ijnlc 01
18 pages
Through The Lens of Core Competency: Survey On Evaluation of Large Language Models
No ratings yet
Through The Lens of Core Competency: Survey On Evaluation of Large Language Models
22 pages
Evaluating Large Language Models On A Highly-Specialized Topic, Radiation Oncology Physics
No ratings yet
Evaluating Large Language Models On A Highly-Specialized Topic, Radiation Oncology Physics
36 pages
HuggingGPT: Solving AI Tasks With ChatGPT and Its Friends in HuggingFace
100% (1)
HuggingGPT: Solving AI Tasks With ChatGPT and Its Friends in HuggingFace
18 pages
Regional Rural Banks Previous Papers - Haryana Gramin Bank Officers
100% (2)
Regional Rural Banks Previous Papers - Haryana Gramin Bank Officers
31 pages
Pib PDF
No ratings yet
Pib PDF
209 pages
Comparison of Large Language Models For Generating Contextually Relevant Questions
No ratings yet
Comparison of Large Language Models For Generating Contextually Relevant Questions
6 pages
Benchmark LLM
No ratings yet
Benchmark LLM
28 pages
Bueno Teoria 2307.06435
No ratings yet
Bueno Teoria 2307.06435
37 pages
FutureOfLearning LLMs Book Chapter
No ratings yet
FutureOfLearning LLMs Book Chapter
12 pages
Hugginggpt: Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face
No ratings yet
Hugginggpt: Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face
18 pages
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
No ratings yet
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
11 pages
LLM Benchmarks
No ratings yet
LLM Benchmarks
5 pages
BSC Hons Applied Accounting 27.04.22
No ratings yet
BSC Hons Applied Accounting 27.04.22
19 pages
E Nhancing e Ducational Qa S Ystems I Ntegrating K Nowledge G Raphs A ND L Arge L Anguage M Odels F or C Ontext A Ware L Earning
No ratings yet
E Nhancing e Ducational Qa S Ystems I Ntegrating K Nowledge G Raphs A ND L Arge L Anguage M Odels F or C Ontext A Ware L Earning
9 pages
LLMand Logicor Mimick
No ratings yet
LLMand Logicor Mimick
11 pages
Measuring Massive Multitask Language Understanding
No ratings yet
Measuring Massive Multitask Language Understanding
25 pages
Glam: Fine-Tuning Large Language Models For Domain Knowledge Graph Alignment Via Neighborhood Partitioning and Generative Subgraph Encoding
No ratings yet
Glam: Fine-Tuning Large Language Models For Domain Knowledge Graph Alignment Via Neighborhood Partitioning and Generative Subgraph Encoding
8 pages
June16 CompleteProportionality C2
100% (2)
June16 CompleteProportionality C2
143 pages
From GPT To BERT:: Benchmarking Large Language Models For Automated Iz Generation
No ratings yet
From GPT To BERT:: Benchmarking Large Language Models For Automated Iz Generation
2 pages
CS194 2324B Majestic Shawarma
No ratings yet
CS194 2324B Majestic Shawarma
6 pages
Mil STD 410e
No ratings yet
Mil STD 410e
20 pages
NBC 461 CCE Forms PDSChecklist
No ratings yet
NBC 461 CCE Forms PDSChecklist
14 pages
Types of Report
No ratings yet
Types of Report
6 pages
Lyman Alpha Forest - Halo Cross - Correlations in Effective Field Theory
No ratings yet
Lyman Alpha Forest - Halo Cross - Correlations in Effective Field Theory
26 pages
Decentralized Diffusion Models
No ratings yet
Decentralized Diffusion Models
23 pages
Assessment in Learning 2
No ratings yet
Assessment in Learning 2
12 pages
Social Psychology
No ratings yet
Social Psychology
7 pages
Cmmlu: M - C: Easuring Massive Multitask Lan Guage Understanding in Hinese
No ratings yet
Cmmlu: M - C: Easuring Massive Multitask Lan Guage Understanding in Hinese
28 pages
R F: Visual Editing As A Chain of Thought For Structured Image Understanding
No ratings yet
R F: Visual Editing As A Chain of Thought For Structured Image Understanding
20 pages
Comparing Radial Migration in Dark Matter and MOND Regimes: R. Nagy, F. Janák, M. Šturc, M. Jur Cík, and E. Puha
No ratings yet
Comparing Radial Migration in Dark Matter and MOND Regimes: R. Nagy, F. Janák, M. Šturc, M. Jur Cík, and E. Puha
19 pages
Deped Sample Req
No ratings yet
Deped Sample Req
5 pages
Relative Pose Estimation Through Affine Corrections of Monocular Depth Priors
No ratings yet
Relative Pose Estimation Through Affine Corrections of Monocular Depth Priors
15 pages
First Observation of Reactor Antineutrinos by Coherent Scattering
No ratings yet
First Observation of Reactor Antineutrinos by Coherent Scattering
13 pages
EPSO AST-SC 06 17 Notice of Competition en
No ratings yet
EPSO AST-SC 06 17 Notice of Competition en
25 pages
Modeling The Saddle-Like Gev-Tev Spectrum of Hess 1809-193: Gamma-Rays Arising From Reverse-Shocked Pulsar Wind Nebula?
No ratings yet
Modeling The Saddle-Like Gev-Tev Spectrum of Hess 1809-193: Gamma-Rays Arising From Reverse-Shocked Pulsar Wind Nebula?
11 pages
The Forward Physics Facility at The HL-LHC and Its Synergies With Astroparticle Physics
No ratings yet
The Forward Physics Facility at The HL-LHC and Its Synergies With Astroparticle Physics
11 pages
World Data On Education Données Mondiales de L'éducation Datos Mundiales de Educación VII Ed. 2010/11
No ratings yet
World Data On Education Données Mondiales de L'éducation Datos Mundiales de Educación VII Ed. 2010/11
25 pages
Search For The Leptonic Decay
No ratings yet
Search For The Leptonic Decay
8 pages
Syllabus ELEC BA 4 First Semester
No ratings yet
Syllabus ELEC BA 4 First Semester
5 pages
ENG110 OBE Syllabus
No ratings yet
ENG110 OBE Syllabus
8 pages
Research in Industrial and Organizational Psychology From 1963 To 2007: Changes, Choices, and Trends
No ratings yet
Research in Industrial and Organizational Psychology From 1963 To 2007: Changes, Choices, and Trends
20 pages
Explainable AI-Enhanced Deep Learning For Pumpkin Leaf Disease Detection: A Comparative Analysis of CNN Architectures
No ratings yet
Explainable AI-Enhanced Deep Learning For Pumpkin Leaf Disease Detection: A Comparative Analysis of CNN Architectures
6 pages
Report
No ratings yet
Report
3 pages
Xuanyuan 2.0: A Large Chinese Financial Chat Model With Hundreds of Billions Parameters
No ratings yet
Xuanyuan 2.0: A Large Chinese Financial Chat Model With Hundreds of Billions Parameters
5 pages
Documents To Support Your Application: MTI (Path) Medical Training Initiative in Pathology
No ratings yet
Documents To Support Your Application: MTI (Path) Medical Training Initiative in Pathology
2 pages
AHASLIDE
No ratings yet
AHASLIDE
3 pages
Letter For Allowing Retake Midterm Examination
No ratings yet
Letter For Allowing Retake Midterm Examination
3 pages
Examinationservices - Nic.in ExamSysCTET downloadAdmitCard AdmitCardCTET - Aspx
No ratings yet
Examinationservices - Nic.in ExamSysCTET downloadAdmitCard AdmitCardCTET - Aspx
3 pages
OSSLT SampleTestBooklet1 2013
No ratings yet
OSSLT SampleTestBooklet1 2013
17 pages
College Essay
No ratings yet
College Essay
3 pages
Dealing With Difficult People Trainer
No ratings yet
Dealing With Difficult People Trainer
3 pages
CddffgAPE Spanish SBAModeration Unit2
No ratings yet
CddffgAPE Spanish SBAModeration Unit2
1 page
Chapter 5 CC or OC New
No ratings yet
Chapter 5 CC or OC New
1 page

M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark For Chinese Large Language Models

Uploaded by

M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark For Chinese Large Language Models

Uploaded by

M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation

Benchmark for Chinese Large Language Models

than 20 billion parameters and are supervised fine-

Benchmarks. The capability of eliciting and ap-

Table 3: Overall statistics of M3KE. Q: question. C: answer choices

Table 8: Summary of all 71 tasks.

You might also like