M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark For Chinese Large Language Models
M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark For Chinese Large Language Models
e.g., cross-task generalization, instruction fol- (Zhu et al., 2015; Liu et al., 2019b; Zellers et al.,
lowing. Comprehensively evaluating the ca- 2019; Gokaslan et al., 2019), which cover a wide
pability of large language models in multi- range of genres, e.g., encyclopedias, news, books,
ple tasks is of great importance. In this pa- social medias, etc. Many studies have demon-
per, we propose M3KE, a Massive Multi-Level strated that LLMs are able to acquire broad knowl-
Multi-Subject Knowledge Evaluation bench- edge of many types and subjects (Zhao et al., 2023;
mark, which is developed to measure knowl-
Paperno et al., 2016; Hoffmann et al., 2022; Tou-
edge acquired by Chinese large language mod-
els by testing their multitask accuracy in zero- vron et al., 2023; Rae et al., 2021; Raffel et al.,
and few-shot settings. We have collected 2020; Du et al., 2022a).
20,477 questions from 71 tasks. Our selection The paradigms that elicit and apply the ac-
covers all major levels of Chinese education
quired knowledge in LLMs onto downstream tasks
system, ranging from the primary school to
college, as well as a wide variety of subjects, have shifted from fine-tuning to instruction-tuning.
including humanities, history, politics, law, ed- Early LLMs usually adopt fine-tuning, which, how-
ucation, psychology, science, technology, art ever, suffers from lack of cross-task generaliza-
and religion. All questions are multiple-choice tion as the fine-tuned LLMs are often task-specific
questions with four options, hence guarantee- and not being parameter-efficient as all pre-trained
ing a standardized and unified assessment pro- LLM parameters are usually required to be updated
cess. We’ve assessed a number of state-of-the- on downstream tasks. As LLMs reach the scale
art open-source Chinese large language mod-
of billions of parameters, a more efficient alterna-
els on the proposed benchmark. The size of
these models varies from 335M to 130B pa- tive to elicit knowledge, in-context Learning (ICL)
rameters. Experiment results demonstrate that (Brown et al., 2020; Xie et al., 2022; Dong et al.,
they perform significantly worse than GPT-3.5 2023) has emerged, which uses only a few demon-
that reaches an accuracy of ∼ 48% on M3KE. stration examples concatenated in a prompt. In
The dataset is available at https://github. order to enhance the cross-task generalization of
com/tjunlp-lab/M3KE. LLMs to a variety of downstream tasks, instruction-
1 Introduction tuning (Wei et al., 2022; Bach et al., 2022; Wang
et al., 2022b), which is performed via multi-task
Large Language Models (LLMs) (Raffel et al., learning (Chung et al., 2022; Liu et al., 2019a) has
2020; Xue et al., 2021; Zhang et al., 2022; Brown been proposed. In instruction-tuning, the instruc-
et al., 2020; Touvron et al., 2023; Scao et al., 2022; tions for different tasks are different, but in a uni-
Zhao et al., 2023; Zhou et al., 2023) have achieved fied form. Supervised Fine-tuning (SFT) (Ouyang
remarkable progress in recent years, especially et al., 2022) and Reinforcement Learning from Hu-
with the release of ChatGPT1 , which is widely ac- man Feedback (RLHF) (Christiano et al., 2017;
knowledged to revolutionize the world of natural Stiennon et al., 2020; Ouyang et al., 2022) are suc-
language processing and to transform AI and so- cessful methods of instruction-tuning, which not
ciety (Altman, 2023; Bubeck et al., 2023; Huang only achieve generalization to unseen instructions
∗
Corresponding author. but also align LLMs with human values and intents
1
https://openai.com/blog/chatgpt (Sanh et al., 2022; Wei et al., 2022; Chung et al.,
Benchmark Language # Tasks # Questions
MMLU (Hendrycks et al., 2021) En 57 15,908
mary school to college, as well as a wide variety
AGIEval (Zhong et al., 2023) En & Zh 20 8,062 of subjects, including humanities, history, politics,
MMCU (Zeng, 2023) Zh 51 11,900
M3KE Zh 71 20,477
law, education, psychology, science, technology,
art and religion. All questions are multiple-choice
Table 1: The comparison between M3KE and other re- questions with four options, hence ensuring a stan-
lated benchmarks. dardized and unified assessment process. Table 1
shows the comparison between M3KE and other
related benchmarks.
2022).
With M3KE, we have tested recently released
As the capability of knowledge acquisition and Chinese LLMs , to track the progress of Chinese
application in LLMs is constantly and rapidly LLMs in knowledge acquisition and application.
evolving, a natural question which arises, is how The evaluated models are either pre-trained on mas-
we can assess such knowledge. Traditional single- sive data or pre-trained + fine-tuned with SFT or
task evaluation benchmarks (Rajpurkar et al., 2016; RLHF. The model sizes vary from 335M to 130B
Khot et al., 2020) are no longer adequate for evalu- parameters.
ating them. Multi-task benchmarks like GLUE With extensive experiments, we observe that
(Wang et al., 2018), SuperGLUE (Wang et al., most evaluated Chinese LLMs have near random-
2019) and BIG-bench (Srivastava et al., 2022) chance accuracy, even for primary school tasks.
aggregate multiple NLP tasks to evaluate LLMs, The best performance is achieved by an SFT model
which, however, are not sufficient either to assess built on the open-source BLOOM (Scao et al.,
knowledge acquired by LLMs. To address this 2022), which is 14.8 points lower than the accuracy
issue, Hendrycks et al. (2021) propose MMLU, of GPT-3.5-turbo.
a widely used benchmark to test the knowledge Our main contributions are summarized as fol-
acquisition and application capability of LLMs, lows.
which uses test questions across multiple subjects
that humans lean to assess LLMs in zero- and few- • We propose M3KE, a knowledge evaluation
shot settings. As MMLU is an English benchmark, benchmark for Chinese LLMs, which to date
it cannot be directly used for measuring LLMs covers the largest number of tasks in line with
trained with data in other languages. Even if it is Chinese education system.
translated into other languages, like the way used
in evaluating GPT-4 (OpenAI, 2023), there are still • We have tested a wide range of open-source
gaps in knowledge across different languages as Chinese LLMs, with model sizes varying from
they usually have different education systems and 335M to 130B, against GPT-3.5-turbo.
knowledge structures.
• We have analyzed the performance of each
Similar to LLMs in English, LLMs dedicated in
model on different subject clusters and educa-
Chinese have also achieved rapid advances recently
tion levels in both zero- and five-shot settings.
(Du et al., 2022b; Zeng et al., 2021; Zhang et al.,
2021; Sun et al., 2021; Zeng et al., 2022; Ren et al., 2 Related Work
2023; Wu et al., 2021; Wang et al., 2021; Chen
et al., 2023). However, a massive knowledge evalu- Chinese Large Language Models. Recent
ation benchmark that measures Chinese LLMs in years have witnessed a rapid development of Chi-
line with Chinese education system is a desider- nese LLMs, following the efforts of their English
atum. To bridge this gap, we propose M3KE, counterparts, e.g., GPT-3 (Brown et al., 2020), Go-
a Massive Multi-Level Multi-Subject Knowledge pher (Rae et al., 2021), LLaMA (Touvron et al.,
Evaluation benchmark, which is designed to mea- 2023). Chinese LLMs, such as Pangu-α with 200B
sure the knowledge acquired by Chinese LLMs by parameters (Zeng et al., 2021), Yuan 1.0 with 245B
testing their multitask accuracy in zero- and few- parameters (Wu et al., 2021), ERNIE 3.0 Titan
shot settings. M3KE contains 20,477 questions with 260B parameters (Sun et al., 2021), have been
collected from 71 tasks. In particular, unlike recent trained on Chinese textual data that contain tokens
benchmarks MMCU (Zeng, 2023) and AGIEval ranging from 180B to 329B. These models are de-
(Zhong et al., 2023), M3KE covers all major levels veloped in industry, which are usually not open-
of Chinese education system, ranging from pri- source. With the success of open-source LLMs
Engineering
(Taori et al., 2023; Peng et al., 2023) based on 10%
Medicine
16%
LLaMA, Chinese versions, such as ChatGLM-6B2 , Economics
4%
MOSS3 , Phoenix (Chen et al., 2023), have emerged
Arts
very recently. These models usually contain less 6%
Education
existing evaluation benchmarks (Wang et al., 2018, 7%
2019; Srivastava et al., 2022; Xu et al., 2020) are Politics
History
10%
8%
normally designed to evaluate LLMs on various
NLP tasks, not tailored for knowledge acquisition Figure 1: The distribution of tasks in M3KE.
and application assessment. To comprehensively
measure knowledge in LLMs, MMLU (Hendrycks
et al., 2021) is proposed, which collects multiple- tasks as shown in Figure 1 while the detailed sub-
choice questions from 57 tasks that humans learn. jects are listed in Appendix A. We collect and or-
As a different education system is used, on the one ganize multiple-choice questions from public web-
side, knowledge in Chinese LLMs may not exhibit sites. To ensure the quality and comprehensiveness
in the translated-into-Chinese version of MMLU, of the questions, entrance exam questions are se-
e.g., Chinese Medicine, Chinese Legal System. On lected as much as possible. For the primary school,
the other side, knowledge to be assessed in MMLU middle school and high school education level, we
may be absent in Chinese textual data used to train choose the subjects according to the correspond-
Chinese LLMs. ing entrance exams for Chinese students. For the
college level, we select subjects according to the na-
Our work is related to 3 datasets that have
tional entrance exam for master’s degree in China.
been developed concurrently with M3KE. MMCU
In addition to subjects under the major Chinese
(Zeng, 2023) is a Chinese benchmark that assesses
education system, we also collect comprehensive
knowledge in four domains: medicine, education,
tasks to expand the knowledge coverage in M3KE,
law, and psychology. AGIEval (Zhong et al., 2023)
including computer grade exam, ancient Chinese
is a bilingual benchmark that measures the capa-
language, novels and Chinese national civil service
bility of LLMs on tasks of the Chinese college
exam which covers commonsense knowledge, arts,
entrance exam and American college admission
religion, etc.
test, for high-school graduates. DomMa (Gu et al.,
2023) is another Chinese benchmark that focuses In total, we have 71 tasks and 20,477 questions.
on domain-specific knowledge. In contrast to these We divide each task into a test set and a few-shot
benchmarks, M3KE is a comprehensive Chinese set, where the few-shot set includes 5 questions for
benchmark that spans major stages of Chinese ed- each task for the few-shot evaluation setting. The
ucation system, from primary school to college test set includes 20,122 questions, and each task
with a broader range of subject categories, such contains at least 100 questions. Instances of M3KE
as art, religion, traditional Chinese medicine, and are listed in Table 2.
classical literature.
3.1 Arts & Humanities
3 M3KE Arts & Humanities comprise a range of disciplines
that cover Chinese, literature, arts and history.
M3KE covers major Chinese education levels, in-
These disciplines focus on the analysis and interpre-
cluding primary school, middle school, high school,
tation of literary and cultural artifacts, rather than
college and professional exams, as well as multiple
on practical applications. For instance, the Chinese
2
https://github.com/THUDM/ChatGLM-6B in primary school aims to evaluate the students’
3
https://github.com/OpenLMLab/MOSS proficiency in language use and literary apprecia-
下面关于拉斯科洞穴壁画说法错误的是? Which statement about the Lascaux cave
murals is incorrect?
Arts & Humanities
A 这个壁画是在法国发现的 This fresco was found in France
B 发现的动物形象有100多个 There are more than 100 animal images found
C 发现的时间为1940年 The discovery was made in 1940
D 壁画颜色以黑色为主 Mural color is mainly black
甲欲杀乙,将毒药投入乙的饭食中. 乙服食 A wants to kill B, and puts poison into B’s
后,甲后悔,赶紧说明情况,并将乙送往医院 food. After B consumed it, A regretted it
Social Sciences 抢救.医院在抢救过程中检查发现,甲所投 and rushed to explain the situation and sent
放的"毒药"根本没有毒性,乙安然无恙.甲 B to the hospital for rescue. The hospital
的行为属于? found that the poison was not toxic at all
and B was unharmed. A’s behavior belongs
to?
A 不构成犯罪 Not a crime
B 犯罪未遂 Attempted crime
C 犯罪中止 Crime suspension
D 犯罪既遂 Crime reached
使用普鲁卡因麻醉神经纤维,影响了神经纤 Which characteristic of nerve fiber conduc-
维传导兴奋的哪一项特征? tion excitation is affected by the use of pro-
Natural Sciences caine anesthesia?
A 生理完整性 Physiological integrity
B 绝缘性 Insulation
C 双向传导性 Bidirectional conduction
D 相对不疲劳性 Relative non-fatigability
以前有几项研究表明,食用巧克力会增 Several studies have previously suggested
加食用者患心脏病的可能性。而一项最新 that consuming chocolate increases the like-
Other 的、更为可靠的研究得出的结论是:食用 lihood of developing heart disease. How-
巧克力与心脏病发病率无关。估计这项研 ever, a recent and more reliable study con-
究成果公布以后,巧克力的消费量将会大 cluded that there is no association between
大增加。上述推论基于以下哪项假设? chocolate consumption and incidence of
heart disease. It is estimated that the con-
sumption of chocolate will significantly in-
crease after the publication of this research.
The above inference is based on the assump-
tion that the reliability of the previous stud-
ies was lower than that of the latest study.
A 尽管有些人知道食用巧克力会增加患心脏 Although some people are aware that consum-
病的可能性,却照样大吃特吃 ing chocolate increases the likelihood of devel-
oping heart disease, they still indulge in it.
B 人们从来也不相信进食巧克力会更容易患 People have never believed the claim that eat-
心脏病的说法 ing chocolate makes it more likely to develop
heart disease.
C 现在许多人吃巧克力是因为他们没有听过 Nowadays, many people eat chocolate because
巧克力会导致心脏病的说法 they have not heard of the claim that chocolate
can lead to heart disease.
D 现在许多人不吃巧克力完全是因为他们相 Nowadays, many people abstain from eat-
信巧克力会诱发心脏病 ing chocolate solely because they believe
that chocolate can trigger heart disease.
Table 2: Examples from M3KE. Bolded items represent correct answers. Examples from top to bottom are from
Fine Arts, Criminal Jurisprudence, Animal Physiology and Chinese Civil Service Examination task, respectively.
Arts & Humanities Social Sciences Natural Sciences Other
Tasks 12 21 31 7
Q Numbers 3,612 6,222 8,162 2,126
Avg.Q Numbers 301 296 263 303
Max.Q Numbers 352 374 347 425
Min.Q Numbers 190 190 100 129
Avg.Q Tokens 30.33 38.75 38.54 33.21
Avg.C Tokens 53.92 30.99 44.57 52.53
tion for ages 7 to 13, such as the usage of synonyms Chinese language and novel reasoning task. These
and antonyms. The historical studies cover both tasks require knowledge that is not limited to a
Chinese and world history from ancient to modern single level or subject as described above. The
times. M3KE also incorporates artistic subjects, Chinese civil service exam involves knowledge
such as dance, fine arts, music and film, because in commonsense, humanities, logic and other do-
we believe that art is an essential aspect of human mains, which we can consider as an assessment
culture and should be relevant to LLMs as well. of the comprehensive knowledge for LLMs. Sim-
ilarly, in the novel task, these questions involve a
3.2 Social Sciences lot of information from many classical novels.
Social sciences differ from Arts & Humanities in
that they emphasize practical aspects of human- 3.5 Overall Statistics
istic studies, such as law, politics, education and Table 3 shows the overall statistics of M3KE. The
psychology. These subjects are mainly taught at numbers of tasks in the four subject clusters de-
the college level. Although ideological and polit- scribed above are 12, 21, 31 and 7, respectively,
ical courses are also part of the Chinese middle while the numbers of questions in the four subject
school and high school curriculum, they primarily clusters are 3,612, 6,222, 8,162 and 2,126, respec-
involve moral education. Social sciences also en- tively. The maximum number of questions is 425
compass economic and management studies, which while the minimum number is 100. Questions in
largely consist of questions from the joint exams for social and natural sciences are usually longer than
graduate students majoring in these fields in China. those in arts & humanities and other while their
These studies include microeconomics, macroeco- answer choices are shorter.
nomics, management and logic at the undergradu-
ate level. 4 Experiments
3.3 Natural Sciences We assessed state-of-the-art large language models
Natural sciences encompass engineering, science, recently developed for Chinese on M3KE, attempt-
medicine and fundamental disciplines such as math, ing to understand and track the progress of Chinese
physics, chemistry, biology and so on. These sub- LLMs in learning and applying knowledge from
jects often require a high degree of computation, massive data.
analysis and logical reasoning skills. The same 4.1 Assessed Models
subject may assess different types of knowledge
at different levels according to the Chinese edu- The assessed Chinese LLMs can be divided into
cation system. For instance, primary school math two categories: models being only pre-trained and
mainly tests the basic arithmetic operations, while models that are instruction-tuned with SFT/RLHF.
high school math covers more advanced mathemat- For the former, we selected GLM-335M (Du et al.,
ical concepts, such as sequences, derivatives and 2022b), GLM-10B (Du et al., 2022b), GLM-130B
geometry. (Zeng et al., 2022) and BLOOM-7.1B (Scao et al.,
2022). For the latter, we included ChatGLM-6B4 ,
3.4 Other MOSS-SFT-16B5 , BELLE-7B (Yunjie Ji and Li,
Other types of tasks include religion, Chinese civil 4
https://github.com/THUDM/ChatGLM-6B
5
service exam, and specialized tasks, like ancient https://huggingface.co/fnlp/moss-moon-003-sft
Models Arts & Humanities Social Sciences Natural Sciences Other Average
GLM-335M 0.070 0.046 0.084 0.044 0.062
BLOOM-7.1B 0.163 0.159 0.161 0.158 0.161
GLM-10B 0.180 0.229 0.219 0.150 0.197
GLM-130B 0.326 0.352 0.274 0.359 0.328
ChatGLM-6B 0.246 0.267 0.168 0.263 0.236
MOSS-SFT-16B 0.260 0.263 0.207 0.275 0.251
BELLE-7B-0.2M 0.247 0.296 0.260 0.260 0.266
BELLE-7B-2M 0.328 0.367 0.282 0.355 0.333
GPT-3.5-turbo 0.460 0.538 0.444 0.481 0.481
Table 4: Average zero-shot accuracy for each model on the four subject clusters.
Models Arts & Humanities Social Sciences Natural Sciences Other Average
GLM-335M 0.220 0.247 0.193 0.126 0.196
BLOOM-7.1B 0.247 0.260 0.235 0.246 0.247
GLM-10B 0.294 0.304 0.232 0.211 0.260
GLM-130B 0.297 0.329 0.246 0.228 0.275
ChatGLM-6B 0.188 0.175 0.121 0.198 0.171
MOSS-SFT-16B 0.266 0.264 0.258 0.284 0.268
BELLE-7B-0.2M 0.292 0.327 0.273 0.307 0.299
BELLE-7B-2M 0.287 0.309 0.284 0.313 0.298
GPT-3.5-turbo 0.453 0.540 0.464 0.476 0.483
Table 5: Average five-shot accuracy for each model on the four subject clusters.
2023), where BELLE-7B is the SFT version based and rationale to the answer choice (the order of
on BLOOMZ-7.1B-MT (Muennighoff et al., 2022). these two types of outputs are random). We hence
We used the two variants of BELLE fine-tuned on keep only the output answer choice indicator as the
200K and 2M instructions, namely BELLE-7B- final answer to calculate accuracy.
0.2M6 and BELLE-7B-2M7 . We also evaluated
GPT-3.5-turbo8 from OpenAI as a reference. 4.3 Results
We compared the zero-shot accuracy of each model
4.2 Prompts in Table 4 in terms of subject clusters. For the pre-
All models were tested using the n-shot setting with trained models, there is a clear positive correlation
a unified prompt, where n is an integer from 0 to 5. between accuracy and model size, where the model
For the zero-shot setting (i.e., n = 0), the unified with 130B parameters significantly outperforms
prompt provided to all models is “Please choose the models with 335M/7B/10B parameters, even
the correct option from ‘A’, ‘B’, ‘C’, ‘D’ based on though they have different backbones. The accu-
the following question”. For few-shot setting (i.e., racy of GPT-3.5-turbo is significantly higher than
n > 0), the unified prompt is “Please choose the those of the evaluated Chinese LLMs, which cur-
correct option from ‘A’, ‘B’, ‘C’, ‘D’ based on the rently provides an upper bound for open-source
following examples and question”. The input to Chinese LLMs. All pretrained LLMs with ≤ 10B
all LLMs consists of the prompt, question, answer parameters achieve an accuracy lower than random-
choices and suffix, which is “the correct option is: chance accuracy (i.e., 25%), indicating that knowl-
”. Even we tell models to only output the correct edge acquired by these models is not adequate for
answer choice indicator (i.e., ∈ {A, B, C, D}) in M3KE. In addition, we observe that the number of
the prompt, not all models can follow this instruc- instructions used for SFT is an important factor, as
tion. Sometimes they output both answer choice the BELLE model fine-tuned with 2M instructions
6 is significantly better than that with 0.2M instruc-
https://huggingface.co/BelleGroup/BELLE-7B-0.2M
7
https://huggingface.co/BelleGroup/BELLE-7B-2M tions. The zero-shot performance of GPT-3.5-turbo
8
https://openai.com/product is much higher than the compared open-sourced
Models Primary School Middle School High School College Other Average
GLM-335M 0.075 0.099 0.099 0.054 0.046 0.075
BLOOM-7.1B 0.173 0.142 0.173 0.160 0.164 0.163
GLM-10B 0.190 0.199 0.197 0.213 0.152 0.190
GLM-130B 0.243 0.303 0.229 0.324 0.359 0.292
ChatGLM-6B 0.180 0.243 0.191 0.213 0.250 0.216
MOSS-SFT-16B 0.224 0.223 0.213 0.242 0.260 0.232
BELLE-7B-0.2M 0.233 0.269 0.259 0.268 0.263 0.258
BELLE-7B-2M 0.248 0.313 0.263 0.332 0.349 0.301
GPT-3.5-turbo 0.328 0.403 0.395 0.509 0.484 0.435
Table 6: Average zero-shot accuracy for each model on five major education levels.
Models Primary School Middle School High School College Other Average
GLM-335M 0.206 0.229 0.232 0.223 0.114 0.201
BLOOM-7.1B 0.262 0.222 0.245 0.249 0.246 0.245
GLM-10B 0.229 0.263 0.270 0.278 0.197 0.248
GLM-130B 0.268 0.293 0.272 0.294 0.208 0.267
ChatGLM-6B 0.089 0.150 0.137 0.155 0.196 0.146
MOSS-SFT-16B 0.272 0.223 0.263 0.266 0.281 0.261
BELLE-7B-0.2M 0.260 0.256 0.273 0.298 0.310 0.280
BELLE-7B-2M 0.258 0.264 0.268 0.306 0.299 0.279
GPT-3.5-turbo 0.308 0.565 0.373 0.517 0.475 0.448
Table 7: Average five-shot accuracy for each model on five major education levels.
Chinese LLMs, but still lower than 50% accuracy, the mixed results to our future work.
suggesting that M3KE is a very challenging bench- We finally provide the results of each model on
mark. different education levels in Table 6 for the zero-
We further compared the accuracy of different shot setting and Table 7 for the few-shot setting.
models under the 5-shot setting. Results are shown Interestingly, we observe that LLMs do not reach
in Table 5. For pre-trained models, ICL in the higher performance at lower education levels than
few-shot setting significantly improves the perfor- higher education levels, even for GPT-3.5-turbo.
mance and the smaller the pretrained model is, the This suggests that tasks from lower education lev-
larger the achieved improvement is. The excep- els remain challenging for these state-of-the-art
tion is GLM-130B, which performs significantly Chinese LLMs.
worse under the 5-shot setting than the zero-shot
setting. We conjecture that GLM-130B already has 5 Conclusion
the ability to understand questions without exam- We have presented a new benchmark M3KE, to as-
ples because it uses instances in the instruction for- sess the capability of Chinese LLMs in learning and
mat as part of the pre-training corpus (Zeng et al., applying knowledge in multiple subjects at multi-
2022), and demonstrations may bring interference ple levels of Chinese education system. M3KE
to the final prediction of the model. The 5-shot contains 71 tasks and 20,447 questions. We find
results of the SFT models are mixed in compari- that all evaluated state-of-the-art open-source Chi-
son to those in the zero-shot setting. We find that nese LLMs significantly lag behind GPT-3.5. We
for ChatGLM-6B and BELLE-7B-2M, 5-shot is hope that this benchmark can be used to track and
worse than zero-shot setting, similar to the results promote further progress in Chinese LLMs.
observed on GLM-130B. In contrast, 5-shot has a
positive impact on MOSS-SFT-16B and BELLE-
7B-0.2M. As these models are different from each
other in terms of model size, training data, instruc-
tion data, etc., we leave the in-depth analysis on
References 30: Annual Conference on Neural Information Pro-
cessing Systems 2017, December 4-9, 2017, Long
Sam Altman. 2023. Planning for agi and beyond. Ope- Beach, CA, USA, pages 4299–4307.
nAI Blog.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret
Stephen H. Bach, Victor Sanh, Zheng Xin Yong, Al- Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang,
bert Webson, Colin Raffel, Nihal V. Nayak, Ab- Mostafa Dehghani, Siddhartha Brahma, Albert Web-
heesht Sharma, Taewoon Kim, M. Saiful Bari, son, Shixiang Shane Gu, Zhuyun Dai, Mirac Suz-
Thibault Févry, Zaid Alyafeai, Manan Dey, An- gun, Xinyun Chen, Aakanksha Chowdhery, Sha-
drea Santilli, Zhiqing Sun, Srulik Ben-David, Can- ran Narang, Gaurav Mishra, Adams Yu, Vincent Y.
wen Xu, Gunjan Chhablani, Han Wang, Jason Alan Zhao, Yanping Huang, Andrew M. Dai, Hongkun
Fries, Maged Saeed AlShaibani, Shanya Sharma, Ur- Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin,
mish Thakker, Khalid Almubarak, Xiangru Tang, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason
Dragomir R. Radev, Mike Tian-Jian Jiang, and Wei. 2022. Scaling instruction-finetuned language
Alexander M. Rush. 2022. Promptsource: An inte- models. CoRR, abs/2210.11416.
grated development environment and repository for
natural language prompts. In ACL (demo), pages 93– Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiy-
104. Association for Computational Linguistics. ong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei
Li, and Zhifang Sui. 2023. A survey for in-context
Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari learning. CoRR, abs/2301.00234.
Morcos, Shashank Shekhar, Tom Goldstein, Florian
Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong,
Tian, et al. 2023. A cookbook of self-supervised Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun,
learning. arXiv preprint arXiv:2304.12210. Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret
Zoph, Liam Fedus, Maarten P. Bosma, Zongwei
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Zhou, Tao Wang, Yu Emma Wang, Kellie Webster,
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Marie Pellat, Kevin Robinson, Kathleen S. Meier-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Hellstern, Toju Duke, Lucas Dixon, Kun Zhang,
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Quoc V. Le, Yonghui Wu, Zhifeng Chen, and Claire
Gretchen Krueger, Tom Henighan, Rewon Child, Cui. 2022a. Glam: Efficient scaling of language
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, models with mixture-of-experts. In International
Clemens Winter, Christopher Hesse, Mark Chen, Conference on Machine Learning, ICML 2022, 17-
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin 23 July 2022, Baltimore, Maryland, USA, pages
Chess, Jack Clark, Christopher Berner, Sam Mc- 5547–5569.
Candlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. 2020. Language models are few-shot learn- Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding,
ers. In Advances in Neural Information Processing Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022b.
Systems 33: Annual Conference on Neural Informa- GLM: General language model pretraining with au-
tion Processing Systems 2020, NeurIPS 2020, De- toregressive blank infilling. In Proceedings of the
cember 6-12, 2020, virtual. 60th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
S’ebastien Bubeck, Varun Chandrasekaran, Ronen El- 320–335, Dublin, Ireland. Association for Computa-
dan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Pe- tional Linguistics.
ter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg,
Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Aaron Gokaslan, Vanya Cohen Ellie Pavlick, and Ste-
and Yi Zhang. 2023. Sparks of artificial general in- fanie Tellex. 2019. Openwebtext corpus. http:
telligence: Early experiments with gpt-4. volume //Skylion007.github.io/OpenWebTextCorpus.
abs/2303.12712.
Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang,
Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Zhuozhi Xiong, Zihan Li, Qianyu He, Sihang Jiang,
Dai, Philip S Yu, and Lichao Sun. 2023. A com- Hongwei Feng, and Yanghua Xiao. 2023. Do-
prehensive survey of ai-generated content (aigc): A main mastery benchmark: An ever-updating bench-
history of generative ai from gan to chatgpt. arXiv mark for evaluating holistic domain knowledge of
preprint arXiv:2303.04226. large language model–a preliminary release. arXiv
preprint arXiv:2304.11679.
Zhihong Chen, Feng Jiang, Junying Chen, Tiannan
Wang, Fei Yu, Guiming Chen, Hongbo Zhang, Dan Hendrycks, Collin Burns, Steven Basart, Andy
Juhao Liang, Chen Zhang, Zhiyi Zhang, et al. 2023. Zou, Mantas Mazeika, Dawn Song, and Jacob Stein-
Phoenix: Democratizing chatgpt across languages. hardt. 2021. Measuring massive multitask language
arXiv preprint arXiv:2304.10453. understanding. In ICLR. OpenReview.net.
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch,
Martic, Shane Legg, and Dario Amodei. 2017. Deep Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
reinforcement learning from human preferences. In Diego de Las Casas, Lisa Anne Hendricks, Johannes
Advances in Neural Information Processing Systems Welbl, Aidan Clark, Tom Hennigan, Eric Noland,
Katie Millican, George van den Driessche, Bogdan Baolin Peng, Chunyuan Li, Pengcheng He, Michel Gal-
Damoc, Aurelia Guy, Simon Osindero, Karen Si- ley, and Jianfeng Gao. 2023. Instruction tuning with
monyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, gpt-4. arXiv preprint arXiv:2304.03277.
and Laurent Sifre. 2022. Training compute-optimal
large language models. abs/2203.15556. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie
Millican, Jordan Hoffmann, H. Francis Song, John
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Aslanides, Sarah Henderson, Roman Ring, Susan-
Saksham Singhal, Shuming Ma, Tengchao Lv, Lei nah Young, Eliza Rutherford, Tom Hennigan, Ja-
Cui, Owais Khan Mohammed, Barun Patra, Qiang cob Menick, Albin Cassirer, Richard Powell, George
Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, van den Driessche, Lisa Anne Hendricks, Mari-
Vishrav Chaudhary, Subhojit Som, Xia Song, and beth Rauh, Po-Sen Huang, Amelia Glaese, Jo-
Furu Wei. 2023. Language is not all you need: hannes Welbl, Sumanth Dathathri, Saffron Huang,
Aligning perception with language models. CoRR, Jonathan Uesato, John Mellor, Irina Higgins, An-
abs/2302.14045. tonia Creswell, Nat McAleese, Amy Wu, Erich
Elsen, Siddhant M. Jayakumar, Elena Buchatskaya,
Tushar Khot, Peter Clark, Michal Guerquin, Peter David Budden, Esme Sutherland, Karen Simonyan,
Jansen, and Ashish Sabharwal. 2020. QASC: A Michela Paganini, Laurent Sifre, Lena Martens,
dataset for question answering via sentence com- Xiang Lorraine Li, Adhiguna Kuncoro, Aida
position. In The Thirty-Fourth AAAI Conference Nematzadeh, Elena Gribovskaya, Domenic Do-
on Artificial Intelligence, AAAI 2020, The Thirty- nato, Angeliki Lazaridou, Arthur Mensch, Jean-
Second Innovative Applications of Artificial Intelli- Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grig-
gence Conference, IAAI 2020, The Tenth AAAI Sym- orev, Doug Fritz, Thibault Sottiaux, Mantas Pa-
posium on Educational Advances in Artificial Intel- jarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama,
ligence, EAAI 2020, New York, NY, USA, February Cyprien de Masson d’Autume, Yujia Li, Tay-
7-12, 2020, pages 8082–8090. fun Terzi, Vladimir Mikulik, Igor Babuschkin,
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian- Aidan Clark, Diego de Las Casas, Aurelia Guy,
feng Gao. 2019a. Multi-task deep neural networks Chris Jones, James Bradbury, Matthew J. Johnson,
for natural language understanding. In ACL (1), Blake A. Hechtman, Laura Weidinger, Iason Gabriel,
pages 4487–4496. Association for Computational William S. Isaac, Edward Lockhart, Simon Osin-
Linguistics. dero, Laura Rimell, Chris Dyer, Oriol Vinyals, Ka-
reem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Hassabis, Koray Kavukcuoglu, and Geoffrey Irv-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, ing. 2021. Scaling language models: Methods,
Luke Zettlemoyer, and Veselin Stoyanov. 2019b. analysis & insights from training gopher. CoRR,
Roberta: A robustly optimized BERT pretraining ap- abs/2112.11446.
proach. CoRR, abs/1907.11692.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Adam Roberts, Stella Biderman, Teven Le Scao, Wei Li, and Peter J. Liu. 2020. Exploring the limits
M. Saiful Bari, Sheng Shen, Zheng Xin Yong, Hai- of transfer learning with a unified text-to-text trans-
ley Schoelkopf, Xiangru Tang, Dragomir Radev, former. J. Mach. Learn. Res., pages 140:1–140:67.
Alham Fikri Aji, Khalid Almubarak, Samuel Al-
banie, Zaid Alyafeai, Albert Webson, Edward Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Raff, and Colin Raffel. 2022. Crosslingual gen- Percy Liang. 2016. Squad: 100, 000+ questions for
eralization through multitask finetuning. CoRR, machine comprehension of text. In Proceedings of
abs/2211.01786. the 2016 Conference on Empirical Methods in Nat-
ural Language Processing, EMNLP 2016, Austin,
OpenAI. 2023. Gpt-4 technical report. OpenAI. Texas, USA, November 1-4, 2016, pages 2383–2392.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- Xiaozhe Ren, Pingyi Zhou, Xinfan Meng, Xinjing
roll L. Wainwright, Pamela Mishkin, Chong Zhang, Huang, Yadao Wang, Weichao Wang, Pengfei Li,
Sandhini Agarwal, Katarina Slama, Alex Ray, John Xiaoda Zhang, Alexander Podolskiy, Grigory Arshi-
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, nov, Andrey Bout, Irina Piontkovskaya, Jiansheng
Maddie Simens, Amanda Askell, Peter Welinder, Wei, Xin Jiang, Teng Su, Qun Liu, and Jun Yao.
Paul F. Christiano, Jan Leike, and Ryan Lowe. 2023. Pangu-Σ: Towards trillion parameter lan-
2022. Training language models to follow instruc- guage model with sparse heterogeneous computing.
tions with human feedback. CoRR, abs/2203.02155. CoRR, abs/2303.10845.
Denis Paperno, Germán Kruszewski, Angeliki Lazari- Victor Sanh, Albert Webson, Colin Raffel, Stephen H.
dou, Quan Ngoc Pham, Raffaella Bernardi, San- Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
dro Pezzelle, Marco Baroni, Gemma Boleda, and Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey,
Raquel Fernández. 2016. The LAMBADA dataset: M Saiful Bari, Canwen Xu, Urmish Thakker,
Word prediction requiring a broad discourse context. Shanya Sharma Sharma, Eliza Szczechla, Taewoon
In ACL (1). The Association for Computer Linguis- Kim, Gunjan Chhablani, Nihal V. Nayak, De-
tics. bajyoti Datta, Jonathan Chang, Mike Tian-Jian
Jiang, Han Wang, Matteo Manica, Sheng Shen, Shang, Peng Sun, Wei Liu, Xuan Ouyang, Dianhai
Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Yu, Hao Tian, Hua Wu, and Haifeng Wang. 2021.
Thomas Wang, Trishala Neeraj, Jos Rozen, Ab- ERNIE 3.0: Large-scale knowledge enhanced pre-
heesht Sharma, Andrea Santilli, Thibault Févry, Ja- training for language understanding and generation.
son Alan Fries, Ryan Teehan, Teven Le Scao, Stella CoRR, abs/2107.02137.
Biderman, Leo Gao, Thomas Wolf, and Alexan-
der M. Rush. 2022. Multitask prompted training Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
enables zero-shot task generalization. In The Tenth Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
International Conference on Learning Representa- and Tatsunori B. Hashimoto. 2023. Stanford al-
tions, ICLR 2022, Virtual Event, April 25-29, 2022. paca: An instruction-following llama model. https:
OpenReview.net. //github.com/tatsu-lab/stanford_alpaca.
Teven Le Scao, Angela Fan, Christopher Akiki, El- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
lie Pavlick, Suzana Ilic, Daniel Hesslow, Ro- Martinet, Marie-Anne Lachaux, Timothée Lacroix,
man Castagné, Alexandra Sasha Luccioni, François Baptiste Rozière, Naman Goyal, Eric Hambro,
Yvon, Matthias Gallé, Jonathan Tow, Alexan- Faisal Azhar, Aurélien Rodriguez, Armand Joulin,
der M. Rush, Stella Biderman, Albert Webson, Edouard Grave, and Guillaume Lample. 2023.
Pawan Sasanka Ammanamanchi, Thomas Wang, Llama: Open and efficient foundation language mod-
Benoît Sagot, Niklas Muennighoff, Albert Villanova els. CoRR.
del Moral, Olatunji Ruwase, Rachel Bawden, Stas
Bekman, Angelina McMillan-Major, Iz Beltagy, Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Huu Nguyen, Lucile Saulnier, Samson Tan, Pe- Amanpreet Singh, Julian Michael, Felix Hill, Omer
dro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Levy, and Samuel R. Bowman. 2019. Superglue: A
Yacine Jernite, Julien Launay, Margaret Mitchell, stickier benchmark for general-purpose language un-
Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor derstanding systems. In NeurIPS, pages 3261–3275.
Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Alex Wang, Amanpreet Singh, Julian Michael, Fe-
Ariel Kreisberg Nitzav, Canwen Xu, Chenghao lix Hill, Omer Levy, and Samuel R. Bowman.
Mou, Chris Emezue, Christopher Klamm, Colin 2018. GLUE: A multi-task benchmark and anal-
Leong, Daniel van Strien, David Ifeoluwa Ade- ysis platform for natural language understand-
lani, and et al. 2022. BLOOM: A 176b-parameter ing. In Proceedings of the Workshop: Analyzing
open-access multilingual language model. CoRR, and Interpreting Neural Networks for NLP, Black-
abs/2211.05100. boxNLP@EMNLP 2018, Brussels, Belgium, Novem-
ber 1, 2018, pages 353–355. Association for Com-
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, putational Linguistics.
Abu Awal Md Shoeb, Abubakar Abid, Adam
Fisch, Adam R. Brown, Adam Santoro, Aditya Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu,
Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Siyu Ding, Weibao Gong, Shikun Feng, Junyuan
Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Shang, Yanbin Zhao, Chao Pang, Jiaxiang Liu, Xuyi
Alex Ray, Alex Warstadt, Alexander W. Kocurek, Chen, Yuxiang Lu, Weixin Liu, Xi Wang, Yangfan
Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Par- Bai, Qiuliang Chen, Li Zhao, Shiyong Li, Peng Sun,
rish, Allen Nie, Aman Hussain, Amanda Askell, Dianhai Yu, Yanjun Ma, Hao Tian, Hua Wu, Tian
Amanda Dsouza, Ameet Rahane, Anantharaman S. Wu, Wei Zeng, Ge Li, Wen Gao, and Haifeng Wang.
Iyer, Anders Andreassen, Andrea Santilli, Andreas 2021. ERNIE 3.0 titan: Exploring larger-scale
Stuhlmüller, Andrew M. Dai, Andrew La, An- knowledge enhanced pre-training for language un-
drew K. Lampinen, Andy Zou, Angela Jiang, Angel- derstanding and generation. CoRR, abs/2112.12731.
ica Chen, Anh Vuong, Animesh Gupta, Anna Got-
tardi, Antonio Norelli, Anu Venkatesh, Arash Gho- Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-
lamidavoodi, Arfa Tabassum, Arul Menezes, Arun isa Liu, Noah A. Smith, Daniel Khashabi, and Han-
Kirubarajan, Asher Mullokandov, Ashish Sabhar- naneh Hajishirzi. 2022a. Self-instruct: Aligning lan-
wal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla guage model with self generated instructions. CoRR,
Karakas, and et al. 2022. Beyond the imitation abs/2212.10560.
game: Quantifying and extrapolating the capabilities
of language models. CoRR, abs/2206.04615. Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-
labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Naik, Arjun Ashok, Arut Selvan Dhanasekaran, An-
Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, jana Arunkumar, David Stap, Eshaan Pathak, Gian-
Dario Amodei, and Paul F. Christiano. 2020. Learn- nis Karamanolakis, Haizhi Gary Lai, Ishan Puro-
ing to summarize from human feedback. CoRR, hit, Ishani Mondal, Jacob Anderson, Kirby Kuz-
abs/2009.01325. nia, Krima Doshi, Kuntal Kumar Pal, Maitreya Pa-
tel, Mehrad Moradshahi, Mihir Parmar, Mirali Puro-
Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, hit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit
Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Verma, Ravsehaj Singh Puri, Rushang Karia, Savan
Chen, Yanbin Zhao, Yuxiang Lu, Weixin Liu, Zhi- Doshi, Shailaja Keyur Sampat, Siddhartha Mishra,
hua Wu, Weibao Gong, Jianzhong Liang, Zhizhou Sujan Reddy A, Sumanta Patro, Tanay Dixit, and
Xudong Shen. 2022b. Super-naturalinstructions: Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang,
Generalization via declarative instructions on 1600+ Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
NLP tasks. In Proceedings of the 2022 Conference Wendi Zheng, Xiao Xia, Weng Lam Tam, Zix-
on Empirical Methods in Natural Language Process- uan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen,
ing, EMNLP 2022, Abu Dhabi, United Arab Emi- Peng Zhang, Yuxiao Dong, and Jie Tang. 2022.
rates, December 7-11, 2022, pages 5085–5109. GLM-130B: an open bilingual pre-trained model.
abs/2210.02414.
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin
Guu, Adams Wei Yu, Brian Lester, Nan Du, An- Hui Zeng. 2023. Measuring massive multitask chinese
drew M. Dai, and Quoc V. Le. 2022. Finetuned lan- understanding. arXiv preprint arXiv:2304.12986.
guage models are zero-shot learners. In The Tenth
International Conference on Learning Representa- Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang,
tions, ICLR 2022, Virtual Event, April 25-29, 2022. Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang
OpenReview.net. Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li,
Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang,
Shaohua Wu, Xudong Zhao, Tong Yu, Rongguo Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang, Jin
Zhang, Chong Shen, Hongli Liu, Feng Li, Hong Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang
Zhu, Jiangang Luo, Liang Xu, et al. 2021. Yuan Peng, Fangqing Jiang, Han Zhang, Lingfeng Deng,
1.0: Large-scale pre-trained language model in Yehong Zhang, Zhe Lin, Chao Zhang, Shaojie
zero-shot and few-shot learning. arXiv preprint Zhang, Mingyue Guo, Shanzhi Gu, Gaojun Fan,
arXiv:2110.04725. Yaowei Wang, Xuefeng Jin, Qun Liu, and Yonghong
Tian. 2021. Pangu-α: Large-scale autoregres-
Sang Michael Xie, Aditi Raghunathan, Percy Liang, sive pretrained chinese language models with auto-
and Tengyu Ma. 2022. An explanation of in-context parallel computation. CoRR, abs/2104.12369.
learning as implicit bayesian inference. In The Tenth
International Conference on Learning Representa- Susan Zhang, Stephen Roller, Naman Goyal, Mikel
tions, ICLR 2022, Virtual Event, April 25-29, 2022. Artetxe, Moya Chen, Shuohui Chen, Christopher
Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin,
Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shus-
Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, ter, Daniel Simig, Punit Singh Koura, Anjali Srid-
Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, har, Tianlu Wang, and Luke Zettlemoyer. 2022.
Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao OPT: open pre-trained transformer language models.
Wang, Weijian Xie, Yanting Li, Yina Patterson, CoRR, abs/2205.01068.
Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua
Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui Zhengyan Zhang, Yuxian Gu, Xu Han, Shengqi Chen,
Zhang, Zhengliang Yang, Kyle Richardson, and Chaojun Xiao, Zhenbo Sun, Yuan Yao, Fanchao Qi,
Zhenzhong Lan. 2020. CLUE: A chinese language Jian Guan, Pei Ke, Yanzheng Cai, Guoyang Zeng,
understanding evaluation benchmark. In COLING, Zhixing Tan, Zhiyuan Liu, Minlie Huang, Wentao
pages 4762–4772. International Committee on Com- Han, Yang Liu, Xiaoyan Zhu, and Maosong Sun.
putational Linguistics. 2021. CPM-2: large-scale cost-effective pre-trained
language models. CoRR, abs/2106.10715.
Linting Xue, Noah Constant, Adam Roberts, Mi-
hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xi-
Barua, and Colin Raffel. 2021. mt5: A massively aolei Wang, Yupeng Hou, Yingqian Min, Beichen
multilingual pre-trained text-to-text transformer. In Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen
Proceedings of the 2021 Conference of the North Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang,
American Chapter of the Association for Computa- Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu,
tional Linguistics: Human Language Technologies, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023.
NAACL-HLT 2021, Online, June 6-11, 2021, pages A survey of large language models. arXiv preprint
483–498. arXiv:2303.18223.
Yan Gong Yiping Peng Qiang Niu Baochang Ma Yun- Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo
jie Ji, Yong Deng and Xiangang Li. 2023. Belle: Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu
Be everyone’s large language model engine. https: Chen, and Nan Duan. 2023. Agieval: A human-
//github.com/LianjiaTech/BELLE. centric benchmark for evaluating foundation models.
arXiv preprint arXiv:2304.06364.
Rowan Zellers, Ari Holtzman, Hannah Rashkin,
Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu,
Yejin Choi. 2019. Defending against neural fake Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan,
news. In Advances in Neural Information Process- Lifang He, Hao Peng, Jianxin Li, Jia Wu, Ziwei Liu,
ing Systems 32: Annual Conference on Neural Infor- Pengtao Xie, Caiming Xiong, Jian Pei, Philip S. Yu,
mation Processing Systems 2019, NeurIPS 2019, De- and Lichao Sun. 2023. A comprehensive survey on
cember 8-14, 2019, Vancouver, BC, Canada, pages pretrained foundation models: A history from BERT
9051–9062. to chatgpt. CoRR, abs/2302.09419.
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan
Salakhutdinov, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Aligning books and movies:
Towards story-like visual explanations by watching
movies and reading books. In 2015 IEEE Interna-
tional Conference on Computer Vision, ICCV 2015,
Santiago, Chile, December 7-13, 2015, pages 19–27.
IEEE Computer Society.
Tasks Subjects Education System
Chinese Arts & Humanities Primary school
Math Natural Sciences Primary school
Chinese Arts & Humanities Junior high school
History Arts & Humanities Junior high school
Politics Social Sciences Junior high school
Math Natural Sciences Junior high school
Physics Natural Sciences Junior high school
Biology Natural Sciences Junior high school
Chemistry Natural Sciences Junior high school
Geography Natural Sciences Junior high school
Chinese Arts & Humanities High school
History Arts & Humanities High school
Politics Social Sciences High school
Math Natural Sciences High school
Physics Natural Sciences High school
Biology Natural Sciences High school
Chemistry Natural Sciences High school
Geography Natural Sciences High school
Modern History Arts & Humanities College
History Foundation Arts & Humanities College
Modern World History Arts & Humanities College
Chinese Constitutional Law Social Sciences College
History of Chinese Education Social Sciences College
History of the Chinese Legal System Social Sciences College
Developmental and Educational Psychology Social Sciences College
History of Foreign Education Social Sciences College
Experimental Psychology Social Sciences College
Introduction to Psychology Social Sciences College
Moral Cultivation Social Sciences College
Psychology of Teaching Social Sciences College
Principles of Pedagogy Social Sciences College
Educational Research Methods Social Sciences College
Current Affairs and Politics Social Sciences College
Introduction to Mao Tsetung Thoughts Social Sciences College
Civil Law Social Sciences College
Jurisprudence Social Sciences College
Sociology Social Sciences College
Basic Principle of Marxism Social Sciences College
Criminal Jurisprudence Social Sciences College
Outline of Chinese Modern History Social Sciences College
Humanistic Medicine Natural Sciences College
Internal Medicine Natural Sciences College
Animal Physiology Natural Sciences College
Surgical Sciences Natural Sciences College
Operating Systems Natural Sciences College
Data Structures Natural Sciences College
Probability Theory Natural Sciences College
Biochemistry Natural Sciences College
Biochemistry and Pathology Natural Sciences College
Physiology Natural Sciences College
Principles of Computer Composition Natural Sciences College
Computer Networks Natural Sciences College
Advanced Mathematics Natural Sciences College
Linear Algebra Natural Sciences College
Stomatology Natural Sciences College
Anthropotomy Natural Sciences College
Pharmacology Natural Sciences College
Immunology Natural Sciences College
Management Natural Sciences College
Economics Natural Sciences College
Film Arts & Humanities Other
Music Arts & Humanities Other
Dance Arts & Humanities Other
Fine Arts Arts & Humanities Other
Computer Fundamentals Natural Sciences Other
Computer Programming Language Natural Sciences Other
Chinese Medicine Other Other
Ancient Chinese Language Other Other
Novels Other Other
Religion Other Other
Chinese Civil Service Examination Other Other
A All Subjects
See Table 8 for all 71 tasks.