Navigating Prompt Complexity For ZeroShot Classification
Navigating Prompt Complexity For ZeroShot Classification
etc.) using all synonyms. For example, we Mu et al., 2023). The dataset developed by
consider the LLM that uses all synonyms pre- (Cotfas et al., 2021) provides 2,792 tweets
dicted as complaints, otherwise they are con- belonging to one of three stance categories:
sidered non-complaints. We only report this pro-vaccine, anti-vaccine, or neutral.
metric for datasets with binary classes.
• Bragging This task aims to classify whether a
3 Data tweet is bragging or not bragging. We evaluate
on a dataset developed by Jin et al. (2022)
In order to ensure a comprehensive evaluation of which contains 6,696 tweets labelled as either
LLM performance, we select six datasets that cover bragging or not bragging.
a wide range of computational social science tasks
and different time spans. In particular, some of • Rumour Stance We use the RumorEval 2017
them were created before September 2021, while dataset which is developed by Derczynski
others were collected after the release of the LLMs et al. (2017). Here, we use the dataset for
used in this paper. All datasets are in English with 4-way rumour stance classification, i.e., deter-
manually annotated class labels. We detail dataset mining the stance of a reply towards a given
specifications and statistics in Table 2: source post (i.e. rumour) as either supporting,
denying, questioning, or commenting.
• Complaint This task aims to identify whether
a tweet expresses a complaint, which is de- • Sarcasm The sarcasm detection task is to
fined as ‘a negative mismatch between real- identify whether a given tweet is intended to
ity and expectations in a particular situation’ be sarcastic or not. We evaluate the task on
(e.g., customer complaints on Twitter) (Olsh- the Semeval-2022 Task 6 dataset (Farha et al.,
tain and Weinbach, 1987). We use a dataset 2022), which contains 4,868 tweets labelled
developed by Preoţiuc-Pietro et al. (2019) con- as either sarcasm or non-sarcasm.
sisting of 3,449 English tweets annotated with
• Hate Speech The task of hate speech detec-
one of two categories, i.e., complaints or not
tion aims to study anti-social behaviours, e.g.,
complaints.
racism and sexism in social media. We eval-
• Vaccine Stance This task aims to automat- uate on a dataset developed by Waseem and
ically predict the stance of tweets towards Hovy (2016) with a binary classification setup,
COVID-19 vaccination (Cotfas et al., 2021; i.e., offensive or non-offensive.
Dataset # of Posts Class (# of Posts)
Rumour Stance 5,568 Support (1,004) / Deny (415) / Query (464) / Comment (3,685)
Vaccine Stance 2,792 Pro Vaccine (991) / Anti Vaccine (791) / Neutral (1,010)
Complaint 3,449 Complaint (1,232) / Not Complaint (2,217)
Bragging 6,696 Bragging (781) / Not Bragging (5,915)
Sarcasm 4,868 Sarcasm (1,067) / Not Sarcasm (3,801)
Hate speech 16,907 Offensive (5,348) / Non-offensive (11,559)
Table 3: LLMs zero-shot classification results across all prompt settings. All datasets are evaluated with accuracy and
macro-F1 scores. Blue highlighted cells denote prompt settings where zero-shot LLMs beat the strong supervised
baseline (i.e., Bert-large fine-tuned on the training set). Bold text denotes the best result per task.
a low temperature (i.e., 0.2)8 for GPT to make the both GPT and OA. To evaluate the reproducibility
model more focused and deterministic. of the models’ output, we execute the basic prompt
For OA, we follow the ‘precise hyper-parameter setting of the Complaint dataset five times for each
setup’9 indicated in the OpenAssistant web inter- language model. Our observations reveal that OA
face, where the Temperature is 0.1, Top P is 0.95, consistently generates identical outputs, whereas
Repetition Penalty is 1.2 and Top K is 50. GPT achieves approximately 99% similarity in its
For BERT-large, we set the learning rate as 2e-5, outputs. Note that we consistently run OA on our
the batch size as 16, and the maximum sequence own servers with identical hardware described in
length as 256. We run all baseline models three Section 4.5.
times with different random seeds and report aver-
age results. We fine-tune BERT-large on an Nvidia 5 Results
RTX Titan GPU with 24GB memory and run OA The experimental results are shown in Table 3 and
on an Nvidia A100 GPU with 40GB memory. The Table 4. Next we discuss them in relation to each
inference rates of OA and GPT are approximately of our three research questions.
1,200 and 3,000 samples per hour respectively.
(RQ 1)What level of zero-shot performance
can LLMs achieve on social media classifica-
4.6 Reproducibility of LLM Output
tion tasks? How does zero-shot LLM perfor-
As noted above, to ensure a consistent output, we mance compare against smaller state-of-the-art
utilize low temperature values of 0.2 and 0.1 for language models fine-tuned on the specific anal-
8 ysis task?
https://platform.openai.com/docs/
api-reference/chat/create In general, LLMs (GPT and OA) with zero-
9
https://open-assistant.io/dashboard shot settings are able to achieve better results than
GPT OA
Synonyms
Accuracy F1-macro Accuracy F1-macro
Task 1
Complaint / not Complaint 87.8 86.4 80.1 79.9
Grievance / not Grievance 87.3 85.7 82.3 81.9
Criticism / not Criticism 80.4 77.9 76.7 76.4
Dissatisfaction / no Dissatisfaction 84.6 83.9 66.7 66.7
Discontent / no Discontent 80.7 80.0 55.2 54.2
Ensemble Majority 84.8 83.5 76.1 76.0
Ensemble All Agreed 86.8 85.1 84.5 83.8
Task 2
Pro Vaccine / Anti Vaccine / Neutral 72.4 73.6 64.2 63.7
In Favour of the Vaccine / Against the Vaccine / Neutral 73.5 74.2 64.4 63.9
Positive Sentiment / Negative Sentiment / Neutral 70.8 70.8 58.9 52.5
Belief in vaccine / not Belief in Vaccine / Neutral 74.4 75.2 61.9 59.5
Positive Attitude to Vaccine / Negative Attitude / Neutral 72.3 72.3 63.7 61.3
Ensemble Majority 74.7 75.4 64.2 63.5
Task 3
Bragging / not Bragging 84.3 66.2 82.8 62.6
Boasting / not Boasting 82.7 65.2 78.4 60.9
Showing off / not Showing off 78.8 62.9 88.4 56.3
Self-aggrandizing / not Self-aggrandizing 81.1 62.0 88.1 60.1
Excessively Proud / not Excessively Proud 75.2 58.0 77.9 58.1
Ensemble Majority 83.4 65.4 86.0 63.9
Ensemble All Agreed 84.9 64.4 88.1 59.8
Task 4
Support / Deny / Query / Comment 51.5 33.3 46.1 27.9
Backing / Dismiss / Questioning / Comment 40.4 30.2 52.1 43.8
Support / Dismiss / Questioning / Comment 39.7 30.4 55.4 39.3
Ensemble 41.7 30.6 55.5 39.4
Task 5
Sarcasm / not Sarcasm 62.9 59.7 64.4 54.8
Ironic / not Ironic 74.9 67.2 63.9 54.7
Insincere / Sincere 73.8 64.8 68.2 42.7
Disingenous / Genuine 77.8 61.9 56.8 49.3
Satire / not Satire 76.9 62.8 75.2 53.1
Ensemble Majority 74.9 65.7 70.5 53.9
Ensemble All Agreed 80.1 58.9 76.9 51.2
Task 6
Offensive / Non-offensive 70.4 69.1 69.8 68.2
Toxic / not Toxic 64.1 63.5 70.7 67.8
Abusive / not Abusive 72.2 69.3 64.8 64.2
Hateful / not Hateful 73.9 71.2 75.6 72.5
Derogatory / not Derogatory 68.2 66.8 58.1 58.1
Ensemble Majority 71.4 69.7 73.6 71.1
Ensemble All Agreed 75.1 71.6 75.0 70.6
Table 4: LLMs zero-shot classification results using synonyms across all tasks. Green highlights are the original
class names. Light grey highlighted cells denote where synonyms prompt settings beat the original label. Bold text
denotes the best result per model per task.
the simple supervised Logistic Regression model. class (labels without any specific speech act, such
However, the traditional smaller fine-tuned lan- as ‘Not Bragging’ and ‘Not Sarcastic’).
guage model (BERT-large) still outperforms the
GPT achieves the best predictive performance on
two LLMs on the majority of the tasks (4 out of 6
two speech act detection downstream tasks, namely
tasks). Furthermore, we observe that GPT consis-
Complaint (89.7 accuracy and 88.7 F1-macro) and
tently outperforms OA across all prompt settings
Sarcasm (62.1 F1-macro). This suggests that LLMs
and tasks when considering only the F1-macro mea-
can be employed as strong baseline models for zero-
sure. However, our results show that the accuracy
shot classification tasks.
of OA is better than that of GPT on some imbal-
anced datasets, such as ‘Bragging’ and ‘Sarcasm’. With respect to prompts, when the results of T/L
This may be due to OA defaulting to the neutral Desc and Memory Recall are compared against
Basic Instruction, it is observed that using a more
complex prompt (e.g., adding label and paper infor- when designing instructions as well as use ensem-
mation) does not necessarily improve model per- ble methods.
formance and may even introduce additional noise, (RQ 3) What are the potential risks arising
leading to a degradation in performance. from the implementation of these prompt strate-
For speech act detection tasks such as Complaint gies?
and Bragging, the accuracy of LLMs exceeds 85%, We also test different prompting strategies (e.g.,
indicating that LLMs can potentially be used for by asking the authors, task details and the name of
data annotation as a way to reduce human annota- each category) to determine whether LLMs have
tion costs. Standard data annotation tasks typically been exposed to the dataset before. Given that some
rely on at least two annotators in the first round, models can recall the task details given the title of
so one of them could be replaced by an LLM. Ac- the arXiv paper (i.e., memory recall), it remains
cording to the annotation details10 of the vaccine unclear whether the dataset has been seen in the
stance task (Poddar et al., 2022), the agreement rate training corpus.
between the two annotators is approximately 62%.
(RQ 2) What are the most effective LLM 6 Error Analysis
prompt strategies for social media classification To better understand the limitations of LLMs, we
tasks in a zero-shot setting? conduct an error analysis focusing on shared er-
Table 3 compares different prompt complexity, rors across all synonym settings following (Ziems
and shows that the simple prompt strategy works et al., 2023). We manually check these wrong pre-
reasonably well. For GPT, adding task and label dictions and observe that some unanimous errors
descriptions typically achieves better results, i.e. (Ziems et al., 2023) (i.e., when the model agreed on
these prompts achieved the best results on 4 out an incorrect answer using different synonyms) are
of 6 datasets as compared to other GPT prompt caused by incorrect or controversial ground truth
strategies. On the other hand, OA achieves mixed labels.11
results. On average, for OA, simple prompts out- On the other hand, we observe that OA often
perform complex counterparts. This may happen defaults to the majority category, such as ‘not a
because complex prompts add additional noise to bragging’ and ‘not sarcasm’, which leads to higher
the model. We also note that adding a few exam- accuracy but a lower macro-F1 measure. However,
ples to the prompt actually damages classification considering the high accuracy of LLM zero-shot
performance, for both GPT and OA. We hypothe- classification performance, LLMs can still be uti-
sise that the longer prompt is affecting the model lized as data annotation tools (combined with hu-
interpretation of instructions. man efforts) for NLP downstream tasks in CSS. We
Table 4 shows all zero-shot results when syn- can utilise LLMs for data annotation and also to
onyms are used in prompts for all six datasets. We identify incorrect annotations.
observe that revising prompts with synonyms can
substantially improve the zero-shot performance of 7 Related Work
OA, except for the Bragging dataset. It is worth not- Both models evaluated in this work, GPT (also re-
ing that the Sarcasm dataset is the only one where ferred to as ChatGPT) and OA, have been trained
the prompt using the original categories performs using Reinforcement Learning with Human Feed-
worse. This suggests that replacing original labels back (RLHF) in conjunction with instruction tun-
with synonyms allows the OA model to better un- ing, as first explored in Ouyang et al. (2022). In-
derstand the task requirements. The diversity in struction tuning is the fine-tuning of language mod-
the distribution of training examples used in the els on NLP tasks rephrased as instructions and
RLHF fine-tuning for both GPT and OA may ex- prior work has shown that it is an effective way
plain the models behaviour. For example, the OA of training LLMs to perform zero-shot on unseen
model might be fine-tuned on a dataset like: ‘[Text tasks. (Wei et al., 2021; Sanh et al., 2021) Longpre
including offensive language] + [Category: Abu- et al. (2023) carried out a detailed ablation study
sive]’. Therefore, we believe that it is important on non-RLHF instruction tuning methods across
to test similar words in place of the original labels the general NLP tasks in the Flan 2022 collection
10 11
https://github.com/sohampoddar26/ We summarize the number of wrong predictions from the
covid-vax-stance/tree/main/dataset synonyms experiments on GPT in Appendix A, Table 6.
and found that T5 instruction tuned on the Flan per- scores.
formed surprisingly well on held-out tasks when As LLMs improve their performance on lan-
compared to models directly fine-tuned on said task. guage generation tasks, the risk of misinformation
Tuning with human feedback could be the next step and propaganda increases. Mitchell et al. (2023)
in improving instruction tuning in this area. propose DetectGPT, a perturbation-based zero-shot
Ziems et al. (2023) sets a roadmap for employing method for identifying machine-generated pas-
LLMs as data annotators by establishing prompting sages. (Su et al., 2023) further develop this ap-
best practices and an evaluation of the zero-shot proach with DetectLLM-LRR and -NPR, achiev-
performance of 13 language models on 24 tasks ing improved efficiency and improved performance
in computational social sciences. Our work is dis- respectively.
tinct from this research as we evaluate LLM perfor- Since our focus is primarily on out-of-the-box
mance on a different set of benchmarks and models performance, we experiment with simple alter-
and experiment with different prompt modification ations of the prompts. Other research, e.g. Arora
strategies such as using synonyms for class labels et al. (2022), has considered prompt aggregation
and adding arXiv paper titles. as well as using LLMs to auto-generate prompts.
To evaluate the zero-shot performance of Chat- We also do not explore advanced methods such as
GPT for text classification, Kuzman et al. (2023) chain-of-thought prompting, which improves LM
compares against a fine-tuned XLM-RoBERTa performance by encouraging it to output its inter-
model for the task of automatic genre classification mediate reasoning steps (Wei et al., 2022; Suzgun
in English and Slovenian. They show that ChatGPT et al., 2022).
outperforms the baseline on unseen datasets and
that there is no drop in performance when provided
8 Conclusion
with Slovenian examples. This paper explored a number of prompting strate-
In their study, Ganesan et al. (2023) use Face- gies for the application of Large Language Models
book posts to classify user personality traits, (LLMs) in computational social science tasks. It
based on openness, conscientiousness, extrover- presented a range of controlled experiments that es-
sion, agreeableness, and neuroticism. They find tablish the efficacy of different prompt strategies on
that GPT-3 performs poorly on binary and worse six publicly available datasets. Our main findings
yet on tertiary ranking for each trait. are summarised as follows:
LLMs have also been applied in mental health
applications. Lamichhane (2023) evaluate Chat- • Task-specific fine-tuned models generally
GPT’s ability to classify stress, depression, and tend to outperform LLMs in zero-shot set-
suicidal inclination from Reddit posts. Although tings.
ChatGPT significantly outperforms their baseline,
• More detailed and complex prompts (e.g, by
the baseline consisted of a simple prediction of the
adding arXiv paper title and few-samples) do
majority class.
not necessarily enhance classification perfor-
For toxicity detection, Wang and Chang (2022) mance.
analyse GPT-3’s generative and discriminative zero-
shot capabilities, finding that performance is only • The selection of specific words or phrases as
slightly better than a random baseline. However, the class label can considerably affect classifi-
the authors argue that the generative task allows for cation outcomes.
nuanced distinction of toxicity in the, somewhat
subjective, binary setting. We therefore argue that developing prompts for
Törnberg (2023) find that ChatGPT-4 outper- zero-shot classification presents a significant chal-
forms non-expert annotators in identifying the po- lenge and recommend testing different prompt con-
litical affiliation of Democratic or Republican party figurations before proceeding with experiments,
members based on their tweets during the 2020 US while keeping in mind the time constraints12 and
election. Wu et al. (2023) use ChatGPT to rank financial costs associated with LLMs (see Table 7
the conservatism of representatives in the 116th US in Appendix A).
Congress through a series of pairwise match ups, 12
https://platform.openai.com/docs/guides/
showing a high correlation with DW-NOMINATE rate-limits/overview
Limitations bidirectional transformers for language understand-
ing. In Proceedings of the 2019 Conference of the
In this paper, we assess the zero-shot classification North American Chapter of the Association for Com-
performance on six downstream tasks in Computa- putational Linguistics: Human Language Technolo-
tional Social Science. We acknowledge that further gies, Volume 1 (Long and Short Papers), pages 4171–
4186.
experiments on other fine-grained tasks would be
beneficial in future work. Ibrahim Abu Farha, Silviu Vlad Oprea, Steven Wilson,
We also tried to explore potential data leakage is- and Walid Magdy. 2022. Semeval-2022 task 6: is-
sues(Ziems et al., 2023) by testing various prompts arcasmeval, intended sarcasm detection in english
and arabic. In Proceedings of the 16th International
to verify whether our test sets have been exposed to Workshop on Semantic Evaluation (SemEval-2022),
GPT and OA. However, due to the black box nature pages 802–814.
of the training datasets of these two LLMs, we are
Adithya V Ganesan, Yash Kumar Lal, August Håkan
unable to confirm the presence of data leakage.
Nilsson, and H Andrew Schwartz. 2023. Systematic
evaluation of gpt-3 for zero-shot personality estima-
Ethics Statement tion. arXiv preprint arXiv:2306.01183.
Our work has received ethical approval from the Mali Jin, Daniel Preoţiuc-Pietro, A Doğruöz, and Niko-
Ethics Committee of our university and complies laos Aletras. 2022. Automatic identification and clas-
with the research policies of Twitter. All datasets sification of bragging in social media. In Proceedings
are obtained through the links provided in the re- of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
spective research papers or by requesting them di- pages 3945–3959.
rectly from the authors. Furthermore, we can con-
firm that the data has been fully anonymised before Andreas Köpf, Yannic Kilcher, Dimitri von Rütte,
being fed to the LLMs for model inference. Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens,
Abdullah Barhoum, Nguyen Minh Duc, Oliver Stan-
ley, Richárd Nagyfi, et al. 2023. Openassistant
conversations–democratizing large language model
References alignment. arXiv preprint arXiv:2304.07327.
Simran Arora, Avanika Narayan, Mayee F Chen, Lau-
rel J Orr, Neel Guha, Kush Bhatia, Ines Chami, Fred- Taja Kuzman, Nikola Ljubešić, and Igor Mozetič. 2023.
eric Sala, and Christopher Ré. 2022. Ask me any- Chatgpt: Beginning of an end of manual annotation?
thing: A simple strategy for prompting language mod- use case of automatic genre identification. arXiv
els. arXiv preprint arXiv:2210.02441. preprint arXiv:2303.03953.
Rewon Child, Scott Gray, Alec Radford, and Bishal Lamichhane. 2023. Evaluation of chatgpt for
Ilya Sutskever. 2019. Generating long se- nlp-based mental health applications. arXiv preprint
quences with sparse transformers. arXiv preprint arXiv:2303.15727.
arXiv:1904.10509.
Shayne Longpre, Le Hou, Tu Vu, Albert Webson,
Liviu-Adrian Cotfas, Camelia Delcea, Ioan Roxin, Co- Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V
rina Ioanăş, Dana Simona Gherai, and Federico Ta- Le, Barret Zoph, Jason Wei, et al. 2023. The flan
jariol. 2021. The longest month: analyzing covid- collection: Designing data and methods for effective
19 vaccination opinions dynamics from tweets in instruction tuning. arXiv preprint arXiv:2301.13688.
the month following the first vaccine announcement.
Ieee Access, 9:33203–33223. Eric Mitchell, Yoonho Lee, Alexander Khazatsky,
Christopher D Manning, and Chelsea Finn. 2023.
Leon Derczynski, Kalina Bontcheva, Maria Liakata, Detectgpt: Zero-shot machine-generated text detec-
Rob Procter, Geraldine Wong Sak Hoi, and Arkaitz tion using probability curvature. arXiv preprint
Zubiaga. 2017. SemEval-2017 task 8: RumourEval: arXiv:2301.11305.
Determining rumour veracity and support for ru-
mours. In Proceedings of the 11th International Yida Mu, Mali Jin, Charlie Grimshaw, Carolina Scarton,
Workshop on Semantic Evaluation (SemEval-2017), Kalina Bontcheva, and Xingyi Song. 2023. Vaxxh-
pages 69–76, Vancouver, Canada. Association for esitancy: A dataset for studying hesitancy towards
Computational Linguistics. covid-19 vaccination on twitter. In Proceedings of
the International AAAI Conference on Web and So-
Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke cial Media, volume 17, pages 1052–1062.
Zettlemoyer. 2021. 8-bit optimizers via block-wise
quantization. arXiv preprint arXiv:2110.02861. Elite Olshtain and Liora Weinbach. 1987. 10. com-
plaints: A study of speech act behavior among native
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and and non-native speakers of hebrew. In The pragmatic
Kristina Toutanova. 2019. Bert: Pre-training of deep perspective, page 195. John Benjamins.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols
Carroll Wainwright, Pamela Mishkin, Chong Zhang, or hateful people? predictive features for hate speech
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. detection on twitter. In Proceedings of the NAACL
2022. Training language models to follow instruc- Student Research Workshop, pages 88–93, San Diego,
tions with human feedback. Advances in Neural California. Association for Computational Linguis-
Information Processing Systems, 35:27730–27744. tics.
Soham Poddar, Mainack Mondal, Janardan Misra, Niloy Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin
Ganguly, and Saptarshi Ghosh. 2022. Winds of Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
change: Impact of covid-19 on vaccine-related opin- drew M Dai, and Quoc V Le. 2021. Finetuned lan-
ions of twitter users. In Proceedings of the Interna- guage models are zero-shot learners. arXiv preprint
tional AAAI Conference on Web and Social Media, arXiv:2109.01652.
volume 16, pages 782–793.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Daniel Preoţiuc-Pietro, Mihaela Gaman, and Nikolaos Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le,
Aletras. 2019. Automatically identifying complaints and Denny Zhou. 2022. Chain of thought prompt-
in social media. In Proceedings of the 57th Annual ing elicits reasoning in large language models. In
Meeting of the Association for Computational Lin- Advances in Neural Information Processing Systems.
guistics, pages 5008–5019.
Patrick Y Wu, Joshua A Tucker, Jonathan Nagler, and
Solomon Messing. 2023. Large language models
Michael V Reiss. 2023. Testing the reliability of chatgpt
can be used to estimate the ideologies of politi-
for text annotation and classification: A cautionary
cians in a zero-shot learning setting. arXiv preprint
remark. arXiv preprint arXiv:2304.11085.
arXiv:2303.12057.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen,
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Zhehao Zhang, and Diyi Yang. 2023. Can large lan-
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun guage models transform computational social sci-
Raja, et al. 2021. Multitask prompted training en- ence? arXiv preprint arXiv:2305.03514.
ables zero-shot task generalization. arXiv preprint
arXiv:2110.08207. Arkaitz Zubiaga, Ahmet Aker, Kalina Bontcheva, Maria
Liakata, and Rob Procter. 2018. Detection and res-
Jinyan Su, Terry Yue Zhuo, Di Wang, and Preslav Nakov. olution of rumours in social media: A survey. ACM
2023. Detectllm: Leveraging log rank information Computing Surveys (CSUR), 51(2):1–36.
for zero-shot detection of machine-generated text.
arXiv preprint arXiv:2306.05540. A Appendix
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se-
bastian Gehrmann, Yi Tay, Hyung Won Chung,
Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny
Zhou, et al. 2022. Challenging big-bench tasks and
whether chain-of-thought can solve them. arXiv
preprint arXiv:2210.09261.
Table 6: We conduct further error analysis on the model outputs across all datasets. # of Unanimous Error denotes
cases in which the LLM unanimously agrees on an incorrect answer while using different synonyms.
Round 1 Round 2
Task
Tokens USD Tokens USD
Rumour Stance 35k / 51 <0.1 82k / 119 0.2
Vaccine Stance 31k / 127 <0.1 86k / 45 0.2
Complaint 23k / 33 <0.1 62k / 91 0.1
Bragging 52k / 76 0.1 96k / 140 0.2
Hate Speech 62k / 90 0.1 94k 137 0.2
Sarcasm 28k / 41 <0.1 50k / 86 0.1
Table 8: Task descriptions used for the prompting strategy ‘Task and Label Description (T/L Desc)’.