0% found this document useful (0 votes)
41 views

Navigating Prompt Complexity For ZeroShot Classification

The document evaluates the zero-shot performance of large language models on six social media classification tasks using different prompting strategies. It finds that task-specific fine-tuned models generally outperform LLMs, and that prompting ensemble methods and recalling past training examples can increase LLM performance, though complex prompting is not always needed.

Uploaded by

Guz Kout
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Navigating Prompt Complexity For ZeroShot Classification

The document evaluates the zero-shot performance of large language models on six social media classification tasks using different prompting strategies. It finds that task-specific fine-tuned models generally outperform LLMs, and that prompting ensemble methods and recalling past training examples can increase LLM performance, though complex prompting is not always needed.

Uploaded by

Guz Kout
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Navigating Prompt Complexity for Zero-Shot Classification: A Study of

Large Language Models in Computational Social Science

Yida Mu, Ben P. Wu, William Thorne, Ambrose Robinson,


Nikolaos Aletras, Carolina Scarton, Kalina Bontcheva, Xingyi Song
Department of Computer Science, The University of Sheffield
{y.mu, bpwu1, wthorne1, arobinson10, n.aletras, c.scarton, k.bontcheva, x.song}@sheffield.ac.uk

Abstract understand the capabilities and limitations of the


latest instruction fine-tuned LLMs for addressing
Instruction-tuned Large Language Models such computational social science tasks. In this
(LLMs) have exhibited impressive language
paper, we are primarily focusing on answering the
arXiv:2305.14310v2 [cs.CL] 20 Sep 2023

understanding and the capacity to generate re-


sponses that follow specific prompts. However,
following research questions (RQ):
due to the computational demands associated • (RQ 1) What level of zero-shot performance
with training these models, their applications can LLMs achieve in social media classifica-
often adopt a zero-shot setting. In this paper,
tion tasks? How does zero-shot LLM perfor-
we evaluate the zero-shot performance of two
publicly accessible LLMs, ChatGPT and Ope- mance compare against smaller state-of-the-
nAssistant, in the context of six Computational art language models fine-tuned to the specific
Social Science classification tasks, while also analysis task?
investigating the effects of various prompting
strategies. Our experiments investigate the im- • (RQ 2) What are the most effective LLM
pact of prompt complexity, including the ef- prompt strategies for social media classifica-
fect of incorporating label definitions into the tion tasks in a zero-shot setting?
prompt; use of synonyms for label names; and
the influence of integrating past memories dur- • (RQ 3) What are the potential risks arising
ing foundation model training. The findings in- from the implementation of these prompt
dicate that in a zero-shot setting, current LLMs strategies?
are unable to match the performance of smaller,
fine-tuned baseline transformer models (such To answer those research questions, we conduct
as BERT-large). Additionally, we find that dif- a series of controlled experiments to investigate the
ferent prompting strategies can significantly af- zero-shot performance of two off-the-shelf instruc-
fect classification accuracy, with variations in tion fine-tuned large language models using differ-
accuracy and F1 scores exceeding 10%. ent prompting strategies. Namely, we experiment
with GPT-3.5-turbo (GPT),1 the most widely used
1 Introduction proprietary instruction fine-tuned large language
Instruction fine-tuning (Ouyang et al., 2022) has model; and OpenAssistant-LLaMA (OA) (Köpf
facilitated transfer learning for Large Language et al., 2023), an open source LLM instruction fine-
Models (LLMs) to unseen tasks at scale. To lever- tuned based LLaMA (Touvron et al., 2023). We use
age LLMs as versatile natural language processors, six social media analysis NLP tasks to evaluate the
there is an immediate effort to ascertain their zero- classification performance of LLMs using differ-
shot performance on challenging tasks. Social me- ent prompt complexity levels (including providing
dia analysis is an active area of research with a few-shots examples and publication information
number of complex, domain-specific tasks which of benchmark datasets in the prompt). The find-
can be utilised for harm reduction (Waseem et al., ings are also compared against baselines employing
2017) and preventing the spread of misinformation standard techniques such as fine-tuning BERT.
(Zubiaga et al., 2018). LLMs have great potential It must be noted that the scope of this paper is
to assist with such computational social science on evaluating the performance of off-the-shelf, in-
tasks, both in automatic data annotation and so- struction fine-tuned language models on social me-
cial media analysis (Kuzman et al., 2023; Reiss, dia classification tasks, in a zero-shot setting. The
1
2023; Törnberg, 2023). Hence, it is important to https://openai.com/blog/chatgpt
evaluation of foundation language models without Description Round is: ‘Basic Instruction + Task
instruction fine-tuning is out of the scope of this and Label Descriptions + Constraints + Text’.
paper.
Few-sample Prompting: We also test a few-
Our main findings are:
sample prompting strategy by adding one example
• (i) Task-specific fine-tuned models still gen- selected from the training set for each label. The
erally tend to outperform LLMs in most zero- prompt designed for the few-sample experiments
shot settings, even when the fully fine-tuned is: ‘Basic Instruction + Few-shot Examples + Con-
model (e.g., BERT-large model) is signifi- straints + Text’. Note that using few-sample as
cantly smaller. input is still a type of zero-shot setup, as we do not
fine-tune the model.
• (ii) Using prompting ensemble methods (e.g.,
on synonyms) can increase the performance Memory Recall (Recall): We observe that both
and robustness of LLMs. GPT and OA can recall papers published before
September 2021 (for prompts and responses see Ta-
• (iii) Detailed and complex prompting strate- ble 5 in Appendix A). Since arXiv papers are part
gies are not necessary. of the training data of the LLMs, we also include
the title of the source paper in the prompt when
2 Methodology evaluating the model’s zero-shot performance. For
2.1 Prompting Strategies example, we include paper information by using
this prompt: ‘Recall this paper [Paper Title] + Ba-
Following the prompting approaches described by
sic Instruction + Constraints + Text’. For such
Child et al. (2019); Ziems et al. (2023), we de-
recall prompts, we only perform experiments on
velop prompts by (i) adding instructions after the
datasets published before September 2021. For ref-
context (e.g., task description) and (ii) using con-
erence, we examine the variations in performance
straints (e.g., ‘Only reply with Bragging’ or ‘Not
across different checkpoints to assess whether in-
Bragging.’) at the end. We observe that using
struction fine-tuning might influence the efficacy
constraints can effectively avoid cases of model
of the classification task.
uncertainty (e.g., ‘As an AI model, I cannot answer
this question.’) and guide models to generate the 2.2 Synonyms
expected outputs.
LLMs trained with reinforcement-learning with
For consistency, we use the same prompts for
human-feedback (RLHF) can generate different
both GPT and OA. Examples of different prompt
outputs when using prompts which are semanti-
strategies are displayed in Table 1. To examine
cally similar (e.g., synonyms2 ). To test the gen-
the zero-shot predictive performance of LLMs, we
eralizability of LLMs, we substitute the names of
carry out a comprehensive set of experiments using
each class with words that have the same or similar
four different prompting strategies.
meaning. For example, we test the synonyms ‘hate-
Basic Instruction (Basic): We only provide a ba- ful’, ‘toxic’, and ‘abusive’ to replace the original
sic instruction without including detailed task and category ‘offensive’. We also use two ensemble
label descriptions. For example, for the bragging learning approaches to improve predictive perfor-
detection task, our prompt is: ‘Identify whether or mance by combining the outputs from all synonyms
not a tweet includes a bragging statement. + Con- settings for each dataset:
straints + Text’. Two possible configurations are
tested, namely adding the prompt before or after • Ensemble Majority: We select the category
the text respectively. that has been selected the most times across
all synonym experiments.
Task and Label Description (T/L Desc): Build-
ing upon the Basic Instruction Round, we provide • Ensemble All Agreed: We also experiment
additional information in the prompt by including with a stricter setting that considers only
task and label descriptions (see Table 8). Note that model outputs that are in the same category
we use the labels and task descriptions detailed in (i.e., Complaint, Criticism, dissatisfaction,
the original papers on the respective datasets. The 2
Appropriate synonyms were selected by consulting
format of the prompts used for the Task and Label https://www.thesaurus.com.
Task Basic
Basic Instruction (i.e., Identify whether or not a tweet includes a bragging statement.)
+ Constraints (i.e., Only reply (bragging) or (not bragging).)
Bragging
+ Text (e.g., Tweet: Come watch me and @USER face off in 2K best of 3 series #braggingrights @USER
you next boiiii.)
Task Basic + T/L Desc
Basic Instruction
+ T/L Desc Tweets that have been assigned to the class ‘pro vaccine’ express a positive opinion regarding the
vaccination. Tweets belonging to the ‘anti vaccine’ class express a negative opinion towards COVID-19
Vaccine
vaccination. The ‘neutral’ class mainly includes news related to the development of vaccines, tweets that do
not express a clear opinion, such as questions regarding the vaccine, informative tweets concerning vaccination.
+ Constraints + Text
Task Few-sample
Basic Instruction
+ Few-samples (e.g., (i) Complaint: @USER @USER give the timeline by which I’ll receive my cashback
Complaint which I should have received by 15th October 2017. (ii) Not Complaint: I just gave 5 stars to Nancy at @USER
for the great service I received!)
+ Constraints + Text
Task Memory Recall
Basic Instruction
+ arXiv Paper Title (i.e., Recall this paper: Hateful symbols or hateful people? predictive features for
Hate Speech
hate speech detection on twitter.)
+ Constraints + Text

Table 1: Prompt Examples.

etc.) using all synonyms. For example, we Mu et al., 2023). The dataset developed by
consider the LLM that uses all synonyms pre- (Cotfas et al., 2021) provides 2,792 tweets
dicted as complaints, otherwise they are con- belonging to one of three stance categories:
sidered non-complaints. We only report this pro-vaccine, anti-vaccine, or neutral.
metric for datasets with binary classes.
• Bragging This task aims to classify whether a
3 Data tweet is bragging or not bragging. We evaluate
on a dataset developed by Jin et al. (2022)
In order to ensure a comprehensive evaluation of which contains 6,696 tweets labelled as either
LLM performance, we select six datasets that cover bragging or not bragging.
a wide range of computational social science tasks
and different time spans. In particular, some of • Rumour Stance We use the RumorEval 2017
them were created before September 2021, while dataset which is developed by Derczynski
others were collected after the release of the LLMs et al. (2017). Here, we use the dataset for
used in this paper. All datasets are in English with 4-way rumour stance classification, i.e., deter-
manually annotated class labels. We detail dataset mining the stance of a reply towards a given
specifications and statistics in Table 2: source post (i.e. rumour) as either supporting,
denying, questioning, or commenting.
• Complaint This task aims to identify whether
a tweet expresses a complaint, which is de- • Sarcasm The sarcasm detection task is to
fined as ‘a negative mismatch between real- identify whether a given tweet is intended to
ity and expectations in a particular situation’ be sarcastic or not. We evaluate the task on
(e.g., customer complaints on Twitter) (Olsh- the Semeval-2022 Task 6 dataset (Farha et al.,
tain and Weinbach, 1987). We use a dataset 2022), which contains 4,868 tweets labelled
developed by Preoţiuc-Pietro et al. (2019) con- as either sarcasm or non-sarcasm.
sisting of 3,449 English tweets annotated with
• Hate Speech The task of hate speech detec-
one of two categories, i.e., complaints or not
tion aims to study anti-social behaviours, e.g.,
complaints.
racism and sexism in social media. We eval-
• Vaccine Stance This task aims to automat- uate on a dataset developed by Waseem and
ically predict the stance of tweets towards Hovy (2016) with a binary classification setup,
COVID-19 vaccination (Cotfas et al., 2021; i.e., offensive or non-offensive.
Dataset # of Posts Class (# of Posts)
Rumour Stance 5,568 Support (1,004) / Deny (415) / Query (464) / Comment (3,685)
Vaccine Stance 2,792 Pro Vaccine (991) / Anti Vaccine (791) / Neutral (1,010)
Complaint 3,449 Complaint (1,232) / Not Complaint (2,217)
Bragging 6,696 Bragging (781) / Not Bragging (5,915)
Sarcasm 4,868 Sarcasm (1,067) / Not Sarcasm (3,801)
Hate speech 16,907 Offensive (5,348) / Non-offensive (11,559)

Table 2: Dataset Specifications.

4 Experimental Setup Logistic Regression We represent the text using


TF-IDF and consider tokens that appear more than
4.1 Large Language Models 5 times.
Our experiments are conducted using two publicly
BERT-large We fine-tune BERT-large6 (Devlin
accessible large language models:
et al., 2019) by adding a linear classifier on top of
GPT-3.5-turbo (GPT)3 is an enhanced version the 24-layer transformer blocks. The special token
of the GPT-3 language model with instruction fine- ‘[CLS]’ is used as the representation of each text.
tuning. GPT can be employed for a wide range
4.3 Data Splits
of NLP tasks, including machine translation, com-
mon sense reasoning, and question answering. The For each benchmark task, we divide the dataset into
experiments use the GPT model via the official training (80%) and test (20%) sets using stratified
OpenAI API.4 random splits7 . The training set is used for super-
vised fine-tuning, and is further sub-divided into a
OpenAssistant-LLaMA (OA) We employ training and a validation subsets (in a 3:1 ratio) for
the OpenAssistant (OA) model developed by hyperparameter tuning (e.g., early stopping) pur-
LAIONAI, which fine-tunes the LLaMA (Touvron poses. Subsequently, we evaluate the performance
et al., 2023) 30B model using the OA dataset of the fine-tuned baselines and zero-shot LLMs on
(Köpf et al., 2023). Since the original LLaMA the 20% test set.
models are not allowed to be shared by individuals,
LAIONAI could not release the weights for OA 4.4 Evaluation Metrics
on huggingface but released xor (i.e., ‘Exclusive Performance results are reported using two eval-
Or’) weights5 applied to the original LLaMA uation metrics: 1) Accuracy which consists of a
weights and the check sum calculations performed direct comparison between the model predictions
to validate the conversion. In order to be able and the ground truth label; and 2) F1-macro scores
to run the experiments locally under hardware are reported for situations where accuracy may not
constraints, we applied 8-bit quantisation at model provide an adequate representation of performance,
load time via BitsAndBytes (Dettmers et al., 2021) particularly for certain imbalanced datasets, such
to decrease the inference memory requirements. as Bragging and Rumour Stance.
4.2 Baselines 4.5 Hyper-parameters
The zero-shot classification performance of the two During initial explorations, we observed that us-
LLMs is compared against a weak Logistic Regres- ing a higher temperature (e.g., 0.8 for GPT and 2
sion baseline and a strong fully fine-tuned BERT- for OA) results in inadequate classification perfor-
large baseline: mance, as it introduces more randomness in the
3 model outputs. This suggests that higher temper-
https://platform.openai.com/docs/models/
gpt-3-5 ature settings can cause the model outputs to be
4
https://platform.openai.com/docs/ non-reproducible. Therefore in this study, we use
api-reference
5 6
We use the OASST- SFT-6- LLAMA -30 B ver- https://huggingface.co/bert-large-uncased
7
sion of the model. The xor weights can be found To generate class-stratified subsets, we employ a dataset
at: https://huggingface.co/OpenAssistant/ split tool from https://scikit-learn.org/stable/
oasst-sft-7-llama-30b-xor modules/generated/sklearn.model_selection
Complaint Vaccine Stance Bragging
Model
Accuracy F1-macro Accuracy F1-macro Accuracy F1-macro
Logistic Regression 81.4 79.7 72.8 73.1 88.6 58.8
BERT-large 89.4 88.6 81.5 81.3 91.3 76.1
GPT Basic After 84.9 84.1 65.5 65.8 81.1 62.7
GPT Basic Before 89.7 88.7 72.4 73.6 84.3 66.2
GPT T/L Desc 89.0 88.0 73.3 73.7 84.9 67.4
GPT Memory Recall 87.1 86.4 66.2 66.9 79.8 64.6
GPT Few-sample 85.6 85.2 68.2 69.4 77.3 61.8
OA Basic After 65.5 65.4 60.5 57.8 57.8 50.1
OA Basic Before 80.1 79.9 64.2 63.7 82.8 62.6
OA Basic (OAT 7) 83.9 83.4 66.4 65.9 64.1 42.0
OA T/L Desc 65.3 65.2 73.7 73.6 88.4 48.2
OA Memory Recall 82.6 82.1 64.2 63.8 88.1 46.8
OA Memory Recall (OA 7) 76.4 76.3 67.8 67.9 67.9 43.0
OA Few-sample 87.7 86.9 66.5 67.3 75.4 59.8
Rumor Stance Sarcasm Hate Speech
Model
Accuracy F1-macro Accuracy F1-macro Accuracy F1-macro
Logistic Regression 68.5 40.9 76.1 53.5 83.2 79.2
BERT-large 73.2 48.2 78.9 58.4 84.5 81.2
GPT Basic After 53.0 36.2 74.3 65.8 72.9 77.0
GPT Basic Before 51.5 33.3 62.9 59.7 70.4 69.1
GPT T/L Desc 59.2 45.7 61.3 57.9 76.9 72.1
GPT Memory Recall 40.2 30.9 52.8 51.7 71.7 69.6
GPT Few-sample 40.8 30.6 68.9 64.9 74.8 71.8
OA Basic After 61.7 29.3 41.6 41.6 56.0 55.9
OA Basic Before 46.1 27.9 64.4 54.8 69.8 68.2
OA Basic (OAT 7) 63.1 35.4 61.4 38.8 58.1 58.1
OA T/L Desc 56.2 29.0 75.9 49.9 75.5 73.3
OA Memory Recall 52.4 34.6 78.1 43.9 55.4 55.4
OA Memory Recall (OA 7) 48.8 33.1 71.9 42.9 58.7 58.7
OA Few-sample 28.3 20.7 71.3 42.6 70.0 68.4

Table 3: LLMs zero-shot classification results across all prompt settings. All datasets are evaluated with accuracy and
macro-F1 scores. Blue highlighted cells denote prompt settings where zero-shot LLMs beat the strong supervised
baseline (i.e., Bert-large fine-tuned on the training set). Bold text denotes the best result per task.

a low temperature (i.e., 0.2)8 for GPT to make the both GPT and OA. To evaluate the reproducibility
model more focused and deterministic. of the models’ output, we execute the basic prompt
For OA, we follow the ‘precise hyper-parameter setting of the Complaint dataset five times for each
setup’9 indicated in the OpenAssistant web inter- language model. Our observations reveal that OA
face, where the Temperature is 0.1, Top P is 0.95, consistently generates identical outputs, whereas
Repetition Penalty is 1.2 and Top K is 50. GPT achieves approximately 99% similarity in its
For BERT-large, we set the learning rate as 2e-5, outputs. Note that we consistently run OA on our
the batch size as 16, and the maximum sequence own servers with identical hardware described in
length as 256. We run all baseline models three Section 4.5.
times with different random seeds and report aver-
age results. We fine-tune BERT-large on an Nvidia 5 Results
RTX Titan GPU with 24GB memory and run OA The experimental results are shown in Table 3 and
on an Nvidia A100 GPU with 40GB memory. The Table 4. Next we discuss them in relation to each
inference rates of OA and GPT are approximately of our three research questions.
1,200 and 3,000 samples per hour respectively.
(RQ 1)What level of zero-shot performance
can LLMs achieve on social media classifica-
4.6 Reproducibility of LLM Output
tion tasks? How does zero-shot LLM perfor-
As noted above, to ensure a consistent output, we mance compare against smaller state-of-the-art
utilize low temperature values of 0.2 and 0.1 for language models fine-tuned on the specific anal-
8 ysis task?
https://platform.openai.com/docs/
api-reference/chat/create In general, LLMs (GPT and OA) with zero-
9
https://open-assistant.io/dashboard shot settings are able to achieve better results than
GPT OA
Synonyms
Accuracy F1-macro Accuracy F1-macro
Task 1
Complaint / not Complaint 87.8 86.4 80.1 79.9
Grievance / not Grievance 87.3 85.7 82.3 81.9
Criticism / not Criticism 80.4 77.9 76.7 76.4
Dissatisfaction / no Dissatisfaction 84.6 83.9 66.7 66.7
Discontent / no Discontent 80.7 80.0 55.2 54.2
Ensemble Majority 84.8 83.5 76.1 76.0
Ensemble All Agreed 86.8 85.1 84.5 83.8
Task 2
Pro Vaccine / Anti Vaccine / Neutral 72.4 73.6 64.2 63.7
In Favour of the Vaccine / Against the Vaccine / Neutral 73.5 74.2 64.4 63.9
Positive Sentiment / Negative Sentiment / Neutral 70.8 70.8 58.9 52.5
Belief in vaccine / not Belief in Vaccine / Neutral 74.4 75.2 61.9 59.5
Positive Attitude to Vaccine / Negative Attitude / Neutral 72.3 72.3 63.7 61.3
Ensemble Majority 74.7 75.4 64.2 63.5
Task 3
Bragging / not Bragging 84.3 66.2 82.8 62.6
Boasting / not Boasting 82.7 65.2 78.4 60.9
Showing off / not Showing off 78.8 62.9 88.4 56.3
Self-aggrandizing / not Self-aggrandizing 81.1 62.0 88.1 60.1
Excessively Proud / not Excessively Proud 75.2 58.0 77.9 58.1
Ensemble Majority 83.4 65.4 86.0 63.9
Ensemble All Agreed 84.9 64.4 88.1 59.8
Task 4
Support / Deny / Query / Comment 51.5 33.3 46.1 27.9
Backing / Dismiss / Questioning / Comment 40.4 30.2 52.1 43.8
Support / Dismiss / Questioning / Comment 39.7 30.4 55.4 39.3
Ensemble 41.7 30.6 55.5 39.4
Task 5
Sarcasm / not Sarcasm 62.9 59.7 64.4 54.8
Ironic / not Ironic 74.9 67.2 63.9 54.7
Insincere / Sincere 73.8 64.8 68.2 42.7
Disingenous / Genuine 77.8 61.9 56.8 49.3
Satire / not Satire 76.9 62.8 75.2 53.1
Ensemble Majority 74.9 65.7 70.5 53.9
Ensemble All Agreed 80.1 58.9 76.9 51.2
Task 6
Offensive / Non-offensive 70.4 69.1 69.8 68.2
Toxic / not Toxic 64.1 63.5 70.7 67.8
Abusive / not Abusive 72.2 69.3 64.8 64.2
Hateful / not Hateful 73.9 71.2 75.6 72.5
Derogatory / not Derogatory 68.2 66.8 58.1 58.1
Ensemble Majority 71.4 69.7 73.6 71.1
Ensemble All Agreed 75.1 71.6 75.0 70.6

Table 4: LLMs zero-shot classification results using synonyms across all tasks. Green highlights are the original
class names. Light grey highlighted cells denote where synonyms prompt settings beat the original label. Bold text
denotes the best result per model per task.

the simple supervised Logistic Regression model. class (labels without any specific speech act, such
However, the traditional smaller fine-tuned lan- as ‘Not Bragging’ and ‘Not Sarcastic’).
guage model (BERT-large) still outperforms the
GPT achieves the best predictive performance on
two LLMs on the majority of the tasks (4 out of 6
two speech act detection downstream tasks, namely
tasks). Furthermore, we observe that GPT consis-
Complaint (89.7 accuracy and 88.7 F1-macro) and
tently outperforms OA across all prompt settings
Sarcasm (62.1 F1-macro). This suggests that LLMs
and tasks when considering only the F1-macro mea-
can be employed as strong baseline models for zero-
sure. However, our results show that the accuracy
shot classification tasks.
of OA is better than that of GPT on some imbal-
anced datasets, such as ‘Bragging’ and ‘Sarcasm’. With respect to prompts, when the results of T/L
This may be due to OA defaulting to the neutral Desc and Memory Recall are compared against
Basic Instruction, it is observed that using a more
complex prompt (e.g., adding label and paper infor- when designing instructions as well as use ensem-
mation) does not necessarily improve model per- ble methods.
formance and may even introduce additional noise, (RQ 3) What are the potential risks arising
leading to a degradation in performance. from the implementation of these prompt strate-
For speech act detection tasks such as Complaint gies?
and Bragging, the accuracy of LLMs exceeds 85%, We also test different prompting strategies (e.g.,
indicating that LLMs can potentially be used for by asking the authors, task details and the name of
data annotation as a way to reduce human annota- each category) to determine whether LLMs have
tion costs. Standard data annotation tasks typically been exposed to the dataset before. Given that some
rely on at least two annotators in the first round, models can recall the task details given the title of
so one of them could be replaced by an LLM. Ac- the arXiv paper (i.e., memory recall), it remains
cording to the annotation details10 of the vaccine unclear whether the dataset has been seen in the
stance task (Poddar et al., 2022), the agreement rate training corpus.
between the two annotators is approximately 62%.
(RQ 2) What are the most effective LLM 6 Error Analysis
prompt strategies for social media classification To better understand the limitations of LLMs, we
tasks in a zero-shot setting? conduct an error analysis focusing on shared er-
Table 3 compares different prompt complexity, rors across all synonym settings following (Ziems
and shows that the simple prompt strategy works et al., 2023). We manually check these wrong pre-
reasonably well. For GPT, adding task and label dictions and observe that some unanimous errors
descriptions typically achieves better results, i.e. (Ziems et al., 2023) (i.e., when the model agreed on
these prompts achieved the best results on 4 out an incorrect answer using different synonyms) are
of 6 datasets as compared to other GPT prompt caused by incorrect or controversial ground truth
strategies. On the other hand, OA achieves mixed labels.11
results. On average, for OA, simple prompts out- On the other hand, we observe that OA often
perform complex counterparts. This may happen defaults to the majority category, such as ‘not a
because complex prompts add additional noise to bragging’ and ‘not sarcasm’, which leads to higher
the model. We also note that adding a few exam- accuracy but a lower macro-F1 measure. However,
ples to the prompt actually damages classification considering the high accuracy of LLM zero-shot
performance, for both GPT and OA. We hypothe- classification performance, LLMs can still be uti-
sise that the longer prompt is affecting the model lized as data annotation tools (combined with hu-
interpretation of instructions. man efforts) for NLP downstream tasks in CSS. We
Table 4 shows all zero-shot results when syn- can utilise LLMs for data annotation and also to
onyms are used in prompts for all six datasets. We identify incorrect annotations.
observe that revising prompts with synonyms can
substantially improve the zero-shot performance of 7 Related Work
OA, except for the Bragging dataset. It is worth not- Both models evaluated in this work, GPT (also re-
ing that the Sarcasm dataset is the only one where ferred to as ChatGPT) and OA, have been trained
the prompt using the original categories performs using Reinforcement Learning with Human Feed-
worse. This suggests that replacing original labels back (RLHF) in conjunction with instruction tun-
with synonyms allows the OA model to better un- ing, as first explored in Ouyang et al. (2022). In-
derstand the task requirements. The diversity in struction tuning is the fine-tuning of language mod-
the distribution of training examples used in the els on NLP tasks rephrased as instructions and
RLHF fine-tuning for both GPT and OA may ex- prior work has shown that it is an effective way
plain the models behaviour. For example, the OA of training LLMs to perform zero-shot on unseen
model might be fine-tuned on a dataset like: ‘[Text tasks. (Wei et al., 2021; Sanh et al., 2021) Longpre
including offensive language] + [Category: Abu- et al. (2023) carried out a detailed ablation study
sive]’. Therefore, we believe that it is important on non-RLHF instruction tuning methods across
to test similar words in place of the original labels the general NLP tasks in the Flan 2022 collection
10 11
https://github.com/sohampoddar26/ We summarize the number of wrong predictions from the
covid-vax-stance/tree/main/dataset synonyms experiments on GPT in Appendix A, Table 6.
and found that T5 instruction tuned on the Flan per- scores.
formed surprisingly well on held-out tasks when As LLMs improve their performance on lan-
compared to models directly fine-tuned on said task. guage generation tasks, the risk of misinformation
Tuning with human feedback could be the next step and propaganda increases. Mitchell et al. (2023)
in improving instruction tuning in this area. propose DetectGPT, a perturbation-based zero-shot
Ziems et al. (2023) sets a roadmap for employing method for identifying machine-generated pas-
LLMs as data annotators by establishing prompting sages. (Su et al., 2023) further develop this ap-
best practices and an evaluation of the zero-shot proach with DetectLLM-LRR and -NPR, achiev-
performance of 13 language models on 24 tasks ing improved efficiency and improved performance
in computational social sciences. Our work is dis- respectively.
tinct from this research as we evaluate LLM perfor- Since our focus is primarily on out-of-the-box
mance on a different set of benchmarks and models performance, we experiment with simple alter-
and experiment with different prompt modification ations of the prompts. Other research, e.g. Arora
strategies such as using synonyms for class labels et al. (2022), has considered prompt aggregation
and adding arXiv paper titles. as well as using LLMs to auto-generate prompts.
To evaluate the zero-shot performance of Chat- We also do not explore advanced methods such as
GPT for text classification, Kuzman et al. (2023) chain-of-thought prompting, which improves LM
compares against a fine-tuned XLM-RoBERTa performance by encouraging it to output its inter-
model for the task of automatic genre classification mediate reasoning steps (Wei et al., 2022; Suzgun
in English and Slovenian. They show that ChatGPT et al., 2022).
outperforms the baseline on unseen datasets and
that there is no drop in performance when provided
8 Conclusion
with Slovenian examples. This paper explored a number of prompting strate-
In their study, Ganesan et al. (2023) use Face- gies for the application of Large Language Models
book posts to classify user personality traits, (LLMs) in computational social science tasks. It
based on openness, conscientiousness, extrover- presented a range of controlled experiments that es-
sion, agreeableness, and neuroticism. They find tablish the efficacy of different prompt strategies on
that GPT-3 performs poorly on binary and worse six publicly available datasets. Our main findings
yet on tertiary ranking for each trait. are summarised as follows:
LLMs have also been applied in mental health
applications. Lamichhane (2023) evaluate Chat- • Task-specific fine-tuned models generally
GPT’s ability to classify stress, depression, and tend to outperform LLMs in zero-shot set-
suicidal inclination from Reddit posts. Although tings.
ChatGPT significantly outperforms their baseline,
• More detailed and complex prompts (e.g, by
the baseline consisted of a simple prediction of the
adding arXiv paper title and few-samples) do
majority class.
not necessarily enhance classification perfor-
For toxicity detection, Wang and Chang (2022) mance.
analyse GPT-3’s generative and discriminative zero-
shot capabilities, finding that performance is only • The selection of specific words or phrases as
slightly better than a random baseline. However, the class label can considerably affect classifi-
the authors argue that the generative task allows for cation outcomes.
nuanced distinction of toxicity in the, somewhat
subjective, binary setting. We therefore argue that developing prompts for
Törnberg (2023) find that ChatGPT-4 outper- zero-shot classification presents a significant chal-
forms non-expert annotators in identifying the po- lenge and recommend testing different prompt con-
litical affiliation of Democratic or Republican party figurations before proceeding with experiments,
members based on their tweets during the 2020 US while keeping in mind the time constraints12 and
election. Wu et al. (2023) use ChatGPT to rank financial costs associated with LLMs (see Table 7
the conservatism of representatives in the 116th US in Appendix A).
Congress through a series of pairwise match ups, 12
https://platform.openai.com/docs/guides/
showing a high correlation with DW-NOMINATE rate-limits/overview
Limitations bidirectional transformers for language understand-
ing. In Proceedings of the 2019 Conference of the
In this paper, we assess the zero-shot classification North American Chapter of the Association for Com-
performance on six downstream tasks in Computa- putational Linguistics: Human Language Technolo-
tional Social Science. We acknowledge that further gies, Volume 1 (Long and Short Papers), pages 4171–
4186.
experiments on other fine-grained tasks would be
beneficial in future work. Ibrahim Abu Farha, Silviu Vlad Oprea, Steven Wilson,
We also tried to explore potential data leakage is- and Walid Magdy. 2022. Semeval-2022 task 6: is-
sues(Ziems et al., 2023) by testing various prompts arcasmeval, intended sarcasm detection in english
and arabic. In Proceedings of the 16th International
to verify whether our test sets have been exposed to Workshop on Semantic Evaluation (SemEval-2022),
GPT and OA. However, due to the black box nature pages 802–814.
of the training datasets of these two LLMs, we are
Adithya V Ganesan, Yash Kumar Lal, August Håkan
unable to confirm the presence of data leakage.
Nilsson, and H Andrew Schwartz. 2023. Systematic
evaluation of gpt-3 for zero-shot personality estima-
Ethics Statement tion. arXiv preprint arXiv:2306.01183.
Our work has received ethical approval from the Mali Jin, Daniel Preoţiuc-Pietro, A Doğruöz, and Niko-
Ethics Committee of our university and complies laos Aletras. 2022. Automatic identification and clas-
with the research policies of Twitter. All datasets sification of bragging in social media. In Proceedings
are obtained through the links provided in the re- of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
spective research papers or by requesting them di- pages 3945–3959.
rectly from the authors. Furthermore, we can con-
firm that the data has been fully anonymised before Andreas Köpf, Yannic Kilcher, Dimitri von Rütte,
being fed to the LLMs for model inference. Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens,
Abdullah Barhoum, Nguyen Minh Duc, Oliver Stan-
ley, Richárd Nagyfi, et al. 2023. Openassistant
conversations–democratizing large language model
References alignment. arXiv preprint arXiv:2304.07327.
Simran Arora, Avanika Narayan, Mayee F Chen, Lau-
rel J Orr, Neel Guha, Kush Bhatia, Ines Chami, Fred- Taja Kuzman, Nikola Ljubešić, and Igor Mozetič. 2023.
eric Sala, and Christopher Ré. 2022. Ask me any- Chatgpt: Beginning of an end of manual annotation?
thing: A simple strategy for prompting language mod- use case of automatic genre identification. arXiv
els. arXiv preprint arXiv:2210.02441. preprint arXiv:2303.03953.

Rewon Child, Scott Gray, Alec Radford, and Bishal Lamichhane. 2023. Evaluation of chatgpt for
Ilya Sutskever. 2019. Generating long se- nlp-based mental health applications. arXiv preprint
quences with sparse transformers. arXiv preprint arXiv:2303.15727.
arXiv:1904.10509.
Shayne Longpre, Le Hou, Tu Vu, Albert Webson,
Liviu-Adrian Cotfas, Camelia Delcea, Ioan Roxin, Co- Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V
rina Ioanăş, Dana Simona Gherai, and Federico Ta- Le, Barret Zoph, Jason Wei, et al. 2023. The flan
jariol. 2021. The longest month: analyzing covid- collection: Designing data and methods for effective
19 vaccination opinions dynamics from tweets in instruction tuning. arXiv preprint arXiv:2301.13688.
the month following the first vaccine announcement.
Ieee Access, 9:33203–33223. Eric Mitchell, Yoonho Lee, Alexander Khazatsky,
Christopher D Manning, and Chelsea Finn. 2023.
Leon Derczynski, Kalina Bontcheva, Maria Liakata, Detectgpt: Zero-shot machine-generated text detec-
Rob Procter, Geraldine Wong Sak Hoi, and Arkaitz tion using probability curvature. arXiv preprint
Zubiaga. 2017. SemEval-2017 task 8: RumourEval: arXiv:2301.11305.
Determining rumour veracity and support for ru-
mours. In Proceedings of the 11th International Yida Mu, Mali Jin, Charlie Grimshaw, Carolina Scarton,
Workshop on Semantic Evaluation (SemEval-2017), Kalina Bontcheva, and Xingyi Song. 2023. Vaxxh-
pages 69–76, Vancouver, Canada. Association for esitancy: A dataset for studying hesitancy towards
Computational Linguistics. covid-19 vaccination on twitter. In Proceedings of
the International AAAI Conference on Web and So-
Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke cial Media, volume 17, pages 1052–1062.
Zettlemoyer. 2021. 8-bit optimizers via block-wise
quantization. arXiv preprint arXiv:2110.02861. Elite Olshtain and Liora Weinbach. 1987. 10. com-
plaints: A study of speech act behavior among native
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and and non-native speakers of hebrew. In The pragmatic
Kristina Toutanova. 2019. Bert: Pre-training of deep perspective, page 195. John Benjamins.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols
Carroll Wainwright, Pamela Mishkin, Chong Zhang, or hateful people? predictive features for hate speech
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. detection on twitter. In Proceedings of the NAACL
2022. Training language models to follow instruc- Student Research Workshop, pages 88–93, San Diego,
tions with human feedback. Advances in Neural California. Association for Computational Linguis-
Information Processing Systems, 35:27730–27744. tics.

Soham Poddar, Mainack Mondal, Janardan Misra, Niloy Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin
Ganguly, and Saptarshi Ghosh. 2022. Winds of Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
change: Impact of covid-19 on vaccine-related opin- drew M Dai, and Quoc V Le. 2021. Finetuned lan-
ions of twitter users. In Proceedings of the Interna- guage models are zero-shot learners. arXiv preprint
tional AAAI Conference on Web and Social Media, arXiv:2109.01652.
volume 16, pages 782–793.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Daniel Preoţiuc-Pietro, Mihaela Gaman, and Nikolaos Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le,
Aletras. 2019. Automatically identifying complaints and Denny Zhou. 2022. Chain of thought prompt-
in social media. In Proceedings of the 57th Annual ing elicits reasoning in large language models. In
Meeting of the Association for Computational Lin- Advances in Neural Information Processing Systems.
guistics, pages 5008–5019.
Patrick Y Wu, Joshua A Tucker, Jonathan Nagler, and
Solomon Messing. 2023. Large language models
Michael V Reiss. 2023. Testing the reliability of chatgpt
can be used to estimate the ideologies of politi-
for text annotation and classification: A cautionary
cians in a zero-shot learning setting. arXiv preprint
remark. arXiv preprint arXiv:2304.11085.
arXiv:2303.12057.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen,
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Zhehao Zhang, and Diyi Yang. 2023. Can large lan-
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun guage models transform computational social sci-
Raja, et al. 2021. Multitask prompted training en- ence? arXiv preprint arXiv:2305.03514.
ables zero-shot task generalization. arXiv preprint
arXiv:2110.08207. Arkaitz Zubiaga, Ahmet Aker, Kalina Bontcheva, Maria
Liakata, and Rob Procter. 2018. Detection and res-
Jinyan Su, Terry Yue Zhuo, Di Wang, and Preslav Nakov. olution of rumours in social media: A survey. ACM
2023. Detectllm: Leveraging log rank information Computing Surveys (CSUR), 51(2):1–36.
for zero-shot detection of machine-generated text.
arXiv preprint arXiv:2306.05540. A Appendix
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se-
bastian Gehrmann, Yi Tay, Hyung Won Chung,
Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny
Zhou, et al. 2022. Challenging big-bench tasks and
whether chain-of-thought can solve them. arXiv
preprint arXiv:2210.09261.

Petter Törnberg. 2023. Chatgpt-4 outperforms experts


and crowd workers in annotating political twitter
messages with zero-shot learning. arXiv preprint
arXiv:2304.06588.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier


Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro,
Faisal Azhar, et al. 2023. Llama: Open and effi-
cient foundation language models. arXiv preprint
arXiv:2302.13971.

Yau-Shian Wang and Yingshan Chang. 2022. Toxicity


detection with generative prompt-based inference.
arXiv preprint arXiv:2205.12390.

Zeerak Waseem, Thomas Davidson, Dana Warmsley,


and Ingmar Weber. 2017. Understanding abuse: A
typology of abusive language detection subtasks. In
Proceedings of the First Workshop on Abusive Lan-
guage Online, pages 78–84.
Task & Prompt
Rumour Stance Do you know this paper: SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours?
Sarcasm Do you know this paper: Semeval-2022 task 6: isarcasmeval, intended sarcasm detection in English and Arabic?
Output
Yes, I’m familiar with the paper titled SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support
for rumours. This paper is related to the SemEval-2017 workshop, specifically Task 8, which focused on rumour detection
GPT (Rumour Stance) and veracity classification on social media platforms. Participants were provided with a dataset containing rumours along
with associated tweets that either supported or denied the rumours. The task involved classifying the veracity of the rumours
as true, false, or unverified, and also identifying the tweets that provided support or denial for the rumours.
Yes, I am aware of the paper titled SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for
rumours. This paper presents the results of the RumourEval task held as part of the 2017 edition of the SEMEVAL
OA (Rumour Stance)
workshop. The task involved developing algorithms for determining the veracity of rumors and identifying sources that
supported or refuted them using Twitter data.
I apologise for the confusion, but as an AI language model, my training only goes up until September 2021, and I do not
GPT (Rumour Stance) have access to specific details about SemEval-2022 or any subsequent tasks. The last SemEval workshop I’m aware of is
SemEval-2021.
Semeval-2022 Task 6: ISARCASMEVAL is a subtask focusing specifically on the identification of ironic and sarcastic
expressions in English and Arabic text. Irony and sarcasm are common forms of expression found in both written and
OA (Rumour Stance)
spoken communication, but they pose significant challenges for automatic sentiment analysis due to their subtle nature
and dependence on contextual cues.

Table 5: Ask LLMs if they can recall the original paper.

Dataset # of Test Set Average # of Error # of Unanimous Error


Complaint 690 89 43
Vaxx Stance 559 145 82
Bragging 1,340 201 160
Rumor Stance 1,114 557 475
Sarcasm 974 194 58
Hate Speech 3,380 845 302

Table 6: We conduct further error analysis on the model outputs across all datasets. # of Unanimous Error denotes
cases in which the LLM unanimously agrees on an incorrect answer while using different synonyms.

Round 1 Round 2
Task
Tokens USD Tokens USD
Rumour Stance 35k / 51 <0.1 82k / 119 0.2
Vaccine Stance 31k / 127 <0.1 86k / 45 0.2
Complaint 23k / 33 <0.1 62k / 91 0.1
Bragging 52k / 76 0.1 96k / 140 0.2
Hate Speech 62k / 90 0.1 94k 137 0.2
Sarcasm 28k / 41 <0.1 50k / 86 0.1

Table 7: The cost of running GPT-3.5 for each task.


Task Prompt with Task Description
You need to annotate a response into one of four rumour stance categoies: support,
deny, query, or comment. Support: the author of the response supports the veracity
of the rumour. Deny: the author of the response denies the veracity of the rumour.
Rumour Stance
Query: the author of the response asks for additional evidence in relation to the
veracity of the rumour. Comment: the author of the response makes their own
comment without a clear contribution to assessing the veracity of the rumour.
You need to annotate a tweet into one of three stance categoies: pro vaccine,
anti vaccine, or neutral. Tweets that have been assigned to the class pro vaccine
express a positive opinion regarding the vaccination. Tweets belonging to the
Vaccine Stance anti vaccine vaccination class express a negative opinion towards COVID-19
vaccination. The neutral class mainly includes news related to the development
of vaccines, tweets that do not express a clear opinion, such as questions
regarding the vaccine, informative tweets concerning vaccination.
Complaining is a basic speech act used to express a negative mismatch between
Complaint reality and expectations towards a state of affairs, product, organization or event.
Key to the definition of complaints is the expression of the breach of expectations.
You need to identify whether or not a tweet includes a bragging statement.
Only reply yes or no. Bragging is a speech act which explicitly or implicitly
attributes credit to the speaker for some ‘good’ (possession, accomplishment,
Bragging skill, etc.) which is positively valued by the speaker and the potential audience.
As such, bragging includes announcements of accomplishments, and explicit positive
evaluations of some aspect of self. A bragging statement should clearly express
what the author is bragging about (i.e. the target of bragging).
Annotate whether a tweet is Sarcasm or non-Sarcasm. Sarcasm is a form of
verbal irony that occurs when there is a discrepancy between the literal and
Scarasm intended meanings of an utterance. Through this discrepancy, the speaker
expresses their position towards a prior proposition, often in the form of
surface contempt or derogation.
Annotate a tweet into one of three categoies: racism, sexism, non-offensive .
A tweet is offensive if it: 1. uses a sexist or racial slur. 2. attacks a minority.
3. seeks to silence a minority. 4. criticizes a minority (without a well founded
argument). 5. promotes, but does not directly use, hate speech or violent crime.
6. criticizes a minority and uses a straw man argument. 7. blatantly misrepresents
Hate Speech
truth or seeks to distort views on a minority with unfounded claims. 8. shows support
of problematic hash tags. E.g. “#BanIslam”, “#whoriental”, “#whitegenocide”.
9. negatively stereotypes a minority. 10. defends xenophobia or sexism. 11. contains a
screen name that is offensive, as per the previous criteria, the tweet is ambiguous
(at best), and the tweet is on a topic that satisfies any of the above criteria.

Table 8: Task descriptions used for the prompting strategy ‘Task and Label Description (T/L Desc)’.

You might also like