0% found this document useful (0 votes)
91 views

A Survey of Text Classification With Transformers How Wide How Large How Long How Accurate How Expensive How Safe

Uploaded by

Anup Kumar Paul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views

A Survey of Text Classification With Transformers How Wide How Large How Long How Accurate How Expensive How Safe

Uploaded by

Anup Kumar Paul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Received 27 November 2023, accepted 10 December 2023, date of publication 4 January 2024,

date of current version 16 January 2024.


Digital Object Identifier 10.1109/ACCESS.2024.3349952

A Survey of Text Classification With


Transformers: How Wide? How
Large? How Long? How Accurate?
How Expensive? How Safe?
JOHN FIELDS 1,2 , (Graduate Student Member, IEEE), KEVIN CHOVANEC 2,

AND PRAVEEN MADIRAJU2


1 Business Analytics, Concordia University Wisconsin-Ann Arbor, Mequon, WI 53097, USA
2 Department of Computer Science, Marquette University, Milwaukee, WI 53233, USA
Corresponding author: John Fields ([email protected])
This work was supported by the Northwestern Mutual Data Science Institute and NSF under Award 1950826.

ABSTRACT Text classification in natural language processing (NLP) is evolving rapidly, particularly
with the surge in transformer-based models, including large language models (LLM). This paper presents
an in-depth survey of text classification techniques across diverse benchmarks, addressing applications
from sentiment analysis to chatbot-driven question-answering. Methodologically, it utilizes NLP-facilitated
approaches such as co-citation and bibliographic coupling alongside traditional research techniques. Because
new use cases continue to emerge in this dynamic field, the study proposes an expanded taxonomy of text
classification applications, extending the focus beyond unimodal (text-only) inputs to explore the emerging
field of multimodal classification. While offering a comprehensive review of text classification with LLMs,
this review highlights novel questions that arise when approaching the task with transformers: It evaluates
the use of multimodal data, including text, numeric, and columnar data, and discusses the evolution of
text input lengths (tokens) for long text classification; it covers the historical development of transformer-
based models, emphasizing recent advancements in LLMs; it evaluates model accuracy on 358 datasets
across 20 applications, with results challenging the assumption that LLMs are universally superior, revealing
unexpected findings related to accuracy, cost, and safety; and it explores issues related to cost and access as
models become increasingly expensive. Finally, the survey discusses new social and ethical implications
raised when using LLMs for text classification, including bias and copyright. Throughout, the review
emphasizes the importance of a nuanced understanding of model performance and a holistic approach to
deploying transformer-based models in real-world applications.

INDEX TERMS NLP, text classification, transformers, survey.

I. INTRODUCTION some label, category, or tag to a body of text (sentence,


In the past five years, large language models have rev- paragraph, document). Traditionally, text classification, like
olutionized natural language processing (NLP), achieving classification tasks more generally, can be divided into three
state-of-the-art across several classic NLP tasks. One of these types:
tasks, text classification, is a diverse and growing set of • Binary classification: classifying texts into one of two
aims in academia and industry related to categorizing and mutually exclusive categories (for example, Spam or Not
organizing text. In text classification, the goal is to assign Spam)
• Multiclass classification: dividing texts into one of three
The associate editor coordinating the review of this manuscript and or more mutually exclusive categories (for example,
approving it for publication was Maria Chiara Caschera . classifying a text’s genre)
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
6518 VOLUME 12, 2024
J. Fields et al.: Survey of Text Classification With Transformers

• Multilabel classification: labeling texts with three or enough in the literature to merit space in the taxonomy,
more potentially overlapping labels, in which each text such as fine-grained sentiment analysis or intent
can receive multiple labels (such as offensive comments recognition. As multimodal approaches are perhaps
labeling, in which a comment might be marked for both the most exciting subfield of text classification, this
violence and hate speech) survey also goes beyond previous surveys to include
However, as automated text classification has expanded, classification methods that pair text data with other
common aims and data sources have reappeared often. For kinds of data.
example, researchers often work with social media, surveys, By large language model, we refer to the pre-trained,
scraped web data, emails, user reviews or comments, and they transformer-based architectures that have been widely
often attempt similar kinds of classification: sentiment anal- adopted since [3] seminal work on attention. These language
ysis, news classification, topic labeling, emotion detection, models, beginning with GPT [4] and BERT [5] in 2018,
offensive language labeling, etc. employ a neural network using a parallel multihead attention
Text data offers rich information, but it has historically mechanism, with either an encoder or encoder-decoder
been difficult and expensive to process. LLMs, especially structure, to transform tokens into embeddings, or word
open-source LLMs such as BERT, offer generalized models vectors, with billions or even trillions of parameters, which
that can greatly facilitate processing this data. An enormous can then be used in downstream tasks. While many of the
amount of research over the last half decade has thus been most famous gains from these LLMs have been in classic
invested in deploying, fine-tuning, and adapting LLMs to question-answer problems, they have also been adopted
text classification tasks. In this literature survey, we aim to widely for text classification. As Figure 1 shows, LLMs
provide an overview of the various ways LLMs have been have also recently exploded after the popularity of ChatGPT
deployed for text-classification, along with a new taxonomy (GPT-3.5/4), offering new opportunities for integrating these
of the common subtypes and an explanation of the central recently released models into research on text classification.
methods that have appeared in the research. Our survey is organized around the central questions raised
Our paper builds on two recent, related literature surveys. by the use of transformer-based LLMs for text classification.
[1] have published a survey of deep learning text classifica- Many of these relate to the training or fine-tuning of the
tion methods in 2021, covering research through 2020. In the model itself, while others relate to the social and ethical
same year, [2] published a detailed account of pre-trained questions. In the next section, we present our methodology,
language models used in NLP tasks. We extend the work of which blends traditional survey methods with an approach
these earlier papers, contributing to research in three primary using new NLP tools. Then, in the third and central section of
ways: the work, we offer our novel taxonomy of text-classification.
1) First, we fill in the last two years, which, in this The rest of the survey’s body is then organized around
rapidly-developing field, has seen several new models, questions uniquely related to text classification with LLMs.
approaches, and benchmarks. Indeed, the cultural First, we consider the type of data used in the model
impact of ChatGPT spurred rapid development in the or the ‘width’ of the model; next, we explore questions
field, and many of the current benchmark models, such related to the size of the model (in ‘How Large’); then,
as GPT-4, Llama 2, and Titan, were not yet released in ‘How Long,’ we consider the length of the documents
when these surveys were published. being classified; ‘How Accurate’ summarizes benchmarking
2) Next, our survey focuses specifically on text clas- across text classification subtasks; and ‘How Safe’ considers
sification and LLMs, which allows us to offer a the ethical and legal issues related to text classification with
comprehensive overview of the work done in this area. LLM.
[1], for example, offers only a relatively short section Finally, we offer a conclusion that summarizes our findings
on LLMs, and [2] devote their survey to all uses of and suggests directions for future research in this rapidly-
LLMs, only briefly discussing classification. Our focus evolving field.
allows us to explore how LLMs perform across several
subtasks within text classification. Moreover, [2] offer II. METHODOLOGY
an excellent technical overview of LLMs, covering We follow the guidelines for comprehensive literature
model architecture and fine-tuning in detail, while reviews proposed by [6]. We first formulated the research
our survey leans more toward the novel practical and questions above, and then systematically searched for
ethical questions that researchers have addressed when primary studies through keyword searches focusing on text
deploying LLMs for text classification. classification and large language models, including all LLMs
3) Finally, based on the articles we survey, we propose an listed in table 3: for example, we searched ‘‘BERT’’ and
expansion of existing taxonomies of text classification. ‘‘Text Classification’’; ‘‘Llama’’ and ‘‘Text Classification’’;
While still rooted in traditional tasks such as sentiment ‘‘GPT’’ and ‘‘Text Classification,’’ etc. Using IEEE Explore
analysis, news classification, or topic labeling, recent and the ACM database, this returned 277 results, after
approaches, propelled by the success of LLMs, have removing duplicates. Initially, Google Scholar was used as a
seen increasingly granular subtasks that appear often third database, but the results were almost all either redundant

VOLUME 12, 2024 6519


J. Fields et al.: Survey of Text Classification With Transformers

FIGURE 1. Timeline of major transformer-based large language model introductions.

or outside the scope of our survey, and we therefore narrowed TABLE 1. Current taxonomy of text classification.
the search to include only sources from the IEEE and ACM
databases.
We then reviewed the primary studies for their quality and
relevance to our topic, and extracted information about the
methods, area, and focus of the work. We excluded works not
primarily focused on text classification and works not using a
transformer-based, pre-trained large language model. We also
excluded works published before 2020, since previous
surveys cover deep learning techniques through 2020. After
removing articles using our exclusion criteria, we found
231 articles relevant to the topic. However, many of these
papers overlapped significantly in focus and methodology,
as the last three years have seen a proliferation of work using
LLMs for text classification; to avoid redundancy, we chose
the most representative and most cited works from each
category.
Finally, we supplemented this traditional survey method-
ology in two ways. First, most simply, we used backward
snowballing to identify frequently cited papers missed in
our original search. We also integrated new NLP tools to
enhance our search, namely, Connected Papers, a visual
networking tool that explores interrelations between research
papers [6]. Connected Papers uses similarity metrics (rather
than citations alone) to link papers together, and for any algorithms were applied to classification problems in the late
paper searched, it returns a few dozens of the most frequently 2010’s with great success and this expanded the potential use
sighted similar papers. As in other studies that have employed cases for text classification as shown in Table 1.
this tool for the purposes of a literature review, we chose seed The taxonomy in Table 1 has been sufficient for the early
articles from our initial search for each section in our paper period of transformer-based text classification but there is a
and fed these articles into Connected Papers. An example of need for improved test datasets and an expanded taxonomy
the Connected Papers graph and prior/derivative works for [1] of applications [14]. With the advent of ChatGPT and other
can be accessed at [7]. LLMs, the text classification capabilities and applications
have expanded, and we propose evolving the above taxonomy
as shown in Table 2 [15].
III. TAXONOMY OF TEXT CLASSIFICATION Some additional classification applications that are
Classification philosophies have existed for thousands of emerging are fine-grained sentiment analysis, aspect-based
years to provide structure and order to the world around sentiment analysis, offensive language detection, intent
us [8]. Only in the 1950’s and 1960’s did we begin to utilize recognition, document classifiers, fake news detection,
computers to aid in the classification of text-based docu- cross-lingual classification, stance detection, emotion/mental
ments, which was a time-consuming and laborious manually health detection, malicious software detection, cause and
task [9]. These systems also relied on handcrafted, rule-based effect classifiers, sentence classification, multilabel, multi-
systems that had limited effectiveness and scalability. The modal, and other applications. Some of these new areas,
developments from the 1960’s to 1990’s progressed slowly like emotion/mental health detection involve more complex
during the ‘‘AI winter’’ and significant progress was not made social and ethical issues that will be described more fully
until the 1990’s with new machine learning techniques and in Section IX. However, as we develop systems with more
faster computers [10]. In recent years, new transformer-based ‘‘human like’’ capabilities, we should consider how these fit

6520 VOLUME 12, 2024


J. Fields et al.: Survey of Text Classification With Transformers

TABLE 2. Additional taxonomy categories for text classification. TABLE 3. Large language model examples and release dates.

TABLE 4. Multimethod examples in aspect-based sentiment analysis.

application for these models is typically question-answer chat


bots, explored more in Section VII. Table 3 below provides
a summary of the major releases of LLMs that can perform
unimodal text classification.
The primary differentiator for these LLMs is the size
of the training data, and this aspect of LLMs is covered
in detail in Section V. The technical differences between
encoder-decoder/encoder only (BERT-method), or decoder-
only (GPT-method) are covered in detail in the papers by [12]
and [13].
These models have also been fine-tuned for domain-
specific unimodal text classification. One example is the
BloombergGPT model which is trained on financial text
data for internal use by the company. This is an area for
into the taxonomy and propose rules and guidelines for how future research to determine if the investment in proprietary,
we implement these safely, or not at all. domain specific LLM could be a competitive advantage for
the companies who make these investments.
IV. HOW WIDE? Multimethod approaches for unimodal text classification
In addition to the applications described in Section III, are also achieving best-in-class results for some applications
we also consider ‘‘How wide?’’ in reference to the different such as aspect-based sentiment analysis. The combination
data types and methods used for analysis. Text classification of graph database methods with transformer-based text
tasks with transformers can be categorized as unimodal or classification is another emerging area in NLP. Table 4 below
multimodal. shows two recent uses of this technique.

A. UNIMODAL B. MULTIMODAL
Unimodal text classification uses only textual information Multimodal classification uses text, video, signal, image,
and applies the transformer model to make a classification audio, and columnar data for classification. A taxon-
prediction. Many survey papers have focused on traditional omy for multimodal machine learning proposed by [43]
NLP methods and newer transformer-based methods applied included Representation, Translation, Alignment, Fusion,
to text classification [1], [13], [27], [28]. Since the launch and Co-learning.
of ChatGPT in late 2022, there has been a significant This is an emerging area of investment by large technology
increase in the research and investment in LLMs. The primary companies as evidenced by Elon Musk’s goal of creating

VOLUME 12, 2024 6521


J. Fields et al.: Survey of Text Classification With Transformers

TABLE 5. Estimated data types in businesses and other organizations. we hope this paper will highlight the need to look at the
emerging research on using transformers for columnar data
and the opportunities to synthesize this research with the
extensive effort and money that is focused on multimodal
research by technology companies as described above. Future
studies may delve into optimizing model architectures, data
fusion strategies, and addressing ethical considerations to
an ‘‘everything app’’ from his new company X.ai [44] unlock the full capabilities of multimodal text classification
and Google’s refocusing their AI teams on ‘‘. . . a series of in domains ranging from media analysis to content recom-
powerful, multimodal AI models’’ [45]. Some multimodal mendation systems.
models, such as UniMSE [46] for sentiment analysis and
SEMI-FND [47] for Fake News Detection, are already V. HOW LARGE?
emerging as best-in-class as shown in Table 10 below. ‘‘Our results strongly suggest that larger models
As we consider multimodal classification, the prior will continue to perform better, and will also
research is primarily focused on the use of image data be much more sample efficient than has been
and text [43]. We hypothesize that the primary types of previously appreciated’’ [59].
data in most companies and organizations are text and The use of transformers for text classification began with
numeric/columnar data. We were unable to find academic BERT and GPT in 2018. Data scientists with an interest in
research on the breakdown of data types most common in a NLP began to see the types of exponential improvements
business context, so we utilized ChatGPT [48] and Bard [49] that the field of vision experienced from neural networks
to provide the following estimates. in 2012 [60]. The success of ‘‘small’’ transformers like
Note: ChatGPT would not provide the sources for the BERT with 110-340 million parameters1 and RoBERTa with
estimates in Table 5. Bard provided the following sources: 123-354 million parameters established many new bench-
• The Data & Analytics Association (D&AA) 2022 State marks in text classification since 2018.
of Data & Analytics report The subsequent creation of the HuggingFace platform
• Gartner’s 2022 Magic Quadrant for Data Integration provided a platform for data scientists to utilize these
Tools transformer-based models [61]. The launch of Google Colab
• IDC’s 2022 Worldwide DataSphere Market Forecast and other cloud-based solutions then provided simpler and
• Statista’s 2022 statistics on corporate data storage more affordable access to Graphical Processing Units (GPU)
If these estimates are accurate, it would indicate that running these compute intensive applications [62].
the research on multimodal classification is too focused on The general public did not have extensive direct exposure
video/voice and not enough on numeric or tabular data. For to transformer-based NLP until the release of ChatGPT in late
example, the paper by [50] includes a summary of 46 papers 2022. This application achieved a new record by reaching
on Multimodal Classification with Deep Learning. However, 100 million customers two months after the application
only one of these papers ([51] ) focuses on the use of text and launch [63]. The ensuing press, investment, and hype over
tabular data which is commonly found in a variety of business ChatGPT has increased the interest in question-answer based
applications such as health care and risk classification [52]. NLP applications. However, the use of this technology for
The more recent 2023 paper by [53] includes a similar list of other text classification tasks has not received the same
31 multimodal datasets with a mix of images, text, video, and attention but should benefit from the increased focus. This
audio but none with numeric/columnar data. paper will also survey the recent developments and explain
There are several recent surveys by [54], [55], [56], and how these can be leveraged to improve text classification
[57] on the use of deep learning with columnar data but these using transformers. Table 6 below adds a size element to
focus on categorical/numeric data and not text. Since there Table 3 to show the rapid increase in the scale of these
is a gap in the research on multimodal text and columnar LLMs.
data applications, we will provide additional detail on this Although the size of the various LLMs continues to
important but overlooked area of text classification. One grow, there are several other strategic approaches that should
example of a solution using text and tabular data is the be considered when choosing the best option for text
Multimodal Toolkit package for Python that was developed classification. Some of the considerations include:
by Georgian.io [58]. As mentioned previously, the focus on • Amount of training data.

text and video may be of interest to academics due to the • Privacy and security.

availability of datasets but corporations and organizations • Complexity and uniqueness of the task.

have access to text and columnar data with information that • Scalability.

could provide valuable insights. • Speed.

However, the benefits of solutions like Multimodal Toolkit • Compute resources available.

have only shown limited gains by using transformers on 1 Parameters are the weights and biases that are adjusted during training
text and numeric data [50]. Since this is only one example, and used to make predictions.

6522 VOLUME 12, 2024


J. Fields et al.: Survey of Text Classification With Transformers

TABLE 6. Size of selected large language models released since ChatGPT TABLE 7. Token length of selected large language models released since
(Nov 2022). ChatGPT (Nov 2022).

Another extension of long text capability is in multimodal


applications. The significant research in multimodal applica-
As the amount of data available on the internet could
tions should also lead to new longer document capabilities
eventually limit the growth of LLMs, these other factors may
for this application. One of the authors of this paper recently
become more important as we continue to improve these
modified the Multimodal Transformers package to add the
models [64]. Additional research efforts, like the Pythia suite,
capability to utilize Longformer for text classification with
are providing new tools to analyze LLMs and address this
text up to 4096 tokens for numeric/categorical data [75]. This
issue [65]. Other recent survey papers such as [66], seek to
resulted in a .9% increase in the F1 score on the Women’s
address the issue of how to apply more efficient transformer
E-Commerce Clothing Review data set compared to using
methods to NLP tasks. Approaches are grouped together
only 512 tokens. The authors plan an additional paper to
by sparse, factorized attention, and architectural change.
publish these results.
However, [66] concludes there are,
‘‘. . . no simple and universal guidelines regarding
VII. HOW ACCURATE?
the current Transformer alternatives.’’
By utilizing attention mechanisms and self-attention layers,
The cost to develop LLMs is another limiting factor where transformers have demonstrated state-of-the-art accuracy
large technology companies and research institutions have across a variety of text classification tasks. Pre-trained
the resources to afford investments in the millions of US transformers like BERT and GPT have become the standard
dollars [67]. The ability to predict the gains in performance approach for many general text classification tasks such
could be beneficial to better understand the value of these as sentiment analysis, document classification, and named
investments. However, the reality is that the ‘‘democratiza- entity recognition. Transfer learning with transformers has
tion’’ of this technology will not occur in the near future until also pushed the boundaries of accuracy by using pre-trained
Graphical Processing Units (GPU) are more accessible and models that can be fine-tuned on smaller datasets with
affordable [68]. excellent results.
To understand the accuracy of different models on text
VI. HOW LONG? classification tasks, a review of Papers With Code [76]
Another factor to consider for text classification is the length provided a summary of the best models from 358 datasets
of the input text. The original BERT model had a limit of used in 20 different NLP classification tasks (as shown in
512 tokens or ≈ 400 words. Extensions of BERT such Table 1, 2). Papers With Code was chosen since it is the
as Big Bird [69] and Longformer [70] extended the token most complete source to date on benchmarks in NLP and
limit to 4096 to handle longer text such as essays. Most other machine learning tasks, although it should be noted
of the new LLM’s have implemented tokens ≤ 2048 as that this source is not comprehensive as described by [77].
shown in Table 7. However, GPT-4 is the exception with a The methodology to determine if the model is ‘‘transformer-
limit of 8,192 tokens. This new capability is already being based’’ was to review the abstracts and search for the
tested in text classification challenges in health care [71] keywords ‘‘transformer’’, ‘‘attention’’, ‘‘BERT’’, or ‘‘GPT’’.
and will likely expand to many other domains. Additional If a model used attention or was ‘‘transformer-like’’, it was
research into hierarchical attention [72] and long-document classified as a transformer-based model in Table 10 below.
summarization [73] techniques are emerging as there are Overall, the transformer based models are ≈ 68% of the ‘‘best
many practical applications that would benefit from going models’’.
beyond words and sentences to long documents. To simplify the resulting analysis, any data set that
One of the sensational headlines and subsequent journal had only 1 Best Model by application category was not
article titles was ‘‘GPT-4 Passes the Bar Exam’’ [74]. One included in the results shown in Table 10. This reduced
major factor attributed to the improved results from GPT-3 to the number of datasets from 358 to 151 and applications
GPT-4 was the increased context window allowing for more from 20 to 15. One unexpected observation is that the
accurate processing of long sequences of text. new LLM’s are almost exclusively the top models in the

VOLUME 12, 2024 6523


J. Fields et al.: Survey of Text Classification With Transformers

TABLE 8. Current research on the efficiency of large language models. use, including issues related to privacy and security, fairness
and equity, and copyright.
A primary concern of research into the security of
large language models has been memorization, or the
model’s ability to repeat content from training verbatim [84].
Reference [85] have shown memorization becomes more
frequent as the size of the model increases, or with a higher
number of duplicates, and they predict that memorization will
become more prevalent as models continue to expand.
As a technical concern, memorization may also lead to
downstream data contamination, or when the pretrained cor-
pus contains some of the material from the test set. Multiple
studies have shown this to be the case in major training
datasets, meaning LLM performance on benchmarking tasks
may be misleading [31], [86]. Reference [87] attempt to
measure the impact of this contamination on classification
tasks.
Arguably more problematically, memorization can also
question-answer application. For the other 14 applications, open these models to adversarial attacks. Reference [88] have
many of the best models are transformer-based but the results shown the use of web-scraped data has led to mining private
also include non-transformer models like XGBoost and Spark information from LLMs, with duplication in the training
NLP. This could be because the new LLM’s work best on data making the model more vulnerable. Reference [84]
question-answer tasks or because they are so new that tests demonstrated GPT-2’s vulnerability to adversarial attack,
have not been conducted on a wide range of applications. noting that public, personal information, including names,
Table 10 can be used as a guide for selecting models based on phone numbers, and email addresses can be extracted
accuracy, but there are no one-size-fits-all models. The best verbatim from the language model, and also finding that
choice depends on the specific characteristics of your dataset, larger models were more vulnerable than smaller models.
the problem at hand, and the resources available to you. Reference [89] similarly found LLMs risked leaking personal
information, though they distinguish between memorization
and association, arguing that models do not pose significant
VIII. HOW EXPENSIVE? risk because of their low association, since private infor-
The expense of transformers used for text classification varies mation will only be leaked randomly rather than extracted.
depending on several factors such as hardware, software, cost As these models are frequently used in email, text, and
of personnel, and the type of model. The cloud computing code auto-completion, they also risk revealing personal data
cost estimates for early transformer models like BERT and when fine-tuned for sensitive tasks. Reference [90] have
GPT-2 were $2074 to $43,008 [78]. The cost for newer LLMs explored how fine-tuning a model impacts the risks to privacy,
is an open research question. When OpenAI’s ChatGPT was observing that fine-tuning the head of the model leaves it most
asked the cost of GPT-3, the response was ‘‘. . . it is widely susceptible to extraction attacks.
believed that the training cost for GPT-3 is in the range of tens On the other hand, [91] attempted an adversarial attack on
of millions of dollars.’’ Google’s Bard was asked a similar BERT to mine patient names and conditions from clinical
question and the response was ‘‘The exact cost to train Bard notes in pre-training and found that the model did not
is not publicly known, but it is estimated to be in the millions meaningfully associate names and conditions.
of dollars.’’ Even if the risk of specific data leakage is low, several
The cost to train these LLMs and the economic/ scholars have begun calling for privacy-preserving LLMs,
environmental concerns has emerged as a significant issue in either by adapting the training to ensure privacy or removing
recent years. However, most of the research to alleviate this sensitive information from the training text [92], [93], [94],
issue has been focused on the efficiency of these models as [95], [96], [97].
illustrated in Table 8. The memorization of demographic features also raises
Creative solutions to the economic and environmental concerns over potential bias in downstream tasks. Other
issues remains an open challenge for organizations and scholars, interested in fairness and equity rather than the
researchers. It is our hope that more attention will be focused privacy and security of these LLMs, have explored the extent
on how to improve the access and sustainability. to which LLMs have learned protected demographic features
in training [98], [99], [100] and, relatedly, how this learned
IX. HOW SAFE? bias might impact downstream classification tasks. Perhaps
The increasing popularity of LLMs across several NLP tasks most research in this area has been devoted to inequities
has also raised novel questions about the safety and ethics of in how the language models handle gender. For example,
6524 VOLUME 12, 2024
J. Fields et al.: Survey of Text Classification With Transformers

TABLE 9. Transparency of selected large language models released since the ‘open’ large language models, shown in Table 9, have
ChatGPT (Nov 2022).
not released information about their training. Reference [116]
have recently shown the lack of openness and transparency in
ChatGPT, Llama, and other large language models, providing
structure for discussing degrees of openness, noting for
example that even if a model can be used, many crucial
aspects of the model remain closed, such as the training data,
processes for instruction tuning, and the code used to train the
model.
Because full details on training are not available, some
have also accused these models of incorporating copyright
materials, potentially violating intellectual property laws.
[100] found gender bias impacted classification results when Reference [117], for example, have very recently shown that
working with medical text. Similarly, several scholars have both BERT and GPT-4 know a wide range of copyright
shown that large language models tend to classify text written material. Further, [118] have argued that training data must
by women as more emotional [98], [101], [102], [103]. be made open to assess sources of bias, thus dovetailing
In addition to gender, other scholarship notes bias when the concerns over fairness and equity with concerns over
dealing with race [104]. Reference [105] recently attempted intellectual property.
to quantify the extent to which BERT has incorporated Relating to the broader concerns over bias and fairness,
demographic information, using BERT outputs in a logistic significant research has gone into incorporating eXplainable
regression model to predict sensitive features. Artificial Intelligence (XAI) methods into LLMs, especially
Because of these known biases embedded in many when used in downstream classification tasks.
pre-trained language models, scholars have also explored For example, there have been various approaches to
ways of reducing or mitigating this bias. Generally, this modifying or highlighting BERT’s architecture to make
takes place during fine-tuning. Reference [102], for example, outputs explainable. ExBERT, for example, offers a dash-
developed a method for identifying and removing seman- board that overviews the model’s attention and internal
tic features which contained sensitive information. Refer- representation [119]. Reference [120] similarly developed
ence [100] incorporated loss during training to minimize VisBERT, which tracks tokens as they are processed by
bias learned during fine tuning. Others attempted to modify BERT. They extract hidden states from each transformer
the data used for fine tuning, [106], [107], [108], removing block and apply Principal Component Analysis to map
traits that indicate gender [109], [110], or race [103]. tokens to a 2d state where distance represents semantic
A third method uses ensemble methods [104], and, most similarity. Transformers Interpret, a Python package, uses
recently, [105] use active sampling of protected-attribute- Integrated Gradients to determine and visualize the signifi-
uninformative data, which was then used to fine-tune the cance of words in any task done with pre-trained language
model, also reducing downstream inequalities. models [121].
This bias stemmed in part from the kind of data used Another approach is to use post hoc, model agnostic
when building the language model ( [109], [111], [112]) explainability methods in classification. Reference [122] use
and research has recently highlighted the fact that almost LIME and Anchors to explain ‘‘Fake News’’ classifications
all training data for these large language models remains made with BERT, highlighting words that have the highest
closed as another security concern. When training data is contribution to the classification result. Reference [123]
known, as [86] have shown with the C4 dataset, the filtering apply ‘‘explanations-by-example’’ to a BERT-based model,
applied often disproportionately removes texts from and using the twin-systems approach (that is, pairing a
about minoritized groups. Reference [113] demonstrate that white-box model, Case-based reasoning (CBR), with the
GPT-2 was more likely to detect negative sentiment in texts black-box BERT model).
written in African American English; [114], investigating Finally, several researchers have incorporated XAI into
toxic content produced by LLMs, argue that the cause of the text classification task, even if the main focus of
this toxic content can be found in the training datasets, their work was not on explainability approaches. These
exploring two corpora used to train LLMs including GPT-2, approaches have been deployed in almost every category
which both contain toxic content. Reference [115] note that of the novel taxonomy we propose above. For example,
these learned biases also impact downstream Question and [124] apply gradientSHAP to a multimodal model using
Answering tasks. BERT for emotion classification. In news classification,
As language models have grown larger and more prof- [125] have added LIME to a BERT-based classifier detecting
itable, they have also increasingly become proprietary, misinformation about COVID-19, which shows the users how
making potential issues in the data regarding personal privacy the decision was reached and which data sources were used
or general equity more difficult to explore. Many of even to make the classification, extracting sentences from relative

VOLUME 12, 2024 6525


J. Fields et al.: Survey of Text Classification With Transformers

TABLE 10. Best models by application on 151 datasets with Count of Best Model ≥ 2 for text classification tasks from Papers with Code (24 Jul 2023).

6526 VOLUME 12, 2024


J. Fields et al.: Survey of Text Classification With Transformers

TABLE 10. (Continued.) Best models by application on 151 datasets with Count of Best Model ≥ 2 for text classification tasks from Papers with
Code (24 Jul 2023).

news articles to explain the classification. Similar approaches pave the way for more responsible and ethically sound
have been employed in the medical field [126] and analysis applications of transformer-based text classification.
of online reviews [127]. As researchers and practitioners alike, we must strive
to harness the full potential of transformers while staying
vigilant to ethical considerations and the broader societal
X. CONCLUSION AND FURTHER RESEARCH
impact of our work. The insights gained from this paper serve
In conclusion, this paper has provided a unique exploration as a foundation for future advancements in the field, guiding
into the use of text classification with transformer-based us towards a deeper understanding of transformer-based
models. The transformative impact of transformers on NLP text classification and its implications for the ever-evolving
tasks, particularly text classification, is evident through landscape of natural language processing.
their ability to capture complex contextual relationships and In the future, potential areas for further research related to
semantic nuances. Throughout this study, we have examined transformer-based text classification include:
the questions of How wide? How large? How long? How • Multimodal classification using text and colum-
accurate? How expensive? and How safe? as it applies to this nar/numeric data.
new technology. • Multiclass classification applications.
The comparative analysis conducted on various trans- • Hierarchical classification for multilabel and topic
former models across diverse datasets underscores their modeling.
remarkable performance and versatility across many but • Changes over time (drift).
not all text classification tasks. The current hype related • Cost and access issues.
to the use of LLMs for question-answering chat bots • Legal issues related to copyrighted data used in training.
is understandable as this capability moves AI closer to
human-level performance on this task. However, this is only REFERENCES
1 of 20 applications reviewed in this paper, and similar gains [1] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu,
are likely possible in other areas. and J. Gao, ‘‘Deep learning-based text classification: A comprehensive
We anticipate that the substantial volume of research, review,’’ ACM Comput. Surv., vol. 54, no. 3, pp. 1–40, Apr. 2021.
[2] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz,
investment, and significant financial commitments directed E. Agirre, I. Heintz, and D. Roth, ‘‘Recent advances in natural language
towards multimodal applications will enhance the proficiency processing via large pre-trained language models: A survey,’’ ACM
across various domains of text classification. Comput. Surv., vol. 56, no. 2, pp. 1–40, Sep. 2023.
[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
However, challenges such as computational requirements, Ł. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in
model size, and potential biases present in pre-trained Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017, pp. 5999–6009.
representations call for continued research and innovation. [4] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improv-
ing Language Understanding by Generative Pre-Training. Accessed:
Efforts to address these challenges, alongside emerging Jun. 29, 2023. [Online]. Available: https://www.cs.ubc.ca/~amuham01/
techniques for model interpret-ability and explain-ability, LING530/papers/radford2018improving.pdf

VOLUME 12, 2024 6527


J. Fields et al.: Survey of Text Classification With Transformers

[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training [30] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and
of deep bidirectional transformers for language understanding,’’ 2018, Q. V. Le, ‘‘XLNet: Generalized autoregressive pretraining for language
arXiv:1810.04805. understanding,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019,
[6] S. Keele, ‘‘Guidelines for performing systematic literature reviews in pp. 5754–5764.
software engineering,’’ Ver. 2.3, EBSE, Tech. Rep., 2007, vol. 5. [31] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
[7] Connected Papers. Accessed: Sep. 18, 2023. [Online]. Available: A. Neelakantan, P. Shyam, G. Sastry, and A. Askell, ‘‘Language models
https://www.connectedpapers.com are few-shot learners,’’ in Proc. Adv. Neur. Inf. Process. Sys., vol. 33, 2020,
[8] P. Studtmann, ‘‘Aristotle’s categories,’’ in The Stanford Encyclopedia pp. 1877–1901.
Philosophy, E. N. Zalta, Ed. Stanford, CA, USA: Stanford Univ.- [32] Introducing LLaMA: A Foundational, 65-Billion-Parameter
Metaphysics Research Lab, 2021. Language Model. Accessed: Jul. 5, 2023. [Online]. Available: https:
[9] M. E. Maron, ‘‘Automatic indexing: An experimental inquiry,’’ J. ACM, //ai.facebook.com/blog/large-language-model-llama-meta-ai/
vol. 8, no. 3, pp. 404–417, Jul. 1961. [33] S. Pichai. (Feb. 2023). An Important Next Step on Our AI Jour-
[10] T. Poibeau, ‘‘The 1966 ALPAC report and its consequences,’’ in Machine ney. Accessed: Jul. 5, 2023. [Online]. Available: https://blog.google/
Translation. Cambridge, MA, USA: MIT Press, 2017, pp. 75–89. technology/ai/bard-google-ai-search-updates/
[11] T. Wolf et al., ‘‘Transformers: State-of-the-art natural language pro- [34] OpenAI, ‘‘GPT-4 technical report,’’ 2023, arXiv:2303.08774.
cessing,’’ in Proc. Conf. Empirical Methods Natural Lang. Process., [35] Introducing BloombergGPT. (Mar. 2023). Bloomberg’s 50-Billion
Syst. Demonstration, 2020, pp. 38–45. Parameter Large Language Model, Purpose-Built From Scratch
[12] Q. Li, H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S. Yu, and L. He, for Finance. Accessed: Jul. 5, 2023. [Online]. Available: https://
‘‘A survey on text classification: From shallow to deep learning,’’ 2020, www.bloomberg.com/company/press/bloomberggpt-50-billion-
arXiv:2008.00364. parameter-llm-tuned-finance/
[13] A. Gasparetto, M. Marcuzzo, A. Zangari, and A. Albarelli, ‘‘A survey [36] (Apr. 2023). Free Dolly: Introducing the World’s First Truly Open
on text classification algorithms: From text to predictions,’’ Information, Instruction-Tuned LLM. Accessed: Jul. 5, 2023. [Online]. Available:
vol. 13, no. 2, p. 83, Feb. 2022. https://www.databricks.com/blog/2023/04/12/dolly-first-open-
[14] D. Kiela, M. Bartolo, Y. Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, commercially-viable-instruction-tuned-llm
G. Prasad, A. Singh, P. Ringshia, Z. Ma, T. Thrush, S. Riedel, Z. Waseem, [37] StableLM: Stability AI Language Models. Github. Accessed: Jul. 5, 2023.
P. Stenetorp, R. Jia, M. Bansal, C. Potts, and A. Williams, ‘‘Dynabench: [Online]. Available: https://github.com/Stability-AI/StableLM
Rethinking benchmarking in NLP,’’ 2021, arXiv:2104.14337. [38] Amazon Titan. Accessed: Jul. 5, 2023. [Online]. Available: https://
[15] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He, aws.amazon.com/bedrock/titan/
Z. Liu, Z. Wu, L. Zhao, D. Zhu, X. Li, N. Qiang, D. Shen, T. Liu, and [39] Bing Chat. Accessed: Jul. 5, 2023. [Online]. Available: https://www.
B. Ge, ‘‘Summary of ChatGPT-related research and perspective towards microsoft.com/en-us/edge/features/bing-chat?form=MT00D8
the future of large language models,’’ 2023, arXiv:2304.01852. [40] Llama 2. Accessed: Jul. 5, 2023. [Online]. Available: https://ai.
[16] M. Munikar, S. Shakya, and A. Shrestha, ‘‘Fine-grained sentiment meta.com/llama/
classification using BERT,’’ 2019, arXiv:1910.03474. [41] L. Xu, X. Pang, J. Wu, M. Cai, and J. Peng, ‘‘Learn from structural scope:
[17] H. H. Do, P. Prasad, A. Maag, and A. Alsadoon, ‘‘Deep learning Improving aspect-level sentiment analysis with hybrid graph convolu-
for aspect-based sentiment analysis: A comparative review,’’ Expert tional networks,’’ Neurocomputing, vol. 518, pp. 373–383, Jan. 2023.
Syst. Appl., vol. 118, pp. 272–299, Mar. 2019. [42] O. Wallaart and F. Frasincar, A Hybrid Approach for Aspect-Based
[18] A. Elmadany, C. Zhang, M. Abdul-Mageed, and A. Hashemi, ‘‘Leverag- Sentiment Analysis Using a Lexicalized Domain Ontology and Attentional
ing affective bidirectional transformers for offensive language detection,’’ Neural Models. New York, NY, USA: Springer, 2019.
2020, arXiv:2006.01266. [43] T. Baltrusaitis, C. Ahuja, and L.-P. Morency, ‘‘Multimodal machine learn-
[19] J. Zhang, K. Hashimoto, Y. Wan, Z. Liu, Y. Liu, C. Xiong, and ing: A survey and taxonomy,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
P. S. Yu, ‘‘Are pretrained transformers robust in intent classification? A vol. 41, no. 2, pp. 423–443, Feb. 2019.
missing ingredient in evaluation of out-of-scope intent detection,’’ 2021, [44] B. Jin and D. Seetharaman, ‘‘Elon Musk creates new artificial
arXiv:2106.04564. intelligence company X.AI,’’ Wall St. J., Apr. 2023. [Online]. Available:
[20] M. Yasunaga, J. Leskovec, and P. Liang, ‘‘LinkBERT: Pretraining https://www.wsj.com/articles/elon-musks-new-artificialintelligence-
language models with document links,’’ 2022, arXiv:2203.15827. business-x-ai-incorporates-in-nevada-962c7c2f
[21] H. Jwa, D. Oh, K. Park, J. Kang, and H. Lim, ‘‘ExBAKE: Automatic [45] S. Pichai. (Apr. 2023). Google DeepMind: Bringing Together Two
fake news detection model based on bidirectional encoder representations World-Class AI Teams. Accessed: Jul. 6, 2023. [Online]. Available:
from transformers (BERT),’’ Appl. Sci., vol. 9, no. 19, p. 4062, https://blog.google/technology/ai/april-ai-update/
Sep. 2019. [46] G. Hu, T.-E. Lin, Y. Zhao, G. Lu, Y. Wu, and Y. Li, ‘‘UniMSE: Towards
[22] C. Wang and M. Banko, ‘‘Practical transformer-based multilingual text unified multimodal sentiment analysis and emotion recognition,’’ 2022,
classification,’’ in Proc. Conf. North Amer. Chapter Assoc. Comput. Lin- arXiv:2211.11256.
guistics, Hum. Lang. Technol., Ind. Papers, 2021, pp. 121–129. [47] P. Singh, R. Srivastava, K. P. S. Rana, and V. Kumar, ‘‘SEMI-FND:
[23] B. Zhang, D. Ding, and L. Jing, ‘‘How would stance detection techniques Stacked ensemble based multimodal inference for faster fake news
evolve after the launch of ChatGPT?’’ 2022, arXiv:2212.14548. detection,’’ 2022, arXiv:2205.08159.
[24] C. M. Greco, A. Simeri, A. Tagarelli, and E. Zumpano, ‘‘Transformer- [48] ChatGPT. Accessed: Jul. 31, 2023. [Online]. Available: https://chat.
based language models for mental health issues: A survey,’’ Pattern openai.com
Recognit. Lett., vol. 167, pp. 204–211, Mar. 2023. [49] Try Bard, an AI Experiment by Google. Accessed: Jul. 31, 2023. [Online].
[25] A. Cohan, I. Beltagy, D. King, B. Dalvi, and D. S. Weld, ‘‘Pre- Available: https://bard.google.com
trained language models for sequential sentence classification,’’ 2019, [50] W. C. Sleeman IV, R. Kapoor, and P. Ghosh, ‘‘Multimodal classi-
arXiv:1909.04054. fication: Current landscape, taxonomy and future directions,’’ 2021,
[26] W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, and I. S. Dhillon, ‘‘Taming arXiv:2109.09020.
pretrained transformers for extreme multi-label text classification,’’ in [51] K. Xu, M. Lam, J. Pang, X. Gao, C. Band, P. Mathur, F. Papay,
Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, A. K. Khanna, J. B. Cywinski, K. Maheshwari, P. Xie, and E. P. Xing,
New York, NY, USA, Aug. 2020, pp. 3163–3171. ‘‘Multimodal machine learning for automated ICD coding,’’ in Proc. 4th
[27] K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, Mach. Learn. Healthcare Conf., vol. 106. Maastricht, The Netherlands,
and D. Brown, ‘‘Text classification algorithms: A survey,’’ Information, 2019, pp. 197–215.
vol. 10, no. 4, p. 150, Apr. 2019. [52] N. Holtz and J. M. Gomez, ‘‘Multimodal transformer for risk classifica-
[28] Q. Xipeng, S. TianXiang, X. Yige, S. Yunfan, D. Ning, and H. Xuanjing, tion: Analyzing the impact of different data modalities,’’ in Natural Lan-
‘‘Pre-trained models for natural language processing: A survey,’’ guage Processing and Machine Learning. Zürich, Switzerland: Academy
Sci. China Technol. Sci., vol. 63, pp. 1872–1897, Sep. 2020. and Industry Research Collaboration Center (AIRCC), May 2023.
[29] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, [53] X. Wang, G. Chen, G. Qian, P. Gao, X.-Y. Wei, Y. Wang, Y. Tian, and
L. Zettlemoyer, and V. Stoyanov, ‘‘RoBERTa: A robustly optimized W. Gao, ‘‘Large-scale multi-modal pre-trained models: A comprehensive
BERT pretraining approach,’’ 2019, arXiv:1907.11692. survey,’’ Mach. Intell. Res., vol. 20, no. 4, pp. 447–482, Jun. 2023.

6528 VOLUME 12, 2024


J. Fields et al.: Survey of Text Classification With Transformers

[54] G. Badaro, M. Saeed, and P. Papotti, ‘‘Transformers for tabular data [78] E. Strubell, A. Ganesh, and A. McCallum, ‘‘Energy and policy
representation: A survey of models and applications,’’ Trans. Assoc. Com- considerations for deep learning in NLP,’’ 2019, arXiv:1906.02243.
put. Linguistics, vol. 11, pp. 227–249, Mar. 2023. [79] K. Subramanyam Kalyan, A. Rajasekharan, and S. Sangeetha, ‘‘AMMUS
[55] V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and : A survey of transformer-based pretrained models in natural language
G. Kasneci, ‘‘Deep neural networks and tabular data: A survey,’’ IEEE processing,’’ 2021, arXiv:2108.05542.
Trans. Neural Netw. Learn. Syst., doi: 10.1109/TNNLS.2022.3229161. [80] J. W. Rae et al., ‘‘Scaling language models: Methods, analysis & insights
[56] Y. Gorishniy, I. Rubachev, V. Khrulkov, and A. Babenko, ‘‘Revisiting from training gopher,’’ 2021, arXiv:2112.11446.
deep learning models for tabular data,’’ in Proc. Adv. Neural Inf. Pro- [81] M. Artetxe et al., ‘‘Efficient large scale language modeling with mixtures
cess. Syst., vol. 34, 2021, pp. 18932–18943. of experts,’’ 2021, arXiv:2112.10684.
[57] R. Shwartz-Ziv and A. Armon, ‘‘Tabular data: Deep learning is not all [82] J. Hoffmann et al., ‘‘Training compute-optimal large language models,’’
you need,’’ Inf. Fusion, vol. 81, pp. 84–90, May 2022. 2022, arXiv:2203.15556.
[58] K. Gu and A. Budhkar, ‘‘A package for learning on tabular and text data [83] S. Borgeaud et al., ‘‘Improving language models by retrieving from
with transformers,’’ in Proc. 3rd Workshop Multimodal Artif. Intell., 2021, trillions of tokens,’’ in Proc. 39th Int. Conf. Mach. Learn., vol. 162,
pp. 69–73. K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato,
[59] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, Eds. Maastricht, The Netherlands, 2022, pp. 2206–2240.
S. Gray, A. Radford, J. Wu, and D. Amodei, ‘‘Scaling laws for neural [84] N. Carlini, F. Tramèr, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee,
language models,’’ 2020, arXiv:2001.08361. A. Roberts, T. Brown, D. Song, Ú. Erlingsson, A. Oprea, and C. Raffel,
[60] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification ‘‘Extracting training data from large language models,’’ in Proc. 30th
with deep convolutional neural networks,’’ Commun. ACM, vol. 60, no. 6, USENIX Secur. Symp., 2021, pp. 2633–2650.
pp. 84–90, May 2017. [85] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang,
[61] S. M. Jain, Introduction to Transformers for NLP: With the Hugging Face ‘‘Quantifying memorization across neural language models,’’ 2022,
Library and Models to Solve Problems. New York, NY, USA: Apress, arXiv:2202.07646.
Oct. 2022. [86] J. Dodge, M. Sap, A. Marasovic, W. Agnew, G. Ilharco, D. Groeneveld,
[62] E. Bisong, ‘‘Google colaboratory,’’ in Building Machine Learning and M. Mitchell, and M. Gardner, ‘‘Documenting large webtext cor-
Deep Learning Models on Google Cloud Platform. Berkeley, CA, USA: pora: A case study on the colossal clean crawled corpus,’’ 2021,
Apress, 2019, pp. 59–64. arXiv:2104.08758.
[87] I. Magar and R. Schwartz, ‘‘Data contamination: From memorization to
[63] D. Milmo, ‘‘ChatGPT reaches 100 million users two months after
exploitation,’’ 2022, arXiv:2203.08242.
launch,’’ Guardian, Feb. 2023. [Online]. Available: https://www.
[88] N. Kandpal, E. Wallace, and C. Raffel, ‘‘Deduplicating training
theguardian.com/technology/2023/feb/02/chatgpt-100-million-users-
data mitigates privacy risks in language models,’’ in Proc. 39th
open-ai-fastest-growing-app
Int. Conf. Mach. Learn., vol. 162, K. Chaudhuri, S. Jegelka, Le Song,
[64] N. Muennighoff, A. M. Rush, B. Barak, T. Le Scao, A. Piktus, N. Tazi,
C. Szepesvari, G. Niu, and S. Sabato, Eds. Maastricht, The Netherlands,
S. Pyysalo, T. Wolf, and C. Raffel, ‘‘Scaling data-constrained language
2022, pp. 10697–10707.
models,’’ 2023, arXiv:2305.16264.
[89] J. Huang, H. Shao, and K. C.-C. Chang, ‘‘Are large pre-trained language
[65] S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien,
models leaking your personal information?’’ 2022, arXiv:2205.12628.
E. Hallahan, M. Aflah Khan, S. Purohit, U. Sai Prashanth, E. Raff,
[90] F. Mireshghallah, A. Uniyal, T. Wang, D. Evans, and T. Berg-Kirkpatrick,
A. Skowron, L. Sutawika, and O. van der Wal, ‘‘Pythia: A suite for
‘‘An empirical analysis of memorization in fine-tuned autoregressive
analyzing large language models across training and scaling,’’ 2023,
language models,’’ in Proc. Conf. Empirical Methods Natural Lang. Pro-
arXiv:2304.01373.
cess., Abu Dhabi, United Arab Emirates, 2022, pp. 1816–1826.
[66] Q. Fournier, G. M. Caron, and D. Aloise, ‘‘A practical survey on faster [91] E. Lehman, S. Jain, K. Pichotta, Y. Goldberg, and B. C. Wallace,
and lighter transformers,’’ ACM Comput. Surv., vol. 55, no. 14s, pp. 1–40, ‘‘Does BERT pretrained on clinical notes reveal sensitive data?’’ 2021,
Dec. 2023. arXiv:2104.07762.
[67] C. Li. (Jun. 2020). OpenAI’s GPT-3 Language Model: A [92] R. Anil, B. Ghazi, V. Gupta, R. Kumar, and P. Manurangsi, ‘‘Large-scale
Technical Overview. Accessed: Jul. 17, 2023. [Online]. Available: differentially private BERT,’’ 2021, arXiv:2108.01624.
https://lambdalabs.com/blog/demystifying-gpt-3 [93] X. Li, F. Tramèr, P. Liang, and T. Hashimoto, ‘‘Large language models
[68] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, B. Yin, and X. Hu, can be strong differentially private learners,’’ 2021, arXiv:2110.05679.
‘‘Harnessing the power of LLMs in practice: A survey on ChatGPT and [94] D. Yu, S. Naik, A. Backurs, S. Gopi, H. A. Inan, G. Kamath,
beyond,’’ 2023, arXiv:2304.13712. J. Kulkarni, Y. T. Lee, A. Manoel, L. Wutschitz, S. Yekhanin, and
[69] M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, H. Zhang, ‘‘Differentially private fine-tuning of language models,’’ 2021,
P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed, ‘‘Big bird: arXiv:2110.06500.
Transformers for longer sequences,’’ 2020, arXiv:2007.14062. [95] W. Shi, A. Cui, E. Li, R. Jia, and Z. Yu, ‘‘Selective differential privacy for
[70] I. Beltagy, M. E. Peters, and A. Cohan, ‘‘LongFormer: The long- language modeling,’’ 2021, arXiv:2108.12944.
document transformer,’’ 2020, arXiv:2004.05150. [96] S. Hoory, A. Feder, A. Tendler, A. Cohen, S. Erell, I. Laish, H. Nakhost,
[71] Z. Liu, Y. Huang, X. Yu, L. Zhang, Z. Wu, C. Cao, H. Dai, L. Zhao, Y. Li, U. Stemmer, A. Benjamini, A. Hassidim, and Y. Matias, ‘‘Learning
P. Shu, F. Zeng, L. Sun, W. Liu, D. Shen, Q. Li, T. Liu, D. Zhu, and X. Li, and evaluating a differentially private pre-trained language model,’’
‘‘DeID-GPT: Zero-shot medical text de-identification by GPT-4,’’ 2023, in Proc. 3rd Workshop Privacy Natural Lang. Process., Punta Cana,
arXiv:2303.11032. Dominican Republic, 2021, pp. 1178–1189.
[72] X. Zhang, F. Wei, and M. Zhou, ‘‘HIBERT: Document level pre-training [97] H. Brown, K. Lee, F. Mireshghallah, R. Shokri, and F. Tramèr, ‘‘What
of hierarchical bidirectional transformers for document summarization,’’ does it mean for a language model to preserve privacy?’’ in Proc. ACM
2019, arXiv:1905.06566. Conf. Fairness, Accountability, Transparency, New York, NY, USA,
[73] X. Dai, I. Chalkidis, S. Darkner, and D. Elliott, ‘‘Revisiting Jun. 2022, pp. 2280–2292.
transformer-based models for long document classification,’’ 2022, [98] X. Jin, F. Barbieri, B. Kennedy, A. M. Davani, L. Neves, and X. Ren,
arXiv:2204.06683. ‘‘On transferability of bias mitigation effects in language model fine-
[74] D. M. Katz, M. J. Bommarito, S. Gao, and P. Arredondo, ‘‘GPT-4 passes tuning,’’ 2020, arXiv:2010.12864.
the bar exam,’’ 2023, doi: 10.2139/ssrn.4389233. [99] K. Lu, P. Mardziel, F. Wu, P. Amancharla, and A. Datta, ‘‘Gender bias in
[75] Multimodal-Toolkit: Multimodal Model for Text and Tabular neural natural language processing,’’ in Logic, Language, and Security,
Data With Huggingface Transformers as Building Block for V. Nigam, T. B. Kirigin, C. Talcott, J. Guttman, S.Kuznetsov, B. T. Loo,
Text Data. Github. Accessed: Jul. 13, 2023. [Online]. Available: M. Okada, Eds. Cham, Switzerland: Springer, 2020, pp. 189–202.
https://github.com/georgian-io/Multimodal-Toolkit [100] A. Silva, P. Tambwekar, and M. Gombolay, ‘‘Towards a comprehensive
[76] Papers With Code—The Latest in Machine Learning. Accessed: understanding and accurate evaluation of societal biases in pre-trained
Jul. 27, 2023. [Online]. Available: http://paperswithcode.com transformers,’’ in Proc. Conf. North Amer. Chapter Assoc. Comput.
[77] F. Martínez-Plumed, P. Barredo, S. Ó. H. Éigeartaigh, and Linguistics, Hum. Lang. Technol., 2021, pp. 2383–2389.
J. Hernández-Orallo, ‘‘Research community dynamics behind popular [101] S. Touileb, L. Øvrelid, and E. Velldal, ‘‘Using gender- and polarity-
AI benchmarks,’’ Nature Mach. Intell., vol. 3, no. 7, pp. 581–589, informed models to investigate bias,’’ in Proc. 3rd Workshop Gender Bias
May 2021. Natural Lang. Process., 2021, pp. 66–74.

VOLUME 12, 2024 6529


J. Fields et al.: Survey of Text Classification With Transformers

[102] R. Bhardwaj, N. Majumder, and S. Poria, ‘‘Investigating gender bias in [125] L. Moe, A. Kundu, and U. T. Nguyen. A BERT-Based Explain-
BERT,’’ Cognit. Comput., vol. 13, no. 4, pp. 1008–1018, Jul. 2021. able System for COVID-19 Misinformation Identification. Accessed:
[103] M. Mozafari, R. Farahbakhsh, and N. Crespi, ‘‘Hate speech detection and Oct. 5, 2023. [Online]. Available: https://workshop-proceedings.icwsm.
racial bias mitigation in social media based on BERT model,’’ PLoS ONE, org/pdf/2023_46.pdf
vol. 15, no. 8, Aug. 2020, Art. no. e0237861. [126] M. T. Rietberg, V. B. Nguyen, J. Geerdink, O. Vijlbrief, and C. Seifert,
[104] M. Halevy, C. Harris, A. Bruckman, D. Yang, and A. Howard, ‘‘Accurate and reliable classification of unstructured reports on their
‘‘Mitigating racial biases in toxic language detection with an equity- diagnostic goal using BERT models,’’ Diagnostics, vol. 13, no. 7, p. 1251,
based ensemble framework,’’ in Proc. EAAMO, New York, NY, USA, Mar. 2023.
Nov. 2021, pp. 1–11. [127] F. Ahmed, S. Sultana, M. T. Reza, S. K. S. Joy, and Md. G. R. Alam,
[105] L. Sha, Y. Li, D. Gasevic, and G. Chen, ‘‘Bigger data or fairer ‘‘Interpretable movie review analysis using machine learning
data? Augmenting BERT via active sampling for educational text and transformer models leveraging XAI,’’ in Proc. IEEE Asia–
classification,’’ in Proc. 29th Int. Conf. Comput. Linguistics, Oct. 2022, Pacific Conf. Comput. Sci. Data Eng. (CSDE), Dec. 2022,
pp. 1275–1285. pp. 1–6.
[106] F. Prost, N. Thain, and T. Bolukbasi, ‘‘Debiasing embeddings for reduced [128] K. Scaria, H. Gupta, S. Goyal, S. Arjun Sawant, S. Mishra, and
gender bias in text classification,’’ 2019, arXiv:1908.02810. C. Baral, ‘‘InstructABSA: Instruction learning for aspect based sentiment
[107] R. Islam, K. N. Keya, Z. Zeng, S. Pan, and J. Foulds, ‘‘Debiasing career
analysis,’’ 2023, arXiv:2302.08624.
recommendations with neural fair collaborative filtering,’’ in Proc. Web
[129] T. Tang, J. Li, W. Xin Zhao, and J.-R. Wen, ‘‘MVP: Multi-task supervised
Conf., New York, NY, USA, Jun. 2021, pp. 3779–3790.
pre-training for natural language generation,’’ 2022, arXiv:2206.12131.
[108] Y. Pruksachatkun, S. Krishna, J. Dhamala, R. Gupta, and K.-W. Chang,
‘‘Does robustness improve fairness? Approaching fairness with [130] C. Sun, L. Huang, and X. Qiu, ‘‘Utilizing BERT for aspect-
word substitution robustness methods for text classification,’’ 2021, based sentiment analysis via constructing auxiliary sentence,’’ 2019,
arXiv:2106.10826. arXiv:1903.09588.
[109] D. de Vassimon Manela, D. Errington, T. Fisher, B. van Breugel, and [131] G. Nikolentzos, A. J. -P. Tixier, and M. Vazirgiannis, ‘‘Message
P. Minervini, ‘‘Stereotype and skew: Quantifying gender bias in pre- passing attention networks for document understanding,’’ in Proc. AAAI
trained and fine-tuned language models,’’ in Proc. 16th Conf. Eur. Conf. Artif. Intell., 2020, pp. 5754–5764.
Chapter Assoc. Comput. Linguistics, Main Volume, 2021, pp. 2232–2242. [132] K. Skianis, G. Nikolentzos, S. Limnios, and M. Vazirgiannis, ‘‘Rep the
[110] J. R. Minot, N. Cheney, M. Maier, D. C. Elbers, C. M. Danforth, set: Neural networks for learning set representations,’’ in Proc. 23rd
and P. S. Dodds, ‘‘Interpretable bias mitigation for textual data: Int. Conf. Artif. Intell. Statist., vol. 108, S. Chiappa and R. Calandra,
Reducing gender bias in patient notes while maintaining classification Eds. Maastricht, Netherlands, 2020, pp. 1410–1420.
performance,’’ 2021, arXiv:2103.05841. [133] K. Kowsari, M. Heidarysafa, D. E. Brown, K. J. Meimandi, and
[111] L. Lucy and D. Bamman, ‘‘Gender and representation bias in GPT-3 L. E. Barnes, ‘‘RMDL: Random multimodel deep learning for classifica-
generated stories,’’ in Proc. 3rd Workshop Narrative Understand., 2021, tion,’’ in Proc. 2nd Int. Conf. Inf. Syst. Data Mining (ICISDM), Apr. 2018,
pp. 48–55. pp. 19–28.
[112] M. Nadeem, A. Bethke, and S. Reddy, ‘‘StereoSet: Measuring stereotyp- [134] S. Gouws, Y. Bengio, and G. Corrado, ‘‘BilBOWA: Fast bilingual
ical bias in pretrained language models,’’ 2020, arXiv:2004.09456. distributed representations without word alignments,’’ in Proc. 32nd
[113] S. Groenwold, L. Ou, A. Parekh, S. Honnavalli, S. Levy, D. Mirza, Int. Conf. Int. Conf. Mach. Learn., vol. 37, 2015, pp. 748–756.
and W. Y. Wang, ‘‘Investigating African–American vernacular English [135] A. Adhikari, A. Ram, R. Tang, and J. Lin, ‘‘DocBERT: BERT for
in transformer-based text generation,’’ 2020, arXiv:2010.02510. document classification,’’ 2019, arXiv:1904.08398.
[114] S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith,
[136] P. H. L. de Araujo, T. E. de Campos, F. A. Braz, and N. C. da Silva,
‘‘RealToxicityPrompts: Evaluating neural toxic degeneration in language
‘‘VICTOR: A dataset for Brazilian legal documents classification,’’ in
models,’’ 2020, arXiv:2009.11462.
Proc. 12th Lang. Resour. Eval. Conf., Marseille, France, May 2020,
[115] T. Li, T. Khot, D. Khashabi, A. Sabharwal, and V. Srikumar,
pp. 1449–1458.
‘‘UnQovering stereotyping biases via underspecified questions,’’ 2020,
arXiv:2010.02428. [137] X. Huang, B. Chen, L. Xiao, J. Yu, and L. Jing, ‘‘Label-aware
[116] A. Liesenfeld, A. Lopez, and M. Dingemanse, ‘‘Opening up ChatGPT: document representation via hybrid attention for extreme multi-label text
Tracking openness, transparency, and accountability in instruction-tuned classification,’’ Neural Process. Lett., vol. 54, no. 5, pp. 3601–3617,
text generators,’’ in Proc. 5th Int. Conf. Conversational User Interface, Oct. 2022.
New York, NY, USA, Jul. 2023, pp. 1–6. [138] J.-S. Lee and J. Hsiang, ‘‘PatentBERT: Patent classification with fine-
[117] K. K. Chang, M. Cramer, S. Soni, and D. Bamman, ‘‘Speak, mem- tuning a pre-trained BERT model,’’ 2019, arXiv:1906.02124.
ory: An archaeology of books known to ChatGPT/GPT-4,’’ 2023, [139] R.-C. Chang, C.-M. Lai, K.-L. Chang, and C.-H. Lin, ‘‘Dataset of
arXiv:2305.00118. propaganda techniques of the state-sponsored information operation of
[118] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, the people’s republic of China,’’ 2021, arXiv:2106.07544.
H. D. Iii, and K. Crawford, ‘‘Datasheets for datasets,’’ Commun. ACM, [140] A. Pal, M. Selvakumar, and M. Sankarasubbu, ‘‘Multi-label text
vol. 64, no. 12, pp. 86–92, Nov. 2021. classification using attention-based graph neural network,’’ 2020,
[119] B. Hoover, H. Strobelt, and S. Gehrmann, ‘‘ExBERT: A visual analysis arXiv:2003.11644.
tool to explore learned representations in transformers models,’’ 2019, [141] V. Kocaman and D. Talby, ‘‘Biomedical named entity recognition at
arXiv:1910.05276. scale,’’ in Proc. Int. Conf. Pattern Recognit. New York, NY, USA:
[120] B. V. Aken, B. Winter, A. Löser, and F. A. Gers, ‘‘VisBERT: Hidden- Springer, 2021, pp. 635–646.
state visualizations for transformers,’’ in Proc. Companion Web Conf., [142] X. Wang, Y. Jiang, N. Bach, T. Wang, Z. Huang, F. Huang, and K. Tu,
New York, NY, USA, Apr. 2020, pp. 207–211. ‘‘Automated concatenation of embeddings for structured prediction,’’
[121] C. Pierse. (Feb. 2021) Introducing Transformers Interpret—Explainable 2020, arXiv:2010.05006.
AI for Transformers. Accessed: Aug. 24, 2023. [Online]. Avail-
[143] C. Wang, X. Liu, Z. Chen, H. Hong, J. Tang, and D. Song, ‘‘Deep-
able: https://towardsdatascience.com/introducing-transformers-interpret-
Struct: Pretraining of language models for structure prediction,’’ 2022,
explainable-ai-for-transformers-890a403a9470
arXiv:2205.10475.
[122] M. Szczepanski, M. Pawlicki, R. Kozik, and M. Choras, ‘‘New
[144] J. Hu, Y. Shen, Y. Liu, X. Wan, and T.-H. Chang, ‘‘Hero-gang neural
explainability method for BERT-based model in fake news detection,’’
model for named entity recognition,’’ 2022, arXiv:2205.07177.
Sci. Rep., vol. 11, no. 1, p. 23705, Dec. 2021.
[123] E. M. Kenny and M. T. Keane, ‘‘Explaining deep learning using [145] M. Jeong and J. Kang, ‘‘Enhancing label consistency on document-level
examples: Optimal feature weighting methods for twin systems using named entity recognition,’’ 2022, arXiv:2210.12949.
post-hoc, explanation-by-example in XAI,’’ Knowl.-Based Syst., vol. 233, [146] Z. Zhong and D. Chen, ‘‘A frustratingly easy approach for entity and
Dec. 2021, Art. no. 107530. relation extraction,’’ 2020, arXiv:2010.12812.
[124] T. Shaikh, A. Khalane, R. Makwana, and A. Ullah, ‘‘Evaluating [147] T. Shavrina, A. Fenogenova, A. Emelyanov, D. Shevelev, E. Artemova,
significant features in context-aware multimodal emotion recognition V. Malykh, V. Mikhailov, M. Tikhonova, A. Chertok, and A. Evlampiev,
with XAI methods,’’ Authorea Preprints, 2023. [Online]. Available: ‘‘RussianSuperGLUE: A Russian language understanding evaluation
https://doi.org/10.22541/au.167407909.97031004 benchmark,’’ 2020, arXiv:2010.15925.

6530 VOLUME 12, 2024


J. Fields et al.: Survey of Text Classification With Transformers

[148] Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang, ‘‘ERNIE [175] A. Rashed, M. Kutlu, K. Darwish, T. Elsayed, and C. Bayrak,
2.0: A continual pre-training framework for language understanding,’’ in ‘‘Embeddings-based clustering for target specific stances: The case of a
Proc. AAAI, Apr. 2020, vol. 34, no. 5, pp. 8968–8975. polarized Turkey,’’ in Proc. Int. AAAI Conf. Web Social Media, vol. 15,
[149] A. Chowdhery et al., ‘‘PaLM: Scaling language modeling with path- May 2021, pp. 537–548.
ways,’’ 2022, arXiv:2204.02311.
[150] Z. Chen, Q. Gao, and L. S. Moss, ‘‘NeuralLog: Natural language infer-
ence with joint neural and logical reasoning,’’ 2021, arXiv:2105.14167.
[151] X.-Q. Dao, N.-B. Le, T.-D. Vo, X.-D. Phan, B.-B. Ngo, V.-T. Nguyen,
T.-M.-M. Nguyen, and H.-P. Nguyen, ‘‘VNHSGE: Vietnamese high JOHN FIELDS (Graduate Student Member, IEEE)
school graduation examination dataset for large language models,’’ 2023, received the B.S. degree in engineering (industrial
arXiv:2305.12199. distribution) from Texas A&M University and
[152] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. Wei Yu, B. Lester, N. Du, the M.S. degree in applied data science from
A. M. Dai, and Q. V. Le, ‘‘Finetuned language models are zero-shot
Syracuse University. He is currently pursuing the
learners,’’ 2021, arXiv:2109.01652.
Ph.D. degree in computer science with Marquette
[153] Y. Tan, D. Min, Y. Li, W. Li, N. Hu, Y. Chen, and G. Qi, ‘‘Can
ChatGPT replace traditional KBQA models? An in-depth analysis of University, Milwaukee, WI, USA.
the question answering performance of the GPT LLM family,’’ 2023, He is an Assistant Professor of business ana-
arXiv:2303.07992. lytics with Concordia University Wisconsin-Ann
[154] G. Moraes Rosa, L. Bonifacio, V. Jeronymo, H. Abonizio, M. Fadaee, Arbor. He has collaborated as the coauthor on an
R. Lotufo, and R. Nogueira, ‘‘No parameter left behind: How distillation upcoming paper scheduled for presentation at the IEEE Big Data 2023 Con-
and model size affect zero-shot retrieval,’’ 2022, arXiv:2206.02873. ference, delving into the application of natural language processing (NLP)
[155] E. Taktasheva, T. Shavrina, A. Fenogenova, D. Shevelev, N. Katricheva, in the realm of education. He is the co-inventor on a pending patent
M. Tikhonova, A. Akhmetgareeva, O. Zinkevich, A. Bashmakova, (62/935,928) that explores the utilization of machine learning and AI for
S. Iordanskaia, A. Spiridonova, V. Kurenshchikova, E. Artemova, and higher education applications. His research interests include the integration
V. Mikhailov, ‘‘TAPE: Assessing few-shot Russian language understand- of artificial intelligence (AI) in education and its impact on student
ing,’’ 2022, arXiv:2210.12813. success, with a specific interest in text classification (including multimodal
[156] J. Huang, S. Shane Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and J. Han, approaches), graph databases, and addressing AI bias and fairness.
‘‘Large language models can self-improve,’’ 2022, arXiv:2210.11610. Mr. Fields was a recipient of the 2015 SAP IGgie Award for information
[157] R. Anil et al., ‘‘PaLM 2 technical report,’’ 2023, arXiv:2305.10403. governance.
[158] M. Yasunaga, J. Leskovec, and P. Liang, ‘‘LinkBERT: Pretrain-
ing language models with document links,’’ in Proc. ICML, 2022,
pp. 8003–8016.
[159] H. Cheng, Y. Shen, X. Liu, P. He, W. Chen, and J. Gao, ‘‘UnitedQA:
A hybrid approach for open domain question answering,’’ 2021, KEVIN CHOVANEC is a data scientist in the
arXiv:2101.00178.
Office of Institutional Research at Marquette
[160] X. Ma, Z. Zhang, and H. Zhao, ‘‘Enhanced speaker-aware multi-party
University in Milwaukee, WI, USA. He received
multi-turn dialogue comprehension,’’ IEEE/ACM Trans. Audio, Speech,
his B.S. in Math and English from Marquette,
Language Process., vol. 31, pp. 2410–2423, 2023.
an M.A. from the University of Chicago, and a
[161] I. Schlag, T. Munkhdalai, and J. Schmidhuber, ‘‘Learning associative
inference using fast weight memory,’’ 2020, arXiv:2011.07831. Ph.D. in English and Comparative Literature from
[162] G. T. Hudson and N. A. Moubayed, ‘‘MuLD: The multitask long the University of North Carolina. Currently, he is
document benchmark,’’ 2022, arXiv:2202.07362. pursuing a Ph.D. in Computer Science at Mar-
[163] I. Beltagy, K. Lo, and A. Cohan, ‘‘SciBERT: A pretrained language model quette. His research embraces an interdisciplinary
for scientific text,’’ 2019, arXiv:1903.10676. approach, focusing on the digital humanities,
[164] W. Antoun, F. Baly, and H. Hajj, ‘‘AraBERT: Transformer-based model editing, natural language processing, fairness and accountability in AI, and
for Arabic language understanding,’’ 2020, arXiv:2003.00104. educational analytics, and his work has appeared in journals and conference
[165] Y. Cui, W. Che, T. Liu, B. Qin, and Z. Yang, ‘‘Pre-training with whole proceedings across academic disciplines, including Digital Humanities
word masking for Chinese BERT,’’ IEEE/ACM Trans. Audio, Speech, Quarterly, Renaissance Drama, and a forthcoming paper in IEEE Big Data
Language Process., vol. 29, pp. 3504–3514, 2021. 2023.
[166] I. Abu Farha and W. Magdy, ‘‘Mazajak: An online Arabic sentiment
analyser,’’ in Proc. 4th Arabic Natural Lang. Process. Workshop,
Stroudsburg, PA, USA, 2019, pp. 192–198.
[167] I. M. Moosa, M. E. Akhter, and A. B. Habib, ‘‘Does transliteration help
multilingual language modeling?’’ 2022, arXiv:2201.12501. PRAVEEN MADIRAJU received the Ph.D.
[168] Q. Xie, Z. Dai, E. Hovy, M.-T. Luong, and Q. V. Le, ‘‘Unsupervised data degree in computer science from Georgia State
augmentation for consistency training,’’ in Proc. NIPS, vol. 33, 2020, University.
pp. 6256–6268. He directs the Data Science and Text Analytics
[169] M. Cliche, ‘‘BB_twtr at SemEval-2017 task 4: Twitter sentiment analysis Laboratory. The Data Laboratory focuses on solv-
with CNNs and LSTMs,’’ 2017, arXiv:1704.06125. ing real-world problems by applying techniques
[170] D. Araci, ‘‘FinBERT: Financial sentiment analysis with pre-trained from the broad area of data science and data
language models,’’ 2019, arXiv:1908.10063.
analytics on both structured and unstructured data.
[171] Y. Meng, Y. Zhang, J. Huang, Y. Zhang, C. Zhang, and J. Han,
The laboratory also conducts research on applying
‘‘Hierarchical topic mining via joint spherical tree and text embedding,’’
machine learning techniques to analyze textual
in Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
New York, NY, USA, Aug. 2020, pp. 1908–1917. and social media data. He is currently a Professor with the Department
[172] G. Lai, B. Oguz, Y. Yang, and V. Stoyanov, ‘‘Bridging the domain gap in of Computer Science, Marquette University, Milwaukee, WI, USA. He is
cross-lingual document classification,’’ 2019, arXiv:1909.07009. also the Graduate Program Chair of the Computer Science Program,
[173] H. Soyer, P. Stenetorp, and A. Aizawa, ‘‘Leveraging monolingual Marquette University. He has published over 50 peer-reviewed articles and
data for crosslingual compositional word representations,’’ 2014, has organized workshops on middleware systems in conjunction with ACM
arXiv:1412.6334. SAC and IEEE COMPSAC. His research interests include data science,
[174] J. Martin Eisenschlos, S. Ruder, P. Czapla, M. Kardas, S. Gugger, healthcare informatics, text analytics, and databases. He regularly serves on
and J. Howard, ‘‘MultiFiT: Efficient multi-lingual language model fine- NSF panels.
tuning,’’ 2019, arXiv:1909.04761.

VOLUME 12, 2024 6531

You might also like