Automated Text Classification
Automated Text Classification
Practical Guide
Pablo Barberá 1 , Amber E. Boydstun 2 , Suzanna Linn 3 ,
Ryan McMahon4 and Jonathan Nagler 5
1 Associate Professor of Political Science and International Relations, University of Southern California, Los Angeles, CA 90089,
Email: [email protected]
5 Professor of Politics and co-Director of the Center for Social Media and Politics, New York University, New York, NY 10012, USA.
Email: [email protected]
Abstract
Automated text analysis methods have made possible the classification of large corpora of text by measures
such as topic and tone. Here, we provide a guide to help researchers navigate the consequential decisions
they need to make before any measure can be produced from the text. We consider, both theoretically
and empirically, the effects of such choices using as a running example efforts to measure the tone of New
York Times coverage of the economy. We show that two reasonable approaches to corpus selection yield
radically different corpora and we advocate for the use of keyword searches rather than predefined subject
categories provided by news archives. We demonstrate the benefits of coding using article segments instead
of sentences as units of analysis. We show that, given a fixed number of codings, it is better to increase the
number of unique documents coded rather than the number of coders for each document. Finally, we find
that supervised machine learning algorithms outperform dictionaries on a number of criteria. Overall, we
intend this guide to serve as a reminder to analysts that thoughtfulness and human validation are key to
text-as-data methods, particularly in an age when it is all too easy to computationally classify texts without
attending to the methodological choices therein.
1 Introduction
The analysis of text is central to a large and growing number of research questions in the social
sciences. While analysts have long been interested in the tone and content of such things as media
coverage of the economy (De Boef and Kellstedt 2004; Tetlock 2007; Goidel et al. 2010; Soroka,
Stecula, and Wlezien 2015), congressional bills (Hillard, Purpura, and Wilkerson 2008; Jurka et al.
2013), party platforms (Laver, Benoit, and Garry 2003; Monroe, Colaresi, and Quinn 2008; Grimmer,
Messing, and Westwood 2012), and presidential campaigns (Eshbaugh-Soha 2010), the advent of
Political Analysis (2021) automated text classification methods combined with the broad reach of digital text archives
vol. 29:19–42
DOI: 10.1017/pan.2020.8 has led to an explosion in the extent and scope of textual analysis. Whereas researchers were
Published
once limited to analyses based on text that was read and hand-coded by humans, machine
9 June 2020 coding by dictionaries and supervised machine learning (SML) tools is now the norm (Grimmer
Corresponding author and Stewart 2013). The time and cost of the analysis of text has thus dropped precipitously. But
Pablo Barberá
the use of automated methods for text analysis requires the analyst to make multiple decisions
Edited by that are often given little consideration yet have consequences that are neither obvious nor
Jeff Gill
benign.
c The Author(s) 2020. Published
by Cambridge University Press Before proceeding to classify documents, the analyst must: (1) select a corpus; and (2) choose
on behalf of the Society for
Political Methodology.
whether to use a dictionary method or a machine learning method to classify each document in
19
the corpus. If an SML method is selected, the analyst must also: (3) decide how to produce the
training dataset—select the unit of analysis, the number of objects (i.e., documents or units of
text) to code, and the number of coders to assign to each object.1 In each section below, we first
identify the options open to the analyst and present the theoretical trade-offs associated with
each. Second, we offer empirical evidence illustrating the degree to which these decisions matter
for our ability to predict the tone of coverage of the US national economy in the New York Times,
as perceived by human readers. Third, we provide recommendations to the analyst on how to
best evaluate their choices. Throughout, our goal is to provide a guide for analysts facing these
decisions in their own work.2
Some of what we present here may seem self-evident. If one chooses the wrong corpus
to code, for example, it is intuitive that no coding scheme will accurately capture the “truth.”
Yet less obvious decisions can also matter a lot. We show that two reasonable attempts to
select the same population of news stories can produce dramatically different outcomes. In
our running example, using keyword searches produces a larger corpus than using predefined
subject categories (developed by LexisNexis), with a higher proportion of relevant articles.
Since keywords also offer the advantage of transparency over using subject categories, we
conclude that keywords are to be preferred over subject categories. If the analyst will be
using SML to produce a training dataset, we show that it makes surprisingly little difference
whether the analyst codes text at the sentence level or the article-segment level. Thus, we
suggest taking advantage of the efficiency of coding at the segment level. We also show that
maximizing the number of objects coded rather than having fewer objects, each coded by
more coders, provides the most efficient means to optimize the performance of SML methods.
Finally, we demonstrate that SML outperforms dictionary methods on a number of different
criteria, including accuracy and precision, and thus we conclude that the use of SML is to be
preferred, provided the analyst is able to produce a sufficiently high-quality/quantity training
dataset.
Before proceeding to describe the decisions at hand, we make two key assumptions. First, we
assume the analyst’s goal is to produce a measure of tone that accurately represents the tone of
the text as read by humans.3 Second, we assume, on average, the tone of a given text is interpreted
by all people in the same way; in other words, there is a single “true” tone inherent in the text that
has merely to be extracted. Of course, this second assumption is harder to maintain. Yet we rely
on the extensive literature on the concept of the wisdom of the crowds—the idea that aggregating
multiple independent judgments about a particular question can lead to the correct answer,
even if those individual assessments are coming from individuals with low levels of information.4
Thus we proceed with these assumptions in describing below each decision the analyst must
make.
1 Because our discussion of SML and dictionaries necessarily relies on having produced a training dataset, we first discuss
selection of the corpus, then we discuss issues in creating a training dataset, and then we compare the benefits and
trade-offs associated with choosing to use SML versus dictionary methods.
2 The discussion in this paper is limited to coding the tone of text (referred to as “sentiment” in the computer science
literature), rather than other variables such as topics or events. But much of what we present is applicable to the analysis
of text more broadly, both when using a computational approach and even (in the first stage we discuss below) when using
manual content analysis.
3 In other words, we assume that the analyst is less interested in capturing the tone of the text that was intended by its
author, and more interested in capturing how the public, in general, would interpret the tone. This assumption may not
always hold. When studying social media, for example, an analyst might be more interested in the intended tone of a tweet
than the tone as it comes across to a general audience.
4 The “wisdom of the crowds,” first introduced by Condorcet in his jury theorem (1972), is now widely applied in social science
(Surowiecki 2005; Page 2008; Lyon and Pacuit 2013). As Benoit (2016) demonstrated, this logic can also be applied to coding
political text in order to obtain correct inferences about its content.
5 In some cases the corpus may correspond to the entire population, but in our running example, as in any example based
on media sources, they will be distinct.
6 As another example, the Policy Agendas Project (www.comparativeagendas.net) offers an extensive database of items (e.g.,
legislative texts, executive texts) that other scholars have already categorized based on an institutionalized list of policy
topics (e.g., macroeconomics) and subtopics (e.g., interest rates).
7 See King, Lam, and Roberts (2016) for discussion of keyword generation algorithms. Most such algorithms rely on co-
occurrence: if a term co-occurs with “unemployment,” but does not occur too frequently without it, then it is likely to
also be about the economy and should be added to the set of keywords. Note that these methods must start with a human
selection of keywords to seed the algorithm, meaning there is no escaping the need for vigilance in thinking about which
keywords are both relevant and representative.
8 LexisNexis now states that “Due to proprietary reasons, we aren’t allowed to share this information [the list of subject
categories].” Correspondence with authors, May 30, 2019.
9 Note that because the underlying set of archived articles can vary over time, based on the media provider’s contracts with
news outlets and the provider’s internal archiving parameters, even the same keyword search performed at two points in
time may yield maddeningly different results, although the differences should be less than those suffered using proprietary
subject categories (Fan, Geddes, and Flory 2013).
10 A number of analysts have coded the tone of news coverage of the US national economy. The universe of text defined in
this body of work varies widely from headlines or front-page stories in the New York Times (Goidel and Langley 1995; Blood
and Phillips 1997; Wu et al. 2002; Fogarty 2005) to multiple newspapers (Soroka, Stecula, and Wlezien 2015), to as many as
30 newspapers (Doms and Morin 2004). Blood and Phillips (1997) coded the full universe of text while others used subject
categories and/or keyword searches to produce a sample of stories from the population of articles about the economy.
11 Soroka and colleagues generously agreed to share their dataset with us, for which we are deeply grateful, allowing us to
perform many of the comparisons in this article.
12 We used ProQuest because LexisNexis does not have historical coverage for the New York Times earlier than 1980, and we
wanted to base some of our analyses below on a longer time span.
13 Articles were kept if they had a relevance score of 85 or higher, as defined by LexisNexis, for any of the subcategories listed
above. After collecting the data, Soroka et al. manually removed articles not focused solely on the US domestic economy,
irrelevant to the domestic economy, shorter than 100 words, or “just long lists of reported economic figures and indicators,”
(Soroka, Stecula, and Wlezien 2015, 461–462).
14 We obtained articles from two sources: the ProQuest Historical New York Times Archive and the ProQuest Newsstand
Database. Articles in the first database span the 1947–2010 period and are only available in PDF format and thus had to be
converted to plain text using OCR (optical character recognition) software. Articles for the 1980–2014 period are available
in plain text through ProQuest Newsstand. We used machine learning techniques to match articles in both datasets and to
delete duplicated articles, keeping the version available in full text through ProQuest Newsstand. We used a filter to remove
any articles that mentioned a country name, country capital, nationality or continent name that did not also mention U.S.,
U.S.A., or United States in the headline or first 1,000 characters of the article (Schrodt 2011).
15 We could also have generated keywords using a(n) (un)supervised method or query expansion (Rocchio 1971; Schütze and
Pedersen 1994; Xu and Croft 1996; Mitra, Singhal, and Buckley 1998; Bai et al. 2005; King, Lam, and Roberts 2016). However,
those methods are difficult to implement because they generally require unfettered access to the entire population of
documents, which we lacked in our case due to access limitations imposed by ProQuest.
16 Note that, although in theory LexisNexis and ProQuest should have identical archives of New York Times articles for
overlapping years, one or both archives might have idiosyncrasies that contribute to some portion of the differences
between the keyword-based and subject category-based corpora presented below. Indeed, the quirks of LexisNexis alone
are well documented (Fan, Geddes, and Flory 2013, 106).
17 See Appendix Section 3 for article counts unique to each corpus and common between them.
18 See Appendix Section 3 for details on how we determined overlap and uniqueness of the corpora.
corpus, we would omit 86.1% of the articles in the keyword corpus. There is, in short, shockingly
little article overlap between two corpora produced using reasonable strategies designed to
capture the same population: the set of New York Times articles relevant to the state of the US
economy.
Having more articles does not necessarily indicate that one corpus is better than the other.
The lack of overlap may indicate the subject category search is too narrow and/or the keyword
search is too broad. Perhaps the use of subject categories eliminates articles that provide no
information about the state of the US national economy, despite containing terms used in the
keyword search. In order to assess these possibilities, we recruited coders through the online
crowd-coding platform CrowdFlower (now Figure Eight), who coded the relevance of: (1) 1,000
randomly selected articles unique to the subject category corpus; (2) 1,000 randomly selected
articles unique to the keyword corpus, and (3) 1,000 randomly selected articles in both corpora.19
We present the results in Table 1.20
Overall, both search strategies yield a sample with a large proportion of irrelevant articles,
suggesting the searches are too broad.21 Unsurprisingly the proportion of relevant articles was
highest, 0.44, in articles that appeared in both the subject category and keyword corpora. Nearly
the same proportion of articles unique to the keyword corpus was coded as relevant (0.42), while
the proportion of articles unique to the subject category corpus coded relevant dropped by 13.5%,
to 0.37. This suggests the LexisNexis subject categories do not provide any assurance that an article
provides “information about the state of the economy.” Because the set of relevant articles in each
corpus is really a sample of the population of articles about the economy and since we want to
estimate the population values, we prefer a larger to a smaller sample, all else being equal. In
19 The materials required to replicate the analyses below are available on Dataverse (Barberá et al. 2020).
20 Relevance codings were based on coders’ assessment of relevance upon reading the first five sentences of the article. See
Section 1 of the Appendix for the coding instrument. Three different coders annotated each article (based on its first five
sentences), producing 9,000 total codings. Each coder was assigned a weight based on his/her overall performance (the
level of the coder’s agreement with that of other coders) before computing the proportion of articles deemed relevant. If
two out of three (weighted) coders concluded an article was relevant, the aggregate response is coded as “relevant”. This
is de facto a majority-rule criterion as coder weights were such that a single heavily weighted coder did not overrule the
decisions of two coders when there was disagreement. The coding-level proportions were qualitatively equivalent and are
presented in Table 3 in Section 3 of the Appendix.
21 Recall, however, that coders only read the first five sentences of each article. It may be that some (or even many) of the
articles deemed irrelevant contained relevant information after the first five sentences.
Note: Cell entries indicate the proportion of articles in each dataset (and their overlap) coded as providing
information about how the US economy is doing. One thousand articles from each dataset were coded
by three CrowdFlower workers located in the US. Each coder was assigned a weight based on her overall
performance before computing the proportion of articles deemed relevant. If two out of three (weighted)
coders concluded an article was relevant, the aggregate response is coded as “relevant”.
this case, the subject category corpus has 7,291 relevant articles versus the keyword corpus with
13,016.22 Thus the keyword dataset would give us on average 34 relevant articles per month with
which to estimate tone, compared to 19 from the subject category dataset. Further, the keyword
dataset is not providing more observations at a cost of higher noise: the proportion of irrelevant
articles in the keyword corpus is lower than the proportion of irrelevant articles in the subject
category corpus.
These comparisons demonstrate that the given keyword and subject category searches
produced highly distinct corpora and that both corpora contained large portions of irrelevant
articles. Do these differences matter? The highly unique content of each corpus suggests the
potential for bias in both measures of tone. And the large proportion of irrelevant articles suggests
both resulting measures of tone will contain measurement error. But given that we do not know
the true tone as captured by a corpus that includes all relevant articles and no irrelevant articles
(i.e., in the population of articles on the US national economy), we cannot address these concerns
directly.23 We can, however, determine how much the differences between the two corpora affect
the estimated measures of tone. Applying Lexicoder, the dictionary used by Soroka, Stecula,
and Wlezien (2015), to both corpora we find a correlation of 0.48 between the two monthly
series while application of our SML algorithm resulted in a correlation of 0.59.24 Longitudinal
changes in tone are often the quantity of interest and the correlations of changes in tone are
much lower, 0.19 and 0.36 using Lexicoder and SML, respectively. These low correlations are
due in part to measurement error in each series, but these are disturbingly low correlations for
two series designed to measure exactly the same thing. Our analysis suggests that regardless
of whether one uses a dictionary method or an SML method, the resulting estimates of tone
may vary significantly depending on the method used for choosing the corpus in the first
place.
The extent to which our findings generalize is unclear—keyword searches may be ineffective
and subject categorization may be quite good in other cases. However, keyword searches are
within the analyst’s control, transparent, reproducible, and portable. Subject category searches
are not. We thus recommend analysts use keyword searches rather than subject categories, but
22 The subject category corpus contains 4,290 articles in common with the keyword corpus, of which 44% are relevant, and
14,605 articles unique to the subject category corpus, of which 37% are relevant. Of the 26,497 articles unique to the
keyword corpus, 42% are relevant.
23 As we discuss in Section 3 of the Appendix, an analyst could train the classifier for relevance and then omit articles classified
as irrelevant. However, we found that it was difficult to train an accurate relevance classifier, which meant that using it as
a filter could lead to sampling bias in the resulting final sample. Since our tests did not show a large difference in the
estimates of tone with respect to the training set, we opted for the one-stage classifier as it was a more parsimonious
choice.
24 Lexicoder tone scores for documents are calculated by taking the number of positive minus the number of negative terms
over the total number of terms (Eshbaugh-Soha 2010; Young and Soroka 2012). We use our baseline SML algorithm which
we define in more detail later on, but which is based on logistic regression with an L2 penalty. The monthly tone estimates
for both measures are the simple averages across the articles in a given month.
Findings: In our comparison, these two approaches yield dramatically different corpora,
with the keyword search producing a larger corpus with a higher proportion of relevant
articles.
Advice: Use keyword searches, following an iterative vetting process to evaluate trade-offs
between broader versus narrower sets of keywords. Avoid subject categories, as their black-
box nature can make replication and extension impossible.
25 For good practices in iterative keyword selection, see Atkinson, Lovett & Baumgartner (2014, 379–380).
26 The most important part of this task is likely the creation of a training instrument: a set of questions to ask humans to code
about the objects to be analyzed. But here we assume the analyst has an instrument at hand and focus on the question of
how to apply the instrument. The creation of a training instrument is covered in other contexts and is beyond the scope of
our work. Central works include Sudman, Bradburn, and Schwartz (1995), Bradburn, Sudman, and Wansink (2004), Groves
et al. (2009), and Krippendorff (2018).
27 This list is not exhaustive of the necessary steps to create a training dataset. We discuss a method of choosing coders based
on comparison of their performance and cost in Section 7 of the Appendix.
28 As with the other decisions we examine here, this choice may have downstream consequences.
29 We compared the performance of a number of classifiers with regard to accuracy and precision in both out of sample and
cross-validated samples before selecting logistic regression with an L2 penalty. See Figures 1 and 2 in Section 4 of the
Appendix for details.
30 Efforts to identify exactly the first five sentences can be difficult in cases where we were working with original PDFs
transferred to text via OCR. Errors in the OCR translation would sometimes result in initial segments of more than five
sentences.
31 Variation in number of coders was a function of how many undergraduates completed tasks. The 5-category coding scheme
used here was employed before switching to the 1-9 scheme later in the project.
32 Of course, compiling a training dataset of segments is more cost-effective than coding articles. We do not claim the first
five sentences are representative of an article’s tone. However, we assume the relationship between features and tone of
a given coded unit are the same regardless of where they occur in the text.
33 See Section 5 of the Appendix for an additional exploration of how relevance and sentiment varies within article segments.
34 Coding was conducted using our 9-point scale. Sentences were randomized, so individual coders were not coding
sentences grouped by segment.
Findings: In our test, the tone of a segment tends to be consistent with the predominant
tone of its sentences. But regardless of a segment’s tone, more than half of the sentences
tend to be irrelevant, suggesting that segment-level coding may be noisier. However,
classifiers trained on sentences and on segments performed nearly identically in
accurately predicting segment-level tone, suggesting that the easier and cheaper
segment-level coding is preferable.
Advice: Code by segment, unless there is reason to suspect wide variance in tone across
sentences, and/or a high proportion of irrelevant sentences within segments.
38 One can create more complex schemes that would allocate a document to two coders, and only go to additional coders if
there is disagreement. Here we only consider cases where the decision is made ex ante.
39 See Appendix Section 6 for a way to use variance of individual coders to measure coder quality.
* These trade-offs assume a fixed amount of resources and a fixed number of codings,
remembering that increasing coders or documents will always improve predictive accuracy.
Findings: Simulations show that, given a fixed number of codings, accuracy is always
higher by maximizing the unique number of documents coded.
Advice: For any fixed number of total codings, maximize the number of unique documents
coded.
40 Note that, depending on the task at hand, analysts may choose to use SML for one task and dictionaries for another (see,
for example, Stecula and Merkley 2019).
41 The terminology “training a classifier” is unique to machine learning, but easily translates to traditional econometrics as
“choose the model specification and estimate model parameters.”
42 Some dictionaries, for example, SentiStrength, allow users to optimize weights using a training set.
43 The analyst is not prohibited from bringing prior information to bear by, for example, including prespecified combinations
of words as features whose weights are estimated from the data.
44 As another example, Thelwall et al. (2010) compare human coding of short texts in MySpace with each positive and negative
SentiStrength scores to validate their dictionary.
45 Based on an iterative procedure designed to maximize convergent validity, Hopkins et al. used fifteen negative terms: “bad”,
“bear”, “debt”, “drop”, “fall”, “fear”, “jobless”, “layoff”, “loss”, “plung”, “problem”, “recess”, “slow”, “slump”, “unemploy”, and
six positive terms: “bull”, “grow”, “growth”, “inflat”, “invest”, and “profit”.
46 One could simultaneously model relevance and tone, or model topics and then assign tone within topics—allowing the
impact of words to vary by topic. Those are considerations for future work.
stemming.47 Appendix Table 12 presents the n-grams most predictive of positive and negative
tone.
We can now compare the performance of each approach. We begin by assessing them against
the CF Truth dataset, comparing the percent of articles for which each approach (1) correctly
predicts the direction of tone coded at the article-segment level by humans (accuracy) and (2)
specifically for articles predicted to be positive, those predictions align with human annotations
(precision).48,49 Then for SML, we consider the role of training dataset size and the threshold
selected for classification. Then we assess accuracy for the baseline SML classifier and Lexicoder
for articles humans have coded as particularly negative or positive and those about which our
coders are more ambivalent.50
47 Selecting the optimal classifier to compare to the dictionaries requires a number of decisions that are beyond the scope of
this paper (but see Raschka 2015, James 2013, Hastie 2009, Caruana 2006), including how to preprocess the text, whether
to stem the text (truncate words to their base), how to select and handle stopwords (commonly used words that do not
contain relevant information), and the nature and number of features (n-grams) of the text to include. Denny and Spirling
(2018) show how the choice of preprocessing methods can have profound consequences, and we examine the effects of
some of these decisions on accuracy and precision in Section 4 of the Appendix.
48 It could be the case that the true tone of an article does not match the true tone of its first five sentences (i.e., article
segment). Yet we have no reason to suspect that a comparison of article-level classifications versus article-segment human
coding systematically advantages or disadvantages dictionaries or SML methods.
49 All articles for which the SML classifier generated a probability of being positive greater than 0.5 were coded as positive.
For each of the dictionaries we coded an article as positive if the sentiment score generated by the dictionary was greater
than zero. This assumes an article with more positive (weighted) terms than negative (weighted) terms is positive. This
rule is somewhat arbitrary and different decision rules will change the accuracy (and precision) of the classifier.
50 See Section 9 in the Appendix for a comparison of the relationship between monthly measures of tone produced by each
classification method and standard measures of economic performance. These comparisons demonstrate the convergent
validity of the measures produced by each classifier.
articles in the modal category. Any classifier can achieve this level of accuracy simply by always
assigning each document to the modal category. Figure 3 shows that only SML outperform the
naive guess of the modal category. The baseline SML classifier correctly predicts coding by crowd
workers in 71.0% of the articles they coded. In comparison, SentiStrength correctly predicts 60.5%,
Lexicoder 58.6%, and the Hopkins 21-Word Method 56.9% of the articles in CF Truth. The relative
performance of the SML classifier is even more pronounced with respect to precision, which is
the more difficult task here as positive articles are the rare category. Positive predictions from our
baseline SML model are correct 71.3% of the time while for SentiStrength this is true 37.5% of the
time and Lexicoder and Hopkins 21-Word Method do so 45.7% and 38.5% of the time, respectively.
In sum, each of the dictionaries is both less accurate and less precise than the baseline SML
model.51
What is the role the training dataset size in explaining the better accuracy and precision rates of
the SML classifier? To answer this question, we drew ten random samples of 250 articles each from
the full CF Truth training dataset. Using the same method as above, we estimated the parameters
of the SML classifier on each of these ten samples. We then used each of these estimates of the
classifier to predict the tone of articles in CF Truth, recording accuracy, precision, and recall for
each replication.52 We repeated this process for sample sizes of 250 to 8,750 by increments of 250.
Figure 4 presents the accuracy results, with shaded areas indicating the 95% confidence interval.
The x-axis gives the size of the training dataset and the y-axis reports the average accuracy in CF
Truth for the given sample size. The final point represents the full training dataset, and as such
there is only one accuracy rate (and thus no confidence interval).
51 Similar accuracy and precision are obtained with respect to the UG Truth dataset. See the Appendix.
52 Results for recall and precision rates by training dataset size may be found in Appendix Section 10. Briefly, we find that
recall—the fraction of positive articles correctly coded as positive by our classifier—behaves similarly to accuracy. However,
precision—the fraction of articles we predict as positive that coders identified as being positive—is quite low (about 47%)
for N = 250 but jumps up and remains relatively flat between 65% and 70% for all sized training datasets 500 and greater.
What do we learn from this exercise? Using the smallest training dataset (250), the accuracy
of the SML classifier equals the percent of articles in the modal category (about 63%). Further,
accuracy improves quickly as the size of the training dataset increases. With 2,000 observations,
SML is quite accurate, and there appears to be very little return for a training dataset with
more than 3,000 articles. While it is clear that in this case 250 articles is not a large enough
training dataset to develop an accurate SML classifier, even using this small training dataset the
SML classifier has greater accuracy with respect to CF Truth than that obtained by any of the
dictionaries.53
An alternative way to compare SML to dictionary classifiers is to use a receiver operator
characteristic, or ROC, curve. An ROC curve shows the ability of each classifier to correctly
predict whether the tone of an article is positive in CF Truth for any given classification
threshold. In other words, it provides a visual description of a classifier’s ability to separate
negative from positive articles across all possible classification rules. Figure 5 presents the ROC
curve for the baseline SML classifier and the Lexicoder dictionary.54 The x-axis gives the false
positive rate—the proportion of all negatively toned articles in CF Truth that were classified as
positively toned—and the y-axis gives the true positive rate—the proportion of all positively
toned articles in CF Truth that were classified as positive. Each point on the curve represents
the misclassification rate for a given classification threshold. Two things are of note. First, for
almost any classification threshold, the SML classifier gives a higher true positive rate than
Lexicoder. Only in the very extreme cases in which articles are classified as positive only if the
predicted probability generated by the classifier is very close to 1.0 (top right corner of the figure)
53 The results of this exercise do not suggest that a training dataset of 250 will consistently produce accuracy rates equal to the
percent in the modal category, nor that 2,000 or even 3,000 observations is adequate to the task in any given application.
The size of the training dataset required will depend both on the quality of the training data, likely a function of the quality
of the coders and the difficulty of the coding task, as well as the ability of the measured features to predict the outcome.
54 Lexicoder scores were standardized to range between zero and one for this comparison.
does Lexicoder misclassify articles slightly less often. Second, the larger the area under the ROC
curve (AUC), the better the classifier’s performance. In this case the AUC of the SML classifier
(0.744) is significantly greater (p = 0.00) than for Lexicoder (0.602). This finding confirms that
the SML classifier has a greater ability to distinguish between more positive versus less positive
articles.
Findings: In our tests, SML outperforms dictionary methods in terms of accuracy and
precision, and the ability to discriminate between more and less positive articles. A
relatively small training dataset produced a high-quality SML classifier.
Advice: Use SML if resources allow for the building of a high-quality training dataset. If using
dictionaries, choose a dictionary appropriate to the task at hand, and validate the utility
of the dictionary by confirming that a sample of dictionary-generated scores of text in the
corpus conform to human coding of the text for the measure of interest.
Acknowledgments
We thank Ken Benoit, Bryce Dietrich, Justin Grimmer, Jennifer Pan, and Arthur Spirling for their
helpful comments and suggestions to a previous version of the paper. We thank Christopher
Schwarz for writing code for simulations for the section on more documents versus more coders
(Figure 2). We are indebted to Stuart Soroka and Christopher Wlezien for generously sharing data
used in their past work, and for constructive comments on previous versions of this paper.
Funding
This material is based in part on work supported by the National Science Foundation under
IGERT Grant DGE-1144860, Big Data Social Science, and funding provided by the Moore and Sloan
Foundations. Nagler’s work was supported in part by a grant from the INSPIRE program of the
National Science Foundation (Award SES-1248077).
Supplementary material
For supplementary material accompanying this paper, please visit
https://doi.org/10.1017/pan.2020.8.
References
Atkinson, M. L., J. Lovett, and F. R. Baumgartner. 2014. “Measuring the Media Agenda.” Political
Communication 31(2):355–380.
Bai, J., D. Song, P. Bruza, J.-Y. Nie, and G. Cao. 2005. “Query Expansion Using Term Relationships in Language
Models for Information Retrieval.” In Proceedings of the 14th ACM International Conference on Information
and Knowledge Management, 688–695. Bremen, Germany: Association for Computing Machinery.