Computational_Methods_in_Authorship_Attribution
Computational_Methods_in_Authorship_Attribution
net/publication/220433607
Article in Journal of the American Society for Information Science and Technology · January 2009
DOI: 10.1002/asi.20961 · Source: DBLP
CITATIONS READS
648 3,920
3 authors:
Shlomo Argamon
Illinois Institute of Technology
121 PUBLICATIONS 7,636 CITATIONS
SEE PROFILE
All content following this page was uploaded by Moshe Koppel on 11 October 2017.
Jonathan Schler
Dept. of Computer Science
Bar-Ilan University
Shlomo Argamon
Dept. of Computer Science
Illinois Institute of Technology
1
Computational Methods in Authorship Attribution
Abstract
Statistical authorship attribution has a long history, culminating in the use of modern machine learning
classification methods. Nevertheless, most of this work suffers from the limitation of assuming a small closed
set of candidate authors and essentially unlimited training text for each. Real-life authorship attribution problems,
however, typically fall short of this ideal. Thus, following detailed discussion of previous work, three scenarios
are considered here for which solutions to the basic attribution problem are inadequate. In the first variant, the
profiling problem, there is no candidate set at all; in this case, the challenge is to provide as much demographic
or psychological information as possible about the author. In the second variant, the needle-in-a-haystack
problem, there are many thousands of candidates for each of whom we might have a very limited writing sample.
In the third variant, the verification problem, there is no closed candidate set but there is one suspect; in this case,
the challenge is to determine if the suspect is or is not the author. For each variant, it is shown how machine
learning methods can be adapted to handle the special challenges of that variant.
1. Introduction
The task of determining or verifying the authorship of an anonymous text based solely on internal
evidence is a very old one, dating back at least to the medieval scholastics, for whom the reliable
attribution of a given text to a known ancient authority was essential to determining the text’s veracity.
More recently, this problem of authorship attribution has gained greater prominence due to new
applications in forensic analysis, humanities scholarship, and electronic commerce, and the
In the simplest form of the problem, we are given examples of the writing of a number of
candidate authors and are asked to determine which of them authored a given anonymous text. In this
straightforward form, the authorship attribution problem fits the standard modern paradigm of a text
2
categorization problem (Lewis & Ringuette 1994, Sebastiani 2002). The components of text
categorization systems are by now fairly well-understood: documents are represented as numerical
vectors that capture statistics of potentially relevant features of the text and machine learning methods
are used to find classifiers that separate documents that belong to different classes.
However, real-life authorship attribution problems are rarely as elegant as straightforward text
categorization problems, in which we have a small closed set of candidate authors and essentially
unlimited training text for each. There are a number of varieties of attribution problems that fall short
of this ideal. For example, we may encounter scenarios such as the following:
1. There is no candidate set at all. In this case, the challenge is to provide as much demographic
or psychological information as possible about the author. This is the profiling problem.
2. There are many thousands of candidates for each of whom we might have a very limited
3. There is no closed candidate set but there is one suspect. In this case, the challenge is to
determine if the suspect is or is not the author. This is the verification problem.
Our goal in this paper is to survey the history of methods used for the basic authorship
attribution scenario and to discuss some recent solutions for the more complex variants mentioned
above.
In the following section, we offer a brief history of the analytical approaches to authorship
attribution, from nineteenth century work on statistical authorial invariants to recent application of
machine learning techniques. These modern techniques, together with recent advances in natural
language processing, have enabled the development of a plethora of potential markers of authorial
style, which we survey in Section 3. In Section 4, we describe the results of a systematic comparison of
3
learning algorithms and feature sets on several representative testbeds to determine the best
We then turn to consideration of variant scenarios where we do not have a small closed
candidate set. After giving an overview of the problems and approaches in Section 5, we consider the
profiling problem in Section 6, the needle-in-a-haystack problem in Section 7, and the verification
2. History of Methods
Over the last century and more, a great variety of methods have been applied to authorship attribution
problems of various sorts (cf. Juola 2008). For convenience, we divide them into three classes of
approach: the earliest, unitary invariant, approach, in which a single numeric function of a text is
sought to discriminate between authors, the multivariate analysis approach, in which statistical
multivariate discriminant analysis is applied to word frequencies and related numerical features, and
the most recent, the machine learning approach, in which modern machine learning methods are
applied to sets of training documents to construct classifiers that can be applied to new anonymous
documents.
A scientific approach to the authorship attribution problem was first proposed in the late nineteenth
century, in the work of Mendenhall (1887), who studied the authorship of texts attributed to Bacon,
Marlowe and Shakespeare, and of Mascol (1888a,1888b) who studied the authorship of the gospels of
the New Testament. The key idea was that the writing of each author could be characterized by a
unique curve expressing the relationship between word length and relative frequency of occurrence;
these characteristic curves thus would provide a basis for author attribution of anonymous texts. This
4
early work was put on a firmer statistical basis in the early twentieth century with the search for
invariant properties of textual statistics (Zipf 1932). The existence of such invariants suggested the
possibility that some related feature might be found that was at least invariant for any given author,
though possibly varying among different authors. Thus, for example, Yule (1944) considered sentence
length as a potential method for authorship discrimination, though he determined that this method was
not reliable. A number of other measures have been proposed as authorial markers (see Section 3.1
below), but for the most part this approach has not proved stable (Sichel 1986; Burrows 1992; Grieve
Mosteller and Wallace’s work (1964) on the authorship of the Federalist Papers augured in a new set
of methods for stylometric authorship attribution, based on combining information from multiple
textual clues. Mosteller and Wallace (1964) applied a then-novel method of Bayesian classification to
the papers (essentially what is now called “Naïve Bayes” classification), using as features the
frequencies of a set of a few dozen function words (i.e., words with primarily grammatical functions
such as the, of, and about). The fundamental insight was that a rigorous Bayesian methodology, applied
to the frequencies of a set of topic-independent words, could yield a measurably reliable method for
attributing authorship. This opened up the field to the exploration of new types of textual features and
A basic intuition behind these methods is that finding the most probable attribution can be
viewed as taking documents as points in some space, and assigning a questioned document to the
author whose documents are 'closest' to it, according to an appropriate distance measure. This simple
notion is quite powerful, so such distance measures continue to be used in recent studies examining the
efficacy of different metrics and feature sets. One such method is Burrows's (2002a) Delta, which has
5
been extended and used for a variety of attribution problems (Burrows 2002b; Hoover 2004a, 2004b),
distribution over frequently appearing words (Stein & Argamon 2006; Argamon 2008). A number of
other similarity functions computed as distance measures for authorship attribution have been applied
to different feature sets as well (Craig 1999; Chaski 2001; Stamatatos et al. 2001; Keselj et al. 2003;
van Halteren et al. 2005; Burrows 2007). Recently, Grieve (2007) has run an exhaustive battery of tests
A related class of techniques was developed earlier by Burrows (1987; 1989), who applied
principal components analysis (PCA) on word frequencies to analyze authorship. The idea is to
visualize the differences between texts written by different authors by projecting high-dimensional
word-frequency vectors computed for those text onto the 2-dimensional subspace spanned by the two
principal components; if good separation is seen between documents known to be written by different
authors, then new texts may be attributed by seeing which authors' comparison documents are closest
to them in this space. This method was elaborated on by Binongo and Smith (1999), and has been used
to resolve several outstanding authorship problems (Burrows 1992; Binongo 2003; Holmes 2003). A
related method is ANOVA, as applied, for example, by Holmes and Forsyth (1995) to the Federalist.
From the probabilistic standpoint, these methods take into account, to some extent, the statistical
Another form of dependence between words is taken into account by methods that model the
sequencing of words in a document. This may be accounted for by using a probabilistic distance
measure such as K-L divergence between Markov model probability distributions of the texts (Juola
1998; Khmelev 2001; Khmelev and Tweedie 2002; Juola & Baayen 2003; Sanderson and Guenter
2006), possibly implicitly in the context of compression methods (Kukushkina et al. 2001; Benedetto et
6
2.3. Machine Learning Approach
The emergence of text categorization techniques rooted in machine learning marked an important
turning point in authorship attribution studies. The application of such methods is straightforward:
training texts are represented as labeled numerical vectors and learning methods are used to find
boundaries between classes (authors) that minimize some classification loss function. The nature of the
learned boundaries depends on the learning method used but in any case these methods facilitate the
use of classes of boundaries that extend well beyond those implicit in methods that minimize distance.
Among the earliest methods to be applied were various types of neural networks, typically using
small sets of function word as features (Matthews & Merriam 1993; Merriam & Matthews 1994; Kjell
1994a; Lowe & Matthews 1995; Tweedie et al. 1996; Hoorn 1999; Waugh et al. 2000). More recently,
Graham et al. (2005) and Zheng et al. (2006) used neural networks on a wide variety of features. Other
studies used k-nearest neighbor (Kjell et al 1995; Hoorn et al. 1999; Zhao & Zobel 2005), Naive Bayes
(Kjell 1994a; Hoorn et al. 1999; Peng et al 2004), rule learners (Holmes & Forsyth 1995; Holmes
1998; Argamon et al. 1998; Koppel & Schler 2003; Abbasi & Chen 2005; Zheng et al. 2006), support
vector machines (De Vel et al. 2001; Diederich et al. 2003; Koppel & Schler 2003, Abbasi & Chen
2005; Koppel et al. 2005; Zheng et al 2006), Winnow (Koppel et al. 2002; Argamon et al. 2003;
Koppel et al. 2006a), and Bayesian regression (Genkin et al. 2006; Madigan et al. 2006; Argamon et al.
2008). Further details regarding these studies can be found in the Appendix.
Comparative studies on machine learning methods for topic-based text categorization problems
(Dumais et al. 1998; Yang 1999) have shown that in general, support vector machine (SVM) learning is
at least as good for text categorization as any other learning method and the same has been found for
authorship attribution (Abbasi & Chen 2005; Zheng et al. 2006). Some recent studies (Koppel et al.
2003; Genkin et al. 2006) have shown that some variations of Winnow and Bayesian regression are
7
also very promising. Below, we compare the performance of several representative learning methods
for authorship attribution. As we will see below, however, the choice of the learning algorithm is no
more important than the choice of the features by which the texts are to be represented. We discuss this
One of the advantages of modern machine learning methods is that they permit us to consider a wide
variety of potentially relevant features without suffering great degradation in accuracy if most of these
features prove to be irrelevant. In this section, we consider a number of feature types that have been, or
might be, used for the attribution problem. A number of earlier works that have surveyed and/or
compared various types of feature sets include Forsyth & Holmes (1996), Holmes (1998), McEnery &
Oakes (2000), Love (2002), Zheng et al. (2006), Abbasi & Chen (2008) and Juola (2008). We note that
in addition to the above-cited work dealing with attribution of texts in a variety of genres, there has
also been a fair amount of work on attribution of programming code, music, art and other media; we
As noted in Section 2.1, early work on authorship focused on the search for a single feature that
remained invariant for a given author but varied among different authors. The search for such
invariants centered on measures of text complexity. These measures included average word length (or
more generally, word length distribution) in terms of syllables (Fucks 1952) or letters (Mendenhall
1887; Brinegar 1963) and average number of words in sentence (Yule 1944; Morton 1965). When these
measures proved inadequate, more sophisticated measures were invented, involving type-token ratio
and the number of words appearing with given frequency in a text (such as hapax legomena). Among
8
the better known of these are Yule's K-measure (1944), Sichel's S-measure (1975), and Honore's R-
measure (1979). Ultimately, none of these measures has proved especially useful on its own (Burrows
1992; Grieve 2007), though it may be that these features have marginal value as additional inputs
together with the features that we consider below (De Vel et al. 2001; Corney et al. 2002; Abbasi &
Chen 2005; Zheng et al. 2006; Li et al. 2006; Abbasi & Chen 2008).
The search for a single invariant measure of textual style was natural in the early stages of stylometric
analysis, but with the development of more sophisticated multivariate analysis techniques, larger sets of
features could be considered. Among the earliest studies to use multivariate approaches was that of
Mosteller and Wallace (1964) noted above, who considered distributions of function words. The
reason for using function words in preference to others is that we do not expect their frequencies to
vary greatly with the topic of the text, and hence we may hope to recognize texts by the same author on
different topics. It is also unlikely that the frequency of function word use can be consciously
controlled, so one may hope that use of function words for attribution will minimize the risk of being
Many studies since that of Mosteller and Wallace have shown the efficacy of function words for
authorship attribution in different scenarios (Morton 1978; Burrows 1987; Karlgren & Cutting 1994;
Merriam & Matthews 1994; Kessler et al. 1997; Argamon et al. 1998; Holmes 1998; de Vel et al. 2001;
Holmes et al. 2001a, 2001b; Baayen et al. 2002; Binongo 2003; Juola & Baayen 2003; Zhao & Zobel
2005; Argamon & Levitan 2005; Koppel et al. 2005, 2006a), confirming the hypothesis that different
Typical modern studies using function words in English use lists of a few hundred words,
including pronouns, prepositions, auxiliary and modal verbs, conjunctions, and determiners. Numbers
9
and interjections are usually included as well since they are essentially independent of topic, though
they are not, strictly speaking, function words. Results of different studies using somewhat different
lists of function words have been similar, indicating that the precise choice of function words is not
crucial. Discriminators built from function word frequencies often perform at levels competitive with
A different type of feature set is based on relative frequencies of different syntactic constructions,
made possible by development of fast and reliable statistical natural language processing techniques. A
number of studies used the output of syntactic text chunkers and parsers to create features, and found
that they could considerably improve results based on traditional word based analysis alone (Baayen et
al. 1996; Stamatatos et al. 2000, 2001; Gamon 2004; van Halteren 2004; Chaski 2005; Uzuner and
Katz 2005; Hirst & Feiguina 2007). Many studies have used the frequencies of short sequences of
parts-of-speech (or combinations of parts-of-speech and other classes of words) as a simple method for
approximating syntactic features for this purpose (Argamon-Engelson et al. 1998; Kukushkina et al.
2001; De Vel et al. 2001; Koppel et al. 2002; Koppel & Schler 2003; Chaski 2005; Koppel et al. 2005,
2006a; van Halteren et al 2005; Zhao et al. 2006; Zheng et al. 2006).
Function words and some parts-of-speech features can be subsumed by considering taxonomies, based
on Systemic Functional Linguistics (Halliday & Matthiessen 2003), which represent grammatical and
semantic distinctions between classes of function words at different levels of abstraction. Such
taxonomies are represented as trees whose roots are labeled by sets of parts-of-speech (articles,
auxiliary verbs, conjunctions, prepositions, pronouns). Each node’s childrens are labeled by meaningful
10
subclasses of the parent node (such as the various sorts of personal pronouns). This bottoms out at the
leaves, which are labeled by sets of individual words. These taxonomies can be used to construct
features for stylistic text classification as has been done for authorship attribution on texts in English
(Whitelaw et al. 2004; Argamon et al. 2007, 2008) and Portuguese (Pavelec 2007).
Such feature sets might include most function words and some POS unigrams, as well as
features at intermediate levels of abstraction. It is important to note that the features so constructed are
all closed sets of words so that no part-of-speech tagging is required for identifying such features in a
text.
There are aspects of authorial identity that are not easily captured by the sorts of stylistic features
described above. For example, one author may prefer to use the words start and large, where another
may prefer begin and big (Mosteller and Wallace 1964, Koppel et al. 2006a). Such patterns of lexical
choice can be represented by modeling the relative frequencies of content words (Martindale and
McKenzie 1995; Craig 1999; Waugh et al. 2000; Diederich et al. 2003; Hoover 2004a, 2004b;
Argamon et al. 2008). Typically very rare words and those with near-uniform distribution over the
corpus of interest can be omitted (Forman 2003), so that a reasonable set of perhaps several thousand
words may used. Sequences and collocations of content words can also be useful (Hoover 2002,
2003a, 2003b).
As noted earlier, the use of content-based features for authorship studies can be problematic.
Content markers might just be artifacts of a particular writing situation or experimental setup and might
thus produce overly optimistic results that will not be borne out in real-life applications. Thus, if one
author’s training documents are all on a particular topic, the trained classifier may do very poorly at
identifying documents by that author on a different topic. We are therefore careful in this paper to
11
distinguish results that exploit content-based features from those that do not.
Several authors have proposed that the frequencies of various character n-grams might be useful for
capturing lexical preferences -- and even grammatical and orthographic preferences -- without the need
for linguistic background knowledge (making application to different languages trivial). Thus, Kjell
(1994a,1994b) and Kjell et al. (1995) used relative frequencies of character n-grams for attribution of
the Federalist papers and others have used character n-grams for authorship attribution of texts in
English (Ledger & Merriam 1994; Clement and Sharp 2003; Houvardas and Stamatatos 2006;
Stamatatos 2008), Dutch (Hoorn et al. 1999), Russian (Kukushkina 2001), Italian (Benedetto et al.
2002) and Greek (Keselj et al. 2003; Peng et al. 2004). Grieve (2007) has found that character bigrams
work surprisingly well for attribution of newspaper opinion columns. Chaski (2005, 2007) found
character n-grams to work well for attribution in a forensic context. Character n-grams have also been
shown useful for related stylistic classification tasks such as document similarity (Damashek 1995) or
determining the native language of the writer (Zigdon 2005), though Graham et al. (2005) found that
character n-grams did not work as well as syntax-based features for stylistic text segmentation. Zhang
and Lee (2006) find clusters of character n-grams that prove useful for a variety of text categorization
problems.
The caveats regarding content words apply also to the use of character n-grams, as many will
Some other features have been found useful for authorship and stylistic classification in particular
cases. Morphological analysis has been shown to be useful for authorship attribution in languages with
12
richer morphology than English, such as Greek (Stamatatos et al. 2001) and Hebrew (Koppel et al.
2006b), where many function words are represented by prefixes and suffixes.
punctuation habits (O’Donnell 1966; Chaski 2001) or orthographic/syntactic errors and idiosyncrasies
(de Vel et al. 2001, Koppel and Schler 2003). Thus, Koppel and Schler (2002) analyzed email texts by
running them through the MS-Word spelling and grammar checker, automatically assigning each error
found an “error type” such as repeated letter (e.g. remmit instead of remit), letter substitution (e.g. firsd
instead of first), letter inversion (e.g. fisrt instead of first), or conflated words (e.g stucktogether). This
approximates methods used in manual analyses of authorship whose goal is to identify idiosyncratic
characteristics of the author that can be recognized in a questioned text (Foster 2000).
Finally, for documents such as email, blogs and other online content, formatting and other
structural features can also be profitably exploited for authorship attribution (De Vel et al. 2001;
3.8. Summary
Above, we have described a wide variety of feature sets and analysis methods that have been applied to
various authorship attribution problems over the years. In principle, any feature set can be used with
nearly any classification method, provided proper methodology is followed in study design (cf.
Rudman 1997). In practice, however, certain combinations have been more often applied and studied.
As a reference, Appendix 1 contains a summary of methods and features sets that have been used in
13
4. Comparison Studies
In this section, we consider and compare methods and features applied to three authorship attribution
problems representative of the range of classical attribution problems. The corpora are as follows:
1. A large set of emails between two of the authors of this paper (Koppel and Schler),
covering the year 2005. The set consisted of 246 emails from Koppel and 242 emails
from Schler, each stripped of headers, named greetings, signatures and quotes from
previous posts in the thread. Some of the texts were as short as a single word. The
messages prior to July 1 were used for training and the second half for testing.
2. Two books by each of nine late 19th and early 20th century authors of American and
Thoreau, Emerson). One book of each was used for training and the other for testing.
3. The full set of posts of twenty prolific bloggers, harvested in August 2004. The number
of posts of the individual bloggers ranged from 217 to 745 with an average of just over
250 words per post. The last 30 posts of each blogger were used as a test corpus.
As can be seen, these corpora differ along a variety of dimensions, including – most prominently – the
size of the candidate sets (2, 9, 20) and the nature of the material (emails, novels, blogs).
For each corpus, we run experiments comparing the effectiveness of various combinations of
feature types and machine learning methods. The feature types and machine learning methods that we
use are given in Table 1. Note that for each feature type that we consider, there are parameters that
need to be chosen. It is beyond the scope of this paper to determine the optimal parameter settings in
each case. We show results for plausible settings that earlier work, or our own preliminary tests,
suggest work reasonably well. Thus, for POS, we use all unigrams, sufficiently frequent bigrams and
no trigrams (Koppel & Schler 2003). We show results for SFL alone, but not in combination with FW
14
and/or POS, since the overlap of SFL with each of the other types is very large. We consider only
character trigrams, since these are long enough to capture morphology without mapping too obviously
to specific words. For both content words and character n-grams, we choose the 1000 features with
highest infogain from among those that are among the 10,000 most frequent in the corpus.
Each document in each corpus is processed to produce a numerical vector, each of whose
elements represents the relative frequency of some feature in the selected feature set. Models learned
on the training sets are then applied to the corresponding test sets to estimate generalization accuracy.
Table 2 shows results for each combination of features and learning method for the email corpus. Table
3 shows the results for the classic authors corpus. Table 4 shows results for the blog corpus.
As can be seen, Naïve Bayes, which we use as a representative for multivariate methods,
performs very poorly for all feature sets. Moreover, SVM and Bayesian regression are far superior to
the other learning algorithms, for all feature sets. Moreover, for these learners, SFL features are
In the corpora we consider here, content words prove to be very useful; in no case, do they lead
us astray. Especially surprising is the effectiveness of the character n-gram feature set. Note that
character n-grams perform almost identically to content words for the first two corpora and
significantly outperform content words for the blog corpus. Consideration of some examples of useful
character n-grams suggests that character n-grams serve as proxies for content words (e.g., dsh for
spreadsheet), as well as for function words, parts-of-speech and even formatting (e.g., the string
colon—newline—1 suggests a numbered list). In the case of the blog corpus, character n-grams have
the additional benefit of capturing acronyms and abbreviations characteristic of blog writing.
We tentatively conclude, therefore, that when the context indicates that purely stylistic features
are appropriate, the combination of parts-of-speech and function words constitutes a reasonable choice
of feature set and that SFL features can be used as an efficiently proxy for this combination. When
15
content features are appropriate, properly chosen unigrams are a good choice, with similarly chosen
character tri-grams an efficient and language-independent proxy. Using these features, as appropriate,
in conjunction with either Bayesian regression or SVM, constitutes a convenient and effective method
The attribution problem we have considered thus far is the standard one in which we are given a
relatively small closed set of candidate authors and are asked to determine which of them is the author
of a given document. In the sections that follow we consider three variations in which no small closed
First, we consider the case in which no candidate set is available at all so that the best we can
hope to do is to profile the anonymous author. We will see in Section 6 that essentially the same
methods that we used above for distinguishing individual authors can be used to distinguish between
classes of authors, such as males and females or writers of different ages. The discussion in Section 6 is
Next, we consider the case in which the candidate set consists of many thousands of authors so
that learning a classifier to distinguish them is infeasible. We will see in Section 7 that this problem can
be solved if we are willing to accept Don't Know as an answer for those cases where the document to
be attributed is not sufficiently distinct to permit attribution. We use meta-learning to identify such
cases and find that in the remaining cases, where the system believes attribution is reliable, we are able
to provide highly accurate results. The discussion is Section 7 is an expansion of that given in Koppel
et al. (2006c).
16
Finally, we consider the case where there is a single candidate author and our task is to
determine if the anonymous document was written by that author. In Section 8, we show that this
problem is solvable if the anonymous text is sufficiently long. The method used entails measuring the
"depth" of the differences between the known texts of the candidate author and the anonymous text. In
particular, we check how accurately we can distinguish between the two as the best features for doing
so are iteratively eliminated. The discussion in Section 8 is drawn from Koppel et al. (2007).
6. Profiling
As noted above, even in cases where we have an anonymous text and no candidate authors, we would
like to say something about the anonymous author. That is, we wish to exploit the sociolinguistic
observation that different groups of people speaking or writing in a particular genre and in a particular
language use that language differently (cf. Chambers et al. 2004). More specifically, we wish to use the
features and methods used above to distinguish between individual authors in order to distinguish
As in Argamon et al. (2008), we consider the following profile dimensions: author gender
(Koppel et al. 2002; Argamon et al. 2003), age (Burger and Henderson 2006; Schler et al. 2006), native
language (Koppel et al. 2005) and neuroticism level (Pennebaker & King 1999; Pennebaker, Mehl, &
Niederhoffer, 2003). For each of these, we assemble an appropriately labeled corpus and proceed
exactly as described above. Thus, for example, we learn a classifier to distinguish between male and
female writers using the same procedure we used above to distinguish between individual authors.
Other authors have considered dimensions we do not consider here, such as education level (Corney et
al. 2002).
Following our observations in Section 4, we use here SFL as our stylistic feature set. For
comparison, we also consider content features alone and stylistic features and content features together.
17
The content features are chosen exactly as described in Section 4. We use Bayesian regression (BMR)
as our learning algorithm. For each of the three feature sets, we run ten-fold cross-validation tests to
test the extent to which each profiling problem is solvable. We also present the most discriminating
6.1 Gender
Our corpus for both gender and age, first described by Schler et al. (2006), was assembled by taking as
an initial set all 47,000 blogs in blogger.com (as of August 2004) that self-reported both age and
gender, and included at least 200 occurrences of common English words. After dividing the set into age
intervals, we selected equal numbers of male and female bloggers in each age interval by randomly
eliminating surplus. The final corpus consists of the full set of postings of 19,320 blog authors (each
text is the full set of posts by a given author) ranging in length from several hundred to tens of
Classification results for gender are shown in the first line of Table 5. As is evident, all feature
sets give effective classification, while the content features are slightly better than style features.
In the first line of Table 6, we show the most discriminating style and content features,
respectively, for gender. As can be seen, the style features most useful for gender discrimination are
determiners and prepositions (markers of male writing) and pronouns (markers of female writing). The
content features most useful for gender discrimination are words related to technology (male) and
words related to personal life and relationships (female). Earlier studies (Argamon et al. 2003) on
author gender in both fiction and non-fiction have shown that the style features found here to be useful
6.2 Age
18
Based on each blogger’s reported age, we label each blog in our corpus as belonging to one of three age
groups: 13-17 (42.7%), 23-27 (41.9%) and 33-47 (15.5%). Intermediate age groups were removed to
avoid ambiguity, since many of the blogs were written over a period of several years. Our objective is
Accuracy results for age classification are shown in the second line of Table 5. Both style and
content features give us over 76% accuracy for this three-way classification problem, while the
The style features most useful for age classification (Table 6) are contractions without
apostrophes (younger writing), and determiners and prepositions (older writing). Note that the strongest
style features for 20s and 30s are identical; they are those that distinguish both of these classes from
teenagers. The content features that prove to be most useful for discrimination are words related to
school and mood for teens, to work and social life for 20s, and to family life for 30s.
For the problem of determining an author’s native language, we use a portion of the International
Corpus of Learner English (Granger et al. 2002). All the writers in the corpus are university students
(mostly in their third or fourth year) studying English as a second language and assigned to the same
proficiency level in English. We consider 1290 texts in five sub-corpora, comprising 258 writers from
Russia, the Czech Republic, Bulgaria, France, and Spain, respectively. All texts in the resulting corpus
are between 579 and 846 words long. Our objective is to determine which of the five languages is the
Accuracy results are shown in the third line of Table 5. Both style and content features give
In Table 6, we can see some consistent patterns of usage in the style features. For example, as
19
might be expected, native speakers of Slavic languages (Russian, Bulgarian, Czech) tend to omit the
definite article the which does not exist in these languages. (Since we list only features that are over-
represented in a given class, this feature is seen by examining the list of features for Spanish. Indeed,
many of the most discriminating features are those that are under-represented for particular languages.)
Furthermore, those words with commonly used analogs in a given language are used with greater
frequency by native speakers of that language, such as indeed (French), over (Russian), and however
(Bulgarian).
Elsewhere (Koppel et al. 2005), we have shown that for determining native language, features
that measure stylistic idiosyncrasies and errors are particularly useful. Using such features together
with the style features considered in this section yields classification accuracy of over 80% for this
task.
Regarding content words, it should be noted that, unlike the text collections used in the other
experiments described in this paper, writers in the learner corpus did not necessarily freely choose their
writing topics, so that differences in content word usage here are plausibly artifacts of the experimental
setup.
6.4 Personality
To examine the extent to which personality type can be determined from writing style, we use a corpus
of essays written by psychology undergraduates at the University of Texas. Students were instructed to
write a short “stream of consciousness” essay wherein they tracked their thoughts and feelings over a
20-minute free-writing period. The essays range in length from 251 to 1951 words. Each writer also
filled out a questionnaire testing for the “Big Five” personality dimensions: neuroticism, extraversion,
openness, conscientiousness, and agreeableness (John et al. 1991). To illustrate personality profiling,
we consider just the dimension of neuroticism; methods and results for other personality factors are
20
qualitatively similar. To formulate this as a classification problem, we define ‘positive’ examples to be
the participants with neuroticism scores in the upper third, and ‘negative’ examples to be those with
scores in the lowest third. The rest of the data are ignored; the final corpus consists of 198 writing
samples.
Accuracy results are shown in the fourth line of Table 5. Notably, style features give a great
surprisingly high; independent studies of individuals who attempted to guess others’ neuroticism levels
have given an average accuracy of 69% -- even among people who have known each other for several
As shown in Table 6, the most discriminating style features for this task suggest that neurotics
tend more to refer to themselves, to use pronouns for subjects rather than as objects in a clause, to use
reflexive pronouns, and to consider explicitly who benefits from some action (through prepositional
phrases involving, e.g., "for" and "in order to"); non-neurotics, on the other hand, tend to be less
concrete and use less precise specification of objects or events (determiners and adjectives such as "a"
or "little"), and to show more concern with how things are or should be done (via prepositions such as
In fact, classifiers learned using only the ten style features shown in Table 6 give classification
accuracy of 63.6%. More surprisingly, although the results in Table 5 indicate that content words
overall are useless for classifying texts by neuroticism, using as features the ten most informative
content features (those in Table 5) gives an accuracy of 68.2%. Apparently, the vast majority of content
is irrelevant to this classification problem and masks a small number of features involving worry about
personal problems (neurotics) and relaxation activities (non-neurotics) that are quite useful for this task.
21
7. Finding a needle-in-a-haystack
Consider now the scenario where we seek to determine the specific identity of a document's author, but
there are many thousands of potential candidates. We call this the needle-in-a-haystack attribution
problem. In this case, standard text-classification techniques are unlikely to give reasonable accuracy,
and may require excessive computation time to learn classification models. But we will show in this
section that if we are willing to tolerate our system telling us it doesn't know the answer, we can
achieve high accuracy for the cases where the system does give us an attribution it considers reliable.
The blogosphere forms a convenient testbed for this problem, as it provides us with text written
by an essentially unlimited number of authors. For this study we use the blog corpus described above
in Section 6, choosing the 20,000 longest blogs in our initial set. We took 10,000 blogs randomly to
create a test set of “snippets”, each snippet comprising enough of the most recent posts of a blog to
total at least 500 words; the remainder of that blog is termed the author’s “known work”. The other
10,000 blogs are held out for training purposes as is described below.
The goal is then to determine, for each snippet, to which of the 10,000 blogs it belongs, by
comparison to the various known works. We address the problem independently for each of the
snippets in the test set, i.e., we do not make use of the fact that there is a one-to-one correspondence
Learning classification models for a 10,000-class problem with thousands of features is impractical.
So, as a first approximation, we apply the standard information retrieval technique in which we define
some distance measure over meaningful textual features and attribute each snippet to the closest blog in
that feature space. Related approaches to authorship problems have been considered by Novak et al.
22
(2004) and Abbasi & Chen (2008).
We represent each text in four different ways: three varieties of tf-idf representations based on
the 1000 most frequent content features in the text and another based on a tf-idf representation based on
style features. For each of these representation methods, we use the standard cosine measure (Salton &
Buckley 1988) to quantify the similarity of each author's known work with a given snippet. The
various authors can then be ranked according to the similarity between their known works and the
snippet under consideration, with the hope being that the highest-ranked author is the author of the
snippet. The idea is that some distinctive feature or features might render the snippet particularly
This simple approach to the problem actually works surprisingly well. The three content
representations assign the snippet to the actual author between 52% and 56% of the time, while the
style representation lags behind with only 6% of snippets assigned to the actual author. 64% of the
snippets are most similar to their actual authors known works in at least one of the four representation
7.2 Meta-Learning
While 56% may seem to be quite a high level of accuracy, given the large number of candidates and the
simplicity of the method, it is also quite useless in the sense that we are still unable to confidently assert
that a given snippet was written by a given author; after all, the system is still wrong almost half of the
time. Thus we would like to automatically determine which attributions by which representation
schemes have a high likelihood of being correct; when none of them do, the system will report that
results are inconclusive. The goal is to return specific attributions as often as possible, while ensuring a
To accomplish this, we apply a meta-learning scheme, using the holdout set of 10,000 blogs
23
(those not included in the test set), set aside for this purpose. We consider each pair consisting of a
snippet and an author ranked most similar to that snippet for at least one representation method, in a
given blog set (holdout or test). We call the pair a successful pair if the candidate author is in fact the
actual author. The pairs over the holdout set are used as training to learn a model that distinguishes
successful pairs from unsuccessful pairs. Each example (pair) is represented in terms of a set of meta-
features reflecting, for each representation, the similarity of the author to the snippet, both absolutely
and relative to other authors, and the author’s rank in similarity relative to other authors.
A linear SVM is used for each representation method to learn a “meta-model” that decides
whether a given pair is reliable or not. To do this, we use the meta-model to compute a reliability
score, which is a monotonic function with range [0,1] of the distance of the pair's representation from
Given reliability scores for each of the representations, the system chooses the attribution of the
Otherwise, the output is Don't Know. Varying this threshold will change the number of attributions
made and the accuracy of those attributions. This enables us to plot recall/precision curves (Figure 1,
upper curve), where recall is defined as the fraction of possible attributions (number of authors
represented by snippets and by known works, in this case 10,000) that were correctly attributed, and
precision is defined as the fraction of attempted attributions that were in fact correct. Note that, for
example, we can achieve recall of 40% with precision of 87%, but if we can settle for recall of 30%, we
To test the sensitivity of these results for snippet length, we ran the experiment for snippets
24
limited to 200 words. In this case (see Figure 1, lower curve), at a recall level of 30% we achieve
Thus, provided we are willing to live with the response Don't Know in a number cases, we can
achieve reasonably reliable authorship attribution even where the number of candidate authors numbers
In the real world, however, we cannot assume that the author of a questioned text will in fact be
contained in our candidate set, even if that set is very large. To evaluate the performance of our
method in such a scenario, we randomly discarded 5,000 of the known works from the candidate set,
and evaluated performance on the original 10,000 snippets. Now, half of our test cases ought to result
in an output of Don’t Know in the best case, since their actual authors are not in the candidate set. The
precision/recall curve for this case is the lower curve in Figure 2 (note that recall here is defined as the
fraction of the 5,000 possible attributions that are correctly made) shown along with the original 600-
word curve for comparison. In this case, at a recall level of 30% we achieve precision of 81% and at
recall of 40% we get precision of 72%. Clearly, performance is noticeably degraded relative to the case
where all snippets have authors in the candidate set, though useful accuracy levels are still attainable.
We should note that as the number of alternative candidates becomes much smaller, the
problem might, somewhat counter-intuitively, become more difficult. This is because our method
implicitly leverages the fact that if a document is much more similar to one author’s writing than to
those of all others, it is very likely the document was written by that author. As the number of
alternative authors decreases, the reliability of such a conclusion will similarly decrease. Thus, in the
extreme case of authorship verification, where we are faced with a single candidate author, we need an
25
FIGURE 2 ABOUT HERE
8. Authorship Verification
Consider the case in which we are given examples of the writing of a single author and are asked to
verify that a given target text was or was not written by this author. As a categorization problem,
verification is significantly more difficult than basic attribution and virtually no work has been done on
it (but see van Halteren (2004)), outside the framework of plagiarism detection (Clough 2000; Meyer
zu Eissen et al. 2007). If, for example, all we wished to do is to determine if a text had been written by
Shakespeare or by Marlowe, it would be sufficient to use their respective known writings, to construct
a model distinguishing them, and to test the unknown text against the model. If, on the other hand, we
need to determine if a text was written by Shakespeare or not, it is difficult to assemble a representative
The situation in which we suspect that a given author may have written some text but do not
have an exhaustive list of alternative candidates is a common one. The problem is complicated by the
fact that a single author may vary his or her style from text to text or may unconsciously drift
stylistically over time, not to mention the possibility of conscious deception. Thus we must learn to
somehow distinguish between relatively shallow differences that reflect conscious or unconscious
changes in an author’s style and deeper differences that reflect styles of different authors.
Verification can be thought of as a one-class classification problem (Manevitz & Yousef 2001,
Scholkopf et al. 2001, Tax 2001). But, perhaps, a better way to think about authorship verification is
that we are given two example sets and are asked whether these sets were generated by the same
process (author) or by two different processes. This section, drawn from Koppel et al. (2007), describes
a method for adducing the depth of difference between two example sets, which method may have far-
26
reaching consequences for determining the reliability of classification models. The idea is to test the
extent to which the accuracy of learned models degrades as the most distinguishing features are
This method provides a robust solution to the authorship verification problem that is
independent of language, period and genre and has already been used to settle at least one outstanding
literary attribution problem (Koppel and Schler 2004; Koppel et al. 2007).
Let us begin by considering two naive approaches to the problem. Although neither of them will prove
One possibility that suggests itself is what we will call the “impostors” method: assemble a
representative collection of works by other authors and to use a two-class learner, such as SVM, to
learn a model for A vs. not-A. Then chunk the mystery work X and run the chunks through the learned
model. If the preponderance of chunks of X are classed as A, then X is deemed to have been written by
This method is straightforward but it suffers from a conceptual flaw. While it is indeed
reasonable to conclude that A is not the author if most chunks are attributed to not-A, the converse is
not true. Any author who is neither A nor represented in the sample not-A, but who happens to have a
style more similar to A than to not-A, will be falsely determined by this method to be A. Despite this
flaw, we will see later that this approach can be used to augment other methods.
Another approach, which does not depend on negative examples, is to learn a model for A vs. X
and assess the extent of the difference between A and X by evaluating generalization accuracy by
cross-validation. If cross-validation accuracy is high, then conclude that A did not write X; if cross-
validation accuracy is low, i.e., we fail to correctly classify test examples better than chance, conclude
27
that A did write X. This intuitive method does not actually work well at all.
Let us consider exactly why the last method fails, by examining a real-world example. Suppose
we are given known works by three of the authors considered in Section 4, Herman Melville, James
Fenimore Cooper and Nathaniel Hawthorne. For each of the three authors, we are asked if that author
was or was not also the author of The House of Seven Gables (henceforth: Gables). Using the method
just described and using a feature set consisting of the 250 most frequently used words in A and X, we
find that we can distinguish Gables from the works of each author with cross-validation accuracy of
above 98%. If we were to conclude, therefore, that none of these authors wrote Gables, we would be
If we look closely at the models that successfully distinguish Gables from Hawthorne’s other work (in
this case, The Scarlet Letter), we find that only a small number of features are doing all the work of
distinguishing between them. These features include he (more frequent in The Scarlet Letter) and she
(more frequent in Gables). The situation in which an author will use a small number of features in a
consistently different way between works is typical. These differences might result from thematic
differences between the works, from differences in genre or purpose, from chronological stylistic drift,
Our main point is to show how this problem can be overcome by determining not only if A is
distinguishable from X but also how great the depth of difference between A and X is. To do this we
use a technique we call “unmasking”. The idea is to remove, by stages, those features that are most
useful for distinguishing between A and X and to gauge the speed with which cross-validation accuracy
degrades as more features are removed. Our main hypothesis is that if A and X are by the same author,
then whatever differences there are between them will be reflected in only a relatively small number of
28
features, despite possible differences in theme, genre and the like.
In Figure 3, we show the result of unmasking when comparing Gables to known works of
Melville, Cooper and Hawthorne. This graph illustrates our hypothesis: when comparing Gables to
works by other authors the degradation as we remove distinguishing features from consideration is
slow and smooth but when comparing it to another work by Hawthorne, the degradation is sudden and
dramatic. Once a relatively small number of distinguishing markers are removed, the two works by
This phenomenon is actually quite general, as we will show below. As we will also see, the
suddenness of the degradation can be quantified in a fashion optimal for this task. Thus by taking into
account the depth of difference between two works, we can determine if they were authored by the
We use as our corpus the collection of classic nineteenth and early twentieth century books considered
in Section 4 above. To break up the two-books-per-author pattern in the corpus, we add to the corpus,
one additional work by Melville and one by Hawthorne, as well as a work by Emily Bronte, who has
Our objective is to run 209 independent authorship verification experiments representing all
possible author/book pairs (21 books * 10 authors but excluding just the pair Emily Bronte/Wuthering
As above, we partitioned each book into approximately equal-length sections of at least 500
29
words without breaking up paragraphs. For each author A and each book X, let AX consist of all the
works by A in the corpus unless X is in fact written by A, in which case AX consists of all works by A
except X. Our objective is to assign to each pair <AX,X> the value same-author if X is by A and the
Now let us introduce the details of our new method based on our observations above regarding iterative
elimination of features. We choose as an initial feature set the n words with highest average frequency
in AX and X (that is, the average of the frequency in AX and the frequency in X, giving equal weight
to AX and X). Note that our objective here is not to maximize accuracy, but rather to measure the
degradation of accuracy; thus, it is enough to choose a simple feature set, rather the best possible one.
Using an SVM with linear kernel we run the following unmasking scheme:
against X. (If one of the sets, AX or X, includes more chunks than the other, we
randomly discard the surplus. Accuracy results are the average of five runs of ten-fold
2. For the model obtained in each fold, eliminate the k most strongly weighted positive
3. Go to step 1.
In this way, we construct degradation curves for each pair <AX,X>. In Figure 4, we show such
curves (using n=250 and k=3) for An Ideal Husband against each of ten authors, including Oscar
Wilde.
30
FIGURE 4 ABOUT HERE
We wish now to quantify the difference between same-author curves and different-author curves. To
do so, we first represent each curve as a numerical vector in terms of its essential features. These
We sort these vectors into two subsets: those in which AX and X are the by same author and those in
which AX and X are by different authors. We then apply a meta-learning scheme in which we use
learners to determine what role to assign to various features of the curves. (Note that although we have
20 same-author pairs, we really only have 13 distinct same-author curves, since for authors with
exactly two works in our corpus, the comparison of AX with X is identical for each of the two books.)
In order to assess the accuracy of the method, we use the following cross-validation
methodology. For each book B in our corpus, we run a trial in which B is completely eliminated from
consideration. We use unmasking to construct curves for all author/book pairs <AX,X> (where B does
not appear in AX and is not X) and then we use a linear SVM to meta-learn to distinguish same-author
curves from different-author curves. Then, for each author A in the corpus, we use unmasking to
31
construct a curve for the pair <AB,B> and use the meta-learned model to determine if the curve is a
Using this testing protocol, we obtain the following results: All but one (Pygmalion by Shaw) of
the twenty same-author pairs are correctly classified. In addition, 181 of 189 different-author pairs are
correctly classified. Among the exceptions are the attributions of The Professor by Charlotte Bronte to
each of her sisters. Thus, we obtain overall accuracy of 95.7% with errors almost identically distributed
between false positives and false negatives. (It should be noted that some of the 8 misclassified
different-author pairs result in a single book being attributed to two authors, which is obviously
Note that the algorithm includes three parameters: n, the size of the initial feature set; k, the
number of eliminated features from each extreme in each iteration; m, the number of iterations we
consider. The results reported above are based on experiments using n=250, k=3, and m=10. We chose
n=250 because experimentation indicated that this was a reasonable rough boundary between common
words and words tightly tied to a particular work. In Koppel et al. (2007), it is shown that results are
somewhat robust with regard to choice of k and m (in fact, some parameter choices turn out to be better
than those shown here), but the recall results for same-author degrades considerably as the size of the
initial feature set increases. Moreover, parameter settings that proved successful on the English
literature corpus considered here also proved successful on a corpus of Hebrew legal writings, thus
It is further shown in Koppel et al. (2007) that unmasking can be augmented by exploiting
known negative examples, using the “impostors” method described above; the augmented method
classes all 189 different-author pairs and 18 of 20 same-author pairs correctly. Finally, one limitation
32
of unmasking that should be noted is that it requires a large amount of training text (Sanderson and
Guenter 2006) ; preliminary tests suggest that the minimum would be in the area of 5000 to 10,000
words..
In Figure 5, we summarize the entire algorithm (including the optional augmentation using
negative examples).
9. Conclusions
We have surveyed the variety of feature types and categorization methods that have been proposed in
the past for authorship attribution. These methods range from early attempts to find individual
statistical markers that could serve as authorial fingerprints, through multivariate methods of varying
degrees of sophistication, and ultimately to text categorization methods rooted in machine learning. We
conclude that two of the most sophisticated machine learning methods, SVM and Bayesian regression,
used in conjunction with word classes derived from systemic functional linguistics or with character n-
grams, offer easily scalable, efficient and effective solutions to the ordinary authorship attribution
problem, assuming proper methodological controls for text genre and the like.
Since many realistic authorship problems don’t fit the standard attribution paradigm, we
consider also three variations that are likely to arise in practice. For the profiling problem, where no
individual candidates are known, we find that we can identify, with varying degrees of accuracy, an
author’s gender, age, native language and personality type. For the needle-in-a-haystack problem,
where there are possibly many thousands of candidate authors, we find that information retrieval
methods can be used to identify the correct author – of even very short texts – with high accuracy for
some considerable fraction of cases. These cases can be isolated using meta-learning methods that take
33
into account the degree to which a single author is more likely than any of the other candidates to be
the actual author. Finally, for the verification problem, where we need to determine of a given author
wrote a given text, we find that our unmasking technique is highly effective at identifying actual
authors, though it is limited to cases in which the attested text is sufficiently long.
34
References
Abbasi, A., and Chen, H. (2005), Applying authorship analysis to extremist-group Web forum messages, IEEE
Intelligent Systems,
Abbasi, A. and Chen, H. 2008. Writeprints: A stylometric approach to identity-level identification and similarity
detection. ACM Transactions on Information Systems (26:2), no. 7.
Argamon, S. (2008), Interpreting Burrows’s Delta: Geometric and probabilistic foundations, Literary and
Linguistic Computing, in press.
Argamon, S. and S. Levitan (2005), Measuring the usefulness of function words for authorship attribution. In
the Proceedings of the ACH/ALLC Conference, Victoria, BC, Canada, June 2005.
Argamon, S., Whitelaw, C., Chase, P., Hota, S., Garg, N., Levitan, S. (2007), Stylistic text classification using
functional lexical features, Journal of the American Society for Information Science and Technology 58(6),
802-821.
Argamon, S., Koppel , M., Fine, J. and Shimoni, A. (2003), Gender, Genre, and Writing Style in Formal Written
Texts, Text 23(3), August 2003.
Argamon, S., Koppel, M., Pennebaker, J. and Schler, J. (2008), Automatically Profiling the Author of an
Anonymous Text, Communications of the ACM , in press
Argamon-Engelson, S., Koppel, M., Avneri, G. (1998), Style-based text categorization: What newspaper am I
reading?, in Proc. of AAAI Workshop on Learning for Text Categorization, 1998, pp. 1-4
Baayen, H., van Halteran, H., Neijt, A., Tweedie, F. (2002), An Experiment in Authorship Attribution, Journees
internationales d'Analyse statistique des Donnees Textuelles 6.
Baayen, H., Van Halteren, H. and Tweedie, F.J. (1996), Outside the Cave of Shadows: Using Syntactic
Annotation to Enhance Authorship Attribution, Literary and Linguistic Computing, 11, 121-131.
Benedetto, D., Caglioti, E. and Loreto, V. (2002), Language Trees and Zipping, Phys. Rev. Lett. 88(4), 487-490
Binongo, J.N.G. and Smith, M.W.A. (1999), The application of principal component analysis to stylometry, Lit
Linguist Computing 14: 445-466
Binongo, J. N. G. (2003), Who wrote the 15th Book of Oz? An application of multivariate analysis to
authorship attribution. Chance 16(2), pp. 9-17.
Brill, E. (1992), A simple rule-based part-of-speech tagger, Proceedings of 3rd Conference on Applied Natural
Language Processing, pp. 152-155
35
Brinegar, C. S. (1963), "Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of
Authorship," Journal of the American Statistical Association 58, pp. 85–96.
Burger J. and Henderson, J. (2006), An exploration of features for predicting blogger age. In AAAI Spring
Symposium on Computational Approaches to Analyzing Weblogs.
Burrows, J. F. (1992), Computers and the study of literature. In C. Butler, editor, Computers and Written Text,
Applied Language Studies, pages 167-204. Blackwell, Oxford.
Burrows, J.F. (1987), "Word Patterns and Story Shapes: The Statistical Analysis of Narrative Style", Literary
and Linguistic Computing, 2, 61-70.
Burrows, J.F. (1989), ‘An ocean where each kind..’: Statistical analysis and some major determinants of literary
style, Computers and the Humanities 23(4), 309-321.
Burrows, J.F. (1992), Not Unles You Ask Nicely: The Interpretative Nexus Between Analysis and Information,
Literary and Linguistic Computing 1992 7(2):91-109
Burrows, J.F. (2002a), Delta: a measure of stylistic difference and a guide to likely authorship, Liter-
ary and Linguistic Computing 17, pp. 267–287
Burrows, J.F. (2002b), The Englishing of Juvenal: Computational stylistics and translated texts,” Style 36, pp.
677–699.
Burrows, J. (2007), All the Way Through: Testing for Authorship in Different Frequency Strata , Literary and
Linguistic Computing 21, pp. 27-47
Chambers, J. K., P. Trudgill, and N. Schilling-Estes (2004), The Handbook of Language Variation and Change.
Blackwell, Oxford.
Chang, C.C. and Lin, C. (2001), LIBSVM: a Library for Support Vector Machines (Version 2.3)
Chaski, C. (2005), “Who’s at the keyboard: Authorship attribution in digital evidence investigations,”
International Journal of Digital Evidence, vol. 4, no. 1, 2005
Chaski, C. (2007), "Multilingual Forensic Author Identification through N-Gram Analysis", Presented at the 8th
Biennial Conference on Forensic Linguistics/Language and Law, July 2007, Seattle, WA.
Chung, C.K. & Pennebaker, J.W. (2007). The psychological function of function words. In K. Fiedler (Ed.),
Social communication: Frontiers of social psychology (pp 343-359). New York: Psychology PressClement, R.
and Sharp, D. (2003). Ngram and Bayesian classification of documents. Literary and Linguistic Computing, 18:
36
423–47.
Clough, P. (2000), Plagiarism in natural and programming languages: an overview of current tools and
technologies, Research Memoranda: CS-00-05, Department of Computer Science, University of Sheffield, UK.
Corney, M., de Vel, O., Anderson, A. and Mohay, G. (2002), "Gender-Preferential Text Mining of E-mail
Discourse", in Proc. of 18th Annual Computer Security Applications Conference
Craig , H. (1999), Authorial attribution and computational stylistics: if you can tell authors apart, have you
learned anything about them?, Lit Linguist Computing 14: 103-113
Damashek, M. (1995), Gauging similarity with n-grams: language independent categorization of text. Science,
267(5199), 843--848.
De Vel, O., Anderson, A., Corney, M., Mohay, G. M. (2001), Mining e-mail content for author identification
forensics. SIGMOD Record 30(4), pp. 55-64
De Vel, O., M. Corney, A. Anderson and G. Mohay (2002), E-mail Authorship Attribution for Computer
Forensics, in Applications of Data Mining in Computer Security, Barbará, D. and Jajodia, S. (eds.), Kluwer.
Diederich, J., Kindermann, J., Leopold, E. and Paass, G. (2003), Authorship Attribution with Support Vector
Machines, Applied Intelligence 19(1), pp. 109-123
Dumais, S., J. Platt, J., Heckerman, D. and Sahami, M. (1998). Inductive learning algorithms and
representations for text categorization, Proceedings of ACM-CIKM98, 148-155
Forman, G. (2003), An extensive empirical study of feature selection metrics for text classification. Journal of
Machine Learning Research 3(1), pages 1289-1305.
Forsyth, R.S. and Holmes, D.I. "Feature-Finding for Text Classification", Literary and Linguistic Computing,
11, 4, (1996).
Foster, D. (2000), Author Unknown: On the Trail of Anonymous, New York: Henry Holt, 2000.
Fucks W. (1952). On the mathematical analysis of style. Biometrica 39, pp. 122-129.
Gamon, M. (2004), Linguistic correlates of style: authorship classification with deep linguistic analysis features.
In Proc. 20th Int. Conf. Computational Linguistics (COLING), pages 611–617, Geneva.
Genkin, A., Lewis, D. and Madigan, D. (2006), Large-scale Bayesian logistic regression for text categorization,
Technometrics
Graham, N., Hirst, G. and Marthi, B. (2005), ``Segmenting documents by stylistic character.'' Natural Language
Engineering, 11(4), December 2005, 397—415
37
Granger, S., Dagneaux, E., Meunier, F. (2002), The International Corpus of Learner English. Handbook and
CD-ROM. Louvain-la-Neuve: Presses Universitaires de Louvain
Grieve, J. (2007), Quantitative Authorship Attribution: An Evaluation of Techniques, Literary and Linguistic
Computing 22(3):251-270.
Hirst, G. and Feiguina, O. (2007), Bigrams of syntactic labels for authorship discrimination of short texts,
Literary and Linguistic Computing, 22(4), pp. 405-417.
Holmes, D. (1998), The evolution of stylometry in humanities scholarship, Literary and Linguistic Computing,
13, 3, 1998, pp. 111-117.
Holmes, D. I., Gordon, L., Wilson, C. (2001a), A Widow and her Soldier: Stylometry and the American Civil
War, Literary and Linguistic Computing 16(4), pp. 403-420
Holmes, D. I., Robertson, M., Paez, R. (2001b), Stephen Crane and the New-York Tribune: A case study in
traditional and non-traditional authorship attribution. Computers and the Humanities 35(3) pp. 315-331.
Holmes, D. and Forsyth, R. (1995), The Federalist revisited: New directions in authorship attribution, Literary
and Linguistic Computing, pp. 111--127.
Honore (1979), Some Simple Measures of Richness of Vocabulary, Association for Literary and Linguistic
Computing Bulletin 7(2), pp. 172-177.
Hoorn, J., Frank, S., Kowalczyk, W., van der Ham, F. (1999), Neural network identification of poets using letter
sequences. Literary and Linguistic Computing, 14(3) pp. 311-338.
Hoover, D. L. (2002), Frequent Word Sequences and Statistical Stylistics, Literary and Linguistic Computing
17: 157–180
Hoover, D. L. (2003a), Frequent Collocations and Authorial Style, Literary and Linguistic Computing 18: 261–
286.
Hoover, D. L. (2003b), Multivariate Analysis and the Study of Style Variation, Literary and Linguistic
Computing 18: 341–360.
Hoover, D. L. (2003c), Another Perspective on Vocabulary Richness, Computers and the Humanities 37: 151–
178.
38
Hoover, D. (2004a), Testing Burrows’s Delta, Literary and Linguistic Computing 19(4), pp. 453–475,
Hoover, D. (2004b), Delta prime?, Literary and Linguistic Computing 19(4), pp. 477–495.
Houvardas, J. and E. Stamatatos (2006), N-gram feature selection for authorship identification, in Proc. of the
12th Int. Conf. on Artificial Intelligence: Methodology, Systems, Applications, pp. 77-86
Joachims, T. (1998), Text categorization with support vector machines: learning with many relevant features. In
Proc. 10th European Conference on Machine Learning ECML-98, pp. 137-142
John, O. P., (1990) , The “Big Five” factor taxonomy: Dimensions of personality in the natural language and in
questionnaires, in John, O. P. and L. A. Pervin, eds., Handbook of Personality: Theory and Research, Guilford
Press, pp. 66-100.
Juola, P. (1998). Cross-entropy and linguistic typology. In Proceedings of New Methods in Language
Processing 3. Sydney, Australia.
Juola, P. (2008), Author Attribution, Foundations and Trends in Information Retrieval, in press
Karlgren, J. and Cutting, D. (1994), Recognizing text genres with simple metrics using discriminant analysis. In
Proceedings of the 15th Conference on Computational Linguistics, Kyoto, Japan, August 1994, pp. 1071-1075.
Keselj, V., Peng, F., Cercone, N., Thomas, C. (2003), N- Gram-Based Author Profiles for Authorship
Attribution. In proceeding of PACLING'03, Halifax, Canada pp. 255-264.
Kessler, B., G. Nunberg, and H. Schütze (1997), Automatic detection of genre. In Proc. 35th Annual Meeting
of the Association for Computational Linguistics and the 8th Meeting of the European Chapter of the
Association for Computational Linguistics, pp. 32-38.
Khmelev D.V. (2001) Disputed Authorship Resolution through Using Relative Empirical Entropy for Markov
Chains of Letters in Human Language Text, Journal of Quantitative Linguistics, 7( 3), 201-207
Khmelev D. V., Teahan W. J. (2003), A repetition based measure for verification of text collections and for text
categorization, Proceedings of the 26th SIGIR conference, pp. 104-110
Khmelev, D. V., Tweedie, F. J. (2002), Using Markov chains for identification of writers. Literary and
Linguistic Computing, 16(4) pp. 299-307.
Kjell, B. (1994a), Authorship attribution of text samples using neural networks and Bayesian classifiers. In
IEEE International Conference on Systems, Man and Cybernetics, San Antonio, TX.
39
Kjell, B. (1994b), Authorship determination using letter pair frequencies with neural network classifiers.
Literary and Linguistic Computing, 9(2) pp. 119-124.
Kjell, B., Woods, W. A., Frieder, O. (1995), Information retrieval using letter tuples with neural network and
nearest neighbor classifiers. In IEEE International Conference on Systems, Man and Cybernetics, volume 2, pp.
1222-1225, Vancouver, BC.
Koppel, M., Akiva, N. and Dagan, I. (2006a), Feature Instability as a Criterion for Selecting Potential Style
Markers, Journal of the American Society for Information Science and Technology 57(11), pp. 1519-1525.
Koppel, M., Argamon, S. Shimoni, A. (2002), Automatically categorizing written texts by author gender,
Literary and Linguistic Computing 17(4), pp. 401-412
Koppel, M., Mughaz, D. and Akiva, N. (2006b), New Methods for Attribution of Rabbinic Literature, Hebrew
Linguistics: A Journal for Hebrew Descriptive, Computational and Applied Linguistics 57, Jan. 2006, pp. 5-18.
Koppel, M. and Schler, J. (2003), Exploiting Stylistic Idiosyncrasies for Authorship Attribution, in Proceedings
of IJCAI'03 Workshop on Computational Approaches to Style Analysis and Synthesis, pp. 69-72.
Koppel, M. and Schler, J. (2004), Authorship Verification as a One Class Classification Problem, in
Proceedings of ECML, Banff, Canada
Koppel M., Schler J. and Zigdon K. (2005), Determining an Author’s Native Language by Mining a Text for
Errors, Proceedings of KDD ’05, Chicago IL.
Koppel, M., Schler, J., Argamon, S. and Messeri,E. (2006c). Authorship Attribution with Thousands of
Candidate Authors, in Proc. 29th ACM SIGIR Conference on Research & Development on Information
Retrieval.
Koppel, M., Schler, J. and Bonchek-Dokow, E. (2007), Measuring Differentiability: Unmasking Pseudonymous
Authors, JMLR 8, pp. 1261-1276
Kukushkina, O. V., Polikarpov, A. A., and Khmelev, D. V. (2001), Using Literal and Grammatical Statistics for
Authorship Attribution, Probl. Inf. Transm. 37, 2 (Apr. 2001), 172-184
Ledger, G. and Merriam, T. (1994), Shakespeare, Fletcher, and the Two Noble Kinsmen, Lit Linguist
Computing 9: 235-248
Lewis, D.D. and Ringuette, M. (1994), Comparison of two learning algorithms for text categorization. In
Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR 94).
Li, J., Zheng, R., and Chen, H. 2006. From fingerprint to writeprint. Communications of the ACM (49:4), pp.
76-82.
40
Littlestone, N. (1988), Learning quickly when irrelevant attributes abound: A new linear threshold algorithm.
Machine Learning 2(4), pp. 285-318.
Lowe, D. and Matthews, R.(1995), Shakespeare vs. Fletcher: A stylometric analysis by Radial Basis Functions.
Computers and the Humanities, 29 pp. 449-461.
Madigan, D., Genkin, A., Lewis, D.D., Argamon, S., Fradkin, D. & Ye, L. (2006), Author Identification on the
Large Scale, Proc. of Classification Society of N. America, 2005
Manevitz, L. M., Yousef, M. (2001), One-class svms for document classification,. Journal of Machine Learning
Research 2 pp. 139-154
Martindale, C. and McKenzie, D. "On the Utility of Content Analysis in Author Attribution: The 'Federalist'",
Computers and the Humanities, 29, 259-270, (1995).
Marton, Y. Wu, N. and Hellerstein, L (2005), On compression-based text classification, in Proceedings of the
27th European Conference on IR Research, pp. 300--314
Mascol, C. (1888a), Curves of pauline and pseudo-pauline style i. Unitarian Review, 30:452-460,1888.
Mascol, C. (1888b), Curves of pauline and pseudo-pauline style ii. Unitarian Review, 30:539-546,1888.
Matthews, R., Merriam, T. (1993), Neural computation in stylometry : An application to the works of
Shakespeare and Fletcher. Literary and Linguistic Computing, 8(4), pp. 203-209.
McEnery, A. and Oakes, M. (2000), Authorship studies/textual statistics, in R. Dale, H. Moisl, H. Somers eds.,
Handbook of Natural Language Processing (Marcel Dekker, 2000).
Mealand, D.L. (1995), Correspondence Analysis of Luke, Lit Linguist Computing 10: 171-182
Merriam, T. (1996), Marlowe's hand in Edward III revisited. Literary and Linguistic Computing, 11(1) pp. 19-
22.
Merriam, T. and Matthews, R. (1994), Neural compuation in stylometry II: An application to the works of
Shakespeare and Marlowe. Literary and Linguistic Computing 9, pp. 1-6.
Meyer zu Eissen, S., Stein, B. and Kulig, M. (2007), Plagiarism detection without reference collections, in R.
Decker and H. J. Lenz (eds.), Advances in Data Analysis, pages 359-366
Morton, A.Q., "The Authorship of Greek Prose", Journal of the Royal Statistical Society (A), 128, 169-233,
41
(1965).
Mosteller, F., Wallace, D. L. (1964), Inference and Disputed Authorship: The Federalist. Reading, Mass.
Addison Wesley.
Novak, J., Raghavan, P., and Tomkins, A. 2004. Anti-aliasing on the web. In Proceedings of the 13th
International World Wide Web Conference, pp. 30-39
O’Donnell, B. (1966). Stephen Crane’s The O’Ruddy: A Problem In Authorship Discrimination. In Leed (ed.),
The Computer and Literary Style. Kent, OH: Kent State University Press, pp. 107–15.
Pavelec, D., Justino, E., and Oliveira, L. S. 2007. Author identification using stylometric features. Inteligencia
Artificial (11:36), pp. 59-65.
Peng, F., Schuurmans, D., ,Wang, S. (2004), Augumenting Naive Bayes Text Classifier with Statistical
Language Models , Information Retrieval, 7 (3-4), pp. 317 - 345
Pennebaker, J. W. and King, L. A. (1999), Linguistic Styles: Language Use as an Individual Difference. Journal
of Personality and Social Psychology, 77 (6) pp. 1296-1312.
Pennebaker, J.W., Mehl, M.R., & Niederhoffer, K. (2003). Psychological aspects of natural language use: Our
words, our selves. Annual Review of Psychology, 54, pp. 547-577.
Platt, J. (1998), Sequential minimal optimization: A fast algorithm for training support vector machines. In
Technical Report MST TR 98(14) Microsoft Research.
Quinlan, J.R. (1986), Induction of Decision Trees. Machine Learning 1(1), 81-106.
Rudman, J. (1997), The State of Authorship Attribution Studies: Some Problems and Solutions, Computers and
the Humanities 31(4), pp. 351-365.
Salton, G. and Buckley, C. (1988), Term-weighting approaches in automatic text retrieval, Information
Processing and Management: an International Journal 24(5), pp. 513-523.
Sanderson, C. and Guenter, S. (2006), Short Text Authorship Attribution via Sequence Kernels, Markov Chains
and Author Unmasking: An Investigation, in Int’l Conference on Empirical Methods in Natural Language
Processing, pp. 482-491
Schler, J. (2007), Authorship attribution in the absence of a closed candidate set, Ph.D. Dissertation, Dept. of
Computer Science, Bar-Ilan University, 2007.
Schler, J., Koppel, M., Argamon, S. and Pennebaker, J. (2006), Effects of Age and Gender on Blogging. In
42
proceedings of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs
Schölkopf, B., Platt, J., Shawe-Taylor, J., Smola, A. J., Williamson, R. C. (2001), Estimating the support of a
high-dimensional distribution. Neural Computation, 13, pp. 1443-1471.
Sebastiani, F. (2002), Machine learning in automated text categorization, ACM Computing Surveys 34(1), pp.
1-47.
Sichel, H. S., "Word frequency distributions and type-token characteristics", Mathematical Scientist, 11, pp. 45-
72, (1986).
Sichel, H.S., "On a Distribution Law for Word Frequencies", Journal of the American Statistical Association,
70, 542-547, (1975).
Stamatatos, E. 2008. Author identification: Using text sampling to handle the class imbalance problem.
Information Processing and Management (44:2), pp. 790-799
Stamatatos, E., Fakotakis, N., Kokkinakis, G. (2000), Automatic text categorization in terms of genre and
author, Computational Linguistics 26(4), pp. 471-495
Stamatatos, E., Fakotakis, N., Kokkinakis, G. (2001), Computer-based authorship attribution without lexical
measures, Computers and the Humanities 35, pp. 193-214.
Stein, S. and S. Argamon (2006), A mathematical explanation of Burrows’s Delta. In the Proceedings of the
Digital Humanities Conference, Paris, 2006.
Tweedie, F. J., Baayen, ,R. H. (1998), How Variable May a Constant Be? Measures of Lexical Richness in
Perspective, Computers and the Humanities, 32 (1998), 323-352.
Tweedie, S Singh, and D I Holmes (1996), Neural network applications in stylometry: The Federalist Papers.
Computers and the Humanities, 30(1):1-10.
Uzuner, O. and Katz, B. (2005), A Comparative Study of Language Models for Book and Author Recognition,
in Springer Lecture Notes in Computer Science, Vol. 3651, pp. 969-980
van Halteren, H. (2004) Linguistic profiling for authorship recognition and verification, Proc. of 42nd Conf. Of
ACL, July 2004, pp. 199-206
van Halteren, H., Baayen, H., Tweedie, F., Haverkort, M. and Neijt, A., New Machine Learning Methods
Demonstrate the Existence of a Human Stylome, Journal of Quantitative Linguistics 12(1): 65-77 (2005)
Vazire , S. (2006), Informant reports: A cheap, fast, and easy method for personality assessment. Journal of
43
Research in Personality 40(5), pp. 472-481.
Waugh, S., Adams, A., and Tweedie, F. J. (2000), Computational stylistics using Artificial Neural
Networks. Literary and Linguistic Computing, 15(2) pp. 187-198.
Whitelaw, C., Herke-Couchman, M. and Patrick, J. (2004), Identifying interpersonal distance using systemic
features, AAAI Spring Symp. on Exploring Attitude and Affect in Text
Witten, I. H., Frank, E. (2000), Data Mining: Practical Machine Learning Tools with Java Implementations.
Morgan Kaufmann, San Francisco.
Yang, Y. (1999), An evaluation of statistical approaches to text categorization. Journal of Information Retrieval,
1 (1-2), pp 67--88.
Yule, G. U. (1938), On Sentence Length as a Statistical Characteristic of Style in Prose with Application to Two
Cases of Disputed Authorship, Biometrika, 30, 363-390.
Yule, G. U. (1944), The Statistical Study of Literary Vocabulary. Cambridge University Press, Cambridge.
Zhang, D. and Lee, W. S. (2006), Extracting key-substring-group features for text classification, in Proc. of the
12th ACM Int’l Conference on Knowledge Discovery and Data Mining, pp. 474-483
Zhao, Y. & Zobel, J. (2005), Effective authorship attribution using function word, in 'Proc. 2nd AIRS Asian
Information Retrieval Symposium', Springer, pp. 174-190
Zhao, Y. and Zobel, J. (2007), Searching with style: authorship attribution in classic literature, Proc.of 30th
Australasian Conference on Computer Science, Vol. 62, pp. 59-68
Zhao, Y., Zobel, J. & Vines, P. (2006), Using relative entropy for authorship attribution, in ‘Proc. 3rd AIRS
Asian Information Retrieval Symposium’, pp. 92–105
Zheng, R., Li, J., Chen, H. and Huang, Z. (2006), “A framework for authorship identification of online
messages: Writing-style features and classification techniques,” Journal of the American Society for
Information Science and Technology, vol. 57, no. 3, pp. 378–393.
Zigdon, K. (2005), Automatically determining an author’s native language, M.Sc. Thesis, Dept. of Computer
Science, Bar-Ilan University, 2005
Zipf, G. K. (1932), Selected Studies of the Principle of Relative Frequency in Language. Harvard University
Press, Cambridge, MA.
44
FW a list of 512 function words, including conjunctions, prepositions, pronouns, modal verbs,
determiners and numbers.
POS 38 part-of-speech unigrams and 1000 most common bigrams using the Brill (1992) part-of-
speech tagger
SFL all 372 nodes in SFL trees for conjunctions, prepositions, pronouns and modal verbs, based
on Matthiessen (1992)
CW the 1000 words with highest information gain (Quinlan 1986) in the training corpus among
the 10,000 most common words in the corpus
CNG the 1000 character trigrams with highest information gain in the training corpus among the
10,000 most common trigrams in the corpus (cf. Keselj 2003)
NB WEKA’s implementation (Witten and Frank 2000) of Naïve Bayes (Lewis 1998) with
Laplace smoothing
J4.8 WEKA’s implementation of the J4.8 decision tree method (Quinlan 1986) with no pruning
SMO Weka’s implementation of Platt’s (1998) SMO algorithm for SVM with a linear kernel and
default settings
Table1. Feature types and machine learning methods used in our experiments
45
features/learner NB J4.8 RMW BMR SMO
FW 60.2% 58.7% 66.1% 68.2% 63.8%
POS 61.0% 59.0% 66.1% 66.3% 67.1%
FW+POS 65.9% 61.6% 68.0% 67.8% 71.7%
SFL 57.2% 57.2% 65.6% 67.2% 62.7%
CW 67.1% 66.9% 74.9% 78.4% 74.7%
CNG 72.3% 65.1% 73.1% 80.1% 74.9%
CW+CNG 73.2% 68.9% 74.2% 83.6% 78.2%
Table 2: Accuracy on test set attribution for a variety of feature sets and
learning algorithms applied to authorship classification for the email corpus.
46
Baseline Style Content Style+Content
Gender (2 classes) 50.0 72.0 75.1 76.1
Age (3 classes) 42.7 66.9 75.5 77.7
Language (5 classes) 20.0 65.1 82.3 79.3
Neuroticism (2 classes) 50.0 65.7 53.0 63.1
Table 5: Classification accuracy for profiling problems using different feature sets.
Teens im, so, thats, dont, cant haha, school, lol, wanna, bored
Twenties preposition, determiner, of, the, in apartment, office, work, job, bar
Thirties+ preposition, the, determiner, of, in years, wife, husband, daughter, children
Table 6: Most important Style and Content features (by information gain) for each class of texts in
each profiling problem.
47
Figure 1: Precision/Recall curves for attribution, adjusting the SVM threshold for deciding whether the
highest-scoring attribution should in fact be made. Upper curve is for snippets limited to 600 words
and lower is for snippets limited to 200 words. Recall (percentage of possible attributions correctly
made) is on the x-axis, and Precision (percentage of actual attributions correctly made) on the y-axis.
48
Figure 2: Precision/recall curves (as in Figure 1) for attribution of 10,000 snippets where all 10,000 are
theoretically attributable (upper curve) and where only 5,000 are theoretically attributable.
49
100
90
80
70
60
50
0 1 2 3 4 5 6 7 8
Figure 3. Ten-fold cross-validation accuracy of models distinguishing House of Seven Gables from each of Hawthorne, Melville and
Cooper. The x-axis represents the number of iterations of eliminating best features at previous iteration. The curve well below the others
is that of Hawthorne, the actual author.
100
90
80
70
60
0 1 2 3 4 5 6 7 8
Figure 4. Unmasking An Ideal Husband against each of the ten authors (n=250, k=3). The curve below all the authors is that of Oscar
Wilde, the actual author. (Several curves are indistinguishable.)
50
Given: anonymous book X, works of suspect author A,
(optionally) impostors {A1,…,An}
Step 2 - Unmasking
Build degradation curve <A,X>
Represent degradation curve as feature vector (see text)
Test degradation curve vector (see text)
if test result positive
return same-author
else
return different-author
Unmasking_END
51
Appendix. History of studies on authorship attribution problems. For each, we identify the corpus on
which methods were tested, the feature types used and the categorization method used.
(NB=Naïve Bayes; NN=neural nets; k-NN=k nearest neighbors; MVA=multivariate analysis; PCA=principle component analysis;
LDA=linear discriminant analysis)
52
Keselj et al. 2003 English novels, Greek character n-grams MVA
newspapers
Khmelev & Teahan 2003 Russian texts character n-grams distance (Markov)
Koppel & Schler 2003 Emails FW(100s), POS n-grams, SVM, J4.8
idiosyncrasies
Argamon et al. 2003 BNC FW(100s), POS n-grams Winnow
Hoover 2004 (a) American novels words MVA+PCA
Hoover 2004 (b) novels and articles words MVA+PCA
Peng et al. 2004 Greek newspapers character n-grams, word n- NB
grams
van Halteren 2004 Dutch texts Word n-grams, syntax MVA
Abbasi & Chen 2005 Arabic forum posts Characters, words, vocabulary SVM, J4.8
richness, various
Chaski 2005 10 anonymous authors character n-grams, word n- distance (LDA)
grams, POS n-grams, various
Juola & Baayen 2005 Dutch texts FW(10s) distance (cross-entropy)
Zhao & Zobel 2005 newswire stories FW(100s) NB, J4.8, k-NN
Koppel et al. 2005 Learner English FW(100s), POS n-grams, SVM
idiosyncrasies
Koppel et al. 2006a Brontes, BNC FW(100s), POS n-grams Balanced Winnow
Zhao et al. 2006 AP stories, English novels FW(100s), POS, punctuation SVM, distance
Madigan et al. 2006 Federalist papers characters, FW(100s), words, Bayesian regression
various
Zheng et al. 2006 English and Chinese characters, FW(100s), syntax, NN, J4.8, SVM
Li et al. newsgroups vocabulary richness, various
Argamon et al. 2007 novels and articles FW(100s), syntax, SFL SVM
Burrows 2007 Restoration poets Words MVA+zeta
Hirst & Feiguina 2007 Brontes syntax SVM
Pavelec et al. 2007 Portuguese newspapers conjunction types SVM
Zhao & Zobel 2007 Shakespeare, Marlowe, various FW(100s), POS, POS n-grams distance(infogain)
Abbasi & Chen 2008 emails, online comments, chats characters, FW(100s), syntax, SVM, PCA, other
vocabulary richness, various
Argamon et al. 2008 blogs, student essays, learner words, SFL Bayesian regression
English
Stamatatos 2008 English and Arabic news character n-grams SVM
53