0% found this document useful (0 votes)
27 views11 pages

2021.findings Acl.83

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views11 pages

2021.findings Acl.83

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Predicting cross-linguistic adjective order with information gain

William Dyer Richard Futrell


Oracle Corporation University of California, Irvine
[email protected] [email protected]

Zoey Liu Gregory Scontras


Boston College University of California, Irvine
[email protected] [email protected]

Abstract often modeled with constraints on which adjective


classes or functions can appear before or after a
Languages vary in their placement of multi-
noun (Cinque, 2010; Fox and Thuilier, 2012).
ple adjectives before, after, or surrounding the
noun, but they typically exhibit strong intra- Traditional accounts of adjective ordering in the
language tendencies on the relative order of linguistics literature often assume a tree structure
those adjectives (e.g., the preference for ‘big in which the target measure is the hierarchical dis-
blue box’ in English, ‘grande boîte bleue’ in tance from noun (N) to adjective (A). According
French, and ‘als.undūq al’azraq alkabı̄r’ in Ara- to syntactic accounts, ordering regularities are pre-
bic). We advance a new quantitative account dicted by a universal hierarchy of lexical seman-
of adjective order across typologically-distinct tic classes (e.g., color adjectives are hierarchically
languages based on maximizing information
closer to the modified noun than size adjectives;
gain. Our model addresses the left-right asym-
metry of French-type ANA sequences with the Cinque, 1994; Scott, 2002). Alternative accounts
same approach as AAN and NAA orderings, use aspects of adjective meaning to predict adjec-
without appeal to other mechanisms. We find tive order, making appeal to notions like ‘inherent-
that, across 32 languages, the preferred order ness’ (Whorf, 1945) or ‘definiteness of denotation’
of adjectives mirrors an efficient algorithm of (Martin, 1969). Recently, Scontras et al. (2017)
maximizing information gain. provide experimental evidence that their synthesis
of semantic predictors into a continuum based on
1 Introduction
subjectivity reliably predicts ordering preference in
Languages that allow multiple sequential adjective English; followup studies have found subjectivity
modifiers tend to exhibit strong tendencies on the to be a reliable predictor in other languages as well
relative order of adjectives, as in ‘big blue box’ (Tagalog: Samonte and Scontras, 2019; Mandarin:
vs. ‘blue big box’ in English (Dixon, 1982). To Shi and Scontras, 2020; Arabic: Kachakeche and
date, most of the research on adjective ordering has Scontras, 2020; Spanish: Rosales Jr. and Scon-
focused on preferences in pre-nominal languages tras, 2019; Scontras et al., 2020). Explanations
like English where adjectives precede the modi- for the role of subjectivity in adjective ordering
fied noun (Futrell et al., 2020a), or in post-nominal show how subjectivity-based orderings are more
languages like Arabic where adjectives follow the efficient than alternative orderings, thereby max-
noun (Kachakeche and Scontras, 2020). This re- imizing communicative success (Simonič, 2018;
search usually posits a metric, such as informa- Hahn et al., 2018; Franke et al., 2019; Scontras
tion locality (Futrell et al., 2020b) or subjectivity et al., 2019).
(Scontras et al., 2017), which governs the preferred Other efficiency-based approaches to adjec-
distance between a noun and its adjectives. Be- tive order quantify efficiency with information-
cause these theories predict only the relative linear theoretic measures of word distributions such as
distance between noun and adjective, they cannot surprisal or entropy (Cover and Thomas, 2006;
be straightforwardly applied to mixed languages Levy, 2008). Models in this vein have a long
like French, where adjectives regularly appear both conceptual history in the field, originating with
before and after the modified noun, at least not with- the idea that semantic closeness between words is
out added assumptions about hierarchical distance reflected in syntactic closeness in a surface real-
(Cinque, 1994). Instead, these mixed languages are ization (Sweet, 1900; Jespersen, 1922; Behaghel,

957
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 957–967
August 1–6, 2021. ©2021 Association for Computational Linguistics
1932). Modern quantitative incarnations include In support of the cross-linguistic stability of ad-
integration cost (Dyer, 2017) and information local- jective ordering preferences, Leung et al. (2020)
ity (Futrell et al., 2020b), both generalizations of present a latent-variable model capable of accu-
the widely-accepted principle of dependency dis- rately predicting adjective order in 24 languages
tance minimization (Liu et al., 2017; Temperley from seven different language families, achieving
and Gildea, 2018). a mean accuracy of 78.9% on an average of 1335
Crucially, while previous approaches are able sequences per language. Importantly, the model
to model symmetrical structures within the noun succeeds even when the training and testing lan-
phrase, as in the mirror-image A1 A2 N orders of guages are different, thus demonstrating that dif-
English and the N A2 A1 orders of Arabic, a hierar- ferent languages rely on similar preferences. How-
chical approach cannot model the left–right asym- ever, Leung et al.’s study was limited to AAN and
metry of Romance A1 N A2 without an appeal NAA templates. There has been very little corpus-
to other, usually syntactic, mechanisms (Cinque, based empirical work on ordering preferences in
2009, 2010). the mixed ANA template, where adjectives both
We advance an information-theoretic factor that precede and follow the modified noun.2
predicts adjective ordering across the three typo- While Leung et al. (2020) learn adjective order
logical ‘templates’ of adjective order—pre (AAN), by training on observed adjective pairs, an alternate
mixed (ANA), and post (NAA)—based on infor- strategy is to posit one or more a priori metrics as
mation gain (IG), a measure of the reduction in an underlying motivation for adjective order (e.g.,
uncertainty attained by transforming a dataset. IG Malouf, 2000, in part). This approach allows for
is used in machine learning for ordering the nodes the study of why adjective orders might have come
of a decision tree (Quinlan, 1986; Norouzi et al., about. To that end, Futrell et al. (2020a) report
2015), where nodes are most often ordered in a an accuracy of 72.3% for English triples based
greedy fashion such that the information gain of on a combination of subjectivity and information-
each node is maximized. By analogy, we view theoretic measures derived from the distribution of
the noun phrase as a decision tree for reducing a adjectives and nouns.
listener’s uncertainty about a speaker’s intended One of the information-theoretic measures ana-
meaning. Each word acts as a node in the deci- lyzed by Futrell et al. (2020a) is an implementation
sion tree; preferred adjective orders thus reflect an of information gain based on the partitioning an
efficient ordering of nodes. adjective performs on the space of possible noun
referents. However, it is unclear how this formu-
2 Empirical background
lation of information gain could be implemented
Empirical investigations of adjective ordering have for post-nominal adjectives, in which the noun has
focused on the cross-linguistic stability of these presumably already been identified. Instead, the
preferences across a host of unrelated languages current study implements information gain based
(e.g., Dixon, 1982; Hetzron, 1978; Sproat and Shih, on feature vectors, as outlined in §3.
1991). For example, where English speakers prefer To our knowledge, the current study is the first
‘big blue box’ to ‘blue big box’, Mandarin speakers attempt at predicting adjective order across all three
similarly prefer dà-de lán-de xiāng-zi ‘big blue box’ templates, with an eye not only to raw accuracy, but
to lán-de dà-de xiāng-zi ‘blue big box’ (Shi and in hopes of illuminating the functional pressures
Scontras, 2020). In post-nominal languages, we which might contribute to word ordering prefer-
find the mirror-image of the English pattern, such ences in general. While we acknowledge that mul-
that adjectives that are preferred closer to the noun tiple factors are likely involved in adjective order
in pre-nominal languages are also preferred closer preferences, our contribution here is a single quan-
to the noun in post-nominal languages.1 For exam- titative factor capable of predicting adjective order
ple, speakers of Arabic prefer als.undūq al’azraq across typologically distinct languages.
alkabı̄r ‘the box blue big’ to als.undūq alkabı̄r
2
al’azraq ‘the box big blue’. We note three empirical studies that have examined the
placement of a single adjective or adjective phrase before
1
Celtic languages have been claimed to be an exception to or after the noun in Romance languages: Thuilier (2014),
this trend (Sproat and Shih, 1991), though our own investiga- Gulordava et al. (2015) and Gulordava and Merlo (2015).
tions into Irish suggest that it behaves like other post-nominal However, these studies do not tackle the question of order
languages, at least with respect to information gain. preferences among ANA triples.

958
3 Information gain m0 m1 m2 m3
 
1 1 0 1 f
 0
3.1 Picture of communication 
 0 1 1 1  f1
 

M : 0 1 1 0 f
 2
We assume that a speaker is trying to communi-  
.. .. .. .. .
 ..

cate a meaning to a listener, with a meaning rep- 
 . . . . 
resented as a binary vector, where each dimension 1 0 1 0 fk
of the vector corresponds to a feature. Multiple m0 m1 m2 m3
features can be true simultaneously. For exam- L: 0.1 0.3 0.2 0.4
ple, a speaker might have in mind a vector like w
m1 = [111 . . . 0] in Figure 1, where the vector has
value 1 in the dimensions for ‘is-big’ (f0 ), ‘is-grey’ f2 f¯2
(f1 ), and ‘is-elephant’ (f2 ), and 0 for all other fea-
m0 m1 m2 m3 m0 m1 m2 m3
tures. A meaning of this sort would be conveyed
L0 : 0.0 0.6 0.4 0.0 L̄0 : 0.2 0.0 0.0 0.8
by the noun phrase ‘big grey elephant’. We call m
a feature vector and the set of feature vectors M .
Figure 1: A toy universe composed of four feature
The listener does not know which meaning m vectors m defined by k binary features f and an
the speaker has in mind; the listener’s state of un- associated probability distribution L. Partitioning
certainty can be represented as a probability dis- L on f2 yields L0 , the probability distribution of
tribution over all possible feature vectors, P (m), the feature vectors containing a 1 for f2 , viz. m1
corresponding to the prior probability of encounter- and m2 , as well as L̄0 , the distribution of feature
ing a given feature vector. We call this distribution vectors containing a 0 for f2 , or f¯2 .
the listener distribution L.
By conveying information, each word in a se-
3.2 Relationship to other quantities
quence causes a change in the listener’s prior distri-
bution. Suppose as in Figure 1 that a listener starts Our IG quantity in Eq. 1 is drawn from the ID3
with probability distribution L, then hears a word algorithm for generating decision trees (Quinlan,
w conveying a feature (f2 ), resulting in the new dis- 1986). The goal of ID3 is to produce a classifier for
tribution L0 . The amount of change from L to L0 some random variable (call it L) which works by
is properly measured using the Kullback–Leibler successively evaluating some set of binary features
(KL) divergence DKL [L0 ||L] (Cover and Thomas, in some order. The optimal order of these features
2006). Therefore, the divergence DKL [L0 ||L] mea- is given by greedily maximizing information gain,
sures the amount of information about meaning where information gain for a feature f is a mea-
conveyed by the word. sure of how much the entropy of L is decreased by
Another measure of the change induced by a partitioning the dataset into positive and negative
word is the information gain, an extension of KL subsets based on whether f is present or absent.
divergence to include the notion of negative evi- Our application of information gain to word order
dence. Let L̄0 represent the listener’s probability comes from treating each word as a binary indica-
distribution over feature vectors conditional on the tor for the presence or absence of the associated
negation of w. By taking a weighted sum of the feature, and then applying the ID3 algorithm to
positive and negative KL divergence, we recover determine the optimal order of these features.
information gain (Quinlan, 1986): The first term of Eq. 1, the divergence
DKL [L0 ||L], measures the amount of information
about L conveyed by the word w and has been the
|L0 | |L̄0 |
IG = DKL [L0 ||L] + DKL [L̄0 ||L], (1) subject of a great deal of study in psycholinguis-
|L| |L| tics. In particular, Levy (2008) shows that if the
word w and the context c can be reconstructed per-
where |L| indicates the number of elements in the fectly from the updated belief state L0 , then the
support of L with non-zero probability. Informa- amount of information conveyed by w reduces to
tion gain represents the information conveyed by the surprisal of word w in context c:
a word and also the information conveyed by its
negation. DKL [L0 ||L] = − log p(w|c). (2)

959
Importantly for our purposes, the positive evidence 3.4 An efficient algorithm
term DKL [L0 ||L] in Eq. 1 alone is unlikely to make The goal of algorithms such as ID3 is to produce
useful predictions about cross-linguistic ordering a decision tree which divides a dataset into equal-
preferences, because surprisal is invariant to rever- sized and mutually-exclusive partitions, thereby
sal of word order across a language as a whole creating a shallow tree (Quinlan, 1986). While
(Levy, 2005; Futrell, 2019): the same surprisal val- finding the smallest possible binary decision tree
ues would be measured for any given language and is NP-complete (Hyafil and Rivest, 1976), ID3’s
a language with all the same sentences in reverse locally-optimal approach has proven quite effec-
order. As such, these metrics are unable to predict tive at producing shallow trees capable of accurate
any a priori asymmetries in word-order preferences classification (Dobkin et al., 1996).
between pre- and post-nominal positions. By analogy, the ordering of adjectives in a noun
phrase by maximizing information gain likewise
3.3 Negative evidence produces a tree with balanced positive and negative
partitions at each node. Specifically, adjectives that
The new feature of information gain, which has not minimize the entropy of both the positive and neg-
been presented in previous information-theoretic ative evidence are placed before adjectives which
models of language, is the negative evidence are less ‘decisive’ at partitioning feature vectors.
term in DKL [L̄0 ||L], indicating the change in the
listener’s belief about L given the negation of 4 Methodology
the features indicated by word w, a quantity re-
4.1 Data
lated to extropy (Lad et al., 2015). For exam-
ple, consider académie/NOUN militaire/ADJ ‘mil- Our study relies on two types of source data, both
itary/ADJ academy/NOUN’ in French. Let L rep- extracted from the CoNLL 2017 Shared Task: Mul-
resent a listener’s belief state after having heard tilingual Parsing from Raw Text to Universal De-
the noun académie ‘academy’. Upon hearing the pendencies (Ginter et al., 2017; Zeman et al., 2017):
adjective militaire ‘military’, L is partitioned into a set of Common Crawl and Wikipedia text data
L0 —the portion of L in which militaire is a fea- from a variety of languages, automatically parsed
ture—and L̄0 , the portion of L in which militaire is according to the Universal Dependencies scheme
not a feature. Put another way, L̄0 is the probability with UDPipe (Straka and Straková, 2017). First,
distribution over non-military academies. we extract noun phrases (NPs) containing at least
The negative evidence portion of information one adjective to populate feature vectors (§4.3).
gain is of primary interest to us because it breaks Second, we extract triples, instances of a noun and
the symmetry to word-order reversal that we would two dependent adjectives, where the three words
have if we used the positive evidence term alone. are sequential in the surface order and neither the
That is, because the sum of surprisals of words noun nor the adjectives have other dependents.
w1 and w2 in the context of w1 is the log joint We restrict triples in this way to minimize the
probability of a sequence: effect that other dependents might have on order
preferences. For example, while single-word ad-
jectives tend to precede the noun in English, as
− log p(w1 ) − log p(w2 |w1 ) = − log p(w1 , w2 ), in ‘the nice people’, adjectives in larger right-
(3) branching phrases often follow: ‘the people nice
the sum of w2 and w1 in the context of w2 neces- to us’ (Matthews, 2014), a trend also seen in Ro-
sarily yields the same quantity. Conversely, IG’s mance (Gulordava et al., 2015; Gulordava and
negative-evidence value is related to the log proba- Merlo, 2015). Similarly, conjunctions have been
bility of w2 conditional on the event of not observ- shown to weaken or neutralize preferences (Fox
ing w1 , and as such the sum of negative evidence and Thuilier, 2012; Rosales Jr. and Scontras, 2019;
values is not equivalent to the joint surprisal. Scontras et al., 2020).
Information gain can therefore predict left–right NPs and triples extracted from the Wikipedia
asymmetrical word-order preferences such as the dumps are used to generate feature vectors and to
order of adjectives in ANA templates. Further, train our regression (§4.4). We use triples from
it maps onto a well-known decision rule for the the Common Crawl dumps to perform hold-out
ordering of trees. accuracy testing.

960
4.2 Normalization IG between the attested first and second adjective,
Because our source data are extracted from dumps a method previously used by Morgan and Levy
of automatically-parsed text, they contain a large (2016) and Futrell et al. (2020a). The benefits of
amount of noise, such as incorrectly assigned syn- this approach are two-fold: we are able to account
tactic categories, HTML, nonstandard orthography, for bias in the distribution of adjectival IGs, and
and so on. To combat this noise, we extract all we can more easily deconstruct how strong infor-
lemmas with UPOS marked as ADJ and NOUN mation gain is as a predictor of adjective order.
in all Universal Dependencies (UD) v2.7 corpora Within each template, for each attested triple τ ,
(Zeman et al., 2020) for a given language—the let π1 be the lexicographically-sorted first permu-
idea being that the UD corpora are of higher qual- tation of τ and π2 be the second, with α1 being
ity—and include only NPs and triples in which the first linear adjective in π1 and α2 being the first
the adjectives and nouns are in the UD lists. All linear adjective in π2 . Our independent variable
characters are case-normalized, where applicable. p is whether π1 is attested in the corpus, and our
dependent variable is the difference between the
4.3 Feature vectors information gain of α1 and α2 . We train the coeffi-
Each NP attested in the Wikipedia corpus for a cients β0 and β1 in a logistic regression of the form
given language corresponds to a feature vector with
value 1 in the dimension associated with each ad- (
jective or noun lemma. For example, an NP such 1, if π1 is attested
p=
as “the best room available” generates a vector con- 0, if π2 is attested (4)
taining 1 for ‘is-available’, ‘is-best’, and ‘is-room’. p
log ∼ β0 + β1 [IG(α1 ) − IG(α2 )].
The relative count of each NP in the Wikipedia 1−p
corpus yields a probability distribution on feature
vectors. It is this distribution which is transformed A positive value for β1 tells us that permutations in
by partitioning on each lemma in a triple. which the larger-IG adjective is placed first tend to
be attested. The value of β0 tells us whether there
4.4 Evaluation
is a generalized bias towards a positive or negative
For a given typological template (AAN, ANA, or IG(π1 ) − IG(π2 ). The accuracy we achieve by run-
NAA) there are two competing variants; our tasks ning the logistic regression on held-out testing data
are to (i) predict which of the variants will be at- tells us the effectiveness of an IG-based algorithm
tested in a corpus and (ii) show a cross-linguistic at predicting adjective order.
consistency in how that prediction comes about.
Because we are limiting our study to the two 4.5 Reporting results
competing variants within each template, the po-
sition of the noun is invariant, leaving only the We report results for languages from which at least
relative order of the two adjectives to determine 5k triples could be analyzed, and for templates
the order of a triple. Our problem thus reduces representing at least 10% of a language’s triples in
to whether the information gain of the first linear UD corpora. The count of analyzable triples for
adjective is greater than that of the second. each language is a product of those available in the
In the case of AAN and ANA triples, the IG 2017 CoNLL Shared Task, those with sufficiently
of each adjective is calculated by partitioning the large UD v2.7 corpora, and those that meet our
entire set of feature vectors L on each of the two extraction requirements (§4.1).
adjectives. In the case of NAA triples, however, Because we are interested in exploring a cross-
IG is calculated by partitioning only those feature linguistic predictor of adjective order, we report
vectors which ‘survive’ the initial partition by the macro-average accuracies and β1 coefficients. That
noun, and are therefore part of L0 . Thus we calcu- is, each language’s accuracy and coefficient are
late IG(L, a) before the noun and IG(L0 , a) after. calculated independently and are then averaged.
Rather than simply implement the ID3 algo- We report both type- and token-accuracy, using the
rithm and choose adjectives based on their raw latter in our analysis based on the intuition that the
information gain, we train a logistic regression to preference for the order of a commonly-occurring
predict surface orders based on the difference of triple is stronger than a more rare one.

961
AAN language n β1 P token acc. type acc.
mean β1 Bulgarian 13018 20.058 0.000 0.650 0.649
18.591 [15.740, 21.443] Chinese 5909 18.604 0.000 0.724 0.766
Croatian 15555 21.246 0.000 0.666 0.634
mean token accuracy Czech 27899 28.207 0.000 0.671 0.665
0.656 [0.630, 0.683] Danish 11226 17.506 0.000 0.786 0.770
Dutch 11279 12.201 0.000 0.609 0.605
mean type accuracy English 23311 22.076 0.000 0.643 0.647
0.645 [0.616, 0.674] Finnish 12605 15.342 0.000 0.655 0.644
German 16391 16.210 0.000 0.601 0.606
Greek 5506 18.383 0.000 0.631 0.643
Latvian 5290 15.826 0.000 0.594 0.551
Russian 25397 25.697 0.000 0.658 0.651
Slovak 11933 25.935 0.000 0.700 0.651
Slovenian 18859 28.192 0.000 0.670 0.661
Swedish 10937 11.462 0.000 0.717 0.711
Turkish 12115 12.579 0.000 0.576 0.577
Ukrainian 11474 15.949 0.000 0.593 0.592
Urdu 6432 9.170 0.000 0.673 0.593

ANA language n β1 P token acc. type acc.


mean β1 Basque 3322 -9.623 0.000 0.703 0.678
31.313 [16.786, 45.841] Catalan 3117 45.135 0.000 0.818 0.814
Croatian 4912 -3.411 0.106 0.608 0.604
mean token accuracy French 5673 43.349 0.000 0.771 0.756
0.737 [0.674, 0.799] Galician 5020 68.290 0.000 0.805 0.806
Indonesian 1521 -2.462 0.138 0.543 0.524
mean type accuracy Italian 9484 36.658 0.000 0.681 0.698
0.726 [0.665, 0.787] Persian 2598 43.242 0.000 0.794 0.766
Polish 13481 24.873 0.000 0.684 0.655
Portuguese 7580 32.374 0.000 0.734 0.725
Romanian 2426 46.823 0.000 0.730 0.739
Spanish 9212 57.813 0.000 0.744 0.738
Vietnamese 2636 24.013 0.000 0.962 0.931

NAA language n β1 P token acc. type acc.


mean β1 Arabic 11595 4.595 0.000 0.693 0.660
4.140 [3.128, 5.152] Basque 4899 1.957 0.000 0.626 0.635
Catalan 2878 5.024 0.000 0.710 0.722
mean token accuracy French 8368 5.143 0.000 0.737 0.749
0.680 [0.639, 0.721] Galician 1334 5.776 0.000 0.716 0.694
Hebrew 6751 1.115 0.000 0.558 0.560
mean type accuracy Indonesian 5724 4.631 0.000 0.740 0.734
0.687 [0.647, 0.726] Italian 4523 4.057 0.000 0.713 0.739
Persian 12683 1.583 0.000 0.605 0.606
Portuguese 5139 5.329 0.000 0.726 0.730
Romanian 8492 5.333 0.000 0.742 0.746
Spanish 6245 6.214 0.000 0.713 0.745
Vietnamese 3354 3.068 0.000 0.561 0.606
comprehensive mean 18.08 0.687 0.681

Table 1: Results by template and language: n triples analyzed, regression coefficient β1 and P -value, and
test accuracies. Means with 95% confidence intervals shown for each template.

962
1
vi

0.8 ca gl
fa
accuracy

da fr
idro
fr
pt sv zh pt ro es
eu ca
itargl
es
ur plsk sl it
eu fi elbghren ru cs
0.6 hr fa nl ukde AAN
tr lv
id he vi ANA
NAA
0.4
−10 0 10 20 30 40 50 60 70
β1

Figure 2: Plot of accuracy and β1 coefficient, categorized by template type.

5 Results n accuracy confidence interval


We extracted and analyzed at least 5k triples from IG-FV 44 0.687 [0.686, 0.688]
32 languages across a variety of families.3 Because Subj. 1 0.661 [0.657, 0.666]
some languages contain triples in two typological PMI 1 0.659 [0.654, 0.664]
templates, we report results for 44 sets of triples. IG-NR 1 0.650 [0.645, 0.654]
Table 1 reports language-specific results and means IC 1 0.642 [0.634, 0.646]
for each template, including n triples analyzed, re-
gression coefficient β1 and P -value, token and type Table 2: Comparison across n languages of the
accuracy, and 95% confidence intervals. Figure 2 current metric, IG of feature vectors (IG-FV), and
shows a plot of accuracy and β1 coefficient for each subjectivity, PMI, IG of noun referents (IG-NR),
language, categorized by template. and integration cost (IC) (Futrell et al., 2020a)
As reported in Table 1, we find above-chance
(> 50%) accuracy for all languages tested. We
nificantly smaller than the other two. More gener-
accurately predict 65.6% of AAN triples, 73.7% of
ally, of the 44 datasets tested, β1 is positive in 41
ANA triples, and 68.0% of NAA triples, for a com-
(93.2%), suggesting that there is a strong prefer-
prehensive accuracy across all languages of 68.7%.
ence to maximize information gain. Further, of the
Overlapping 95% confidence intervals across tem-
three instances of a negative β1 , two (Croatian and
plate suggest that IG-based prediction performs
Indonesian ANA) do not reach significance, per-
equally well across templates.
haps due to a paucity of data. The sole significant
The high performance on Vietnamese ANA
negative β1 is from Basque ANA triples.
triples (96.2%) is largely due to the algorithm cor-
rectly predicting that the highly-frequent adjective 6 Discussion
nhiều ‘many’ should be placed before the noun,
while most other adjectives are placed after.4 6.1 β1 coefficient
Though we cannot make a direct comparison to Our results show a strong tendency across typologi-
other studies due to a lack of shared data, Table 2 cal templates and across languages for the adjective
shows that our cross-linguistic accuracy of 68.7% which yields a larger information gain to be placed
bests any single predictor applied to a similar set before the other, as evidenced by a positive β1 .
of English AAN triples by Futrell et al. (2020a). However, the absolute value of β1 is difficult to
The learned β1 coefficient is not significantly dif- interpret without understanding the relative magni-
ferent between AAN (18.591) and ANA (31.313) tudes of the underlying IG scores, magnitudes that
triples, though that of NAA (4.140) triples is sig- vary across datasets and word distributions.
3
In general, we observe that a larger value of β1
https://github.com/wmdyer/infogain
4
One might worry about the classification of ‘many’ as an indicates that IG is a more reliable predictor within
adjective. While widely extant across languages, the class of a dataset. More specifically, a value of β1 = 1
adjectives is not entirely homogeneous. As such, the equiva- indicates that if the IG difference between orders
lent of a word like ‘many’ in some languages might be marked
as an adjective, determiner, or other syntactic category. For the is equal to one bit, then the log odds of the order
current study, we simply follow the UD annotation scheme. with larger IG increases by one.

963
n rate confidence interval attested in both possible orders (e.g., A1 A2 N and
A2 A1 N, where N can be any noun) within each
AAN 18 0.017 [0.012, 0.022] template in our dataset. At 95% confidence the dif-
ANA 13 0.007 [0.002, 0.011] ference between AAN and NAA does not reach sig-
NAA 13 0.022 [0.013, 0.032] nificance, though the rate for ANA is significantly
lower than the other two. More generally, the mean
all 44 0.016 [0.012, 0.020]
rate of just 1.6% across templates reinforces the
Table 3: Macro-average rate of adjectives attested notion that ordering preferences are quite robust
in both possible orders within each template, show- regardless of template, at least for our normalized
ing n languages, rate of attestation, and 95% confi- triples from the languages analyzed here.
dence intervals.
6.3 Ablation
Equation 1 defines information gain as the condi-
6.2 Asymmetries
tioned sum of two elements, the positive evidence
The preference for one variant of an ANA triple DKL [L0 ||L] and the negative evidence DKL [L̄0 ||L].
over the other is an asymmetry without a straight- The positive evidence alone is akin to surprisal, a
forward explanation in a distance-based model; well-studied quantity in psycholinguistics (§3.2),
there is no clear mapping from ANA onto the other while the negative evidence is related to extropy
templates, which means that an adjective’s relative (§3.3). By ablating the IG formulation into the two
distance to the noun is not informative. Our algo- terms discretely, we can show empirically that the
rithm is novel in that the placement of the adjectives proportionally-combined positive and negative ev-
is governed by greedy IG, not distance to the noun— idence yield more accurate and consistent results
an innovation that allows us to break the symmetry than either of the two constituent terms alone.
between the adjectives in ANA triples. Similarly, Table 4 shows the mean accuracy and polarity
IG makes no a priori prediction as to whether a proportion of the β1 coefficient across languages
mirror- or same-order will emerge between AAN and templates. The polarity of β1 tells us whether
and NAA triples: both pre- and post-nominal be- maximizing IG (positive) or minimizing IG (neg-
havior is a product of ordering adjectives such that ative) is the better strategy. Thus a polarity per-
information gain is maximized, and IG itself is fun- centage close to 0 or 1 indicates more consistent
damentally derived from the distribution of adjec- behavior across templates.
tives and nouns that populate a language’s possible For example, while the accuracy of using only
feature vectors for conveying meaning. positive evidence, DKL [L0 ||L], for AAN triples is
Another left–right asymmetry that has been 0.565, that accuracy is realized due to a 0.000 rate
posited in the linguistics literature holds that depen- of positive β1 coefficient—that is, the 56.5% ac-
dents placed before the head in a surface realization curacy is achieved by minimizing IG, placing the
(e.g., the adjectives in an AAN triple) follow a more adjective with the lower IG first. On the other hand,
rigid ordering than those placed after (e.g., the ad- while using only positive evidence to predict NAA
jectives in a NAA triple; Hawkins, 1983). Both triples yields the same accuracy, 0.565, the coef-
noun modifiers in general and adjectives specifi- ficient polarity proportion of 0.769 means that, in
cally have been reported to follow this pattern, with most NAA cases, IG should be maximized. The
a largely-universal pre-nominal ordering and a mir- three templates together reflect a modest accuracy
ror, same, or ‘free’ post-nominal order (Hetzron, (0.566) and an inconsistent coefficient polarity pro-
1978). There is as yet no large-scale empirical evi- portion (0.273).
dence for this claim, though Trainin and Shetreet Using only negative evidence, DKL [L̄0 ||L],
(2021) suggest that Hebrew NAA order preferences yields even worse accuracies and similarly incon-
may be weaker than English AAN for a restricted sistent coefficients as positive only. The accuracy
set of adjective classes. across templates is little better than chance at 0.535,
In an effort to empirically assess the claim that and the average coefficient polarity proportion of
post-nominal orderings are more flexible compared 0.273 likewise demonstrates that using negative ev-
to orderings pre-nominally across languages, Table idence alone does not produce consistent behavior
3 reports the average prevalence of adjective pairs across templates.

964
accuracy proportion of positive β1

AAN ANA NAA all AAN ANA NAA all

DKL [L0 ||L] 0.565 0.567 0.565 0.566 0.000 0.154 0.769 0.273
DKL [L̄0 ||L] 0.533 0.548 0.526 0.535 0.167 0.231 0.462 0.273
IG 0.657 0.737 0.680 0.687 1.000 0.769 1.000 0.932

Table 4: Ablation on accuracy and the proportion of positive coefficients for positive evidence (DKL [L0 ||L])
alone, negative evidence (DKL [L̄0 ||L]) alone, and proportionally combined terms (IG). Boldfaced values
indicate the highest accuracy or coefficient polarity proportion in each column.

The full IG calculation, including both positive References


and negative evidence, yields the highest accuracy Otto Behaghel. 1932. Deutsche Syntax eine
across templates (0.687), as well as the highest geschichtliche Darstellung, volume IV. Carl Win-
for each template—AAN (0.657), ANA (0.737) ters Unversitätsbuchhandlung, Heidelberg.
and NAA (0.680). IG also demonstrates the most
Guglielmo Cinque. 1994. On the Evidence for Partial
consistent behavior across languages and templates: N-Movement in the Romance DP. In Guglielmo
at a rate of 0.932, maximizing IG yields the highest Cinque, Jan Koster, Jean-Yves Pollack, Luigi Rizzi,
accuracy, regardless of whether adjectives precede and Raffaella Zanuttini, editors, Paths Towards
or follow the noun. Universal Grammar. Studies in Honor of Richard
S. Kayne,, pages 85–110. Georgetown University
Press, Washington, DC.
7 Summary
Guglielmo Cinque. 2009. The Fundamental Left-Right
We have taken a novel approach to the problem Asymmetry of Natural Languages, pages 165–184.
Springer Netherlands, Dordrecht.
of predicting the surface order of adjectives across
languages, casting it as a decision tree operating on Guglielmo Cinque. 2010. The Syntax of Adjectives: A
a probability distribution over binary feature vec- Comparative Study. The MIT Press, Camb., Mass.
tors. As each adjective is uttered, probability mass Thomas M. Cover and Joy A. Thomas. 2006. Elements
is partitioned into positive and negative subsets: of Information Theory. John Wiley & Sons, Hobo-
those vectors that contain the feature and those that ken, NJ.
do not. The information gained by this partition
Robert M. W. Dixon. 1982. Where have all the ad-
can be used to order adjectives in a greedy manner, jectives gone? And other essays in semantics and
similarly to well-known algorithms for ordering syntax. Mouton, Berlin, Germany.
nodes in a decision tree.
David Dobkin, Truxton Fulton, Dimitrios Gunopulos,
An IG-based approach allows us to provide the Simon Kasif, and Steven Salzberg. 1996. Induction
first quantitative information-theoretic account pre- of shallow decision trees. submitted to IEEE PAMI.
dicting the order of ANA triples. Further, with this
William E. Dyer. 2017. Minimizing integration cost:
approach we need not stipulate mirror- or same- A general theory of constituent order. Ph.D. thesis,
orders for AAN and NAA triples. Because IG is not University of California, Davis, Davis, CA.
a distance metric between adjective and noun, and
because IG incorporates negative evidence, both Gwendoline Fox and Juliette Thuilier. 2012. Predicting
the position of attributive adjectives in the french np.
ANA and pre- or post-nominal asymmetries are In New Directions in Logic, Language and Compu-
able to emerge within an IG framework, without tation, pages 1–15. Springer.
appeal to other mechanisms.
Michael Franke, Gregory Scontras, and Mihael Si-
Our results show that information gain is a good monič. 2019. Subjectivity-based adjective ordering
predictor of adjective order across languages. Im- maximizes communicative success. In Proceedings
portantly, IG-based prediction follows a consis- of the 41st annual meeting of the Cognitive Science
tent pattern across the three typological templates, Society, pages 344–350.
namely that adjectives that maximize information Richard Futrell. 2019. Information-theoretic locality
gain tend to be placed first. properties of natural language. In Proceedings of

965
the First Workshop on Quantitative Syntax (Quasy, Frank Lad, Giuseppe Sanfilippo, Gianna Agro, et al.
SyntaxFest 2019), pages 2–15, Paris, France. Asso- 2015. Extropy: complementary dual of entropy. Sta-
ciation for Computational Linguistics. tistical Science, 30(1):40–58.
Richard Futrell, William Dyer, and Greg Scontras. Jun Yen Leung, Guy Emerson, and Ryan Cotterell.
2020a. What determines the order of adjectives in 2020. Investigating cross-linguistic adjective order-
English? comparing efficiency-based theories using ing tendencies with a latent-variable model. In Pro-
dependency treebanks. In Proc. of the 58th Annual ceedings of the 2020 Conference on Empirical Meth-
Meeting of ACL, pages 2003–2012, Online. ACL. ods in Natural Language Processing.
Richard Futrell, Edward Gibson, and Roger P Levy. Roger Levy. 2005. Probabilistic Models of Word Order
2020b. Lossy-context surprisal: An information- and Syntactic Discontinuity. Ph.D. thesis, Stanford
theoretic model of memory effects in sentence pro- University, Stanford, CA.
cessing. Cognitive science, 44(3):e12814.
Roger Levy. 2008. Expectation-based syntactic com-
Filip Ginter, Jan Hajič, Juhani Luotolahti, Milan Straka, prehension. Cognition, 106(3):1126–1177.
and Daniel Zeman. 2017. CoNLL 2017 shared task
- automatically annotated raw texts and word embed- Haitao Liu, Chunshan Xu, and Junying Liang. 2017.
dings. LINDAT/CLARIAH-CZ digital library at the Dependency distance: A new perspective on syntac-
Institute of Formal and Applied Linguistics (ÚFAL), tic patterns in natural languages. Physics of Life Re-
Faculty of Mathematics and Physics, Charles Uni- views, 21:171–93.
versity. Robert Malouf. 2000. The order of prenominal adjec-
Kristina Gulordava and Paola Merlo. 2015. Structural tives in natural language generation. In Proceedings
and lexical factors in adjective placement in complex of the 38th Annual Meeting of the Association for
noun phrases across Romance languages. In Pro- Computational Linguistics, pages 85–92.
ceedings of the Nineteenth Conference on Computa- J. E. Martin. 1969. Semantic determinants of preferred
tional Natural Language Learning, pages 247–257, adjective order. Journal of Verbal Learning and Ver-
Beijing, China. Association for Computational Lin- bal Behavior, 8:697–704.
guistics.
Peter Matthews. 2014. The Positions of Adjectives in
Kristina Gulordava, Paola Merlo, and Benoit Crabbé. English. Oxford University Press, New York.
2015. Dependency length minimisation effects in
short spans: a large-scale analysis of adjective place- Emily Morgan and Roger Levy. 2016. Abstract knowl-
ment in complex noun phrases. In Proceedings edge versus direct experience in processing of bino-
of the 53rd Annual Meeting of the Association for mial expressions. Cognition, 157:382–402.
Computational Linguistics and the 7th International
Joint Conference on Natural Language Processing Mohammad Norouzi, Maxwell D. Collins, Matthew
(Volume 2: Short Papers), pages 477–482, Beijing, Johnson, David J. Fleet, and Pushmeet Kohli. 2015.
China. Association for Computational Linguistics. Efficient non-greedy optimization of decision trees.
arXiv:1511.04056 [cs].
Michael Hahn, Judith Degen, Noah Goodman, Daniel
Jurafsky, and Richard Futrell. 2018. An information- J. Ross Quinlan. 1986. Induction of decision trees. Ma-
theoretic explanation of adjective ordering prefer- chine learning, 1(1):81–106.
ences. In Proceedings of the 40th Annual Meeting Cesar Manuel Rosales Jr. and Gregory Scontras. 2019.
of the Cognitive Science Society, pages 1766–1772, On the role of conjunction in adjective ordering pref-
Madison, WI. Cognitive Science Society. erences. Proceedings of the Linguistic Society of
John A. Hawkins. 1983. Word Order Universals: America, 4(32):1–12.
Quantitative analyses of linguistic structure. New Suttera Samonte and Gregory Scontras. 2019. Adjec-
York: Academic Press. tive ordering in Tagalog: A cross-linguistic compar-
Robert Hetzron. 1978. On the relative order of adjec- ison of subjectivity-based preferences. In Proceed-
tives. In Hansjakob Seiler, editor, Language Univer- ings of the Linguistic Society of America, volume 4,
sals, pages 165–184. Gunter Narr Verlag, Tübingen. pages 1–13.

Laurent Hyafil and R Rivest. 1976. Constructing opti- Gregory Scontras, Galia Bar-Sever, Zeinab
mal binary search trees is np complete. Information Kachakeche, Cesar Manuel Rosales Jr., and
Processing Letters. Suttera Samonte. 2020. Incremental semantic
restriction and subjectivity-based adjective order-
Otto Jespersen. 1922. Language: its nature and devel- ing. Proceedings of Sinn und Bedeutung 24, pages
opment. George Allen & Unwin Ltd., London. 253–270.
Zeinab Kachakeche and Gregory Scontras. 2020. Ad- Gregory Scontras, Judith Degen, and Noah D. Good-
jective ordering in Arabic: Post-nominal structure man. 2017. Subjectivity predicts adjective ordering
and subjectivity-based preferences. In Proc. of the preferences. Open Mind: Discoveries in Cognitive
LSA, volume 5, pages 419–430. Science, 1(1):53–65.

966
Gregory Scontras, Judith Degen, and Noah D. Good-
man. 2019. On the grammatical source of adjec-
tive ordering preferences. Semantics and Pragmat-
ics, 12(7).
Gary-John Scott. 2002. Stacked adjectival modifica-
tion and the structure of nominal phrases. In Func-
tional Structure in DP and IP: The Cartography of
Syntactic Structures, volume 1, pages 91–210. Ox-
ford University Press, New York.
Yuxin Shi and Gregory Scontras. 2020. Mandarin has
subjectivity-based adjective ordering preferences in
the presence of ‘de’. In Proceedings of the Linguis-
tic Society of America, volume 5, pages 410–418.
Mihael Simonič. 2018. Functional explanation of ad-
jective ordering preferences using probabilistic pro-
gramming. Master’s thesis, University of Tübingen.
Richard Sproat and Chilin Shih. 1991. The Cross-
Linguistic Distribution of Adjective Ordering Re-
strictions. In Carol Georgopoulos and Roberta Ishi-
hara, editors, Interdisciplinary Approaches to Lan-
guage, pages 565 – 93. Kluwer Academic Publish-
ers, Boston.
Milan Straka and Jana Straková. 2017. Tokenizing,
POS Tagging, Lemmatizing and Parsing UD 2.0
with UDPipe. In Proc. of the CoNLL 2017 Shared
Task: Multilingual Parsing from Raw Text to Univer-
sal Dependencies, pages 88–99, Vancouver. ACL.
Henry Sweet. 1900. A new English grammar, logical
and historical, volume 1. Clarendon Press, Oxford.
David Temperley and Daniel Gildea. 2018. Min-
imizing syntactic dependency lengths: Typologi-
cal/cognitive universal? Annual Review of Linguis-
tics, 4:1–15.
Juliette Thuilier. 2014. An Experimental Approach to
French Attributive Adjective Syntax. In Christopher
Piñón, editor, Empirical Issues in Syntax and Seman-
tics, volume 10 of Experimental Syntax and Seman-
tics, pages 287–304.
Nitzan Trainin and Einat Shetreet. 2021. It’s a dotted
blue big star: on adjective ordering in a post-nominal
language. Language, Cognition and Neuroscience,
36(3):320–341.
Benjamin Lee Whorf. 1945. Grammatical Categories.
Language, 21(1):1–11.
Daniel Zeman, Joakim Nivre, ... Abrams, and Anna
Zhuravleva. 2020. Universal dependencies 2.7.
LINDAT/CLARIAH-CZ digital library at the Insti-
tute of Formal and Applied Linguistics (ÚFAL), Fac-
ulty of Mathematics and Physics, Charles Univer-
sity.
Daniel Zeman, Martin Popel, ..., and Josie Li. 2017.
Multilingual parsing from raw text to universal de-
pendencies. In Proc. of CoNLL 2017 Shared Task:
Multilingual Parsing from Raw Text to Universal De-
pendencies, pages 1–19. ACL.

967

You might also like