2021.findings Acl.83
2021.findings Acl.83
957
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 957–967
August 1–6, 2021. ©2021 Association for Computational Linguistics
1932). Modern quantitative incarnations include In support of the cross-linguistic stability of ad-
integration cost (Dyer, 2017) and information local- jective ordering preferences, Leung et al. (2020)
ity (Futrell et al., 2020b), both generalizations of present a latent-variable model capable of accu-
the widely-accepted principle of dependency dis- rately predicting adjective order in 24 languages
tance minimization (Liu et al., 2017; Temperley from seven different language families, achieving
and Gildea, 2018). a mean accuracy of 78.9% on an average of 1335
Crucially, while previous approaches are able sequences per language. Importantly, the model
to model symmetrical structures within the noun succeeds even when the training and testing lan-
phrase, as in the mirror-image A1 A2 N orders of guages are different, thus demonstrating that dif-
English and the N A2 A1 orders of Arabic, a hierar- ferent languages rely on similar preferences. How-
chical approach cannot model the left–right asym- ever, Leung et al.’s study was limited to AAN and
metry of Romance A1 N A2 without an appeal NAA templates. There has been very little corpus-
to other, usually syntactic, mechanisms (Cinque, based empirical work on ordering preferences in
2009, 2010). the mixed ANA template, where adjectives both
We advance an information-theoretic factor that precede and follow the modified noun.2
predicts adjective ordering across the three typo- While Leung et al. (2020) learn adjective order
logical ‘templates’ of adjective order—pre (AAN), by training on observed adjective pairs, an alternate
mixed (ANA), and post (NAA)—based on infor- strategy is to posit one or more a priori metrics as
mation gain (IG), a measure of the reduction in an underlying motivation for adjective order (e.g.,
uncertainty attained by transforming a dataset. IG Malouf, 2000, in part). This approach allows for
is used in machine learning for ordering the nodes the study of why adjective orders might have come
of a decision tree (Quinlan, 1986; Norouzi et al., about. To that end, Futrell et al. (2020a) report
2015), where nodes are most often ordered in a an accuracy of 72.3% for English triples based
greedy fashion such that the information gain of on a combination of subjectivity and information-
each node is maximized. By analogy, we view theoretic measures derived from the distribution of
the noun phrase as a decision tree for reducing a adjectives and nouns.
listener’s uncertainty about a speaker’s intended One of the information-theoretic measures ana-
meaning. Each word acts as a node in the deci- lyzed by Futrell et al. (2020a) is an implementation
sion tree; preferred adjective orders thus reflect an of information gain based on the partitioning an
efficient ordering of nodes. adjective performs on the space of possible noun
referents. However, it is unclear how this formu-
2 Empirical background
lation of information gain could be implemented
Empirical investigations of adjective ordering have for post-nominal adjectives, in which the noun has
focused on the cross-linguistic stability of these presumably already been identified. Instead, the
preferences across a host of unrelated languages current study implements information gain based
(e.g., Dixon, 1982; Hetzron, 1978; Sproat and Shih, on feature vectors, as outlined in §3.
1991). For example, where English speakers prefer To our knowledge, the current study is the first
‘big blue box’ to ‘blue big box’, Mandarin speakers attempt at predicting adjective order across all three
similarly prefer dà-de lán-de xiāng-zi ‘big blue box’ templates, with an eye not only to raw accuracy, but
to lán-de dà-de xiāng-zi ‘blue big box’ (Shi and in hopes of illuminating the functional pressures
Scontras, 2020). In post-nominal languages, we which might contribute to word ordering prefer-
find the mirror-image of the English pattern, such ences in general. While we acknowledge that mul-
that adjectives that are preferred closer to the noun tiple factors are likely involved in adjective order
in pre-nominal languages are also preferred closer preferences, our contribution here is a single quan-
to the noun in post-nominal languages.1 For exam- titative factor capable of predicting adjective order
ple, speakers of Arabic prefer als.undūq al’azraq across typologically distinct languages.
alkabı̄r ‘the box blue big’ to als.undūq alkabı̄r
2
al’azraq ‘the box big blue’. We note three empirical studies that have examined the
placement of a single adjective or adjective phrase before
1
Celtic languages have been claimed to be an exception to or after the noun in Romance languages: Thuilier (2014),
this trend (Sproat and Shih, 1991), though our own investiga- Gulordava et al. (2015) and Gulordava and Merlo (2015).
tions into Irish suggest that it behaves like other post-nominal However, these studies do not tackle the question of order
languages, at least with respect to information gain. preferences among ANA triples.
958
3 Information gain m0 m1 m2 m3
1 1 0 1 f
0
3.1 Picture of communication
0 1 1 1 f1
M : 0 1 1 0 f
2
We assume that a speaker is trying to communi-
.. .. .. .. .
..
cate a meaning to a listener, with a meaning rep-
. . . .
resented as a binary vector, where each dimension 1 0 1 0 fk
of the vector corresponds to a feature. Multiple m0 m1 m2 m3
features can be true simultaneously. For exam- L: 0.1 0.3 0.2 0.4
ple, a speaker might have in mind a vector like w
m1 = [111 . . . 0] in Figure 1, where the vector has
value 1 in the dimensions for ‘is-big’ (f0 ), ‘is-grey’ f2 f¯2
(f1 ), and ‘is-elephant’ (f2 ), and 0 for all other fea-
m0 m1 m2 m3 m0 m1 m2 m3
tures. A meaning of this sort would be conveyed
L0 : 0.0 0.6 0.4 0.0 L̄0 : 0.2 0.0 0.0 0.8
by the noun phrase ‘big grey elephant’. We call m
a feature vector and the set of feature vectors M .
Figure 1: A toy universe composed of four feature
The listener does not know which meaning m vectors m defined by k binary features f and an
the speaker has in mind; the listener’s state of un- associated probability distribution L. Partitioning
certainty can be represented as a probability dis- L on f2 yields L0 , the probability distribution of
tribution over all possible feature vectors, P (m), the feature vectors containing a 1 for f2 , viz. m1
corresponding to the prior probability of encounter- and m2 , as well as L̄0 , the distribution of feature
ing a given feature vector. We call this distribution vectors containing a 0 for f2 , or f¯2 .
the listener distribution L.
By conveying information, each word in a se-
3.2 Relationship to other quantities
quence causes a change in the listener’s prior distri-
bution. Suppose as in Figure 1 that a listener starts Our IG quantity in Eq. 1 is drawn from the ID3
with probability distribution L, then hears a word algorithm for generating decision trees (Quinlan,
w conveying a feature (f2 ), resulting in the new dis- 1986). The goal of ID3 is to produce a classifier for
tribution L0 . The amount of change from L to L0 some random variable (call it L) which works by
is properly measured using the Kullback–Leibler successively evaluating some set of binary features
(KL) divergence DKL [L0 ||L] (Cover and Thomas, in some order. The optimal order of these features
2006). Therefore, the divergence DKL [L0 ||L] mea- is given by greedily maximizing information gain,
sures the amount of information about meaning where information gain for a feature f is a mea-
conveyed by the word. sure of how much the entropy of L is decreased by
Another measure of the change induced by a partitioning the dataset into positive and negative
word is the information gain, an extension of KL subsets based on whether f is present or absent.
divergence to include the notion of negative evi- Our application of information gain to word order
dence. Let L̄0 represent the listener’s probability comes from treating each word as a binary indica-
distribution over feature vectors conditional on the tor for the presence or absence of the associated
negation of w. By taking a weighted sum of the feature, and then applying the ID3 algorithm to
positive and negative KL divergence, we recover determine the optimal order of these features.
information gain (Quinlan, 1986): The first term of Eq. 1, the divergence
DKL [L0 ||L], measures the amount of information
about L conveyed by the word w and has been the
|L0 | |L̄0 |
IG = DKL [L0 ||L] + DKL [L̄0 ||L], (1) subject of a great deal of study in psycholinguis-
|L| |L| tics. In particular, Levy (2008) shows that if the
word w and the context c can be reconstructed per-
where |L| indicates the number of elements in the fectly from the updated belief state L0 , then the
support of L with non-zero probability. Informa- amount of information conveyed by w reduces to
tion gain represents the information conveyed by the surprisal of word w in context c:
a word and also the information conveyed by its
negation. DKL [L0 ||L] = − log p(w|c). (2)
959
Importantly for our purposes, the positive evidence 3.4 An efficient algorithm
term DKL [L0 ||L] in Eq. 1 alone is unlikely to make The goal of algorithms such as ID3 is to produce
useful predictions about cross-linguistic ordering a decision tree which divides a dataset into equal-
preferences, because surprisal is invariant to rever- sized and mutually-exclusive partitions, thereby
sal of word order across a language as a whole creating a shallow tree (Quinlan, 1986). While
(Levy, 2005; Futrell, 2019): the same surprisal val- finding the smallest possible binary decision tree
ues would be measured for any given language and is NP-complete (Hyafil and Rivest, 1976), ID3’s
a language with all the same sentences in reverse locally-optimal approach has proven quite effec-
order. As such, these metrics are unable to predict tive at producing shallow trees capable of accurate
any a priori asymmetries in word-order preferences classification (Dobkin et al., 1996).
between pre- and post-nominal positions. By analogy, the ordering of adjectives in a noun
phrase by maximizing information gain likewise
3.3 Negative evidence produces a tree with balanced positive and negative
partitions at each node. Specifically, adjectives that
The new feature of information gain, which has not minimize the entropy of both the positive and neg-
been presented in previous information-theoretic ative evidence are placed before adjectives which
models of language, is the negative evidence are less ‘decisive’ at partitioning feature vectors.
term in DKL [L̄0 ||L], indicating the change in the
listener’s belief about L given the negation of 4 Methodology
the features indicated by word w, a quantity re-
4.1 Data
lated to extropy (Lad et al., 2015). For exam-
ple, consider académie/NOUN militaire/ADJ ‘mil- Our study relies on two types of source data, both
itary/ADJ academy/NOUN’ in French. Let L rep- extracted from the CoNLL 2017 Shared Task: Mul-
resent a listener’s belief state after having heard tilingual Parsing from Raw Text to Universal De-
the noun académie ‘academy’. Upon hearing the pendencies (Ginter et al., 2017; Zeman et al., 2017):
adjective militaire ‘military’, L is partitioned into a set of Common Crawl and Wikipedia text data
L0 —the portion of L in which militaire is a fea- from a variety of languages, automatically parsed
ture—and L̄0 , the portion of L in which militaire is according to the Universal Dependencies scheme
not a feature. Put another way, L̄0 is the probability with UDPipe (Straka and Straková, 2017). First,
distribution over non-military academies. we extract noun phrases (NPs) containing at least
The negative evidence portion of information one adjective to populate feature vectors (§4.3).
gain is of primary interest to us because it breaks Second, we extract triples, instances of a noun and
the symmetry to word-order reversal that we would two dependent adjectives, where the three words
have if we used the positive evidence term alone. are sequential in the surface order and neither the
That is, because the sum of surprisals of words noun nor the adjectives have other dependents.
w1 and w2 in the context of w1 is the log joint We restrict triples in this way to minimize the
probability of a sequence: effect that other dependents might have on order
preferences. For example, while single-word ad-
jectives tend to precede the noun in English, as
− log p(w1 ) − log p(w2 |w1 ) = − log p(w1 , w2 ), in ‘the nice people’, adjectives in larger right-
(3) branching phrases often follow: ‘the people nice
the sum of w2 and w1 in the context of w2 neces- to us’ (Matthews, 2014), a trend also seen in Ro-
sarily yields the same quantity. Conversely, IG’s mance (Gulordava et al., 2015; Gulordava and
negative-evidence value is related to the log proba- Merlo, 2015). Similarly, conjunctions have been
bility of w2 conditional on the event of not observ- shown to weaken or neutralize preferences (Fox
ing w1 , and as such the sum of negative evidence and Thuilier, 2012; Rosales Jr. and Scontras, 2019;
values is not equivalent to the joint surprisal. Scontras et al., 2020).
Information gain can therefore predict left–right NPs and triples extracted from the Wikipedia
asymmetrical word-order preferences such as the dumps are used to generate feature vectors and to
order of adjectives in ANA templates. Further, train our regression (§4.4). We use triples from
it maps onto a well-known decision rule for the the Common Crawl dumps to perform hold-out
ordering of trees. accuracy testing.
960
4.2 Normalization IG between the attested first and second adjective,
Because our source data are extracted from dumps a method previously used by Morgan and Levy
of automatically-parsed text, they contain a large (2016) and Futrell et al. (2020a). The benefits of
amount of noise, such as incorrectly assigned syn- this approach are two-fold: we are able to account
tactic categories, HTML, nonstandard orthography, for bias in the distribution of adjectival IGs, and
and so on. To combat this noise, we extract all we can more easily deconstruct how strong infor-
lemmas with UPOS marked as ADJ and NOUN mation gain is as a predictor of adjective order.
in all Universal Dependencies (UD) v2.7 corpora Within each template, for each attested triple τ ,
(Zeman et al., 2020) for a given language—the let π1 be the lexicographically-sorted first permu-
idea being that the UD corpora are of higher qual- tation of τ and π2 be the second, with α1 being
ity—and include only NPs and triples in which the first linear adjective in π1 and α2 being the first
the adjectives and nouns are in the UD lists. All linear adjective in π2 . Our independent variable
characters are case-normalized, where applicable. p is whether π1 is attested in the corpus, and our
dependent variable is the difference between the
4.3 Feature vectors information gain of α1 and α2 . We train the coeffi-
Each NP attested in the Wikipedia corpus for a cients β0 and β1 in a logistic regression of the form
given language corresponds to a feature vector with
value 1 in the dimension associated with each ad- (
jective or noun lemma. For example, an NP such 1, if π1 is attested
p=
as “the best room available” generates a vector con- 0, if π2 is attested (4)
taining 1 for ‘is-available’, ‘is-best’, and ‘is-room’. p
log ∼ β0 + β1 [IG(α1 ) − IG(α2 )].
The relative count of each NP in the Wikipedia 1−p
corpus yields a probability distribution on feature
vectors. It is this distribution which is transformed A positive value for β1 tells us that permutations in
by partitioning on each lemma in a triple. which the larger-IG adjective is placed first tend to
be attested. The value of β0 tells us whether there
4.4 Evaluation
is a generalized bias towards a positive or negative
For a given typological template (AAN, ANA, or IG(π1 ) − IG(π2 ). The accuracy we achieve by run-
NAA) there are two competing variants; our tasks ning the logistic regression on held-out testing data
are to (i) predict which of the variants will be at- tells us the effectiveness of an IG-based algorithm
tested in a corpus and (ii) show a cross-linguistic at predicting adjective order.
consistency in how that prediction comes about.
Because we are limiting our study to the two 4.5 Reporting results
competing variants within each template, the po-
sition of the noun is invariant, leaving only the We report results for languages from which at least
relative order of the two adjectives to determine 5k triples could be analyzed, and for templates
the order of a triple. Our problem thus reduces representing at least 10% of a language’s triples in
to whether the information gain of the first linear UD corpora. The count of analyzable triples for
adjective is greater than that of the second. each language is a product of those available in the
In the case of AAN and ANA triples, the IG 2017 CoNLL Shared Task, those with sufficiently
of each adjective is calculated by partitioning the large UD v2.7 corpora, and those that meet our
entire set of feature vectors L on each of the two extraction requirements (§4.1).
adjectives. In the case of NAA triples, however, Because we are interested in exploring a cross-
IG is calculated by partitioning only those feature linguistic predictor of adjective order, we report
vectors which ‘survive’ the initial partition by the macro-average accuracies and β1 coefficients. That
noun, and are therefore part of L0 . Thus we calcu- is, each language’s accuracy and coefficient are
late IG(L, a) before the noun and IG(L0 , a) after. calculated independently and are then averaged.
Rather than simply implement the ID3 algo- We report both type- and token-accuracy, using the
rithm and choose adjectives based on their raw latter in our analysis based on the intuition that the
information gain, we train a logistic regression to preference for the order of a commonly-occurring
predict surface orders based on the difference of triple is stronger than a more rare one.
961
AAN language n β1 P token acc. type acc.
mean β1 Bulgarian 13018 20.058 0.000 0.650 0.649
18.591 [15.740, 21.443] Chinese 5909 18.604 0.000 0.724 0.766
Croatian 15555 21.246 0.000 0.666 0.634
mean token accuracy Czech 27899 28.207 0.000 0.671 0.665
0.656 [0.630, 0.683] Danish 11226 17.506 0.000 0.786 0.770
Dutch 11279 12.201 0.000 0.609 0.605
mean type accuracy English 23311 22.076 0.000 0.643 0.647
0.645 [0.616, 0.674] Finnish 12605 15.342 0.000 0.655 0.644
German 16391 16.210 0.000 0.601 0.606
Greek 5506 18.383 0.000 0.631 0.643
Latvian 5290 15.826 0.000 0.594 0.551
Russian 25397 25.697 0.000 0.658 0.651
Slovak 11933 25.935 0.000 0.700 0.651
Slovenian 18859 28.192 0.000 0.670 0.661
Swedish 10937 11.462 0.000 0.717 0.711
Turkish 12115 12.579 0.000 0.576 0.577
Ukrainian 11474 15.949 0.000 0.593 0.592
Urdu 6432 9.170 0.000 0.673 0.593
Table 1: Results by template and language: n triples analyzed, regression coefficient β1 and P -value, and
test accuracies. Means with 95% confidence intervals shown for each template.
962
1
vi
0.8 ca gl
fa
accuracy
da fr
idro
fr
pt sv zh pt ro es
eu ca
itargl
es
ur plsk sl it
eu fi elbghren ru cs
0.6 hr fa nl ukde AAN
tr lv
id he vi ANA
NAA
0.4
−10 0 10 20 30 40 50 60 70
β1
963
n rate confidence interval attested in both possible orders (e.g., A1 A2 N and
A2 A1 N, where N can be any noun) within each
AAN 18 0.017 [0.012, 0.022] template in our dataset. At 95% confidence the dif-
ANA 13 0.007 [0.002, 0.011] ference between AAN and NAA does not reach sig-
NAA 13 0.022 [0.013, 0.032] nificance, though the rate for ANA is significantly
lower than the other two. More generally, the mean
all 44 0.016 [0.012, 0.020]
rate of just 1.6% across templates reinforces the
Table 3: Macro-average rate of adjectives attested notion that ordering preferences are quite robust
in both possible orders within each template, show- regardless of template, at least for our normalized
ing n languages, rate of attestation, and 95% confi- triples from the languages analyzed here.
dence intervals.
6.3 Ablation
Equation 1 defines information gain as the condi-
6.2 Asymmetries
tioned sum of two elements, the positive evidence
The preference for one variant of an ANA triple DKL [L0 ||L] and the negative evidence DKL [L̄0 ||L].
over the other is an asymmetry without a straight- The positive evidence alone is akin to surprisal, a
forward explanation in a distance-based model; well-studied quantity in psycholinguistics (§3.2),
there is no clear mapping from ANA onto the other while the negative evidence is related to extropy
templates, which means that an adjective’s relative (§3.3). By ablating the IG formulation into the two
distance to the noun is not informative. Our algo- terms discretely, we can show empirically that the
rithm is novel in that the placement of the adjectives proportionally-combined positive and negative ev-
is governed by greedy IG, not distance to the noun— idence yield more accurate and consistent results
an innovation that allows us to break the symmetry than either of the two constituent terms alone.
between the adjectives in ANA triples. Similarly, Table 4 shows the mean accuracy and polarity
IG makes no a priori prediction as to whether a proportion of the β1 coefficient across languages
mirror- or same-order will emerge between AAN and templates. The polarity of β1 tells us whether
and NAA triples: both pre- and post-nominal be- maximizing IG (positive) or minimizing IG (neg-
havior is a product of ordering adjectives such that ative) is the better strategy. Thus a polarity per-
information gain is maximized, and IG itself is fun- centage close to 0 or 1 indicates more consistent
damentally derived from the distribution of adjec- behavior across templates.
tives and nouns that populate a language’s possible For example, while the accuracy of using only
feature vectors for conveying meaning. positive evidence, DKL [L0 ||L], for AAN triples is
Another left–right asymmetry that has been 0.565, that accuracy is realized due to a 0.000 rate
posited in the linguistics literature holds that depen- of positive β1 coefficient—that is, the 56.5% ac-
dents placed before the head in a surface realization curacy is achieved by minimizing IG, placing the
(e.g., the adjectives in an AAN triple) follow a more adjective with the lower IG first. On the other hand,
rigid ordering than those placed after (e.g., the ad- while using only positive evidence to predict NAA
jectives in a NAA triple; Hawkins, 1983). Both triples yields the same accuracy, 0.565, the coef-
noun modifiers in general and adjectives specifi- ficient polarity proportion of 0.769 means that, in
cally have been reported to follow this pattern, with most NAA cases, IG should be maximized. The
a largely-universal pre-nominal ordering and a mir- three templates together reflect a modest accuracy
ror, same, or ‘free’ post-nominal order (Hetzron, (0.566) and an inconsistent coefficient polarity pro-
1978). There is as yet no large-scale empirical evi- portion (0.273).
dence for this claim, though Trainin and Shetreet Using only negative evidence, DKL [L̄0 ||L],
(2021) suggest that Hebrew NAA order preferences yields even worse accuracies and similarly incon-
may be weaker than English AAN for a restricted sistent coefficients as positive only. The accuracy
set of adjective classes. across templates is little better than chance at 0.535,
In an effort to empirically assess the claim that and the average coefficient polarity proportion of
post-nominal orderings are more flexible compared 0.273 likewise demonstrates that using negative ev-
to orderings pre-nominally across languages, Table idence alone does not produce consistent behavior
3 reports the average prevalence of adjective pairs across templates.
964
accuracy proportion of positive β1
DKL [L0 ||L] 0.565 0.567 0.565 0.566 0.000 0.154 0.769 0.273
DKL [L̄0 ||L] 0.533 0.548 0.526 0.535 0.167 0.231 0.462 0.273
IG 0.657 0.737 0.680 0.687 1.000 0.769 1.000 0.932
Table 4: Ablation on accuracy and the proportion of positive coefficients for positive evidence (DKL [L0 ||L])
alone, negative evidence (DKL [L̄0 ||L]) alone, and proportionally combined terms (IG). Boldfaced values
indicate the highest accuracy or coefficient polarity proportion in each column.
965
the First Workshop on Quantitative Syntax (Quasy, Frank Lad, Giuseppe Sanfilippo, Gianna Agro, et al.
SyntaxFest 2019), pages 2–15, Paris, France. Asso- 2015. Extropy: complementary dual of entropy. Sta-
ciation for Computational Linguistics. tistical Science, 30(1):40–58.
Richard Futrell, William Dyer, and Greg Scontras. Jun Yen Leung, Guy Emerson, and Ryan Cotterell.
2020a. What determines the order of adjectives in 2020. Investigating cross-linguistic adjective order-
English? comparing efficiency-based theories using ing tendencies with a latent-variable model. In Pro-
dependency treebanks. In Proc. of the 58th Annual ceedings of the 2020 Conference on Empirical Meth-
Meeting of ACL, pages 2003–2012, Online. ACL. ods in Natural Language Processing.
Richard Futrell, Edward Gibson, and Roger P Levy. Roger Levy. 2005. Probabilistic Models of Word Order
2020b. Lossy-context surprisal: An information- and Syntactic Discontinuity. Ph.D. thesis, Stanford
theoretic model of memory effects in sentence pro- University, Stanford, CA.
cessing. Cognitive science, 44(3):e12814.
Roger Levy. 2008. Expectation-based syntactic com-
Filip Ginter, Jan Hajič, Juhani Luotolahti, Milan Straka, prehension. Cognition, 106(3):1126–1177.
and Daniel Zeman. 2017. CoNLL 2017 shared task
- automatically annotated raw texts and word embed- Haitao Liu, Chunshan Xu, and Junying Liang. 2017.
dings. LINDAT/CLARIAH-CZ digital library at the Dependency distance: A new perspective on syntac-
Institute of Formal and Applied Linguistics (ÚFAL), tic patterns in natural languages. Physics of Life Re-
Faculty of Mathematics and Physics, Charles Uni- views, 21:171–93.
versity. Robert Malouf. 2000. The order of prenominal adjec-
Kristina Gulordava and Paola Merlo. 2015. Structural tives in natural language generation. In Proceedings
and lexical factors in adjective placement in complex of the 38th Annual Meeting of the Association for
noun phrases across Romance languages. In Pro- Computational Linguistics, pages 85–92.
ceedings of the Nineteenth Conference on Computa- J. E. Martin. 1969. Semantic determinants of preferred
tional Natural Language Learning, pages 247–257, adjective order. Journal of Verbal Learning and Ver-
Beijing, China. Association for Computational Lin- bal Behavior, 8:697–704.
guistics.
Peter Matthews. 2014. The Positions of Adjectives in
Kristina Gulordava, Paola Merlo, and Benoit Crabbé. English. Oxford University Press, New York.
2015. Dependency length minimisation effects in
short spans: a large-scale analysis of adjective place- Emily Morgan and Roger Levy. 2016. Abstract knowl-
ment in complex noun phrases. In Proceedings edge versus direct experience in processing of bino-
of the 53rd Annual Meeting of the Association for mial expressions. Cognition, 157:382–402.
Computational Linguistics and the 7th International
Joint Conference on Natural Language Processing Mohammad Norouzi, Maxwell D. Collins, Matthew
(Volume 2: Short Papers), pages 477–482, Beijing, Johnson, David J. Fleet, and Pushmeet Kohli. 2015.
China. Association for Computational Linguistics. Efficient non-greedy optimization of decision trees.
arXiv:1511.04056 [cs].
Michael Hahn, Judith Degen, Noah Goodman, Daniel
Jurafsky, and Richard Futrell. 2018. An information- J. Ross Quinlan. 1986. Induction of decision trees. Ma-
theoretic explanation of adjective ordering prefer- chine learning, 1(1):81–106.
ences. In Proceedings of the 40th Annual Meeting Cesar Manuel Rosales Jr. and Gregory Scontras. 2019.
of the Cognitive Science Society, pages 1766–1772, On the role of conjunction in adjective ordering pref-
Madison, WI. Cognitive Science Society. erences. Proceedings of the Linguistic Society of
John A. Hawkins. 1983. Word Order Universals: America, 4(32):1–12.
Quantitative analyses of linguistic structure. New Suttera Samonte and Gregory Scontras. 2019. Adjec-
York: Academic Press. tive ordering in Tagalog: A cross-linguistic compar-
Robert Hetzron. 1978. On the relative order of adjec- ison of subjectivity-based preferences. In Proceed-
tives. In Hansjakob Seiler, editor, Language Univer- ings of the Linguistic Society of America, volume 4,
sals, pages 165–184. Gunter Narr Verlag, Tübingen. pages 1–13.
Laurent Hyafil and R Rivest. 1976. Constructing opti- Gregory Scontras, Galia Bar-Sever, Zeinab
mal binary search trees is np complete. Information Kachakeche, Cesar Manuel Rosales Jr., and
Processing Letters. Suttera Samonte. 2020. Incremental semantic
restriction and subjectivity-based adjective order-
Otto Jespersen. 1922. Language: its nature and devel- ing. Proceedings of Sinn und Bedeutung 24, pages
opment. George Allen & Unwin Ltd., London. 253–270.
Zeinab Kachakeche and Gregory Scontras. 2020. Ad- Gregory Scontras, Judith Degen, and Noah D. Good-
jective ordering in Arabic: Post-nominal structure man. 2017. Subjectivity predicts adjective ordering
and subjectivity-based preferences. In Proc. of the preferences. Open Mind: Discoveries in Cognitive
LSA, volume 5, pages 419–430. Science, 1(1):53–65.
966
Gregory Scontras, Judith Degen, and Noah D. Good-
man. 2019. On the grammatical source of adjec-
tive ordering preferences. Semantics and Pragmat-
ics, 12(7).
Gary-John Scott. 2002. Stacked adjectival modifica-
tion and the structure of nominal phrases. In Func-
tional Structure in DP and IP: The Cartography of
Syntactic Structures, volume 1, pages 91–210. Ox-
ford University Press, New York.
Yuxin Shi and Gregory Scontras. 2020. Mandarin has
subjectivity-based adjective ordering preferences in
the presence of ‘de’. In Proceedings of the Linguis-
tic Society of America, volume 5, pages 410–418.
Mihael Simonič. 2018. Functional explanation of ad-
jective ordering preferences using probabilistic pro-
gramming. Master’s thesis, University of Tübingen.
Richard Sproat and Chilin Shih. 1991. The Cross-
Linguistic Distribution of Adjective Ordering Re-
strictions. In Carol Georgopoulos and Roberta Ishi-
hara, editors, Interdisciplinary Approaches to Lan-
guage, pages 565 – 93. Kluwer Academic Publish-
ers, Boston.
Milan Straka and Jana Straková. 2017. Tokenizing,
POS Tagging, Lemmatizing and Parsing UD 2.0
with UDPipe. In Proc. of the CoNLL 2017 Shared
Task: Multilingual Parsing from Raw Text to Univer-
sal Dependencies, pages 88–99, Vancouver. ACL.
Henry Sweet. 1900. A new English grammar, logical
and historical, volume 1. Clarendon Press, Oxford.
David Temperley and Daniel Gildea. 2018. Min-
imizing syntactic dependency lengths: Typologi-
cal/cognitive universal? Annual Review of Linguis-
tics, 4:1–15.
Juliette Thuilier. 2014. An Experimental Approach to
French Attributive Adjective Syntax. In Christopher
Piñón, editor, Empirical Issues in Syntax and Seman-
tics, volume 10 of Experimental Syntax and Seman-
tics, pages 287–304.
Nitzan Trainin and Einat Shetreet. 2021. It’s a dotted
blue big star: on adjective ordering in a post-nominal
language. Language, Cognition and Neuroscience,
36(3):320–341.
Benjamin Lee Whorf. 1945. Grammatical Categories.
Language, 21(1):1–11.
Daniel Zeman, Joakim Nivre, ... Abrams, and Anna
Zhuravleva. 2020. Universal dependencies 2.7.
LINDAT/CLARIAH-CZ digital library at the Insti-
tute of Formal and Applied Linguistics (ÚFAL), Fac-
ulty of Mathematics and Physics, Charles Univer-
sity.
Daniel Zeman, Martin Popel, ..., and Josie Li. 2017.
Multilingual parsing from raw text to universal de-
pendencies. In Proc. of CoNLL 2017 Shared Task:
Multilingual Parsing from Raw Text to Universal De-
pendencies, pages 1–19. ACL.
967