Klein Thesis
Klein Thesis
NN
insurance
NN
company
PP
IN
with
NP
JJ
nancial
NN
problems
Here, the prepositional phrase is inside the scope of the determiner, and the noun phrase
has at least some internal structure other than the PP (many linguists would want even
more). However, the latter structure would score badly against the former: the N
nodes,
though reasonable or even superior, are either over-articulations (precision errors) or cross-
ing brackets (both precision and recall errors). When our systems propose alternate analy-
ses along these lines, it will be noted, but in any case it complicates the process of automatic
evaluation.
To be clear what the dangers are, it is worth pointing out that a system which produced
both analyses above in free variation would score better than one which only produced
the latter. However, like choosing which side of the road to drive on, either convention is
preferable to inconsistency.
While there are issues with measuring parsing accuracy against gold standard treebanks,
it has the substantial advantage of providing hard empirical numbers, and is therefore an
important evaluation tool. We now discuss the specic metrics used in this work.
2.2.2 Unlabeled Brackets
Consider the pair of parse trees shown in gure 2.2 for the sentence
0
the
1
screen
2
was
3
a
4
sea
5
of
6
red
7
The tree in gure 2.2(a) is the gold standard tree from the Penn treebank, while the
tree in gure 2.2(b) is an example output of a version of the induction system described
in chapter 5. This system doesnt actually label the brackets in the tree; it just produces a
20 CHAPTER 2. EXPERIMENTAL SETUP AND BASELINES
S
NP
DT
the
NN
screen
VP
VBD
was
NP
NP
DT
a
NN
sea
PP
IN
of
NP
NN
red
(a) Gold Tree
c
c
c
DT
the
NN
screen
VBD
was
c
c
DT
a
NN
sea
c
IN
of
NN
red
(b) Predicted Tree
Figure 2.2: A predicted tree and a gold treebank tree for the sentence, The screen was a
sea of red.
2.2. EVALUATION 21
nested set of brackets. Moreover, for systems which do label brackets, there is the problem
of knowing how to match up induced symbols with gold symbols. This latter problem is
the same issue which arises in evaluating clusterings, where cluster labels have no inherent
link to true class labels. We avoid both the no-label and label-correspondence problems by
measuring the unlabeled brackets only.
Formally, we consider a labeled tree T to be a set of labeled constituent brackets, one
for each node n in the tree, of the form (x : i, j), where x is the label of n, i is the index
of the left extent of the material dominated by n, and j is the index of the right extent of
the material dominated by n. Terminal and preterminal nodes are excluded, as are non-
terminal nodes which dominate only a single terminal. For example, the gold tree (a)
consists of the labeled brackets:
Constituent Material Spanned
(NP : 0, 2) the screen
(NP : 3, 5) a sea
(PP : 5, 7) of red
(NP : 3, 7) a sea of red
(VP : 2, 7) was a sea of red
(S : 0, 7) the screen was a sea of red
Fromthis set of labeled brackets, we can dene the corresponding set of unlabeled brackets:
brackets(T) = i, j) : x s.t. (X : i, j) T
Note that even if there are multiple labeled constituents over a given span, there will be only
a single unlabeled bracket in this set for that span. The denitions of unlabeled precision
(UP) and recall (UR) of a proposed corpus P = [P
i
] against a gold corpus G = [G
i
] are:
UP(P, G)
i
[brackets(P
i
) brackets(G
i
)[
i
[brackets(P
i
)[
UR(P, G)
i
[brackets(P
i
) brackets(G
i
)[
i
[brackets(G
i
)[
22 CHAPTER 2. EXPERIMENTAL SETUP AND BASELINES
In the example above, both trees have 6 brackets, with one mistake in each direction, giving
a precision and recall of 5/6.
As a synthesis of these two quantities, we also report unlabeled F
1
, their harmonic
mean:
UF
1
(P, G) =
2
UP(P, G)
1
+ UR(P, G)
1
Note that these measures differ from the standard PARSEVAL measures (Black et al. 1991)
over pairs of labeled corpora in several ways: multiplicity of brackets is ignored, brackets
of span one are ignored, and bracket labels are ignored.
2.2.3 Crossing Brackets and Non-Crossing Recall
Consider the pair of trees in gure 2.3(a) and (b), for the sentence a full four-color page in
newsweek will cost 100,980.
3
There are several precision errors: due the atness of the gold treebank, the analysis
inside the NP a full four-color page creates two incorrect brackets in the proposed tree.
However, these two brackets do not actually cross, or contradict, any brackets in the gold
tree. On the other hand, the bracket over the verb group will cost does contradict the
gold trees VP node. Therefore, we dene several additional measures which count as
mistakes only contradictory brackets. We write b S for an unlabeled bracket b and a
set of unlabeled brackets S if b does not cross any bracket b
i
[b brackets(P
i
) : b brackets(G
i
)[
i
[brackets(P
i
)[
UNCR(P, G)
i
[b brackets(G
i
) brackets(P
i
)[
i
[brackets(G
i
)[
and unlabeled non-crossing F
1
is dened as their harmonic mean as usual. Note that these
measures are more lenient than UP/UR/UF
1
. Where the former metrics count all proposed
analyses of structure inside underspecied gold structures as wrong, these measures count
3
Note the removal of the $ from what was originally $ 100,980 here.
2.2. EVALUATION 23
S
NP
NP
DT
a
JJ
full
JJ
four-color
NN
page
PP
IN
in
NP
NNP
newsweek
VP
MD
will
VP
VB
cost
NP
CD
100,980
(a) Gold Tree
c
c
c
DT
a
c
JJ
full
c
JJ
four-color
NN
page
c
IN
in
NNP
newsweek
c
c
MD
will
VB
cost
CD
100,980
(b) Predicted Tree
Figure 2.3: A predicted tree and a gold treebank tree for the sentence, A full, four-color
page in Newsweek will cost $100,980.
24 CHAPTER 2. EXPERIMENTAL SETUP AND BASELINES
all such analyses as correct. The truth usually appears to be somewhere in between.
Another useful statistic is the crossing brackets rate (CB), the average number of guessed
brackets per sentence which cross one or more gold brackets:
CB(P, G)
i
[b brackets(T
P
i
) : b brackets(T
G
i
)[
[P[
2.2.4 Per-Category Unlabeled Recall
Although the proposed trees will either be unlabeled or have labels with no inherent link
to gold treebank labels, we can still report a per-label recall rate for each label in the gold
label vocabulary. For a gold label x, that categorys labeled recall rate (LR) is dened as
LR(x, P, G)
i
[(X : i, j) G
i
: j > i + 1 i, j) brackets(P
i
)[
i
[(X : i, j) G
i
[
In other words, we take all occurrences of nodes labeled x in the gold treebank which
dominate more than one terminal. Then, we count the fraction which, as unlabeled brackets,
have a match in their corresponding proposed tree.
2.2.5 Alternate Unlabeled Bracket Measures
In some sections, we report results according to an alternate unlabeled bracket measure,
which was originally used in earlier experiments. The alternate unlabeled bracket precision
(UP
) are dened as
UP
(P, G)
i
[brackets(P
i
) brackets(G
i
)[ 1
[brackets(P
i
)[ 1
UR
(P, G)
i
[brackets(P
i
) brackets(G
i
)[ 1
[brackets(G
i
)[ 1
with, F
1
(UF
1
) dened as their harmonic mean, as usual. In the rare cases where it oc-
curred, a ratio of 0/0 was taken to be equal to 1. These alternate measures do not count the
top bracket as a constituent, since, like span-one constituents, all well-formed trees contain
the top bracket. This exclusion tended to lower the scores. On the other hand, the scores
2.2. EVALUATION 25
were macro-averaged at the sentence level, which tended to increase the scores. The net
differences were generally fairly slight, but for the sake of continuity we report some older
results by this metric.
2.2.6 EVALB
For comparison to earlier work which tested on the ATIS corpus using the EVALB program
with the supplied unlabeled evaluation settings, we report (though only once) the results
of running our predicted and gold versions of the ATIS sentences through EVALB (see
section 5.3. The difference between the measures above and the EVALB program is that
the program has a complex treatment of multiplicity of brackets, while the measures above
simply ignore multiplicity.
2.2.7 Dependency Accuracy
The models in chapter 6 model dependencies, linkages between pairs of words, rather than
top-down nested structures (although certain equivalences do exist between the two repre-
sentations, see section 6.1.1). In this setting, we view trees not as collections of constituent
brackets, but rather as sets of word pairs. The general meaning of a dependency pair is that
one word either modies or predicates the other word. For example, in the screen was a
sea of red, we get the following dependency structure:
DT
the
NN
screen
VBD
was
DT
a
NN
sea
IN
of
NN
red
x
p(x) log
p(x)
q(x)
The lowest divergence (highest similarity) pairs are primarily of two types. First, there are
pairs like VBD, VBZ) (past vs. 3sg present tense nite verbs) and NN, NNS) (singular
vs. plural common nouns) where the original distinction was morphological, with minimal
distributional reexes. Second, there are pairs like DT, PRP$) (determiners vs. possessive
3.4. DISTRIBUTIONAL SYNTAX 37
Tag Top Linear Contexts by Frequency
CC NNP NNP), NN NN), NNS NNS), CD CD), NN JJ)
CD IN CD), CD IN), IN NN), IN NNS), TO CD)
DT IN NN), IN JJ), IN NNP), NN), VB NN)
EX VBZ), VBP), IN VBZ), VBD), CC VBZ)
FW NNP NNP), NN FW), DT FW), FW NN), FW FW)
IN NN DT), NN NNP), NNS DT), NN NN), NN JJ)
JJR DT NN), IN IN), RB IN), IN NNS), VBD IN)
JJS DT NN), IN CD), POS NN), DT NNS), DT JJ)
JJ DT NN), IN NNS), IN NN), JJ NN), DT NNS)
LS VB), JJ), IN), NN), PRP)
MD NN VB), PRP VB), NNS VB), NNP VB), WDT VB)
NNPS NNP NNP), NNP ), NNP VBD), NNP IN), NNP CC)
NNP NNP NNP), IN NNP), NNP), NNP VBD), DT NNP)
NNS JJ IN), JJ ), NN IN), IN IN), NN )
NN DT IN), JJ IN), DT NN), NN IN), JJ )
PDT IN DT), VB DT), DT), RB DT), VBD DT)
POS NNP NN), NN NN), NNP JJ), NNP NNP), NN JJ)
PR$ IN NN), IN JJ), IN NNS), VB NN), VBD NN)
PRP VBD), VBP), VBZ), IN VBD), VBD VBD)
RBR NN ), DT JJ), RB JJ), RB IN), NN IN)
RBS DT JJ), POS JJ), CC JJ), PR$ JJ), VBZ JJ)
RB MD VB), NN IN), RB IN), VBZ VBN), VBZ JJ)
RP VB DT), VBN IN), VBD IN), VB IN), VBD DT)
SYM IN), VBZ), NN), JJ), VBN)
TO NN VB), NNS VB), VBN VB), VBD VB), JJ VB)
UH PRP), DT), ), UH), UH )
VBD NNP DT), NN DT), NN VBN), NN IN), NNP PRP)
VBG IN DT), NN DT), DT NN), IN NNS), IN NN)
VBN NN IN), VBD IN), NNS IN), VB IN), RB IN)
VBP NNS VBN), NNS RB), PRP RB), NNS DT), NNS IN)
VBZ NN VBN), NN RB), NN DT), NNP VBN), NNP DT)
VB TO DT), TO IN), MD DT), MD VBN), TO JJ)
WDT NN VBZ), NNS VBP), NN VBD), NNP VBZ), NN MD)
W$ NNP NN), NNP NNS), NN NN), NNS NNS), NNP JJ)
WP NNS VBP), NNP VBD), NNP VBZ), NNS VBD), NN VBZ)
WRB NN DT), NN PRP), DT), PRP), NN NNS)
Figure 3.1: The most frequent left/right tag context pairs for the part-of-speech tags in the
Penn Treebank.
38 CHAPTER 3. DISTRIBUTIONAL METHODS
pronouns) and WDT, WP) (wh-determiners vs. wh-pronouns) where the syntactic role is
truly different at some deep level, but where the syntactic peculiarities of English prevent
there from being an easily detected distributional reex of that difference. For example,
English noun phrases can begin with either a DT like the in the general idea or a PRP$
like his in his general idea. However, they both go in the same initial position and can-
not co-occur.
4
However, in general, similar signatures do reect broadly similar syntactic
functions.
One might hope that this correlation between similarity of local linear context signa-
tures and syntactic function would extend to units of longer length. For example, DT JJ NN
and DT NN, both noun phrases, might be expected to have similar signatures. Figure 3.2
shows the top pairs of multi-tag subsequences by the same signature divergence metric.
5
Of course, linguistic arguments of syntactic similarity involve more than linear con-
text distributions. For one, traditional argumentation places more emphasis on potential
substitutability (what contexts items can be used in) and less emphasis on empirical substi-
tutability (what contexts they are used in) (Radford 1988). We might attempt to model this
in some way, such as by attening the empirical distribution over context counts to blunt
the effects of non-uniformempirical usage of grammatical contexts. For example, we could
use as our context signatures the distribution which is uniform over observed contexts, and
zero elsewhere.
Another difference between linear context distributions and traditional linguistic no-
tions of context is that traditional contexts refer to the surrounding high-level phrase struc-
ture. For example, the subsequence factory payrolls in the sentence below is, linearly,
followed by fell (or, at the tag level, VBD). However, in the treebank parse
4
Compare languages such as Italian where the PRP$ would require a preceding DT, as in la sua idea, and
where this distributional similarity would not appear.
5
The list of candidates was restricted to pairs of items each of length at most 4 and each occurring at least
50 times in the treebank otherwise the top examples are mostly long singletons with chance zero divergence.
3.4. DISTRIBUTIONAL SYNTAX 39
Rank Tag Pairs Sequence Pairs
1 VBZ, VBD ) NNP NNP, NNP NNP NNP )
2 DT, PRP$ ) DT JJ NN IN, DT NN IN )
3 NN, NNS ) NNP NNP NNP NNP, NNP NNP NNP )
4 WDT, WP ) DT NNP NNP, DT NNP )
5 VBG, VBN ) IN DT JJ NN, IN DT NN )
6 VBP, VBD ) DT JJ NN, DT NN NN )
7 VBP, VBZ ) DT JJ NN, DT NN )
8 EX, PRP ) IN JJ NNS, IN NNS )
9 POS, WP$ ) IN NN IN, IN DT NN IN )
10 RB, VBN ) IN NN, IN JJ NN )
11 CD, JJ ) DT JJ NN NN, DT NN NN )
12 NNPS, NNP ) IN NNP, IN NNP NNP )
13 CC, IN ) IN JJ NNS, IN NN NNS )
14 JJS, JJR ) NN IN DT, NN DT )
15 RB, VBG ) IN DT NNP NNP, IN DT NNP )
16 JJR, JJ ) IN DT NN IN, IN NNS IN )
17 JJR, VBG ) NNP NNP POS, NNP POS )
18 CC, VBD ) NNP NNP IN, NNP IN )
19 JJR, VBN ) TO VB DT, TO DT )
20 DT, JJ ) IN NN IN, IN NNS IN )
21 CD, VBG ) NNS MD, NN MD )
22 LS, SYM) JJ NNS, JJ NN NNS )
23 NN, JJ ) JJ NN NN, JJ JJ NN )
24 VBG, JJ ) NN NNS, JJ NNS )
25 JJR, RBR ) PRP VBZ, PRP VBD )
26 CC, VBZ ) NN IN NNP, NN IN NNP NNP )
27 CC, RB ) NNP NNP CC, NNP CC )
28 DT, CD ) NN VBZ, NN VBD )
29 NN, NNP ) IN NNP NNP NNP, IN NNP NNP )
30 VBG, VBD ) IN DT JJ NN, IN DT NN NN )
31 CC, VBG ) DT JJ NNS, DT NNS )
32 TO, CC ) JJ NN, JJ JJ NN )
33 WRB, VBG ) DT JJ JJ, PR$ JJ )
34 CD, NNS ) VBZ DT, VBD DT )
35 IN, VBD ) DT JJ JJ, DT JJ )
36 RB, NNS ) CC NNP, CC NNP NNP )
37 RP, JJR ) JJ NN, JJ NN NN )
38 VBZ, VBG ) DT NNP NN, DT NN NN )
39 RB, RBR ) NN IN, NN NN IN )
40 RP, RBR ) NN IN DT, NNS IN DT )
Figure 3.2: The most similar part-of-speech pairs and part-of-speech sequence pairs, based
on the Jensen-Shannon divergence of their left/right tag signatures.
40 CHAPTER 3. DISTRIBUTIONAL METHODS
S
NP
NN
factory
NNS
payrolls
VP
VBD
fell
PP
IN
in
NP
NNP
september
the corresponding NP node in the tree is followed by a verb phrase. Since we do have
gold-standard parse trees for the sentences in the Penn Treebank, we can do the following
experiment. For each constituent node x in each treebank parse tree t, we record the yield
of x as well as its context, for two denitions of context. First, we look at the local linear
context as before. Second, we dene the left context of x to be the left sibling of the lowest
ancestor of x (possibly x itself) which has a left sibling, or if x is sentence-initial. We de-
ne the right context symmetrically. For example, in the parse tree above, factory payrolls
is a noun phrase whose lowest right sibling is the VP node, and whose lowest left sibling is
the beginning of the sentence. This is the local hierarchical context. Figure 3.3 shows the
most similar pairs of frequent sequences according to Jensen-Shannon divergence between
signatures for these two denitions of context. Since we only took counts for tree nodes x,
these lists only contain sequences which are frequently constituents. The lists are relatively
similar, suggesting that the patterns detected by the two denitions of context are fairly
well correlated, supporting the earlier assumption that the local linear context should be
largely sufcient. This correlation is fortunate some of the methods we will investigate
are greatly simplied by the ability to appeal to linear context when hierarchical context
might be linguistically more satisfying (see chapter 5).
A nal important point is that traditional linguistic argumentation for constituency goes
far beyond distributional facts (substitutability). Some arguments, like the tendency of
targets of dislocation to be constituents might have distributional correlates. For exam-
ple, dislocatable sequences might be expected to occur frequently at sentence boundary
contexts, or have high context entropy. Other arguments for phrasal categories, like those
3.4. DISTRIBUTIONAL SYNTAX 41
Rank Constituent Sequences by Linear Context Constituent Sequences by Hierarchical Context
1 NN NNS, JJ NNS ) NN NNS, JJ NNS )
2 IN NN, IN DT NN ) IN NN, IN DT NN )
3 DT JJ NN, DT NN ) IN DT JJ NN, IN JJ NNS )
4 DT JJ NN, DT NN NN ) VBZ VBN, VBD VBN )
5 IN DT JJ NN, IN DT NN ) NN NNS, JJ NN NNS )
6 NN NNS, JJ NN NNS ) DT JJ NN NN, DT NN NN )
7 IN JJ NNS, IN NNS ) IN DT JJ NN, IN DT NN )
8 DT JJ NN NN, DT NN NN ) IN JJ NNS, IN DT NN )
9 NNP NNP POS, NNP POS ) DT JJ NN, DT NN )
10 IN JJ NNS, IN JJ NN ) DT JJ NN, DT NN NN )
11 IN NNP, IN NNP NNP ) IN NNS, IN NN NNS )
12 JJ NNS, JJ NN NNS ) IN NNP, IN NNP NNP )
13 IN DT JJ NN, IN JJ NNS ) IN DT NN, IN NNP )
14 IN NNS, IN NN NNS ) IN JJ NNS, IN JJ NN )
15 IN JJ NNS, IN DT NN ) DT NNP NNP, DT NNP )
16 DT NNP NNP, DT NNP ) IN JJ NNS, IN NNS )
17 JJ NNS, DT NNS ) IN JJ NNS, IN NNP )
18 DT JJ NNS, DT NNS ) VBZ VBN, MD VB )
19 IN JJ NNS, IN NN ) JJ NNS, JJ NN NNS )
20 NN NNS, DT NNS ) IN DT NN NN, IN DT NN )
21 IN DT JJ NN, IN NN ) IN DT NN NN, IN DT NNS )
22 JJ JJ NNS, JJ NN NNS ) IN JJ NNS, IN NN )
23 DT NN POS, NNP NNP POS ) DT JJ NN, JJ NNS )
24 IN NNS, IN JJ NN ) DT NNP NN, DT NN NN )
25 JJ NN, DT JJ NN ) JJ NNS, DT NN NN )
26 IN DT NN NN, IN DT NN ) DT NNS, DT NN NN )
27 IN NN NNS, IN JJ NN ) IN JJ NNS, IN NN NNS )
28 DT NNP NN, DT NN NN ) NN NNS, DT NNS )
29 IN JJ NNS, IN NN NNS ) IN DT NN NN, IN JJ NN )
30 IN NN, IN NNS ) IN DT JJ NN, IN NNP )
31 IN NN, IN JJ NN ) IN DT NN NN, IN NN NNS )
32 JJ NN, DT NN NN ) DT NNP NNP, DT NNP NN )
33 VB DT NN, VB NN ) IN DT NN NN, IN JJ NNS )
34 IN DT NN NN, IN JJ NN ) JJ JJ NNS, JJ NN NNS )
35 DT NN, DT NN NN ) VBD VBN, VBD JJ )
36 DT NNP NNP, DT NNP NN ) IN NN, IN NNP )
37 JJ JJ NNS, JJ NNS ) VB DT NN, VB NN )
38 IN DT JJ NN, IN DT NN NN ) IN NN NNS, IN JJ NN )
39 JJ NN, NN NN ) NN NNS, DT NN NN )
40 DT JJ NNS, JJ NN NNS ) IN NN NNS, IN NNP NNP )
Figure 3.3: The most similar sequence pairs, based on the Jensen-Shannon divergence of
their signatures, according to both a linear and a hierarchical denition of context.
42 CHAPTER 3. DISTRIBUTIONAL METHODS
which reference internal consistency (e.g., noun phrases all having a nominal head), are not
captured by distributional similarity, but can potentially be captured in other ways. How-
ever, scanning gure 3.2, it is striking that similar pairs do tend to have similar internal
structure the chief difculty isnt telling that DT JJ NN IN is somehow similar to DT NN
IN, its telling that neither is a constituent.
Chapter 4
A Structure Search Experiment
A broad division in statistical methods for unsupervised grammar induction is between
structure search methods and parameter search methods. In structure search, the primary
search operator is a symbolic change to the grammar. For example, one might add a pro-
duction to a context-free grammar. In parameter search, one takes a parameterized model
with a xed topology, and the primary search operator is to nudge the parameters around
a continuous space, using some numerical optimization procedure. Most of the time, the
optimization procedure is the expectation-maximization algorithm, and it is used to t a pa-
rameterized probabilistic model to the data. A classic instance of this method is estimating
the production weights for a PCFG with an a priori xed set of rewrites.
Of course, the division is not perfect a parameter search can have symbolic effects,
for example by zeroing out certain rewrites probabilities, and a structure search proce-
dure often incorporates parameter search inside each new guess at the symbolic structure.
Nonetheless, the distinction is broadly applicable, and the two approaches have contrast-
ing motivations. We will discuss the potential merits of parameter search methods later
(section 5.1, section 6.1.2), but their disadvantages are easy to see.
First, the historical/empirical stigma: early attempts at parameter search were extremely
discouraging, even when applied to toy problems. Lari and Young (1990) report that, when
using EM to recover extremely simple context-free grammars, the learned grammar would
require several times the number of non-terminals to recover the structure of a target gram-
mar, and even then it would often learn weakly equivalent variants of that target grammar.
43
44 CHAPTER 4. A STRUCTURE SEARCH EXPERIMENT
When applied to real natural language data, the results were, unsurprisingly, even worse.
Carroll and Charniak (1992) describes experiments running the EM algorithm from ran-
dom starting points, resulting in widely varying grammars of extremely poor quality (for
more on these results, see section 5.1).
Second, parameter search methods all essentially maximize the data likelihood, either
conditioned on the model or jointly with the model. Of course, outside of language mod-
eling scenarios, we dont generally care about data likelihood for its own sake we want
our grammars to parse accurately, or be linguistically plausible, or we have some goal ex-
trinsic to the training corpus in front of us. While its always possible data likelihood in
our model family will correspond to whatever our real goal is, in practice its not guaran-
teed, and often demonstrably false. As far as it goes, this objection holds equally well for
structure search methods which are guided by data- or model-posterior-likelihood metrics.
However, in structure search methods one only needs a local heuristic for evaluating sym-
bolic search actions. This heuristic can be anything we want whether we understand what
its (greedily) maximizing or not. This property invites an approach to grammar induction
which is far more readily available in structure search approaches than in parameter search
approaches: dream up a local heuristic, grow a grammar using greedy structure search, and
hope for the best. To the extent that we can invent a heuristic that embodies our true goals
better than data likelihood, we might hope to win out with structure search.
1
The following chapter is a good faith attempt to engineer just such a structure search
system, using the observations in chapter 3. While the system does indeed produce en-
couragingly linguistically sensible context-free grammars, the structure search procedure
turns out to be very fragile and the grammars produced do not successfully cope with the
complexities of broad-coverage parsing. Some aws in our system are solved in various
other works; we will compare our system to other structure-search methods in section 4.3.
Nonetheless, our experiences with structure search led us to the much more robust param-
eter search systems presented in later chapters.
1
Implicit in this argument is the assumption that inventing radical new objectives for parameter search
procedures is much harder, which seems to be the case.
4.1. APPROACH 45
4.1 Approach
At the heart of any structure search-based grammar induction system is a method, implicit
or explicit, for deciding how to update the grammar. In this section, we attempt to engineer
a local heuristic which identies linguistically sensible grammar changes, then use that
heuristic to greedily construct a grammar. The core idea is to use distributional statistics to
identify sequences which are likely to be constituents, to create categories (grammar non-
terminals) for those sequences, and to merge categories which are distributionally similar.
Two linguistic criteria for constituency in natural language grammars motivate our
choices of heuristics (Radford 1988):
1. External distribution: A constituent is a sequence of words which appears in various
structural positions (within larger constituents).
2. Substitutability: A constituent is a sequence of words with (simple) variants which
can be substituted for that sequence.
To make use of these intuitions, we use a local notion of distributional context, as
described in chapter 3. Let be a part-of-speech tag sequence. Every occurrence of will
be in some context x y, where x and y are the adjacent tags or sentence boundaries. The
distribution over contexts in which occurs is called its signature, which we denote by
().
Criterion 1 regards constituency itself. Consider the tag sequences IN DT NN and IN
DT. The former is a canonical example of a constituent (of category PP), while the later,
though strictly more common, is, in general, not a constituent. Frequency alone does not
distinguish these two sequences, but Criterion 1 points to a distributional fact which does.
In particular, IN DT NN occurs in many environments. It can followa verb, begin a sentence,
end a sentence, and so on. On the other hand, IN DT is generally followed by some kind of
a noun or adjective.
This argument suggests that a sequences constituency might be roughly indicated by
the entropy of its signature, H(()). Entropy, however, turns out to be only a weak indica-
tor of true constituency. To illustrate, gure 4.1 shows the actual most frequent constituents
46 CHAPTER 4. A STRUCTURE SEARCH EXPERIMENT
in the WSJ10 data set (see section 2.1.1), along with their rankings by several other mea-
sures. Despite the motivating intuition of constituents occurring in many contexts, entropy
by itself gives a list that is not substantially better-correlated with the true list than simply
listing sequences by frequency. There are two primary causes for this. One is that un-
common but possible contexts have little impact on the tag entropy value, yet in classical
linguistic argumentation, congurations which are less common are generally not taken to
be less grammatical.
To correct for the empirical skew in observed contexts, let
u
() be the uniform distri-
bution over the observed contexts for . This signature attens out the information about
what contexts are more or less likely, but preserves the count of possible contexts. Us-
ing the entropy of
u
() instead of the entropy of () would therefore have the direct
effect of boosting the contributions of rare contexts, along with the more subtle effect of
boosting the rankings of more common sequences, since the available samples of com-
mon sequences will tend to have collected nonzero counts of more of their rare contexts.
However, while H(()) presumably converges to some sensible limit given innite data,
H(
u
()) will not, as noise eventually makes all or most counts non-zero. Let u be the
uniform distribution over all contexts. The scaled entropy
H
s
(()) = H(())[H(
u
())/H(u)]
turned out to be a useful quantity in practice.
2
Multiplying entropies is not theoretically
meaningful, but this quantity does converge to H(()) given innite (noisy) data. The
list for scaled entropy still has notable aws, mainly relatively low ranks for common NPs,
which does not hurt system performance, and overly high ranks for short subject-verb se-
quences, which does.
The other fundamental problem with these entropy-based rankings stems from the con-
text features themselves. The entropy values will change dramatically if, for example,
all noun tags are collapsed, or if functional tags are split. This dependence on the tagset
2
There are certainly other ways to balance the attened and unattened distribution, including interpola-
tion or discounting. We found that other mechanisms were less effective in practice, but little of the following
rests crucially on this choice.
4.1. APPROACH 47
Sequence Actual Freq Entropy Scaled Boundary
DT NN 1 2 4 2 1
NNP NNP 2 1 - - 4
CD CD 3 9 - - -
JJ NNS 4 7 3 3 2
DT JJ NN 5 - - - 10
DT NNS 6 - - - 9
JJ NN 7 3 - 7 6
CD NN 8 - - - -
IN NN 9 - - 9 10
IN DT NN 10 - - - -
NN NNS - - 5 6 3
NN NN - 8 - 10 7
TO VB - - 1 1 -
DT JJ - 6 - - -
MD VB - - 10 - -
IN DT - 4 - - -
PRP VBZ - - - - 8
PRP VBD - - - - 5
NNS VBP - - 2 4 -
NN VBZ - 10 7 5 -
RB IN - - 8 - -
NN IN - 5 - - -
NNS VBD - - 9 8 -
NNS IN - - 6 - -
Figure 4.1: Candidate constituent sequences by various ranking functions. Top non-trivial
sequences by actual constituent counts, raw frequency, raw entropy, scaled entropy, and
boundary scaled entropy in the WSJ10 corpus. The upper half of the table lists the ten most
common constituent sequences, while the bottom half lists all sequences which are in the
top ten according to at least one of the rankings.
48 CHAPTER 4. A STRUCTURE SEARCH EXPERIMENT
for constituent identication is very undesirable. One appealing way to remove this de-
pendence is to distinguish only two tags: one for the sentence boundary (#) and another
for words. Scaling entropies by the entropy of this reduced signature produces the im-
proved list labeled Boundary. This quantity was not used in practice because, although
it is an excellent indicator of NP, PP, and intransitive S constituents, it gives too strong a
bias against other constituents, which do not appear so frequently both sentence-initially
and sentence-nally. However, the present system is not driven exclusively by the entropy
measure used, and duplicating the above rankings more accurately did not always lead to
better end results.
In summary, we have a collection of functions of distributional signatures which loosely,
but very imperfectly, seem to indicate the constituency of a sequence.
Criterion 2 suggests we then use similarity of distributional signatures to identify when
two constituent sequences are of the same constituent type. This seems reasonable enough
NNP and PRP are both NP yields, and occur in similar environments characteristic of NPs.
This criterion has a serious aw: even if our data were actually generated by a PCFG, it
need not be the case that all possible yields of a symbol X will have identical distribu-
tions. As a concrete example, PRP and NNP differ in that NNP occurs as a subsequence of
longer NPs like NNP NNP, while PRP generally doesnt. The context-freedom of a PCFG
process doesnt guarantee that all sequences which are possible NP yields have identical
distributions; it only guarantees that the NP occurrences of such sequences have identical
distributions. Since we generally dont have this kind of information available in a struc-
ture search system, at least to start out with, one generally just has to hope that signature
similarity will, in practice, still be reliable as an indicator of syntactic similarity. Figure 3.2
shows that if two sequences have similar raw signatures, then they do tend to have similar
syntactic behavior. For example, DT JJ NN and DT NN have extremely similar signatures,
and both are common noun phrases. Also, NN IN and NN NN IN have very similar signa-
tures, and both are primarily non-constituents.
Given these ideas, section 4.2 discusses a system, called GREEDY-MERGE, whose
grammar induction steps are guided by sequence entropy and interchangeability. The out-
put of GREEDY-MERGE is a symbolic CFG suitable for partial parsing. The rules it learns
appear to be of high linguistic quality (meaning they pass the dubious glance test, see
4.2. GREEDY-MERGE 49
gures 4.4 and 4.5), but parsing coverage is very low.
4.2 GREEDY-MERGE
GREEDY-MERGE is a precision-oriented system which, to a rst approximation, can be
seen as an agglomerative clustering process over sequences, where the sequences are taken
from increasingly structured analyses of the data. A run of the system shown in gure 4.3
will be used as a concrete example of this process.
We begin with all subsequences occurring in the WSJ10 corpus. For each pair of such
sequences, a scaled divergence is calculated as follows:
d(, ) =
D
JS
((),())
Hs(())+Hs(())
Small scaled divergence between two sequences indicates some combination of similarity
between their signatures and high rank according to the scaled entropy constituency heu-
ristic. The pair with the least scaled divergence is selected for merging.
3
In this case, the
initial top candidates were
3
We required that the candidates be among the 250 most frequent sequences. The exact threshold was not
important, but without some threshold, long singleton sequences with zero divergence were always chosen.
This suggests that we need a greater bias towards quantity of evidence in our basic method.
50 CHAPTER 4. A STRUCTURE SEARCH EXPERIMENT
Rank Proposed Merge
1 NN NN NN
2 NNP NNP NNP
3 NN JJ NN
4 NNS NN NNS
5 NNP NNP NNP NNP NNP
6 DT NN DT JJ NN
7 JJ NN NN NN
8 DT PRP$
9 DT DT JJ
10 VBZ VBD
11 NN NNS
12 PRP VBD PRP VBZ
13 VBD MD VB
14 NNS VBP NN VBZ
15 DT NN DT NN NN
16 VBZ VBZ RB
17 NNP NNP NNP NNP
18 DT JJ PRP$
19 IN NN IN DT NN
20 RB RB RB
Note that the top few merges are all linguistically sensible noun phrase or N unit merges.
Candidates 8, 10, and 11 are reasonable part-of-speech merges, and lower on the list (19)
there is good prepositional phrase pair. But the rest of the list shows what could easily go
wrong in a systemlike this one. Candidate 9 suggests a strange determiner-adjective group-
ing, and many of the remaining candidates either create verb-(ad)verb groups or subject-
verb groupings instead of the standard verb-object verb phrases. Either of these kinds of
merges will take the learned grammar away from the received linguistic standard. While
admittedly neither mistake is really devastating provided the alternate analysis is system-
atic in the learned grammar, this system has no operators for backing out of early mistakes.
At this point, however, only the single pair NN, NN NN) is selected.
Merging two sequences involves the creation of a single new non-terminal category for
those sequences, which rewrites as either sequence. These created categories have arbitrary
names, such as z17, though in the following exposition, we give them useful descriptors.
In this case, a category z1 is created, along with the rules
z1 NN
4.2. GREEDY-MERGE 51
z1 NN NN
Unary rules are disallowed in this system; a learned unary is instead interpreted as a merge
of the parent and child grammar symbols, giving
NN NN NN
This single rule forms the entire grammar after the rst merge, and roughly captures how
noun-noun compounding works in English: any sequence NN
i,jspans(S)
P(
ij
, x
ij
[B
ij
)
=
i,j
P(
ij
[B
ij
)P(x
ij
[B
ij
)
The distribution P(
ij
[B
ij
) is a pair of multinomial distributions over the set of all possible
yields: one for constituents (B
ij
= c) and one for distituents (B
ij
= d). Similarly for
P(x
ij
[B
ij
) and contexts. The marginal probability assigned to the sentence S is given by
summing over all possible bracketings of S: P(S) =
B
P(B)P(S[B). Note that this is
64 CHAPTER 5. CONSTITUENT-CONTEXT MODELS
5 4 3 2 1 0
5
4
3
2
1
0
S
t
a
r
t
End
5 4 3 2 1 0
5
4
3
2
1
0
S
t
a
r
t
End
5 4 3 2 1 0
5
4
3
2
1
0
S
t
a
r
t
End
(a) Tree-equivalent (b) Binary (c) Crossing
Figure 5.2: Three bracketings of the sentence Factory payrolls fell in September. Con-
stituent spans are shown in black. The bracketing in (b) corresponds to the binary parse
in gure 5.1; (a) does not contain the 2,5) VP bracket, while (c) contains a 0,3) bracket
crossing that VP bracket.
a more severe set of independence assumptions than, say, in a naive-bayes model. There,
documents positions are lled independently, and the result can easily be an ungrammatical
document. Here, the result need not even be a structurally consistent sentence.
1
To induce structure, we run EM over this model, treating the sentences S as observed
and the bracketings B as unobserved. The parameters of the model are the constituency-
conditional yield and context distributions P([b) and P(x[b). If P(B) is uniform over all
(possibly crossing) bracketings, then this procedure will be equivalent to soft-clustering
with two equal-prior classes.
There is reason to believe that such soft clusterings alone will not produce valuable
distinctions, even with a signicantly larger number of classes. The distituents must neces-
sarily outnumber the constituents, and so such distributional clustering will result in mostly
distituent classes. Clark (2001a) nds exactly this effect, and must resort to a ltering heu-
ristic to separate constituent and distituent clusters. To underscore the difference between
the bracketing and labeling tasks, consider gure 5.3. In both plots, each point is a frequent
tag sequence, assigned to the (normalized) vector of its context frequencies. Each plot has
been projected onto the rst two principal components of its respective data set. The left
1
Viewed as a model generating sentences, this model is decient, placing mass on yield and context
choices which will not tile into a valid sentence, either because specications for positions conict or because
yields of incorrect lengths are chosen. We might in principle renormalize by dividing by the mass placed on
proper sentences and zeroing the probability of improper bracketings. In practice, there does not seem to be
an easy way to carry out this computation.
5.2. A GENERATIVE CONSTITUENT-CONTEXT MODEL 65
NP
VP
PP
Usually a Constituent
Rarely a Constituent
(a) Constituent Types (b) Constituents vs. Distituents
Figure 5.3: Clustering vs. detecting constituents. The most frequent yields of (a) three
constituent types and (b) constituents and distituents, as context vectors, projected onto
their rst two principal components. Clustering is effective at labeling, but not detecting,
constituents.
plot shows the most frequent sequences of three constituent types. Even in just two dimen-
sions, the clusters seem coherent, and it is easy to believe that they would be found by a
clustering algorithm in the full space. On the right, sequences have been labeled according
to whether their occurrences are constituents more or less of the time than a cutoff (of 0.2).
The distinction between constituent and distituent seems much less easily discernible.
We can turn what at rst seems to be distributional clustering into tree induction by
conning P(B) to put mass only on tree-equivalent bracketings. In particular, consider
P
bin
(B) which is uniform over binary bracketings and zero elsewhere. If we take this
bracketing distribution, then when we sum over data completions, we will only involve
bracketings which correspond to valid binary trees. This restriction is the basis for the next
algorithm.
5.2.2 The Induction Algorithm
We now essentially have our induction algorithm. We take P(B) to be P
bin
(B), so that all
binary trees are equally likely. We then apply the EM algorithm:
E-Step: Find the conditional completion likelihoods P(B[S, ) according to the current
.
M-Step: Fix P(B[S, ) and nd the
which maximizes
B
P(B[S, ) log P(S, B[
).
66 CHAPTER 5. CONSTITUENT-CONTEXT MODELS
13
30
48
60
71
82
87
0
20
40
60
80
100
L
B
R
A
N
C
H
R
A
N
D
O
M
D
E
P
-
P
C
F
G
R
B
R
A
N
C
H
C
C
M
S
U
P
-
P
C
F
G
U
B
O
U
N
D
F
1
(
p
e
r
c
e
n
t
)
Figure 5.4: Bracketing F
1
for various models on the WSJ10 data set.
The completions (bracketings) cannot be efciently enumerated, and so a cubic dynamic
program similar to the inside-outside algorithm is used to calculate the expected counts of
each yield and context, both as constituents and distituents (see the details in appendix A.1).
Relative frequency estimates (which are the ML estimates for this model) are used to set
.
5.3 Experiments
The experiments that follow used the WSJ10 data set, as described in chapter 2, using the
alternate unlabeled metrics described in section 2.2.5, with the exception of gure 5.15
which uses the standard metrics, and gure 5.6 which reports numbers given by the EVALB
program. The basic experiments do not label constituents. An advantage to having only a
single constituent class is that it encourages constituents of one type to be proposed even
when they occur in a context which canonically holds another type. For example, NPs
and PPs both occur between a verb and the end of the sentence, and they can transfer
constituency to each other through that context.
5.3. EXPERIMENTS 67
0
10
20
30
40
50
60
70
80
90
100
2 3 4 5 6 7 8 9
Bracket Span
P
e
r
c
e
n
t
CCM Recall CCM F1
CCM Precision DEP-PCFG F1
Figure 5.5: Scores for CCM-induced structures by span size. The drop in precision for span
length 2 is largely due to analysis inside NPs which is omitted by the treebank. Also shown
is F
1
for the induced PCFG. The PCFG shows higher accuracy on small spans, while the
CCM is more even.
Figure 5.4 shows the F
1
score for various methods of parsing. RANDOM chooses a tree
uniformly at random from the set of binary trees.
2
This is the unsupervised baseline. DEP-
PCFG is the result of duplicating the experiments of Carroll and Charniak (1992), using
EM to train a dependency-structured PCFG. LBRANCH and RBRANCH choose the left-
and right-branching structures, respectively. RBRANCH is a frequently used baseline for
supervised parsing, but it should be stressed that it encodes a signicant fact about English
structure, and an induction system need not beat it to claim a degree of success. CCM is our
system, as described above. SUP-PCFG is a supervised PCFG parser trained on a 90-10 split
of this data, using the treebank grammar, with the Viterbi parse right-binarized.
3
UBOUND
is the upper bound of how well a binary system can do against the treebank sentences,
which are generally atter than binary, limiting the maximum precision.
CCM is doing quite well at 71.1%, substantially better than right-branching structure.
One common issue with grammar induction systems is a tendency to chunk in a bottom-
up fashion. Especially since the CCM does not model recursive structure explicitly, one
might be concerned that the high overall accuracy is due to a high accuracy on short-span
2
This is different from making random parsing decisions, which gave a higher score of 35%.
3
Without post-binarization, the F
1
score was 88.9.
68 CHAPTER 5. CONSTITUENT-CONTEXT MODELS
System UP UR F
1
CB
EMILE 51.6 16.8 25.4 0.84
ABL 43.6 35.6 39.2 2.12
CDC-40 53.4 34.6 42.0 1.46
RBRANCH 39.9 46.4 42.9 2.18
CCM 55.4 47.6 51.2 1.45
Figure 5.6: Comparative ATIS parsing results.
constituents. Figure 5.5 shows that this is not true. Recall drops slightly for mid-size con-
stituents, but longer constituents are as reliably proposed as short ones. Another effect
illustrated in this graph is that, for span 2, constituents have low precision for their recall.
This contrast is primarily due to the single largest difference between the systems induced
structures and those in the treebank: the treebank does not parse into NPs such as DT JJ
NN, while our system does, and generally does so correctly, identifying N units like JJ NN.
This overproposal drops span-2 precision. In contrast, gure 5.5 also shows the F
1
for
DEP-PCFG, which does exhibit a drop in F
1
over larger spans.
The top row of gure 5.8 shows the recall of non-trivial brackets, split according the
brackets labels in the treebank. Unsurprisingly, NP recall is highest, but other categories
are also high. Because we ignore trivial constituents, the comparatively low S represents
only embedded sentences, which are somewhat harder even for supervised systems.
To facilitate comparison to other recent work, gure 5.6 shows the accuracy of our
system when trained on the same WSJ data, but tested on the ATIS corpus, and evaluated
according to the EVALB program. EMILE and ABL are lexical systems described in (van Za-
anen 2000, Adriaans and Haas 1999), both of which operate on minimal pairs of sentences,
deducing constituents from regions of variation. CDC-40, from (Clark 2001a), reects
training on much more data (12M words), and is describe in section 5.1. The F
1
numbers
are lower for this corpus and evaluation method.
4
Still, CCM beats not only RBRANCH (by
8.3%), but the next closest unsupervised system by slightly more.
4
The primary cause of the lower F
1
is that the ATIS corpus is replete with span-one NPs; adding an extra
bracket around all single words raises our EVALB recall to 71.9; removing all unaries from the ATIS gold
standard gives an F
1
of 63.3%.
5.3. EXPERIMENTS 69
Rank Overproposed Underproposed
1 JJ NN NNP POS
2 MD VB TO CD CD
3 DT NN NN NNS
4 NNP NNP NN NN
5 RB VB TO VB
6 JJ NNS IN CD
7 NNP NN NNP NNP POS
8 RB VBN DT NN POS
9 IN NN RB CD
10 POS NN IN DT
Figure 5.7: Constituents most frequently over- and under-proposed by our system.
5.3.1 Error Analysis
Parsing gures can only be a component of evaluating an unsupervised induction system.
Low scores may indicate systematic alternate analyses rather than true confusion, and the
Penn treebank is a sometimes arbitrary or even inconsistent gold standard. To give a better
sense of the kinds of errors the system is or is not making, we can look at which se-
quences are most often overproposed, or most often underproposed, compared to the tree-
bank parses.
Figure 5.7 shows the 10 most frequently over- and under-proposed sequences. The sys-
tems main error trends can be seen directly from these two lists. It forms MD VB verb
groups systematically, and it attaches the possessive particle to the right, like a determiner,
rather than to the left.
5
It provides binary-branching analyses within NPs, normally result-
ing in correct extra N constituents, like JJ NN, which are not bracketed in the treebank.
More seriously, it tends to attach post-verbal prepositions to the verb and gets confused by
long sequences of nouns. A signicant improvement over some earlier systems (both ours
and other researchers) is the absence of subject-verb groups, which disappeared when we
switched to P
split
(B) for initial completions (see section 5.3.6); the more balanced subject-
verb analysis had a substantial combinatorial advantage with P
bin
(B).
5
Linguists have at times argued for both analyses: Halliday (1994) and Abney (1987), respectively.
70 CHAPTER 5. CONSTITUENT-CONTEXT MODELS
5.3.2 Multiple Constituent Classes
We also ran the system with multiple constituent classes, using a slightly more complex
generative model in which the bracketing generates a labeling L (a mapping from spans to
label classes C) which then generates the constituents and contexts.
P(S, L, B) = P(B)P(L[B)P(S[L)
P(L[B) =
i,jspans(S)
P(L
ij
[B
ij
)
P(S[L) =
i,jspans(S)
P(
ij
, x
ij
[L
ij
)
=
i,j
P(
ij
[L
ij
)P(x
ij
[L
ij
)
The set of labels for constituent spans and distituent spans are forced to be disjoint, so
P(L
ij
[B
ij
) is given by
P(L
ij
[B
ij
) =
_
_
1 if B
ij
= false L
ij
= d
0 if B
ij
= false L
ij
,= d
0 if B
ij
= true L
ij
= d
1/[C 1[ if B
ij
= true L
ij
,= d
where d is a distinguished distituent-only label, and the other labels are sampled uniformly
at each constituent span.
Intuitively, it seems that more classes should help, by allowing the systemto distinguish
different types of constituents and constituent contexts. However, it seemed to slightly hurt
parsing accuracy overall. Figure 5.8 compares the performance for 2 versus 12 classes; in
both cases, only one of the classes was allocated for distituents. Overall F
1
dropped very
slightly with 12 classes, but the category recall numbers indicate that the errors shifted
around substantially. PP accuracy is lower, which is not surprising considering that PPs
5.3. EXPERIMENTS 71
Classes Tags Precision Recall F
1
NP Recall PP Recall VP Recall S Recall
2 Treebank 63.8 80.2 71.1 83.4 78.5 78.6 40.7
12 Treebank 63.6 80.0 70.9 82.2 59.1 82.8 57.0
2 Induced 56.8 71.1 63.2 52.8 56.2 90.0 60.5
Figure 5.8: Scores for the 2- and 12-class model with Treebank tags, and the 2-class model
with induced tags.
Class 0 Class 1 Class 2 Class 3 Class 4 Class 5 Class 6
NNP NNP NN VBD DT NN NNP NNP CD CD VBN IN MD VB JJ NN
NN IN NN NN JJ NNS NNP NNP NNP CD NN JJ IN MD RB VB JJ NNS
IN DT NNS VBP DT NNS CC NNP IN CD CD DT NN VBN IN JJ JJ NN
DT JJ NNS VBD DT JJ NN POS NN CD NNS JJ CC WDT VBZ CD NNS
NN VBZ TO VB NN NNS NNP NNP NNP NNP CD CD IN CD CD DT JJ NN JJ IN NNP NN
Figure 5.9: Most frequent members of several classes found.
tend to appear rather optionally and in contexts in which other, easier categories also fre-
quently appear. On the other hand, embedded sentence recall is substantially higher, pos-
sibly because of more effective use of the top-level sentences which occur in the context
.
The classes found, as might be expected, range from clearly identiable to nonsense.
Note that simply directly clustering all sequence types into 12 categories based on their
local linear distributional signatures produced almost entirely the latter, with clusters rep-
resenting various distituent types. Figure 5.9 shows several of the 12 classes. Class 0 is the
models distituent class. Its most frequent members are a mix of obvious distituents (IN DT,
DT JJ, IN DT, NN VBZ) and seemingly good sequences like NNP NNP. However, there are
many sequences of 3 or more NNP tags in a row, and not all adjacent pairs can possibly be
constituents at the same time. Class 1 is mainly common NP sequences, class 2 is proper
NPs, class 3 is NPs which involve numbers, and class 6 is N sequences, which tend to be
linguistically right but unmarked in the treebank. Class 4 is a mix of seemingly good NPs,
often from positions like VBZNN where they were not constituents, and other sequences
that share such contexts with otherwise good NP sequences. This is a danger of not jointly
modeling yield and context, and of not modeling any kind of recursive structure: our model
cannot learn that a sequence is a constituent only in certain contexts (the best we can hope
for is that such contexts will be learned as strong distituent contexts). Class 5 is mainly
composed of verb phrases and verb groups. No class corresponded neatly to PPs: perhaps
because they have no signature contexts. The 2-class model is effective at identifying them
72 CHAPTER 5. CONSTITUENT-CONTEXT MODELS
only because they share contexts with a range of other constituent types (such as NPs and
VPs).
5.3.3 Induced Parts-of-Speech
A reasonable criticism of the experiments presented so far, and some other earlier work,
is that we assume treebank part-of-speech tags as input. This criticism could be two-fold.
First, state-of-the-art supervised PCFGs do not perform nearly so well with their input
delexicalized. We may be reducing data sparsity and making it easier to see a broad picture
of the grammar, but we are also limiting how well we can possibly do. It is certainly worth
exploring methods which supplement or replace tagged input with lexical input. However,
we address here the more serious criticism: that our results stem from clues latent in the
treebank tagging information which are conceptually posterior to knowledge of structure.
For instance, some treebank tag distinctions, such as particle (RP) vs. preposition (IN) or
predeterminer (PDT) vs. determiner (DT) or adjective (JJ), could be said to import into the
tag set distinctions that can only be made syntactically.
To show results from a complete grammar induction system, we also did experiments
starting with an automatic clustering of the words in the treebank (details in section 2.1.4.
We do not believe that the quality of our tags matches that of the better methods of Sch utze
(1995), much less the recent results of Clark (2000). Nevertheless, using these tags as
input still gave induced structure substantially above right-branching. Figure 5.8 shows
the performance with induced tags compared to correct tags. Overall F
1
has dropped, but,
interestingly, VP and S recall are higher. This seems to be due to a marked difference
between the induced tags and the treebank tags: nouns are scattered among a dispropor-
tionately large number of induced tags, increasing the number of common NP sequences,
but decreasing the frequency of each.
5.3.4 Convergence and Stability
A common issue with many previous systems is their sensitivity to initial choices. While
the model presented here is clearly sensitive to the quality of the input tagging, as well
as the qualitative properties of the initial completions, it does not suffer from the need to
5.3. EXPERIMENTS 73
0
10
20
30
40
50
60
70
80
0 4 8 12 16 20 24 28 32 36 40
Iterations
F
1
(
p
e
r
c
e
n
t
)
0.00M
0.05M
0.10M
0.15M
0.20M
0.25M
0.30M
0.35M
L
o
g
-
L
i
k
e
l
i
h
o
o
d
(
s
h
i
f
t
e
d
)
F1
log-likelihood
Figure 5.10: F
1
is non-decreasing until convergence.
inject noise to avoid an initial saddle point. Training on random subsets of the training data
brought lower performance, but constantly lower over equal-size splits.
Figure 5.10 shows the overall F
1
score and the data likelihood according to our model
during convergence.
6
Surprisingly, both are non-decreasing as the system iterates, indicat-
ing that data likelihood in this model corresponds well with parse accuracy.
7
Figure 5.12
shows recall for various categories by iteration. NP recall exhibits the more typical pattern
of a sharp rise followed by a slow fall, but the other categories, after some initial drops, all
increase until convergence.
8
These graphs stop at 40 iterations. The time to convergence
varied according to smoothing amount, number of classes, and tags used, but the system
almost always converged within 80 iterations, usually within 40.
5.3.5 Partial Supervision
For many practical applications, supplying a few gold parses may not be much more expen-
sive than deploying a fully unsupervised system. To test the effect of partial supervision,
we trained the CCM model on 90% of the WSJ10 corpus, and tested it on the remaining
6
The data likelihood is not shown exactly, but rather we show the linear transformation of it calculated by
the system (internal numbers were scaled to avoid underow).
7
Pereira and Schabes (1992) nd otherwise for PCFGs.
8
Models in the next chapter also show good correlation between likelihood and evaluation metrics, but
generally not monotonic as in the present case.
74 CHAPTER 5. CONSTITUENT-CONTEXT MODELS
70
71
72
73
74
75
76
77
78
0 20 40 60 80 100
Percent Supervision
H
e
l
d
-
O
u
t
F
1
Figure 5.11: Partial supervision
10%. Various fractions of that 90% were labeled with their gold treebank parses; during
the learning phase, analyses which crossed the brackets of the labeled parses were given
zero weight (but the CCM still lled in binary analyses inside at gold trees). Figure 5.11
shows F
1
on the held-out 10% as supervision percent increased. Accuracy goes up initially,
though it drops slightly at very high supervision levels. The most interesting conclusion
from this graph is that small amounts of supervision do not actually seem to help the CCM
very much, at least when used in this naive fashion.
5.3.6 Details
There are several details necessary to get good performance out of this model.
Initialization
The completions in this model, just as in the inside-outside algorithm for PCFGs, are dis-
tributions over trees. For natural language trees, these distributions are very non-uniform.
Figure 5.13 shows empirical bracketing distributions for three languages. These distribu-
tions show, over treebank parses of 10-word sentences, the fraction of trees with a con-
stituent over each start and end point. On the other hand, gure 5.14 (b) shows the bracket
fractions in a distribution which puts equal weight on each (unlabeled) binary tree. The
5.3. EXPERIMENTS 75
0
20
40
60
80
100
0 10 20 30 40
Iterations
R
e
c
a
l
l
(
p
e
r
c
e
n
t
)
Overall
NP
PP
VP
S
Figure 5.12: Recall by category during convergence.
most important difference between the actual and tree-uniform bracketing distributions is
that uniform trees are dramatically more likely to have central constituents, while in natural
language constituents tend to either start at the beginning of a sentence or end at the end of
the sentence.
What this means for an induction algorithm is important. Most uniform grammars,
such as a PCFG in which all rewrites have equal weight, or our current proposal with the
constituent and context multinomials being uniform, will have the property that all trees
will receive equal scores (or roughly so, modulo any initial perturbation). Therefore, if we
begin with an E-step using such a grammar, most rst M-steps will be presented with a
posterior that looks like gure 5.14(b). If we have a better idea about what the posteriors
should look like, we can begin with an E-step instead, such as the one where all non-
trivial brackets are equally likely, shown in gure 5.14(a) (this bracket distribution does
not correspond to any distribution over binary trees).
Now, we dont necessarily know what the posterior should look like, and we dont want
to bias it too much towards any particular language. However, we found that another rel-
atively neutral distribution over trees made a good initializer. In particular, consider the
following uniform-splitting process of generating binary trees over k terminals: choose
a split point at random, then recursively build trees by this process on each side of the
76 CHAPTER 5. CONSTITUENT-CONTEXT MODELS
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6 7 8 9 10
(a) English (b) German (c) Chinese
Figure 5.13: Empirical bracketing distributions for 10-word sentences in three languages
(see chapter 2 for corpus descriptions).
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6 7 8 9 10
(a) Uniform over Brackets (b) Uniform over Trees (c) Uniform over Splits
Figure 5.14: Bracketing distributions for several notions of uniform: all brackets having
equal likelihood, all trees having equal likelihood, and all recursive splits having equal
likelihood.
5.3. EXPERIMENTS 77
Initialization Precision Recall F
1
CB
Tree Uniform 55.5 70.5 62.1 1.58
Bracket Uniform 55.6 70.6 62.2 1.57
Split Uniform 64.7 82.2 72.4 0.99
Empirical 65.5 83.2 73.3 1.00
Figure 5.15: CCM performance on WSJ10 as the initializer is varied. Unlike other num-
bers in this chapter, these values are micro-averaged at the bracket level, as is typical for
supervised evaluation, and give credit for the whole-sentence bracket).
split. This process gives a distribution P
split
which puts relatively more weight on unbal-
anced trees, but only in a very general, non language-specic way. The posterior of the
split-uniform distribution is shown in gure 5.14 (c). Another useful property of the split
distribution is that it can be calculated in closed form (details in appendix B.2).
In gure 5.13, aside from the well-known right-branching tendency of English (and
Chinese), a salient characteristic of all three languages is that central brackets are relatively
rare. The split-uniform distribution also shows this property, while the bracket-uniform
distribution and the natural tree-uniform distribution do not. Unsurprisingly, results
when initializing with the bracket-uniform and tree-uniform distributions were substan-
tially worse than using the split-uniform one. Using the actual posterior was, interestingly,
only slightly better (see gure 5.15).
While the split distribution was used as an initial completion, it was not used in the
model itself. It seemed to bias too strongly against balanced structures, and led to entirely
linear-branching structures.
Smoothing
The smoothing used was straightforward, but very important. For each yield or context
x, we added 10 counts of that item: 2 as a constituent and 8 as a distituent. This reected
the relative skew of random spans being more likely to be distituents.
78 CHAPTER 5. CONSTITUENT-CONTEXT MODELS
Sentence Length
A weakness of the current model is that it performs much better on short sentences than
longer ones: F
1
drops all the way to 53.4% on sentences of length up to 15 (see gure 6.9 in
section 6.3). One likely cause is that as spans get longer, span type counts get smaller, and
so the parsing is driven by the less-informative context multinomials. Indeed, the primary
strength of this system is that it chunks simple NP and PP groups well; longer sentences are
less well-modeled by linear spans and have more complex constructions: relative clauses,
coordination structures, and so on. The improved models in chapter 6 degrade substantially
less with increased sentence length (section 6.3).
5.4 Conclusions
We have presented a simple generative model for the unsupervised distributional induction
of hierarchical linguistic structure. The system achieves the above-baseline unsupervised
parsing scores on the WSJ10 and ATIS data sets. The induction algorithm combines the
benets of EM-based parameter search and distributional clustering methods. We have
shown that this method acquires a substantial amount of correct structure, to the point that
the most frequent discrepancies between the induced trees and the treebank gold standard
are systematic alternate analyses, many of which are linguistically plausible. We have
shown that the system is not overly reliant on supervised POS tag input, and demonstrated
increased accuracy, speed, simplicity, and stability compared to previous systems.
Chapter 6
Dependency Models
6.1 Unsupervised Dependency Parsing
Most recent work (and progress) in unsupervised parsing has come from tree or phrase-
structure based models, but there are compelling reasons to reconsider unsupervised depen-
dency parsing as well. First, most state-of-the-art supervised parsers make use of specic
lexical information in addition to word-class level information perhaps lexical informa-
tion could be a useful source of information for unsupervised methods. Second, a central
motivation for using tree structures in computational linguistics is to enable the extraction
of dependencies function-argument and modication structures and it might be more
advantageous to induce such structures directly. Third, as we show below, for languages
such as Chinese, which have few function words, and for which the denition of lexical
categories is much less clear, dependency structures may be easier to detect.
6.1.1 Representation and Evaluation
An example dependency representation of a short sentence is shown in gure 6.1(a), where,
following the traditional dependency grammar notation, the regent or head of a dependency
is marked with the tail of the dependency arrow, and the dependent is marked with the ar-
rowhead (Mel
cuk 1988). It will be important in what follows to see that such a represen-
tation is isomorphic (in terms of strong generative capacity) to a restricted form of phrase
79
80 CHAPTER 6. DEPENDENCY MODELS
structure grammar, where the set of terminals and nonterminals is identical, and every rule
is of the form X X Y or X Y X (Miller 1999), giving the isomorphic representation
of gure 6.1(a) shown in gure 6.1(b).
1
Depending on the model, part-of-speech categories
may be included in the dependency representation, as suggested here, or dependencies may
be directly between words (bilexical dependencies). Below, we will assume an additional
reserved nonterminal ROOT, whose sole dependent is the head of the sentence. This sim-
plies the notation, math, and the evaluation metric.
A dependency analysis will always consist of exactly as many dependencies as there are
words in the sentence. For example, in the dependency structure of gure 6.1(b), the depen-
dencies are (ROOT, fell), (fell, payrolls), (fell, in), (in, September), (payrolls, Factory).
The quality of a hypothesized dependency structure can hence be evaluated by accuracy as
compared to a gold-standard dependency structure, by reporting the percentage of depen-
dencies shared between the two analyses.
It is important to note that the Penn treebanks do not include dependency annotations;
however, the automatic dependency rules from (Collins 1999) are sufciently accurate to be
a good benchmark for unsupervised systems for the time being (though see below for spe-
cic issues). Similar head-nding rules were used for Chinese experiments. The NEGRA
corpus, however, does supply hand-annotated dependency structures.
Where possible, we report an accuracy gure for both directed and undirected depen-
dencies. Reporting undirected numbers has two advantages: rst, it facilitates comparison
with earlier work, and, more importantly, it allows one to partially obscure the effects of
alternate analyses, such as the systematic choice between a modal and a main verb for the
head of a sentence (in either case, the two verbs would be linked, but the direction would
vary).
6.1.2 Dependency Models
The dependency induction task has received relatively little attention; the best known work
is Carroll and Charniak (1992), Yuret (1998), and Paskin (2002). All systems that we are
1
Strictly, such phrase structure trees are isomorphic not to at dependency structures, but to specic
derivations of those structures which specify orders of attachment among multiple dependents which share a
common head.
6.1. UNSUPERVISED DEPENDENCY PARSING 81
NN
Factory
NNS
payrolls
VBD
fell
IN
in
NN
September
ROOT
VBD
NNS
NN
Factory
NNS
payrolls
VBD
VBD
fell
IN
IN
in
NN
September
(a) Classical Dependency Structure (b) Dependency Structure as CF Tree
S
NP
NN
Factory
NNS
payrolls
VP
VBD
fell
PP
IN
in
NN
September
(c) CFG Structure
Figure 6.1: Three kinds of parse structures.
ROOT
Figure 6.2: Dependency graph with skeleton chosen, but words not populated.
aware of operate under the assumption that the probability of a dependency structure is the
product of the scores of the dependencies (attachments) in that structure. Dependencies
are seen as ordered (head, dependent) pairs of words, but the score of a dependency can
optionally condition on other characteristics of the structure, most often the direction of the
dependency (whether the arrow points left or right).
Some notation before we present specic models: a dependency d is a pair h, a) of a
head and an argument, which are words in a sentence s, in a corpus S. For uniformity of
notation with chapter 5, words in s are specied as size-one spans of s: for example the
rst word would be
0
s
1
. A dependency structure D over a sentence is a set of dependen-
cies (arcs) which form a planar, acyclic graph rooted at the special symbol ROOT, and in
which each word in s appears as an argument exactly once. For a dependency structure D,
there is an associated graph G which represents the number of words and arrows between
them, without specifying the words themselves (see gure 6.2). A graph G and sentence
s together thus determine a dependency structure. The dependency structure is the object
generated by all of the models that follow; the steps in the derivations vary from model to
82 CHAPTER 6. DEPENDENCY MODELS
model.
Existing generative dependency models intended for unsupervised learning have chosen
to rst generate a word-free graph G, then populate the sentence s conditioned on G. For
instance, the model of Paskin (2002), which is broadly similar to the semi-probabilistic
model in Yuret (1998), rst chooses a graph G uniformly at random (such as gure 6.2),
then lls in the words, starting with a xed root symbol (assumed to be at the rightmost
end), and working down G until an entire dependency structure D is lled in (gure 6.1a).
The corresponding probabilistic model is
P(D) = P(s, G)
= P(G)P(s[G)
= P(G)
(i,j,dir)G
P(
i1
s
i
[
j1
s
j
, dir) .
In Paskin (2002), the distribution P(G) is xed to be uniform, so the only model parameters
are the conditional multinomial distributions P(a[h, dir) that encode which head words
take which other words as arguments. The parameters for left and right arguments of
a single head are completely independent, while the parameters for rst and subsequent
arguments in the same direction are identied.
In those experiments, the model above was trained on over 30Mwords of raw newswire,
using EM in an entirely unsupervised fashion, and at great computational cost. However,
as shown in gure 6.3, the resulting parser predicted dependencies at below chance level
(measured by choosing a random dependency structure). This below-random performance
seems to be because the model links word pairs which have high mutual information (such
as occurrences of congress and bill) regardless of whether they are plausibly syntactically
related. In practice, high mutual information between words is often stronger between
two topically similar nouns than between, say, a preposition and its object (worse, its also
usually stronger between a verb and a selected preposition than that preposition and its
object).
6.1. UNSUPERVISED DEPENDENCY PARSING 83
Model Dir. Undir.
English (WSJ)
Paskin 01 39.7
RANDOM 41.7
Charniak and Carroll 92-inspired 44.7
ADJACENT 53.2
DMV 54.4
English (WSJ10)
RANDOM 30.1 45.6
ADJACENT 33.6 56.7
DMV 43.2 63.7
German (NEGRA10)
RANDOM 21.8 41.5
ADJACENT 32.6 51.2
DMV 36.3 55.8
Chinese (CTB10)
RANDOM 35.9 47.3
ADJACENT 30.2 47.3
DMV 42.5 54.2
Figure 6.3: Parsing performance (directed and undirected dependency accuracy) of various
dependency models on various treebanks, along with baselines.
i
h
j
a
k
h
i
a
j
h
k
h
i
h
j
h
STOP
i
h
j
h
STOP
(a) (b) (c) (d)
Figure 6.4: Dependency congurations in a lexicalized tree: (a) right attachment, (b) left
attachment, (c) right stop, (d) left stop. h and a are head and argument words, respectively,
while i, j, and k are positions between words. Not show is the step (if modeled) where
the head chooses to generate right arguments before left ones, or the congurations if left
arguments are to be generated rst.
84 CHAPTER 6. DEPENDENCY MODELS
The specic connection which argues why this model roughly learns to maximize mu-
tual information is that in maximizing
P(D) = P(G)
(i,j,dir)G
P(
i1
s
i
[
j1
s
j
, dir)
it is also maximizing
P(D)
P
(D)
=
P(G)
(i,j,dir)G
P(
i1
s
i
[
j1
s
j
, dir)
P(G)
i
P(
i1
s
i
)
which, dropping the dependence on directionality, gives
P(D)
P
(D)
=
P(G)
(i,j)G
P(
i1
s
i
[
j1
s
j
)
P(G)
i
P(
i1
s
i
)
=
(i,j)G
P(
i1
s
i
,
j1
s
j
)
P(
i1
s
i
)P(
j1
s
j
)
which is a product of (pointwise) mutual information terms.
One might hope that the problem with this model is that the actual lexical items are
too semantically charged to represent workable units of syntactic structure. If one were
to apply the Paskin (2002) model to dependency structures parameterized simply on the
word-classes, the result would be isomorphic to the dependency PCFG models described
in Carroll and Charniak (1992) (see section 5.1). In these models, Carroll and Charniak
considered PCFGs with precisely the productions (discussed above) that make them iso-
morphic to dependency grammars, with the terminal alphabet being simply parts-of-speech.
Here, the rule probabilities are equivalent to P(Y[X, right) and P(Y[X, left) respectively.
2
The actual experiments in Carroll and Charniak (1992) do not report accuracies that we
can compare to, but they suggest that the learned grammars were of extremely poor quality.
As discussed earlier, a main issue in their experiments was that they randomly initialized
the production (attachment) probabilities. As a result, their learned grammars were of very
2
There is another, more subtle distinction: in the Paskin work, a canonical ordering of multiple attach-
ments was xed, while in the Carroll and Charniak work all attachment orders are considered to be different
(equal scoring) structures when listing analyses, giving a relative bias in the Carroll and Charniak work to-
wards structures where heads take more than one argument.
6.2. AN IMPROVED DEPENDENCY MODEL 85
poor quality and had high variance. However, one nice property of their structural con-
straint, which all dependency models share, is that the symbols in the grammar are not
symmetric. Even with a grammar in which the productions are initially uniform, a sym-
bol X can only possibly have non-zero posterior likelihood over spans which contain a
matching terminal X. Therefore, one can start with uniform rewrites and let the interac-
tion between the data and the model structure break the initial symmetry. If one recasts
their experiments in this way, they achieve an accuracy of 44.7% on the Penn treebank,
which is higher than choosing a random dependency structure, but lower than simply link-
ing all adjacent words into a left-headed (and right-branching) structure (53.2%). That this
should outperform the bilexical model is in retrospect unsurprising: a major source of non-
syntactic information has been hidden from the model, and accordingly there is one fewer
unwanted trend that might be detected in the process of maximizing data likelihood.
A huge limitation of both of the above models, however, is that they are incapable
of encoding even rst-order valence facts, valence here referring in a broad way to the
regularities in number and type of arguments a word or word class takes (i.e., including
but not limited to subcategorization effects). For example, the former model will attach all
occurrences of new to york, even if they are not adjacent, and the latter model learns
that nouns to the left of the verb (usually subjects) attach to the verb. But then, given a
NOUN NOUN VERB sequence, both nouns will attach to the verb there is no way that
the model can learn that verbs have exactly one subject. We now turn to an improved
dependency model that addresses this problem.
6.2 An Improved Dependency Model
The dependency models discussed above are distinct from dependency models used inside
high-performance supervised probabilistic parsers in several ways. First, in supervised
models, a head outward process is modeled (Eisner 1996, Collins 1999). In such processes,
heads generate a sequence of arguments outward to the left or right, conditioning on not
only the identity of the head and direction of the attachment, but also on some notion of
distance or valence. Moreover, in a head-outward model, it is natural to model stop steps,
where the nal argument on each side of a head is always the special symbol STOP. Models
86 CHAPTER 6. DEPENDENCY MODELS
like Paskin (2002) avoid modeling STOP by generating the graph skeleton Grst, uniformly
at random, then populating the words of s conditioned on G. Previous work (Collins 1999)
has stressed the importance of including termination probabilities, which allows the graph
structure to be generated jointly with the terminal words, precisely because it does allow
the modeling of required dependents.
We propose a simple head-outward dependency model over word classes which in-
cludes a model of valence, which we call DMV (for dependency model with valence). We
begin at the ROOT. In the standard way (see below), each head generates a series of non-
STOP arguments to one side, then a STOP argument to that side, then non-STOP arguments
to the other side, then a second STOP.
For example, in the dependency structure in gure 6.1, we rst generate a single child of
ROOT, here fell. Then we recurse to the subtree under fell. This subtree begins with gener-
ating the right argument in. We then recurse to the subtree under in (generating September
to the right, a right STOP, and a left STOP). Since there are no more right arguments after
in, its right STOP is generated, and the process moves on to the left arguments of fell.
In this process, there are two kinds of derivation events, whose local probability factors
constitute the models parameters. First, there is the decision at any point whether to termi-
nate (generate STOP) or not: P
STOP
(STOP[h, dir, adj). This is a binary decision conditioned
on three things: the head h, the direction (generating to the left or right of the head), and
the adjacency (whether or not an argument has been generated yet in the current direction,
a binary variable). The stopping decision is estimated directly, with no smoothing. If a
stop is generated, no more arguments are generated for the current head to the current side.
If the current heads argument generation does not stop, another argument is chosen us-
ing: P
CHOOSE
(a[h, dir). Here, the argument is picked conditionally on the identity of the
head (which, recall, is a word class) and the direction. This term, also, is not smoothed in
any way. Adjacency has no effect on the identity of the argument, only on the likelihood
of termination. After an argument is generated, its subtree in the dependency structure is
recursively generated.
This process should be compared to what is generally done in supervised parsers (Collins
1999, Charniak 2000, Klein and Manning 2003). The largest difference is that supervised
parsers condition actions on the identity of the head word itself. The lexical identity is a
6.2. AN IMPROVED DEPENDENCY MODEL 87
good feature to have around in a supervised system, where syntactic lexical facts can be
learned effectively. In our unsupervised experiments, having lexical items in the model led
to distant topical associations being preferentially modeled over class-level syntactic pat-
terns, though it would clearly be advantageous to discover a mechanism for acquiring the
richer kinds of models used in the supervised case. Supervised parsers decisions to stop or
continue generating arguments are also typically conditioned on ner notions of distance
than adjacent/non-adjacent (buckets or punctuation-dened distance). Moreover, decisions
about argument identity are conditioned on the identity of previous arguments, not just a
binary indicator of whether there were any previous ones. This richer history allows for
the explicit modeling of inter-argument correlations, such as subcategorization/selection
preferences and argument ordering trends. Again, for the unsupervised case, this much
freedom can be dangerous. We did not use such richer histories, their success in supervised
systems suggests that they could be exploited here, perhaps in a system which originally
ignored richer context, then gradually began to model it.
Formally, for a dependency structure D, let each word h have left dependents deps
D
(h, l)
and right dependents deps
D
(h, r). The following recursion denes the probability of the
fragment D(h) of the dependency tree rooted at h:
P(D(h)) =
dir{l,r}
adeps
D
(h,dir)
P
STOP
(STOP[h, dir, adj)
P
CHOOSE
(a[h, dir)P(D(a))
P
STOP
(STOP[h, dir, adj)
One can view a structure generated by this derivational process as a lexicalized
tree composed of the local binary and unary context-free congurations shown in g-
ure 6.4.
3
Each conguration equivalently represents either a head-outward derivation step
or a context-free rewrite rule. There are four such congurations. Figure 6.4(a) shows a
head h taking a right argument a. The tree headed by h contains h itself, possibly some right
3
It is lexicalized in the sense that the labels in the tree are derived from terminal symbols, but in our
experiments the terminals were word classes, not individual lexical items.
88 CHAPTER 6. DEPENDENCY MODELS
arguments of h, but no left arguments of h (they attach after all the right arguments). The
tree headed by a contains a itself, along with all of its left and right children. Figure 6.4(b)
shows a head h taking a left argument a the tree headed by h must have already generated
its right stop to do so. Figure 6.4(c) and gure 6.4(d) show the sealing operations, where
STOP derivation steps are generated. The left and right marks on node labels represent left
and right STOPs that have been generated.
4
The basic inside-outside algorithm (Baker 1979) can be used for re-estimation. For
each sentence s S, it gives us c
s
(x : i, j), the expected fraction of parses of s with a node
labeled x extending from position i to position j. The model can be re-estimated from these
counts. For example, to re-estimate an entry of P
STOP
(STOP[h, left, non-adj) according to
a current model , we calculate two quantities.
5
The rst is the (expected) number of
trees headed by
sS
i<loc(h)
k
c(h : i, k)
sS
i<loc(h)
k
c(
h : i, k)
This can be intuitively thought of as the relative number of times a tree headed by h had al-
ready taken at least one argument to the left, had an opportunity to take another, but didnt.
6
Section A.2 has a more detailed exposition of the details of calculating the necessary ex-
pectations.
Initialization is important to the success of any local search procedure. As in chapter 5,
we chose to initialize EM not with an initial model, but with an initial guess at posterior
distributions over dependency structures (completions). For the rst-round, we constructed
4
Note that the asymmetry of the attachment rules enforces the right-before-left attachment convention.
This is harmless and arbitrary as far as dependency evaluations go, but imposes an X-bar-like structure on the
constituency assertions made by this model. This bias/constraint is dealt with in section 6.3.
5
To simplify notation, we assume each word h occurs at most one time in a given sentence, between
indexes loc(h) and loc(h) + 1.
6
As a nal note, in addition to enforcing the right-argument-rst convention, we constrained ROOT to
have at most a single dependent, by a similar device.
6.2. AN IMPROVED DEPENDENCY MODEL 89
a somewhat ad-hoc harmonic completion where all non-ROOT words took the same num-
ber of arguments, and each took other words as arguments in inverse proportion to (a con-
stant plus) the distance between them. The ROOT always had a single argument and took
each word with equal probability. This structure had two advantages: rst, when testing
multiple models, it is easier to start them all off in a common way by beginning with an
M-step, and, second, it allowed us to point the model in the vague general direction of
what linguistic dependency structures should look like. It should be emphasized that this
initialization was important in getting reasonable patterns out of this model.
On the WSJ10 corpus, the DMV model recovers a substantial fraction of the broad de-
pendency trends: 43.2% of guessed directed dependencies were correct (63.7% ignoring
direction). To our knowledge, this is the rst published result to break the adjacent-word
heuristic (at 33.6% for this corpus). Verbs are the sentence heads, prepositions take fol-
lowing noun phrases as arguments, adverbs attach to verbs, and so on. Figure 6.5 shows
the most frequent discrepancies between the test dependencies and the models guesses.
Most of the top mismatches stem from the model systematically choosing determiners to
be the heads of noun phrases, where the test trees have the rightmost noun as the head.
The models choice is supported by a good deal of linguistic research (Abney 1987), and is
sufciently systematic that we also report the scores where the NP headship rule is changed
to percolate determiners when present. On this adjusted metric, the score jumps hugely to
55.7% directed (and 67.9% undirected). There are other discrepancy types, such as modals
dominating main verbs, choice of the wrong noun as the head of a noun cluster, and having
some sentences headed by conjunctions.
This model also works on German and Chinese at above-baseline levels (55.8% and
54.2% undirected, respectively), with no modications whatsoever. In German, the largest
source of errors is also the systematic postulation of determiner-headed noun-phrases.
The second largest source is that adjectives are (incorrectly) considered to be the head
in adjective-noun units. The third source is the incorrect attachment of adjectives into de-
terminers inside denite NPs. Mistakes involving verbs and other classes are less common,
but include choosing past participles rather than auxiliaries as the head of periphrastic verb
constructions. In Chinese, there is even more severe confusion inside nominal sequences,
possibly because the lack of functional syntax makes the boundaries between adjacent NPs
90 CHAPTER 6. DEPENDENCY MODELS
English using DMV
Overproposals Underproposals
DT NN 3083 DT NN 3079
NNP NNP 2108 NNP NNP 1899
CC ROOT 1003 IN NN 779
IN DT 858 DT NNS 703
DT NNS 707 NN VBZ 688
MD VB 654 NN IN 669
DT IN 567 MD VB 657
DT VBD 553 NN VBD 582
TO VB 537 VBD NN 550
DT VBZ 497 VBZ NN 543
English using CCM+DMV
Overproposals Underproposals
DT NN 3474 DT NN 3079
NNP NNP 2096 NNP NNP 1898
CD CD 760 IN NN 838
IN DT 753 NN VBZ 714
DT NNS 696 DT NNS 672
DT IN 627 NN IN 669
DT VBD 470 CD CD 638
DT VBZ 420 NN VBD 600
NNP ROOT 362 VBZ NN 553
NNS IN 347 VBD NN 528
Figure 6.5: Dependency types most frequently overproposed and underproposed for En-
glish, with the DMV alone and with the combination model.
6.3. A COMBINED MODEL 91
unclear. For example, temporal nouns often take adjacent proper nouns as arguments all
other classes of errors are much less salient.
This dependency induction model is reasonably successful. However, the model can be
substantially improved by paying more attention to syntactic constituency, at the same time
as modeling dependency structure. To this end, we next present a combined model that
exploits kinds of structure. As we will see, the combined model nds correct dependencies
more successfully than the model above, and nds constituents more successfully than the
model of chapter 5.
6.3 A Combined Model
The CCM and the DMV models have a common ground. Both can be seen as models over
lexicalized trees composed of the congurations in gure 6.4. For the DMV, it is already a
model over these structures. At the attachment rewrite for the CCM in (a/b), we assign
the quantity:
P(
i
s
k
[true)P(
i1
s
i
k
s
k+1
[true)
P(
i
s
k
[false)P(
i1
s
i
k
s
k+1
[false)
which is the odds ratio of generating the subsequence and context for span i, k) as a con-
stituent as opposed to as a non-constituent. If we multiply all trees attachment scores
by
i,j
P(
i
s
j
[false)P(
i1
s
i
j
s
j+1
[false)
the denominators of the odds ratios cancel, and we are left with each tree being assigned
the probability it would have received under the CCM.
7
In this way, both models can be seen as generating either constituency or dependency
structures. Of course, the CCM will generate fairly random dependency structures (con-
strained only by bracketings). Getting constituency structures from the DMV is also prob-
lematic, because the choice of which side to rst attach arguments on has ramications on
constituency it forces x-bar-like structures even though it is an arbitrary convention as
7
This scoring function as described is not a generative model over lexicalized trees, because it has no
generation step at which nodes lexical heads are chosen. This can be corrected by multiplying in a head
choice factor of 1/(k j) at each nal sealing conguration (d). In practice, this correction factor was
harmful for the model combination, since it duplicated a strength of the dependency model, badly.
92 CHAPTER 6. DEPENDENCY MODELS
far as dependency evaluations are concerned. For example, if we attach right arguments
rst, then a verb with a left subject and a right object will attach the object rst, giving
traditional VPs, while the other attachment order gives subject-verb groups. To avoid this
bias, we alter the DMV in the following ways. When using the dependency model alone,
we allow each word to have even probability for either generation order, this order being
chosen as the rst step in a heads outward dependency generation process (in each actual
head derivation, only one order occurs). When using the models together, better perfor-
mance was obtained by releasing the one-side-attaching-rst requirement entirely.
8
In gure 6.6, we give the behavior of the CCM constituency model and the DMV
dependency model on both constituency and dependency induction. Unsurprisingly, their
strengths are complementary. The CCM is better at recovering constituency (except for
Chinese where neither is working particularly well), and the dependency model is better
at recovering dependency structures. It is reasonable to hope that a combination model
might exhibit the best of both. In the supervised parsing domain, for example, scoring a
lexicalized tree with the product of a simple lexical dependency model and a PCFG model
can outperform each factor on its respective metric (Klein and Manning 2003).
In the combined model, we score each tree with the product of the probabilities from
the individual models above. We use the inside-outside algorithm to sum over all lexi-
calized trees, similarly to the situation in section 6.2. The tree congurations are shown
in gure 6.4. For each conguration, the relevant scores from each model are multiplied
together. For example, consider gure 6.4(a). From the CCM we generate
i
s
k
as a con-
stituent and its corresponding context. From the dependency model, we pay the cost of
h taking a as a right argument (P
CHOOSE
), as well as the cost of not stopping (P
STOP
). The
other congurations are similar. We then run the inside-outside algorithm over this product
model. From the results, we can extract the statistics needed to re-estimate both individual
models.
9
The models in combination were initialized in the same way as when they were run
individually. Sufcient statistics were separately taken off these individual completions.
From then on, the resulting models were used together during re-estimation.
8
With no one-side-rst constraint, the proper derivation process chooses whether to stop entirely before
each dependent, and if not choose a side to generate on, then generate an argument to that side.
9
The product, like the CCM itself, is mass-decient.
6.3. A COMBINED MODEL 93
Constituency Dependency
Model UP UR UF
1
Dir Undir
English (WSJ10 7422 Sentences)
LBRANCH/RHEAD 25.6 32.6 28.7 33.6 56.7
RANDOM 31.0 39.4 34.7 30.1 45.6
RBRANCH/LHEAD 55.1 70.0 61.7 24.0 55.9
DMV 46.6 59.2 52.1 43.2 62.7
CCM 64.2 81.6 71.9 23.8 43.3
DMV+CCM (POS) 69.3 88.0 77.6 47.5 64.5
DMV+CCM (DISTR.) 65.2 82.8 72.9 42.3 60.4
UBOUND 78.8 100.0 88.1 100.0 100.0
German (NEGRA10 2175 Sentences)
LBRANCH/RHEAD 27.4 48.8 35.1 32.6 51.2
RANDOM 27.9 49.6 35.7 21.8 41.5
RBRANCH/LHEAD 33.8 60.1 43.3 21.0 49.9
DMV 38.4 69.5 49.5 40.0 57.8
CCM 48.1 85.5 61.6 25.5 44.9
DMV+CCM 49.6 89.7 63.9 50.6 64.7
UBOUND 56.3 100.0 72.1 100.0 100.0
Chinese (CTB10 2437 Sentences)
LBRANCH/RHEAD 26.3 48.8 34.2 30.2 43.9
RANDOM 27.3 50.7 35.5 35.9 47.3
RBRANCH/LHEAD 29.0 53.9 37.8 14.2 41.5
DMV 35.9 66.7 46.7 42.5 54.2
CCM 34.6 64.3 45.0 23.8 40.5
DMV+CCM 33.3 62.0 43.3 55.2 60.3
UBOUND 53.9 100.0 70.1 100.0 100.0
Figure 6.6: Parsing performance of the combined model on various treebanks, along with
baselines.
94 CHAPTER 6. DEPENDENCY MODELS
Figure 6.6 summarizes the results. The combined model beats the CCM on English
F
1
: 77.6 vs. 71.9. To give a concrete indication of how the constituency analyses differ
with and without the addition of the dependency model, gure 6.7 shows the sequences
which were most frequently overproposed and underproposed as constituents, as well as
the crossing proposals, which are the overproposals which actually cross a gold bracket.
Note that in the combination model, verb groups disappear and adverbs are handled more
correctly (this can be seen in both mistake summaries).
Figure 6.6 also shows the combination models score when using word classes which
were induced entirely automatically, using the same induced classes as in chapter 5. These
classes show some degradation, e.g. 72.9 F
1
, but it is worth noting that these totally unsu-
pervised numbers are better than the performance of the CCM model running off of Penn
treebank word classes. Again, if we modify the gold standard so as to make determiners
the head of NPs, then this model with distributional tags scores even better with 50.6% on
directed and 64.8% on undirected dependency accuracy.
On the German data, the combination again outperforms each factor alone, though
while the combination was most helpful at boosting constituency quality for English, for
German it provided a larger boost to the dependency structures. Figure 6.8 shows the
common mistake sequences for German, with and without the DMV component. The most
dramatic improvement is the more consistent use of verb-object VPs instead of subject-
verb groups. Note that for the German data, the gold standard is extremely at. This is
why the precision is so low (49.6% in the combination model) despite the rather high recall
(89.7%): in fact the crossing bracket rate is extremely low (0.39, cf. 0.68 for the English
combination model).
Finally, on the Chinese data, the combination did substantially boost dependency accu-
racy over either single factor, but actually suffered a small drop in constituency.
10
Overall,
the combination is able to combine the individual factors in an effective way.
To point out one nal advantage of the combined model over the CCM (though not the
DMV), consider gure 6.9, which shows howthe accuracy of the combined model degrades
with longer maximum sentence length (10 being WSJ10). On the constituency F
1
measure,
10
This seems to be partially due to the large number of unanalyzed fragments in the Chinese gold standard,
which leave a very large fraction of the posited bracketings completely unjudged.
6.3. A COMBINED MODEL 95
English using CCM
Overproposals Underproposals Crossing
JJ NN 1022 NNP NNP 183 MD VB 418
NNP NNP 453 TO CD CD 159 RB VB 306
MD VB 418 NNP POS 159 IN NN 192
DT NN 398 NN NNS 140 POS NN 172
RB VB 349 NN NN 101 CD CD IN CD CD 154
JJ NNS 320 CD CD 72 MD RB VB 148
NNP NN 227 IN CD 69 RB VBN 103
RB VBN 198 TO VB 66 CD NNS 80
IN NN 196 RB JJ 63 VNB TO 72
POS NN 172 IN NNP 62 NNP RB 66
English using CCM+DMV
Overproposals Underproposals Crossing
JJ NN 1022 NNP NNP 167 CD CD IN CD CD 154
NNP NNP 447 TO CD CD 154 NNS RB 133
DT NN 398 IN NN 76 NNP NNP NNP 67
JJ NNS 294 IN DT NN 65 JJ NN 66
NNP NN 219 IN CD 60 NNP RB 59
NNS RB 164 CD NNS 56 NNP NNP NNP NNP 51
NNP NNP NNP 156 NNP NNP NNP 54 NNP NNP 50
CD CD IN CD CD 155 IN NNP 54 NNS DT NN 41
TO CD CD IN CD CD 154 NN NNS 49 IN PRP 41
CD NN TO CD CD IN CD CD 120 RB JJ 47 RB PRP 33
Figure 6.7: Sequences most frequently overproposed, underproposed, and proposed in lo-
cations crossing a gold bracket for English, for the CCM and the combination model.
96 CHAPTER 6. DEPENDENCY MODELS
German using CCM
Overproposals Underproposals Crossing
ADJA NN 461 APPR ART NN 97 ADJA NN 30
ART NN 430 APPR NN 84 ART NN VVPP 23
ART ADJA NN 94 APPR NE 46 CARD NN 18
KON NN 71 NE NE 32 NN VVPP 16
CARD NN 67 APPR ADJA NN 31 NE NE 15
PPOSAT NN 37 ADV ADJD 24 VVPP VAINF 13
ADJA NN NE 36 ADV ADV 23 ART NN PTKVZ 12
APPRART NN 33 APPR ART ADJA NN 21 NE NN 12
NE NE 30 NN NE 20 NE VVPP 12
ART NN VVPP 29 NN KON NN 19 ART NN VVINF 11
German using CCM+DMV
Overproposals Underproposals Crossing
ADJA NN 461 NE NE 30 ADJA NN 30
ART NN 430 NN KON NN 22 CARD NN 18
ART ADJA NN 94 NN NE 12 NE NE 17
KON NN 71 APPR NE 9 APPR NN 15
APPR NN 68 ADV PTKNEG 9 ART NN ART NN 14
CARD NN 67 VVPP VAINF 9 NE ADJA NN 11
NE ADJA NN 62 ADV ADJA 9 ADV ADJD 9
NE NE 38 ADV CARD 9 NN APPR NN 8
PPOSAT NN 37 ADJD ADJA 8 ADV NN 7
APPR ART NN 36 CARD CARD 7 APPRART NN NN 7
Figure 6.8: Sequences most frequently overproposed, underproposed, and proposed in lo-
cations crossing a gold bracket for German, for the CCM and the combination model.
6.4. CONCLUSION 97
0
10
20
30
40
50
60
70
80
90
9 10 11 12 13 14 15
Maximum Length
F
1
CCM+DMV
CCM
0
10
20
30
40
50
60
70
80
90
9 10 11 12 13 14 15
Maximum Length
D
e
p
.
A
c
c
u
r
a
c
y
CCM+DMV
DMV
(a) (b)
Figure 6.9: Change in parse quality as maximum sentence length increases: (a) CCM alone
vs. combination and (b) DMV alone vs. combination.
the combination degrades substantially more slowly with sentence length than the CCM
alone. This is not too surprising: the CCMs strength is nding common short constituent
chunks: the DMVs representation is less scale sensitive at inference time. What is a little
surprising is that the DMV and the combination actually converge in dependency accuracy
as sentences get longer this may well be because as sentences get longer, the pressure
from the CCM gets relatively weaker: it is essentially agnostic about longer spans.
6.4 Conclusion
We have presented a successful new dependency-based model for the unsupervised in-
duction of syntactic structure, which picks up the key ideas that have made dependency
models successful in supervised statistical parsing work. We proceeded to show that it
works cross-linguistically. We then demonstrated how this model could be combined with
the constituent-induction model of chapter 5 to produce a combination which, in general,
substantially outperforms either individual model, on either metric. A key reason that these
models are capable of recovering structure more accurately than previous work is that they
minimize the amount of hidden structure that must be induced. In particular, neither model
attempts to learn intermediate, recursive categories with no direct connection to surface
statistics. For example, the CCM models nesting but not recursion. The dependency model
is a recursive model, but each tree node is headed by one of the leaves it dominates, so no
hidden categories must be posited. Moreover, in their basic forms presented here, neither
98 CHAPTER 6. DEPENDENCY MODELS
model (nor their combination) requires any symmetry-breaking perturbations at initializa-
tion.
Chapter 7
Conclusions
There is a great deal of structure in human languages that is not explicit in the observed
input. One way or another, human language learners gure out how to analyze, process,
generalize, and produce languages to which they are exposed. If we wish to build systems
which interact with natural languages, we either have to take a supervised approach and
supply detailed annotation which makes all this hidden structure explicit, or we have to
develop methods of inducing this structure automatically. The former problem is well-
studied and well-understood; the latter problem is merely well-studied. This work has
investigated a specic corner of the language learning task: inducing constituency and
dependency tree structured analyses given only observations of grammatical sentences (or
word-class sequences). We have demonstrated for this task several systems which exceed
previous systems performance in extracting linguistically reasonable structure. Hopefully,
we have also provided some useful ideas about what does and does not work for this task,
both in our own systems and in other researchers work.
Many open questions and avenues of research remain, ranging from the extremely tech-
nical to the extremely broad. On the narrow side, machine learning issues exist for the
present models. Aside from the multi-class CCM, the models have the virtue that symbol
symmetries do not arise, so randomness is not needed at initialization. However, all models
are sensitive to initialization to some degree. Indeed, for the CCM, one of the contributions
of this work is the presentation of a better uniform initializer. In addition, the CCM model
family is probabilistically mass decient; it redundantly generates the observations, and,
99
100 CHAPTER 7. CONCLUSIONS
most severely, its success may well rely on biases latent in this redundant generation. It
performs better on corpora with shorter sentences than longer sentences; part of this issue
is certainly that the linear sequences get extremely sparse for long sentences.
Alittle more broadly, the DMVmodels, which are motivated by dependency approaches
that have performed well in the supervised case, still underperform the theoretically less
satisfying CCM models. Despite an enduring feeling that lexical information beyond
word-class must be useful for learning language, it seems to be a statistical distraction
in some cases. General methods for starting with low-capacity models and gradually re-
leasing model parameters are needed otherwise we will be stuck with a trade-off between
expressivity and learnability that will cap the achievable quality of inducible models.
Much more broadly, an ideal language learning systemshould not be disconnected from
other aspects of language understanding and use, such as the context in which the utterances
are situated. Without attempting to learn the meaning of sentences, success at learning their
grammatical structure is at best an illuminating stepping stone to other tasks and at worst
a data point for linguists interested in nativism. Moreover, from the standpoint of an NLP
researcher in need of a parser for a language lacking supervised tools, approaches which
are weakly supervised, requiring, say 100 or fewer example parses, are likely to be just as
reasonable as fully unsupervised methods, and one could reasonably hope that they would
provide better results. In fact, from an engineering standpoint, supplying a few parses is
generally much easier than tuning an unsupervised algorithm for a specic language.
Nonetheless, it is important to emphasize that this work has shown that much progress
in the unsupervised learning of real, broad-coverage parsers can come from careful under-
standing of the interaction between the representation of a probabilistic model and what
kinds of trends it detects in the data. We can not expect that unsupervised methods will
ever exceed supervised methods in cases where there is plenty of labeled training data,
but we can hope that, when only unlabeled data is available, unsupervised methods will
be important, useful tools, which additionally can shed light on how human languages are
structured, used, and learned.
Appendix A
Calculating Expectations for the Models
A.1 Expectations for the CCM
In estimating parameters for the CCM model, the computational bottleneck is the E-step,
where we must calculate posterior expectations of various tree congurations according
to a xed parameter vector (chapter 5). This section gives a cubic dynamic program for
efciently collecting these expectations.
The CCM model family is parameterized by two kinds of multinomials: the class-
conditional span generation terms P
SPAN
([c) and the class-conditional context generation
terms P
CONTEXT
([c), where c is a boolean indicating the constituency of the span, is the
sequence lling that span, and is the local linear context of that span. The score assigned
to a sentence s
0,n
under a single bracketing B is
P(s, B) = P
TREE
(B)
i,j
P
SPAN
((i, j, s)[B
ij
)P
CONTEXT
((i, j, s)[B
ij
)
In P
TREE
(B), the bracketings B with non-zero mass are in one-to-one correspondence with
the set of binary tree structures T. Therefore, we can rewrite this expression in terms of the
101
102 APPENDIX A. CALCULATING EXPECTATIONS FOR THE MODELS
constituent brackets in the tree T(B).
P(s, B) = P
TREE
(B)
i,jT(B)
P
SPAN
((i, j, s)[true)P
CONTEXT
((i, j, s)[true)
i,j/ T(B)
P
SPAN
((i, j, s)[false)P
CONTEXT
((i, j, s)[false)
Since most spans in any given tree are distituents, we can also calculate the score for a
bracketing B by starting with the score for the all-distituent bracketing and multiplying in
correction factors for the spans which do occur as constituents in B:
P(s, B) = P
TREE
(B)
i,j
P
SPAN
((i, j, s)[false)P
CONTEXT
((i, j, s)[false)
i,jT(B)
P
SPAN
((i, j, s)[false)P
CONTEXT
((i, j, s)[false)
P
SPAN
((i, j, s)[true)P
CONTEXT
((i, j, s)[true)
Since all binary trees have the same score in P
TREE
, and the all-distituent product does not
depend on the bracketing, we can write this as
P(s, B) = K(s)
i,jT(B)
(i, j, s)
where
(i, j, s) =
P
SPAN
((i, j, s)[true)P
CONTEXT
((i, j, s)[true)
P
SPAN
((i, j, s)[false)P
CONTEXT
((i, j, s)[false)
and K(s) is some sentence-specic constant. This expression for P(s, B) now factors ac-
cording to the nested tree structure of T(B). Therefore, we can dene recursions analogous
to the standard inside/outside recurrences and use these to calculate the expectations were
interested in.
First, we dene I(i, j, s), which is analogous to the inside score in the inside-outside
algorithm for PCFGs.
I(i, j, s) =
TT (ji)
a,b:ai,biT
(s, a, b)
A.1. EXPECTATIONS FOR THE CCM 103
0 i k
j n
I(i, k)
I(k, j)
(i, j)
I(i, j)
0 i
j
k
n
O(i, k)
I(j, k)
O(i, j)
(a) Inside recurrence (b) Outside recurrence
Figure A.1: The inside and outside congurational recurrences for the CCM.
In other words, this is the sum, over all binary tree structures T spanning i, j), of the
products of the local scores of the brackets in those trees (see gure A.1). This quantity
has a nice recursive decomposition:
I(i, j, s) =
_
_
(i, j, s)
i<k<j
I(i, k, s)I(k, j, s) if j i > 1
(i, j, s) if j i = 1
0 if j i = 0
From this recurrence, either a dynamic programming solution or a memoized solution for
calculating the table of I scores in time O(n
3
) and space O(n
2
) is straightforward.
Similarly, we dene O(i, j, s), which is analogous to the outside score for PCFGs:
O(i, j, s) =
TT (n(ji1))
a,b=i,j:a,b(ji1)T
(a, b, s)
This quantity is the sum of the scores of all tree structures outside the i, j) bracket (again
see gure A.1). Note that here, O excludes the factor for the local score at i, j). The
104 APPENDIX A. CALCULATING EXPECTATIONS FOR THE MODELS
outside sum also decomposes recursively:
O(i, j, s) =
_
_
_
(i, j, s)
0k<i
I(k, i, s)O(k, j, s) +
j<kn
I(j, k, s)O(i, k, s) if j i < n
1 if j i = n
Again, the table of O values can be computed by dynamic programming or memoization.
The expectations we need for reestimation of the CCM are the posterior bracket counts
P
BRACKET
(i, j[s), the fraction of trees (bracketings) that contain the span i, j, ) as a con-
stituent.
P
BRACKET
(i, j[s) =
B:B(i,j)=true
P(s, B)
B
P(s, B)
We can calculate the terms in this expression using the I and O quantities. Since the set
of trees containing a certain bracket is exactly the cross product of partial trees inside that
bracket and partial trees outside that bracket, we have
B:B(i,j)=true
P(s, B) = K(s)I(i, j, s)O(i, j, s)
and
B
P(s, B) = K(s)I(0, n, s)O(0, n, s)
Since the constants K(s) cancel and O(0, n, s) = 1, we have
P
BRACKET
(i, j[s) =
I(i, j, s)O(i, j, s)
I(0, n, s)
A.2 Expectations for the DMV
The DMV model can be most simply (though not most efciently) described as decom-
posing over a lexicalized tree structure, as shown in gure A.2. These trees are essen-
tially context-free trees in which the terminal symbols are w W for some terminal
vocabulary W (here, W is the set of word-classes). The non-terminal symbols are
w,
w,
w,
fell
fell
stocks
stocks
stocks
stocks
fell
fell
fell
fell
yesterday
yesterday
yesterday
yesterday
h
h
Right-rst left attachment
h a
h
Right-rst seal h
h
Left-rst left attachment
h a
h
Left-rst left stop
h
h
Left-rst right attachment
h a
Left-rst seal h
h
We imagine all sentences to end with the symbol , and treat as the start symbol.
Trees with root h are called sealed trees, since their head h cannot take any further argu-
ments in this grammar topology. Trees with roots
h and
w, i, j) +
P
STOP
(stop[w, right, adj(j, w))P
INSIDE
(
[ w, i, j)
Half-sealed scores are expressed in terms of either smaller half-sealed scores or unsealed
scores:
P
INSIDE
(
w, i, j) =
_
a
P
STOP
(stop[w, left, adj(k, w))P
ATTACH
(a[w, left)
P
INSIDE
(a, i, k)P
INSIDE
(
w, k, j)
_
+
P
STOP
(stop[w, right, adj(j, w))P
INSIDE
(
w, i, j)
P
INSIDE
(
w, i, j) =
_
a
P
STOP
(stop[w, right, adj(k, w))P
ATTACH
(a[w, right)
P
INSIDE
(
w, i, k)P
INSIDE
(a, k, j)
_
+
P
STOP
(stop[w, left, adj(i, w))P
INSIDE
(
w, i, j)
Note the dependency on adjacency: the function adj(i, w) indicates whether the index i
is adjacent to word w (on either side).
1
Unsealed scores (for spans larger than one) are
1
Note also that were abusing notation so that w indicates not just a terminal symbol, but a specic instance
of that symbol in the sentence.
108 APPENDIX A. CALCULATING EXPECTATIONS FOR THE MODELS
expressed in terms of smaller unsealed scores:
P
INSIDE
(
w, i, j) =
a
P
STOP
(stop[w, right, adj(k, w))P
ATTACH
(a[w, right)
P
INSIDE
(
w, i, k)P
INSIDE
(a, k, j)
P
INSIDE
(
w, i, j) =
a
P
STOP
(stop[w, left, adj(k, w))P
ATTACH
(a[w, left)
P
INSIDE
(a, i, k)P
INSIDE
(
w, k, j)
The outside recurrences for P
OUTSIDE
(x, i, j) are similar. Both can be calculated in
O(n
5
) time using the standard inside-outside algorithms for headed (lexicalized) PCFGs or
memoization techniques. Of course, the O(n
4
) and O(n
3
) techniques of Eisner and Satta
(1999) apply here, as well, but not in the case of combination with the CCM model, and
are slightly more complex to present.
In any case, once we have the inside and outside scores, we can easily calculate the
fraction of trees over a given sentence which contain any of the structural congurations
which are necessary to re-estimate the model multinomials.
A.3 Expectations for the CCM+DMV Combination
For the combination of the CCM and DMV models, we used the simplest technique which
admitted a dynamic programming solution. For any lexicalized derivation tree structure
of the form shown in gure A.2, we can read off a list of DMV derivation stops (stops,
attachments, etc.). However, we can equally well read off a list of assertions that certain
sequences and their contexts are constituents or distituents. Both models therefore assign a
score to a lexicalized derivation tree, though multiple distinct derivation trees will contain
the same set of constituents.
A.3. EXPECTATIONS FOR THE CCM+DMV COMBINATION 109
To combine the models, we took lexicalized derivation trees T and scored them with
P
COMBO
(T) = P
CCM
(T)P
DMV
(T)
The quantity P
COMBO
is a decient probability function, in that, even if P
CCM
were not
itself decient, the sum of probabilities over all trees T will be less than one.
2
Calculating expectations with respect to P
COMBO
is almost exactly the same as working
with P
DMV
. We use a set of O(n
5
) recurrences, as in section A.2. The only difference is
that we must premultiply all our probabilities by the CCM base product
i,j
P
SPAN
((i, j, s)[false)P
CONTEXT
((i, j, s)[false)
and we must multiply in the local CCM correction factor (i, j, s) at each constituent that
spans more than one terminal. To do the latter, we simply adjust the binary-branching
recurrences above. Instead of:
P
INSIDE
(
w, i, j) =
_
a
P
STOP
(stop[w, left, adj(k, w))P
ATTACH
(a[w, left)
P
INSIDE
(a, i, k)P
INSIDE
(
w, k, j)
_
+
P
STOP
(stop[w, right, adj(j, w))P
INSIDE
(
w, i, j)
P
INSIDE
(
w, i, j) =
_
a
P
STOP
(stop[w, right, adj(k, w))P
ATTACH
(a[w, right)
P
INSIDE
(
w, i, k)P
INSIDE
(a, k, j)
_
+
P
STOP
(stop[w, left, adj(i, w))P
INSIDE
(
w, i, j)
2
Except in degenerate cases, such as if both components put mass one on a single tree.
110 APPENDIX A. CALCULATING EXPECTATIONS FOR THE MODELS
we get
I(
w, i, j[s) =
_
a
P
STOP
(stop[w, left, adj(k, w))P
ATTACH
(a[w, left)P
INSIDE
(a, i, k)
P
INSIDE
(
w, k, j)(i, j, s)
_
+
P
STOP
(stop[w, right, adj(j, w))P
INSIDE
(
w, i, j)
I(
w, i, j[s) =
_
a
P
STOP
(stop[w, right, adj(k, w))P
ATTACH
(a[w, right)
P
INSIDE
(
w, i, k)P
INSIDE
(a, k, j)(i, j, s)
_
+
P
STOP
(stop[w, left, adj(i, w))P
INSIDE
(
w, i, j)
and similarly
P
INSIDE
(
w, i, j) =
a
P
STOP
(stop[w, right, adj(k, w))P
ATTACH
(a[w, right)
P
INSIDE
(
w, i, k)P
INSIDE
(a, k, j)
P
INSIDE
(
w, i, j) =
a
P
STOP
(stop[w, left, adj(k, w))P
ATTACH
(a[w, left)
P
INSIDE
(a, i, k)P
INSIDE
(
w, k, j)
A.3. EXPECTATIONS FOR THE CCM+DMV COMBINATION 111
becomes
I(
w, i, j[s) =
a
P
STOP
(stop[w, right, adj(k, w))P
ATTACH
(a[w, right)
P
INSIDE
(
w, i, k)P
INSIDE
(a, k, j)(i, j, s)
I(
w, i, j[s) =
a
P
STOP
(stop[w, left, adj(k, w))P
ATTACH
(a[w, left)
P
INSIDE
(a, i, k)P
INSIDE
(
w, k, j)(i, j, s)
Again, the outside expressions are similar. Notice that the scores which were inside proba-
bilities in the DMV case are now only sum-of-products of scores, relativized to the current
sentence, just as in the CCM case.
Appendix B
Proofs
B.1 Closed Form for the Tree-Uniform Distribution
The tree-uniform distribution, P
TREE
(T[n), is simply the uniform distribution over the set
of binary trees spanning n leaves. The statistic needed from this distribution in this work
is the posterior bracketing distribution, P(i, j[n), the fraction of trees, according to the tree
distribution, which contain the bracket i, j). In the case of the tree-uniform distribution,
these posterior bracket counts can be calculated in (basically) closed form.
Since each tree has equal mass, we know that P(i, j[n) is simply the number of trees
containing the i, j) bracket divided by the total number T(n) of trees over n terminals.
The latter is well-known to be C(n 1), the (n 1)st Catalan number:
T(n) = C(n 1) = (
_
2n 2
n 1
_
)/n
So how many trees over n leaves contain a bracket b = i, j)? Each such tree can be
described by the pairing of a tree over j i leaves, which describes what the tree looks like
inside the bracket b, and another tree, over n (j i 1) leaves, which describes what
the tree looks like outside of the bracket b. Indeed, the choices of inside and outside trees
will give rise to all and only the trees over n symbols containing b. Therefore, the number
112
B.2. CLOSED FORM FOR THE SPLIT-UNIFORM DISTRIBUTION 113
of trees with b is T(j i)T(n (j i 1)), and the nal fraction is
P
BRACKET
(i, j[n) =
T(j i)T(n (j i 1))
T(n)
B.2 Closed Form for the Split-Uniform Distribution
The split-uniform distribution over the set of binary trees spanning n leaves, P
SPLIT
(T[n)
is dened (recursively) as follows. If n is 1, there is only the trivial tree, T
1
, which has
conditional probability 1:
P
SPLIT
(T
1
[1) 1
Otherwise, there are n 1 options for top-level split points. One is chosen uniformly at
random. This splits the n leaves into a set of k left leaves and nk right leaves, for some k.
The left and right leaves are independently chosen from P
SPLIT
([k) and P
SPLIT
([n k).
Formally, for a specic tree T over n leaves consisting of a left child T
L
over k leaves and
a right child T
R
over n k leaves:
P
SPLIT
(T[n) P(k[n)P
SPLIT
(T
L
[k)P
SPLIT
(T
R
[n k)
where
P(k[n)
1
n 1
The statistics of this distribution needed in this work are the posterior bracket expectations
P(i, j[n), the fraction of trees over n nodes which contain a given bracket i, j) according
to this distribution. This quantity can be recursively dened, as well. Since all trees contain
a bracket over the entire sentence,
P
BRACKET
(0, n[n) 1
For smaller spans, consider a tree T chosen a random from P
SPLIT
([n). If it contains
a bracket b = i, j), then the bracket c immediately dominating b must be of the form
c = i
, j
), where either i
= i and j < j
or i
< i and j = j
0i
<i
P
BRACKET
(i
, j[n)P(i i
[j i
) +
j<j
n
P
BRACKET
(i, j
[n)P(j
j[j
i)
The relevant solution to this recurrence is:
P
BRACKET
(i, j[n) =
_
_
1 if i = 0 j = n
1/(j i) if i = 0 j = n
2/[(j i)(j i + 1) if 0 < i < j < n
This solution was suggested by Noah Smith, p.c., and can be proven by induction as fol-
lows. The base case (i = 0, j = n) holds because all binary trees have a root bracket over
all leaves. Now, assume for some k, all brackets of size [j i[ > k obey the given solution.
Consider b = i, j), [j i[ = k.
Case I: Assume i = 0. Then, we can express P
BRACKET
(i, j[n) in terms of larger
brackets likelihoods as
P
BRACKET
(i, j[n) =
:j<j
n
P
BRACKET
(0, j
[n)P(j 0[j
0)
=
_
j
:j<j
<n
1
j
1
j
1
_
+ 1
1
n 1
A partial fraction expansion gives
1
x
1
x 1
=
1
x 1
1
x
B.2. CLOSED FORM FOR THE SPLIT-UNIFORM DISTRIBUTION 115
and so the sum telescopes:
P
BRACKET
(i, j[n) =
_
j
:j<j
n
1
j
1
j
1
_
+
1
n 1
=
_
(
1
j 1
1
j
) + (
1
j
1
j + 1
) +. . . + (
1
n 2
1
n 1
)
_
+
1
n 1
=
1
j 1
and so the hypothesis holds.
Case II: Assume j = n. Since the denition of P
SPLIT
is symmetric, a similar argument
holds as in case I.
Case III: Assume 0 < i < j < n. Again we can expand P(i, j[n) in terms of known
quantities:
P
BRACKET
(i, j[n) =
:j<j
n
P
BRACKET
(i, j
[n)P(j i[j
i) +
:0i
<i
P
BRACKET
(i
, j[n)P(j i[j i
)
It turns out that the two sums are equal:
S1(i, j[n) =
:j<j
n
P
BRACKET
(i, j
[n)P(j i[j
i) =
1
(j i)(j i + 1)
and
S2(i, j[n) =
:0i
<i
P
BRACKET
(i
, j[n)P(j i[j i
) =
1
(j i)(j i + 1)
We will show the S1 equality; the S2 equality is similar. Substituting the uniform split
116 APPENDIX B. PROOFS
probabilities, we get
S1(i, j[n) =
:j<j
n
P
BRACKET
(i, j
[n)
1
j
i 1
and substituting the assumed values for the larger spans, we get
S1(i, j[n) =
_
j
:j<j
<n
P
BRACKET
(i, j
[n)
1
j
i 1
_
+ P
BRACKET
(i, n[n)
1
n i 1
_
j
:j<j
<n
2
(j
i)(j
i + 1)
1
j
i 1
_
+
1
n i
1
n i 1
Here, the relevant partial fraction expansion is
2
(x 1)(x)(x + 1)
=
1
x 1
2
x
+
1
x + 1
Again the sum telescopes:
S1(i, j[n) =
_
j
:j<j
<n
2
(j
i)(j
i + 1)
1
j
i 1
_
+
1
n i
1
n i 1
=
__
1
j i 1
2
j i
+
1
j i + 1
_
+
_
1
j i
2
j i + 1
+
1
j i + 2
_
+. . .
_
1
n i 2
2
n i 1
+
1
n i
__
+
_
1
n i 1
+
1
n i
_
=
_
1
j i 1
1
j i
1
n i 1
+
1
n i
_
+
_
1
n i 1
+
1
n i
_
=
1
j i 1
1
j i
=
1
(j i 1)(j i)
B.2. CLOSED FORM FOR THE SPLIT-UNIFORM DISTRIBUTION 117
Since S2 is similar, we have
P(i, j[n) = S1(i, j[n) +S2(i, j[n)
= 2S1(i, j[n)
=
2
(j i 1)(j i)
and so the hypothesis holds again.
Bibliography
Abney, S. P. 1987. The English Noun Phrase in its Sentential Aspect. PhD thesis, MIT.
Adriaans, P., and E. Haas. 1999. Grammar induction as substructural inductive logic pro-
gramming. In J. Cussens (Ed.), Proceedings of the 1st Workshop on Learning Language
in Logic, 117127, Bled, Slovenia.
Angluin, D. 1990. Negative results for equivalence theories. Machine Learning 5(2).
Baker, C. L., and J. J. McCarthy (Eds.). 1981. The logical problem of language acquisition.
Cambridge, Mass.: MIT Press.
Baker, J. K. 1979. Trainable grammars for speech recognition. In D. H. Klatt and J. J. Wolf
(Eds.), Speech Communication Papers for the 97th Meeting of the Acoustical Society of
America, 547550.
Black, E., S. Abney, D. Flickinger, C. Gdaniec, R. Grishman, P. Harrison, D. Hindle, R. In-
gria, F. Jelinek, J. Klavans, M. Liberman, M. Marcus, S. Roukos, B. Santorini, and
T. Strzalkowski. 1991. A procedure for quantitatively comparing the syntactic cover-
age of English grammars. In Proceedings, Speech and Natural Language Workshop,
306311, Pacic Grove, CA. DARPA.
Brill, E. 1993. Automatic grammar induction and parsing free text: A transformation-based
approach. In Proceedings of the 31st Meeting of the Association for Computational
Linguistics, 259265.
Brown, R., and C. Hanlon. 1970. Derivational complexity and order of acquisition in child
118
Bibliography 119
speech. In J. R. Hayes (Ed.), Cognition and the Development of Language. New York:
Wiley.
Carroll, G., and E. Charniak. 1992. Two experiments on learning probabilistic dependency
grammars from corpora. In C. Weir, S. Abney, R. Grishman, and R. Weischedel (Eds.),
Working Notes of the Workshop Statistically-Based NLP Techniques, 113. Menlo Park,
CA: AAAI Press.
Chang, N., and O. Gurevich. 2004. Context-driven construction learning. In Proceedings
of the 26th Annual Meeting of the Cognitive Science Society.
Charniak, E. 1996. Tree-bank grammars. In Proceedings of the Thirteenth National Con-
ference on Articial Intelligence (AAAI 96), 10311036.
Charniak, E. 2000. A maximum-entropy-inspired parser. In Proceedings of the First Meet-
ing of the North American Chapter of the Association for Computational Linguistics
(NAACL 1), 132139.
Charniak, E., K. Knight, and K. Yamada. 2003. Syntax-based language models for machine
translation. In Proceedings of MT Summit IX.
Chelba, C., and F. Jelinek. 1998. Exploiting syntactic structure for language modeling. In
Proceedings of the 36th Meeting of the Association for Computational Linguistics (ACL
36/COLING 17), 225231.
Chen, S. F. 1995. Bayesian grammar induction for language modeling. In Proceedings of
the 33rd Meeting of the Association for Computational Linguistics (ACL 33), 228235.
Chomsky, N. 1965. Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.
Chomsky, N. 1986. Knowledge of Language: Its Nature, Origin and Use. New York:
Praeger.
Clark, A. 2000. Inducing syntactic categories by context distribution clustering. In The
Fourth Conference on Natural Language Learning.
120 Bibliography
Clark, A. 2001a. Unsupervised induction of stochastic context-free grammars using distri-
butional clustering. In The Fifth Conference on Natural Language Learning.
Clark, A. 2001b. Unsupervised Language Acquisition: Theory and Practice. PhD thesis,
University of Sussex.
Clark, A. 2003. Combining distributional and morphological information in part of speech
induction. In Proceedings of the European Chapter of the Association for Computa-
tional Linguistics (EACL).
Collins, M. 1999. Head-Driven Statistical Models for Natural Language Parsing. PhD
thesis, University of Pennsylvania.
Demetras, M. J., K. N. Post, and C. E. Snow. 1986. Feedback to rst language learners: the
role of repetitions and clarication questions. Journal of Child Language 13:275292.
Eisner, J. 1996. Three new probabilistic models for dependency parsing: An exploration. In
Proceedings of the 16th International Conference on Computational Linguistics (COL-
ING 16), 340345.
Eisner, J., and G. Satta. 1999. Efcient parsing for bilexical context-free grammars and
head automaton grammars. In Proceedings of the 37th Annual Meeting of the Associa-
tion for Computational Linguistics, 457464.
Finch, S., and N. Chater. 1992. Bootstrapping syntactic categories using statistical meth-
ods. In W. Daelemans and D. Powers (Eds.), Background and Experiments in Machine
Learning of Natural Language, 229235, Tilburg University. Institute for Language
Technology and AI.
Finch, S. P. 1993. Finding Structure in Language. PhD thesis, University of Edinburgh.
Fodor, J. 1983. Modularity of mind. Cambridge, Mass.: MIT Press.
Gold, E. M. 1967. Language identication in the limit. Information and Control 10:447
474.
Bibliography 121
Halliday, M. A. K. 1994. An introduction to functional grammar. London: Edward Arnold.
2nd edition.
Hirsh-Pasek, K., R. Treiman, and M. Schneiderman. 1984. Brown and Hanlon revisited:
mothers sensitivity to ungrammatical forms. Journal of Child Language 11:8188.
Hofmann, T. 1999. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Arti-
cial Intelligence, Stockholm.
Horning, J. J. 1969. A study of grammatical inference. PhD thesis, Stanford.
Jackendoff, R. 1996. The architecture of the language faculty. Cambridge, Mass.: MIT
Press.
Kit, C. 1998. A goodness measure for phrase structure learning via compression with the
mdl principle. In Proceedings of the European Summer School in Logic, Language and
Information Student Session.
Klein, D., and C. D. Manning. 2001a. Distributional phrase structure induction. In Proceed-
ings of the Fifth Conference on Natural Language Learning (CoNLL 2001), 113120.
Klein, D., and C. D. Manning. 2001b. Natural language grammar induction using a
constituent-context model. In T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.),
Advances in Neural Information Processing Systems 14 (NIPS 2001), Vol. 1, 3542.
MIT Press.
Klein, D., and C. D. Manning. 2002. A generative constituent-context model for improved
grammar induction. In ACL 40, 128135.
Klein, D., and C. D. Manning. 2003. Fast exact inference with a factored model for natural
language parsing. In S. Becker, S. Thrun, and K. Obermayer (Eds.), Advances in Neural
Information Processing Systems 15, Cambridge, MA. MIT Press.
Klein, D., and C. D. Manning. 2004. Corpus-based induction of syntactic structure: Models
of dependency and constituency. In Proceedings of the 42nd Annual Meeting of the
Association for Computational Linguistics (ACL 04).
122 Bibliography
Landauer, T. K., P. W. Foltz, and D. Laham. 1998. Introduction to latent semantic analysis.
Discourse Processes 25:259284.
Langley, P., and S. Stromsten. 2000. Learning context-free grammars with a simplicity bias.
In Machine Learning: ECML 2000, 11th European Conference on Machine Learning,
Barcelona, Catalonia, Spain, May 31 - June 2, 2000, Proceedings, Vol. 1810, 220228.
Springer, Berlin.
Lari, K., and S. J. Young. 1990. The estimation of stochastic context-free grammars using
the inside-outside algorithm. Computer Speech and Language 4:3556.
Magerman, D., and M. Marcus. 1990. Parsing a natural language using mutual information
statistics. In Proceedings of the Eighth National Conference on Articial Intelligence.
Manning, C. D., and H. Sch utze. 1999. Foundations of Statistical Natural Language Pro-
cessing. Cambridge, Massachusetts: The MIT Press.
Marcus, G. 1993. Negative evidence in language acquisition. Cognition 46(1):5385.
Marcus, M. P., B. Santorini, and M. A. Marcinkiewicz. 1993. Building a large annotated
corpus of English: The Penn treebank. Computational Linguistics 19:313330.
Mel
cuk, I. A. 1988. Dependency Syntax: theory and practice. Albany, NY: State Univer-
sity of New York Press.
Merialdo, B. 1994. Tagging English text with a probabilistic model. Computational
Linguistics 20(2):155171.
Miller, P. H. 1999. Strong Generative Capacity. Stanford, CA: CSLI Publications.
Olivier, D. C. 1968. Stochastic Grammars and Language Acquisition Mechanisms. PhD
thesis, Harvard University.
Paskin, M. A. 2002. Grammatical bigrams. In T. G. Dietterich, S. Becker, and Z. Ghahra-
mani (Eds.), Advances in Neural Information Processing Systems 14, Cambridge, MA.
MIT Press.
Bibliography 123
Penner, S. 1986. Parental responses to grammatical and ungrammatical child utterances.
Child Development 58:376384.
Pereira, F., and Y. Schabes. 1992. Inside-outside reestimation from partially bracketed
corpora. In Proceedings of the 30th Meeting of the Association for Computational
Linguistics (ACL 30), 128135.
Pereira, F., N. Tishby, and L. Lee. 1993. Distributional clustering of English words. In Pro-
ceedings of the 31st Annual Meeting of the Association for Computational Linguistics
(ACL 31), 183190.
Pinker, S. 1994. The Language Instinct. New York: William Morrow.
Pullum, G. K. 1996. Learnability, hyperlearning, and the poverty of the stimulus. In
Proceedings of the 22nd Annual Meeting of the Berkeley Linguistics Society, Berkeley
CA. Berkeley Linguistics Society.
Radford, A. 1988. Transformational Grammar. Cambridge: Cambridge University Press.
Roark, B. 2001. Probabilistic top-down parsing and language modeling. Computational
Linguistics 27:249276.
Saffran, J. R., E. L. Newport, and R. N. Aslin. 1996. Word segmentation: the role of
distributional cues. Journal of Memory and Language 35:606621.
Sch utze, H. 1993. Part-of-speech induction from scratch. In The 31st Annual Meeting of
the Association for Computational Linguistics (ACL 31), 251258.
Sch utze, H. 1995. Distributional part-of-speech tagging. In Proceedings of the 7th Meeting
of the European Chapter of the Association for Computational Linguistics (EACL 7),
141148, San Francisco CA. Morgan Kaufmann.
Skut, W., T. Brants, B. Krenn, and H. Uszkoreit. 1998. A linguistically interpreted corpus
of German newspaper texts. In Proceedings of the European Summer School in Logic,
Language and Information Workshop on Recent Advances in Corpus Annotations.
124 Bibliography
Solan, Z., E. Ruppin, D. Horn, and S. Edelman. 2003. Automatic acquisition and efcient
representation of syntactic structures. In S. Becker, S. Thrun, and K. Obermayer (Eds.),
Advances in Neural Information Processing Systems 15, Cambridge, MA. MIT Press.
Stolcke, A., and S. M. Omohundro. 1994. Inducing probabilistic grammars by Bayesian
model merging. In Grammatical Inference and Applications: Proceedings of the Sec-
ond International Colloquium on Grammatical Inference. Springer Verlag.
van Zaanen, M. 2000. ABL: Alignment-based learning. In Proceedings of the 18th Inter-
national Conference on Computational Linguistics (COLING 18), 961967.
Wolff, J. G. 1988. Learning syntax and meanings through optimization and distributional
analysis. In Y. Levy, I. M. Schlesinger, and M. D. S. Braine (Eds.), Categories and
processes in language acquisition, 179215. Hillsdale, NJ: Lawrence Erlbaum.
Xue, N., F.-D. Chiou, and M. Palmer. 2002. Building a large-scale annotated Chinese cor-
pus. In Proceedings of the 19th International Conference on Computational Linguistics
(COLING 2002).
Yuret, D. 1998. Discovery of Linguistic Relations Using Lexical Attraction. PhD thesis,
MIT.