06 Tesi PDF
06 Tesi PDF
Thesis advisors:
This work studies the possible application of the algorithms and techniques of the
Machine Learning field in order to handle the WSD task.
The first issue treated has been the adaptation of alternative ML algorithms to deal
with word senses as classes. Then, a comparison of these methods is performed under the
same conditions. The evaluation measures applied to compare the performances of these
methods are the typical precision and recall, but also agreement rates and kappa statistics.
The use of unlabelled data to train classifiers for Word Sense Disambiguation is a very
challenging line of research in order to develop a really robust, complete and accurate
Word Sense Tagger. Due to this fact, the next topic treated in this work is the application
of two bootstrapping approaches on WSD: the Transductive Support Vector Machines and
the Greedy Agreement bootstrapping algorithm by Steven Abney.
During the development of this research we have been interested in the construction and
evaluation of several WSD systems. We have participated in the last two editions of the
English Lexical Sample task of Senseval evaluation exercises. The Lexical Sample tasks are
mainly oriented to evaluate Supervised Machine Learning systems. Our systems achieved
i
a very good performance in both editions of this competition. That is, our systems are
among the state-of-the-art on WSD. As a complementary work of this participation, some
other issues have been explored: 1) a comparative study of the features used to represent
examples for WSD; 2) the comparison of different feature selection procedures; and 3) an
study of the results of the best systems of Senseval-2 that shows the real behaviour of
these systems with respect to the data.
Summarising, this work has: 1) studied the application of Machine Learning to Word
Sense Disambiguation; and 2) described our participation on the English Lexical Sample
task of both Senseval-2 and Senseval-3 international evaluation exercises. This work has
clarified several open questions which we think will help to understand this complex
problem.
ii
Acknowledgements
We want to specially thank to Victoria Arranz for her help with English writing; to David
Martı́nez for their useful comments and discussions; to Adrià Gispert for its perl scripts;
to Yarowsky’s group for the feature extractor of syntactical patterns; to the referees of
international conferences and of the previous version of this document for their helpful
comments; to Victoria Arranz, Jordi Atserias, Laura Benı́tez and Montse Civit for their
friendship; and to Montse Beserán for its infinity patience and love.
We want to also thank to Lluı́s Màrquez and German Rigau for all these years of
dedication; to Horacio Rodrı́guez for his humanity; to Eneko Agirre for his comments;
and to all the members of the TALP Research Centre of the Software Department of the
Technical University of Catalonia for working together.
This research has been partially funded by the European Commission via
EuroWordNet (LE4003), NAMIC (IST-1999-12392) and MEANING (IST-2001-34460)
projects; by the Spanish Research Department via ITEM (TIC96-1243-C03-03),
BASURDE (TIC98-0423-C06) and HERMES (TIC2000-0335-C03-02); and by the Catalan
Research Department VIA CIRIT’s consolidated research group 1999SGR-150, CREL’s
Catalan WordNet project and CIRIT’s grant 1999FI-00773.
iii
iv
Contents
Abstract i
Acknowledgements iii
1 Introduction 1
2 State-of-the-Art 7
v
CONTENTS
2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.4 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
vi
CONTENTS
4 Domain Dependence 73
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5 Bootstrapping 83
5.1.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vii
CONTENTS
7 Conclusions 131
Bibliography 137
viii
Chapter 1
Introduction
Since the appearance of the first computers, in the earlier 50’s, humans have been thinking
in Natural Language Understanding (NLU). Since then, lots of talking computers appeared
in fiction novels and films. Humans usually see computers as intelligent devices. However,
the pioneers of Natural Language Processing (NLP) underestimated the complexity of the
task. Unfortunately, natural language systems seems to require extensive knowledge about
the world which in turn it is not easy to acquire. NLP community has been researching
on NLU for 50 years and satisfactory results have been obtained only for very restricted
domains1 .
It is commonly assumed that NLU is related with semantics. NLP community have
applied different ways to tackle semantics, such as logical formalisms or frames to represent
the necessary knowledge. The current usual way to tackle NLP is as a process chain, in
which each step of the chain is devoted to one Natural Language task; usually trying
to solve a particular Natural Language ambiguity problem. Natural language presents
many types of ambiguity, ranging from morphological ambiguity to pragmatic ambiguity
passing through syntactic or semantic ambiguities. Thus, most efforts in Natural Language
Processing are devoted to solve different types of ambiguities, such as: part-of-speech
(POS) tagging –dealing with morphosyntactic categories– or parsing –dealing with syntax.
The work presented here focus on lexical or word sense ambiguity. That is, the
ambiguity related to the polisemy of words (isolated words are ambiguous). The
determination of the meaning of each word of a context seems to be a necessity in order
to understand the language. Lexical ambiguity is a long-standing problem in NLP, which
appears on early references (Kaplan, 1950; Yngve, 1955) on Machine Translation2 .
1
See Winograd SHRDLU program (Winograd, 1972).
2
Wikipedia points Word Sense Disambiguation as one of the big difficulties for NLP.
1
1 Introduction
Table 1.1: Definitions of two of the senses of word age from WordNet 1.5.
sense gloss
age 1 He was mad about stars at the age of nine .
age 2 About 20,000 years ago the last ice age ended .
WSD task is a potential intermediate task (Wilks and Stevenson, 1996) for many
other NLP systems, including mono and multilingual Information Retrieval, Information
Extraction, Machine Translation or Natural Language Understanding.
Resolving the sense ambiguity of words is obviously essential for many Natural
Language Understanding applications (Ide and Véronis, 1998). However, current NLU
applications are mostly domain specific (Kilgarriff, 1997; Viegas et al., 1999). Obviously,
when restricting the application to a domain we are, in fact, using the “one sense per
discourse” hypothesis to eliminate most of the semantic ambiguity of polysemous words
(Gale et al., 1992b).
3
The English information of this table has been extracted from WordNet 1.5 (Miller et al., 1990).
4
These sentences have been extracted from the DSO corpus (Ng and Lee, 1996).
2
1.2 Thesis Contributions
• Machine Translation: This is the field in which the first attempts to perform WSD
where carried out (Weaver, 1955; Yngve, 1955; Bar-Hillel, 1960). There is no doubt
that some kind of WSD is essential for the proper translation of polysemous words.
• Semantic Parsing: Alshawi and Carter (1994) suggest the utility of WSD in
restricting the space of competing parses, especially, for the dependencies, such as
prepositional phrases.
• Speech Synthesis and Recognition: WSD could be useful for the correct phonetisation
of words in Speech Synthesis (Sproat et al., 1992; Yarowsky, 1997), and for word
segmentation and homophone discrimination in Speech Recognition (Connine, 1990;
Seneff, 1992).
• Lexicography: Kilgarriff (1997) suggests that lexicographers can be not only suppliers
of NLP resources, but also customers of WSD systems.
The main contribution of this thesis is to explore the possible application of algorithms
and techniques of the Machine Learning field to the Word Sense Disambiguation. We have
summarised below the list of contributions we consider most important from this work:
3
1 Introduction
This section is devoted to overview the contents of the rest of the thesis:
4
1.3 Thesis Layout
comparison works, and explains how the algorithms should be adapted to deal with
the word senses.
• Chapter 6 describes our participation in the English Lexical Sample task of both
Senseval-2 and Senseval-3 evaluation exercises. It also shows an study of the best
performing systems in Senseval-2.
There are four appendixes in this document. The first one shows the formulae of
the statistical measures applied in all the documents for evaluation purposes. The second
appendix describes in detail (with an example) the output of the feature extractor resulting
of this work. The third appendix contains a list of web references of all resources related
WSD. And the final one contains a list of the acronyms used throughout all the text.
5
1 Introduction
6
Chapter 2
State-of-the-Art
This chapter is devoted to present the state-of-the-art of WSD. Several approaches have
been proposed for assigning the correct sense to a word in context1 , some of them achieving
remarkable high accuracy figures. Initially, these methods were usually tested only on a
small set of words with few and clear sense distinctions, e.g. Yarowsky (1995b) reports
96% precision for twelve words with only two clear sense distinctions each. Despite the
wide range of approaches investigated and the large effort devoted to tackle this problem,
it is a fact that to date, no large-scale broad-coverage and highly accurate WSD system has
been built. It still remains an open problem if we look at the main conclusions of the ACL
SIGLEX Workshop: Tagging Text with Lexical Semantics: Why, What and How? “WSD
is perhaps the great open problem at the lexical level of NLP” (Resnik and Yarowsky, 1997)
or to the results of the Senseval (Kilgarriff and Rosenzweig, 2000), Senseval-2 (Kilgarriff,
2001) and Senseval-3 (Mihalcea et al., 2004) evaluation exercises for WSD; in which none
of the systems presented in these conferences achieved 80% accuracy on both English
Lexical Sample and All Words tasks. Actually, in the last two competitions the best
systems achieved accuracy values near 65%.
WSD has been described as an AI-complete problem in the literature, that is, “its
solution requires a solution to all the general Artificial Intelligence (AI) (problems of
representing and reasoning about arbitrary real-world knowledge” (Kilgarriff, 1997) or “a
problem which can be solved only by first resolving all the difficult problems in AI, such as
the representation of common sense and encyclopedic knowledge” (Ide and Véronis, 1998).
This fact evinces the hardness of the task and, the large amount of resources needed to
tackle the problem.
1
The most simple approach ignore context and selects for each ambiguous word its most frequent sense
(Gale et al., 1992a; Miller et al., 1994).
7
2 State-of-the-Art
This chapter has been organised as follows: section 2.1 is devoted to describe the task at
hand and the different ways to address this problem. Then, we focus on the corpus-based
approach, which is the framework of this thesis. In section 2.2 we extend the corpus-based
explanation by showing some WSD data, the main corpus-based approaches and the way
WSD data is represented. Section 2.3 is devoted to present a set of promising research
lines in the field, and section 2.4 to several evaluation issues. Finally, section 2.5 contains
references for extending several issues of the state-of-the-art.
WSD typically involves two main tasks. On the one hand (1) determining the different
possible senses (or meanings) of each word, and, on the other hand (2) tagging each word
of a text with its appropriate sense with high accuracy and efficiency.
The former task, that is, the precise definition of a sense, is still under discussion
and remains as an open problem within the NLP community. Section 2.2.1 enumerates
the most important sense repositories used by the Computational Linguistics Community.
At the moment, the most used Sense Repository is WordNet. However, many problems
arise mainly related to the granularity of the senses. Contrary to the intuition that the
agreement between human annotators should be very high in the WSD task (using a
particular sense repository), some papers report surprisingly low figures. For instance,
(Ng et al., 1999) reports an agreement rate of 56.7% and a Kappa value of 0.317 when
comparing the annotation of a subset of the DSO corpus performed by two independent
research groups2 . Similarly, Véronis (1998) reports values of Kappa near to zero when
annotating some special words for the Romanseval corpus3 . These reports evince the
fragility of existing sense repositories. Humans can not agree in the meaning of a word
due to the subtle and fine distinctions of the sense repositories, lack of clear guidelines,
and tight time constraints and limited experience of the human annotators. Moreover,
Senseval-3 has shown the dependence of the sense repositories on the accuracy of the
systems. The English Lexical Sample task used as sense repository WordNet and the best
system achieved an accuracy of 72.9% (Mihalcea et al., 2004). On the other side, for
Catalan and Spanish Lexical Sample tasks, the organisers developed a sense repository
2
The Kappa statistic k (Cohen, 1960) (see appendix A) is a good measure for inter-annotator agreement
because reduces the effect of chance agreement. A Kappa value of 1 indicates perfect agreement, while 0.8
is considered as indicating good agreement (Carletta, 1996).
3
Romanseval is, like Senseval for English, a specific competition between WSD systems for Romance
languages at the moment of first Senseval event. In Senseval-2 and 3 events, Romanseval tasks have become
part of Senseval.
8
2.1 The Word Sense Disambiguation Task
based on coarser grained senses and the best systems achieved respectively 85.8% and
84.2% (Màrquez et al., 2004b,a). However, the discussion on the most appropriate sense
repository for WSD is beyond the scope of the work presented here.
The second task, the tagging of each word with a sense, involves the development of
a system capable of tagging polysemic words in running text with sense labels4 . The
WSD community accepts a classification of these systems in two main general categories:
knowledge-based and corpus-based methods. All methods build a representation of the
examples to be tagged using some previous information. The difference between them
is the source of this information. Knowledge-based methods obtain the information
from external knowledge sources such as Machine Readable Dictionaries (MRDs) or
lexico-semantic ontologies. This knowledge exists previous to the disambiguation process,
and usually have been manually generated (probably not taking into account its intended
use when developing them). On the contrary, in corpus-based methods the information
is gathered from contexts of previously annotated instances (examples) of the word.
These methods extract the knowledge from examples applying Statistical or Machine
Learning methods. When these examples are previously hand-tagged we talk about
supervised learning, while if the examples do not come with the sense label we talk
about unsupervised learning. Supervised Learning involves a large amount of semantically
annotated training examples labelled with their correct senses (Ng and Lee, 1996). On
the contrary, Knowledge-based systems do not require the development of training corpora
(Rigau et al., 1997)5 . Our work focus on the supervised Corpus-based framework, although
we briefly discus knowledge-based methods in the following section.
These methods mainly try to avoid the need of large amounts of training materials required
in supervised methods. Knowledge-based methods can be classified in function of the type
of resources they use: 1) Machine-Readable Dictionaries; 2) Thesauri; or 3) Computational
Lexicons or Lexical Knowledge Bases.
9
2 State-of-the-Art
MRDs contain inconsistencies and are created for human use, and not for machine
exploitation. There is a lot of knowledge in a dictionary only really useful when
performing a complete WSD process on the whole definitions (Richardson, 1997;
Rigau, 1998; Harabagiu and Moldovan, 1998). See also the results of the Senseval-3
task devoted to disambiguate WordNet glosses (Castillo et al., 2004).
WSD techniques using MRDs can be classified according to: the lexical resource
used (mono or bilingual MRDs); the MRD information exploited by the method
(words in definitions, semantic codes, etc); and the similarity measure used to relate
words from context and MRD senses.
Lesk (1986) created a method for guessing the correct word sense counting word
overlaps between the definitions of the word and the definitions of the context words,
and selecting the sense with the greatest number of overlapping words. Although this
method is very sensitive to the presence or absence of the words in the definition,
it has served as the basis for most of the subsequent MRD-based disambiguation
systems. Among others, Guthrie et al. (1991) proposed the use of the subject
semantic codes of the Longman Dictionary of Contemporary English (LDOCE) to
improve the results. Cowie et al. (1992) improved Lesk’s method by using the
Simulated Annealing algorithm. Wilks et al. (1993) and Rigau (1998) used the
co-occurrence data extracted from LDOCE and Diccionario General de la Lengua
Española (DGILE), respectively, to construct word-context vectors.
• In the late 80s and throughout 90s, a large effort have been carried out on developing
manually large-scale knowledge bases: WordNet (Miller et al., 1990), CyC (Lenat
and Ramanathan, 1990), ACQUILEX (Briscoe, 1991), and Mikrokosmos (Viegas
et al., 1999) are examples of such resources. Currently, WordNet is the best-known
and the most used resource for WSD in English. In WordNet the concepts are
10
2.1 The Word Sense Disambiguation Task
defined as synonymy sets called synsets linked one to each other through semantic
relations (hyperonymy, hyponymy, meronymy, antonymy, and so on). Each sense of
a word is linked to a synset.
Taking the WordNet structure as source, some methods using semantic distance
metrics have been developed. Most of these metrics consider only nouns. Sussna’s
metric (Sussna, 1993) consists of computing the distance as a function of the length of
the shortest path between nodes (word senses). It is interesting because he used many
relations and not only the hyperonymy-hyponymy relation. Conceptual Density, a
more complex semantic distance measure between words is defined in (Agirre and
Rigau, 1995) and tested on the Brown Corpus, as a proposal for WSD in (Agirre
and Rigau, 1996). Viegas et al. (1999) illustrate the Mikrokosmos approach to WSD
applying an ontological graph search mechanism, Onto-Search, to check constraints.
Magnini et al. (2001) have studied the influence of the domains in WSD, using
WordNet Domains (Magnini and Cavaglia, 2000). The main aim of the work is to
reduce the polisemy of words by selecting first the domain of the text. They obtained
a high precision but low recall at Senseval-2 English Lexical Sample Task (see section
2.4). At Senseval-3, domains were used as a component of their complete system
including among others a corpus-based module obtaining very good results.
Knowledge-based systems are gaining importance in the All Words tasks of last
Senseval events to tag words with a low number of training examples. Most systems
combine both knowledge-based and corpus-based approaches when all-words data
have to be processed. They learn those words for which labelled examples are
available; and, apply an unsupervised approach for those words without enough
training examples.
The information used as input for corpus-based methods is being called knowledge
sources or sources of information by the community (Lee et al., 2004). Some authors
argue that knowledge bases can be seen as sources of information for representing
example features (see as examples section 6.4 and appendix B).
6
Currently, several national and international projects are funding the construction and improvement
of WordNets (See Global WordNet Association web page for a complete list of developed WordNets).
11
2 State-of-the-Art
Corpus-based approaches are those that build a classification model from examples. These
methods involve two phases: learning and classification. The learning phase consists of
learning a sense classification model from the training examples. The classification process
consists of the application of this model to new examples in order to assign the output
senses. Most of the algorithms and techniques to build models from examples come from
the Machine Learning area of AI. Section 2.2.3 enumerates some of these algorithms that
have been applied to WSD; and section 2.2.1 enumerates the most important corpora used
by the WSD community.
One of the first and most important issues to take into account is the representation
of the examples by means of features/attributes. That is, which information could and
should be provided to the learning component from the examples. The representation of
examples highly affects the accuracy of the systems. It seems to be as or more important
than the learning method used by the system. Section 2.2.2 is devoted to discuss the most
common example representation appearing in the literature.
So far, corpus-based approaches have obtained the best absolute results. However, it is well
known that supervised methods suffer from the lack of widely available semantically tagged
corpora, from which to construct really broad coverage WSD systems. This is known as
the “knowledge acquisition bottleneck” (Gale et al., 1993). Ng (1997b) estimated that to
obtain a high accuracy domain-independent system at least 3,200 words should be tagged
with about 1,000 occurrences each. The necessary effort for constructing such a training
corpus is estimated to be 16 person-years, according to the experience of the authors on
the building of the DSO corpus (see section 2.2.1).
Unfortunately, many people think that Ng’s estimate might fall short, as the annotated
corpus produced in this way is not guaranteed to enable high accuracy WSD. In fact, recent
studies using DSO have shown that: 1) The performance for state of the art supervised
WSD systems continues to be in the 60%-70% for this corpus (see chapter 4), and 2) Some
highly polysemous words get very low performance (20-40% accuracy).
There have been some works exploring the learning curves of each different word to
investigate the amount of training data required. In (Ng, 1997b), the exemplar-based
learning LEXAS supervised system was trained for a set of 137 words with at least 500
12
2.1 The Word Sense Disambiguation Task
examples, and for a set of 43 words with at least 1,300 examples. In both situations, the
accuracy of the system in the learning curve was still rising with the whole training data.
In an independent work (Agirre and Martı́nez, 2000), the learning curves of two small sets
of words (containing nouns, verbs, adjectives, and adverbs) were studied using different
corpora (SemCor and DSO). Words of different types were selected, taking into account
their characteristics: high/low polysemy, high/low frequency, and high/low skew of the
most frequent sense in SemCor. The results reported, using Decision Lists as the learning
algorithm, showed that SemCor data is not enough, but that in the DSO corpus the result
seemed to stabilise for nouns and verbs before using all the training material. The word
set tested in DSO had in average 927 examples per noun, and 1,370 examples per verb.
Another important issue is that of the selection and quality of the examples. In most
of the semantically tagged corpora available it is difficult to find a minimum number of
occurrences per each sense of a word. In order to overcome this particular problem, also
known as “knowledge acquisition bottleneck”, four main lines of research are currently
being pursued: 1) Automatic acquisition of training examples; 2) Active learning;
3) Learning from labelled and unlabelled examples; and 4) Combining training examples
from different words.
13
2 State-of-the-Art
In Mihalcea (2002a), a sense tagged corpus (GenCor) is generated using a set of seeds,
consisting of sense tagged examples from four sources: SemCor (see section 2.2.1),
WordNet examples created with the method described in (Mihalcea and Moldovan,
1999b), and hand-tagged examples from other sources (e.g., Senseval-2 corpus). A
corpus with about 160,000 examples was generated from these seeds. A comparison
of the results obtained by their WSD system, when training with the generated
corpus or with the hand-tagged data provided in Senseval-2, was reported. She
concluded that the precision achieved using the generated corpus is comparable, and
sometimes better, than learning from hand tagged examples. She also showed that
the addition of both corpora further improved the results. Their method has been
tested in the Senseval-2 framework with remarkable results.
• Active learning is used to choose informative examples for hand tagging, in order
to reduce the acquisition cost. Argamon-Engelson and Dagan (1999) describe two
main types of active learning: membership queries and selective sampling. In the
first approach, the learner constructs examples and asks a teacher to label them.
This approach would be difficult to apply to WSD. Instead, in selective sampling
the learner selects the most informative examples from unlabelled data. The
informativeness of the examples can be measured using the amount of uncertainty
in their classification, given the current training data. Lewis and Gale (1994) use
a single learning model and select those examples for which the classifier is most
uncertain (uncertainty sampling). Argamon-Engelson and Dagan (1999) propose
another method, called committee-based sampling, which randomly derives several
classification models from the training set, and the degree of disagreement between
them is used to measure the informativeness of the examples. For building a verb
database, Fujii et al. (1998) applied selective sampling to the disambiguation of verb
senses, in base to their case fillers. The disambiguation method was based on nearest
neighbour classification, and the selection of examples in the notion of training
utility, which has two criteria: number of neighbours in unsupervised data (i.e.,
examples with many neighbours will be more informative in next iterations), and
dissimilarity of the example with other supervised examples (to avoid redundancy).
A comparison of their method with uncertainty and committee-based sampling was
reported, obtaining significantly better results on the learning curve.
Open Mind Word Expert (Chklovski and Mihalcea, 2002), is a project to collect
word sense tagged examples from web users. They select the examples to be tagged
applying a selective sampling method. Two different classifiers are independently
applied on untagged data: an instance-based classifier that uses active feature
selection, and a constraint-based tagger. Both systems have a low inter-annotation
14
2.1 The Word Sense Disambiguation Task
agreement (54.96%), high accuracy when they agree (82.5%), and low accuracy when
they disagree (52.4% and 30.09%, respectively). This makes the disagreement cases
the hardest to annotate, being the ones that are presented to the user.
• Some methods have been devised for learning from labelled and unlabelled data,
which are also referred to as bootstrapping methods (Abney, 2002). Among them,
we can highlight co-training (Blum and Mitchell, 1998) and their derivates (Collins
and Singer, 1999; Abney, 2002). These techniques seem to be very appropriate for
WSD and other NLP tasks, because of the wide availability of untagged data, and
the scarcity of tagged data. In a well-known work in the WSD field, (Yarowsky,
1995b; Abney, 2004) exploited some discourse properties (see section 2.2.3) in an
iterative bootstrapping process, to induce a classifier based on Decision Lists. With
a minimum set of seed (annotated) examples, the system obtained comparable results
to those obtained by supervised methods in a limited set of binary sense distinctions.
Lately, this kind of methods are gaining importance. Section 2.3.1 is devoted to
explain them in more detail.
• Recent works build classifiers for semantic classes, instead of word classifiers.
Kohomban and Lee (2005) build semantic classifiers by merging training examples
from words in the same semantic class obtaining good results in the application of
their methods on Senseval-3 data.
Porting the WSD systems to new corpora of different genre/domains also presents
important challenges. Some studies show that, in practice, the assumptions for supervised
learning do not hold when using different corpora, even when there are many training
examples available. There is a dramatic degradation of performance when training and
testing on different corpora. Sense uses depend very much of the domains. That is why
corpora should be large and diverse, covering many domains and topics in order to provide
enough examples for each of the senses.
15
2 State-of-the-Art
BC forcing the same number of examples per sense in both sets. The results obtained
when training and testing across corpora were disappointing for all ML algorithms tested,
since significant decreases in performance were observed in all cases (in some of them
the cross-corpus accuracy was even lower than the “most frequent sense” baseline). The
incremental addition of a percentage of supervised training examples from the target
corpus did not help very much to raise accuracy of the systems. In the best case, the
accuracy achieved was only slightly better than the one obtained by training on the small
supervised part of the target corpus, making no use of the whole set of examples from the
source corpus. The third experiment showed that WSJ and BC have very different sense
distributions and that relevant features acquired by the ML algorithms are not portable
across corpora, since in some cases they correlated with different senses in different corpora.
In (Martı́nez and Agirre, 2000), the main reason for the low performance in
cross-corpora tagging was also attributed to the change in domain and genre. Again, they
used the DSO corpus and a disjoint selection of the sentences from the WSJ and BC parts.
In BC, texts are classified according to some predefined categories (Reportage, Religions,
Science Fiction, etc.). This fact allowed them to test the effect of the domain and genre
on cross-corpora sense tagging. Reported experiments, training on WSJ and testing on
BC and vice versa, showed that the performance dropped significantly from the results on
each corpus separately. This happened mainly because there were few common collocations
(features were mainly based on collocations), and also because some collocations received
systematically different tags in each corpus –a similar observation to that of (Escudero
et al., 2000b). Subsequent experiments were conducted taking into account the category of
the documents of the BC, showing that results were better when two independent corpora
shared genre/topic than when using the same corpus with different genre/topic. The main
conclusion is that the “one sense per collocation” constraint does hold across corpora, but
that collocations vary from one corpus to another, following genre and topic variations.
They argued that a system trained on a specific genre/topic would have difficulties to adapt
to new genres/topics. Besides, methods that try to extend automatically the amount of
examples for training should also take into account genre and topic variations.
Another current trend in WSD and in Machine Learning in general is the automatic
selection of features (Hoste et al., 2002b; Daelemans and Hoste, 2002; Decadt et al., 2004).
The previous feature selection should be needed for: 1) computational issues; 2) some
learning algorithms are very sensitive to non relevant or redundant features; and 3) the
addition of new attributes not necessary improves the performance.
16
2.1 The Word Sense Disambiguation Task
Some recent works have focused on defining separate feature sets for each word,
claiming that different features help to disambiguate different words. For instance,
Mihalcea (2002b) applied exemplar-based algorithm to WSD, an algorithm very sensitive
to irrelevant features. In order to overcome this problem she used a forward selection
iterative process to select the optimal features for each word. She ran cross-validation
on the training set, adding the best feature to the optimal set at each iteration, until
no improvement was observed. The final system achieved very competitive results in the
Senseval-2 competition (see section 2.4.2).
Martı́nez et al. (2002) make use of feature selection for high precision disambiguation
at the cost of coverage. By using cross-validation on the training corpus, a set of individual
features with a discriminative power above a certain threshold was extracted for each word.
The threshold parameter allows to adjust the desired precision of the final system. This
method was used to train decision lists, obtaining 86% precision for 26% coverage, or 95%
precision for 8% coverage on the Senseval-2 data. However, the study do not provided an
analysis of the senses for which the systems performed correctly. One potential use of a high
precision system is the acquisition of almost error-free new examples in a bootstrapping
framework.
Finally, Escudero et al. (2004) achieves also good results on Senseval-3 English Lexical
Sample task by performing per-word feature selection taking as input an initial feature
set obtained by a per-POS feature selection over the Senseval-2 English Lexical Sample
17
2 State-of-the-Art
Initially, Machine Readable Dictionaries (MRDs) were used as main repositories of word
sense distinctions to annotate word examples with senses. For instance, LDOCE, Logman
Dictionary of Contemporary English (Procter, 1973) was frequently used as a research
lexicon (Wilks et al., 1993) and for tagging word sense usages (Bruce and Wiebe, 1999).
In the first Senseval edition, in 1998, the English lexical-sample task used the HECTOR
dictionary to label each sense instance. The Oxford University Press and DEC dictionary
research project jointly produced this dictionary. However, WordNet (Miller et al., 1990;
Fellbaum, 1998) and EuroWordNet (Vossen, 1998) are nowadays becoming the common
knowledge sources for sense discriminations.
Many corpora have been annotated using WordNet and EuroWordNet. Since version
1.4 up to 1.6., Princeton provides also SemCor (Miller et al., 1993)7 (see next section
for description of main corpora for WSD). DSO is annotated using a slightly modified
version of WordNet 1.5, the same version used for the “line, hard and serve” corpora. The
Open Mind Word Expert initiative uses WordNet 1.7. The English tasks of Senseval-2
7
Rada Mihalcea automatically created SemCor 1.7a from SemCor 1.6 by mapping WordNet 1.6 into
WordNet 1.7 senses.
18
2.2 Supervised Corpus-based Word Sense Disambiguation
were annotated using a preliminary version of WordNet 1.7 and most of the Senseval-2
non-English task were labelled using EuroWordNet. Although using different WordNet
versions can be seen as a problem for the standardisation of these valuable lexical resources,
successful methods have been achieved for providing compatibility across the European
wordnets and the different versions of Princeton (Daudé et al., 1999, 2000, 2001).
19
2 State-of-the-Art
case, the ambiguity corresponds to the same word having or not accent, like the Spanish
words {“cantara”, “cantará”}.
The DSO corpus (Ng and Lee, 1996) is the first medium-big size corpus that was
really semantically annotated. It contains 192,800 occurrences of 121 nouns and 70 verbs,
corresponding to a subset of the most frequent and ambiguous English words. These
examples, consisting of the full sentence in which the ambiguous word appears, are tagged
with a set of labels corresponding, with minor changes to the senses of WordNet 1.5 –some
details in (Ng et al., 1999). Ng and colleagues from the University of Singapore compiled
this corpus in 1996 and, since then, it has been widely used. It is currently available from
the Linguistic Data Consortium. The DSO corpus contains sentences from two different
corpora, namely Wall Street Journal (WSJ) and Brown Corpus (BC). The former focused
on financial domain and the second being a general corpus.
Apart from the DSO corpus, there is another major sense-tagged corpora available
for English, SemCor (Miller et al., 1993), which stands for Semantic Concordance. It
is available from the WordNet web site. Texts that were used to create SemCor were
extracted from the Brown corpus (80%) and a novel, The Red Badge of Courage (20%),
and then manually linked to senses from the WordNet lexicon. The Brown Corpus is a
collection of 500 documents, which are classified into fifteen categories. The SemCor corpus
makes use of 352 out of the 500 Brown Corpus documents. In 166 of these documents only
verbs are annotated (totalising 41,525 occurrences). In the remaining 186 documents all
open-class words (nouns, verbs, adjective and adverbs) are linked to WordNet (for a total
of 193,139 occurrences). For an extended description of the Brown Corpus see (Francis
and Kucera, 1982).
Several authors have also provided the research community with the corpora developed
for their experiments. This is the case of the “line, hard and serve” corpora with more
than 4,000 examples per word (Leacock et al., 1998). In this cases, the sense repository
was WordNet 1.5 and the text examples were selected from the Wall Street Journal, the
American Printing House for the Blind, and the San Jose Mercury newspaper. Another
sense tagged corpus is the “interest” corpus with 2,369 examples coming from the Wall
Street Journal and using the LDOCE sense distinctions.
Furthermore, new initiatives like the Open Mind Word Expert (Chklovski and
Mihalcea, 2002) appear to be very promising. This system makes use of the Web
technology to help volunteers to manually annotate sense examples. The system includes
an active learning component that automatically selects for human tagging those examples
that were most difficult to classify by the automatic tagging systems. The corpus is growing
daily and, nowadays, contains more than 70,000 instances of 230 words using WordNet
1.7 for sense distinctions. In order to ensure the quality of the acquired examples, the
20
2.2 Supervised Corpus-based Word Sense Disambiguation
system requires redundant tagging. The examples are extracted from three sources: Penn
Treebank corpus, Los Angeles Times collection (as provided for the TREC conferences),
and Open Mind Common Sense. While the two first sources are well known, the Open
Mind Common Sense provides sentences that are not usually found in current corpora.
They consist mainly in explanations and assertions similar to glosses of a dictionary, but
phrased in less formal language, and with many examples per sense. The authors of the
project suggest that these sentences could be a good source of keywords to be used for
disambiguation. They also propose a new task for Senseval-4 based on this resource.
Finally, resulting from Senseval competitions small sets of tagged corpora have been
developed for several languages including: Basque, Catalan, Chinese, Czech, English,
Estonian, French, Italian, Japanese, Portuguese, Korean, Romanian, Spanish, and
Swedish. Most of these resources and/or corpora are available at the Senseval Web page8 .
Before applying any ML algorithm, all the sense examples of a particular word have to be
codified in a way that the learning algorithm can handle them. The most usual way of
codifying training examples is as feature vectors. In this way, they can be seen as points
in an n dimensional feature space, where n is the total amount of features used.
Features try to capture information and knowledge about the context and the target
words to be disambiguated. However, they necessarily codify only a simplification (or
generalisation) of the word sense examples.
In principle this preprocessing step, in which each example is converted into a feature
vector, can be seen as an independent process with respect the ML algorithm to be used.
However, there are strong implications between the kind and codification of the features
and the appropriateness to each learning algorithm (e.g., exemplar-based learning is very
sensitive to irrelevant features, decision tree induction does not handle properly attributes
with many values, etc.). In 3.3 it is discussed how the feature representation affects both
to the efficiency and accuracy of two learning systems for WSD. See also (Agirre and
Martı́nez, 2001) for a survey on the types of knowledge sources that could be relevant for
codifying training examples. Also, the WSD problem (as well as other NLP tasks) have
some properties, which become very important when considering the application of ML
techniques: large number of features (thousands); both, the learning instances and the
target concept to be learned, reside very sparsely in the feature space9 ; presence of many
8
http://www.senseval.org
9
In (Agirre et al., 2005), we find an example of recent work dealing with the sparseness of data by
means of combining classifiers with different feature spaces.
21
2 State-of-the-Art
irrelevant and highly dependant features; and presence of many noisy examples.
The feature sets most commonly used in the WSD literature can be grouped as follows:
Local features, representing the local context of a word usage. The local context
features comprise bigrams and trigrams of word forms, lemmas, POS tags, and their
positions with respect the target word. Sometimes, local features include also a
bag-of-words (or lemmas) in a small window around the target word (the position of
these words is not taken into account). These features are able to capture knowledge
about collocations, argument-head relations and limited syntactic cues.
Topic features, representing more general contexts (wide windows of words, other
sentences, paragraphs, documents), usually as a bag-of-words.
Syntactic dependencies, at a sentence level, have also been used trying to better
model the syntactic behaviour and argument-head relations.
Stevenson and Wilks (2001) propose also the combination of different linguistic
knowledge sources. They integrate the answers of three partial taggers based on different
knowledge sources in a feature-vector representation for each sense. The vector is
completed with information about the sense (including rank in the lexicon), and simple
collocations extracted from the context. The partial taggers apply the following knowledge:
(i) Dictionary definition overlap, optimised for all-words by means of simulated annealing;
(ii) Selectional preferences based on syntactic dependencies and LDOCE codes; and (iii)
Subject codes from LDOCE using the algorithm by Yarowsky (1992).
We can classify the supervised methods according to the induction principle they use for
acquiring the classification models or rules from examples. The following classification
does not aim to be exhaustive or unique. Of course, the combination of many paradigms
is another possibility that has been applied recently.
10
http://www.lsi.upc.es/∼meaning
22
2.2 Supervised Corpus-based Word Sense Disambiguation
Statistical methods usually estimate a set of probabilistic parameters that express the
conditional probability of each category given in a particular context (described as
features). Then, these parameters can be combined in order to assign the set of categories
that maximises its probability on new examples.
The Naive Bayes algorithm (Duda and Hart, 1973) is the simplest algorithm of this
type, which uses the Bayes rule and assumes the conditional independence of features
given the class label. It has been applied to many investigations in WSD with considerable
success (Gale et al., 1992b; Leacock et al., 1993; Pedersen and Bruce, 1997; Escudero et al.,
2000d; Yuret, 2004). Its main problem is the independence assumption. Bruce and Wiebe
(1999) present a more complex model known as “decomposable model” which considers
different characteristics dependent to each other. The main drawback of this approach is
the enormous amount of parameters to be estimated, because they are proportional to the
number of different combinations of the interdependent characteristics. Therefore, this
technique requires a great quantity of training examples so as to appropriately estimate
all the parameters. In order to solve this problem, Pedersen and Bruce (1997) propose
an automatic method for identifying the optimal model (high performance and low effort
in parameter estimation), by means of the iterative modification of the complexity degree
of the model. Despite its simplicity, Naive Bayes is claimed to obtain state-of-the-art
accuracy on supervised WSD in many papers (Mooney, 1996; Ng, 1997a; Leacock et al.,
1998; Yuret, 2004).
The Maximum Entropy approach (Berger et al., 1996) provides a flexible way to
combine statistical evidences from many sources. The estimation of probabilities assumes
no prior knowledge of data and it has proven to be very robust. It has been applied to
many NLP problems and it also appears as a competitive alternative in WSD (Suárez and
Palomar, 2002; Suárez, 2004).
The methods in this family perform disambiguation by taking into account a similarity
metric. This can be done by comparing new examples to a set of prototypes (one for each
word sense) and assigning the sense of the most similar prototype, or by searching into a
base of annotated examples which are the most similar.
There are many forms to calculate the similarity between two examples. Assuming the
Vector Space Model (VSM), one of the simplest similarity measures is to consider the angle
that both example vectors form. Schütze (1992) applied this model codifying each word
23
2 State-of-the-Art
of the context with a word vector representing the frequency of its collocations. In this
way, each target word is represented with a vector, calculated as the sum of the vectors
of the words that are related to the words appearing in the context. Leacock et al. (1993)
compared VSM, neural nets, and Naive Bayes methods, and drew the conclusion that
the two first methods slightly surpass the last one in WSD. Yarowsky et al. (2001) built a
system consisted of the combination of up to six supervised classifiers, which obtained very
good results in Senseval-2. One of the included systems was VSM. For training it, they
applied a rich set of features (including syntactic information), and weighting of feature
types (Agirre et al., 2005).
Another representative algorithm of this family and the most widely used is the
k-Nearest Neighbour (kNN) algorithm. In this algorithm the classification of a new
example is performed by searching the set of the k most similar examples (or nearest
neighbours) among a pre-stored set of labelled examples, and selecting the most frequent
sense among them, in order to make the prediction. In the simplest case, the training step
stores all the examples in memory (this is why this technique is called Memory-based,
Exemplar-based, Instance-based, or Case-based learning) and the generalisation is
postponed until a new example is being classified (this is why sometimes is also called
lazy learning). A very important issue in this technique is the definition of an appropriate
similarity (or distance) metric for the task, which should take into account the relative
importance of each attribute and should be efficiently computable. The combination
scheme for deciding the resulting sense among the k nearest neighbours also leads to
several alternative algorithms.
First works on kNN for WSD were developed by (Ng and Lee, 1996) on the DSO
corpus. Lately, Ng (1997a) automatically identified the optimal value of k for each
word improving the previously obtained results. Section 3.3 of this document focuses
on certain contradictory results in the literature regarding the comparison of Naive
Bayes and kNN methods for WSD. The kNN approach seemed to be very sensitive to
the attribute representation and to the presence of irrelevant features. For that reason
alternative representations were developed demonstrating to be more efficient and effective.
The experiments demonstrated that kNN was clearly superior to NB when applying
adequate feature representation together with feature and example weighting, and quite
sophisticated similarity metrics (see chapter 3). Hoste et al. (2002a) also used a kNN
system in the English all words task of Senseval-2 (Hoste et al., 2001) and won Senseval-3
(Decadt et al., 2004) (see section 2.4). The system was trained on the SemCor corpus,
combining different types of ML algorithms like Memory-based learning (TiMBL) and rule
induction (Ripper), and used several knowledge sources11 . Mihalcea and Faruque (2004)
11
See appendix C for Web references of both the resources and the systems.
24
2.2 Supervised Corpus-based Word Sense Disambiguation
obtained very good results also in Senseval-3 English all words task with a very interesting
system that includes kNN as its learning component. Daelemans et al. (1999) provide
empirical evidence about the appropriateness of Memory-based learning for general NLP
problems. They show the danger of dropping exception in generalisation processes. But,
Memory-based learning has the known drawback that is very sensitive to feature selection.
In fact, Decadt et al. (2004) devote a lot of effort to perform feature selection applying
Genetic Algorithms.
The methods based on corpus can exploit several discourse properties for WSD: “one sense
per discourse” (Gale et al., 1992b), “one sense per collocation” and “attribute redundancy”
(Yarowsky, 1995b) properties:
• One sense per discourse. The occurrences of a same word in a discourse (or
document) usually denote the same sense. For instance, having a business document,
it is more likely that the occurrences of the word “company” in the document will
denote <business institution> instead of <military unit>.
• One sense per collocation. There are certain word senses that are completely
determined by means of a collocation. For example, the word “party” in the
collocation “Communist Party” unambiguously refers to the political sense.
25
2 State-of-the-Art
These methods acquire selective rules associated to each word sense. Given a polysemous
word, the system selects the sense that verifies some of the rules that determine one of the
senses.
• Methods based on Decision Lists. Decision Lists are ordered lists of rules of
the form (condition, class, weight). According to (Rivest, 1987) Decision Lists can
be considered as weighted “if-then-else” rules where highly discriminant conditions
appear at the beginning of the list (high weights), the general conditions appear at
the bottom (low weights), and the last condition of the list is a default accepting
all remaining cases. Weights are calculated with a scoring function describing the
association between the condition and the particular class, and they are estimated
from the training corpus. When classifying a new example, each rule in the list is
tested sequentially and the class of the first rule whose condition matches with the
example is assigned as the result.
Yarowsky (1994) used Decision Lists to solve a particular type of lexical ambiguity,
the Spanish and French accent restoration. In another work, Yarowsky (1995b)
applied Decision Lists to WSD. In this work, each condition corresponds to a feature,
the values are the word senses, and the weights are calculated with a log-likelihood
measure indicating the probability of the sense given the feature value.
Some recent experiments suggest that Decision Lists can be also very productive for
high precision feature selection (Martı́nez et al., 2002; Agirre and Martı́nez, 2004b)
which could be potentially useful for bootstrapping.
26
2.2 Supervised Corpus-based Word Sense Disambiguation
nodes corresponding to rules that cover very few training examples do not produce
reliable estimates of the class label. The first two factors make very difficult for a
DT based algorithm to include lexical information about the words in the context
to be disambiguated, while the third problem is especially bad in the presence of
small training sets. It has to be noted that part of these problems can be partially
mitigated by using simpler related methods such as decision lists.
Linear classifiers have been very popular in the field of Information Retrieval (IR), since
they have been successfully used as simple and efficient models for text categorisation. A
linear (binary) classifier, is an hyperplane in an n-dimensional feature space that can be
represented with a weight vector w and a bias b indicating the distance of the hyperplane
to the origin: h(x) = <w, x> + b (where <·, ·> stands for the dot product). The weight
vector has a component for each feature, expressing the “importance” of this feature in the
classification rule, which can be stated as: f (x) = +1 if h(x) ≥ 0 and f (x) = −1 otherwise.
There are many on-line learning algorithms for training such linear classifiers (Perceptron,
27
2 State-of-the-Art
One of the biggest advantages of the use of kernels is the design of specific kernels for
the task at hand. We can design a kernel with semantic information from features for
the WSD task. Strapparava et al. (2004) and Popescu (2004) uses a rich combination of
features using a kernel for each kind of features achieving very good results.
Steven Abney (2002) defines bootstrapping as a problem setting in which the task is to
induce a classifier from a small set of labelled data and a large set of unlabelled data.
In principle, bootstrapping is a very interesting approach for Word Sense Disambiguation
28
2.3 Current Open Issues
due to the availability of a large amount of unlabelled data, and the lacking of labelled
data.
29
2 State-of-the-Art
seeds obtaining better results than using inductive SVMs. Unfortunately, Escudero and
Màrquez (2003) report very irregular and bad performance using the DSO WSD corpus
(see chapter 5).
However, bootstrapping is one of the most promising research lines to open the
knowledge acquisition bottleneck in Word Sense Disambiguation, due to the lack of labelled
data and the availability of a huge amount of unlabelled data. The research community
is dedicating a lot of effort to this issue. Recently, Ando and Zhang (2005) have improved
a system by using unlabelled examples to create auxiliary problems in learning structural
data from CoNLL’00 (syntactical chunking) and CoNLL’03 (named entity chunking) tasks.
Unfortunately, to date neither theoretical nor experimental evidences of the usefulness of
bootstrapping techniques to WSD have been shown.
2.4 Evaluation
In the WSD field there have been a inflexion point since the organisation of the Senseval
evaluation exercises. This section describes the evaluation in the WSD field before and
after this point.
There have been very few direct comparisons between alternative methods for WSD
before the Senseval era. It was commonly stated that Naive Bayes, Neural Networks
and Exemplar-based learning represented state-of-the-art accuracy on supervised WSD
(Mooney, 1996; Ng, 1997a; Leacock et al., 1998; Fujii et al., 1998; Pedersen and Bruce,
1998). There were two well-known comparisons between Naive Bayes and Exemplar-based
methods: the works by Mooney (1996) and by Ng (1997a). The first comparison used
more algorithms but on a small set of words. The second comparison used fewer number
of algorithms but on more words and more examples per word.
Mooney’s paper showed that the Bayesian approach was clearly superior to the
Exemplar-based approach. Although it was not explicitly said, the overall accuracy of
Naive Bayes was about 16 points higher than that of the Example-based algorithm,
and the latter was only slightly above the accuracy that a Most Frequent Classifier.
In the Exemplar-based approach, the algorithm applied for classifying new examples
was a standard k -Nearest-Neighbour (k NN), using the Hamming distance for measuring
closeness. Neither example weighting nor attribute weighting were applied, k was set to
3, and the number of attributes used was said to be almost 3,000.
30
2.4 Evaluation
The second paper compared the Naive Bayes approach with PEBLS (Cost and
Salzberg, 1993), a more sophisticated Exemplar-based learner especially designed for
dealing with examples having symbolic features. This paper showed that, for a large
number of nearest-neighbours, the performance of both algorithms was comparable, while
if cross-validation was used for parameter setting, PEBLS slightly outperformed Naive
Bayes. It has to be noted that the comparison was carried out in a limited setting,
using only 7 feature types, and that the attribute/example-weighting facilities provided
by PEBLS were not used. The author suggests that the poor results obtained in Mooney’s
work were due to the metric associated to the k NN algorithm, but he did not test if the
Modified Value Difference Metric (MVDM), proposed by Cost and Salzberg (1993), used
in PEBLS was superior to the standard Hamming distance or not.
Another surprising result that appears in Ng’s paper is that the accuracy results
obtained were 1-1.6% higher than those reported by the same author one year before (Ng
and Lee, 1996), when running exactly the same algorithm on the same data, but using a
larger and richer set of attributes. This apparently paradoxical difference is attributed,
by the author, to the feature pruning process performed in the older paper.
Apart from the contradictory results obtained by the previous papers, some
methodological drawbacks of both comparisons should also be pointed out. On the one
hand, Ng applies the algorithms on a broad-coverage corpus but reports the accuracy
results of a single testing experiment, providing no statistical tests of significance. On the
other hand, Mooney performs thorough and rigorous experiments, but he compares the
alternative methods on a limited domain consisting of a single word with a reduced set of
six senses. Obviously, this extremely specific domain does not guarantee the reaching of
reliable conclusions about the relative performances of alternative methods when applied
to broad-coverage domains.
Section 3.3 clarifies some of the contradictory information in both previous papers
and proposes changes in the representation of features to better fit Naive Bayes and
Exemplar-based algorithms on WSD data. Section 3.4 is devoted to the adaptation of
AdaBoost algorithm to WSD data and compares it with both previous algorithms showing
a better performance of AdaBoost.
Lee and Ng (2002) compared different algorithms and showed the importance and
influence of the knowledge sources, that is, how we represent and provide examples to
the machine learning algorithms. They pointed Support Vector Machines to be the best
option. They obtained very good results on Senseval-3 English Lexical Sample task.
Chapter 4 reports bad experimental results when training and testing across corpora.
The best performance is achieved by margin-based classifiers. Finally, section 3.5 is
31
2 State-of-the-Art
devoted to perform a comparison taking into account not only the typical accuracy
measures, but also test measures like kappa values and agreement rates. It shows the
best performance for Support Vector Machines when small training set is available and
for AdaBoost when we have large amount of examples.
Since 1998 a broad-coverage comparison have been performed on the Senseval evaluation
exercise (Kilgarriff, 1998a; Kilgarriff and Rosenzweig, 2000). Under the auspices of
ACL-SIGLEX and EURALEX, Senseval is attempting to run ARPA-like comparisons
between WSD systems. Like other international competitions of the style of those
sponsored by the American government, MUC or TREC, Senseval was designed to
compare, within a controlled framework, the performance of different approaches and
systems for Word Sense Disambiguation. The goal of Senseval is to evaluate the power
and the weakness of WSD programs with respect to different words, different languages and
different tasks. Typically, in an “all words” task, the evaluation consists of assigning the
correct sense to almost all the words of a text. In a “lexical sample” task, the evaluation
consists of assigning the correct sense to different occurrences of the same word. In a
“translation task”, senses correspond to distinct translations of a word in another language.
Basically, Senseval classifies the systems into two different types: supervised and
unsupervised systems. However, there are systems difficult to classify. In principle,
knowledge-based systems (mostly unsupervised) can be applied to all three tasks,
whereas exemplar-based systems (mostly supervised) can participate preferably in the
lexical-sample and translation tasks.
The first Senseval edition (hereinafter Senseval-1) was carried out during summer of
1998 and it was designed for English, French and Italian, with 25 participating systems.
It included only the lexical sample task. Up to 17 systems participated in the English
lexical-sample task (Kilgarriff and Rosenzweig, 2000), and the best results (all supervised)
achieved 75-80% accuracy. Although senses were based on the HECTOR dictionary most
systems were based on WordNet senses. These systems were in severe disadvantage due
to the mappings of senses.
The second Senseval contest (hereinafter Senseval-2) was made in July 2001 and
included tasks for 12 languages: Basque, Czech, Dutch, English, Estonian, Chinese,
Danish, Italian, Japanese, Korean, Spanish and Swedish (Edmonds and Cotton, 2001).
About 35 teams participated, presenting up to 94 systems. Some teams participated
in several tasks allowing the analysis of the performance across tasks and languages.
Furthermore, some words for several tasks were selected to be “translation-equivalents” to
32
2.4 Evaluation
some English words to perform further experiments after the official competition. All the
results of the evaluation and data are now in the public domain including: results (system
scores and plots), data (system answers, scores, training and testing corpora, and scoring
software), system descriptions, task descriptions and scoring criteria. About 26 systems
took part in the English lexical-sample task, and the best results achieved 60-65% precision
(see table 2.1). After Senseval-2, (Lee and Ng, 2002; Florian and Yarowsky, 2002) have
reported 65.4% and 66.5% accuracy results, respectively. The first one focussed in the
study of the features and the second work on the combination of classifiers. Table 2.2
shows the accuracy of the best systems on the All Words task of Senseval-2 event. The
best system achieved an accuracy of 69%.
An explanation for these low results, with respect to Senseval-1, is the change of the
dictionary and the addition of new senses. In Senseval-1, the English lexical-sample task
was made using the HECTOR dictionary, whereas in Senseval-2 a preliminary version of
WordNet 1.7 was used. In the first case, also manual connections to WordNet 1.5 and 1.6
were provided. The second factor is the addition of many proper nouns, phrasal verbs,
multiwords to the different tasks.
system precision
SMUaw- .69
AVe-Antwerp .64
LIA-Sinequa-AllWords .62
david-fa-UNED-AW-T .58
david-fa-UNED-AW-U .56
33
2 State-of-the-Art
It is very difficult to make a direct comparison among the Senseval systems because
they differ very much in the methodology, the knowledge used, the task in which
they participated, and the measurement that they wanted to optimise (“coverage” or
“precision”). Instead, we prefer to briefly describe some of the top-ranked supervised
systems that participated in Senseval-2. With respect to the English-all-words task we
describe SMUaw (Mihalcea and Moldovan, 2001) and Antwerp (Hoste et al., 2001, 2002a)
systems. Regarding the English lexical-sample we describe JHU-English (Yarowsky et al.,
2001) and TALP (Escudero et al., 2001) systems.
The SMUaw system acquired patterns and rules from the SemCor corpus and WordNet,
via a set of heuristics. The semantic disambiguation of a word was performed by means
of its semantic relations with respect to the surrounding words in the context. These
semantic relations were extracted from GenCor, a large corpus of 160,000 sense annotated
word pairs. This corpus was obtained from SemCor and WordNet. As this initial system
had not full coverage, afterwards a new method was applied to propagate the senses of
the disambiguated words to the rest of ambiguous words in the context. For those words
remaining ambiguous, the most frequent sense of WordNet was applied. The performance
of this system was 63.8% precision and recall.
The Antwerp system also used SemCor. This system applied ML classifiers. The
system combined, using a voting scheme, different types of learning algorithms: MBL
(TiMBL) and rule induction (Ripper). They learnt a classifier for each word (a word
expert) choosing the best option after an optimisation process. The performance of this
system was of 64% precision and recall.
The JHU-English system integrated 6 supervised subsystems. The final result was
obtained combining the output of all the classifiers. The six subsystems were based on:
decision lists (Yarowsky, 2000), error-driven transformation-based learning (Brill, 1995;
Ngai and Florian, 2001), a vector space model, decision trees, and two Naive Bayes
probabilistic models (one trained on words and the other on lemmas). Each subsystem
was trained using a set of words in a fixed context window, its morphologic analysis, a
variety of syntactic features like direct subjects, objects, circumstantial complements and
several modifiers. These syntactic features were obtained using a set of heuristics based on
patterns (Ngai and Florian, 2001). The output of each subsystem was combined by means
of a meta-classifier. Results achieved by this system were 64.2% precision and recall and
66.5% in a extended version (Florian and Yarowsky, 2002).
The TALP system used the LazyBoosting algorithm to classify the senses of the words
using a rich set of attributes, including domain codes associated to the WordNet synsets
(Magnini and Cavaglia, 2000). Like other systems, the examples were preprocessed (for
example eliminating the collocations) to decrease the degree of polysemy. The system also
34
2.4 Evaluation
used a hierarchical model in the style of (Yarowsky, 2000) although it was seen that this
decomposition was counter-productive. The method itself produced a combination of weak
rules (decision stumps) without explicitly combine the different outputs. The performance
achieved by this system was 59.4% of precision and recall. Further description of the TALP
system can be found in section 6.2 of this document.
Section 6.3 is devoted to systematically study the outputs of the best six systems
of Senseval-2 English Lexical Sample task. This study aims at understanding the
characteristics and influences of the different components of these systems on the final
performance. Some interesting conclusions are drawn from this study pointing out the
importance of the preprocessing of the data and the treatment of special cases, such as
name-entities or phrasal verbs.
Senseval-3
The third edition of Senseval (hereinafter Senseval-3) was made in July 2004. 27 teams
participated in the English Lexical Sample task, presenting 47 systems (Mihalcea et al.,
2004); and 16 teams in the All-Words task, presenting 26 systems (Snyder and Palmer,
2004). Senseval is growing in interest as shown by the large number of participants. The
best systems obtained 72.9% and 65.2% accuracy for both tasks respectively. A remarkable
result was that the first 17 systems in the Lexical Sample Task achieves a very similar
accuracy (with a maximum difference of 3.8%).
It seems that in the English Lexical Sample task there were no qualitative
improvements, comparing with previous editions of the task. There is an increment of
10% of accuracy from Senseval-2 to Senseval-3, but is probably due to the decrease in the
number of senses. In Senseval-3 English Lexical Sample there are not multiwords, phrasal
verbs and these kind of satellite senses. Table 2.3 shows the results of the best system
of each team. The results of this event evidence a clear dependence of the knowledge
(features) and not the Machine Learning algorithm applied. Another important issue
is how features are combined. The most used algorithms are those based in Kernels.
Carpuat et al. (2004); Agirre and Martı́nez (2004b) perform a combination of several ML
components. Lee et al. (2004); Escudero et al. (2004) use SVMs as learning methods
focussing the attention on the feature set. The former has a fine tuned set of features
(taking special attention to the syntactical features) and the later has a huge set of them
and performs feature selection and multiclass binarisation selection (between one-vs-all
and constraint classification approaches). Strapparava et al. (2004) also use SVMs but
developing latent Semantic Kernels for the different classes of features. They obtained the
second position in the ranking achieving an accuracy of 72.6%. Popescu (2004); Grozea
35
2 State-of-the-Art
(2004) used RLSC learning algorithm. The first used an standard feature set; and the
second extends the feature set, converting the classification problem to a regression one.
A postprocess on the output tries to correct the tendency to vote the most frequency sense
by taking into account the prior distribution of senses in the training data. This system
won the task achieving an accuracy of 72.9%. The most novel aspects of these systems
were: the postprocessing of the output weights by (Grozea, 2004); the use of kernels in
the SVMs for the different classes of features by (Strapparava et al., 2004); and the study
of the features used by (Lee et al., 2004; Escudero et al., 2004; Decadt et al., 2004).
Table 2.4 shows the results of the English All-Words task (Snyder and Palmer, 2004).
The best system achieved an accuracy of 65.2% (Decadt et al., 2004). This system is
the continuation of that presented in Senseval-2 plus a feature and parameter selection
performed using Genetic Algorithms. They developed a supervised system based on
memory-based learning, taking as training data all available corpora: Semcor (that
included in WordNet 1.7.1); those of the previous editions of Senseval (lexical sample and
all-words); the line-, hard - and serve- corpora; and the examples of WordNet 1.7.1. They
36
2.4 Evaluation
tuned the features and parameters of the system with a Genetic Algorithm. They applied
the same system to the English Lexical Sample task but they did not achieve so good
results. The second best-performing system (Mihalcea and Faruque, 2004) achieved an
accuracy of 64.6%. It was composed by two components: a regular memory-based classifier
with a small set of features focussing on syntactical dependencies and an interesting
generalisation learner. This last component learns, through syntactical dependencies and
WordNet hierarchies, generalisation patterns that could be applied to disambiguate unseen
words12 . Next system (Yuret, 2004) used Naive Bayes with the standard set of features
and studied the size of the context. (Vázquez et al., 2004; Villarejo et al., 2004) perform a
combination of multiple systems from different research teams. The most interesting issue
presented in this task is that of the generalisation learner for the treatment of the unseen
words (Mihalcea and Faruque, 2004).
system precision
GAMBL-AW-S 65.2
SenseLearner-S 64.6
Koc University-S 64.1
R2D2: Enlish-all-words 64.1
Meaning-allwords-S 62.4
upv-shmm-eaw-S 60.9
LCCaw 60.7
UJAEN-S 59.0
IRST-DDD-00-U 58.3
University of Sussex-Prob5 57.2
DFA-Unsup-AW-U 54.8
KUNLP-Eng-All-U 50.0
merl.system3 45.8
autoPS-U 43.6
clr04-aw 43.4
DLSI-UA-all-Nosu 28.0
12
This component is capable to disambiguate take tea from a training example containing drink water,
by using the WordNet hierarchy and the dependency verb-object.
37
2 State-of-the-Art
During the compilation of this state of the art, we benefited from a number of good papers
and surveys. These references represent an excellent starting point to begin the research
in both WSD and ML fields. Most of them are included in the following list:
38
Chapter 3
A Comparison of Supervised ML
Algorithms for WSD
Section 3.3 analyses how to represent the information for Naive-Bayes and
Exemplar-Based algorithms, and compares both algorithms under the same framework.
The proposed data representation is also used later on Decision Lists.
39
3 A Comparison of Supervised ML Algorithms for WSD
work contains also some ways of speeding up the algorithm by reducing the feature space
to explore. The best variant has been called LazyBoosting.
The final section contains a comparison of the preceding algorithms under the same
conditions: Naive-Bayes, Exemplar-Based, Decision Lists, AdaBoost, and Support Vector
Machines. The comparison not only takes into account the typical precision-recall
measures, but also the kappa statistic, and agreement ratio. The t-Student measure of
significance has been applied to check the statistical significance of the differences observed.
See appendix A for all the formulae.
The training set contains n training examples, s = {(x1 , y1 ), . . . , (xn , yn )}, which are
pairs (x, y) where x belongs to X and y = f (x). The x component of each example
is typically a vector x = (x1 , . . . , xn ), whose components, called attributes (or features),
are discrete- or real-valued and describe the relevant information/properties about the
example. The values of the output space Y associated to each training example are called
classes (or categories). Therefore, each training example is completely described by a set
of attribute-value pairs, and a class label.
In the Statistical Learning Theory field (Vapnik, 1998), the function f is viewed as
a probability distribution P (X, Y ) and not as a deterministic mapping, and the training
examples are considered as a set being i.i.d. generated from this distribution. Additionally,
X is usually identified as <n , and each example x as a point in <n with one real-valued
feature for each dimension. All along this document we will try to maintain the descriptions
and notation compatible with both approaches.
40
3.1 Machine Learning for Classification
several that can be compatible with the training set (e.g., simplicity, maximal margin,
etc.).
Given new x vectors, h is used to predict the corresponding y values, i.e., to classify
the new examples, and it is expected to be coincident with f in the majority of the cases,
or, equivalently, to perform a small number of errors.
The measure of the error rate on unseen examples is called generalisation (or true)
error. It is obvious that generalisation error cannot be directly minimised by the learning
algorithm since the true function f , or the distribution P (X, Y ), is unknown. Therefore,
an inductive principle is needed. The most common way to proceed is to directly minimise
the training (or empirical ) error, that is, the number of errors on the training set. This
principle is known as Empirical Risk Minimisation, and gives a good estimation of the
generalisation error in the presence of many training examples. However, in domains with
few training examples, forcing a zero training error can lead to overfit the training data and
to generalise badly. A notion of complexity of the induced functions is also related to the
risk of overfitting –see again (Vapnik, 1998)– and this risk is increased in the presence
of outliers and noise (i.e., very exceptional and wrongly classified training examples,
respectively). This tradeoff between training error and complexity of the induced classifier
is something that has to be faced in any experimental setting in order to guarantee a low
generalisation error.
Finally, we would like to briefly comment a terminology issue that can be rather
confusing in the WSD literature. Recall that, in machine learning, the term supervised
learning refers to the fact that the training examples are annotated with the class labels,
which are taken from a pre-specified set. Instead, non supervised learning refers to the
problem of learning from examples when there is no set of pre-specified class labels. That
is, the learning consists of acquiring the similarities between examples to form clusters that
can be later interpreted as classes (this is why it is usually referred to as clustering). In
the WSD literature, the term unsupervised learning is used with another meaning, which
is the acquisition of disambiguation models or rules from non-annotated examples and
external sources of information (e.g., lexical databases, aligned corpora, etc.). Note that
in this case the set of class labels (which are the senses of the words) are also specified in
advance.
41
3 A Comparison of Supervised ML Algorithms for WSD
sense k that maximises the conditional probability of the sense given the observed sequence
of feature of that example. That is,
Ym
P (x1 , . . . , xm |k)P (k)
arg max P (k|x1 , . . . , xm ) = arg max ≈ arg max P (k) P (xi |k)
k k P (x1 , . . . , xm ) k
i=1
where P (k) and P (xi |k) are estimated from the training set, using relative frequencies
(maximum likelihood estimation). To avoid the effects of zero counts when estimating
the conditional probabilities of the model, a very simple smoothing technique, proposed
in (Ng, 1997a), has been used. It consists in replacing zero count of P (xi |k) with P (k)/n
where n is the number of training examples.
where wj is the weight of the j-th feature and δ(xj ,xij ) is the distance between two values,
which is 0 if xj = xij and 1 otherwise.
In the implementation tested in these experiments, we used Hamming distance to
measure closeness and the Gain Ratio measure (Quinlan, 1993) for estimating feature
weights. For k 0 s greater than 1, the resulting sense is the weighted majority sense of the
k nearest neighbours -where each example votes its sense with a strength proportional to
its closeness to the test example. The kNN algorithm is run several times using different
number of nearest neighbours (1, 3, 5, 7, 10, 15, 20 and 25) and the results corresponding
to the best choice for each word are reported1 .
1
In order to construct a real kNN-based system for Word Sense Disambiguation, the k parameter
should have been estimated by cross-validation using only the training set (Ng, 1997a), however, in our
case, this cross-validation inside the cross-validation involved in the testing process would have introduced
a prohibitive overhead.
42
3.1 Machine Learning for Classification
One of the main issues of section 3.3 is the adaptation of NB and kNN to the WSD.
In order to test some of the hypotheses about the differences between Naive Bayes (N B)
and Exemplar-based (EBh,k 2 ) approaches, some variants of the basic kNN algorithm have
been implemented:
When both modifications are put together, the resulting algorithm will be referred to
as EBh,k,e,a . Finally, we have also investigated the effect of using an alternative metric.
TiMBL3 is an available software that implements this method with all the options
explained in this section (Daelemans et al., 2001).
2
h stands for hamming distance and k for the number of nearest neighbours.
3
http://ilk.kub.nl/software.html
43
3 A Comparison of Supervised ML Algorithms for WSD
A decision list consists of a set of ordered rules of the form (feature-value, sense, weight).
In this setting, the decision lists algorithm works as follows: the training data is used
to estimate the features, which are weighted with a log-likelihood measure (Yarowsky,
1995b) indicating the likelihood of a particular sense given a particular feature value. The
list of all rules is sorted by decreasing values of this weight; when testing new examples,
the decision list is checked, and the feature with highest weight that is matching the test
example selects the winning word sense.
P (sensek |f eaturei )
weight(sensek , f eaturei ) = Log( P )
P (sensej |f eaturei )
j6=k
These probabilities can be estimated using the maximum likelihood estimate, and
some kind of smoothing to avoid the problem of 0 counts. There are many approaches
for smoothing probabilities –see, for instance, (Chen, 1996). In these experiments a very
simple solution has been adopted, which consists of replacing the denominator by 0.1 when
the frequency is zero.
3.1.4 AdaBoost
AdaBoost (AB) is a general method for obtaining a highly accurate classification rule
by combining many weak classifiers, each of which may be only moderately accurate.
A generalised version of the AdaBoost algorithm, which combines weak classifiers
with confidence-rated predictions (Schapire and Singer, 1999), has been used in these
experiments. This algorithm is able to work efficiently in very high dimensional feature
spaces, and has been applied, with significant success, to a number of practical problems.
The weak hypotheses are learned sequentially, one at a time, and, conceptually, at
each iteration the weak hypothesis is biased to classify the examples which were most
difficult to classify by the ensemble of preceding weak hypotheses. AdaBoost maintains
a vector of weights as a distribution Dt over examples. At round t, the goal of the weak
learner algorithm is to find a weak hypothesis, ht : X → <, with moderately low error
44
3.1 Machine Learning for Classification
with respect to the weight distribution Dt . In this setting, weak hypotheses ht (x) make
real-valued confidence-rated predictions. Initially, the distribution D1 is uniform, but after
each iteration, the boosting algorithm exponentially increases (or decreases) the weights
Dt (i) for which ht (xi ) makes a bad (or good) prediction, with a variation exponentially
dependant to the confidence |ht (xi )|. The final combined hypothesis, ht : X → <,
computes its predictions using a weighted vote of the weak hypotheses,
T
X
f (x) = αt · ht (x)
t=1
For each example x, the sign of f (x) is interpreted as the predicted class (the basic
AdaBoost works only with binary outputs, -1 or +1), and the magnitude |f (x)| is
interpreted as a measure of confidence in the prediction. Such a function can be used
either for classifying new unseen examples or for ranking them according to the confidence
degree.
Let S = {(x1 , Y1 ), . . . , (xm , Ym )} be the set of m training examples, where each instance
xi belongs to an instance space X and each Yi is a subset of a finite set of labels or classes
Y. The size of Y is denoted by k = |Y|. AdaBoost.MH maintains an m × k matrix of
weights as a distribution D over m examples and k labels. The goal of the WeakLearner
algorithm is to find a weak hypothesis with moderately low error with respect to these
weights. Initially, the distribution D1 is uniform, but the boosting algorithm updates the
weights on each round to force the weak learner to concentrate on the pairs (example,
label) which are hardest to predict.
More precisely, let Dt be the distribution at round t, and ht : X × Y → < the weak
rule acquired according to Dt . The sign of ht (x, l) is interpreted as a prediction of whether
label l should be assigned to example x or not. The magnitude of the prediction |ht (x, l)| is
interpreted as a measure of confidence in the prediction. In order to understand correctly
the updating formula this last piece of notation should be defined. Thus, given Y ⊆ Y
and l ∈ Y, let Y [l] be +1 if l ∈ Y and −1 otherwise.
Now, it becomes clear that the updating function increases (or decreases) the weights
Dt (i, l) for which ht makes a good (or bad) prediction, and that this variation is
proportional to |ht (x, l)|.
Note that WSD is not a multi-label classification problem since a unique sense is
expected for each word in context. In our implementation, the algorithm runs exactly in
45
3 A Comparison of Supervised ML Algorithms for WSD
procedure AdaBoost.MH
Input: S = {(xi , Yi ), ∀i = 1..m}
### S is the set of training examples
### Initialize distribution D1
### m is the number of training examples
### k is the number of classes
1
D1 (i, l) = mk , ∀i = 1..m, ∀l = 1..k
### Perform T rounds:
for t := 1 to T do
### Get the weak hypothesis ht : X × Y → <
ht = WeakLearner(X, Dt )
### Update distribution Dt
Dt (i,l)exp(−Yi [l]ht (xi ,l))
Dt+1 (i, l) = Zt
, ∀i = 1..m, ∀l = 1..k
### Zt is a normalisation factor
###(chosen so that Dt+1 will be a distribution)
end for
T
X
return the combined hypothesis: f (x, l) = ht (x, l)
t=1
end AdaBoost.MH
the same way as explained above, except that sets Yi are reduced to a unique label, and
that the combined hypothesis is forced to output a unique label, which is the one that
maximises f (x, l).
Up to now, it only remains to be defined the form of the WeakLearner. Schapire and
Singer (1999) prove that the Hamming loss of the AdaBoost.MH algorithm on the training
T
Q
set4 is at most Zt , where Zt is the normalisation factor computed on round t. This
t=1
upper bound is used in guiding the design of the WeakLearner algorithm, which attempts
to find a weak hypothesis ht that minimises:
m X
X
Zt = Dt (i, l)exp(−Yi [l]ht (x, l))
i=1 l∈Y
As in (Abney et al., 1999), very simple weak hypotheses are used to test the value of a
boolean predicate and make a prediction based on that value. The predicates used are of
the form “f = v”, where f is a feature and v is a value (e.g.: “previous word = hospital”).
4
i.e. the fraction of training examples i and labels l for which the sign of f (xi , l) differs from Yi [l].
46
3.1 Machine Learning for Classification
Formally, based on a given predicate p, our interest lies on weak hypotheses h which make
predictions of the form: (
c0l if p holds in x
h(x, l) =
c1l otherwise
where the cjl ’s are real numbers.
For a given predicate p, and bearing the minimisation of Zt in mind, values cjl should
be calculated as follows. Let X1 be the subset of examples for which the predicate p holds
and let X0 be the subset of examples for which the predicate p does not hold. Let [[π]],
for any predicate π, be 1 if π holds and 0 otherwise. Given the current distribution Dt ,
the following real numbers are calculated for each possible label l, for j ∈ {0, 1}, and for
b ∈ {+1, −1}:
m
X
Wbjl = Dt (i, l)[[xi ∈ Xj ∧ Yi [l] = b]]
i=1
jl jl
That is, W+1 (W−1 ) is the weight (with respect to distribution Dt ) of the training
examples in partition Xj which are (or not) labelled by l.
As it is shown in (Schapire and Singer, 1999), Zt is minimised for a particular predicate
by choosing:
jl
1 W
cjl = ln( +1
jl
)
2 W−1
X Xq jl jl
Zt = 2 W+1 W−1
j∈{0,1} l∈Y
Thus, the predicate p chosen is that for which the value of Zt is smallest.
Very small or zero values for the parameters Wbjl cause cjl predictions to be large or
infinite in magnitude. In practice, such large predictions may cause numerical problems
to the algorithm, and seem to increase the tendency to overfit. As suggested in (Schapire
and Singer, 1999), smoothed values for cjl have been used.
A simple modification of the AdaBoost algorithm, which consists of dynamically
selecting a very reduced feature set at each iteration, has been used to significantly increase
the efficiency of the learning process with no loss in accuracy (see section 3.4 for the details
about Lazy Boosting and how to set the number of rounds when learning.). This variant
has been called LazyBoosting.
47
3 A Comparison of Supervised ML Algorithms for WSD
Support Vector Machines (SVM) were introduced by (Boser et al., 1992), but they have
only recently been gaining popularity in the machine learning community since 1996.
Nowadays, SVM have been successfully applied to a number of problems related to pattern
recognition in bioinformatics and image recognition. Regarding text processing, SVM have
obtained the best results to date in text categorisation (Joachims, 1998) and they are being
used in more and more NLP-related problems, e.g., chunking (Kudo and Matsumoto,
2001), parsing (Collins, 2002), or WSD (Murata et al., 2001; Lee and Ng, 2002; Escudero
et al., 2004; Lee et al., 2004).
SVM are based on the Structural Risk Minimisation principle from the Statistical
Learning Theory (Vapnik, 1998) and, in their basic form, they learn a linear hyperplane
that separates a set of positive examples from a set of negative examples with maximum
margin (the margin γ is defined by the distance of the hyperplane to the nearest of the
positive and negative examples). This learning bias has proved to have good properties in
terms of generalisation bounds for the induced classifiers. The linear classifier is defined
by two elements: a weight vector w (with one component for each feature), and a bias
b which stands for the distance of the hyperplane to the origin: h(x) = <x, w> + b (see
figure 3.2). The classification rule f (x) assigns +1, to a new example x, when h(x) =≥ 0,
and −1 otherwise. The positive and negative examples closest to the (w, b) hyperplane
are called support vectors.
48
3.1 Machine Learning for Classification
Learning the maximal margin hyperplane (w, b) can be simply stated as a convex
quadratic optimisation problem with a unique solution, consisting of (primal formulation):
minimize : ||w||
N
X
subject to : yi αi = 0, αi ≥ 0, ∀ 1 ≤ i ≤ N
i=1
From this formulation, the orthogonal vector to the hyperplane rests defined as:
N
X
h(x) = <w, x> + b = b + yi αi <xi , x>
i=1
where αi 6= 0 for all training examples in the margin (support vectors) and αi = 0 for the
others. “f (x) = +1 when h(x) ≥ 0 and −1 otherwise” defines the classifier from the dual
formulation.
Despite the linearity of the basic algorithm, SVM can be transformed into a dual form
(in which the training examples only appear inside dot product operations), allowing the
use of kernel functions to produce non-linear classifiers. The kernel functions make SVM to
work efficiently and implicitly in very high dimensional feature spaces, where new features
can be expressed as combinations of many basic features (Cristianini and Shawe-Taylor,
2000). If we suppose we have a non lineal transformation Φ from <D to some Hilbert space
=, then we can define a dot product using a kernel function like K(x, y) = <Φ(x), Φ(y)>
and obtain a dual formulation:
N
X N
1 X
maximize : αi − yi yj αi αj K(xi , xj )
i=1
2 i,j=1
N
X
subject to : yi αi = 0, αi ≥ 0, ∀ 1 ≤ i ≤ N
i=1
49
3 A Comparison of Supervised ML Algorithms for WSD
N
X
h(x) = <w, Φ(x)> + b = b + yi αi K(xi , x)
i=1
Sometimes, training examples are not linearly separable or, simply, it is no desirable
to obtain a perfect hyperplane. In these cases it is preferable to allow some errors in
the training set so as to maintain a “better” solution hyperplane (see figure 3.3). This is
achieved by a variant of the optimisation problem, referred to as soft margin, in which the
contribution to the objective function of the margin maximisation and the training errors
can be balanced through the use of a parameter called C. This parameter affects to the
slack variables ξi in the function (see figure 3.3). This problem can be formulated as:
XN
1
minimize : <w, w> + C ξi
2 i=1
In the experiments presented below, the SV M light software5 has been used. We have
worked only with linear kernels, performing a tuning of the C parameter directly on the
DSO corpus. It has to be noted that no significant improvements were achieved by using
non-linear kernels (polynomial or Gaussian).
5
See appendix C for the web reference of this freely available implementation.
50
3.2 Setting
3.2 Setting
The comparison of algorithms has been performed in series of controlled experiments using
exactly the same training and test sets for each method. The experimental methodology
consisted of a 10-fold cross-validation. All accuracy/error rate results have been averaged
over the results of the 10 folds. The statistical tests of significance have been performed
using a 10-fold cross validation paired Student’s t-test (Dietterich, 1998) with a confidence
value of: t9,0.975 = 2.262.
3.2.1 Corpora
In our experiments, all approaches have been evaluated on the DSO corpus, a semantically
annotated corpus containing 192,800 occurrences of 121 nouns and 70 verbs corresponding
to the most frequent and ambiguous English words. These examples, consisting of the
full sentence in which the ambiguous word appears, are tagged with a set of labels
corresponding, with minor changes, to the senses of WordNet 1.5 (Miller et al., 1990).
This corpus was collected by Ng and colleagues (Ng and Lee, 1996) and it is available
from the Linguistic Data Consortium6 .
In the experiments reported in sections 3.3 and 3.4 a smaller group of 10 nouns (age,
art, car, child, church, cost, head, interest, line and work ) and 5 verbs (fall, know,set,
speak and take) which frequently appear in the WSD literature has been selected. Since
our goal is to acquire a classifier for each word, each word represents a classification
problem. Table 3.1 shows these words with their number of classes (senses), the number
of training examples and the accuracy of a naive Most Frequent Sense classifier (M F S).
Two sets of attributes have been used, which will be referred to as SetA and SetB,
respectively. Let “. . . w−3 w−2 w−1 w w+1 w+2 w+3 . . .” be the context of consecutive
words around the word w to be disambiguated. Attributes refer to this context as follows.
• SetA contains the seven following attributes: w−2 , w−1 , w+1 , w+2 , (w−2 , w−1 ),
(w−1 , w+1 ), and (w+1 , w+2 ), where the last three correspond to collocations of two
consecutive words. These attributes, which are exactly those used in (Ng, 1997a),
represent the local context of the ambiguous word and they have been proven to
be very informative features for WSD. Note that whenever an attribute refers to a
6
http://www.ldc.upenn.edu/Catalog/LDC97T12.html
51
3 A Comparison of Supervised ML Algorithms for WSD
position that falls beyond the boundaries of the sentence for a certain example, a
default value “ ” is assigned.
Let p±i be the part-of-speech tag7 of word w±i , and c1 , . . . , cm the unordered set of open
class words appearing in the sentence.
• SetB enriches the local context: w−1 , w+1 , (w−2 , w−1 ), (w−1 , w+1 ), (w+1 , w+2 ),
(w−3 , w−2 , w−1 ), (w−2 , w−1 , w+1 ), (w−1 , w+1 , w+2 ) and (w+1 , w+2 , w+3 ), with the
part-of-speech information: p−3 , p−2 , p−1 , p+1 , p+2 , p+3 , and, additionally, it
incorporates broad context information: c1 . . . cm . SetB is intended to represent a
more realistic set of attributes for WSD8 . Note that ci attributes are binary-valued,
denoting the the presence or absence of a content word in the sentence context.
7
We have used the FreeLing tagger (Atserias et al., 2006) to get those tags. See appendix C for Web
references.
8
Actually, it incorporates all the attributes used in (Ng and Lee, 1996), with the exception of the
morphology of the target word and the verb-object syntactic relation.
52
3.3 Adapting Naive-Bayes and Exemplar-Based Algorithms
Regarding the comparison between Naive Bayes and Exemplar-based methods, the
works by Mooney (1996) and Ng (1997a) will be the ones basically referred in this
section. Mooney’s paper claims that the Bayesian approach is clearly superior to the
Exemplar-based approach. Although it is not explicitly said, the overall accuracy of
Naive Bayes is about 16 points higher than that of the Example-based algorithm, and
the latter is only slightly above the accuracy that a Most Frequent Sense Classifier
would obtain. In the Exemplar-based approach, the algorithm applied for classifying
new examples was a standard k -NN, using the Hamming distance for measuring closeness.
Neither example weighting nor attribute weighting are applied, k is set to 3, and the
number of attributes used is said to be almost 3,000. The second paper compares the
Naive Bayes approach with PEBLS (Cost and Salzberg, 1993), an Exemplar-based learner
especially designed for dealing with symbolic features. This paper shows that, for a large
number of nearest-neighbours, the performance of both algorithms is comparable, while if
cross-validation is used for parameter setting, PEBLS slightly outperforms Naive Bayes.
It has to be noted that the comparison was carried out in a limited setting, using only
53
3 A Comparison of Supervised ML Algorithms for WSD
Another surprising result that appears in Ng’s paper is that the accuracy results
obtained were 1-1.6% higher than those reported by the same author one year before (Ng
and Lee, 1996), when running exactly the same algorithm on the same data, but using a
larger and richer set of attributes. This apparently paradoxical difference is attributed, by
the author, to the feature pruning process performed in the older paper. Apart from the
contradictory results obtained by the previous two papers, some methodological drawbacks
of both comparisons should also be pointed out. On the one hand, Ng applies the
algorithms on a broad-coverage corpus but reports the accuracy results of a single testing
experiment, providing no statistical tests of significance. On the other hand, Mooney
performs thorough and rigorous experiments, but he compares the alternative methods
on a limited domain consisting of a single word with a reduced set of six senses. Thus,
it is our claim that this extremely specific domain does not guarantee reliable conclusions
about the relative performances of alternative methods when applied to broad-coverage
domains. Consequently, the aim of this section is twofold: 1) To study the source of the
differences between both approaches in order to clarify the contradictory and incomplete
information. 2) To empirically test the alternative algorithms and their extensions on a
broad-coverage sense tagged corpus, in order to estimate which is the most appropriate
choice.
Both Naive Bayes and Exemplar-based (during this section EBh,k ) classifiers have been
used as they are explained in section 3.1. All settings of the experiments have been already
described in section 3.2.
This experiment is devoted to test the Naive Bayes and all the variants of Exemplar-based
algorithm on a simple and commonly used set of local features. Table 3.2 shows the
results of all methods and variants tested on the 15 reference words, using the SetA set
of attributes: Most Frequent Sense (M F S), Naive Bayes (N B), Exemplar-based using
Hamming distance (EBh variants, 4th to 8th columns), and Exemplar-based approach
using the MVDM metric (EBcs variants, 9th to 11th columns) are included. The best
54
3.3 Adapting Naive-Bayes and Exemplar-Based Algorithms
EBh EBcs
word MF S NB 1 7 15, e 7, a 7, e, a 1 10 10, e
age.n 62.1 73.8 71.4 69.4 71.0 74.4 75.9 70.8 73.6 73.6
art.n 46.7 54.8 44.2 59.3 58.3 58.5 57.0 54.1 59.5 61.0
car.n 95.1 95.4 91.3 95.5 95.8 96.3 96.2 95.4 96.8 96.8
child.n 80.9 86.8 82.3 89.3 89.5 91.0 91.2 87.5 91.0 90.9
church.n 61.1 62.7 61.9 62.7 63.0 62.5 64.1 61.7 64.6 64.3
cost.n 87.3 86.7 81.1 87.9 87.7 88.1 87.8 82.5 85.4 84.7
fall.v 70.1 76.5 73.3 78.2 79.0 78.1 79.8 78.7 81.6 81.9
head.n 36.9 76.9 70.0 76.5 76.9 77.0 78.7 74.3 78.6 79.1
interest.n 45.1 64.5 58.3 62.4 63.3 64.8 66.1 65.1 67.3 67.4
know.v 34.9 47.3 42.2 44.3 46.7 44.9 46.8 45.1 49.7 50.1
line.n 21.9 51.9 46.1 47.1 49.7 50.7 51.9 53.3 57.0 56.9
set.v 36.9 55.8 43.9 53.0 54.8 52.3 54.3 49.7 56.2 56.0
speak.v 69.1 74.3 64.6 72.2 73.7 71.8 72.9 67.1 72.5 72.9
take.v 35.6 44.8 39.3 43.7 46.1 44.5 46.0 45.3 48.8 49.1
work.n 31.7 51.9 42.5 43.7 47.2 48.5 48.9 48.5 52.0 52.5
nouns 57.4 71.7 65.8 70.0 71.1 72.1 72.6 70.6 73.6 73.7
verbs 46.6 57.6 51.1 56.3 58.1 56.4 58.1 55.9 60.3 60.5
all 53.3 66.4 60.2 64.8 66.2 66.1 67.2 65.0 68.6 68.7
Table 3.2: Accuracy results of all algorithms on the set of 15 reference words using SetA.
result for each word is printed in boldface. From these figures, several conclusions can be
drawn:
• Referring to the EBh variants, EBh,7 performs significantly better than EBh,1 ,
confirming the results of Ng (Ng, 1997a) that values of k greater than one are
needed in order to achieve good performance with the kNN approach. Additionally,
both example weighting (EBh,15,e ) and attribute weighting (EBh,7,a ) significantly
improve EBh,7 . Further, the combination of both (EBh,7,e,a ) achieves an additional
improvement
55
3 A Comparison of Supervised ML Algorithms for WSD
P OS MF S NB P NB EBh,15
nouns 57.4 72.2 72.4 64.3
verbs 46.6 55.2 55.3 43.0
all 53.3 65.8 66.0 56.2
P EBh P EBcs
P OS 1 7 7, e 7, a 10, e, a 1 10 10, e
nouns 70.6 72.4 73.7 72.5 73.4 73.2 75.4 75.6
verbs 54.7 57.7 59.5 58.9 60.2 58.6 61.9 62.1
all 64.6 66.8 68.4 67.4 68.4 67.7 70.3 70.5
Table 3.4: Results of all algorithms on the set of 15 reference words using SetB.
Table 3.3 shows that EBh is fifty times faster than EBcs using SetA9 .
• The Exemplar-based approach achieves better results than the Naive Bayes
algorithm. This difference is statistically significant when comparing the EBcs,10
and EBcs,10,e against N B.
The aim of the experiments with SetB is to test both methods with a realistic large set of
features. Table 3.4 summarises the results of these experiments10 . Let’s now consider only
N B and EBh (65.8 and 56.2 accuracy). A very surprising result is observed: while N B
achieves almost the same accuracy that in the previous experiment, the exemplar-based
approach shows a very low performance. The accuracy of EBh drops 8.6 points (from 66.0
to 56.2 accuracy) and is only slightly higher than that of M F S.
The problem is that the binary representation of the broad-context attributes is not
appropriate for the kNN algorithm. Such a representation leads to an extremely sparse
vector representation of the examples, since in each example only a few words, among all
9
The programs were implemented using PERL-5.003 and they ran on a Sun UltraSPARC-2 machine
with 192Mb of RAM.
10
Detailed results for each word are not included.
56
3.3 Adapting Naive-Bayes and Exemplar-Based Algorithms
possible, are observed. Thus, the examples are represented by a vector of about 5,000 0’s
and only a few 1’s. In this situation two examples will coincide in the majority of the values
of the attributes (roughly speaking in “all” the zeros) and will differ in those positions
corresponding to 1’s with high probability. This fact wrongly biases the similarity measure
(and, thus, the classification) in favour of that stored examples which have less 1’s, that
is, those corresponding to the shortest sentences.
This situation could explain the poor results obtained by the kNN algorithm in
Mooney’s work, in which a large number of attributes was used. Moreover, it could explain
why the results of Ng’s system working with a rich attribute set (including binary-valued
contextual features) were lower than those obtained with a simpler set of attributes11 .
This approach implies that a test example is classified taking into account the
information about the words it contains (positive information), but no the information
about the words it does not contain. Besides, it allows a very efficient implementation,
which will be referred to as P EB (standing for Positive Exemplar-Based).
In the same direction, we have tested the Naive Bayes algorithm combining only the
conditional probabilities corresponding to the words that appear in the test examples13 .
This variant is referred to as P N B. The results of both P EB and P N B are included in
table 3.4, from which the following conclusions can be drawn.
57
3 A Comparison of Supervised ML Algorithms for WSD
• N B do not performs better with SetB than with SetA. The addition of new features
to a N B classifier should be done carefully, due to its sensitiveness to redundant
features.
• The Exemplar-based approach achieves better results than the Naive Bayes
algorithm. Furthermore, the accuracy of P EBh,7,e is now significantly higher than
that of P N B.
Global Results
In order to ensure that the results obtained so far also hold on a realistic broad-coverage
domain, the P N B and P EB algorithms have been tested on the whole sense-tagged
corpus, using both sets of attributes. This corpus contains about 192,800 examples of 121
nouns and 70 verbs. The average number of senses is 7.2 for nouns, 12.6 for verbs, and
9.2 overall. The average number of training examples is 933.9 for nouns, 938.7 for verbs,
and 935.6 overall.
The results obtained are presented in table 3.5. It has to be noted that the results of
P EBcs using SetB were not calculated due to the extremely large computational effort
required by the algorithm (see table 3.3). Results are coherent to those reported previously,
that is:
• Using SetA, the Exemplar-based approach using the MVDM metric is significantly
superior to the rest.
58
3.4 Applying the AdaBoost Algorithm
• Using SetB, the Exemplar-based approach using Hamming distance and example
weighting significantly outperforms the Bayesian approach. Although the use of
the MVDM metric could lead to better results, the current implementation is
computationally prohibitive.
• Contrary to the Exemplar-based approach, Naive Bayes does not improve accuracy
when moving from SetA to SetB, that is, the simple addition of attributes does not
guarantee accuracy improvements in the Bayesian framework.
3.3.3 Conclusions
This section has focused on clarifying some contradictory results obtained when comparing
Naive Bayes and Exemplar-based approaches to Word Sense Disambiguation. Different
alternative algorithms have been tested using two different attribute sets on a large
sense-tagged corpus. The experiments carried out show that Exemplar-based algorithms
have generally better performance than Naive Bayes, when they are extended with
example/attribute weighting, richer metrics, etc. The experiments reported also show that
the Exemplar-based approach is very sensitive to the representation of a concrete type of
attributes, frequently used in Natural Language problems. To avoid this drawback, an
alternative representation of the attributes has been proposed and successfully tested.
Furthermore, this representation also improves the efficiency of the algorithms, when
using a large set of attributes. The test on the whole corpus allows us to estimate
that, in a realistic scenario, the best tradeoff between performance and computational
requirements is achieved by using the Positive Exemplar-based algorithm, both local and
topical attributes, Hamming distance, and example-weighting.
In this section Schapire and Singer’s AdaBoost.MH boosting algorithm has been applied to
Word Sense Disambiguation. Initial experiments on a set of 15 selected polysemous words
show that the boosting approach surpasses Naive Bayes and Exemplar-based approaches.
In order to make boosting practical for a real learning domain of thousands of words,
several ways of accelerating the algorithm by reducing the feature space are studied. The
best variant, which we call LazyBoosting, is tested on the DSO corpus. Again, boosting
compares favourably to the other benchmark algorithms.
59
3 A Comparison of Supervised ML Algorithms for WSD
The setting details of the experiments reported in this section are those of section 3.2.
The data applied are the 15 words selected from the DSO corpus and the feature set
is that called SetA. AdaBoost.M H has been compared to Naive Bayes (N B) and
Exemplar-Based (EBh,15,e ) algorithms, using them as baselines (see sections 3.2 and 3.3).
Figure 3.4 shows the error rate curve of AdaBoost.MH, averaged over the 15 reference
words, and for an increasing number of weak rules per word. This plot shows that the
error obtained by AdaBoost.MH is lower than those obtained by N B and EB for a number
of rules above 100. It also shows that the error rate decreases slightly and monotonically,
as it approaches the maximum number of rules reported14 .
50 MFS
Naive Bayes
Exemplar Based
AdaBoost.MH
45
Error rate (%)
40
35
30
0 100 200 300 400 500 600 700 800
# Weak Rules
Figure 3.4: Error rate of AdaBoost.MH related to the number of weak rules
According to the plot in figure 3.4, no overfitting is observed while increasing the
number of rules per word. Although it seems that the best strategy could be “learn as
many rules as possible”, the number of rounds must be determined individually for each
14
The maximum number of rounds considered is 750.
60
3.4 Applying the AdaBoost Algorithm
word since they have different behaviour. The adjustment of the number of rounds can be
done by cross-validation on the training set, as suggested in (Abney et al., 1999). However,
in our case, this cross-validation inside the cross-validation of the general experiment would
generate a prohibitive overhead. Instead, a very simple stopping criterion (sc) has been
used, which consists in stopping the acquisition of weak rules whenever the error rate on
the training set falls below 5%, with an upper bound of 750 rules. This variant, which is
referred to as ABsc , obtained comparable results to AB750 but generating only 370.2 weak
rules per word on average, which represents a very moderate storage requirement for the
combined classifiers.
Table 3.6: Set of 15 reference words and results of the main algorithms
As it can be seen, in 14 out of 15 cases, the best results correspond to the boosting
algorithm. When comparing global results, accuracies of either AB750 or ABsc are
significantly higher than those of any of the other methods.
61
3 A Comparison of Supervised ML Algorithms for WSD
Up to now, it has been seen that AdaBoost.MH is a simple and competitive algorithm
for the WSD task. It achieves an accuracy performance superior to that of the Naive
Bayes and Exemplar-based baselines. However, AdaBoost.MH has the drawback of its
computational cost, which makes the algorithm not scale properly to real WSD domains
of thousands of words.
The space and time-per-round requirements of AdaBoost.MH are O(mk) (recall that
m is the number of training examples and k the number of senses), not including the call
to the weak learner. This cost is unavoidable since AdaBoost.MH is inherently sequential.
That is, in order to learn the (t+1)-th weak rule it needs the calculation of the t-th weak
rule, which properly updates the matrix Dt . Further, inside the WeakLearner, there is
another iterative process that examines, one by one, all attributes so as to decide which
is the one that minimises Zt . Since there are thousands of attributes, this is also a time
consuming part, which can be straightforwardly speed up either by reducing the number
of attributes or by relaxing the need to examine all attributes at each iteration.
Four methods have been tested in order to reduce the cost of searching for weak rules.
The first three, consisting of aggressively reducing the feature space, are frequently applied
in Text Categorisation. The fourth consists in reducing the number of attributes that are
examined at each round of the boosting algorithm.
• Local frequency filtering (LFreq): This method works similarly to Freq but
considers the frequency of events locally, at the sense level. More particularly, it
selects the N most frequent features of each sense and merging.
• RLM ranking: This third method consists in making a ranking of all attributes
according to the RLM distance measure (de Mántaras, 1991) and selecting the N
most relevant features. This measure has been commonly used for attribute selection
in decision tree induction algorithms15 .
15
RLM distance belongs to the distance-based and information-based families of attribute selection
functions. It has been selected because it showed better performance than seven other alternatives in an
experiment of decision tree induction for POS tagging (Màrquez, 1999).
62
3.4 Applying the AdaBoost Algorithm
• LazyBoosting: The last method does not filter out any attribute but reduces the
number of those that are examined at each iteration of the boosting algorithm.
More specifically, a small proportion p of attributes are randomly selected and the
best weak rule is selected among them. The idea behind this method is that if the
proportion p is not too small, probably a sufficiently good rule can be found at each
iteration. Besides, the chance for a good rule to appear in the whole learning process
is very high. Another important characteristic is that no attribute needs to be
discarded and so we avoid the risk of eliminating relevant attributes. This approach
has a main drawback: boosting changes the weight distribution at each iteration.
Perhaps a good feature will loose importance at the iteration is selected due to this
weight distribution. Because this feature can be not important to the examples
with highest weight. This method will be called LazyBoosting in reference to the
work by Samuel and colleagues (Samuel, 1998). They applied the same technique
for accelerating the learning algorithm in a Dialogue Act tagging system.
The four methods above have been compared for the set of 15 reference words. Figure
3.5 contains the average error-rate curves obtained by the four variants at increasing
levels of attribute reduction. The top horizontal line corresponds to the MFS error rate,
while the bottom horizontal line stands for the error rate of AdaBoost.MH working with
all attributes. The results contained in figure 3.5 are calculated running the boosting
algorithm 250 rounds for each word.
• All methods seem to work quite well since no important degradation is observed in
performance for values lower than 95% in rejected attributes. This may indicate
that there are many irrelevant or highly dependent attributes in our domain; and
that concepts depend only on a small set of features. All attribute filtering methods
capture the really important features.
• LFreq is slightly better than Freq, indicating a preference to make frequency counts
for each sense rather than globally.
• The more informed RLM ranking performs better than frequency-based reduction
methods Freq and LFreq.
63
3 A Comparison of Supervised ML Algorithms for WSD
48
Freq
LFreq
46 RLM
LazyBoosting
44
42
Error rate (%)
40
38
36
34
32
86 88 90 92 94 96 98 100
Rejected attributes (%)
Figure 3.5: Error rate obtained by the four methods, at 250 weak rules per word, with
respect to the percentage of rejected attributes
performance and runs about 7 times faster than AdaBoost.MH working with all
attributes, will be selected for the experiments in next section.
Evaluating LazyBoosting
The LazyBoosting algorithm has been tested on the full DSO (all 191 words) with p = 10%
and the same stopping criterion described above, which will be referred to as ABl10sc .
The ABl10sc algorithm learned an average of 381.1 rules per word, and took about
4 days of cpu time to complete16 . It has to be noted that this time includes the
cross-validation overhead. Eliminating it, it is estimated that 4 cpu days would be the
necessary time for acquiring a word sense disambiguation boosting-based system covering
about 2,000 words.
The ABl10sc has been compared again to the benchmark algorithms using the 10-fold
cross-validation methodology described in section 3.2. The average accuracy results are
reported in the upper-side of table 3.7. The best figures correspond to the LazyBoosting
16
The implementation was written in PERL-5.003 and it was run on a SUN UltraSparc2 machine with
194Mb of RAM.
64
3.5 Comparison of Machine Learning Methods
algorithm ABl10sc , and again, the differences are statistically significant using the 10-fold
cross-validation paired t-test.
accuracy (%)
MF S N B EB15 ABl10sc
nouns (121) 56.4 68.7 68.0 70.8
verbs (70) 46.7 64.8 64.9 67.5
all (191) 52.3 67.1 66.7 69.5
wins-ties-losses
ABl10sc vs N B ABl10sc vs EB15
nouns (121) 99(51)-1-21(3) 100(68)-5-16(1)
verbs (70) 63(35)-1-6(2) 64(39)-2-4(0)
all (191) 162(86)-2-27(5) 164(107)-7-20(1)
Table 3.7: Results of LazyBoosting and the benchmark methods on the 191-word corpus
The lower-side of the table shows the comparison of ABl10sc versus N B and EB15
algorithms, respectively. Each cell contains the number of wins, ties, and losses of
competing algorithms. The counts of statistically significant differences are included in
brackets. It is important to point out that EB15 only beats significantly ABl10sc in one
case while N B does so in five cases. Conversely, a significant superiority of ABl10sc over
EB15 and N B is observed in 107 and 86 cases, respectively.
3.4.2 Conclusions
In this section, Schapire and Singer’s AdaBoost.MH algorithm has been evaluated on
the Word Sense Disambiguation task. As it has been shown, the boosting approach
outperforms Naive Bayes and Exemplar-based learning. In addition, a faster variant
has been suggested and tested, which is called LazyBoosting. This variant scales the
algorithm to broad-coverage real Word Sense Disambiguation domains, and is as accurate
as AdaBoost.MH.
This section is devoted to compare the five algorithms described in section 3.1. That is:
Naive Bayes (NB), Exemplar-based (kNN), Decision Lists (DL), AdaBoost.MH (AB) and
Support Vector Machines (SVM). We experimentally evaluated and compared them on
the DSO corpus.
65
3 A Comparison of Supervised ML Algorithms for WSD
From the 191 words represented in the DSO corpus, a group of 21 words which frequently
appear in the WSD literature has been selected to perform the comparative experiment.
These words are 13 nouns (age, art, body, car, child, cost, head, interest, line, point, state,
thing, work ) and 8 verbs (become, fall, grow, lose, set, speak, strike, tell) and they have
been treated as independent classification problems. The number of examples per word
ranges from 202 to 1,482 with an average of 801.1 examples per word (840.6 for nouns and
737.0 for verbs). The level of ambiguity is quite high in this corpus. The number of senses
is between 3 and 25, with an average of 10.1 senses per word (8.9 for nouns and 12.1 for
verbs).
This comparison has SetB as feature set (see section 3.2.2). The different methods
tested translate this information into features in different ways. The AdaBoost and SVM
algorithms require binary features. Therefore, local context attributes have to be binarised
in a preprocess, while the topical context attributes remain as binary tests about the
presence or absence of a concrete word in the sentence. As a result of this binarisation,
the number of features is expanded to several thousands (from 1,764 to 9,900 depending
on the particular word).
The binary representation of features is not appropriate for Naive Bayes and kNN
algorithms. Therefore, the 15 local-context attributes are considered straightforwardly.
Regarding the binary topical-context attributes, the variants described in section 3.3 are
considered. For kNN, the topical information is codified as a single set-valued attribute
(containing all words appearing in the sentence) and the calculation of closeness is modified
so as to handle this type of attributes. For Naive Bayes, the topical context is conserved
as binary features, but when classifying new examples only the information of words
appearing in the example (positive information) is taken into account.
Regarding to kNN, we do not use MVDM here for simplicity reasons. Its
implementation has a very serious computational overhead and the difference in
performance is reduced when using the Hamming distance jointly with feature and example
weighting, as we do in this setting (see section 3.3).
We have performed a 10-fold cross-validation experiment in order to estimate the
performance of the systems. The accuracy figures reported below are (micro) averaged
over the results of the 10 folds and over the results on each of the 21 words. We have applied
a statistical test of significance, which is a paired Student’s t-test with a confidence value
of t9,0.995 =3.250 (see appendix A for all the formulae about evaluation). When classifying
test examples, all methods are forced to output a unique sense, resolving potential ties
between senses by choosing the most frequent sense among all those tied.
66
3.5 Comparison of Machine Learning Methods
3.5.2 Discussion
Table 3.8 presents the results (accuracy and standard deviation) of all methods in the
reference corpus. MFC stands for a Most Frequent Sense classifier. Averaged results are
presented for nouns, verbs, and overall and the best results are printed in boldface.
All methods significantly outperform the baseline Most Frequent Classifier, obtaining
accuracy gains between 15 and 20.5 points. The best performing methods are SVM and
AdaBoost (SVM achieves a slightly better accuracy but this difference is not statistically
significant). On the other extreme, Naive Bayes and Decision Lists are the methods with
the lowest accuracy with no significant differences among them. The kNN method is in
the middle of the previous two groups. That is, according to the paired t-test the partial
order between methods that follows is:
where ‘≈’ means that the accuracies of both methods are not significantly different, and
‘>’ means that the left method accuracy is significantly better than the right one.
The low results of DL seem to contradict some previous research, in which very good
results were obtained with this method. One possible reason for this failure is the simple
smoothing method applied. Yarowsky (1995a) showed that smoothing techniques can help
to obtain good estimates for different feature types, which is crucial for methods like DL.
These techniques are applied to different learning methods in (Agirre and Martı́nez, 2004a),
and they report an improvement of 4% in recall over the simple smoothing. Another reason
for the low performance is that when DL is forced to make decisions with few data points
it does not make reliable predictions. Rather than trying to force 100% coverage, the
DL paradigm seems to be more appropriate for obtaining high precision estimates. In
(Martı́nez et al., 2002) DL are shown to have a very high precision for low coverage,
achieving 94.90% accuracy at 9.66% coverage, and 92.79% accuracy at 20.44% coverage.
These experiments were performed on the Senseval-2 datasets.
Regarding standard deviation between experiments in the cross-validation framework,
SVM has shown the most stable behaviour, achieving the lowest values.
67
3 A Comparison of Supervised ML Algorithms for WSD
Table 3.9: Overall accuracy of AB and SVM classifiers by groups of words of increasing
average number of examples per sense
In this corpus subset, the average accuracy values achieved for nouns and verbs are
very close, being the baseline MFC results almost identical (46.59% for nouns and 46.49%
for verbs). This is quite different from the results reported in many papers taking into
account the whole set of 191 words of the DSO corpus. For instance, differences between
3 and 4 points can be observed in favour of nouns in previous section. This is due to the
singularities of the subset of 13 nouns studied here, which are particularly difficult. Note
that in the whole DSO corpus the MFC over nouns (56.4%) is fairly higher than in this
subset (46.6%) and that an AdaBoost-based system is able to achieve 70.8% on nouns
in previous section compared to the 66.0% in this section. Also, the average number of
senses per noun is higher than in the entire corpus. Despite this fact, a difference between
two groups of methods can be observed regarding the accuracy on nouns and verbs. On
the one hand, the worst performing methods (NB and DLs) do better on nouns than on
verbs. On the other hand, the best performing methods (kNN, AB, and SVM) are able to
better learn the behaviour of verb examples, achieving an accuracy value around 1 point
higher than for nouns.
Some researchers, Schapire (2002) for instance, argue that the AdaBoost algorithm may
perform very poorly when training using small samples. In order to verify this statement
in the present study we calculated the accuracy obtained by AdaBoost in several groups
of words sorted by increasing size of training set. The size of a training set is taken as the
ratio between the number of examples and the number of senses of that word, that is, the
average number of examples per sense. Table 3.9 shows the results obtained, including
a comparison with the SVM method. As expected, the accuracy of SVM is significantly
superior than that of AdaBoost for small training sets (up to 60 examples per sense). On
the contrary, AdaBoost outperforms SVM in the large training sets (over 120 examples
per sense). Recall that the overall accuracy is comparable in both classifiers (see table
3.8).
In absolute terms, the overall results of all methods can be considered quite low
(61-67%). We do not claim that these results can not be improved by using richer feature
representations, by a more accurate tuning of the systems, or by the addition of more
training examples. Additionally, it is known that the DSO words included in this study are
among the most polysemous English words and that WordNet is a very fine-grained sense
68
3.5 Comparison of Machine Learning Methods
Table 3.10: Kappa statistic (below diagonal) and percentage of agreement (above diagonal)
between all pairs of systems on the DSO corpus
repository. Supposing that we had enough training examples for every ambiguous word in
the language, it is reasonable to think that a much more accurate all-words system could
be constructed based on the current supervised technology. However, this assumption is
totally unrealistic and, actually, the best current supervised systems for English all-words
disambiguation achieve figures around 65% (see Senseval-3 results). Our opinion is that
state-of-the-art supervised systems still have to be qualitatively improved in order to be
really practical.
Apart from accuracy figures, the observation of the predictions made by the classifiers
provides interesting information about the comparison between methods. Table 3.10
presents the percentage of agreement and the Kappa statistic between all pairs of systems
on the test sets. ‘DSO’ stands for the annotation of the DSO corpus, which is taken as
the correct annotation. Therefore, the agreement rates with respect to DSO contain the
accuracy results previously reported. The Kappa statistic (Cohen, 1960) is a measure
of inter-annotator agreement, which reduces the effect of chance agreement, and which
has been used for measuring inter-annotator agreement during the construction of some
semantically annotated corpora (Véronis, 1998; Ng et al., 1999). A Kappa value of
1 indicates perfect agreement, values around 0.8 are considered to indicate very good
agreement (Carletta, 1996), and negative values are interpreted as systematic disagreement
on non-frequent cases.
NB obtains the most similar results with regard to MFC in agreement rate and Kappa
value. A 73.9% of agreement means that almost 3 out of 4 times it predicts the most
frequent sense (which is correct in less than half of the cases). SVM and AB obtain the
most similar results with regard to DSO (the manual annotation) in agreement rate and
Kappa values, and they have the less similar Kappa and agreement values with regard to
MFC. This indicates that SVM and AB are the methods that best learn the behaviour of
the DSO examples. It is also interesting to note that the three highest values of kappa
69
3 A Comparison of Supervised ML Algorithms for WSD
(0.65, 0.61, and 0.50) are between the top performing methods (SVM, AB, and kNN), and
that, despite NB and DLs achieve a very similar accuracy on the corpus, their predictions
are quite different, since the Kappa value between them is one of the lowest (0.39).
The Kappa values between the methods and the DSO annotation are very low. But,
as it is suggested in (Véronis, 1998), evaluation measures should be computed relative to
the agreement between the human annotators of the corpus and not to a theoretical 100%.
It seems pointless to expect more agreement between the system and the reference corpus
than between the annotators themselves. Contrary to the intuition that the agreement
between human annotators could be very high in the Word Sense Disambiguation task,
some papers report surprisingly low figures. For instance, Ng et al. (1999) report an
accuracy rate of 56.7% and a Kappa value of 0.317 when comparing the annotation of
a subset of the DSO corpus performed by two independent research groups. Similarly,
Véronis (1998) reports values of Kappa near to zero when annotating some special words
for the ROMANSEVAL17 corpus.
From this point of view, the Kappa value of 0.44 achieved by SVM and AB could be
considered pretty high result. Unfortunately, the subset of the DSO corpora treated in
this work does not coincide to that in (Ng et al., 1999) and, therefore, a direct comparison
is not possible.
3.6 Conclusions
In section 3.5 a comparison between five of the State-of-the-Art algorithms used in the
Word Sense Disambiguation task have been performed. The systems’ performance has
been compared in terms of accuracy, but other measures, agreement and kappa values
between each pair of methods, have been used to compare systems’ outputs. With all
this data a deeper study of the behaviour of the algorithms is possible. Among the
five algorithms compared, the best performing algorithms (in terms of accuracy and
behaviour on all senses to be learned) for the Word Sense Disambiguation task are both
AdaBoost.MH and Support Vector Machines: SVM for training sets with small number
of examples per sense (up to 60) and AdaBoost for larger sets (over 120). Up to now,
there is no large sense-tagged corpus available. So, probably the best system could be a
combination of both systems, depending on the number of examples per sense for each
word. That is, train each word with one of the algorithms depending on the number of
examples per sense.
17
http://www.lpl.univ-aix.fr/projectes/romanseval
70
3.6 Conclusions
71
3 A Comparison of Supervised ML Algorithms for WSD
72
Chapter 4
Domain Dependence
This chapter describes a set of experiments carried out to explore the domain dependence
of alternative supervised Word Sense Disambiguation algorithms. The aim of the work is:
studying the performance of these algorithms when tested on a different corpus from that
they were trained on; and exploring their ability to tune to new domains. As we will see,
the domain dependence of Word Sense Disambiguation systems seems very strong and
suggests that some kind of adaptation or tuning is required for cross-corpus applications.
4.1 Introduction
Supervised learning assumes that the training examples are somehow reflective of the task
that will be performed by the trainee on other data. Consequently, the performance of
such systems is commonly estimated by testing the algorithm on a separate part of the
set of training examples (say 10-20% of them), or by N -fold cross-validation, in which
the set of examples is partitioned into N disjoint sets (or folds), and the training-test
procedure is repeated N times using all combinations of N − 1 folds for training and 1
fold for testing. In both cases, test examples are different from those used for training,
but they belong to the same corpus, and, therefore, they are expected to be quite similar.
Although this methodology could be valid for certain NLP problems, such as English
Part-of-Speech tagging, we think that there exists reasonable evidence to say that, in
WSD, accuracy results cannot be simply extrapolated to other domains (contrary to the
opinion of other authors Ng (1997b)): On the one hand, WSD is very dependant to the
domain of application Gale et al. (1992a) –see also Ng and Lee (1996); Ng (1997a), in
which quite different accuracy figures are obtained when testing an exemplar-based WSD
classifier on two different corpora. On the other hand, it does not seem reasonable to
73
4 Domain Dependence
think that the training material can be large and representative enough to cover “all”
potential types of examples. We think that an study of the domain dependence of WSD
–in the style of other studies devoted to parsing Sekine (1997)– is needed to assess the
validity of the supervised approach, and to determine to which extent a tuning process is
necessary to make real WSD systems portable to new corpora. In order to corroborate the
previous hypotheses, this chapter explores the portability and tuning of four different ML
algorithms (previously applied to WSD) by training and testing them on different corpora.
Additionally, supervised methods suffer from the “knowledge acquisition bottleneck” Gale
et al. (1992b). Ng (1997b) estimates that the manual annotation effort necessary to build
a broad coverage semantically annotated English corpus is about 16 person-years. This
overhead for supervision could be much greater if a costly tuning procedure is required
before applying any existing system to each new domain.
Due to this fact, recent works have focused on reducing the acquisition cost as well
as the need for supervision in corpus-based methods. It is our belief that the research by
Leacock et al. (1998); Mihalcea and Moldovan (1999b)1 provide enough evidence towards
the “opening” of the bottleneck in the near future. For that reason, it is worth further
investigating the robustness and portability of existing supervised ML methods to better
resolve the WSD problem. It is important to note that the focus of this work will be on the
empirical cross-corpus evaluation of several ML supervised algorithms. Other important
issues, such as: selecting the best attribute set, discussing an appropriate definition of
senses for the task, etc., are not addressed in this chapter.
4.2 Setting
The set of comparative experiments has been carried out on a subset of 21 words of the
DSO corpus. Each word is treated as a different classification problem. They are 13 nouns
(age, art, body, car, child, cost, head, interest, line, point, state, thing, work ) and 8 verbs
(become, fall, grow, lose, set, speak, strike, tell )2 . The average number of senses per word
is close to 10 and the number of training examples is close to 1,000.
The DSO corpus contains sentences from two different corpora. The first is Wall Street
Journal (WSJ), which belongs to the financial domain and the second is Brown Corpus
(BC) being a general corpora of English usage. Therefore, it is easy to perform experiments
about the portability of alternative systems by training them on the WSJ part (A part,
hereinafter) and testing them on the BC part (B part, hereinafter), or vice-versa.
1
In the line of using lexical resources and search engines to automatically collect training examples from
large text collections or the Web.
2
This set of words is the same used in the final comparison of the previous chapter (section 3.5).
74
4.3 First Experiment: Across Corpora evaluation
Two kinds of information are used to train classifiers: local and topical context. The
former consists of the words and part-of-speech tags appearing in a window of ± 3 items
around the target word, and collocations of up to three consecutive words in the same
window. The latter consists of the unordered set of content words appearing in the whole
sentence3 .
The four methods tested (Naive Bayes, Exemplar-Based, Decision Lists, and
LazyBoosting) translate this information into features in different ways. LazyBoosting
algorithm requires binary features. Therefore, local context attributes have to be binarised
in a preprocess, while the topical context attributes remain as binary tests about the
presence/absence of a concrete word in the sentence. As a result the number of attributes
is expanded to several thousands (from 1,764 to 9,900 depending on the particular
word). The binary representation of attributes is not appropriate for Naive Bayes
and Exemplar-Based algorithms. Therefore, the 15 local-context attributes are taken
straightforwardly. Regarding the binary topical-context attributes, we have used the
variants described in chapter 3. For Exemplar-Based, the topical information is codified
as a single set-valued attribute (containing all words appearing in the sentence) and the
calculation of the closeness measure is modified so as to handle this type of attribute. For
Naive Bayes, the topical context is conserved as binary features, but when classifying new
examples only the information of words appearing in the example (positive information)
is taken into account.
The comparison of algorithms has been performed in series of controlled experiments using
exactly the same training and test sets. There are 7 combinations of training-test sets
called: A+B→A+B , A+B→A, A+B→B , A→A, B→B , A→B , and B→B , respectively. In
this notation, the training set is placed at the left hand side of symbol “→”, while the test
set is at the right hand side. For instance, A→B means that the training set is corpus A and
the test set is corpus B. The symbol “+” stands for set union, therefore A+B→B means
that the training set is A union B and the test set is B. When comparing the performance
of two algorithms, two different statistical tests of significance have been applied depending
on the case. A→B and B→A combinations represent a single training-test experiment.
In this cases, the McNemar’s test of significance is used (with a confidence value of:
χ21,0.95 = 3.842), which is proven to be more robust than a simple test for the difference of
two proportions. In the other combinations, a 10-fold cross-validation was performed in
3
This set of features is the SetB defined in section 3.2.2.
75
4 Domain Dependence
order to prevent testing on the same material used for training. The associated statistical
tests of significance is a paired Student’s t-test with a confidence value of: t9,0.975 = 2.262
(see appendix A for evaluation formulae).
The four algorithms, jointly with a naive Most-Frequent-sense Classifier (MFC), have
been tested on 7 different combinations of training-test sets. Accuracy figures, averaged
over the 21 words, are reported in table 4.1. The comparison leads to the following
conclusions:
MFC NB EB DL LB
A+B→A+B 46.55±0.71 61.55±1.04 63.01±0.93 61.58±0.98 66.32±1.34
A+B→A 53.90±2.01 67.25±1.07 69.08±1.66 67.64±0.94 71.79±1.51
A+B→B 39.21±1.90 55.85±1.81 56.97±1.22 55.53±1.85 60.85±1.81
A→A 55.94±1.10 65.86±1.11 68.98±1.06 67.57±1.44 71.26±1.15
B→B 45.52±1.27 56.80±1.12 57.36±1.68 56.56±1.59 58.96±1.86
A→B 36.40 41.38 45.32 43.01 47.10
B→A 38.71 47.66 51.13 48.83 51.99∗
Table 4.1: Accuracy results (± standard deviation) of the methods on all training-test
combinations
• Regarding the portability of the systems, very disappointing results are obtained.
Restricting to LazyBoosting results, we observe that the accuracy obtained in A→B is
47.1% while the accuracy in B→B (which can be considered an upper bound for
LazyBoosting in B corpus) is 59.0%, that is, a drop of 12 points. Furthermore,
47.1% is only slightly better than the most frequent sense in corpus B, 45.5%. The
comparison in the reverse direction is even worse: a drop from 71.3% (A→A) to
52.0% (B→A), which is lower than the most frequent sense of corpus A, 55.9%.
76
4.4 Second Experiment: tuning to new domains
The previous experiment shows that classifiers trained on the A corpus do not work well
on the B corpus, and vice-versa. Therefore, it seems that some kind of tuning process
is necessary to adapt supervised systems to each new domain. This experiment explores
the effect of a simple tuning process consisting of adding to the original training set a
relatively small sample of manually sense tagged examples of the new domain. The size
of this supervised portion varies from 10% to 50% of the available corpus in steps of 10%
(the remaining 50% is kept for testing). This set of experiments will be referred to as
A+%B→B , or conversely, B+%A→A.
In order to determine to which extent the original training set contributes to accurately
disambiguate in the new domain, we also calculate the results for %A→A (and %B→B ),
that is, using only the tuning corpus for training.
Figure 4.1 graphically presents the results obtained by all methods. Each plot contains
X+%Y →Y and %Y →Y curves, and the straight lines corresponding to the lower bound
MFC, and to the upper bounds Y →Y and X+Y →Y .
As expected, the accuracy of all methods grows (towards the upper bound) as more
tuning corpus is added to the training set. However, the relation between X+%Y →Y and
%Y →Y reveals some interesting facts. In plots 1b, 2a, and 3b the contribution of the
original training corpus is almost negligible. Furthermore, in plots 1a, 2b, and 3a a
degradation on the accuracy performance is observed. Summarising, these six plots show
that for Naive Bayes, Exemplar-Based and Decision Lists methods it is not worth keeping
the original training examples. Instead, a better (but disappointing) strategy would be
simply using the tuning corpus.
However, this is not the situation of LazyBoosting (plots 4a and 4b), for which
a moderate (but consistent) improvement of accuracy is observed when retaining the
original training set. Therefore, LazyBoosting shows again a better behaviour than their
competitors when moving from one domain to another.
The bad results about portability may be explained by, at least, two reasons: 1) Corpus
A and B have a very different distribution of senses, and, therefore, different a-priori
biases; 2) Examples of corpus A and B contain different information, and, therefore, the
learning algorithms acquire different (and non interchangeable) classification cues from
both corpora.
77
4 Domain Dependence
(1a) NB on A (1b) NB on B
58
MFS 72 MFS
56 B-B A-A
A+B-B 70 A+B-A
54 A+%B-B A+%B-A
%B-B 68 %A-A
Accuracy rate (%)
(2a) EB on A (2b) EB on B
58
MFS 72 MFS
56 B-B A-A
A+B-B 70 A+B-A
A+%B-B A+%B-A
54 %B-B 68 %A-A
Accuracy rate (%)
60
48
58
46
56
44 54
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
(3a) DL on A (3b) DL on B
58
MFS 72 MFS
56 B-B A-A
A+B-B 70 B+A-A
54 A+%B-B B+%A-A
%B-B 68 %A-A
Accuracy rate (%)
52 66
64
50
62
48
60
46
58
44 56
42 54
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
(4a) LB on A (4b) LB on B
62
MFS 72 MFS
60 B-B A-A
A+B-B 70 A+B-A
58 A+%B-B A+%B-A
%B-B 68 %A-A
Accuracy rate (%)
56
66
54
64
52 62
50 60
48 58
46 56
44 54
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
78
4.5 Third Experiment: training data quality
The first hypothesis is confirmed by observing the bar plots of figure 4.2, which contain
the distribution of the four most frequent senses of some sample words in the corpora A and
B, respectively. In order to check the second hypothesis, two new sense-balanced corpora
have been generated from the DSO corpus, by equilibrating the number of examples of
each sense between A and B parts. In this way, the first difficulty is artificially overrided
and the algorithms should be portable if examples of both parts are quite similar. Table
4.2 shows the results obtained by LazyBoosting on these new corpora.
Figure 4.2: Distribution of the four most frequent senses for two nouns (head, interest)
and two verbs (fall, grow). Black bars = A corpus; Grey bars = B corpus
The observation of the rules acquired by LazyBoosting also could help improving data
quality. It is known that misslabelled examples resulting from annotation errors tend to
be hard examples to classify correctly, and, therefore, tend to have large weights in the
final distribution. Abney et al. (1999) applied this idea from AdaBoost to PoS tagging,
improving the data quality. This observation allows both to identify the noisy examples
and use LB as a way to improve the training corpus. An experiment has been carried
79
4 Domain Dependence
MFC LB
A+B→A+B 48.55±1.16 64.35±1.16
A+B→A 48.64±1.04 66.20±2.12
A+B→B 48.46±1.21 62.50±1.47
A→A 48.62±1.09 65.22±1.50
B→B 48.46±1.21 61.74±1.18
A→B 48.70 56.12
B→A 48.70 58.05
out in this direction by studying the rules acquired by LB from the training examples of
word state. The manually inspection of the 50 highest scored rules allowed us to identify a
high number of noisy training examples -there were 11 tagging errors-, and, additionally,
17 examples not coherently tagged, probably due to the too fine grained or not so clear
distinctions between the senses involved in these examples. Thus, there were 28 of 50
examples with some kind of problem among the best scored.
See, for instance, the two rules presented in table 4.3, acquired from corpus A and B
respectively, which account for the collocation state court. For each rule and each sense
there are two real numbers, which are the rule outputs when the feature holds in the
example and when it does not, respectively. The sign of real numbers is interpreted as a
prediction as to whether the sense should or should not be assigned to the example. The
magnitude of the prediction is interpreted as a measure of confidence in the prediction.
Therefore, according to the first rule, the presence of the collocation state court gives
positive evidence to the sense number 3 and negative to the rest of senses. On the contrary,
according to the second rule the presence of such collocation would contribute to sense
number 5 and negatively to all the rest (including sense number 3).
senses
source 0 1 2 3 4 5
rule 1 A -0.5064 -1.0970 -0.6992 1.2580 -1.0125 -0.4721
0.0035 0.0118 0.0066 -0.0421 0.0129 0.0070
rule 2 B -0.0696 -0.2988 -0.1476 -1.1580 -0.5095 1.2326
-0.0213 0.0019 -0.0037 0.0102 0.0027 -0.0094
It has to be noted that these rules are completely coherent with the senses assigned
80
4.6 Conclusions
to the examples containing state court in both corpora, and therefore, the contradiction
comes from different information that characterise examples of both corpora.
4.6 Conclusions
This work has pointed out some difficulties regarding the portability of supervised WSD
systems, a very important issue that has been paid little attention up to the present.
According to our experiments, it seems that the performance of supervised sense taggers
is not guaranteed when moving from one domain to another (e.g. from a balanced corpus,
such as BC, to an economic domain, such as WSJ). These results implies that some
kind of adaptation is required for cross-corpus application. Furthermore, these results
are in contradiction with the idea of “robust broad-coverage WSD” introduced by Ng
(1997b), in which a supervised system trained on a large enough corpora (say a thousand
examples per word) should provide accurate disambiguation on any corpora (or, at least
significantly better than MFS). Consequently, it is our belief that a number of issues
regarding portability, tuning, knowledge acquisition, etc., should be thoroughly studied
before stating that the supervised ML paradigm is able to resolve a realistic WSD problem.
Regarding the ML algorithms tested, the contribution of this work consist of empirically
demonstrating that the LazyBoosting algorithm outperforms other three supervised ML
methods for WSD. Furthermore, this algorithm is proven to have better properties when
is applied to new domains.
81
4 Domain Dependence
82
Chapter 5
Bootstrapping
As we have seen in previous chapters, supervised systems are one of the most promising
approaches to Word Sense Disambiguation. The most important drawback with them is
the need of a large quantity of labelled examples to train classifiers. One of the most
important open lines of research in the field is the use of bootstrapping techniques to take
advantage of the unlabelled examples in learning classifiers. It is usually easy to obtain
unlabelled examples from large resources, such as the Web or large text collections. The
main aim of this chapter is to explore the usefulness of unlabelled examples in Supervised
Word Sense Disambiguation systems. We explore in this chapter two approaches of those
described in section 2.3.1: the transductive learning for SVMs (Joachims, 1999) and the
bootstrapping algorithm proposed by Abney (2002, 2004).
The Transductive approach (Vapnik, 1998) consists of training a system with labelled
and unlabelled examples at a time. Specifically, the transductive approach for Support
Vector Machines begins by learning a usual SVM from the labelled examples and ends by
adjusting the separating hyperplane by means of an iterative process, guessing the class
of the unlabelled examples. The proportion of examples for each class is being maintained
throughout all iterations.
Joachims (1999) suggested the utility of the transductive approach for Text
Categorisation. The experiments show the viability of training with a small set of labelled
examples and a large set of “cheap” unlabelled examples.
Word Sense Disambiguation and Text Categorisation share some important
characteristics: high dimensional input space, sparse example vectors and few irrelevant
83
5 Bootstrapping
features. The transductive approach seems to work well in Text Categorisation because
documents usually share words (treated as features) of the domain. That is, each domain
or category has some typical words associated and the documents of the same category
usually contain a subset of them. Once identifies with one of the categories, the unlabelled
examples are used to generalise the classification model with other complementary features
not included in the labelled set. See details at (Joachims, 1998, 1999). This section reports
the experiment of Joachims (1999) ported to the Word Sense Disambiguation problem,
using as data the DSO corpus.
5.1.1 Setting
The experiment have been carried out using the SV M light implementation of the Inductive
(SV M ) and Transductive (T SV M ) Support Vector Machines. This software, developed
by Thorsten Joachims (1999), is publicly available and open source.
For this work, eight words of the DSO corpus (Ng and Lee, 1996) have been selected.
Seven of them have the higher number of examples in the DSO corpus. The other, art,
has been selected because of the availability of a large unlabelled corpus for this word.
They are 4 nouns (art, law, state, and system) and 4 verbs (become, leave, look, and think ).
Table 5.1 shows, as a reference, the number of examples, and the 10 fold cross-validation
accuracy of the naive Most Frequent Classifier and the implementation of AdaBoost.MH
described in section 3.4. The English version of FreeLing part-of-speech tagger have been
used for tagging (Atserias et al., 2006). The accuracy improvement over the MFC baseline
is 21.7 points, comparable to those obtained in previous chapters.
Feature Set
Let “. . . w−3 w−2 w−1 w w+1 w+2 w+3 . . .” be the context of consecutive words around
the word w to be disambiguated, and p±i (−3≤i≤3) be the part-of-speech tag of word
w±i . The feature patterns referring to local context are the following: w, p−3 , p−2 , p−1 ,
p+1 , p+2 , p+3 , w+3 , w−2 , w−1 , w+1 , w+2 and w+3 .
The features referring to topical context consist of: the unordered set of open-class
words appearing in the sentence, all bigrams of open-class words (after deleting the
84
5.1 Transductive SVMs
Table 5.1: Number of examples and accuracy results of the two baseline systems (in %)
closed-class words of the sentence), and, finally, all the word bigrams appearing in the
sentence (including open- and closed-class words).
This feature set is a small modification of SetB in section 3.2.2. Local trigrams have
been deleted and the topical bigrams added. This decision has been motivated due to
experimental results of Senseval exercises (see chapter 6).
Parameter Tuning
As in the previous chapters, one of the first things to decide when using the Support
Vector Machines is the kernel. Simple linear SVMs have been used, since the use of more
complex kernels (polynomials, gaussian functions, etc.) did not lead to better results in
some exploratory experiments.
There is another parameter to estimate, the C value. This parameter measures the
trade-off between training error and margin maximisation. Table 5.2 contains the C value
for the SVMs and TSVMs estimated in a 10–fold cross-validation experiment (over the P A
partition), by selecting the values maximising accuracy. The values ranged from 0 and 2.5
in steps of 0.05. In previous experiments we have seen that good values for the parameter
applied to WSD data are those near to 0. In the SV M case, 9 folds are considered as
training set and 1 fold as testing set each time, while in the T SV M case, 1 fold is taken
as the training set and the remaining 9 as test set, in order to better resemble the real
usage scenario.
85
5 Bootstrapping
C value
word SVM TSVM
art.n 0.1 1.35
law.n 0 0.05
state.n 0 2.05
system.n 0.05 1.25
become.v 0.1 0.05
leave.v 0.05 0.05
look.v 0.05 0.05
think.v 0 1.1
SV M light is a binary implementation of SVM. That is, it can not directly deal with a
multiclass classification problem (see section 6.4 for some binarisation examples). In order
to provide a classifier for that kind of problems, a binarisation process have to be done.
Thus, the parameter tuning can be done at the level of each of binary problem. However,
first results in this direction showed no significant improvements.
There are two main experiments in Joachims (1999): the first one aiming to show the
effect of the training set size (fixing the size of the test set) and the other one to see the
effect of the test set size (fixing the training set to a small size).
Both experiments reported a better accuracy for the transductive approach than the
inductive one, but not for all the classes. The first experiment showed largest differences
when using very small training sets, tending to disappear when increasing the number
of training examples. The second experiment showed a logarithmic-like accuracy curve,
always over the accuracy result of the inductive approach.
Main Experiment
This section is devoted to translate experiments in (Joachims, 1999) to the Word Sense
Disambiguation domain. It consists of carrying out two experiments for each problem.
In this case, the PB partition of the corpus has been used. This experiment (hereinafter
VarTrain) tries to show the effect of the number of examples in the training set, fixing
the test set; and, the second (hereinafter VarTest) the effect of the number of examples
in test set, fixing the training set.
86
5.1 Transductive SVMs
Figure 5.1 shows the results of these experiments for the 4 nouns considered, and figure
5.2 for the 4 verbs. Left plots of both figures show the results of the VarTrain experiment,
and right plots show the VarTest results. For better understanding, table 5.3 shows the
folds used as training and test sets in both experiments. VarTrain outputs a learning
curve in which the test set are always folds from 10 to 19 and the training set is variable.
In the first iteration the training set is fold 0, in the second iteration, the training are
folds 0 and 1 and so on until using for training folds from 0 to 9. VarTest has the same
behaviour but fixing training set to fold 0, and varying test set from the use of only fold
1, to the use of folds form 1 to 19.
VarTrain VarTest
Training Set Folds 0, 0-1, . . . , 0-9 Fold 0
Testing Set Folds 10-19 Folds 1, 1-2, . . . , 1-19
Table 5.3: Number of folds used as training and test sets in each experiment
The learning curves are very disappointing. All VarTrain plots show worse performance
for TSVM than for SVM. Only in the case of the verb become the accuracy is comparable.
VarTest plots show a very irregular behaviour, that in many cases tends to decrease the
accuracy when using TSVM. There is only a case (verb become) in which the results are
similar to those observed in the Text Categorisation problem by Joachims.
Second Experiment
A second experiment has been conducted in order to test the integrity of the transductive
models. It consists of running SVM and TSVM under the same conditions and with a small
number of examples in the test set (9 folds for training a 1 fold for test), for minimising
the effect of the transductive examples.
The results of this experiment are shown in table 5.4. It can be seen that when using the
transductive approach the accuracy of the system decreases in all words. The parameters
have been tuned over the inductive approach, but this is not a plausible reason for the
comparative low results of TSVM. In the experiments of Joachims (1999) the accuracy
of TSVM is always higher than that of SVM. This fact is an evidence of the differences
between Text Categorisation and Word Sense Disambiguation problems.
87
5 Bootstrapping
VarTrain VarTest
art.n
80 60
SVM SVM
75 TSVM TSVM
55
70
Accuracy (%)
Accuracy (%)
65 50
60
55 45
50
40
45
40 35
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
Number of Training Folds Number of Transductive Folds
law.n
80 60
SVM SVM
75 TSVM TSVM
55
70
Accuracy (%)
Accuracy (%)
65 50
60
55 45
50
40
45
40 35
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
Number of Training Folds Number of Transductive Folds
state.n
60 55
SVM SVM
55 TSVM TSVM
50
50
Accuracy (%)
Accuracy (%)
45 45
40
35 40
30
35
25
20 30
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
Number of Training Folds Number of Transductive Folds
system.n
60 45
SVM SVM
55 TSVM TSVM
40
50
Accuracy (%)
Accuracy (%)
45 35
40
35 30
30
25
25
20 20
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
Number of Training Folds Number of Transductive Folds
88
5.1 Transductive SVMs
VarTrain VarTest
become.v
90 80
SVM SVM
85 TSVM TSVM
75
80
Accuracy (%)
Accuracy (%)
75 70
70
65 65
60
60
55
50 55
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
Number of Training Folds Number of Transductive Folds
leave.v
65 45
SVM SVM
60 TSVM TSVM
40
55
Accuracy (%)
Accuracy (%)
50 35
45
30
40
35
25
30
20
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
Number of Training Folds Number of Transductive Folds
look.v
90 75
SVM SVM
85 TSVM TSVM
70
80
Accuracy (%)
Accuracy (%)
75 65
70
65 60
60
55
55
50 50
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
Number of Training Folds Number of Transductive Folds
think.v
90 75
SVM SVM
85 TSVM TSVM
70
80
Accuracy (%)
Accuracy (%)
75 65
70
65 60
60
55
55
50 50
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
Number of Training Folds Number of Transductive Folds
89
5 Bootstrapping
Table 5.4: Accuracy figures of SVM and TSVM compared under the same conditions (in
%)
label. In the representation of examples in Word Sense Disambiguation there are different
kinds of features with a small number of values in a sentence. In Text Categorisation there
is only one kind of feature (the words in the document) with lots of values; that is, there
are many more words in Text Categorisation than in Word Sense Disambiguation, and the
probability of word intersection between documents is far higher in Text Categorisation.
Third Experiment
This experiment is devoted to show the time needed by both transductive and inductive
approaches. Table 5.5 shows the time cost in processing, under the same conditions, the
learning and test process for the word art (the one with the lowest number of examples).
Processes were run on a Pentium III with 128Mb of RAM under Linux.
method time
SVMs (9 → 1) 0m14.91s
TSVMs (9 → 1) 1m14.14s
TSVMs (1 → 9) 17m58.76s
Table 5.5: Processing time of SVM and TSVM for the word art
The inductive SVM algorithm performed a single learning (using 9 folds) and testing
(using 1 fold) experiment in less than 15 seconds, while TSVM performed the same
experiment in 1 minute and 15 seconds (5 times more). When learning TSVM on 1 fold
90
5.2 Greedy Agreement Bootstrapping Algorithm
and testing over 9 folds the time requirements grow up to 18 minutes. As expected, the
TSVM is very time consuming when a large amount of unlabelled examples are included
in the training process.
The high cost of TSVM is due to the need of retraining at every classification
of an unlabelled example. This cost could be alleviated by an incremental (on-line)
implementation of the SVM paradigm, which is by now not available in the SV M light
software.
This section describes an experiment carried out to explore the bootstrapping algorithm,
described in (Abney, 2002, 2004), applied to Word Sense Disambiguation (see section
2.3.1 for a discussion of the different bootstrapping algorithms). It consists of learning
two views (independent classifiers) called F and G, respectively, and it is assumed that
both are conditionally independent given a target label. In practice, both views F and
G are classifiers that learn with the same ML algorithms but with different feature set.
The bootstrapping algorithm is an iterative process (figure 5.3 shows its pseudo-code) that
learns an atomic rule for one of the views at each iteration alternatively. The algorithm
needs as input a set of seeds, that is, an initial set of atomic rules with a very high
precision, but low coverage. The result of the learning process consists of two classifiers,
one for each view, which may be combined to classify new instances.
procedure Greedy Agreement
Input: seed rules F , G
loop
foreach atomic rule H
G0 := G + H
evaluate cost of (F , G0 )
keep lowest-cost G0
if G0 is worse than G, quit
swap F , G0
return F and G
end Greedy Agreement
Classifiers F and G induced from both views can abstain when predicting: F (x) = ⊥
91
5 Bootstrapping
or G(x) = ⊥. The cost function needed by the algorithm is defined in (Abney, 2002) as:
cost(F ) + cost(G)
cost(F, G) =
2
where
δ
cost(F ) = × lab(F ) + unlab(F )
µ−δ
where δ = P r[F 6= G|F, G 6= ⊥], µ = minu P r[F = u|F 6= ⊥], lab(F ) is the number
of predictions of classifier F , and unlab(F ) is the number of times classifier F abstains.
cost(G) is computed in the same way.
Each atomic rule corresponds to test the value of a unique feature and emit a
classification decision. This algorithm does not delete atomic rules. That is, an atomic
rule could be selected more than one time. This fact allows to assign some kind of “weight”
to the atomic rules (or features).
Finally, the combination of both views, in order to get a final classification, is not
a trivial issue due to the fact that both views could abstain or predict a label. We
have implemented several alternatives that are tested in the subsequent experiments. All
variants we tried are summarised in table 5.6 together with a brief explanation of each
strategy.
5.2.1 Setting
Four words of the DSO corpus have been selected. These words have the highest number
of examples for their two most frequent senses. They are two nouns (law and system)
and two verbs (become and think ). Table 5.7 shows the two most frequent senses and the
number of examples for each of these senses. The experiment has been performed taking
into account only these two most frequent senses for each word. Abney (2002) applied its
algorithm to Name Entity Recognition and for that task he had available a very large data
set. In WSD there are not such a large annotated corpora. This is why we have selected
only the two most frequent senses from those words with more examples.
Table 5.8 shows the set of features selected for each view and a short description of
each feature. Considering the assumption of conditionally independence of the views, we
have chosen three local features for view F and three topical features for view G.
A random partition of examples in 10 folds has been generated. These folds have been
numbered from 0 to 9. Tables 5.9 and 5.10 show the results (precision, recall, F1 and
coverage) calculated to serve as baselines for nouns and verbs respectively. “0 → (1 − 8)”
stands for fold 0 as training data and folds from 1 to 8 as test data. “0 → 9” stands for
92
5.2 Greedy Agreement Bootstrapping Algorithm
key definition
STRICT AGREE the final classifier only predicts a label (as output) if both
classifiers (F and G) have a prediction and the label is the
same
RELAX AGREE the final classifier predicts a label e if both classifiers F and
G predict the same label e or if one predicts e and the other
abstains
PRIORITY X the final classifier predicts a label e if one or both predict e;
but if both labels are different, it predicts that of view X;
where X ∈ {F, G}
STRICT VOTING the final classifier only predicts a label when both classifiers
have a prediction; then it gets the highest scoring label; else
abstains
RELAX VOTING STRICT VOTING but if one of the views abstains, the final
classifier takes the label from the non abstaining view
ONLY VOTING the predicted label is the highest scoring label from both
classifiers and all classes
senses examples
law.n 14:00 10:00 444 361
system.n 06:00 09:02 417 141
become.v 30:00 42:00 606 532
think.v 31:01 31:00 690 429
Table 5.7: Senses and number of examples for each word selected from DSO corpus
93
5 Bootstrapping
fold 0 as training data and fold 9 as test data. The distribution of the folds has been done
to be comparable with the experiments in next subsection. In this experiment, fold 0 is
used to generate the set of seeds; folds from 1 to 8 are used as unlabelled training set; and
fold 9 as test set. MFC, DLs, and LB stands respectively for Most Frequent Classifier,
Decision Lists, and LazyBoosting (see chapter 3 for a description of these algorithms). In
previous chapters DLs had a coverage of 100% and now it is lower. This is due to the fact
that we were using the Most Frequent Sense when DL abstained or ties were observed.
Instead, in this experiment ties and abstentions are not solved. Here, we will be using DLs
to generate the set of seeds. In all three cases the training set contains only the examples
of fold 0.
0 → (1 − 8)
law.n system.n
prec rec F1 cov prec rec F1 cov
MFC 55.16 55.16 55.16 100.00 74.73 74.73 74.73 100.00
DLs 59.87 42.40 49.64 70.81 60.59 46.19 52.42 76.23
LB 59.75 59.75 59.75 100.00 62.33 62.33 62.33 100.00
0→9
law.n system.n
prec rec F1 cov prec rec F1 cov
MFC 55.16 55.16 55.16 100.00 74.73 74.73 74.73 100.00
DLs 52.73 38.16 44.28 72.37 71.43 51.28 59.70 71.79
LB 64.47 64.47 64.47 100.00 71.79 71.79 71.79 100.00
The main conclusions that arise from this results are the following: The results are
lower for nouns than for verbs. The classifiers learn the behaviour of verbs with only fold
0 as a training set. Finally, note that the accuracies obtained by DL for noun law and by
both methods for noun system are under the most frequent .
94
5.2 Greedy Agreement Bootstrapping Algorithm
0 → (1 − 8)
become.v think.v
prec rec F1 cov prec rec F1 cov
MFC 53.25 53.25 53.25 100.00 61.66 61.66 61.66 100.00
DLs 85.17 80.81 82.93 94.87 89.54 85.14 87.28 95.08
LB 87.24 87.24 87.24 100.00 87.15 87.15 87.15 100.00
0→9
become.v think.v
prec rec F1 cov prec rec F1 cov
MFC 53.25 53.25 53.25 100.00 61.66 61.66 61.66 100.00
DLs 83.50 78.90 81.13 94.50 86.36 84.07 85.20 97.35
LB 85.32 85.32 85.32 100.00 86.73 86.73 86.73 100.00
respectively. These values have been selected to evaluate the trade-off between accuracy
and coverage during the bootstrapping process.
The first line of each table (Iterations) shows the total number of iterations applied.
The bootstrapping algorithm stops when the addition of the best rule selected for the next
iteration is not able to improve the cost function. For instance, the number of iterations
for the noun system in table 5.13 (threshold 2.0) is 0 because for this threshold no further
rules where included into the bootstrapping process. For comparison purposes, in all
tables, section seed stands for the results of applying directly the seed rules learned from
fold 0 on the test folds 1-8 and fold 9; and the Abney section for the results of applying
the bootstrapping algorithm using folds 1-8 as bootstrapping examples on the same test
folds. Thus, 0 → (1 − 8) → (1 − 8) stands for extracting the seed rules from fold 0,
performing the bootstrapping process on folds (1-8) and testing also on folds (1-8); while
0 → (1 − 8) → 9 provides the tests on fold 9.
We also obtain very different performances depending on the way of combining both
views. The best results are obtained by priority F, priority G and only voting. Also
as expected, Abney’s algorithm obtains generally better figures than the initial seed
algorithm. This is not the case for think.v using threshold 3.0 and using as test folds
1-8, for both verbs using threshold 2.5 in both test sets and with the verb become using
95
5 Bootstrapping
Table 5.12: Results of Abney’s algorithm with a seed threshold of 2.5 (in %)
97
5 Bootstrapping
99
5 Bootstrapping
Machine Learning algorithm (LB with an F1 of 62.3). For the verb become, the best results
are achieved when applying priority G combination rule and threshold 3.0 achieving an
F1 measure of 84.9, which is 2.3 points below the best baseline for this word (LB). Finally,
for the verb think, the best results are achieved when applying priority G combination
rule and threshold 2.0 achieving an F1 measure of 86.8, which is 0.4 points below the best
baseline for this word (LB).
Summarising, the threshold parameter affects directly the initial coverage and
precision. Obviously, the higher is the threshold, the higher the initial precision, and lower
the initial coverage, which not surprisingly has a direct relationship with the behaviour of
the bootstrapping algorithm. The best results are obtained using priority F or priority G
combination rules. When performing the test on fold 9 (not seen by the bootstrapping
algorithm), three of the four words, the Abney’s algorithm is able to obtain better
performances than the baselines. However, when using folds 1-8 for testing (those used by
the bootstrapping algorithm) only one word is able to outperform the baselines.
There are at least two points in which these experiments are different from those
previously reported for the Named Entity Recognition task (Collins and Singer, 1999).
The first one is the amount of unlabelled examples used for the bootstrapping. The
experiments on Named Entity Recognition used around 90,000 unlabelled examples (two
orders of magnitude higher than those reported here, which consist on about 500 unlabelled
examples). The second substantial difference is the high precision rates of the initial seeds.
The Collins and Singer’s work report rules of almost 100% precision while we are reporting
precisions in the range of 60% to 90%.
Thus, our next steps on this research will consist on evaluating the performances
of non absolute thresholds. Instead, we plan to use relative thresholds associated to
precision, recall, coverage or F1 measures. We also plan to use a larger number of
unlabelled examples gathered from the British National Corpus. For instance, in BNC
there are 19,873 examples for the noun “law”, 40,478 examples for noun “system”, 53,844
occurrences of the verb “become” and 86,564 examples for verb “think”. After this initial
set of experiments, we also plan to use a more realistic evaluation scenario. We plan to use
all the senses of the words (7 for law.n, 8 for system.n, 4 for become.v and 8 for think.v)
and their whole set of labelled examples from the DSO (around 1,500) as a bootstrapping
corpora (750 examples) and as test corpora (750 examples).
100
5.3 Conclusions and Current/Further Work
The use of unlabelled data to train classifiers for Word Sense Disambiguation is a very
challenging line of research in order to develop a really robust, complete and accurate
Word Sense Tagger (Agirre et al., 2005). Looking at the results of the experiments of the
first section of this chapter, we can conclude that, at this moment, the use of unlabelled
examples with the transductive approach is not useful for Word Sense Disambiguation.
At this point, there is a line of research to explore; to further study on the differences
and similarities between Text Categorisation and Word Sense Disambiguation models, in
order to develop a Word Sense Disambiguation model “compatible” with the transductive
approach.
The second section of this chapter explores the application of the Greedy Agreement
algorithm of Abney (2002) to Word Sense Disambiguation. It consists of the previous
use of Decision Lists in order to obtain the seeds. The atomic rules of Decision Lists are
similar to those of Abney’s algorithm. We have applied it to only 2 senses of 4 words
of the DSO corpus in order to deal with a big amount of examples per sense. Then we
have compared the results to a naive Most Frequent Classifier, Decision Lists and Lazy
Boosting, and obtained comparable or higher results (in accuracy). This approximation
seems to be useful in order to apply bootstrapping to the Word Sense Disambiguation
data.
Since the bootstrapping algorithm on this setting is providing interesting but not
conclusive results, we plan to continue the study of the behaviour of this approach on a
more realistic scenario. It should consist of:
• testing the framework with all word senses (not only with 2 of them),
• increasing the feature sets of the views with those described in appendix B1 ,
• optimising the threshold see extraction for each word by applying cross-validation,
• and finally, testing the Greedy Agreement algorithm using large sets of unlabelled
data extracted from the British National Corpus; enlarging both the training set
and the set used for the extraction of the seed rules.
1
One of the difficulties when applying the greedy agreement algorithm is the use of a limited number of
features in the views. So, the results of the views are far below than when using the complete feature set.
101
5 Bootstrapping
102
Chapter 6
This section explains in detail the Senseval English Lexical Sample data1 . Senseval-2
corpus consists of 8,611 training and 4,328 test examples of 73 words. These examples
were extracted from the British National Corpus2 . Figure 6.1 shows an example from
the Senseval-2 corpus. It corresponds to the instance art.40012, which has been labelled
with two senses: P and arts%1:09:00::. Most of the examples have only a sense but there
are exceptions with two or three senses. The word to be disambiguated is marked by the
<head> label and is always located in the last sentence of the context. The Senseval-2 sense
inventory is a prerelease of WordNet 1.7. Table 6.1 shows the noun art in that inventory.
For each word the corpus has some extra “sense labels”: label P for proper-names; and
label U for unassignable. Unassignable stands for examples in which human annotators
1
http://www.senseval.org
2
http://www.natcorp.ox.ac.uk
103
6 Evaluation in Senseval Exercises
had problems in labelling. Nouns and adjectives can be also monosemous multiwords
–such as art class%1:04:00:: or art collection%1:14:00:: for the word art–; and verbs can
be polysemous phrasal verbs –such as work on%2:41:00:: for the verb work. Phrasal verbs
have been tagged. So, they should be treated as independent words (see figure 6.2 for an
example). Senseval providers referred to the sense labels of the examples as keys and the
labels provided by the participants as system votes.
Table 6.1: Sense Inventory for the noun art in Senseval-2 corpus
Senseval-3 corpus consists of 7,860 training and 3,944 test examples of 57 words. Nouns
and adjectives of Senseval-3 data have been annotated using the WordNet 1.7.1 sense
inventory, and verbs with labels based on Wordsmyth definitions3 (see appendix C for
Web references to both resources). Senseval-3 sense inventory has neither multiwords, nor
phrasal verbs, nor proper-nouns. That is, the senses used are only the senses like those of
table 6.1 and the unassignable U label. This corpus has a large number of examples with
more than one sense assigned. Figure 6.3 shows and example from Senseval-3 corpus.
This section describes the architecture and results of the TALP system presented at the
Senseval-2 exercise for the English Lexical Sample task. The system was developed on
the basis of LazyBoosting (Escudero et al., 2000c), the boosting-based approach for Word
Sense Disambiguation explained in section 3.4. In order to better fit the Senseval-2 domain,
some improvements were made, including: (1) features that take into account domain
information, (2) an specific treatment of multiwords, and (3) a hierarchical decomposition
of the multiclass classification problem, similar to that of Yarowsky (2000). All these
issues will be briefly described in the following sections.
3
A mapping between Wordsmyth and WordNet verb senses was provided.
104
6.2 TALP System at Senseval-2
Three kinds of information were used to describe the examples and to train the classifiers.
These features refer to local and topical contexts, and domain labels. These features
are SetA plus the topic features contained in SetB (see section 3.2.2) and the domain
information detailed below.
105
6 Evaluation in Senseval Exercises
<instance id="work.097">
<answer instance="work.097" senseid="work_on%2:41:00::"/>
<context>
How could anyone know what to do with an assortment like that ? ?
Perhaps he had better have someone help him put up the pegboard and
build the workbench - someone who knew what he was about .
Then at least he would have a place to hang his tools and something
to <head sats="work_on.097:0">work</head>
<sat id="work_on.097:0">on</sat> .
</context>
</instance>
106
6.2 TALP System at Senseval-2
ContextDomains(d)
W eight(d) = · LabelsDomain
DomainF reqW N (d)
where LabelsDomain is the ratio of the total number of labelled WordNet synsets over the
total number of domain labels; DomainFreqWN(d) is the frequency of the domain d on
WordNet; and ContextDomains(d) is,
X W ordDomains(w, d)
ContextDomains(d) =
w∈nouns(e)
SumW eights(w)
where
X µ ¶
P osition(w, s) − 1
W ordDomains(w, d) = 1−
{s∈Synsets(w)|s∈d}
|Synsets(w)|
and
X µ ¶
P osition(w, s) − 1
SumW eights(w) = |Domains(s)| · 1 −
s∈Synsets(w)
|Synsets(w)|
where Position(w,s) is the frequency position relative to WordNet of the synset s in the
word w (e.g., 2 stands for the second most frequent sense of the word); |Synsets(w)| stands
for the number of synsets of the word w ; and |Domains(s)| is referred to the number of
domains to which synset s is connected.
This function assigns to each example a set of domains labels depending on the number
of domain labels assigned to each noun and their relative frequencies in the whole WordNet.
Each domain label of the example is ranked by a weight, and the list of domains is pruned
by a threshold. That is, the result of this procedure is the set of domain labels that
achieve a score higher than a threshold of 0.04, which are incorporated as regular features
for describing the example. Figure 6.4 shows a text example and the domains extracted
applying the algorithms with their corresponding weights4 . We used this information to
built a bag of domains.
4
The example has been applied using not only nouns, but also verbs, adjectives and adverbs. At
Senseval-2 time, there were available domains for nouns only.
107
6 Evaluation in Senseval Exercises
domain weight
Example
painting 7.07
A new art fair will take place at the
art 6.79
G-Mex Centre , Manchester from 18 to 21
time period 1.64
March . Paintings , drawings and sculpture
anthropology 1.39
from every period of art during the last 350
sociology 1.21
years will be on display ranging from a Tudor
quality 0.25
portrait to contemporary British art .
factotum 0.16
108
6.2 TALP System at Senseval-2
1st level
Full senses
Senses Exs.
Senses Exs.
bar%1:06 199
bar%1:06:04:: 127
bar%1:14 17
bar%1:06:00:: 29
bar%1:10 12
bar%1:06:05:: 28
bar%1:14:00:: 17
2nd level
bar%1:10:00:: 12
bar%1:06
bar%1:06:06:: 11
Senses Exs.
bar%1:04:00:: 5
04:: 127
bar%1:06:02:: 4
00:: 29
bar%1:23:00:: 3
05:: 28
bar%1:17:00:: 1
06:: 11
step in order to distinguish between its particular senses. Note that the same simplifying
rules of the previous level are also applied. For instance, the bottom-right table of figure
6.5 shows the labels for bar%1:06, where 02:: has been rejected (few examples). When
classifying a new test example, the classifiers of the two levels are applied sequentially.
That is, the semantic-file classifier is applied first. Then, depending on the semantic-file
label output by this classifier, the appropriate 2nd level classifier is selected. The resulting
label assigned to the test example is formed by the concatenation of both previous labels.
Despite the simplifying assumptions and the loss of information, we observed that
all these changes together significantly improved the accuracy on the training set. Next
section includes further evaluation about this issue.
6.2.3 Evaluation
Figure 6.6 graphically shows the official results for the fine-grained accuracy of the
Senseval-2 English Lexical Sample Task6 . Table 6.2 shows the fine- and coarse-grained
results of the 6 best systems on the same exercise. The evaluation setting corresponding
to these results contains all the modifications explained in the previous sections, including
the hierarchical approach to all words. The best system obtained 64.2% precision on
fine-grained evaluation and 71.3% precision on coarse-grained. The TALP system was the
6
This graphic has been downloaded from the official Senseval Web Page. See appendix C for web
references.
109
6 Evaluation in Senseval Exercises
sixth of 21 supervised systems and the difference from the first was of only 4.8% and 4.2%
in fine-grained and coarse-grained precision, respectively.
After the Senseval-2 event, we added a very simple Named-entity Recogniser to the
part-of-speech tagger that was not finished at the time of the event, but the system
continued completely ignoring the ‘U’ label. We also evaluated which parts of the system
contributed most to the improvement in performance. Figure 6.7 shows the accuracy
results of the four combinations resulting from using (or not) domain-label features
and hierarchical decomposition. These results have been calculated over the test set of
Senseval-2.
On the one hand, it becomes clear that enriching the feature set with domain labels
systematically improves the results in all cases, and that this difference is especially
noticeable in the case of nouns (over 3 points of improvement). On the other hand,
the use of the hierarchies is unexpectedly useless in all cases. Although it is productive
in some particular words (3 nouns, 12 verbs and 5 adjectives) the overall performance is
significantly lower. A fact that can explain this situation is that the first-level classifiers
do not succeed at classifying semantic-file labels with high precision levels (the average
accuracy of first-level classifiers is only slightly over 71%) and that this important error is
110
6.3 From Senseval-2 to Senseval-3
dramatically propagated to the second-level, not allowing the greedy sequential application
of classifiers. A possible explanation of this fact is the way semantic classes are defined in
WordNet. Consider for instance work#1 (activity) and work#2 (production), they seem
quite close but a system trying to differentiate among semantic fields will try to distinguish
between these two senses. On the other extreme, such a classifier should group house#2
(legislature) with house#4 (family), which are quite different. Joining both situations
makes a pretty difficult task.
Regarding multiword preprocessing (not included in figure 6.7), we have seen that
is slightly useful in all cases. It improves the non hierarchical scheme with domain
information by almost 1 point in accuracy. By part-of-speech, the improvement is about
1 point for nouns, 0.1 for verbs and about 2 points for adjectives.
In conclusion, the best results obtained by our system on this test set correspond to the
application of multiword preprocessing and domain-labels for all words, but no hierarchical
decomposition at all, achieving a fine-grained accuracy of 61.51% and a coarse-grained
accuracy of 69.00% (from figure 6.7). We know that it is not fair to consider these results
for comparison, since the system is tuned over the test set. Instead, we only wanted to
fully inspect the TALP system and to know which parts were useful for future Word Sense
Disambiguation systems.
This section is devoted to study the results of the best six systems of Senseval-2 English
Lexical Sample task; trying to understand the behaviour of their components (e.g., if
they use an Name-Entity Recogniser and how it works). After this study, we show the
architecture of a new system built taking into account this new information. This system
outperforms all previous six systems.
111
6 Evaluation in Senseval Exercises
nouns
no dom. with dom.
fine coar. fine coar.
no hier. 64.3 72.4 67.9 75.6
hier. 63.0 71.1 64.3 71.5
verbs
no dom. with dom.
fine coar. fine coar.
no hier. 51.6 61.6 52.1 62.6
hier. 50.3 60.8 51.1 62.0
adjectives
no dom. with dom.
fine coar. fine coar.
no hier. 66.2 66.2 68.9 68.9
hier. 65.4 65.4 68.2 68.2
Figure 6.7: Fine/coarse-grained evaluation (in %) for different settings and part-of-speech
The Senseval-2 sense labels could be classified in different groups. These groups of labels
could be tackled with different modules.
Table 6.3 shows the overall results of the six best systems of Senseval-2: jhu(r)
(Yarowsky et al., 2001), smuis (Mihalcea and Moldovan, 2001), kunlp (Seo et al., 2001),
cs224n (Ilhan et al., 2001), sinequa (Crestan et al., 2001) and talp (Escudero et al.,
2001). The columns stands, respectively, for the fine-grained accuracy, the number of
correct labelled examples in the test sample (in fine-grained evaluation), the coarse-grained
accuracy and the number of examples correctly classified. There are 4,328 test examples
in the corpus. All examples were treated by all systems (coverage = 100%). So, precision
and recall are the same. Throughout all this study we have used the scorer of Senseval-27 .
A difference of only 5 points in fine-grained and of 4 points in coarse-grained accuracy can
be observed between the first and the last system.
Table 6.4 shows the agreement and kappa values between every pair of systems. Cells
above the diagonal are the agreement between every pair of systems and those below the
diagonal contain the kappa values. The maximum difference between pairs of systems
is of 8% of agreement (from 0.60 to 0.68). The low agreements (65%) of all pairs show
7
Available at the web site (see appendix C).
112
6.3 From Senseval-2 to Senseval-3
the big differences in the outputs of the systems. The kappa values differ more than
simple agreement (from 0.22 to 0.35). This measure shows evidence of a large difference
in performance among some of the systems.
Table 6.4: Agreement (above diagonal) and kappa (below diagonal) values between every
pair of systems
Now we try to decompose the systems output (outcomes) in order to understand their
results and the source of the differences among them. First, we clustered the test examples
as follows: P stands for Proper-names, mw for Multiwords, pv for Phrasal Verbs, U for
Unassignable and ml for the set of examples with regular senses (e.g., sense bar%1:06:04::
of the noun bar ).
Table 6.5 shows the performance of the six systems taking into account only the P
label. To do that, all files (both keys and system votes) were pruned, and only P labels
remained for testing. There are 146 P test examples in the corpus. This table shows,
respectively, precision (pr), recall (rc), F1, number of correct labelled examples (ok), and
number of examples attempted (att). Both fine-grained and coarse-grained results are
the same, due to the fact that label P has no hierarchy. Two of these systems (kunlp
and talp) had no Name-Entity Recogniser. Two of them detected proper nouns with high
precision but a very poor recall, Sinequa and CS2224N (probably they tackled proper
113
6 Evaluation in Senseval Exercises
nouns as another label of the learning problem). And finally, the two best systems had a
component to preprocess proper nouns. SMUIs had the best .
system pr rc F1 ok att
JHU(R) 54.8 34.9 42.6 51 93
SMUIs 81.6 48.6 60.9 71 87
KUNLP N/A 0 N/A 0 0
CS224N 66.7 1.4 2.7 2 3
Sinequa 88.9 5.5 10.4 8 9
TALP N/A 0 N/A 0 0
Table 6.6 shows multiword treatment and has the same format of table 6.5. There
are 170 multiword test examples in the corpus. Multiwords appeared only for nouns and
adjectives, but not for verbs. In this group there is again a significant difference between
the two best system and the rest. They had the best multiword processor.
system pr rc F1 ok att
JHU(R) 85.8 74.7 79.9 127 148
SMUIs 88.8 68.0 77.0 151 222
KUNLP 53.9 48.8 51.2 83 154
CS224N 78.4 34.1 47.5 58 74
Sinequa 90.3 54.7 68.1 93 103
TALP 89.3 64.1 74.6 109 122
Table 6.7 has the same format as the previous two. It shows the results on the set of
phrasal verbs. There are 182 phrasal verbs test examples in the corpus. Phrasal verb labels
were in a hierarchy. For this reason this table contains fine-grained and coarse-grained
results. Again, there are big differences. They ranged from 43.0% to 66.5% on precision
and from 26.9% to 65.4% of recall.
Next, table 6.8 shows the behaviour on the U -labelled examples. This label has no
hierarchy. Only 3 systems treated this label, and they achieved very low recall results.
There are 160 U test examples in the corpus.
Finally, the results on the core ml set of examples are shown in table 6.9. There
are 3.733 ml test examples in the corpus. The results for nouns (1,412 examples), verbs
114
6.3 From Senseval-2 to Senseval-3
fine coarse
system pr rc F1 ok pr rc F1 ok att
JHU(R) 66.5 65.4 66.0 119 71.5 70.3 70.9 128 179
SMUIs 58.0 57.7 57.9 105 65.7 65.4 65.6 119 181
KUNLP 49.8 58.8 53.9 107 53.5 63.2 58.0 115 215
CS224N 43.0 26.9 33.1 49 50.9 31.9 39.2 58 114
Sinequa 59.5 42.9 49.9 78 65.6 47.3 55.0 86 131
TALP 45.5 39.0 42.0 71 55.1 47.3 50.9 86 156
system pr rc F1 ok att
JHU(R) 70.4 23.7 35.5 38 54
SMUIs N/A 0 N/A 0 0
KUNLP N/A 0 N/A 0 0
CS224N 55.8 15.0 23.6 24 43
Sinequa 61.5 10.0 17.2 16 26
TALP N/A 0 N/A 0 0
(1,630 examples), and adjectives (691 examples) are presented by tables 6.10, 6.11 and
6.12, respectively.
From these results, the first system in the competition, JHU(R), is not the best
system (from the machine learning perspective) for nouns and verbs -and nouns and verbs
represented more than the half part of the test set. The main reason for obtaining the
best results resides in performing a better preprocessing of examples and treatment of
U, phrasal verbs, multiwords and proper-nouns cases. That is, it treated better than the
rest multiword, phrasal verbs, proper-nouns and the U label; maintaining a competitive
performance on the ml set. There are no overall ml significant differences. SMUIs and
KUNLP are better in verbs, and SMUIs in nouns; but JHU(R) is extremely good with
adjectives.
The interaction between all components of the systems makes difficult to compare the
Machine Learning algorithms. For this reason, we have extracted the intersection set of
all the examples in which every system have output a regular sense. This set has 3,520
examples. The results are shown in table 6.13. Here, we can see that the four best systems
have very similar performance being KUNLP slightly better than the rest.
115
6 Evaluation in Senseval Exercises
fine coarse
system pr rc F1 ok pr rc F1 ok att
JHU(R) 60.1 62.1 61.1 2,317 68.1 70.3 69.2 2,623 3,854
SMUIs 60.3 62.0 61.1 2,315 68.5 70.4 69.4 2,629 3,838
KUNLP 59.1 62.6 60.8 2,338 66.4 70.5 68.4 2,630 3,959
CS224N 57.0 62.5 59.6 2,334 64.5 70.7 67.5 2,640 4,094
Sinequa 55.8 60.6 58.1 2,263 63.2 68.7 65.8 2,565 4,059
TALP 54.3 59.0 56.6 2,201 62.2 67.5 64.7 2,521 4,050
fine coarse
system pr rc F1 ok pr rc F1 ok att
JHU(R) 61.5 65.9 63.6 930 69.3 74.3 71.7 1,049 1,513
SMUIs 63.8 67.8 65.7 957 72.5 77.1 74.7 1,088 1,501
KUNLP 58.0 67.4 62.4 951 65.4 75.9 70.3 1072 1,639
CS224N 59.1 69.3 63.8 978 66.1 77.5 71.4 1,094 1,655
Sinequa 57.0 65.9 61.1 931 64.4 74.6 69.1 1,053 1,634
TALP 55.6 65.5 60.2 925 63.4 74.7 68.6 1,055 1,664
Table 6.10: Senseval-2 results on the nouns of the ml set of examples (in %)
Summarising, the best system did not only focus on the learning algorithm. But it
had the best components treating all the special labels. Thus, preprocessors were very
important to construct a competitive WSD systems at Senseval-2.
After the previous study we developed a system (hereinafter TalpSVM ) taking into account
several points to improve the TALP system at competition time:
• First, new features have been generated to improve the performance of the system.
After extracting the features, the features that appears in less than 4 examples have
been eliminated.
• Second, Support Vector Machines have been chosen as learning algorithm, due to its
better performance in the presence of small training corpora (see chapter 3). Again,
we have used SV M light implementation of Support Vector Machines. The learner
116
6.3 From Senseval-2 to Senseval-3
fine coarse
system pr rc F1 ok pr rc F1 ok att
JHU(R) 54.9 54.8 54.9 894 66.4 66.3 66.4 1,081 1,627
SMUIs 55.5 55.3 55.4 902 66.8 66.6 66.7 1,085 1,625
KUNLP 58.3 56.9 57.6 927 69.0 67.4 68.2 1,098 1,591
CS224N 52.4 54.4 53.4 887 63.7 66.1 64.9 1,077 1,692
Sinequa 52.4 53.9 53.1 878 63.2 64.9 64.0 1,058 1,675
TALP 51.4 52.0 51.7 848 62.9 63.7 63.3 1,038 1,650
Table 6.11: Senseval-2 results on the verbs of the ml set of examples (in %)
fine coarse
system pr rc F1 ok pr rc F1 ok att
JHU(R) 69.0 71.3 70.1 493 69.0 71.3 70.1 493 714
SMUIs 64.0 66.0 65.0 456 64.0 66.0 65.0 456 712
KUNLP 63.1 66.6 64.8 460 63.1 66.6 64.8 460 729
CS224N 62.8 67.9 65.3 469 62.8 67.9 65.3 469 747
Sinequa 60.5 65.7 63.0 454 60.5 65.7 63.0 454 750
TALP 58.2 61.9 60.0 428 58.2 61.9 60.0 428 736
Table 6.12: Senseval-2 results on the adjectives of the ml set of examples (in %)
takes as input all the examples of the training set that are not tagged as P, U,
multiword neither phrasal verb, and label U has not been considered in the training
step neither in the test step.
• Third, the addition of new examples from other corpora have been tried to
compensate for the small number of examples of Senseval-2 corpora.
117
6 Evaluation in Senseval Exercises
accuracy
Methods fine coarse
JHU(R) 65.5 72.9
SMUIS 65.2 72.9
KUNLP 66.0 73.4
CS224N 65.7 73.4
Sinequa 63.4 70.8
TALP 61.6 69.6
Table 6.13: Comparison of the learning component of the best performing systems at
Senseval-2 (accuracy in %)
Features
We intended to test the impact of different feature groups. The set of features was
incrementally increased by adding big groups of related features. All the features included
are binary. These feature groups are:
• forms: includes features about: the form of the target word, the part-of-speech
and the location of the words in a ±3-word-window around the target word, the
form and the location of the words in a ±3-word-window around the target word,
collocations of part-of-speech and forms is a ±3-word-window, a bag of word forms of
the text, a bag of word forms in a ±10-word-window, bigrams of the text, bigrams in
a ±10-word-window, pairs of open-class-words in the text, pairs of open-class-words
in a ±10-word-window, the form of the first noun/verb/adjective/adverb on the
left/right, and finally, the form of the first noun/verb/adjective/adverb on the
left/right in a ±10-word-window.
• lemmas: the same as forms, but taking the lemmas instead of word forms.
• domains: a bag of domains extracted with the algorithm of section 6.2.1, but using
all nouns, verbs, adjectives and adverbs instead of only nouns.
• sumo: the same type but using sumo ontology labels9 (Niles and Pease, 2001)
instead of those of Magnini and Cavaglia (2000).
9
http://www.ontolgyportal.org
118
6.3 From Senseval-2 to Senseval-3
Table 6.14 shows the result of the system using the different sets of features. As we
can see, all features sets seems to improve the performance of the system.
New Examples
Table 6.16 shows the effect of the addition of training examples from other corpora. wne
stands for the examples in WordNet glosses; wng for the definitions in WordNet glosses;
sc stands for sentences in the Semcor corpus10 ; and se2 represents the original Senseval-2
English Lexical Sample task corpus. Table 6.15 shows the amount of examples extracted
from each of these corpora.
Surprisingly, the addition of examples from other sources implies a decrease in accuracy
results (see table 6.16). This could be caused by the difference in the format of the
examples in the corpora.
10
See appendix C for Web references.
119
6 Evaluation in Senseval Exercises
Preprocess
Table 6.17 shows the results of the proper-name component. Every row represents
respectively: the use of the P label, the use of the ml corresponding label, and the use of
both labels for each example. The best option corresponds to ml that shows the lack of
utility of the component and evincing that a better Name-Entity Recogniser is needed.
Phrasal verbs have been tagged with the most frequent sense of the training set; or,
120
6.4 Senseval-3
if they did not appear in training set, with the most frequent sense of WordNet. The
behaviour of the system including this component is shown in table 6.19. An increase of
2% in accuracy is observed when using the phrasal verbs component.
After the development, TalpSVM has been compared with the six best systems of
Senseval-2 competition. Tables 6.20 and 6.21 contain the results of the system to be
compared with those in tables from 6.3 to 6.13. P stands for the use of P label in test
and noP for ignoring it, respectively. TalpSVM.P outperforms all six systems in both
accuracy, fine and coarse grained (table 6.2). The same applies for TalpSVM.noP. Also,
it is the most similar (tables 6.4 and 6.21) to JHU(R), the winner, and to KUNLP, the
best learning system. Table 6.20 (proper names) shows a high recall of the TalpSVM in
tagging P labels but a very low precision, that means that the proper-names component
is overassigning P labels to test examples. This component is counterproductive. The
multiword component works very well (tables 6.6 and 6.20). It achieves the best precision
and a very high recall. Phrasal verbs have been tagged with the most frequent sense. This
component works as the average of the systems. It is one of the candidates to be improved.
And finally, tables from 6.10 to 6.13 show that TalpSVM.noP has the best precision on
nouns, verbs, and overall. This, seems to be caused by the Support Vector Machines
algorithm. It works pretty well with small sets of training examples and is very robust in
front of different kind of features. We must remark, that the final system outperforms all
the systems presented at Senseval-2.
6.4 Senseval-3
This section describes the TALP system on the English Lexical Sample task of the
Senseval-3 event (Mihalcea et al., 2004). The system is fully supervised and relies also
on Support Vector Machines. It does not use extra examples than those provided by
121
6 Evaluation in Senseval Exercises
fine coarse
subcorpus pr rc F1 ok pr rc F1 ok att table
global.P 64.6 – – 2,794 71.1 – – 3,078 4,328 6.3
global.noP 64.8 – – 2,806 71.7 – – 3,105 4,328 6.3
proper-names 54.3 82.9 65.6 121 – – – – 223 6.5
multiwords 92.4 85.9 89.0 146 – – – – 158 6.6
phrasal-verbs 52.5 51.1 51.8 93 61.6 59.9 60.7 109 177 6.7
unassignable N/A 0 N/A 0 – – – – 0 6.8
ml.all.P 62.2 62.8 62.5 2,346 69.5 70.2 69.9 2,622 3,770 6.9
ml.all.noP 60.3 64.5 62.3 2,408 67.3 72.0 69.6 2,689 3,993 6.9
ml.nouns.P 64.1 65.9 65.0 931 71.1 73.2 72.1 1,033 1,453 6.10
ml.nouns.noP 59.6 69.1 64.0 975 66.1 76.6 71.0 1,082 1,637 6.10
ml.verbs.P 58.6 58.6 58.6 955 69.3 69.3 69.3 1,129 1,629 6.11
ml.verbs.noP 58.6 58.6 58.6 955 69.3 69.3 69.3 1,129 1,629 6.11
ml.adjs.P 66.9 66.6 66.8 460 66.9 66.6 66.8 460 688 6.12
ml.adjs.noP 65.7 69.2 67.4 478 65.7 69.2 67.4 478 727 6.12
learner 66.8 – – – 73.6 – – – 3,520 6.13
Table 6.20: TalpSVM performance on Senseval-2 data, comparison to the best performing
systems is included by referencing to previous tables of results [last column]
Senseval-3 organisers, though it uses external tools and ontologies to extract part of the
representation features.
Three main characteristics have to be highlighted from the system architecture. The
first thing is the way in which the multiclass classification problem is addressed using
binary SVM classifiers. Two different approaches for binarising multiclass problems have
been tested: One vs All and Constraint Classification. In a cross-validation experimental
setting the best strategy has been selected at word level.
The second characteristic is the rich set of features used to represent training and
test examples; extending the feature set of the previous section. Topical and local context
features are used as usual, but also syntactic relations and semantic features indicating
the predominant semantic classes in the example context are now taken into account.
And finally, since each word represents a learning problem with different characteristics,
a per-word feature selection has been applied.
122
6.4 Senseval-3
agree. kappa
jhu 0.70 0.41
smui 0.66 0.36
kunlp 0.71 0.40
cs224n 0.69 0.29
sinequa 0.67 0.29
talp 0.68 0.37
Table 6.21: Agreement and kappa values between TalpSVM and the best six systems of
Senseval-2
One of the issues when using SVM for WSD is how to binarise the multiclass
classification problem. The two approximations tested in the TALP system are: (1) the
usual one-versus-all and (2) the recently introduced constraint-classification framework
(Har-Peled et al., 2002).
123
6 Evaluation in Senseval Exercises
the expanded training set to solve the multiclass problem. See Har-Peled et al. (2002) for
details.
6.4.2 Features
The basic feature set is that detailed in section 6.3.2, with a few extensions. Local and topic
features are the same. Knowledge-based features have been extended by using the formula
in section 6.2.1 to other semantic resources into the Multilingual Central Repository (Mcr)
(Atserias et al., 2004) of the Meaning Project: Sumo labels, WordNet Lexicographic Files,
and EuroWordNet Top Ontology. Also, it has been extended with features codifying the
output of Yarowsky’s dependency pattern extractor11 .
Appendix B details part of the output of the feature extractor for an example of
Senseval-2 corpus.
Since the set of features is highly redundant, a selection process has been applied. It
is detailed in the next section.
For each binarisation approach, we performed a feature selection process consisting of two
consecutive steps:
124
6.4 Senseval-3
125
6 Evaluation in Senseval Exercises
The result of this preprocess is the selection of the best binarisation approach with its
best feature set for each word.
Regarding the implementation details of the system, we used again SVMlight (Joachims,
2002). A simple lineal kernel with a regularisation C value of 0.1 was applied. This
parameter was empirically decided on the basis of our previous experiments on the
Senseval-2 corpus. Again, previous tests using non-linear kernels did not provide better
results. The selection of the best feature set and the binarisation scheme per word
described above, have been performed using 5-fold cross-validation on the Senseval-3
training set. The five partitions of the training set were obtained maintaining, as much
as possible, the initial distribution of examples per sense (using the stratified sampling
scheme).
After several experiments considering the U label as an additional regular class, we
found that we obtained better results by simply ignoring it. Then, if a training example
was tagged only with this label, it was removed from the training set. If the example
was tagged with this label and others, the U label was also removed from the learning
example. In that way, the TALP system do not assigns U labels to the test examples.
Due to lack of time, the TALP system presented at the competition time did not include
a complete model selection for the constraint classification binarisation setting. More
precisely, 14 words were processed within the complete model selection framework, and
43 were adjusted with only the one-vs-all approach with feature selection but not the
constraint classification approach. After the competition was closed, we implemented the
constraint classification setting more efficiently and we reprocessed again the data.
6.4.4 Evaluation
Table 6.23 shows the accuracy obtained in the training set by 5 fold cross-validation.
OVA(base) stands for the results of the one-versus-all approach on the starting feature
set; CL(base) for the constrain-classification on the starting feature set; and OVA(best)
and CL(best) mean one-versus-all and constraint-classification with their respective feature
selection. Final stands for the full architecture on the training set and SE3 for the results
obtained at competition time12 .
Globally, feature selection consistently improves the accuracy in 3 points (both in
OVA and CL). Constraint-classification is slightly better than one-vs-all approach; but
the Final (the full architecture) results shows that it is not true for all the words. Finally,
the combined feature selection increases the accuracy in half a point.
11
This software was kindly provided by David Yarowsky’s group, from Johns Hopkins University.
12
Only 14 words where processed at competition time due to lack of time.
126
6.4 Senseval-3
method acc.
OVA(base) 72.38
CL(base) 72.28
OVA(best) 75.27
CL(best) 75.70
SE3 75.62
Final 76.02
Table 6.23: Results on the Senseval-3 5 fold cross-validation training set (accuracy in%)
Table 6.24 shows the official results of the system. SE3 means the official result13 ,
Final means the complete architecture, and Best means the winner of the competition.
However, when testing the complete architecture on the official test set, we obtained
an accuracy decrease of more than 4 points. Nevertheless, the system obtained high
competitive results: ranking in the 7th team position, at only 1.3 points on fine-grained
accuracy and by 1.1% on coarse-grained accuracy below the winner.
This section presents the results of the application of three new feature selection methods
using the Senseval-3 corpora.
The first method is a greedy iterative process (hereinafter bf ) in which the best feature
(in accuracy) is selected at each step. The initial features of this method is the empty set.
This approximation is local and greedy. Sometimes, a pair of features would be selected
in order to improve the overall. And once a feature is selected it can not be deleted. A
variant of this method consisting of using as initial set those in table 6.22 has been called
wbf.
The third process, hill climbing (hereinafter hc), aims to avoid the problem of selecting
a bad feature. It is also a greedy iterative process. First and second steps are like the
13
Only 14 words were processed with the full architecture.
127
6 Evaluation in Senseval Exercises
previous one. That is, the process selects the best two features in two iterations. From the
third step further, each step looks for the best feature (in accuracy) to add and the best
to delete (if it exists a feature that increases the accuracy in being deleted). The initial
set of this method is also the empty set.
Evaluation
Table 6.25 shows the result of the three feature selection methods on the Senseval-3 corpora
for nouns, verbs, adjectives and overall. The first method (wrd ) is the one we presented
to the Senseval-3 event, achieving a fine-grained accuracy of 71.6 and a coarse one of 78.2.
Although every procedure achieves the best performance on some words, the best overall
system is wrd. That is, none of feature selection methods improved the original results.
The basic aim of table 6.26 is showing the impact on accuracy of the feature selection
procedures for the seven first words of Senseval-3 English Lexical Sample corpus. ova-wrd
stands for one versus all binarisation plus wrd feature selection procedure, cc-wrd for
constraint classification plus wrd selection, and so on. top, loc, dom and syn stands
respectively for topical, local, domain and syntactical features. So, the four first columns
of the table show the number of words for which the method has selected some of those
features. The last column (avg) shows the average number of features selected by the
method.
It seems that generally one versus all selects more features than constrain classification.
wrd and wbf feature selection procedures gather much more features (around 50 types of
them) than the others (less than 20). While wrd and wbf methods start with a large
feature set (approximately 40 feature types depending of the part-of-speech), bf and hc
begin with the empty set. This explains the differences between them. Furthermore, bf
selects the double number of features than hc because hc deletes features in its iterative
process.
Regarding the 4 clusters of features, all methods have selected local features for all the
128
6.5 Conclusions
words. The domain features from the MCR are equally selected as topical features and
selected more often than syntactical features.
6.5 Conclusions
We have participated in the English Lexical Sample task of last two editions of Senseval
international evaluation exercises. Our systems have achieved a very high position in both
competitions. That is, they are part of the State-of-the-Art in supervised WSD systems.
At Senseval-2, we proposed a system based on LazyBoosting, an improved version
of the AdaBoost.MH. We enriched the typical local and topical context with domain
information. After the exercise, we have studied the outputs of the six best system and
proposed a new system based on SVM that outperformed all the six systems presented at
the competition.
At Senseval-3, we proposed and tested a supervised system in which the examples are
represented through a very rich and redundant set of feature (using the information content
integrated within the Multilingual Central Repository of the Meaning project), and which
performs some specialised selection of features and multiclass binarisation process for
each word. All these features lead to a better system able to be automatically configured
for every word. The main drawback of the approach presented in this work is the high
computational overhead of the feature selection process for every word.
The system based on Support Vector Machines and a rich set of features has a very
high competitive performance. The features based on domains bring new complementary
and useful information.
This work suggests some promising lines to be studied in detail: 1) Searching new ways
of representing the examples: for instance, the learning algorithms could use the output
129
6 Evaluation in Senseval Exercises
Finally, there are some open questions: 1) Which set of features better represent each
sense of each word? 2) How do we have to binarise the multiclass problem? And 3) How
do we have to treat examples with more than one possible sense (multilabel assignments)?
14
That is, starting from the full feature set and deleting features at each iteration.
130
Chapter 7
Conclusions
7.1 Summary
Word Sense Disambiguation is one of the tasks for dealing with NLP semantics. First
references on Lexical Ambiguity are sixty years old. Despite the work devoted to the
problem, it can be said that no large-scale broad-coverage accurate WSD system has been
built up to date (Kilgarriff and Rosenzweig, 2000). Therefore, WSD is one of the most
important open problems in NLP, with current state-of-the-art accuracy in the 60-70%
range. Machine Learning classifiers are effective, but, due to the knowledge acquisition
bottleneck, they will not be feasible until reliable methods for acquiring large sets of
training examples with a minimum human annotation effort will be available. These
automatic methods for helping in the collection of examples should be robust to noisy data
and changes in sense frequency distributions and corpus domain (or genre). The adequate
ML algorithms used to induce the WSD classifiers should be also noise-tolerant (both in
class-label and feature values), easy to adapt to new domains, robust to overfitting, and
efficient for learning thousands of classifiers in very large training sets and feature spaces.
Many NLP researchers consider word sense ambiguity as a central problem for
several Human Language Technology applications, (e.g., Machine Translation, Information
Extraction and Retrieval, etc.). This is also the case for most of the associated sub-tasks,
for instance, reference resolution and parsing. For this reason, many international research
groups have been actively working on WSD during last years, using a wide range of
approaches. This can be noticed throughout the recent publications in the main NLP
conferences and journals (including special issues on the WSD task), the organisation of
specialised conferences on this topic, a number of current NLP research projects including
WSD among their goals, and the increasing number of participants in the different editions
131
7 Conclusions
The Best Systems at the English Lexical Sample task on the last Senseval editions
obtained accuracy results lower than 75%. Obviously, this figures are not useful for
practical purposes. We think that the best way to improve the current systems is to exploit
the representation of the examples with linguistic knowledge that is not currently available
in wide-coverage lexical knowledge bases. We refer, for instance, to sub-categorisation
frames, syntactic structure, selectional preferences, and domain information. That is, an
in depth study of the features is needed (the better we represent examples, the better
learning systems will be obtained). One of the biggest problems of this approach is the
dependency of the classifiers on the training corpora. We need an in depth study on the
dependency on the corpora and a way to develop general corpora or effective and simple
ways to adapt to new application domains.
The Best Systems at the All-words task on last Senseval edition obtained accuracy
results near 65%. Following the supervised approach, we need a large amount of labelled
corpora to build a broad coverage Word Sense Tagger; which are very expensive. The
most promising line to tackle this problem is the application of bootstrapping techniques
to profit from unlabelled examples. This is one of the interesting current research lines
on Word Sense Disambiguation field and other NL-related tasks. Moreover, future WSD
systems will need to automatically detect and collapse spurious sense distinctions, as well
as to discover, probably in an on-line learning setting, occurrences of new senses in running
text.
We also think that it might be very useful to develop more complex systems dealing
with Semantics from several points of view, like systems based on Semantic Parsing.
Finally, we think that there are some general outstanding problems within Word Sense
Disambiguation dealing with semantics. The most important one is the demonstration, on
a real application, of the usefulness of Word Sense Disambiguation. Another open issue is
the utility of WSD in semantics to develop Natural Language Understanding Systems.
132
7.2 Contributions
7.2 Contributions
This section summarises the main contributions of the thesis. We have grouped them
based on the same chapters of this document:
133
7 Conclusions
application of feature selection procedures, and the comparison between them; and 3)
the application of different ways of binarising the multiclass problem when using binary
classifiers, such as SVM.
7.3 Publications
This section lists the publications ordered by the contribution topics of the previous
section:
134
7.4 Further Work
This section is devoted to explain the current work and further research lines that we plan
to address:
• First, the representation of the examples. Which is the best way to represent the
examples? Which is the best feature selection procedure?
135
7 Conclusions
• The third line is the integration of Semantic Parsing models within Word Sense
Disambiguation classifiers in order to develop a better and complementary hybrid
system to deal with both tasks.
136
Bibliography
E. Agirre and D. Martı́nez. The Basque Country University system: English and Basque
tasks. In Proceedings of the International Workshop on the Evaluation of Systems for
the Semantic Analysis of Text, Senseval–3, 2004b.
137
BIBLIOGRAPHY
E. Agirre and G. Rigau. A Proposal for Word Sense Disambiguation using Conceptual
Distance. In Proceedings of the International Conference: Recent Advances in Natural
Language Processing, RANLP, 1995.
H. Alshawi and D. Carter. Training and scaling preference functions for disambiguation.
Computational Linguistics, 20(4), 1994.
A. Berger, S. Della Pietra, and V. Della Pietra. A Maximum Entropy Approach to Natural
Language Processing. Computational Linguistics, 22(1), 1996.
138
BIBLIOGRAPHY
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co–training. In
Proceedings of the Workshop on Computational Learning Theory, COLT, 1998.
B. Boser, I. Guyon, and V. Vapnik. A Training Algorithm for Optimal Margin Classifiers.
In Proceedings of the Workshop on Computational Learning Theory, COLT , 1992.
C. Cabezas, P. Resnik, and J. Stevens. Supervised Sense Tagging using Support Vector
Machines. In Proceedings of the International Workshop on Evaluating Word Sense
Disambiguation Systems, 2001.
C. Cardie and R. J. Mooney. Guest Editors’ Introduction: Machine Learning and Natural
Language. Machine Learning (Special Issue on Natural Language Learning), 34, 1999.
M. Carpuat, W. Su, and D. Wu. Augmenting Ensemble Classification for Word Sense
Disambiguation with a Kernel PCA Model. In Proceedings of the International
Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Senseval–3,
2004.
M. Carpuat and D. Wu. Word Sense Disambiguation vs. Statistical Machine Translation.
In Proceedings of the Annual Meeting of the Association for Computational Linguistics,
ACL, 2005.
139
BIBLIOGRAPHY
M. Castillo, F. Real, J. Atserias, and G. Rigau. The TALP Systems for Disambiguating
WordNet Glosses. In Proceedings of the International Workshop on the Evaluation of
Systems for the Semantic Analysis of Text, Senseval–3, 2004.
S. F. Chen. Building Probabilistic Models for Natural Language. PhD thesis, Center for
Research in Computing Technology. Harvard University, 1996.
T. Chklovski and R. Mihalcea. Building a Sense Tagged Corpus with Open Mind Word
Expert. In Proceedings of the ACL workshop: Word Sense Disambiguation: Recent
Successes and Future Directions, 2002.
M. Collins. Parameter Estimation for Statistical Parsing Models: Theory and Practice of
Distribution–Free Methods. In To appear as a book chapter, 2002. Available at the web
address: www.ai.mit.edu/people/mcollins.
S. Cost and S. Salzberg. A Weighted Nearest Neighbor Algorithm for Learning with
Symbolic Features. Machine Learning, 10, 1993.
140
BIBLIOGRAPHY
W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch. Timbl: Tilburg memory
based learner, version 4.0, reference guide. Technical Report, University of Antwerp,
2001.
B. Decadt, V. Hoste, W. Daelemans, and A. van den Bosch. GAMBL, Genetic Algorithm
Optimization of Memory–Based WSD. In Proceedings of the International Workshop
an Evaluating Word Sense Disambiguation Systems, Senseval–3, 2004.
141
BIBLIOGRAPHY
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley &
Sons, 1973.
G. Escudero, L. Màrquez, and G. Rigau. Using Lazy Boosting for Word Sense
Disambiguation. In Proceedings of the International Workshop an Evaluating Word
Sense Disambiguation Systems, Senseval–2, 2001.
G. Escudero, L. Màrquez, and G. Rigau. TALP System for the English Lexical
Sample Task. In Proceedings of the International Workshop an Evaluating Word Sense
Disambiguation Systems, Senseval–3, 2004.
C. Fellbaum, editor. WordNet. An Electronic Lexical Database. The MIT Press, 1998.
142
BIBLIOGRAPHY
R. Florian and D. Yarowsky. Modelling consensus: Classifier combination for word sense
disambiguation. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing, EMNLP, 2002.
W. Gale, K. W. Church, and D. Yarowsky. Estimating Upper and Lower Bounds on the
Performance of Word Sense Disambiguation. In Proceedings of the Annual Meeting of
the Association for Computational Linguistics, ACL, 1992a.
C. Grozea. Finding optimal parameter settings for high performance word sense
disambiguation. In Proceedings of the International Workshop on the Evaluation of
Systems for the Semantic Analysis of Text, Senseval–3, 2004.
143
BIBLIOGRAPHY
V. Hoste, W. Daelemans, I. Hendrickx, and A. van den Bosch. Evaluating the Results of
a Memory–Based Word–Expert Approach to Unrestricted Word Sense Disambiguation.
In Proceedings of the Workshop on Word Sense Disambiguation: Recent Successes and
Future Directions, 2002a.
N. Ide and J. Véronis. Introduction to the Special Issue on Word Sense Disambiguation:
The State of the Art. Computational Linguistics, 24(1), 1998.
T. Joachims. Text Categorization with Support Vector Machines: Learning with Many
Relevant Features. In Proceedings of the European Conference on Machine Learning,
ECML, 1998.
T. Joachims. Learning to Classify Text Using Support Vector Machines. Kluwer, 2002.
144
BIBLIOGRAPHY
U. S. Kohomban and W. S. Lee. Learning semantic classes for word sense disambiguation.
In Proceedings of the Annual Meeting of the Association for Computational Linguistics,
ACL, 2005.
Y. K. Lee, H. T. Ng, and T. K. Chia. Supervised Word Sense Disambiguation with Support
Vector Machines and Multiple Knowledge Sources. In Proceedings of the International
Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Senseval–3,
2004.
145
BIBLIOGRAPHY
B. Magnini and G. Cavaglia. Integrating Subject Field Codes into WordNet. In Proceedings
of the International Conference on Language Resources and Evaluation, LREC, 2000.
D. Martı́nez and E. Agirre. One Sense per Collocation and Genre/Topic Variations. In
Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language
Processing and Very Large Corpora, EMNLP/VLC, 2000.
D. Martı́nez, E. Agirre, and L. Màrquez. Syntactic Features for High Precision Word
Sense Disambiguation. In Proceedings of International Conference on Computational
Linguistics, COLING, 2002.
146
BIBLIOGRAPHY
R. Mihalcea. Instance Based Learning with Automatic Feature Selection Applied to Word
Sense Disambiguation. In Proceedings of the International Conference on Computational
Linguistics, COLING, 2002b.
R. Mihalcea and D. Moldovan. Pattern Learning and Active Feature Selection for Word
Sense Disambiguation. In Proceedings of the International Workshop on Evaluating
Word Sense Disambiguation Systems, Senseval–2, 2001.
147
BIBLIOGRAPHY
H. T. Ng. Getting Serious about Word Sense Disambiguation. In Proceedings of the ACL
SIGLEX Workshop: Standardizing Lexical Resources, 1997b.
H. T. Ng, C. Y. Lim, and S. K. Foo. A Case Study on Inter–Annotator Agreement for Word
Sense Disambiguation. In Proceedings of the ACL SIGLEX Workshop: Standardizing
Lexical Resources, 1999.
I. Niles and A. Pease. Towards a Standard Upper Ontology. In Proceedings of the 2nd
International Conference on Formal Ontology in Information Systems, FOILS’2001,
2001.
T. Pedersen and R. Bruce. A New Supervised Learning Algorithm for Word Sense
Disambiguation. In Proceedings of the National Conference on Artificial Intelligence,
AAAI, 1997.
148
BIBLIOGRAPHY
149
BIBLIOGRAPHY
H. Seo, S. Lee, H. Rim, and H. Lee. KUNLP system using Classification Information
Model at Senseval–2. In Proceedings of the International Workshop on evaluating Word
Sense Disambiguation Systems, Senseval–2, 2001.
B. Snyder and M. Palmer. The English All–Words Task. In Proceedings of the International
Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Senseval–3,
2004.
C. Strapparava, A. Gliozzo, and C. Giuliano. Pattern Abstraction and Term Similarity for
Word Sense Disambiguation: IRST at Senseval–3. In Proceedings of the International
Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Senseval–3,
2004.
150
BIBLIOGRAPHY
M. Sussna. Word sense disambiguation for gree–text indexing using a massive semantic
network. In Proceedings of the International Conference on Information and Knowledge
Base Management, CIKM, 1993.
E. Voorhees. Using WordNet to disambiguate word senses for text retrieval. In Proceedings
of the Annual International ACM SIGDIR Conference on Research and Development
in Information Retrieval, 1993.
Y. Wilks and M. Stevenson. The Grammar of Sense: Is word–sense tagging much more
than part–of–speech tagging? Technical Report CS-96-05, University of Sheffield,
Sheffield, UK, 1996.
151
BIBLIOGRAPHY
D. Yarowsky. Three Machine Learning Algorithms for Lexical Ambiguity Resolution. PhD
thesis, Department of Computer and Information Sciences. University of Pennsylvania,
1995a.
D. Yarowsky. Hierarchical Decision Lists for Word Sense Disambiguation. Computers and
the Humanities, 34(2), 2000.
152
Appendix A
Statistical Measures
This appendix contains the description and formulae of the statistics and measures used
in this work.
All measures of this section have been applied for comparison purposes within different
classifiers on different data.
Given the predictions of a classifier F , where N denotes then total number of examples,
V the number of examples with prediction and A the number of examples with a correct
prediction:
A
precision =
V
A
recall =
N
V
coverage =
N
2 × precision × recall 2×A
F1 = =
precision + recall V +N
When V = N (when a classifier gives an output prediction for each example and not
abstains) we talk about accuracy instead of precision and recall, because they are the
same.
153
A Statistical Measures
We have applied two significance tests for complementing the measures of the previous
section. Given two classifiers F and G:
• the paired Student’s t-test with a confidence value of t9,0.975 = 2.262 or equivalent
(depending on the number of folds) have been used; when using a cross-validation
framework: √
p× n
v
u
> 2.262
u P (p(i)−p)2
n
t
i=1
n−1
n
P
1
where n is the number of folds, p = n p(i), p(i) = pF (i) − pG (i), and pX (i) is the
i=1
number of examples misclassified by classifier X (where X ∈ {F, G}).
• and, the McNemar’s test with a confidence value of χ21,0.95 = 3.842, when no
cross-validation framework has been applied:
(|A − B| − 1)2
> 3.842
A+B
where A is the number of examples misclassified by classifier F but no by G, and B
is the number of examples misclassified by G but not by F .
Given the predictions of two classifiers F and G. The agreement (observed) is computed
as:
A
Pa =
N
where A denotes then number of examples for which both classifiers have output the same
prediction and N is the total number of examples.
And the kappa statistic as:
Pa − Pe
kappa =
1 − Pe
where P e (the expected agreement) is computed as:
ns
X PF (i) × PG (i)
Pe =
i=1
N2
where ns is the number of senses, and PX (i) is the number of predictions of classifier X
(with X ∈ {F, G}) on the sense i.
154
Appendix B
The following example has been extracted from the test set of the English Lexical Task
corpus of Senseval-2. It is followed by some of the features of the output of our feature
extractor component used in this document. The target word is ‘Pop Art’ 1 and its sense
is ‘pop art%1:06:00::’ :
155
B Example of the Feature Extractor
feat. description
form form of the target word
locat all part-of-speech/forms/lemmas in the local context
coll all collocations of two part-of-speech/forms/lemmas
coll2 all collocations of a form/lemma and a part-of-speech (and
the reverse)
first form/lemma of the first noun/verb/adjective/adverb on the
left/right of the target word
• f fln (first): the form of the first noun on the left of the target word (e.g. f fln
Pop).
• l frv (first): the lemma of the first verb on the right of the target word (e.g. l frv
be).
Three types of topical features are shown in table B.2. Topical features try to obtain
non-local information from the words of the context. For each type, two overlapping sets
of redundant topical features are considered: one extracted from a ±10-word-window and
another considering all the example. Next, it is shown some topical features generated
from previous example:
• f top (topic): bag of forms of the open-class-words in the example (e.g. f top
market).
156
feat. description
topic bag of forms/lemmas
sbig all form/lemma bigrams of the example
comb forms/lemmas of consecutive (or not) pairs of the
open-class-words in the example
• l top (topic): bag of lemmas of the open-class-words in the example (e.g. l top
signal).
• fsbig (sbig): collocations of forms of every two consecutive words in the example
(e.g. fsbig of material).
Table B.3 presents the knowledge-based features. These features have been
obtained using the knowledge contained into Multilingual Central Repository (Mcr)
(Atserias et al., 2004) of Meaning Project. For each example, the program obtains, from
each context, all nouns, all their synsets and their associated semantic information: Sumo
labels, domain labels, WordNet Lexicographic Files, and EuroWordNet Top Ontology. We
also assign to each label a weight which depend on the number of labels assigned to each
noun and their relative frequencies in the whole WordNet (see section 6.2.1 for a more
detailed description). For each kind of semantic knowledge, summing up all these weights,
the program finally selects those semantic labels with higher weights. Next, it is shown
some knowledge-based features generated from previous example:
• f mag (f magn): the first domain label applying the formula previously explained
in section 6.2.1 to the IRST Domains (e.g. f mag quality).
• a mag (f magn): bag of domain labels applying the formula previously explained
in section 6.2.1 to the IRST Domains (e.g. a mag psychology).
• f sum (f sumo): the first SUMO label applying the formula previously explained
in section 6.2.1 to the SUMO Labels (e.g. f sum capability).
• f ton (f tonto): the first EuroWordNet Top Ontology label applying the formula
previously explained in section 6.2.1 to the Top Ontology (e.g. f ton Property).
157
B Example of the Feature Extractor
feat. description
f sumo first sumo label
a sumo all sumo labels
f semf first wn semantic file label
a semf all wn semantic file labels
f tonto first ewn top ontology label
a tonto all ewn top ontology labels
f magn first domain label
a magn all domain labels
feat. description
tgt mnp syntactical relations of the target word from minipar
rels mnp all syntactical relations from minipar
yar noun NounModifier, ObjectTo, SubjectTo for nouns
yar verb Object, ObjectToPreposition, Preposition for verbs
yar adjs DominatingNoun for adjectives
• f smf (f semf ): the first WordNet Semantic File label applying the formula
previously explained in section 6.2.1 to the Semantic Files (e.g. f smf 42).
Finally, table B.4 describes the syntactic features which contains features extracted
using two different tools: Dekang Lin’s Minipar and Yarowsky’s dependency pattern
extractor. Next, it is shown some local features generated from previous example:
• mptgt (tgt mnp): minipar dependency relation to or from the target word (e.g.
mptgt s:be).
• mprls (rels mnp): minipar dependency relations of the example (e.g. mprls
dealer:nn:market).
158
Appendix C
3. DSO Corpus
http://www.ldc.upenn.edu/Catalog/LDC97T12.html
4. EuroWordNet
http://www.illc.uva.nl/EuroWordNet/
5. FreeLing
http://garraf.epsevg.upc.es/freeling
8. interest corpus
http://crl.nmsu.edu/cgi-bin/Tools/CLR/clrcat#I9
9. LazyBoosting
http://www.lsi.upc.es/∼escudero
159
C Resources and Web References
13. Minipar
http://www.cs.ualberta.ca/∼lindek/minipar.htm
16. Romanseval
http://www.lpl.univ-aix.fr/projectes/romanseval
18. Senseval
http://www.senseval.org
19. SUMO
http://www.ontolgyportal.org
20. SVM-light
http://svmlight.joachims.org
22. TNT
http://www.coli.uni-saarland.de/∼thorsten/tnt
24. WordNet
http://wordnet.princeton.edu
25. WordSmyth
http://www.wordsmyth.net
160
Appendix D
List of Acronyms
AB AdaBoost.MH
AI Artificial Intelligence
BC Brown Corpus
cc constraint classification
CL Computational Linguistics
DL Decision Lists
EB Exemplar-based
IR Information Retrieval
kNN k Nearest Neighbour
LB LazyBoosting
LDC Linguistic Data Consortium
MBL Memory Base Learning
MFC Most Frequent Sense Classifier
MFS Most Frequent Sense
ML Machine Learning
MT Machine Translation
MVDM Modified Value Difference Metric
NB Naive Bayes
NL Natural Language
NLP Natural Language Processing
161
D List of Acronyms
162