2_Text Operations (1)
2_Text Operations (1)
(ISR)
Chapter 2
Document represen-
tation and
Text Operations
1
Document representation
• IR system/Search engine does not scan each docu-
ment to see if it satisfies the query
• It uses an index to quickly locate the relevant doc-
uments
• Index: a list of concepts with pointers to docu-
ments that discuss(represent) them
– What goes in the index is very important
• Document representation: deciding what concepts
should go in the index
2
Document representation cont. …
Two options:
1.Controlled vocabulary – a set of manually con-
structed concepts that describe the major topics
covered in the collection
2.Free-text indexing – the set of individual terms
that occur in the collection
3
1.Document representation
Controlled Vocabulary: cont. …
a set of well-defined
concepts
– Assigned to documents by humans (or automatically)
• E.g. Subject headings, key words, etc.
– May include parent-child relations b/n concepts
• E.g. computers: software/Hardware: Information Retrieval
– Facilitate non-query-based browsing and exploration
Because it will brows parent-child relationship.
4
Controlled Vocabulary: Advantages
• Concepts do not need to appear explicitly in the text
• R/ships b/n concepts facilitate non-query based navigation
and exploration
• Developed by experts who know the data and the user
• Represent the concepts/relationships that users (presum-
ably) care the most about
• Describe the concepts that are most central to the docu-
ment
• Concepts are unambiguous and recognizable (necessary for
annotators and good for users)
5
Controlled Vocabulary: Disadvantages
• Time consuming
• Users must know the concepts in the index
• Labor intensive
6
2. Free Text Indexing
• Represent documents using terms within the document
• Which terms? Only the most descriptive terms? Only the unam-
biguous ones? All of them?
– Usually, all of them (a.k.a. full-text indexing)
• The user will use term-combinations to express higher level con-
cepts
• Query terms will hopefully disambiguate each other (e.g., “volk-
swagen golf”)
• The search engine will determine which terms are important
7
How are the texts handled?
• What happens if you take the words exactly as they ap-
pear in the original text?
• What about punctuation, capitalization, etc.?
• What about spelling errors?
• What about plural vs. singular forms of words?
• What about cases and declension(the variation of the
form of a noun, pronoun, or adjective, by which its
grammatical case) in non-English language?
• What about non-roman alphabets?
8
Free Text Indexing: Steps
1. Mark-up removal
2. Normalization – e.g down casing
– Information and information
– Retrieval and RETRIEVAL
– US and us – can change the meaning of words
3. Tokenization - splitting text into words (based on sequences of non-alpha-
numeric characters)
– Problematic cases: ph.d. = ph d, isn’t = isn t
4. Stop word removal
5. Do steps 1-4 to every document in the collection and
6. create an index using the union of all remaining terms
9
Controlled Vs. Free Text Indexing
Cost of Assigning Ambiguity of Detail of repre-
index terms index terms sentation
11
Statistical Properties of words in a Text
• How is the frequency of different words distributed?
• How fast does vocabulary size grow with the size of a corpus?
• Such properties of text collection greatly affect the performance of IR system &
can be used to select suitable term weights & other aspects of the system
• There are three well-known researchers who define statistical properties of
words in a text:
1. Zipf’s Law: models word distribution in text corpus
2. Luhn’s idea: measures word significance
3. Heap’s Law: shows how vocabulary size grows with the growth of the cor-
pus size
12
1. Zipf's Law: Word Distribution/Fre-
quency
• A few words are very
common.
2 most frequent words
(e.g. “the”, “of”) can ac-
count for about 10% of
word occurrences.
• Most words are very rare.
Half the words in a cor-
pus appear only once,
called “read only once”
or Hapax Legomena (in
Greek)
13
Zipf's Law: Word distribution cont. ...
• Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
attempts to capture the distribution of the frequencies (i.e. ,
number of occurances ) of the words within a text.
• For all the words in a collection of documents, for each word w
f : is the frequency of w
r : is rank of w in order of frequency. (The most commonly
occurring word has rank 1, etc.)
f
r 14
Word distribution: Zipf's Law
• Zipf's Law states that when the distinct words in a text are ar-
ranged in decreasing order of their frequency of occuerence (most
frequent words first), the occurence characterstics of the vocabulary
can be characterized by the constant rank-frequency law of Zipf.
• The table shows the most frequently occurring words from 336,310 document corpus
containing 125, 720, 891 total words; out of which 508, 209 are unique words 15
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words (upper cut-
off). Used by almost all systems.
• Significant words: Take words in between the most
frequent (upper cut-off) and least frequent words
(lower cut-off)
• Term weighting: Give differing weights to terms based
on their frequency, with most frequent words weighed
less. Used by almost all ranking methods.
16
Zipf's Law Impact
• Zipf’s Law Impact on IR
17
2. Word significance: Luhn’s Ideas
• Luhn Idea (1958): the frequency of word occurrence in a text fur-
nishes a useful measurement of word significance
• For this, Luhn specifies two cutoff points: an upper and a lower
cutoffs based on which non-significant words are excluded
– The words exceeding the upper cutoff were considered to be common
– The words below the lower cutoff were considered to be rare
– Hence they are not contributing significantly to the content of the text
– The ability of words to discriminate content, reached a peak at a rank or-
der position half way between the two-cutoffs
20
Vocabulary Growth: Heaps’ Law
• How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
– This determines how the size of the inverted index will scale with
the size of the corpus.
• Heap’s law: estimates the number of vocabularies in a
given corpus
– The vocabulary size grows by K(nβ), where β is a constant between
0 – 1.
– If V is the size of the vocabulary and n is the length of the corpus in
words, Heap’s law provides the following equation:
• Where constants:
– K 10100
V Kn
– 0.40.6 (approx. square-root)
21
Heap’s distributions
• Distribution of size of the vocabulary vs. total
number of terms extracted from text corpus
22
Example: Heaps Law
• We want to estimate the size of the vocabulary for a
corpus of 1,000,000 words
• Assume that based on statistical analysis on smaller cor-
pora sizes:
– A corpus with 100,000 words contain 50,000 unique
words; and
– A corpus with 500,000 words contain 150,000 unique
words
• Estimate the vocabulary size for the 1,000,000 words
corpus?
– What about for a corpus of 1,000,000,000 words?
23
Text Operations
• Not all words in a document are equally
significant to represent the contents/
meanings of a document
– Some word carry more meaning than others
– Noun words are the most representative of
a document content
• Therefore, need to preprocess the text
of a document in a collection to be used
as source of index terms
• Tokenization Issues
– numbers, hyphens, punctuations marks, apostrophes …
28
Issues in Tokenization
• One word or multiple: How to handle special cases
involving hyphens, apostrophes, punctuation marks
etc? C++, C#, URL’s, e-mail, …
29
Cont. …
• Two words may be connected by hyphens
– Can two words connected by hyphens taken as one
word or two words? Break up hyphenated sequence as
two tokens?
30
Cont. …
•Two words may be connected by punctuation marks
–Punctuation marks: remove totally unless significant, e.
g. program code: x.exe and
xexe. What about Kebede’s, www.command.com?
•Two words (phrase) may be separated by space
–E.g. Addis Ababa, San Francisco, Los Angeles
31
Issues in Tokenization
• Numbers: are numbers/digits words and used as index terms?
– dates (3/12/91 vs. Mar. 12, 1991);
– phone numbers (+251923415--)
– IP addresses (100.2.86.144)
– Numbers are not good index terms (like 1910, 1999); but 510 B.C. is unique.
Generally, don’t index numbers as text, though very useful.
• What about case of letters (e.g. Data or data or DATA):
– cases are not important and there is a need to convert all to upper or lower.
Which one is mostly followed by human beings?
34
2. Stop-word Removal
• Stopwords: words that we ignore because we expect
them not to be useful in distinguishing between rele-
vant/non-relevant documents for any query
• A stopword is a term that is discarded from the docu-
ment representation
• Stopwords are extremely common words across docu-
ment collections that have no discriminatory power
• Assumption: stopwords are unimportant because they
are frequent in every document
– They may occur in 80% of the documents in a collection.
– They would appear to be of little value in helping select documents match-
ing a user need and needs to be filtered out as potential index terms
35
Stopword Removal cont. …
• Stopwords are typically function words:
– Examples of stopwords are articles, prepositions, conjunctions,
etc.:
• articles (a, an, the); pronouns: (I, he, she, it, their, his)
• Some prepositions (on, of, in, about, besides, against);
• conjunctions/ connectors (and, but, for, nor, or, so,
yet), verbs (is, are, was, were),
• adverbs (here, there, out, because, soon, after) and
• adjectives (all, any, each, every, few, many, some) can
also be treated as stopwords.
• Stopwords are language dependent
36
Why Stopword Removal?
• Intuition:
–Stopwords have little semantic content; It is typical to re-
move such high-frequency words
–Stopwords take up 50% of the text. Hence, document size
reduces by 30-50%
• Smaller indices for information retrieval
–Good compression techniques for indices: The 30
most common words account for 30% of the tokens in
written text
• With the removal of stopwords, we can measure better
approximation of importance for text classification, text
categorization, text summarization, etc.
37
How to detect a stopword?
38
Trends in Stopwords
• Stopword elimination used to be standard in older IR sys-
tems. But the trend is away from doing this nowadays.
• Most web search engines index stopwords:
– Good query optimization techniques mean you pay little atten-
tion at query time for including stop words.
– You need stopwords for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
– Elimination of stopwords might reduce recall (e.g. “To be or not
to be” – all eliminated except “be” – no or irrelevant retrieval)
• Therefore still now it needs better improvement.
39
3. Normalization
• It is Canonicalizing(normalizing) tokens(words or
terms) so that matches occur despite superficial dif-
ferences in the character sequences of the tokens
– Need to “normalize” terms in indexed text as well as query
terms into the same form
– Example: We want to match U.S.A. and USA, by deleting peri-
ods in a term
• Case Folding: Often best to lower case everything,
since users will use lowercase regardless of ‘correct’
capitalization…
– Republican vs. republican
– Fasil vs. fasil vs. FASIL
– Anti-discriminatory vs. anti-discriminatory
– Car vs. Automobile? 40
Normalization issues
• Good for
– Allow instances of Automobile at the beginning of a
sentence to match with a query of automobile
– Helps a search engine when most users type ferrari
while they are interested in a Ferrari car
• Bad for
– Proper names vs. common nouns
• E.g. General Motors, Associated Press, Kebede…
• Solution:
– lowercase only words at the beginning of the sentence
• In IR, lowercasing is most practical because of the way
users issue their queries
41
4. Stemming/Morphological analysis
The final output from a conflation algorithm is a set of classes, one for
each stem detected
• Inflectional morphology: vary the form of words in order to express grammatical fea-
tures, such as singular/plural or past/present tense. E.g. Boy → boys, cut → cutting.
• Derivational morphology: makes new words from old ones. E.g. creation is formed
from create , but they are two separate words. And also, destruction → destroy
42
Stemming/morphological analysis
• Basic question: words occur in different forms. Do we
want to treat different forms as different index terms?
• Conflation: treating different (inflectional and deriva-
tional) variants as the same index term
• What are we trying to achieve by conflating morpholog-
ical variants?
• Goal: help the system ignore unimportant variations of
language usage.
43
Stemming cont. …
• The final output from a conflation algorithm is a set of classes,
one for each stem detected
–A Stem: the portion of a word which is left after the removal of its
affixes (i.e., prefixes and/or suffixes).
–Example: ‘connect’ is the stem for {connected, connecting connec-
tion, connections}
–Thus, [automate, automatic, automation] all reduce to automat
• A class name is assigned to a document if and only if one of its
members occurs as a significant word in the text of the docu-
ment
–A document representative then becomes a list of class names,
which are often referred as the documents index terms/keywords
• Queries : Queries are handled in the same way
44
Ways to implement stemming
There are basically two ways to implement stemming
–The first approach is to create a big dictionary that maps words
to their stems
• The advantage of this approach is that it works perfectly (insofar
as the stem of a word can be defined perfectly); the disadvantages
are the space required by the dictionary and the investment re-
quired to maintain the dictionary as new words appear
–The second approach is to use a set of rules that extract stems
from words
• Techniques widely used include: rule-based, statistical, machine
learning or hybrid
• The advantages of this approach are that the code is typically
small, & it can gracefully handle new words; the disadvantage is
that it occasionally makes mistakes
–But, since stemming is imperfectly defined, anyway, occasional
mistakes are tolerable, & the rule-based approach is the one
that is generally chosen 45
Porter Stemmer
• Stemming is the operation of stripping the suffices from
a word, leaving its stem
– Google, for instance, uses stemming to search for web pages
containing the words connected, connecting, connection and
connections when users ask for a web page that contains the
word connect.
• In 1979, Martin Porter developed a stemming algorithm
that uses a set of rules to extract stems from words, and
though it makes some mistakes, most common words
seem to work out right
– Porter describes his algorithm and provides a reference implem
entation in C at http://tartarus.org/~martin/PorterStemmer/ind
ex.html
46
Porter stemmer
• Most common algorithm for stemming English words to
their common grammatical root
• It is simple procedure for removing known affixes in Eng-
lish without using a dictionary. To gets rid of plurals the
following rules are used:
– SSES SS caresses caress
– IES i ponies poni
– SS SS caress → caress
–S cats cat
– EMENT
– replacement replac
– cement cement
47
Porter stemmer
• Porter stemmer works in steps.
– While step 1a gets rid of plurals –s and -es,
– step 1b removes -ed or -ing.
e.g.
;; agreed -> agree ;; disabled -> disable
;; matting -> mat ;; mating -> mate
;; meeting -> meet ;; milling -> mill
;; messing -> mess ;; meetings -> meet
;; feed -> feed
48
Stemming: challenges
• May produce unusual stems that are not English
words:
– Removing ‘UAL’ from FACTUAL and EQUAL
49
5. Thesaurus Construction
• Thesaurus Construction demonstrate inter-term relationship. it is
like a book that lists words in groups of synonyms and related
concepts.
• Thesaurus: The vocabulary of a controlled indexing language, formally or-
ganized so that a priori relationships between concepts (for example as
"broader" and “related") are made explicit
51
Thesaurus Construction
Example: thesaurus built to assist IR for searching
cars and vehicles :
Term: Motor vehicles
UF : Automobiles
Cars
Trucks
BT: Vehicles
RT: Road Engineering
Road Transport
52
More Example
Example: thesaurus built to assist IR in the fields of Infor-
mation System:
TERM: natural languages
– UF natural language processing (UF=used for NLP)
– BT languages (BT=broader term is languages)
– TT languages (TT = top term is languages)
– RT artificial intelligence (RT=related term/s)
computational linguistic
formal languages
query languages
speech recognition
53
Language-specificity
• Many of the above features embody transformations
that are:
– Language-specific and
– Often, application-specific
• These are “plug-in” addenda to the indexing
process
• Both open source and commercial plug-ins are
available for handling these.
54
Index Term Selection
• Index language is the language used to describe docu-
ments and requests
• Elements of the index language are index terms which
may be derived from the text of the document to be de-
scribed, or may be arrived at independently.
– If a full text representation of the text is adopted, then all
words in the text are used as index terms = full text index-
ing
– Otherwise, need to select the words to be used as index
terms for reducing the size of the index file which is basic to
design an efficient searching IR system
55
•The end
56