0% found this document useful (0 votes)
7 views

2_Text Operations (1)

vb

Uploaded by

kenabadane9299
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

2_Text Operations (1)

vb

Uploaded by

kenabadane9299
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Information Storage and Retrieval

(ISR)
Chapter 2

Document represen-
tation and
Text Operations
1
Document representation
• IR system/Search engine does not scan each docu-
ment to see if it satisfies the query
• It uses an index to quickly locate the relevant doc-
uments
• Index: a list of concepts with pointers to docu-
ments that discuss(represent) them
– What goes in the index is very important
• Document representation: deciding what concepts
should go in the index

2
Document representation cont. …
Two options:
1.Controlled vocabulary – a set of manually con-
structed concepts that describe the major topics
covered in the collection
2.Free-text indexing – the set of individual terms
that occur in the collection

3
1.Document representation
Controlled Vocabulary: cont. …
a set of well-defined
concepts
– Assigned to documents by humans (or automatically)
• E.g. Subject headings, key words, etc.
– May include parent-child relations b/n concepts
• E.g. computers: software/Hardware: Information Retrieval
– Facilitate non-query-based browsing and exploration
Because it will brows parent-child relationship.

4
Controlled Vocabulary: Advantages
• Concepts do not need to appear explicitly in the text
• R/ships b/n concepts facilitate non-query based navigation
and exploration
• Developed by experts who know the data and the user
• Represent the concepts/relationships that users (presum-
ably) care the most about
• Describe the concepts that are most central to the docu-
ment
• Concepts are unambiguous and recognizable (necessary for
annotators and good for users)

5
Controlled Vocabulary: Disadvantages
• Time consuming
• Users must know the concepts in the index
• Labor intensive

6
2. Free Text Indexing
• Represent documents using terms within the document
• Which terms? Only the most descriptive terms? Only the unam-
biguous ones? All of them?
– Usually, all of them (a.k.a. full-text indexing)
• The user will use term-combinations to express higher level con-
cepts
• Query terms will hopefully disambiguate each other (e.g., “volk-
swagen golf”)
• The search engine will determine which terms are important

7
How are the texts handled?
• What happens if you take the words exactly as they ap-
pear in the original text?
• What about punctuation, capitalization, etc.?
• What about spelling errors?
• What about plural vs. singular forms of words?
• What about cases and declension(the variation of the
form of a noun, pronoun, or adjective, by which its
grammatical case) in non-English language?
• What about non-roman alphabets?

8
Free Text Indexing: Steps
1. Mark-up removal
2. Normalization – e.g down casing
– Information and information
– Retrieval and RETRIEVAL
– US and us – can change the meaning of words
3. Tokenization - splitting text into words (based on sequences of non-alpha-
numeric characters)
– Problematic cases: ph.d. = ph d, isn’t = isn t
4. Stop word removal
5. Do steps 1-4 to every document in the collection and
6. create an index using the union of all remaining terms

9
Controlled Vs. Free Text Indexing
Cost of Assigning Ambiguity of Detail of repre-
index terms index terms sentation

Controlled High Not ambiguous Can’t represent


Vocabulary arbitrary detail

Free text in- low Can be am- Any level of de-


dexing
• Both are effective biguous
and used often tail

• We will focus on free-text indexing in this


course
– cheap and easy
– most search engines use it
10
Free/Full Text Indexing
• Our goal is to describe content using content
• Are all words equally descriptive?
• What are the most descriptive words?
• How might a computer identify these?
• We know that language use is varied
– There are many ways to convey the same information (which
makes IR difficult)
• But, are there statistical properties of word usage that are pre-
dictable? Across languages? Across modalities? Across
genres(literature)?

11
Statistical Properties of words in a Text
• How is the frequency of different words distributed?
• How fast does vocabulary size grow with the size of a corpus?
• Such properties of text collection greatly affect the performance of IR system &
can be used to select suitable term weights & other aspects of the system
• There are three well-known researchers who define statistical properties of
words in a text:
1. Zipf’s Law: models word distribution in text corpus
2. Luhn’s idea: measures word significance
3. Heap’s Law: shows how vocabulary size grows with the growth of the cor-
pus size

12
1. Zipf's Law: Word Distribution/Fre-
quency
• A few words are very
common.
2 most frequent words
(e.g. “the”, “of”) can ac-
count for about 10% of
word occurrences.
• Most words are very rare.
Half the words in a cor-
pus appear only once,
called “read only once”
or Hapax Legomena (in
Greek)

13
Zipf's Law: Word distribution cont. ...
• Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
attempts to capture the distribution of the frequencies (i.e. ,
number of occurances ) of the words within a text.
• For all the words in a collection of documents, for each word w
f : is the frequency of w
r : is rank of w in order of frequency. (The most commonly
occurring word has rank 1, etc.)
f

Zipf’s distribu- Distribution of sorted word


tions: Rank w has rank r &
frequencies, according
Frequency Dis-
frequency f to Zipf’s law
tribution

r 14
Word distribution: Zipf's Law
• Zipf's Law states that when the distinct words in a text are ar-
ranged in decreasing order of their frequency of occuerence (most
frequent words first), the occurence characterstics of the vocabulary
can be characterized by the constant rank-frequency law of Zipf.

• If the words, w, in a col-


lection are ranked, r, by
their frequency, f, they
roughly fit the relation:
1
f 
r*f=c r
– Different collections
have different constants
c.

• The table shows the most frequently occurring words from 336,310 document corpus
containing 125, 720, 891 total words; out of which 508, 209 are unique words 15
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words (upper cut-
off). Used by almost all systems.
• Significant words: Take words in between the most
frequent (upper cut-off) and least frequent words
(lower cut-off)
• Term weighting: Give differing weights to terms based
on their frequency, with most frequent words weighed
less. Used by almost all ranking methods.

16
Zipf's Law Impact
• Zipf’s Law Impact on IR

– Good News: Stopwords will account for a large fraction


of text so eliminating them greatly reduces inverted-in-
dex storage costs.

– Bad News: For most words, gathering sufficient data for


meaningful statistical analysis (e.g. for correlation anal-
ysis for query expansion) is difficult since they are ex-
tremely rare.

17
2. Word significance: Luhn’s Ideas
• Luhn Idea (1958): the frequency of word occurrence in a text fur-
nishes a useful measurement of word significance

• Luhn suggested that both extremely common and extremely un-


common words were not very useful for indexing

• For this, Luhn specifies two cutoff points: an upper and a lower
cutoffs based on which non-significant words are excluded
– The words exceeding the upper cutoff were considered to be common
– The words below the lower cutoff were considered to be rare
– Hence they are not contributing significantly to the content of the text
– The ability of words to discriminate content, reached a peak at a rank or-
der position half way between the two-cutoffs

• Let f be the frequency of occurrence of words in a text, and r their rank in


decreasing order of word frequency, then a plot relating f & r yields the fol-
lowing curve
18
Luhn’s Ideas

Luhn (1958) suggested that both extremely common and ex-


tremely uncommon words were not very useful for document rep-
resentation & indexing 19
3. Heaps’ Law
• As the corpus grows, the number of new terms
will increase dramatically at first, but then will
increase at a slower rate
• Nevertheless, as the corpus grows, new terms
will always be found (even if the corpus be-
comes huge)
– there is no end to vocabulary growth
– invented words, proper nouns (people, products),
misspellings, email addresses, etc.

20
Vocabulary Growth: Heaps’ Law
• How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
– This determines how the size of the inverted index will scale with
the size of the corpus.
• Heap’s law: estimates the number of vocabularies in a
given corpus
– The vocabulary size grows by K(nβ), where β is a constant between
0 – 1.
– If V is the size of the vocabulary and n is the length of the corpus in
words, Heap’s law provides the following equation:
• Where constants: 
– K  10100
V Kn
–   0.40.6 (approx. square-root)
21
Heap’s distributions
• Distribution of size of the vocabulary vs. total
number of terms extracted from text corpus

22
Example: Heaps Law
• We want to estimate the size of the vocabulary for a
corpus of 1,000,000 words
• Assume that based on statistical analysis on smaller cor-
pora sizes:
– A corpus with 100,000 words contain 50,000 unique
words; and
– A corpus with 500,000 words contain 150,000 unique
words
• Estimate the vocabulary size for the 1,000,000 words
corpus?
– What about for a corpus of 1,000,000,000 words?
23
Text Operations
• Not all words in a document are equally
significant to represent the contents/
meanings of a document
– Some word carry more meaning than others
– Noun words are the most representative of
a document content
• Therefore, need to preprocess the text
of a document in a collection to be used
as source of index terms

• Using the set of all words in a collection


to index documents creates too much
noise for the retrieval task 24
Text Operations
• Preprocessing is the process of controlling
the size of the vocabulary or the number
of distinct words used as index terms
– Preprocessing will lead to an improvement in
the information retrieval performance
• However, some search engines on the Web
omit preprocessing
– Every word in the document is an index
term

Text operations is the process of text


transformations in to logical represen-25
Text Operations
• Five (5) main text operations for selecting index terms,
i.e. to choose words/stems (or groups of words) to be
used as indexing terms:
1. Tokenization of the text: generate a set of words from text col-
lection
2. Elimination of stop words - filter out words which are not use-
ful in the retrieval process
3. Normalization – bringing to one form – e.g. downcasing
4. Stemming words - remove affixes (prefixes and suffixes) and
group together word variants with similar meaning
5. Construction of term categorization structures such as the-
saurus, to capture relationship for allowing the expansion of
the original query with related terms 26
1. Tokenization
 Is a process of breaking up a piece of text into many pieces,
such as sentences and words. It works by separating words
using spaces and punctation.

– It is the process of demarcating and possibly classifying sec-


tions of a string of input characters into words
– For example,
The quick brown fox jumps over the lazy dog
• Objective - identify words in the text
– What is a word means?
• Is that a sequence of characters, numbers and alpha-numeric once?
– How we identify a set of words that exist in a text docu-
ments?
27
Tokenization Cont. …
• Tokenization is one of the steps used to con-
vert text of the documents into a sequence of
words, w1, w2, … wn to be adopted as index
terms

• Tokenization Issues
– numbers, hyphens, punctuations marks, apostrophes …

28
Issues in Tokenization
• One word or multiple: How to handle special cases
involving hyphens, apostrophes, punctuation marks
etc? C++, C#, URL’s, e-mail, …

– Sometimes punctuations (e-mail), numbers (1999), &


case (Republican vs. republican) can be a meaningful
part of a token.
– However, frequently they are not

29
Cont. …
• Two words may be connected by hyphens
– Can two words connected by hyphens taken as one
word or two words? Break up hyphenated sequence as
two tokens?

• In most cases hyphen – break up the words (e.g. state-of-


the-art  state of the art), but some words, e.g. MS-DOS -
unique word which require hyphens

30
Cont. …
•Two words may be connected by punctuation marks
–Punctuation marks: remove totally unless significant, e.
g. program code: x.exe and
xexe. What about Kebede’s, www.command.com?
•Two words (phrase) may be separated by space
–E.g. Addis Ababa, San Francisco, Los Angeles

•Two words may be written in different ways


–lowercase, lower-case, lower case? data base, database,
data-base?

31
Issues in Tokenization
• Numbers: are numbers/digits words and used as index terms?
– dates (3/12/91 vs. Mar. 12, 1991);
– phone numbers (+251923415--)
– IP addresses (100.2.86.144)
– Numbers are not good index terms (like 1910, 1999); but 510 B.C. is unique.
Generally, don’t index numbers as text, though very useful.
• What about case of letters (e.g. Data or data or DATA):
– cases are not important and there is a need to convert all to upper or lower.
Which one is mostly followed by human beings?

• Simplest approach is to ignore all numbers and punctuation marks


(period, colon, comma, brackets, semi-colon, apostrophe, …) & use
only case-insensitive unbroken strings of alphabetic characters as
words.
– Will often index “meta-data”, including creation date, format, etc. separately
• Issues of tokenization are language specific
– Requires the language to be known 32
Tokenization
• Analyze text into a sequence of discrete tokens (words)
• Input: “Friends, Romans and Countrymen”
• Output: Tokens (an instance of a sequence of characters that are
grouped together as a useful semantic unit for processing)
 Friends
 Romans
 and
 Countrymen
• Each such token is now a candidate for an index entry, af-
ter further processing
• But what are valid tokens to emit?
33
Exercise: Tokenization
• The cat slept peacefully in the living room. It’s
a very old cat.

• The instructor (Dr. O’Neill) thinks that the


boys’ stories about Chile’s capital aren’t amus-
ing.

34
2. Stop-word Removal
• Stopwords: words that we ignore because we expect
them not to be useful in distinguishing between rele-
vant/non-relevant documents for any query
• A stopword is a term that is discarded from the docu-
ment representation
• Stopwords are extremely common words across docu-
ment collections that have no discriminatory power
• Assumption: stopwords are unimportant because they
are frequent in every document
– They may occur in 80% of the documents in a collection.
– They would appear to be of little value in helping select documents match-
ing a user need and needs to be filtered out as potential index terms

35
Stopword Removal cont. …
• Stopwords are typically function words:
– Examples of stopwords are articles, prepositions, conjunctions,
etc.:
• articles (a, an, the); pronouns: (I, he, she, it, their, his)
• Some prepositions (on, of, in, about, besides, against);
• conjunctions/ connectors (and, but, for, nor, or, so,
yet), verbs (is, are, was, were),
• adverbs (here, there, out, because, soon, after) and
• adjectives (all, any, each, every, few, many, some) can
also be treated as stopwords.
• Stopwords are language dependent
36
Why Stopword Removal?
• Intuition:
–Stopwords have little semantic content; It is typical to re-
move such high-frequency words
–Stopwords take up 50% of the text. Hence, document size
reduces by 30-50%
• Smaller indices for information retrieval
–Good compression techniques for indices: The 30
most common words account for 30% of the tokens in
written text
• With the removal of stopwords, we can measure better
approximation of importance for text classification, text
categorization, text summarization, etc.
37
How to detect a stopword?

• One method: Sort terms (in decreasing order) by docu-


ment frequency (DF) and take the most frequent ones
based on the cutoff point

• Another method: Build a stop word list that contains a


set of articles, pronouns, etc.
– Why do we need stop lists: With a stop list, we can
compare and exclude from index terms entirely the com-
monest words.
– Can you identify common words in Amharic and build
stop list?

38
Trends in Stopwords
• Stopword elimination used to be standard in older IR sys-
tems. But the trend is away from doing this nowadays.
• Most web search engines index stopwords:
– Good query optimization techniques mean you pay little atten-
tion at query time for including stop words.
– You need stopwords for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
– Elimination of stopwords might reduce recall (e.g. “To be or not
to be” – all eliminated except “be” – no or irrelevant retrieval)
• Therefore still now it needs better improvement.
39
3. Normalization
• It is Canonicalizing(normalizing) tokens(words or
terms) so that matches occur despite superficial dif-
ferences in the character sequences of the tokens
– Need to “normalize” terms in indexed text as well as query
terms into the same form
– Example: We want to match U.S.A. and USA, by deleting peri-
ods in a term
• Case Folding: Often best to lower case everything,
since users will use lowercase regardless of ‘correct’
capitalization…
– Republican vs. republican
– Fasil vs. fasil vs. FASIL
– Anti-discriminatory vs. anti-discriminatory
– Car vs. Automobile? 40
Normalization issues
• Good for
– Allow instances of Automobile at the beginning of a
sentence to match with a query of automobile
– Helps a search engine when most users type ferrari
while they are interested in a Ferrari car
• Bad for
– Proper names vs. common nouns
• E.g. General Motors, Associated Press, Kebede…
• Solution:
– lowercase only words at the beginning of the sentence
• In IR, lowercasing is most practical because of the way
users issue their queries
41
4. Stemming/Morphological analysis
 The final output from a conflation algorithm is a set of classes, one for
each stem detected

• Stemming reduces tokens to their “root” form of words to recognize


morphological variation
– The process involves removal of affixes (i.e. prefixes and suffixes) with the
aim of reducing variants to the same stem

– Often removes inflectional and derivational morphology of a word

• Inflectional morphology: vary the form of words in order to express grammatical fea-
tures, such as singular/plural or past/present tense. E.g. Boy → boys, cut → cutting.

• Derivational morphology: makes new words from old ones. E.g. creation is formed
from create , but they are two separate words. And also, destruction → destroy

• Compounding – combining words to form new ones e.g. beefsteak

42
Stemming/morphological analysis
• Basic question: words occur in different forms. Do we
want to treat different forms as different index terms?
• Conflation: treating different (inflectional and deriva-
tional) variants as the same index term
• What are we trying to achieve by conflating morpholog-
ical variants?
• Goal: help the system ignore unimportant variations of
language usage.

43
Stemming cont. …
• The final output from a conflation algorithm is a set of classes,
one for each stem detected
–A Stem: the portion of a word which is left after the removal of its
affixes (i.e., prefixes and/or suffixes).
–Example: ‘connect’ is the stem for {connected, connecting connec-
tion, connections}
–Thus, [automate, automatic, automation] all reduce to  automat
• A class name is assigned to a document if and only if one of its
members occurs as a significant word in the text of the docu-
ment
–A document representative then becomes a list of class names,
which are often referred as the documents index terms/keywords
• Queries : Queries are handled in the same way

44
Ways to implement stemming
There are basically two ways to implement stemming
–The first approach is to create a big dictionary that maps words
to their stems
• The advantage of this approach is that it works perfectly (insofar
as the stem of a word can be defined perfectly); the disadvantages
are the space required by the dictionary and the investment re-
quired to maintain the dictionary as new words appear
–The second approach is to use a set of rules that extract stems
from words
• Techniques widely used include: rule-based, statistical, machine
learning or hybrid
• The advantages of this approach are that the code is typically
small, & it can gracefully handle new words; the disadvantage is
that it occasionally makes mistakes
–But, since stemming is imperfectly defined, anyway, occasional
mistakes are tolerable, & the rule-based approach is the one
that is generally chosen 45
Porter Stemmer
• Stemming is the operation of stripping the suffices from
a word, leaving its stem
– Google, for instance, uses stemming to search for web pages
containing the words connected, connecting, connection and
connections when users ask for a web page that contains the
word connect.
• In 1979, Martin Porter developed a stemming algorithm
that uses a set of rules to extract stems from words, and
though it makes some mistakes, most common words
seem to work out right
– Porter describes his algorithm and provides a reference implem
entation in C at http://tartarus.org/~martin/PorterStemmer/ind
ex.html

46
Porter stemmer
• Most common algorithm for stemming English words to
their common grammatical root
• It is simple procedure for removing known affixes in Eng-
lish without using a dictionary. To gets rid of plurals the
following rules are used:
– SSES  SS caresses  caress
– IES  i ponies  poni
– SS  SS caress → caress
–S cats  cat

– EMENT  
– replacement  replac
– cement  cement

47
Porter stemmer
• Porter stemmer works in steps.
– While step 1a gets rid of plurals –s and -es,
– step 1b removes -ed or -ing.
e.g.
;; agreed -> agree ;; disabled -> disable
;; matting -> mat ;; mating -> mate
;; meeting -> meet ;; milling -> mill
;; messing -> mess ;; meetings -> meet
;; feed -> feed

48
Stemming: challenges
• May produce unusual stems that are not English
words:
– Removing ‘UAL’ from FACTUAL and EQUAL

• May conflate (reduce to the same token) words


that are actually distinct.
• “computer”, “computational”, “computation” all re-
duced to same token “comput”

• Note: recognize all morphological derivations.

49
5. Thesaurus Construction
• Thesaurus Construction demonstrate inter-term relationship. it is
like a book that lists words in groups of synonyms and related
concepts.
• Thesaurus: The vocabulary of a controlled indexing language, formally or-
ganized so that a priori relationships between concepts (for example as
"broader" and “related") are made explicit

• Mostly full-text searching cannot be accurate, since different authors may


select different words to represent the same concept
– Problem: The same meaning can be expressed using different terms
that are synonyms, and related terms
– How can it be achieved such that for the same meaning the identical
terms are used in the index and the query?
• A thesaurus contains terms and relationships between terms
– IR thesauri rely typically upon the use of symbols such as USE/UF
(UF=used for), BT(broader term), TT(top term )and RT(related term)
to demonstrate inter-term relationships
– e.g., car UF automobile, truck, bus, taxi, motor vehicle
-color UF colour, or paint 50
Aim of Thesaurus
• Thesaurus tries to control the use of the vocabulary by
showing a set of related words to handle synonyms

• The aim of thesaurus is therefore:


– to provide a standard vocabulary for indexing and query
• Thesaurus rewrite to form equivalence classes, and
we index such equivalences
• When the document contains automobile, index it
under car as well (usually, also vice-versa)
– to assist users with locating terms for proper query formula-
tion: When the query contains automobile, look under car
as well for expanding query

51
Thesaurus Construction
Example: thesaurus built to assist IR for searching
cars and vehicles :
Term: Motor vehicles
UF : Automobiles
Cars
Trucks
BT: Vehicles
RT: Road Engineering
Road Transport

52
More Example
Example: thesaurus built to assist IR in the fields of Infor-
mation System:
TERM: natural languages
– UF natural language processing (UF=used for NLP)
– BT languages (BT=broader term is languages)
– TT languages (TT = top term is languages)
– RT artificial intelligence (RT=related term/s)
computational linguistic
formal languages
query languages
speech recognition

53
Language-specificity
• Many of the above features embody transformations
that are:
– Language-specific and
– Often, application-specific
• These are “plug-in” addenda to the indexing
process
• Both open source and commercial plug-ins are
available for handling these.

54
Index Term Selection
• Index language is the language used to describe docu-
ments and requests
• Elements of the index language are index terms which
may be derived from the text of the document to be de-
scribed, or may be arrived at independently.
– If a full text representation of the text is adopted, then all
words in the text are used as index terms = full text index-
ing
– Otherwise, need to select the words to be used as index
terms for reducing the size of the index file which is basic to
design an efficient searching IR system

55
•The end

56

You might also like