0% found this document useful (0 votes)
3 views

2 - Text Operation

Uploaded by

Gemeda Abugare
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

2 - Text Operation

Uploaded by

Gemeda Abugare
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 55

Chapter 2:Text / Document

Operations

Information Storage and Retrieval Baeza-Yates, Berthier Ribeiro-Neto, 2022


Statistical Properties of Text

• How is the frequency of different words distributed?

• How fast does vocabulary size grow with the size of a


corpus?

• There are three well-known researcher who define


statistical properties of words in a text:
– Zipf’s Law: models word distribution in text corpus

– Luhn’s idea: measures word significance

– Heap’s Law: shows how vocabulary size grows with


the growth corpus size
Information Storage and Retrieval 2.2 Baeza-Yates, Berthier Ribeiro-Neto, 2022
Statistical Properties of Text…

• Such properties of text collection greatly affect the


performance of IR system & can be used to select
suitable term weights & other aspects of the system.

Information Storage and Retrieval 2.3 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Word Distribution

• A few words are very common.

• 2 most frequent words (e.g. “the”, “of”) can


account for about 10% of word occurrences.

• Most words are very rare.

• Half the words in a corpus appear only once,


called “read only once”

Information Storage and Retrieval 2.4 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Word Distribution…

Information Storage and Retrieval 2.5 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Word distribution: Zipf's Law
• Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
• attempts to capture the distribution of the
frequencies (i.e., number of occurances ) of the
words within a text.
• For all the words in a collection of documents, for each
word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most
commonly occurring word has rank 1, etc.)
2.6
r
Baeza-Yates, Berthier Ribeiro-Neto, 2022
Information Storage and Retrieval
Word distribution: Zipf's Law...

• Zipf’s distributions: Rank Frequency Distribution

• Distribution of sorted word frequencies, according to


Zipf’s law

w has rank r &


frequency f

r
Information Storage and Retrieval 2.7 Baeza-Yates, Berthier Ribeiro-Neto, 2022
Word distribution: Zipf's Law...

• Zipf's Law states that when the distinct words in a text


are arranged in decreasing order of their frequency of
occuerence (most frequent words first), the occurence
characterstics of the vocabulary can be characterized
by the constant rank-frequency law of Zipf.

• If the words, w, in a collection are ranked, r, by their


frequency, f, they roughly fit the relation:

• r*f=c
• Different collections have different constants c.

Information Storage and Retrieval 2.8 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Word distribution: Zipf's Law...

• The table shows the most frequently occurring words from 336,310

document corpus containing 125,720,891 total words; out of which

508,209 are unique words


Information Storage and Retrieval 2.9 Baeza-Yates, Berthier Ribeiro-Neto, 2022
More Example: Zipf’s Law

• Illustration of Rank-Frequency Law. Let the total # of


word occurrences in the sample N = 1,000,000
Rank (R) Term Frequency (F) R.(F/N)
1 the 69 971 0.070
2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095

Information Storage and Retrieval 2.10 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Zipf’s law: modeling word distribution

• Given that occurrence of the most frequent word is f1


times, the collection frequency of the ith most common
term is proportional to 1/i
• If the most frequent term occurs f1 times, then the
second most frequent term has half as many
occurrences, the third most frequent term has a third
as many, etc
1
f i∝
i

Information Storage and Retrieval 2.11 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Methods that Build on Zipf's Law

• Stop lists: Ignore the most frequent words (upper cut-


off). Used by almost all systems.

• Significant words: Take words in between the most


frequent (upper cut-off) and least frequent words (lower
cut-off).

• Term weighting: Give differing weights to terms


based on their frequency, with most frequent words
weighed less. Used by almost all ranking methods.

Information Storage and Retrieval 2.12 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Word significance: Luhn’s Ideas

• Luhn Idea (1958): the frequency of word occurrence in


a text furnishes a useful measurement of word
significance.
• Luhn suggested that both extremely common and
extremely uncommon words were not very useful for
indexing.
• For this, Luhn specifies two cutoff points: an upper
and a lower cutoffs based on which non-significant
words are excluded:

Information Storage and Retrieval 2.13 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Word significance: Luhn’s Ideas

• The words exceeding the upper cutoff were considered to


be common
• The words below the lower cutoff were considered to be
rare
• Hence they are not contributing significantly to the content
of the text
• The ability of words to discriminate content, reached a peak
at a rank order position half way between the two-cutoffs

• Let f be the frequency of occurrence of words in a text,


and r their rank in decreasing order of word frequency,
then a plot relating f & r yields the following curve
Information Storage and Retrieval 2.14 Baeza-Yates, Berthier Ribeiro-Neto, 2022
Luhn’s Ideas

Luhn (1958) suggested that both extremely common and


extremely uncommon words were not very useful for document
representation & indexing.
Information Storage and Retrieval 2.15 Baeza-Yates, Berthier Ribeiro-Neto, 2022
Vocabulary Growth: Heaps’ Law

• How does the size of the overall vocabulary (number of


unique words) grow with the size of the corpus?
– This determines how the size of the inverted index
will scale with the size of the corpus.

• Heap’s law: estimates the number of vocabularies in a


given corpus

Information Storage and Retrieval 2.16 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Vocabulary Growth: Heaps’ Law

– The vocabulary size grows by O(nβ), where β is a


constant between 0 – 1.
– If V is the size of the vocabulary and n is the length of
the corpus in words, Heap’s provides the following
β
equation: V=Kn
• Where constants:
– K  10100
–   0.40.6 (approx. square-root)

Information Storage and Retrieval 2.17 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Heap’s distributions

• Distribution of size of the vocabulary vs. total number of


terms extracted from text corpus

Example: from 1,000,000,000 documents, there may be


1,000,000 distinct words. Can you agree?
Information Storage and Retrieval 2.18 Baeza-Yates, Berthier Ribeiro-Neto, 2022
Example: Heaps Law

• Assume that based on statistical analysis on smaller


corpora sizes:

– A corpus with 100,000 words contain 50,000 unique


words; and

– A corpus with 500,000 words contain 150,000 unique


words

• Estimate vocabulary size for 1,000,000 words corpus?

– What about for a corpus of 1,000,000,000 words?

Information Storage and Retrieval 2.19 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Text Operations
• Not all words in a document are equally significant to
represent the contents/meanings of a document

– Some word carry more meaning than others

– Noun words are the most representative of a


document content

• Therefore, need to preprocess the text of a document in


a collection to be used as index terms

Information Storage and Retrieval 2.20 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Text Operations…
• Using the set of all words in a collection to index
documents creates too much noise for the retrieval task

– Reduce noise means reduce words which can be used


to refer to the document

• Text operation is the task of preprocessing text


documents to control the size of the vocabulary or the
number of distinct words used as index terms

Information Storage and Retrieval 2.21 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Text Operations…
• Preprocessing will lead to an improvement in the
information retrieval performance

• However, some search engines on the Web omit


preprocessing

– Every word in the document is an index term

• Text operations is the process of text transformations in


to logical representations

Information Storage and Retrieval 2.22 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Text Operations…

• Main operations for selecting index terms, i.e. to choose


words (groups of words) to be used as indexing terms:
– Lexical analysis/Tokenization of the text: generate a set of
words from text collection

– Elimination of stop words: filter out words which are not useful
in the retrieval process

– Stemming words: remove affixes (prefixes and suffixes) and


group together word variants with similar meaning

– Construction of term categorization structures such as


thesaurus, to capture relationship among words for allowing the
expansion of the original query with related terms
Information Storage and Retrieval 2.23 Baeza-Yates, Berthier Ribeiro-Neto, 2022
Generating Document Representatives
• Text Processing System
– Input text – full text, abstract or title

– Output – a document representative adequate for use in an


automatic retrieval system

• The document representative consists of a list of class


names, each name representing a class of words
occurring in the total input text.
• A document will be indexed by a name if one of its
significant words occurs as a member of that class.

Information Storage and Retrieval 2.24 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Generating Document Representatives

Documen
t Corpus Tokenization stop words stemming Thesaurus

Free Index
Text terms

Information Storage and Retrieval 2.25 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Lexical Analysis/Tokenization of Text
• Tokenization is one of the step used to convert text of
the documents into a sequence of words, w1, w2, … wn
to be adopted as index terms.

• It is the process of demarcating and possibly classifying


sections of a string of input characters into words.

• For example,

The quick brown fox jumps over the lazy dog

Information Storage and Retrieval 2.26 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Lexical Analysis/Tokenization of Text
• Objective of tokenization is identifying words in the text

• What is a word means?

• Is that a sequence of characters, numbers and


alpha-numeric once?

• How we identify a set of words that exist in a text


documents?

• Tokenization Issues

 numbers, hyphens, punctuations marks,


apostrophes …
Information Storage and Retrieval 2.27 Baeza-Yates, Berthier Ribeiro-Neto, 2022
Issues in Tokenization
• Two words may be connected by hyphens.

– Can two words connected by hyphens and punctuation


marks taken as one word or two words? Break up
hyphenated sequence as two tokens?
– In most cases hyphen – break up the words (e.g.
state-of-the-art  state of the art), but some words, e.g.
MS-DOS, B-49 - unique words which require hyphens

Information Storage and Retrieval 2.28 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Issues in Tokenization
• Two words may be connected by punctuation marks .

– remove totally punctuation marks unless significant,


e.g. program code: x.exe and xexe
• Two words may be separated by space.

– E.g. Addis Ababa, San Francisco, Los Angeles

• Two words may be written in different ways

– lowercase, lower-case, lower case ?

– data base, database, data-base?

Information Storage and Retrieval 2.29 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Issues in Tokenization

• Numbers: Are numbers/digits words & used as index


terms?
dates (3/12/91 vs. Mar. 12, 1991);
phone numbers (+251923415005)
IP addresses (100.2.86.144)

– Generally, don’t index numbers as text most numbers


are not good index terms (like 1910, 1999)

Information Storage and Retrieval 2.30 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Issues in Tokenization

• What about case of letters (e.g. Data or data or DATA):

– cases are not important and there is a need to convert


all to upper or lower. Which one is mostly followed by
human beings?
• Simplest approach is to ignore all numbers &
punctuations and use only case-insensitive unbroken
strings of alphabetic characters as tokens.
• Issues of tokenization are language specific

– Requires the language to be known

Information Storage and Retrieval 2.31 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Tokenization
• Analyze text into a sequence of discrete tokens (words)
• Input: “Friends, Romans and Countrymen”
• Output: Tokens (an instance of a sequence of
characters that are grouped together)
– Friends
– Romans
– and
– Countrymen
• Each such token is now a candidate for an index entry,
after further processing
Information Storage and Retrieval 2.32 Baeza-Yates, Berthier Ribeiro-Neto, 2022
Elimination of Stop-words
• Stop-words are extremely common words across
document collections that have no discriminatory power
– They may occur in 80% of the documents in a
collection.
– Stop-words have little semantic content;
– It is typical to remove such high-frequency words
– They would appear to be of little value in helping
select documents matching a user need and needs to
be filtered out as potential index terms

Information Storage and Retrieval 2.33 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Elimination of Stop-words
• The following examples can be treated as stop-words

– articles (a, an, the);

– pronouns: (I, he, she, it, their, his)

– prepositions (on, of, in, about, besides, against)

– Conjunctions (and, but, for, nor, or, so, yet)

– verbs (is, are, was, were)

– adverbs (here, there, out, because, soon, after)

– adjectives (all, any, each, every, few, many, some)

• Stop word removal is language dependent.


Information Storage and Retrieval 2.34 Baeza-Yates, Berthier Ribeiro-Neto, 2022
How to detect a stop-word?
• One method: Sort terms (in decreasing order) by
document frequency and take the most frequent ones
– In a collection about insurance practices, “insurance”
would be a stop word
• Another method: Build a stop word list that contains a
set of articles, pronouns, etc.
– Why do we need stop lists: With a stop list, we can
compare and exclude from index terms entirely the
commonest words.

Information Storage and Retrieval 2.35 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Stop words
• Stop word elimination used to be standard in older IR
systems.
• Most web search engines index stop words:
– Good query optimization techniques mean you pay
little at query time for including stop words.
– You need stop-words for:
– Relational” queries: “flights to London”
– Elimination of stop-words might reduce recall (e.g.
“To be or not to be” – all eliminated except “be” – no
or irrelevant retrieval)
Information Storage and Retrieval 2.36 Baeza-Yates, Berthier Ribeiro-Neto, 2022
Normalization
• It is canonicalizing tokens so that matches occur despite
superficial differences in the character sequences
– Need to “normalize” terms in indexed text as well as
query terms into the same form
– Example: We want to match U.S.A. and USA, by
deleting periods in a term
• Case Folding: Often best to lower case everything,
since users will use lowercase regardless of ‘correct’
capitalization… Fasil vs. fasil vs. FASIL
• Car vs. automobile?
Information Storage and Retrieval 2.37 Baeza-Yates, Berthier Ribeiro-Neto, 2022
Stemming/Morphological analysis
• Stemming reduces tokens to their “root” form of words
to recognize morphological variation .
• The process involves removal of affixes (i.e. prefixes
and suffixes) with the aim of reducing variants to the
same stem
• Often stemming removes inflectional and derivational
morphology of a word

Information Storage and Retrieval 2.38 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Stemming/Morphological analysis
• Inflectional morphology: vary the form of words in
order to express grammatical features, such as
singular/plural or past/present tense.
 E.g. Boy → boys, cut → cutting.

• Derivational morphology: makes new words from old


ones. E.g. creation is formed from create , but they are
two separate words. And also, destruction → destroy
• Stemming is language dependent
• Correct stemming is language specific and can be
complex.
Information Storage and Retrieval 2.39 Baeza-Yates, Berthier Ribeiro-Neto, 2022
Stemming
• The final output from a conflation algorithm is a set of
classes, one for each stem detected.
• A Stem: the portion of a word which is left after the
removal of its affixes (i.e., prefixes and/or suffixes).
• Example: ‘connect’ is the stem for {connected,
connecting connection, connections}
• Thus, [automate, automatic, automation] all reduce to
automat

Information Storage and Retrieval 2.40 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Stemming…
• A class name is assigned to a document if and only if
one of its members occurs as a significant word in the
text of the document.
• A document representative then becomes a list of
class names, which are often referred as the
documents index terms/keywords.
• Queries : Queries are handled in the same way.

Information Storage and Retrieval 2.41 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Ways to implement stemming

• There are basically two ways to implement stemming.

• The first approach is to create a big dictionary that


maps words to their stems.
• The advantage of this approach is that it works
perfectly (so far the stem of a word is defined)
• The disadvantages are the space required by
the dictionary and the investment required to
maintain the dictionary as new words appear.

Information Storage and Retrieval 2.42 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Ways to implement stemming…
• The second approach is to use a set of rules that
extract stems from words.
• The advantages of this approach are that the
code is typically small, and it can gracefully
handle new words
• the disadvantage is that it occasionally makes
mistakes.
• But, since stemming is imperfectly defined, anyway,
occasional mistakes are tolerable, and the rule-based
approach is the one that is generally chosen.
Information Storage and Retrieval 2.43 Baeza-Yates, Berthier Ribeiro-Neto, 2022
Porter Stemmer
• Stemming is the operation of stripping the suffices from
a word, leaving its stem.
– Google, for instance, uses stemming to search for web
pages containing the words connected, connecting,
connection and connections when users ask for a
web page that contains the word connect.

Information Storage and Retrieval 2.44 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Porter Stemmer
• In 1979, Martin Porter developed a stemming algorithm
that uses a set of rules to extract stems from words, and
though it makes some mistakes, most common words
seem to work out right.
– Porter describes his algorithm and provides a
reference implementation in C at
http://tartarus.org/~martin/PorterStemmer/index.html

Information Storage and Retrieval 2.45 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Porter stemmer
• Most common algorithm for stemming English words to
their common grammatical root
• It is simple procedure for removing known affixes in
English without using a dictionary. To gets rid of plurals
the following rules are used:
– SSES  SS caresses  caress
– IES  i ponies  poni
– SS  SS caress → caress
– S  cats cat
– EMENT  (Delete final ement if what remains is
longer than 1 character )
replacement  replac ,……….cementcement
Information Storage and Retrieval 2.46 Baeza-Yates, Berthier Ribeiro-Neto, 2022
Porter stemmer
• While step 1a gets rid of plurals, step 1b removes -ed or
-ing.
– e.g.
;; agreed -> agree ;; disabled -> disable

;; matting -> mat ;; mating -> mate


;; meeting -> meet ;; milling -> mill
;; messing -> mess ;; meetings -> meet

;; feed -> feed

Information Storage and Retrieval 2.47 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Stemming: challenges
• May produce unusual stems that are not English words:

– Removing ‘UAL’ from FACTUAL and EQUAL

• May conflate (reduce to the same token) words that are


actually distinct.

• “computer”, “computational”, “computation” all


reduced to same token “comput”

• Not recognize all morphological derivations.

Information Storage and Retrieval 2.48 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Thesauri
• Mostly full-text searching cannot be accurate, since
different authors may select different words to represent
the same concept
– Problem: The same meaning can be expressed using different
terms that are synonyms, homonyms and related terms

– How can it be achieved such that for the same meaning the
identical terms are used in the index and the query?

Information Storage and Retrieval 2.49 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Thesauri
• Thesaurus: The vocabulary of a controlled indexing
language, formally organized so that a priori relationships
between concepts are made explicit.

• A thesaurus contains terms and r/ships between terms


– IR thesauri rely typically upon the use of symbols such as
USE/UF (UF=used for), BT, and RT to demonstrate inter-term
relationships.

– e.g., car = automobile, truck, bus, taxi, motor vehicle

-color = colour, paint

Information Storage and Retrieval 2.50 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Aim of Thesaurus
• Thesaurus tries to control the use of the vocabulary by
showing a set of related words to handle synonyms and
homonyms

• The aim of thesaurus is therefore:


– to provide a standard vocabulary for indexing and
searching

• assist users with locating terms for proper query


formulation: When the query contains automobile, look
under car as well for expanding query

Information Storage and Retrieval 2.51 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Thesaurus Construction
Example: thesaurus built to assist IR for searching cars
and vehicles :

Term: Motor vehicles


UF : Automobiles
Cars
Trucks

BT: Vehicles
RT: Road Engineering
Road Transport

Information Storage and Retrieval 2.52 Baeza-Yates, Berthier Ribeiro-Neto, 2022


More Example
Example: thesaurus built to assist IR in the fields of
computer science:
TERM: natural languages
– UF natural language processing
– BT languages (BT = Broader Term)
– TT languages (TT = Top Term)
– RT artificial intelligence (RT = Related Term/s)
computational linguistic
formal languages
query languages, speech recognition
Information Storage and Retrieval 2.53 Baeza-Yates, Berthier Ribeiro-Neto, 2022
Language-specificity
• Many of the above features embody transformations
that are

– Language-specific and

– Often, application-specific

• These are “plug-in” additions to the indexing process

• Both open source and commercial plug-ins are available


for handling these

Information Storage and Retrieval 2.54 Baeza-Yates, Berthier Ribeiro-Neto, 2022


Index Term Selection
• Index language is the language used to describe
documents and requests
• Elements of the index language are index terms which
may be derived from the text of the document to be
described independently.
– If a full text representation is adopted, then all words
in the text are used as index terms = full text indexing
– Or, select content-bearing words to be used as index
terms for reducing the size of the index file which is
basic to design an efficient searching IR system
Information Storage and Retrieval 2.55 Baeza-Yates, Berthier Ribeiro-Neto, 2022

You might also like