0% found this document useful (0 votes)

14 views

Chapter 2 Text Operations

Uploaded by

Dawit Sebhat

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Chapter 2 Text Operations

Uploaded by

Dawit Sebhat

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Chapter Two

Text Operations

1
Statistical Properties of Text
 How is the frequency of different words distributed?

 How fast does vocabulary size grow with the size of a corpus?
◦ Such factors affect the performance of IR system & can be used to select
suitable term weights & other aspects of the system.

 A few words are very common.

◦ 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word
occurrences.
2
Statistical…….
 Most words are very rare.
◦ Half the words in a corpus appear only once, called
“read only once”

3
Sample Word Frequency Data

4
Word distribution: Zipf's Law
 Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
◦ attempts to capture the distribution of the frequencies
(number of occurances ) of the words within a text.

 Zipf's Law states that when the distinct words in a text

are ranked by frequency from most frequent to least
frequent, the product of rank and frequency is a constant.
5
Zipf's Law...
Frequency * Rank = constant

That is If the words, w, in a collection are ranked, r,

by their frequency, f, they roughly fit the relation:
r*f=c
◦ Different collections have different constants c.

6
Zipf ’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most commonly occurring word has rank 1,
etc.)
f Distribution of sorted word frequencies,
according to Zipf’s law

w has rank r and

frequency f

7
Example: Zipf's Law

 The table shows the most frequently occurring words

from 336,310 document collection containing 125, 720,
891 total words; out of which 508, 209 unique words 8
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words
(upper cut-off). Used by almost all systems.
• Significant words: Take words in between the
most frequent (upper cut-off) and least frequent
words (lower cut-off).
• Term weighting: Give differing weights to terms
based on their frequency, with most frequent
words weighed less. Used by almost all ranking
methods. 9
Zipf ’s Law Impact on IR
◦ Good News: Stopwords will account for a large fraction
of text so eliminating them greatly reduces inverted-
index storage costs.
◦ Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation analysis
for query expansion) is difficult since they are extremely
rare. 10
Word significance: Luhn’s Ideas
 Luhn Idea (1958): the frequency of word occurrence in a text
furnishes a useful measurement of word significance.

 Luhn suggested that both extremely common and extremely

uncommon words were not very useful for indexing.

 For this, Luhn specifies two cut-off points: an upper and a

lower cutoffs based on which non-significant words are
excluded 11
Word significance: Luhn’s Ideas
 The words exceeding the upper cut-off were considered to be
common
 The words below the lower cut-off were considered to be rare
 Hence they are not contributing significantly to the content of the
text
 The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
 Let f be the frequency of occurrence of words in a text, and r their
rank in decreasing order of word frequency, then a plot relating 12f
Luhn’s Ideas

Luhn (1958) suggested that both extremely common and

extremely uncommon words were not very useful for document
representation & indexing. 13
Vocabulary size : Heaps’ Law
 How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
◦ This determines how the size of the inverted index will
scale with the size of the corpus.

14
Vocabulary Growth: Heaps’ Law
 Heap’s law: estimates the number of vocabularies in a
given corpus
◦ The vocabulary size grows by O(n ),
β where β is a constant
between 0 – 1.
◦ If V is the size of the vocabulary and n is the length of the corpus
in words, Heap’s provides the following equation:
 Where constants:
◦ K  10−100
◦   0.4−0.6 (approx. square-root)

V = Kn 15
Heap’s distributions
• Distribution of size of the vocabulary: there is a linear
relationship between vocabulary size and number of
tokens

Example: from 1,000,000,000 documents, there

may be 1,000,000 distinct words. Can you agree? 16
Example
 We want to estimate the size of the vocabulary
for a corpus of 1,000,000 words. However, we
only know statistics computed on smaller
corpora sizes:
◦ For 100,000 words, there are 50,000 unique words
◦ For 500,000 words, there are 150,000 unique words
◦ Estimate the vocabulary size for the 1,000,000 words
corpus?
◦ How about for a corpus of 1,000,000,000 words? 17
Text Operations
 Not all words in a document are equally significant to
represent the contents/meanings of a document
◦ Some word carry more meaning than others
◦ Noun words are the most representative of a
document content

 Therefore, one needs to preprocess the text of a

document in a collection to be used as index terms 18
Text Op….
 Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index terms
◦ Preprocessing will lead to an improvement in the information
retrieval performance
 However, some search engines on the Web omit preprocessing
◦ Every word in the document is an index term

19
 Text operations is the process of text transformations in to logical
representations

 The main operations for selecting index terms are:

 Lexical analysis/Tokenization of the text - digits, hyphens, punctuations marks, and the
case of letters

 Elimination of stop words - filter out words which are not useful in the retrieval
process

 Stemming words - remove affixes (prefixes and suffixes)

 Construction of term categorization structures such as thesaurus/wordlist, to capture

relationship for allowing the expansion of the original query with related terms
20
Generating Document Representatives
 Text Processing System
◦ Input text – full text, abstract or title
◦ Output – a document representative adequate for use in an
automatic retrieval system
documents Tokenization stop words stemming Thesaurus

Index
terms 21
Lexical Analysis/Tokenization of Text
 Change text of the documents into words to be adopted
as index terms

 Objective - identify words in the text

◦ Digits, hyphens, punctuation marks, case of letters

◦ Numbers are not good index terms (like 1910, 1999);

but 510 B.C. – unique
22
Lexical Analysis…..
 Hyphen – break up the words (e.g. state-of-the-art = state of
the art)- but some words, e.g. gilt-edged, B-49 - unique words
which require hyphens

 Punctuation marks – remove totally unless significant,

e.g. program code: x.exe and xexe
 Case of letters – not important and can convert all to
upper or lower
23
 Analyze text into a sequence of discrete tokens (words).


Tokenization Input:“Friends, Romans and Countrymen”

 Output: Tokens (an instance of a sequence of characters that are

grouped together as a useful semantic unit for processing)

◦ Friends , and, Romans, Countrymen

 Each such token is now a candidate for an index entry,

after further processing

 But what are valid tokens to omit? 24

 One word or multiple: How do you decide it is one token or
Issues in Tokenization two or more?
◦ Hewlett-Packard → Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
 Addis Ababa, Bahir Dar
◦ lowercase, lower-case, lower case ?
 data base, database, data-base
• Numbers:
 dates (3/12/91 vs. Mar. 12, 1991);
 phone numbers,
 IP addresses (100.2.86.144)
25
Issues in Tokenization
 How to handle special cases involving apostrophes, hyphens
etc? C++, C#, URLs, emails, …
◦ Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a
token.
◦ However, frequently they are not.
26
Issues in Tokenization
 Simplest approach is to ignore all numbers and punctuation and
use only case-insensitive unbroken strings of alphabetic
characters as tokens.
◦ Generally, don’t index numbers as text, But often very useful. Will often
index “meta-data” , including creation date, format, etc. separately

 Issues of tokenization are language specific

◦ Requires the language to be known

27
Exercise: Tokenization
 The cat slept peacefully in the living room. It’s a
very old cat.

 Mr. O’Neill thinks that the boys’ stories about

Chile’s capital aren’t amusing.

28
Term Weights: Term Frequency
 More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j

 May want to normalize term frequency (tf) by

dividing by the frequency of the most common
term in the document:
tfij = fij / maxi{fij}
29
Term Weights: Inverse Document Frequency
 Terms that appear in many different documents are
less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
 An indication of a term’s discrimination power.
 Log used to dampen the effect relative to tf.
30
TF-IDF Weighting
 A typical combined term importance indicator
is tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
 A term occurring frequently in the document
but rarely in the rest of the collection is given
high weight.
 Many other ways of determining term weights
have been proposed.
 Experimentally, tf-idf has been found to work
well.
31
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8 32
Similarity Measure
 A similarity measure is a function that computes
the degree of similarity between two vectors.

 Using a similarity measure between the query

and each document:
◦ It is possible to rank the retrieved documents in the
order of presumed relevance.
◦ It is possible to enforce a certain threshold so that
33
Similarity Measure - Inner Product
 Similarity between vectors for the document di and query q can be
computed as the vector innert product (a.k.a. dot product):

sim(dj,q) = dj•qi =1=  ij iq

w w

where wij is the weight of term i in document j and wiq is the weight of term i in
the query
 For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection).
 For weighted term vectors, it is the sum of the products of the
weights of the matched terms.

34
Properties of Inner Product
 The inner product is unbounded.

 Favors long documents with a large number

of unique terms.

 Measures how many terms matched but not

how many terms are not matched.
35
36
37

CONFLICT OF LAWS - Readings
No ratings yet
CONFLICT OF LAWS - Readings
5 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
Text Operations 2021
No ratings yet
Text Operations 2021
45 pages
Chapter 2 Text Operation
No ratings yet
Chapter 2 Text Operation
46 pages
2 - Text Operation
No ratings yet
2 - Text Operation
47 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
2 Text Operation
No ratings yet
2 Text Operation
42 pages
2&3 Text Operation
No ratings yet
2&3 Text Operation
65 pages
2 Text Operation
No ratings yet
2 Text Operation
46 pages
chapter two IR
No ratings yet
chapter two IR
44 pages
2_text operation
No ratings yet
2_text operation
35 pages
Chapter 2 (Information Storage & Retrieval)
No ratings yet
Chapter 2 (Information Storage & Retrieval)
56 pages
2_Text Operations (1)
No ratings yet
2_Text Operations (1)
56 pages
2 TextOperations
No ratings yet
2 TextOperations
54 pages
CH 2_text operation
No ratings yet
CH 2_text operation
38 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
2 Text-Operation
No ratings yet
2 Text-Operation
60 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
Chap 4
No ratings yet
Chap 4
76 pages
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
No ratings yet
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
13 pages
Chapter 4
No ratings yet
Chapter 4
72 pages
Processing Text: 4.1 From Words To Terms
No ratings yet
Processing Text: 4.1 From Words To Terms
52 pages
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
No ratings yet
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
65 pages
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
No ratings yet
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
40 pages
ISR Assignment 1
No ratings yet
ISR Assignment 1
13 pages
Topic 4 W4 - Text Processing
No ratings yet
Topic 4 W4 - Text Processing
42 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
Ch-2 Text Operations
No ratings yet
Ch-2 Text Operations
40 pages
0 Experimenteeff
No ratings yet
0 Experimenteeff
5 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
2-Text Operations_new
No ratings yet
2-Text Operations_new
39 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
2 - Text Operation
No ratings yet
2 - Text Operation
55 pages
Term Weighting
No ratings yet
Term Weighting
71 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Term Weighting and Similarity Measures
50% (2)
Term Weighting and Similarity Measures
54 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
Multimedia Information Retrieval (CSC 545) : The Problem of IR
No ratings yet
Multimedia Information Retrieval (CSC 545) : The Problem of IR
29 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Qta Lse Day2.PDF
No ratings yet
Qta Lse Day2.PDF
55 pages
Basic Text Process
No ratings yet
Basic Text Process
3 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Text
No ratings yet
Text
3 pages
Preprocessing Stemin JI
No ratings yet
Preprocessing Stemin JI
3 pages
1-s2.0-S1877050916311589-main- part-5
No ratings yet
1-s2.0-S1877050916311589-main- part-5
7 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
1 Information Retrieval System
No ratings yet
1 Information Retrieval System
10 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
6_2018_09_11!11_16_16_AM
No ratings yet
6_2018_09_11!11_16_16_AM
101 pages
Dictionary Skills
From Everand
Dictionary Skills
Grace Morgan
No ratings yet
Glossário Jurídico
From Everand
Glossário Jurídico
Luanda Garibotti Victorino
5/5 (1)
Chapter 1 Introduction to IR
No ratings yet
Chapter 1 Introduction to IR
18 pages
Multi Media Material
No ratings yet
Multi Media Material
101 pages
Chapter-3
No ratings yet
Chapter-3
90 pages
Chapter-2
No ratings yet
Chapter-2
58 pages
Chapter-4
No ratings yet
Chapter-4
83 pages
red it
No ratings yet
red it
30 pages
Chapter 2
No ratings yet
Chapter 2
29 pages
Chapter-1-part-1
No ratings yet
Chapter-1-part-1
38 pages
Chapter-1-part-2
No ratings yet
Chapter-1-part-2
60 pages
Ethics ch1
No ratings yet
Ethics ch1
25 pages
4 YEAR 2 SEMESTER FINAL EXAM SCHEDULE (2)
No ratings yet
4 YEAR 2 SEMESTER FINAL EXAM SCHEDULE (2)
2 pages
Chapter 3
No ratings yet
Chapter 3
52 pages
Chapter-2
No ratings yet
Chapter-2
25 pages
Session Validation
No ratings yet
Session Validation
2 pages
Chapter 5 Retrieval Efective
No ratings yet
Chapter 5 Retrieval Efective
24 pages
IT Chapter 2 2015
No ratings yet
IT Chapter 2 2015
26 pages
IT-Chapter 1 B PPT 2015-20
No ratings yet
IT-Chapter 1 B PPT 2015-20
21 pages
IT Chapter 6 2015
No ratings yet
IT Chapter 6 2015
20 pages
Remedial Chemistry Model Exam 2024
100% (2)
Remedial Chemistry Model Exam 2024
6 pages
IT Chapter 4 2015
100% (1)
IT Chapter 4 2015
30 pages
MID - Exam For Emerging Technology
100% (4)
MID - Exam For Emerging Technology
4 pages
IT Chapter 5 2015
No ratings yet
IT Chapter 5 2015
41 pages
Chemistry MODEL EXAM - 2
No ratings yet
Chemistry MODEL EXAM - 2
2 pages
Model Exam For Remedial Alliance
No ratings yet
Model Exam For Remedial Alliance
4 pages
Edited Event 4th Reg
No ratings yet
Edited Event 4th Reg
7 pages
Mughal Gardens - Original123
100% (1)
Mughal Gardens - Original123
38 pages
Load Flow Analysis
No ratings yet
Load Flow Analysis
5 pages
AMC SUPERFOAM
No ratings yet
AMC SUPERFOAM
11 pages
SDLC Interview Questions
No ratings yet
SDLC Interview Questions
8 pages
10 Tips For Shooting Time-Lapse
No ratings yet
10 Tips For Shooting Time-Lapse
41 pages
Chat GPT - A New AI Tool For English Language Teaching and Learning Among
No ratings yet
Chat GPT - A New AI Tool For English Language Teaching and Learning Among
11 pages
Supercomputing Organized by Network Mining (SONM) in Blockchain - GeeksforGeeks
No ratings yet
Supercomputing Organized by Network Mining (SONM) in Blockchain - GeeksforGeeks
2 pages
Acc 321
No ratings yet
Acc 321
25 pages
Cleanin' Out My Closet Analysis
No ratings yet
Cleanin' Out My Closet Analysis
5 pages
Starbucks Primo
No ratings yet
Starbucks Primo
59 pages
Active - Passive - Voice 123
No ratings yet
Active - Passive - Voice 123
7 pages
Dehumidification Calcs Ashrae
No ratings yet
Dehumidification Calcs Ashrae
3 pages
DSP (5th) May2019 PDF
No ratings yet
DSP (5th) May2019 PDF
2 pages
Condor Pasa
No ratings yet
Condor Pasa
13 pages
Strategic Management: Agarwal Packers & Movers Ltd. Case: Group 9
No ratings yet
Strategic Management: Agarwal Packers & Movers Ltd. Case: Group 9
11 pages
Form SS-4 For ESTATE ONLY Template
100% (1)
Form SS-4 For ESTATE ONLY Template
2 pages
Case Study Sunder Pichai
100% (1)
Case Study Sunder Pichai
2 pages
Warband - Human - Ostermark Road Wardens
No ratings yet
Warband - Human - Ostermark Road Wardens
10 pages
Pile Settlement - EnCE 4610
100% (1)
Pile Settlement - EnCE 4610
36 pages
3.3 Solving Equations with Word Problems
No ratings yet
3.3 Solving Equations with Word Problems
3 pages
Universities Serbia EBA Letters PDF
No ratings yet
Universities Serbia EBA Letters PDF
167 pages
Uber Taxi Cab - Case Study
No ratings yet
Uber Taxi Cab - Case Study
9 pages
NAUGHTY BOY LYRICS - La La La PDF
No ratings yet
NAUGHTY BOY LYRICS - La La La PDF
2 pages
2v2 Foundation
No ratings yet
2v2 Foundation
1 page
G8 - Rainfall-Runoff Relationship
No ratings yet
G8 - Rainfall-Runoff Relationship
41 pages
Commonly Asked Digital Marketing Interview Questions
100% (1)
Commonly Asked Digital Marketing Interview Questions
13 pages
SWAYAM NPTEL JAN-JUN 2025 - Sheet1
No ratings yet
SWAYAM NPTEL JAN-JUN 2025 - Sheet1
1 page
1.d4 - Beat The Guer Ril Las
0% (1)
1.d4 - Beat The Guer Ril Las
11 pages
1102 - Chapter 19 Local Area Networking - Slide Handouts
No ratings yet
1102 - Chapter 19 Local Area Networking - Slide Handouts
89 pages

Chapter 2 Text Operations

Uploaded by

Chapter 2 Text Operations

Uploaded by

Chapter Two

 A few words are very common.

 Zipf's Law states that when the distinct words in a text

That is If the words, w, in a collection are ranked, r,

w has rank r and

 The table shows the most frequently occurring words

 Luhn suggested that both extremely common and extremely

 For this, Luhn specifies two cut-off points: an upper and a

Luhn (1958) suggested that both extremely common and

Example: from 1,000,000,000 documents, there

 Therefore, one needs to preprocess the text of a

 The main operations for selecting index terms are:

 Stemming words - remove affixes (prefixes and suffixes)

 Construction of term categorization structures such as thesaurus/wordlist, to capture

 Objective - identify words in the text

◦ Digits, hyphens, punctuation marks, case of letters

◦ Numbers are not good index terms (like 1910, 1999);

 Punctuation marks – remove totally unless significant,

 Output: Tokens (an instance of a sequence of characters that are

◦ Friends , and, Romans, Countrymen

 Each such token is now a candidate for an index entry,

 But what are valid tokens to omit? 24

 Issues of tokenization are language specific

 Mr. O’Neill thinks that the boys’ stories about

 May want to normalize term frequency (tf) by

 Using a similarity measure between the query

sim(dj,q) = dj•qi =1=  ij iq

 Favors long documents with a large number

 Measures how many terms matched but not

You might also like