Chapter 2 Text Operations
Chapter 2 Text Operations
Text Operations
1
Statistical Properties of Text
How is the frequency of different words distributed?
How fast does vocabulary size grow with the size of a corpus?
◦ Such factors affect the performance of IR system & can be used to select
suitable term weights & other aspects of the system.
3
Sample Word Frequency Data
4
Word distribution: Zipf's Law
Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
◦ attempts to capture the distribution of the frequencies
(number of occurances ) of the words within a text.
6
Zipf ’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most commonly occurring word has rank 1,
etc.)
f Distribution of sorted word frequencies,
according to Zipf’s law
7
Example: Zipf's Law
14
Vocabulary Growth: Heaps’ Law
Heap’s law: estimates the number of vocabularies in a
given corpus
◦ The vocabulary size grows by O(n ),
β where β is a constant
between 0 – 1.
◦ If V is the size of the vocabulary and n is the length of the corpus
in words, Heap’s provides the following equation:
Where constants:
◦ K 10−100
◦ 0.4−0.6 (approx. square-root)
V = Kn 15
Heap’s distributions
• Distribution of size of the vocabulary: there is a linear
relationship between vocabulary size and number of
tokens
19
Text operations is the process of text transformations in to logical
representations
Elimination of stop words - filter out words which are not useful in the retrieval
process
Index
terms 21
Lexical Analysis/Tokenization of Text
Change text of the documents into words to be adopted
as index terms
Tokenization Input:“Friends, Romans and Countrymen”
27
Exercise: Tokenization
The cat slept peacefully in the living room. It’s a
very old cat.
28
Term Weights: Term Frequency
More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j
where wij is the weight of term i in document j and wiq is the weight of term i in
the query
For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection).
For weighted term vectors, it is the sum of the products of the
weights of the matched terms.
34
Properties of Inner Product
The inner product is unbounded.