Chapter-2 - Automatic Text Anlysis
Chapter-2 - Automatic Text Anlysis
•Indexing text
–Organizing indexes
• What techniques to use ? How to select it ?
–Storage of indexes
• Is compression required? Do we store on memory or in a disk ?
•Accessing text
–Accessing indexes
• How to access to indexes ? What data/file structure to use?
–Processing indexes
• How to search a given query in the index? How to update the index?
–Accessing documents
Indexing: Basic Concepts
• Indexing is used to speed up access to desired in-
formation from document collection as per users
query such that
– It enhances efficiency in terms of time for retrieval. Relevant
documents are searched and retrieved quick
Example: author catalog in library
• An index file consists of records, called index en-
tries.
• Index files are much smaller than the original file.
– Remember Heaps Law: For 1 GB of TREC text collection
the vocabulary has a size of only 5 MB
– This size may be further reduced by Linguistic pre-pro-
cessing (like stemming & other normalization methods).
• Common features
–allow Boolean searches
Major Steps in Index Construction
• Source file: Collection of text document
–A document can be described by a set of representative keywords called
index terms.
Token Tok-
stream. enizer Friends Romans countrymen
In-
Index File (In- dexer friend 2 4
verted file).
roman 1 2
countryman 13 16
Building Index file
•An index file of a document is a file consisting of a list of index terms
and a link to one or more documents that has the index term
–A good index file maps each keyword Ki to a set of documents Di that contain the
keyword
•An index file is list of search terms that are organized for associative
look-up, i.e., to answer user’s query:
–In which documents does a specified search term appear?
–Where within each document does each term appear? (There may be several oc-
currences.)
•For organizing index file for a collection of documents, there are vari-
ous options available:
–Decide what data structure and/or file structure to use. Is it sequential file, in-
Index file Evaluation Metrics
• Running time
–Access/search time
–Update time (Insertion time, Deletion time, ….)
• Space overhead
–Computer storage space consumed.
•r
Word distribution: Zipf's Law
• Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
– attempts to capture the distribution of the frequencies
(i.e. , number of occurances ) of the words within a text.
• Zipf's Law states that when the distinct words in a text are
arranged in decreasing order of their frequency of occuer-
ence (most frequent words first), the occurence character-
stics of the vocabulary can be characterized by the constant
rank-frequency law of Zipf:
Frequency * Rank = constant
that is If the words, w, in a collection are ranked, r, by their
frequency, f, they roughly fit the relation: r*f=c
– Different collections have different constants c.
Example: Zipf's Law
Index
terms
Lexical Analysis/Tokenization of
Text
• Change text of the documents into words to be
adopted as index terms
• Objective - identify words in the text
– Digits, hyphens, punctuation marks, case of letters
– Numbers are not good index terms (like 1910, 1999);
but 510 B.C. – unique
– Hyphen – break up the words (e.g. state-of-the-art =
state of the art)- but some words, e.g. gilt-edged, B-49 -
unique words which require hyphens
– Punctuation marks – remove totally unless significant,
e.g. program code: x.exe and xexe
– Case of letters – not important and can convert all to
upper or lower
Tokenization
• Analyze text into a sequence of discrete tokens
(words).
• Input: “Friends, Romans and Countrymen”
• Output: Tokens (an instance of a sequence of characters
that are grouped together as a useful semantic unit for pro-
cessing)
– Friends
– Romans
– and
– Countrymen
• Each such token is now a candidate for an index en-
try, after further processing
• But what are valid tokens to omit?
Issues in Tokenization
• One word or multiple: How do you decide it is
one token or two or more?
–Hewlett-Packard Hewlett and Packard as two
tokens?
• state-of-the-art: break up hyphenated sequence.
• San Francisco, Los Angeles
• Addis Ababa, Arba Minch
–lowercase, lower-case, lower case ?
• data base, database, data-base
• Numbers:
• dates (3/12/91 vs. Mar. 12, 1991);
• phone numbers,
• IP addresses (100.2.86.144)
Issues in Tokenization
• How to handle special cases involving apostrophes,
hyphens etc? C++, C#, URLs, emails, …
– Sometimes punctuation (e-mail), numbers (1999), and
case (Republican vs. republican) can be a meaningful part
of a token.
– However, frequently they are not.
• Simplest approach is to ignore all numbers and
punctuation and use only case-insensitive unbroken
strings of alphabetic characters as tokens.
– Generally, don’t index numbers as text, But often very use-
ful. Will often index “meta-data” , including creation date,
format, etc. separately
• Issues of tokenization are language specific
– Requires the language to be known
Exercise: Tokenization
• The cat slept peacefully in the living room.
It’s a very old cat.
c 0 1
a b
Frequency: 16 5 12 17 10 25
•Ziv-Lempel compression
–Not rely on previous knowledge about the data
–Rather builds this knowledge in the course of data trans-
mission/data storage
–Ziv-Lempel algorithm (called LZ) uses a table of code-words
created during data transmission;
• each time it replaces strings of characters with a reference to a previ-
ous occurrence of the string.
Lempel-Ziv Compression Algorithm
• The multi-symbol patterns are of the form: C0C1 . . . Cn-
1Cn. The prefix of a pattern consists of all the pattern
symbols except the last: C0C1 . . . Cn-1
1. Mississippi
2. ABBCBCABABCAABCAAB
3. SATATASACITASA.