0% found this document useful (0 votes)
83 views

Chapter-2 - Automatic Text Anlysis

The document discusses text collections and indexing for information retrieval systems. It provides statistics on the size of various text collections and describes the key components of an IR system including the indexing and searching subsystems. The indexing subsystem involves preprocessing documents, generating index terms, and building an inverted index. The searching subsystem uses the index to retrieve relevant documents for a user query. The document also discusses linguistic properties of text like Zipf's law and the heavy-tailed distribution of word frequencies.

Uploaded by

abraham getu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views

Chapter-2 - Automatic Text Anlysis

The document discusses text collections and indexing for information retrieval systems. It provides statistics on the size of various text collections and describes the key components of an IR system including the indexing and searching subsystems. The indexing subsystem involves preprocessing documents, generating index terms, and building an inverted index. The searching subsystem uses the index to retrieve relevant documents for a user query. The document also discusses linguistic properties of text like Zipf's law and the heavy-tailed distribution of word frequencies.

Uploaded by

abraham getu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

Chapter Two

Automatic Text Analysis


Text Collections and IR
• Large collections of documents from various sources:
news articles, research papers, books, digital libraries,
Web pages, etc.
Sample Statistics of Text Collections
• Dialog:
–claims to have more than 15 terabytes of data in >600 Data-
bases, > 800 million unique records
• LEXIS/NEXIS:
–claims 7 terabytes, 1.7 billion documents, 1.5 million sub-
scribers, 11,400 databases; >200,000 searches per day; 9
mainframes, 300 Unix servers, 200 NT servers
• Web Search Engines:
–Google claim to index over 1.5 billion pages.
–How many search engines are available these days?
• TREC collections:
–total of about 5 gigabytes of text
Designing an IR System
Our focus during IR system design is:
• In improving performance effectiveness of the sys-
tem
–Effectiveness of the system is measured in terms of preci-
sion, recall, …
–Stemming, stopwords, weighting schemes, matching algo-
rithms

• In improving performance efficiency.


The concern here is
–storage space usage, access time, …
–Compression, data/file structures, space – time tradeoffs

• The two subsystems of an IR system:


–Searching and
–Indexing
Indexing Subsystem
documents
Documents Assign document identifier

text Tokenize document


IDs
tokens Stop list
non-stoplist
Stemming & Normalize
tokens
stemmed
Term weighting
terms
terms with
weights Index
Searching Subsystem
query
parse query
query tokens
ranked non-stoplist
document Stop list
tokens
set
Ranking
Stemming & Normalize
relevant doc- stemmed terms
ument set
Similarity Query Term weighting
Measure terms
Index terms
Index
Basic assertion
Indexing and searching:
inexorably connected
– you cannot search that that was not first indexed
in some manner or other
– indexing of documents or objects is done in or-
der to be searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing
language

Knowing searching is knowing indexing


Implementation Issues
•Storage of text:
–The need for text compression: to reduce storage space

•Indexing text
–Organizing indexes
• What techniques to use ? How to select it ?
–Storage of indexes
• Is compression required? Do we store on memory or in a disk ?

•Accessing text
–Accessing indexes
• How to access to indexes ? What data/file structure to use?
–Processing indexes
• How to search a given query in the index? How to update the index?
–Accessing documents
Indexing: Basic Concepts
• Indexing is used to speed up access to desired in-
formation from document collection as per users
query such that
– It enhances efficiency in terms of time for retrieval. Relevant
documents are searched and retrieved quick
Example: author catalog in library
• An index file consists of records, called index en-
tries.
• Index files are much smaller than the original file.
– Remember Heaps Law: For 1 GB of TREC text collection
the vocabulary has a size of only 5 MB
– This size may be further reduced by Linguistic pre-pro-
cessing (like stemming & other normalization methods).

• The usual unit for indexing is the word


– Index terms - are used to look up records in a file.
INDEX TERM SELECTION
• Index language is the language used to describe
documents and requests

• Elements of the index language are index terms


which may be derived from the text of the document
to be described, or may be arrived at independently.
– If a full text representation of the text is adopted, then
all words in the text are used as index terms = full
text indexing
– Otherwise, need to select the words to be used as
index terms for reducing the size of the index file
which is basic to design an efficient searching IR sys-
tem
How Current Search Engines index?
• Indexes are built using a web crawler, which re-
trieves each page on the Web for indexing.
–After indexing, the local copy of each page is discarded, un-
less stored in a cache.
• Some search engines: automatically index
–such search engines include:
Google, AltaVista, Excite, HotBot, InfoSeek, Lycos

• Some others: semi automatically index


–Partially human indexed, hierarchically organized
–Such search engines include:
Yahoo, Magellan, Galaxy, WWW Virtual Library

• Common features
–allow Boolean searches
Major Steps in Index Construction
• Source file: Collection of text document
–A document can be described by a set of representative keywords called
index terms.

• Index Terms Selection:


–Tokenize: identify words in a document, so that each document is repre-
sented by a list of keywords or attributes
–Stop words: removal of high frequency words
• Stop list of words is used for comparing the input text
–Word stem: reduce words with similar meaning into their stem/root word
• Suffix stripping is the common method
–Term relevance weight: Different index terms have varying relevance
when used to describe document contents.
• This effect is captured through the assignment of numerical weights to
each index term of a document.
• There are different index terms weighting methods: TF, TF*IDF, …

• Output: a set of index terms (vocabulary) to be used for In-


dexing the documents that each term occurs in.
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.

Token Tok-
stream. enizer Friends Romans countrymen

Modified to- Linguistic pre- friend roman countryman


kens. processing

In-
Index File (In- dexer friend 2 4
verted file).
roman 1 2

countryman 13 16
Building Index file
•An index file of a document is a file consisting of a list of index terms
and a link to one or more documents that has the index term
–A good index file maps each keyword Ki to a set of documents Di that contain the
keyword

•Index file usually has index terms in a sorted order.


–The sort order of the terms in the index file provides an order on a physical file

•An index file is list of search terms that are organized for associative
look-up, i.e., to answer user’s query:
–In which documents does a specified search term appear?
–Where within each document does each term appear? (There may be several oc-
currences.)

•For organizing index file for a collection of documents, there are vari-
ous options available:
–Decide what data structure and/or file structure to use. Is it sequential file, in-
Index file Evaluation Metrics
• Running time
–Access/search time
–Update time (Insertion time, Deletion time, ….)

• Space overhead
–Computer storage space consumed.

• Access types supported efficiently. Is the in-


dexing structure allows to access:
–records with a specified term, or
–records with terms falling in a specified range of
values.
Statistical Properties of Text
• How is the frequency of different words distributed?
• How fast does vocabulary size grow with the size of
a corpus?
– Such factors affect the performance of IR system & can be
used to select suitable term weights & other aspects of the
system.
• A few words are very common.
– 2 most frequent words (e.g. “the”, “of”) can account for
about 10% of word occurrences.
• Most words are very rare.
– Half the words in a corpus appear only once, called “read
only once”
• Called a “heavy tailed” distribution, since most of the
probability mass is in the “tail”
Sample Word Frequency Data
Zipf’s distributions
Rank Frequency Distribution
•For all the words in a collection of documents, for each word
w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most commonly
occurring word has rank 1, etc.)
•f • Distribution of sorted word frequen-
cies, according to Zipf’s law

•w has rank r and


frequency f

•r
Word distribution: Zipf's Law
• Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
– attempts to capture the distribution of the frequencies
(i.e. , number of occurances ) of the words within a text.
• Zipf's Law states that when the distinct words in a text are
arranged in decreasing order of their frequency of occuer-
ence (most frequent words first), the occurence character-
stics of the vocabulary can be characterized by the constant
rank-frequency law of Zipf:
Frequency * Rank = constant
that is If the words, w, in a collection are ranked, r, by their
frequency, f, they roughly fit the relation: r*f=c
– Different collections have different constants c.
Example: Zipf's Law

• The table shows the most frequently occurring


words from 336,310 document collection containing
125, 720, 891 total words; out of which 508, 209
Zipf’s law: modeling word distribution
• The collection frequency of the ith most common term
1
is proportional to 1/i fi 
i
– If the most frequent term occurs f1 times then the second
most frequent term has half as many occurrences, the third
most frequent term has a third as many, etc
• Zipf's Law states that the frequency of the i-th most
frequent word is 1/iӨ times that of the most frequent
word for some Ө>=1
– occurrence of some event ( P ), as a function of the
rank (i) when the rank is determined by the fre-
quency of occurrence, is a power-law function Pi ~
1/i Ө with the exponent Ө close to 1.
More Example: Zipf’s Law
• Illustration of Rank-Frequency Law. Let the total number of word
occurrences in the sample N = 1, 000, 000
Rank (R) Term Frequency (F) R.(F/N)

1 the 69 971 0.070


2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095
Methods that Build on Zipf's
Law
• Stop lists: Ignore the most frequent words (up-
per cut-off). Used by almost all systems.
• Significant words: Take words in between the
most frequent (upper cut-off) and least frequent
words (lower cut-off).
• Term weighting: Give differing weights to terms
based on their frequency, with most frequent
words weighed less. Used by almost all ranking
methods.
Explanations for Zipf’s Law
• The law has been explained by “principle of least ef-
fort” which makes it easier for a speaker or writer of
a language to repeat certain words instead of coin-
ing new and different words.
– Zipf’s explanation was his “principle of least effort” which
balance between speaker’s desire for a small vocabulary
and hearer’s desire for a large one.
• Zipf’s Law Impact on IR
– Good News: Stopwords will account for a large fraction of
text so eliminating them greatly reduces inverted-index
storage costs.
– Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation analysis
for query expansion) is difficult since they are extremely
Word significance: Luhn’s Ideas
• Luhn Idea (1958): the frequency of word occurrence in a text fur-
nishes a useful measurement of word significance.
• Luhn suggested that both extremely common and extremely un-
common words were not very useful for indexing.
• For this, Luhn specifies two cut-off points: an upper and a lower
cutoffs based on which non-significant words are excluded
– The words exceeding the upper cut-off were considered to be common
– The words below the lower cut-off were considered to be rare
– Hence they are not contributing significantly to the content of the text
– The ability of words to discriminate content, reached a peak at a rank order
position half way between the two-cutoffs
• Let f be the frequency of occurrence of words in a text, and r their rank
in decreasing order of word frequency, then a plot relating f & r yields
the following curve
Luhn’s Ideas

•Luhn (1958) suggested that both extremely common and ex-


tremely uncommon words were not very useful for document rep-
resentation & indexing.
Vocabulary size : Heaps’ Law
• Dictionaries
600,000+ words

• The words do not include names of people,


locations, products etc

• Heap’s law: estimates the number of vo-


cabularies in a given corpus
Vocabulary Growth: Heaps’ Law
• How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
– This determines how the size of the inverted index will scale with the
size of the corpus.
• Heap’s law: estimates the number of vocabularies in a given
corpus
– The vocabulary size grows by O(nβ), where β is a constant between 0
– 1.
– If V is the size of the vocabulary and n is the length of the corpus in
words, Heap’s provides the following equation:
• Where constants:
– K  10100

–   0.40.6 (approx. square-root) VKn
Heap’s distributions
• Distribution of size of the vocabulary: there is a linear rela-
tionship between vocabulary size and number of tokens

•Example: from 1,000,000,000 documents, there


may be 1,000,000 distinct words. Can you agree?
Example
• We want to estimate the size of the vocabulary for a
corpus of 1,000,000 words. However, we only know
statistics computed on smaller corpora sizes:
– For 100,000 words, there are 50,000 unique words
– For 500,000 words, there are 150,000 unique
words

– Estimate the vocabulary size for the 1,000,000


words corpus?
– How about for a corpus of 1,000,000,000 words?
Text Operations
• Not all words in a document are equally significant to repre-
sent the contents/meanings of a document
– Some word carry more meaning than others
– Noun words are the most representative of a document content
• Therefore, one needs to preprocess the text of a document
in a collection to be used as index terms
• Using the set of all words in a collection to index documents
creates too much noise for the retrieval task
– Reduce noise means reduce words which can be used to refer to the
document
• Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index
terms
– Preprocessing will lead to an improvement in the information retrieval
performance
• However, some search engines on the Web omit preprocess-
ing
Text Operations
• Text operations is the process of text transforma-
tions in to logical representations
• The main operations for selecting index terms, i.e. to
choose words/stems (or groups of words) to be used
as indexing terms are:
– Lexical analysis/Tokenization of the text - digits, hyphens,
punctuations marks, and the case of letters
– Elimination of stop words - filter out words which are not
useful in the retrieval process
– Stemming words - remove affixes (prefixes and suffixes)
– Construction of term categorization structures such as
thesaurus, to capture relationship for allowing the expan-
sion of the original query with related terms
Generating Document Representatives
• Text Processing System
– Input text – full text, abstract or title
– Output – a document representative adequate for use in
an automatic retrieval system
• The document representative consists of a list of class
names, each name representing a class of words occurring in
the total input text. A document will be indexed by a name if
one of its significant words occurs as a member of that class.

documents Tokenization stop words stemming Thesaurus

Index
terms
Lexical Analysis/Tokenization of
Text
• Change text of the documents into words to be
adopted as index terms
• Objective - identify words in the text
– Digits, hyphens, punctuation marks, case of letters
– Numbers are not good index terms (like 1910, 1999);
but 510 B.C. – unique
– Hyphen – break up the words (e.g. state-of-the-art =
state of the art)- but some words, e.g. gilt-edged, B-49 -
unique words which require hyphens
– Punctuation marks – remove totally unless significant,
e.g. program code: x.exe and xexe
– Case of letters – not important and can convert all to
upper or lower
Tokenization
• Analyze text into a sequence of discrete tokens
(words).
• Input: “Friends, Romans and Countrymen”
• Output: Tokens (an instance of a sequence of characters
that are grouped together as a useful semantic unit for pro-
cessing)
– Friends
– Romans
– and
– Countrymen
• Each such token is now a candidate for an index en-
try, after further processing
• But what are valid tokens to omit?
Issues in Tokenization
• One word or multiple: How do you decide it is
one token or two or more?
–Hewlett-Packard  Hewlett and Packard as two
tokens?
• state-of-the-art: break up hyphenated sequence.
• San Francisco, Los Angeles
• Addis Ababa, Arba Minch
–lowercase, lower-case, lower case ?
• data base, database, data-base

• Numbers:
• dates (3/12/91 vs. Mar. 12, 1991);
• phone numbers,
• IP addresses (100.2.86.144)
Issues in Tokenization
• How to handle special cases involving apostrophes,
hyphens etc? C++, C#, URLs, emails, …
– Sometimes punctuation (e-mail), numbers (1999), and
case (Republican vs. republican) can be a meaningful part
of a token.
– However, frequently they are not.
• Simplest approach is to ignore all numbers and
punctuation and use only case-insensitive unbroken
strings of alphabetic characters as tokens.
– Generally, don’t index numbers as text, But often very use-
ful. Will often index “meta-data” , including creation date,
format, etc. separately
• Issues of tokenization are language specific
– Requires the language to be known
Exercise: Tokenization
• The cat slept peacefully in the living room.
It’s a very old cat.

• Mr. O’Neill thinks that the boys’ stories


about Chile’s capital aren’t amusing.
Elimination of STOPWORD
• Stopwords are extremely common words across docu-
ment collections that have no discriminatory power
– They may occur in 80% of the documents in a collection.
– They would appear to be of little value in helping select doc-
uments matching a user need and needs to be filtered out
from potential index terms
• Examples of stopwords are articles, , pronouns, preposi-
tions, conjunctions, etc.:
– articles (a, an, the); pronouns: (I, he, she, it, their, his)
– Some prepositions (on, of, in, about, besides, against, over),
– conjunctions/ connectors (and, but, for, nor, or, so, yet),
– verbs (is, are, was, were),
– adverbs (here, there, out, because, soon, after) and
– adjectives (all, any, each, every, few, many, some)
can also be treated as stopwords
• Stopwords are language dependent.
Stopwords
• Intuition:
–Stopwords have little semantic content; It is typical to
remove such high-frequency words
–Stopwords take up 50% of the text. Hence, document
size reduces by 30-50%

• Smaller indices for information retrieval


–Good compression techniques for indices: The 30
most common words account for 30% of the to-
kens in written text

• Better approximation of importance for classifica-


tion, summarization, etc.
How to determine a list of stop-
words?
• One method: Sort terms (in decreasing order) by
collection frequency and take the most frequent
ones
–Problem: In a collection about insurance practices,
“insurance” would be a stop word
• Another method: Build a stop word list that con-
tains a set of articles, pronouns, etc.
–Why do we need stop lists: With a stop list, we can
compare and exclude from index terms entirely the
commonest words.

• With the removal of stopwords, we can measure


better approximation of importance for classifica-
tion, summarization, etc.
Stop words
• Stop word elimination used to be standard in older IR
systems.
• But the trend is getting away from doing this. Most
web search engines index stop words:
– Good query optimization techniques mean you pay little at
query time for including stop words.
– You need stopwords for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
– Elimination of stopwords might reduce recall (e.g. “To be or
not to be” – all eliminated except “be” – no or irrelevant re-
trieval)
Normalization
• It is Canonicalizing tokens so that matches occur de-
spite superficial differences in the character se-
quences of the tokens. It is in a way standardization
of text
– Need to “normalize” terms in indexed text as well as query
terms into the same form
– Example: We want to match U.S.A. and USA, by deleting
periods in a term
• Case Folding: Often best to lowercase everything,
since users will use lowercase regardless of ‘correct’
capitalization…
– Republican vs. republican
– Fasil vs. fasil vs. FASIL
– Anti-discriminatory vs. antidiscriminatory
– Car vs. automobile?
Normalization issues
• Good for
– Allow instances of Automobile at the beginning
of a sentence to match with a query of automo-
bile
– Helps a search engine when most users type fer-
rari when they are interested in a Ferrari car
• Bad for
– Proper names vs. common nouns
• E.g. General Motors, Associated Press, …
• Solution:
– lowercase only words at the beginning of the sentence
• In IR, lowercasing is most practical because of the
way users issue their queries
Stemming/Morphological analysis
• Stemming reduces tokens to their “root” form of words
to recognize morphological variation .
–The process involves removal of affixes (i.e. prefixes and suf-
fixes) with the aim of reducing variants to the same stem
–Often removes inflectional and derivational morphology of a
word
• Inflectional morphology: vary the form of words in order to ex-
press grammatical features, such as singular/plural or past/
present tense. E.g. Boy → boys, cut → cutting.
• Derivational morphology: makes new words from old ones. E.g.
creation is formed from create , but they are two separate words.
And also, destruction → destroy
• Stemming is language dependent
–Correct stemming is language specific and can be complex.
•for example compressed and •for example compress and
•compression are both accepted. •compress are both accept
Stemming
• The final output from a conflation (reducing words to the
same token) algorithm is a set of classes, one for each
stem detected.
–A Stem: the portion of a word which is left after the removal of
its affixes (i.e., prefixes and/or suffixes).
–Example: ‘connect’ is the stem for {connected, connecting
connection, connections}
–Thus, [automate, automatic, automation] all reduce to  automat
• A class name is assigned to a document if and only if one
of its members occurs as a significant word in the text of
the document.
–A document representative then becomes a list of class
names, which are often referred as the documents index
terms/keywords.
• Queries : Queries are handled in the same way.
Ways to implement stemming
There are basically two ways to implement stemming.
– The first approach is to create a big dictionary that maps
words to their stems.
• The advantage of this approach is that it works perfectly (insofar as
the stem of a word can be defined perfectly); the disadvantages are
the space required by the dictionary and the investment required to
maintain the dictionary as new words appear.
– The second approach is to use a set of rules that extract
stems from words.
• The advantages of this approach are that the code is typically small,
and it can gracefully handle new words; the disadvantage is that it
occasionally makes mistakes.
• But, since stemming is imperfectly defined, anyway, occasional mis-
takes are tolerable, and the rule-based approach is the one that is
generally chosen.
Porter Stemmer
• Stemming is the operation of stripping the suffices
from a word, leaving its stem.
– Google, for instance, uses stemming to search for web
pages containing the words connected, connecting,
connection and connections when users ask for a web
page that contains the word connect.
• In 1979, Martin Porter developed a stemming algo-
rithm that uses a set of rules to extract stems from
words, and though it makes some mistakes, most
common words seem to work out right.
– Porter describes his algorithm and provides a reference im
plementation in C at http://tartarus.org/~martin/PorterStem
mer/index.html
Porter stemmer
• It is the most common algorithm for stemming Eng-
lish words to their common grammatical root
• It uses a simple procedure for removing known af-
fixes in English without using a dictionary. To gets
rid of plurals the following rules are used:
– SSES  SS caresses  caress
– IES  i ponies  poni
– SS  SS caress → caress
– S   (nil) cats  cat
– EMENT   (Delete final element if what remains is
longer than 1 character )
replacement  replac
cement  cement
• While step 1a gets rid of plurals, step 1b removes -ed
or -ing.
– e.g.
;; agreed -> agree ;; disabled -> disable
;; matting -> mat ;; mating -> mate
;; meeting -> meet ;; milling -> mill
;; messing -> mess ;; meetings -> mee
;; feed -> feedt
Stemming: challenges
• May produce unusual stems that are not Eng-
lish words:
– Removing ‘UAL’ from FACTUAL and EQUAL

• May conflate (reduce to the same token)


words that are actually distinct.
• “computer”, “computational”, “computation” all re-
duced to same token “comput”

• Not recognize all morphological derivations.


Thesauri
• Mostly full-text searching cannot be accurate, since
different authors may select different words to repre-
sent the same concept
– Problem: The same meaning can be expressed us-
ing different terms that are synonyms(a word or
phrase which has the same or nearly the same
meaning as another word or phrase in the same
language), homonyms(a word that sounds the
same or is spelled the same as another word but
has a different meaning), and related terms
– How can it be achieved such that for the same
meaning the identical terms are used in the index
and the query?
Thesauri
• Thesaurus: The vocabulary of a controlled in-
dexing language, formally organized so that a
priori relationships between concepts (for exam-
ple as "broader" and “related") are made explicit.
• A thesaurus contains terms and relationships be-
tween terms
– IR thesauri rely typically upon the use of symbols such
as USE/UF (UF=used for), BT(Broader Term), and
RT(Related term) to demonstrate inter-term relation-
ships.
– e.g., car = automobile, truck, bus, taxi, motor vehicle
-color = colour, paint
Aim of Thesaurus
• Thesaurus tries to control the use of the vocabulary
by showing a set of related words to handle syn-
onyms and homonyms
• The aim of thesaurus is therefore:
– to provide a standard vocabulary for indexing and
searching
• Thesaurus rewrite to form equivalence classes, and we
index such equivalences
• When the document contains automobile, index it under
car as well (usually, also vice-versa)
– to assist users with locating terms for proper query
formulation: When the query contains automobile, look
under car as well for expanding query
– to provide classified hierarchies that allow the broaden-
ing and narrowing of the current request according to
Thesaurus Construction
Example: thesaurus built to assist IR for search-
ing cars and vehicles :
Term: Motor vehicles
UF : Automobiles
Cars
Trucks
BT: Vehicles
RT: Road Engineering
Road Transport
More Example
Example: thesaurus built to assist IR in the fields
of computer science:
TERM: natural languages
– UF natural language processing (UF=used for NLP)
– BT languages (BT=broader term is languages)
– NT languages (NT= Narrower related term)
– TT languages (TT = top term is languages)
– RT artificial intelligence (RT=related term/s)
computational linguistic
formal languages
query languages
speech recognition
Language-specificity
• Many of the above features embody trans-
formations that are
– Language-specific and
– Often, application-specific
• These are “plug-in” addenda to the index-
ing process
• Both open source and commercial plug-ins
are available for handling these
Text Compression
• Text compression is about finding ways to represent
the text in fewer bits or bytes
• Advantages:
–save storage space requirement.
–speed up document transmission time
–Takes less time to search the compressed text
• Common compression methods
–Static methods: which requires statistical information about
frequency of occurrence of symbols in the document
E.g. Huffman coding
• Estimate probabilities of symbols, code one at a time, shorter
codes for high probabilities
–Adaptive methods: which constructs dictionary in the pro-
cessing of compression
E.g. Ziv-Lempel compression:
• Replace words or symbols with a pointer to dictionary entries
Statistical
•Developed in 1950smethods: Huffman
by David Huffman, coding
widely used for text compression, mul- 0 1
timedia codec and message transmis- D4
sion 0 1
1 D3
•The problem: Given a set of n symbols 0
and their weights (or frequencies), con-D1 D2
struct a tree structure (a binary tree for
binary code) with the objective of re- Code of:
ducing memory space & decoding time
D1 = 000
per symbol.
D2 = 001

•Huffman coding is constructed based on D3 = 01

frequency of occurrence of letters in D4 = 1


text documents
How to construct Huffman coding
Step 1: Create forest of trees for each symbol, t1, t2,… tn
Step 2: Sort forest of trees according to falling probabilities
of symbol occurrence
Step 3: WHILE more than one tree exist DO
–Merge two trees t1 and t2 with least probabilities p1 and p2
–Label their root with sum p1 + p2
–Associate binary code: 1 with the right branch and 0 with the left
branch
Step 4: Create a unique codeword for each symbol by
traversing the tree from the root to the leaf.
–Concatenate all encountered 0s and 1s together during traversal

• The resulting tree has a prob. of 1 in its root and symbols in


its leaf node.
Example
• Consider a 7-symbol alphabet given in the following table
to construct the Huffman coding.
Symbol Probability
a 0.05
b 0.05 • The Huffman encoding al-
gorithm picks each time
c 0.1
two symbols (with the
d 0.2 smallest frequency) to
e 0.3 combine
f 0.2
g 0.1
Huffman code tree
1
0
1
0.4 0.6
0 1 0
1
0.3
d f 1 e
0
0.2
1
g 0
0.1

c 0 1
a b

• Using the Huffman coding a table can be constructed by


working down the tree, left to right. This gives the binary
equivalents for each symbol in terms of 1s and 0s.
• What is the Huffman binary representation for ‘café’?
Exercise
1. Given the following, apply the Huffman algorithm
to find an optimal binary code:
Character: a b c d e t

Frequency: 16 5 12 17 10 25

2. Given text: “for each rose, a rose is a rose”


– Construct the Huffman coding
Ziv-Lempel compression
•The problem with Huffman coding is that it requires
knowledge about the data before encoding takes
place.
–Huffman coding requires frequencies of symbol occurrence
before codeword is assigned to symbols

•Ziv-Lempel compression
–Not rely on previous knowledge about the data
–Rather builds this knowledge in the course of data trans-
mission/data storage
–Ziv-Lempel algorithm (called LZ) uses a table of code-words
created during data transmission;
• each time it replaces strings of characters with a reference to a previ-
ous occurrence of the string.
Lempel-Ziv Compression Algorithm
• The multi-symbol patterns are of the form: C0C1 . . . Cn-
1Cn. The prefix of a pattern consists of all the pattern
symbols except the last: C0C1 . . . Cn-1

Lempel-Ziv Output: there are three options in assigning a code


to each symbol in the list
• If one-symbol pattern is not in dictionary, assign (0, symbol)
• If multi-symbol pattern is not in dictionary, assign (dic-
tionaryPrefixIndex, lastPatternSymbol)
• If the last input symbol or the last pattern is in the dictionary,
asign (dictionaryPrefixIndex, )
Example: LZ Compression
Encode (i.e., compress) the string ABBCBCABABCAABCAAB
using the LZ algorithm.

The compressed message is: (0,A)(0,B)(2,C)(3,A)(2,A)(4,A)(6,B)


Note: The above is just a representation, the commas and parentheses
are not transmitted;
Example: Decompression
Decode (i.e., decompress) the sequence (0, A) (0, B) (2, C) (3, A)
(2, A) (4, A) (6, B)

The decompressed message is:•


ABBCBCABABCAABCAAB
Exercise
Encode (i.e., compress) the following strings using the
Lempel-Ziv algorithm.

1. Mississippi
2. ABBCBCABABCAABCAAB
3. SATATASACITASA.

You might also like