3. text-processing
3. text-processing
Characteristics
Kron
Text
• Text parsing
– Tokenization, terms
– A bit of linguistics
• Text characteristics
– Zipfs law
Query Engine Index
Interface
Text processing
Indexer
Users
Crawler
Web
A Typical Web Search Engine
Focus on documents
Decide what is an individual document
Can vary depending on problem
• Documents are basic units consisting of a sequence of
tokens or terms and are to be indexed.
• Terms (derived from tokens) are words or roots of
words, semantic units or phrases which are the atoms of
indexing
• Repositories (databases) and corpora are collections of
documents.
• Query is a request for documents on a query-related
topic.
Building an index
• Collect documents to be indexed
– Create your corpora
• Tokenize the text
• Linguistic processing
• Build the inverted index from terms
What is a Document?
• A document is a digital object
– Indexable
• Can be queried and potentially retrieved.
• Types:
– Written vs Spoken
– General vs Specialized
– Monolingual vs Multilingual
• e.g. Parallel, Comparable
– Synchronic (at a particular pt in time) vs Diachronic (over time)
– Annotated vs Unannotated
– Indexed vs unindexed
– Static vs dynamic
Text Processing
• Standard Steps:
– Recognize document structure
• titles, sections, paragraphs, etc.
– Break into tokens – type of markup
• Tokens are delimited text
– Hello, how are you.
– _hello_,_how_are_you_._
• usually space and punctuation delineated
• special issues with Asian languages
– Stemming/morphological analysis
– What is left are terms
– Store in inverted index
• Lexical analysis is the process of converting a sequence of
characters into a sequence of tokens.
– A program or function which performs lexical analysis is called a lexical
analyzer, lexer or scanner.
Basic indexing pipeline
Documents to Friends, Romans, countrymen.
be indexed.
Tokenizer
Token stream. Friends Romans Countrymen
Linguistic
modules
Modified tokens (terms). friend roman countryman
Indexer friend 2 4
roman 1 2
Inverted index.
countryman 13 16
Parsing a document
(lexical analysis)
• What format is it in?
– pdf/word/excel/html?
• What language is it in?
• What character set is in use?
Each of these is a classification problem
which can be solved using heuristics or
Machine Learning methods.
But there are complications …
Format/language stripping
• Documents being indexed can include docs from
many different languages
– A single index may have to contain terms of several
languages.
• Sometimes a document or its components can
contain multiple languages/formats
– French email with a Portuguese pdf attachment.
• What is a unit document?
– An email?
– With attachments?
– An email with a zip containing documents?
Document preprocessing
• Convert byte sequences into a linear sequence of
characters
• Trivial with ascii, but not so with Unicode or
others
– Use ML classifiers or heuristics.
7 月 30 日 vs. 7/30
• Character-level alphabet detection and
conversion
– Tokenization not separable from this.
– Sometimes ambiguous: Is this
Morgen will ich in MIT … German “mit”?
Stop Lists
•Very common words, such as of, and, the, are rarely of use
in information retrieval.
•A stop list is a list of such words that are removed during
lexical analysis.
•A long stop list saves space in indexes, speeds processing,
and eliminates many false hits.
•However, common words are sometimes significant in
information retrieval, which is an argument for a short stop
list. (Consider the query, "To be or not to be?")
Suggestions for Including
Words in a Stop List
• Include the most common words in the English
language (perhaps 50 to 250 words).
• Do not include words that might be important for
retrieval (Among the 200 most frequently
occurring words in general literature in English
are time, war, home, life, water, and world).
• In addition, include words that are very common
in context (e.g., computer, information, system in a
set of computing documents).
Example: the WAIS stop list
(first 84 of 363 multi-letter words)
about above according across actually adj
after afterwards again against all almost
alone along already also although always
among amongst an another any anyhow
anyone anything anywhere are aren't around
at be became because become
becomes becoming been before beforehand begin
beginning behind being below beside besides
between beyond billion both but by
can can't cannot caption co
could couldn't
did didn't do does doesn't don't
down during each eg eight eighty
either else elsewhere end ending enough
Stop list policies
How many words should be in the stop list?
• Long list lowers recall
Which words should be in list?
• Some common words may have retrieval importance:
-- war, home, life, water, world
• In certain domains, some words are very common:
-- computer, program, source, machine, language
There is very little systematic evidence to use in selecting
a stop list.
Stop Lists in Practice
text document
break into tokens
numbers
tokens stop list* and *field
numbers
non-stoplist stemming*
tokens
*Indicates
optional stemmed term weighting*
operation. terms
f C 1 / r
C N / 10
• Another way to state this is with an approximately correct rule of thumb:
– Say the most common term occurs C times
– The second most common occurs C/2 times
– The third most common occurs C/3 times
– …
Zipf Distribution
(linear and log scale)
What Kinds of Data Exhibit a
Zipf Distribution?
• Words in a text collection
– Virtually any language usage
• Library book checkout patterns
• Incoming Web Page Requests (Nielsen)
• Outgoing Web Page Requests (Cunha &
Crovella)
• Document Size on Web (Cunha & Crovella)
• Many sales with certain retailers
Power Laws
Power Law Statistics - problems with means
Power-law distributions
• The degree distributions of most real-life networks follow a power law
p(k) = Ck-
• Right-skewed/Heavy-tail distribution
– there is a non-negligible fraction of nodes that has very high degree (hubs)
– scale-free: no characteristic scale, average is not informative
The phrase The Long Tail, as a proper noun, was first coined by Chris
Anderson. The concept drew in part from an influential February 2003
essay by Clay Shirky, "Power Laws, Weblogs and Inequality" that noted
that a relative handful of weblogs have many links going into them but
"the long tail" of millions of weblogs may have only a handful of links
going into them. Beginning in a series of speeches in early 2004 and
culminating with the publication of a Wired magazine article in October
2004, Anderson described the effects of the long tail on current and
future business models. Anderson later extended it into the book The
Long Tail: Why the Future of Business is Selling Less of More (2006).
Anderson argued that products that are in low demand or have low sales
volume can collectively make up a market share that rivals or exceeds the
relatively few current bestsellers and blockbusters, if the store or
distribution channel is large enough. Examples of such mega-stores
include the online retailer Amazon.com and the online video rental
service Netflix. The Long Tail is a potential market and, as the examples
illustrate, the distribution and sales channel opportunities created by the
Internet often enable businesses to tap into that market successfully.
Word Frequency vs. Resolving
Power
The most frequent words are not the most descriptive.
van Rijsbergen 79
Consequences of Zipf for IR
• There are always a few very frequent tokens
that are not good discriminators.
– Called “stop words” in IR
– Usually correspond to linguistic notion of
“closed-class” words
• English examples: to, from, on, and, the, ...
• Grammatical classes that don’t take on new members.
• There are always a large number of tokens
that occur once and can mess up algorithms.
• Medium frequency words most descriptive
Text
• Perform lexical analysis - processing text
into tokens
– Many issues: normalization, lemmatization
• Stemming reduces the number of tokens
– Porter stemmer most common
• Stop words removed to improve
performance
• What remains are terms to be indexed
• Text has power law distribution
– Words with resolving power in the middle and
tail of the distribution