IR Ch23 Text Representation
IR Ch23 Text Representation
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 2
IR System
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 3
IR: Relevance
Relevance is a subjective judgment and may include:
◦ Being on the proper subject.
◦ Being timely (recent information).
◦ Being authoritative (from a trusted source).
◦ Satisfying the goals of the user and his/her intended use of the information
(information need).
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 4
IR System Architecture
User Interface
Text
User
Text Operations
Need
Logical View
User Query Database
Feedback Operations Indexing
Manager
Inverted
file
Query Searching Index
Text
Ranked Retrieved Database
Docs Ranking Docs
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 5
IR System Components
Text Operations forms index words (tokens).
◦ Stopword removal
◦ Stemming
Indexing constructs an inverted index of word to document pointers.
Searching retrieves documents that contain a given query token from the
inverted index.
Ranking scores all retrieved documents according to a relevance metric.
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 6
IR System Components (continued)
User Interface manages interaction with the user:
◦ Query input and document output.
◦ Relevance feedback.
◦ Visualization of results.
Query Operations transform the query to improve retrieval:
◦ Query expansion using a thesaurus.
◦ Query transformation using relevance feedback.
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 7
IR system: Performance
System: returns documents (or not)
User: finds document relevant (or not)
Precision : Fraction of retrieved docs that are relevant to the user’s information need
Recall : Fraction of relevant docs in collection that are retrieved
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 8
IR system: Performance
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 9
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 10
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 11
Intelligent IR
Taking into account the meaning of the words used.
Taking into account the order of words in the query.
Adapting to the user based on direct or indirect feedback.
Taking into account the authority of the source.
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 12
Other IR-Related Tasks
Automated document categorization
Information filtering (spam filtering)
Information routing
Automated document clustering
Recommending information or products
Information extraction
Information integration
Question answering
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 13
CS444: Information Retrieval
and Web Search
Spring 2020
CHAPTER 3:
IR MODEL: INFORMATION REPRESENTATION
IR System: Information Representation
A fundamental component of an IR system is the representation of the information
Information are represented to processed
Information retrieval comprises many subfields, such as
Text retrieval,
image retrieval,
speech retrieval,
Information generation,
query answering,
and text summarization
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 15
IR System: Text-Based Modeling
Modeling in IR is a complex process aimed at producing a ranking function
A ranking is an ordering of the documents that (hopefully) reflects their relevance to a user query
Ranking function: a function that assigns scores to documents with regard to a given query
IR system has to deal with the problem of predicting which documents the users will find
relevant
This problem naturally embodies a degree of uncertainty, or vagueness
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 16
IR System: The Process
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 17
IR System: The Model
IR model is a quadruple [D, Q, F, R(qi, dj)] where:
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 18
Abstraction of search engine architecture
Indexed corpus
Crawler
Ranking procedure
Feedback Evaluation
Doc Analyzer
(Query)
Doc Representation
Query Rep User
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 19
IR System: Text Operation (Preprocessing)
In IR, we most often use unstructured text representations
Text is represented as unordered set of terms (i.e., bag of words)
However, many details about the exact representation are still undefined
How do we „split” text into terms?
Can this be done in more than one way?
Do we consider all terms, or do we want to eliminate some?
E.g., functional words that have little meaning like articles and prepositions?
How do we treat different forms of the same word?
E.g., should „house” be treated the same as „houses”? What about „housing”?
What about synonyms or same concepts in different languages?
On a more technical side: what about different document formats?
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 20
Text Preprocessing
The preprocessing (i.e., preparing text for the retrieval process) usually involves the following
steps:
Extracting pure textual content (e.g., from HTML, PDF, Word)
Language detection
◦ Optional –if you’re dealing with multilingual document collections
Tokenization (separating text into character sequences)
Morphological normalization (lemmatization or stemming)
Stopword removal
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 21
Text Preprocessing: Definitions
Word is a delimited string of characters as it appears in the text
Term is a normalized form of the word (accounting for morphology, spelling, etc.)
Word and term are in the same equivalence class –in informal speech they are often used interchangeably
Tokenization is a process, typically automated, of breaking down the text (one long string) into a
sequence of tokens (shorter strings)
Stemming is the procedure of reducing the word to its grammatical root
Stop-words: is a list of such words that are very common such as of, the , and,…And rarely used in IR
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 22
Text Preprocessing: tokenization
Two types of methods for tokenization
Rule-based (i.e., heuristic)
Based on supervised machine learning models
Learn from manually tokenized texts
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 23
Text Preprocessing: normalization
Normalization or standardization can involve various changes to the token
Error/spelling correction –repairing the incorrect word
Case-folding –converting all letters to lower case
Often best to lower case everything (queries and documents)
Stop-words: is a list of such words that are very common such as of, the , and,…. And rarely used in IR
Stop-words may:
Save space
Speed processing
Eliminates many false hits
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 24
Text Preprocessing: normalization
Stemming is the procedure of reducing the word to its grammatical (morpho-syntactic) root
The result of stemming is not necessarily a valid word of the language
E.g., „recognized”-> „recogniz”, „incredibly”-> „incredibl”
Stemming removes suffixes with heuristics
E.g., „automates”, „automatic”, „automation”will all be reduced to „automat”
Stemming is „more aggressive” than lemmatization and „less agressive” than normalization
„More agressive” means more different words are normalized to the same form
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 25
Preprocessing Summary
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 26
Abstraction of search engine architecture
Indexed corpus
Crawler
Ranking procedure
Feedback Evaluation
Doc Analyzer
(Query)
Doc Representation
Query Rep User
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 27
IR Model: Text (Query and Document)
Representation
Text document retrieval is the most traditional subfield of IR
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 28
IR System: Text (Queries)Representation
Queries are the representation of information needs of a user.
Queries can be characterized by the structure and semantics of text.
An information need can be of three types:
◦ known item information need (users search or verify the existence of documents they know)
◦ conscious information need, (users search for documents they do not know, but they know the subject)
◦ confused information need (users know neither the documents nor the subject)
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 29
IR Model: Text (Document) Representation
A document is the representation of the information the author wished to encode
A textual document is a digital object consisting of a sequence of words and other symbols, e.g., punctuation
marks.
The individual words and symbols are knowing as Tokens or terms
IR system deals with documents with three types data format
Structured data:
bond to a pre-defined data model and is therefore straightforward to analyze.
e.g. Excel files or SQL databases.
Unstructured data:
is either does not have a predefined data model or is not organized in a pre-defined manner.
e.g. audio, video files or No-SQL databases.
Semi-structured data:
known as self-describing structure that has tags and markers.
e.g. JSON and XML
Any text (document) can be characterized by using four attributes:
syntax, structure, semantics, and style.
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 30
IR System: Text (Document) Representation
Document representation classes are:
Stream of characters
Text is represented as a stream of characters and no interpretation is made on its structure or semantic content.
Structural: The main idea is to enrich documents with additional information that allow a computer to make part
of the semantic content explicit. (XML)
Latent semantic
Fuzzy subset
N-grams
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 31
Abstraction of search engine architecture
Indexed corpus
Crawler
Ranking procedure
Feedback Evaluation
Doc Analyzer
(Query)
Doc Representation
Query Rep User
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 32
What we have now (the Indexer)
Documents have been
◦ Crawled from Web
◦ Tokenized/normalized
The Incidence
◦ Represented as Bag-of-Words
Matrices
Let’s do search!
◦ Query: “information retrieval”
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 33
Complexity analysis
Space complexity analysis
◦ 𝑂(𝐷 ∗ 𝑉)
◦ D is total number of documents and V is vocabulary size
◦ Zipf’s law: each document only has about 10% of vocabulary observed in it
◦ 90% of space is wasted!
◦ Space efficiency can be greatly improved by only storing the occurred words
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 34
Complexity analysis
Time complexity analysis
◦ 𝑂( 𝑞 ∗ 𝐷 ∗ |𝐷|)
◦ 𝑞 is the length of query, |𝐷| is the length of a document
doclist = []
for (wi in q) {
for (d in D) { Bottleneck, since most
for (wj in d) { of them won’t match!
if (wi == wj) {
doclist += [d];
break;
}
}
}
}
return doclist;
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 35
Solution: INVERTED INDEX
Build a look-up table for each word in vocabulary
◦ From word to find documents!
Dictionary Postings
information Doc1 Doc2 Time complexity:
• 𝑂( 𝑞 ∗ |𝐿|),
retrieval Doc1
• |𝐿| is the average length of
Query: retrieved Doc2 posting list
information is Doc1 Doc2 • By Zipf’s law, 𝐿 ≪ 𝐷
retrieval
helpful Doc1 Doc2
for Doc1 Doc2
you Doc2
everyone Doc1
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 36