0% found this document useful (0 votes)
45 views

IR Ch23 Text Representation

The document discusses information representation in information retrieval systems. It describes how documents and queries are modeled, including preprocessing text through steps like tokenization, normalization, and removing stopwords. The goal is to transform natural language texts into representations that can be matched and compared by the retrieval system.

Uploaded by

Bushra Mamoud
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

IR Ch23 Text Representation

The document discusses information representation in information retrieval systems. It describes how documents and queries are modeled, including preprocessing text through steps like tokenization, normalization, and removing stopwords. The goal is to transform natural language texts into representations that can be matched and compared by the retrieval system.

Uploaded by

Bushra Mamoud
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

CS444: Information Retrieval

and Web Search


Spring 2020
CHAPTER 2:
IR SYSTEM
IR System: Typical Task
Given:
◦ A corpus of textual natural-language documents.
◦ A user query in the form of a textual string.
Find:
◦ A ranked set of documents that are relevant to the query.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 2
IR System

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 3
IR: Relevance
Relevance is a subjective judgment and may include:
◦ Being on the proper subject.
◦ Being timely (recent information).
◦ Being authoritative (from a trusted source).
◦ Satisfying the goals of the user and his/her intended use of the information
(information need).

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 4
IR System Architecture
User Interface
Text
User
Text Operations
Need
Logical View
User Query Database
Feedback Operations Indexing
Manager
Inverted
file
Query Searching Index
Text
Ranked Retrieved Database
Docs Ranking Docs
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 5
IR System Components
Text Operations forms index words (tokens).
◦ Stopword removal
◦ Stemming
Indexing constructs an inverted index of word to document pointers.
Searching retrieves documents that contain a given query token from the
inverted index.
Ranking scores all retrieved documents according to a relevance metric.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 6
IR System Components (continued)
User Interface manages interaction with the user:
◦ Query input and document output.
◦ Relevance feedback.
◦ Visualization of results.
Query Operations transform the query to improve retrieval:
◦ Query expansion using a thesaurus.
◦ Query transformation using relevance feedback.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 7
IR system: Performance
System: returns documents (or not)
User: finds document relevant (or not)

Precision : Fraction of retrieved docs that are relevant to the user’s information need
Recall : Fraction of relevant docs in collection that are retrieved

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 8
IR system: Performance

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 9
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 10
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 11
Intelligent IR
Taking into account the meaning of the words used.
Taking into account the order of words in the query.
Adapting to the user based on direct or indirect feedback.
Taking into account the authority of the source.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 12
Other IR-Related Tasks
Automated document categorization
Information filtering (spam filtering)
Information routing
Automated document clustering
Recommending information or products
Information extraction
Information integration
Question answering

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 13
CS444: Information Retrieval
and Web Search
Spring 2020
CHAPTER 3:
IR MODEL: INFORMATION REPRESENTATION
IR System: Information Representation
A fundamental component of an IR system is the representation of the information
Information are represented to processed
Information retrieval comprises many subfields, such as
Text retrieval,
image retrieval,
speech retrieval,
Information generation,
query answering,
and text summarization

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 15
IR System: Text-Based Modeling
Modeling in IR is a complex process aimed at producing a ranking function
A ranking is an ordering of the documents that (hopefully) reflects their relevance to a user query
Ranking function: a function that assigns scores to documents with regard to a given query

This process of modeling consists of two main tasks:


The conception of a logical framework for representing documents and queries
The definition of a ranking function that allows quantifying the similarities among documents and
queries

IR system has to deal with the problem of predicting which documents the users will find
relevant
This problem naturally embodies a degree of uncertainty, or vagueness

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 16
IR System: The Process

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 17
IR System: The Model
IR model is a quadruple [D, Q, F, R(qi, dj)] where:

 D is a set of logical views for the documents in the collection


 Q is a set of logical views for the user queries
 F is a framework for modeling documents and queries
 R(qi, dj) is a ranking function

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 18
Abstraction of search engine architecture
Indexed corpus
Crawler
Ranking procedure

Feedback Evaluation
Doc Analyzer
(Query)
Doc Representation
Query Rep User

Indexer Index Ranker results

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 19
IR System: Text Operation (Preprocessing)
In IR, we most often use unstructured text representations
Text is represented as unordered set of terms (i.e., bag of words)
However, many details about the exact representation are still undefined
How do we „split” text into terms?
Can this be done in more than one way?
Do we consider all terms, or do we want to eliminate some?
E.g., functional words that have little meaning like articles and prepositions?
How do we treat different forms of the same word?
E.g., should „house” be treated the same as „houses”? What about „housing”?
What about synonyms or same concepts in different languages?
On a more technical side: what about different document formats?

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 20
Text Preprocessing
The preprocessing (i.e., preparing text for the retrieval process) usually involves the following
steps:
Extracting pure textual content (e.g., from HTML, PDF, Word)
Language detection
◦ Optional –if you’re dealing with multilingual document collections
Tokenization (separating text into character sequences)
Morphological normalization (lemmatization or stemming)
Stopword removal

After preprocessing, the text (i.e., the document) is ready to be indexed

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 21
Text Preprocessing: Definitions
Word is a delimited string of characters as it appears in the text
Term is a normalized form of the word (accounting for morphology, spelling, etc.)
Word and term are in the same equivalence class –in informal speech they are often used interchangeably

Token is an instance of a word or term occurring in a document


Tokens are „words” in the general sense
But numbers, punctuation, and special characters are also tokens

Tokenization is a process, typically automated, of breaking down the text (one long string) into a
sequence of tokens (shorter strings)
Stemming is the procedure of reducing the word to its grammatical root
Stop-words: is a list of such words that are very common such as of, the , and,…And rarely used in IR

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 22
Text Preprocessing: tokenization
Two types of methods for tokenization
Rule-based (i.e., heuristic)
Based on supervised machine learning models
 Learn from manually tokenized texts

Tokenization might seem simple, but it’s not always unambiguous


E.g., a simple rule: split string on all whitespaces
 „Hewlett-Packard declared losses”-> „Hewlett-Packard”,„declared”, „losses”
 Would we want to split„Hewlett” from„Packard”? What about „lower-case”?
 What about „Denmark’s mountains”:„Denmark” and„’s”,or„Denmarks”,or„Denmark”?

What about tokenizing numbers and punctuation?


„19/1/2017”, „55 B.C.”, „+49 176 832 40 332”, „IP: 192.168.0.1”
Sometimes spaces are not an indication of an end of a token

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 23
Text Preprocessing: normalization
Normalization or standardization can involve various changes to the token
 Error/spelling correction –repairing the incorrect word
 Case-folding –converting all letters to lower case
 Often best to lower case everything (queries and documents)

Morphological normalization (lemmatization))


 Reducing different forms of the „same” word to a common representative form
 Nouns: singular form in „nominative” case
 Verbs: infinitive form
 Lemmatization reduces words to dictionary headword entries
 I.e., the resulting lemma is a string that is again a valid word in the language E.g., „houses”-> „house”, „tried” -> „try”

Stop-words: is a list of such words that are very common such as of, the , and,…. And rarely used in IR
Stop-words may:
 Save space
 Speed processing
 Eliminates many false hits

How about the query “to be or not to be”

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 24
Text Preprocessing: normalization
Stemming is the procedure of reducing the word to its grammatical (morpho-syntactic) root
The result of stemming is not necessarily a valid word of the language
 E.g., „recognized”-> „recogniz”, „incredibly”-> „incredibl”
Stemming removes suffixes with heuristics
 E.g., „automates”, „automatic”, „automation”will all be reduced to „automat”
Stemming is „more aggressive” than lemmatization and „less agressive” than normalization
 „More agressive” means more different words are normalized to the same form

Stemming is more frequently used in IR systems than lemmatization

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 25
Preprocessing Summary

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 26
Abstraction of search engine architecture
Indexed corpus
Crawler
Ranking procedure

Feedback Evaluation
Doc Analyzer
(Query)
Doc Representation
Query Rep User

Indexer Index Ranker results

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 27
IR Model: Text (Query and Document)
Representation
Text document retrieval is the most traditional subfield of IR

Text information retrieval system is about


Represent queries and Document

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 28
IR System: Text (Queries)Representation
Queries are the representation of information needs of a user.
Queries can be characterized by the structure and semantics of text.
An information need can be of three types:
◦ known item information need (users search or verify the existence of documents they know)
◦ conscious information need, (users search for documents they do not know, but they know the subject)
◦ confused information need (users know neither the documents nor the subject)

Queries representation classes:


 Keyword-based
 This is the simplest form for a query.
 It is composed by keywords and the documents containing such keywords are searched for.
 a keyword query
 single word
 complex combination of Boolean operations (AND, OR) applied to several words.
 Pattern-Based
 which allows the specification of text having some properties.
 Structural
 A mechanism to improve the retrieval quality of structured information.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 29
IR Model: Text (Document) Representation
A document is the representation of the information the author wished to encode
A textual document is a digital object consisting of a sequence of words and other symbols, e.g., punctuation
marks.
The individual words and symbols are knowing as Tokens or terms
IR system deals with documents with three types data format
Structured data:
bond to a pre-defined data model and is therefore straightforward to analyze.
e.g. Excel files or SQL databases.
Unstructured data:
is either does not have a predefined data model or is not organized in a pre-defined manner.
e.g. audio, video files or No-SQL databases.
Semi-structured data:
known as self-describing structure that has tags and markers.
e.g. JSON and XML
Any text (document) can be characterized by using four attributes:
syntax, structure, semantics, and style.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 30
IR System: Text (Document) Representation
Document representation classes are:
Stream of characters
 Text is represented as a stream of characters and no interpretation is made on its structure or semantic content.

Vector space. (next lectures)


each document is described by a vector of components that are representative of the semantic content of the document.
A term that appears in one document describes the content of this document
Vector representations can be categorized as:
 Binary :The text document is represented as a binary vector of terms. Each element of the vector represents a term and its value is ‘1’ if the term appears
in the document, ‘0’ otherwise. Indexing (incidence matrices & Inverted index)
 Weighted: In this case element values are real numbers between 0 and 1, called term weights, and represent the affinity of the term with respect to the
document. (Classic Model)

Structural: The main idea is to enrich documents with additional information that allow a computer to make part
of the semantic content explicit. (XML)
Latent semantic
Fuzzy subset
N-grams

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 31
Abstraction of search engine architecture
Indexed corpus
Crawler
Ranking procedure

Feedback Evaluation
Doc Analyzer
(Query)
Doc Representation
Query Rep User

Indexer Index Ranker results

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 32
What we have now (the Indexer)
Documents have been
◦ Crawled from Web
◦ Tokenized/normalized
The Incidence
◦ Represented as Bag-of-Words
Matrices
Let’s do search!
◦ Query: “information retrieval”

information retrieval retrieved is helpful for you everyone


Doc1 1 1 0 1 1 1 0 1
Doc2 1 0 1 1 1 1 1 0
Doc3 0 1 1 1 1 1 0 0
…. …. …. …. …. …. …. …. ….

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 33
Complexity analysis
Space complexity analysis
◦ 𝑂(𝐷 ∗ 𝑉)
◦ D is total number of documents and V is vocabulary size
◦ Zipf’s law: each document only has about 10% of vocabulary observed in it
◦ 90% of space is wasted!
◦ Space efficiency can be greatly improved by only storing the occurred words

Solution: linked list for each document

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 34
Complexity analysis
Time complexity analysis
◦ 𝑂( 𝑞 ∗ 𝐷 ∗ |𝐷|)
◦ 𝑞 is the length of query, |𝐷| is the length of a document

doclist = []
for (wi in q) {
for (d in D) { Bottleneck, since most
for (wj in d) { of them won’t match!
if (wi == wj) {
doclist += [d];
break;
}
}
}
}
return doclist;

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 35
Solution: INVERTED INDEX
Build a look-up table for each word in vocabulary
◦ From word to find documents!
Dictionary Postings
information Doc1 Doc2 Time complexity:
• 𝑂( 𝑞 ∗ |𝐿|),
retrieval Doc1
• |𝐿| is the average length of
Query: retrieved Doc2 posting list
information is Doc1 Doc2 • By Zipf’s law, 𝐿 ≪ 𝐷
retrieval
helpful Doc1 Doc2
for Doc1 Doc2
you Doc2
everyone Doc1

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 36

You might also like