0% found this document useful (0 votes)
3 views

module 7

The document discusses Information Retrieval (IR) and its various models, including the Boolean model and Vector Space Model, focusing on how documents and queries are represented and matched. It outlines the IR cycle, the concept of term weighting, and the importance of understanding user queries in relation to document terms. Additionally, it provides examples of text processing steps and retrieval methods to enhance search accuracy.

Uploaded by

funstuffs2266
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

module 7

The document discusses Information Retrieval (IR) and its various models, including the Boolean model and Vector Space Model, focusing on how documents and queries are represented and matched. It outlines the IR cycle, the concept of term weighting, and the importance of understanding user queries in relation to document terms. Additionally, it provides examples of text processing steps and retrieval methods to enhance search accuracy.

Uploaded by

funstuffs2266
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Information Retrieval

and Extraction
First, nomenclature…
• Information retrieval (IR)
• Focus on textual information (= text/document retrieval)
• Other possibilities include image, video, music, …
• What do we search?
• Generically, “collections”
• Less-frequently used, “corpora”
• What do we find?
• Generically, “documents”
• Even though we may be referring to web pages, PDFs, PowerPoint slides,
paragraphs, etc.
Information Retrieval Cycle
Source
Selection Resource

Query
Formulation Query

Search Results

Selection Documents
System discovery
Vocabulary discovery
Concept discovery
Document discovery Examination Information

source reselection
Delivery
The Central Problem in Search Author
Searcher

Concepts Concepts

Query Terms Document Terms


“tragic love story” “fateful star-crossed romance”

Do these represent the same concepts?


Abstract IR Architecture
Query Documents

online offline
Representation Representation
Function Function

Query Representation Document Representation

Comparison
Function Index

Hits
How do we represent text?
• Remember: computers don’t “understand” anything!
• “Bag of words”
• Treat all the words in a document as index terms
• Assign a “weight” to each term based on “importance”
(or, in simplest case, presence/absence of word)
• Disregard order, structure, meaning, etc. of the words
• Simple, yet effective!
• Assumptions
• Term occurrence is independent
• Document relevance is independent
• “Words” are well-defined
What’s a word?
天主教教宗若望保祿二世因感冒再度住進醫院。
這是他今年第二度因同樣的病因住院。 ‫ الناطق باسم‬- ‫وقال مارك ريجيف‬
‫ إن شارون قبل‬- ‫الخارجية اإلسرائيلية‬
‫الدعوة وسيقوم للمرة األولى بزيارة‬
‫ التي كانت لفترة طويلة المقر‬،‫تونس‬
1982 ‫الرسمي لمنظمة التحرير الفلسطينية بعد خروجها من لبنان عام‬.

Выступая в Мещанском суде Москвы экс-глава ЮКОСа


заявил не совершал ничего противозаконного, в чем
обвиняет его генпрокуратура России.

भारत सरकार ने आर्थिक सर्वेक्षण में र्र्वत्तीय र्वर्ि 2005-06 में सात फीसदी
र्र्वकास दर हार्सल करने का आकलन र्कया है और कर सुधार पर ज़ोर
र्दया है

日米連合で台頭中国に対処…アーミテージ前副長官提言

조재영 기자= 서울시는 25일 이명박 시장이 `행정중심복합도시'' 건설안


에 대해 `군대라도 동원해 막고싶은 심정''이라고 말했다는 일부 언론의
보도를 부인했다.
Sample Document
McDonald's slims down spuds
Fast-food chain to reduce certain types of fat
“Bag of Words”
in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is 14 × McDonalds
cutting the amount of "bad" fat in its french fries
nearly in half, the fast-food chain said Tuesday as it 12 × fat
moves to make all its fried menu items healthier.
But does that mean the popular shoestring fries won't 11 × fries
taste the same? The company says no. "It's a win-win
for our customers because they are getting the same
great french-fry taste along with an even healthier
8 × new
nutrition profile," said Mike Roberts, president of
McDonald's USA. 7 × french
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to use, but
6 × company, said, nutrition
at least one nutrition expert says playing with the
formula could mean a different taste. 5 × food, oil, percent, reduce,
Shares of Oak Brook, Ill.-based McDonald's (MCD: taste, Tuesday
down $0.54 to $23.22, Research, Estimates) were
lower Tuesday afternoon. It was unclear Tuesday …
whether competitors Burger King and Wendy's
International (WEN: down $0.80 to $34.91, Research,
Estimates) would follow suit. Neither company could
immediately be reached for comment.

Information retrieval models
• An IR model governs how a document and a query
are represented and how the relevance of a
document to a user query is defined.
• Main models:
• Boolean model
• Vector space model
• Statistical language model..
• etc

9
Boolean model
• Each document or query is treated as a “bag” of
words or terms. Word sequence is not considered.
• Given a collection of documents D, let V = {t1, t2, ...,
t|V|} be the set of distinctive words/terms in the
collection. V is called the vocabulary.
• A weight wij > 0 is associated with each term ti of a
document dj ∈ D. For a term that does not appear in
document dj, wij = 0.
dj = (w1j, w2j, ..., w|V|j),

10
Boolean model (contd)
• Query terms are combined logically using the Boolean
operators AND, OR, and NOT.
• E.g., ((data AND mining) AND (NOT text))
• Retrieval
• Given a Boolean query, the system retrieves every
document that makes the query logically true.
• Called exact match.
• The retrieval results are usually quite poor because
term frequency is not considered.

11
Sec. 1.3

Boolean queries: Exact match


• The Boolean retrieval model is being able to ask a
query that is a Boolean expression:
– Boolean Queries are queries using AND, OR and NOT to join
query terms
• Views each document as a set of words
• Is precise: document matches condition or not.
– Perhaps the simplest model to build an IR system on
• Primary commercial retrieval tool for 3 decades.
• Many search systems you still use are Boolean:
– Email, library catalog, Mac OS X Spotlight

12
Strengths and Weaknesses
• Strengths
• Precise, if you know the right strategies
• Precise, if you have an idea of what you’re looking for
• Implementations are fast and efficient
• Weaknesses
• Users must learn Boolean logic
• Boolean logic insufficient to capture the richness of language
• No control over size of result set: either too many hits or none
• When do you stop reading? All documents in the result set are considered
“equally good”
• What about partial matches? Documents that “don’t quite match” the query
may be useful also
Vector Space Model
t3
d2

d3
d1
θ
φ
t1

d5
t2
d4

Assumption: Documents that are “close together” in


vector space “talk about” the same things

Therefore, retrieve documents based on how close the


document is to the query (i.e., similarity ~ “closeness”)
Similarity Metric
• Use “angle” between the vectors:  
d j  dk
cos( ) =  
d j dk
 

n
d j  dk w w
sim(d j , d k ) =   = i =1 i , j i , k

i=1 i, j i=1 i,k


n n
d j dk w 2
w 2

 
sim(d j , d k ) = d j  d k = i =1 wi , j wi ,k
n

• Or, more generally, inner products:


Vector space model
• Documents are also treated as a “bag” of words or terms.
• Each document is represented as a vector.
• However, the term weights are no longer 0 or 1. Each
term weight is computed based on some variations of TF
or TF-IDF scheme.

16
Term Weighting
• Term weights consist of two components
• Local: how important is the term in this document?
• Global: how important is the term in the collection?
• Here’s the intuition:
• Terms that appear often in a document should get high weights
• Terms that appear in many documents should get low weights
• How do we capture this mathematically?
• Term frequency (local)
• Inverse document frequency (global)
TF.IDF Term Weighting
N
wi , j = tf i , j  log
ni
wi , j weight assigned to term i in document j

tfi, j number of occurrence of term i in document j

N number of documents in entire collection

ni number of documents with term i


Retrieval in vector space model
• Query q is represented in the same way or slightly
differently.
• Relevance of di to q: Compare the similarity of query q
and document di.
• Cosine similarity (the cosine of the angle between the
two vectors)

• Cosine is also commonly used in text clustering

19
An Example
• A document space is defined by three terms:
• hardware, software, users
• the vocabulary
• A set of documents are defined as:
• A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)
• A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)
• A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1)
• If the Query is “hardware and software”
• what documents should be retrieved?

20
An Example (cont.)
• In Boolean query matching:
• document A4, A7 will be retrieved (“AND”)
• retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)
• In similarity matching (cosine):
• q=(1, 1, 0)
• S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
• S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5
• S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
• Document retrieved set (with ranking)=
• {A4, A7, A1, A2, A5, A6, A8, A9}

21
Some formulas for Sim

Dot product Sim( D, Q) =  (ai * bi )


t1

Cosine
 (a * b ) i i D
Sim( D, Q) = i

 ai *  bi Q
2 2

i i
t2
Dice 2 (ai * bi )
Sim( D, Q) = i

 ai +  bi
2 2

Jaccard i i

 (a * b ) i i
Sim( D, Q) = i

 a +  b −  (a * b )
2 2
i i i i
i i i
22
Vector Space Model
A Running Example
A Running Example
• Step 1 – Extract text (i.e. no preposition)
This is a data
mining course. This is a data mining course

We are studying
text mining. Text
We are studying text mining
mining is a Text mining is a subfield of
subfield of data
mining.
data mining

Mining text is
interesting, and I
Mining text is interesting and
am interested in I am interested in it
it.
24
P. 24
A Running Example
• Step 2 – Remove stopword
This is a data
mining course. This is a data mining course

We are studying We are studying text mining Text mining is


text mining. Text a subfield of data mining
mining is a
subfield of data
mining.

Mining text is Mining text is interesting and I am


interesting, and I interested in it
am interested in
it.
25
P. 25
A Running Example
• Step 3 – Convert all words to lower case
This is a data
mining course. This is a data mining course

We are studying
text
We are studying text mining Text mining is
text mining. Text a subfield of data mining
mining is a
subfield of data
mining.

Mining text is
mining
Mining text is interesting and I am
interesting, and I interested in it
am interested in
it.
26
P. 26
A Running Example
• Step 4 – Stemming
This is a data mine
This is a data mining course
mining course.

We are studying
study mine textmine
We are studying text mining Text mining is
text mining. Text a subfield of data mining
mining is a
subfield of data
mine
mining.

Mining text is
mine interest
Mining text is interesting and I am
interesting, and I interested in it
am interested in
it.
interest
27
P. 27
A Running Example
• Step 5 – Count the word frequencies
This is a data mine
This is a data mining course
mining course.
coursex1, datax1, minex1

We are studying
study mine textmine
We are studying text mining Text mining is
text mining. Text a subfield of data mining
mining is a
subfield of data
mining. mine
datax1, minex3, studyx1, subfieldx1, textx2

Mining text is
mine interest
Mining text is interesting and I am
interesting, and I interested in it
am interested in
it.
interest
interestx2, minex1, textx1
28
A Running Example
• Step 6 – Create an indexing file
This is a data mine
mining course. This is a data mining course ID word doc freq
coursex1, datax1, minex1
1 course 1
2 data 2
study mine text 3 interest 1
We are studying We are studying text mining Text mining is
text mining. Text a subfieldmine
of data mining 4 mine 3
mining is a
subfield of data
mine 5 study 1
mining. 6 subfield 1
datax1, minex3, studyx1, subfieldx1, textx2 7 text 2

Mining text is mine interest


Mining text is interesting and I am
interesting, and I interested in it interest
am interested in
it.
interestx2, minex1, textx1
29
A Running Example
• Step 7 – Create the vector space model
docu
This is a data mine
This is a data mining course ment
mining course. I
word freq
coursex1, datax1, minex1 D
uenc
(1, 1, 0, 1, 0, 0, 0) y
1 course 1

We are studying
study mine textmine
We are studying text mining Text mining is
2 data 2
text mining. Text 3 interest 1
a subfield of data mining
mining is a mine 4 mine 3
subfield of data
datax1, minex3, studyx1, subfieldx1, textx2 5 study 1
mining.
(0, 1, 0, 3, 1, 1, 2) 6 subfield 1

Mining text is
mine interest
Mining text is interesting and I am
7 text 2

interesting, and I interested in it


am interested in
it.
interest
interestx2, minex1, textx1
30
(0, 0, 2, 1, 0, 0, 1)
A Running Example
• Step 8 – Compute the inverse document frequency
total documents
IDF ( word ) = log
document frequency
This is a data
mining course. document
ID word IDF
frequency
(1, 1, 0, 1, 0, 0, 0)
1 course 1 0.477
2 data 2 0.176
3 interest 1 0.477
We are studying 4 mine 3 0
text mining. Text
mining is a 5 study 1 0.477
subfield of data
(0, 1, 0, 3, 1, 1, 2) 6 subfield 1 0.477
mining.
7 text 2 0.176

Mining text is
interesting, and I
am interested in (0, 0, 2, 1, 0, 0, 1)
it.
31
A Running Example
• Step 9 – Compute the weights of the words
w( word i ) = TF ( word i )  IDF ( word i )
This is a data TF ( word i ) = number of times word i appears in the document
mining course.
(1, 1, 0, 1, 0, 0, 0) document
ID word IDF
(0.477, 0.176, 0, 0, 0, 0, 0) frequency
1 course 1 0.477
2 data 2 0.176
We are studying
text mining. Text 3 interest 1 0.477
mining is a
subfield of data
(0, 1, 0, 3, 1, 1, 2) 4 mine 3 0

mining. (0, 0.176, 0, 0, 0.477, 0.477, 0.352)5 study 1 0.477


6 subfield 1 0.477
7 text 2 0.176
Mining text is
interesting, and I
am interested in (0, 0, 2, 1, 0, 0, 1)
it.
(0, 0, 0.954, 0, 0, 0, 0.176)
32
P. 32
A Running Example
• Step 10 – Normalize all documents to unit length
w( word i )
w( word i ) =
This is a data w2 ( word 1 ) + w2 ( word 2 ) +  + w2 ( word n )
mining course.
document
(1, 1, 0, 1, 0, 0, 0) ID word frequenc IDF
y
(0.938, 0.346, 0, 0, 0, 0, 0)
1 course 1 0.477
2 data 2 0.176
We are studying
text mining. Text 3 interest 1 0.477
mining is a 4 mine 3 0
subfield of data
(0, 1, 0, 3, 1, 1, 2)
mining. 5 study 1 0.477
(0, 0.225 0, 0, 0.611, 0.611, 0.450) 6 subfield 1 0.477

Mining text is 7 text 2 0.176


interesting, and I
am interested in (0, 0, 2, 1, 0, 0, 1)
it.
(0, 0, 0.983, 0, 0, 0, 0.181)
33
A Running Example
• Finally, we obtain the following:
• Everything become structural!
• We can perform classification, clustering, etc!!!!
This is a data
mining course. docume
(0.938, 0.346, 0, 0, 0, 0, 0) nt
ID word IDF
frequen
cy
1 course 1 0.477
We are studying 2 data 2 0.176
text mining. Text 3 interest 1 0.477
mining is a (0, 0.225 0, 0, 0.611, 0.611, 0.450) 4 mine 3 0
subfield of data
mining. 5 study 1 0.477
6 subfield 1 0.477
Mining text is
interesting, and I (0, 0, 0.983, 0, 0, 0, 0.181) 7 text 2 0.176
am interested in
it.
34
P. 34
PAGE RANK ALGORITHM

You might also like