0% found this document useful (0 votes)

3 views

module 7

The document discusses Information Retrieval (IR) and its various models, including the Boolean model and Vector Space Model, focusing on how documents and queries are represented and matched. It outlines the IR cycle, the concept of term weighting, and the importance of understanding user queries in relation to document terms. Additionally, it provides examples of text processing steps and retrieval methods to enhance search accuracy.

Uploaded by

funstuffs2266

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

module 7

Uploaded by

funstuffs2266

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Information Retrieval

and Extraction
First, nomenclature…
• Information retrieval (IR)
• Focus on textual information (= text/document retrieval)
• Other possibilities include image, video, music, …
• What do we search?
• Generically, “collections”
• Less-frequently used, “corpora”
• What do we find?
• Generically, “documents”
• Even though we may be referring to web pages, PDFs, PowerPoint slides,
paragraphs, etc.
Information Retrieval Cycle
Source
Selection Resource

Query
Formulation Query

Search Results

Selection Documents
System discovery
Vocabulary discovery
Concept discovery
Document discovery Examination Information

source reselection
Delivery
The Central Problem in Search Author
Searcher

Concepts Concepts

Query Terms Document Terms

“tragic love story” “fateful star-crossed romance”

Do these represent the same concepts?

Abstract IR Architecture
Query Documents

online offline
Representation Representation
Function Function

Query Representation Document Representation

Comparison
Function Index

Hits
How do we represent text?
• Remember: computers don’t “understand” anything!
• “Bag of words”
• Treat all the words in a document as index terms
• Assign a “weight” to each term based on “importance”
(or, in simplest case, presence/absence of word)
• Disregard order, structure, meaning, etc. of the words
• Simple, yet effective!
• Assumptions
• Term occurrence is independent
• Document relevance is independent
• “Words” are well-defined
What’s a word?
天主教教宗若望保祿二世因感冒再度住進醫院。
這是他今年第二度因同樣的病因住院。 ‫ الناطق باسم‬- ‫وقال مارك ريجيف‬
‫ إن شارون قبل‬- ‫الخارجية اإلسرائيلية‬
‫الدعوة وسيقوم للمرة األولى بزيارة‬
‫ التي كانت لفترة طويلة المقر‬،‫تونس‬
1982 ‫الرسمي لمنظمة التحرير الفلسطينية بعد خروجها من لبنان عام‬.

Выступая в Мещанском суде Москвы экс-глава ЮКОСа

заявил не совершал ничего противозаконного, в чем
обвиняет его генпрокуратура России.

भारत सरकार ने आर्थिक सर्वेक्षण में र्र्वत्तीय र्वर्ि 2005-06 में सात फीसदी
र्र्वकास दर हार्सल करने का आकलन र्कया है और कर सुधार पर ज़ोर
र्दया है

日米連合で台頭中国に対処…アーミテージ前副長官提言

조재영 기자= 서울시는 25일 이명박 시장이 `행정중심복합도시'' 건설안

에 대해 `군대라도 동원해 막고싶은 심정''이라고 말했다는 일부 언론의
보도를 부인했다.
Sample Document
McDonald's slims down spuds
Fast-food chain to reduce certain types of fat
“Bag of Words”
in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is 14 × McDonalds
cutting the amount of "bad" fat in its french fries
nearly in half, the fast-food chain said Tuesday as it 12 × fat
moves to make all its fried menu items healthier.
But does that mean the popular shoestring fries won't 11 × fries
taste the same? The company says no. "It's a win-win
for our customers because they are getting the same
great french-fry taste along with an even healthier
8 × new
nutrition profile," said Mike Roberts, president of
McDonald's USA. 7 × french
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to use, but
6 × company, said, nutrition
at least one nutrition expert says playing with the
formula could mean a different taste. 5 × food, oil, percent, reduce,
Shares of Oak Brook, Ill.-based McDonald's (MCD: taste, Tuesday
down $0.54 to $23.22, Research, Estimates) were
lower Tuesday afternoon. It was unclear Tuesday …
whether competitors Burger King and Wendy's
International (WEN: down $0.80 to $34.91, Research,
Estimates) would follow suit. Neither company could
immediately be reached for comment.
…
Information retrieval models
• An IR model governs how a document and a query
are represented and how the relevance of a
document to a user query is defined.
• Main models:
• Boolean model
• Vector space model
• Statistical language model..
• etc

9
Boolean model
• Each document or query is treated as a “bag” of
words or terms. Word sequence is not considered.
• Given a collection of documents D, let V = {t1, t2, ...,
t|V|} be the set of distinctive words/terms in the
collection. V is called the vocabulary.
• A weight wij > 0 is associated with each term ti of a
document dj ∈ D. For a term that does not appear in
document dj, wij = 0.
dj = (w1j, w2j, ..., w|V|j),

10
Boolean model (contd)
• Query terms are combined logically using the Boolean
operators AND, OR, and NOT.
• E.g., ((data AND mining) AND (NOT text))
• Retrieval
• Given a Boolean query, the system retrieves every
document that makes the query logically true.
• Called exact match.
• The retrieval results are usually quite poor because
term frequency is not considered.

11
Sec. 1.3

Boolean queries: Exact match

• The Boolean retrieval model is being able to ask a
query that is a Boolean expression:
– Boolean Queries are queries using AND, OR and NOT to join
query terms
• Views each document as a set of words
• Is precise: document matches condition or not.
– Perhaps the simplest model to build an IR system on
• Primary commercial retrieval tool for 3 decades.
• Many search systems you still use are Boolean:
– Email, library catalog, Mac OS X Spotlight

12
Strengths and Weaknesses
• Strengths
• Precise, if you know the right strategies
• Precise, if you have an idea of what you’re looking for
• Implementations are fast and efficient
• Weaknesses
• Users must learn Boolean logic
• Boolean logic insufficient to capture the richness of language
• No control over size of result set: either too many hits or none
• When do you stop reading? All documents in the result set are considered
“equally good”
• What about partial matches? Documents that “don’t quite match” the query
may be useful also
Vector Space Model
t3
d2

d3
d1
θ
φ
t1

d5
t2
d4

Assumption: Documents that are “close together” in

vector space “talk about” the same things

Therefore, retrieve documents based on how close the

document is to the query (i.e., similarity ~ “closeness”)
Similarity Metric
• Use “angle” between the vectors:  
d j  dk
cos( ) =  
d j dk
 

n
d j  dk w w
sim(d j , d k ) =   = i =1 i , j i , k

i=1 i, j i=1 i,k

n n
d j dk w 2
w 2

 
sim(d j , d k ) = d j  d k = i =1 wi , j wi ,k
n

• Or, more generally, inner products:

Vector space model
• Documents are also treated as a “bag” of words or terms.
• Each document is represented as a vector.
• However, the term weights are no longer 0 or 1. Each
term weight is computed based on some variations of TF
or TF-IDF scheme.

16
Term Weighting
• Term weights consist of two components
• Local: how important is the term in this document?
• Global: how important is the term in the collection?
• Here’s the intuition:
• Terms that appear often in a document should get high weights
• Terms that appear in many documents should get low weights
• How do we capture this mathematically?
• Term frequency (local)
• Inverse document frequency (global)
TF.IDF Term Weighting
N
wi , j = tf i , j  log
ni
wi , j weight assigned to term i in document j

tfi, j number of occurrence of term i in document j

N number of documents in entire collection

ni number of documents with term i

Retrieval in vector space model
• Query q is represented in the same way or slightly
differently.
• Relevance of di to q: Compare the similarity of query q
and document di.
• Cosine similarity (the cosine of the angle between the
two vectors)

• Cosine is also commonly used in text clustering

19
An Example
• A document space is defined by three terms:
• hardware, software, users
• the vocabulary
• A set of documents are defined as:
• A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)
• A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)
• A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1)
• If the Query is “hardware and software”
• what documents should be retrieved?

20
An Example (cont.)
• In Boolean query matching:
• document A4, A7 will be retrieved (“AND”)
• retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)
• In similarity matching (cosine):
• q=(1, 1, 0)
• S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
• S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5
• S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
• Document retrieved set (with ranking)=
• {A4, A7, A1, A2, A5, A6, A8, A9}

21
Some formulas for Sim

Dot product Sim( D, Q) =  (ai * bi )

Cosine
 (a * b ) i i D
Sim( D, Q) = i

 ai *  bi Q
2 2

i i
t2
Dice 2 (ai * bi )
Sim( D, Q) = i

 ai +  bi
2 2

Jaccard i i

 (a * b ) i i
Sim( D, Q) = i

 a +  b −  (a * b )
2 2
i i i i
i i i
22
Vector Space Model
A Running Example
A Running Example
• Step 1 – Extract text (i.e. no preposition)
This is a data
mining course. This is a data mining course

We are studying
text mining. Text
We are studying text mining
mining is a Text mining is a subfield of
subfield of data
mining.
data mining

Mining text is
interesting, and I
Mining text is interesting and
am interested in I am interested in it
it.
24
P. 24
A Running Example
• Step 2 – Remove stopword
This is a data
mining course. This is a data mining course

We are studying We are studying text mining Text mining is

text mining. Text a subfield of data mining
mining is a
subfield of data
mining.

Mining text is Mining text is interesting and I am

interesting, and I interested in it
am interested in
it.
25
P. 25
A Running Example
• Step 3 – Convert all words to lower case
This is a data
mining course. This is a data mining course

We are studying
text
We are studying text mining Text mining is
text mining. Text a subfield of data mining
mining is a
subfield of data
mining.

Mining text is
mining
Mining text is interesting and I am
interesting, and I interested in it
am interested in
it.
26
P. 26
A Running Example
• Step 4 – Stemming
This is a data mine
This is a data mining course
mining course.

We are studying
study mine textmine
We are studying text mining Text mining is
text mining. Text a subfield of data mining
mining is a
subfield of data
mine
mining.

Mining text is
mine interest
Mining text is interesting and I am
interesting, and I interested in it
am interested in
it.
interest
27
P. 27
A Running Example
• Step 5 – Count the word frequencies
This is a data mine
This is a data mining course
mining course.
coursex1, datax1, minex1

We are studying
study mine textmine
We are studying text mining Text mining is
text mining. Text a subfield of data mining
mining is a
subfield of data
mining. mine
datax1, minex3, studyx1, subfieldx1, textx2

Mining text is
mine interest
Mining text is interesting and I am
interesting, and I interested in it
am interested in
it.
interest
interestx2, minex1, textx1
28
A Running Example
• Step 6 – Create an indexing file
This is a data mine
mining course. This is a data mining course ID word doc freq
coursex1, datax1, minex1
1 course 1
2 data 2
study mine text 3 interest 1
We are studying We are studying text mining Text mining is
text mining. Text a subfieldmine
of data mining 4 mine 3
mining is a
subfield of data
mine 5 study 1
mining. 6 subfield 1
datax1, minex3, studyx1, subfieldx1, textx2 7 text 2

Mining text is mine interest

Mining text is interesting and I am
interesting, and I interested in it interest
am interested in
it.
interestx2, minex1, textx1
29
A Running Example
• Step 7 – Create the vector space model
docu
This is a data mine
This is a data mining course ment
mining course. I
word freq
coursex1, datax1, minex1 D
uenc
(1, 1, 0, 1, 0, 0, 0) y
1 course 1

We are studying
study mine textmine
We are studying text mining Text mining is
2 data 2
text mining. Text 3 interest 1
a subfield of data mining
mining is a mine 4 mine 3
subfield of data
datax1, minex3, studyx1, subfieldx1, textx2 5 study 1
mining.
(0, 1, 0, 3, 1, 1, 2) 6 subfield 1

Mining text is
mine interest
Mining text is interesting and I am
7 text 2

interesting, and I interested in it

am interested in
it.
interest
interestx2, minex1, textx1
30
(0, 0, 2, 1, 0, 0, 1)
A Running Example
• Step 8 – Compute the inverse document frequency
total documents
IDF ( word ) = log
document frequency
This is a data
mining course. document
ID word IDF
frequency
(1, 1, 0, 1, 0, 0, 0)
1 course 1 0.477
2 data 2 0.176
3 interest 1 0.477
We are studying 4 mine 3 0
text mining. Text
mining is a 5 study 1 0.477
subfield of data
(0, 1, 0, 3, 1, 1, 2) 6 subfield 1 0.477
mining.
7 text 2 0.176

Mining text is
interesting, and I
am interested in (0, 0, 2, 1, 0, 0, 1)
it.
31
A Running Example
• Step 9 – Compute the weights of the words
w( word i ) = TF ( word i )  IDF ( word i )
This is a data TF ( word i ) = number of times word i appears in the document
mining course.
(1, 1, 0, 1, 0, 0, 0) document
ID word IDF
(0.477, 0.176, 0, 0, 0, 0, 0) frequency
1 course 1 0.477
2 data 2 0.176
We are studying
text mining. Text 3 interest 1 0.477
mining is a
subfield of data
(0, 1, 0, 3, 1, 1, 2) 4 mine 3 0

mining. (0, 0.176, 0, 0, 0.477, 0.477, 0.352)5 study 1 0.477

6 subfield 1 0.477
7 text 2 0.176
Mining text is
interesting, and I
am interested in (0, 0, 2, 1, 0, 0, 1)
it.
(0, 0, 0.954, 0, 0, 0, 0.176)
32
P. 32
A Running Example
• Step 10 – Normalize all documents to unit length
w( word i )
w( word i ) =
This is a data w2 ( word 1 ) + w2 ( word 2 ) +  + w2 ( word n )
mining course.
document
(1, 1, 0, 1, 0, 0, 0) ID word frequenc IDF
y
(0.938, 0.346, 0, 0, 0, 0, 0)
1 course 1 0.477
2 data 2 0.176
We are studying
text mining. Text 3 interest 1 0.477
mining is a 4 mine 3 0
subfield of data
(0, 1, 0, 3, 1, 1, 2)
mining. 5 study 1 0.477
(0, 0.225 0, 0, 0.611, 0.611, 0.450) 6 subfield 1 0.477

Mining text is 7 text 2 0.176

interesting, and I
am interested in (0, 0, 2, 1, 0, 0, 1)
it.
(0, 0, 0.983, 0, 0, 0, 0.181)
33
A Running Example
• Finally, we obtain the following:
• Everything become structural!
• We can perform classification, clustering, etc!!!!
This is a data
mining course. docume
(0.938, 0.346, 0, 0, 0, 0, 0) nt
ID word IDF
frequen
cy
1 course 1 0.477
We are studying 2 data 2 0.176
text mining. Text 3 interest 1 0.477
mining is a (0, 0.225 0, 0, 0.611, 0.611, 0.450) 4 mine 3 0
subfield of data
mining. 5 study 1 0.477
6 subfield 1 0.477
Mining text is
interesting, and I (0, 0, 0.983, 0, 0, 0, 0.181) 7 text 2 0.176
am interested in
it.
34
P. 34
PAGE RANK ALGORITHM

Yf300 Service Manual
No ratings yet
Yf300 Service Manual
53 pages
PIC18F4550 Timer Capture (CCP) Mode
No ratings yet
PIC18F4550 Timer Capture (CCP) Mode
16 pages
Information Retrieval
No ratings yet
Information Retrieval
72 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
Module1PartBInformationRetrievalWebdocuments
No ratings yet
Module1PartBInformationRetrievalWebdocuments
49 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
33 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
Unit II
No ratings yet
Unit II
73 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
No ratings yet
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
61 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
application_nlp
No ratings yet
application_nlp
23 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
NLP UNIT-II(PART-I)
No ratings yet
NLP UNIT-II(PART-I)
19 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
27 pages
Query Languages: Chapter Seven
No ratings yet
Query Languages: Chapter Seven
36 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
F-IR
No ratings yet
F-IR
30 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
No ratings yet
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
3 pages
IR-Berhampore-sukomalPal
No ratings yet
IR-Berhampore-sukomalPal
82 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
Ipseology: A new science of the self
From Everand
Ipseology: A new science of the self
Dr. Jason Jeffrey Jones
No ratings yet
IOCL Interview Questions
No ratings yet
IOCL Interview Questions
4 pages
Test Initial A 11 A - A 12 A Engleza R Ii
No ratings yet
Test Initial A 11 A - A 12 A Engleza R Ii
2 pages
10 Personal Relationship
No ratings yet
10 Personal Relationship
50 pages
ORAL COMMUNICATION - Q1 - W3 - Mod3 - Models of Communication 2
100% (1)
ORAL COMMUNICATION - Q1 - W3 - Mod3 - Models of Communication 2
14 pages
Mock Test Maths
No ratings yet
Mock Test Maths
3 pages
Omin Society Crypto INDEX
No ratings yet
Omin Society Crypto INDEX
3 pages
1RS 19min PDF
No ratings yet
1RS 19min PDF
13 pages
Glenn Neely Neowave, Inc.: Friday, January 4, 2008
No ratings yet
Glenn Neely Neowave, Inc.: Friday, January 4, 2008
1 page
Valvula de Prueba y Drenaje
No ratings yet
Valvula de Prueba y Drenaje
4 pages
Power Your Signal: Antenna Specifications
No ratings yet
Power Your Signal: Antenna Specifications
3 pages
COM 113 Practical Book - InTRO To Computer Programming
83% (6)
COM 113 Practical Book - InTRO To Computer Programming
39 pages
Deformed Reinforcing Bars Prices Per Length - PHILCON PRICES
No ratings yet
Deformed Reinforcing Bars Prices Per Length - PHILCON PRICES
1 page
Impaired Social Interaction Related To Fear of Being Scrutinized and Embarrassed by Others
No ratings yet
Impaired Social Interaction Related To Fear of Being Scrutinized and Embarrassed by Others
9 pages
Newcrest Technical Report On Red Chris Operations As of 30 June 2021
No ratings yet
Newcrest Technical Report On Red Chris Operations As of 30 June 2021
285 pages
Outcrop: Features Study Examples See Also References External Links
No ratings yet
Outcrop: Features Study Examples See Also References External Links
3 pages
9 TiengAnh HSG12 2024 DE
No ratings yet
9 TiengAnh HSG12 2024 DE
6 pages
How To Play Sci Damaths
No ratings yet
How To Play Sci Damaths
9 pages
Mary Mcmahon Cv Career Fair
No ratings yet
Mary Mcmahon Cv Career Fair
1 page
The Birth of Tragedy - Music and Drama
No ratings yet
The Birth of Tragedy - Music and Drama
8 pages
2023 Boer Cosma Et Al Trends in Adolescent Mental Health
No ratings yet
2023 Boer Cosma Et Al Trends in Adolescent Mental Health
17 pages
Understanding Psychology 10th Edition Morris Solutions Manual instant download
100% (1)
Understanding Psychology 10th Edition Morris Solutions Manual instant download
41 pages
Concrete Lab Manual
75% (4)
Concrete Lab Manual
21 pages
Form Data Motor FM6
No ratings yet
Form Data Motor FM6
50 pages
Project
No ratings yet
Project
2 pages
MoorDyn Users Guide
No ratings yet
MoorDyn Users Guide
17 pages
Wednesday, October 25, 2023, 07:04 AM
No ratings yet
Wednesday, October 25, 2023, 07:04 AM
16 pages
Observation Checklist Task 3.1 - BSBOPS404 V1-1
No ratings yet
Observation Checklist Task 3.1 - BSBOPS404 V1-1
7 pages
Protecting Victims of Violent Patients While Protecting Confidentiality
No ratings yet
Protecting Victims of Violent Patients While Protecting Confidentiality
7 pages

module 7

Uploaded by

module 7

Uploaded by

Information Retrieval

Query Terms Document Terms

Do these represent the same concepts?

Query Representation Document Representation

Выступая в Мещанском суде Москвы экс-глава ЮКОСа

조재영 기자= 서울시는 25일 이명박 시장이 `행정중심복합도시'' 건설안

Boolean queries: Exact match

Assumption: Documents that are “close together” in

Therefore, retrieve documents based on how close the

i=1 i, j i=1 i,k

• Or, more generally, inner products:

tfi, j number of occurrence of term i in document j

N number of documents in entire collection

ni number of documents with term i

• Cosine is also commonly used in text clustering

Dot product Sim( D, Q) =  (ai * bi )

We are studying We are studying text mining Text mining is

Mining text is Mining text is interesting and I am

Mining text is mine interest

interesting, and I interested in it

mining. (0, 0.176, 0, 0, 0.477, 0.477, 0.352)5 study 1 0.477

Mining text is 7 text 2 0.176

You might also like