module 7
module 7
and Extraction
First, nomenclature…
• Information retrieval (IR)
• Focus on textual information (= text/document retrieval)
• Other possibilities include image, video, music, …
• What do we search?
• Generically, “collections”
• Less-frequently used, “corpora”
• What do we find?
• Generically, “documents”
• Even though we may be referring to web pages, PDFs, PowerPoint slides,
paragraphs, etc.
Information Retrieval Cycle
Source
Selection Resource
Query
Formulation Query
Search Results
Selection Documents
System discovery
Vocabulary discovery
Concept discovery
Document discovery Examination Information
source reselection
Delivery
The Central Problem in Search Author
Searcher
Concepts Concepts
online offline
Representation Representation
Function Function
Comparison
Function Index
Hits
How do we represent text?
• Remember: computers don’t “understand” anything!
• “Bag of words”
• Treat all the words in a document as index terms
• Assign a “weight” to each term based on “importance”
(or, in simplest case, presence/absence of word)
• Disregard order, structure, meaning, etc. of the words
• Simple, yet effective!
• Assumptions
• Term occurrence is independent
• Document relevance is independent
• “Words” are well-defined
What’s a word?
天主教教宗若望保祿二世因感冒再度住進醫院。
這是他今年第二度因同樣的病因住院。 الناطق باسم- وقال مارك ريجيف
إن شارون قبل- الخارجية اإلسرائيلية
الدعوة وسيقوم للمرة األولى بزيارة
التي كانت لفترة طويلة المقر،تونس
1982 الرسمي لمنظمة التحرير الفلسطينية بعد خروجها من لبنان عام.
भारत सरकार ने आर्थिक सर्वेक्षण में र्र्वत्तीय र्वर्ि 2005-06 में सात फीसदी
र्र्वकास दर हार्सल करने का आकलन र्कया है और कर सुधार पर ज़ोर
र्दया है
日米連合で台頭中国に対処…アーミテージ前副長官提言
9
Boolean model
• Each document or query is treated as a “bag” of
words or terms. Word sequence is not considered.
• Given a collection of documents D, let V = {t1, t2, ...,
t|V|} be the set of distinctive words/terms in the
collection. V is called the vocabulary.
• A weight wij > 0 is associated with each term ti of a
document dj ∈ D. For a term that does not appear in
document dj, wij = 0.
dj = (w1j, w2j, ..., w|V|j),
10
Boolean model (contd)
• Query terms are combined logically using the Boolean
operators AND, OR, and NOT.
• E.g., ((data AND mining) AND (NOT text))
• Retrieval
• Given a Boolean query, the system retrieves every
document that makes the query logically true.
• Called exact match.
• The retrieval results are usually quite poor because
term frequency is not considered.
11
Sec. 1.3
12
Strengths and Weaknesses
• Strengths
• Precise, if you know the right strategies
• Precise, if you have an idea of what you’re looking for
• Implementations are fast and efficient
• Weaknesses
• Users must learn Boolean logic
• Boolean logic insufficient to capture the richness of language
• No control over size of result set: either too many hits or none
• When do you stop reading? All documents in the result set are considered
“equally good”
• What about partial matches? Documents that “don’t quite match” the query
may be useful also
Vector Space Model
t3
d2
d3
d1
θ
φ
t1
d5
t2
d4
sim(d j , d k ) = d j d k = i =1 wi , j wi ,k
n
16
Term Weighting
• Term weights consist of two components
• Local: how important is the term in this document?
• Global: how important is the term in the collection?
• Here’s the intuition:
• Terms that appear often in a document should get high weights
• Terms that appear in many documents should get low weights
• How do we capture this mathematically?
• Term frequency (local)
• Inverse document frequency (global)
TF.IDF Term Weighting
N
wi , j = tf i , j log
ni
wi , j weight assigned to term i in document j
19
An Example
• A document space is defined by three terms:
• hardware, software, users
• the vocabulary
• A set of documents are defined as:
• A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)
• A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)
• A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1)
• If the Query is “hardware and software”
• what documents should be retrieved?
20
An Example (cont.)
• In Boolean query matching:
• document A4, A7 will be retrieved (“AND”)
• retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)
• In similarity matching (cosine):
• q=(1, 1, 0)
• S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
• S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5
• S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
• Document retrieved set (with ranking)=
• {A4, A7, A1, A2, A5, A6, A8, A9}
21
Some formulas for Sim
Cosine
(a * b ) i i D
Sim( D, Q) = i
ai * bi Q
2 2
i i
t2
Dice 2 (ai * bi )
Sim( D, Q) = i
ai + bi
2 2
Jaccard i i
(a * b ) i i
Sim( D, Q) = i
a + b − (a * b )
2 2
i i i i
i i i
22
Vector Space Model
A Running Example
A Running Example
• Step 1 – Extract text (i.e. no preposition)
This is a data
mining course. This is a data mining course
We are studying
text mining. Text
We are studying text mining
mining is a Text mining is a subfield of
subfield of data
mining.
data mining
Mining text is
interesting, and I
Mining text is interesting and
am interested in I am interested in it
it.
24
P. 24
A Running Example
• Step 2 – Remove stopword
This is a data
mining course. This is a data mining course
We are studying
text
We are studying text mining Text mining is
text mining. Text a subfield of data mining
mining is a
subfield of data
mining.
Mining text is
mining
Mining text is interesting and I am
interesting, and I interested in it
am interested in
it.
26
P. 26
A Running Example
• Step 4 – Stemming
This is a data mine
This is a data mining course
mining course.
We are studying
study mine textmine
We are studying text mining Text mining is
text mining. Text a subfield of data mining
mining is a
subfield of data
mine
mining.
Mining text is
mine interest
Mining text is interesting and I am
interesting, and I interested in it
am interested in
it.
interest
27
P. 27
A Running Example
• Step 5 – Count the word frequencies
This is a data mine
This is a data mining course
mining course.
coursex1, datax1, minex1
We are studying
study mine textmine
We are studying text mining Text mining is
text mining. Text a subfield of data mining
mining is a
subfield of data
mining. mine
datax1, minex3, studyx1, subfieldx1, textx2
Mining text is
mine interest
Mining text is interesting and I am
interesting, and I interested in it
am interested in
it.
interest
interestx2, minex1, textx1
28
A Running Example
• Step 6 – Create an indexing file
This is a data mine
mining course. This is a data mining course ID word doc freq
coursex1, datax1, minex1
1 course 1
2 data 2
study mine text 3 interest 1
We are studying We are studying text mining Text mining is
text mining. Text a subfieldmine
of data mining 4 mine 3
mining is a
subfield of data
mine 5 study 1
mining. 6 subfield 1
datax1, minex3, studyx1, subfieldx1, textx2 7 text 2
We are studying
study mine textmine
We are studying text mining Text mining is
2 data 2
text mining. Text 3 interest 1
a subfield of data mining
mining is a mine 4 mine 3
subfield of data
datax1, minex3, studyx1, subfieldx1, textx2 5 study 1
mining.
(0, 1, 0, 3, 1, 1, 2) 6 subfield 1
Mining text is
mine interest
Mining text is interesting and I am
7 text 2
Mining text is
interesting, and I
am interested in (0, 0, 2, 1, 0, 0, 1)
it.
31
A Running Example
• Step 9 – Compute the weights of the words
w( word i ) = TF ( word i ) IDF ( word i )
This is a data TF ( word i ) = number of times word i appears in the document
mining course.
(1, 1, 0, 1, 0, 0, 0) document
ID word IDF
(0.477, 0.176, 0, 0, 0, 0, 0) frequency
1 course 1 0.477
2 data 2 0.176
We are studying
text mining. Text 3 interest 1 0.477
mining is a
subfield of data
(0, 1, 0, 3, 1, 1, 2) 4 mine 3 0