Unit V Easy To Learn
Unit V Easy To Learn
Introduction
Text mining refers to data mining using text documents as data.
• Most text mining tasks use Information Retrieval (IR) methods to pre-process
text documents.
• These methods are quite different from traditional data pre-processing methods
used for relational tables.
• Web search also has its root in IR.
Information Retrieval
1. Lower casing.
2. Removal of Punctuations.
3. Removal of Stopwords.
4. Removal of Frequent words.
5. Removal of Rare words.
6. Stemming.
Typical IR Task
• Given:
– A corpus of textual natural-language documents.
– A user query in the form of a textual string.
• Find:
– A ranked set of documents that are relevant to the query.
IR System Components
let V = {t1, t2, ..., t|V|} be the set of distinctive words/terms in the collection. V
is called the vocabulary
• A weight wij> 0 is associated with each term ti of a document dj∈ D
– wij = 0/1 (absence/presence)
– dj = (w1j, w2j, ...,w|V|j)
Query terms are combined logically using the Boolean operators AND, OR, and
NOT.
• Retrieval – Given a Boolean query, the system retrieves every document that
makes the query logically true – Exact match
Boolean Models Problems
Very rigid: AND means all; OR means any.
• Difficult to express complex user requests.
• Difficult to control the number of documents retrieved.
– All matched documents will be returned.
• Difficult to rank output.
– All matched documents logically satisfy the query.
• Difficult to perform relevance feedback
– If a document is identified by the user as relevant or irrelevant, how should
the query be modified?
Boolean model: an Example
Consider a document space defined by three terms,
i.e., the vocabulary / lexicon: – hardware, software, users
• A set of documents is defined as:
– A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1),A4=(1, 1, 0),
A5=(1, 0, 1),A6=(0, 1, 1) , A7=(1, 1, 1) A8=(1, 0, 1), A9=(0, 1, 1)
• If the query is: “hardware, software”
i.e., (1, 1, 0)
what documents should be retrieved?
– AND: documents A4, A7
• Acollectionofndocumentscanberepresentedinthe vector
space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a term
in the document; zero means the term has no
significanceinthedocumentoritsimplydoesn’texistin the
document.
T1 T2 …. Tt
D1 w1 w21 … wt1
: : : :
: : : :
Dn w1n w2n … wtn
Query Vector
• Query vector is typically treated as a document and also tf-idf weighted.
• Alternative is for the user to supply weights for the given query terms.
Similarity Measure
• A similarity measure is a function that computes the degree of similarity between
two vectors.
• Using a similarity measure between the query and each document:
– It is possible to rank the retrieved documents in the order of presumed
relevance.
– It is possible to enforce a certain threshold so that the size of the retrieved
set can be controlled.
Similarity Measure - Inner Product
Similarity between vectors for the document di and query q can be
computed as the vector inner product (a.k.a. dot product):
Properties of Inner Product
The inner product is unbounded.
• Favors long documents with a large number of unique terms.
• Measures how many terms matched but not how many terms are not
matched.
Statistical Models
• Adocumentistypicallyrepresentedbyabagof words
(unordered words with frequencies).
• Bag=setthatallowsmultipleoccurrencesofthe same element.
• Userspecifiesasetofdesiredtermswithoptional weights:
– Weightedqueryterms:
Q=<database0.5; text0.8;information0.2 >
– Unweightedqueryterms:
Q=<database;text;information>
– NoBooleanconditionsspecifiedinthequery.
• Automaticrelevance feedbackcanbesupported:
– Relevantdocuments“added”toquery.
– Irrelevantdocuments“subtracted”from query.
Text pre-processing
Document parsing for word (term) extraction: easy
• Stopwords removal
• Stemming
• Frequency counts and computing TF-IDF term weights
Stopwords removal
Many of the most frequently used words in English are useless in IR and text
mining
– these words are called stop words – “the”, “of”, “and”, “to”, ….
Inverted index
The inverted index of a document collection is basically a data structure that
– attaches each distinctive term with a list of all documents that contain the
term.
• Thus, in retrieval, it takes constant time to
– find the documents that contains a query term.
• Multiple query terms are also easy handled as we will see soon.
Web
A huge, widely-distributed, highly heterogeneous, semistructured,,
interconnected, evolving, hypertext/hypermedia information repository.
Application of IR to HTML documents on the World Wide Web.
Differences:
– Must assemble document corpus by spidering the web.
– Can exploit the structural layout information in HTML (XML).
– Documents change uncontrollably.
– Can exploit the link structure of the web.
Other IR-Related Tasks
• Automated document categorization
• Information filtering (spam filtering)
• Information routing
• Automated document clustering
• Recommending information or products
• Information extraction
• Information integration
• Question answering
Related Areas
• Database Management
• Library and Information Science
• Artificial Intelligence
• Natural Language Processing
• Machine Learning
Database Management
• Focused on structured data stored in relational tables rather than free-
form text.
Besides the grows of the page number, the pages are also continuosly
updated or removed – About the 23% of all the pages are modified daily –
In the .com domain, this percentage rises to 40% – On the average,
after 10 days, half of the new pages are removed.
28% of all the pages
• Core of the network
• Important pages … highly interconnected with each other
– 22% of all the pages
• reachable from the core, but not vice versa
Power law – The degree of a node is the number of incoming/outgoing
links – If we call k the degree of a node, a scale-free network is defined
by the power-law, which corresponds to this distribution: