0% found this document useful (0 votes)
17 views

Unit V Easy To Learn

The document discusses information retrieval and describes key concepts like inverted indexing, retrieval models, and the vector space model. It provides details on text preprocessing steps and compares the boolean and vector space retrieval models.

Uploaded by

Vignesh K
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Unit V Easy To Learn

The document discusses information retrieval and describes key concepts like inverted indexing, retrieval models, and the vector space model. It provides details on text preprocessing steps and compares the boolean and vector space retrieval models.

Uploaded by

Vignesh K
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

lOMoARcPSD|32839144

UNIT - V - Easy to learn

Advanced Database Technology (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by jai chandran ([email protected])
lOMoARcPSD|32839144

UNIT V INFORMATION RETRIEVAL AND WEB SEARCH

IR concepts – Retrieval Models – Queries in IR system – Text Preprocessing –


Inverted Indexing – Evaluation Measures – Web Search and Analytics – Current
trends.

Introduction
Text mining refers to data mining using text documents as data.
• Most text mining tasks use Information Retrieval (IR) methods to pre-process
text documents.
• These methods are quite different from traditional data pre-processing methods
used for relational tables.
• Web search also has its root in IR.

Information Retrieval

1. The indexing and retrieval of textual documents.


2. Searching for pages on the World Wide Web is the “killer app.”
3. Concerned firstly with retrieving relevantdocuments to a query.
4. Concerned secondly with retrieving from large sets of documents
efficiently.
5. IR helps users find information that matches their information needs
expressed as queries
6. Historically, IR is about document retrieval, emphasizing document as the
basic unit. – Finding documents relevant to user queries
7. Technically, IR studies the acquisition, organization, storage, retrieval, and
distribution of information.

What are the steps in text preprocessing?


Some of the common text preprocessing / cleaning steps are:

1. Lower casing.
2. Removal of Punctuations.

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

3. Removal of Stopwords.
4. Removal of Frequent words.
5. Removal of Rare words.
6. Stemming.

Typical IR Task
• Given:
– A corpus of textual natural-language documents.
– A user query in the form of a textual string.
• Find:
– A ranked set of documents that are relevant to the query.

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

Problems with Keywords


• May not retrieve relevant documents that include synonymous terms.
– “restaurant” vs. “café”
– “PRC” vs. “China”
• May retrieve irrelevant documents that include ambiguous terms.
– “bat” (baseball vs. mammal)
– “Apple” (company vs. fruit)
– “bit” (unit of data vs. act of eating)
• Text Operations forms index words (tokens).
– Stopword removal
– Stemming

IR System Components

• Text Operations forms index words (tokens).


• Stopword removal
• Stemming

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

• Indexing constructs an inverted index of word to document pointers.


• Searching retrieves documents that contain a given query token from the
inverted index.
• Ranking scores all retrieved documents according to a relevance metric.
• User Interface manages interaction with the user:
– Query input and document output.
– Relevance feedback.
– Visualization of results.
• Query Operations transform the query to improve retrieval:
Query transformation using relevance feedback
Retrieval Models
• Aretrievalmodelspecifiesthedetails of:
– Documentrepresentation
– Queryrepresentation
– Retrievalfunction
• Determinesanotionofrelevance.
• Notionofrelevancecanbebinaryor continuous (i.e. ranked
retrieval).
Information retrieval models
An IR model governs how a document and a query are represented and how the
relevance of a document to a user query is defined.
Main models:
– Boolean model
– Vector space model
– Statistical language model
– etc
Boolean model
Each document or query is treated as a “bag” of words or terms.
Ordering of words is not considered.
Given a collection of documents D,

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

let V = {t1, t2, ..., t|V|} be the set of distinctive words/terms in the collection. V
is called the vocabulary
• A weight wij> 0 is associated with each term ti of a document dj∈ D
– wij = 0/1 (absence/presence)
– dj = (w1j, w2j, ...,w|V|j)
Query terms are combined logically using the Boolean operators AND, OR, and
NOT.
• Retrieval – Given a Boolean query, the system retrieves every document that
makes the query logically true – Exact match
Boolean Models Problems
Very rigid: AND means all; OR means any.
• Difficult to express complex user requests.
• Difficult to control the number of documents retrieved.
– All matched documents will be returned.
• Difficult to rank output.
– All matched documents logically satisfy the query.
• Difficult to perform relevance feedback
– If a document is identified by the user as relevant or irrelevant, how should
the query be modified?
Boolean model: an Example
Consider a document space defined by three terms,
i.e., the vocabulary / lexicon: – hardware, software, users
• A set of documents is defined as:
– A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1),A4=(1, 1, 0),
A5=(1, 0, 1),A6=(0, 1, 1) , A7=(1, 1, 1) A8=(1, 0, 1), A9=(0, 1, 1)
• If the query is: “hardware, software”
i.e., (1, 1, 0)
what documents should be retrieved?
– AND: documents A4, A7

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

– OR: all documents, but A3


Similarity matching: an Example
A set of documents is defined as:
A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1), A4=(1, 1, 0), A5=(1, 0, 1),
A6=(0, 1, 1), A7=(1, 1, 1), A8=(1, 0, 1), A9=(0, 1, 1)
In similarity matching (cosine in the Boolean vector space):
– q=(1, 1, 0)
– S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
– S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5
– S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
– Document retrieved set (with ranking, where cosine>0): • {A4, A7, A1, A2,
A5, A6, A8, A9}
Vector space model
Documents are still treated as a “bag” of words or terms.
• Each document is still represented as a vector.
• However, the term weights are not forced to be 0 or 1, like in the Boolean model
– Each term weight is computed on the basis of some variations of TF or TF-
IDF scheme.
• Term Frequency (TF) Scheme: The weight of a term ti in document dj is the
number of times that ti appears in dj, denoted by fij. Normalization may also be
applied.
The Vector-Space Model
Assume t distinct terms remain after preprocessing; call them index terms or the
vocabulary.
• These “orthogonal” terms form a vector space. Dimensionality = t = |vocabulary|
• Each term, i, in a document or query, j, is given a real-valued weight, wij.
• Both documents and queries are expressed as
t-dimensional vectors: dj = (w1j, w2j, …,wtj)
Example:

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

D1 =2T1 +3T2 +5T3 D2 =3T1 +7T2 +T3 Q =0T1+0T2+2T3

• Acollectionofndocumentscanberepresentedinthe vector
space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a term
in the document; zero means the term has no
significanceinthedocumentoritsimplydoesn’texistin the
document.

T1 T2 …. Tt
D1 w1 w21 … wt1

D2 w12 w22 … wt2

: : : :
: : : :
Dn w1n w2n … wtn

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

Term Weights: Term Frequency


More frequent terms in a document are more important, i.e. more indicative of the
topic. fij = frequency of term i in document j
• May want to normalize term frequency (tf) by dividing by the frequency of the
most common term in the document: tfij = fij / maxi{fij}
Term Weights: Inverse Document Frequency
Terms that appear in many different documents are less indicative of overall topic.
dfi = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term
i, = log2 (N/ dfi) (N: total number of documents)
• An indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
TF-IDF Weighting
• A typical combined term importance indicator is tf-idf weighting:
wij = tfijidfi = tfij log2 (N/ dfi)
• A term occurring frequently in the document but rarely in the rest of the
collection is given high weight.
• Many other ways of determining term weights have been proposed.
• Experimentally, tf-idf has been found to work well.
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1) Assume collection contains 10,000 documents and document
frequencies of these terms are:
A(50), B(1300), C(250)
Then: A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

Query Vector
• Query vector is typically treated as a document and also tf-idf weighted.
• Alternative is for the user to supply weights for the given query terms.
Similarity Measure
• A similarity measure is a function that computes the degree of similarity between
two vectors.
• Using a similarity measure between the query and each document:
– It is possible to rank the retrieved documents in the order of presumed
relevance.
– It is possible to enforce a certain threshold so that the size of the retrieved
set can be controlled.
Similarity Measure - Inner Product
Similarity between vectors for the document di and query q can be
computed as the vector inner product (a.k.a. dot product):
Properties of Inner Product
The inner product is unbounded.
• Favors long documents with a large number of unique terms.
• Measures how many terms matched but not how many terms are not
matched.

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

Statistical Models
• Adocumentistypicallyrepresentedbyabagof words
(unordered words with frequencies).
• Bag=setthatallowsmultipleoccurrencesofthe same element.
• Userspecifiesasetofdesiredtermswithoptional weights:
– Weightedqueryterms:
Q=<database0.5; text0.8;information0.2 >
– Unweightedqueryterms:
Q=<database;text;information>
– NoBooleanconditionsspecifiedinthequery.

• Retrievalbasedonsimilaritybetweenquery and documents.

• Outputdocumentsare rankedaccordingto similarity to query.

• Similaritybasedonoccurrencefrequenciesof keywords in query and


document.

• Automaticrelevance feedbackcanbesupported:

– Relevantdocuments“added”toquery.

– Irrelevantdocuments“subtracted”from query.

Text pre-processing
Document parsing for word (term) extraction: easy
• Stopwords removal
• Stemming
• Frequency counts and computing TF-IDF term weights
Stopwords removal
Many of the most frequently used words in English are useless in IR and text
mining
– these words are called stop words – “the”, “of”, “and”, “to”, ….

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

– Typically about 400 to 500 such words –


For an application, an additional domain specific stopwords list may be
constructed
Why do we need to remove stopwords?
– Reduce indexing (or data) file size
• stopwords accounts 20-30% of total word counts.
– Improve efficiency and effectiveness
• stopwords are not useful for searching or text mining
• they may also confuse the retrieval system
• Current Web Search Engines generally do not use stopword lists for “phrase
search queries”
Stemming

Techniques used to find out the root/stem of a word.


e.g., user engineering
users engineered
used engineer
using engineer
Usefulness:
• improving effectiveness of IR and text mining
– Matching similar words
– Mainly improve recall
• reducing indexing size
– combing words with the same roots may reduce indexing size as much as 40-
50%
– Web Search Engine may need to index un-stemmed words too for “phrase
search”

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

Basic stemming methods


Using a set of rules. e.g., English rules
remove ending
– if a word ends with a consonant other than s, followed by an s, then delete
s.
– if a word ends in es, drop the s.
– if a word ends in ing, delete the ing unless the remaining word consists only
of one letter or of th.
– If a word ends with ed, preceded by a consonant, delete the ed unless this
leaves only a single letter.
transform words – if a word ends with “ies”, but not “eies” or “aies”, then
“ies è y”
Evaluation: Precision and Recall
• Given a query: – Are all retrieved documents relevant?
– Have all the relevant documents been retrieved?
• Measures for system performance:
– The first question is about the precision of the search
– The second is about the completeness (recall) of the search
• By increasing the number of retrieved items, we usually increase the recall,
but also reduce precision
Precision-recall curve

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

Inverted index
The inverted index of a document collection is basically a data structure that
– attaches each distinctive term with a list of all documents that contain the
term.
• Thus, in retrieval, it takes constant time to
– find the documents that contains a query term.

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

• Multiple query terms are also easy handled as we will see soon.

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

Web Search System

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

Web
A huge, widely-distributed, highly heterogeneous, semistructured,,
interconnected, evolving, hypertext/hypermedia information repository.
Application of IR to HTML documents on the World Wide Web.
Differences:
– Must assemble document corpus by spidering the web.
– Can exploit the structural layout information in HTML (XML).
– Documents change uncontrollably.
– Can exploit the link structure of the web.
Other IR-Related Tasks
• Automated document categorization
• Information filtering (spam filtering)
• Information routing
• Automated document clustering
• Recommending information or products
• Information extraction
• Information integration
• Question answering
Related Areas
• Database Management
• Library and Information Science
• Artificial Intelligence
• Natural Language Processing
• Machine Learning
Database Management
• Focused on structured data stored in relational tables rather than free-
form text.

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

• Focused on efficient processing of well-defined queries in a formal


language (SQL).
• Clearer semantics for both data and queries.
Recent move towards semi-structured data (XML) brings it closer to IR
Library and Information Science
• Focused on the human user aspects of information retrieval (human-
computer interaction, user interface, visualization).
• Concerned with effective categorization of human knowledge.
• Concerned with citation analysis and bibliometrics(structure of
information).
• Recent work on digital libraries brings it closer to CS & IR.
Artificial Intelligence
• Focused on the representation of knowledge, reasoning, and intelligent
action.
• Formalisms for representing knowledge and queries:
– First-order Predicate Logic
– Bayesian Networks
• Recent work on web ontologies and intelligent information agents brings
it closer to IR.
Natural Language Processing
• Focused on the syntactic, semantic, and pragmatic analysis of natural
language text and discourse.
• Ability to analyze syntax (phrase structure) and semantics could allow
retrieval based on meaning rather than keywords.
Machine Learning
• Focused on the development of computational systems that improve their
performance with experience.
• Automated classification of examples based on learning concepts from
labeled training examples (supervised learning).
• Automated methods for clustering unlabeled examples into meaningful
groups (unsupervised learning).

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

Machine Learning:IR Directions


• Text Categorization
– Automatic hierarchical classification (Yahoo).
– Adaptive filtering/routing/recommending.
– Automated spam filtering.
• Text Clustering
– Clustering of IR query results.
– Automatic formation of hierarchies (Yahoo).
• Learning for Information Extraction
• Text Mining and Learning to Rank

Google in July 2007 announced to have identified 1 trillion (1012) of


unique pages/URLs in the Web – After removing duplicates (about 30%-
40%) !!! – Estimated growth: several billions of pages per day
How many disks for all the Web pages?
Consider only the text (HTML) – On the average, 10K Byte –
Considering a trillion of pages è about 1016 Byte – Using Hard Disks
of 1 Terabyte (about 1012 bytes) è About 10.000 disks.

Downloaded by jai chandran ([email protected])


lOMoARcPSD|32839144

Besides the grows of the page number, the pages are also continuosly
updated or removed – About the 23% of all the pages are modified daily –
In the .com domain, this percentage rises to 40% – On the average,
after 10 days, half of the new pages are removed.
28% of all the pages
• Core of the network
• Important pages … highly interconnected with each other
– 22% of all the pages
• reachable from the core, but not vice versa
Power law – The degree of a node is the number of incoming/outgoing
links – If we call k the degree of a node, a scale-free network is defined
by the power-law, which corresponds to this distribution:

– Most of the nodes are poorly interconnected

Downloaded by jai chandran ([email protected])

You might also like