0% found this document useful (0 votes)
23 views37 pages

Unit 5

Uploaded by

Lieo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views37 pages

Unit 5

Uploaded by

Lieo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

IR AND LEXICAL

RESOURCES
Information Retrieval And
Lexical Resources
• Information Retrieval
• Design features of Information Retrieval Systems
• Indexing
• Eliminating Stop Words
• Stemming
• Zipf’s Law
Information Retrieval And
Lexical Resources
• Information Retrieval Models
• Classical Models of IR
• Boolean Model
• Probabilistic model
• Vector Space Model
• Non-classical Models of IR
• Alternative Models of Information Retrieval
• Cluster Model
• Fuzzy Model
• Latent Semantic Indexing Model
Information Retrieval And
Lexical Resources
Evaluation of the IR System
• Lexical Resources:
• Word Net
• Frame Net
• Stemmers
• POS Tagger (PART-OF-SPEECH Tagger)
• Research Corpora
Information Retrieval
• Information retrieval (IR) deals with the organisation, storage,
retrieval, and evaluation of information relevant to a user’s query.
• A user in need of information formulates a request in the form of a
query written in a natural language.
• The retrieval system responds by retrieving the document that seems
relevant to the query.
• An information retrieval system does not inform (i.e., change the
knowledge of) the user on the subject of her inquiry. It merely
informs on the existence ( or non-existence) and whereabouts of
documents relating to her request.
Design features of Information
Retrieval Systems
Design features of Information
Retrieval Systems
• Fig 1. Illustrates the basic process of IR.
• It begins with the user’s information need.
• Based on this need, he/she formulates a query.
• The IR system returns documents that seem relevant to the query.
• The retrieval is performed by matching the query representation with
document representation.
1. Indexing
• A collection of raw documents is usually transformed into an easily accessible
representation. This process is known as indexing.
• Most indexing techniques involve identifying good document descriptors, such as
keywords or terms which describe the information content of documents.
• Luhn(1957, 1958) is considered the first person to advance the notion of
automatic indexing of document based on their content. He assumed that the
frequency of certain word-occurrences in an article gave meaningful identification
of the article’s content. He proposed that the discrimination power of index terms
is a function of the rank order of the frequency of their occurrence, and that
middle frequency terms have the highest discrimination power. This model was
proposed for the extraction of silent terms from a document.
1. Indexing
• The word term can be a single word or multiword phrases.
• For example, the sentence, Design features of information retrieval
systems, can be represented as follows:
• Design, features, information, retrieval, systems.
• It can also be represented by the set of terms:
• Design, features, information retrieval, information retrieval systems.
• These multi-word terms can be obtained by looking at frequently
appearing sequences of words, n-grams, part-of-speech tags, or by
applying NLP to identify meaningful phrases or handcrafting.
1.Indexing
In text retrieval conference (TREC) the method used for phrase
extraction is as follows:
• Any pair of adjacent non-stop words is regarded a potential phrase.
• The final list of phrases is composed of those pairs of words that
occur in, say, 25 or more documents in the document collection.
2. Eliminating Stop Words
• The lexical processing of index terms involves elimination of stop words.
• Stop words are high frequency words which have little semantic weight
and are thus, unlikely to help in retrieval.
• Typical example of stop words are articles and prepositions.
• Eliminating them considerably reduces the number of index terms. The
drawback of eliminating stop word is that it can sometimes result in the
elimination of useful index terms, for instance the stop word A in Vitamin
A. Some phrases, lie to be or not to be, consist entirely of stop words.
• Elimination stop words in such case, make it impossible to correctly
search a document.
3. Stemming
• Stemming normalizes morphological variant, though in a crude
manner, by removing affixes from the words to reduce them to their
stem, e.g. the words compute computing, computes, and computer,
are all be reduced to same word stem, comput.
• The stemmed representation of the text, Design features of
information retrieval systems, is (design, feature, inform, retriev,
system)
3. Stemming
• One of the problems associated with stemming is that it may throw
away useful distinctions. In some cases, it may be useful to help
conflate, similar terms resulting in increased recall.
• In others, it may be harmful resulting in reduced precision (e.g . when
documents containing the term computation are returned in response
to the query phrase personal computer)
4. Zipf’s Law
• Zipf’s law says that the frequency of words multiplied by their ranks in
a large corpus is more or less constant. More formally,
• Frequent x rank = constant
• This means that if we compute the frequencies of the words in a
corpus, and arrange them in decreasing order of frequency, then the
product of the frequency of a word and its rank is approximately
equal to the product of the frequency and rank of another word.
4. Zipf’s Law
• This indicates that the frequency of a word is inversely proportional to
its rank.
• This relationship is shown in figure.

Frequency

Rank Order
4. Zipf’s Law
• Empirical investigation of Zipf’s law on large corpuses suggest that
human languages contain a small number of words that occur with
high frequency and a large number of words that occur with low
frequency.
• The high frequency word being common, have less discriminating
power, and thus are not useful for indexing.
• Low frequency words are less likely to be included in the query, and
are also not useful or indexing. As there are a large number of rare
(low frequency) words, dropping them considerably reduces the size
of a list of index terms.
4. Zipf’s Law
• The remaining medium frequency words are content-bearing terms
and can be used for indexing.
• This can be implemented by defining thresholds for high and low
frequency, and dropping words that have frequencies above or below
these thresholds.
• Stop word elimination can be thought of as an implementation of
Zipf’s law, where high frequency terms are dropped from a set of
index terms.
Information Retrieval Models
• The IR system consists of a model for documents, a model for queries,
and a matching function which compares queries to documents.
• The central objectives of the model is to retrieve all documents
relevant to a query. This defines the central task of an IR system.
• IR models can be classified as follows:
1. Classical Models of IR
2. Non classical Models of IR
3. Alternative Models of Information Retrieval
Classical Models of IR
• The three classical IR Models-Boolean, vector, and probabilistic are
based on mathematical knowledge that is easily recognized and well
understood.
• These models are simple, efficient, and easy to implement. Almost all
existing commercial systems are based on the mathematical models
of IR.
• That is why they are called classical models of IR.
CLASSICAL INFORMATION
RETRIEVAL MODELS
• Boolean Model
• Probabilistic model
• Vector Space Model
Boolean Model
• Description: This model treats documents and queries as sets of index
terms.
• Functionality: It uses Boolean logic (AND, OR, NOT) to define precise
matches between the query and documents.
• Characteristics: It is a strict, all-or-nothing system where documents
either match the query or they don't.
Vector Space Model(VSM)
• Description: This model represents documents and queries as vectors
in a t-dimensional space, where 't' is the number of unique index
terms in the entire collection.
• Functionality: It calculates the similarity between a query vector and
document vectors using algebraic methods, such as the cosine
similarity, based on weighted terms (like TF-IDF).
• Characteristics: It allows for partial matching and ranks documents by
their relevance to the query.
Probabilistic Model
• Description: This model frames information retrieval as a process of
estimating the probability of relevance between a query and a
document.
• Functionality: It aims to assign a probability score indicating the
likelihood that a document is relevant to a user's information need.
• Characteristics: It views the IR task as a statistical inference problem,
providing a ranked list of documents based on these probabilities
Non-classical models
• Non-classical models perform retrieval based on principles other than
those used by classical models, i.e., similarity, probability, and
Boolean operation.
• These are best exemplified by models based on special logic
technique, situation theory, or the concept of interaction.
Non-classical models
• Non-classical IR models are based on principles other than similarity, probability,
Boolean operations, etc., on which classical retrieval models are based.
• Examples include information logic model, situation theory model, and
interaction model.
• The information logic model is based on a special logic technique called logical
imaging.
• Retrieval is performed by making inferences from document to query. This is
unlike classical models, where a search process is used.
• Unlike usual implication, which is true in all cases except that when antecedent
is true and consequent is false, this inference is uncertain.
• Hence, a measure of uncertainty is associated with this inference.
Non-classical models
• These models often operate on different principles than traditional
models.
• Information Logic Model: Focuses on inference and measures
uncertainty in the retrieval process.
• Situation Theory Model: Views retrieval as an information flow and
utilizes "infons" to represent information in documents.
• Interaction Model: Models documents and queries as neurons in a
neural network and measures their interaction to determine
relevance.
Alternative Models of
Information Retrieval
• The third category of IR models, namely alternative models, are
actually enhancements of classical models, making use of specific
techniques from other fields .
• The cluster model, fuzzy model, and latent semantic indexing (LSI)
model are examples of alternative models of IR.
Alternative Models of
Information Retrieval
Cluster Model:
• Groups documents based on similarity, allowing for retrieval based on clusters rather
than individual documents.
Fuzzy Model:
• Extends the Boolean model by allowing for terms to be matched partially rather than
requiring exact matches, providing more flexible retrieval.
Latent Semantic Indexing (LSI):
• A technique that goes beyond simple keyword matching by identifying underlying
semantic relationships between terms and documents, improving retrieval accuracy.
Generalized Vector Model:
• Unlike classical vector models that enforce term independence, the generalized vector
model allows for correlation between index terms
Evaluation of the IR System
Lexical Resources
• WordNet
• FrameNet
• Stemmers
• Part-of-Speech (POS) Taggers
• Research Corpora
WordNet
• A large English lexical database with three parts: nouns, verbs, and
adjectives/adverbs.
• Words are grouped into synsets (sets of synonyms) representing one concept,
linked via lexical (word-form) and semantic (meaning) relations like synonymy,
hypernymy/hyponymy, antonymy, meronymy/holonymy, troponymy.
• A word may appear in multiple synsets (different senses). Each sense includes
synonyms and a gloss (definition + example).
• Nouns/verbs organized hierarchically (hypernyms), adjectives clustered by
antonyms.
• Freely downloadable. Multilingual versions exist (EuroWordNet, Hindi
WordNet).
WordNet
Applications:
• Concept identification, word sense disambiguation (Voorhees used
WordNet noun hierarchy in IR), query expansion, document
categorization, summarization (lexical chains).
FrameNet
• A large database of semantically annotated English sentences based
on frame semantics.
• Words evoke frames (situations) with participants called frame
elements (semantic roles).
• Example: [Authorities the police] nabbed [Suspect the snatcher]
(“nab” evokes ARREST frame).
• Frames may inherit roles (STATEMENT frame inherits from
COMMUNICATION frame).
FrameNet
Applications:
• Automatic semantic parsing (Gildea & Jurafsky), information
extraction, question answering (e.g., sender/recipient roles), IR,
machine translation, summarization, word sense disambiguation.
Stemmers
• Reduce inflected/derived words to a stem (not necessarily a valid
root).
• Common in search engines for indexing/query expansion.
• Popular algorithms: Porter’s, Lovins, Paice/Husk.
• Snowball provides stemmers for many European languages.
• For Indian languages, cluster-based approaches (Majumder et al.)
improve recall.
Stemmers
Applications:
• Reduces index size, retrieves documents with word variants
(“astronaut” ↔ “astronauts”), used in text
summarization/categorization.
• May slightly reduce precision in English systems.
Part-of-Speech (POS) Taggers
Assign grammatical tags (noun, verb, etc.) early in text processing for IR, MT, speech
synthesis.
Popular Taggers:
• Stanford POS Tagger (MaxEnt Markov Model).
• Bi-directional MEMM Tagger (outperforms unidirectional).
• TnT (HMM-based, efficient).
• Brill Tagger (rule-based, transformation learning).
• CLAWS (probabilistic + rule-based hybrid).
• Tree-Tagger (decision tree for transition probabilities).
• ACOPOST (Maximum Entropy, Trigram, Transformation-based, Example-based taggers).
• Limited tools for Indian languages due to lack of annotated corpora.
Research Corpora
Standard collections for NLP tasks:
• IR Test Collections: LETOR (OHSUMED, TREC datasets) for learning-to-
rank.
• Summarization: DUC with gold summaries.
• Word Sense Disambiguation: SEMCOR (Brown subset tagged with
WordNet synsets); Open Mind Word Expert (crowdsourced).
• Asian Languages: EMILLE (South Asian languages); CIIL (Indian
languages, multiple genres).

You might also like