0% found this document useful (0 votes)

23 views37 pages

Unit 5

Uploaded by

Lieo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views37 pages

Unit 5

Uploaded by

Lieo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

IR AND LEXICAL

RESOURCES
Information Retrieval And
Lexical Resources
• Information Retrieval
• Design features of Information Retrieval Systems
• Indexing
• Eliminating Stop Words
• Stemming
• Zipf’s Law
Information Retrieval And
Lexical Resources
• Information Retrieval Models
• Classical Models of IR
• Boolean Model
• Probabilistic model
• Vector Space Model
• Non-classical Models of IR
• Alternative Models of Information Retrieval
• Cluster Model
• Fuzzy Model
• Latent Semantic Indexing Model
Information Retrieval And
Lexical Resources
Evaluation of the IR System
• Lexical Resources:
• Word Net
• Frame Net
• Stemmers
• POS Tagger (PART-OF-SPEECH Tagger)
• Research Corpora
Information Retrieval
• Information retrieval (IR) deals with the organisation, storage,
retrieval, and evaluation of information relevant to a user’s query.
• A user in need of information formulates a request in the form of a
query written in a natural language.
• The retrieval system responds by retrieving the document that seems
relevant to the query.
• An information retrieval system does not inform (i.e., change the
knowledge of) the user on the subject of her inquiry. It merely
informs on the existence ( or non-existence) and whereabouts of
documents relating to her request.
Design features of Information
Retrieval Systems
Design features of Information
Retrieval Systems
• Fig 1. Illustrates the basic process of IR.
• It begins with the user’s information need.
• Based on this need, he/she formulates a query.
• The IR system returns documents that seem relevant to the query.
• The retrieval is performed by matching the query representation with
document representation.
1. Indexing
• A collection of raw documents is usually transformed into an easily accessible
representation. This process is known as indexing.
• Most indexing techniques involve identifying good document descriptors, such as
keywords or terms which describe the information content of documents.
• Luhn(1957, 1958) is considered the first person to advance the notion of
automatic indexing of document based on their content. He assumed that the
frequency of certain word-occurrences in an article gave meaningful identification
of the article’s content. He proposed that the discrimination power of index terms
is a function of the rank order of the frequency of their occurrence, and that
middle frequency terms have the highest discrimination power. This model was
proposed for the extraction of silent terms from a document.
1. Indexing
• The word term can be a single word or multiword phrases.
• For example, the sentence, Design features of information retrieval
systems, can be represented as follows:
• Design, features, information, retrieval, systems.
• It can also be represented by the set of terms:
• Design, features, information retrieval, information retrieval systems.
• These multi-word terms can be obtained by looking at frequently
appearing sequences of words, n-grams, part-of-speech tags, or by
applying NLP to identify meaningful phrases or handcrafting.
1.Indexing
In text retrieval conference (TREC) the method used for phrase
extraction is as follows:
• Any pair of adjacent non-stop words is regarded a potential phrase.
• The final list of phrases is composed of those pairs of words that
occur in, say, 25 or more documents in the document collection.
2. Eliminating Stop Words
• The lexical processing of index terms involves elimination of stop words.
• Stop words are high frequency words which have little semantic weight
and are thus, unlikely to help in retrieval.
• Typical example of stop words are articles and prepositions.
• Eliminating them considerably reduces the number of index terms. The
drawback of eliminating stop word is that it can sometimes result in the
elimination of useful index terms, for instance the stop word A in Vitamin
A. Some phrases, lie to be or not to be, consist entirely of stop words.
• Elimination stop words in such case, make it impossible to correctly
search a document.
3. Stemming
• Stemming normalizes morphological variant, though in a crude
manner, by removing affixes from the words to reduce them to their
stem, e.g. the words compute computing, computes, and computer,
are all be reduced to same word stem, comput.
• The stemmed representation of the text, Design features of
information retrieval systems, is (design, feature, inform, retriev,
system)
3. Stemming
• One of the problems associated with stemming is that it may throw
away useful distinctions. In some cases, it may be useful to help
conflate, similar terms resulting in increased recall.
• In others, it may be harmful resulting in reduced precision (e.g . when
documents containing the term computation are returned in response
to the query phrase personal computer)
4. Zipf’s Law
• Zipf’s law says that the frequency of words multiplied by their ranks in
a large corpus is more or less constant. More formally,
• Frequent x rank = constant
• This means that if we compute the frequencies of the words in a
corpus, and arrange them in decreasing order of frequency, then the
product of the frequency of a word and its rank is approximately
equal to the product of the frequency and rank of another word.
4. Zipf’s Law
• This indicates that the frequency of a word is inversely proportional to
its rank.
• This relationship is shown in figure.

Frequency

Rank Order
4. Zipf’s Law
• Empirical investigation of Zipf’s law on large corpuses suggest that
human languages contain a small number of words that occur with
high frequency and a large number of words that occur with low
frequency.
• The high frequency word being common, have less discriminating
power, and thus are not useful for indexing.
• Low frequency words are less likely to be included in the query, and
are also not useful or indexing. As there are a large number of rare
(low frequency) words, dropping them considerably reduces the size
of a list of index terms.
4. Zipf’s Law
• The remaining medium frequency words are content-bearing terms
and can be used for indexing.
• This can be implemented by defining thresholds for high and low
frequency, and dropping words that have frequencies above or below
these thresholds.
• Stop word elimination can be thought of as an implementation of
Zipf’s law, where high frequency terms are dropped from a set of
index terms.
Information Retrieval Models
• The IR system consists of a model for documents, a model for queries,
and a matching function which compares queries to documents.
• The central objectives of the model is to retrieve all documents
relevant to a query. This defines the central task of an IR system.
• IR models can be classified as follows:
1. Classical Models of IR
2. Non classical Models of IR
3. Alternative Models of Information Retrieval
Classical Models of IR
• The three classical IR Models-Boolean, vector, and probabilistic are
based on mathematical knowledge that is easily recognized and well
understood.
• These models are simple, efficient, and easy to implement. Almost all
existing commercial systems are based on the mathematical models
of IR.
• That is why they are called classical models of IR.
CLASSICAL INFORMATION
RETRIEVAL MODELS
• Boolean Model
• Probabilistic model
• Vector Space Model
Boolean Model
• Description: This model treats documents and queries as sets of index
terms.
• Functionality: It uses Boolean logic (AND, OR, NOT) to define precise
matches between the query and documents.
• Characteristics: It is a strict, all-or-nothing system where documents
either match the query or they don't.
Vector Space Model(VSM)
• Description: This model represents documents and queries as vectors
in a t-dimensional space, where 't' is the number of unique index
terms in the entire collection.
• Functionality: It calculates the similarity between a query vector and
document vectors using algebraic methods, such as the cosine
similarity, based on weighted terms (like TF-IDF).
• Characteristics: It allows for partial matching and ranks documents by
their relevance to the query.
Probabilistic Model
• Description: This model frames information retrieval as a process of
estimating the probability of relevance between a query and a
document.
• Functionality: It aims to assign a probability score indicating the
likelihood that a document is relevant to a user's information need.
• Characteristics: It views the IR task as a statistical inference problem,
providing a ranked list of documents based on these probabilities
Non-classical models
• Non-classical models perform retrieval based on principles other than
those used by classical models, i.e., similarity, probability, and
Boolean operation.
• These are best exemplified by models based on special logic
technique, situation theory, or the concept of interaction.
Non-classical models
• Non-classical IR models are based on principles other than similarity, probability,
Boolean operations, etc., on which classical retrieval models are based.
• Examples include information logic model, situation theory model, and
interaction model.
• The information logic model is based on a special logic technique called logical
imaging.
• Retrieval is performed by making inferences from document to query. This is
unlike classical models, where a search process is used.
• Unlike usual implication, which is true in all cases except that when antecedent
is true and consequent is false, this inference is uncertain.
• Hence, a measure of uncertainty is associated with this inference.
Non-classical models
• These models often operate on different principles than traditional
models.
• Information Logic Model: Focuses on inference and measures
uncertainty in the retrieval process.
• Situation Theory Model: Views retrieval as an information flow and
utilizes "infons" to represent information in documents.
• Interaction Model: Models documents and queries as neurons in a
neural network and measures their interaction to determine
relevance.
Alternative Models of
Information Retrieval
• The third category of IR models, namely alternative models, are
actually enhancements of classical models, making use of specific
techniques from other fields .
• The cluster model, fuzzy model, and latent semantic indexing (LSI)
model are examples of alternative models of IR.
Alternative Models of
Information Retrieval
Cluster Model:
• Groups documents based on similarity, allowing for retrieval based on clusters rather
than individual documents.
Fuzzy Model:
• Extends the Boolean model by allowing for terms to be matched partially rather than
requiring exact matches, providing more flexible retrieval.
Latent Semantic Indexing (LSI):
• A technique that goes beyond simple keyword matching by identifying underlying
semantic relationships between terms and documents, improving retrieval accuracy.
Generalized Vector Model:
• Unlike classical vector models that enforce term independence, the generalized vector
model allows for correlation between index terms
Evaluation of the IR System
Lexical Resources
• WordNet
• FrameNet
• Stemmers
• Part-of-Speech (POS) Taggers
• Research Corpora
WordNet
• A large English lexical database with three parts: nouns, verbs, and
adjectives/adverbs.
• Words are grouped into synsets (sets of synonyms) representing one concept,
linked via lexical (word-form) and semantic (meaning) relations like synonymy,
hypernymy/hyponymy, antonymy, meronymy/holonymy, troponymy.
• A word may appear in multiple synsets (different senses). Each sense includes
synonyms and a gloss (definition + example).
• Nouns/verbs organized hierarchically (hypernyms), adjectives clustered by
antonyms.
• Freely downloadable. Multilingual versions exist (EuroWordNet, Hindi
WordNet).
WordNet
Applications:
• Concept identification, word sense disambiguation (Voorhees used
WordNet noun hierarchy in IR), query expansion, document
categorization, summarization (lexical chains).
FrameNet
• A large database of semantically annotated English sentences based
on frame semantics.
• Words evoke frames (situations) with participants called frame
elements (semantic roles).
• Example: [Authorities the police] nabbed [Suspect the snatcher]
(“nab” evokes ARREST frame).
• Frames may inherit roles (STATEMENT frame inherits from
COMMUNICATION frame).
FrameNet
Applications:
• Automatic semantic parsing (Gildea & Jurafsky), information
extraction, question answering (e.g., sender/recipient roles), IR,
machine translation, summarization, word sense disambiguation.
Stemmers
• Reduce inflected/derived words to a stem (not necessarily a valid
root).
• Common in search engines for indexing/query expansion.
• Popular algorithms: Porter’s, Lovins, Paice/Husk.
• Snowball provides stemmers for many European languages.
• For Indian languages, cluster-based approaches (Majumder et al.)
improve recall.
Stemmers
Applications:
• Reduces index size, retrieves documents with word variants
(“astronaut” ↔ “astronauts”), used in text
summarization/categorization.
• May slightly reduce precision in English systems.
Part-of-Speech (POS) Taggers
Assign grammatical tags (noun, verb, etc.) early in text processing for IR, MT, speech
synthesis.
Popular Taggers:
• Stanford POS Tagger (MaxEnt Markov Model).
• Bi-directional MEMM Tagger (outperforms unidirectional).
• TnT (HMM-based, efficient).
• Brill Tagger (rule-based, transformation learning).
• CLAWS (probabilistic + rule-based hybrid).
• Tree-Tagger (decision tree for transition probabilities).
• ACOPOST (Maximum Entropy, Trigram, Transformation-based, Example-based taggers).
• Limited tools for Indian languages due to lack of annotated corpora.
Research Corpora
Standard collections for NLP tasks:
• IR Test Collections: LETOR (OHSUMED, TREC datasets) for learning-to-
rank.
• Summarization: DUC with gold summaries.
• Word Sense Disambiguation: SEMCOR (Brown subset tagged with
WordNet synsets); Open Mind Word Expert (crowdsourced).
• Asian Languages: EMILLE (South Asian languages); CIIL (Indian
languages, multiple genres).

Module 4
No ratings yet
Module 4
16 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Introduction to Information Retrieval Course
No ratings yet
Introduction to Information Retrieval Course
39 pages
Module2 NLP BAD613B Notes
100% (1)
Module2 NLP BAD613B Notes
16 pages
Decrease and Conquer, Insertion Sort
No ratings yet
Decrease and Conquer, Insertion Sort
17 pages
Information Retrieval Basics Explained
No ratings yet
Information Retrieval Basics Explained
8 pages
Unit 2
No ratings yet
Unit 2
10 pages
The Classic TF-IDF Vector Space Model
No ratings yet
The Classic TF-IDF Vector Space Model
15 pages
Swing for Java Developers
No ratings yet
Swing for Java Developers
70 pages
Unit Ii Modeling
No ratings yet
Unit Ii Modeling
15 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
(Ebook) Speech and Language Processing: An Introduction To Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky, James H. Martin Download
100% (1)
(Ebook) Speech and Language Processing: An Introduction To Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky, James H. Martin Download
80 pages
Functional Overview of Information Retrieval System
No ratings yet
Functional Overview of Information Retrieval System
12 pages
Indexing and Cataloging Essentials
No ratings yet
Indexing and Cataloging Essentials
16 pages
QA Review: IR-based Question Answering
No ratings yet
QA Review: IR-based Question Answering
11 pages
Unit 3 (Question Answering and Dialogue Systems)
No ratings yet
Unit 3 (Question Answering and Dialogue Systems)
19 pages
ch1 - Information Retrieval Systems
No ratings yet
ch1 - Information Retrieval Systems
52 pages
Regular Expressions in NLP Basics
No ratings yet
Regular Expressions in NLP Basics
69 pages
Be Computer Engineering Semester 7 2023 May Dloc III Natural Language Processing Rev 2019 C Scheme
0% (1)
Be Computer Engineering Semester 7 2023 May Dloc III Natural Language Processing Rev 2019 C Scheme
2 pages
Sp09midterm Revised
No ratings yet
Sp09midterm Revised
6 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
9 pages
R20 Flat Mid I Question Bank
No ratings yet
R20 Flat Mid I Question Bank
2 pages
Natural Language Processing Exam Questions
No ratings yet
Natural Language Processing Exam Questions
2 pages
UNIT 2 IRS Up
No ratings yet
UNIT 2 IRS Up
42 pages
Irs r22 Unit 4 Lecture Notes User Search Techniques Ranking Algorithms
No ratings yet
Irs r22 Unit 4 Lecture Notes User Search Techniques Ranking Algorithms
24 pages
Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
No ratings yet
Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
53 pages
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
100% (1)
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
86 pages
Introduction to Hashing Techniques
No ratings yet
Introduction to Hashing Techniques
65 pages
Text Search & Multimedia Retrieval
No ratings yet
Text Search & Multimedia Retrieval
22 pages
Cs8080 Unit3 Text Classification and Clustering
No ratings yet
Cs8080 Unit3 Text Classification and Clustering
171 pages
CCS369 - TSS-Unit 5
No ratings yet
CCS369 - TSS-Unit 5
23 pages
Com713 Advanced Data Structures and Algorithms
No ratings yet
Com713 Advanced Data Structures and Algorithms
13 pages
Ch11 3 Tries
No ratings yet
Ch11 3 Tries
11 pages
B.Tech CSE: Info Retrieval Systems
No ratings yet
B.Tech CSE: Info Retrieval Systems
11 pages
Unit IV
100% (1)
Unit IV
19 pages
AWK Part-3
100% (1)
AWK Part-3
10 pages
Unit 1
No ratings yet
Unit 1
99 pages
Irs Unit - 4
No ratings yet
Irs Unit - 4
29 pages
IRS Assignment-I: 1. Define IRS & Goals. Ans
No ratings yet
IRS Assignment-I: 1. Define IRS & Goals. Ans
3 pages
002chapter 2 - Lexical Analysis
No ratings yet
002chapter 2 - Lexical Analysis
114 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
113 pages
Lecture 6 (Strings in Java)
No ratings yet
Lecture 6 (Strings in Java)
33 pages
PPL Question Paper
No ratings yet
PPL Question Paper
13 pages
NLP Feature Extraction Techniques
No ratings yet
NLP Feature Extraction Techniques
19 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Linguistics & NLP: Morphology Basics
No ratings yet
Linguistics & NLP: Morphology Basics
14 pages
NLP Unit 1 Answers
No ratings yet
NLP Unit 1 Answers
7 pages
Lab Manual: Data Structures and Applications Laboratory Manual (17CSL38) (Iii Semester)
No ratings yet
Lab Manual: Data Structures and Applications Laboratory Manual (17CSL38) (Iii Semester)
50 pages
M.Tech IR Course Overview
No ratings yet
M.Tech IR Course Overview
72 pages
NLP Overview: Key Concepts and History
No ratings yet
NLP Overview: Key Concepts and History
30 pages
NLP Question Bank Answers (Jagmeet)
No ratings yet
NLP Question Bank Answers (Jagmeet)
31 pages
Machine Learning for Tech Enthusiasts
No ratings yet
Machine Learning for Tech Enthusiasts
12 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Natural Language Processing Overview
No ratings yet
Natural Language Processing Overview
11 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Indian Legal Language Model Development
No ratings yet
Indian Legal Language Model Development
7 pages
Mod 4
No ratings yet
Mod 4
35 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Eligibility+Criteria+ +Technical+Batch+2026
No ratings yet
Eligibility+Criteria+ +Technical+Batch+2026
3 pages
IDP Sem 6
No ratings yet
IDP Sem 6
54 pages
Icstsdg Proceedings
No ratings yet
Icstsdg Proceedings
173 pages
Communication and Soft Skills Record Final Cse B (06.05.2025)
No ratings yet
Communication and Soft Skills Record Final Cse B (06.05.2025)
22 pages
Congratulating Pioneers Event 2025
No ratings yet
Congratulating Pioneers Event 2025
3 pages
Cloud Computing - Unit 5 - Week 2
No ratings yet
Cloud Computing - Unit 5 - Week 2
4 pages
Comprehensive Guide to IR Models
100% (3)
Comprehensive Guide to IR Models
58 pages
A Comparative Study of TF IDF LSI
No ratings yet
A Comparative Study of TF IDF LSI
8 pages
Wehel Hadi (8686807) Thesis Applied Data Science
No ratings yet
Wehel Hadi (8686807) Thesis Applied Data Science
46 pages
IR Question Bank
100% (2)
IR Question Bank
29 pages
LSA and SVD for Data Analysis
No ratings yet
LSA and SVD for Data Analysis
41 pages
Latent Semantic Analysis Explained
No ratings yet
Latent Semantic Analysis Explained
15 pages
Module 4 Notes
No ratings yet
Module 4 Notes
34 pages
Matrix Decomposition 2019
No ratings yet
Matrix Decomposition 2019
25 pages
Telugu Cross-Language Information Retrieval
No ratings yet
Telugu Cross-Language Information Retrieval
142 pages
Unsupervised Video Surveillance Report
No ratings yet
Unsupervised Video Surveillance Report
18 pages
ML Notes Question Bank Exstraction From Notes
No ratings yet
ML Notes Question Bank Exstraction From Notes
163 pages
Berry M.W., Browne M.-Understanding Search Engines. Mathematical Modeling and Text Retrieval-SIAM, Society For Industrial and Applied Mathematics (2005) PDF
100% (2)
Berry M.W., Browne M.-Understanding Search Engines. Mathematical Modeling and Text Retrieval-SIAM, Society For Industrial and Applied Mathematics (2005) PDF
136 pages
PHD Thesis
No ratings yet
PHD Thesis
218 pages
Overview of Categorization Techniques
No ratings yet
Overview of Categorization Techniques
7 pages
Deep Parsing Techniques for NLP
No ratings yet
Deep Parsing Techniques for NLP
50 pages
Peg Howland, Haesun Park (Auth.), Michael W. Berry, Malu Castellanos (Eds.) - Survey of Text Mining II - Clustering, Classification, and Retrieval-Springer-Verlag London (2008)
No ratings yet
Peg Howland, Haesun Park (Auth.), Michael W. Berry, Malu Castellanos (Eds.) - Survey of Text Mining II - Clustering, Classification, and Retrieval-Springer-Verlag London (2008)
239 pages
Single Pass Algorithm
No ratings yet
Single Pass Algorithm
16 pages
An Analysis of Automated Essay Grading Systems: ISSN: 2277-3878 (Online), Volume-8 Issue-6, March 2020
No ratings yet
An Analysis of Automated Essay Grading Systems: ISSN: 2277-3878 (Online), Volume-8 Issue-6, March 2020
4 pages
Web Based Information Retrieval
No ratings yet
Web Based Information Retrieval
83 pages
01 - Introduction To Text Analytics - Part2
No ratings yet
01 - Introduction To Text Analytics - Part2
48 pages
Aiml Unit-3
No ratings yet
Aiml Unit-3
52 pages
Information Retrieval Course
No ratings yet
Information Retrieval Course
24 pages
NLP-Based Text Summarization Project
67% (3)
NLP-Based Text Summarization Project
23 pages
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
No ratings yet
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
47 pages
Comparison of Topic Modelling Approaches in The Banking Context
No ratings yet
Comparison of Topic Modelling Approaches in The Banking Context
14 pages
Singlular Value Decomposition
No ratings yet
Singlular Value Decomposition
6 pages
Essential Math For AI Hala Nelson Online Reading
No ratings yet
Essential Math For AI Hala Nelson Online Reading
110 pages
Language Modeling in NLP
No ratings yet
Language Modeling in NLP
15 pages
Machine Learning for Subjective Answer Evaluation
No ratings yet
Machine Learning for Subjective Answer Evaluation
13 pages

Unit 5

Uploaded by

Unit 5

Uploaded by

IR AND LEXICAL

You might also like