Information Retrieval and XML Data: ADBMS Unit-4
Information Retrieval and XML Data: ADBMS Unit-4
XML Data
ADBMS Unit-4
Topics
Colliding Worlds : Databases ,IR & XML
Introduction to Information Retrieval
Indexing for Text Search
Web Search Engines, Managing Text in DBMS
A Data model for XML: Xquery
Efficient Evaluation of XML queies
An information retrieval (IR) system is a set of
algorithms that facilitate the relevance of displayed
documents to searched queries.
In simple words, it works to sort and rank documents
based on the queries of a user. There is uniformity with
respect to the query and text in the document to enable
document accessibility.
This also allows a matching function to be used
effectively to rank a document formally using their
Retrieval Status Value (RSV). The document contents
are represented by a collection of descriptors, known as
terms, that belong to a vocabulary V. An IR system also
extracts feedback on the usability of the displayed
results by tracking the user’s behaviour.
Types of Information Retrieval Model
An information retrieval comprises of the following
four key elements:
D − Document Representation.
Q − Query Representation.
F − A framework to match and establish a relationship
between D and Q.
R (q, di) − A ranking function that determines the
similarity between the query and the document to
display relevant information.
There are three types of Information Retrieval
(IR) models
Classical IR Model — It is designed upon basic
mathematical concepts and is the most widely-used of
IR models.
Classic Information Retrieval models can be
implemented with ease. Its examples include Vector-
space, Boolean and Probabilistic IR models.
In this system, the retrieval of information depends on
documents containing the defined set of queries. There
is no ranking or grading of any kind. The different
classical IR models take Document Representation,
Query representation, and Retrieval/Matching
function into account in their modelling.:
Non-Classical IR Model — They differ from classic
models in that they are built upon propositional logic.
Examples of non-classical IR models include
Information Logic, Situation Theory, and Interaction
models.
Alternative IR Model — These take principles of
classical IR model and enhance upon to create more
functional models like the Cluster model, Alternative
Set-Theoretic Models Fuzzy Set model, Latent
Semantic Indexing (LSI) model, Alternative Algebraic
Models Generalized Vector Space Model, etc.
Components of Information Retrieval Model
Here are the prerequisites for an IR model:
An automated or manually-operated indexing system
used to index and search techniques and procedures.
A collection of documents in any one of the following
formats: text, image or multimedia.
A set of queries that serve as the input to a system, via a
human or machine.
An evaluation metric to measure or evaluate a system’s
effectiveness (for instance, precision and recall). For
instance, to ensure how useful the information
displayed to the user is.
XML Databases:
XML Database is used to store huge amount of
information in the XML format. As the use of XML is
increasing in every field, it is required to have a secured
place to store the XML documents. The data stored in
the database can be queried using XQuery, serialized,
and exported into a desired format.
XML Database Types
There are two major types of XML databases −
XML- enabled
Native XML (NXD)
XML - Enabled Database
XML enabled database is nothing but the extension
provided for the conversion of XML document. This is
a relational database, where data is stored in tables
consisting of rows and columns. The tables contain set
of records, which in turn consist of fields.
Native XML Database
Native XML database is based on the container rather
than table format. It can store large amount of XML
document and data. Native XML database is queried by
the XPath-expressions.
Native XML database has an advantage over the XML-
enabled database. It is highly capable to store, query
and maintain the XML document than XML-enabled
database.
Example
Following example demonstrates XML database −
<?xml version = "1.0"?>
<contact-info>
<contact1>
<name>Tanmay Patil</name>
<company>Tegapoint</company>
<phone>(011) 123-4567</phone>
</contact1>
<contact2>
<name>Manisha Patil</name>
<company>Tegapoint</company>
<phone>(011) 789-4567</phone> </contact2>
</contact-info>
Indexing for Text Search
To implement a full-text search in a SQL database, you
must create a full-text index on each column you want
to be indexed.
In MySQL, this would be done with the FULLTEXT
keyword. Then you will be able to query the database using
MATCH and AGAINST.
Full-text search refers to searching some text inside
extensive text data stored electronically and returning
results that contain some or all of the words from the query.
In contrast, traditional search would return exact matches.
While traditional databases are great for storing and
retrieving general data, performing full-text searches has
been challenging. Frequently, additional tooling is required
to achieve this.
Indexing in Databases
Indexing is a way to optimize the performance of a database
by minimizing the number of disk accesses required when a
query is processed. It is a data structure technique which is
used to quickly locate and access the data in a database.
Indexes are created using a few database columns.
The first column is the Search key that contains a copy of the
primary key or candidate key of the table. These values are
stored in sorted order so that the corresponding data can be
accessed quickly.
Note: The data may or may not be stored in sorted order.
The second column is the Data Reference or Pointer which
contains a set of pointers holding the address of the disk
block where that particular key value can be found.
Web Search Engines
Search Engine refers to a huge database of internet
resources such as web pages, newsgroups, programs, images
etc. It helps to locate information on World Wide Web.
User can search for any information by passing query in
form of keywords or phrase. It then searches for relevant
information in its database and return to the user.
Search Engine Components
Generally there are three basic components of a search
engine as listed below:
Web Crawler
Database
Search Interfaces
Web crawler
It is also known as spider or bots. It is a software
component that traverses the web to gather
information.
Database
All the information on the web is stored in database. It
consists of huge web resources.
Search Interfaces
This component is an interface between user and the
database. It helps the user to search through the
database.
Search Engine Working
Web crawler, database and the search interface are the
major component of a search engine that actually
makes search engine to work. Search engines make use
of Boolean expression AND, OR, NOT to restrict and
widen the results of a search. Following are the steps
that are performed by the search engine:
The search engine looks for the keyword in the index
for predefined database instead of going directly to the
web to search for the keyword.
It then uses software to search for the information in
the database. This software component is known as
web crawler.
Once web crawler finds the pages, the search engine
then shows the relevant web pages as a result. These
retrieved web pages generally include title of page, size
of text portion, first several sentences etc
User can click on any of the search results to open it.
Architecture
The search engine architecture comprises of the three
basic layers listed below:
Content collection and refinement.
Search core
User and application interfaces
Search Engine Processing
Indexing Process
Indexing process comprises of the following three
tasks:
Text acquisition
Text transformation
Index creation
Text acquisition
It identifies and stores documents for indexing.
Text Transformation
It transforms document into index terms or features.
Index Creation
It takes index terms created by text transformations
and create data structures to suport fast searching.
Query Process
Query process comprises of the following three tasks:
User interaction
Ranking
Evaluation
User interaction
It supporst creation and refinement of user query and
displays the results.
Ranking
It uses query and indexes to create ranked list of
documents.
Evaluation
It monitors and measures the effectiveness and
efficiency. It is done offline.
Examples
Following are the several search engines available
today: