0% found this document useful (0 votes)
203 views37 pages

Information Retrieval and XML Data: ADBMS Unit-4

This document discusses information retrieval and XML data. It provides an overview of information retrieval systems, how they index and retrieve documents, and different IR models. It also discusses XML databases and how they can store XML documents more efficiently than traditional databases. Key components of search engines like crawlers, databases, and interfaces are explained. Indexing of text for search and examples of popular search engines are also summarized.

Uploaded by

sdesfesf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
203 views37 pages

Information Retrieval and XML Data: ADBMS Unit-4

This document discusses information retrieval and XML data. It provides an overview of information retrieval systems, how they index and retrieve documents, and different IR models. It also discusses XML databases and how they can store XML documents more efficiently than traditional databases. Key components of search engines like crawlers, databases, and interfaces are explained. Indexing of text for search and examples of popular search engines are also summarized.

Uploaded by

sdesfesf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Information Retrieval and

XML Data
ADBMS Unit-4
Topics
Colliding Worlds : Databases ,IR & XML
Introduction to Information Retrieval
Indexing for Text Search
Web Search Engines, Managing Text in DBMS
A Data model for XML: Xquery
Efficient Evaluation of XML queies
An information retrieval (IR) system is a set of
algorithms that facilitate the relevance of displayed
documents to searched queries.
In simple words, it works to sort and rank documents
based on the queries of a user. There is uniformity with
respect to the query and text in the document to enable
document accessibility.
This also allows a matching function to be used
effectively to rank a document formally using their
Retrieval Status Value (RSV). The document contents
are represented by a collection of descriptors, known as
terms, that belong to a vocabulary V. An IR system also
extracts feedback on the usability of the displayed
results by tracking the user’s behaviour.
Types of Information Retrieval Model
An information retrieval comprises of the following
four key elements:
D − Document Representation.
Q − Query Representation.
F − A framework to match and establish a relationship
between D and Q.
R (q, di) − A ranking function that determines the
similarity between the query and the document to
display relevant information.
There are three types of Information Retrieval
(IR) models
Classical IR Model — It is designed upon basic
mathematical concepts and is the most widely-used of
IR models.
Classic Information Retrieval models can be
implemented with ease. Its examples include Vector-
space, Boolean and Probabilistic IR models.
In this system, the retrieval of information depends on
documents containing the defined set of queries. There
is no ranking or grading of any kind. The different
classical IR models take Document Representation,
Query representation, and Retrieval/Matching
function into account in their modelling.:
 Non-Classical IR Model — They differ from classic
models in that they are built upon propositional logic.
Examples of non-classical IR models include
Information Logic, Situation Theory, and Interaction
models.
 Alternative IR Model — These take principles of
classical IR model and enhance upon to create more
functional models like the Cluster model, Alternative
Set-Theoretic Models Fuzzy Set model, Latent
Semantic Indexing (LSI) model, Alternative Algebraic
Models Generalized Vector Space Model, etc.
Components of Information Retrieval Model
Here are the prerequisites for an IR model: 
An automated or manually-operated indexing system
used to index and search techniques and procedures.
A collection of documents in any one of the following
formats: text, image or multimedia.
A set of queries that serve as the input to a system, via a
human or machine.
An evaluation metric to measure or evaluate a system’s
effectiveness (for instance, precision and recall). For
instance, to ensure how useful the information
displayed to the user is. 
XML Databases:
XML Database is used to store huge amount of
information in the XML format. As the use of XML is
increasing in every field, it is required to have a secured
place to store the XML documents. The data stored in
the database can be queried using XQuery, serialized,
and exported into a desired format.
XML Database Types
There are two major types of XML databases −
XML- enabled
Native XML (NXD)
XML - Enabled Database
XML enabled database is nothing but the extension
provided for the conversion of XML document. This is
a relational database, where data is stored in tables
consisting of rows and columns. The tables contain set
of records, which in turn consist of fields.
Native XML Database
Native XML database is based on the container rather
than table format. It can store large amount of XML
document and data. Native XML database is queried by
the XPath-expressions.
Native XML database has an advantage over the XML-
enabled database. It is highly capable to store, query
and maintain the XML document than XML-enabled
database.
Example
Following example demonstrates XML database −
<?xml version = "1.0"?>
 <contact-info>
<contact1>
<name>Tanmay Patil</name>
<company>Tegapoint</company>
<phone>(011) 123-4567</phone>
</contact1>
<contact2>
<name>Manisha Patil</name>
<company>Tegapoint</company>
<phone>(011) 789-4567</phone> </contact2>
</contact-info>
Indexing for Text Search
To implement a full-text search in a SQL database, you
must create a full-text index on each column you want
to be indexed.
In MySQL, this would be done with the FULLTEXT
keyword. Then you will be able to query the database using
MATCH and AGAINST.
Full-text search refers to searching some text inside
extensive text data stored electronically and returning
results that contain some or all of the words from the query.
In contrast, traditional search would return exact matches.
While traditional databases are great for storing and
retrieving general data, performing full-text searches has
been challenging. Frequently, additional tooling is required
to achieve this.
Indexing in Databases
Indexing is a way to optimize the performance of a database
by minimizing the number of disk accesses required when a
query is processed. It is a data structure technique which is
used to quickly locate and access the data in a database.
Indexes are created using a few database columns. 
The first column is the Search key that contains a copy of the
primary key or candidate key of the table. These values are
stored in sorted order so that the corresponding data can be
accessed quickly. 
Note: The data may or may not be stored in sorted order.
The second column is the Data Reference or Pointer which
contains a set of pointers holding the address of the disk
block where that particular key value can be found.
Web Search Engines
Search Engine refers to a huge database of internet
resources such as web pages, newsgroups, programs, images
etc. It helps to locate information on World Wide Web.
User can search for any information by passing query in
form of keywords or phrase. It then searches for relevant
information in its database and return to the user.
Search Engine Components
Generally there are three basic components of a search
engine as listed below:
Web Crawler
Database
Search Interfaces
Web crawler
It is also known as spider or bots. It is a software
component that traverses the web to gather
information.
Database
All the information on the web is stored in database. It
consists of huge web resources.
Search Interfaces
This component is an interface between user and the
database. It helps the user to search through the
database.
Search Engine Working
Web crawler, database and the search interface are the
major component of a search engine that actually
makes search engine to work. Search engines make use
of Boolean expression AND, OR, NOT to restrict and
widen the results of a search. Following are the steps
that are performed by the search engine:
The search engine looks for the keyword in the index
for predefined database instead of going directly to the
web to search for the keyword.
It then uses software to search for the information in
the database. This software component is known as
web crawler.
Once web crawler finds the pages, the search engine
then shows the relevant web pages as a result. These
retrieved web pages generally include title of page, size
of text portion, first several sentences etc
User can click on any of the search results to open it.
Architecture
The search engine architecture comprises of the three
basic layers listed below:
Content collection and refinement.
Search core
User and application interfaces
Search Engine Processing
Indexing Process
Indexing process comprises of the following three
tasks:
Text acquisition
Text transformation
Index creation
Text acquisition
It identifies and stores documents for indexing.
Text Transformation
It transforms document into index terms or features.
Index Creation
It takes index terms created by text transformations
and create data structures to suport fast searching.
Query Process
Query process comprises of the following three tasks:
User interaction
Ranking
Evaluation
User interaction
It supporst creation and refinement of user query and
displays the results.
Ranking
It uses query and indexes to create ranked list of
documents.
Evaluation
It monitors and measures the effectiveness and
efficiency. It is done offline.
Examples
Following are the several search engines available
today:

Search Engine Description

Google It was originally called BackRub. It is the most popular


search engine globally.

Bing It was launched in 2009 by Microsoft. It is the latest web-


based search engine that also delivers Yahoo’s results.

Ask It was launched in 1996 and was originally known as Ask


Jeeves. It includes support for match, dictionary, and
conversation question.
Search Engine Description

AltaVista It was launched by Digital Equipment Corporation in


1995. Since 2003, it is powered by Yahoo technology

AOL.Search It is powered by Google.

LYCOS It is top 5 internet portal and 13th largest online property


according to Media Matrix.

Alexa It is subsidiary of Amazon and used for providing website


traffic information.
A data model for xml:

The data model for XML is very simple - or very abstract,


depending on one's point of view. XML provides no more
than a baseline on which more complex models can be
built. All those more restricted applications will share
some common invariants, however, and it is those that
are given below.
Think of an XML document as a linearization of a tree
structure. At every node in the tree there are several
character strings. The tree structure and the character
strings together form the information content of an XML
document. Almost everything will follow naturally from
that. Some of the characters in the document are only
there to support the linearization, others are part of the
information content.
A tree and a graph overlaid
The main structure of an XML document is tree-like,
and most of the lexical structure is devoted to defining
that tree, but there is also a way to make connections
between arbitrary nodes in a tree.
<p>
<q id="x7">The first q</q>
<q id="x8">The second q</q>
<q href="#x7">The third q</q>
</p>
The tree corresponding to this document can be
visualized as follows:
The tree that an XML document represents has a
number of different types of nodes:
element
document
processing instruction [not needed?]
comment
data
XQUERY:
XQuery is to XML what SQL is to databases.
XQuery is designed to query XML data.
XQuery is the language for querying XML data
XQuery for XML is like SQL for databases
XQuery is built on XPath expressions
XQuery is supported by all major databases
XQuery is a W3C Recommendation
XQuery is a language for finding and extracting elements
and attributes from XML documents.
Here is an example of what XQuery could solve:
"Select all CD records with a price less than $10 from the
CD collection stored in cd_catalog.xml"
XQuery can be used to:
Extract information to use in a Web Service
Generate summary reports
Transform XML data to XHTML
Search Web documents for relevant information

Another Simple xquery


<book category="CHILDREN">
  <title lang="en">Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>
Evaluation of Xquery:
Evaluates one or more XQueries against the content of
a FlowFile. The results of those XQueries are assigned
to FlowFile Attributes or are written to the content of
the FlowFile itself, depending on configuration of the
Processor.
XQueries are entered by adding user-defined
properties; the name of the property maps to the
Attribute Name into which the result will be placed (if
the Destination is 'flowfile-attribute'; otherwise, the
property name is ignored). The value of the property
must be a valid XQuery.
If the XQuery returns more than one result, new
attributes or FlowFiles (for Destinations of 'flowfile-
attribute' or 'flowfile-content' respectively) will be
created for each result (attributes will have a '.n' one-
up number appended to the specified attribute name).
If any provided XQuery returns a result, the
FlowFile(s) will be routed to 'matched'. If no provided
XQuery returns a result, the FlowFile will be routed to
'unmatched'. If the Destination is 'flowfile-attribute'
and the XQueries matche nothing, no attributes will
be applied to the FlowFile.
In the list below, the names of required properties
appear in bold. Any other properties (not in bold) are
considered optional. The table also indicates any
default values.
Allowable
Name Default Value Values Description

Indicates whether the results


•flowfile- of the XQuery evaluation are
flowfile- content written to the FlowFile
Destination content content or a FlowFile
•flowfile-
attribute attribute

Output: •xml Identifies the overall


xml •html
Method method that should be used
•text for outputting a result tree.
Defaul Allowable
Name t Value Values Description

Specifies whether the processor


Output: Omit false should output an XML
XML Declaration declaration when transforming a
result tree.

Specifies whether the processor


Output: Indent false may add additional whitespace
when outputting a result tree.

Specifies whether or not the XML


Validate DTD true •true content should be validated
•false against the DTD.

You might also like