0% found this document useful (0 votes)

203 views37 pages

Information Retrieval and XML Data: ADBMS Unit-4

This document discusses information retrieval and XML data. It provides an overview of information retrieval systems, how they index and retrieve documents, and different IR models. It also discusses XML databases and how they can store XML documents more efficiently than traditional databases. Key components of search engines like crawlers, databases, and interfaces are explained. Indexing of text for search and examples of popular search engines are also summarized.

Uploaded by

sdesfesf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

203 views37 pages

Information Retrieval and XML Data: ADBMS Unit-4

Uploaded by

sdesfesf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Information Retrieval and

XML Data
ADBMS Unit-4
Topics
Colliding Worlds : Databases ,IR & XML
Introduction to Information Retrieval
Indexing for Text Search
Web Search Engines, Managing Text in DBMS
A Data model for XML: Xquery
Efficient Evaluation of XML queies
An information retrieval (IR) system is a set of
algorithms that facilitate the relevance of displayed
documents to searched queries.
In simple words, it works to sort and rank documents
based on the queries of a user. There is uniformity with
respect to the query and text in the document to enable
document accessibility.
This also allows a matching function to be used
effectively to rank a document formally using their
Retrieval Status Value (RSV). The document contents
are represented by a collection of descriptors, known as
terms, that belong to a vocabulary V. An IR system also
extracts feedback on the usability of the displayed
results by tracking the user’s behaviour.
Types of Information Retrieval Model
An information retrieval comprises of the following
four key elements:
D − Document Representation.
Q − Query Representation.
F − A framework to match and establish a relationship
between D and Q.
R (q, di) − A ranking function that determines the
similarity between the query and the document to
display relevant information.
There are three types of Information Retrieval
(IR) models
Classical IR Model — It is designed upon basic
mathematical concepts and is the most widely-used of
IR models.
Classic Information Retrieval models can be
implemented with ease. Its examples include Vector-
space, Boolean and Probabilistic IR models.
In this system, the retrieval of information depends on
documents containing the defined set of queries. There
is no ranking or grading of any kind. The different
classical IR models take Document Representation,
Query representation, and Retrieval/Matching
function into account in their modelling.:
 Non-Classical IR Model — They differ from classic
models in that they are built upon propositional logic.
Examples of non-classical IR models include
Information Logic, Situation Theory, and Interaction
models.
 Alternative IR Model — These take principles of
classical IR model and enhance upon to create more
functional models like the Cluster model, Alternative
Set-Theoretic Models Fuzzy Set model, Latent
Semantic Indexing (LSI) model, Alternative Algebraic
Models Generalized Vector Space Model, etc.
Components of Information Retrieval Model
Here are the prerequisites for an IR model:
An automated or manually-operated indexing system
used to index and search techniques and procedures.
A collection of documents in any one of the following
formats: text, image or multimedia.
A set of queries that serve as the input to a system, via a
human or machine.
An evaluation metric to measure or evaluate a system’s
effectiveness (for instance, precision and recall). For
instance, to ensure how useful the information
displayed to the user is.
XML Databases:
XML Database is used to store huge amount of
information in the XML format. As the use of XML is
increasing in every field, it is required to have a secured
place to store the XML documents. The data stored in
the database can be queried using XQuery, serialized,
and exported into a desired format.
XML Database Types
There are two major types of XML databases −
XML- enabled
Native XML (NXD)
XML - Enabled Database
XML enabled database is nothing but the extension
provided for the conversion of XML document. This is
a relational database, where data is stored in tables
consisting of rows and columns. The tables contain set
of records, which in turn consist of fields.
Native XML Database
Native XML database is based on the container rather
than table format. It can store large amount of XML
document and data. Native XML database is queried by
the XPath-expressions.
Native XML database has an advantage over the XML-
enabled database. It is highly capable to store, query
and maintain the XML document than XML-enabled
database.
Example
Following example demonstrates XML database −
<?xml version = "1.0"?>
 <contact-info>
<contact1>
<name>Tanmay Patil</name>
<company>Tegapoint</company>
<phone>(011) 123-4567</phone>
</contact1>
<contact2>
<name>Manisha Patil</name>
<company>Tegapoint</company>
<phone>(011) 789-4567</phone> </contact2>
</contact-info>
Indexing for Text Search
To implement a full-text search in a SQL database, you
must create a full-text index on each column you want
to be indexed.
In MySQL, this would be done with the FULLTEXT
keyword. Then you will be able to query the database using
MATCH and AGAINST.
Full-text search refers to searching some text inside
extensive text data stored electronically and returning
results that contain some or all of the words from the query.
In contrast, traditional search would return exact matches.
While traditional databases are great for storing and
retrieving general data, performing full-text searches has
been challenging. Frequently, additional tooling is required
to achieve this.
Indexing in Databases
Indexing is a way to optimize the performance of a database
by minimizing the number of disk accesses required when a
query is processed. It is a data structure technique which is
used to quickly locate and access the data in a database.
Indexes are created using a few database columns.
The first column is the Search key that contains a copy of the
primary key or candidate key of the table. These values are
stored in sorted order so that the corresponding data can be
accessed quickly.
Note: The data may or may not be stored in sorted order.
The second column is the Data Reference or Pointer which
contains a set of pointers holding the address of the disk
block where that particular key value can be found.
Web Search Engines
Search Engine refers to a huge database of internet
resources such as web pages, newsgroups, programs, images
etc. It helps to locate information on World Wide Web.
User can search for any information by passing query in
form of keywords or phrase. It then searches for relevant
information in its database and return to the user.
Search Engine Components
Generally there are three basic components of a search
engine as listed below:
Web Crawler
Database
Search Interfaces
Web crawler
It is also known as spider or bots. It is a software
component that traverses the web to gather
information.
Database
All the information on the web is stored in database. It
consists of huge web resources.
Search Interfaces
This component is an interface between user and the
database. It helps the user to search through the
database.
Search Engine Working
Web crawler, database and the search interface are the
major component of a search engine that actually
makes search engine to work. Search engines make use
of Boolean expression AND, OR, NOT to restrict and
widen the results of a search. Following are the steps
that are performed by the search engine:
The search engine looks for the keyword in the index
for predefined database instead of going directly to the
web to search for the keyword.
It then uses software to search for the information in
the database. This software component is known as
web crawler.
Once web crawler finds the pages, the search engine
then shows the relevant web pages as a result. These
retrieved web pages generally include title of page, size
of text portion, first several sentences etc
User can click on any of the search results to open it.
Architecture
The search engine architecture comprises of the three
basic layers listed below:
Content collection and refinement.
Search core
User and application interfaces
Search Engine Processing
Indexing Process
Indexing process comprises of the following three
tasks:
Text acquisition
Text transformation
Index creation
Text acquisition
It identifies and stores documents for indexing.
Text Transformation
It transforms document into index terms or features.
Index Creation
It takes index terms created by text transformations
and create data structures to suport fast searching.
Query Process
Query process comprises of the following three tasks:
User interaction
Ranking
Evaluation
User interaction
It supporst creation and refinement of user query and
displays the results.
Ranking
It uses query and indexes to create ranked list of
documents.
Evaluation
It monitors and measures the effectiveness and
efficiency. It is done offline.
Examples
Following are the several search engines available
today:

Search Engine Description

Google It was originally called BackRub. It is the most popular

search engine globally.

Bing It was launched in 2009 by Microsoft. It is the latest web-

based search engine that also delivers Yahoo’s results.

Ask It was launched in 1996 and was originally known as Ask

Jeeves. It includes support for match, dictionary, and
conversation question.
Search Engine Description

AltaVista It was launched by Digital Equipment Corporation in

1995. Since 2003, it is powered by Yahoo technology

AOL.Search It is powered by Google.

LYCOS It is top 5 internet portal and 13th largest online property

according to Media Matrix.

Alexa It is subsidiary of Amazon and used for providing website

traffic information.
A data model for xml:

The data model for XML is very simple - or very abstract,

depending on one's point of view. XML provides no more
than a baseline on which more complex models can be
built. All those more restricted applications will share
some common invariants, however, and it is those that
are given below.
Think of an XML document as a linearization of a tree
structure. At every node in the tree there are several
character strings. The tree structure and the character
strings together form the information content of an XML
document. Almost everything will follow naturally from
that. Some of the characters in the document are only
there to support the linearization, others are part of the
information content.
A tree and a graph overlaid
The main structure of an XML document is tree-like,
and most of the lexical structure is devoted to defining
that tree, but there is also a way to make connections
between arbitrary nodes in a tree.
<p>
<q id="x7">The first q</q>
<q id="x8">The second q</q>
<q href="#x7">The third q</q>
</p>
The tree corresponding to this document can be
visualized as follows:
The tree that an XML document represents has a
number of different types of nodes:
element
document
processing instruction [not needed?]
comment
data
XQUERY:
XQuery is to XML what SQL is to databases.
XQuery is designed to query XML data.
XQuery is the language for querying XML data
XQuery for XML is like SQL for databases
XQuery is built on XPath expressions
XQuery is supported by all major databases
XQuery is a W3C Recommendation
XQuery is a language for finding and extracting elements
and attributes from XML documents.
Here is an example of what XQuery could solve:
"Select all CD records with a price less than $10 from the
CD collection stored in cd_catalog.xml"
XQuery can be used to:
Extract information to use in a Web Service
Generate summary reports
Transform XML data to XHTML
Search Web documents for relevant information

Another Simple xquery

<book category="CHILDREN">
  <title lang="en">Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>
Evaluation of Xquery:
Evaluates one or more XQueries against the content of
a FlowFile. The results of those XQueries are assigned
to FlowFile Attributes or are written to the content of
the FlowFile itself, depending on configuration of the
Processor.
XQueries are entered by adding user-defined
properties; the name of the property maps to the
Attribute Name into which the result will be placed (if
the Destination is 'flowfile-attribute'; otherwise, the
property name is ignored). The value of the property
must be a valid XQuery.
If the XQuery returns more than one result, new
attributes or FlowFiles (for Destinations of 'flowfile-
attribute' or 'flowfile-content' respectively) will be
created for each result (attributes will have a '.n' one-
up number appended to the specified attribute name).
If any provided XQuery returns a result, the
FlowFile(s) will be routed to 'matched'. If no provided
XQuery returns a result, the FlowFile will be routed to
'unmatched'. If the Destination is 'flowfile-attribute'
and the XQueries matche nothing, no attributes will
be applied to the FlowFile.
In the list below, the names of required properties
appear in bold. Any other properties (not in bold) are
considered optional. The table also indicates any
default values.
Allowable
Name Default Value Values Description

Indicates whether the results

•flowfile- of the XQuery evaluation are
flowfile- content written to the FlowFile
Destination content content or a FlowFile
•flowfile-
attribute attribute

Output: •xml Identifies the overall

xml •html
Method method that should be used
•text for outputting a result tree.
Defaul Allowable
Name t Value Values Description

Specifies whether the processor

Output: Omit false should output an XML
XML Declaration declaration when transforming a
result tree.

Specifies whether the processor

Output: Indent false may add additional whitespace
when outputting a result tree.

Specifies whether or not the XML

Validate DTD true •true content should be validated
•false against the DTD.

COmplete SAP SF Guide
50% (2)
COmplete SAP SF Guide
182 pages
XML Final Exam (42/50) : Question Text
50% (6)
XML Final Exam (42/50) : Question Text
41 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Chapter 1
No ratings yet
Chapter 1
52 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
Chap 1
No ratings yet
Chap 1
22 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
L001
No ratings yet
L001
49 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Unit II
No ratings yet
Unit II
73 pages
L01
No ratings yet
L01
33 pages
Aesthetics and Technology in Building, Pier Luigi Nervi
100% (4)
Aesthetics and Technology in Building, Pier Luigi Nervi
146 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
UNIT 3 Notes
No ratings yet
UNIT 3 Notes
32 pages
chapter 2
No ratings yet
chapter 2
45 pages
Unit-5. Search Engines
No ratings yet
Unit-5. Search Engines
105 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Chapter 2
No ratings yet
Chapter 2
23 pages
How Do Search Engines Work
No ratings yet
How Do Search Engines Work
3 pages
Comsats Institute of Information TECHNOLOGY Islamabad
No ratings yet
Comsats Institute of Information TECHNOLOGY Islamabad
11 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
Information Search and Retrieval
No ratings yet
Information Search and Retrieval
23 pages
Jaff Seminar
No ratings yet
Jaff Seminar
31 pages
Unit - 1
No ratings yet
Unit - 1
51 pages
Chapter 1 Introduction to IR
No ratings yet
Chapter 1 Introduction to IR
18 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Search ENgine
No ratings yet
Search ENgine
28 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
IR chapter 1 (2)
No ratings yet
IR chapter 1 (2)
29 pages
1-Overview of Information Retrieval_new
No ratings yet
1-Overview of Information Retrieval_new
47 pages
Chapter 1 Search Engine 1. Objective
No ratings yet
Chapter 1 Search Engine 1. Objective
63 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
Web Technologies Unit-III
No ratings yet
Web Technologies Unit-III
11 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
Topic 2 W2 - SDR - Edited - March2023
No ratings yet
Topic 2 W2 - SDR - Edited - March2023
25 pages
Exploring Data with Access 2016
From Everand
Exploring Data with Access 2016
Larry Rockoff
No ratings yet
Darknet Report
No ratings yet
Darknet Report
27 pages
01 Introduction to ISR
No ratings yet
01 Introduction to ISR
34 pages
Search Engine Architecture
No ratings yet
Search Engine Architecture
15 pages
CS317 IR W1a
No ratings yet
CS317 IR W1a
20 pages
Chap 1
No ratings yet
Chap 1
23 pages
Database & Search Engine
No ratings yet
Database & Search Engine
17 pages
Text Mining
No ratings yet
Text Mining
23 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
UNIT I_ Introduction and Motivation
No ratings yet
UNIT I_ Introduction and Motivation
57 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Unit I
No ratings yet
Unit I
11 pages
Exploring Data with Access 2019
From Everand
Exploring Data with Access 2019
Larry Rockoff
No ratings yet
Indexing and Search Engines For The Intranets: by Suvarsha Walters (Suvarsha@ncsi - Iisc.ernet - In)
No ratings yet
Indexing and Search Engines For The Intranets: by Suvarsha Walters (Suvarsha@ncsi - Iisc.ernet - In)
33 pages
Introduction
No ratings yet
Introduction
32 pages
Information
No ratings yet
Information
61 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Search Engine Using Apache Lucene
No ratings yet
Search Engine Using Apache Lucene
5 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
13 pages
Unit 6 Dbms Unit 6
No ratings yet
Unit 6 Dbms Unit 6
4 pages
Benefits of Jaxp
No ratings yet
Benefits of Jaxp
8 pages
XML Basics
No ratings yet
XML Basics
13 pages
ATG Rup 6 Readme
No ratings yet
ATG Rup 6 Readme
44 pages
DTD Home
No ratings yet
DTD Home
21 pages
HTML, DHTML Final
No ratings yet
HTML, DHTML Final
21 pages
How Can You Enable Automatic Paging in Datagrid ?
No ratings yet
How Can You Enable Automatic Paging in Datagrid ?
35 pages
Qualys Api v1 User Guide PDF
No ratings yet
Qualys Api v1 User Guide PDF
392 pages
Jdom
100% (1)
Jdom
48 pages
Complete Copy Black
No ratings yet
Complete Copy Black
185 pages
SOA XML questions
No ratings yet
SOA XML questions
51 pages
CS311 - Study Plan
No ratings yet
CS311 - Study Plan
20 pages
ISO-TS-20625-2002
No ratings yet
ISO-TS-20625-2002
15 pages
Unit 4 Web Authoring Complete
No ratings yet
Unit 4 Web Authoring Complete
69 pages
S@T 01.60 V3.0.0 (Release 2007)
No ratings yet
S@T 01.60 V3.0.0 (Release 2007)
72 pages
XML Unit 3
No ratings yet
XML Unit 3
75 pages
Chapter 2: Data Mapping and Exchange: Visit
No ratings yet
Chapter 2: Data Mapping and Exchange: Visit
99 pages
Two Types of XML Parsers
No ratings yet
Two Types of XML Parsers
6 pages
SF Topics List
No ratings yet
SF Topics List
3 pages
Web Viva Questions & Answers
No ratings yet
Web Viva Questions & Answers
8 pages
Rosettanet Etd Library User'S Guide: Release 4.5.2
No ratings yet
Rosettanet Etd Library User'S Guide: Release 4.5.2
38 pages
Computer Concepts and Web Technology
No ratings yet
Computer Concepts and Web Technology
2 pages
Java Persistence For Relational Databases: Richard Sperko
No ratings yet
Java Persistence For Relational Databases: Richard Sperko
29 pages
WT Practical File
No ratings yet
WT Practical File
21 pages
WT Unit 2
No ratings yet
WT Unit 2
20 pages
Unit I: 1. Explain The Servlet Lifecycle 2. Discuss On Servlet Architecture
No ratings yet
Unit I: 1. Explain The Servlet Lifecycle 2. Discuss On Servlet Architecture
3 pages

Information Retrieval and XML Data: ADBMS Unit-4

Uploaded by

Information Retrieval and XML Data: ADBMS Unit-4

Uploaded by

Information Retrieval and

Search Engine Description

Google It was originally called BackRub. It is the most popular

Bing It was launched in 2009 by Microsoft. It is the latest web-

Ask It was launched in 1996 and was originally known as Ask

AltaVista It was launched by Digital Equipment Corporation in

AOL.Search It is powered by Google.

LYCOS It is top 5 internet portal and 13th largest online property

Alexa It is subsidiary of Amazon and used for providing website

The data model for XML is very simple - or very abstract,

Another Simple xquery

Indicates whether the results

Output: •xml Identifies the overall

Specifies whether the processor

Specifies whether the processor

Specifies whether or not the XML

You might also like