Semantic Web (CS1145) : Department Elective (Final Year) Department of Computer Science & Engineering
Semantic Web (CS1145) : Department Elective (Final Year) Department of Computer Science & Engineering
(CS1145)
Department Elective (Final Year)
Department of Computer Science & Engineering
UNIT I
THE BASICS OF SEMANTIC WEB
From Traditional Web to Semantic Web
What is WWW?
World Wide Web/Internet
How are we using the Internet?/Usage
Search : eg. SOAP - Soaps
Integration: eg. Restaurant selection; Composition:
Airline Ticketing
Web data mining: Air traffic control tower Atlanta
international airport take off rate crawler (agent)
stock information (prices) every 10min
Internet is Huge distributed database
Drawbacks of Internet usage
Web search
Web severs- integration
Web data mining
The Internet is constructed in such a way that its documents
only contain enough information for the computers to present
them, not to understand them.
Semantic Web
Semantic Web
The Semantic Web is an extension of the current Web in which
information is given well-defined meaning, better enabling
computers and people to work in cooperation. . . . a web of data
that can be processed directly and indirectly by machines.
Tim Berners-Lee, James Hendler, Ora Lassila
... the idea of having data on the Web defined and linked in a way
that it can be used by machines not just for display purposes, but
for automation, integration, and reuse of data across various
applications.
W3C Semantic Web Activity
Summarizing Semantic Web in general
(Machine readable view)
The current Web is made up of many Web documents
(pages).
Any given Web document, in its current form (HTML tags
and natural text), only gives the machine instructions about
how to present information in a browser for human eyes.
Therefore, machines have no idea about the meaning of the
document they are presenting; in fact, every single document
on the Web looks exactly the same to machines.
Machines have no way to understand the documents and
cannot make any intelligent decisions about these documents.
Developers cannot process the documents on a global scale
(and search engines will never deliver satisfactory
performance).
One possible solution is to modify the Web documents, and
one such modification is to add some extra data to these
documents; the purpose of this extra information is to enable
the computers to understand the meaning of these documents.
Assuming that this modification is feasible, we can then
construct tools and agents running on this new Web to process
the document on a global scale; and this new Web is now
called the Semantic Web.
Metadata
data about data;- it is data that describes information
resources.
Metadata is a systematic method for describing resources and
thereby improving their access. In the Web world, systematic
means structured and, furthermore, structured data implies
machine readability and understandability.
Metadata of each Web document has its own unique structure,
and it is simply not possible for an automated agent to process
these metadata in a uniform and global way.
Metadata provides the essential link between the page content
and content meaning.
A standard is a set of agreed-on criteria for describing data. For
instance, a standard may specify that each metadata record should
consist of a number of predefined elements representing some
specific attributes of a resource (in this case, the Web document),
and each element can have one or more values. This kind of
standard is called a metadata schema.
Dublin Core (DC) is one such standard. It was developed in
the March 1995 Metadata Workshop sponsored by the Online
Computer Library Center (OCLC) and the National Center for
Supercomputing Applications (NCSA). It has 13 elements
(subsequently increased to 15), which are called Dublin Core
Metadata Element Set (DCMES); it is proposed as the minimum
number of metadata elements required to facilitate the discovery
of document-like objects in a networked environment such as
the Internet
Metadata Considerations
Embedding the Metadata in Your Page
<metadata> tag in the <head>section
Using a text-parsing crawler to create Metadata
Once the crawler reaches a page and finds that it does not have any
metadata, it attempts to discover some meaningful information by
scanning through the text and creates some metadata for the page
Using Metadata Tools to Add Metadata to Existing Pages
http://www.ukoln.ac.uk/metadata/dcdot/
The problem with this solution is that you have to visit the Web pages one
by one to generate the metadata, and the metadata that is generated is only
DC metadata
the generated metadata cannot be really added to the page itself, because
you normally do not have access to it; you need to figure out some other
place to store them
DCdot can be used to generate DC metadata for the
page you submit
The DC metadata generated by DCdot
Search Engine For The Traditional Web
Search Engines: Google, Yahoo, AltaVista, Lycos
Indexation process spider or crawler
To initially kick off the process, the main control
component of a search engine will provide the crawler with
a seed URL
Building the index table
Indexation process
The quality of the generated index table to a large extent
decides the quality of a query result.
indexation process is conducted by a special piece of software
usually called a spider , or crawler
A crawler visits the Web to collect literally everything it can
find by constructing the index table during its journey.
To initially kick off the process, the main control component of
a search engine will provide the crawler with a seed URL (a set
of seed URLs), and the crawler, after receiving seed URL, will
begin its journey by accessing this URL:
it downloads the page pointed to by this URL
and does the following:
Step 1: Build an index table for every single word on this
page.
Step 2: From the current page, find the first link (which is
again a URL pointing to another page) and crawl to this
link, meaning to download the page pointed to by this link
Step 3: After downloading this page, start reading each
word on this page, and add them all to the index table.
Step 4: Go to step 2, until no unvisited link exists
Two possibilities
1. the current word from this new page has never been
added to the index table, or
2. it already exists in the index table