50% found this document useful (2 votes)
1K views

" DEEP WEB: Surfacing Hidden Values ": Seminar Report ON by

This seminar report summarizes research on the deep web and its value. The deep web refers to web content not indexed by typical search engines as it is stored in databases rather than static pages. While the surface web contains an estimated 2.5 billion documents, over 90% of information is buried in the deep web. Techniques like directed query engines are needed to uncover the deep web's hidden values across various content types and domains.

Uploaded by

kool420
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
50% found this document useful (2 votes)
1K views

" DEEP WEB: Surfacing Hidden Values ": Seminar Report ON by

This seminar report summarizes research on the deep web and its value. The deep web refers to web content not indexed by typical search engines as it is stored in databases rather than static pages. While the surface web contains an estimated 2.5 billion documents, over 90% of information is buried in the deep web. Techniques like directed query engines are needed to uncover the deep web's hidden values across various content types and domains.

Uploaded by

kool420
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

A

SEMINAR REPORT
ON
“ DEEP WEB : Surfacing Hidden Values “
By
Mr. Kumar Rahul
Roll No. 05
B.E COMP I

Guided by
Prof. D.M. Thakore
Submitted in Partial Fulfillment for the Award of
Bachelor in Engineering
In
Computer Engineering

DEPARTMENT OF COMPUTER ENGINEERING


BHARATI VIDYAPEETH UNIVERSITY
COLLEGE OF ENGINEERING
PUNE – 43
2009-10
INTRODUCTION TO
DEEP WEB

• The deep Web (also called Deepnet , the invisible Web, dark Web
or the hidden Web) refers to World Wide Web content that is not part
of the surface web, which is indexed by standard search engines.

• Searching on the Internet today can be compared to dragging a net


across the surface of the ocean.
Surface Web

• The surface Web (also known as the visible Web or


indexable Web) is that portion of the World Wide Web that
is indexed by conventional search engines.

• Search engines construct a database of the Web by using


programs called spiders or Web crawlers that begin with a list
of known Web pages.
• The surface Web contains an estimated 2.5
billion documents, growing at a rate of 7.5
million documents per day.

Figure : Search Engines:


Dragging a Net Across the Web's Surface
BACKGROUND
 Inthe earliest days of the Web, there were relatively
few documents and sites.

 Alldocuments/pages could be crawled easily by


conventional search engines.

 First,database technology was introduced to the


Internet as Bluestone's Sapphire/Web bought by HP
and later Oracle.
 Then, Los Alamos National Laboratory(LANL)
founded Innovative Web Applications in 1996.

 Finally,Deployed the first "deep web" application in


the Federal government, the Environmental Science
Network in February 1999.
The Deep Web: Surfacing Hidden
Value
 There is still a wealth of information that is deep, and
therefore, missed. The reason is simple: Most of the
Web's information is buried far down on dynamically
generated sites, and standard search engines never find
it.

 Traditional search engines create their indices by


spidering or crawling surface Web pages.

 Deep Web sources store their content in databases.


How Search Engines Work

Search engines obtain their listings in two


ways: Authors may submit their own Web
pages, or the search engines "crawl" or
"spider“.
Crawlers work by recording every hypertext
link in every page they index crawling.
Invisible Web

 Whilestudy/research , we have avoided the term "invisible


Web" because it is inaccurate.

 The only thing "invisible" about searchable databases is that


they are not indexable . Using BrightPlanet technology, they
are totally "visible" to those who need to access them.
FIGURE 1.2 : Harvesting the Deep and
Surface Web with a Directed Query Engine
TECHNICAL CHALLENGES
 BrighPlanet'stechnology is uniquely suited to tap the
deep Web and bring its results to the surface.

 Why Hidden Web is Better Than Google ????


If more than 90% all information is only accessible
through the Hidden Web, if you are searching for
things on Google, you may only be getting 10% of the
possible results on the Web. Only by using the Hidden
Web will you find even more results, comprising the
bulk of everything stored on the Internet.
TECHNICAL CHALLENGES

 Site like Google rely on HTML text based searching.

 The content of documents in PDF format in google


.All of this is stored on the Web, but without a hidden
web search engine, we will not easily find all of the
results what we need.

 Also known as the deep web or the invisible web, the


hidden web contains thousands of search engines that
focus only on their little corner of the world .
Some Applications
1.Deep Web as a Search Engine
 When you’re searching the Web for what you need,
you’re missing about 90 percent of all the information on
the web if you aren’t searching using Deep Web search
engines.

 The deep web search engines are offering us to access


specific searches across the web for sites which have
stored data that can’t be easily spidered by Google or any
other surface web sites .
These major search engine sites are only
able to read text or HTML tags and labels
inside that text.

They are unable to spider/access any


audio file ,video file anything that isn’t
text-based.
2 .The Deep Web in Google

 The concept of the deep Web is becoming more complex as search


engines such as Google have found ways to integrate deep Web content
into their central search function.

 However, even a search engine as far-reaching as Google provides


access to only a very small part of the deep Web.
Figure : Distribution of Deep Web Sites by content type
Why deep web ????
 Deep Web is massive -- approximately 500
times greater than that visible to conventional
search engines -- with much higher quality
throughout.

 Fast, economical, provide depth knowledge.

 Deep Web Coverage is Broad, Relevant.

 Deep Web searchable databases and search


engines combined total of more than 250,000
sites.
Analysis & characteristics
Deep Web Site Qualification

 An initial pool of 53,220 possible deep Web candidate


URLs were identified but after harvesting, this pool
resulted in 45,732 actual unique listings after tests for
duplicates.

 The BrightPlanet technology was used to retrieve the


complete pages and fully index them for both the initial
unique sources and the one-link removed sources. A total
of 43,348 resulting URLs were actually retrieved.
Conclusion

 The
deep Web thus appears to be a critical source
when it is imperative to find a "needle in a haystack."

 Atpresent , the Internet is funtionally divided into two


areas ---

a) 10% of the information content is in the surface


web (yahoo, google etc) and

b) 90% of is in the deep web.


Deep Web Growing Faster than Surface
Web

Figure : Comparative Deep and Surface Web Site Growth Rates


THANKS

TO

EVERYONE
ANY

QUERIES

You might also like