50% found this document useful (2 votes)

1K views

" DEEP WEB: Surfacing Hidden Values ": Seminar Report ON by

This seminar report summarizes research on the deep web and its value. The deep web refers to web content not indexed by typical search engines as it is stored in databases rather than static pages. While the surface web contains an estimated 2.5 billion documents, over 90% of information is buried in the deep web. Techniques like directed query engines are needed to uncover the deep web's hidden values across various content types and domains.

Uploaded by

kool420

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

50% found this document useful (2 votes)

1K views

" DEEP WEB: Surfacing Hidden Values ": Seminar Report ON by

Uploaded by

kool420

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 27

A

SEMINAR REPORT
ON
“ DEEP WEB : Surfacing Hidden Values “
By
Mr. Kumar Rahul
Roll No. 05
B.E COMP I

Guided by
Prof. D.M. Thakore
Submitted in Partial Fulfillment for the Award of
Bachelor in Engineering
In
Computer Engineering

DEPARTMENT OF COMPUTER ENGINEERING

BHARATI VIDYAPEETH UNIVERSITY
COLLEGE OF ENGINEERING
PUNE – 43
2009-10
INTRODUCTION TO
DEEP WEB

• The deep Web (also called Deepnet , the invisible Web, dark Web
or the hidden Web) refers to World Wide Web content that is not part
of the surface web, which is indexed by standard search engines.

• Searching on the Internet today can be compared to dragging a net

across the surface of the ocean.
Surface Web

• The surface Web (also known as the visible Web or

indexable Web) is that portion of the World Wide Web that
is indexed by conventional search engines.

• Search engines construct a database of the Web by using

programs called spiders or Web crawlers that begin with a list
of known Web pages.
• The surface Web contains an estimated 2.5
billion documents, growing at a rate of 7.5
million documents per day.

Figure : Search Engines:

Dragging a Net Across the Web's Surface
BACKGROUND
 Inthe earliest days of the Web, there were relatively
few documents and sites.

 Alldocuments/pages could be crawled easily by

conventional search engines.

 First,database technology was introduced to the

Internet as Bluestone's Sapphire/Web bought by HP
and later Oracle.
 Then, Los Alamos National Laboratory(LANL)
founded Innovative Web Applications in 1996.

 Finally,Deployed the first "deep web" application in

the Federal government, the Environmental Science
Network in February 1999.
The Deep Web: Surfacing Hidden
Value
 There is still a wealth of information that is deep, and
therefore, missed. The reason is simple: Most of the
Web's information is buried far down on dynamically
generated sites, and standard search engines never find
it.

 Traditional search engines create their indices by

spidering or crawling surface Web pages.

 Deep Web sources store their content in databases.

How Search Engines Work

Search engines obtain their listings in two

ways: Authors may submit their own Web
pages, or the search engines "crawl" or
"spider“.
Crawlers work by recording every hypertext
link in every page they index crawling.
Invisible Web

 Whilestudy/research , we have avoided the term "invisible

Web" because it is inaccurate.

 The only thing "invisible" about searchable databases is that

they are not indexable . Using BrightPlanet technology, they
are totally "visible" to those who need to access them.
FIGURE 1.2 : Harvesting the Deep and
Surface Web with a Directed Query Engine
TECHNICAL CHALLENGES
 BrighPlanet'stechnology is uniquely suited to tap the
deep Web and bring its results to the surface.

 Why Hidden Web is Better Than Google ????

If more than 90% all information is only accessible
through the Hidden Web, if you are searching for
things on Google, you may only be getting 10% of the
possible results on the Web. Only by using the Hidden
Web will you find even more results, comprising the
bulk of everything stored on the Internet.
TECHNICAL CHALLENGES

 Site like Google rely on HTML text based searching.

 The content of documents in PDF format in google

.All of this is stored on the Web, but without a hidden
web search engine, we will not easily find all of the
results what we need.

 Also known as the deep web or the invisible web, the

hidden web contains thousands of search engines that
focus only on their little corner of the world .
Some Applications
1.Deep Web as a Search Engine
 When you’re searching the Web for what you need,
you’re missing about 90 percent of all the information on
the web if you aren’t searching using Deep Web search
engines.

 The deep web search engines are offering us to access

specific searches across the web for sites which have
stored data that can’t be easily spidered by Google or any
other surface web sites .
These major search engine sites are only
able to read text or HTML tags and labels
inside that text.

They are unable to spider/access any

audio file ,video file anything that isn’t
text-based.
2 .The Deep Web in Google

 The concept of the deep Web is becoming more complex as search

engines such as Google have found ways to integrate deep Web content
into their central search function.

 However, even a search engine as far-reaching as Google provides

access to only a very small part of the deep Web.
Figure : Distribution of Deep Web Sites by content type
Why deep web ????
 Deep Web is massive -- approximately 500
times greater than that visible to conventional
search engines -- with much higher quality
throughout.

 Fast, economical, provide depth knowledge.

 Deep Web Coverage is Broad, Relevant.

 Deep Web searchable databases and search

engines combined total of more than 250,000
sites.
Analysis & characteristics
Deep Web Site Qualification

 An initial pool of 53,220 possible deep Web candidate

URLs were identified but after harvesting, this pool
resulted in 45,732 actual unique listings after tests for
duplicates.

 The BrightPlanet technology was used to retrieve the

complete pages and fully index them for both the initial
unique sources and the one-link removed sources. A total
of 43,348 resulting URLs were actually retrieved.
Conclusion

 The
deep Web thus appears to be a critical source
when it is imperative to find a "needle in a haystack."

 Atpresent , the Internet is funtionally divided into two

areas ---

a) 10% of the information content is in the surface

web (yahoo, google etc) and

b) 90% of is in the deep web.

Deep Web Growing Faster than Surface
Web

Figure : Comparative Deep and Surface Web Site Growth Rates

THANKS

EVERYONE
ANY

QUERIES

(Ebook) - Drugs - How To Make Biological LSD
No ratings yet
(Ebook) - Drugs - How To Make Biological LSD
2 pages
Invest 30$ and Make 3000$ (Private Tutorial For Making Money), (Illegal)
No ratings yet
Invest 30$ and Make 3000$ (Private Tutorial For Making Money), (Illegal)
1 page
DNM Vendor Bible
100% (2)
DNM Vendor Bible
93 pages
XJ650 Manual (Searchable)
100% (1)
XJ650 Manual (Searchable)
278 pages
GBL
No ratings yet
GBL
14 pages
Latest Deep Web Links 2017
100% (4)
Latest Deep Web Links 2017
2 pages
Onion GuideOnionshop Guide: How To Set Up A Hidden Service?
100% (1)
Onion GuideOnionshop Guide: How To Set Up A Hidden Service?
23 pages
The Black Oceans of The Deep Web, and The Crystal Clear Databases
100% (2)
The Black Oceans of The Deep Web, and The Crystal Clear Databases
57 pages
Tor and The Dark Net - Volume 1 (2016) PDF
100% (1)
Tor and The Dark Net - Volume 1 (2016) PDF
40 pages
Top Links June 2019
0% (1)
Top Links June 2019
127 pages
BlackMart APK Download Latest Version 2021.edited
No ratings yet
BlackMart APK Download Latest Version 2021.edited
5 pages
Study On How Dark Web Facilitate Cyber Security Experts To Improve Business Securit
No ratings yet
Study On How Dark Web Facilitate Cyber Security Experts To Improve Business Securit
8 pages
The Spider's Parlor - Government Malware On The Dark Web
No ratings yet
The Spider's Parlor - Government Malware On The Dark Web
38 pages
IGGGAMES Free Download PC Games - Direct Links - Torrents
No ratings yet
IGGGAMES Free Download PC Games - Direct Links - Torrents
10 pages
Hacker
100% (2)
Hacker
26 pages
Phrack 06
No ratings yet
Phrack 06
79 pages
How To Gain 10k Likes by Cyber - Dude
No ratings yet
How To Gain 10k Likes by Cyber - Dude
3 pages
1406998360819
No ratings yet
1406998360819
7 pages
Chapter Three: Constraints and Challenges
No ratings yet
Chapter Three: Constraints and Challenges
8 pages
API - Voucher Codes
No ratings yet
API - Voucher Codes
2 pages
Eage Banking Change of Address Form
No ratings yet
Eage Banking Change of Address Form
2 pages
Hacking
No ratings yet
Hacking
2 pages
NSA Hacks Google, Yahoo
No ratings yet
NSA Hacks Google, Yahoo
8 pages
Bots EDI Translator Documentation: Release 3.2.0
No ratings yet
Bots EDI Translator Documentation: Release 3.2.0
166 pages
Dark Net
No ratings yet
Dark Net
89 pages
DropMeFiles - Free One-Click File Sharing Service
No ratings yet
DropMeFiles - Free One-Click File Sharing Service
5 pages
Sentry MBA ReadMe
No ratings yet
Sentry MBA ReadMe
14 pages
Links
0% (1)
Links
4 pages
Absint : Section Section
No ratings yet
Absint : Section Section
50 pages
Classified The Contents: Deep Web Dark Web
No ratings yet
Classified The Contents: Deep Web Dark Web
5 pages
Hacking Method 4
100% (1)
Hacking Method 4
1 page
Discover the 1 System that Can Send Over 950,000,000+ Highly Free Targeted Traffic Without Spending A Dime On Advertising:: Increase Your Website Traffic with our SEO Tools and Social Media Advertising
From Everand
Discover the 1 System that Can Send Over 950,000,000+ Highly Free Targeted Traffic Without Spending A Dime On Advertising:: Increase Your Website Traffic with our SEO Tools and Social Media Advertising
Andrew Moore
1/5 (1)
Authentication of SMS
No ratings yet
Authentication of SMS
3 pages
Structure of RP
No ratings yet
Structure of RP
6 pages
PGP Tutorial For Newbs Gpg4usb - Deep Dot Web
No ratings yet
PGP Tutorial For Newbs Gpg4usb - Deep Dot Web
1 page
An Anarchist Primer
No ratings yet
An Anarchist Primer
22 pages
Cryptocurrency Scams: January 2018
No ratings yet
Cryptocurrency Scams: January 2018
11 pages
Buy Xanax Online To Treat With Stress and Anxiety - Order Xanax Online
No ratings yet
Buy Xanax Online To Treat With Stress and Anxiety - Order Xanax Online
5 pages
MEG10 Complete
No ratings yet
MEG10 Complete
25 pages
The Motorola Bible
No ratings yet
The Motorola Bible
63 pages
Advanced Hacking
100% (1)
Advanced Hacking
9 pages
Dark
No ratings yet
Dark
2 pages
Meth Precursor Chemicals From China: Implications For The United States
No ratings yet
Meth Precursor Chemicals From China: Implications For The United States
24 pages
Deep Web
No ratings yet
Deep Web
25 pages
Betsy August, M.D. Massachusetts License Applications
No ratings yet
Betsy August, M.D. Massachusetts License Applications
57 pages
First Vita Plus Not A Scam
No ratings yet
First Vita Plus Not A Scam
4 pages
Chlamydophila Pneumoniae (CPN) : The Most Dangerous Germ?
No ratings yet
Chlamydophila Pneumoniae (CPN) : The Most Dangerous Germ?
6 pages
Font Awesome 4 7 0 Icons Cheatsheet
No ratings yet
Font Awesome 4 7 0 Icons Cheatsheet
7 pages
Javier Echaiz D.C.I.C. - U.N.S. Je@cs - Uns.edu - Ar: Clase 26
No ratings yet
Javier Echaiz D.C.I.C. - U.N.S. Je@cs - Uns.edu - Ar: Clase 26
73 pages
Deep Web
100% (1)
Deep Web
60 pages
Confirmation For Cairo
No ratings yet
Confirmation For Cairo
2 pages
Untitled Document
No ratings yet
Untitled Document
22 pages
CIA Secret Circle PDF
No ratings yet
CIA Secret Circle PDF
2 pages
Introduction To The Free 9-11 Forbidden Library PDF
No ratings yet
Introduction To The Free 9-11 Forbidden Library PDF
3 pages
Hack Software and Run The Trial Program Forever - GoHacking
100% (1)
Hack Software and Run The Trial Program Forever - GoHacking
17 pages
Doing Business in Ukraine
No ratings yet
Doing Business in Ukraine
48 pages
Citec Con 2 1269931884 Phpapp02 PDF
No ratings yet
Citec Con 2 1269931884 Phpapp02 PDF
66 pages
Las Vegas - You Can Do It For Less...
From Everand
Las Vegas - You Can Do It For Less...
Shane Loves Vegas
No ratings yet
proxy servers Third Edition
From Everand
proxy servers Third Edition
Gerardus Blokdyk
No ratings yet
The Spam Letters
From Everand
The Spam Letters
Jonathan Land
2/5 (2)
The Paranoid's Guide to Using the Internet
From Everand
The Paranoid's Guide to Using the Internet
Pamela Gifford
1/5 (1)
200+ TOP SEO Online Quiz Questions - Exam Test
No ratings yet
200+ TOP SEO Online Quiz Questions - Exam Test
20 pages
Mge1801 Mitsubishi Generator 4800
No ratings yet
Mge1801 Mitsubishi Generator 4800
127 pages
Adobe Scan 04-Mar-2024
No ratings yet
Adobe Scan 04-Mar-2024
12 pages
SEO Keyword Ranking 20120601 00
No ratings yet
SEO Keyword Ranking 20120601 00
23 pages
Literature Review On On-Page & Off-Page SEO For Ranking Purpose
No ratings yet
Literature Review On On-Page & Off-Page SEO For Ranking Purpose
5 pages
Boss SE-70 Algorithm Guide
No ratings yet
Boss SE-70 Algorithm Guide
114 pages
Solution Assignment 1 - Search Engine Optimization
No ratings yet
Solution Assignment 1 - Search Engine Optimization
5 pages
How Search Engines Work: Crawling, Indexing, And Ranking - Beginner's Guide to SEO - Moz
No ratings yet
How Search Engines Work: Crawling, Indexing, And Ranking - Beginner's Guide to SEO - Moz
47 pages
Perkins T4-248 4-236 4-212
No ratings yet
Perkins T4-248 4-236 4-212
168 pages
Course Content (Full SEO Course and Tutorial in Hindi SEO Course 2023 Umar Tazkeer)
0% (1)
Course Content (Full SEO Course and Tutorial in Hindi SEO Course 2023 Umar Tazkeer)
3 pages
Website Audit Easy-Outsource - Com 2024-05-11
No ratings yet
Website Audit Easy-Outsource - Com 2024-05-11
10 pages
SEO
No ratings yet
SEO
5 pages
Africa Twin Xrv650
No ratings yet
Africa Twin Xrv650
111 pages
Mixed
No ratings yet
Mixed
15 pages
Manual de Servicio Serie 6B Cummins
No ratings yet
Manual de Servicio Serie 6B Cummins
461 pages
CB400N Owners Manual
No ratings yet
CB400N Owners Manual
144 pages
SEOmoz - HostingCon 2009 Keynote Presentation by Rand Fishkin
100% (1)
SEOmoz - HostingCon 2009 Keynote Presentation by Rand Fishkin
61 pages
Saab 1981 - 900 Service Manual
No ratings yet
Saab 1981 - 900 Service Manual
220 pages
Toyota 5l
100% (2)
Toyota 5l
79 pages
Analisa Dan Implementasi SEO Search Engine Optimization Konten Website Untuk Algoritma Google Panda Dan Yahoo Naskah Publikasi
No ratings yet
Analisa Dan Implementasi SEO Search Engine Optimization Konten Website Untuk Algoritma Google Panda Dan Yahoo Naskah Publikasi
22 pages
Advertising and Digital Marketing
No ratings yet
Advertising and Digital Marketing
9 pages
Cookies 7 1
No ratings yet
Cookies 7 1
81 pages
Cummins
100% (1)
Cummins
461 pages
Vitara Service Manual Se416
No ratings yet
Vitara Service Manual Se416
835 pages
SJ 700 Au
No ratings yet
SJ 700 Au
128 pages
Millermatic 200 User Manual
No ratings yet
Millermatic 200 User Manual
72 pages
Midterm Quiz 2 - TRENDS
No ratings yet
Midterm Quiz 2 - TRENDS
4 pages
SEO Interview Questions and Answers
No ratings yet
SEO Interview Questions and Answers
4 pages
Downloaded From Manuals Search Engine
No ratings yet
Downloaded From Manuals Search Engine
82 pages