Web Mining: Presented By: Vikash Kumar
Web Mining: Presented By: Vikash Kumar
PRESENTED BY:
VIKASH KUMAR.
Web Mining
Web Mining is the use of the data mining
techniques to automatically discover and extract
information from web documents/services
Discovering useful information from the World-
Wide Web and its usage patterns
My Definition: Using data mining techniques to
make the web more useful and more profitable
(for some) and to increase the efficiency of our
interaction with the web
Web Mining
Data Mining Techniques
Association rules
Sequential patterns
Classification
Clustering
Classification of Web Mining
Techniques
Search Engines
Personalization
Website Design
Website Usage Analysis
Website Usage Analysis
Why analyze Website usage?
Knowledge about how visitors use Website could
Provide guidelines to web site reorganization; Help prevent
disorientation
Help designers place important information where the visitors look
for it
Pre-fetching and caching web pages
Provide adaptive Website (Personalization)
Questions which could be answered
What are the differences in usage and access patterns among users?
What user behaviors change over time?
How usage patterns change with quality of service (slow/fast)?
What is the distribution of network traffic over time?
Web Content Mining
‘Process of information’ or resource
discovery from content of millions of sources
across the World Wide Web
E.g. Web data contents: text, Image, audio, video,
metadata and hyperlinks
Goes beyond key word extraction, or some
simple statistics of words and phrases in
documents.
Examples of Discovered
Patterns
Association rules
98% of AOL users also have E-trade accounts
Classification
People with age less than 40 and salary > 40k
trade on-line
Clustering
Users A and B access similar URLs
Outlier Detection
User A spends more than twice the average
amount of time surfing on the Web
Web Mining
Engage technologies
Tracks web traffic to create anonymous user profiles of Web surfers
Has profiles for more than 35 million anonymous users
Problems with Web Search Today
Today’s search engines are plagued by
problems:
the abundance problem (99% of info of no interest to
99% of people)
limited coverage of the Web (internet sources
hidden behind search interfaces)
Largest crawlers cover < 18% of all web pages
limited query interface based on keyword-oriented
search
limited customization to individual users
Problems with Web Search
Today(cont.)
Today’s search engines are plagued by
problems:
Web is highly dynamic
Lot
of pages added, removed, and updated
every day
Very high dimensionality
Web Mining Issues
Size
Grows at about 1 million pages a day
Google indexes 9 billion documents
Number of web sites
Netcraft survey says 72 million sites
(http://news.netcraft.com/archives/web_server_survey.html)
Network Management
Performance management
Fault management
Retrieval of Similar Images
Given:
A set of images
Find:
All images similar to a given image
All pairs of similar images
Sample applications:
Medical diagnosis
Weather predication
Web search engine for images
E-commerce
Fraud
With the growing popularity of E-commerce, systems to
detect and prevent fraud on the Web become important