0% found this document useful (0 votes)
127 views

Web Mining

The document discusses web mining and its various types and techniques. Web mining involves extracting useful information and patterns from web data sources, and it is classified into web content mining, web structure mining, and web usage mining which analyzes user behavior and clickstream data from web server logs. The goals of web mining include finding relevant information, creating new knowledge, and personalizing information for users.

Uploaded by

rohitg5955
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views

Web Mining

The document discusses web mining and its various types and techniques. Web mining involves extracting useful information and patterns from web data sources, and it is classified into web content mining, web structure mining, and web usage mining which analyzes user behavior and clickstream data from web server logs. The goals of web mining include finding relevant information, creating new knowledge, and personalizing information for users.

Uploaded by

rohitg5955
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 42

By

Uday Kumar

WEB MINING

8/12/10
Agenda

World Wide Web – a brief history

Introduction to Data Mining

Data Mining Process &
Techniques

Web Mining

Data Mining Vs Web Mining

Classification of Web Mining

Benefits & Application Areas of
Web Mining

Web Mining Softwares

Summary

8/12/10
World-Wide Web - a brief history
Who invented the World-Wide Web ?
(Sir) Tim Berners-Lee in 1989, while working at
CERN, invented the World Wide Web, including
URL scheme, HTML, and in 1990 wrote the first
server (httpd) and the first browser.
Web’s Characteristics:

billions of documents authored by millions of diverse people

distributed over millions of computers, connected by variety
of media

Large size, Dynamic content, Time dimension and
Multilingual

Different data types: text, image, hyperlinks and user usage
information.
8/12/10
Mining Large Data Sets - Motivation

There is often information “hidden” in the data that is not
readily evident

Human analysts may take weeks to discover useful information

Much of the data is never analyzed at all
4 , 0 0 0 , 0 0 0

3 , 5 0 0 , 0 0 0

3 , 0 0 0 , 0 0 0
T h e D a t a G a
2 , 5 0 0 , 0 0 0

2 , 0 0 0 , 0 0 0
T o t a l n e w d i s k ( T B )
1 , 5 0 0 , 0 0 0

1 , 0 0 0 , 0 0 0
N u m b e r
5 0 0 , 0 0 0
a n a l y s t s
0
8/12/10 1 9 9 5 1 9 9 6 1 9 9 7 1 9 9 8 1 9 9 9
Data Mining

8/12/10
Data Mining - Definition

» It is commonly defined as the process of extracting


meaningful information from data sources e.g databases,
texts, images, the web e.t.c

» It is the process of performing automated extraction and


generating predictive information from large data banks
which enables us to understand the current market trends
and enables us to proactive measures to gain maximum
benefit from the same.

8/12/10
Data Mining Process

8/12/10
Data Mining Tasks
» Data mining makes use of various algorithms to perform a
variety of tasks. These algorithms examine the sample data
of a problem and determine a model that fits close to
solving the problem.

» A Predictive model enables you to predicts the values of


data by making use of known results from a different set
of sample data. The list of tasks that forms the part of
predictive model are:

Classification

Regression

Time Series Analysis

8/12/10
Data Mining Tasks Contd..

» A Descriptive model enables you to determine the


patterns and relationships in a sample data. The list of
tasks that forms the part of descriptive model are:

Clustering

Summarization

Association rules

Sequence discovery

8/12/10
Data Mining Tasks Contd..
» Classification: enables you to classify data in a large data
bank into predefined set of classes. Ex: People with age less
than 40 and salary > 40k trade on-line
» Regression: enables to forecast data values based on the
present and past values Ex: helps the organization to predict
the need for recruiting new employees and purchases based in
the past and current growth rate.
» Time Series Analysis: enables to predict future values for the
current set of values are time dependent (monthly, yearly..)
» Summarization:The use of summarization enables you to
summarize a large chunk of data containing in a web page.

8/12/10
Data Mining Tasks Contd..
» Clustering: enables you to create new groups (clusters) based
on the study of patterns and relation between values of data in
a data bank. It is similar to classification but does not require
you to predefine groups.(also called as Unsupervised
Learning) Ex:Users A and B access similar URLs

» Association Rules:It defines certain rules of associativity


between data items and then use those rules to establish
relationships. Ex:Find the items that tend to be purchased together
and specify their relationship.

» Sequence Discovery:enables to determine the sequential


patterns that might exist in a large and unorganized data bank.
Ex: crime detection.
8/12/10
Data Mining Techniques
» Data mining is not so much a single technique as the idea that
there is more knowledge hidden in the data than shows itself
on the surface. Any technique that helps extract more out of
your data is useful, list of data mining techniques are.

Statistical techniques: is the branch of mathematics, which
deals with the collection and analysis of numerical data by
using various methods and techniques.

Machine Learning: is the process of generating a computer
system that is capable of acquiring data and integrating the
data to generate useful knowledge.

Decision trees: is a tree-shaped structure, in which each
branch represents a classification question while leaves of
the tree represents the partition of classified information.

8/12/10
Data Mining Techniques
» Hidden Markov Models:enables you to predict future
actions to be taken in time series. The model provides the
probability of a future event, when provided with the present
and previous events.
» Neural networks:In this a large set of historical data is
analyzed in order to predict the output of a particular future
situation or a problem.
» Genetic algorithms:If you have a certain set of sample data,
then GA enables to determine the best possible model out of a
set of models in order to represent the sample data.

8/12/10
Data Mining vs. Web
Traditional data mining

data is structured and relational

well-defined tables, columns, rows, keys, and
constraints.
Web data

Semi-structured (HTML documents)and
unstructured (free text)
Mining


readily available data

rich in features and patterns

8/12/10
Problems when interacting with the Web

» Finding relevant information

» Creating new knowledge out of the


information available on the Web

» Personalization of the information

» Learning about consumers or individual users

8/12/10
Web Mining

8/12/10
Web Mining - Definition
» “Web mining refers to the overall process of discovering
potentially useful and previously unknown information or
knowledge from the Web data.”

» The web mining process is similar to the data mining


process, the difference is usually in the data collection.
» In data mining, the data is often already collected and
stored in a data warehouse.
» In web mining, data collection can be a substantial task,
especially for web structure and content mining, which
involves crawling a large number of target web pages.

8/12/10
Web Mining - Subtasks

Resource finding

Retrieving intended documents

Information selection/pre-processing

Select and pre-process specific information from selected
documents

Generalization

Discover general patterns at individual web sites as well as
across multiple web sites

Analysis

Validation and/or interpretation of mined patterns

8/12/10
Web Mining
Contd..
Web Mining is not IR:

Information retrieval (IR) is the automatic retrieval of all
relevant documents while at the same time retrieving as few
of the non-relevant documents as possible

Web Mining is not IE:



Information extraction (IE) aims to extract the relevant facts
from given documents

IE systems for the general Web are not feasible

Most focus on specific Web sites or content

8/12/10
Classification of Web Mining

8/12/10
Web Usage Mining


Web Usage Mining refers to the discovery of user access

Click to edit the
patterns from the web usage logs, which record every click
made by each user. outline text format

Second Outline

The usage data records the user’s behavior
Levelwhen the user
browses or makes transactions on the web site in order to better
understand and serve the needs of users or− Web-based
Third Outline
applications. Level

Fourth

It is an activity that involves the automatic discovery
Outline of
patterns from one or more Web servers.
Level
− Fifth
8/12/10
Outline
Web Usage Mining Contd..
Organizations often generate and collect large volumes of data;
most of this information is usually generated automatically by
Web servers and collected in server log.


Analyzing such data can help these organizations to determine:

the value of particular customers

cross marketing strategies across products

the effectiveness of promotional campaigns, etc.

Typical Sources of Data

automatically generated data stored in server access logs,
proxy server logs referrer logs, browser logs, bookmark
data, mouse clicks and scrolls and client-side cookies

user profiles

meta data: page attributes, content attributes, usage data
8/12/10
Web Usage Mining Contd..

The first web analysis tools simply provided mechanisms to
report user activity as recorded in the servers. Using such tools,
it was possible to determine such information as:

the number of accesses to the server

the times or time intervals of visits

the domain names and the URLs of users of the Web server.

Two main categories:

Learning a user profile (personalized)
Web users would be interested in techniques that learn
their needs and preferences automatically

Learning user navigation patterns (impersonalized)
Information providers would be interested in techniques
that improve the effectiveness of their Web site or biasing
the users towards the goals of the site
8/12/10
Web Usage Mining Contd..

Web servers, Web proxies, and client applications can quite
easily capture Web Usage data.

Web server log:
Every visit to the pages, what and when files have been
requested, the IP address of the request, the error code, the
number of bytes sent to user, and the type of browser used…

By analyzing the Web usage data, web mining systems can
discover useful knowledge about a system’s usage
characteristics and the users’ interests which has various
applications:

Personalization and Collaboration in Web-based systems

Marketing

Web site design and evaluation

Decision support
8/12/10
Web Server Log - A Sample

8/12/10
Web Usage Mining
Contd..
The technique to retrieve visitor based information from web
servers based log files and apply this information to analyze
data is known as Web Log Mining.
The major types of log files are

Access Log- file maintains a list of all the web pages that the
visitors have requested.

Agent Log- file consists of information about the browser
that was used to explore the various web pages.

8/12/10
Web Content Mining

Web Content Mining extracts or mines useful information or
knowledge from web page contents.
 
Click
In this mining, patterns are extracted to edit
from online the such
sources
as outline text format

HTML files

Text documents

Second Outline

Images Level

E-books or email messages

Audio or Video −
Third Outline

Level
The concept of WCM is far wider than searching for any specific
term or only keyword extraction or some simple statistics of words
and phrases in documents. Fourth
Outline

A tool that performs WCM can summarize a webLevel page so that you
need not read the complete document and save your −time and energy.
Fifth
8/12/10
Outline
Web Content Mining
Contd..
The two basic approaches or models to implement WCM are

Local Knowledge base Model:
The abstract characterizations of several web pages
are stored locally. (i.e References to several web sites relating
to the categories are stored in a database and based on the
selection of the category the searching is performed with in the
web site)

Agent Based Model:
This approach applies the Artificial Intelligence
systems known as Web Agents that can perform a search on
behalf of a particular user for discovering and organizing
documents in the web. Some web agents can apply individual
user profiles for searching information from the web and
organize and interpret the discovered information.
8/12/10
Preprocessing Content
Content Preparation:

Extract text from HTML.

Perform Stemming.

Remove Stop Words.

Calculate Collection Wide Word Frequencies (DF).

Calculate per Document Term Frequencies (TF).
Vector Creation:

Common Information Retrieval Technique.

Each document (HTML page) is represented by a sparse
vector of term weights.

Typically, additional weight is given to terms appearing as
keywords or in titles.

8/12/10
Common Mining Techniques
The more basic and popular data mining techniques include:

Classification- Classification on server logs using decision trees,
Naives-Bayes classifier to discover the profiles of users
belonging to a particular category.

Clustering- can be used to group users exhibiting similar
browsing patterns.

Associations- can be used to relate pages that are most often
referenced together in a single server session.
The other significant ideas are:

Topic Identification, tracking and drift analysis

Concept hierarchy creation

Relevance of content.

8/12/10
Web Structure Mining

Web Structure Mining discovers useful knowledge from hyper
links, which represent the structure of the web.
Click to edit the

 outline
Web structure mining can be divided into twotext
kinds:format

Extract patterns from hyperlinks in the web. A hyperlink is a
Second Outline
structural component that connects the web page to a
different location. Level

Mining the document structure. It is using the tree-like
− Third Outline
structure to analyze and describe the HTML
Levelor XML tags
within the web page.
Fourth

 Outline
The process of using the graph theory to analyze the node and
connection structure of a web site. Level
− Fifth
8/12/10
Outline
Web Structure Mining
Contd..
Web Structure is a useful source for extracting information
such as
Web Page Classification

Classifying web pages according to various topics
Quality of Web Page

The authority of a page on a topic

Ranking of web pages
Which pages to crawl

Deciding which web pages to add to the collection of web
pages
Finding Related Pages

Given one relevant page, find all related pages

8/12/10
Web Structure Mining
Contd..
The Hyperlink Induced Topic Search (HITS) is the common
method or algorithm for knowledge discovery in the Web. The
Concept of HITS is

8/12/10
Web Structure Mining
Identication of

Authorities: authoritative, high-quality web pages on broad
topics

hubs: web pages that link to a collection of authorities

A good authority is pointed to by many good hubs

A good hub points to many good authorities

Web structure mining has been largely influenced by research


in

Social network analysis

Citation analysis (bibliometrics).

in-links: the hyperlinks pointing to a page

out-links: the hyperlinks found in a page.

Usually, the larger the number of in-links, the better a page is.
8/12/10
Web Structure Mining
Contd..

Each Web page is a node of the Web-graph


The out-degree of a node, is the number of distinct links originating at that point to other nodes.
The probability, at any step, that the person will continue is a damping factor d =0.85
N- Number of web pages

8/12/10
Application Areas of Web Mining

E-commerce

Search Engines

Personalization

Website Design

Web mining applications

Amazon.com

Google

Double Click

AOL

Ebay

MyYahoo

CiteSeer

I-MODE

v-TAG Web Mining Server

8/12/10
Applications Contd..
Amazon:
A host of Web mining techniques, e.g. associations between
pages visited, click-path analysis, etc., are used to improve the
customer’s experience during a ’store visit’. Knowledge gained
from Web mining is the key intelligence behind Amazon’s
features such as ’instant recommendations’, ’purchase circles’,
’wish-lists’, etc.

8/12/10
Applications Contd..
Google

Earlier search engines concentrated on the Web content to
return the relevant pages to a query. Google was the first to
introduce the importance of the link structure in mining the
information from the web. Page Rank, that measures an
importance of a page, is the underlying technology in all
Google search products.


The Page Rank technology, that makes use of the structural
information of the Web graph, is the key to returning quality
results relevant to a query.

8/12/10
Benefits of Web Mining
Match your available resources to visitor interests

Increase the value of each visitor


Improve the visitor's experience at the website


Perform targeted resource management


Collect information in new ways


Test the relevance of content and web site architecture


8/12/10
Web Mining Softwares
Web Miner:

Sinope Summarizer:

Teleport Pro:

Click Tracks

8/12/10
Summary
Major Limitations of Web Mining research:

Difficult to collect Web Usage data across different Web
Sites.

Lack of suitable test collections that can be reused by
researchers

Future research directions:



Multimedia data mining: A picture is worth a thousand words.

Multilingual knowledge extraction: Web page translations

The Hidden Web: Forms, Dynamically generated web pages.

Semantic Web

Wireless Web: WML and HDML.
8/12/10
8/12/10

You might also like