0% found this document useful (0 votes)
33 views28 pages

Bda Class - Feb 7th

Web mining involves applying data mining techniques to extract knowledge from web data. There are three types of web mining: web content mining, web structure mining, and web usage mining. Web content mining analyzes the contents of web documents. Web structure mining analyzes the hyperlink structure between documents. Web usage mining analyzes patterns from user interactions with websites through web server logs. Popular algorithms for web structure mining include PageRank, which Google uses to rank web pages, and Hubs and Authorities (HITS), which identifies hubs and authorities in linked data.

Uploaded by

Neeraj Sivadas K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views28 pages

Bda Class - Feb 7th

Web mining involves applying data mining techniques to extract knowledge from web data. There are three types of web mining: web content mining, web structure mining, and web usage mining. Web content mining analyzes the contents of web documents. Web structure mining analyzes the hyperlink structure between documents. Web usage mining analyzes patterns from user interactions with websites through web server logs. Popular algorithms for web structure mining include PageRank, which Google uses to rank web pages, and Hubs and Authorities (HITS), which identifies hubs and authorities in linked data.

Uploaded by

Neeraj Sivadas K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

MODULE I

Contents
• Web Mining
• Types of Web mining
• Data mining VS Web mining
WEB MINING
• Web mining is the application of data mining & machine
learning techniques to extract useful knowledge from the
content, structure & usage of web resources.
• Web mining is an application of data mining techniques to
find information patterns from the web data.
Types of Web Mining
Web Mining

Web Structure
Web Content Mining Web Usage Mining
Mining
TYPES OF WEB MINING

• 1. WEB CONTENT MINING


✔ Extract useful knowledge from the contents of web page or
web documents.
✔ Content data may consist of text, images, audio, video
✔ Web content mining performs scanning and mining of the text,
images and groups of web pages according to the content of the
input (query), by displaying the list in search engines.
✔ For example: If an user wants to search for a particular book,
then search engine provides the list of suggestions.
Agent-Based Approach
• The agent approach uses web agents to collect relevant
information from the world wide web.
• A web agent is a program that visits a web site and filters the
information the user is interested in.
• Intelligent-Search-Agents develop searches for characteristics to
organize and interpret the discovered information.
• Information-Filtering/Categorization-
⮚ Using various information retrieval techniques and
characteristics of open hypertext Web documents to
automatically retrieve, filter, and categorize them. HyPursuit, BO
(Bookmark Organizer).
• Personalized Web Agents: Development of sophisticated AI
systems acting on behalf of users autonomously or semi-
autonomously to discover and organize information.
Database Approaches
• Used for transforming unstructured data into more
structured and high-level collections of resources, such
as in relational databases, and using standard database
querying mechanisms and data mining techniques to
access and analyze this information.
• Multilevel-Databases
⮚ lowest level - semi- structured information is kept
⮚ High level - generalizations from lower levels organized into
relations and objects.
• Web-Query Systems
⮚ Web-based query systems and languages developed such as
SQL, NLP for extracting data.
• Applications of web content mining:
❖ Document clustering or categorization
❖ Topic identification/tracking
❖ Concept discovery
❖ Focused crawling
❖ Content-based personalization
❖ Intelligent search tools
2. WEB USAGE MINING
• Extracting interesting patterns from user interactions
with resources on one or more websites.
• Web usage mining is used for mining the web log
records (access information of web pages) and helps
to discover the user access patterns of web pages.
• Web server registers a web log entry for every web
page.
• Analysis of similarities in web log records can be
useful to identify the potential customers for e-
commerce companies.
• Goal: analyze the behavioral patterns and profiles of users
interacting with a Web site.
• The discovered patterns are usually represented as
collections of pages, objects, or resources that are frequently
accessed by groups of users with common interests.
• Data in Web Usage Mining:
a. Web server logs
b. Site contents
c. Data about the visitors, gathered from external
channels
Data Preparation
• Data cleaning
– By checking the suffix of the URL name, for example, all log entries
with filename suffixes such as, \gif, jpeg, etc
• User identification
– If a page is requested that is not directly linked to the previous
pages, multiple users are assumed to exist on the same machine
– Other heuristics involve using a combination of IP address, machine
name, browser agent, and temporal information to identify users
• Transaction identification
– All of the page references made by a user during a single visit to a
site
– Size of a transaction can range from a single page reference to all of
the page references
Pattern Discovery Tasks
• 1. Clustering and Classification
– Clustering of users help to discover groups of users with
similar navigation patterns => provide personalized Web
content
– Clustering of pages help to discover groups of pages having
related content => search engine
– E.g. clients who often access web miner software products
tend to be from educational institutions.
– clients who placed an online order for software tend to be
students in the 20-25 age group and live in the United
States.
– 75% of clients who download software and visit between
7:00 and 11:00 pm on weekend are engineering students
Pattern Discovery Tasks
• 2. Sequential Patterns:
✔ extract frequently occurring intersession patterns such that
the presence of a set of items followed by another item in
time order
✔ Used to predict future user visit patterns=>placing ads or
recommendations
• Association Rules:
✔ Discover correlations among pages accessed together by a
client
✔ Help the restructure of Web site
✔ Develop e-commerce marketing strategies - Grocery Mart
Pattern Analysis Tasks
• Pattern Analysis is the final stage of Web Usage Mining, which
involves the validation and interpretation of the mined
pattern
• Validation:
– to eliminate the irrelevant rules or patterns and to extract
the interesting rules or patterns from the output of the
pattern discovery process
• Interpretation:
– the output of mining algorithms is mainly in mathematic
form and not suitable for direct human interpretations
• Applications of web usage mining:
• User and customer behavior modeling
• Website optimization
• E-customer relationship management
• Web marketing
• Targeted advertising
• Recommender systems
3. WEB STRUCTURE MINING
• Web structure mining is the process of discovering structure
information from the web.
• The structure of typical web graph consists of Web pages as
nodes, and hyperlinks as edges connecting between two
related pages.
• The web structure mining can be used to discover the link
structure of hyperlink.
• This type of mining can be performed either at the document
level(intra-page) or at the hyperlink level(inter-page).
• That is extracting patterns from the hyperlink in the web or
mining the document structure.
• The research at the hyperlink level is called Hyperlink
analysis.
• Hyperlink structure can be used to retrieve useful
information on the web.
• It is used to identify that the web pages are either linked by
information or direct link connection.
• The purpose of structure mining is to produce the
structural summary of website and similar web
pages.
• Example: Web structure mining can be very useful to
companies to determine the connection between
two commercial websites.

• There are two main algorithms used in web structure


mining:
● PageRank (Google uses to rank its web pages)
● Hubs and Authorities - HITS
Assignment topics

Explain in detail the two popular algorithms used in the web structure
mining
(1)PageRank (Google uses to rank its web pages)
(2)Hubs and Authorities - HITS
Data Mining vs Web Mining

• Data Mining : It is a concept of identifying a significant pattern


from the data that gives a better outcome.
• Web Mining : It is the process of performing data mining in
the web. Extracting the web documents and discovering the
patterns from it.
• Applications of web structure:
• Document retrieval & ranking (eg., Google)
• Discovery of “hubs” and “authorities”
• Discovery of web communities
• Social network analysis
Web Data Mining Process
• Issues with web mining
• Web data sets can be very large
⮚ Tens to hundreds of terabyte
• Cannot mine on a single server
⮚ Need large farms of servers
● Proper organization of hardware and software to mine multi-
terabyte data sets
● Difficulty in finding relevant information
● Extracting new knowledge from the web
Web Mining Taxonomy
THANK YOU

You might also like