0% found this document useful (0 votes)
54 views

6 WebMining

This document provides an introduction to the concept of web mining. It discusses the large amount of data available on the web and the challenges of retrieving useful information from it. The document then defines web mining as the application of data mining techniques to automatically discover and extract useful information from web data. It distinguishes web mining from related areas like information retrieval and information extraction. Finally, it outlines the main subtasks of web mining, including web content mining, web structure mining, and web usage mining.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

6 WebMining

This document provides an introduction to the concept of web mining. It discusses the large amount of data available on the web and the challenges of retrieving useful information from it. The document then defines web mining as the application of data mining techniques to automatically discover and extract useful information from web data. It distinguishes web mining from related areas like information retrieval and information extraction. Finally, it outlines the main subtasks of web mining, including web content mining, web structure mining, and web usage mining.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Web Mining

 Introduction to Web Mining


 Web data nature
 Knowledge discovery from web data
 Web Structure Mining
 Web Content Mining
 Web Usage Mining
Web Mining – The Idea
• In recent years the growth of the World Wide
Web exceeded all expectations.
• Web is the single largest data source in the world.
– There are several billions of HTML documents,
pictures and other multimedia files available via
Internet and the number is still rising.
• But considering the impressive variety of the
web, retrieving interesting content has become a
very difficult task.
– This is due to heterogeneity and lack of structure of
web data.
Web Mining
• The new sales window
– The markets move towards business virtualization.
– Many companies have begun to use the Web as a new
channel of massive spread and worldwide cover.
– How can we improve our web site?
• By understanding what describes the user behavior.

• OK, but how can we do it?


– By analyzing the user browsing behavior.
– By analyzing the web page preferences.
– By analyzing the user profile.
– In fact “by extracting knowledge from web data” …
applying Web mining techniques.
Size of the Web
• Determining the size of the WWW is extremely
difficult.
– The size grows at the rate of about 1 million pages a day
• Number of pages
– Technically, infinite number of pages because of
dynamically generated content
– Much duplication (accounts for around 30-40%)
• Best estimate of HTML pages comes from search
engine claims
– In 2012, Google has indexed over 30 trillion web pages,
and serves 100 billion queries per month.
– Yahoo also claimed that their index contains 20 billion
pages
The web as a graph
• The Web can be naturally
modeled as a directed graph,
consisting of:
– a set of abstract nodes (the web
pages) joined by directed edges (the
hyperlinks).
• The hyperlinks into a page referred to as
in-links (also known as its in-degree)
and those out of a page as out-links
(also known as its out-degree) .
– There is a high linkage among web
pages.
• The number of in-links to a page
has 10-20 links on average.
• Power-law degree distribution:
The distribution of the number of
links into a web page follow power
law distribution, in which the total
number of web pages with k in-
degree is proportional to 1/kα;
where the value of α is 2.1.
Web Graph

6
What can the graph tell us?
• Distinguish “important” pages from unimportant ones
– Page rank: The information provided by the Web-graph is for
instance at the basis of link analysis algorithms for ranking Web
documents, PageRank.
• Discover communities of related pages
– Hubs and Authorities: Some pages, the most prominent sources of
primary content, are the authorities on the topic.
• Other pages, equally intrinsic to the structure, assemble high-
quality guides and resource lists that act as focused hubs,
directing users to recommended authorities.
• Detect web spam
– Trust Rank: It is a technique used for separating useful web-pages
from spam. Trust Rank algorithm was created to combat web
spam.
• It computes trust scores for a web graph. Good sites will have
relatively high trust scores, while spam sites will have poor trust
scores.
Is Web only about Data?
• Data and Information:
– The coverage of Web information is very wide & diverse.
One can find information about almost anything.
– Information/data of almost all types exist on the Web; e.g.,
structured tables, texts, multimedia data, etc.
• The Web is also about services:
– Many Web sites & pages enable people to perform
operations with input parameters, i.e., they provide services.
• Above all, the Web is a virtual society:
– It is not only about data, information &
services, but also about social network
(interactions among people, organizations
& automatic systems, i.e., communities).
– Top Popular Social Networking Sites (May
2012): Facebook, Twitter, LinkedIn,
MySpace, Google Plus+, DeviantArt,
LiveJournal, Tagged, Orkut, CafeMom,
Ning, myLife, etc.
Opportunities and Challenges
• Web offers an unprecedented opportunity & challenges
to DM
− The amount of information on the Web is huge & easily
accessible.
− The coverage of Web information is very wide & diverse.
▪ One can find information about almost anything.
− Information/data of almost all types exist on the Web,
▪ e.g., structured tables, texts, multimedia data, etc.
− Much of the Web information is semi-structured due to
the nested structure of HTML code.
− Much of the Web information is linked.
▪ There are hyperlinks among pages within a site, & across
different sites.
Opportunities and Challenges …
− Much of the Web information is redundant.
▪ The same piece of information or its variants may appear in
many pages.
− The Web is noisy.
▪ A Web page typically contains a mixture of many kinds of
information, e.g., main contents, advertisements, navigation
panels, copyright notices, etc.
− The Web is dynamic.
▪ Information on the Web changes constantly. Keeping up with the
changes & monitoring the changes are important issues.
− The Web is also about services.
▪ Many Web sites & pages enable people to perform operations
with input parameters, i.e., they provide services.
− Above all, the Web is a virtual society.
▪ It is not only about data, information & services, but also about
interactions among people, organizations & automatic systems,
i.e., communities.
Web Mining: Definition
• The term created by Orem Etzioni (1996)
• Web mining is the application of data mining
techniques to automatically discover and extract useful
information or patterns from Web data.
– It is the process of discovering useful information from the
World-Wide Web and its usage patterns.
– It is the extraction of interesting and potentially useful
patterns and implicit information from artifacts or activity
related to the World-Wide Web.

• Web mining refers to the overall process of discovering


potentially useful and previously unknown information or
knowledge from the Web data.
Web Mining: Not IR
• Information retrieval (IR) is the area of study
concerned with searching for all relevant
documents, while at the same time retrieving as
few of the non-relevant documents as possible
in order to satisfy information need of users

• Web mining discovers hidden useful information


that helps to classify web documents.
– Web document classification, which is a Web Mining
task, could be part of an IR system (e.g. indexing for
a search engine).
Web Mining: Not IE
• The goal of Information extraction (IE) is to
automatically extract structured information from
unstructured and/or semi-structured documents.
– Information extraction (IE) aims to extract the relevant facts
from given documents
– IE systems for the general Web are not feasible
– Most focus on specific Web sites or content

• Web content mining is an automatic process that goes


beyond keyword extraction, or some simple statistics
of words and phrases in documents.
Data Mining vs. Web Mining
• Data mining
– data is structured and relational
– well-defined tables, columns, rows, keys, and constraints.
• Web data
– Semi-structured collection of data
– readily available data
– rich in features and patterns
– Structure
• Textual information and linkage structure
– Scale
• Data generated per day is comparable to largest conventional
data warehouses
• Google’s usage logs are much bigger than their web crawl!
– Order of magnitude: terabytes per day
– Speed
• Often need to react to evolving usage patterns in real-time
(e.g., merchandising)
Data Mining: Going deeper
Prediction of next event Sequence &
Association rules
Discovery of associated
discovery
events or application objects
Discovery of visitor groups
with common properties and
interests Clustering
Discovery of visitor groups
with common behaviour
Characterization of visitors
with respect to a set of
predefined classes Classification

Card fraud detection


Web Data
• Web Structure
– The actual linkage structure between web pages
– The HTML or XML code of the page
• Web Usage
– Usage data (Logs data) that describe how web
page are accessed by visitors Image

– user profiles: include demographic and


registration information
– user sessions
• Web Content
– Content of actual web pages; Text, Audio, Image
& video
Web Mining: Subtasks
• Resource finding (web data collection)
✓ Retrieving intended documents

• Information selection/pre-processing
✓ Select and pre-process specific information from selected
documents

• Generalization
✓ Discover general patterns within and across web sites

• Analysis
✓ Validation and/or interpretation of mined patterns
Web Mining Architecture
Web Mining Taxonomy
Web Mining

Web Content Web Structure Web Usage


Mining Mining Mining

• Web Content Mining


✓ Discovering useful information from the content of web data &
documents.
• Web Structure Mining
✓ Discovering the model underlying link structures (topology) on
the Web. E.g. discovering authorities and hubs
• Web Usage Mining
✓ Make sense of data generated by surfers, such as Usage data
from logs, user profiles, user sessions, cookies, user queries,
bookmarks, mouse clicks and scrolls, etc.
Web Content Mining
• It is the process of information or resource discovery from
content of millions of sources across the World-Wide
Web.
– E.g. Web data contents: multimedia, metadata and
hyperlinks
• Unstructured – free text
• Semi-structured – HTML
• More structured – Table or Database generated HTML pages
• Multimedia data (such as image, audio, video, text, …) –
receive less attention than text or hypertext
• Web content mining is the process of extracting
knowledge from the content of documents or their
descriptions.
– Web document text mining or resource discovery
based on concepts indexing may also fall in this
category.
Web Content Mining
• Web content mining is related to data mining
and text mining.
– Do you agree? Why?

– It is related to data mining because many data


mining techniques can be applied in Web content
mining.
– It is related to text mining because much of the
web contents are texts.
– Web data are mainly semi-structured and/or
unstructured, while data mining is structured and
text is unstructured.
Web Content Mining Strategies
• There are two groups of web
content mining strategies:
– The content of web pages:
Those that directly mine
the content of documents
– The result of web
searching: those that
improve on the content
search of other tools like
search engines.
• Techniques for Web Content Mining
– Classifications
– Clustering
– Association rule discovery
Document Classification
• Supervised Learning
–Supervised learning is a ‘machine learning’ technique for
creating a function from training data .
–Documents are categorized
–The output can predict a class label of the input object (called
classification).
• Techniques used are
–Feature Selection
• Removes terms in the training documents which are
statistically uncorrelated with the class labels
• Simple heuristics
– Stop words like “a”, “an”, “the” etc.
– Empirically chosen thresholds for ignoring “too frequent” or “too rare”
terms
–Classification tools used : Nearest Neighbor Classifier, Decision
Tree & Neural Networks
Document Clustering
• Unsupervised Learning : a data set of input objects is
gathered
• Goal : Evolve measures of similarity to cluster a collection of
documents/terms into groups within which similarity of
members of a cluster is larger than across clusters.
• Hypothesis : Given a `suitable‘ clustering of a collection, if
the user is interested in document/term document, he is
likely to be interested in other members of the cluster to
which document belongs.
• Hierarchical clustering
– Bottom-Up
– Top-Down
• Partitional clustering
– K- means
Semi-Supervised Learning
• A collection of documents is available, out of which a
subset of the collection has known labels.
• Goal: to label the rest of the collection.
• Approach
– Train a supervised learner using the labeled subset.
– Apply the trained learner on the remaining
documents.
• Idea
– Harness information in the labeled subset to enable
better learning.
– Also, check the collection for emergence of new
topics.
Web Structure Mining
• Interested in mining the structure (links, graph) between
Web documents (not within a document).
– Web structure mining is the process of inferring knowledge from the
World Wide Web organization and links between references and
referents in the Web.
• WWW can reveal more information than just the information
contained in documents. Web structure mining generates
structural summary about the Web site & Web page.
–Web pages categorization depending upon the hyperlink
–Discovering the Web Page Structure.
–Discovering the nature of the hierarchy of hyperlinks in the
website.
–Measuring the “completeness” of a Web site
• Finding Information about web pages
–Retrieving information about the relevance and the quality of the
web page.
–Finding the authoritative on the topic and content
Web Structure Mining: Inference on Hyperlink
• The web page contains not only information but also
hyperlinks, which contains huge amount of annotation.
• Hyperlink identifies author’s endorsement of the other
web page.
– links pointing to a document indicate the popularity of
the document, while links coming out of a document
indicate the richness or perhaps the variety of topics
covered in the document.
– This can be compared to bibliographical citations.
When a paper is cited often, it ought to be important.
• The PageRank and CLEVER methods take advantage of this
information conveyed by the links to find pertinent web
pages.
PageRank
• Used by Google Prioritize pages returned from search
by looking at Web structure.
• Importance of page is calculated based on number of
pages which point to it – Backlinks.
• Weighting is used to provide more importance to
backlinks coming form important pages.

• Given page p its PageRank is defined as:


PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)
– PR(i): PageRank for a page i which points to target page p.
– Ni: number of links coming out of page i
CLEVER
• A search can be viewed as having a goal of
finding the best pages that are hubs and
authorities in a given topic
– CLEVER attempts to identify authoritative and hub
pages.
• Authoritative Pages :
– Highly important pages.
– Best source for requested information.
• Hub Pages :
– Contain links to highly important pages.
Web Usage Mining
• What is Usage Mining? Web usage mining tries to predict
user behavior from interaction with the Web.
– It attempts in discovering user ‘navigation patterns’ from web
data.
– Predict user behavior while the user interacts with the web.
– Helps to restructure a website & Improve large Collection of
resources.
– Present dynamic information to users based on their interests and
profiles
• Web usage mining, also known as Web Log Mining, is the
process of extracting interesting patterns in web access logs.
• Typical problems: Distinguishing among unique users,
server sessions, episodes, etc in the presence of caching
and proxy servers.
Web Usage Mining
• Web servers record and accumulate data about user
interactions whenever requests for resources are received.
Analyzing the web access logs of different web sites can help
understand the user behavior and the web structure,
thereby improving the design of this colossal collection of
resources.
• While it is encouraging and exciting to see the various
potential applications of web log file analysis, it is important
to know that the success of such applications depends on
what and how much valid and reliable knowledge one can
discover from the large raw log data.
• For an effective web usage mining, an important cleaning
and data transformation step before analysis may be
needed.
Two main Web Usage Mining strategies
– The general access pattern tracking
analyzes the web logs to
understand access patterns and
trends. These analyses can shed
light on better structure & grouping
of resource providers. Many web
analysis tools existed but they are
limited and usually unsatisfactory.
• Applying data mining techniques on access logs unveils interesting
access patterns that can be used to restructure sites in a more
efficient grouping, pinpoint effective advertising locations, and
target specific users for specific selling ads.
– Customized usage tracking (personalization) analyzes individual
trends. Its purpose is to customize web sites to users. The
information displayed, the depth of the site structure and the
format of the resources can all be dynamically customized for each
user over time based on their access patterns.
Personalization
• Web access or contents tuned to better fit
the desires of each user.
• Manual techniques identify user’s
preferences based on profiles or
demographics.
• Collaborative filtering identifies preferences
based on ratings from similar users.
• Content based filtering retrieves pages based
on similarity between pages and user
profiles.
Web usage mining process
Web Usage Mining Activities
• Preprocessing Web log
– Cleanse
– Remove extraneous information
– Sessionize
Session: Sequence of pages referenced by one user at a sitting.
• Pattern Discovery (navigational patterns and sequence
patterns)
– Count patterns that occur in sessions
– Pattern is sequence of pages references in session.
– Similar to association rules
• Transaction: session
• Itemset: pattern (or subset)
• Order is important
• Pattern Analysis
– Interpretation of the discovered patterns
Data in Web Usage Mining:
• Wide range of web usage data (logs): The record of what
actions a user takes with his mouse and keyboard while
visiting a site.
– Web client data: Client-side cookies
– Proxy server data
– Web server logs: Server access logs
– Site contents
– Search engine logs
– Agent logs
– Database logs
– User profiles: Data about the visitors, gathered from external
channels
– Further application data
• A large part of Web usage mining is about processing usage/
clickstream data.
– After that various data mining algorithm can be applied.
Transfer / Access Log
• The transfer/access log contains detailed information about
each request that the server receives from user’s web
browsers.

CLIENT

SERVER

Time Date Hostname File Amount of Status of


Requested data the request
transferred
Error Log
• The error log keeps a record of errors and failed requests.
• A request may fail if the page contains links to a file that does
not exist or if the user is not authorized to access a specific
page or file.
CLIENT
SERVER
Data Preprocessing
•Data cleaning
–remove irrelevant references and fields in server logs
–Clean/Filter raw data to eliminate redundancy
–remove erroneous references
–add missing references due to caching (done after sessionization)
•Data integration
–synchronize data from multiple server logs
–Integrate semantics, e.g.,
• meta-data (e.g., content labels)
• e-commerce and application server data
–integrate demographic / registration data
•Data Transformation
–user identification; sessionization / episode identification
–pageview identification
• a pageview is a set of page files and associated objects that contribute to a
single display in a Web Browser
•Data Reduction
–sampling and dimensionality reduction (ignoring certain pageviews / items)
•Identifying User Transactions (i.e., sets or sequences of pageviews
possibly with associated weights)
Web-Usage Mining cont…
• Data Mining Techniques – Navigation Patterns
Web Page Hierarchy Analysis Example:
of a Web Site • 70% of users who accessed
/company/product2 did so by
starting at /company and
A proceeding through
/company/new, /company/
products and company/product1
B E ✓ 80% of users who accessed
the site started from
/company/products
✓ 65% of users left the site
C D after four or less page
references
Web-Usage Mining cont…
• Data Mining Techniques – Sequential Patterns

Example: Supermarket
Customer Transaction Time Purchased Items
John 6/21/05 5:30 pm Beer
John 6/22/05 10:20 pm Brandy
Frank 6/20/05 10:15 am Juice, Coke
Frank 6/20/05 11:50 am Beer
Frank 6/20/05 12:50 am Wine, Cider
Mary 6/20/05 2:30 pm Beer
Mary 6/21/05 6:17 pm Wine, Cider
Mary 6/22/05 5:05 pm Brandy
Web-Usage Mining cont…
• Data Mining Techniques – Sequential Patterns

Example: Supermarket Customer Sequence


Customer Customer Sequences

John (Beer) (Brandy)


Frank (Juice, Coke) (Beer) (Wine, Cider)
Mary (Beer) (Wine, Cider) (Brandy)

Mining Result
Sequential Patterns with Support >= Supporting Customers
40%
(Beer) (Brandy) John, Mary
(Beer) (Wine, Cider) Frank, Mary
Web-Usage Mining cont…
• Data Mining Techniques – Sequential Patterns

Web usage examples


◼ In Google search, within past week 30% of users
who visited /company/product/ had ‘camera’ as
text.
◼ 60% of users who placed an online order in
/company/product1 also placed an order in
/company/product4 within 15 days
Public Software
Name Firma Type Comments
STstat ST Software Report and Statistics Is a set of CGI scripts (written in C), that produce HTML
reports, based on the access logs that the HTTP server
keeps, and it is suitable for almost all http server
software (Unix & Windows), supporting now three log
formats (Common, Extended and IIS).
weblog_p ACME Labs Logfiles Processing Extract specified fields from a web log file.
arse Software. Reads a web server log file, in either "Common Logfile
Format" or "Combined Logfile Format". Parses it, and
writes out only the user-specified fields, separated by
tabs for easier handling
WebLog Darryl C. Logfiles Analysis Tool Is a comprehensive access log analysis tool. It allows
Burgdorf you to keep track of activity on your site by month,
week, day and hour, to monitor total hits, bytes
transferred and page views, and to keep track of your
most popular pages.
Analog University of Logfiles Analyzer Analog is a program to analyse the logfiles from your
Cambridge web server. It tells you which pages are most popular,
Statistical which countries people are visiting from, which sites
Laboratory they tried to follow broken links from, etc.
References
• Mining the Web: Discovering Knowledge
from Hypertext Data by Soumen Chakrabarti
(Morgan-Kaufmann Publishers )
• Web Mining :Accomplishments & Future
Directions by Jaideep Srivastava
• The World Wide Web: Quagmire or
goldmine by Oren Entzioni
• http://www.galeas.de/webmining.html

You might also like