6 WebMining
6 WebMining
6
What can the graph tell us?
• Distinguish “important” pages from unimportant ones
– Page rank: The information provided by the Web-graph is for
instance at the basis of link analysis algorithms for ranking Web
documents, PageRank.
• Discover communities of related pages
– Hubs and Authorities: Some pages, the most prominent sources of
primary content, are the authorities on the topic.
• Other pages, equally intrinsic to the structure, assemble high-
quality guides and resource lists that act as focused hubs,
directing users to recommended authorities.
• Detect web spam
– Trust Rank: It is a technique used for separating useful web-pages
from spam. Trust Rank algorithm was created to combat web
spam.
• It computes trust scores for a web graph. Good sites will have
relatively high trust scores, while spam sites will have poor trust
scores.
Is Web only about Data?
• Data and Information:
– The coverage of Web information is very wide & diverse.
One can find information about almost anything.
– Information/data of almost all types exist on the Web; e.g.,
structured tables, texts, multimedia data, etc.
• The Web is also about services:
– Many Web sites & pages enable people to perform
operations with input parameters, i.e., they provide services.
• Above all, the Web is a virtual society:
– It is not only about data, information &
services, but also about social network
(interactions among people, organizations
& automatic systems, i.e., communities).
– Top Popular Social Networking Sites (May
2012): Facebook, Twitter, LinkedIn,
MySpace, Google Plus+, DeviantArt,
LiveJournal, Tagged, Orkut, CafeMom,
Ning, myLife, etc.
Opportunities and Challenges
• Web offers an unprecedented opportunity & challenges
to DM
− The amount of information on the Web is huge & easily
accessible.
− The coverage of Web information is very wide & diverse.
▪ One can find information about almost anything.
− Information/data of almost all types exist on the Web,
▪ e.g., structured tables, texts, multimedia data, etc.
− Much of the Web information is semi-structured due to
the nested structure of HTML code.
− Much of the Web information is linked.
▪ There are hyperlinks among pages within a site, & across
different sites.
Opportunities and Challenges …
− Much of the Web information is redundant.
▪ The same piece of information or its variants may appear in
many pages.
− The Web is noisy.
▪ A Web page typically contains a mixture of many kinds of
information, e.g., main contents, advertisements, navigation
panels, copyright notices, etc.
− The Web is dynamic.
▪ Information on the Web changes constantly. Keeping up with the
changes & monitoring the changes are important issues.
− The Web is also about services.
▪ Many Web sites & pages enable people to perform operations
with input parameters, i.e., they provide services.
− Above all, the Web is a virtual society.
▪ It is not only about data, information & services, but also about
interactions among people, organizations & automatic systems,
i.e., communities.
Web Mining: Definition
• The term created by Orem Etzioni (1996)
• Web mining is the application of data mining
techniques to automatically discover and extract useful
information or patterns from Web data.
– It is the process of discovering useful information from the
World-Wide Web and its usage patterns.
– It is the extraction of interesting and potentially useful
patterns and implicit information from artifacts or activity
related to the World-Wide Web.
• Information selection/pre-processing
✓ Select and pre-process specific information from selected
documents
• Generalization
✓ Discover general patterns within and across web sites
• Analysis
✓ Validation and/or interpretation of mined patterns
Web Mining Architecture
Web Mining Taxonomy
Web Mining
CLIENT
SERVER
Example: Supermarket
Customer Transaction Time Purchased Items
John 6/21/05 5:30 pm Beer
John 6/22/05 10:20 pm Brandy
Frank 6/20/05 10:15 am Juice, Coke
Frank 6/20/05 11:50 am Beer
Frank 6/20/05 12:50 am Wine, Cider
Mary 6/20/05 2:30 pm Beer
Mary 6/21/05 6:17 pm Wine, Cider
Mary 6/22/05 5:05 pm Brandy
Web-Usage Mining cont…
• Data Mining Techniques – Sequential Patterns
Mining Result
Sequential Patterns with Support >= Supporting Customers
40%
(Beer) (Brandy) John, Mary
(Beer) (Wine, Cider) Frank, Mary
Web-Usage Mining cont…
• Data Mining Techniques – Sequential Patterns