DWDM Unit 4
DWDM Unit 4
Clustering
Clustering is the process of partitioning a set of data into meaningful similar subclasses
is called cluster
[Or]
Clustering is the grouping the set of objects such a way that the object of same
group’s are grouped together. i.e., while doing clustering analysis, we first partitioning data
into group based on similarity.
Marketing
Land use
Insurance
City-planning
A categorization of major clustering methods:
1. Partitioning approach:
Construct various partitions and then evaluate them by some criterion.e.g, Minimizing the
sum of square errors.
2 .Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using some criterion.
BIRCH
ROCK
CHAMELEON
3. Density-based approach:
Based on connectivity and density functions.
Typical methods:
5. Model-based methods:
A model is hypothesized for each of the clusters and tries to find the best fit of that model to each
other.
6. Constraint-based methods:
Partitioning methods:
1. k-mean algorithm:
Step-4 k-mean method typically uses the Square error criterion function.
M1=4 M2=12
C1={2,3,4} C2={10,11,12,20,25,30}
C1={2,3,4,10} C2={11,12,20,25,3
C1={2,3,4,10,11,12} C2={20,25,30}
M1=7 M2=25
C1={2,3,4,10,11,12} C2={20,25,30}
M1=7 M2=25
2. K-Medoids algorithm:
The k-mean algorithm is sensitivity to outliers because an object with an extremely large values
may substrainly destroy the destroy the distribution of data.
Insetead of taking the mean value of the object in a cluster as a reference point, we can
pick actual objects to represent clusters using one representative object per cluster.
Each remaining object is clustered with the representative object to which it is the most
similar.
→data object
+ →cluster center
- →before swapping
2. The idea behind CLARA is instead of taking the whole set of data in to
consideration, a small portion of the actual data is choosen.
4. CLARA applies PAM on each sample and returns its best clustering as the output.
5. CLARA can deal with larger data sets than PAM. The complexities of Each iteration now
becomes O(ks²+k(n-k))
2.It starts by placing each object in its own cluster and merge this atomic clusters in to
large &Larger clusters, until all of the objects are in a single cluster.
2. It does the reverse of agglomerative hierarchial clustering by starting with all objects in
one cluster.
Disadvantages:
1.Once merge (or) splits step is done it cannot be redo (or) undo.
Given ‘n’ d- dimensional data objects or points in a cluster, we can define the centroid x₀,
Radius R, & diameter D.
Ex: suppose that there are three points(2,5),(3,2)ξ (4,3) in a cluster c1.
CF-tree: It is a height-balanced tree that stores the clustering features for a hierarchical
clustering. the size of a clustering feature tree is dependent on two factors:
1. Branching factor: It decides the maximum number of child nodes for a non-
leafnode.
2. Threshold : It decides the maximum diameter that a subclusters i.e., a
collection of non-leaf and its child node.
CHAMELEON:
Measures the similarity based on a dynamic model: Two clusters are merged only if the
interconnectivity and closeness between two clusters are high relative to the internal
interconnectivity of the clusters and closeness of items with in the clusters.
Density-based Clustering
The Density-based Clustering tool works by detecting areas where points are concentrated and
where they are separated by areas that are empty or sparse. Points that are not part of a cluster are
labeled as noise.
This tool uses unsupervised machine learning clustering algorithms which automatically detect
patterns based purely on spatial location and the distance to a specified number of neighbors.
These algorithms are considered unsupervised because they do not require any training on what
it means to be a cluster.
Clustering Methods
The Density-based Clustering tool provides three different Clustering Methods with which to
find clusters in your point data:
While only Multi-scale (OPTICS) uses the reachability plot to detect clusters, the plot can be
used to help explain, conceptually, how these methods differ from each other. For the purposes
of illustration, the reachability plot below will be used to explain the differences in the 3
methods. The plot reveals clusters of varying densities and separation distances.
→Each cell at a high level is partitioned into number of smaller cells in the next lower
level.
→Statistical info of each cell is calculated and stored before hand and is used to
answer Queries.
→parameters of higher level cells can be easily calculated from parameters of lower
level cell
→when finish examining the current layer, proceed to the next lower level.
Advantages:
Disadvantages: All the cluster boundaries are either horizontal or vertical, and no
diagonal boundary is detected.
Summarizes the data by imposing a multi dimensional grid grid structure on to data space
These multidimensional spatial data objects are represented in a n-dimensional feature
space.
Apply wavelet transform on feature space to find the dense regions in the feature space.
COBWEB is a popular and simple method of incremental conceptual clustering. Its input objects
are described by categorical attribute-value pairs. COBWEB creates a hierarchical clustering in
the form of a classification tree.
Outlier Analysis
“What is an outlier?” Very often, there exist data objects that do not comply with the general
behavior or model of the data. Such data objects, which are grossly different from or inconsistent
with the remaining set of data, are called outliers. Outliers can be caused by measurement or
execution error.
Many data mining algorithms try to minimize the influence of outliers or eliminate them all
together. This, however, could result in the loss of important hidden information because one
person’s noise could be another person’s signal. In other words, the outliers may be of particular
interest, such as in the case of fraud detection, where outliers may indicate fraudulent activity.
Thus, outlier detection and analysis is an interesting data mining task, referred to as outlier
mining.
Outlier mining has wide applications. As mentioned previously, it can be used in fraud detection,
for example, by detecting unusual usage of credit cards or telecommunication services. In
addition, it is useful in customized marketing for identifying the spending behavior of customers
with extremely low or extremely high incomes, or in medical analysis for finding unusual
responses to various medical treatments.
The web has multiple aspects that yield different approaches for the mining process, such as web
pages consist of text, web pages are linked via hyperlinks, and user activity can be monitored via
web server logs. These three features lead to the differentiation between the three areas are web
content mining, web structure mining, web usage mining.
Web mining is the process of discovering patterns, structures, and relationships in web data. It
involves using data mining techniques to analyze web data and extract valuable insights. The
applications of web mining are wide-ranging and include:
Personalized marketing:
Web mining can be used to analyze customer behavior on websites and social media platforms.
This information can be used to create personalized marketing campaigns that target customers
based on their interests and preferences.
E-commerce
Web mining can be used to analyze customer behavior on e-commerce websites. This
information can be used to improve the user experience and increase sales by recommending
products based on customer preferences.
Search engine optimization:
Web mining can be used to analyze search engine queries and search engine results pages
(SERPs). This information can be used to improve the visibility of websites in search engine
results and increase traffic to the website.
Fraud detection:
Web mining can be used to detect fraudulent activity on websites. This information can be
used to prevent financial fraud, identity theft, and other types of online fraud.
Sentiment analysis:
Web mining can be used to analyze social media data and extract sentiment from posts,
comments, and reviews. This information can be used to understand customer sentiment
towards products and services and make informed business decisions.
Web content analysis:
Web mining can be used to analyze web content and extract valuable information such as
keywords, topics, and themes. This information can be used to improve the relevance of web
content and optimize search engine rankings.
Customer service:
Web mining can be used to analyze customer service interactions on websites and social media
platforms. This information can be used to improve the quality of customer service and
identify areas for improvement.
Healthcare:
Web mining can be used to analyze health-related websites and extract valuable information
about diseases, treatments, and medications. This information can be used to improve the
quality of healthcare and inform medical research.
Process of Web Mining:
Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as
following below.
Web Content Mining?
Web Content Mining can be used for the mining of useful data, information, and knowledge
from web page content. Web content mining performs scanning and mining of the text, images,
and group of web pages according to the content of the input by displaying the list in search
engines.
It is also quite different from data mining because web data are mainly semi-structured or
unstructured, while data mining deals primarily with structured data. Web content mining is also
different from text mining because of the semi-structured nature of the web, while text mining
focuses on unstructured texts. Thus, Web content mining requires creative applications of data
mining and text mining techniques and its own unique approaches.
In the past few years, there has been a rapid expansion of activities in the web content mining
area. This is not surprising because of the phenomenal growth of web content and the significant
economic benefit of such mining. However, due to the heterogeneity and the lack of structure of
web data, automated discovery of targeted or unexpected knowledge information still present
many challenging research problems. Web content mining could be differentiated from two
approaches, such as:
1. Agent-based Approach
This approach involves intelligent systems. It aims to improve information finding and filtering.
It usually relies on autonomous agents that can identify relevant websites. And it could be placed
into the following three categories, such as:
o Intelligent Search Agents: These agents search for relevant information using domain
characteristics and user profiles to organize and interpret the discovered information.
o Information Filtering or Categorization: These agents use information retrieval
techniques and characteristics of open hypertext Web documents to retrieve
automatically, filter, and categorize them.
o Personalized Web Agents: These agents learn user preferences and discover Web
information based on other users' preferences with similar interests.
Data based approach is used to organize semi-structured data present on the internet into
structured data. It aims to model the web data into a more structured form to apply standard
database querying mechanisms and data mining applications to analyze it.
Web content mining has the following problems or challenges also with their solutions, such as:
o Data Extraction: Extraction of structured data from Web pages, such as products and
search results. Extracting such data allows one to provide services. Two main types of
techniques, machine learning and automatic extraction, are used to solve this problem.
o Web Information Integration and Schema Matching: Although the Web contains a
huge amount of data, each website (or even page) represents similar information
differently. Identifying or matching semantically similar data is an important problem
with many practical applications.
o Opinion extraction from online sources: There are many online opinion sources, e.g.,
customer reviews of products, forums, blogs, and chat rooms. Mining opinions are of
great importance for marketing intelligence and product benchmarking.
o Knowledge synthesis: Concept hierarchies or ontology are useful in many applications.
However, generating them manually is very time-consuming. The main application is to
synthesize and organize the pieces of information on the web to give the user a coherent
picture of the topic domain. A few existing methods that explore the web's information
redundancy will be presented.
o Segmenting Web pages and detecting noise: In many Web applications, one only wants
the main content of the Web page without advertisements, navigation links, copyright
notices. Automatically segmenting Web pages to extract the pages' main content is an
interesting problem.
What is Web Structure Mining?
The challenge for Web structure mining is to deal with the structure of the hyperlinks within the
web itself. Link analysis is an old area of research. However, with the growing interest in Web
mining, the research of structure analysis has increased. These efforts resulted in a newly
emerging research area called Link Mining, which is located at the intersection of the work in
link analysis, hypertext, web mining, relational learning, inductive logic programming, and graph
mining.
Web structure mining uses graph theory to analyze a website's node and connection structure.
According to the type of web structural data, web structure mining can be divided into two kinds:
The web contains a variety of objects with almost no unifying structure, with differences in the
authoring style and content much greater than in traditional collections of text documents. The
objects in the WWW are web pages, and links are in, out, and co-citation (two pages linked to by
the same page). Attributes include HTML tags, word appearances, and anchor texts. Web
structure mining includes the following terminology, such as:
An example of a technique of web structure mining is the Page-Rank algorithm used by Google
to rank search results. A page's rank is decided by the number and quality of links pointing to the
target node.
Link mining had produced some agitation on some traditional data mining tasks. Below we
summarize some of these possible tasks of link mining which are applicable in Web structure
mining, such as:
1. Link-based Classification: The most recent upgrade of a classic data mining task to
linked Domains. The task is to predict the category of a web page based on words that
occur on the page, links between pages, anchor text, html tags, and other possible
attributes found on the web page.
2. Link-based Cluster Analysis: The data is segmented into groups, where similar objects
are grouped together, and dissimilar objects are grouped into different groups. Unlike the
previous task, link-based cluster analysis is unsupervised and can be used to discover
hidden patterns from data.
3. Link Type: There is a wide range of tasks concerning predicting the existence of links,
such as predicting the type of link between two entities or predicting the purpose of a
link.
4. Link Strength: Links could be associated with weights.
5. Link Cardinality: The main task is to predict the number of links between objects. page
categorization used to
o Finding related pages.
o Finding duplicated websites and finding out the similarity between them.
Web Usage Mining focuses on techniques that could predict the behavior of users while they are
interacting with the WWW. Web usage mining, discovering user navigation patterns from web
data, trying to discover useful information from the secondary data derived from users'
interactions while surfing the web. Web usage mining collects the data from Weblog records to
discover user access patterns of web pages. Several available research projects and commercial
tools analyze those patterns for different purposes. The insight knowledge could be utilized in
personalization, system improvement, site modification, business intelligence, and usage
characterization.
The only information left behind by many users visiting a Web site is the path through the pages
they have accessed. Most of the Web information retrieval tools only use textual information,
while they ignore the link information that could be very valuable. In general, there are mainly
four kinds of data mining techniques applied to the web mining domain to discover the user
navigation pattern, such as:
Association rule is the most basic rule of data mining methods which is used more than other
methods in web usage mining. This method enables the website for more efficient content
organization or provides recommendations for an effective cross-selling product.
These rules are statements in the form X => Y where (X) and (Y) are the set of available items in
a series of transactions. The rule of X => Y states that transactions that contain items in X may
also include items in Y. Association rules in the web usage mining are used to find relationships
between pages that frequently appear next to one another in user sessions.
2. Sequential Patterns
Sequential patterns are used to discover the subsequence in a large volume of sequential data. In
web usage mining, sequential patterns are used to find user navigation patterns that frequently
appear at meetings. The sequential patterns may seem to be association rules. But the sequential
patterns are included the time, which means that the sequence of events that occurred is defined
in sequential patterns. Algorithms that are used to extract association rules can also be used to
generate sequential patterns. Two types of algorithms are used for sequential mining patterns.
o The first type of algorithm is based on association rules mining. Many common
algorithms of sequential mining patterns have been changed for mining association rules.
For example, GSP and AprioriAll are two developed species of Apriori algorithms that
are used to extract association rules. But some researchers believe that association rules
mining algorithms do not have enough performance in the long sequential patterns
mining.
o The second type of sequential patterns mining algorithms has been introduced in which
the tree structure and Markov chain are used to represent survey patterns. For example, in
one of these algorithms called WAP-mine, the tree structure called WAP-tree is used to
explore access patterns to the web. Evaluation results show that its performance is higher
than an algorithm such as GSP.
3. Clustering
Clustering techniques diagnose groups of similar items among high volumes of data. This is
done based on distance functions which measure the degree of similarity between different items.
Clustering in web usage mining is used for grouping similar meetings. What is important in this
type of search is the contrast between the user and individual groups. Two types of interesting
clustering can be found in this area: user clustering and page clustering.
Clustering of user records is usually used to analyze web mining and web analytics tasks. More
knowledge derived from clustering is used to partition the market in e-commerce. Different
methods and techniques are used for clustering, which includes:
o Using the similarity graph and the amount of time spent viewing a page to estimate the
similarity of meetings.
o Using genetic algorithms and user feedback.
o Clustering matrix.
o K -means algorithm, which is the most classic clustering method.
The repetitive patterns are first extracted from the user's sessions using association rules in other
clustering methods. Then, these patterns are used to construct a graph where the nodes are the
visited pages. The edges of the graph connect two or more pages. If these pages exist in a pattern
extracted, the weight will be assigned to the edges that show the relationship between the nodes.
Then, for clustering, this graph is recursively divided to user behavior groups are detected.
4. Classification Mining
Discovering classification rules allows one to develop a profile of items belonging to a particular
group according to their common attributes. This profile can classify new data items added to the
database. In Web Mining, classified techniques allow one to develop a profile for clients who
access particular server files based on demographic information available on those clients or
their navigation patterns.
Search engines are build in such a way that they effectively generate the required information
by crawling across the web and searching from the available databases on internet.
If we look back to earlier example the search engine acts as a librarian that gathers relevant
books which is required information from the library of data available on the internet.
To summarize, when user searches for a particular data the web crawlers scan or crawl through
the data available on web and gather all the relevant information (Crawling).
The search engine then picks up the most relevant results according to the ranking and finally
displays it in the results page or SERP. It is quite a technical process, but all this happens so
quickly that user gets the results as soon as they search something on the search engine.
Architecture Of Search Engine
If we talk about the architecture or the framework of a search engine, it can be described in
three main components –
• Web crawlers – As the name suggests these acts as spiders which crawl all over the web to
collect required information. These are special bots that search throughout the internet and
accumulates data using various links.
• Database – It is a collection of data which is gathered by the web crawlers after searching
throughout the World Wide Web.
• Search Interface – It provides a medium or interface for users so that they can access and
search on the database for required information.
Prediction:
Prediction Is a type of classification technique. It is a Linear and Continuous
classification Technique, to classify data Prediction uses one method called Linear
regression.
Linear regression:
It is a classification method used in prediction to classify data. This
technique involves two types of variables one is called response variable (y) and
another is called Linear variable(x).
In this technique we have to calculate the response
variable ‘y’ as follows
Y=w0+w1
Where w0= y−w 1 x
n
∑ ( x i−x ) ( y i− y )
i=1
n
∑ ( x i− x )
2
i=1
SOL:
Step-1:
84 +63+71+78+96 +75
y= =77.5
6
Step-2:
Y= w0+w1
w1=
n
∑ ( x i−x ) ( y i− y )
i=1
n
∑ ( x i− x )
2
i=1
W1=0.42
X= 87(given value)
Hence x= 87 & y= 83