U5 unsupervised learning
U5 unsupervised learning
Unsupervised
Machine
Learning
K . D . P O LY T E C H N I C , PATA N
Co5: Apply unsupervised learning algorithms
based on dataset characteristics
5.1 Introduction of Unsupervised Learning
◦ Brief explanation of unsupervised Machine Learning
◦ Need of unsupervised learning
◦ Working of unsupervised learning
◦ Real world examples of unsupervised Learning
◦ List unsupervised learning algorithms
the objective is to take a dataset as input and try to find natural groupings or patterns within the
data elements or records.
Unsupervised learning is often termed a descriptive model and the process of unsupervised
learning is called as pattern discovery or knowledge discovery.
Need of Unsupervised Learning
Exploratory Data Analysis (EDA): helps us understand the underlying structure of data without any predefined labels.
we gain insights into the data distribution. understanding data quality, identifying outliers, and making informed decisions.
Clustering: Clustering algorithms group similar data points together based on their features. i.e. Customer segmentation,
Image segmentation, Anomaly detection
Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) reduce the number of features while
preserving essential information.
Recommendation Systems: helps build personalized recommendations. Collaborative filtering and matrix factorization are
common techniques. for ex: Movie recommendations (e.g., Netflix)., Product recommendations (e.g., Amazon).
Feature Engineering: Unsupervised learning can create new features from existing ones.
Data Preprocessing: Imputing missing values, scaling features, and handling outliers. Unsupervised methods help prepare
data for subsequent modeling.
Application of Unsupervised
Learning
❖Segmentation of target consumer populations by an advertisement consulting agency on the basis of
few dimensions such as demography, financial data, purchasing habits, etc. so that the advertisers can
reach their target consumers efficiently.
❖Anomaly or fraud detection in the banking sector by identifying the pattern of loan defaulters.
❖Image processing and image segmentation such as face recognition, expression identification, etc.
❖Utilization by data scientists to reduce the dimensionalities in sample data to simplify modeling
➢Clustering algorithms split data into natural groups by finding similar structures or patterns in uncategorized data.
➢Partition method: Data is grouped in a way where a single data point can only exist in one cluster. This is also referred to as
“hard” clustering. A common example of exclusive clustering is the K-means clustering algorithm, which partitions data
points into a user-defined number K of clusters.
➢Density based method: finds groups based on the density of data points
➢Hierarchical clustering: Data is divided into distinct clusters based on similarities, which are then repeatedly merged and
organized based on their hierarchical relationships.
➢Probabilistic clustering: Data is grouped into clusters based on the probability of each data point belonging to each cluster.
This approach differs from the other methods, which group data points based on their similarities to others in a cluster.
Applications of Clustering
Market Segmentation: Companies use clustering to group customers based on purchasing behavior, demographics,
and engagement levels. Segmented groups allow targeted marketing strategies and personalized recommendations.
Social Network Analysis: Clustering helps identify communities or groups within social networks. It reveals patterns of
connections, influencers, and subgroups.
Search Result Grouping: Search engines use clustering to group similar search results. Users benefit from organized
and relevant search results.
Medical Imaging: Clustering helps segment medical images (e.g., MRI, CT scans). It aids in identifying tumors,
lesions, or other anomalies.
Image Segmentation: In computer vision, clustering segments images into meaningful regions. Useful for object
detection, image recognition, and scene understanding.
Anomaly Detection: Clustering identifies unusual patterns or outliers. Examples: Fraud detection, network intrusion
detection.
Types of Unsupervised Learning:
Association analysis
•Association rule presents a methodology that is useful for identifying interesting relationships hidden in
large data sets. It is also known as association analysis, and the discovered relationships can be
represented in the form of association rules comprising a set of frequent items.
•A common application of this analysis is the Market Basket Analysis
that retailers use for cross-selling of their products.
•focuses on identifying associations between data elements.
•uncover how the items are associated with each other.
•which items appear together in a transaction or relation.
•retailers, grocery stores, an online marketplace that has a large
transactional database
Association analysis: methods
•Common Algorithm: Apriori is a well-known algorithm for association rule learning.
•Itemset: One or more items are grouped together and are surrounded by brackets to indicate that
they form a set, or more specifically, an itemset that appears in the data with some regularity.
•Support Count: denotes the number of transactions in which a particular itemset is present. This is a very
important property of an itemset as it denotes the frequency of occurrence for the itemset.
itemset { Bread, Milk, Egg } occurs together 3 times
and thus have a support count of 3.
Association rule
The result of the market basket analysis is expressed as a set of association rules that specify
patterns of relationships among items.
Support and confidence- two concepts for measuring the strength of an association rule.
Support denotes how often a rule is applicable to a given data set.
Confidence indicates how often the items in Y appear in transactions that contain X in a total
transaction of N. Confidence denotes the predictive power or accuracy of the rule.
C({Bread, milk}→{Egg}) = S({Bread, milk, Egg})/S({Bread, milk}) = ¾ = 0.75
A low support may indicate that the rule has occurred by chance.
confidence provides the measurement for reliability of the inference of a rule.
C({Bread, milk}→{Egg}) ≠ C({Egg}→{Bread, milk}) as C({Egg}→{Bread, milk} = 3/5 = 0.6
Items set Support of item Confidence of item set
set
No Ground Truth for Evaluation: Without labeled data, it’s difficult to evaluate the performance of
unsupervised models objectively.
Metrics like accuracy or precision are not applicable, making model assessment less straightforward.
Difficulty in Generalization: Unsupervised models may struggle to generalize well to unseen examples.
Since they learn from patterns within the data,