0% found this document useful (0 votes)
66 views22 pages

Assignment 3

The document discusses unsupervised learning in machine learning, defining it as a method that analyzes unlabeled data to discover hidden patterns and structures. It covers applications such as clustering, anomaly detection, and dimensionality reduction, along with challenges like lack of ground truth and sensitivity to noise. Additionally, it explains clustering techniques, K-means algorithm, and the use of dendrograms in hierarchical clustering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views22 pages

Assignment 3

The document discusses unsupervised learning in machine learning, defining it as a method that analyzes unlabeled data to discover hidden patterns and structures. It covers applications such as clustering, anomaly detection, and dimensionality reduction, along with challenges like lack of ground truth and sensitivity to noise. Additionally, it explains clustering techniques, K-means algorithm, and the use of dendrograms in hierarchical clustering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Assignment: - 3

Subject: - Machine Learning

Submitted by: Submitted to: -


Bidya Sagar Lekhi Er. Pradip Sharma
Roll: - 10
[Link] Unsupervised Learning. Hou it is different from
Supervised Learning?
Ans:- Unsupervised learning is a type of machine learning where
algorithms analyze and draw inferences from data that is unlabeled,
uncategorized, and untagged. The goal is for the algorithm to
discover hidden patterns, structures, or groupings within the data
without any prior knowledge of the outcomes or explicit instructions on
what to look for. Common tasks in unsupervised learning
include clustering (grouping similar data points), association (finding
relationships between variables), and dimensionality
reduction (simplifying data by reducing the number of features).
How Unsupervised Learning Differs from Supervised Learning
Aspect Unsupervised Learning Supervised Learning
Data Uses unlabeled data (no Uses labeled data (input-
output or category output pairs with known
labels) outcomes)
Goal Discover hidden Learn a mapping from
patterns, groupings, or inputs to known outputs
structures (classification/regression)
Human No human-provided Requires human-labeled
Intervention labels; model finds data for training
structure on its own
Examples Clustering (K-means), Image classification, spam
association rules, detection, sentiment
dimensionality reduction analysis
(PCA)
Output Groups, clusters, or new Predicted labels or values
representations

2. Discuss the applications and challenges of Unsupervised


Learning?
Ans:- Unsupervised learning is a type of machine learning where
algorithms analyze unlabeled data to discover hidden patterns,
structures, or groupings without any prior guidance or labeled
outputs. It is widely used to explore complex datasets and extract
meaningful insights when explicit labels are unavailable.
Applications of Unsupervised Learning
1. Clustering and Customer Segmentation
Algorithms like k-means clustering group similar data points,
enabling businesses to segment customers based on behavior,
preferences, or demographics. This helps in targeted marketing,
personalized recommendations, and improving customer
experience.
2. Anomaly Detection
Detecting unusual patterns or outliers in data, such as fraudulent
transactions in finance or identifying rare events in healthcare, is
a key application. Techniques like Isolation Forest or DBSCAN
help flag anomalies for further investigation.
3. Dimensionality Reduction
Methods like Principal Component Analysis (PCA) reduce the
number of variables in large datasets while preserving important
information. This simplifies data visualization, speeds up
learning, and helps identify underlying factors influencing data,
useful in finance, image compression, and more.
4. Pattern Recognition and Natural Language Processing
Hidden Markov Models and clustering techniques are used for
speech recognition, text classification, and optical character
recognition, enabling machines to understand and process
human language and audio-visual data.
5. Association Rule Mining
Discovering relationships between variables, such as products
frequently bought together in e-commerce, helps design cross-
selling strategies and targeted promotions.
6. Image and Video Analysis
Unsupervised learning assists in segmenting images,
compressing data, and identifying patterns in video streams,
which is valuable in surveillance, medical imaging, and content
management.
Challenges of Unsupervised Learning
1. Lack of Ground Truth
Since there are no labeled outputs, evaluating the quality or
accuracy of the learned patterns is difficult. The results may be
ambiguous or hard to interpret.
2. Determining the Number of Clusters or Groups
Many algorithms require specifying the number of clusters in
advance, which is often unknown and can significantly affect
outcomes.
3. Sensitivity to Noise and Outliers
Unsupervised methods can be misled by noisy data or outliers,
resulting in poor or misleading segmentation.
4. Computational Complexity
Some algorithms, especially on large or high-dimensional
datasets, can be computationally expensive and require careful
tuning.
5. Interpretability
The discovered patterns may not always correspond to
meaningful or actionable insights, making it challenging to
apply results in real-world decisions.
6. Dependence on Feature Selection
The quality of results heavily depends on the choice and quality
of input features, which may require domain expertise.
3. What is Clustering? Explain its importance in Unsupervised
Learning?
Ans:- Clustering is an unsupervised machine learning technique
that organizes and classifies data points into groups or clusters based on
their similarities or patterns, without requiring any labeled data.
What is Clustering?
1. It groups data points such that those within the same cluster are
more similar to each other than to those in other clusters.
2. Clustering helps to find natural groupings in data, revealing
underlying structures or relationships.
3. Each cluster can be assigned an ID or label to identify its
characteristics.
4. It is widely used in exploratory data analysis to simplify complex
datasets and discover patterns.
Importance of Clustering in Unsupervised Learning
1. No Need for Labeled Data: Clustering works on unlabeled data,
making it ideal when labeling is expensive or impractical.
2. Pattern Discovery: It helps uncover hidden patterns, groupings,
or anomalies in data that humans might not easily detect.
3. Data Simplification: By grouping similar data points, clustering
reduces complexity and aids in data summarization.
4. Preprocessing Step: Clustering can be used to segment data
before applying other machine learning algorithms or
dimensionality reduction.
5. Versatile Applications: Used in customer segmentation,
anomaly detection, image processing, document classification,
and more.
6. Anomaly Detection: Points that do not fit well into any cluster
can be flagged as outliers or anomalies.
[Link] Clustering with Classification?
Ans:- Classification vs. Clustering: A Comparison

Feature Clustering Classification


Unsupervised learning Supervised learning
technique used to group technique used to assign
Definition
similar data points into predefined labels (classes)
clusters based on features. to input data.
Learning
Unsupervised learning Supervised learning
Type
Labeled Does not require labeled Requires labeled training
Data data data
Groups or clusters (e.g., Class labels (e.g., Spam or
Output
Cluster 1, Cluster 2) Not Spam)
Find natural groupings or Predict a category or label
Goal
structure in data for new data
Feature Clustering Classification
Customer segmentation, Email spam detection,
Examples
Document clustering Disease diagnosis
K-Means, Hierarchical Decision Trees, SVM, k-
Algorithms
Clustering, DBSCAN NN, Logistic Regression
Explore data patterns or Build predictive models
Use Case
reduce dimensions based on known labels
Evaluation Silhouette score, Davies– Accuracy, Precision,
Metrics Bouldin index Recall, F1-score

[Link] how dimensional reduction helps on data visualization


and noise removal?
Ans:- Dimensionality reduction is a technique that transforms high-
dimensional data into a lower-dimensional space while preserving its
essential structure and information. It plays a crucial role in both data
visualization and noise removal.
How Dimensionality Reduction Helps in Data Visualization
1. Simplifies Complex Data: High-dimensional data (with many
features) is difficult to visualize directly because humans can
only perceive in 2D or 3D. Dimensionality reduction techniques
like PCA (Principal Component Analysis) or t-SNE reduce
data to two or three dimensions, enabling effective visualization.
2. Reveals Patterns and Clusters: By projecting data onto lower
dimensions, clusters, groupings, and boundaries become more
apparent, facilitating intuitive understanding and exploratory
analysis. For example, handwritten digit datasets become
visually separable after dimensionality reduction.
3. Facilitates Interpretation: Reduced-dimensional plots help
data scientists and stakeholders interpret complex relationships
and detect anomalies or trends that would otherwise remain
hidden in high-dimensional space.
How Dimensionality Reduction Helps in Noise Removal
1. Filters Out Irrelevant Features: Real-world datasets often
contain redundant or noisy features that do not contribute
meaningful information. Dimensionality reduction compresses
data by retaining only the most significant features or
components, effectively filtering noise.
2. Improves Signal-to-Noise Ratio: By focusing on the principal
components or most informative directions in the data, these
techniques enhance the relevant signal while suppressing
random noise or minor variations.
3. Enhances Model Performance: Removing noise reduces
overfitting and improves the generalization ability of machine
learning models, making them more robust and accurate.

[Link] is association rule learning? Mention its applications with


example.
Ans:- Association rule learning is a rule-based machine learning
method used to discover interesting relationships, patterns, or
associations between variables in large datasets, especially in
transactional databases. It identifies strong rules of the form X⇒YX⇒Y,
meaning "if X occurs, then Y is likely to occur," where X and Y are sets
of items or attributes. This technique is unsupervised, meaning it does
not require labeled data, and is widely used to uncover hidden patterns
that can inform decision-making.
Applications of Association Rule Learning
• Market Basket Analysis:
Retailers use association rules to analyze customer purchase
data and discover which items are frequently bought together.
For example, if customers often buy bread and butter together,
the rule {bread}⇒{butter}{bread}⇒{butter} can help in product
placement, promotions, or bundling strategies.
Example:
If a supermarket finds that "onions and potatoes" are often
purchased with "burger meat," they might place these items
closer together or offer combo discounts.
• Recommendation Systems:
E-commerce and streaming platforms use association rules to
recommend products or content. For instance, if a user watches
a comedy trailer and then watches the full movie, the system
might recommend similar comedies or preload the full movie
for other users who watch comedy trailers.
• Fraud Detection:
Financial institutions use association rules to spot unusual
patterns in transactions that may indicate fraud, such as multiple
purchases from the same merchant in a short time frame.
• Web Usage Mining:
Association rules help analyze website navigation patterns to
improve site structure, content recommendations, and targeted
advertising.
• Healthcare and Bioinformatics:
Used to discover relationships between symptoms, diagnoses,
and treatments, or to identify potential adverse drug interactions
from patient records.
• Customer Segmentation:
Businesses segment customers based on purchasing patterns to
tailor marketing strategies and improve customer targeting.
Example Rule
In a grocery store, analysis of transaction data may reveal the rule:
{Bread, Milk} ⇒ {Eggs}
This means that customers who buy bread and milk together are likely
to also buy eggs. The store can use this insight to optimize shelf
placement or run targeted promotions.

[Link] the K-means clustering algorithm with steps.


Ans:- K-means clustering is a popular unsupervised machine learning
algorithm used to partition a dataset into K distinct, non-overlapping
clusters based on feature similarity. The goal is to group data points so
that points within the same cluster are more similar to each other than
to those in other clusters.
Steps of the K-means Clustering Algorithm
1. Choose the number of clusters (K):
Decide how many clusters you want to divide your data into.
This can be based on prior knowledge or methods like the
Elbow method.
2. Initialize centroids:
Randomly select K data points from the dataset as the initial
centroids (cluster centers).
3. Assign data points to the nearest centroid:
For each data point, calculate the distance (commonly Euclidean
distance) to each centroid and assign the point to the cluster
whose centroid is closest.
4. Update centroids:
After all points are assigned, recalculate the centroid of each
cluster by taking the mean of all data points assigned to that
cluster.
5. Repeat steps 3 and 4 until convergence:
Continue reassigning points and updating centroids until the
centroids no longer move significantly or the assignments stop
changing. This indicates that the algorithm has converged.

[Link] are the limitations of K-means clustering.


Ans:- The limitations of K-means clustering are well-documented and
include the following key points:
1. Need to Specify Number of Clusters (K) in Advance:
You must decide the number of clusters before running the
algorithm. Choosing an incorrect K can lead to poor clustering
results. Methods like the elbow method or silhouette analysis
help, but don’t always give a clear answer.
2. Sensitivity to Initial Centroid Selection:
K-means randomly initializes centroids, which can lead to
different results on different runs and may cause the algorithm
to converge to a local optimum rather than the global best
solution. Improved initialization methods like K-means++ help
mitigate this.
3. Assumes Clusters are Spherical and Evenly Sized:
K-means works best when clusters are roughly spherical and of
similar size. It struggles with clusters of varying shapes,
densities, or sizes, often producing poor or misleading clusters
in such cases.
4. Sensitive to Outliers and Noise:
Outliers can disproportionately affect centroid positions,
dragging them away from the true cluster centers or causing
outliers to form their own clusters, reducing clustering accuracy.
5. Difficulty Handling Categorical Data:
K-means relies on calculating distances, which is
straightforward for numerical data but problematic for
categorical data. Encoding categorical variables (e.g., one-hot
encoding) can increase dimensionality and degrade
performance.
6. Curse of Dimensionality:
As the number of features (dimensions) increases, the distance
between points becomes less meaningful, making it harder for
K-means to distinguish clusters. Dimensionality reduction
techniques like PCA or spectral clustering are often needed
before applying K-means.
7. Computational Complexity on Large or High-Dimensional
Datasets:
The algorithm’s time complexity depends on the number of
clusters, data points, dimensions, and iterations, which can make
it computationally expensive for very large or high-dimensional
datasets.
8. Equal Weighting of Features:
K-means assumes all features contribute equally to the
clustering, which may not be true in practice. It does not
inherently weigh features differently.

[Link] between agglomerative and divisive clustering


with diagrams.
Ans:-
Agglomerative
Feature Divisive Clustering
Clustering
Approach Bottom-Up Top-Down
Each point is its own All points in one big
Initial State
cluster cluster
Process Merge clusters Split clusters
Computation Less expensive More expensive
Common
More commonly used Less commonly used
Usage
Until each point is a
Stopping When required number of
cluster or based on a
Condition clusters is achieved
threshold

Diagram of Agglomerative clustering:


Stepwise merging (bottom-up):
A B C D E (Start with 5 individual points)

| | | | |
| | | | |
| | | \ /
| | | Clust4
| \ /
| Clust2
\ /
Clust1

(Eventually all merged into one cluster)

Diagram of Divisive Clustering:


Stepwise splitting (top-down):

Clust1 (All points together)


/ \
Cl2 Cl3 (Split into two)
/\ /\
A B C D (Eventually split into single points)

[Link] is a dendrogram? How it is used in hierarchical clustering.


Ans:- A dendrogram is a tree-like diagram that visually represents the
arrangement of clusters produced by hierarchical clustering. It
illustrates the hierarchical relationships between data points or clusters,
showing how individual elements group together step-by-step to form
larger clusters.
What is a Dendrogram?
• It displays clusters as branches (called clades) that merge or
split at different levels.
• The leaves (at the bottom) represent individual data points.
• Branches connect these points or clusters based on their
similarity or distance.
• The height of the branches (y-axis) indicates the distance or
dissimilarity at which clusters are joined or split.
• Shorter branches mean higher similarity; taller branches indicate
greater dissimilarity between clusters.
How is a Dendrogram Used in Hierarchical Clustering?
• Visualizing Cluster Formation:
It shows the order and levels at which clusters merge
(agglomerative) or split (divisive), providing insight into the
data’s structure.
• Determining the Number of Clusters:
By “cutting” the dendrogram at a certain height (distance), you
can decide how many clusters to form. A horizontal cut across
the dendrogram intersects branches, and the number of
intersections corresponds to the number of clusters.
• Interpreting Similarities:
The dendrogram helps identify which data points or clusters are
most similar based on where they join. For example, points
joined at lower heights are more similar than those joined higher
up.
• Exploring Data at Different Granularities:
Because hierarchical clustering does not require pre-specifying
the number of clusters, the dendrogram allows exploring data
groupings at various levels of detail.

[Link] is DBSCAN? How does identify noise and clusters of


arbitrary shapes?
Ans:- DBSCAN (Density-Based Spatial Clustering of Applications
with Noise) is a popular unsupervised clustering algorithm that groups
data points based on their density in the feature space. Unlike
algorithms such as K-means, DBSCAN does not require you to specify
the number of clusters in advance and can find clusters of arbitrary
shapes and sizes. It is especially effective at identifying outliers (noise)
and handling real-world data irregularities.
How DBSCAN Works
DBSCAN uses two main parameters:
• ε (epsilon): The radius that defines the neighborhood around a
data point.
• MinPts: The minimum number of points required within the ε-
radius to form a dense region (cluster).
Key Concepts
• Core Point: A point with at least MinPts neighbors within ε.
These are typically in the interior of a cluster.
• Border Point: A point that is within ε of a core point but has
fewer than MinPts neighbors itself.
• Noise Point (Outlier): A point that is not a core point and not
within ε of any core point; it does not belong to any cluster.
Step-by-Step Process
1. Identify Core Points: For each point, count the number of
points within its ε-neighborhood. If this count ≥ MinPts, mark it
as a core point.
2. Form Clusters: Start with an unvisited core point and form a
new cluster. Recursively add all points that are density-
reachable (directly or indirectly connected through other core
points within ε).
3. Assign Border Points: Points within ε of a core point but not
themselves core points are assigned to the cluster as border
points.
4. Identify Noise: Points that are neither core nor border points are
labeled as noise (outliers) and are not assigned to any cluster.
Identifying Clusters of Arbitrary Shapes
• Density-Based Approach: DBSCAN defines clusters as dense
regions separated by areas of lower density. This allows it to
find clusters of any shape—linear, elongated, curved, or
irregular—unlike algorithms that assume clusters are spherical.
• Density Reachability: Clusters are formed by chaining together
core points that are within ε of each other, allowing the cluster
to grow in any direction as long as the density condition is met.
• No Need for Predefined Cluster Count: DBSCAN
automatically determines the number of clusters based on the
data’s density structure.
Identifying Noise
• Noise Points: Any point that does not meet the density criteria
(not a core point and not within ε of a core point) is labeled as
noise. These are typically isolated points or outliers in sparse
regions of the data.
• Robustness: This feature makes DBSCAN robust to outliers
and noise, as such points are not forced into clusters.

[Link] DBSCAN with K-means and hierarchical Clustering.


Ans:- Comparison of DBSCAN, K-means, and Hierarchical
Clustering
Overview
• K-means: A centroid-based, partitioning algorithm that divides
data into a predefined number (K) of clusters by minimizing
within-cluster variance.
• Hierarchical Clustering: Builds a tree (dendrogram) of clusters
either by merging (agglomerative) or splitting (divisive) data
points, without requiring the number of clusters in advance.
• DBSCAN: A density-based algorithm that groups data points
into clusters based on density, identifying outliers as noise and
discovering clusters of arbitrary shape.
Key Differences
DBSCAN
(Density-Based
Hierarchical Spatial
Feature K-Means
Clustering Clustering of
Applications
with Noise)
Hierarchical
Type Partitioning (Agglomerative/D Density-based
ivisive)
Can stop at any ε (radius), MinPts
Input Number of
number of (minimum
Required clusters (k)
clusters points)
Cluster Flexible (depends Arbitrary shapes
Spherical/convex
Shape on linkage (non-convex
clusters
Assumption method) allowed)
Moderate (faster
Scalability
Slower, not ideal than hierarchical,
(Large Fast and scalable
for large datasets slower than K-
Datasets)
means)
Poor (every point Good (can detect
Noise Poor (no explicit
assigned to and exclude
Handling noise detection)
cluster) noise)
Outlier Yes (outliers are
No No
Detection labeled as noise)
Parameter Sensitive to k and Sensitive to Sensitive to ε and
Sensitivity initial centroids distance threshold MinPts
Spatial or
Data Numerical Any type (with
continuous
Requirement features distance metric)
numeric data
Not
Cluster Stepwise Based on density
possible
Merging/Splitting merging/splitting connectivity
during run
Market Geospatial data,
segmentation, Gene clustering, anomaly
Use Cases
customer text data grouping detection, image
grouping segmentation

Strengths and Limitations


K-means
• Strengths:
• Simple, fast, and efficient for large datasets.
• Works well for compact, spherical clusters.
• Limitations:
• Requires K in advance.
• Poor with non-spherical clusters and outliers.
• Sensitive to initial centroid placement.
Hierarchical Clustering
• Strengths:
• No need to predefine the number of clusters.
• Produces a dendrogram for multi-level analysis.
• Flexible with distance/linkage choices.
• Limitations:
• Computationally expensive for large datasets.
• Sensitive to noise and outliers.
• Once merged or split, cannot undo steps.
DBSCAN
• Strengths:
• Finds clusters of arbitrary shape and size.
• Automatically detects the number of clusters.
• Robust to outliers and noise.
• Limitations:
• Struggles with clusters of varying density.
• Parameter selection (ε, MinPts) can be challenging.
• Less effective in high-dimensional spaces.
Practical Considerations
• K-means is preferred for large, well-separated, and spherical
clusters.
• Hierarchical clustering is ideal for exploring data structure and
relationships, especially with smaller datasets.
• DBSCAN excels in spatial data, noise-rich environments, and
when clusters are irregularly shaped or the number of clusters is
unknown.

[Link] is DBSCAN suitable for clusters of arbitrary shape?


Provide an Example.
Ans:- DBSCAN (Density-Based Spatial Clustering of Applications with
Noise) is designed to find clusters based on the density of data
points rather than assuming any specific cluster shape (like spherical
clusters in K-means). It defines clusters as maximal sets of density-
connected points, which allows it to:
• Group together points that are closely packed, regardless of
the shape or size of the cluster.
• Handle clusters of complex, irregular, or elongated
shapes because it expands clusters by connecting core points that
have dense neighborhoods.
• Distinguish noise and outliers effectively, as points in low-
density regions are labeled as noise and not forced into clusters.
This density-based approach means DBSCAN can discover clusters that
are non-convex, intertwined, or arbitrarily shaped, which many
traditional algorithms struggle with.
How DBSCAN identifies clusters of arbitrary shape:
• It starts from a core point (a point with at least MinPts neighbors
within radius ε).
• It recursively adds all points density-reachable from the core
point, forming a cluster that can grow in any direction as long as
the density condition is satisfied.
• This process naturally follows the shape of the dense regions in
the data, allowing clusters to take any form.
Example:
Imagine a dataset with two clusters:
• One cluster shaped like a crescent moon.
• Another cluster shaped like a circle.
Traditional algorithms like K-means would struggle because they
assume spherical clusters and may split or merge these shapes
incorrectly.
DBSCAN, however, will:
• Identify the crescent-shaped cluster by connecting core points
along the crescent, following its curved shape.
• Identify the circular cluster as another dense region.
• Label points between the clusters (low-density areas) as noise.
This way, DBSCAN successfully clusters data points according to their
true spatial distribution without shape constraints.

[Link] is the “curse of dimensionality” and how does it affect


clustering algorithms?
Ans:- The curse of dimensionality refers to the various problems and
challenges that arise when working with data in very high-dimensional
spaces (i.e., when the number of features or attributes is very large). As
dimensionality increases, the volume of the space grows exponentially,
causing data points to become extremely sparse and distances between
points to lose meaning.
What is the Curse of Dimensionality?
• Data Sparsity: In high dimensions, data points spread out so
much that the space becomes mostly empty. This sparsity makes
it difficult to find meaningful patterns or clusters because points
are far apart from each other.
• Distance Concentration: The difference between the nearest
and farthest points tends to shrink, making distance-based
measures (like Euclidean distance) less effective for
distinguishing points.
• Exponential Data Requirement: To maintain the same data
density as dimensionality grows, the number of required data
points increases exponentially. Without sufficient data, models
tend to overfit or fail to generalize.
• Increased Computational Complexity: Algorithms become
slower and require more resources because they must process
more features and data points.
How Does It Affect Clustering Algorithms?
Clustering algorithms often rely on distance or similarity measures to
group data points. The curse of dimensionality impacts clustering in the
following ways:
1. Reduced Meaningfulness of Distance Metrics:
In high-dimensional spaces, distances between points become
similar, making it hard to differentiate clusters based on
proximity. This undermines algorithms like K-means or
hierarchical clustering which depend on distance calculations.
2. Difficulty in Identifying Dense Regions:
Density-based algorithms (like DBSCAN) rely on finding dense
neighborhoods. When data is sparse due to high dimensionality,
these dense regions become less distinct, reducing clustering
effectiveness.
3. Overfitting and Noise Sensitivity:
High-dimensional data often contains irrelevant or noisy
features, which can mislead clustering algorithms into finding
spurious clusters or failing to find meaningful ones.
4. Increased Computational Cost:
More dimensions mean more calculations for distance, density,
or similarity, leading to slower clustering, especially for large
datasets.
Mitigation Strategies
• Dimensionality Reduction: Techniques like Principal
Component Analysis (PCA) or t-SNE reduce the number of
features while preserving important information, improving
clustering performance.
• Feature Selection: Selecting only relevant features helps reduce
noise and computational load.
• Using Appropriate Distance Measures: Some algorithms use
distance metrics better suited for high-dimensional data.

[Link] is principal Component Analysis (PCA)? Explain with


steps.
Ans:- Principal Component Analysis (PCA) is a statistical technique
used for dimensionality reduction. It transforms a large set of correlated
variables into a smaller set of uncorrelated variables called principal
components, which retain most of the original data’s variability. PCA
helps simplify complex datasets, reduce noise, and improve
visualization and computational efficiency.
How PCA Works: Step-by-Step
1. Standardize the Data
Since variables may have different units and scales, PCA starts
by standardizing the data so each feature has a mean of 0 and a
standard deviation of 1. This ensures all variables contribute
equally to the analysis.
2. Compute the Covariance Matrix
Calculate the covariance matrix to understand how variables
vary together. This matrix captures the relationships between
pairs of variables.
3. Calculate Eigenvectors and Eigenvalues
Perform eigen decomposition on the covariance matrix to find
eigenvectors (directions of maximum variance) and eigenvalues
(amount of variance explained by each eigenvector). Each
eigenvector corresponds to a principal component.
4. Sort Eigenvectors by Eigenvalues
Rank the principal components based on their eigenvalues in
descending order. The first principal component explains the
largest variance, the second the next largest, and so on.
5. Project the Data onto Principal Components
Transform the original data onto the new coordinate system
defined by the top principal components. This reduces
dimensionality by keeping only the most significant
components.

17. What is the difference between PCA and LDA? In what


situations is LDA preferred?
Ans:- Comparison Table: PCA vs. LDA
PCA (Principal LDA (Linear
Feature
Component Analysis) Discriminant Analysis)
Type of
Unsupervised Supervised
Learning
Maximize variance in Maximize class
Goal
the data separability
Input
Only features (no labels) Requires class labels
Requirement
Projection Along directions of Along directions that best
Direction maximum variance separate classes
Eigenvectors of scatter
Output Eigenvectors of
matrices (within &
Components covariance matrix
between)
Captures overall data Captures class
Focus
structure discrimination
Data compression,
Classification, pattern
Best For visualization, noise
recognition
reduction
Class
Not considered Fully utilized
Information
Resulting May not enhance class Specifically designed to
Components separability enhance class separability
Max # of
≤ number of features ≤ number of classes - 1
Components

When is LDA Preferred?


• When you have labeled data and the goal is to improve class
separability or classification accuracy.
• In supervised learning tasks where dimensionality reduction is
needed but preserving class discrimination is critical.
• When the dataset has multiple classes, and you want to find the
best linear combinations of features that separate these classes.
• For classification problems where maximizing the distance
between classes and minimizing variance within classes leads to
better predictive models.
[Link] the role of eigenvalues and eigenvectors in PCA.
Ans:- In Principal Component Analysis
(PCA), eigenvalues and eigenvectors play a central role in identifying
the principal components that capture the most important patterns in
the data.
Role of Eigenvectors in PCA
• Eigenvectors of the covariance matrix represent the directions
(axes) in the feature space along which the data varies the most.
• Each eigenvector defines a principal component, which is a new
axis formed by a linear combination of the original variables.
• These eigenvectors are orthogonal to each other, ensuring that the
principal components are uncorrelated.
• By projecting the original data onto these eigenvectors, PCA
transforms the data into a new coordinate system aligned with the
directions of maximum variance.
Role of Eigenvalues in PCA
• Each eigenvector has a corresponding eigenvalue that quantifies
the amount of variance (information) in the data explained by that
principal component.
• Larger eigenvalues indicate principal components that capture
more of the data’s variability.
• By ranking eigenvectors according to their eigenvalues (from
highest to lowest), PCA identifies the most significant
components to keep for dimensionality reduction.
• The proportion of total variance explained by each principal
component is given by the ratio of its eigenvalue to the sum of all
eigenvalues.
Thank You!

You might also like