ML_DA
ML_DA
2) Standardization
This method of scaling is basically based on the central tendencies and
variance of the data.
1. First we should calculate the mean and standard deviation of
the data we would like to normalize it.
2. Then we are supposed to subtract the mean value from each
entry and then divide the result by the standard deviation.
This helps us achieve a normal distribution of the data with a mean
equal to zero and a standard deviation equal to 1.
Why use Feature Scaling?
In machine learning feature scaling is used for number of purposes:
• Range: Scaling guarantees that all features are on a
comparable scale and have comparable ranges. This process is
known as feature normalisation. This is significant because the
magnitude of the features has an impact on many machine
learning techniques. Larger scale features may dominate the
learning process and have an excessive impact on the
outcomes.
• Algorithm performance improvement: When the features are
scaled several machine learning methods including gradient
descent-based algorithms, distance-based algorithms (such k-
nearest neighbours) and support vector machines perform
better or converge more quickly. The algorithm’s performance
can be enhanced by scaling the features which prevent the
convergence of the algorithm to the ideal outcome.
• Preventing numerical instability: Numerical instability can be
prevented by avoiding significant scale disparities between
features. For examples include distance calculations where
having features with differing scales can result in numerical
overflow or underflow problems. Stable computations are
required to mitigate this issue by scaling the features.
• Equal importance: Scaling features makes sure that each
characteristic is given the same consideration during the
learning process. Without scaling bigger scale features could
dominate the learning producing skewed outcomes. This bias
is removed through scaling and each feature contributes fairly
to model predictions.
4) ML | Data Preprocessing in Python
Data preprocessing is a important step in the data
science transforming raw data into a clean structured format for
analysis. It involves tasks like handling missing values,
normalizing data and encoding variables. Mastering preprocessing
in Python ensures reliable insights for accurate predictions and
effective decision-making. Pre-processing refers to
the transformations applied to data before feeding it to the
algorithm.
Supervised Learning
Supervised learning algorithms are generally categorized into two
main types:
• Classification - where the goal is to predict discrete labels or
categories
• Regression - where the aim is to predict continuous numerical
values.
There are many algorithms used in supervised learning, each suited to
different types of problems. Some of the most commonly used
supervised learning algorithms include:
Linear Regression
Linear regression is a statistical method used to model the relationship
between a dependent variable and one or more independent variables. It
provides valuable insights for prediction and data analysis.
Linear regression is also a type of supervised machine-learning
algorithm that learns from the labelled datasets and maps the data points
with most optimized linear functions which can be used for prediction on
new datasets. It computes the linear relationship between the dependent
variable and one or more independent features by fitting a linear equation
with observed data. It predicts the continuous output variables based on
the independent input variable.
For example if we want to predict house price we consider various factor
such as house age, distance from the main road, location, area and number
of room, linear regression uses all these parameter to predict house price
as it consider a linear relation between all these features and price of
house.
Assumptions are:
Logistic regression is used for binary classification where we use sigmoid function, that takes
input as independent variables and produces a probability value between 0 and 1.
For example, we have two classes Class 0 and Class 1 if the value of the logistic function for an
input is greater than 0.5 (threshold value) then it belongs to Class 1 otherwise it belongs to
Class 0. It’s referred to as regression because it is the extension of linear regression but is
mainly used for classification problems.
Key Points:
• In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).
On the basis of the categories, Logistic Regression can be classified into three types:
1. Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as “low”, “Medium”, or “High”.
2. Binary dependent variables: It takes the assumption that the dependent variable must
be binary or dichotomous, meaning it can take only two values. For more than two
categories SoftMax functions are used.
3. Linearity relationship between independent variables and log odds: The relationship
between the independent variables and the log odds of the dependent variable should
be linear.
So far, we’ve covered the basics of logistic regression, but now let’s focus on the most
important function that forms the core of logistic regression.
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1. The value of the
logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the “S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
The logistic regression model transforms the linear regression function continuous value
output into categorical value output using a sigmoid function, which maps any real-valued set
of independent variables input into a value between 0 and 1. This function is known as the
logistic function.
• Dependent variable: The target variable in a logistic regression model, which we are
trying to predict.
• Logistic function: The formula used to represent how the independent and dependent
variables relate to one another. The logistic function transforms the input variables
into a probability value between 0 and 1, which represents the likelihood of the
dependent variable being 1 or 0.
• Log-odds: The log-odds, also known as the logit function, is the natural logarithm of
the odds. In logistic regression, the log odds of the dependent variable are modeled as
a linear combination of the independent variables and the intercept.
• Coefficient: The logistic regression model’s estimated parameters, show how the
independent and dependent variables relate to one another.
• Intercept: A constant term in the logistic regression model, which represents the log
odds when all independent variables are equal to zero.
• Maximum likelihood estimation: The method used to estimate the coefficients of the
logistic regression model, which maximizes the likelihood of observing the data given
the model
Decision Tree
A decision tree is a supervised learning algorithm used for
both classification and regression tasks. It models decisions as a tree-like
structure where internal nodes represent attribute tests, branches
represent attribute values, and leaf nodes represent final decisions or
predictions. Decision trees are versatile, interpretable, and widely used in
machine learning for predictive modeling.
Intuition behind the Decision Tree
• The first question is: “Is the person’s age less than 15?”
• If the person is younger than 15, they are likely to enjoy computer games (+2
prediction score).
• If the person is 15 or older, ask the next question: “Is the person male?”
• If the person is male, they are somewhat likely to enjoy computer games (+0.1
prediction score).
• If the person is not male, they are less likely to enjoy computer games (-1
prediction score)
Example: Predicting Whether a Person Likes Computer Games Using Two Decision Trees
Tree 1: Age and Gender
1. The first tree asks two questions:
• “Is the person’s age less than 15?”
o If Yes, they get a score of +2.
o If No, proceed to the next question.
• “Is the person male?”
o If Yes, they get a score of +0.1.
o If No, they get a score of -1.
Tree 2: Computer Usage
1. The second tree focuses on daily computer usage:
• “Does the person use a computer daily?”
o If Yes, they get a score of +0.9.
o If No, they get a score of -0.9.
Combining Trees: Final Prediction
The final prediction score is the sum of scores from both trees
Information Gain and Gini Index in Decision Tree
Till now we have discovered the basic intituition and approach of how decision tree works, so
lets just move to the attribute selection measure of decision tree.
1. 1. Information Gain
2. 2. Gini Index
Building Decision Tree using Information GainThe essentials:
• Start with all training instances associated with the root node
• Use info gain to choose which attribute to label each node with
• Note: No root-to-leaf path should contain the same discrete
attribute twice
• Recursively construct each subtree on the subset of training
instances that would be classified down that path in the tree.
• If all positive or all negative training instances remain, the label
that node “yes” or “no” accordingly
• If no attributes remain, label with a majority vote of training
instances left at that node
• If no instances remain, label with a majority vote of the parent’s
training instances.
Example: Now, let us draw a Decision Tree for the following data using
Information gain. Training set: 3 features and 2 classes
2. Gini Index
• Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified. It means an attribute with a lower Gini index should be
preferred.
• Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini” value.
For example, if we have a group of people where all bought the product (100% “Yes”), the
Gini Index is 0, indicating perfect purity. But if the group has an equal mix of “Yes” and “No”,
the Gini Index would be 0.5, showing higher impurity or uncertainty.
• A lower Gini Index indicates a more homogeneous or pure distribution, while a higher
Gini Index indicates a more heterogeneous or impure distribution.
• In decision trees, the Gini Index is used to evaluate the quality of a split by measuring
the difference between the impurity of the parent node and the weighted impurity of
the child nodes.
• Compared to other impurity measures like entropy, the Gini Index is faster to compute
and more sensitive to changes in class probabilities.
• One disadvantage of the Gini Index is that it tends to favour splits that create equally
sized child nodes, even if they are not optimal for classification accuracy.
• In practice, the choice between using the Gini Index or other impurity measures
depends on the specific problem and dataset, and often requires experimentation and
tuning.
Till now we have understand about the attriburtes and components of decision tree. Now lets
jump to a real life usecase in which how decision tree works step by step.
• If the Sunny subset is mixed, ask: “Is the humidity high or normal?”
• Cloudy → “Hiking”.
• Example: If the outlook is Sunny and the humidity is High, follow the tree:
o Start at Outlook.
o Result: “Swimming”.
Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks. While it can handle regression problems, SVM is
particularly well-suited for classification tasks.
SVM aims to find the optimal hyperplane in an N-dimensional space to separate data points
into different classes. The algorithm maximizes the margin between the closest points of
different classes.
• Support Vectors: The closest data points to the hyperplane, crucial for determining
the hyperplane and margin in SVM.
• Margin: The distance between the hyperplane and the support vectors. SVM aims to
maximize this margin for better classification performance.
• Dual Problem: Involves solving for Lagrange multipliers associated with support
vectors, facilitating the kernel trick and efficient computation.
The key idea behind the SVM algorithm is to find the hyperplane that best separates two
classes by maximizing the margin between them. This margin is the distance from the
hyperplane to the nearest data points (support vectors) on each side.
Multiple hyperplanes separate the data from two classes
The best hyperplane, also known as the “hard margin,” is the one that maximizes the distance
between the hyperplane and the nearest data points from both classes. This ensures a clear
separation between the classes. So, from the above figure, we choose L2 as hard margin.
Here, we have one blue ball in the boundary of the red ball.
A soft margin allows for some misclassifications or violations of the margin to improve
generalization. The SVM optimizes the following equation to balance margin maximization
and penalty minimization:
Objective Function
When data is not linearly separable (i.e., it can’t be divided by a straight line), SVM uses a
technique called kernels to map the data into a higher-dimensional space where it becomes
separable. This transformation helps SVM find a decision boundary even for non-linear data.
Original 1D dataset for classification
A kernel is a function that maps data points into a higher-dimensional space without explicitly
computing the coordinates in that space. This allows SVM to work efficiently with non-linear
data by implicitly performing the mapping.
For example, consider data points that are not linearly separable. By applying a kernel
function, SVM transforms the data points into a higher-dimensional space where they become
linearly separable.
• Radial Basis Function (RBF) Kernel: Transforms data into a space based on distances
between data points.
In this case, the new variable y is created as a function of distance from the origin.
K-Nearest Neighbor(KNN)
K-Nearest Neighbors (KNN) is a simple way to classify things by looking
at what’s nearby. Imagine a streaming service wants to predict if a new
user is likely to cancel their subscription (churn) based on their age.
They checks the ages of its existing users and whether they churned or
stayed. If most of the “K” closest users in age of new user canceled their
subscription KNN will predict the new user might churn too. The key
idea is that users with similar ages tend to have similar behaviors and
KNN uses this closeness to make decisions.
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the time of classification it
performs an action on the dataset.
As an example, consider the following table of data points containing two features:
The new point is classified as Category 2 because most of its closest neighbors are blue
squares. KNN assigns the category based on the majority of nearby points.
The image shows how KNN predicts the category of a new data point based on its closest
neighbours.
• The red diamonds represent Category 1 and the blue squares represent Category 2.
• The new data point checks its closest neighbours (circled points).
• Since the majority of its closest neighbours are blue squares (Category 2) KNN predicts
the new data point belongs to Category 2.
In the k-Nearest Neighbours (k-NN) algorithm k is just a number that tells the algorithm how
many nearby points (neighbours) to look at when it makes a decision.
Example:
Imagine you’re deciding which fruit it is based on its shape and size. You compare it to fruits
you already know.
• If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is
an apple because most of its neighbours are apples.
• Cross-Validation: A robust method for selecting the best k is to perform k-fold cross-
validation. This involves splitting the data into k subsets training the model on some
subsets and testing it on the remaining ones and repeating this for each subset. The
value of k that results in the highest average validation accuracy is usually the best
choice.
• Elbow Method: In the elbow method we plot the model’s error rate or accuracy for
different values of k. As we increase k the error usually decreases initially. However
after a certain point the error rate starts to decrease more slowly. This point where the
curve forms an “elbow” that point is considered as best k.
• Odd Values for k: It’s also recommended to choose an odd value for k especially in
classification tasks to avoid ties when deciding the majority class.
KNN uses distance metrics to identify nearest neighbour, these neighbours are used for
classification and regression task. To identify nearest neighbour we use below distance
metrics:
1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between two points in a plane or
space. You can think of it like the shortest path you would walk if you were to go directly from
one point to another.
Step-by-Step explanation of how KNN works is discussed below:
• To measure the similarity between target and training data points Euclidean distance
is used. Distance is calculated between data points in the dataset and target point.
• The k data points with the smallest distances to the target point are nearest neighbors.
Step 4: Voting for Classification or Taking Average for Regression
• When you want to classify a data point into a category (like spam or not spam), the K-
NN algorithm looks at the K closest points in the dataset. These closest points are
called neighbors. The algorithm then looks at which category the neighbors belong to
and picks the one that appears the most. This is called majority voting.
• In regression, the algorithm still looks for the K closest points. But instead of voting for
a class in classification, it takes the average of the values of those K neighbors. This
average is the predicted value for the new point for the algorithm.
• It takes different random parts of the dataset to train each tree and then it combines
the results by averaging them. This approach helps improve the accuracy of
predictions. Random Forest is based on ensemble learning.
Imagine asking a group of friends for advice on where to go for vacation. Each friend gives
their recommendation based on their unique perspective and preferences (decision trees
trained on different subsets of data). You then make your final decision by considering the
majority opinion or averaging their suggestions (ensemble prediction).
As explained in image: Process starts with a dataset with rows and their corresponding
class labels (columns).
• Then - Multiple Decision Trees are created from the training data. Each tree is trained
on a random subset of the data (with replacement) and a random subset of features.
This process is known as bagging or bootstrap aggregating.
• When presented with a new, unseen instance, each Decision Tree in the ensemble
makes a prediction.
The final prediction is made by combining the predictions of all the Decision Trees. This is
typically done through a majority vote (for classification) or averaging (for regression).
• Scales Well with Large and Complex Data without significant performance
degradation.
• Algorithm is versatile and can be applied to both classification tasks (e.g., predicting
categories) and regression tasks (e.g., predicting continuous values).
How Random Forest Algorithm Works?
• Random Forest builds multiple decision trees using random samples of the data. Each
tree is trained on a different subset of the data which makes each tree unique.
• When creating each tree the algorithm randomly selects a subset of features or
variables to split the data rather than using all available features at a time. This adds
diversity to the trees.
• Each decision tree in the forest makes a prediction based on the data it was trained
on. When making final prediction random forest combines the results from all the
trees.
o For classification tasks the final prediction is decided by a majority vote. This
means that the category predicted by most trees is the final prediction.
o For regression tasks the final prediction is the average of the predictions from
all the trees.
• The randomness in data samples and feature selection helps to prevent the model
from overfitting making the predictions more accurate and reliable.
• Each tree makes its own decisions: Every tree in the forest makes its own predictions
without relying on others.
• Random parts of the data are used: Each tree is built using random samples and
features to reduce mistakes.
• Enough data is needed: Sufficient data ensures the trees are different and learn
unique patterns and variety.
Gradient Boosting
Gradient Boosting is a popular boosting algorithm in machine learning used
for classification and regression tasks. Boosting is one kind of ensemble
Learning method which trains the model sequentially and each new model
tries to correct the previous model.
Gradient Boosting is a powerful boosting algorithm that combines several
weak learners into strong learners, in which each new model is trained to
minimize the loss function such as mean squared error or cross-entropy of
the previous model using gradient descent. In each iteration, the algorithm
computes the gradient of the loss function with respect to the predictions of
the current ensemble and then trains a new weak model to minimize this
gradient. The predictions of the new model are then added to the ensemble,
and the process is repeated until a stopping criterion is met.
In contrast to AdaBoost, the weights of the training instances are not
tweaked, instead, each predictor is trained using the residual errors of the
predecessor as labels. There is a technique called the Gradient Boosted
Trees whose base learner is CART (Classification and Regression Trees). The
below diagram explains how gradient-boosted trees are trained for
regression problems.
The ensemble consists of M trees. Tree1 is trained using the feature
matrix X and the labels y. The predictions labeled y1(hat) are used to
determine the training set residual errors r1. Tree2 is then trained using
the feature matrix X and the residual errors r1 of Tree1 as labels. The
predicted results r1(hat) are then used to determine the residual r2. The
process is repeated until all the M trees forming the ensemble are trained.
There is an important parameter used in this technique known
as Shrinkage. Shrinkage refers to the fact that the prediction of each tree
in the ensemble is shrunk after it is multiplied by the learning rate (eta)
which ranges between 0 to 1. There is a trade-off between eta and the
number of estimators, decreasing learning rate needs to be compensated
with increasing estimators in order to reach certain model performance.
Since all trees are trained now, predictions can be made. Each tree predicts
a label and the final prediction is given by the formula,
y(pred) = y1 + (eta * r1) + (eta * r2) + ....... + (eta * rN)
Unsupervised Learning
The image shows set of animals: elephants, camels, and cows that represents raw data
that the unsupervised learning algorithm will process.
• The “Interpretation” stage signifies that the algorithm doesn’t have predefined labels
or categories for the data. It needs to figure out how to group or organize the data
based on inherent patterns.
• Algorithm represents the core of unsupervised learning process using techniques like
clustering, dimensionality reduction, or anomaly detection to identify patterns and
structures in the data.
The output shows the results of the unsupervised learning process. In this case, the
algorithm might have grouped the animals into clusters based on their species (elephants,
camels, cows).
K-means clustering is a technique used to organize data into groups based on their
similarity. For example online store uses K-Means to group customers based on purchase
frequency and spending creating segments like Budget Shoppers, Frequent Buyers and
Big Spenders for personalised marketing.
The algorithm works by first randomly picking some central points called centroids and
each data point is then assigned to the closest centroid forming a cluster. After all the
points are assigned to a cluster the centroids are updated by finding the average position
of the points in each cluster. This process repeats until the centroids stop changing forming
clusters. The goal of clustering is to divide the data points into clusters so that similar data
points belong to same group.
We are given a data set of items with certain features and values for these features (like a
vector). The task is to categorize those items into groups. To achieve this, we will use the
K-means algorithm. ‘K’ in the name of the algorithm represents the number of
groups/clusters we want to classify our items into.
K means Clustering
The algorithm will categorize the items into k groups or clusters of similarity. To calculate
that similarity, we will use the Euclidean distance as a measurement. The algorithm works
as follows:
1. First, we randomly initialize k points, called means or cluster centroids.
2. We categorize each item to its closest mean, and we update the mean’s coordinates,
which are the averages of the items categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the end, we have our
clusters.
The “points” mentioned above are called means because they are the mean values of the
items categorized in them. To initialize these means, we have a lot of options. An intuitive
method is to initialize the means at random items in the data set. Another method is to
initialize the means at random values between the boundaries of the data set. For example
for a feature x the items have values in [0,3] we will initialize the means with values for x at
[0,3].
Limitations
K-means clustering is a useful algorithm for grouping data into clusters. However, it has a
few limitations that we need to be aware of. In this blog post, we’ll explore these
limitations in simple terms and discuss their impact on the clustering process.
When using K-means, we have to start by guessing the initial positions of the cluster
centers. The final clustering results can be affected by this initial guess. Sometimes, the
algorithm may not find the best solution, leading to less accurate clusters.
2. Sensitivity to Outliers
K-means treats all data points equally and can be sensitive to outliers, which are unusual
or extreme data points. Outliers can distort the clustering process, causing the algorithm
to create less reliable clusters. Handling outliers properly is important to get better results.
K-means assumes that clusters are round or spherical in shape and have roughly the same
size. However, in real-world data, clusters can have different shapes and sizes. K-means
may struggle to handle such irregular clusters, resulting in less accurate clusters. Other
algorithms like DBSCAN or Gaussian Mixture Models can handle more complex cluster
shapes.
With K-means, we have to tell the algorithm how many clusters we expect in the data. This
can be tricky, especially if we don’t have prior knowledge about the data. Choosing the
wrong number of clusters can lead to misleading results. Methods like the elbow method
or silhouette analysis can help estimate the appropriate number of clusters, but it’s still a
challenge.
When dealing with large datasets, K-means may become computationally expensive and
slow. As the number of data points increases, the algorithm’s efficiency decreases. For very
large datasets, alternative techniques like Mini-Batch K-means or distributed frameworks
can be used to handle the scaling issue.
Hierarchical Clustering
Hierarchical clustering is a technique used to group similar data points together based on
their similarity creating a hierarchy or tree-like structure. The key idea is to begin with
each data point as its own separate cluster and then progressively merge or split them
based on their similarity.
Imagine you have four fruits with different weights: an apple (100g), a banana (120g), a
cherry (50g), and a grape (30g). Hierarchical clustering starts by treating each fruit as its
own group.
• First, the cherry and grape are grouped together because they are the lightest.
Finally, all the fruits are merged into one large group, showing how hierarchical clustering
progressively combines the most similar data points.
A dendrogram is like a family tree for clusters. It shows how individual data points or
groups of data merge together. The bottom shows each data point as its own group, and
as you move up, similar groups are combined. The lower the merge point, the more similar
the groups are. It helps you see how things are grouped step by step.
The working of the dendrogram can be explained using the below diagram:
Types of Hierarchical Clustering
Now that we understand the basics of hierarchical clustering, let’s explore the two main
types of hierarchical clustering.
1. Agglomerative Clustering
2. Divisive clustering
1. Start with individual points: Each data point is its own cluster. For example if you have
5 data points you start with 5 clusters each containing just one data point.
2. Calculate distances between clusters: Calculate the distance between every pair of
clusters. Initially since each cluster has one point this is the distance between the two
data points.
3. Merge the closest clusters: Identify the two clusters with the smallest distance and
merge them into a single cluster.
4. Update distance matrix: After merging you now have one less cluster. Recalculate the
distances between the new cluster and the remaining clusters.
5. Repeat steps 3 and 4: Keep merging the closest clusters and updating the distance
matrix until you have only one cluster left.
6. Create a dendrogram: As the process continues you can visualize the merging of
clusters using a tree-like diagram called a dendrogram. It shows the hierarchy of how
clusters are merged.
It is also known as a top-down approach. This algorithm also does not require to
prespecify the number of clusters. Top-down clustering requires a method for splitting
a cluster that contains the whole data and proceeds by splitting clusters recursively
until individual data have been split into singleton clusters.
2. Split the cluster: Divide the cluster into two smaller clusters. The division is typically
done by finding the two most dissimilar points in the cluster and using them to
separate the data into two parts.
3. Repeat the process: For each of the new clusters, repeat the splitting process:
4. Stop when each data point is in its own cluster: Continue this process until every data
point is its own cluster, or the stopping condition (such as a predefined number of
clusters) is met.
While merging two clusters we check the distance between two every pair of clusters
and merge the pair with the least distance/most similarity. But the question is how
is that distance determined. There are different ways of defining Inter Cluster
distance/similarity. Some of them are:
1. Min Distance: Find the minimum distance between any two points of the cluster.
2. Max Distance: Find the maximum distance between any two points of the cluster.
3. Group Average: Find the average distance between every two points of the clusters.
4. Ward’s Method: The similarity of two clusters is based on the increase in squared error
when two clusters are merged.
Introduction to Dimensionality Reduction
There are several techniques for dimensionality reduction, including principal
component analysis (PCA), singular value decomposition (SVD), and linear
discriminant analysis (LDA). Each technique uses a different method to project the data
onto a lower-dimensional space while preserving important information.
Imagine you are building a machine learning model to predict house prices based on
features like the number of bedrooms, square footage, location, age of the
house, number of bathrooms, and so on. If you have too many features like additional
ones for each room’s condition, flooring type, or neighborhood amenities, your
dataset can become very large and complex.
With too many features, your model may become slow to train, and it might also pick
up unnecessary details or noise. For example, if the flooring type doesn’t significantly
impact house prices, it might lead the model to make less accurate predictions,
especially when the data is noisy or when there are many irrelevant features.
Lets understand how dimensionality Reduction is used with the help of the figure
below:
On the left, data points exist in a 3D space (X, Y, Z), but the Z-dimension appears
unnecessary since the data primarily varies along the X and Y axes. The goal of
dimensionality reduction is to remove less important dimensions without losing
valuable information.
On the right, after reducing the dimensionality, the data is represented in lower-
dimensional spaces. The top plot (X-Y) maintains the meaningful structure, while the
bottom plot (Z-Y) shows that the Z-dimension contributed little useful information.
This process makes data analysis more efficient, improving computation speed and
visualization while minimizing redundancy
Till now, we have discussed Dimensionality Reduction and how it helps in reducing the
number of features while preserving important information. Now, let’s explore two key
approaches to achieving this: Feature Selection and Feature Extraction
Feature Selection
Feature selection chooses the most relevant features from the dataset without
altering them. It helps remove redundant or irrelevant features, improving model
efficiency. There are several methods for feature selection including filter methods,
wrapper methods, and embedded methods.
• Filter methods rank the features based on their relevance to the target
variable.
• Wrapper methods use the model performance as the criteria for selecting
features.
Feature Extraction
As seen earlier, high dimensionality makes models inefficient. Let’s now summarize
the key advantages of reducing dimensionality.
• Faster Computation: With fewer features, machine learning algorithms can
process data more quickly. This results in faster model training and testing, which
is particularly useful when working with large datasets.
• Prevent Overfitting: With fewer features, models are less likely to memorize the
training data and overfit. This helps the model generalize better to new, unseen
data, improving its ability to make accurate predictions.
• Data Loss & Reduced Accuracy – Some important information may be lost during
dimensionality reduction, potentially affecting model performance.