0% found this document useful (0 votes)
9 views

ML_DA

Machine learning, a subset of AI, enables computers to learn from data and make predictions through various techniques such as supervised, unsupervised, and reinforcement learning. The machine learning pipeline involves steps like data cleaning, feature engineering, and model evaluation, which are crucial for developing effective models. Key algorithms discussed include linear regression for continuous predictions and logistic regression for classification tasks, each with specific assumptions and applications.

Uploaded by

Aryan Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

ML_DA

Machine learning, a subset of AI, enables computers to learn from data and make predictions through various techniques such as supervised, unsupervised, and reinforcement learning. The machine learning pipeline involves steps like data cleaning, feature engineering, and model evaluation, which are crucial for developing effective models. Key algorithms discussed include linear regression for continuous predictions and logistic regression for classification tasks, each with specific assumptions and applications.

Uploaded by

Aryan Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Machine Learning

Machine learning is a subset of Artificial Intelligence (AI) that enables


computers to learn from data and make predictions without being
explicitly programmed.

Introduction to Machine Learning


Machine learning teaches computers to recognize patterns and make
decisions automatically using data and algorithms.
It can be broadly categorized into three types:
• Supervised Learning: Trains models on labeled data to predict
or classify new, unseen data.
• Unsupervised Learning: Finds patterns or groups in unlabeled
data, like clustering or dimensionality reduction.
• Reinforcement Learning: Learns through trial and error to
maximize rewards, ideal for decision-making tasks.
In addition these categories, there are also Semi-Supervised Learning
and Self-Supervised Learning.
• Semi-Supervised Learning uses a mix of labeled and
unlabeled data, making it helpful when labeling data is costly
or time-consuming.
• Self-Supervised Learning creates its own labels from raw
data, allowing it to learn patterns without needing labeled
examples.
Machine Learning Pipeline
Machine learning is fundamentally built upon data, which serves as the foundation
for training and testing models. Data consists of inputs (features) and outputs
(labels). A model learns patterns during training and is tested on unseen data to
evaluate its performance and generalization. In order to make predictions, there are
essential steps through which data passes in order to produce a machine learning
model that can make predictions.
1. ML workflow
2. Data Cleaning
3. Feature Scaling
4. Data Preprocessing in Python
1) ML - WorkFlow
Machine learning lifecycle is a process that guides development and
deployment of machine learning models in a structured way. It consists
of various steps. Each step plays a crucial role in ensuring the success
and effectiveness of the machine learning model. By following the
machine learning lifecycle we can solve complex problems, can get
data-driven insights and create scalable and sustainable models. The
steps are:
1. Problem Definition
2. Data Collection
3. Data Cleaning and Preprocessing
4. Exploratory Data Analysis (EDA)
5. Feature Engineering and Selection
6. Model Selection
7. Model Training
8. Model Evaluation and Tuning
9. Model Deployment
10. Model Monitoring and Maintenance
2) Data Cleaning
Data cleaning is a important step in the machine learning (ML) pipeline
as it involves identifying and removing any missing duplicate or
irrelevant data. The goal of data cleaning is to ensure that the data is
accurate, consistent and free of errors as raw data is often noisy,
incomplete and inconsistent which can negatively impact the
accuracy of model and its reliability of insights derived from it

Python Implementation for Database Cleaning


Let’s understand each step for Database Cleaning, using titanic dataset.
Below are the necessary steps:
• Import the necessary libraries
• Load the dataset
• Check the data information using df.info()
Data Inspection and Exploration
Let’s first understand the data by inspecting its structure and identifying
missing values, outliers and inconsistencies and check the duplicate
rows with below python code:
From the above data info we can see that Age and Cabin have
an unequal number of counts. And some of the columns are categorical
and have data type objects and some are integer and float values.
Handling Missing Data
Missing data is a common issue in real-world datasets and it can occur
due to various reasons such as human errors, system failures or data
collection issues. Various techniques can be used to handle missing
data, such as imputation, deletion or substitution.
Let’s check the missing values columns-wise for each row using
df.isnull() it checks whether the values are null or not and gives returns
boolean values and sum() will sum the total number of null values rows
and we divide it by the total number of rows present in the dataset then
we multiply to get values in i.e per 100 values how much values are
null.
3) Feature Engineering: Scaling, Normalization,
and Standardization

Feature Scaling is a technique to standardize the independent features present in


the data. It is performed during the data pre-processing to handle highly varying
values. If feature scaling is not done then machine learning algorithm tends to
use greater values as higher and consider smaller values as lower regardless
of the unit of the values. For example it will take 10 m and 10 cm both as same
regardless of their unit. In this article we will learn about different techniques
which are used to perform feature scaling.

1. Absolute Maximum Scaling


This method of scaling requires two-step:
1. We should first select the maximum absolute value out of all the
entries of a particular measure.
2. Then after this we divide each entry of the column by this maximum
value.

2) Standardization
This method of scaling is basically based on the central tendencies and
variance of the data.
1. First we should calculate the mean and standard deviation of
the data we would like to normalize it.
2. Then we are supposed to subtract the mean value from each
entry and then divide the result by the standard deviation.
This helps us achieve a normal distribution of the data with a mean
equal to zero and a standard deviation equal to 1.
Why use Feature Scaling?
In machine learning feature scaling is used for number of purposes:
• Range: Scaling guarantees that all features are on a
comparable scale and have comparable ranges. This process is
known as feature normalisation. This is significant because the
magnitude of the features has an impact on many machine
learning techniques. Larger scale features may dominate the
learning process and have an excessive impact on the
outcomes.
• Algorithm performance improvement: When the features are
scaled several machine learning methods including gradient
descent-based algorithms, distance-based algorithms (such k-
nearest neighbours) and support vector machines perform
better or converge more quickly. The algorithm’s performance
can be enhanced by scaling the features which prevent the
convergence of the algorithm to the ideal outcome.
• Preventing numerical instability: Numerical instability can be
prevented by avoiding significant scale disparities between
features. For examples include distance calculations where
having features with differing scales can result in numerical
overflow or underflow problems. Stable computations are
required to mitigate this issue by scaling the features.
• Equal importance: Scaling features makes sure that each
characteristic is given the same consideration during the
learning process. Without scaling bigger scale features could
dominate the learning producing skewed outcomes. This bias
is removed through scaling and each feature contributes fairly
to model predictions.
4) ML | Data Preprocessing in Python
Data preprocessing is a important step in the data
science transforming raw data into a clean structured format for
analysis. It involves tasks like handling missing values,
normalizing data and encoding variables. Mastering preprocessing
in Python ensures reliable insights for accurate predictions and
effective decision-making. Pre-processing refers to
the transformations applied to data before feeding it to the
algorithm.
Supervised Learning
Supervised learning algorithms are generally categorized into two
main types:
• Classification - where the goal is to predict discrete labels or
categories
• Regression - where the aim is to predict continuous numerical
values.
There are many algorithms used in supervised learning, each suited to
different types of problems. Some of the most commonly used
supervised learning algorithms include:
Linear Regression
Linear regression is a statistical method used to model the relationship
between a dependent variable and one or more independent variables. It
provides valuable insights for prediction and data analysis.
Linear regression is also a type of supervised machine-learning
algorithm that learns from the labelled datasets and maps the data points
with most optimized linear functions which can be used for prediction on
new datasets. It computes the linear relationship between the dependent
variable and one or more independent features by fitting a linear equation
with observed data. It predicts the continuous output variables based on
the independent input variable.
For example if we want to predict house price we consider various factor
such as house age, distance from the main road, location, area and number
of room, linear regression uses all these parameter to predict house price
as it consider a linear relation between all these features and price of
house.

Why Linear Regression is Important?


The interpretability of linear regression is one of its greatest strengths. The
model’s equation offers clear coefficients that illustrate the influence of
each independent variable on the dependent variable, enhancing our
understanding of the underlying relationships. Its simplicity is a significant
advantage; linear regression is transparent, easy to implement, and serves
as a foundational concept for more advanced algorithms.

What is the best Fit Line?


Our primary objective while using linear regression is to locate the best-fit
line, which implies that the error between the predicted and actual values
should be kept to a minimum. There will be the least error in the best-fit
line.
The best Fit Line equation provides a straight line that represents the
relationship between the dependent and independent variables. The slope
of the line indicates how much the dependent variable changes for a unit
change in the independent variable(s).
Here Y is called a dependent or target variable and X is called an
independent variable also known as the predictor of Y. There are many
types of functions or modules that can be used for regression. A linear
function is the simplest type of function. Here, X may be a single feature
or multiple features representing the problem.
Linear regression performs the task to predict a dependent variable value
(y) based on a given independent variable (x)). Hence, the name is Linear
Regression. In the figure above, X (input) is the work experience and Y
(output) is the salary of a person. The regression line is the best-fit line for
our model.
In linear regression some hypothesis are made to ensure reliability of the
model’s results.

Hypothesis function in Linear Regression

Assumptions are:

• Linearity: It assumes that there is a linear relationship between the


independent and dependent variables. This means that changes in
the independent variable lead to proportional changes in the
dependent variable.

• Independence: The observations should be independent from each


other that is the errors from one observation should not influence
other.
As we have discussed that our independent feature is the experience i.e X
and the respective salary Y is the dependent variable. Let’s assume there
is a linear relationship between X and Y then the salary can be predicted
using:
Use Case of Simple Linear Regression
• In a case study evaluating student performance analysts use
simple linear regression to examine the relationship between
study hours and exam scores. By collecting data on the number
of hours students studied and their corresponding exam results
the analysts developed a model that reveal correlation, for each
additional hour spent studying, students exam scores increased
by an average of 5 points. This case highlights the utility of simple
linear regression in understanding and improving academic
performance.
• Another case study focus on marketing and sales where
businesses uses simple linear regression to forecast sales based
on historical data particularly examining how factors like
advertising expenditure influence revenue. By collecting data on
past advertising spending and corresponding sales figures
analysts develop a regression model that tells the relationship
between these variables. For instance if the analysis reveals that
for every additional dollar spent on advertising sales increase by
$10. This predictive capability enables companies to optimize
their advertising strategies and allocate resources effectively.
Assumptions of Multiple Linear Regression
For Multiple Linear Regression, all four of the assumptions from Simple
Linear Regression apply. In addition to this, below are few more:
1. No multicollinearity: There is no high correlation between the
independent variables. This indicates that there is little or no
correlation between the independent variables. Multicollinearity
occurs when two or more independent variables are highly
correlated with each other, which can make it difficult to
determine the individual effect of each variable on the dependent
variable. If there is multicollinearity, then multiple linear
regression will not be an accurate model.
2. Additivity: The model assumes that the effect of changes in a
predictor variable on the response variable is consistent
regardless of the values of the other variables. This assumption
implies that there is no interaction between variables in their
effects on the dependent variable.
3. Feature Selection: In multiple linear regression, it is essential to
carefully select the independent variables that will be included in
the model. Including irrelevant or redundant variables may lead
to overfitting and complicate the interpretation of the model.
4. Overfitting: Overfitting occurs when the model fits the training
data too closely, capturing noise or random fluctuations that do
not represent the true underlying relationship between variables.
This can lead to poor generalization performance on new, unseen
data.
Use Case of Multiple Linear Regression
Multiple linear regression allows us to analyze relationship between multiple
independent variables and a single dependent variable. Here are some use cases:
• Real Estate Pricing: In real estate MLR is used to predict property prices
based on multiple factors such as location, size, number of bedrooms, etc.
This helps buyers and sellers understand market trends and set
competitive prices.
• Financial Forecasting: Financial analysts use MLR to predict stock prices
or economic indicators based on multiple influencing factors such as
interest rates, inflation rates and market trends. This enables better
investment strategies and risk management24.
• Agricultural Yield Prediction: Farmers can use MLR to estimate crop
yields based on several variables like rainfall, temperature, soil quality
and fertilizer usage. This information helps in planning agricultural
practices for optimal productivity
• E-commerce Sales Analysis: An e-commerce company can utilize MLR
to assess how various factors such as product price, marketing
promotions and seasonal trends impact sales.

Evaluation Metrics for Linear Regression


A variety of evaluation measures can be used to determine the strength of
any linear regression model. These assessment metrics often give an
indication of how well the model is producing the observed outputs.
The most common measurements are
Logistic Regression

Logistic regression is a supervised machine learning algorithm used for classification


tasks where the goal is to predict the probability that an instance belongs to a given class or
not. Logistic regression is a statistical algorithm which analyze the relationship between two
data factors. The article explores the fundamentals of logistic regression, it’s types and
implementations.

Logistic regression is used for binary classification where we use sigmoid function, that takes
input as independent variables and produces a probability value between 0 and 1.

For example, we have two classes Class 0 and Class 1 if the value of the logistic function for an
input is greater than 0.5 (threshold value) then it belongs to Class 1 otherwise it belongs to
Class 0. It’s referred to as regression because it is the extension of linear regression but is
mainly used for classification problems.

Key Points:

• Logistic regression predicts the output of a categorical dependent variable. Therefore,


the outcome must be a categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value
as 0 and 1, it gives the probabilistic values which lie between 0 and 1.

• In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).

Types of Logistic Regression

On the basis of the categories, Logistic Regression can be classified into three types:

1. Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.

2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible


unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”

3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as “low”, “Medium”, or “High”.

Assumptions of Logistic Regression

We will explore the assumptions of logistic regression as understanding these assumptions is


important to ensure that we are using appropriate application of the model. The assumption
include:

1. Independent observations: Each observation is independent of the other. meaning


there is no correlation between any input variables.

2. Binary dependent variables: It takes the assumption that the dependent variable must
be binary or dichotomous, meaning it can take only two values. For more than two
categories SoftMax functions are used.

3. Linearity relationship between independent variables and log odds: The relationship
between the independent variables and the log odds of the dependent variable should
be linear.

4. No outliers: There should be no outliers in the dataset.

5. Large sample size: The sample size is sufficiently large

Understanding Sigmoid Function

So far, we’ve covered the basics of logistic regression, but now let’s focus on the most
important function that forms the core of logistic regression.

• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1. The value of the
logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the “S” form.

• The S-form curve is called the Sigmoid function or the logistic function.

• In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

How does Logistic Regression work?

The logistic regression model transforms the linear regression function continuous value
output into categorical value output using a sigmoid function, which maps any real-valued set
of independent variables input into a value between 0 and 1. This function is known as the
logistic function.

Let the independent input features be:


Terminologies involved in Logistic Regression

Here are some common terms involved in logistic regression:

• Independent variables: The input characteristics or predictor factors applied to the


dependent variable’s predictions.

• Dependent variable: The target variable in a logistic regression model, which we are
trying to predict.

• Logistic function: The formula used to represent how the independent and dependent
variables relate to one another. The logistic function transforms the input variables
into a probability value between 0 and 1, which represents the likelihood of the
dependent variable being 1 or 0.

• Odds: It is the ratio of something occurring to something not occurring. it is different


from probability as the probability is the ratio of something occurring to everything
that could possibly occur.

• Log-odds: The log-odds, also known as the logit function, is the natural logarithm of
the odds. In logistic regression, the log odds of the dependent variable are modeled as
a linear combination of the independent variables and the intercept.

• Coefficient: The logistic regression model’s estimated parameters, show how the
independent and dependent variables relate to one another.

• Intercept: A constant term in the logistic regression model, which represents the log
odds when all independent variables are equal to zero.

• Maximum likelihood estimation: The method used to estimate the coefficients of the
logistic regression model, which maximizes the likelihood of observing the data given
the model

Decision Tree
A decision tree is a supervised learning algorithm used for
both classification and regression tasks. It models decisions as a tree-like
structure where internal nodes represent attribute tests, branches
represent attribute values, and leaf nodes represent final decisions or
predictions. Decision trees are versatile, interpretable, and widely used in
machine learning for predictive modeling.
Intuition behind the Decision Tree

Here’s an example to make it simple to understand the intuition of decision tree:

Imagine you’re deciding whether to buy an umbrella:

1. Step 1 – Ask a Question (Root Node):


Is it raining?
If yes, you might decide to buy an umbrella. If no, you move to the next question.

2. Step 2 – More Questions (Internal Nodes):


If it’s not raining, you might ask:
Is it likely to rain later?
If yes, you buy an umbrella; if no, you don’t.

3. Step 3 – Decision (Leaf Node):


Based on your answers, you either buy or skip the umbrella

1. Start with the Root Question (Age):

• The first question is: “Is the person’s age less than 15?”

• If Yes, move to the left.

• If No, move to the right.

2. Branch Based on Age:

• If the person is younger than 15, they are likely to enjoy computer games (+2
prediction score).
• If the person is 15 or older, ask the next question: “Is the person male?”

3. Branch Based on Gender (For Age 15+):

• If the person is male, they are somewhat likely to enjoy computer games (+0.1
prediction score).

• If the person is not male, they are less likely to enjoy computer games (-1
prediction score)

Example: Predicting Whether a Person Likes Computer Games Using Two Decision Trees
Tree 1: Age and Gender
1. The first tree asks two questions:
• “Is the person’s age less than 15?”
o If Yes, they get a score of +2.
o If No, proceed to the next question.
• “Is the person male?”
o If Yes, they get a score of +0.1.
o If No, they get a score of -1.
Tree 2: Computer Usage
1. The second tree focuses on daily computer usage:
• “Does the person use a computer daily?”
o If Yes, they get a score of +0.9.
o If No, they get a score of -0.9.
Combining Trees: Final Prediction
The final prediction score is the sum of scores from both trees
Information Gain and Gini Index in Decision Tree

Till now we have discovered the basic intituition and approach of how decision tree works, so
lets just move to the attribute selection measure of decision tree.

We have two popular attribute selection measures used:

1. 1. Information Gain

2. 2. Gini Index
Building Decision Tree using Information GainThe essentials:
• Start with all training instances associated with the root node
• Use info gain to choose which attribute to label each node with
• Note: No root-to-leaf path should contain the same discrete
attribute twice
• Recursively construct each subtree on the subset of training
instances that would be classified down that path in the tree.
• If all positive or all negative training instances remain, the label
that node “yes” or “no” accordingly
• If no attributes remain, label with a majority vote of training
instances left at that node
• If no instances remain, label with a majority vote of the parent’s
training instances.
Example: Now, let us draw a Decision Tree for the following data using
Information gain. Training set: 3 features and 2 classes
2. Gini Index

• Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified. It means an attribute with a lower Gini index should be
preferred.

• Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini” value.

For example, if we have a group of people where all bought the product (100% “Yes”), the
Gini Index is 0, indicating perfect purity. But if the group has an equal mix of “Yes” and “No”,
the Gini Index would be 0.5, showing higher impurity or uncertainty.

The Formula for Gini Index is given by :

Some additional features and characteristics of the Gini Index are:

• It is calculated by summing the squared probabilities of each outcome in a distribution


and subtracting the result from 1.

• A lower Gini Index indicates a more homogeneous or pure distribution, while a higher
Gini Index indicates a more heterogeneous or impure distribution.

• In decision trees, the Gini Index is used to evaluate the quality of a split by measuring
the difference between the impurity of the parent node and the weighted impurity of
the child nodes.

• Compared to other impurity measures like entropy, the Gini Index is faster to compute
and more sensitive to changes in class probabilities.

• One disadvantage of the Gini Index is that it tends to favour splits that create equally
sized child nodes, even if they are not optimal for classification accuracy.

• In practice, the choice between using the Gini Index or other impurity measures
depends on the specific problem and dataset, and often requires experimentation and
tuning.

Understanding Decision Tree with Real life Usecase:

Till now we have understand about the attriburtes and components of decision tree. Now lets
jump to a real life usecase in which how decision tree works step by step.

Step 1. Start with the Whole Dataset


We begin with all the data, which is treated as the root node of the decision tree.
Step 2. Choose the Best Question (Attribute)
Pick the best question to divide the dataset. For example, ask: “What is the outlook?”

• Possible answers: Sunny, Cloudy, or Rainy.

Step 3. Split the Data into Subsets


Divide the dataset into groups based on the question:

• If Sunny, go to one subset.

• If Cloudy, go to another subset.

• If Rainy, go to the last subset.

Step 4. Split Further if Needed (Recursive Splitting)


For each subset, ask another question to refine the groups. For example:

• If the Sunny subset is mixed, ask: “Is the humidity high or normal?”

o High humidity → “Swimming”.

o Normal humidity → “Hiking”.

Step 5. Assign Final Decisions (Leaf Nodes)


When a subset contains only one activity, stop splitting and assign it a label:

• Cloudy → “Hiking”.

• Rainy → “Stay Inside”.

• Sunny + High Humidity → “Swimming”.

• Sunny + Normal Humidity → “Hiking”.

Step 6. Use the Tree for Predictions


To predict an activity, follow the branches of the tree:

• Example: If the outlook is Sunny and the humidity is High, follow the tree:

o Start at Outlook.

o Take the branch for Sunny.

o Then go to Humidity and take the branch for High Humidity.

o Result: “Swimming”.
Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks. While it can handle regression problems, SVM is
particularly well-suited for classification tasks.

SVM aims to find the optimal hyperplane in an N-dimensional space to separate data points
into different classes. The algorithm maximizes the margin between the closest points of
different classes.

Support Vector Machine (SVM) Terminology

• Hyperplane: A decision boundary separating different classes in feature space,


represented by the equation wx + b = 0 in linear classification.

• Support Vectors: The closest data points to the hyperplane, crucial for determining
the hyperplane and margin in SVM.

• Margin: The distance between the hyperplane and the support vectors. SVM aims to
maximize this margin for better classification performance.

• Kernel: A function that maps data to a higher-dimensional space, enabling SVM to


handle non-linearly separable data.

• Hard Margin: A maximum-margin hyperplane that perfectly separates the data


without misclassifications.

• Soft Margin: Allows some misclassifications by introducing slack variables, balancing


margin maximization and misclassification penalties when data is not perfectly
separable.

• C: A regularization term balancing margin maximization and misclassification


penalties. A higher C value enforces a stricter penalty for misclassifications.

• Hinge Loss: A loss function penalizing misclassified points or margin violations,


combined with regularization in SVM.

• Dual Problem: Involves solving for Lagrange multipliers associated with support
vectors, facilitating the kernel trick and efficient computation.

How does Support Vector Machine Algorithm Work?

The key idea behind the SVM algorithm is to find the hyperplane that best separates two
classes by maximizing the margin between them. This margin is the distance from the
hyperplane to the nearest data points (support vectors) on each side.
Multiple hyperplanes separate the data from two classes

The best hyperplane, also known as the “hard margin,” is the one that maximizes the distance
between the hyperplane and the nearest data points from both classes. This ensures a clear
separation between the classes. So, from the above figure, we choose L2 as hard margin.

Let’s consider a scenario like shown below:

Selecting hyperplane for data with outlier

Here, we have one blue ball in the boundary of the red ball.

How does SVM classify the data?


It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The SVM
algorithm has the characteristics to ignore the outlier and finds the best hyperplane that
maximizes the margin. SVM is robust to outliers.

Hyperplane which is the most optimized one

A soft margin allows for some misclassifications or violations of the margin to improve
generalization. The SVM optimizes the following equation to balance margin maximization
and penalty minimization:

Objective Function

What to do if data are not linearly separable?

When data is not linearly separable (i.e., it can’t be divided by a straight line), SVM uses a
technique called kernels to map the data into a higher-dimensional space where it becomes
separable. This transformation helps SVM find a decision boundary even for non-linear data.
Original 1D dataset for classification

A kernel is a function that maps data points into a higher-dimensional space without explicitly
computing the coordinates in that space. This allows SVM to work efficiently with non-linear
data by implicitly performing the mapping.

For example, consider data points that are not linearly separable. By applying a kernel
function, SVM transforms the data points into a higher-dimensional space where they become
linearly separable.

• Linear Kernel: For linear separability.

• Polynomial Kernel: Maps data into a polynomial space.

• Radial Basis Function (RBF) Kernel: Transforms data into a space based on distances
between data points.

Mapping 1D data to 2D to become able to separate the two classes

In this case, the new variable y is created as a function of distance from the origin.
K-Nearest Neighbor(KNN)
K-Nearest Neighbors (KNN) is a simple way to classify things by looking
at what’s nearby. Imagine a streaming service wants to predict if a new
user is likely to cancel their subscription (churn) based on their age.
They checks the ages of its existing users and whether they churned or
stayed. If most of the “K” closest users in age of new user canceled their
subscription KNN will predict the new user might churn too. The key
idea is that users with similar ages tend to have similar behaviors and
KNN uses this closeness to make decisions.
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the time of classification it
performs an action on the dataset.

As an example, consider the following table of data points containing two features:

KNN Algorithm working visualization

The new point is classified as Category 2 because most of its closest neighbors are blue
squares. KNN assigns the category based on the majority of nearby points.

The image shows how KNN predicts the category of a new data point based on its closest
neighbours.

• The red diamonds represent Category 1 and the blue squares represent Category 2.

• The new data point checks its closest neighbours (circled points).

• Since the majority of its closest neighbours are blue squares (Category 2) KNN predicts
the new data point belongs to Category 2.

KNN works by using proximity and majority voting to make predictions.

What is ‘K’ in K Nearest Neighbour ?

In the k-Nearest Neighbours (k-NN) algorithm k is just a number that tells the algorithm how
many nearby points (neighbours) to look at when it makes a decision.

Example:

Imagine you’re deciding which fruit it is based on its shape and size. You compare it to fruits
you already know.

• If k = 3, the algorithm looks at the 3 closest fruits to the new one.

• If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is
an apple because most of its neighbours are apples.

How to choose the value of k for KNN Algorithm?


The value of k is critical in KNN as it determines the number of neighbors to consider when
making predictions. Selecting the optimal value of k depends on the characteristics of the
input data. If the dataset has significant outliers or noise a higher k can help smooth out the
predictions and reduce the influence of noisy data. However choosing very high value can
lead to underfitting where the model becomes too simplistic.

Statistical Methods for Selecting k:

• Cross-Validation: A robust method for selecting the best k is to perform k-fold cross-
validation. This involves splitting the data into k subsets training the model on some
subsets and testing it on the remaining ones and repeating this for each subset. The
value of k that results in the highest average validation accuracy is usually the best
choice.

• Elbow Method: In the elbow method we plot the model’s error rate or accuracy for
different values of k. As we increase k the error usually decreases initially. However
after a certain point the error rate starts to decrease more slowly. This point where the
curve forms an “elbow” that point is considered as best k.

• Odd Values for k: It’s also recommended to choose an odd value for k especially in
classification tasks to avoid ties when deciding the majority class.

Distance Metrics Used in KNN Algorithm

KNN uses distance metrics to identify nearest neighbour, these neighbours are used for
classification and regression task. To identify nearest neighbour we use below distance
metrics:

1. Euclidean Distance

Euclidean distance is defined as the straight-line distance between two points in a plane or
space. You can think of it like the shortest path you would walk if you were to go directly from
one point to another.
Step-by-Step explanation of how KNN works is discussed below:

Step 1: Selecting the optimal value of K

• K represents the number of nearest neighbors that needs to be considered while


making prediction.

Step 2: Calculating distance

• To measure the similarity between target and training data points Euclidean distance
is used. Distance is calculated between data points in the dataset and target point.

Step 3: Finding Nearest Neighbors

• The k data points with the smallest distances to the target point are nearest neighbors.
Step 4: Voting for Classification or Taking Average for Regression

• When you want to classify a data point into a category (like spam or not spam), the K-
NN algorithm looks at the K closest points in the dataset. These closest points are
called neighbors. The algorithm then looks at which category the neighbors belong to
and picks the one that appears the most. This is called majority voting.

• In regression, the algorithm still looks for the K closest points. But instead of voting for
a class in classification, it takes the average of the values of those K neighbors. This
average is the predicted value for the new point for the algorithm.

Random Forest Algorithm


A Random Forest is a collection of decision trees that work together to make predictions.
In this article, we'll explain how the Random Forest algorithm works and how to use it.

Understanding Intuition for Random Forest Algorithm

Random Forest algorithm is a powerful tree learning technique in Machine Learning to


make predictions and then we do voting of all the tress to make prediction. They are
widely used for classification and regression task.

• It is a type of classifier that uses many decision trees to make predictions.

• It takes different random parts of the dataset to train each tree and then it combines
the results by averaging them. This approach helps improve the accuracy of
predictions. Random Forest is based on ensemble learning.

Imagine asking a group of friends for advice on where to go for vacation. Each friend gives
their recommendation based on their unique perspective and preferences (decision trees
trained on different subsets of data). You then make your final decision by considering the
majority opinion or averaging their suggestions (ensemble prediction).
As explained in image: Process starts with a dataset with rows and their corresponding
class labels (columns).

• Then - Multiple Decision Trees are created from the training data. Each tree is trained
on a random subset of the data (with replacement) and a random subset of features.
This process is known as bagging or bootstrap aggregating.

• Each Decision Tree in the ensemble learns to make predictions independently.

• When presented with a new, unseen instance, each Decision Tree in the ensemble
makes a prediction.

The final prediction is made by combining the predictions of all the Decision Trees. This is
typically done through a majority vote (for classification) or averaging (for regression).

Key Features of Random Forest

• Handles Missing Data: Automatically handles missing values during training,


eliminating the need for manual imputation.

• Algorithm ranks features based on their importance in making predictions offering


valuable insights for feature selection and interpretability.

• Scales Well with Large and Complex Data without significant performance
degradation.

• Algorithm is versatile and can be applied to both classification tasks (e.g., predicting
categories) and regression tasks (e.g., predicting continuous values).
How Random Forest Algorithm Works?

The random Forest algorithm works in several steps:

• Random Forest builds multiple decision trees using random samples of the data. Each
tree is trained on a different subset of the data which makes each tree unique.

• When creating each tree the algorithm randomly selects a subset of features or
variables to split the data rather than using all available features at a time. This adds
diversity to the trees.

• Each decision tree in the forest makes a prediction based on the data it was trained
on. When making final prediction random forest combines the results from all the
trees.

o For classification tasks the final prediction is decided by a majority vote. This
means that the category predicted by most trees is the final prediction.

o For regression tasks the final prediction is the average of the predictions from
all the trees.

• The randomness in data samples and feature selection helps to prevent the model
from overfitting making the predictions more accurate and reliable.

Assumptions of Random Forest

• Each tree makes its own decisions: Every tree in the forest makes its own predictions
without relying on others.

• Random parts of the data are used: Each tree is built using random samples and
features to reduce mistakes.

• Enough data is needed: Sufficient data ensures the trees are different and learn
unique patterns and variety.

• Different predictions improve accuracy: Combining the predictions from different


trees leads to a more accurate final results.

Gradient Boosting
Gradient Boosting is a popular boosting algorithm in machine learning used
for classification and regression tasks. Boosting is one kind of ensemble
Learning method which trains the model sequentially and each new model
tries to correct the previous model.
Gradient Boosting is a powerful boosting algorithm that combines several
weak learners into strong learners, in which each new model is trained to
minimize the loss function such as mean squared error or cross-entropy of
the previous model using gradient descent. In each iteration, the algorithm
computes the gradient of the loss function with respect to the predictions of
the current ensemble and then trains a new weak model to minimize this
gradient. The predictions of the new model are then added to the ensemble,
and the process is repeated until a stopping criterion is met.
In contrast to AdaBoost, the weights of the training instances are not
tweaked, instead, each predictor is trained using the residual errors of the
predecessor as labels. There is a technique called the Gradient Boosted
Trees whose base learner is CART (Classification and Regression Trees). The
below diagram explains how gradient-boosted trees are trained for
regression problems.
The ensemble consists of M trees. Tree1 is trained using the feature
matrix X and the labels y. The predictions labeled y1(hat) are used to
determine the training set residual errors r1. Tree2 is then trained using
the feature matrix X and the residual errors r1 of Tree1 as labels. The
predicted results r1(hat) are then used to determine the residual r2. The
process is repeated until all the M trees forming the ensemble are trained.
There is an important parameter used in this technique known
as Shrinkage. Shrinkage refers to the fact that the prediction of each tree
in the ensemble is shrunk after it is multiplied by the learning rate (eta)
which ranges between 0 to 1. There is a trade-off between eta and the
number of estimators, decreasing learning rate needs to be compensated
with increasing estimators in order to reach certain model performance.
Since all trees are trained now, predictions can be made. Each tree predicts
a label and the final prediction is given by the formula,
y(pred) = y1 + (eta * r1) + (eta * r2) + ....... + (eta * rN)

Unsupervised Learning

The image shows set of animals: elephants, camels, and cows that represents raw data
that the unsupervised learning algorithm will process.
• The “Interpretation” stage signifies that the algorithm doesn’t have predefined labels
or categories for the data. It needs to figure out how to group or organize the data
based on inherent patterns.

• Algorithm represents the core of unsupervised learning process using techniques like
clustering, dimensionality reduction, or anomaly detection to identify patterns and
structures in the data.

• Processing stage shows the algorithm working on the data.

The output shows the results of the unsupervised learning process. In this case, the
algorithm might have grouped the animals into clusters based on their species (elephants,
camels, cows).

How does unsupervised learning work?


Unsupervised learning works by analyzing unlabeled data to identify patterns
and relationships. The data is not labeled with any predefined categories or
outcomes, so the algorithm must find these patterns and relationships on its
own. This can be a challenging task, but it can also be very rewarding, as it
can reveal insights into the data that would not be apparent from a labeled
dataset.
Data-set in Figure A is Mall data that contains information about its clients
that subscribe to them. Once subscribed they are provided a membership
card and the mall has complete information about the customer and his/her
every purchase. Now using this data and unsupervised learning techniques,
the mall can easily group clients based on the parameters we are feeding in.
K means Clustering
K-Means Clustering is an Unsupervised Machine Learning algorithm which groups the
unlabeled dataset into different clusters. The article aims to explore the fundamentals and
working of k means clustering along with its implementation.

Understanding K-means Clustering

K-means clustering is a technique used to organize data into groups based on their
similarity. For example online store uses K-Means to group customers based on purchase
frequency and spending creating segments like Budget Shoppers, Frequent Buyers and
Big Spenders for personalised marketing.

The algorithm works by first randomly picking some central points called centroids and
each data point is then assigned to the closest centroid forming a cluster. After all the
points are assigned to a cluster the centroids are updated by finding the average position
of the points in each cluster. This process repeats until the centroids stop changing forming
clusters. The goal of clustering is to divide the data points into clusters so that similar data
points belong to same group.

How k-means clustering works?

We are given a data set of items with certain features and values for these features (like a
vector). The task is to categorize those items into groups. To achieve this, we will use the
K-means algorithm. ‘K’ in the name of the algorithm represents the number of
groups/clusters we want to classify our items into.

K means Clustering

The algorithm will categorize the items into k groups or clusters of similarity. To calculate
that similarity, we will use the Euclidean distance as a measurement. The algorithm works
as follows:
1. First, we randomly initialize k points, called means or cluster centroids.

2. We categorize each item to its closest mean, and we update the mean’s coordinates,
which are the averages of the items categorized in that cluster so far.

3. We repeat the process for a given number of iterations and at the end, we have our
clusters.

The “points” mentioned above are called means because they are the mean values of the
items categorized in them. To initialize these means, we have a lot of options. An intuitive
method is to initialize the means at random items in the data set. Another method is to
initialize the means at random values between the boundaries of the data set. For example
for a feature x the items have values in [0,3] we will initialize the means with values for x at
[0,3].

Limitations
K-means clustering is a useful algorithm for grouping data into clusters. However, it has a
few limitations that we need to be aware of. In this blog post, we’ll explore these
limitations in simple terms and discuss their impact on the clustering process.

1. Dependency on Initial Guess

When using K-means, we have to start by guessing the initial positions of the cluster
centers. The final clustering results can be affected by this initial guess. Sometimes, the
algorithm may not find the best solution, leading to less accurate clusters.

2. Sensitivity to Outliers

K-means treats all data points equally and can be sensitive to outliers, which are unusual
or extreme data points. Outliers can distort the clustering process, causing the algorithm
to create less reliable clusters. Handling outliers properly is important to get better results.

3. Assumption of Round Clusters

K-means assumes that clusters are round or spherical in shape and have roughly the same
size. However, in real-world data, clusters can have different shapes and sizes. K-means
may struggle to handle such irregular clusters, resulting in less accurate clusters. Other
algorithms like DBSCAN or Gaussian Mixture Models can handle more complex cluster
shapes.

4. Need to Know the Number of Clusters

With K-means, we have to tell the algorithm how many clusters we expect in the data. This
can be tricky, especially if we don’t have prior knowledge about the data. Choosing the
wrong number of clusters can lead to misleading results. Methods like the elbow method
or silhouette analysis can help estimate the appropriate number of clusters, but it’s still a
challenge.

5. Handling Large Datasets

When dealing with large datasets, K-means may become computationally expensive and
slow. As the number of data points increases, the algorithm’s efficiency decreases. For very
large datasets, alternative techniques like Mini-Batch K-means or distributed frameworks
can be used to handle the scaling issue.

Hierarchical Clustering
Hierarchical clustering is a technique used to group similar data points together based on
their similarity creating a hierarchy or tree-like structure. The key idea is to begin with
each data point as its own separate cluster and then progressively merge or split them
based on their similarity.

Lets understand this with the help of an example

Imagine you have four fruits with different weights: an apple (100g), a banana (120g), a
cherry (50g), and a grape (30g). Hierarchical clustering starts by treating each fruit as its
own group.

• It then merges the closest groups based on their weights.

• First, the cherry and grape are grouped together because they are the lightest.

• Next, the apple and banana are grouped together.

Finally, all the fruits are merged into one large group, showing how hierarchical clustering
progressively combines the most similar data points.

Getting Started with Dendogram

A dendrogram is like a family tree for clusters. It shows how individual data points or
groups of data merge together. The bottom shows each data point as its own group, and
as you move up, similar groups are combined. The lower the merge point, the more similar
the groups are. It helps you see how things are grouped step by step.

The working of the dendrogram can be explained using the below diagram:
Types of Hierarchical Clustering

Now that we understand the basics of hierarchical clustering, let’s explore the two main
types of hierarchical clustering.

1. Agglomerative Clustering

2. Divisive clustering

Hierarchical Agglomerative Clustering

It is also known as the bottom-up approach or hierarchical agglomerative clustering


(HAC). Unlike flat clustering hierarchical clustering provides a structured way to group
data. This clustering algorithm does not require us to prespecify the number of clusters.
Bottom-up algorithms treat each data as a singleton cluster at the outset and then
successively agglomerate pairs of clusters until all clusters have been merged into a single
cluster that contains all data.
Hierarchical Agglomerative Clustering

Workflow for Hierarchical Agglomerative clustering

1. Start with individual points: Each data point is its own cluster. For example if you have
5 data points you start with 5 clusters each containing just one data point.

2. Calculate distances between clusters: Calculate the distance between every pair of
clusters. Initially since each cluster has one point this is the distance between the two
data points.

3. Merge the closest clusters: Identify the two clusters with the smallest distance and
merge them into a single cluster.

4. Update distance matrix: After merging you now have one less cluster. Recalculate the
distances between the new cluster and the remaining clusters.

5. Repeat steps 3 and 4: Keep merging the closest clusters and updating the distance
matrix until you have only one cluster left.

6. Create a dendrogram: As the process continues you can visualize the merging of
clusters using a tree-like diagram called a dendrogram. It shows the hierarchy of how
clusters are merged.

Hierarchical Divisive clustering

It is also known as a top-down approach. This algorithm also does not require to
prespecify the number of clusters. Top-down clustering requires a method for splitting
a cluster that contains the whole data and proceeds by splitting clusters recursively
until individual data have been split into singleton clusters.

Workflow for Hierarchical Divisive clustering :


1. Start with all data points in one cluster: Treat the entire dataset as a single large
cluster.

2. Split the cluster: Divide the cluster into two smaller clusters. The division is typically
done by finding the two most dissimilar points in the cluster and using them to
separate the data into two parts.

3. Repeat the process: For each of the new clusters, repeat the splitting process:

1. Choose the cluster with the most dissimilar points.

2. Split it again into two smaller clusters.

4. Stop when each data point is in its own cluster: Continue this process until every data
point is its own cluster, or the stopping condition (such as a predefined number of
clusters) is met.

Hierarchical Divisive clustering

Computing Distance Matrix

While merging two clusters we check the distance between two every pair of clusters
and merge the pair with the least distance/most similarity. But the question is how
is that distance determined. There are different ways of defining Inter Cluster
distance/similarity. Some of them are:

1. Min Distance: Find the minimum distance between any two points of the cluster.

2. Max Distance: Find the maximum distance between any two points of the cluster.

3. Group Average: Find the average distance between every two points of the clusters.

4. Ward’s Method: The similarity of two clusters is based on the increase in squared error
when two clusters are merged.
Introduction to Dimensionality Reduction
There are several techniques for dimensionality reduction, including principal
component analysis (PCA), singular value decomposition (SVD), and linear
discriminant analysis (LDA). Each technique uses a different method to project the data
onto a lower-dimensional space while preserving important information.

Let’s take an example to explain this better:

Imagine you are building a machine learning model to predict house prices based on
features like the number of bedrooms, square footage, location, age of the
house, number of bathrooms, and so on. If you have too many features like additional
ones for each room’s condition, flooring type, or neighborhood amenities, your
dataset can become very large and complex.

Before Dimensionality Reduction

With too many features, your model may become slow to train, and it might also pick
up unnecessary details or noise. For example, if the flooring type doesn’t significantly
impact house prices, it might lead the model to make less accurate predictions,
especially when the data is noisy or when there are many irrelevant features.

How Dimensionality Reduction Works?

Lets understand how dimensionality Reduction is used with the help of the figure
below:

On the left, data points exist in a 3D space (X, Y, Z), but the Z-dimension appears
unnecessary since the data primarily varies along the X and Y axes. The goal of
dimensionality reduction is to remove less important dimensions without losing
valuable information.
On the right, after reducing the dimensionality, the data is represented in lower-
dimensional spaces. The top plot (X-Y) maintains the meaningful structure, while the
bottom plot (Z-Y) shows that the Z-dimension contributed little useful information.

This process makes data analysis more efficient, improving computation speed and
visualization while minimizing redundancy

What is Feature selection and Feature Extraction?

Till now, we have discussed Dimensionality Reduction and how it helps in reducing the
number of features while preserving important information. Now, let’s explore two key
approaches to achieving this: Feature Selection and Feature Extraction

Feature Selection

Feature selection chooses the most relevant features from the dataset without
altering them. It helps remove redundant or irrelevant features, improving model
efficiency. There are several methods for feature selection including filter methods,
wrapper methods, and embedded methods.

• Filter methods rank the features based on their relevance to the target
variable.

• Wrapper methods use the model performance as the criteria for selecting
features.

• Embedded methods combine feature selection with the model training


process.

Feature Extraction

Feature extraction involves creating new features by combining or transforming the


original features. There are several methods for feature extraction stated above in the
introductory part which is responsible for creating and transforming the
features. PCA is a popular technique that projects the original features onto a lower-
dimensional space while preserving as much of the variance as possible.

Advantages of Dimensionality Reduction

As seen earlier, high dimensionality makes models inefficient. Let’s now summarize
the key advantages of reducing dimensionality.
• Faster Computation: With fewer features, machine learning algorithms can
process data more quickly. This results in faster model training and testing, which
is particularly useful when working with large datasets.

• Better Visualization: As we saw in the earlier figure, reducing dimensions makes it


easier to visualize data, revealing hidden patterns.

• Prevent Overfitting: With fewer features, models are less likely to memorize the
training data and overfit. This helps the model generalize better to new, unseen
data, improving its ability to make accurate predictions.

Disadvantages of Dimensionality Reduction

• Data Loss & Reduced Accuracy – Some important information may be lost during
dimensionality reduction, potentially affecting model performance.

• Interpretability Challenges – The transformed features (e.g., principal


components) may not have clear meanings, making it harder to understand
relationships in the original data.

• Choosing the Right Components – Deciding how many dimensions to keep is


difficult, as keeping too few may lose valuable information, while keeping too many
can lead to overfitting.

You might also like