machine learning
machine learning
1
Here are some regression algorithms:
Linear Regression
Polynomial Regression
Ridge Regression
Lasso Regression
Decision tree
Random Forest
Advantages of Supervised Machine Learning
Supervised Learning models can have high accuracy as they are trained on labelled data.
The process of decision-making in supervised learning models is often interpretable.
It can often be used in pre-trained models which saves time and resources when developing
new models from scratch.
Disadvantages of Supervised Machine Learning
It has limitations in knowing patterns and may struggle with unseen or unexpected patterns
that are not present in the training data.
It can be time-consuming and costly as it relies on labeled data only.
It may lead to poor generalizations based on new data.
Applications:
Image classification
Nlp
Speech recognition
Recommendation system
Predictive analytics
Fraud detection
Medical diagnosis etc……
2. Unsupervised Machine Learning
Unsupervised Learning Unsupervised learning is a type of machine learning technique in which an
algorithm discovers patterns and relationships using unlabeled data. Unlike supervised learning,
unsupervised learning doesn’t involve providing the algorithm with labeled target outputs. The
primary goal of Unsupervised learning is often to discover hidden patterns, similarities, or clusters
within the data, which can then be used for various purposes, such as data exploration, visualization,
dimensionality reduction, and more.
There are two main categories of unsupervised learning that are mentioned below:
Clustering
Association
Clustering
2
Clustering is the process of grouping data points into clusters based on their similarity. This technique
is useful for identifying patterns and relationships in data without the need for labeled examples.
Here are some clustering algorithms:
K-Means Clustering algorithm
Mean-shift algorithm
DBSCAN Algorithm
Principal Component Analysis
Independent Component Analysis
Association
Association rule learning is a technique for discovering relationships between items in a dataset. It
identifies rules that indicate the presence of one item implies the presence of another item with a
specific probability.
Here are some association rule learning algorithms:
Apriori Algorithm
Eclat
FP-growth Algorithm
Advantages of Unsupervised Machine Learning
It helps to discover hidden patterns and various relationships between the data.
Used for tasks such as customer segmentation, anomaly detection, and data exploration.
It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning
Without using labels, it may be difficult to predict the quality of the model’s output.
Cluster Interpretability may not be clear and may not have meaningful interpretations.
It has techniques such as autoencoders and dimensionality reduction that can be used to
extract meaningful features from raw data.
Applications of Unsupervised Learning
Clustering
Anomaly detection
Market basket analysis
Recommendation systems
Image segmentation
Density estimation
Data preprocessing
4. Reinforcement Machine Learning
Reinforcement machine learning algorithm is a learning method that interacts with the
environment by producing actions and discovering errors. Trial, error, and delay are the most
relevant characteristics of reinforcement learning. In this technique, the model keeps on
3
increasing its performance using Reward Feedback to learn the behavior or pattern. These
algorithms are specific to a particular problem e.g. Google Self Driving car, AlphaGo where a bot
competes with humans and even itself to get better and better performers in Go Game. Each time
we feed in data, they learn and add the data to their knowledge which is training data. So, the
more it learns the better it gets trained and hence experienced.
Here are some of most common reinforcement learning algorithms:
Q-learning
SARSA (State-Action-Reward-State-Action)
Deep Q-learning
Types of Reinforcement Machine Learning
There are two main types of reinforcement learning:
Positive reinforcement
Rewards the agent for taking a desired action.
Encourages the agent to repeat the behavior.
Examples: Giving a treat to a dog for sitting, providing a point in a game for a correct answer.
Negative reinforcement
Removes an undesirable stimulus to encourage a desired behavior.
Discourages the agent from repeating the behavior.
Examples: Turning off a loud buzzer when a lever is pressed, avoiding a penalty by
completing a task.
Advantages of Reinforcement Machine Learning
It has autonomous decision-making that is well-suited for tasks and that can learn to make a
sequence of decisions, like robotics and game-playing.
This technique is preferred to achieve long-term results that are very difficult to achieve.
It is used to solve a complex problems that cannot be solved by conventional techniques.
Disadvantages of Reinforcement Machine Learning
Training Reinforcement Learning agents can be computationally expensive and time-
consuming.
Reinforcement learning is not preferable to solving simple problems.
It needs a lot of data and a lot of computation, which makes it impractical and costly.
Applications of Reinforcement Machine Learning
Game playing
Robotics
Recommendation systems
Healthcasre
Nlp
Finance and trading
4
Virtual reality(VR) and augmented reality(AR)
Design a Learning System in Machine Learning
According to Arthur Samuel “Machine Learning enables a Machine to Automatically learn from Data,
Improve performance from an Experience and predict things without explicitly programmed.”
In Simple Words, When we fed the Training Data to Machine Learning Algorithm, this algorithm will
produce a mathematical model and with the help of the mathematical model, the machine will make a
prediction and take a decision without being explicitly programmed. Also, during training data, the
more machine will work with it the more it will get experience and the more efficient result is
produced.
According to Tom Mitchell, “A computer program is said to be learning from experience (E), with
respect to some task (T). Thus, the performance measure (P) is the performance at task T, which is
measured by P, and it improves with experience E.”
Example: In Spam E-Mail detection,
Task, T: To classify mails into Spam or Not Spam.
Performance measure, P: Total percent of mails being correctly classified as being “Spam”
or “Not Spam”.
Experience, E: Set of Mails with label “Spam”
Steps for Designing Learning System are:
5
Step 1) Choosing the Training Experience: The very important and first task is to choose
the training data or training experience which will be fed to the Machine Learning Algorithm.
It is important to note that the data or experience that we fed to the algorithm must have a
significant impact on the Success or Failure of the Model. So Training data or experience
should be chosen wisely.
The training experience will be able to provide direct or indirect feedback regarding choices.
Second important attribute is the degree to which the learner will control the sequences of
training examples.
Third important attribute is how it will represent the distribution of examples over which
performance will be measured.
Step 2- Choosing target function: The next important step is choosing the target function. It
means according to the knowledge fed to the algorithm the machine learning will choose
NextMove function which will describe what type of legal moves should be taken.
Step 3- Choosing Representation for Target function: When the machine algorithm will know
all the possible legal moves the next step is to choose the optimized move using any
representation i.e. using linear Equations, Hierarchical Graph Representation, Tabular form etc.
Step 4- Choosing Function Approximation Algorithm: An optimized move cannot be chosen
just with the training data. The training data had to go through with set of example and through
these examples the training data will approximates which steps are chosen and after that machine
will provide feedback on it.
Step 5- Final Design: The final design is created at last when system goes from number of
examples , failures and success , correct and incorrect decision and what will be the next step etc.
ML | Find S Algorithm
Introduction :
The find-S algorithm is a basic concept learning algorithm in machine learning. The find-S
algorithm finds the most specific hypothesis that fits all the positive examples. We have to note
here that the algorithm considers only those positive training example. The find-S algorithm starts
with the most specific hypothesis and generalizes this hypothesis each time it fails to classify an
observed positive training data. Hence, the Find-S algorithm moves from the most specific
hypothesis to the most general hypothesis.
Important Representation :
6
2. Take the next example and if it is negative, then no changes occur to the hypothesis.
3. If the example is positive and we find that our initial hypothesis is too specific then we update
our current hypothesis to a general condition.
4. Keep repeating the above steps till all the training examples are complete.
5. After we have completed all the training examples we will have the final hypothesis when can
use to classify the new examples.
Example :
Consider the following data set having the data about which particular seeds are poisonous.
First, we consider the hypothesis to be a more specific hypothesis. Hence, our hypothesis would
be :
h = {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}
Consider example 1 :
The data in example 1 is { GREEN, HARD, NO, WRINKLED }. We see that our initial
hypothesis is more specific and we have to generalize it for this example. Hence, the hypothesis
becomes :
h = { GREEN, HARD, NO, WRINKLED }
Consider example 2 :
Here we see that this example has a negative outcome. Hence we neglect this example and our
hypothesis remains the same.
h = {GREEN, HARD, NO, WRINKLED}
Consider example 3 :
Here we see that this example has a negative outcome. Hence we neglect this example and our
7
hypothesis remains the same.
h = {GREEN, HARD, NO, WRINKLED}
Consider example 4:
The data present in example 4 is {ORANGE, HARD, NO, WRINKLED}. We compare every
single attribute with the initial data and if any mismatch is found we replace that particular
attribute with a general case (”?”). After doing the process the hypothesis becomes :
h = {?, HARD, NO, WRINKLED }
Consider example 5:
The data present in example 5 is {GREEN, SOFT, YES, SMOOTH}. We compare every single
attribute with the initial data and if any mismatch is found we replace that particular attribute with
a general case ( ” ? ” ). After doing the process the hypothesis becomes :
h = { ?, ?, ?, ? }
Since we have reached a point where all the attributes in our hypothesis have the general
condition, example 6 and example 7 would result in the same hypothesizes with all general
attributes.
h = { ?, ?, ?, ? }
Hence, for the given data the final hypothesis would be :
Final Hypothesis: h = { ?, ?, ?, ? }
8
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-
axis, then such a relationship is termed as a Positive linear relationship.
9
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the best
fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the
input variable to the output variable. This mapping function is also known as Hypothesis
function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average
of squared error occurred between the predicted values and actual values. It can be written as:
For the above linear equation, MSE can be calculated as:
Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
o A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function.
o It is done by a random selection of values of coefficient and then iteratively update the values
to reach the minimum cost function.
UNIT-2
10
Multi-Layer Perceptron and Backpropagation:
The Architecture of Multi-Layer Perceptron (MLP)
An MLP is a type of artificial neural network that consists of multiple layers of neurons, which are the
basic computational units. The architecture of an MLP includes the following components:
Input Layer
The input layer is the first layer in the MLP. It consists of input neurons that pass the data to the
subsequent layers. These neurons do not perform any computations; they merely transmit the input
features to the next layer. The number of neurons in this layer corresponds to the number of features
in the input data.
Hidden Layers
The hidden layers are the core of the MLP. These layers consist of neurons called Threshold Logic
Units (TLUs). Each hidden layer transforms the input from the previous layer using a set of weights
and biases. The transformed input is then passed through an activation function, which introduces
non-linearity into the model. This non-linearity allows the MLP to capture complex patterns in the
data.
Output Layer
The final layer in the MLP is the output layer, which produces the network’s predictions. The number
of neurons in the output layer depends on the type of problem being solved. For binary classification,
there is typically one output neuron, while for multi-class classification, there are multiple output
neurons.
Bias Neurons and Fully Connected Layers
Each layer, except for the output layer, includes a bias neuron. Bias neurons are special neurons that
always output the value 1. Every neuron in a layer is fully connected to every neuron in the
subsequent layer. This means that each neuron in the current layer sends its output to every neuron in
the next layer.
Feedforward Neural Network (FNN)
11
The MLP is a type of Feedforward Neural Network (FNN). In an FNN, the signal flows in one
direction, from the input layer to the output layer, without any cycles or loops. This straightforward
flow of information is crucial for understanding how predictions are made and how errors are
propagated back through the network during training.
Deep Neural Networks (DNN)
When an MLP has multiple hidden layers, it is often referred to as a Deep Neural Network (DNN).
The term “deep” signifies the presence of many layers between the input and output layers. DNNs can
model highly complex functions due to their deep architecture.
Backpropagation: The Key to Training MLPs
Backpropagation is a powerful and efficient algorithm for training MLPs. It involves computing the
gradient of the loss function with respect to each weight by propagating the error backward through
the network.
Forward Pass
The forward pass is the first step in the backpropagation algorithm. During this step, the input data is
passed through the network, layer by layer, until it reaches the output layer. Each neuron computes a
weighted sum of its inputs, applies an activation function to this sum, and passes the result to the next
layer. This process continues until the network produces an output.
Error Calculation
Once the network’s output is obtained, the error is calculated using a loss function. The loss function
measures the difference between the predicted output and the actual target values. Common loss
functions include Mean Squared Error (MSE) for regression tasks and Cross-Entropy Loss for
classification tasks. The goal of training is to minimize this error.
Backward Pass
The backward pass is where the magic of backpropagation happens. During this step, the algorithm
computes the gradient of the loss function with respect to each weight in the network. This is done by
applying the chain rule of calculus. The chain rule allows the algorithm to efficiently compute the
gradient of the loss with respect to each weight by multiplying gradients along the computational
graph.
1. Gradient Calculation: The algorithm starts at the output layer and works its way backward
through the network, layer by layer. For each layer, it calculates the gradient of the loss with
respect to the weights. This involves computing how much each weight contributed to the
overall error.
2. Error Propagation: The error is propagated backward through the network, from the output
layer to the input layer. At each layer, the algorithm updates the weights by subtracting a
fraction of the gradient from the current weights. This fraction is determined by the learning
rate, a hyperparameter that controls the size of the weight updates.
Gradient Descent Step
The final step in the backpropagation algorithm is the Gradient Descent step. During this step, the
weights are updated based on the computed gradients. This involves taking a step in the direction that
reduces the error. The size of this step is determined by the learning rate.
Activation Functions
12
Activation functions are a critical component of MLPs. They introduce non-linearity into the network,
allowing it to model complex functions. Without activation functions, an MLP would be equivalent to
a single-layer linear model, regardless of the number of layers.
Logistic Function (Sigmoid)
The logistic function, also known as the sigmoid function, maps any real-valued number to a value
between 0 and 1. It is S-shaped and has a well-defined nonzero derivative, which makes it suitable for
use with the backpropagation algorithm.
The hyperbolic tangent function is similar to the logistic function but outputs values between -1 and 1.
This centering effect can help speed up convergence during training.
Rectified Linear Unit (ReLU)
The ReLU function is widely used in deep learning due to its simplicity and effectiveness. It outputs
the input value if it is positive; otherwise, it outputs zero. Although ReLU is not differentiable at zero
and has a zero gradient for negative inputs, it performs well in practice and helps mitigate the
vanishing gradient problem.
Radial Basis Function (RBF) Neural Network
What is RBF in Machine Learning?
In machine learning the term "RBF" stands for "Radial Basis Function." It refers to a specific type of
neural network architecture that utilizes radial basis functions networks (RBFs) as activation
functions. RBF neural networks are distinct from traditional feedforward or recurrent neural networks
due to their unique approach to processing input data and performing computations.
RBF Neural Network Structure
The input layer receives and processes data, with each neuron representing a feature. Unlike
traditional networks, the hidden layer of RBF networks is comprised of radial basis functions (RBFs),
where each neuron corresponds to a function centered at specific points in the input space, offering a
13
departure from densely interconnected layers. These RBFs, which can adopt various forms like
Gaussian or Multiquadric functions, produce scalar outputs based on the distance between input data
and the function's center, enabling the network to capture intricate data relationships.
In the output layer, which generates the final network output, neurons may represent class labels,
continuous values, or categories depending on the task. The outputs of RBFs in the hidden layer are
typically combined linearly through weighted sums, with weights learned during training. This
weighted linear combination process allows for the integration of RBF outputs into the final network
output, enhancing its ability to approximate functions, recognize patterns, and classify data
effectively, especially in scenarios where traditional neural network architectures encounter
challenges in generalization.
Radial basis functions are mathematical functions whose value depends only on the distance from a
specified center or origin. Commonly used radial basis functions include Gaussian, Multiquadric, and
as φ(∣∣x−c∣∣)φ(∣∣x−c∣∣), where c is the center of the function and ∣∣x−c∣∣∣∣x−c∣∣ denotes the distance
Inverse Multiquadric functions. Mathematically, a radial basis function φ(x)φ(x) can be represented
14
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify the data
points. This best boundary is known as the hyperplane of SVM.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below
image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there
can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors. The distance between the vectors and the hyperplane
is called as margin. And the goal of SVM is to maximize this margin. The hyperplane with
maximum margin is called the optimal hyperplane.
15
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used
two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated
as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
16
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d
space with z=1, then it will become as:
Advantages of SVM:
1. **Effective in high-dimensional spaces**.
2. **Memory efficient**.
3. **Robust to overfitting**.
4. **Versatile with different kernels**.
5. **Strong theoretical foundations**.
6. **Works well with clear margin separation**.
Disadvantages of SVM:
1. **Computationally intensive**.
2. **Not suitable for very large datasets**.
17
3. **Requires careful tuning of parameters**.
4. **Sensitive to noise**.
5. **Hard to interpret results**.
6. **Primarily for binary classification**.
UNIT-3
What is a Decision Tree?
A decision tree is a flowchart-like structure used to make decisions or predictions. It consists of
nodes representing decisions or tests on attributes, branches representing the outcome of these
decisions, and leaf nodes representing final outcomes or predictions. Each internal node corresponds
to a test on an attribute, each branch corresponds to the result of the test, and each leaf node
corresponds to a class label or a continuous value.
Structure of a Decision Tree
1. Root Node: Represents the entire dataset and the initial decision to be made.
2. Internal Nodes: Represent decisions or tests on attributes. Each internal node has one or
more branches.
3. Branches: Represent the outcome of a decision or test, leading to another node.
4. Leaf Nodes: Represent the final decision or prediction. No further splits occur at these nodes.
How Decision Trees Work?
The process of creating a decision tree involves:
1. Selecting the Best Attribute: Using a metric like Gini impurity, entropy, or information gain,
the best attribute to split the data is selected.
2. Splitting the Dataset: The dataset is split into subsets based on the selected attribute.
3. Repeating the Process: The process is repeated recursively for each subset, creating a new
internal node or leaf node until a stopping criterion is met (e.g., all instances in a node belong
to the same class or a predefined depth is reached).
18
Information Gain: Measures the reduction in entropy or Gini impurity after a dataset is split
on an attribute.
o InformationGain=Entropyparent–
19
Tree structure: CART builds a tree-like structure consisting of nodes and branches. The
nodes represent different decision points, and the branches represent the possible outcomes of
those decisions. The leaf nodes in the tree contain a predicted class label or value for the
target variable.
Splitting criteria: CART uses a greedy approach to split the data at each node. It evaluates
all possible splits and selects the one that best reduces the impurity of the resulting subsets.
For classification tasks, CART uses Gini impurity as the splitting criterion. The lower the
Gini impurity, the more pure the subset is. For regression tasks, CART uses residual
reduction as the splitting criterion. The lower the residual reduction, the better the fit of the
model to the data.
Pruning: To prevent overfitting of the data, pruning is a technique used to remove the nodes
that contribute little to the model accuracy. Cost complexity pruning and information gain
pruning are two popular pruning techniques. Cost complexity pruning involves calculating the
cost of each node and removing nodes that have a negative cost. Information gain pruning
involves calculating the information gain of each node and removing nodes that have a low
information gain.
How does CART algorithm works?
The CART algorithm works via the following process:
The best-split point of each input is obtained.
Based on the best-split points of each input in Step 1, the new “best” split point is identified.
Split the chosen input according to the “best” split point.
Continue splitting until a stopping rule is satisfied or no further desirable splitting is available.
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by searching
for the best homogeneity for the sub nodes, with the help of the Gini index criterion.
Ensemble Learning
20
Ensemble means ‘a collection of things’ and in Machine Learning terminology, Ensemble learning
refers to the approach of combining multiple ML models to produce a more accurate and robust
prediction compared to any individual model. It implements an ensemble of fast algorithms
(classifiers) such as decision trees for learning and allows them to vote.
Ensemble Learning Techniques
Gradient Boosting Machines (GBM): Gradient Boosting is a popular ensemble learning
technique that sequentially builds a group of decision trees and corrects the residual errors
made by previous trees, enhancing its predictive accuracy. It trains each new weak learner to
fit the residuals of the previous ensemble's predictions thus making it less sensitive to
individual data points or outliers in the data.
Extreme Gradient Boosting (XGBoost): XGBoost features tree pruning, regularization, and
parallel processing, which makes it a preferred choice for data scientists seeking robust and
accurate predictive models.
CatBoost: It is designed to handle features categorically that eliminates the need for
extensive pre-processing.CatBoost is known for its high predictive accuracy, fast training, and
automatic handling of overfitting.
Stacking: It combines the output of multiple base models by training a combiner(an
algorithm that takes predictions of base models) and generate more accurate
prediction. Stacking allows for more flexibility in combining diverse models, and the
combiner can be any machine learning algorithm.
Random Subspace Method (Random Subspace Ensembles): It is an ensemble learning
approach that improves the predictive accuracy by training base models on random subsets of
input features. It mitigates overfitting and improves the generalization by introducing
diversity in the model space.
Random Forest Variants: They introduce variations in tree construction, feature selection, or
model optimization to enhance performance.
Selecting the right advanced ensemble technique depends on the nature of the data, the specific
problem trying to be solved, and the computational resources available. It often requires
experimentation and changes to achieve the best results.
Algorithm based on Bagging and Boosting
Bagging Algorithm
Bagging is a supervised learning technique that can be used for both regression and classification
tasks. Here is an overview of the steps including Bagging classifier algorithm:
Bootstrap Sampling: Divides the original training data into ‘N’ subsets and randomly selects
a subset with replacement in some rows from other subsets. This step ensures that the base
models are trained on diverse subsets of the data and there is no class imbalance.
Base Model Training: For each bootstrapped sample, train a base model independently on
that subset of data. These weak models are trained in parallel to increase computational
efficiency and reduce time consumption.
Prediction Aggregation: To make a prediction on testing data combine the predictions of all
base models. For classification tasks, it can include majority voting or weighted majority
while for regression, it involves averaging the predictions.
21
Out-of-Bag (OOB) Evaluation: Some samples are excluded from the training subset of
particular base models during the bootstrapping method. These “out-of-bag” samples can be
used to estimate the model’s performance without the need for cross-validation.
Final Prediction: After aggregating the predictions from all the base models, Bagging
produces a final prediction for each instance.
Boosting Algorithm
Boosting is an ensemble technique that combines multiple weak learners to create a strong learner.
The ensemble of weak models are trained in series such that each model that comes next, tries to
correct errors of the previous model until the entire training dataset is predicted correctly. One of the
most well-known boosting algorithms is AdaBoost (Adaptive Boosting).
Here are few popular boosting algorithm frameworks:
AdaBoost (Adaptive Boosting): AdaBoost assigns different weights to data points, focusing
on challenging examples in each iteration. It combines weighted weak classifiers to make
predictions.
Gradient Boosting: Gradient Boosting, including algorithms like Gradient Boosting
Machines (GBM), XGBoost, and LightGBM, optimizes a loss function by training a sequence
of weak learners to minimize the residuals between predictions and actual values, producing
strong predictive models.
The Gaussian Mixture Model (GMM) is a probabilistic clustering and density estimation method
used in machine learning, based on a mixture of Gaussian distributions. It is effective for modeling
complex, continuous datasets where the underlying structure can be represented as a combination of
multiple Gaussian (normal) distributions.
2. Soft Clustering:
Unlike K-means, where each data point belongs to exactly one cluster, GMM provides soft
assignments.
22
Each point has a probability of belonging to each Gaussian component, offering greater
flexibility in clustering.
Advantages:
1. Soft Assignments: Handles overlapping clusters better than K-means.
2. Flexible Covariance: Clusters can have various shapes (elliptical, spherical) due to
covariance matrices.
3. Probabilistic Model: Can compute the likelihood of data belonging to clusters, useful in
uncertainty modeling.
Disadvantages:
1. Sensitive to initialization (poor starting parameters may cause local minima).
2. Higher computational cost than K-means, especially for large datasets.
3. Assumes the data fits a Gaussian distribution, which may not always be valid.
Applications:
1. Clustering:
o Identifying patterns in customer segmentation.
2. Anomaly Detection:
o Outliers have low probability under the fitted Gaussian mixture model.
3. Density Estimation:
o Modeling the probability distribution of data, often used in speech recognition or
financial modeling.
4. Dimensionality Reduction:
23
o By approximating the data’s distribution, GMMs help in understanding and
compressing high-dimensional datasets.
Conclusion:
The Gaussian Mixture Model is a powerful probabilistic method for clustering, capable of modeling
complex data distributions with flexibility. By combining multiple Gaussian components, GMM
provides robust clustering and density estimation, making it ideal for applications where soft
clustering and probabilistic modeling are required.
K-Means Clustering
K-Means is an unsupervised machine learning algorithm used for clustering data into groups. It
minimizes the intra-cluster distances by iteratively assigning data points to clusters.
Key Features
1. Hard Assignment: Each data point belongs to one cluster.
2. Distance-Based: Uses Euclidean distance to measure closeness.
3. K (Number of Clusters): Predefined; must be specified before running the algorithm.
Advantages
24
1. Simple and easy to implement.
2. Efficient for large datasets.
3. Guarantees convergence.
Disadvantages
1. Sensitive to the initial choice of centroids.
2. Only finds spherical clusters (not complex shapes).
3. Affected by outliers.
Applications
Customer Segmentation in marketing.
Image Compression (grouping similar pixels).
Document Classification in text analysis.
K-Means is a fast, scalable algorithm suitable for clustering tasks, but it assumes clusters are spherical
and requires manual tuning of KK.
UNIT-4
What is Independent Component Analysis?
Independent Component Analysis (ICA) is a statistical and computational technique used in machine
learning to separate a multivariate signal into its independent non-Gaussian components. The goal of
ICA is to find a linear transformation of the data such that the transformed data is as close to being
statistically independent as possible.
The heart of ICA lies in the principle of statistical independence. ICA identify components within
mixed signals that are statistically independent of each other.
Statistical Independence Concept:
It is a probability theory that if two random variables X and Y are statistically independent. The joint
probability distribution of the pair is equal to the product of their individual probability distributions,
which means that knowing the outcome of one variable does not change the probability of the other
outcome.
P(X and Y) = P(X)*P(Y)
Assumptions in ICA
1. The first assumption asserts that the source signals (original signals) are statistically
independent of each other.
2. The second assumption is that each source signal exhibits non-Gaussian distributions.
Mathematical Representation of Independent Component Analysis
25
The observed random vector is X= (x1, x2, ……., xm ) ^T, representing the observed data with m
components. The hidden components are represented by the random vector S = ( s1, s2, ……, sn )^T,
where n is the number of hidden sources.
Linear Static Transformation
The observed data X is transformed into hidden components S using a linear static transformation
representation by the matrix W.
S=WX
Here, W = transformation matrix.
The goal is to transform the observed data x in a way that the resulting hidden components are
independent. The independence is measured by some function F(s1,…….,sn)^T . The task is to find
the optimal transformation matrix W that maximizes the independence of the hidden components.
Advantages of Independent Component Analysis (ICA):
ICA is a powerful tool for separating mixed signals into their independent components. This
is useful in a variety of applications, such as signal processing, image analysis, and data
compression.
ICA is a non-parametric approach, which means that it does not require assumptions about
the underlying probability distribution of the data.
ICA is an unsupervised learning technique, which means that it can be applied to data
without the need for labeled examples. This makes it useful in situations where labeled data is
not available.
ICA can be used for feature extraction, which means that it can identify important features
in the data that can be used for other tasks, such as classification.
Disadvantages of Independent Component Analysis (ICA):
ICA assumes that the underlying sources are non-Gaussian, which may not always be true. If
the underlying sources are Gaussian, ICA may not be effective.
ICA assumes that the sources are mixed linearly, which may not always be the case. If the
sources are mixed nonlinearly, ICA may not be effective.
ICA can be computationally expensive, especially for large datasets. This can make it difficult
to apply ICA to real-world problems.
ICA can suffer from convergence issues, which means that it may not always be able to find a
solution. This can be a problem for complex datasets with many sources.
o Fitness Function: The fitness function is used to determine the individual's fitness level in
the population. It means the ability of an individual to compete with other individuals. In
every iteration, individuals are evaluated based on their fitness function.
o Genetic Operators: In a genetic algorithm, the best individual mate to regenerate offspring
better than parents. Here genetic operators play a role in changing the genetic composition of
the next generation.
o Selection
After calculating the fitness of every existent in the population, a selection process is used to
determine which of the individualities in the population will get to reproduce and produce the seed
that will form the coming generation.
How Genetic Algorithm Work?
The genetic algorithm works on the evolutionary generational cycle to generate high-quality solutions.
These algorithms use different operations that either enhance or replace the population to give an
improved fit solution.
It basically involves five phases to solve the complex optimization problems, which are given as
below:
o Initialization
o Fitness Assignment
o Selection
o Reproduction
o Termination
1. Initialization
The process of a genetic algorithm starts by generating the set of individuals, which is called
population. Here each individual is the solution for the given problem. An individual contains or is
characterized by a set of parameters called Genes. Genes are combined into a string and generate
chromosomes, which is the solution to the problem. One of the most popular techniques for
initialization is the use of random binary strings.
27
2. Fitness Assignment
Fitness function is used to determine how fit an individual is? It means the ability of an individual to
compete with other individuals. In every iteration, individuals are evaluated based on their fitness
function. The fitness function provides a fitness score to each individual. This score further
determines the probability of being selected for reproduction. The high the fitness score, the more
chances of getting selected for reproduction.
3. Selection
The selection phase involves the selection of individuals for the reproduction of offspring. All the
selected individuals are then arranged in a pair of two to increase reproduction. Then these individuals
transfer their genes to the next generation.
There are three types of Selection methods available, which are:
o Roulette wheel selection
o Tournament selection
o Rank-based selection
4. Reproduction
After the selection process, the creation of a child occurs in the reproduction step. In this step, the
genetic algorithm uses two variation operators that are applied to the parent population. The two
operators involved in the reproduction phase are given below:
o Crossover: The crossover plays a most significant role in the reproduction phase of the
genetic algorithm. In this process, a crossover point is selected at random within the genes.
Then the crossover operator swaps genetic information of two parents from the current
generation to produce a new individual representing the offspring.
28
The genes of parents are exchanged among themselves until the crossover point is met. These
newly generated offspring are added to the population. This process is also called or
crossover. Types of crossover styles available:
o One point crossover
o Two-point crossover
o Livery crossover
o Mutation
The mutation operator inserts random genes in the offspring (new child) to maintain the
diversity in the population. It can be done by flipping some bits in the chromosomes.
Mutation helps in solving the issue of premature convergence and enhances diversification.
The below image shows the mutation process:
Types of mutation styles available,
o Flip bit mutation
o Gaussian mutation
o Exchange/Swap mutation
5. Termination
After the reproduction phase, a stopping criterion is applied as a base for termination. The algorithm
terminates after the threshold fitness solution is reached. It will identify the final solution as the best
solution in the population.
29