MACHINE
LEARNING
EFFORTS BY RIDA SHAFIQ
Page 1 of 70
Machine Learning Course Contents:
Introduction to machine learning and statistical pattern recognition.
Supervised Learning:
S# Parts Topics
(Graphical Models (full Bayes, Naïve Bayes), Decision
Trees for Classification & regression for both categorical
1 Part 1
& numerical data, Ensemble methods, Random forests,
Boosting (AdaBoost and XgBoost), Stacking;
(Four Components of Machine Learning
Algorithm(Hypothesis, Loss Functions, Derivatives and
Optimization Algorithms), Gradient Descent, Stochastic
Gradient Descent , Linear Regression, Nonlinear
Regression, Perceptron, Support Vector machines,
Kernel Methods, Logistic Regression, Softmax, Neural
2 Part 2
networks); Unsupervised Learning: K-means, Density
based Clustering Methods (DBSCAN, etc.), Gaussian
mixture models, EM algorithm, etc.,; Reinforcement
learning; Tuning model complexity; Bias-Variance
Tradeoff; Grid Search, Random Search; Evaluation
metrics; reporting predictive performance
Reference Material Books:
1. Elements of Statistical Learning
2. Pattern Recognition & Machine Learning, 1st Edition, Chris Bishop
3. Machine Learning: A Probabilistic Perspective, 1st Edition, Kevin R murphy
4. Applied Machine Learning, Online edition, David Forsyth
Introduction of Graphical Models:2
The Graphical model is a subdivision of Machine Learning. It uses a graph to signify a
domain problem. A graph states the conditional need structure between random
variables. These are being used in many Machine Learning algorithms. For example;
Naive Bayes’ algorithm
The Hidden Markov Model
Restricted Boltzmann machine
Neural Networks
Description:
There are lots of causes to learn about graphical or probabilistic modeling. Among
them one of that it is a charming scientific field with a beautiful theory. That bonds
in astonishing ways two very different branches of mathematics:
Probability and
Graph theory
All at once, probabilistic modeling is broadly used all over machine learning. It also
used in many real-world applications. These methods may be used to solve problems
in fields as varied as drug, linguistic processing, visualization, and many others. This
grouping of neat theory and influential applications. It makes graphical models one
of the greatest captivating topics in modern artificial intelligence and computer science.
Types of Graphical Model:
The two main types of graphical models are;
Directed graphical models (Bayesian networks BNs)
Undirected graphical models (Markov networks or Markov random elds
MRFs)
Directed graphical models
We present a graphical language for stipulating a graphical model, called the
directed graphical model.
It gives a dense model and concise way to require probabilistic models.
It permits the reader to visually parse needs between random variables.
A graphical model visually seizures the tactic in which the dual distribution
over all random variables may be rotten into a product of factors.
That depends simply on a subset of these variables.
The joint distribution by itself may be somewhat complicated.
It does not state us all about structural properties of the probabilistic model.
Page 3 of 70
For instance, the joint distribution p (a; b; c) does not speak us whatever about
independence relations.
At this point graphical models come into play.
Nodes are random variables in a graphical model.
In the above figure, the nodes represent the random variables a; b; c.
Edges denote probabilistic relations between variables.
Not every distribution may be denoted in a specific choice of graphical
model.
They are a modest mode to imagine the structure of a probabilistic model.
They may be used to design or inspire new kinds of statistical models.
Check of the graph only provides us insight into properties, for example,
conditional independence.
Intricate computations for inference and learning in statistical models can be
said in relations of graphical handlings.
We may construct the matching directed graphical model from a factorized
joint distribution as follows:
1. Generate a node for all random variables.
2. We enhance a directed link or arrow to the graph from the nodes matching to
the variables on which the distribution is conditioned for each conditional
distribution.
Undirected Graphical Model:
The undirected graph made known can have one of numerous
interpretations.
The common story is that the presence of an edge involves some sort of
dependence between the matching random variables.
We might infer from this graph that B, C, D are all equally independent once
A is identified.
Consistently in this case that
For some non-negative functions f, A B, f A C, f A D
Applications:
Applications of graphical models comprise as;
Causal inference
Information extraction
Speech recognition
Computer vision
Decoding of low-density parity-check codes
Modeling of gene regulatory networks
Gene finding and diagnosis of diseases
Graphical models for protein structure.
Naïve Bayes theorem:
o Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification
problems.
o It is mainly used in text classification that includes a high-
dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the
basis of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.
Page 5 of 70
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of
a certain feature is independent of the occurrence of other features.
Such as if the fruit is identified on the bases of color, shape, and
taste, then red, spherical, and sweet fruit is recognized as an apple.
Hence each feature individually contributes to identify that it is an
apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle
of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law,
which is used to determine the probability of a hypothesis with
prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the
observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that
the probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
P(B) is Marginal Probability: Probability of Evidence.
Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a
class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
o Naive Bayes assumes that all features are independent or unrelated,
so it cannot learn the relationship between features.
Applications of Naïve Bayes Classifier:
o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes
Classifier is an eager learner.
o It is used in Text classification such as Spam
filtering and Sentiment analysis.
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a
normal distribution. This means if predictors take continuous
values instead of discrete, then the model assumes that these values
are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used
when the data is multinomial distributed. It is primarily used for
document classification problems, it means a particular document
belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the
Multinomial classifier, but the predictor variables are the
independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for
document classification tasks.
Decision Trees for Classification & Regression for
both Categorical & Numerical data
Decision Tree: A Decision Tree is a supervised learning algorithm. It is
a graphical representation of all the possible solutions. All the decisions
were made based on some conditions.
Page 7 of 70
it starts from the root node and branches off to the number of solutions just like a
tree. The tree also starts from the root and it starts growing its branches when it
grows bigger and bigger.
To understand more, take this example which is represented in a tree model below.
Here the root node is a question asking whether you are hungry or not. If you are not
hungry then go back to sleep. If you are hungry then check whether you have 100
dollars or not. If you have sufficient money, then go to the restaurant. If you don’t
have enough money, then go and just buy some juice.
In this way, the Decision Tree divides into different groups based on some
conditions.
Let us take a dataset to understand more.
In this dataset, we can see fruits were labelled into either Mango, Grape, or Lemon based on colour and Diameter.
Here we take the root node as diameter. If the diameter is greater than 3 then we
have colour as either Green or yellow. and so if the diameter is less than 3 then for
sure it is grape which is red.
Now we have to check the colour. If it is not yellow, then for sure it is mango. But
if it is in yellow then we have two options. it can be either mango or lemon. So there
is a chance for 50% mango and 50% Lemon.
Page 9 of 70
But which question comes as the root node and which question comes next? Here
we need to see which attribute will unmix the label at that particular point. We can
find the amount of uncertainty at a single node with something known as Gini
impurity and also we can find how much a question reduces that uncertainty with
something known as Information Gain.
CART Algorithm
CART (Classification and Regression Tree) algorithm is a predictive model which
shows how an outcome of a variable is predicted based on other values.
Let’s have a look at this data.
Here there are attributes like Outlook, temperature, humidity, wind, and a label Play.
All these attributes decide whether to play or not.
So among all of them which one should we pick first? The attribute which classifies
the training data best will be picked first.
To decide that we need to learn some terms.
Gini Index: Gini Index is the measure of impurity or the purity that is used in
building a decision tree in the CART Algorithm.
Information Gain: Information gain is the measure of how much information a
feature gives about the class. It is the decrease in entropy after splitting the dataset
based on the attribute.
Constructing a decision tree is all about finding the attribute that has the highest
information gain.
Reduction in Variance: In general Variance is how much your data is varying. So
here also attribute with a lower variance will be split first.
Chi-square: It is an algorithm to find the statistical significance of the differences
between sub-nodes and parent nodes.
The first step to do before solving the problem for the decision tree is entropy which
is used to find information gain. As we know splitting will be done based on
information gain. Attribute with highest information gain is selected first.
Entropy: Entropy is the measure of uncertainty. It is a metric that measures the
impurity of something.
Let’s understand what is impurity first.
Imagine a basket full of cherries and a bowl that contains labels with cherry written
on it. Now if you take 1 fruit from the basket and 1 label from a bowl. So the
probability of matching cherry with cherry is 1 and there is no impurity here.
Now imagine another situation with different fruits in a basket and different fruits
names labels in the bowl. Now if you pick up 1 random fruit from the basket and 1
random label from the bowl, the probability of matching cherry with cherry is
certainly not 1. It is less than 1. Here there is impurity.
Entropy=-ΣP(x)logP(x)
Entropy(s)=-P(yes)logP(yes)-P(no)logP(no)
where,
s = total sample space
P(yes) = Probability of yes
P(no) = Probability of no
If number of yes = number of no,T hen
P(s)=0.5 and Entropy(s) = 1
If it contains either all yes or all no, Then
P(s) = 1 or 0 and Entropy(s) = 0
let’s see first case, where number of yes = number of no
Entropy(s)=-P(yes)logP(yes)
E(s)=0.5 log20.5 – 0.5 log20.5
E(s)=0.5(log20.5 – log20.5)
E(s)=1
let’s see second case where it contains either all yes or all no
Entropy(s)=-P(yes)logP(yes)
E(s)=1log1
E(s)=0
Similarly, with no.
Entropy(s)=-P(no)logP(no)
E(s)=1log1
E(s)=0
Calculating Information Gain
Information gain = Entropy(s) – [ (average weight) * Entropy (each feature)]
Let’s calculate the entropy for this dataset. Here total we have 14 data points in
which 9 are yes and 5 are no.
Entropy(s)=-P(yes)logP(yes)-P(no)logP(no)
E(s)=-(9/14) log(9/14) – (5/14)log(5/14)
E(s)=0.41+0.53
E(s)=0.94
So entropy for this dataset is 0.94.
Now out of outlook, Temperature, Humidity, Windy which node is selected as
a root node?
Random Forest Algorithm
Random Forest is one of the most popular and commonly used algorithms by Data
Scientists. Random forest is a Supervised Machine Learning Algorithm that is used widely
in Classification and Regression problems. It builds decision trees on different
Page 11 of 70
samples and takes their majority vote for classification and average in case of
regression.
One of the most important features of the Random Forest Algorithm is that it can
handle the data set containing continuous variables, as in the case of regression,
and categorical variables, as in the case of classification. It performs better for
classification and regression tasks.
Working of Random Forest Algorithm:
Before understanding the working of the random forest algorithm in machine
learning, we must look into the ensemble learning technique. Ensemble simply
means combining multiple models. Thus a collection of models is used to make
predictions rather than an individual model.
Ensemble uses two types of methods:
1. Bagging– It creates a different training subset from sample training data with
replacement & the final output is based on majority voting. For example, Random
Forest.
2. Boosting– It combines weak learners into strong learners by creating sequential
models such that the final model has the highest accuracy. For example, ADA
BOOST, XG BOOST.
Steps Involved in Random Forest Algorithm:
Step 1: In the Random forest model, a subset of data points and a subset of features
is selected for constructing each decision tree. Simply put, n random records and m
features are taken from the data set having k number of records.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or Averaging for
Classification and regression, respectively.
For example: consider the fruit basket as the data as shown in the figure below. Now
n number of samples are taken from the fruit basket, and an individual decision tree
is constructed for each sample. Each decision tree will generate an output, as shown
in the figure. The final output is considered based on majority voting. In the below
figure, you can see that the majority decision tree gives output as an apple when
compared to a banana, so the final output is taken as an apple.
Important Features of Random Forest
Diversity: Not all attributes/variables/features are considered while making
an individual tree; each tree is different.
Immune to the curse of dimensionality: Since each tree does not consider
all the features, the feature space is reduced.
Parallelization: Each tree is created independently out of different data and
attributes. This means we can fully use the CPU to build random forests.
Train-Test split: In a random forest, we don’t have to segregate the data for
train and test as there will always be 30% of the data which is not seen by
the decision tree.
Stability: Stability arises because the result is based on majority voting/
averaging.
Difference Between Decision Tree and Random Forest
Random forest is a collection of decision trees; still, there are a lot of differences
in their behavior.
Decision trees Random Forest
1. Random forests are created from subsets of data,
1. Decision trees normally suffer from
and the final output is based on average or majority
the problem of overfitting if it’s allowed
ranking; hence the problem of overfitting is taken care
to grow without any control.
of.
Page 13 of 70
2. A single decision tree is faster in
2. It is comparatively slower.
computation.
3. When a data set with features is taken
3. Random forest randomly selects observations,
as input by a decision tree, it will
builds a decision tree, and takes the average result. It
formulate some rules to make
doesn’t use any set of formulas.
predictions.
Thus random forests are much more successful than decision trees only if the
trees are diverse and acceptable.
What is Boosting?
Boosting is an ensemble learning technique that sequentially combines
multiple weak classifiers to create a strong classifier. It is done by training
a model using training data and is then evaluated. Next model is built on
that which tries to correct the errors present in the first model. This
procedure is continued and models are added until either the complete
training data set is predicted correctly or predefined number of iterations is
reached.
Think of it like in a class a teacher focuses more on weak learners to
improve its academic performance similarly boosting works.
1. AdaBoost
AdaBoost works on the principle of the stagewise addition method, where
multiple weak learners are combined to form a strong learner. The key idea
is to assign higher weights to the misclassified instances, so that
subsequent weak learners focus more on the difficult cases. The alpha
parameter, which is related to the errors of the weak learner, determines
the weightage given to each weak learner. Weak learners with higher
errors get more weightage to correct their mistakes1.
Example code for AdaBoost:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
X, y = make_regression(n_samples=100, n_features=10,
n_informative=5, n_targets=1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
adr = AdaBoostRegressor()
adr.fit(X_train, y_train)
y_pred = adr.predict(X_test)
print("AdaBoost - R2: ", r2_score(y_test, y_pred))
2. XGBoost
XGBoost is an advanced version of the gradient boosting algorithm,
incorporating regularization techniques to prevent overfitting and improve
performance. It is known for its speed and efficiency, handling both
numerical and categorical features effectively. XGBoost uses a regularized
objective function, which includes both the loss function and a
regularization term1.
Example code for XGBoost:
from xgboost import XGBRegressor
xgr = XGBRegressor()
xgr.fit(X_train, y_train)
y_pred = xgr.predict(X_test)
print("XGBoost - R2: ", r2_score(y_test, y_pred))
AdaBoost VS XGBoost
Page 15 of 70
AdaBoost (Adaptive Boosting) and XGBoost (Extreme Gradient Boosting)
are both popular boosting algorithms used in machine learning to improve
the performance of weak learners.
Key Differences
1. Regularization: XGBoost includes regularization techniques to prevent
overfitting, while AdaBoost does not.
2. Speed and Scalability: XGBoost is faster and more scalable compared
to AdaBoost, making it suitable for large datasets.
3. Handling Categorical Variables: XGBoost can handle categorical
variables more effectively than AdaBoost.
4. Performance: XGBoost generally performs better than AdaBoost due to
its regularization and optimization techniques1.
In conclusion, while both AdaBoost and XGBoost are powerful boosting
algorithms, XGBoost is often preferred for its speed, scalability, and ability to
handle complex datasets. AdaBoost, on the other hand, is simpler and can be
useful for smaller datasets or when interpretability is a priority.
Stacking
Stacking, also known as stacked generalization, is an ensemble learning
technique that combines multiple classification or regression models to
improve overall performance. Unlike other ensemble methods like bagging
and boosting, stacking leverages different types of models to capture
various aspects of the problem space.
How Stacking Works
The process of stacking involves two levels of models:
1. Base Models (Level-0 Models): These are the initial models that are
trained on the training data. Each base model makes predictions on the
validation set.
2. Meta-Model (Level-1 Model): This model is trained on the predictions
made by the base models. The meta-model learns how to best combine
these predictions to make the final prediction.
Steps to Implement Stacking
1. Split the Training Data: Divide the training data into K-folds, similar to
K-fold cross-validation.
2. Train Base Models: Train each base model on K-1 folds and make
predictions on the Kth fold. Repeat this for all folds.
3. Train Meta-Model: Use the predictions from the base models as
features to train the meta-model.
4. Make Final Predictions: Use the meta-model to make final predictions
on the test set.
Example Code
Here is an example of implementing stacking using Python and
the mlxtend library:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from mlxtend.classifier import StackingClassifier
# Load dataset
df = pd.read_csv('heart.csv')
X = df.drop('target', axis=1)
y = df['target']
# Split data into train and test sets
Page 17 of 70
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
# Standardize data
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Define base models
knc = KNeighborsClassifier()
nb = GaussianNB()
# Train base models
knc.fit(X_train, y_train)
nb.fit(X_train, y_train)
# Define meta-model
lr = LogisticRegression()
# Create stacking classifier
clf_stack = StackingClassifier(classifiers=[knc, nb],
meta_classifier=lr, use_probas=True,
use_features_in_secondary=True)
# Train stacking classifier
clf_stack.fit(X_train, y_train)
# Make predictions
pred_stack = clf_stack.predict(X_test)
# Evaluate accuracy
acc_stack = accuracy_score(y_test, pred_stack)
print('Accuracy of Stacked model:', acc_stack * 100)
Benefits and Considerations
Stacking can significantly improve model performance by leveraging the
strengths of different models. However, it requires careful selection and
tuning of base models and the meta-model. Additionally, stacking can be
computationally expensive and may not always guarantee better
performance
Hypothesis
In machine learning, a hypothesis refers to a proposed explanation or
model that describes the relationship between input data and output
predictions. It is also commonly known as a model or a learning
algorithm. The hypothesis is created based on the available training data
and is used to make predictions on new, unseen data.
In supervised learning, the hypothesis is typically represented as a
function that maps input variables (features) to output variables (labels or
target values). The goal of supervised learning is to find the best
hypothesis that accurately predicts the output variable for new input data.
Page 19 of 70
For example, in a linear regression problem, the hypothesis might be a
linear equation that relates the input features to the target variable. The
hypothesis could be represented as:
h(x) = θ₀ + θ₁x₁ + θ₂x₂ + ... + θₙxₙ
where h(x) is the predicted output, x₁, x₂, ..., xₙ are the input features, θ₀,
θ₁, θ₂, ..., θₙ are the parameters (also called weights), and n is the number
of features. The goal is to find the optimal values for the parameters that
minimize the difference between the predicted output and the actual
output in the training data.
The process of finding the best hypothesis involves training the model
using an appropriate algorithm and adjusting the parameters based on an
objective function (e.g., minimizing the mean squared error). Various
machine learning algorithms, such as linear regression, decision trees,
support vector machines, and neural networks, can be used to generate
different types of hypotheses depending on the problem at hand.
Python Code:
from sklearn.linear_model import LinearRegression
# Sample input features and target values
X = [[1], [2], [3], [4], [5]] # Input features
y = [2, 4, 6, 8, 10] # Target values
# Create a linear regression model
model = LinearRegression()
# Train the model
model.fit(X, y)
# Predict using the trained model
new_data = [[6], [7], [8]] # New input data for prediction
predictions = model.predict(new_data)
# Print the predictions
for i in range(len(new_data)):
print("Input:", new_data[i][0], "Prediction:", predictions[i])
In this example, we first import the LinearRegression class from scikit-
learn. Then we define our input features X and target values y. Next, we
create an instance of the linear regression model using
LinearRegression(). We train the model by calling the fit() method and
passing in the input features X and target values y.
After training, we can use the trained model to make predictions on new,
unseen data. In this case, we define the new input data new_data and use
the predict() method of the model to obtain the predictions.
Finally, we print the input values along with their corresponding
predictions to see the results.
Loss Function: -
In machine learning, a loss function, also known as a cost function or an
objective function, is a mathematical function that quantifies the
discrepancy or error between the predicted output of a model and the
actual output or target value. It serves as a measure of how well the model
is performing and provides a way to optimize or train the model.
The choice of a loss function depends on the specific learning task and the
nature of the problem being solved. Different types of machine learning
problems, such as regression, classification, and clustering, often require
different loss functions.
Here are a few common types of loss functions:
1. Mean Squared Error (MSE): Used in regression problems, MSE
calculates the average squared difference between the predicted
and actual values. It penalizes larger errors more heavily.
2. Binary Cross-Entropy: Typically used in binary classification
problems, this loss function measures the dissimilarity between
predicted probabilities and true binary labels. It quantifies how
well the predicted probabilities match the true labels.
3. Categorical Cross-Entropy: Employed in multiclass classification
problems, categorical cross-entropy measures the dissimilarity
Page 21 of 70
between predicted class probabilities and true class labels. It
encourages the model to assign high probabilities to the correct
class.
4. Log Loss: Similar to cross-entropy, log loss is used in binary or
multiclass classification problems. It measures the performance of
a classification model by penalizing incorrect predictions based on
the logarithm of the predicted probabilities.
5. Kullback-Leibler Divergence: This loss function is often used in
tasks involving probabilistic models. It quantifies the difference
between two probability distributions, such as the predicted and
target distributions.
The goal of training a machine learning model is to minimize the chosen
loss function, typically achieved through optimization algorithms like
gradient descent. By iteratively adjusting the model's parameters based on
the gradients of the loss function, the model can learn to make better
predictions and minimize the error between its predictions and the ground
truth.
Gradient Descent:
Gradient descent is an optimization algorithm commonly used in machine
learning to minimize the loss or cost function of a model. It is an iterative
approach that adjusts the parameters of the model by moving in the
direction of steepest descent of the loss function. The goal is to find the
set of parameters that results in the minimum value of the loss function,
effectively improving the model's performance.
Here's how gradient descent works:
1. Initialization: The algorithm starts by initializing the parameters of
the model randomly or with some predefined values. These
parameters are the ones that will be updated during the optimization
process.
2. Compute the loss: The next step is to compute the value of the loss
function for the current set of parameters. The loss function
measures how well the model is performing, and the goal is to
minimize it.
3. Compute the gradient: The gradient represents the direction of the
steepest ascent of the loss function. It is a vector that points in the
direction of the greatest rate of increase of the loss. To compute the
gradient, partial derivatives of the loss function with respect to each
parameter are calculated.
4. Update the parameters: Once the gradient has been computed, the
parameters are updated by taking a small step in the opposite
direction of the gradient. This step is determined by the learning rate,
which controls the size of the parameter updates. The learning rate
is a hyper parameter that needs to be carefully chosen to ensure
convergence.
5. Repeat: Steps 2-4 are repeated iteratively until a stopping criterion
is met. This criterion could be a maximum number of iterations,
reaching a desired level of convergence, or any other condition
defined by the user.
By iteratively adjusting the parameters in the direction of the negative
gradient, gradient descent allows the model to find the optimal set of
parameters that minimize the loss function. There are different variants of
gradient descent, such as batch gradient descent, stochastic gradient
descent, and mini-batch gradient descent, each with its own characteristics
and trade-offs in terms of convergence speed and computational
efficiency.
Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a variant of the gradient descent
algorithm that addresses some of the computational inefficiencies
associated with the standard batch gradient descent. While batch gradient
descent computes the gradient using the entire training dataset, stochastic
gradient descent updates the parameters based on the gradient of a single
training example at a time. This makes it more computationally efficient,
especially for large datasets.
Page 23 of 70
Linear Regression
Linear regression is an algorithm that provides a linear relationship between an
independent variable and a dependent variable to predict the outcome of future
events. It is a statistical method used in data science and machine learning for
predictive analysis.
The independent variable is also the predictor or explanatory variable that remains
unchanged due to the change in other variables. However, the dependent variable
changes with fluctuations in the independent variable. The regression model predicts
the value of the dependent variable, which is the response or outcome variable being
analyzed or studied.
Thus, linear regression is a supervised learning algorithm that simulates a
mathematical relationship between variables and makes predictions for continuous
or numeric variables such as sales, salary, age, product price, etc.
This analysis method is advantageous when at least two variables are available in
the data, as observed in stock market forecasting, portfolio management, scientific
analysis, etc.
A sloped straight line represents the linear regression model.
In the above figure,
X-axis = Independent variable
Y-axis = Output / dependent variable
Line of regression = Best fit line for a model
Here, a line is plotted for the given data points that suitably fit all the issues. Hence,
it is called the ‘best fit line.’ The goal of the linear regression algorithm is to find
this best fit line seen in the above figure.
Key benefits of linear regression
Linear regression is a popular statistical tool used in data science, thanks to the
several benefits it offers, such as:
1. Easy implementation
The linear regression model is computationally simple to implement as it does not
demand a lot of engineering overheads, neither before the model launch nor during
its maintenance.
2. Interpretability
Unlike other deep learning models (neural networks), linear regression is relatively
straightforward. As a result, this algorithm stands ahead of black-box models that
fall short in justifying which input variable causes the output variable to change.
3. Scalability
Linear regression is not computationally heavy and, therefore, fits well in cases
where scaling is essential. For example, the model can scale well regarding increased
data volume (big data).
4. Optimal for online settings
The ease of computation of these algorithms allows them to be used in online
settings. The model can be trained and retrained with each new example to generate
predictions in real-time, unlike the neural networks or support vector machines that
are computationally heavy and require plenty of computing resources and substantial
waiting time to retrain on a new dataset. All these factors make such compute-
intensive models expensive and unsuitable for real-time applications.
Equation:
Y = m*X + b
Where X = dependent variable (target)
Y = independent variable
m = slope of the line (slope is defined as the ‘rise’ over the ‘run’)
Page 25 of 70
Non Linear Regression:
Nonlinear regression is a type of regression analysis in machine learning
where the relationship between the input features and the target variable
is modeled using a nonlinear function. Unlike linear regression, which
assumes a linear relationship, nonlinear regression allows for more
complex and flexible relationships between the variables.
In nonlinear regression, the goal is still to find the best-fitting curve or
surface that minimizes the difference between the predicted values and
the actual values of the target variable. However, instead of a linear
equation, a nonlinear function is used to model the relationship. The form
of the function is typically determined based on prior knowledge of the
problem domain or by exploring different functional forms through
experimentation.
The parameters of the nonlinear function, such as coefficients or
parameters that define the shape of the curve, are learned during the
training phase. This is usually done by minimizing a cost function, such
as the sum of squared residuals, using optimization techniques like
gradient descent, Levenberg-Marquardt, or genetic algorithms.
Nonlinear regression models can handle a wide range of nonlinear
relationships, including polynomial, exponential, logarithmic, sigmoidal,
and trigonometric functions. They are often used when there are clear
indications of a nonlinear relationship between the input features and the
target variable, or when linear models fail to capture the underlying
patterns in the data.
However, nonlinear regression can be more challenging than linear
regression due to the increased complexity and the potential for
overfitting. Overfitting occurs when the model becomes too complex and
fits the noise in the training data, leading to poor generalization
performance on unseen data. Regularization techniques, such as adding
regularization terms or using techniques like cross-validation, can help
mitigate overfitting in nonlinear regression.
Nonlinear regression is widely used in various fields, including finance,
economics, biology, physics, and engineering, where the relationships
between variables are often nonlinear and require more flexible modeling
approaches.
Example:
import numpy as np
import matplotlib.pyplot as plt
# Generate some synthetic data with a quadratic relationship
x = np.linspace(-5, 5, 100)
y = 3 * x**2 + 2 * x + 1 + np.random.normal(0, 5, 100)
# Fit the data to a quadratic function
coefficients = np.polyfit(x, y, 2)
# Extract the fitted coefficients
a_fit, b_fit, c_fit = coefficients
# Generate predictions using the fitted coefficients
y_pred = a_fit * x**2 + b_fit * x + c_fit
# Plot the original data and the fitted curve
plt.scatter(x, y, label='Original Data')
plt.plot(x, y_pred, color='red', label='Fitted Curve')
plt.legend()
plt.xlabel('x')
plt.ylabel('y')
plt.title('Nonlinear Regression Example')
plt.show()
Perceptron
The perceptron is a fundamental building block of artificial neural
networks and is a simple model of a biological neuron. It was introduced
by Frank Rosenblatt in 1957 and is considered one of the earliest forms
of a machine learning algorithm.
The perceptron is a binary classifier that takes an input vector and
produces a binary output, indicating whether the input belongs to a certain
Page 27 of 70
class or not. It consists of three main components: input values, weights,
and an activation function.
Here's how a perceptron works:
1. Input values: The perceptron receives a set of input values
represented as a vector. Each input is associated with a weight,
which determines the significance of that input.
2. Weighted sum: The input values are multiplied by their
corresponding weights, and the weighted values are summed up.
3. Activation function: The weighted sum is then passed through an
activation function, which introduces nonlinearity to the model. The
activation function determines the output of the perceptron based on
the weighted sum.
4. Thresholding: The output of the activation function is compared to
a threshold value. If the output is above the threshold, the perceptron
predicts one class; otherwise, it predicts the other class.
During the learning phase, the weights of the perceptron are adjusted
based on the training data. The adjustment is done through an algorithm
called the perceptron learning rule, which updates the weights to minimize
the prediction errors. The learning rule iteratively updates the weights
until the perceptron reaches a satisfactory level of accuracy on the training
data.
Perceptrons can be combined to form more complex models, such as
multilayer perceptrons (MLPs), which have one or more layers of
interconnected perceptrons. MLPs are capable of learning more complex
decision boundaries and are widely used in various machine learning
tasks, including classification, regression, and pattern recognition.
Support vector machine (SVM)
Support vector machines are a set of supervised learning methods used for
classification, regression, and outliers detection. All of these are common
tasks in machine learning.
These can be used to detect cancerous cells based on millions of images
or to predict future driving routes with a well-fitted regression model.
There are specific types of SVMs for particular machine learning
problems, like support vector regression (SVR) which is an extension of
support vector classification (SVC).
SVMs are different from other classification algorithms because of the
way they choose the decision boundary that maximizes the distance from
the nearest data points of all the classes. The decision boundary created
by SVMs is called the maximum margin classifier or the maximum
margin hyper plane.
How an SVM works?
A simple linear SVM classifier works by making a straight line between
two classes. That means all of the data points on one side of the line will
represent a category and the data points on the other side of the line will
be put into a different category. This means there can be an infinite
number of lines to choose from.
What makes the linear SVM algorithm better than some of the other
algorithms, like k-nearest neighbors, is that it chooses the best line to
classify your data points. It chooses the line that separates the data and is
the furthest away from the closet data points as possible.
A 2-D example helps to make sense of all the machine learning jargon.
Basically you have some data points on a grid. You're trying to separate
these data points by the category they should fit in, but you don't want to
have any data in the wrong category. That means you're trying to find the
line between the two closest points that keeps the other data points
separated.
So the two closest data points give you the support vectors you'll use to
find that line. That line is called the decision boundary.
Page 29 of 70
Types of SVMs
There are two different types of SVMs, each used for different things:
Simple SVM: Typically used for linear regression and classification
problems.
Kernel SVM: Has more flexibility for non-linear data because you
can add more features to fit a hyperplane instead of a two-
dimensional space.
Why SVMs are used in machine learning
SVMs are used in applications like handwriting recognition, intrusion
detection, face detection, email classification, gene classification, and in
web pages. This is one of the reasons we use SVMs in machine learning.
It can handle both classification and regression on linear and non-linear
data.
Another reason we use SVMs is because they can find complex
relationships between your data without you needing to do a lot of
transformations on your own. It's a great option when you are working
with smaller datasets that have tens to hundreds of thousands of features.
They typically find more accurate results when compared to other
algorithms because of their ability to handle small, complex datasets.
Pros
Effective on datasets with multiple features, like financial or
medical data.
Effective in cases where number of features is greater than the
number of data points.
Uses a subset of training points in the decision function called
support vectors which makes it memory efficient.
Different kernel functions can be specified for the decision function.
You can use common kernels, but it's also possible to specify custom
kernels.
Cons
If the number of features is a lot bigger than the number of data
points, avoiding over-fitting when choosing kernel functions and
regularization term is crucial.
SVMs don't directly provide probability estimates. Those are
calculated using an expensive five-fold cross-validation.
Works best on small sample sets because of its high training time.
Softmax
In machine learning, softmax refers to a mathematical function that is
often used as the activation function in the output layer of a neural
network for multi-class classification problems. It is especially useful
when the classes are mutually exclusive, meaning that each input belongs
to only one class.
The softmax function takes a vector of real-valued scores or logits as input
and transforms them into a probability distribution over the classes. It
computes the exponential of each score and normalizes them by dividing
by the sum of the exponentiated scores. The resulting values range
between 0 and 1, and their sum is equal to 1, making them interpretable
as probabilities.
Mathematically, given a vector of scores or logits,
denoted as z = [z1, z2, ..., zK], where K is the number of classes, the
softmax function computes the probability p for each class as follows:
Page 31 of 70
p_i = exp(z_i) / sum(exp(z_j)) for i = 1 to K
Here, exp(x) represents the exponential function.
The softmax function is often used in combination with the categorical
cross-entropy loss function during training. The categorical cross-entropy
measures the difference between the predicted class probabilities and the
true class labels.
By using softmax as the final activation function, the neural network
learns to produce probability distributions that maximize the likelihood of
the correct class while minimizing the likelihood of the incorrect classes.
This allows the model to make probabilistic predictions for multi-class
classification tasks.
In Python, you can apply softmax to a vector of scores using libraries such
as NumPy or TensorFlow. Here's a simple example using NumPy:
import numpy as np
# Example scores for three classes
scores = np.array([2.0, 1.0, 0.5])
# Applying softmax function
probabilities = np.exp(scores) / np.sum(np.exp(scores))
print(probabilities)
In this example, the scores array represents the raw scores or logits for
three classes. We compute the softmax probabilities using NumPy
operations by exponentiating the scores and dividing by the sum of the
exponentiated scores. The resulting probabilities array contains the
probabilities assigned to each class based on the softmax function.
Neural Network
A neural network, also known as an artificial neural network (ANN) or a
deep neural network (DNN), is a machine learning model inspired by the
structure and function of biological neurons in the human brain. It consists
of interconnected nodes called artificial neurons or units, organized in
layers. Neural networks are used for a variety of tasks, including
classification, regression, pattern recognition, and more.
Here are some key components and concepts related to neural networks:
1. Neurons (or Nodes): Neurons are the fundamental building blocks
of a neural network. They receive inputs, perform computations, and
produce an output. Each neuron applies an activation function to its
weighted inputs to determine its output.
2. Layers: Neurons are organized into layers. A neural network
typically consists of an input layer, one or more hidden layers, and
an output layer. The input layer receives the initial input data, the
hidden layers process and transform the information, and the output
layer produces the final predictions or outputs.
3. Weights and Biases: Each connection between neurons in adjacent
layers is associated with a weight. These weights represent the
strength or importance of the connection. Additionally, each neuron
has an associated bias term, which allows it to adjust the output
based on an additional learned parameter.
4. Activation Functions: Activation functions introduce nonlinearity to
the neural network, enabling it to learn complex patterns and
relationships. Common activation functions include the sigmoid
function, hyperbolic tangent (tanh), and rectified linear unit (ReLU).
5. Forward Propagation: Forward propagation is the process of passing
the input data through the neural network, layer by layer, to produce
an output. Each neuron in a layer receives inputs from the previous
layer, applies the activation function to the weighted sum of its
inputs, and passes the output to the next layer.
6. Backpropagation: Backpropagation is a crucial algorithm used in
neural network training. It calculates the gradient of the loss
function with respect to the network's weights and biases. By
iteratively adjusting the weights and biases based on the gradient,
backpropagation allows the network to learn and improve its
predictions.
7. Loss Function: The loss function measures the discrepancy between
the predicted outputs of the neural network and the true labels or
target values. The choice of the loss function depends on the specific
Page 33 of 70
task, such as mean squared error (MSE) for regression or cross-
entropy for classification.
8. Optimization: Neural networks use optimization algorithms, such as
gradient descent, to minimize the loss function. These algorithms
iteratively adjust the weights and biases of the network based on the
calculated gradients, gradually improving the network's
performance.
Neural networks are powerful models capable of learning complex
representations and patterns from data. They have been successfully
applied to various domains, including image and speech recognition,
natural language processing, recommender systems, and many others.
Python Example:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
# Generate some random data for demonstration
X = np.random.random((100, 5))
y = np.random.randint(2, size=(100, 1))
# Create a sequential model
model = Sequential()
# Add a dense (fully connected) layer with 10 units and ReLU activation
model.add(Dense(10, input_dim=5, activation='relu'))
# Add an output layer with 1 unit and sigmoid activation for binary classification
model.add(Dense(1, activation='sigmoid'))
# Compile the model with a binary cross-entropy loss function and Adam optimizer
model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
# Train the model on the data
model.fit(X, y, epochs=10, batch_size=32)
# Make predictions on new data
new_data = np.random.random((10, 5))
predictions = model.predict(new_data)
print(predictions)
In this example, we first generate some random data for demonstration
purposes. Then, we create a sequential model using the Keras library. We
add a dense (fully connected) layer with 10 units and ReLU activation as
the input layer, followed by an output layer with 1 unit and sigmoid
activation for binary classification.
We compile the model by specifying a binary cross-entropy loss function
and the Adam optimizer. Then, we train the model on the generated data
using the fit function, specifying the number of epochs and the batch size.
After training, we can make predictions on new data using the trained
model. In this example, we generate some new random data (new_data)
and use the predict function to obtain the predictions.
Note that this is a minimalistic example, and in practice, you would
typically preprocess your data, split it into training and testing sets, and
perform more comprehensive model evaluation.
Supervised Learning
What is supervised learning?
Supervised learning, also known as supervised machine learning, is a
subcategory of machine learning and artificial intelligence. It is defined
by its use of labeled datasets to train algorithms that to classify data or
predict outcomes accurately. As input data is fed into the model, it adjusts
its weights until the model has been fitted appropriately, which occurs as
part of the cross validation process. Supervised learning helps
organizations solve for a variety of real-world problems at scale, such as
classifying spam in a separate folder from your inbox.
How supervised learning works?
Page 35 of 70
Supervised learning uses a training set to teach models to yield the desired
output. This training dataset includes inputs and correct outputs, which
allow the model to learn over time. The algorithm measures its accuracy
through the loss function, adjusting until the error has been sufficiently
minimized.
Supervised learning can be separated into two types of problems when
data mining—classification and regression:
Classification uses an algorithm to accurately assign test data into
specific categories. It recognizes specific entities within the dataset
and attempts to draw some conclusions on how those entities
should be labeled or defined. Common classification algorithms
are linear classifiers, support vector machines (SVM), decision
trees, k-nearest neighbor, and random forest, which are described
in more detail below.
Regression is used to understand the relationship between
dependent and independent variables. It is commonly used to make
projections, such as for sales revenue for a given business. Linear
regression, logistical regression, and polynomial regression are
popular regression algorithms.
Supervised learning algorithms
Various algorithms and computations techniques are used in supervised
machine learning processes. Below are brief explanations of some of the
most commonly used learning methods, typically calculated through use
of programs like R or Python:
Neural networks: Primarily leveraged for deep learning
algorithms, neural networks process training data by mimicking
the interconnectivity of the human brain through layers of nodes.
Each node is made up of inputs, weights, a bias (or threshold), and
an output. If that output value exceeds a given threshold, it “fires”
or activates the node, passing data to the next layer in the network.
Neural networks learn this mapping function through supervised
learning, adjusting based on the loss function through the process
of gradient descent. When the cost function is at or near zero, we
can be confident in the model’s accuracy to yield the correct
answer.
Naive bayes: Naive Bayes is classification approach that adopts
the principle of class conditional independence from the Bayes
Theorem. This means that the presence of one feature does not
impact the presence of another in the probability of a given
outcome, and each predictor has an equal effect on that result.
There are three types of Naïve Bayes classifiers: Multinomial
Naïve Bayes, Bernoulli Naïve Bayes, and Gaussian Naïve Bayes.
This technique is primarily used in text classification, spam
identification, and recommendation systems.
Linear regression: Linear regression is used to identify the
relationship between a dependent variable and one or more
independent variables and is typically leveraged to make
predictions about future outcomes. When there is only one
independent variable and one dependent variable, it is known as
simple linear regression. As the number of independent variables
increases, it is referred to as multiple linear regression. For each
type of linear regression, it seeks to plot a line of best fit, which is
calculated through the method of least squares. However, unlike
other regression models, this line is straight when plotted on a
graph.
Logistic regression: While linear regression is leveraged when
dependent variables are continuous, logistic regression is selected
when the dependent variable is categorical, meaning they have
binary outputs, such as "true" and "false" or "yes" and "no." While
both regression models seek to understand relationships between
data inputs, logistic regression is mainly used to solve binary
classification problems, such as spam identification.
Support vector machines (SVM): A support vector machine is a
popular supervised learning model developed by Vladimir Vapnik,
used for both data classification and regression. That said, it is
typically leveraged for classification problems, constructing a
hyperplane where the distance between two classes of data points
Page 37 of 70
is at its maximum. This hyperplane is known as the decision
boundary, separating the classes of data points (e.g., oranges vs.
apples) on either side of the plane.
K-nearest neighbor: K-nearest neighbor, also known as the KNN
algorithm, is a non-parametric algorithm that classifies data points
based on their proximity and association to other available data.
This algorithm assumes that similar data points can be found near
each other. As a result, it seeks to calculate the distance between
data points, usually through Euclidean distance, and then it assigns
a category based on the most frequent category or average. Its ease
of use and low calculation time make it a preferred algorithm by
data scientists, but as the test dataset grows, the processing time
lengthens, making it less appealing for classification tasks. KNN is
typically used for recommendation engines and image recognition.
Random forest: Random forest is another flexible supervised
machine learning algorithm used for both classification and
regression purposes. The "forest" references a collection of
uncorrelated decision trees, which are then merged together to
reduce variance and create more accurate data predictions.
Unsupervised learning
Unsupervised learning is a branch of machine learning that deals with unlabeled
data. Unlike supervised learning, where the data is labeled with a specific category
or outcome, unsupervised learning algorithms are tasked with finding patterns and
relationships within the data without any prior knowledge of the data’s meaning.
Unsupervised machine learning algorithms find hidden patterns and data without
any human intervention, i.e., we don’t give output to our model. The training model
has only input parameter values and discovers the groups or patterns on its own.
The image shows set of animals:
elephants, camels, and cows that represents raw data that the unsupervised
learning algorithm will process.
The “Interpretation” stage signifies that the algorithm doesn’t have predefined
labels or categories for the data. It needs to figure out how to group or organize
the data based on inherent patterns.
“Algorithm” represents the core of unsupervised learning process using
techniques like clustering, dimensionality reduction, or anomaly detection to
identify patterns and structures in the data.
“Processing” stage shows the algorithm working on the data.
The output shows the results of the unsupervised learning process. In this case, the
algorithm might have grouped the animals into clusters based on their species
(elephants, camels, cows).
How does unsupervised learning work?
Page 39 of 70
Unsupervised learning works by analyzing unlabeled data to identify patterns and
relationships. The data is not labeled with any predefined categories or outcomes,
so the algorithm must find these patterns and relationships on its own. This can be
a challenging task, but it can also be very rewarding, as it can reveal insights into
the data that would not be apparent from a labeled dataset.
Data-set in Figure A is Mall data that contains information about its clients that
subscribe to them. Once subscribed they are provided a membership card and the
mall has complete information about the customer and his/her every purchase.
Now using this data and unsupervised learning techniques, the mall can easily
group clients based on the parameters we are feeding in.
The input to the unsupervised learning models is as follows:
Unstructured data: May contain noisy(meaningless) data, missing values, or unknown data
Unlabeled data: Data only contains a value for input parameters, there is no
targeted value(output). It is easy to collect as compared to the labeled one in the
Supervised approach.
Unsupervised Learning Algorithms:
There are mainly 3 types of Algorithms which are used for Unsupervised dataset.
Clustering
Association Rule Learning
Dimensionality Reduction
1. Clustering Algorithms
Clustering in unsupervised machine learning is the process of grouping unlabeled
data into clusters based on their similarities. The goal of clustering is to identify
patterns and relationships in the data without any prior knowledge of the data’s
meaning.
Broadly this technique is applied to group data based on different patterns, such as
similarities or differences, our machine model finds. These algorithms are used to
Page 41 of 70
process raw, unclassified data objects into groups. For example, in the above figure,
we have not given output parameter values, so this technique will be used to group
clients based on the input parameters provided by our data.
Some common clustering algorithms:
K- means clustering
Hierarchical clustering
Density Based clustering(DBSCAN)
Mean-shfit clustering
Spectral clustering
2. Association Rule Learning:
Association rule learning is also known as association rule mining is a common
technique used to discover associations in unsupervised machine learning. This
technique is a rule-based ML technique that finds out some very useful relations
between parameters of a large data set. This technique is basically used for market
basket analysis that helps to better understand the relationship between different
products.
For e.g. shopping stores use algorithms based on this technique to find out the
relationship between the sale of one product w.r.t to another’s sales based on
customer behavior. Like if a customer buys milk, then he may also buy bread, eggs,
or butter. Once trained well, such models can be used to increase their sales by
planning different offers.
Some common Association Rule Learning algorithms:
Apriori Algorithm: Finds patterns by exploring frequent item combinations step-by-
step.
FP-Growth Algorithm: An Efficient Alternative to Apriori. It quickly identifies
frequent patterns without generating candidate sets.
Eclat Algorithm: Uses intersections of itemsets to efficiently find frequent patterns.
Efficient Tree-based Algorithms: Scales to handle large datasets by organizing data
in tree structures.
3. Dimensionality Reduction:
Dimensionality reduction is the process of reducing the number of features in a
dataset while preserving as much information as possible. This technique is useful
for improving the performance of machine learning algorithms and for data
visualization.
Imagine a dataset of 100 features about students (height, weight, grades, etc.). To
focus on key traits, you reduce it to just 2 features: height and grades, making it
easier to visualize or analyze the data.
Here are some popular Dimensionality Reduction algorithms:
Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Page 43 of 70
Non-negative Matrix Factorization (NMF)
Locally Linear Embedding (LLE)
Isomap
Challenges of Unsupervised Learning:
Noisy Data: Outliers and noise can distort patterns and reduce the effectiveness
of algorithms.
Assumption Dependence: Algorithms often rely on assumptions (e.g., cluster
shapes), which may not match the actual data structure.
Overfitting Risk: Overfitting can occur when models capture noise instead of
meaningful patterns in the data.
Limited Guidance: The absence of labels restricts the ability to guide the algorithm
toward specific outcomes.
Cluster Interpretability: Results, such as clusters, may lack clear meaning or
alignment with real-world categories.
Sensitivity to Parameters: Many algorithms require careful tuning of
hyperparameters, such as the number of clusters in k-means.
Lack of Ground Truth: Unsupervised learning lacks labeled data, making it difficult
to evaluate the accuracy of results.
Applications of Unsupervised learning:
Unsupervised learning has diverse applications across industries and domains. Key
applications include:
Customer Segmentation: Algorithms cluster customers based on purchasing
behavior or demographics, enabling targeted marketing strategies.
Anomaly Detection: Identifies unusual patterns in data, aiding fraud detection,
cybersecurity, and equipment failure prevention.
Recommendation Systems: Suggests products, movies, or music by analyzing user
behavior and preferences.
Image and Text Clustering: Groups similar images or documents for tasks like
organization, classification, or content recommendation.
Social Network Analysis: Detects communities or trends in user interactions on
social media platforms.
Astronomy and Climate Science: Classifies galaxies or groups weather patterns to
support scientific research.
K-means
K-means is a popular unsupervised learning algorithm in machine
learning used for clustering tasks. It aims to partition a given dataset into
K clusters, where K is a predefined number determined by the user.
Here's a simplified explanation of the K-means algorithm:
1. Initialization: Randomly select K data points from the dataset as
initial cluster centroids.
2. Assignment: For each data point, calculate its distance to each
centroid and assign it to the cluster with the nearest centroid.
3. Update: Recalculate the centroids of the clusters based on the mean
of the data points assigned to each cluster.
Page 45 of 70
4. Repeat Steps 2 and 3: Iterate the assignment and update steps until
convergence, which occurs when the centroids no longer change
significantly or a maximum number of iterations is reached.
5. Final Clustering: After convergence, the algorithm returns K
clusters, where each data point is assigned to one of the clusters
based on its nearest centroid.
The choice of K is critical and often requires domain knowledge or
exploration of the dataset. Different initializations can result in different
final clustering’s, so it's common to run the algorithm multiple times with
different random initializations and select the clustering with the lowest
overall distortion or within-cluster sum of squares.
K-means is known for its simplicity, scalability, and interpretability.
However, it has limitations, such as sensitivity to the initial centroid
selection and the assumption of isotropic and equally sized clusters.
Additionally, K-means can struggle with clusters of non-convex shape or
when the cluster sizes differ significantly.
In Python, you can use various machine learning libraries such as scikit-
learn to apply K-means clustering. Here's a short example of how to use
sickie-learn to perform K-means clustering:
from sklearn.cluster import KMeans
import numpy as np
# Generate some random data for clustering
X = np.random.rand(100, 2)
# Create a K-means clustering model with K=3
kmeans = KMeans(n_clusters=3)
# Fit the model to the data
kmeans.fit(X)
# Get the cluster labels for each data point
labels = kmeans.labels_
# Get the cluster centroids
centroids = kmeans.cluster_centers_
# Print the cluster labels and centroids
print ("Cluster Labels:", labels)
print ("Cluster Centroids:", centroids)
In this example, we generate random data (X) with 100 samples and 2
features. We create a K-means model with n_clusters=3 to generate three
clusters. We fit the model to the data using the fit method, and then obtain
the cluster labels for each data point using the labels_ attribute. Finally,
we retrieve the cluster centroids using the cluster_centers_ attribute.
Clustering and its Types:
Clustering is a technique in machine learning and data analysis that
involves grouping similar data points together into clusters or segments.
The goal of clustering is to discover inherent patterns or structures in the
data without using predefined labels. Different types of clustering
algorithms can be used based on their underlying principles and
characteristics. Here are some common types of clustering algorithms:
1. K-Means Clustering:
K-Means is one of the most popular and widely used
clustering algorithms.
It aims to partition the data into K clusters, where K is a user-
defined parameter.
Data points are assigned to the nearest cluster centroid based
on the Euclidean distance.
K-Means works well when clusters are spherical and equally
sized.
2. Hierarchical Clustering:
Hierarchical clustering creates a tree-like structure of
clusters, also known as a dendrogram.
It can be agglomerative (bottom-up) or divisive (top-down).
Page 47 of 70
Agglomerative hierarchical clustering starts with individual
data points and merges them into clusters.
Divisive hierarchical clustering starts with all data points in a
single cluster and recursively divides them into smaller
clusters.
Hierarchical clustering is useful when the data has a nested or
hierarchical structure.
3. DBSCAN (Density-Based Spatial Clustering of Applications
with Noise):
DBSCAN identifies dense regions in the data and groups
points that are close to each other in terms of a distance
metric.
It can discover clusters of arbitrary shapes and is robust to
noise.
Points not belonging to any cluster are considered as noise or
outliers.
DBSCAN does not require the user to specify the number of
clusters.
4. Mean Shift Clustering:
Mean Shift is a non-parametric clustering algorithm that
iteratively shifts points towards the mode (peak) of the
density function.
It can find clusters of varying shapes and sizes.
Mean Shift is especially useful for cases where the number of
clusters is not known in advance.
5. Gaussian Mixture Models (GMM):
GMM assumes that the data is generated from a mixture of
several Gaussian distributions.
It estimates the parameters (mean, covariance, and mixing
coefficients) of these distributions to fit the data.
GMM can model clusters with different shapes and
orientations and allows for probabilistic cluster assignments.
6. Agglomerative Clustering:
Agglomerative clustering starts with individual data points as
separate clusters and merges them iteratively.
The algorithm uses linkage criteria (e.g., average linkage,
complete linkage) to determine how to merge clusters.
Agglomerative clustering creates a hierarchy of clusters that
can be cut at different levels to obtain different numbers of
clusters.
7. Fuzzy C-Means Clustering:
Fuzzy C-Means assigns each data point a membership value
for each cluster, indicating the degree of belongingness.
Unlike K-Means, where a point belongs to only one cluster,
Fuzzy C-Means allows points to belong to multiple clusters
with varying degrees of membership.
These are just a few examples of clustering algorithms. The choice of
algorithm depends on the characteristics of your data, the desired number
of clusters, and the goals of your analysis. It's important to explore and
experiment with different algorithms to find the one that best suits your
specific problem.
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
is a popular clustering algorithm used in machine learning for identifying
clusters of data points based on their density distribution. Unlike k-means
or hierarchical clustering, which rely on fixed distance or linkage criteria,
DBSCAN works by defining clusters as regions of high data point density
separated by regions of low density. This makes it particularly effective
at identifying clusters of arbitrary shapes and handling noise in the data.
Key concepts of DBSCAN:
1. Core Points: A data point is considered a core point if it has at
least a specified number of other points (minPts) within a specified
radius (eps) around it. These core points are the foundation of
clusters.
Page 49 of 70
2. Border Points: A data point is considered a border point if it is
within the radius of eps of a core point, but does not have enough
neighboring points to be considered a core point itself.
3. Noise Points: Data points that are not core points or border points
are treated as noise points, as they do not belong to any cluster.
The algorithm works as follows:
1. Select a Core Point: Choose a random data point and check if it
has at least minPts data points within distance eps.
2. Expand the Cluster: If the point is a core point, start a new cluster
and add it to the cluster. Then, recursively add all directly
reachable points (within eps distance) to the cluster. These points
can be core or border points.
3. Explore Neighbor Points: For each core point found in step 2,
recursively add all directly reachable points to the cluster. This
step ensures that the cluster is expanded to cover all dense regions.
4. Continue Exploration: Continue this process until no more core
points can be found, and all dense regions have been covered by
clusters.
5. Assign Noise Points: Any remaining data points that have not
been assigned to a cluster are treated as noise points.
DBSCAN advantages:
Can identify clusters of various shapes and sizes.
Can handle noise and outliers effectively by designating them as
noise points.
Does not require specifying the number of clusters beforehand.
Does not assume a specific distribution of data points.
DBSCAN parameters:
eps (Epsilon): The maximum distance between two data points for
one to be considered a neighbor of the other.
minPts (Minimum Points): The minimum number of data points
required within the epsilon neighborhood of a point for it to be
considered a core point.
DBSCAN limitations:
Sensitivity to parameter settings: The choice of eps and minPts can
significantly affect the resulting clusters.
Struggles with clusters of varying densities: If there are clusters of
vastly different densities, DBSCAN might not work well out of the
box.
Difficulty handling high-dimensional data: The concept of
"density" becomes less clear in high-dimensional spaces.
To use DBSCAN, you need to preprocess your data, set appropriate values
for eps and minPts, and then apply the algorithm. Libraries like Scikit-
learn in Python provide implementations of DBSCAN.
Python code:
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.metrics import silhouette_score
# Load the heart disease dataset (you can replace this with your own dataset)
data = load_iris()
X = data.data
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Create a DBSCAN instance
eps = 0.5 # Adjust the value based on your dataset
min_samples = 5 # Adjust the value based on your dataset
dbscan = DBSCAN (eps=eps, min_samples=min_samples)
# Fit the DBSCAN model
labels = dbscan.fit_predict(X_scaled)
# Calculate silhouette score (optional, for evaluation)
silhouette_avg = silhouette_score(X_scaled, labels)
print (f"Silhouette Score: {silhouette_avg}")
Page 51 of 70
# Add the cluster labels to the original data
result = pd.DataFrame(data.data, columns=data.feature_names)
result['cluster'] = labels
# Print the count of points in each cluster
print(result['cluster'].value_counts())
Remember to replace the dataset loading part (load_iris()) with loading your own
dataset using pandas or any other suitable method. Adjust the eps and min_samples
parameters based on your data and the desired behavior of the algorithm.
Keep in mind that DBSCAN doesn't require you to specify the number of clusters
beforehand, but it's important to choose suitable values for eps and min_samples to
capture the underlying structure of your data. Additionally, clustering results might
vary based on the dataset characteristics and parameters used.
Gaussian Mixture Model: -
A Gaussian Mixture Model (GMM) is a probabilistic model used in
machine learning for modeling and representing data that comes from
multiple Gaussian distributions. It's a powerful technique for
unsupervised clustering, density estimation, and generating synthetic data
that resembles a given dataset.
The fundamental idea behind a GMM is that a dataset is assumed to be
generated by a mixture of several Gaussian distributions, each
representing a different cluster or mode. Each Gaussian distribution in the
mixture is often referred to as a "component" of the mixture model. The
GMM tries to estimate the parameters (mean, covariance, and mixing
coefficients) of these Gaussian components to best fit the observed data.
Key concepts of GMM:
1. Components: A GMM consists of a predefined number of
Gaussian components, each characterized by its mean vector and
covariance matrix. Each component represents a potential cluster
in the data.
2. Mixture Weights: Each component is associated with a mixing
weight that indicates the probability of a data point belonging to
that component. The sum of all mixing weights equals 1.
3. Probability Density Function: The GMM defines a probability
density function that represents the likelihood of observing a data
point given the mixture of Gaussian components. The overall
density is the weighted sum of the densities of the individual
components.
4. Expectation-Maximization (EM) Algorithm: The process of
estimating the parameters of a GMM often involves using the EM
algorithm. EM iteratively optimizes the model's parameters by
alternating between two steps: the E-step (expectation) computes
the responsibilities of each component for each data point, and the
M-step (maximization) updates the component parameters using
the computed responsibilities.
GMM advantages:
Flexibility: GMM can capture complex data distributions by
modeling them as combinations of simpler Gaussian distributions.
Soft Clustering: GMM provides soft assignments of data points to
clusters, meaning that each data point is associated with
probabilities for each component, indicating how likely it belongs
to each cluster.
Generative Capability: GMM can be used to generate new data
points that resemble the original dataset by sampling from the
learned Gaussian components.
GMM limitations:
Number of Components: The number of components needs to be
specified beforehand. Choosing an appropriate number of
components can be challenging and may require domain
knowledge or additional techniques.
Computationally Intensive: GMM, especially with a large
number of dimensions or components, can be computationally
demanding and may require careful initialization to converge to
meaningful results.
Page 53 of 70
Sensitive to Initialization: The EM algorithm used for GMM
estimation can converge to different solutions based on the initial
parameter values.
Here's a simple example of using GMM in Python with the Scikit-learn
library for data clustering:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
# Generate synthetic data
np.random.seed(0)
n_samples = 300
n_components = 3
X = np.concatenate([
np.random.normal(0, 1, int(0.8 * n_samples)),
np.random.normal(5, 1, int(0.1 * n_samples)),
np.random.normal(10, 1, int(0.1 * n_samples))
]).reshape(-1, 1)
# Fit GMM to the data
gmm = GaussianMixture(n_components=n_components)
gmm.fit(X)
# Predict the cluster assignments
labels = gmm.predict(X)
# Plot the data and GMM components
plt.scatter(X, np.zeros_like(X), c=labels, cmap='viridis', s=40)
plt.xlabel('Data')
plt.title('Gaussian Mixture Model Clustering')
plt.show()
In this example, GMM is used to cluster a univariate dataset into three
components. The GMM algorithm fits Gaussian components to the data
and assigns each data point to one of the components based on the highest
posterior probability.
Remember that GMM can be applied to multivariate data as well, and it's
important to choose an appropriate number of components based on your
problem and dataset characteristics.
EM Algorithm: -
The Expectation-Maximization (EM) algorithm is an iterative
optimization technique used for estimating the parameters of statistical
models, particularly when dealing with situations where data is
incomplete, unobserved, or involves latent variables. EM is widely used
in various fields, including statistics, machine learning, and signal
processing.
At its core, the EM algorithm is designed to find the maximum likelihood
or maximum a posteriori (MAP) estimates of model parameters when
there is missing information in the data. It alternates between two steps:
the Expectation (E) step and the Maximization (M) step. These steps are
iteratively repeated until convergence.
Here's a high-level overview of how the EM algorithm works:
1. Expectation (E) Step:
In the E-step, the algorithm computes the expected value of
the log-likelihood function given the observed data and the
current estimates of the parameters. This involves estimating
the missing or latent variables using the current parameter
estimates.
The E-step computes the posterior probabilities or
responsibilities of the latent variables based on the observed
data and the current parameter estimates.
2. Maximization (M) Step:
In the M-step, the algorithm updates the parameter estimates
to maximize the expected log-likelihood computed in the E-
step. This step involves treating the estimated latent variables
as observed data and performing a regular optimization to find
the parameter values that maximize the expected log-
likelihood.
Page 55 of 70
The M-step involves solving an optimization problem to find
the parameters that improve the fit of the model to the
observed and estimated latent data.
3. Iteration:
The E and M steps are iteratively performed, with the
parameter estimates being refined in each iteration. The
algorithm aims to improve the likelihood of the data under the
model in each iteration.
The iterations continue until the parameter estimates converge
or a stopping criterion is met.
The EM algorithm is versatile and is applied in a variety of scenarios,
including:
Gaussian Mixture Models (GMMs): EM is used to estimate the
parameters of a mixture of Gaussian distributions, which is often
used for clustering and density estimation.
Hidden Markov Models (HMMs): EM is used to estimate the
parameters of HMMs, which are used in time series analysis,
speech recognition, and other sequential data applications.
Missing Data Imputation: EM can be used to fill in missing
values in datasets by estimating the values based on the available
data and the model.
Latent Variable Models: EM is used to estimate parameters in
models with hidden or latent variables that are not directly
observed.
EM algorithm advantages:
Handles incomplete and unobserved data.
Provides a principled way to estimate parameters in the presence of
latent variables.
Often leads to improved parameter estimates with each iteration.
EM algorithm limitations:
Convergence to local optima is possible, so the choice of initial
parameter values can impact results.
Computationally intensive, especially for complex models and
large datasets.
Overall, the EM algorithm is a fundamental tool for parameter estimation
in scenarios where data is incomplete or involves hidden variables,
enabling the development of sophisticated models that can capture
complex patterns in the data.
Example #1
Imagine you have a puzzle, but some of the pieces are missing. You want to
put the puzzle together, but you're not sure where the missing pieces should
go. The EM algorithm is like a strategy to solve this puzzle.
1. Guessing and Adjusting (E-step):
First, you take a guess at where the missing puzzle pieces might
fit based on the pieces you have. This is like trying to complete
the puzzle as best you can.
After you've made your guesses, you look at how well the pieces
fit together. Some might fit perfectly, while others might not
match well.
You adjust your guesses to make the best fit possible. If some
pieces don't fit well, you change your guess about where they
should go. You're trying to make the puzzle look more complete.
2. Improving (M-step):
Once you've adjusted your guesses, you focus on improving the
overall picture. You try to move the pieces around to make the
entire puzzle look better, even if some pieces aren't in the right
places yet.
As you make these improvements, you notice that the parts you've already completed start to look better too. By
making the whole puzzle look better, you're also making each individual part fit better.
3. Repeat and Converge:
You keep repeating the process of guessing, adjusting, and
improving. Each time, your puzzle becomes more complete and
looks better overall.
You keep doing this until you can't make the puzzle look any
better. The puzzle pieces settle into their places, and you're done.
Your puzzle is as complete as it can be!
Page 57 of 70
In simpler terms, the EM algorithm is like trying to complete a puzzle with
missing pieces. You start by making guesses about where the missing pieces
go, then adjust those guesses to make the puzzle look better. As you make
improvements, the whole puzzle starts to come together, and you keep
repeating this process until the puzzle looks as complete as possible.
Example
Let's break down the Expectation-Maximization (EM) algorithm in a simple
and easy-to-understand way:
1. Imagine a Puzzle:
Think of the EM algorithm like solving a puzzle where some
pieces are missing. You want to put the puzzle together, but you
can't see all the pieces at once.
Each puzzle piece represents a part of the data you're trying to
understand.
2. Two Steps, One Goal:
The EM algorithm has two steps that it repeats over and over:
Expectation (E) and Maximization (M).
In the Expectation step, you make a guess about the missing
puzzle pieces based on the ones you can see. This helps you
figure out where the missing pieces might fit.
In the Maximization step, you adjust your guess to make the
whole puzzle fit better. You adjust the pieces you have to make
the picture look better as a whole.
3. Repeating the Steps:
You keep going back and forth between the E and M steps. In
each round, you get a clearer picture of the puzzle.
The more you go back and forth, the closer you get to a complete
puzzle that looks just right.
4. Finishing the Puzzle:
You repeat the E and M steps until you can't make the puzzle
look any better. At this point, you've found the best way to
arrange the pieces based on the ones you can see.
You've also figured out where the missing pieces should go to
create the most accurate and complete picture.
In simpler terms, the EM algorithm is like solving a puzzle where some pieces
are missing. You keep guessing where the missing pieces should go, adjusting
your guess to make the whole puzzle look better each time. By repeating these
steps, you end up with a complete and accurate picture of the puzzle, even if
you couldn't see all the pieces at once. The EM algorithm helps us understand
complex data by filling in missing parts and finding the best way to explain
what we observe.
Tuning Model Complexity
Tuning model complexity involves finding the right balance between a
model that is too simple (underfitting) and a model that is too complex
(overfitting). It's a crucial step in machine learning to ensure that your
model generalizes well to new, unseen data.
Here's a simple explanation of tuning model complexity:
1. Underfitting (Too Simple):
Imagine you're trying to teach a robot to recognize different
types of animals. If you give the robot a very basic set of
instructions, it might struggle to tell one animal from another.
This is like having an underfitting model.
An underfitting model is too simple to capture the underlying
patterns in the data. It doesn't perform well on both the
training data and new, unseen data.
2. Overfitting (Too Complex):
On the other hand, if you give the robot too many specific
instructions for each animal, it might remember those exact
details but fail to recognize new animals. This is like having
an overfitting model.
An overfitting model is too complex and tries to remember
the noise and randomness in the training data. It performs
extremely well on the training data but poorly on new data.
3. Balancing Act (Tuning Complexity):
Your goal is to find a middle ground between too simple and
too complex. You want the robot to understand general
characteristics of animals without memorizing every detail.
Page 59 of 70
Similarly, you want a model that captures the important
patterns in the data without fitting the noise. This is achieved
by selecting the right complexity, which involves choosing
appropriate features, hyperparameters, and algorithms.
4. Validation and Testing:
To tune model complexity, you use validation and testing
datasets. You train your model on the training data and
evaluate its performance on the validation data.
You adjust the model's complexity (e.g., by adding more
features or adjusting hyperparameters) and keep checking its
performance on the validation data until you find the best
balance.
5. Generalization:
When your model performs well on both the training and
validation data, it's likely to generalize well to new, unseen
data. This is the ultimate goal of tuning model complexity.
In essence, tuning model complexity is like finding the right level of detail
and generality. You want your model to capture the essential patterns in
the data while avoiding both over-simplification and over-complication.
This process helps you build models that are accurate, robust, and capable
of making good predictions on new data.
Bias variance tradeoff: -
The bias-variance tradeoff is a fundamental concept in machine learning
that helps you understand the balance between two types of errors that a
model can make: bias error and variance error. Finding the right balance
between these errors is crucial for creating a model that generalizes well
to new, unseen data.
Here's a simple explanation of the bias-variance tradeoff:
1. Bias Error (Under fitting):
Imagine you're training a dog to recognize the difference
between cats and dogs. If your dog always thinks every
animal is a cat, it's making a bias error.
In machine learning, a model with high bias error is too
simple and doesn't capture the underlying patterns in the data.
It's under fitting and performs poorly on both the training
data and new data.
2. Variance Error (Overfitting):
On the other hand, if your dog memorizes the appearance of
every single animal it has seen, it might not be able to
recognize new animals. This is a variance error.
In machine learning, a model with high variance error is too
complex and captures noise and randomness in the training
data. It's overfitting and performs extremely well on the
training data but poorly on new data.
3. Balancing Act (Bias-Variance Tradeoff):
Just like you want your dog to recognize both cats and dogs
without overthinking, you want a model that captures the
essential patterns without memorizing every detail.
The bias-variance tradeoff is the balance between having a
model that is too simple (high bias) and a model that is too
complex (high variance).
4. Optimal Point:
The goal is to find the sweet spot where the model has the
right amount of complexity to perform well on both the
training data and new data.
This balance minimizes both the bias and variance errors,
leading to a model that generalizes well.
5. Model Complexity:
Adjusting the complexity of the model impacts the bias and
variance. As you make the model more complex, bias
decreases, but variance increases.
6. Validation and Testing:
To find the right balance, you use validation and testing
datasets. You try different model complexities and see how
they perform on new data.
In short, the bias-variance tradeoff is like training your dog to recognize
animals. You want the dog to understand the differences without being
Page 61 of 70
too rigid or too flexible. Similarly, you want a model that captures patterns
without being too simple or too complex. Striking this balance helps you
build a model that makes accurate predictions on new, unseen data.
Grid Search
Grid search is a technique used for hyper parameter optimization in machine
learning. Hyper parameters are parameters that are set before the learning
process begins and control the behavior of the model. Finding the optimal
values for these hyper parameters is crucial for improving the performance
of the model.
How Grid Search Works
Grid search involves specifying a list of possible values for each hyper
parameter and then training the model for every combination of these values.
The performance of each model is evaluated, and the combination of hyper
parameters that yields the best performance is selected. This method is
exhaustive and can be computationally expensive, especially when dealing
with a large number of hyper parameters or a wide range of values.
Here is an example of how to implement grid search using
the GridSearchCV class from the sklearn.model_selection module:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
# Load dataset
iris = datasets.load_iris()
X = iris['data']
y = iris['target']
# Define the model
model = LogisticRegression(max_iter=10000)
# Define the hyperparameters and their possible values
param_grid = {'C': [0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2]}
# Perform grid search
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X, y)
# Print the best hyperparameters
print(grid_search.best_params_)
Advantages and Disadvantages
One of the main advantages of grid search is that it is straightforward and
guarantees finding the optimal set of hyper parameters if the search space
is well-defined. However, it can be very time-consuming and computationally
expensive, especially when the number of hyper parameters and their
possible values are large.
Comparison with Randomized Search
Randomized search is another method for hyper parameter optimization that
can be more efficient than grid search. Instead of trying every possible
combination, randomized search samples a fixed number of hyper
parameter combinations from a specified distribution. This can be more
efficient and faster, especially when the hyper parameter space is large.
Page 63 of 70
Here is an example of how to implement randomized search using
the RandomizedSearchCV class from
the sklearn.model_selection module:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
# Define the hyper parameters and their distributions
param_distributions = {'C': uniform(0.25, 1.75)}
# Perform randomized search
random_search = RandomizedSearchCV(model,
param_distributions, n_iter=100, cv=5)
random_search.fit(X, y)
# Print the best hyperparameters
print(random_search.best_params_)
Conclusion
Grid search is a powerful technique for hyper parameter optimization, but it
can be computationally expensive. Randomized search offers a more
efficient alternative by sampling a subset of hyper parameter combinations.
Both methods are essential tools in the machine learning practitioner's toolkit
and can significantly improve model performance when used appropriately
Random Search
Random search is a technique used in machine learning and artificial
intelligence to optimize hyperparameters by generating and evaluating
random combinations of hyperparameters. This method does not assume
anything about the structure of the objective function, making it effective for
problems with high dimensionality and domain expertise1.
How Random Search Works
The random search algorithm starts by initializing random hyperparameter
values from the search space. It then evaluates the cost function at these
points and iteratively updates the hyperparameters based on the cost
function's value. The process continues until a termination requirement, such
as a set number of iterations or an acceptable fitness level, is met 1.
Here is a step-by-step description of the random search algorithm:
Set a random point ( x ) in the search space.
Repeat until a termination requirement is met: Take a new position ( y ) from
the hypersphere around the current position ( x ). If ( f(y) < f(x) ), then assign
( x = y ) as the new position.
Advantages of Random Search
Random search has several advantages over other hyperparameter
optimization techniques like grid search:
Efficiency: Random search can yield better results than grid search when
the dimensionality and the number of hyperparameters are high1.
Flexibility: It allows limiting the number of hyperparameter combinations,
whereas grid search checks all possible combinations1.
Speed: Random search usually requires fewer iterations to find a good
solution compared to grid search1.
Python Implementation
In Python, the RandomizedSearchCV function from Scikit-learn can be
used to implement random search. Here is an example:
from sklearn.model_selection import RandomizedSearchCV
Page 65 of 70
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
# Load dataset
data = load_iris()
X, y = data.data, data.target
# Define the model
model = DecisionTreeClassifier()
# Define the parameter space
param_distributions = {
'max_depth': [3, None],
'max_features': ['auto', 'sqrt', 'log2'],
'min_samples_split': [2, 3, 4],
'min_samples_leaf': [1, 2, 3],
'criterion': ['gini', 'entropy']
# Create the random search object
random_search = RandomizedSearchCV(model,
param_distributions, n_iter=10, scoring='accuracy', cv=5,
random_state=42)
# Fit the model
random_search.fit(X, y)
# Print the best parameters
print("Best parameters found: ", random_search.best_params_)
Conclusion
Random search is a powerful technique for hyperparameter optimization in
machine learning and AI. It is particularly useful when dealing with high-
dimensional search spaces and can often find good solutions more efficiently
than exhaustive methods like grid search1. However, it does not always
guarantee finding the best hyperparameters, but it provides a practical
approach to exploring the search space.
Evaluation Metrics in Machine Learning
Evaluation metrics in machine learning are used to measure the
performance of a model. Here are some common evaluation metrics:
Classification Metrics
1) Accuracy: Proportion of correct predictions.
2) Precision: Proportion of true positives among all positive
predictions.
3) Recall: Proportion of true positives among all actual positive
instances.
4) F1-score: Harmonic mean of precision and recall.
5) ROC-AUC: Area under the receiver operating characteristic curve.
Page 67 of 70
6) Confusion Matrix: Table showing true positives, false positives,
true negatives, and false negatives.
Regression Metrics
1) Mean Squared Error (MSE): Average squared difference
between predicted and actual values.
2) Mean Absolute Error (MAE): Average absolute
difference between predicted and actual values.
3) Root Mean Squared Error (RMSE): Square root of MSE.
4) R-squared: Proportion of variance in the dependent
variable explained by the model.
Other Metrics
1. Mean Absolute Percentage Error (MAPE): Average absolute
percentage difference between predicted and actual values.
2. Cohen's Kappa: Measure of agreement between predicted and actual
values.
3. Log Loss: Measure of difference between predicted probabilities and
actual labels.
Choosing Evaluation Metrics
1. Problem type: Choose metrics relevant to the problem type
(classification, regression, etc.).
2. Model goals: Select metrics that align with the model's goals and
objectives.
3. Data characteristics: Consider data characteristics, such as class
imbalance or outliers.
Importance of Evaluation Metrics
1. Model comparison: Compare performance of different models.
2. Hyper parameter tuning: Optimize model parameters.
3. Model evaluation: Assess model performance on test data.
4. Model improvement: Identify areas for improvement.
By using evaluation metrics, you can assess the performance of your
machine learning model and make informed decisions to improve it.
Reporting Predictive Performance
When reporting predictive performance in machine learning, consider the
following:
Metrics
1. Accuracy: Proportion of correct predictions.
2. Precision: Proportion of true positives among all positive predictions.
Page 69 of 70
3. Recall: Proportion of true positives among all actual positive instances.
4. F1-score: Harmonic mean of precision and recall.
5. ROC-AUC: Area under the receiver operating characteristic curve.
6. Mean Squared Error (MSE): Average squared difference between predicted and
actual values.
7. Mean Absolute Error (MAE): Average absolute difference between predicted
and actual values.
Reporting Guidelines
1. Choose relevant metrics: Select metrics that align with the problem and model
goals.
2. Provide context: Include information about the dataset, model, and evaluation
methodology.
3. Use clear and concise language: Avoid technical jargon and ensure results are
easily understandable.
4. Include visualizations: Use plots and charts to illustrate model performance,
such as ROC curves or confusion matrices.
5. Report limitations: Discuss potential biases, limitations, and areas for
improvement.
Best Practices
1. Use cross-validation: Evaluate model performance on unseen data to ensure
generalizability.
2. Compare to baselines: Compare model performance to baseline models or
existing solutions.
3. Monitor performance over time: Track model performance over time to detect
degradation or changes.
4. Consider multiple evaluation metrics: Use multiple metrics to get a
comprehensive understanding of model performance.
Reporting Format
1. Tables: Use tables to summarize model performance metrics.
2. Figures: Use plots and charts to visualize model performance.
3. Narrative: Provide a clear and concise narrative explaining the results and
implications.
By following these guidelines and best practices, you can effectively report
predictive performance in machine learning and communicate results to
stakeholders.
//The End//
//Best of Luck//