0% found this document useful (0 votes)
419 views

THEORY FILE - Machine Learning (6th Sem)!!

The document provides comprehensive notes on Machine Learning, covering its definition, categories (supervised, unsupervised, reinforcement), and key algorithms like linear regression and decision trees. It outlines the machine learning process flow, use cases across various industries, and the importance of model evaluation and optimization. Additionally, it discusses specific techniques such as gradient descent and decision tree induction, emphasizing their applications and advantages in predictive modeling.

Uploaded by

sahil gupta.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
419 views

THEORY FILE - Machine Learning (6th Sem)!!

The document provides comprehensive notes on Machine Learning, covering its definition, categories (supervised, unsupervised, reinforcement), and key algorithms like linear regression and decision trees. It outlines the machine learning process flow, use cases across various industries, and the importance of model evaluation and optimization. Additionally, it discusses specific techniques such as gradient descent and decision tree induction, emphasizing their applications and advantages in predictive modeling.

Uploaded by

sahil gupta.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

THEORY FILE : Machine Learning

(FULL NOTES: BY SAHIL RAUNIYAR / PTU-CODER) .

SUBJECT CODE: UGCA- 1950

BACHELOR OF COMPUTER APPLICATIONS

er
MAINTAINED BY: TEACHER’S /MAM’:
Sahil Kumar Prof.

COLLEGE ROLL NO: 226617

UNIVERSITY ROLL NO: 2200315

od
u C
Pt
@

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING

BABA BANDA SINGH BAHADUR ENGINEERING

COLLEGE FATEGARH SAHIB


1

Program BCA ➖➖
Course Name
Semester
➖6th.
Machine Learning (Theory).

UNIT ➖01
●​ # Introduction : What is Machine Learning, Unsupervised Learning, Reinforcement


Learning Machine Learning Use-Cases, Machine Learning Process Flow, Machine
Learning Categories, Linear regression and Gradient descent

er
Introduction to Machine Learning ➖
Machine Learning (ML) is a subset of artificial intelligence (AI) that enables systems to learn from
data, identify patterns, and make decisions without being explicitly programmed. In simple terms, ML

od
allows computers to automatically improve their performance on a task through experience.

The process of machine learning involves training a model using data, which allows the model to make
predictions, detect patterns, and solve problems without human intervention.

What is Machine Learning? ➖


C
Machine Learning is the field of study that gives computers the ability to learn without being explicitly
programmed. It's a data-driven approach that focuses on finding patterns or insights from data. Machine
learning involves algorithms that can learn from and make predictions based on data.
u
Machine learning is primarily categorized into three types:
Pt

1.​ Supervised Learning


2.​ Unsupervised Learning
3.​ Reinforcement Learning


@

Unsupervised Learning
In unsupervised learning, the model is given unlabeled data, and the system tries to learn patterns or
structures from the data itself. Unlike supervised learning, where the data includes input-output pairs,
unsupervised learning has no defined labels for the data.

Main goals of unsupervised learning:

●​ Clustering: Group similar data points together (e.g., grouping customers based on purchasing
behavior).
●​ Dimensionality Reduction: Reduce the number of input variables (e.g., Principal Component
Analysis or PCA).

Examples:
2
●​ K-means clustering: A method to divide data into K groups based on similarity.
●​ Hierarchical clustering: Groups data in a tree-like structure, where similar items are clustered
together at each level.

Reinforcement Learning ➖
Reinforcement learning (RL) is a type of machine learning where an agent learns by interacting with its
environment and receiving rewards or penalties for actions. The goal of RL is to find an optimal
strategy, known as a policy, to maximize the cumulative reward over time.

In RL, the system is not told what to do explicitly but instead learns through trial and error.

er
Components of RL:

●​ Agent: The learner or decision-maker.


●​ Environment: The external system the agent interacts with.
●​ Actions: The possible moves the agent can make.
●​
●​

Example:
od
Rewards: The feedback the agent gets after performing an action.
State: The current condition of the environment.

●​ AlphaGo: The AI developed by Google DeepMind to play the game Go, using RL to improve
C
over time.

Machine Learning Use-Cases ➖


u
Machine learning is used in various industries and applications, such as:
Pt

1.​ Healthcare
○​ Predictive modeling for patient diagnoses.
○​ Disease detection using image recognition (e.g., detecting cancer from radiology images).
2.​ Finance
○​ Fraud detection by analyzing transaction patterns.
@

○​ Algorithmic trading based on market data.


3.​ E-commerce
○​ Personalized product recommendations (e.g., Amazon, Netflix).
○​ Predictive analytics for demand forecasting.
4.​ Autonomous Vehicles
○​ Self-driving cars use ML algorithms to interpret sensor data and make driving decisions.
5.​ Natural Language Processing (NLP)
○​ Speech recognition, machine translation, and sentiment analysis.
Machine Learning Process Flow ➖ 3

The process of machine learning involves several key steps:

1.​ Data Collection: Gather data relevant to the problem you're trying to solve.
2.​ Data Preprocessing: Clean and organize the data. This may involve removing missing or
duplicate values, and encoding categorical variables.
3.​ Model Selection: Choose the appropriate machine learning algorithm (e.g., linear regression,
decision trees, etc.).
4.​ Training the Model: Use training data to fit the model.
5.​ Model Evaluation: Assess the model's performance using a test dataset. Common metrics include
accuracy, precision, recall, and F1-score.

er
6.​ Tuning and Optimization: Fine-tune the model by adjusting hyperparameters to improve
performance.
7.​ Deployment: Deploy the model in a production environment for real-time prediction.


Machine Learning Categories

od
Machine Learning is broadly classified into three categories:

1.​ Supervised Learning:


○​ The model is trained on labeled data (input-output pairs).
C
○​ Goal: Learn a mapping from input to output.
○​ Example algorithms: Linear Regression, Decision Trees, Support Vector Machines (SVM),
k-Nearest Neighbors (k-NN).
u
2.​ Unsupervised Learning:
○​ The model is trained on unlabeled data.
○​ Goal: Identify patterns or structures in the data.
Pt

○​ Example algorithms: K-means clustering, Hierarchical clustering, Principal Component


Analysis (PCA).
3.​ Reinforcement Learning:
○​ The model learns by interacting with the environment and receiving feedback.
○​ Goal: Maximize cumulative reward.
@

○​ Example algorithms: Q-learning, Deep Q-Networks (DQN).

Linear Regression ➖
Linear Regression is a supervised learning algorithm used for predicting a continuous value based on the
relationship between the dependent variable and one or more independent variables. It assumes a linear
relationship between inputs (features) and the output (target).

Equation for Simple Linear Regression:

y=β0+β1x+ϵy = \beta_0 + \beta_1 x + \epsilony=β0​+β1​x+ϵ


4
Where:

●​ yyy is the predicted value (target).


●​ xxx is the feature or independent variable.
●​ β0\beta_0β0​is the intercept, and β1\beta_1β1​is the slope (the coefficient).
●​ ϵ\epsilonϵ is the error term.

Gradient Descent ➖
Gradient Descent is an optimization algorithm used to minimize the loss function (or cost function) in
machine learning models, particularly for regression. The algorithm works by updating the model
parameters (weights) in the direction of the steepest descent of the cost function.

er
How Gradient Descent Works:

1.​ Initialize the parameters (weights) randomly.


2.​ Compute the gradient (slope) of the cost function with respect to each parameter.
3.​ Update the parameters using the formula: θ=θ−α×∇J(θ)\theta = \theta - \alpha \times \nabla
J(\theta)θ=θ−α×∇J(θ) Where:
○​ θ\thetaθ represents the parameters.
○​ α\alphaα is the learning rate.
od
○​ ∇J(θ)\nabla J(\theta)∇J(θ) is the gradient of the cost function.
4.​ Repeat this process until the cost function converges to a minimum value.
C
Conclusion ➖
u
Machine learning provides powerful tools for solving complex problems across different industries. By
applying the correct ML model, organizations can leverage data to make informed decisions, automate
processes, and gain valuable insights. Understanding the types of learning (supervised, unsupervised, and
Pt

reinforcement), and the foundational algorithms like linear regression and gradient descent, is essential
for building effective machine learning systems.

HAPPY ENDING BY : SAHIL RAUNIYAR &


@

PTU-CODER !! ☺️
5

UNIT ➖ 02
# Supervised Learning : Classification and its use cases, Decision Tree, Algorithm for

●​
Decision Tree Induction

Supervised Learning ➖
Supervised Learning is one of the most common types of machine learning. In this approach, the model
is trained on a labeled dataset, where the input data is paired with the correct output (also called the

er
target variable). The goal of supervised learning is to learn a mapping from inputs to outputs in such a
way that, given new data, the model can predict the output or label for unseen examples.

Types of Supervised Learning

od
1.​ Classification: The task of predicting a categorical label (e.g., spam vs. not spam, disease vs. no
disease).
2.​ Regression: The task of predicting a continuous value (e.g., predicting the price of a house based
on its features).


C
Classification and Its Use Cases
Classification is a type of supervised learning where the model learns to categorize data into different
classes or categories. Given a set of features or attributes, classification models predict a discrete label or
u
class. The labels are predefined and can take one of the limited set of possible values.

Common Classification Algorithms:


Pt

1.​ Logistic Regression


2.​ Support Vector Machines (SVM)
3.​ k-Nearest Neighbors (k-NN)
4.​ Naive Bayes
@

5.​ Decision Trees


6.​ Random Forest

Use Cases of Classification:

●​ Spam Detection: Identifying whether an email is spam or not.


●​ Sentiment Analysis: Classifying customer reviews as positive, neutral, or negative.
●​ Medical Diagnosis: Predicting whether a patient has a certain disease based on diagnostic data
(e.g., classifying medical images into benign or malignant).
●​ Image Recognition: Classifying images into categories (e.g., dog, cat, car).
Decision Tree ➖ 6

A Decision Tree is a supervised learning algorithm used for both classification and regression tasks. It
works by splitting the data into subsets based on feature values, recursively, until it reaches a point where
no further division is needed.

Components of a Decision Tree:

●​ Root Node: The topmost node that represents the entire dataset.
●​ Internal Nodes: Nodes representing feature tests or conditions.
●​ Leaf Nodes: Terminal nodes that represent the predicted label or output value.
●​ Branches: Connections between nodes that represent the outcomes of tests.

er
How a Decision Tree Works:

1.​ Starting at the root, the algorithm splits the data into two or more homogeneous sets based on the
most significant attribute.

od
2.​ This process is repeated recursively on each branch until a stopping condition is met (e.g., if all
data points belong to the same class or the maximum depth is reached).
3.​ At each node, the algorithm chooses the attribute that provides the best split based on a criterion
like Gini impurity, Entropy (Information Gain), or Mean Squared Error.


C
Algorithm for Decision Tree Induction
The Decision Tree Induction Algorithm involves the following key steps:
u
1.​ Start with the entire dataset as the root node.
2.​ Select the best attribute to split the data: This is done using a criterion that measures how well a
given attribute separates the data. Common criteria include:
Pt

○​ Gini Impurity (for classification)


○​ Entropy/Information Gain (for classification)
○​ Mean Squared Error (MSE) (for regression)
3.​ Split the dataset into subsets based on the selected attribute.
@

4.​ Recursively apply the same procedure to each subset (i.e., select the best attribute and split
again).
5.​ Stop the recursion when one of the following conditions is met:
○​ All instances in a subset belong to the same class.
○​ The maximum tree depth is reached.
○​ There are no more attributes to split on.
6.​ Assign labels to leaf nodes: Once the tree has finished growing, each leaf node is assigned the
most frequent class (for classification) or the average value (for regression).

Steps in Building a Decision Tree:

1.​ Start with the entire dataset: Begin at the root node with all data points.
7
2.​ Select the best feature: Using a splitting criterion, select the feature that will best separate the
data. For classification, this might be based on information gain (entropy) or Gini index.
3.​ Split the dataset: Divide the data based on the selected feature into subsets.
4.​ Repeat the process: Apply the same steps to each subset until stopping conditions are met (e.g.,
no further useful splits, or a certain depth of the tree is reached).
5.​ Assign labels to leaf nodes: Once the tree stops growing, assign each leaf a class label or value.

Example of a Decision Tree for Classification:

Let’s say we have a dataset that contains information about whether a person buys a product based on
Age and Income. The dataset looks like this:

er
Age Income Buys
Product?

22 Low No

34 High Yes

45

25
High

Low
Yes

No

The algorithm would work as follows:


od
C
1.​ Split by Age or Income: The algorithm tests each feature to see which one best divides the data
based on the Entropy or Gini Impurity.
2.​ The best feature might be Income since it most effectively separates the classes.
u
3.​ Create the first split: If Income is High, then the decision is Yes (they will buy). If Income is
Low, then the decision is No.
4.​ End the process: We have no further splitting to do.
Pt

Advantages of Decision Trees: ➖


1.​ Easy to Understand and Interpret: Decision trees can be visualized and are easy to interpret
@

even by non-experts.
2.​ Handles both Categorical and Numerical Data: It can handle a mix of numerical and categorical
data well.
3.​ No Need for Feature Scaling: Unlike many other algorithms, decision trees do not require
features to be normalized or scaled.

Disadvantages of Decision Trees: ➖


1.​ Overfitting: Decision trees tend to create very deep trees that may overfit the data. Techniques like
pruning or setting a maximum depth can mitigate this.
2.​ Instability: Small changes in the data can lead to significant changes in the tree structure.
8
3.​ Bias toward dominant classes: If the data is imbalanced, decision trees can favor the majority
class.

er
Conclusion ➖
od
In summary, Decision Trees are one of the most widely used algorithms for classification and regression
tasks in machine learning. They offer intuitive and easy-to-understand models, and when used with
C
techniques like pruning, they can provide highly accurate predictions. The algorithm for decision tree
induction is based on a recursive process of selecting the best features and splitting the data accordingly,
making it a robust tool for both simple and complex classification problems.
u
Pt
@
9
●​ # Creating a Perfect Decision Tree, Confusion Matrix, Random Forest. What is Naïve
Bayes, How Naïve Bayes works, Implementing Naïve Bayes Classifier, Support Vector
Machine, Illustration how Support Vector Machine works, Hyper parameter Optimization,


Grid Search Vs Random Search, Implementation of Support Vector Machine for
Classification.

Creating a Perfect Decision Tree ➖


A perfect decision tree is a tree that accurately classifies the dataset without overfitting or underfitting.
To create a perfect decision tree, the following factors must be considered:

er
1.​ Choosing the Best Splitting Criteria: The decision tree's performance depends on how well it
splits the data at each node. The most commonly used splitting criteria are:
○​ Gini Impurity: Measures the impurity of a node; lower values are better.
○​ Entropy (Information Gain): Measures the uncertainty reduction after a split. A higher

od
gain is better.
○​ Chi-Square Test: Often used in decision trees for classification problems.
2.​ Avoiding Overfitting: A tree that is too complex may fit the training data perfectly but fail to
generalize to new data. To avoid overfitting, techniques like pruning (removing branches that
have little predictive power) and limiting the maximum depth of the tree are used.
3.​ Handling Missing Values: Ensure that missing values are handled properly before training the
C
decision tree. You can either remove instances with missing values or use techniques to impute the
missing values.
4.​ Ensuring Balanced Data: A decision tree can be biased if the dataset is imbalanced. To address
this, use class weights or balanced sampling.
u

Confusion Matrix ➖
Pt

A Confusion Matrix is a tool used to evaluate the performance of a classification model. It shows the
comparison between the predicted labels and the actual labels, providing insights into the types of errors
the model makes.
@

The matrix is usually represented as follows:

Predicted Predicted
Positive Negative

Actual True Positive False Negative


Positive (TP) (FN)

Actual False Positive True Negative


Negative (FP) (TN)

From this matrix, we can derive several important metrics:


10
●​ Accuracy: Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP +
FN}Accuracy=TP+TN+FP+FNTP+TN​
●​ Precision: Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}Precision=TP+FPTP​
●​ Recall (Sensitivity): Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}Recall=TP+FNTP​
●​ F1 Score: The harmonic mean of precision and recall: F1=2×Precision×RecallPrecision+RecallF1
= 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} +
\text{Recall}}F1=2×Precision+RecallPrecision×Recall​

Random Forest ➖

er
A Random Forest is an ensemble learning method that combines multiple decision trees to improve the
performance and generalization ability of a model.

How Random Forest Works:

od
●​ Bootstrap Aggregating (Bagging): Random Forest creates multiple decision trees by training each tree
on a different subset of the data using sampling with replacement (bootstrap sampling).
●​ Random Feature Selection: At each node, instead of considering all features, a random subset of
features is selected, which leads to more diverse trees.
●​ Voting: For classification, the final prediction is made by aggregating the votes of all individual
uC
trees (i.e., majority voting). For regression, the output is the average of all tree predictions.

Advantages of Random Forest:

1.​ Reduces Overfitting: By averaging multiple decision trees, Random Forest reduces the variance.
2.​ Handles Large Datasets: It works well with large datasets that have a large number of features.
3.​ Handles Missing Data: Random Forest can handle missing data efficiently.
Pt

4.​ Feature Importance: It can provide insights into the importance of features used in making
decisions.

Naïve Bayes ➖
@

Naïve Bayes is a probabilistic classifier based on Bayes' Theorem. It assumes that the features used to
predict the class are conditionally independent given the class, which is why it’s called “naïve.”

How Naïve Bayes Works:

1.​ Bayes' Theorem: Bayes' Theorem helps to compute the probability of a class given the data.​
P(C∣X)=P(X∣C)P(C)P(X)P(C|X) = \frac{P(X|C) P(C)}{P(X)}P(C∣X)=P(X)P(X∣C)P(C)​​
Where:
○​ P(C∣X)P(C|X)P(C∣X) is the posterior probability of class CCC given features XXX.
○​ P(X∣C)P(X|C)P(X∣C) is the likelihood, or the probability of features given the class.
○​ P(C)P(C)P(C) is the prior probability of the class.
11
○​ P(X)P(X)P(X) is the probability of the features.
2.​ Conditional Independence Assumption: Naïve Bayes assumes that the features are conditionally
independent given the class, which simplifies the calculation of P(X∣C)P(X|C)P(X∣C) as the
product of the individual feature probabilities:​
P(X∣C)=P(x1∣C)⋅P(x2∣C)⋅...⋅P(xn∣C)P(X|C) = P(x_1|C) \cdot P(x_2|C) \cdot ... \cdot
P(x_n|C)P(X∣C)=P(x1​∣C)⋅P(x2​∣C)⋅...⋅P(xn​∣C)

Types of Naïve Bayes:

●​ Gaussian Naïve Bayes: Assumes that the features follow a Gaussian distribution.
●​ Multinomial Naïve Bayes: Used when features represent counts or frequencies (e.g., word counts

er
in text classification).
●​ Bernoulli Naïve Bayes: Used for binary/boolean features.

Advantages:

Disadvantages:
od
1.​ Simple and Efficient: Especially when the dataset has a large number of features.
2.​ Good for Text Classification: It is widely used in spam filtering and sentiment analysis.

1.​ Assumption of Independence: The assumption of independence is often unrealistic, and it may affect
the performance in certain cases.
uC
Example : -
Pt
@
Implementing Naïve Bayes Classifier: ➖ 12

python

from sklearn.naive_bayes import GaussianNB


from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X = data.data

er
y = data.target

# Split dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3)

od
# Initialize and train the Naïve Bayes classifier
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)
uC
# Make predictions
y_pred = nb_classifier.predict(X_test)

# Evaluate the classifier


from sklearn.metrics import accuracy_score
Pt

print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Support Vector Machine (SVM) ➖


@

Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and
regression tasks. The primary objective of SVM is to find a hyperplane that best separates the data into
different classes.

How SVM Works:

1.​ Find the Optimal Hyperplane: The optimal hyperplane is the one that maximizes the margin between
the two classes. The margin is the distance between the closest points (support vectors) to the
hyperplane.
2.​ Support Vectors: These are the data points that lie closest to the hyperplane and are critical for
defining the margin.
13
3.​ Kernel Trick: For non-linearly separable data, SVM uses a kernel function (e.g., linear,
polynomial, radial basis function) to map data into a higher-dimensional space where it becomes
linearly separable.

Illustration of How SVM Works: ➖


Consider a 2D dataset with two classes, "A" and "B". The SVM algorithm tries to find a line (hyperplane)
that separates the two classes with the largest margin. Points close to this line are the support vectors.

In the case of non-linearly separable data, SVM will transform the data into a higher-dimensional space
using a kernel, making it easier to separate the data.

er
Hyperparameter Optimization
Hyperparameter optimization refers to the process of selecting the best combination of
hyperparameters for a machine learning model to achieve the best performance. Common

od
hyperparameters include learning rate, regularization strength, kernel type (for SVM), and the number of
trees (for Random Forest).

Methods for Hyperparameter Optimization:

1.​ Grid Search: A method to exhaustively search over a specified hyperparameter grid.
2.​ Random Search: Randomly samples hyperparameter combinations and evaluates the model. This
uC
method is often more efficient than grid search.

Grid Search Vs Random Search: ➖


1.​ Grid Search:
Pt

○​ Exhaustively checks all possible combinations in the specified parameter grid.


○​ Computationally expensive, especially when the search space is large.
○​ Suitable when you have a small search space and you want to test all possible combinations.
2.​ Random Search:
○​ Randomly selects combinations from the hyperparameter space.
@

○​ Can often yield better results in less time than grid search, especially when the
hyperparameter space is large.

Implementation of Support Vector Machine for Classification: ➖


python
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
14
# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3)

# Initialize and train the SVM classifier

er
svm_classifier = SVC(kernel='linear') # You can choose different
kernels: linear, rbf, poly
svm_classifier.fit(X_train, y_train)

od
# Make predictions
y_pred = svm_classifier.predict(X_test)

# Evaluate the classifier


print(f"Accuracy: {accuracy_score(y_test, y_pred)}")


uC
Conclusion

●​ Decision Trees, Random Forest, Naïve Bayes, and Support Vector Machines (SVM) are some
of the most widely used machine learning algorithms, each with its strengths and weaknesses.
●​ Random Forest and SVM offer high performance with more robust models by using ensemble
learning and maximizing margins between classes, respectively.
Pt

●​ Hyperparameter Optimization is crucial for fine-tuning models and improving their


performance using techniques like Grid Search and **Random Search .

HAPPY ENDING BY : SAHIL RAUNIYAR &


@

PTU-CODER !! ☺️
15

UNIT ➖ 03
●​ # Clustering : What is Clustering & its Use Cases, K-means Clustering, How does


K-means algorithm work, C-means Clustering, Hierarchical Clustering, How Hierarchical
Clustering works.

Clustering & its Use Cases ➖

er
Clustering is a type of unsupervised learning where the objective is to group similar data points
together based on certain characteristics or features, with the assumption that data points in the same
group (cluster) are more similar to each other than to those in other groups. It’s widely used in various
fields for exploratory data analysis and pattern recognition.

od
Use Cases of Clustering:

1.​ Customer Segmentation: Businesses can use clustering to segment customers based on purchasing
behavior, helping in targeted marketing.
2.​ Document Clustering: In natural language processing (NLP), clustering is used to group similar
documents, making information retrieval more efficient.
uC
3.​ Anomaly Detection: Clustering can be used to detect outliers in the data, which are instances that
don't fit well into any cluster.
4.​ Image Segmentation: Clustering can help in segmenting an image into different regions, useful in
medical imaging, object detection, and image compression.
5.​ Genetic Data Analysis: In bioinformatics, clustering is used to find genes with similar expression
patterns.
Pt
@
K-means Clustering ➖ 16

K-means is one of the most widely used clustering algorithms. It is a partitioning method where the
data is divided into K clusters, and the goal is to minimize the variance within each cluster.

How K-means Algorithm Works:

1.​ Initialize: Randomly select K initial centroids (the center points of the clusters).
2.​ Assignment Step: Assign each data point to the nearest centroid based on a distance metric
(usually Euclidean distance).
3.​ Update Step: After assigning the data points, recompute the centroids as the mean of the data
points in each cluster.
4.​ Repeat: Repeat the assignment and update steps until convergence (when centroids no longer

er
change significantly, or a fixed number of iterations is reached).

Steps of K-means Algorithm:

1.​ Choose the number of clusters K.


2.​
3.​
4.​
5.​
Randomly initialize K centroids.

Recompute the centroids of the clusters.


od
Assign each data point to the nearest centroid.

Repeat steps 3 and 4 until the centroids converge.

Advantages of K-means:
uC
●​ Efficiency: K-means is relatively efficient and can handle large datasets well.
●​ Simplicity: It’s easy to understand and implement.

Disadvantages of K-means:

●​ Requires K to be predefined: The number of clusters K needs to be specified beforehand, and finding
the right K can be tricky.
Pt

●​ Sensitive to Initial Centroids: Different initializations can lead to different results.


●​ Assumes spherical clusters: K-means performs poorly when the clusters are non-convex or have
complex shapes.


@

C-means Clustering
C-means Clustering, specifically Fuzzy C-means (FCM), is an extension of the K-means algorithm.
Unlike K-means, which assigns each data point to a single cluster, fuzzy clustering allows each data
point to belong to multiple clusters with varying degrees of membership.

How C-means Algorithm Works:

1.​ Initialize: Randomly initialize the centroids of the clusters.


2.​ Membership Assignment: For each data point, calculate the degree of membership to each cluster
using a membership function (usually a fuzzy membership value between 0 and 1).
3.​ Update Centroids: Recalculate the centroids as a weighted average, where the weights are the
degree of membership of each point in each cluster.
17
4.​ Repeat: Iterate the membership assignment and centroid update steps until convergence (when the
centroids and memberships no longer change).

Advantages of C-means:

●​ Flexibility: Fuzzy C-means allows data points to belong to multiple clusters, making it more
suitable for real-world data where boundaries between clusters are not always clear.
●​ Soft Clustering: Instead of forcing each data point into a single cluster, fuzzy clustering allows the
model to express uncertainty.

Disadvantages of C-means:

●​ Sensitive to Initial Centroids: Similar to K-means, fuzzy C-means is sensitive to the initialization of

er
centroids.
●​ Computational Complexity: Fuzzy C-means can be more computationally expensive than
K-means due to the membership values.


Hierarchical Clustering

od
Hierarchical Clustering builds a hierarchy of clusters, where clusters are merged (agglomerative) or
divided (divisive) in a tree-like structure called a dendrogram. It does not require the number of clusters
to be specified beforehand.
uC
How Hierarchical Clustering Works:

There are two main types of hierarchical clustering:

1.​ Agglomerative Hierarchical Clustering (Bottom-up approach):


○​ Start with each data point as a separate cluster.
○​ Iteratively merge the closest clusters until all data points belong to a single cluster or until
the desired number of clusters is reached.
Pt

2.​ Divisive Hierarchical Clustering (Top-down approach):


○​ Start with all data points in one cluster.
○​ Iteratively split the clusters into smaller ones until each data point is its own cluster or the
desired number of clusters is achieved.
@

Steps for Agglomerative Hierarchical Clustering:

1.​ Compute Distance: Calculate the pairwise distance between each data point.
2.​ Merge Closest Clusters: Identify the two clusters that are closest and merge them.
3.​ Repeat: Recalculate the pairwise distances between clusters and merge the closest clusters
iteratively.
4.​ Dendrogram: The result is represented as a dendrogram, a tree-like diagram that shows the
merging process.

Linkage Criteria:

●​ Single linkage: The minimum distance between members of two clusters.


●​ Complete linkage: The maximum distance between members of two clusters.
18
●​ Average linkage: The average distance between all pairs of points in two clusters.
●​ Ward's linkage: Minimizes the variance within clusters by merging the clusters that result in the
smallest increase in the sum of squared errors.

Advantages of Hierarchical Clustering:

●​ No need to pre-specify K: Unlike K-means, there is no need to define the number of clusters before
applying the algorithm.
●​ Dendrogram Representation: The dendrogram provides a clear view of the data structure and can
be useful for understanding data at different levels of granularity.

Disadvantages of Hierarchical Clustering:

er
●​ Computational Complexity: Hierarchical clustering is computationally expensive, especially for large
datasets.
●​ Scalability: It does not scale well to very large datasets.
●​ Sensitive to Noise: Outliers or noisy data can significantly affect the clustering process.

Comparison of Clustering Methods

Clustering
Algorithm
Type

od
Requires
Predefined
Sensitivity
to Outliers
Computation
Complexity
uC
Number of
Clusters

K-means Partitional Yes High Low


Pt

C-means Partitional (Fuzzy) Yes High Moderate


(Fuzzy
C-means)
@

Hierarchical Agglomerative/Divisive No Low High


Clustering
Conclusion ➖ 19

●​ K-means and Fuzzy C-means (C-means) are partition-based algorithms, with K-means being
suitable for hard clustering and C-means for soft clustering, where data points can belong to
multiple clusters.
●​ Hierarchical Clustering is more flexible since it doesn't require the number of clusters to be
predefined and provides a dendrogram to visualize the clustering process.
●​ The choice of clustering algorithm depends on the nature of the data, the scale of the dataset, and
whether the number of clusters is known beforehand or not.

HAPPY ENDING BY : SAHIL RAUNIYAR &

er
PTU-CODER !! ☺️

od
uC
Pt
@
20

UNIT ➖ 04
# Why Reinforcement Learning, Elements of Reinforcement Learning, Exploration

●​
vs Exploitation dilemma, Epsilon Greedy Algorithm, Markov Decision Process (MDP)

Reinforcement Learning (RL) ➖


Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by
interacting with an environment to maximize a cumulative reward. In RL, the agent takes actions within

er
the environment, observes the results, and then adjusts its future actions based on past experiences.
Unlike supervised learning where the model is trained with labeled data, in RL the agent learns from trial
and error, gradually improving its ability to make decisions.

In RL, there is typically a goal or task the agent is trying to accomplish, and the environment provides

od
feedback in the form of rewards or penalties based on the agent's actions. The aim is for the agent to
learn a strategy or policy that maximizes the long-term reward.

Elements of Reinforcement Learning ➖


uC
Reinforcement Learning involves several key elements:

1.​ Agent: The decision maker, which interacts with the environment by taking actions.
2.​ Environment: The external system or world with which the agent interacts. It provides feedback
in the form of rewards or penalties based on the agent's actions.
3.​ State (S): A representation of the current situation of the agent within the environment. States can
be simple or complex depending on the environment (e.g., position of an object in a game or the
Pt

temperature in a heating system).


4.​ Action (A): A set of all possible actions the agent can take. The actions are chosen by the agent to
transition between states.
5.​ Reward (R): A scalar feedback signal received after each action taken by the agent. It indicates
the immediate benefit of the agent's action, and the agent aims to maximize cumulative reward
@

over time.
6.​ Policy (π): A strategy or mapping from states to actions. It defines the agent's behavior, i.e., the
way the agent chooses actions based on the current state.
7.​ Value Function (V): A function that estimates the long-term reward for each state, helping the
agent decide which states are more desirable. It tells the agent how good a particular state is in
terms of future rewards.
8.​ Q-Function (Q): The Q-function (or action-value function) estimates the expected return (reward)
for taking a certain action in a particular state and following a policy thereafter.
9.​ Model: Some reinforcement learning systems use a model, which represents the environment's
dynamics (i.e., how the state transitions when an action is taken). This is often used in
Model-based RL, but it is not always necessary in Model-free RL.
21

Exploration vs Exploitation Dilemma ➖


In Reinforcement Learning, one of the fundamental challenges is the exploration vs exploitation
trade-off. The agent must balance between exploring new actions and exploiting its existing knowledge to
maximize rewards.

●​ Exploration: Refers to trying new actions or exploring new states that the agent hasn't
encountered before. Exploration helps the agent gather more information about the environment
and discover potentially better strategies.
●​ Exploitation: Refers to using the knowledge the agent has gained so far to select actions that have
already yielded high rewards. Exploitation leverages the current understanding to maximize

er
rewards, but it risks missing potentially better actions or strategies.

This dilemma arises because focusing too much on exploration may delay the agent's convergence to the
optimal strategy, while focusing too much on exploitation may prevent the agent from discovering new,
possibly better strategies. Therefore, finding an optimal balance is key.

Epsilon Greedy Algorithm ➖


od
The Epsilon-Greedy Algorithm is a popular strategy used to balance exploration and exploitation in
Reinforcement Learning. The idea is to use a random action with a certain probability (exploration) and
uC
use the best-known action with the remaining probability (exploitation).

●​ Epsilon (ε) is a parameter that determines the probability of exploration. Typically, ε is a small
value (like 0.1), which means there’s a 10% chance the agent will explore and a 90% chance it will
exploit.
●​ Greedy Action: The action with the highest known reward (i.e., the best action based on current
knowledge).
Pt

The Epsilon-Greedy Algorithm works as follows:

1.​ With probability ε, the agent explores by selecting a random action.


2.​ With probability 1 - ε, the agent exploits the best-known action.
@

Over time, ε can be reduced (decaying epsilon) to gradually shift the focus from exploration to
exploitation as the agent learns more about the environment.

Markov Decision Process (MDP) ➖


A Markov Decision Process (MDP) is a mathematical framework used to describe environments in
Reinforcement Learning where the outcome of each action depends only on the current state and action
(the Markov property).

An MDP is defined by the following components:

1.​ States (S): A finite set of possible states that represent the environment's configurations.
22
2.​ Actions (A): A finite set of actions that the agent can take to transition between states.
3.​ Transition Function (T): The probability distribution that defines the likelihood of transitioning
from one state to another when a certain action is taken. Mathematically, this is denoted as P(s'|s,
a), the probability of reaching state s' when taking action a in state s.
4.​ Reward Function (R): A function that gives the immediate reward received after taking an action
in a given state. R(s, a, s') indicates the reward received when transitioning from state s to state s'
after action a.
5.​ Discount Factor (γ): A factor that discounts future rewards. It helps prioritize immediate rewards
over long-term rewards. γ is a value between 0 and 1, where a higher value places more
importance on future rewards.
6.​ Policy (π): A strategy that defines the agent's action selection. It can be deterministic or stochastic
and guides the agent to make decisions at each state.

er
In summary, an MDP is a framework that formalizes the decision-making process where an agent
interacts with an environment, making decisions to maximize long-term rewards, considering current
states, actions, transitions, rewards, and the discount factor.

Summary of Key Concepts in RL


Concept

Exploration vs
od

Description

Balancing trying new actions (exploration) and using the


uC
Exploitation best-known actions (exploitation).

Epsilon-Greedy An algorithm that chooses random actions with probability ε


(exploration) and the best-known actions with probability 1-ε
(exploitation).

Markov Decision A formal framework for modeling decision-making, where the


Pt

Process (MDP) outcome of each action depends on the current state, including
states, actions, rewards, transitions, and policy.

Reinforcement Learning relies on these key elements to enable agents to learn optimal behaviors over
time by balancing exploration and exploitation in complex, uncertain environments.
@
●​ # Q values and V values, Q – Learning, α values ➖ 23

Q-Values and V-Values in Reinforcement Learning ➖


In Reinforcement Learning (RL), Q-values (action-value function) and V-values (value function) are
two key concepts used to estimate the expected return (reward) of states or state-action pairs. These
values help the agent to make decisions about which actions to take in a given state to maximize
long-term rewards.

V-Values (Value Function) :-

●​ The V-value of a state represents the expected long-term reward starting from that state and

er
following a certain policy. It gives the agent an indication of how good it is to be in a particular
state based on the expected rewards.
●​ V(s): The value of state s is the expected cumulative reward the agent will receive starting from
state s and following a specific policy π. This can be mathematically expressed as:​
V(s)=E[∑t=0∞γtRt∣S0=s]V(s) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R_t | S_0 = s

od
\right]V(s)=E[t=0∑∞​γtRt​∣S0​=s]​
Where:
○​ V(s): The value of state s.
○​ γ: The discount factor, which represents the importance of future rewards.
○​ R_t: The reward received at time step t.
○​ S_0 = s: The agent starts in state s.
uC
●​ Goal: The goal of the agent is to learn the value function for each state, which can guide its
decision-making process. The higher the value of a state, the more desirable it is to be in that state.

Q-Values (Action-Value Function)

●​ The Q-value is a more detailed version of the value function, as it evaluates the expected
long-term reward starting from a given state s, taking a specific action a, and then following a
Pt

policy. In essence, Q-values represent how good it is to take a particular action in a particular state.
●​ Q(s, a): The Q-value of a state-action pair is the expected cumulative reward after taking action a
in state s and continuing to follow the policy π. It is mathematically represented as:​
Q(s,a)=E[∑t=0∞γtRt∣S0=s,A0=a]Q(s, a) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R_t |
S_0 = s, A_0 = a \right]Q(s,a)=E[t=0∑∞​γtRt​∣S0​=s,A0​=a]​
@

Where:
○​ Q(s, a): The Q-value for state s and action a.
○​ γ: The discount factor (like in the V-value).
○​ R_t: The reward received at time step t.
○​ S_0 = s, A_0 = a: The agent starts in state s and takes action a.
●​ Goal: The goal of Q-learning is to estimate the Q-values for all state-action pairs, and these values
guide the agent in selecting the best action in any given state.
24
Q-Learning: A Model-Free Reinforcement Learning Algorithm

Q-learning is a popular model-free RL algorithm where the agent learns the optimal policy by iteratively
updating the Q-values for state-action pairs based on feedback from the environment. It does not require
knowledge of the environment’s dynamics (i.e., transition probabilities or reward function), and instead,
the agent learns purely from its experiences.

The Q-learning update rule is:

Q(s,a)←Q(s,a)+α[R(s,a)+γmax⁡a′Q(s′,a′)−Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[ R(s, a) + \gamma


\max_{a'} Q(s', a') - Q(s, a) \right]Q(s,a)←Q(s,a)+α[R(s,a)+γa′max​Q(s′,a′)−Q(s,a)]

Where:

er
●​ Q(s, a): The Q-value of the state-action pair (s, a).
●​ α (alpha): The learning rate, which determines to what extent new information overrides the old
information. If α is 0, no learning happens, and if α is 1, the agent completely ignores previous
values.

od
●​ R(s, a): The immediate reward the agent gets after taking action a in state s.
●​ γ (gamma): The discount factor, which determines the importance of future rewards. A value of 0
makes the agent short-sighted (only cares about immediate rewards), while a value closer to 1
makes it focus more on future rewards.
●​ max_{a'} Q(s', a'): The maximum Q-value of the next state s' after taking action a.
uC
Pt
@

Explanation of the Q-learning Update Rule:

●​ The agent starts with an initial Q-table, where each state-action pair has an initial Q-value (often
initialized to zero).
●​ Upon taking action a in state s, the agent receives a reward R(s, a) and transitions to a new state s'.
25
●​ The agent then updates the Q-value for the state-action pair (s, a) using the above update rule. The
update considers the immediate reward and the maximum Q-value from the new state s' (which
represents the best possible action from the new state).
●​ This iterative process continues, and over time, the Q-values converge to the optimal Q-values,
which represent the best possible policy.

α (Alpha) Values in Reinforcement Learning ➖


The learning rate (α) controls how quickly the agent updates its knowledge. Specifically, it determines
the weight given to the most recent information. A high value of α means the agent will quickly adjust its
values based on new experiences, while a low value means the agent will place more emphasis on

er
previously learned values.

●​ High α (e.g., 0.9): This means that the agent will rely heavily on the latest reward information. The
learning process is faster but can be noisy and unstable.
●​ Low α (e.g., 0.1): This means that the agent will slowly adapt its Q-values, giving more weight to

od
the past experiences. This makes the learning process more stable but slower.

In practical applications, α can be adjusted dynamically during training to balance between fast learning
and stable convergence. For example, it can be decreased over time to encourage the agent to settle on a
final policy after a certain number of episodes.


uC
Summary:

●​ V-values (Value Function) estimate the expected return for a given state under a policy.
●​ Q-values (Action-Value Function) estimate the expected return for a given state-action pair under
a policy.
●​ Q-learning is an algorithm that updates Q-values iteratively to learn the optimal policy without
knowing the environment’s dynamics.
Pt

●​ α (alpha) controls the learning rate, which determines how much new experiences override
previous knowledge. A higher value leads to faster learning, but potentially less stable behavior.

In conclusion, Q-values and V-values are fundamental concepts in reinforcement learning. Q-values are
typically used in algorithms like Q-learning to learn optimal policies, while V-values provide a more
@

general representation of how valuable a state is under a given policy. Together, they help guide the
decision-making process of RL agents toward achieving their objectives efficiently.

HAPPY ENDING BY : SAHIL RAUNIYAR &


PTU-CODER !! ☺️

You might also like