ml_unit1
ml_unit1
Machine Learning (ML) is a subset of artificial intelligence (AI) that focuses on the
development of algorithms and statistical models that enable computers to perform
specific tasks without explicit instructions.
Instead of being programmed to perform a task, machine learning systems learn from data,
identify patterns, and make decisions based on that data.
Approach to Problem-Solving
Traditional Programming:
Machine Learning:
2. Data Dependency
Traditional Programming:
Machine Learning:
Machine learning relies heavily on data. The quality, quantity, and diversity
of the training data significantly a ect the model's performance. A well-
trained model can generalize to new data, but it may also fail if the training
data is biased or insu icient.
Traditional Programming:
Machine Learning:
Machine learning models can adapt to new data without needing explicit
reprogramming. If new patterns emerge in the data, the model can be
retrained to accommodate these changes, making it more flexible in dynamic
environments.
4. Complexity of Problems
Traditional Programming:
Best suited for well-defined problems with clear rules and logic. It works
e ectively for tasks that can be easily expressed through algorithms.
Machine Learning:
Traditional Programming:
The output is deterministic and predictable based on the input and the
defined rules. If the input is the same, the output will always be the same.
Machine Learning:
The output can be probabilistic. For instance, a model might predict that an
email is 80% likely to be spam. The model's predictions can vary based on
the data it has seen and the inherent uncertainty in the learning process.
Traditional Programming:
Machine Learning:
Developing a machine learning model often requires more time for data
collection, preprocessing, and model training. However, once a model is
trained, it can be reused and adapted with new data.
o Classification
o Regression
a) Classification :
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc.
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables.
These are used to predict continuous output variables, such as market trends, weather
prediction, etc.
Advantages:
o Since supervised learning work with the labelled dataset so we can have an exact
idea about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior
experience.
Disadvantages:
o It may predict the wrong output if the test data is di erent from the training data.
Image Segmentation
Medical Diagnosis
Fraud Detection
Spam detection
Speech Recognition
di erent from the Supervised learning technique; there is no need for supervision.
It means, in unsupervised machine learning, the machine is trained using the
unlabeled dataset, and the machine predicts the output without any
supervision.
In unsupervised learning, the models are trained with the data that is neither
classified nor labelled.
The main aim of the unsupervised learning algorithm is to group the unsorted
dataset according to the similarities, patterns, and di erences. Machines are
instructed to find the hidden patterns from the input dataset.
o Association
1) Clustering
It is a way to group the objects into a cluster such that the objects with the most
similarities remain in one group and have fewer or no similarities with the objects of other
groups.
o DBSCAN Algorithm
2) Association
The main aim of this learning algorithm is to find the dependency of one data item on
another data item and map those variables accordingly so that it can generate maximum
profit.
This algorithm is mainly applied in Market Basket analysis, Web usage mining, Customer
Segmentation, etc.
o Apriori Algorithm
o FP Growth Algorithm
Advantages:
o These algorithms can be used for complicated tasks compared to the supervised
ones because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more di icult as it works with the unlabelled
dataset that does not map with the output.
o Recommendation Systems
o Anomaly Detection
3. Semi-Supervised Learning
It represents the intermediate ground between Supervised (With Labelled training data)
and Unsupervised learning (with no labelled training data) algorithms and uses the
combination of labelled and unlabeled datasets during the training period.
It is completely di erent from supervised and unsupervised learning as they are based on
the presence & absence of labels.
The main aim of semi-supervised learning is to e ectively use all the available data,
rather than only labelled data like in supervised learning.
Initially, similar data is clustered along with an unsupervised learning algorithm, and
further, it helps to label the unlabeled data into labelled data. It is because labelled data
is a comparatively more expensive acquisition than unlabeled data.
Advantages:
o It is highly e icient.
Disadvantages:
o Iterations results may not be stable.
o Accuracy is low.
4. Reinforcement Learning :
Agent gets rewarded for each good action and get punished for each bad action; hence
the goal of reinforcement learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised learning, and agents
learn from their experiences only.
o Video Games:
Some popular games that use RL algorithms are AlphaGO and AlphaGO Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that
how to use RL in computer to automatically learn and schedule resources to wait for
di erent jobs in order to minimize average job slowdown.
o Robotics:
Robots are used in the industrial and manufacturing area, and these robots are
made more powerful with reinforcement learning.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with
the help of Reinforcement Learning by Salesforce company.
Advantages
o The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.
Disadvantage
o Too much reinforcement learning can lead to an overload of states which can
weaken the results.
o The curse of dimensionality limits reinforcement learning for real physical systems.
Applications :
1. Healthcare
Disease Diagnosis: ML algorithms can analyze medical images (like X-rays, MRIs,
and CT scans) to assist in diagnosing diseases such as cancer, pneumonia, and
other conditions.
2. Finance
Fraud Detection: ML algorithms analyze transaction patterns to identify potentially
fraudulent activities in real-time.
3. E-commerce
4. Transportation
6. Security
Linear Regression
a type of supervised machine-learning algorithm that learns from the labelled datasets
and maps the data points with most optimized linear functions which can be used for
prediction on new datasets.
It computes the linear relationship between the dependent variable and one or more
independent features by fitting a linear equation with observed data.
It predicts the continuous output variables based on the independent input variable.
Assumptions are:
Our primary objective while using linear regression is to locate the best-fit line, which
implies that the error between the predicted and actual values should be kept to a
minimum. There will be the least error in the best-fit line.
The slope of the line indicates how much the dependent variable changes for a unit change
in the independent variable(s).
A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear
relationship.
When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.
so we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate
this we use cost function.
Cost function-
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable.
o For Linear Regression, we use the Mean Squared Error (MSE) cost function, which
is the average of squared error occurred between the predicted values and actual
values.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the
cost function.
o A regression model uses gradient descent to update the coe icients of the line by
reducing the cost function.
o It is done by a random selection of values of coe icient and then iteratively update
the values to reach the minimum cost function.
Naive Bayes classifiers are supervised machine learning algorithms used for classification
tasks, based on Bayes’ Theorem to find probabilities.
The Naive Bayes Classifier is a simple probabilistic classifier , are used to build the ML
models that can predict at a faster speed than other classification algorithms.
The “Bayes” part of the name refers to for the basis in Bayes’ Theorem.
Bayes' Theorem:
o is used to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Advantages :
Disadvantages :
o Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
o It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents
the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any
further branches.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?
o Decision Trees usually mimic human thinking ability while making a decision, so
it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows a
tree-like structure.
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure.
o Information Gain
o Gini Index
1. Information Gain:
o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute having the highest information gain is split first. It can be calculated
using the below formula:
Where,
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning.
o It is simple to understand
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may
increase.
Imagine a streaming service wants to predict if a new user is likely to cancel their
subscription (churn) based on their age. They checks the ages of its existing users and
whether they churned or stayed. If most of the “K” closest users in age of new user
canceled their subscription KNN will predict the new user might churn too. The key
idea is that users with similar ages tend to have similar behaviors and KNN uses this
closeness to make decisions.
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and at the time of
classification it performs an action on the dataset.
The new point is classified as Category 2 because most of its closest neighbors are blue
squares. KNN assigns the category based on the majority of nearby points.
The image shows how KNN predicts the category of a new data point based on its closest
neighbors.
What is ‘K’ in K Nearest Neighbour ?
In the k-Nearest Neighbours (k-NN) algorithm k is just a number that tells the algorithm
how many nearby points (neighbours) to look at when it makes a decision.
Example:
Imagine you’re deciding which fruit it is based on its shape and size. You compare it to fruits
you already know.
If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is
an apple because most of its neighbours are apples.
Selecting the optimal value of k depends on the characteristics of the input data. If the
dataset has significant outliers or noise a higher k can help smooth out the predictions
and reduce the influence of noisy data. However choosing very high value can lead to
underfitting where the model becomes too simplistic.
Elbow Method: In the elbow method we plot the model’s accuracy for di erent
values of k. As we increase k the error usually decreases initially. However after a
certain point the error rate starts to decrease more slowly. This point where the
curve forms an “elbow” that point is considered as best k.
Odd Values for k: It’s also recommended to choose an odd value for k especially in
classification tasks to avoid ties when deciding the majority class.
KNN uses distance metrics to identify nearest neighbour, these neighbours are used for
classification and regression task. To identify nearest neighbour we use below distance
metrics:
1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between two points in a plane or
space. You can think of it like the shortest path you would walk if you were to go directly
from one point to another.
distance(x,Xi)=∑j=1d(xj–Xij)2]distance(x,Xi)=∑j=1d(xj–Xij)2]
2. Manhattan Distance
This is the total distance you would travel if you could only move along horizontal and
vertical lines (like a grid or city streets). It’s also called “taxicab distance” because a taxi
can only drive along the grid-like streets of a city.
d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity where it
value of a new data point by considering the labels .
Distance is calculated between data points in the dataset and target point.
The k data points with the smallest distances to the target point are nearest
neighbors.
When you want to classify a data point into a category (like spam or not spam), the
K-NN algorithm looks at the K closest points in the dataset. These closest points
are called neighbors. The algorithm then looks at which category the neighbors
belong to and picks the one that appears the most. This is called majority voting.
In regression, the algorithm still looks for the K closest points. But instead of voting
for a class in classification, it takes the average of the values of those K neighbors.
This average is the predicted value for the new point for the algorithm.
For example, we have two classes Class 0 and Class 1 if the value of the logistic function
for an input is greater than 0.5 (threshold value) then it belongs to Class 1 otherwise it
belongs to Class 0. It’s referred to as regression because it is the extension of linear
regression but is mainly used for classification problems.
Key Points:
It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
So far, we’ve covered the basics of logistic regression, but now let’s focus on the most
important function that forms the core of logistic regression.
The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
It maps any real value into another value within a range of 0 and 1. The value of the
logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the “S” form.
The S-form curve is called the Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.
The logistic regression model transforms the linear regression function continuous value
output into categorical value output using a sigmoid function, which maps any real-
valued set of independent variables input into a value between 0 and 1. This function is
known as the logistic function.
Sigmoid Function
Now we use the sigmoid function where the input will be z and we find the probability
between 0 and 1. i.e. predicted y.
σ(z)=11+e−zσ(z)=1+e−z1
Sigmoid function
As shown above, the figure sigmoid function converts the continuous variable data into
the probability i.e. between 0 and 1.
P(y=1)=σ(z)P(y=0)=1−σ(z)P(y=1)=σ(z)P(y=0)=1−σ(z)
p(X;b,w)=ew⋅X+b1+ew⋅X+b=11+e−w⋅X+bp(X;b,w)=1+ew⋅X+bew⋅X+b=1+e−w⋅X+b1
Logistic function: The formula used to represent how the independent and
dependent variables relate to one another. The logistic function transforms the input
variables into a probability value between 0 and 1, which represents the likelihood
of the dependent variable being 1 or 0.
Log-odds: The log-odds, also known as the logit function, is the natural logarithm of
the odds. In logistic regression, the log odds of the dependent variable are modeled
as a linear combination of the independent variables and the intercept.
Coe icient: The logistic regression model’s estimated parameters, show how the
independent and dependent variables relate to one another.
Intercept: A constant term in the logistic regression model, which represents the
log odds when all independent variables are equal to zero.
Maximum likelihood estimation: The method used to estimate the coe icients of
the logistic regression model, which maximizes the likelihood of observing the data
given the model
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine.
Example: Suppose we see a strange cat that also has some features of dogs, so if we want
a model that can accurately identify whether it is a cat or dog, so such a model can be
created by using the SVM algorithm. We will first train our model with lots of images of cats
and dogs so that it can learn about di erent features of cats and dogs, and then we test it
with this strange creature. So as support vector creates a decision boundary between
these two data (cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog. On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line and
classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which a ect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
Linear SVM:
Suppose we have a dataset that has two tags (green and blue), and the dataset has two
features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in
either green or blue.
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes.
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane.
SVM algorithm finds the closest point of the lines from both the classes. These points are
called support vectors. The distance between the vectors and the hyperplane is called
as margin.
And the goal of SVM is to maximize this margin. The hyperplane with maximum margin
is called the optimal hyperplane.
Random Forest Algorithm :
It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to
solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset."
Instead of relying on one decision tree, the random forest takes the prediction from each
tree and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
o There should be some actual values in the feature variable of the dataset so that
the classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
o It predicts output with high accuracy, even for the large dataset it runs e iciently.
Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result,
and when a new data point occurs, then based on the majority of results, the Random
Forest classifier predicts the final decision.
1. Banking: Banking sector mostly uses this algorithm for the identification of loan
risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease
can be identified.
o It enhances the accuracy of the model and prevents the overfitting issue.
o Although random forest can be used for both classification and regression tasks, it
is not more suitable for Regression tasks.