0% found this document useful (0 votes)
8 views

ml_unit1

Machine Learning (ML) is a subset of artificial intelligence that enables computers to learn from data and make decisions without explicit programming. It contrasts traditional programming by relying on data patterns rather than fixed rules, with various types including supervised, unsupervised, semi-supervised, and reinforcement learning, each with distinct applications and methodologies. Key applications of ML span across healthcare, finance, e-commerce, transportation, natural language processing, and security.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

ml_unit1

Machine Learning (ML) is a subset of artificial intelligence that enables computers to learn from data and make decisions without explicit programming. It contrasts traditional programming by relying on data patterns rather than fixed rules, with various types including supervised, unsupervised, semi-supervised, and reinforcement learning, each with distinct applications and methodologies. Key applications of ML span across healthcare, finance, e-commerce, transportation, natural language processing, and security.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

UNIT – 1

What is Machine Learning?

Machine Learning (ML) is a subset of artificial intelligence (AI) that focuses on the
development of algorithms and statistical models that enable computers to perform
specific tasks without explicit instructions.

Instead of being programmed to perform a task, machine learning systems learn from data,
identify patterns, and make decisions based on that data.

Di erence between traditional and machine learning

Approach to Problem-Solving

 Traditional Programming:

 In traditional programming, a developer writes explicit rules and instructions


to solve a problem.

 Example: A program to calculate the area of a rectangle would require the


developer to specify the formula (length × width) and how to handle inputs.

 Machine Learning:

 In machine learning, the focus is on training a model using data. Instead of


writing explicit rules, the model learns patterns and relationships from the
data itself. The model can then make predictions or decisions based on new,
unseen data.
 Example: A machine learning model for predicting house prices would be
trained on historical data (features like size, location, etc.) and would learn
the relationships between these features and the prices.

2. Data Dependency

 Traditional Programming:

 The performance of traditional programs is largely dependent on the quality


of the algorithms and the logic implemented by the programmer.

 Machine Learning:

 Machine learning relies heavily on data. The quality, quantity, and diversity
of the training data significantly a ect the model's performance. A well-
trained model can generalize to new data, but it may also fail if the training
data is biased or insu icient.

3. Flexibility and Adaptability

 Traditional Programming:

 Traditional programs are rigid. If the requirements change or if new


scenarios arise, the programmer must modify the code and logic explicitly.

 Machine Learning:

 Machine learning models can adapt to new data without needing explicit
reprogramming. If new patterns emerge in the data, the model can be
retrained to accommodate these changes, making it more flexible in dynamic
environments.

4. Complexity of Problems

 Traditional Programming:

 Best suited for well-defined problems with clear rules and logic. It works
e ectively for tasks that can be easily expressed through algorithms.

 Machine Learning:

 More e ective for complex problems where the relationships between


inputs and outputs are not easily defined. This includes tasks like image
recognition, natural language processing, and predictive analytics, where
patterns may be intricate and not easily captured by traditional
programming.
5. Output Interpretation

 Traditional Programming:

 The output is deterministic and predictable based on the input and the
defined rules. If the input is the same, the output will always be the same.

 Machine Learning:

 The output can be probabilistic. For instance, a model might predict that an
email is 80% likely to be spam. The model's predictions can vary based on
the data it has seen and the inherent uncertainty in the learning process.

6. Development Time and Maintenance

 Traditional Programming:

 Development can be straightforward for simple tasks, but as complexity


increases, maintaining and updating the code can become cumbersome.

 Machine Learning:

 Developing a machine learning model often requires more time for data
collection, preprocessing, and model training. However, once a model is
trained, it can be reused and adapted with new data.

1. Supervised Machine Learning


 It is based on supervision. It means in the supervised learning technique, we train
the machines using the "labelled" dataset, and based on the training, the machine
predicts the output.
 Here, the labelled data specifies that some of the inputs are already mapped to
the output.
 first, we train the machine with the input and corresponding output, and then we ask
the machine to predict the output using the test dataset.
 The main goal of the supervised learning technique is to map the input
variable(x) with the output variable(y).
 Some real-world applications are Risk Assessment, Fraud Detection, Spam
filtering, etc.

Categories of Supervised Machine Learning

o Classification

o Regression

a) Classification :

Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc.

The classification algorithms predict the categories present in the dataset.

Some real-world examples are Spam Detection, Email filtering, etc.

Some popular classification algorithms are given below:

o Random Forest Algorithm

o Decision Tree Algorithm

o Logistic Regression Algorithm

o Support Vector Machine Algorithm

b) Regression

Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables.

These are used to predict continuous output variables, such as market trends, weather
prediction, etc.

Some popular Regression algorithms are given below:


o Linear Regression Algorithm

o Decision Tree Algorithm

Advantages and Disadvantages of Supervised Learning

Advantages:

o Since supervised learning work with the labelled dataset so we can have an exact
idea about the classes of objects.

o These algorithms are helpful in predicting the output on the basis of prior
experience.

Disadvantages:

o These algorithms are not able to solve complex tasks.

o It may predict the wrong output if the test data is di erent from the training data.

o It requires lots of computational time to train the algorithm.

Applications of Supervised Learning

 Image Segmentation

 Medical Diagnosis

 Fraud Detection

 Spam detection

 Speech Recognition

2. Unsupervised Machine Learning

 di erent from the Supervised learning technique; there is no need for supervision.
 It means, in unsupervised machine learning, the machine is trained using the
unlabeled dataset, and the machine predicts the output without any
supervision.
 In unsupervised learning, the models are trained with the data that is neither
classified nor labelled.
 The main aim of the unsupervised learning algorithm is to group the unsorted
dataset according to the similarities, patterns, and di erences. Machines are
instructed to find the hidden patterns from the input dataset.

Categories of Unsupervised Machine Learning


o Clustering

o Association

1) Clustering

It is a way to group the objects into a cluster such that the objects with the most
similarities remain in one group and have fewer or no similarities with the objects of other
groups.

An example of the clustering algorithm is grouping the customers by their purchasing


behavior.

Some of the popular clustering algorithms are given below:

o K-Means Clustering algorithm

o DBSCAN Algorithm

2) Association

which finds interesting relations among variables within a large dataset.

The main aim of this learning algorithm is to find the dependency of one data item on
another data item and map those variables accordingly so that it can generate maximum
profit.

This algorithm is mainly applied in Market Basket analysis, Web usage mining, Customer
Segmentation, etc.

Some of the popular clustering algorithms are given below:

o Apriori Algorithm

o FP Growth Algorithm

Advantages and Disadvantages of Unsupervised Learning Algorithm

Advantages:

o These algorithms can be used for complicated tasks compared to the supervised
ones because these algorithms work on the unlabeled dataset.

o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.

Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.

o Working with Unsupervised learning is more di icult as it works with the unlabelled
dataset that does not map with the output.

Applications of Unsupervised Learning

o Recommendation Systems

o Anomaly Detection

3. Semi-Supervised Learning

Semi-Supervised learning is a type of Machine Learning algorithm that lies between


Supervised and Unsupervised machine learning.

It represents the intermediate ground between Supervised (With Labelled training data)
and Unsupervised learning (with no labelled training data) algorithms and uses the
combination of labelled and unlabeled datasets during the training period.

Although Semi-supervised learning is the middle ground between supervised and


unsupervised learning and operates on the data that consists of a few labels, it mostly
consists of unlabeled data.

It is completely di erent from supervised and unsupervised learning as they are based on
the presence & absence of labels.

To overcome the drawbacks of supervised learning and unsupervised learning algorithms,


the concept of Semi-supervised learning is introduced.

The main aim of semi-supervised learning is to e ectively use all the available data,
rather than only labelled data like in supervised learning.

Initially, similar data is clustered along with an unsupervised learning algorithm, and
further, it helps to label the unlabeled data into labelled data. It is because labelled data
is a comparatively more expensive acquisition than unlabeled data.

Advantages:

o It is simple and easy to understand the algorithm.

o It is highly e icient.

o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.

Disadvantages:
o Iterations results may not be stable.

o We cannot apply these algorithms to network-level data.

o Accuracy is low.

4. Reinforcement Learning :

Reinforcement learning works on a feedback-based process, in which an AI agent (A


software component) automatically explore its surrounding by hitting & trail, taking
action, learning from experiences, and improving its performance.

Agent gets rewarded for each good action and get punished for each bad action; hence
the goal of reinforcement learning agent is to maximize the rewards.

In reinforcement learning, there is no labelled data like supervised learning, and agents
learn from their experiences only.

Reinforcement learning is employed in di erent fields such as Game theory, Operation


Research, Information theory, multi-agent systems.

A reinforcement learning problem can be formalized using Markov Decision


Process(MDP). In MDP, the agent constantly interacts with the environment and
performs actions; at each action, the environment responds and generates a new state.

Categories of Reinforcement Learning

Reinforcement learning is categorized mainly into two types of methods/algorithms:

o Positive Reinforcement Learning: Positive reinforcement learning specifies


increasing the tendency that the required behavior would occur again by adding
something. It enhances the strength of the behavior of the agent and positively
impacts it.

o Negative Reinforcement Learning: Negative reinforcement learning works exactly


opposite to the positive RL. It increases the tendency that the specific behavior
would occur again by avoiding the negative condition.

Real-world Use cases of Reinforcement Learning

o Video Games:
Some popular games that use RL algorithms are AlphaGO and AlphaGO Zero.

o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that
how to use RL in computer to automatically learn and schedule resources to wait for
di erent jobs in order to minimize average job slowdown.

o Robotics:
Robots are used in the industrial and manufacturing area, and these robots are
made more powerful with reinforcement learning.

o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with
the help of Reinforcement Learning by Salesforce company.

Advantages and Disadvantages of Reinforcement Learning

Advantages

o It helps in solving complex real-world problems which are di icult to be solved by


general techniques.

o The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.

o Helps in achieving long term results.

Disadvantage

o RL algorithms are not preferred for simple problems.

o RL algorithms require huge data and computations.

o Too much reinforcement learning can lead to an overload of states which can
weaken the results.

o The curse of dimensionality limits reinforcement learning for real physical systems.

 Applications :

1. Healthcare

 Disease Diagnosis: ML algorithms can analyze medical images (like X-rays, MRIs,
and CT scans) to assist in diagnosing diseases such as cancer, pneumonia, and
other conditions.

 Predictive Analytics: ML models can predict patient outcomes, readmission rates,


and disease progression based on historical data.

2. Finance
 Fraud Detection: ML algorithms analyze transaction patterns to identify potentially
fraudulent activities in real-time.

 Credit Scoring: Assessing the creditworthiness of individuals by analyzing their


financial history and behavior.

3. E-commerce

 Recommendation Systems: ML algorithms analyze user behavior and preferences


to suggest products, enhancing the shopping experience (e.g., Amazon, Netflix).

 Inventory Management: Predicting demand for products to optimize inventory


levels and reduce costs.

 Customer Segmentation: Analyzing customer data to identify distinct segments for


targeted marketing campaigns.

4. Transportation

 Autonomous Vehicles: Self-driving cars use ML to interpret sensor data, recognize


objects, and make driving decisions.

 Route Optimization: ML algorithms optimize delivery routes based on tra ic


patterns, weather conditions, and other factors.

5. Natural Language Processing (NLP)

 Chatbots and Virtual Assistants: ML powers conversational agents that can


understand and respond to user queries (e.g., Siri, Alexa).

 Sentiment Analysis: Analyzing social media posts, reviews, and customer


feedback to gauge public sentiment about products or services.

6. Security

 Anomaly Detection: Identifying unusual patterns in network tra ic to detect


potential security breaches or cyberattacks.

 Facial Recognition: Using ML for identity verification and surveillance applications.

 Linear Regression

a type of supervised machine-learning algorithm that learns from the labelled datasets
and maps the data points with most optimized linear functions which can be used for
prediction on new datasets.
It computes the linear relationship between the dependent variable and one or more
independent features by fitting a linear equation with observed data.

It predicts the continuous output variables based on the independent input variable.

Assumptions are:

 Linearity: It assumes that there is a linear relationship between the independent


and dependent variables. This means that changes in the independent variable lead
to proportional changes in the dependent variable.

 Independence: The observations should be independent from each other that is


the errors from one observation should not influence other.

What is the best Fit Line?

Our primary objective while using linear regression is to locate the best-fit line, which
implies that the error between the predicted and actual values should be kept to a
minimum. There will be the least error in the best-fit line.

The slope of the line indicates how much the dependent variable changes for a unit change
in the independent variable(s).

Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear
relationship.

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable
increases on the X-axis, then such a relationship is called a negative linear
relationship.

Finding the best fit line:

When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.
so we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate
this we use cost function.

Cost function-

o Cost function optimizes the regression coe icients or weights.

o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable.

o For Linear Regression, we use the Mean Squared Error (MSE) cost function, which
is the average of squared error occurred between the predicted values and actual
values.

Gradient Descent:

o Gradient descent is used to minimize the MSE by calculating the gradient of the
cost function.

o A regression model uses gradient descent to update the coe icients of the line by
reducing the cost function.

o It is done by a random selection of values of coe icient and then iteratively update
the values to reach the minimum cost function.

 Naïve Bayes Classifier :

Naive Bayes classifiers are supervised machine learning algorithms used for classification
tasks, based on Bayes’ Theorem to find probabilities.

The Naive Bayes Classifier is a simple probabilistic classifier , are used to build the ML
models that can predict at a faster speed than other classification algorithms.

It is a probabilistic classifier because it assumes that one feature in the model is


independent of existence of another feature.

Naïve Bayes Algorithm is used in spam filtration, Sentimental analysis, etc.

Why it is Called Naive Bayes?

It is named as “Naive” because it assumes that the occurrence of a certain feature is


independent of the occurrence of other features.

The “Bayes” part of the name refers to for the basis in Bayes’ Theorem.

Bayes' Theorem:
o is used to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.

o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Advantages :

o fast and easy to predict a class of datasets.

o used for Binary as well as Multi-class Classifications.

o performs well in Multi-class predictions.

Disadvantages :

o Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.

o It is used in medical data classification.

o It is used in Text classification such as Spam filtering and Sentiment analysis.

 Decision Tree Classification Algorithm :

o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
o It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents
the outcome.

o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any
further branches.

o It is a graphical representation for getting all the possible solutions to a


problem/decision based on given conditions.

o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.

o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.

o A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?

o Decision Trees usually mimic human thinking ability while making a decision, so
it is easy to understand.

o The logic behind the decision tree can be easily understood because it shows a
tree-like structure.

Decision Tree Terminologies

 Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.

o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).

o Step-3: Divide the S into subsets that contains possible values for the best
attributes.

o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.

Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure.

There are two popular techniques for ASM, which are:

o Information Gain

o Gini Index

1. Information Gain:

o Information gain is the measurement of changes in entropy after the segmentation


of a dataset based on an attribute.

o It calculates how much information a feature provides us about a class.

o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute having the highest information gain is split first. It can be calculated
using the below formula:

Information Gain = Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples

o P(yes)= probability of yes

o P(no)= probability of no

2. Gini Index:

o Gini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.

o An attribute with the low Gini index should be preferred

o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.

o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning.

Advantages of the Decision Tree

o It is simple to understand

o useful for solving decision-related problems.

o think about all the possible outcomes for a problem.

o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.

o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.

o For more class labels, the computational complexity of the decision tree may
increase.

 K-Nearest Neighbor (KNN) Algorithm :

simple way to classify things by looking at what’s nearby.

Imagine a streaming service wants to predict if a new user is likely to cancel their
subscription (churn) based on their age. They checks the ages of its existing users and
whether they churned or stayed. If most of the “K” closest users in age of new user
canceled their subscription KNN will predict the new user might churn too. The key
idea is that users with similar ages tend to have similar behaviors and KNN uses this
closeness to make decisions.

K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and at the time of
classification it performs an action on the dataset.

KNN Algorithm working visualization

The new point is classified as Category 2 because most of its closest neighbors are blue
squares. KNN assigns the category based on the majority of nearby points.

The image shows how KNN predicts the category of a new data point based on its closest
neighbors.
What is ‘K’ in K Nearest Neighbour ?

In the k-Nearest Neighbours (k-NN) algorithm k is just a number that tells the algorithm
how many nearby points (neighbours) to look at when it makes a decision.

Example:

Imagine you’re deciding which fruit it is based on its shape and size. You compare it to fruits
you already know.

 If k = 3, the algorithm looks at the 3 closest fruits to the new one.

 If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is
an apple because most of its neighbours are apples.

How to choose the value of k for KNN Algorithm?

The value of k is critical in KNN as it determines the number of neighbors to consider


when making predictions.

Selecting the optimal value of k depends on the characteristics of the input data. If the
dataset has significant outliers or noise a higher k can help smooth out the predictions
and reduce the influence of noisy data. However choosing very high value can lead to
underfitting where the model becomes too simplistic.

Statistical Methods for Selecting k:

 Elbow Method: In the elbow method we plot the model’s accuracy for di erent
values of k. As we increase k the error usually decreases initially. However after a
certain point the error rate starts to decrease more slowly. This point where the
curve forms an “elbow” that point is considered as best k.

 Odd Values for k: It’s also recommended to choose an odd value for k especially in
classification tasks to avoid ties when deciding the majority class.

Distance Metrics Used in KNN Algorithm

KNN uses distance metrics to identify nearest neighbour, these neighbours are used for
classification and regression task. To identify nearest neighbour we use below distance
metrics:

1. Euclidean Distance

Euclidean distance is defined as the straight-line distance between two points in a plane or
space. You can think of it like the shortest path you would walk if you were to go directly
from one point to another.
distance(x,Xi)=∑j=1d(xj–Xij)2]distance(x,Xi)=∑j=1d(xj–Xij)2]

2. Manhattan Distance

This is the total distance you would travel if you could only move along horizontal and
vertical lines (like a grid or city streets). It’s also called “taxicab distance” because a taxi
can only drive along the grid-like streets of a city.

d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣

Working of KNN algorithm

Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity where it
value of a new data point by considering the labels .

Step 1: Selecting the optimal value of K

 K represents the number of nearest neighbors that needs to be considered while


making prediction.

Step 2: Calculating distance

 Distance is calculated between data points in the dataset and target point.

Step 3: Finding Nearest Neighbors

 The k data points with the smallest distances to the target point are nearest
neighbors.

Step 4: Voting for Classification or Taking Average for Regression

 When you want to classify a data point into a category (like spam or not spam), the
K-NN algorithm looks at the K closest points in the dataset. These closest points
are called neighbors. The algorithm then looks at which category the neighbors
belong to and picks the one that appears the most. This is called majority voting.

 In regression, the algorithm still looks for the K closest points. But instead of voting
for a class in classification, it takes the average of the values of those K neighbors.
This average is the predicted value for the new point for the algorithm.

 What is Logistic Regression?

 Logistic regression is a supervised machine learning algorithm used


for classification tasks where the goal is to predict the probability that an instance
belongs to a given class or not.
 Logistic regression is a statistical algorithm which analyze the relationship between
two data factors.
 Logistic regression is used for binary classification where we use sigmoid
function, that takes input as independent variables and produces a probability value
between 0 and 1.

For example, we have two classes Class 0 and Class 1 if the value of the logistic function
for an input is greater than 0.5 (threshold value) then it belongs to Class 1 otherwise it
belongs to Class 0. It’s referred to as regression because it is the extension of linear
regression but is mainly used for classification problems.

Key Points:

 Logistic regression predicts the output of a categorical dependent variable.


Therefore, the outcome must be a categorical or discrete value.

 It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.

 In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic


function, which predicts two maximum values (0 or 1).

Assumptions of Logistic Regression

We will explore the assumptions of logistic regression as understanding these assumptions


is important to ensure that we are using appropriate application of the model. The
assumption include:

1. Independent observations: Each observation is independent of the other. meaning


there is no correlation between any input variables.
3. Linearity relationship between independent variables and log odds: The relationship
between the independent variables and the log odds of the dependent variable
should be linear.

Understanding Sigmoid Function

So far, we’ve covered the basics of logistic regression, but now let’s focus on the most
important function that forms the core of logistic regression.

 The sigmoid function is a mathematical function used to map the predicted values
to probabilities.

 It maps any real value into another value within a range of 0 and 1. The value of the
logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the “S” form.

 The S-form curve is called the Sigmoid function or the logistic function.

 In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.

How does Logistic Regression work?

The logistic regression model transforms the linear regression function continuous value
output into categorical value output using a sigmoid function, which maps any real-
valued set of independent variables input into a value between 0 and 1. This function is
known as the logistic function.

Sigmoid Function

Now we use the sigmoid function where the input will be z and we find the probability
between 0 and 1. i.e. predicted y.

σ(z)=11+e−zσ(z)=1+e−z1

Sigmoid function
As shown above, the figure sigmoid function converts the continuous variable data into
the probability i.e. between 0 and 1.

 σ(z) σ(z) tends towards 1 as z→∞z→∞

 σ(z) σ(z) tends towards 0 as z→−∞z→−∞

 σ(z) σ(z) is always bounded between 0 and 1

where the probability of being a class can be measured as:

P(y=1)=σ(z)P(y=0)=1−σ(z)P(y=1)=σ(z)P(y=0)=1−σ(z)

Equation of Logistic Regression:

then the final logistic regression equation will be:

p(X;b,w)=ew⋅X+b1+ew⋅X+b=11+e−w⋅X+bp(X;b,w)=1+ew⋅X+bew⋅X+b=1+e−w⋅X+b1

Likelihood Function for Logistic Regression

The predicted probabilities will be:

 for y=1 The predicted probabilities will be: p(X;b,w) = p(x)

 for y = 0 The predicted probabilities will be: 1-p(X;b,w) = 1-p(x)

Gradient of the log-likelihood function

To find the maximum likelihood estimates, we di erentiate w.r.t w

Terminologies involved in Logistic Regression

 Logistic function: The formula used to represent how the independent and
dependent variables relate to one another. The logistic function transforms the input
variables into a probability value between 0 and 1, which represents the likelihood
of the dependent variable being 1 or 0.

 Odds: It is the ratio of something occurring to something not occurring. it is di erent


from probability as the probability is the ratio of something occurring to everything
that could possibly occur.

 Log-odds: The log-odds, also known as the logit function, is the natural logarithm of
the odds. In logistic regression, the log odds of the dependent variable are modeled
as a linear combination of the independent variables and the intercept.

 Coe icient: The logistic regression model’s estimated parameters, show how the
independent and dependent variables relate to one another.
 Intercept: A constant term in the logistic regression model, which represents the
log odds when all independent variables are equal to zero.

 Maximum likelihood estimation: The method used to estimate the coe icients of
the logistic regression model, which maximizes the likelihood of observing the data
given the model

 Support Vector Machine Algorithm :

SVM is Supervised Learning algorithms, which is used for Classification as well as


Regression problems.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine.

Example: Suppose we see a strange cat that also has some features of dogs, so if we want
a model that can accurately identify whether it is a cat or dog, so such a model can be
created by using the SVM algorithm. We will first train our model with lots of images of cats
and dogs so that it can learn about di erent features of cats and dogs, and then we test it
with this strange creature. So as support vector creates a decision boundary between
these two data (cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog. On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line and
classifier is used called as Linear SVM classifier.

o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in


n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane.

The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which a ect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.

How does SVM works?

Linear SVM:

Suppose we have a dataset that has two tags (green and blue), and the dataset has two
features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in
either green or blue.

So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes.
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane.

SVM algorithm finds the closest point of the lines from both the classes. These points are
called support vectors. The distance between the vectors and the hyperplane is called
as margin.

And the goal of SVM is to maximize this margin. The hyperplane with maximum margin
is called the optimal hyperplane.
 Random Forest Algorithm :

It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to
solve a complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset."

Instead of relying on one decision tree, the random forest takes the prediction from each
tree and based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.

Assumptions for Random Forest

o There should be some actual values in the feature variable of the dataset so that
the classifier can predict accurate results rather than a guessed result.

o The predictions from each tree must have very low correlations.

Why use Random Forest?

o It takes less training time as compared to other algorithms.

o It predicts output with high accuracy, even for the large dataset it runs e iciently.

o It can also maintain accuracy when a large proportion of data is missing.


How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result,
and when a new data point occurs, then based on the majority of results, the Random
Forest classifier predicts the final decision.

Applications of Random Forest

1. Banking: Banking sector mostly uses this algorithm for the identification of loan
risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease
can be identified.

3. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest

o both Classification and Regression tasks.

o handling large datasets with high dimensionality.

o It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest

o Although random forest can be used for both classification and regression tasks, it
is not more suitable for Regression tasks.

You might also like