0% found this document useful (0 votes)
3 views

Applied Machine Learning I

The document provides an overview of various supervised learning algorithms for classification tasks, including K-Nearest Neighbor (K-NN), Decision Trees, Random Forest, Support Vector Machines (SVM), and Logistic Regression. Each algorithm is explained in terms of its working principles, advantages, disadvantages, and specific terminologies. The document also includes examples and mathematical formalism relevant to the discussed algorithms.

Uploaded by

ibiamiheanyi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Applied Machine Learning I

The document provides an overview of various supervised learning algorithms for classification tasks, including K-Nearest Neighbor (K-NN), Decision Trees, Random Forest, Support Vector Machines (SVM), and Logistic Regression. Each algorithm is explained in terms of its working principles, advantages, disadvantages, and specific terminologies. The document also includes examples and mathematical formalism relevant to the discussed algorithms.

Uploaded by

ibiamiheanyi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Lecture Week 8:

Applied Machine Learning


Supervised Learning Algorithms for Classification Task
Mr. Nwachukwu Victor C.
K-Nearest Neighbor Algorithm

Applied ML I... Nwachukwu Victor 2


K-Nearest Neighbor Algorithm
The K-Nearest Neighbors (K-NN)
algorithm is a popular technique in
machine learning used for both
classification and regression tasks.

It works on the principle that data points


that are close together are likely to have
similar characteristics.

Applied ML I... Nwachukwu Victor 3


KNN Classification
Given a dataset Distance Metrics
, where 1. Euclidean Distance
is an n-dimensional vector and is the
class label:
1. Compute Distance: For a test point x,
compute the distance between x and all
points in the training set Where and
2. Select Neighbors: Select the closest
points
3. Majority Vote: Determine the class 2. Manhattan Distance
label by majority vote among the
neighbors.

Applied ML I... Nwachukwu Victor 4


KNN Classification
Choosing K: The value of K significantly Lazy Learner: K-NN doesn't learn a
impacts the performance of the model from the data during training.
algorithm. A high K value reduces the Instead, it waits until a new data point
impact of noise but can lead to arrives and then calculates the distances
overfitting. A low K value can be more to determine the nearest neighbours.
sensitive to noise in the data.
Distance Metric: The choice of distance
Non-parametric: K-NN does not make metric (e.g., Euclidean, Manhattan)
any assumptions about the underlying affects how the distance between points
data distribution, making it suitable for is calculated.
various data types.
We will be using Euclidean distance
metric
Applied ML I... Nwachukwu Victor 5
KNN Classification
Advantages of K-NN Disadvantages of K-NN
• Simplicity: Easy to understand and • Computational Cost: Can be expensive
implement. in terms of memory and computation,
• No Training Phase: Useful for especially with large datasets, as it
applications where the dataset is needs to store all training data and
frequently updated. compute distances for each query.
• Adaptability: Can be used for both • Curse of Dimensionality: Performance
classification and regression tasks. can degrade with high-dimensional data
because the distance metrics become
less informative.
• Sensitive to Noise: Can be influenced
by irrelevant features or noisy data
points.
Applied ML I... Nwachukwu Victor 6
KNN Question I
Suppose you have a dataset containing Solution:
information about houses, with each
house represented by two features:
square footage (in square feet) and
number of bedrooms. You are given the
following dataset:

Using the KNN algorithm with K = 3 and


Euclidean distance measure, find the 3 nearest Thus the 3-nearest neighbours to the
neighbours of a new house with the following
new house are:
features: square footage = 1600 sq ft and
bedrooms = 2. House 1, House 4, House 3 7
Applied ML I... Nwachukwu Victor
KNN Question II
A KNN classifier assigns a test instance
to the majority class associated with its
K nearest training instances. Distance
between instances is measured using
Euclidean distance. Suppose we have
the following training set of positive (+)
and negative (-) instances and a single (a) What would be the class assigned to this
test instance for K=1 and give your reason
test instance (o). All instances are
(b) What would be the class assigned to this
projected onto a vector space of two test instance for K=3 and give your reason
real-valued features (X and Y). Answer (c) What would be the class assigned to this
the following questions. Assume test instance for K=5 and give your reason
“unweighted” KNN (every nearest (d) Setting K to a large value seems like a good
neighbour contributes equally to the idea. We get more votes! Given this particular
training set, would you recommend setting K =
final vote).
Applied ML I... Nwachukwu Victor 11? Why or why not? 8
Decision Tree Algorithm

Applied ML I... Nwachukwu Victor 9


Decision Tree (DT) Algorithm
The Decision Tree algorithm is a popular
supervised learning algorithm used for
both classification and regression tasks.

It splits the data into subsets based on


the most significant attribute in the
dataset, recursively doing so to form a
tree-like model of decisions.

10
Applied ML I... Nwachukwu Victor
Decision Tree Terminologies
• Root Node: A decision tree’s root node, which represents the original choice or
feature from which the tree branches, is the highest node.

• Internal Nodes (Decision Nodes): Nodes in the tree whose choices are
determined by the values of particular attributes. There are branches on these
nodes that go to other nodes.

• Leaf Nodes (Terminal Nodes): The branches’ termini, when choices or forecasts
are decided upon. There are no more branches on leaf nodes.

• Branches (Edges): Links between nodes that show how decisions are made in
response to particular circumstances.

Applied ML I... Nwachukwu Victor 11


Decision Tree Terminologies
• Splitting: The process of dividing a node into two or more sub-nodes based on a
decision criterion. It involves selecting a feature and a threshold to create subsets
of data.

• Parent Node: A node that is split into child nodes. The original node from which a
split originates.

• Child Node: Nodes created as a result of a split from a parent node.


• Decision Criterion: The rule or condition used to determine how the data should
be split at a decision node. It involves comparing feature values against a
threshold.

• Pruning: The process of removing branches or nodes from a decision tree to


improve its generalisation and prevent overfitting. 12
Applied ML I... Nwachukwu Victor
How Decision Trees Work (Classification Task)
1. Starting Point: Begin with the entire Splitting Criteria
dataset as the root. 1. Gini impurity Index
2. Splitting: At each node, choose the best
feature to split the data. The best split is
based on a criterion such as Gini impurity,
Information Gain Where = probability of a randomly
3. Stopping Criteria: Recursively split the chosen element being correctly
subsets until one of the stopping criteria is classified, is the number of classes.
met, such as maximum depth of the tree, 2. Entropy
minimum number of samples in a node,
or no further gain from splitting.
4. Prediction: For classification, the label
of a new instance is determined by Information gain (S,A)
traversing the tree from the root to a leaf,
following the splits. Applied ML I... Nwachukwu Victor ∈
13
Building DTs using Gini Index
Given the dataset Step I: Weighted Gini Index of Employee
Employee Study Hrs. Pass Exam Gini Index (Fresher)
Fresher > 2hrs Y
Fresher > 2hrs Y
Senior > 2hrs Y
Gini index (Senior)
Junior < 2hrs Y
Fresher > 2hrs N
Senior > 2hrs N
Fresher < 2hrs N Gini index (Junior)

Build a decision tree using the Gini index


split criterion Weighted Gini for Employee

Applied ML I... Nwachukwu Victor 14


Building DTs using Gini Index Employee
Step II: Weighted Gini Index of Study Hrs.
Gini Index (>2hrs)
Fresher Junior Senior

Gini index (<2hrs)


>2 <2 <2 >2 <2

Weighted Gini for Study Hrs.

i.e. “Employee” has a lower weighted Y N Y Y N


gini index, thus a higher information
gain, so we will build the tree with
“Employee” as the root node.
Applied ML I... Nwachukwu Victor 15
Decision Tree (DT) Algorithm Pros and Cons
Pros: Cons:
1. Simple to understand and interpret. 1. Prone to overfitting, especially with
2. Can handle both numerical and deep trees.
categorical data. 2. Can be unstable; small variations in
3. Requires little data preprocessing (no the data can result in a completely
need for normalization or scaling). different tree.
4. Capable of capturing non-linear 3. Greedy nature does not guarantee the
relationships. globally optimal solution.
5. Can handle multi-output problems. 4. Decision trees can be biased if some
classes dominate.

Applied ML I... Nwachukwu Victor 16


Random Forest Algorithm

Applied ML I... Nwachukwu Victor 17


Random Forest Algorithm
The Random Forest algorithm is an
ensemble learning method that
combines the predictions of multiple
decision trees to improve classification
and regression accuracy.

By aggregating the results of several


trees, Random Forest mitigates the
overfitting problem inherent in
individual decision trees and provides
more robust predictions.

Applied ML I... Nwachukwu Victor 18


How Random Forest Works (Classification Task)
1. Bootstrap Sampling: 2. Decision Tree Construction: For each
The algorithm generates multiple subset, a decision tree is constructed.
subsets of the training data through a Unlike standard decision trees, Random
process called bootstrap sampling. Forest introduces additional
randomness: at each split in the tree, a
Each subset is created by randomly random subset of features is selected,
selecting samples from the original and the best split is chosen only from
dataset with replacement, meaning this subset. This process is known as
some samples may appear multiple "feature bagging.“
times in a subset. 3. Aggregation of Predictions: For
classification tasks, each decision tree in
the forest casts a vote for the predicted
class. The final prediction is the class
that receives the majority of votes.19
Applied ML I... Nwachukwu Victor
Mathematical Formalism
1. Bagging (Bootstrap Sampling) 2. Growing Decision Trees with feature
Given a dataset of size , bagging.
bootstrap samples , For each bootstrap sample , grow an
where each bootstrap sample is unpruned tree .
created by randomly selecting At each node of the tree, randomly
instances from with replacement. select features out of the total
features
From these features, select the
features that provide the best split
Where are the data instances according to a chosen impurity measure
and their corresponding labels. (Gini impurity, entropy).

Applied ML I... Nwachukwu Victor 20


Mathematical Formalism
3. Aggregating Predictions Pros and Cons
Once all decision trees are trained,
Each tree predicts a class label for an Pros
instance High Accuracy
Robustness
Feature Importance
The final prediction is determined by Versatility
majority vote. Handles Missing Values

Cons
Complex
Computationally Intensive
Memory Usage
Applied ML I... Nwachukwu Victor 21
Support Vector Machine (SVM) Algorithm

Applied ML I... Nwachukwu Victor 22


Support Vector Machine (SVM) Algorithm
Support Vector Machine (SVM) is a
machine learning algorithm used for
linear or nonlinear classification.
Key Concepts
1. Hyperplane: In SVM, the goal is to find
a hyperplane that best separates the
classes in the feature space.

2. Support Vectors: These are the data


points that are closest to the hyperplane
and influence its position and
orientation. The SVM algorithm uses
these support vectors to find the
optimal hyperplane. 23
Applied ML I... Nwachukwu Victor
Support Vector Machine (SVM) Algorithm
3. Margin: Margin is the distance
between the support vector and
hyperplane. The main objective of the
support vector machine algorithm is to
maximize the margin. The wider margin
indicates better classification
performance.
4. Kernel: Kernel is the mathematical
function used in SVM to map the original
input data points into high-dimensional
feature spaces. Some of the common
kernel functions are linear, polynomial,
radial basis function(RBF), and sigmoid.
24
Applied ML I... Nwachukwu Victor
Support Vector Machine (SVM) Algorithm
Advantages Disadvantages
1. Effective in high-dimensional spaces: 1. Not suitable for large datasets: SVMs
SVM is particularly effective in cases are not suitable for very large datasets
where the number of dimensions due to their high computational
exceeds the number of samples. complexity.
2. Memory efficient: SVM uses a subset 2. Difficult to choose the right kernel:
of training points (support vectors) in The choice of kernel and its parameters
the decision function. can significantly affect the performance
3. Versatile: Different kernel functions of SVM.
can be specified for the decision 3. Sensitive to noisy data: SVM does not
function, making it adaptable to various perform well when the data has a lot of
data types and distributions. noise.

Applied ML I... Nwachukwu Victor 25


Logistic Regression

Applied ML I... Nwachukwu Victor 26


Logistic Regression
Logistic regression is used for binary
classification using a sigmoid function
(logistic function), which takes input as
independent variables and produces a
probability value between 0 and 1
The logistic function is defined as:

Where is
a linear combination of the input
features and the model
parameters (coefficients)

Applied ML I... Nwachukwu Victor 27


Logistic Regression
Common Terms 4. Maximum likelihood estimation: The
1. Independent variables: The input method used to estimate the
characteristics or predictor factors coefficients of the logistic regression
applied to the dependent variable’s model, which maximizes the likelihood
predictions. of observing the data given the model.
2. Dependent variable: The target The log-likelihood function is given as:
variable in a logistic regression model,
which we are trying to predict.
3. Coefficient: The logistic regression Maximizing the log-likelihood function
model’s estimated parameters, show with respect to the parameters gives
how the independent and dependent the estimated model coefficients. This is
variables relate to one another. typically done using iterative
optimization algorithms such as gradient
descent 28
Applied ML I... Nwachukwu Victor
Logistic Regression Steps
1. Initialization 4. Prediction
Initialize the model parameters For a new instance , compute the
(to zero or small random linear combination
values)
2. Model Specification
Define the linear combinations of Compute the probability
features

Apply the logistic function to obtain Classify the instance based on the
probabilities: decision rule (e.g threshold at 0.5)
3. Use an optimization algorithm to
maximize the likelihood function and
find the optimal parameters
Applied ML I... Nwachukwu Victor 29

You might also like