0% found this document useful (0 votes)
7 views

Chatgpt Unit - 3

Uploaded by

he he
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Chatgpt Unit - 3

Uploaded by

he he
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Classifica on in Machine Learning - Unit 3 Notes

1. Introduc on to Classifica on
What is Classifica on?
Classifica on is a supervised learning technique used to predict categorical labels (e.g., spam vs. non-spam, disease
vs. no disease) for given input data. The model learns from labeled training data to classify new, unseen instances.
General Approach to Classifica on:
1. Data Prepara on: Clean and preprocess the data.
2. Feature Selec on: Select relevant features for the task.
3. Model Selec on: Choose an appropriate classifica on algorithm.
4. Training: Use labeled data to train the model.
5. Evalua on: Validate model performance using metrics like accuracy, precision, recall, and F1-score.
6. Predic on: Apply the trained model to classify new data.
Classifica on Algorithms Overview:
 k-Nearest Neighbor (k-NN)
 Random Forests
 Fuzzy Set Approaches
 Support Vector Machines (SVM)
 Decision Trees
 Bayesian Learning
 Ensemble Methods

2. k-Nearest Neighbor (k-NN) Algorithm Do not have fixed numbers of parameters and the number of parameters grows with
Overview: the size of the training data.
k-NN is a simple, non-parametric algorithm that classifies data points based on their proximity to k nearest
neighbors in the feature space.
Steps:
1. Select the number of neighbors (‘k’).
2. Calculate the distance between the query point and all other points in the dataset.
3. Iden fy the k closest neighbors.
4. Assign the class label based on majority vo ng among the neighbors.
Distance Metrics:
 Euclidean Distance
 Manha an Distance
 Minkowski Distance
Advantages:
 Simple to implement.
 Effec ve for small datasets.
Disadvantages:
 Computa onally expensive for large datasets.
 Sensi ve to the choice of k and scaling of features.

3. Random Forests that aggregates two or more learners in order to produce better predictions
Overview:
Random Forest is an ensemble learning method that builds mul ple decision trees during training and outputs the
mode of the classes (classifica on) or mean predic on (regression).
Key Features:
 Combines mul ple decision trees to improve accuracy.
 Reduces the risk of overfi ng.
Steps:
1. Bootstrap sampling to create subsets of the dataset.
2. Train a decision tree on each subset using a random feature selec on.
3. Aggregate the results using majority vo ng.
Advantages:
 High accuracy and robustness.
 Handles large datasets efficiently.
Disadvantages:
 Less interpretable compared to single decision trees.

4. Fuzzy Set Approaches


Overview:
Fuzzy set approaches deal with uncertainty and vagueness in data. Unlike classical sets, fuzzy sets allow par al
membership, represented by a membership func on.
Applica ons in Classifica on:
 Handling imprecise and noisy data.
 Medical diagnosis.
 Risk assessment.

5. Support Vector Machines (SVM)


Introduc on:
SVM is a powerful classifica on algorithm that constructs a hyperplane to separate data into different classes.
Types of Kernels:
1. Linear Kernel: Separates data linearly.
2. Polynomial Kernel: Captures non-linear rela onships.
3. Gaussian Kernel (RBF): Models complex boundaries.
Key Concepts:
 Hyperplane: Decision boundary that separates classes.
 Margin: Distance between the hyperplane and the nearest data points.
 Support Vectors: Data points closest to the hyperplane that influence its posi on.
Proper es of SVM:
 Effec ve in high-dimensional spaces.
 Memory-efficient.
Issues:
 Computa onally expensive for large datasets.
 Requires careful kernel selec on.

6. Decision Trees
Overview:
A decision tree is a flowchart-like structure where internal nodes represent tests on features, branches represent
outcomes, and leaf nodes represent class labels.
Algorithm:
 Use metrics like entropy and informa on gain to split nodes.
 Repeat spli ng un l stopping criteria are met (e.g., maximum depth).
Key Concepts:
 Entropy: Measure of randomness or impurity.
o Formula: E=−∑pilog⁡2piE = -\sum p_i \log_2 p_i.
 Informa on Gain: Reduc on in entropy a er a split.
o Formula: IG=Eparent−∑∣child∣∣parent∣EchildIG = E_{parent} - \sum
\frac{|child|}{|parent|}E_{child}.
ID-3 Algorithm:
 Itera ve Dichotomiser 3 (ID-3) uses informa on gain to construct a decision tree.
Issues:
 Overfi ng.
 Sensi vity to noisy data.

7. Bayesian Learning
Bayes Theorem:
Bayes theorem provides a mathema cal framework for upda ng probabili es based on new evidence.
Formula:
P(A∣B)=P(B∣A)P(A)P(B)P(A|B) = \frac{P(B|A)P(A)}{P(B)}
Key Concepts:
 Bayes Op mal Classifier: Makes predic ons with the lowest possible error.
 Naive Bayes Classifier: Assumes independence among features.
 Bayesian Belief Networks: Graphical models represen ng probabilis c rela onships.
 EM Algorithm: Used for parameter es ma on in probabilis c models.
Applica ons:
 Text classifica on.
 Spam filtering.
 Sen ment analysis.

8. Ensemble Methods
Overview:
Ensemble methods combine mul ple models to improve performance.
Techniques:
1. Bagging:
o Reduces variance by training models on different subsets.
o Example: Random Forest.
2. Boos ng:
o Focuses on correc ng errors made by previous models.
o Example: AdaBoost, XGBoost.
3. Stacking:
o Combines predic ons from different models using a meta-model.

9. Classifica on Model Evalua on and Selec on


Metrics:
1. Sensi vity (Recall):
o Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}.
o Measures true posi ve rate.
2. Specificity:
o Specificity=TNTN+FP\text{Specificity} = \frac{TN}{TN + FP}.
o Measures true nega ve rate.
3. Posi ve Predic ve Value (Precision):
o Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}.
o Measures the accuracy of posi ve predic ons.
4. Nega ve Predic ve Value:
o NPV=TNTN+FN\text{NPV} = \frac{TN}{TN + FN}.
5. ROC Curve:
o Plots true posi ve rate against false posi ve rate at different thresholds.
o Area Under Curve (AUC) measures overall performance.
6. Li and Gain Curves:
o Evaluate the effec veness of classifica on models.
Real-World Concerns:
 Misclassifica on Cost Adjustment:
o Balance between false posi ves and false nega ves.
 Decision Cost/Benefit Analysis:
o Assess the trade-offs of classifica on decisions.

Conclusion
Classifica on is a cornerstone of machine learning, with diverse algorithms tailored to different tasks. From simple
methods like k-NN to advanced techniques like ensemble methods, understanding their strengths and limita ons
ensures effec ve model selec on and deployment.

You might also like