SML
SML
Seaborn: Built on Matplotlib, prettier and easier (e.g., scatterplot(), lineplot(), barplot(),
histplot()).
Common Plots:
What is an Outlier?
Types of Outliers:
Detection Techniques:
Z-Score: Measures how many standard deviations away a value is from the mean.
IQR (Interquartile Range): Outliers lie below Q1 – 1.5×IQR or above Q3 + 1.5×IQR.
Treatment Methods:
✅ Feature Selection
Helps in:
o Improving accuracy
o Reducing overfitting
Methods:
Feature Engineering
Common Techniques:
REGRESSION MODELS
Regression models are used to predict a target (dependent variable) based on one or more inputs
(independent variables). Let’s explore each one:
Formula:
x: Input variable
ε: Error term
Goal: Find the straight line (best fit) that minimizes the difference between actual and predicted
values.
Evaluation Metrics:
R-squared (R²)
Formula:
Example: Predicting a student's marks based on hours studied, sleep hours, and attendance.
Assumptions:
Use: For modeling complex situations where multiple factors influence the result.
3. Logistic Regression
Purpose: Used for classification problems, especially binary classification (yes/no, 0/1).
Types:
Evaluation Metrics:
Accuracy
ROC-AUC
4. Poisson Regression
Formula:
Example:
Assumption:
TREE-BASED MODELS
Tree-based models use a "divide and conquer" strategy. They split the data into branches based on
feature values to make decisions or predictions.
1. Decision Tree
How it works:
The model asks questions and splits data based on answers (like a flowchart).
For classification:
For regression:
Advantages:
Disadvantages:
How it works:
Creates many decision trees using random subsets of data and features.
Reduces overfitting
Example:
Spam detection
Advantages:
High accuracy
Disadvantages:
Less interpretable
3. Boosting Algorithms
Purpose: Boosting builds models sequentially, each new one fixing the mistakes of the previous one.
Popular Types:
XGBoost: An optimized version of gradient boosting (very popular in real-world projects and
competitions).
How it works:
Advantages:
Disadvantages:
Detection Techniques:
Z-Score: Measures how many standard deviations away a value is from the mean.
Treatment Methods:
✅ Feature Selection
Helps in:
o Improving accuracy
o Reducing overfitting
Feature Engineering
Common Techniques:
Feature Splitting: Break down features (e.g., extract "year" from "date").