0% found this document useful (0 votes)
6 views

SML

The document provides an overview of data visualization, outlier detection, feature selection, and various regression models in Python. It highlights key libraries like Matplotlib and Seaborn, explains outlier types and detection methods, and discusses feature engineering techniques. Additionally, it details different regression models such as Simple Linear Regression, Multiple Linear Regression, and Logistic Regression, along with tree-based models like Decision Trees and Random Forests.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

SML

The document provides an overview of data visualization, outlier detection, feature selection, and various regression models in Python. It highlights key libraries like Matplotlib and Seaborn, explains outlier types and detection methods, and discusses feature engineering techniques. Additionally, it details different regression models such as Simple Linear Regression, Multiple Linear Regression, and Logistic Regression, along with tree-based models like Decision Trees and Random Forests.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

1 1.

Data Visualization Using Python

Purpose of Data Visualization:

 Helps understand patterns, trends, and outliers in the data.

 Makes it easier to interpret and analyze complex datasets.

Key Libraries in Python:

 Matplotlib: Low-level, flexible (e.g., plot(), scatter(), bar(), hist()).

 Seaborn: Built on Matplotlib, prettier and easier (e.g., scatterplot(), lineplot(), barplot(),
histplot()).

Common Plots:

 Scatter Plot – shows relationships between variables.

 Line Chart – shows trends over time.

 Bar Chart – compares categories.

 Histogram – shows frequency distribution.

 Boxplot – detects outliers using quartiles.

Steps for Visualization:

1. Import libraries (import matplotlib.pyplot as plt, import seaborn as sns)

2. Load dataset (pd.read_csv())

3. Clean and inspect (df.head(), df.info(), df.isnull().sum())

4. Create plots (sns.histplot(df['column']), etc.)

2. Outlier Detection and Treatment

What is an Outlier?

 A data point that is significantly different from other values.

 Can cause errors or bias in machine learning models.

Types of Outliers:

 Global: Far away from most data points.

 Contextual: Unusual only in certain situations.

Detection Techniques:

 Z-Score: Measures how many standard deviations away a value is from the mean.
 IQR (Interquartile Range): Outliers lie below Q1 – 1.5×IQR or above Q3 + 1.5×IQR.

 Box Plot: Visual method to see outliers.

 Distance Methods: KNN, LOF.

 Clustering Methods: DBSCAN.

 Isolation Forest, One-class SVM: Model-based detection.

Treatment Methods:

 Remove: Drop outliers.

 Transform: Apply log, square root, or other functions.

 Cap: Set max/min limits (Winsorization).

 Modeling separately: Treat outliers as a special group.

3. Feature Selection and Engineering

✅ Feature Selection

 Choosing the most relevant features for the model.

 Helps in:

o Improving accuracy

o Reducing overfitting

o Reducing training time

Methods:

1. Filter Methods: Based on statistics like correlation, chi-square.

2. Wrapper Methods: Try different combinations (e.g., RFE).

3. Embedded Methods: Built into algorithms (e.g., Lasso, Tree importance).

Feature Engineering

 Creating new features or modifying existing ones to improve model performance.

Common Techniques:

 Imputation: Handle missing values (mean, mode).

 Outlier Handling: Remove or replace.

 Log Transformation: Reduce skewness.

 Binning: Convert continuous values into categories.


 Feature Splitting: Break down features (e.g., extract "year" from "date").

 Encoding: Convert categories to numbers (label encoding, one-hot encoding).

 Scaling: Normalize or standardize features.

REGRESSION MODELS

Regression models are used to predict a target (dependent variable) based on one or more inputs
(independent variables). Let’s explore each one:

1. Simple Linear Regression (SLR)

Purpose: Predict a numerical value using one input variable.

Formula:

y=β0+β1x+εy = \beta_0 + \beta_1x + \varepsilony=β0+β1x+ε

 y: Target variable (what you want to predict)

 x: Input variable

 β0: Intercept (value of y when x = 0)

 β1: Slope (how much y changes with x)

 ε: Error term

Example: Predicting a person’s salary based on years of experience.

Goal: Find the straight line (best fit) that minimizes the difference between actual and predicted
values.

Evaluation Metrics:

 Mean Squared Error (MSE)

 R-squared (R²)

2. Multiple Linear Regression (MLR)

Purpose: Predict a numerical value using two or more input variables.

Formula:

y=β0+β1x1+β2x2+...+βnxn+εy = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n +


\varepsilony=β0+β1x1+β2x2+...+βnxn+ε

 x1, x2, ..., xn: Multiple independent variables

Example: Predicting a student's marks based on hours studied, sleep hours, and attendance.
Assumptions:

 Linear relationship between inputs and output

 No multicollinearity (inputs shouldn’t be highly correlated)

 Errors are normally distributed

Use: For modeling complex situations where multiple factors influence the result.

3. Logistic Regression

Purpose: Used for classification problems, especially binary classification (yes/no, 0/1).

Not for predicting numbers – it predicts probability.

Sigmoid Function (S-shaped curve):

P=11+e−zwhere z=β0+β1x1+β2x2+…P = \frac{1}{1 + e^{-z}} \quad \text{where } z = \beta_0 +


\beta_1x_1 + \beta_2x_2 + \ldotsP=1+e−z1where z=β0+β1x1+β2x2+…

If P > 0.5, output = 1 (positive class); else 0.

Types:

 Binary Logistic Regression: Two classes (e.g., spam or not)

 Multinomial Logistic Regression: More than two unordered classes

 Ordinal Logistic Regression: More than two ordered classes

Example: Will a customer buy a product? (Yes = 1, No = 0)

Evaluation Metrics:

 Accuracy

 Precision, Recall, F1-score

 ROC-AUC

4. Poisson Regression

Purpose: Predict count-based outcomes (like number of occurrences).

**Used when target variable is a non-negative integer (0, 1, 2, …).

Formula:

log (μ)=β0+β1x1+β2x2+...\log(\mu) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ...log(μ)=β0+β1x1+β2x2


+...
 μ is the expected count

 log(μ) makes the model linear

Example:

 Number of calls received in a day

 Number of accidents in a week

Assumption:

 Mean and variance of the count variable are equal.

TREE-BASED MODELS

Tree-based models use a "divide and conquer" strategy. They split the data into branches based on
feature values to make decisions or predictions.

1. Decision Tree

Purpose: Used for both classification and regression.

How it works:

 The model asks questions and splits data based on answers (like a flowchart).

 Each split is based on a condition (e.g., “Is age > 18?”)

 Continues until it reaches a final decision (leaf node).

For classification:

 Uses Gini index or Entropy to split data.

 Example: Approve loan or not based on age, salary.

For regression:

 Splits are made to minimize variance in numeric output.

Advantages:

 Easy to understand and interpret

 Handles both numerical and categorical data

Disadvantages:

 Can overfit the data

 Not very accurate alone


2. Random Forest

Purpose: A stronger model made by combining many decision trees.

How it works:

 Creates many decision trees using random subsets of data and features.

 Takes average of predictions (for regression) or majority vote (for classification).

Why better than one tree?

 Reduces overfitting

 More accurate and stable

Example:

 Spam detection

 Customer churn prediction

Advantages:

 High accuracy

 Handles missing values well

Disadvantages:

 Slower than a single tree

 Less interpretable

3. Boosting Algorithms

Purpose: Boosting builds models sequentially, each new one fixing the mistakes of the previous one.

Popular Types:

 AdaBoost: Weights wrong predictions more in next round.

 Gradient Boosting: Focuses on reducing prediction error directly.

 XGBoost: An optimized version of gradient boosting (very popular in real-world projects and
competitions).

How it works:

 Trains weak learners (like small decision trees).

 Combines them to make a strong overall model.


Example: Fraud detection, product recommendation.

Advantages:

 Very high accuracy

 Works well with large and complex datasets

Disadvantages:

 Can overfit if not tuned properly

 Slower training time

Global: Far away from most data points.

 Contextual: Unusual only in certain situations.

Detection Techniques:

 Z-Score: Measures how many standard deviations away a value is from the mean.

 IQR (Interquartile Range): Outliers lie below Q1 – 1.5×IQR or above Q3 + 1.5×IQR.

 Box Plot: Visual method to see outliers.

 Distance Methods: KNN, LOF.

 Clustering Methods: DBSCAN.

 Isolation Forest, One-class SVM: Model-based detection.

Treatment Methods:

 Remove: Drop outliers.

 Transform: Apply log, square root, or other functions.

 Cap: Set max/min limits (Winsorization).

 Modeling separately: Treat outliers as a special group.

3. Feature Selection and Engineering

✅ Feature Selection

 Choosing the most relevant features for the model.

 Helps in:

o Improving accuracy

o Reducing overfitting

o Reducing training time


Methods:

1. Filter Methods: Based on statistics like correlation, chi-square.

2. Wrapper Methods: Try different combinations (e.g., RFE).

3. Embedded Methods: Built into algorithms (e.g., Lasso, Tree importance).

Feature Engineering

 Creating new features or modifying existing ones to improve model performance.

Common Techniques:

 Imputation: Handle missing values (mean, mode).

 Outlier Handling: Remove or replace.

 Log Transformation: Reduce skewness.

 Binning: Convert continuous values into categories.

 Feature Splitting: Break down features (e.g., extract "year" from "date").

 Encoding: Convert categories to numbers (label encoding, one-hot encoding).

 Scaling: Normalize or standardize features.

You might also like