0% found this document useful (0 votes)

6 views

SML

The document provides an overview of data visualization, outlier detection, feature selection, and various regression models in Python. It highlights key libraries like Matplotlib and Seaborn, explains outlier types and detection methods, and discusses feature engineering techniques. Additionally, it details different regression models such as Simple Linear Regression, Multiple Linear Regression, and Logistic Regression, along with tree-based models like Decision Trees and Random Forests.

Uploaded by

rinshadrinshad2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

SML

Uploaded by

rinshadrinshad2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

1 1.

Data Visualization Using Python

Purpose of Data Visualization:

 Helps understand patterns, trends, and outliers in the data.

 Makes it easier to interpret and analyze complex datasets.

Key Libraries in Python:

 Matplotlib: Low-level, flexible (e.g., plot(), scatter(), bar(), hist()).

 Seaborn: Built on Matplotlib, prettier and easier (e.g., scatterplot(), lineplot(), barplot(),
histplot()).

Common Plots:

 Scatter Plot – shows relationships between variables.

 Line Chart – shows trends over time.

 Bar Chart – compares categories.

 Histogram – shows frequency distribution.

 Boxplot – detects outliers using quartiles.

Steps for Visualization:

1. Import libraries (import matplotlib.pyplot as plt, import seaborn as sns)

2. Load dataset (pd.read_csv())

3. Clean and inspect (df.head(), df.info(), df.isnull().sum())

4. Create plots (sns.histplot(df['column']), etc.)

2. Outlier Detection and Treatment

What is an Outlier?

 A data point that is significantly different from other values.

 Can cause errors or bias in machine learning models.

Types of Outliers:

 Global: Far away from most data points.

 Contextual: Unusual only in certain situations.

Detection Techniques:

 Z-Score: Measures how many standard deviations away a value is from the mean.
 IQR (Interquartile Range): Outliers lie below Q1 – 1.5×IQR or above Q3 + 1.5×IQR.

 Box Plot: Visual method to see outliers.

 Distance Methods: KNN, LOF.

 Clustering Methods: DBSCAN.

 Isolation Forest, One-class SVM: Model-based detection.

Treatment Methods:

 Remove: Drop outliers.

 Transform: Apply log, square root, or other functions.

 Cap: Set max/min limits (Winsorization).

 Modeling separately: Treat outliers as a special group.

3. Feature Selection and Engineering

✅ Feature Selection

 Choosing the most relevant features for the model.

 Helps in:

o Improving accuracy

o Reducing overfitting

o Reducing training time

Methods:

1. Filter Methods: Based on statistics like correlation, chi-square.

2. Wrapper Methods: Try different combinations (e.g., RFE).

3. Embedded Methods: Built into algorithms (e.g., Lasso, Tree importance).

Feature Engineering

 Creating new features or modifying existing ones to improve model performance.

Common Techniques:

 Imputation: Handle missing values (mean, mode).

 Outlier Handling: Remove or replace.

 Log Transformation: Reduce skewness.

 Binning: Convert continuous values into categories.

 Feature Splitting: Break down features (e.g., extract "year" from "date").

 Encoding: Convert categories to numbers (label encoding, one-hot encoding).

 Scaling: Normalize or standardize features.

REGRESSION MODELS

Regression models are used to predict a target (dependent variable) based on one or more inputs
(independent variables). Let’s explore each one:

1. Simple Linear Regression (SLR)

Purpose: Predict a numerical value using one input variable.

Formula:

y=β0+β1x+εy = \beta_0 + \beta_1x + \varepsilony=β0+β1x+ε

 y: Target variable (what you want to predict)

 x: Input variable

 β0: Intercept (value of y when x = 0)

 β1: Slope (how much y changes with x)

 ε: Error term

Example: Predicting a person’s salary based on years of experience.

Goal: Find the straight line (best fit) that minimizes the difference between actual and predicted
values.

Evaluation Metrics:

 Mean Squared Error (MSE)

 R-squared (R²)

2. Multiple Linear Regression (MLR)

Purpose: Predict a numerical value using two or more input variables.

Formula:

y=β0+β1x1+β2x2+...+βnxn+εy = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n +

\varepsilony=β0+β1x1+β2x2+...+βnxn+ε

 x1, x2, ..., xn: Multiple independent variables

Example: Predicting a student's marks based on hours studied, sleep hours, and attendance.
Assumptions:

 Linear relationship between inputs and output

 No multicollinearity (inputs shouldn’t be highly correlated)

 Errors are normally distributed

Use: For modeling complex situations where multiple factors influence the result.

3. Logistic Regression

Purpose: Used for classification problems, especially binary classification (yes/no, 0/1).

Not for predicting numbers – it predicts probability.

Sigmoid Function (S-shaped curve):

P=11+e−zwhere z=β0+β1x1+β2x2+…P = \frac{1}{1 + e^{-z}} \quad \text{where } z = \beta_0 +

\beta_1x_1 + \beta_2x_2 + \ldotsP=1+e−z1where z=β0+β1x1+β2x2+…

If P > 0.5, output = 1 (positive class); else 0.

Types:

 Binary Logistic Regression: Two classes (e.g., spam or not)

 Multinomial Logistic Regression: More than two unordered classes

 Ordinal Logistic Regression: More than two ordered classes

Example: Will a customer buy a product? (Yes = 1, No = 0)

Evaluation Metrics:

 Accuracy

 Precision, Recall, F1-score

 ROC-AUC

4. Poisson Regression

Purpose: Predict count-based outcomes (like number of occurrences).

**Used when target variable is a non-negative integer (0, 1, 2, …).

Formula:

log (μ)=β0+β1x1+β2x2+...\log(\mu) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ...log(μ)=β0+β1x1+β2x2

+...
 μ is the expected count

 log(μ) makes the model linear

Example:

 Number of calls received in a day

 Number of accidents in a week

Assumption:

 Mean and variance of the count variable are equal.

TREE-BASED MODELS

Tree-based models use a "divide and conquer" strategy. They split the data into branches based on
feature values to make decisions or predictions.

1. Decision Tree

Purpose: Used for both classification and regression.

How it works:

 The model asks questions and splits data based on answers (like a flowchart).

 Each split is based on a condition (e.g., “Is age > 18?”)

 Continues until it reaches a final decision (leaf node).

For classification:

 Uses Gini index or Entropy to split data.

 Example: Approve loan or not based on age, salary.

For regression:

 Splits are made to minimize variance in numeric output.

Advantages:

 Easy to understand and interpret

 Handles both numerical and categorical data

Disadvantages:

 Can overfit the data

 Not very accurate alone

2. Random Forest

Purpose: A stronger model made by combining many decision trees.

How it works:

 Creates many decision trees using random subsets of data and features.

 Takes average of predictions (for regression) or majority vote (for classification).

Why better than one tree?

 Reduces overfitting

 More accurate and stable

Example:

 Spam detection

 Customer churn prediction

Advantages:

 High accuracy

 Handles missing values well

Disadvantages:

 Slower than a single tree

 Less interpretable

3. Boosting Algorithms

Purpose: Boosting builds models sequentially, each new one fixing the mistakes of the previous one.

Popular Types:

 AdaBoost: Weights wrong predictions more in next round.

 Gradient Boosting: Focuses on reducing prediction error directly.

 XGBoost: An optimized version of gradient boosting (very popular in real-world projects and
competitions).

How it works:

 Trains weak learners (like small decision trees).

 Combines them to make a strong overall model.

Example: Fraud detection, product recommendation.

Advantages:

 Very high accuracy

 Works well with large and complex datasets

Disadvantages:

 Can overfit if not tuned properly

 Slower training time

Global: Far away from most data points.

 Contextual: Unusual only in certain situations.

Detection Techniques:

 Z-Score: Measures how many standard deviations away a value is from the mean.

 IQR (Interquartile Range): Outliers lie below Q1 – 1.5×IQR or above Q3 + 1.5×IQR.

 Box Plot: Visual method to see outliers.

 Distance Methods: KNN, LOF.

 Clustering Methods: DBSCAN.

 Isolation Forest, One-class SVM: Model-based detection.

Treatment Methods:

 Remove: Drop outliers.

 Transform: Apply log, square root, or other functions.

 Cap: Set max/min limits (Winsorization).

 Modeling separately: Treat outliers as a special group.

3. Feature Selection and Engineering

✅ Feature Selection

 Choosing the most relevant features for the model.

 Helps in:

o Improving accuracy

o Reducing overfitting

o Reducing training time

Methods:

1. Filter Methods: Based on statistics like correlation, chi-square.

2. Wrapper Methods: Try different combinations (e.g., RFE).

3. Embedded Methods: Built into algorithms (e.g., Lasso, Tree importance).

Feature Engineering

 Creating new features or modifying existing ones to improve model performance.

Common Techniques:

 Imputation: Handle missing values (mean, mode).

 Outlier Handling: Remove or replace.

 Log Transformation: Reduce skewness.

 Binning: Convert continuous values into categories.

 Feature Splitting: Break down features (e.g., extract "year" from "date").

 Encoding: Convert categories to numbers (label encoding, one-hot encoding).

 Scaling: Normalize or standardize features.

Assignment Guidelines
No ratings yet
Assignment Guidelines
14 pages
Data Science Cheatsheet
100% (1)
Data Science Cheatsheet
5 pages
Assignment 9[1]
No ratings yet
Assignment 9[1]
8 pages
Machinelearning Algorithm Basics2 NOTES
No ratings yet
Machinelearning Algorithm Basics2 NOTES
72 pages
Machine Learning Strategies
No ratings yet
Machine Learning Strategies
59 pages
Unit 5
No ratings yet
Unit 5
18 pages
All About ML
No ratings yet
All About ML
18 pages
ML assignment
No ratings yet
ML assignment
13 pages
Unit 3
No ratings yet
Unit 3
55 pages
ML 3 (1)
No ratings yet
ML 3 (1)
50 pages
Exam PA Knowledge Based Outline
No ratings yet
Exam PA Knowledge Based Outline
22 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
AWS Machine Learning Specialty Master Cheat Sheet
No ratings yet
AWS Machine Learning Specialty Master Cheat Sheet
24 pages
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
No ratings yet
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
4 pages
Commonly Used Machine Learning Algorithms
No ratings yet
Commonly Used Machine Learning Algorithms
27 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
2-Machine Learning Algorithms
No ratings yet
2-Machine Learning Algorithms
16 pages
ML 2
No ratings yet
ML 2
3 pages
5 markd
No ratings yet
5 markd
24 pages
ML models
No ratings yet
ML models
21 pages
Final ML
No ratings yet
Final ML
2 pages
Regression Bayesian SVM Notes
No ratings yet
Regression Bayesian SVM Notes
6 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Zzplagiarism
No ratings yet
Zzplagiarism
23 pages
Parametric
No ratings yet
Parametric
15 pages
Unit-4 Data Mining
No ratings yet
Unit-4 Data Mining
19 pages
Supervised Learning - Basics
No ratings yet
Supervised Learning - Basics
115 pages
Data Collection
No ratings yet
Data Collection
8 pages
Machine Learning: Engr. Ejaz Ahmad
No ratings yet
Machine Learning: Engr. Ejaz Ahmad
54 pages
What Are The Differences Between Supervised and Unsupervised Learning?
No ratings yet
What Are The Differences Between Supervised and Unsupervised Learning?
22 pages
Machine Learing Algorithms
No ratings yet
Machine Learing Algorithms
13 pages
Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) - Download the ebook and explore the most detailed content
100% (1)
Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) - Download the ebook and explore the most detailed content
60 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
Interview Preparing - ML Draft
No ratings yet
Interview Preparing - ML Draft
12 pages
Machine Learning Complete-Course-Notes Polimi
No ratings yet
Machine Learning Complete-Course-Notes Polimi
107 pages
Module_2
No ratings yet
Module_2
5 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Statistical Machine Learning: Yiqiao YIN Department of Statistics Columbia University
No ratings yet
Statistical Machine Learning: Yiqiao YIN Department of Statistics Columbia University
204 pages
PID5108657
No ratings yet
PID5108657
8 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
2-ML Principles
No ratings yet
2-ML Principles
34 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
Classification Models
No ratings yet
Classification Models
3 pages
ML - Interview Prep
No ratings yet
ML - Interview Prep
9 pages
GATE ML Updated 111023
No ratings yet
GATE ML Updated 111023
109 pages
ML
No ratings yet
ML
9 pages
Presentation on Supervised Learning (1)
No ratings yet
Presentation on Supervised Learning (1)
8 pages
ML Assigment 3
No ratings yet
ML Assigment 3
4 pages
CSC413 Lecture Note
No ratings yet
CSC413 Lecture Note
32 pages
machine learning
No ratings yet
machine learning
37 pages
ML Notes.docx
No ratings yet
ML Notes.docx
15 pages
Data science cheat sheet
No ratings yet
Data science cheat sheet
7 pages
Overfitting & Feature Engineering.pptx
No ratings yet
Overfitting & Feature Engineering.pptx
37 pages
Methods and Models
No ratings yet
Methods and Models
12 pages
ML
No ratings yet
ML
16 pages
Broadly, There Are 3 Types of Machine Learning Algorithms.
No ratings yet
Broadly, There Are 3 Types of Machine Learning Algorithms.
33 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
202-NHT_Ex2_Data_and_Recursion_only-w
No ratings yet
202-NHT_Ex2_Data_and_Recursion_only-w
14 pages
R Programming
No ratings yet
R Programming
114 pages
BCA V SEM Advanced R Programming Lab manual final-1(2)
No ratings yet
BCA V SEM Advanced R Programming Lab manual final-1(2)
5 pages
walmart time series forecasting
No ratings yet
walmart time series forecasting
23 pages
Boxplot
No ratings yet
Boxplot
7 pages
Maa SL 4.1-4.3 Statistics - Basic Concepts
No ratings yet
Maa SL 4.1-4.3 Statistics - Basic Concepts
30 pages
Box and Whisker Plot Homework
100% (1)
Box and Whisker Plot Homework
5 pages
1x1 qs-STAT ENG B
No ratings yet
1x1 qs-STAT ENG B
28 pages
Univariate_Statistics l1 lmd
No ratings yet
Univariate_Statistics l1 lmd
60 pages
A2
No ratings yet
A2
18 pages
Lecture 2 - Introductory Statistics
No ratings yet
Lecture 2 - Introductory Statistics
55 pages
RGR PDF
No ratings yet
RGR PDF
209 pages
BioStat Module 3
No ratings yet
BioStat Module 3
41 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
6port Choice in A Competitive Environment From The Shipping Lines' Perspective
No ratings yet
6port Choice in A Competitive Environment From The Shipping Lines' Perspective
17 pages
Jeong Et Al 2022 Artificial Intelligence Approach For Estimating Dairy Methane Emissions
No ratings yet
Jeong Et Al 2022 Artificial Intelligence Approach For Estimating Dairy Methane Emissions
10 pages
4024_Scheme_of_Work_(for_examination_from_2025)
No ratings yet
4024_Scheme_of_Work_(for_examination_from_2025)
52 pages
The Interpretation of Geochemical Survey Data
100% (1)
The Interpretation of Geochemical Survey Data
49 pages
Mean Median and Mode
No ratings yet
Mean Median and Mode
32 pages
Advanced Statistics Project Report
100% (1)
Advanced Statistics Project Report
42 pages
Describing Data Numerically
No ratings yet
Describing Data Numerically
9 pages
Levelised Cost of Hydrogen On Proposed Electrolyser - NL
No ratings yet
Levelised Cost of Hydrogen On Proposed Electrolyser - NL
33 pages
J560-05 QP Nov19
No ratings yet
J560-05 QP Nov19
16 pages
Walmart - Case - Study Ref
No ratings yet
Walmart - Case - Study Ref
1 page
(Transactions on Computer Systems and Networks) Tanvir Mustafy, Md. Tauhid Ur Rahman - Statistics and Data Analysis for Engineers and Scientists-Springer (2024)
No ratings yet
(Transactions on Computer Systems and Networks) Tanvir Mustafy, Md. Tauhid Ur Rahman - Statistics and Data Analysis for Engineers and Scientists-Springer (2024)
192 pages
SPSS Module-1
No ratings yet
SPSS Module-1
48 pages
Descriptive and Graphical Analysis Using R
No ratings yet
Descriptive and Graphical Analysis Using R
40 pages
Business Statistics: A Decision-Making Approach: Describing Data Using Numerical Measures
No ratings yet
Business Statistics: A Decision-Making Approach: Describing Data Using Numerical Measures
45 pages
Notched and Variable Width Box-Plots
No ratings yet
Notched and Variable Width Box-Plots
16 pages