0% found this document useful (0 votes)

26 views9 pages

Lecture Material 10

The document discusses various machine learning concepts including data cleaning, feature engineering, normalization, and encoding techniques. It provides code examples to demonstrate data imputation using median, forward fill, KNN imputer. It also covers power transforms to achieve normal distribution and different encoding methods like label, one-hot and ordinal encoding.

Uploaded by

Ali Naseer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views9 pages

Lecture Material 10

Uploaded by

Ali Naseer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Lecture_material_10

March 16, 2024

1 Machine Learning
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to
learn without being explicitly programmed.
It uses that data to detect patterns in data and adjust program actions accordingly.
Libraries for machine learning
Scikit-learn
TensorFlow and Keras
PyTorch
Natural language toolkit (NLTK)
OpenCV

2 AI vs machine learning
Artificial general intelligence is form of intelligence similar or equal to the human intelligence which
will have a language and be able to learn and take decisions.
Machine learning is a type of artificial intelligence where we create an algorithm that would learn
from given data.
We give the system an input and output and it will write a code to link the two after the code is
written we can use it to give an output when we feed input to the system.

3 Types of machine learning

Supervised: input and output are provided
Unsupervised: only output is provided
Reinforcement: machine will learn according to the feedback from environment An decide on actions
its typical for automated systems that have to take decision without human interference example:
self driving car at a yellow traﬀic should the car accelerate or go slower?

1
4 Data cleaning in machine learning
[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

[ ]: df=sns.load_dataset("titanic")
df.head()

[ ]: # Check missing values

df.isnull().sum().sort_values(ascending=False)

[ ]: # pip install distutils

#%pip install distutils
#pip install setuptools

5 Imputation of missing values

For various reasons, many real-world datasets contain missing values, often encoded as blanks,
NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators
which assume that all values in an array are numerical, and that all have and hold meaning.
A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing
missing values. However, this comes at the price of losing data which may be valuable (even though
incomplete). A better strategy is to impute the missing values, i.e., to infer them from the known
part of the data.

6 Univariate feature imputation

The Simple Imputer class provides basic strategies for imputing missing values.
Missing values can be imputed with a provided constant value, or using the statistics (mean, median
or most frequent) of each column in which the missing values are located

[ ]: from sklearn.impute import SimpleImputer

[ ]: imputer=SimpleImputer(strategy='median')
df['age']=imputer.fit_transform(df[['age']])

7 Imputation of multi-variate features

A more sophisticated approach is to use the Iterative Imputer class, which models each feature
with missing values as a function of other features and uses that estimate for imputation.
It does so in an iterated round-robin fashion: at each step, a feature column is designated as output
y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y.

2
Then, the regressor is used to predict the missing values of y. This is done for each feature in an
iterative fashion, and then is repeated for max_iter imputation rounds

[ ]: from sklearn.experimental import enable_iterative_imputer

from sklearn.impute import IterativeImputer

[ ]: imputer=IterativeImputer(max_iter=20,n_nearest_features=5,random_state=0)
df['age']=imputer.fit_transform(df[['age']])

8 Forward and backward fill

[ ]: df=sns.load_dataset("titanic")
df.head()

[ ]: df.isnull().sum().sort_values(ascending=False)

[ ]: # Replace the missing values with forward fill

df['age'] = df['age'].ffill()

[ ]: # Replace the missing values with backward fill

df['age'] = df['age'].bfill()

9 Using KNN Imputer

[ ]: from sklearn.impute import KNNImputer
# Create an imputer object with a KNN filling strategy
imputer = KNNImputer(n_neighbors=5)
# Fill the missing values using the KNN imputer
df['age'] = imputer.fit_transform(df[['age']])

10 Inconsistencies in data
[ ]: data={"Date":
↪["2020-01-01","01-02-2000","2020-03-01","20-2020-04","2020-01-05","2020-01-06","2020-01-07",

"Country":["China","USA","China","America","China","USA","China","United␣
↪States","China","USA"],

"Name":
↪["John","Alice","John","Alice","John","Alice","John","Alice","John","Alice"],

"Sales_2020":[100,200,300,400,500,600,700,800,900,1000],
"Sales_2021":[120,220,320,420,520,620,720,820,920,1020]}

[ ]: data=pd.DataFrame(data)
data

3
[ ]: data["Date"] = pd.to_datetime(data["Date"], errors='coerce')
data["Date"] = data["Date"].dt.strftime("%Y-%m-%d")
data

[ ]: # to fill the null vlaues of date column with constant value

data["Date"].fillna("2020-01-01", inplace = True)

[ ]: # Harmonize the name of the countries

data['Country'].replace({'America': 'United States', 'USA': 'United States'},␣
↪inplace=True)

[ ]: data = pd.DataFrame(data=data)

[ ]: data = data.drop_duplicates(subset=["Name"])
data

11 Merging of data
[ ]: data1={"id":[1,2,3,4,5],"Name":["Ali","Abdullah","Ahmed","Sultan","Haider"],␣
↪"Age":[20,21,22,23,24]}

data1=pd.DataFrame(data1)
data1

[ ]: data2={"id":[1,2,3,4,6],"City":
↪["Lahore","Qasur","Karachi","Faislabad","Multan"], "Occupation":

↪["Engineer","Doctor","Teacher","Businessman","Lawyer"]}

data2=pd.DataFrame(data2)
data2

[ ]: # Merge the two dataframes based on the id

data3 = pd.merge(data1, data2, on='id', how='inner')
data3

12 Assignment
Please read the details of 1. Left join
2. Right join
3. Inner join
4. Outer join

13 Concatenate different data sets

[ ]: print(data1)
print(data2)

4
[ ]: # Concatenate both dataframes
data4=pd.concat([data1, data2], axis=0)
data4

[ ]: data4=pd.concat([data1, data2], axis=1)

data4

14 Normalization and Non linear transformation of data

Two types of transformations are available: quantile transforms and power transforms. Both quan-
tile and power transforms are based on monotonic transformations of the features and thus preserve
the rank of the values along each feature.
In many modeling scenarios, normality of the features in a dataset is desirable. Power transforms are
a family of parametric, monotonic transformations that aim to map data from any distribution to
as close to a Gaussian distribution as possible in order to stabilize variance and minimize skewness.
PowerTransformer currently provides two such power transformations, the Yeo-Johnson transform
and the Box-Cox transform.

[ ]: # Generate non normal distributed data (exponential distribution)

data = np.random.exponential(scale=2, size=1000)
data = pd.DataFrame(data, columns=['values'])
data

[ ]: data.shape

[ ]: sns.histplot(data['values'], kde=True)

[ ]: from sklearn.preprocessing import PowerTransformer

from sklearn.preprocessing import QuantileTransformer
# Create a power transformer object
pt_box_cox = PowerTransformer(method='box-cox') # for non-negative values
pt_yeo_johnson = PowerTransformer(method='yeo-johnson') # for negative values
qt_normal=QuantileTransformer(output_distribution="normal") # use␣
↪quantile_transformer for non-gaussian distribution

[ ]: data.min()

[ ]: data["box_cox"] = pt_box_cox.fit_transform(data[["values"]])

[ ]: data["yeo_johnson"]=pt_yeo_johnson.fit_transform(data[["values"]])

[ ]: data["quantile"]=qt_normal.fit_transform(data[["values"]])

[ ]: data

5
[ ]: # Check if 'col' exists in the dataframe
for col in data:
sns.histplot(data[col], kde=True)
plt.show()

15 Assignment: Read complete detail of box_cox and

yeo_johnson and quantile distribution

16 Normalization
Normalization is the process of scaling individual samples to have unit norm. This process can be
useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify
the similarity of any pair of samples.
L1 normalization is the process of making the sum of the absolute values of each row equal to 1
L2 normalization is the process of making the sum of the squares of each row equal to 1

[ ]: from sklearn.preprocessing import Normalizer

X = [[ 1, 1, 1],
[ 1, 1, 0],
[ 1, 0, 0]]
X_normalized = Normalizer(norm='l2')

X_normalized

[ ]: X_normalized.fit_transform(X)

[ ]: X_normalized = Normalizer(norm='l1')
X_normalized.fit_transform(X)

17 Feature encoding
Feature encoding, in the context of machine learning and data preprocessing, refers to the process
of converting categorical or text-based features into numerical representations that can be used as
input for machine learning algorithms.
Label encoding
Ordinal encoding
one hot encoding
binary encoding

[ ]: tip = sns.load_dataset("tips")
tip.head()

[ ]: print(tip["time"].value_counts())

[ ]: from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

6
[ ]: le=LabelEncoder()
tip["encoded_time"] = le.fit_transform(tip["time"])

[ ]: print(tip["encoded_time"].value_counts())

[ ]: tip.head()

[ ]: print(tip["day"].value_counts())

[ ]: # Apply the OrdinalEncoder

oe = OrdinalEncoder(categories=[["Thur", "Fri", "Sat", "Sun"]])

[ ]: tip["encoded_day"] = oe.fit_transform(tip[["day"]])

[ ]: tip["encoded_day"].value_counts()

[ ]: tip["smoker"].value_counts()

[ ]: ohe=OneHotEncoder()

[ ]: ohe.fit_transform(tip[["smoker"]]).toarray()

[ ]: pip install category_encoders

[ ]: from category_encoders import BinaryEncoder

[ ]: binary_encoder = BinaryEncoder()
binary_encoder=binary_encoder.fit_transform(tip[["day"]])
binary_encoder

18 Pandas get dummies

[ ]: dummies=pd.get_dummies(tip["day"])
dummies

[ ]: get_dummies=pd.get_dummies(tip, columns=["day"])
get_dummies

19 Data discretization
Data discretization is a process used in data preprocessing to transform continuous data into discrete
intervals or categories. This technique is particularly useful when dealing with numerical features
or variables that have a wide range of values and can simplify analysis, reduce complexity, and
improve the performance of certain machine learning algorithms.

[ ]: from sklearn.preprocessing import KBinsDiscretizer

7
[ ]: titanic=sns.load_dataset("titanic")

[ ]: # Impute missing values

titanic['age'] = titanic['age'].fillna(titanic['age'].median())
titanic['fare'] = titanic['fare'].fillna(titanic['fare'].median())

[ ]: # age Discretizer
age_discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal',␣
↪strategy='uniform')

[ ]: titanic['age_discretized'] = age_discretizer.fit_transform(titanic[['age']])

[ ]: titanic.head()

[ ]: sns.histplot(data=titanic, x=titanic["age"], hue=titanic["age_discretized"])

[ ]: age_dis = KBinsDiscretizer(n_bins=2, encode='ordinal', strategy='uniform')

titanic['age_dis'] = age_dis.fit_transform(titanic[['age']])
sns.histplot(data=titanic, x=titanic["age"], hue=titanic["age_dis"])

[ ]: age_dis = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='kmeans')

titanic['age_dis'] = age_dis.fit_transform(titanic[['age']])
sns.histplot(data=titanic, x=titanic["age"], hue=titanic["age_dis"])

[ ]: # Pandas method for bining

titanic.head()

[ ]: titanic["age_bins"] = pd.cut(titanic["age"].values, bins=3, labels=[0,1,2])

titanic.head()

[ ]: sns.histplot(data=titanic, x=titanic["age"], hue=titanic["age_bins"])

20 Use of loc and iloc function

21 loc
loc is primarily label-based indexing, meaning that you use the row and column labels to access
data.
You specify the row label(s) and column label(s) inside the brackets to select specific rows and
columns.

[ ]: titanic.head()

[ ]: # use the loc function in titanic data

titanic.loc[titanic["age"]>=50]

8
[ ]: # use iloc function in titanic data
titanic.iloc[0:5, 0:3]

22 Different terms in Machine learning

Algorithm: A set of rules and instructions given to an AI system to help it learn from data.
Example: decision tree is algorithm used in regression and classification tasks.
Training data: The data set used to train the ML model. It is labelled data for supervised
learning. Example: A set of images of cat and dogs, each labelled with cat and dog.
Testing data: Data used to evaluate the performance of the model after training. It is unseen by
the model during training.
Example: A new set of images not included in the training data used to check the accuracy of the
trained model.
Features: Individual measurable properties or characteristics of the phenomenon being observed,
used as input variables in the model. Example: In a dataset of house price prediction, features
might include square footage, number of bedrooms and age of house.
Model: In a machine learning, a model refers to the specific representation learned from the data
based on which predictions and decisions are made. Example: A neural network trained to identify
objects in images.

Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Slides on DataI
No ratings yet
Slides on DataI
33 pages
Data Analytics Lab Manual_250402_095326
No ratings yet
Data Analytics Lab Manual_250402_095326
58 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
ML_Unit_2
No ratings yet
ML_Unit_2
52 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Data Preprocessing in Machine Learning[1]
No ratings yet
Data Preprocessing in Machine Learning[1]
24 pages
MODELS (AutoRecovered)
No ratings yet
MODELS (AutoRecovered)
9 pages
data-mining-lab-manual-CSE-VII-Sem
No ratings yet
data-mining-lab-manual-CSE-VII-Sem
63 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
1737527078055
No ratings yet
1737527078055
111 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
Data Pre-processing Steps
No ratings yet
Data Pre-processing Steps
32 pages
Lecture5
No ratings yet
Lecture5
26 pages
8. ML_Lab Manual
No ratings yet
8. ML_Lab Manual
54 pages
ML Workshop
No ratings yet
ML Workshop
78 pages
Eda
No ratings yet
Eda
48 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
DataAnalytics Lab Manual (1)
No ratings yet
DataAnalytics Lab Manual (1)
35 pages
Data Preprocessing for Machine Learning in Python
No ratings yet
Data Preprocessing for Machine Learning in Python
27 pages
DA PROGRAM UPTO 6 (1)
No ratings yet
DA PROGRAM UPTO 6 (1)
20 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
dm(2)
No ratings yet
dm(2)
3 pages
DA_Programs
No ratings yet
DA_Programs
44 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Lec_2_ML_S4_data_preprocessing
No ratings yet
Lec_2_ML_S4_data_preprocessing
20 pages
72b85f60-8523-423f-9efc-ff56aa21f3f3
No ratings yet
72b85f60-8523-423f-9efc-ff56aa21f3f3
29 pages
Data Science Bootcamp (Day-01) (1) - Compressed
No ratings yet
Data Science Bootcamp (Day-01) (1) - Compressed
161 pages
ML_DA
No ratings yet
ML_DA
55 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
20dit073 Jay Prajapati ML
No ratings yet
20dit073 Jay Prajapati ML
68 pages
mini4
No ratings yet
mini4
9 pages
Final ML
No ratings yet
Final ML
2 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Ml Cyber Lab
No ratings yet
Ml Cyber Lab
16 pages
IML 2 - Data Preparation
No ratings yet
IML 2 - Data Preparation
13 pages
Semi Supervised Learning
No ratings yet
Semi Supervised Learning
86 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
ML unit 3
No ratings yet
ML unit 3
17 pages
Data Mining
No ratings yet
Data Mining
33 pages
ML_Notes
No ratings yet
ML_Notes
44 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
ML and Deploying It Using Flask and Docker.
No ratings yet
ML and Deploying It Using Flask and Docker.
30 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Mechanical Vibarations CEP-F
No ratings yet
Mechanical Vibarations CEP-F
8 pages
HVAC CEP[1]
No ratings yet
HVAC CEP[1]
27 pages
Lecture Material 8
No ratings yet
Lecture Material 8
9 pages
Lecture Material 12
No ratings yet
Lecture Material 12
9 pages
Lecture Material 6
No ratings yet
Lecture Material 6
3 pages
Phishing Website Detection Using Machine Learning Algorithms
No ratings yet
Phishing Website Detection Using Machine Learning Algorithms
4 pages
Key Techniques in AI Research Paper
No ratings yet
Key Techniques in AI Research Paper
2 pages
presentation
No ratings yet
presentation
14 pages
Creation FromGodtoMantoAI
No ratings yet
Creation FromGodtoMantoAI
15 pages
UDRC RNN LSTM LibrariesTutorial
No ratings yet
UDRC RNN LSTM LibrariesTutorial
144 pages
Data Mining Group Project .
No ratings yet
Data Mining Group Project .
26 pages
SHUBH_GARG_JaneStreet
No ratings yet
SHUBH_GARG_JaneStreet
1 page
Cold Start Problem in Recomandation System
No ratings yet
Cold Start Problem in Recomandation System
15 pages
CS3491 AI and ML Important Question Bank (1)
No ratings yet
CS3491 AI and ML Important Question Bank (1)
7 pages
Amazon_Sales_Analysis_Presentation
No ratings yet
Amazon_Sales_Analysis_Presentation
24 pages
Diksha
No ratings yet
Diksha
1 page
Session 5
No ratings yet
Session 5
2 pages
NOMA SIC Cancellation
No ratings yet
NOMA SIC Cancellation
60 pages
LightXML Transformer With Dynamic Negative Sampling For High-Performance
No ratings yet
LightXML Transformer With Dynamic Negative Sampling For High-Performance
8 pages
AI ML Answers
No ratings yet
AI ML Answers
3 pages
Jin-Han2010 ReferenceWorkEntry K-MeansClustering
No ratings yet
Jin-Han2010 ReferenceWorkEntry K-MeansClustering
10 pages
RNN, NLP
No ratings yet
RNN, NLP
2 pages
Ish - AI in Consumer Behaviour
No ratings yet
Ish - AI in Consumer Behaviour
4 pages
Forecasting Bitcoin Volatility Using Hybrid GARCH Models With Machine Learning
No ratings yet
Forecasting Bitcoin Volatility Using Hybrid GARCH Models With Machine Learning
18 pages
Linear Discriminant Analysis
No ratings yet
Linear Discriminant Analysis
2 pages
A Thorough Evaluation of Task-Specific Pretraining For Summarization
No ratings yet
A Thorough Evaluation of Task-Specific Pretraining For Summarization
6 pages
Experiment 3.3
No ratings yet
Experiment 3.3
3 pages
Fault Detection in Internet of Things (IOT)
No ratings yet
Fault Detection in Internet of Things (IOT)
11 pages
Case Study Shaadi - Com Interview Report
No ratings yet
Case Study Shaadi - Com Interview Report
5 pages
19bit0368 Capstone Final Review
No ratings yet
19bit0368 Capstone Final Review
48 pages
Lecture 1
No ratings yet
Lecture 1
24 pages
ML - AI Presentation
No ratings yet
ML - AI Presentation
3 pages
Samonte - 2018 - Polarity Analysis of Editorial Articles Towards Fa
No ratings yet
Samonte - 2018 - Polarity Analysis of Editorial Articles Towards Fa
5 pages
Big Data Assignment#3
No ratings yet
Big Data Assignment#3
17 pages
ai project (1) (Repaired)
No ratings yet
ai project (1) (Repaired)
36 pages