0% found this document useful (0 votes)
26 views9 pages

Lecture Material 10

The document discusses various machine learning concepts including data cleaning, feature engineering, normalization, and encoding techniques. It provides code examples to demonstrate data imputation using median, forward fill, KNN imputer. It also covers power transforms to achieve normal distribution and different encoding methods like label, one-hot and ordinal encoding.

Uploaded by

Ali Naseer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views9 pages

Lecture Material 10

The document discusses various machine learning concepts including data cleaning, feature engineering, normalization, and encoding techniques. It provides code examples to demonstrate data imputation using median, forward fill, KNN imputer. It also covers power transforms to achieve normal distribution and different encoding methods like label, one-hot and ordinal encoding.

Uploaded by

Ali Naseer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Lecture_material_10

March 16, 2024

1 Machine Learning
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to
learn without being explicitly programmed.
It uses that data to detect patterns in data and adjust program actions accordingly.
Libraries for machine learning
Scikit-learn
TensorFlow and Keras
PyTorch
Natural language toolkit (NLTK)
OpenCV

2 AI vs machine learning
Artificial general intelligence is form of intelligence similar or equal to the human intelligence which
will have a language and be able to learn and take decisions.
Machine learning is a type of artificial intelligence where we create an algorithm that would learn
from given data.
We give the system an input and output and it will write a code to link the two after the code is
written we can use it to give an output when we feed input to the system.

3 Types of machine learning


Supervised: input and output are provided
Unsupervised: only output is provided
Reinforcement: machine will learn according to the feedback from environment An decide on actions
its typical for automated systems that have to take decision without human interference example:
self driving car at a yellow traffic should the car accelerate or go slower?

1
4 Data cleaning in machine learning
[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

[ ]: df=sns.load_dataset("titanic")
df.head()

[ ]: # Check missing values


df.isnull().sum().sort_values(ascending=False)

[ ]: # pip install distutils


#%pip install distutils
#pip install setuptools

5 Imputation of missing values


For various reasons, many real-world datasets contain missing values, often encoded as blanks,
NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators
which assume that all values in an array are numerical, and that all have and hold meaning.
A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing
missing values. However, this comes at the price of losing data which may be valuable (even though
incomplete). A better strategy is to impute the missing values, i.e., to infer them from the known
part of the data.

6 Univariate feature imputation


The Simple Imputer class provides basic strategies for imputing missing values.
Missing values can be imputed with a provided constant value, or using the statistics (mean, median
or most frequent) of each column in which the missing values are located

[ ]: from sklearn.impute import SimpleImputer

[ ]: imputer=SimpleImputer(strategy='median')
df['age']=imputer.fit_transform(df[['age']])

7 Imputation of multi-variate features


A more sophisticated approach is to use the Iterative Imputer class, which models each feature
with missing values as a function of other features and uses that estimate for imputation.
It does so in an iterated round-robin fashion: at each step, a feature column is designated as output
y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y.

2
Then, the regressor is used to predict the missing values of y. This is done for each feature in an
iterative fashion, and then is repeated for max_iter imputation rounds

[ ]: from sklearn.experimental import enable_iterative_imputer


from sklearn.impute import IterativeImputer

[ ]: imputer=IterativeImputer(max_iter=20,n_nearest_features=5,random_state=0)
df['age']=imputer.fit_transform(df[['age']])

8 Forward and backward fill


[ ]: df=sns.load_dataset("titanic")
df.head()

[ ]: df.isnull().sum().sort_values(ascending=False)

[ ]: # Replace the missing values with forward fill


df['age'] = df['age'].ffill()

[ ]: # Replace the missing values with backward fill


df['age'] = df['age'].bfill()

9 Using KNN Imputer


[ ]: from sklearn.impute import KNNImputer
# Create an imputer object with a KNN filling strategy
imputer = KNNImputer(n_neighbors=5)
# Fill the missing values using the KNN imputer
df['age'] = imputer.fit_transform(df[['age']])

10 Inconsistencies in data
[ ]: data={"Date":
↪["2020-01-01","01-02-2000","2020-03-01","20-2020-04","2020-01-05","2020-01-06","2020-01-07",

"Country":["China","USA","China","America","China","USA","China","United␣
↪States","China","USA"],

"Name":
↪["John","Alice","John","Alice","John","Alice","John","Alice","John","Alice"],

"Sales_2020":[100,200,300,400,500,600,700,800,900,1000],
"Sales_2021":[120,220,320,420,520,620,720,820,920,1020]}

[ ]: data=pd.DataFrame(data)
data

3
[ ]: data["Date"] = pd.to_datetime(data["Date"], errors='coerce')
data["Date"] = data["Date"].dt.strftime("%Y-%m-%d")
data

[ ]: # to fill the null vlaues of date column with constant value


data["Date"].fillna("2020-01-01", inplace = True)

[ ]: # Harmonize the name of the countries


data['Country'].replace({'America': 'United States', 'USA': 'United States'},␣
↪inplace=True)

[ ]: data = pd.DataFrame(data=data)

[ ]: data = data.drop_duplicates(subset=["Name"])
data

11 Merging of data
[ ]: data1={"id":[1,2,3,4,5],"Name":["Ali","Abdullah","Ahmed","Sultan","Haider"],␣
↪"Age":[20,21,22,23,24]}

data1=pd.DataFrame(data1)
data1

[ ]: data2={"id":[1,2,3,4,6],"City":
↪["Lahore","Qasur","Karachi","Faislabad","Multan"], "Occupation":

↪["Engineer","Doctor","Teacher","Businessman","Lawyer"]}

data2=pd.DataFrame(data2)
data2

[ ]: # Merge the two dataframes based on the id


data3 = pd.merge(data1, data2, on='id', how='inner')
data3

12 Assignment
Please read the details of 1. Left join
2. Right join
3. Inner join
4. Outer join

13 Concatenate different data sets


[ ]: print(data1)
print(data2)

4
[ ]: # Concatenate both dataframes
data4=pd.concat([data1, data2], axis=0)
data4

[ ]: data4=pd.concat([data1, data2], axis=1)


data4

14 Normalization and Non linear transformation of data


Two types of transformations are available: quantile transforms and power transforms. Both quan-
tile and power transforms are based on monotonic transformations of the features and thus preserve
the rank of the values along each feature.
In many modeling scenarios, normality of the features in a dataset is desirable. Power transforms are
a family of parametric, monotonic transformations that aim to map data from any distribution to
as close to a Gaussian distribution as possible in order to stabilize variance and minimize skewness.
PowerTransformer currently provides two such power transformations, the Yeo-Johnson transform
and the Box-Cox transform.

[ ]: # Generate non normal distributed data (exponential distribution)


data = np.random.exponential(scale=2, size=1000)
data = pd.DataFrame(data, columns=['values'])
data

[ ]: data.shape

[ ]: sns.histplot(data['values'], kde=True)

[ ]: from sklearn.preprocessing import PowerTransformer


from sklearn.preprocessing import QuantileTransformer
# Create a power transformer object
pt_box_cox = PowerTransformer(method='box-cox') # for non-negative values
pt_yeo_johnson = PowerTransformer(method='yeo-johnson') # for negative values
qt_normal=QuantileTransformer(output_distribution="normal") # use␣
↪quantile_transformer for non-gaussian distribution

[ ]: data.min()

[ ]: data["box_cox"] = pt_box_cox.fit_transform(data[["values"]])

[ ]: data["yeo_johnson"]=pt_yeo_johnson.fit_transform(data[["values"]])

[ ]: data["quantile"]=qt_normal.fit_transform(data[["values"]])

[ ]: data

5
[ ]: # Check if 'col' exists in the dataframe
for col in data:
sns.histplot(data[col], kde=True)
plt.show()

15 Assignment: Read complete detail of box_cox and


yeo_johnson and quantile distribution

16 Normalization
Normalization is the process of scaling individual samples to have unit norm. This process can be
useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify
the similarity of any pair of samples.
L1 normalization is the process of making the sum of the absolute values of each row equal to 1
L2 normalization is the process of making the sum of the squares of each row equal to 1

[ ]: from sklearn.preprocessing import Normalizer


X = [[ 1, 1, 1],
[ 1, 1, 0],
[ 1, 0, 0]]
X_normalized = Normalizer(norm='l2')

X_normalized

[ ]: X_normalized.fit_transform(X)

[ ]: X_normalized = Normalizer(norm='l1')
X_normalized.fit_transform(X)

17 Feature encoding
Feature encoding, in the context of machine learning and data preprocessing, refers to the process
of converting categorical or text-based features into numerical representations that can be used as
input for machine learning algorithms.
Label encoding
Ordinal encoding
one hot encoding
binary encoding

[ ]: tip = sns.load_dataset("tips")
tip.head()

[ ]: print(tip["time"].value_counts())

[ ]: from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

6
[ ]: le=LabelEncoder()
tip["encoded_time"] = le.fit_transform(tip["time"])

[ ]: print(tip["encoded_time"].value_counts())

[ ]: tip.head()

[ ]: print(tip["day"].value_counts())

[ ]: # Apply the OrdinalEncoder


oe = OrdinalEncoder(categories=[["Thur", "Fri", "Sat", "Sun"]])

[ ]: tip["encoded_day"] = oe.fit_transform(tip[["day"]])

[ ]: tip["encoded_day"].value_counts()

[ ]: tip["smoker"].value_counts()

[ ]: ohe=OneHotEncoder()

[ ]: ohe.fit_transform(tip[["smoker"]]).toarray()

[ ]: pip install category_encoders

[ ]: from category_encoders import BinaryEncoder

[ ]: binary_encoder = BinaryEncoder()
binary_encoder=binary_encoder.fit_transform(tip[["day"]])
binary_encoder

18 Pandas get dummies


[ ]: dummies=pd.get_dummies(tip["day"])
dummies

[ ]: get_dummies=pd.get_dummies(tip, columns=["day"])
get_dummies

19 Data discretization
Data discretization is a process used in data preprocessing to transform continuous data into discrete
intervals or categories. This technique is particularly useful when dealing with numerical features
or variables that have a wide range of values and can simplify analysis, reduce complexity, and
improve the performance of certain machine learning algorithms.

[ ]: from sklearn.preprocessing import KBinsDiscretizer

7
[ ]: titanic=sns.load_dataset("titanic")

[ ]: # Impute missing values


titanic['age'] = titanic['age'].fillna(titanic['age'].median())
titanic['fare'] = titanic['fare'].fillna(titanic['fare'].median())

[ ]: # age Discretizer
age_discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal',␣
↪strategy='uniform')

[ ]: titanic['age_discretized'] = age_discretizer.fit_transform(titanic[['age']])

[ ]: titanic.head()

[ ]: sns.histplot(data=titanic, x=titanic["age"], hue=titanic["age_discretized"])

[ ]: age_dis = KBinsDiscretizer(n_bins=2, encode='ordinal', strategy='uniform')


titanic['age_dis'] = age_dis.fit_transform(titanic[['age']])
sns.histplot(data=titanic, x=titanic["age"], hue=titanic["age_dis"])

[ ]: age_dis = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='kmeans')


titanic['age_dis'] = age_dis.fit_transform(titanic[['age']])
sns.histplot(data=titanic, x=titanic["age"], hue=titanic["age_dis"])

[ ]: # Pandas method for bining


titanic.head()

[ ]: titanic["age_bins"] = pd.cut(titanic["age"].values, bins=3, labels=[0,1,2])


titanic.head()

[ ]: sns.histplot(data=titanic, x=titanic["age"], hue=titanic["age_bins"])

20 Use of loc and iloc function

21 loc
loc is primarily label-based indexing, meaning that you use the row and column labels to access
data.
You specify the row label(s) and column label(s) inside the brackets to select specific rows and
columns.

[ ]: titanic.head()

[ ]: # use the loc function in titanic data


titanic.loc[titanic["age"]>=50]

8
[ ]: # use iloc function in titanic data
titanic.iloc[0:5, 0:3]

22 Different terms in Machine learning


Algorithm: A set of rules and instructions given to an AI system to help it learn from data.
Example: decision tree is algorithm used in regression and classification tasks.
Training data: The data set used to train the ML model. It is labelled data for supervised
learning. Example: A set of images of cat and dogs, each labelled with cat and dog.
Testing data: Data used to evaluate the performance of the model after training. It is unseen by
the model during training.
Example: A new set of images not included in the training data used to check the accuracy of the
trained model.
Features: Individual measurable properties or characteristics of the phenomenon being observed,
used as input variables in the model. Example: In a dataset of house price prediction, features
might include square footage, number of bedrooms and age of house.
Model: In a machine learning, a model refers to the specific representation learned from the data
based on which predictions and decisions are made. Example: A neural network trained to identify
objects in images.

You might also like