Lecture Material 10
Lecture Material 10
1 Machine Learning
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to
learn without being explicitly programmed.
It uses that data to detect patterns in data and adjust program actions accordingly.
Libraries for machine learning
Scikit-learn
TensorFlow and Keras
PyTorch
Natural language toolkit (NLTK)
OpenCV
2 AI vs machine learning
Artificial general intelligence is form of intelligence similar or equal to the human intelligence which
will have a language and be able to learn and take decisions.
Machine learning is a type of artificial intelligence where we create an algorithm that would learn
from given data.
We give the system an input and output and it will write a code to link the two after the code is
written we can use it to give an output when we feed input to the system.
1
4 Data cleaning in machine learning
[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
[ ]: df=sns.load_dataset("titanic")
df.head()
[ ]: imputer=SimpleImputer(strategy='median')
df['age']=imputer.fit_transform(df[['age']])
2
Then, the regressor is used to predict the missing values of y. This is done for each feature in an
iterative fashion, and then is repeated for max_iter imputation rounds
[ ]: imputer=IterativeImputer(max_iter=20,n_nearest_features=5,random_state=0)
df['age']=imputer.fit_transform(df[['age']])
[ ]: df.isnull().sum().sort_values(ascending=False)
10 Inconsistencies in data
[ ]: data={"Date":
↪["2020-01-01","01-02-2000","2020-03-01","20-2020-04","2020-01-05","2020-01-06","2020-01-07",
"Country":["China","USA","China","America","China","USA","China","United␣
↪States","China","USA"],
"Name":
↪["John","Alice","John","Alice","John","Alice","John","Alice","John","Alice"],
"Sales_2020":[100,200,300,400,500,600,700,800,900,1000],
"Sales_2021":[120,220,320,420,520,620,720,820,920,1020]}
[ ]: data=pd.DataFrame(data)
data
3
[ ]: data["Date"] = pd.to_datetime(data["Date"], errors='coerce')
data["Date"] = data["Date"].dt.strftime("%Y-%m-%d")
data
[ ]: data = pd.DataFrame(data=data)
[ ]: data = data.drop_duplicates(subset=["Name"])
data
11 Merging of data
[ ]: data1={"id":[1,2,3,4,5],"Name":["Ali","Abdullah","Ahmed","Sultan","Haider"],␣
↪"Age":[20,21,22,23,24]}
data1=pd.DataFrame(data1)
data1
[ ]: data2={"id":[1,2,3,4,6],"City":
↪["Lahore","Qasur","Karachi","Faislabad","Multan"], "Occupation":
↪["Engineer","Doctor","Teacher","Businessman","Lawyer"]}
data2=pd.DataFrame(data2)
data2
12 Assignment
Please read the details of 1. Left join
2. Right join
3. Inner join
4. Outer join
4
[ ]: # Concatenate both dataframes
data4=pd.concat([data1, data2], axis=0)
data4
[ ]: data.shape
[ ]: sns.histplot(data['values'], kde=True)
[ ]: data.min()
[ ]: data["box_cox"] = pt_box_cox.fit_transform(data[["values"]])
[ ]: data["yeo_johnson"]=pt_yeo_johnson.fit_transform(data[["values"]])
[ ]: data["quantile"]=qt_normal.fit_transform(data[["values"]])
[ ]: data
5
[ ]: # Check if 'col' exists in the dataframe
for col in data:
sns.histplot(data[col], kde=True)
plt.show()
16 Normalization
Normalization is the process of scaling individual samples to have unit norm. This process can be
useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify
the similarity of any pair of samples.
L1 normalization is the process of making the sum of the absolute values of each row equal to 1
L2 normalization is the process of making the sum of the squares of each row equal to 1
X_normalized
[ ]: X_normalized.fit_transform(X)
[ ]: X_normalized = Normalizer(norm='l1')
X_normalized.fit_transform(X)
17 Feature encoding
Feature encoding, in the context of machine learning and data preprocessing, refers to the process
of converting categorical or text-based features into numerical representations that can be used as
input for machine learning algorithms.
Label encoding
Ordinal encoding
one hot encoding
binary encoding
[ ]: tip = sns.load_dataset("tips")
tip.head()
[ ]: print(tip["time"].value_counts())
6
[ ]: le=LabelEncoder()
tip["encoded_time"] = le.fit_transform(tip["time"])
[ ]: print(tip["encoded_time"].value_counts())
[ ]: tip.head()
[ ]: print(tip["day"].value_counts())
[ ]: tip["encoded_day"] = oe.fit_transform(tip[["day"]])
[ ]: tip["encoded_day"].value_counts()
[ ]: tip["smoker"].value_counts()
[ ]: ohe=OneHotEncoder()
[ ]: ohe.fit_transform(tip[["smoker"]]).toarray()
[ ]: binary_encoder = BinaryEncoder()
binary_encoder=binary_encoder.fit_transform(tip[["day"]])
binary_encoder
[ ]: get_dummies=pd.get_dummies(tip, columns=["day"])
get_dummies
19 Data discretization
Data discretization is a process used in data preprocessing to transform continuous data into discrete
intervals or categories. This technique is particularly useful when dealing with numerical features
or variables that have a wide range of values and can simplify analysis, reduce complexity, and
improve the performance of certain machine learning algorithms.
7
[ ]: titanic=sns.load_dataset("titanic")
[ ]: # age Discretizer
age_discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal',␣
↪strategy='uniform')
[ ]: titanic['age_discretized'] = age_discretizer.fit_transform(titanic[['age']])
[ ]: titanic.head()
21 loc
loc is primarily label-based indexing, meaning that you use the row and column labels to access
data.
You specify the row label(s) and column label(s) inside the brackets to select specific rows and
columns.
[ ]: titanic.head()
8
[ ]: # use iloc function in titanic data
titanic.iloc[0:5, 0:3]