Data Mining Lab 03
Data Mining Lab 03
2 Data Preprocessing
Data preprocessing is the process of transforming raw data into an understandable format.Before applying
machine learning or data mining algorithms, the quality of the data should be checked by the following-
Implementation in Python
1 import pandas as pd
2 import numpy as np
3 df = pd.read_csv("titanic_dataset.csv")
4
5 print(df.head())
Input/Output
Output of the programs is given below.
See that the dataset contains many columns like PassengerId, Name, Age etc. We are going to be deleting
the unnecessary columns such as Name, Ticket, PassengerId, Cabin and Embarked.
Implementation in Python
1 df.drop("Name",axis=1,inplace=True)
2 df.drop("Ticket",axis=1,inplace=True)
3 df.drop("PassengerId",axis=1,inplace=True)
4 df.drop("Cabin",axis=1,inplace=True)
5 df.drop("Embarked",axis=1,inplace=True)
Implementation in Python
1 df.info()
Input/Output
Output of the programs is given below.
See that there are null values in the column Age. The second way of finding whether we have null values
in the data is by using the "isnull()" function.
Implementation in Python
1 print(df.isnull().sum())
Input/Output
Output of the programs is given below.
See that all the null values in the dataset are in the column "Age". Let’s try fitting the data using follow-
ing techniques.
Implementation in Python
1 updated_df = df
2 updated_df[’Age’]=updated_df[’Age’].fillna(updated_df[’Age’].mean())
3 updated_df.info()
Input/Output
Implementation in Python
1 updated_df = df.dropna(axis=1)
2 updated_df.info()
The problem with this method is that we may lose valuable information on that feature, as we have deleted it
completely due to some null values.
Implementation in Python
1 updated_df = newdf.dropna(axis=0)
2
3 y1 = updated_df[’Survived’]
4 updated_df.drop("Survived",axis=1,inplace=True)
5
6 updated_df.info()
Input/Output
2.2.1 Discretization
The continuous data here is split into intervals. Discretization reduces the data size. For example, rather than
specifying the class time, we can set an interval like (3 pm-5 pm, 6 pm-8 pm).
Implementation in Python
1
2 # Importing pandas and numpy libraries
3 import pandas as pd
4 import numpy as np
5
6 # Creating a dummy DataFrame of 15 numbers randomly
7 # ranging from 1−100 for age
8 df = pd.DataFrame({’Age’: [42, 15, 67, 55, 1, 29, 75, 89, 4,
9 10, 15, 38, 22, 77]})
10
11 # Printing DataFrame Before sorting Continuous
12 # to Categories
13
14 # A column of name ’Label’ is created in DataFrame
15 # Categorizing Age into 4 Categories
16 # Baby/Toddler: (0,3], 0 is excluded & 3 is included
17 # Child: (3,17], 3 is excluded & 17 is included
18 # Adult: (17,63], 17 is excluded & 63 is included
19 # Elderly: (63,99], 63 is excluded & 99 is included
20 df[’Label’] = pd.cut(x=df[’Age’], bins=[0, 3, 17, 63, 99],
21 labels=[’Baby/Toddler’, ’Child’, ’Adult’,
22 ’Elderly’])
23
24 # Printing DataFrame after sorting Continuous to
25 # Categories
26 print(df)
27
28 # Check the number of values in each bin
29 print("Categories: ")
30 print(df[’Label’].value_counts())
Input/Output
• Nominal: Categories without any implied order. For example, different blood groups like A+ve, gender,
fruits name etc.
• Ordinal: Categories with implied ordering. For example, rank, education level, grade etc.
See that there are also categorical values in the dataset, for this, you need to use Label Encoding or One
Hot Encoding. To handling categorical data, we will learn 3 methods as following:
Label encoding
Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-
readable form. Machine learning algorithms can then decide in a better way how those labels must be operated.
Implementation in Python
1 # Import libraries
2 import numpy as np
3 import pandas as pd
4
5 # Import dataset
6 df = pd.read_csv(’../../data/Iris.csv’)
7
8 print(df[’species’].unique())
9
10
11 # Import label encoder
12 from sklearn import preprocessing
13
14 # label_encoder object knows how to understand word labels.
15 label_encoder = preprocessing.LabelEncoder()
16
17 # Encode labels in column ’species’.
18 df[’species’]= label_encoder.fit_transform(df[’species’])
19
20 df[’species’].unique()
Input/Output
Ordinal encoding
In ordinal encoding, each unique category value is assigned an integer value. For example, “red” is 1, “green” is
2, and “blue” is 3.
Implementation in Python
One-Hot Encoding
For categorical variables where no ordinal relationship exists, the integer encoding may not be enough, at best,
or misleading to the model at worst. This is where the integer encoded variable is removed and one new binary
variable is added for each unique integer value in the variable.
Implementation in Python
Input/Output
2.2.2 Normalization
In machine learning, some feature values differ from others multiple times. The features with higher values
will dominate the learning process. Therefore, data normalization could also be a typical practice in machine
learning which consists of transforming numeric columns to a standard scale. We’ll learn about 2 normalizing
techniques given as follows:
Implementation in Python
1 import pandas as pd
2
3 # create data
4 df = pd.DataFrame([
5 [180000, 110, 18.9, 1400],
6 [360000, 905, 23.4, 1800],
7 [230000, 230, 14.0, 1300],
8 [60000, 450, 13.5, 1500]],
9
10 columns=[’Col A’, ’Col B’,
11 ’Col C’, ’Col D’])
12 df_min_max_scaled = df.copy()
13
14 # apply normalization techniques
15 for column in df_min_max_scaled.columns:
16 df_min_max_scaled[column] = (df_min_max_scaled[column] − df_min_max_scaled[
column].min()) / (df_min_max_scaled[column].max() − df_min_max_scaled[
column].min())
17
18 # view normalized data
19 print(df_min_max_scaled)
Input/Output
Lab Task (Please implement yourself and show the output to the instructor)
• Download and load a dataset. Now, write a Python program to impute null values (if any) using the
average value of it’s previous and next value.
• Using one-hot encoding, convert categorical values into numerical values.
5 Policy
Copying from internet, classmate, seniors, or from any other source is strongly prohibited. 100% marks will be
deducted if any such copying is detected