Exp 6
Exp 6
Categorical Encoding:
Since we are going to be working on categorical variables in this experiment , here is a quick refresher on
the same with a couple of examples. Categorical variables are usually represented as „strings‟ or „categories‟
1. The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore, etc.
2. The department a person works in: Finance, Human resources, IT, Production.
3. The highest degree a person has: High school, Diploma, Bachelors, Masters, PhD.
In the above examples, the variables only have definite possible values. Further, we can see there are two
Regardless of the encoding method, they all aim to replace instances of a categorical variable with a fixed-
length vector. Before moving on to the next section, it is important to know that there are two types of
categorical variables:
We use this categorical data encoding technique when the categorical feature is ordinal. In this case,
retaining the order is important. Hence encoding should reflect the sequence.
In Label encoding, each label is converted into an integer value. We will create a variable that contains the
where 0 is the label for tall, 1 is the label for medium, and 2 is a label for short height.
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('/content/Iris.csv')
data_set
data_set['Species'].unique()
data_set['Species'].unique()
data_set
One hot encoding is a technique used to represent categorical variables as numerical values in a
machine learning model. The advantages of using one hot encoding include:
1. It allows the use of categorical variables in models that require numerical input.
2. It can improve model performance by providing more information to the model about the categorical
variable.
3. It can help to avoid the problem of ordinality, which can occur when a categorical variable has a
natural ordering (e.g. “small”, “medium”, “large”).
In this technique, the categorical parameters will prepare separate columns for both Male and Female
labels. So, wherever there is Male, the value will be 1 in Male column and 0 in Female column, and vice-
versa. Let‟s understand with an example: Consider the data where fruits and their corresponding
categorical values and prices are given.
Fruit Price
apple 5
mango 10
apple 15
orange 20
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20
Step-1: Data Pre-processing Step:
The very first step is data pre-processing, which we have already discussed in this tutorial. This process
contains the below steps:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('50_CompList.csv')
data_set
data_set.isna().sum()
data_set.describe()
# Calculation r2_score