0% found this document useful (0 votes)
14 views

Exp 6

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Exp 6

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Experiment-6

Write a program to implement Categorical Encoding, One-hot Encoding:

Categorical Encoding:

What is categorical data?

Since we are going to be working on categorical variables in this experiment , here is a quick refresher on

the same with a couple of examples. Categorical variables are usually represented as „strings‟ or „categories‟

and are finite in number. Here are a few examples:

1. The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore, etc.

2. The department a person works in: Finance, Human resources, IT, Production.

3. The highest degree a person has: High school, Diploma, Bachelors, Masters, PhD.

4. The grades of a student: A+, A, B+, B, B- etc.

In the above examples, the variables only have definite possible values. Further, we can see there are two

kinds of categorical data-

 Ordinal Data: The categories have an inherent order

 Nominal Data: The categories do not have an inherent order

Regardless of the encoding method, they all aim to replace instances of a categorical variable with a fixed-

length vector. Before moving on to the next section, it is important to know that there are two types of

categorical variables:

1. Nominal → Athens, Cairo, Paris, Tokyo, New Delhi etc

2. Ordinal → High School Diploma, BS, MS, PhD

Label Encoding or Ordinal Encoding

We use this categorical data encoding technique when the categorical feature is ordinal. In this case,

retaining the order is important. Hence encoding should reflect the sequence.

In Label encoding, each label is converted into an integer value. We will create a variable that contains the

categories representing the education qualification of a person.


Label Encoding refers to converting the labels into a numeric form so as to convert them into the
machine-readable form. Machine learning algorithms can then decide in a better way how those labels
must be operated. It is an important pre-processing step for the structured dataset in supervised learning.
 Example :
Suppose we have a column Height in some dataset.

After applying label encoding, the Height column is converted into:

where 0 is the label for tall, 1 is the label for medium, and 2 is a label for short height.

# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv('/content/Iris.csv')

data_set
data_set['Species'].unique()

# Import label encoder


from sklearn import preprocessing

# label_encoder object knows how to understand word labels.


label_encoder = preprocessing.LabelEncoder()

# Encode labels in column 'species'.


data_set['Species']= label_encoder.fit_transform(data_set['Species'])

data_set['Species'].unique()

data_set

here convert species into 3 labels (0,1,2)


One-hot Encoding:

One hot encoding is a technique used to represent categorical variables as numerical values in a
machine learning model. The advantages of using one hot encoding include:
1. It allows the use of categorical variables in models that require numerical input.
2. It can improve model performance by providing more information to the model about the categorical
variable.
3. It can help to avoid the problem of ordinality, which can occur when a categorical variable has a
natural ordering (e.g. “small”, “medium”, “large”).

In this technique, the categorical parameters will prepare separate columns for both Male and Female
labels. So, wherever there is Male, the value will be 1 in Male column and 0 in Female column, and vice-
versa. Let‟s understand with an example: Consider the data where fruits and their corresponding
categorical values and prices are given.

Fruit Price

apple 5

mango 10

apple 15

orange 20

The output after one-hot encoding of the data is given as follows,

apple mango orange price

1 0 0 5

0 1 0 10

1 0 0 15

0 0 1 20
Step-1: Data Pre-processing Step:
The very first step is data pre-processing, which we have already discussed in this tutorial. This process
contains the below steps:

# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv('50_CompList.csv')

data_set
data_set.isna().sum()
data_set.describe()

#Extracting Independent and dependent Variable


x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 4].values

Encoding Dummy Variables:


As we have one categorical variable (State), which cannot be directly applied to the model, so we will
encode it. To encode the categorical variable into numbers, we will use the LabelEncoder class. But it is
not sufficient because it still has some relational order, which may create a wrong model. So in order to
remove this problem, we will use OneHotEncoder, which will create the dummy variables. Below is code
for it:

from sklearn.preprocessing import OneHotEncoder


from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([("Country", OneHotEncoder(), [3])], remainder = 'passthrough')
x = ct.fit_transform(x)
#avoiding the dummy variable trap:
x = x[:, 1:]

# Splitting the dataset into training and test set.


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

#Fitting the MLR model to the training set:


from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)

#Predicting the Test set result;


y_pred= regressor.predict(x_test)

print('Train Score: ', regressor.score(x_train, y_train))


print('Test Score: ', regressor.score(x_test, y_test))

print(" Actual output \n {}".format(y_test),"\n predict outputs:\n {}".format(y_pre


d))

# Calculation of Mean Squared Error (MSE)


from sklearn.metrics import mean_squared_error
print("Mean Squared Error (MSE):\n")
mean_squared_error(y_test,y_pred)

# Calculation r2_score

from sklearn.metrics import r2_score


r2 = r2_score(y_test,y_pred)
print('r2 score for perfect model is', r2*100)

You might also like