0% found this document useful (0 votes)
25 views12 pages

Assignment 1 A

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views12 pages

Assignment 1 A

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Computer Laboratory-I Class: BE (AI &DS)

Assignment No:1-A

Title: To use PCA Algorithm for dimensionality reduction.


You have a dataset that includes measurements for different variables on wine (alcohol,
ash, magnesium, and so on).

Problem Statement:
Apply PCA algorithm & transform this data so that most variations in the measurements of the
variables are captured by a small number of principal components so that it is easier to
distinguish between red and white wine by inspecting these principal components.

Dataset Link: https://media.geeksforgeeks.org/wp-content/uploads/Wine.csv

Objectives:

• To make use of PCA algorithm


• To transform the data in reduced form

Theory:

Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the given dataset by reducing
the variances.The PCA algorithm is based on some mathematical concepts such as:

• Variance and Covariance

• Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:

• Dimensionality: It is the number of features or variables present in the given dataset.


More easily, it is the number of columns present in the dataset.

• Correlation: It signifies that how strongly two variables are related to each other. Such
as if one changes, the other variable also gets changed. The correlation value ranges
from -1 to +1. Here, -1 occurs if variables are inversely proportional to each other, and
+1 indicates that variables are directly proportional to each other.

Department of AI & DS MCOERC, NASHIK


Computer Laboratory-I Class: BE (AI &DS)

• Orthogonal: It defines that variables are not correlated to each other, and hence the
correlation between the pair of variables is zero.

• Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v


will be eigenvector if Av is the scalar multiple of v.
• Covariance Matrix: A matrix containing the covariance between the pair of variables
is called the Covariance Matrix.

Steps for PCA algorithm


1. Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y, where
X is the training set, and Y is the validation set.
2. Representing data into a structure

Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data
items, and the column corresponds to the Features. The number of columns is the
dimensions of the dataset.
3. Standardizing the data

In this step, we will standardize our dataset. Such as in a particular column, the features
with high variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will
divide each data item in a column with the standard deviation of the column. Here we
will name the matrix as Z.
4. Calculating the Covariance of Z

To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of
Z.
5. Calculating the Eigen Values and Eigen Vectors

Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors

In this step, we will take all the eigenvalues and will sort them in decreasing order,
which means from largest to smallest. And simultaneously sort the eigenvectors
accordingly in matrix P of eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components

Here we will calculate the new features. To do this, we will multiply the P* matrix to
the Z. In the resultant matrix Z*, each observation is the linear combination of original
features. Each column of the Z* matrix is independent of each other.

Department of AI & DS MCOERC, NASHIK


Computer Laboratory-I Class: BE (AI &DS)

8. Remove less or unimportant features from the new dataset.


The new feature set has occurred, so we will decide here what to keep and what to
remove. It means, we will only keep the relevant or important features in the new
dataset, and unimportant features will be removed out.

Sample Code

In[1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory


# For example, running this (by clicking run or pressing Shift+Enter) will list all files under
the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as
output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the
current session

/kaggle/input/wineuci/Wine.csv

In [2]:
#------------------Import_libraries------------------
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [3]:
df = pd.read_csv("/kaggle/input/wineuci/Wine.csv")

In [4]:
#--------------print_sample_of_dataset------------------
df.head()

Department of AI & DS MCOERC, NASHIK


Computer Laboratory-I Class: BE (AI &DS)

Out[4]:
14.2 1.7 2.4 15. 12 3.0 2.2 5.6 1.0 3.9 106
1 2.8 .28
3 1 3 6 7 6 9 4 4 2 5
13.2 1.7 2.1 11. 10 2.6 2.7 0.2 1.2 4.3 1.0 3.4 105
0 1
0 8 4 2 0 5 6 6 8 8 5 0 0
13.1 2.3 2.6 18. 10 2.8 3.2 0.3 2.8 5.6 1.0 3.1 118
1 1
6 6 7 6 1 0 4 0 1 8 3 7 5
14.3 1.9 2.5 16. 11 3.8 3.4 0.2 2.1 7.8 0.8 3.4 148
2 1
7 5 0 8 3 5 9 4 8 0 6 5 0
13.2 2.5 2.8 21. 11 2.8 2.6 0.3 1.8 4.3 1.0 2.9
3 1 735
4 9 7 0 8 0 9 9 2 2 4 3
14.2 1.7 2.4 15. 11 3.2 3.3 0.3 1.9 6.7 1.0 2.8 145
4 1
0 6 5 2 2 7 9 4 7 5 5 5 0

In [5]:
#---------------Check_dataset_information--------------
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177 entries, 0 to 176
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 1 177 non-null int64
1 14.23 177 non-null float64
2 1.71 177 non-null float64
3 2.43 177 non-null float64
4 15.6 177 non-null float64
5 127 177 non-null int64
6 2.8 177 non-null float64
7 3.06 177 non-null float64
8 .28 177 non-null float64
9 2.29 177 non-null float64
10 5.64 177 non-null float64
11 1.04 177 non-null float64
12 3.92 177 non-null float64
13 1065 177 non-null int64
dtypes: float64(11), int64(3)
memory usage: 19.5 KB

In [6]:
#---------------Check_distribution_of_dataset----------------------

Department of AI & DS MCOERC, NASHIK


Computer Laboratory-I Class: BE (AI &DS)

df.describe()

Out[6]:
14. 1.7 2.4 15. 3.0 2.2 5.6 1.0 3.9 106
1 127 2.8 .28
23 1 3 6 6 9 4 4 2 5
c
177 177 177 177 177 177 177 177 177 177 177 177 177
o 177.
.00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00
u 000
000 000 000 000 000 000 000 000 000 000 000 000 000
n 000
0 0 0 0 0 0 0 0 0 0 0 0 0
t
m
1.9 12. 2.3 2.3 19. 99. 2.2 2.0 0.3 1.5 5.0 0.9 2.6 745.
e
435 993 398 661 516 587 922 234 623 869 548 569 042 096
a
03 672 87 58 949 571 60 46 16 49 02 83 94 045
n
s 0.7 0.8 1.1 0.2 3.3 14. 0.6 0.9 0.1 0.5 2.3 0.2 0.7 314.
t 739 088 193 750 360 174 264 986 246 715 244 291 051 884
d 91 08 14 80 71 018 65 58 53 45 46 35 03 046
m 1.0 11. 0.7 1.3 10. 70. 0.9 0.3 0.1 0.4 1.2 0.4 1.2 278.
i 000 030 400 600 600 000 800 400 300 100 800 800 700 000
n 00 000 00 00 000 000 00 00 00 00 00 00 00 000
2 1.0 12. 1.6 2.2 17. 88. 1.7 1.2 0.2 1.2 3.2 0.7 1.9 500.
5 000 360 000 100 200 000 400 000 700 500 100 800 300 000
% 00 000 00 00 000 000 00 00 00 00 00 00 00 000
5 2.0 13. 1.8 2.3 19. 98. 2.3 2.1 0.3 1.5 4.6 0.9 2.7 672.
0 000 050 700 600 500 000 500 300 400 500 800 600 800 000
% 00 000 00 00 000 000 00 00 00 00 00 00 00 000
107
7 3.0 13. 3.1 2.5 21. 2.8 2.8 0.4 1.9 6.2 1.1 3.1 985.
.00
5 000 670 000 600 500 000 600 400 500 000 200 700 000
000
% 00 000 00 00 000 00 00 00 00 00 00 00 000
0
162 168
m 3.0 14. 5.8 3.2 30. 3.8 5.0 0.6 3.5 13. 1.7 4.0
.00 0.00
a 000 830 000 300 000 800 800 600 800 000 100 000
000 000
x 00 000 00 00 000 00 00 00 00 000 00 00
0 0

In [7]:
#-----------------Check_null_values_in_dataset--------------------
df.isnull().sum()

Out[7]:

Department of AI & DS MCOERC, NASHIK


Computer Laboratory-I Class: BE (AI &DS)

1 0
14.23 0
1.71 0
2.43 0
15.6 0
127 0
2.8 0
3.06 0
.28 0
2.29 0
5.64 0
1.04 0
3.92 0
1065 0

dtype: int64

In [8]:
#-------------Check_imbalance_in_dataset--------------------
sns.countplot(x = '1',data=df)

Out[8]:

<AxesSubplot:xlabel='1', ylabel='count'>

In [9]:
target = df['1']
df = df.drop('1',axis=1)

In [10]:
#-----------Split_dataset_into_train_test_set--------------

Department of AI & DS MCOERC, NASHIK


Computer Laboratory-I Class: BE (AI &DS)

from sklearn.model_selection import train_test_split


X_train,X_test,y_train,y_test = train_test_split(df,target,test_size =0.20,random_state=42)

In [11]:
sns.pairplot(X_train)

Out[11]:

<seaborn.axisgrid.PairGrid at 0x7ff464bfd610>

In [12]:
#------------Implement_scaling-----------
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

Department of AI & DS MCOERC, NASHIK


Computer Laboratory-I Class: BE (AI &DS)

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [13]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

In [14]:
sns.pairplot(X_train)

Out[14]:

<seaborn.axisgrid.PairGrid at 0x7ff44f4dab10>

Department of AI & DS MCOERC, NASHIK


Computer Laboratory-I Class: BE (AI &DS)

In [15]:
#-----------------Build_classifier_model_using_all_available_variables------
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train,y_train)
model

Out[15]:

LogisticRegression()

In [16]:
#--------Check_model_performance-------------------
from sklearn.metrics import classification_report
print("The classification_report
is:{}".format(classification_report(y_test,model.predict(X_test))))
The classification_report is: precision recall f1-score support

1 1.00 1.00 1.00 14


2 1.00 0.71 0.83 14
3 0.67 1.00 0.80 8

accuracy 0.89 36
macro avg 0.89 0.90 0.88 36
weighted avg 0.93 0.89 0.89 36

In [17]:
#-----------------Check_correlation_between_independent_variables---------------
plt.figure(figsize =(10,8))
sns.heatmap(X_train.corr(),annot=True)

Out[17]:

<AxesSubplot:>

Department of AI & DS MCOERC, NASHIK


Computer Laboratory-I Class: BE (AI &DS)

In [18]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
tr_comp = pca.fit_transform(X_train)
ts_comp = pca.transform(X_test)

In [19]:
#--------------Plot_PCA-----------------------
sns.scatterplot(tr_comp[:,0],tr_comp[:,1])
plt.xlabel("PC1")
plt.ylabel("PC2")

Out[19]:

Text(0, 0.5, 'PC2')

Department of AI & DS MCOERC, NASHIK


Computer Laboratory-I Class: BE (AI &DS)

linkcode

Compoents Looks orthogonal to each other

In [20]:
#---------------Build_ml_model_on_extracted_components---------------
from sklearn.linear_model import LogisticRegression
pc_model = LogisticRegression()
pc_model.fit(tr_comp,y_train)
pc_model

Out[20]:

LogisticRegression()

In [21]:
#------------Evaluate_model_performance---------------
from sklearn.metrics import classification_report
print("The classification report is:
{}".format(classification_report(y_test,pc_model.predict(ts_comp))))
The classification report is: precision recall f1-score support

1 1.00 1.00 1.00 14


2 1.00 0.93 0.96 14
3 0.89 1.00 0.94 8

accuracy 0.97 36
macro avg 0.96 0.98 0.97 36
weighted avg 0.98 0.97 0.97 36

Department of AI & DS MCOERC, NASHIK


Computer Laboratory-I Class: BE (AI &DS)

The performance of logistic regression model is improved after performing principal


component analysis. PCA not only removed some redundancy but also improved variance in
the dataset.

Conclusion:
We have successfully implemented PCA Algorithm for
Student will able to analyse the importance of PCA in dimension reduction

Reference:
https://www.kaggle.com/code/bhavesh302/pca-on-wine-dataset

Department of AI & DS MCOERC, NASHIK

You might also like