Assignment 1 A
Assignment 1 A
Assignment No:1-A
Problem Statement:
Apply PCA algorithm & transform this data so that most variations in the measurements of the
variables are captured by a small number of principal components so that it is easier to
distinguish between red and white wine by inspecting these principal components.
Objectives:
Theory:
Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the given dataset by reducing
the variances.The PCA algorithm is based on some mathematical concepts such as:
• Correlation: It signifies that how strongly two variables are related to each other. Such
as if one changes, the other variable also gets changed. The correlation value ranges
from -1 to +1. Here, -1 occurs if variables are inversely proportional to each other, and
+1 indicates that variables are directly proportional to each other.
• Orthogonal: It defines that variables are not correlated to each other, and hence the
correlation between the pair of variables is zero.
Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data
items, and the column corresponds to the Features. The number of columns is the
dimensions of the dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the features
with high variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will
divide each data item in a column with the standard deviation of the column. Here we
will name the matrix as Z.
4. Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of
Z.
5. Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order,
which means from largest to smallest. And simultaneously sort the eigenvectors
accordingly in matrix P of eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix to
the Z. In the resultant matrix Z*, each observation is the linear combination of original
features. Each column of the Z* matrix is independent of each other.
Sample Code
In[1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as
output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the
current session
/kaggle/input/wineuci/Wine.csv
In [2]:
#------------------Import_libraries------------------
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
In [3]:
df = pd.read_csv("/kaggle/input/wineuci/Wine.csv")
In [4]:
#--------------print_sample_of_dataset------------------
df.head()
Out[4]:
14.2 1.7 2.4 15. 12 3.0 2.2 5.6 1.0 3.9 106
1 2.8 .28
3 1 3 6 7 6 9 4 4 2 5
13.2 1.7 2.1 11. 10 2.6 2.7 0.2 1.2 4.3 1.0 3.4 105
0 1
0 8 4 2 0 5 6 6 8 8 5 0 0
13.1 2.3 2.6 18. 10 2.8 3.2 0.3 2.8 5.6 1.0 3.1 118
1 1
6 6 7 6 1 0 4 0 1 8 3 7 5
14.3 1.9 2.5 16. 11 3.8 3.4 0.2 2.1 7.8 0.8 3.4 148
2 1
7 5 0 8 3 5 9 4 8 0 6 5 0
13.2 2.5 2.8 21. 11 2.8 2.6 0.3 1.8 4.3 1.0 2.9
3 1 735
4 9 7 0 8 0 9 9 2 2 4 3
14.2 1.7 2.4 15. 11 3.2 3.3 0.3 1.9 6.7 1.0 2.8 145
4 1
0 6 5 2 2 7 9 4 7 5 5 5 0
In [5]:
#---------------Check_dataset_information--------------
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177 entries, 0 to 176
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 1 177 non-null int64
1 14.23 177 non-null float64
2 1.71 177 non-null float64
3 2.43 177 non-null float64
4 15.6 177 non-null float64
5 127 177 non-null int64
6 2.8 177 non-null float64
7 3.06 177 non-null float64
8 .28 177 non-null float64
9 2.29 177 non-null float64
10 5.64 177 non-null float64
11 1.04 177 non-null float64
12 3.92 177 non-null float64
13 1065 177 non-null int64
dtypes: float64(11), int64(3)
memory usage: 19.5 KB
In [6]:
#---------------Check_distribution_of_dataset----------------------
df.describe()
Out[6]:
14. 1.7 2.4 15. 3.0 2.2 5.6 1.0 3.9 106
1 127 2.8 .28
23 1 3 6 6 9 4 4 2 5
c
177 177 177 177 177 177 177 177 177 177 177 177 177
o 177.
.00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00
u 000
000 000 000 000 000 000 000 000 000 000 000 000 000
n 000
0 0 0 0 0 0 0 0 0 0 0 0 0
t
m
1.9 12. 2.3 2.3 19. 99. 2.2 2.0 0.3 1.5 5.0 0.9 2.6 745.
e
435 993 398 661 516 587 922 234 623 869 548 569 042 096
a
03 672 87 58 949 571 60 46 16 49 02 83 94 045
n
s 0.7 0.8 1.1 0.2 3.3 14. 0.6 0.9 0.1 0.5 2.3 0.2 0.7 314.
t 739 088 193 750 360 174 264 986 246 715 244 291 051 884
d 91 08 14 80 71 018 65 58 53 45 46 35 03 046
m 1.0 11. 0.7 1.3 10. 70. 0.9 0.3 0.1 0.4 1.2 0.4 1.2 278.
i 000 030 400 600 600 000 800 400 300 100 800 800 700 000
n 00 000 00 00 000 000 00 00 00 00 00 00 00 000
2 1.0 12. 1.6 2.2 17. 88. 1.7 1.2 0.2 1.2 3.2 0.7 1.9 500.
5 000 360 000 100 200 000 400 000 700 500 100 800 300 000
% 00 000 00 00 000 000 00 00 00 00 00 00 00 000
5 2.0 13. 1.8 2.3 19. 98. 2.3 2.1 0.3 1.5 4.6 0.9 2.7 672.
0 000 050 700 600 500 000 500 300 400 500 800 600 800 000
% 00 000 00 00 000 000 00 00 00 00 00 00 00 000
107
7 3.0 13. 3.1 2.5 21. 2.8 2.8 0.4 1.9 6.2 1.1 3.1 985.
.00
5 000 670 000 600 500 000 600 400 500 000 200 700 000
000
% 00 000 00 00 000 00 00 00 00 00 00 00 000
0
162 168
m 3.0 14. 5.8 3.2 30. 3.8 5.0 0.6 3.5 13. 1.7 4.0
.00 0.00
a 000 830 000 300 000 800 800 600 800 000 100 000
000 000
x 00 000 00 00 000 00 00 00 00 000 00 00
0 0
In [7]:
#-----------------Check_null_values_in_dataset--------------------
df.isnull().sum()
Out[7]:
1 0
14.23 0
1.71 0
2.43 0
15.6 0
127 0
2.8 0
3.06 0
.28 0
2.29 0
5.64 0
1.04 0
3.92 0
1065 0
dtype: int64
In [8]:
#-------------Check_imbalance_in_dataset--------------------
sns.countplot(x = '1',data=df)
Out[8]:
<AxesSubplot:xlabel='1', ylabel='count'>
In [9]:
target = df['1']
df = df.drop('1',axis=1)
In [10]:
#-----------Split_dataset_into_train_test_set--------------
In [11]:
sns.pairplot(X_train)
Out[11]:
<seaborn.axisgrid.PairGrid at 0x7ff464bfd610>
In [12]:
#------------Implement_scaling-----------
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
In [13]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)
In [14]:
sns.pairplot(X_train)
Out[14]:
<seaborn.axisgrid.PairGrid at 0x7ff44f4dab10>
In [15]:
#-----------------Build_classifier_model_using_all_available_variables------
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train,y_train)
model
Out[15]:
LogisticRegression()
In [16]:
#--------Check_model_performance-------------------
from sklearn.metrics import classification_report
print("The classification_report
is:{}".format(classification_report(y_test,model.predict(X_test))))
The classification_report is: precision recall f1-score support
accuracy 0.89 36
macro avg 0.89 0.90 0.88 36
weighted avg 0.93 0.89 0.89 36
In [17]:
#-----------------Check_correlation_between_independent_variables---------------
plt.figure(figsize =(10,8))
sns.heatmap(X_train.corr(),annot=True)
Out[17]:
<AxesSubplot:>
In [18]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
tr_comp = pca.fit_transform(X_train)
ts_comp = pca.transform(X_test)
In [19]:
#--------------Plot_PCA-----------------------
sns.scatterplot(tr_comp[:,0],tr_comp[:,1])
plt.xlabel("PC1")
plt.ylabel("PC2")
Out[19]:
linkcode
In [20]:
#---------------Build_ml_model_on_extracted_components---------------
from sklearn.linear_model import LogisticRegression
pc_model = LogisticRegression()
pc_model.fit(tr_comp,y_train)
pc_model
Out[20]:
LogisticRegression()
In [21]:
#------------Evaluate_model_performance---------------
from sklearn.metrics import classification_report
print("The classification report is:
{}".format(classification_report(y_test,pc_model.predict(ts_comp))))
The classification report is: precision recall f1-score support
accuracy 0.97 36
macro avg 0.96 0.98 0.97 36
weighted avg 0.98 0.97 0.97 36
Conclusion:
We have successfully implemented PCA Algorithm for
Student will able to analyse the importance of PCA in dimension reduction
Reference:
https://www.kaggle.com/code/bhavesh302/pca-on-wine-dataset