0% found this document useful (0 votes)
27 views

ml2020 Pythonlab03

This document discusses principal component analysis (PCA) and its application to a wine dataset. It includes: 1) Exploratory data analysis of the wine dataset to understand feature correlations and class separation. Some features like alkalinity clearly separate the wine classes. 2) Performing PCA on the scaled wine data. The first two principal components explain 56% of the total variance in the data, showing better class separation than the original features. 3) An exercise is assigned to split a dataset into train and test, perform linear regression with and without PCA on the training data, and compare prediction accuracies.

Uploaded by

VINAY U PAI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

ml2020 Pythonlab03

This document discusses principal component analysis (PCA) and its application to a wine dataset. It includes: 1) Exploratory data analysis of the wine dataset to understand feature correlations and class separation. Some features like alkalinity clearly separate the wine classes. 2) Performing PCA on the scaled wine data. The first two principal components explain 56% of the total variance in the data, showing better class separation than the original features. 3) An exercise is assigned to split a dataset into train and test, perform linear regression with and without PCA on the training data, and compare prediction accuracies.

Uploaded by

VINAY U PAI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Birla Institute of Technology and Science, Pilani

Department of computer science & information system


BITS F464 - Machine Learning
I Semester 2020-21
3-Sep-20 Lab Sheet-03 – Principle Component Analysis

Singular-Value Decomposition

The most known and widely used matrix decomposition method is the Singular-Value
Decomposition, or SVD. All matrices have an SVD, which makes it more stable than other
methods, such as the eigendecomposition.
The SVD is used widely both in the calculation of other matrix operations, such as matrix inverse,
but also as a data reduction, compressing and denoising method in machine learning

Calculate Singular-Value Decomposition

The SVD can be calculated by calling the svd() function.


The function takes a matrix and returns the U, Sigma and V^T elements. The Sigma diagonal
matrix is returned as a vector of singular values. The V matrix is returned in a transposed form,
e.g. V.T.
The example below defines a 3×2 matrix and calculates the Singular-value decomposition.

# Singular-value decomposition
from numpy import array
from scipy.linalg import svd
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# SVD
U, s, VT = svd(A)
print(U)
print(s)
print(VT)

Reconstruct Matrix from SVD

The original matrix can be reconstructed from the U, Sigma, and V^T elements.
The U, s, and V elements returned from the svd() cannot be multiplied directly.
The s vector must be converted into a diagonal matrix using the diag() function. By default, this
function will create a square matrix that is n x n, relative to our original matrix. This causes a
problem as the size of the matrices do not fit the rules of matrix multiplication, where the number
of columns in a matrix must match the number of rows in the subsequent matrix.
# Reconstruct SVD
from numpy import array
from numpy import diag
from numpy import dot
from numpy import zeros
from scipy.linalg import svd
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# Singular-value decomposition
U, s, VT = svd(A)
# create m x n Sigma matrix
Sigma = zeros((A.shape[0], A.shape[1]))
# populate Sigma with n x n diagonal matrix
Sigma[:A.shape[1], :A.shape[1]] = diag(s)
# reconstruct matrix
B = U.dot(Sigma.dot(VT))
print(B)
SVD for Pseudoinverse

The pseudoinverse is the generalization of the matrix inverse for square matrices to rectangular
matrices where the number of rows and columns are not equal.
It is also called the the Moore-Penrose Inverse after two independent discoverers of the method
or the Generalized Inverse.
# Pseudoinverse via SVD
from numpy import array
from numpy.linalg import svd
from numpy import zeros
from numpy import diag
# define matrix
A = array([
[0.1, 0.2],
[0.3, 0.4],
[0.5, 0.6],
[0.7, 0.8]])
print(A)
# calculate svd
U, s, VT = svd(A)
# reciprocals of s
d = 1.0 / s
# create m x n D matrix
D = zeros(A.shape)
# populate D with n x n diagonal matrix
D[:A.shape[1], :A.shape[1]] = diag(d)
# calculate pseudoinverse
B = VT.T.dot(D.T).dot(U.T)
print(B)

Principal Component Analysis


PCA is mathematically defined as an orthogonal linear transformation that transforms the data to
a new coordinate system such that the greatest variance by some projection of the data comes
to lie on the first coordinate (called the first principal component), the second greatest variance
on the second coordinate, and so on.

Import packages and download the wine dataset from


“https://archive.ics.uci.edu/ml/datasets/wine”

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Read in the data and perform basic exploratory analysis


df = pd.read_csv('./Datasets/wine.data.csv')
df.head(10)

Basic statistics
df.iloc[:,1:].describe()

Boxplots by output labels/classes


for c in df.columns[1:]:
df.boxplot(c,by='Class',figsize=(7,4),fontsize=14)
plt.title("{}\n".format(c),fontsize=16)
plt.xlabel("Wine Class", fontsize=16)

It can be seen that some features classify the wine labels pretty clearly. For example,
Alcalinity, Total Phenols, or Flavonoids produce boxplots with well-separated medians, which are
clearly indicative of wine classes.

Below is an example of class seperation using two variables


plt.figure(figsize=(10,6))
plt.scatter(df['OD280/OD315 of diluted wines'],df['Flavanoids'],c=df['Clas
s'],edgecolors='k',alpha=0.75,s=150)
plt.grid(True)
plt.title("Scatter plot of two features showing the \ncorrelation and clas
s seperation",fontsize=15)
plt.xlabel("diluted wines",fontsize=15)
plt.ylabel("Flavanoids",fontsize=15)
plt.show()

Are the features independent? Plot co-variance matrix

It can be seen that there is some good amount of correlation between features i.e. they are not
independent of each other.
def correlation_matrix(df):
from matplotlib import pyplot as plt
from matplotlib import cm as cm

fig = plt.figure(figsize=(16,12))
ax1 = fig.add_subplot(111)
cmap = cm.get_cmap('jet', 30)
cax = ax1.imshow(df.corr(), interpolation="nearest", cmap=cmap)
ax1.grid(True)
plt.title('Wine data set features correlation\n',fontsize=15)
labels=df.columns
ax1.set_xticklabels(labels,fontsize=9)
ax1.set_yticklabels(labels,fontsize=9)
# Add colorbar, make sure to specify tick locations to match desired t
icklabels
fig.colorbar(cax, ticks=[0.1*i for i in range(-11,11)])
plt.show()

correlation_matrix(df)

Principal Component Analysis

Data scaling

PCA requires scaling/normalization of the data to work properly

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
X = df.drop('Class',axis=1)
y = df['Class']
X = scaler.fit_transform(X)
dfx = pd.DataFrame(data=X,columns=df.columns[1:])
dfx.head(10)
dfx.describe()

PCA class import and analysis


from sklearn.decomposition import PCA
pca = PCA(n_components=None)
dfx_pca = pca.fit(dfx)

Plot the explained variance ratio


plt.figure(figsize=(10,6))
plt.scatter(x=[i+1 for i in range(len(dfx_pca.explained_variance_ratio_))]
,y=dfx_pca.explained_variance_ratio_, s=200, alpha=0.75,c='orange',edgecol
or='k')
plt.grid(True)
plt.title("Explained variance ratio of the \nfitted principal component ve
ctor\n",fontsize=25)
plt.xlabel("Principal components",fontsize=15)
plt.xticks([i+1 for i in range(len(dfx_pca.explained_variance_ratio_))],fo
ntsize=15)
plt.yticks(fontsize=15)
plt.ylabel("Explained variance ratio",fontsize=15)
plt.show()

The above plot means that the 1st principal component explains about 36% of the total variance
in the data and the 2 ND component explains further 20%. Therefore, if we just consider first two
components, they together explain 56% of the total variance.
Showing better class separation using principal components

Transform the scaled data set using the fitted PCA object
dfx_trans = pca.transform(dfx)

Put it in a data frame


dfx_trans = pd.DataFrame(data=dfx_trans)
dfx_trans.head(10)

Plot the first two columns of this transformed data set with the color set to original ground truth
class label
plt.figure(figsize=(10,6))
plt.scatter(dfx_trans[0],dfx_trans[1],c=df['Class'],edgecolors='k',alpha=0
.75,s=150)
plt.grid(True)
plt.title("Class separation using first two principal components\n",fontsi
ze=20)
plt.xlabel("Principal component-1",fontsize=15)
plt.ylabel("Principal component-2",fontsize=15)
plt.show()

Lab 03 Exercise (Submit the code in given time):

Download any of the Integer attribute type dataset (you can use wine dataset also)
and split the data into training and testing then perform linear regression using any
methods that introduced in previous Lab. Then compare the prediction accuracy
with and without PCA on the training datasets.

You might also like