ml2020 Pythonlab03
ml2020 Pythonlab03
Singular-Value Decomposition
The most known and widely used matrix decomposition method is the Singular-Value
Decomposition, or SVD. All matrices have an SVD, which makes it more stable than other
methods, such as the eigendecomposition.
The SVD is used widely both in the calculation of other matrix operations, such as matrix inverse,
but also as a data reduction, compressing and denoising method in machine learning
# Singular-value decomposition
from numpy import array
from scipy.linalg import svd
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# SVD
U, s, VT = svd(A)
print(U)
print(s)
print(VT)
The original matrix can be reconstructed from the U, Sigma, and V^T elements.
The U, s, and V elements returned from the svd() cannot be multiplied directly.
The s vector must be converted into a diagonal matrix using the diag() function. By default, this
function will create a square matrix that is n x n, relative to our original matrix. This causes a
problem as the size of the matrices do not fit the rules of matrix multiplication, where the number
of columns in a matrix must match the number of rows in the subsequent matrix.
# Reconstruct SVD
from numpy import array
from numpy import diag
from numpy import dot
from numpy import zeros
from scipy.linalg import svd
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# Singular-value decomposition
U, s, VT = svd(A)
# create m x n Sigma matrix
Sigma = zeros((A.shape[0], A.shape[1]))
# populate Sigma with n x n diagonal matrix
Sigma[:A.shape[1], :A.shape[1]] = diag(s)
# reconstruct matrix
B = U.dot(Sigma.dot(VT))
print(B)
SVD for Pseudoinverse
The pseudoinverse is the generalization of the matrix inverse for square matrices to rectangular
matrices where the number of rows and columns are not equal.
It is also called the the Moore-Penrose Inverse after two independent discoverers of the method
or the Generalized Inverse.
# Pseudoinverse via SVD
from numpy import array
from numpy.linalg import svd
from numpy import zeros
from numpy import diag
# define matrix
A = array([
[0.1, 0.2],
[0.3, 0.4],
[0.5, 0.6],
[0.7, 0.8]])
print(A)
# calculate svd
U, s, VT = svd(A)
# reciprocals of s
d = 1.0 / s
# create m x n D matrix
D = zeros(A.shape)
# populate D with n x n diagonal matrix
D[:A.shape[1], :A.shape[1]] = diag(d)
# calculate pseudoinverse
B = VT.T.dot(D.T).dot(U.T)
print(B)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Basic statistics
df.iloc[:,1:].describe()
It can be seen that some features classify the wine labels pretty clearly. For example,
Alcalinity, Total Phenols, or Flavonoids produce boxplots with well-separated medians, which are
clearly indicative of wine classes.
It can be seen that there is some good amount of correlation between features i.e. they are not
independent of each other.
def correlation_matrix(df):
from matplotlib import pyplot as plt
from matplotlib import cm as cm
fig = plt.figure(figsize=(16,12))
ax1 = fig.add_subplot(111)
cmap = cm.get_cmap('jet', 30)
cax = ax1.imshow(df.corr(), interpolation="nearest", cmap=cmap)
ax1.grid(True)
plt.title('Wine data set features correlation\n',fontsize=15)
labels=df.columns
ax1.set_xticklabels(labels,fontsize=9)
ax1.set_yticklabels(labels,fontsize=9)
# Add colorbar, make sure to specify tick locations to match desired t
icklabels
fig.colorbar(cax, ticks=[0.1*i for i in range(-11,11)])
plt.show()
correlation_matrix(df)
Data scaling
The above plot means that the 1st principal component explains about 36% of the total variance
in the data and the 2 ND component explains further 20%. Therefore, if we just consider first two
components, they together explain 56% of the total variance.
Showing better class separation using principal components
Transform the scaled data set using the fitted PCA object
dfx_trans = pca.transform(dfx)
Plot the first two columns of this transformed data set with the color set to original ground truth
class label
plt.figure(figsize=(10,6))
plt.scatter(dfx_trans[0],dfx_trans[1],c=df['Class'],edgecolors='k',alpha=0
.75,s=150)
plt.grid(True)
plt.title("Class separation using first two principal components\n",fontsi
ze=20)
plt.xlabel("Principal component-1",fontsize=15)
plt.ylabel("Principal component-2",fontsize=15)
plt.show()
Download any of the Integer attribute type dataset (you can use wine dataset also)
and split the data into training and testing then perform linear regression using any
methods that introduced in previous Lab. Then compare the prediction accuracy
with and without PCA on the training datasets.