0% found this document useful (0 votes)
18 views18 pages

6 Dimension Reduction Theory

Uploaded by

Bhaskar Mulik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views18 pages

6 Dimension Reduction Theory

Uploaded by

Bhaskar Mulik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Why is Dimensionality Reduction important in Machine Learning and Predictive

Modeling?
An intuitive example of dimensionality reduction can be discussed through a simple e-mail
classification problem, where we need to classify whether the e-mail is spam or not. This can
involve a large number of features, such as whether or not the e-mail has a generic title, the
content of the e-mail, whether the e-mail uses a template, etc. However, some of these features
may overlap. In another condition, a classification problem that relies on both humidity and
rainfall can be collapsed into just one underlying feature, since both of the aforementioned are
correlated to a high degree. Hence, we can reduce the number of features in such problems.

List of reasons for reducing dimensionality includes the following:


 Making the dataset easier to use
 Reducing computational cost of many algorithms
 Removing noise
 Making the results easier to understand

Advantages of Dimensionality Reduction:


 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if any.
Singular Value Decomposition (SVD)
4 3
Factorize following matrix using SVD: A= [ ]
0 −5
Step 1: Compute its transpose AT and ATA.

Step 2: Determine the eigenvalues of ATA and sort these in descending order, in
the absolute sense. Square roots these to obtain the singular values of A.

Step 3: Construct diagonal matrix S by placing singular values in descending order


along its diagonal. Compute its inverse, S-1.
Step 4: Use the ordered eigenvalues from step 2 and compute the eigenvectors of
ATA. Place these eigenvectors along the columns of V and compute its transpose,
VT.
Step 5: Compute U as U = AVS-1. To complete the proof, compute the full SVD
using A = USVT.

Principal Component Analysis (PCA)


In PCA, the dataset is transformed from its original coordinate system to a new coordinate
system. The new coordinate system is chosen by the data itself. The first new axis is chosen in
the direction of the most variance in the data. The second axis is orthogonal to the first axis and
in the direction of an orthogonal axis with the largest variance. This procedure is repeated for as
many features as we had in the original data. We’ll find that the majority of the variance is
contained in the first few axes. Therefore, we can ignore the rest of the axes, and we reduce the
dimensionality of our data.

 Data on ʻ p ʼ variables; these variables may be correlated.


 Correlation indicates information contained in one variable is also contained in
some of the other p-1 variables.
 PCA transforms the ʻ p ʼ original correlated variables into ʻ p ʼ uncorrelated
components (also called as orthogonal components or principal components)
 These components are linear functions of the original variables.
The transformation is written as Z=XA
Where, X is nxp matrix of n observations on p variables
Z is nxp matrix of n values for each of p components
A is pxp matrix of coefficients defining the linear transformation

1. Find mean of ʻ p ʼ variables i.e x̅ and y̅


2. Matrix X is assumed to be deviations from their respective means; hence X is a
matrix of deviations from mean. i.e Matrix X contains columns of xi-x̅ and yi-y̅
3. Find covariance between ʻ p ʼ variables

4. Construct covariance matrix A

A=

5. Find Eigen values and Eigen vectors of covariance matrix

6. Choosing Principal component. Fraction of total variance accounted for by the


jth principal component.

7. Choose principal components and form a feature vector using Eigen vectors.
8. Derive the new data set Z.
Z = X A Where X is a matrix containing columns xi-𝐱̅ and yi-𝐲̅ and A is a
matrix of Eigen vectors.
Singular Value Decomposition(SVD)
It is one of the most widely used unsupervised learning algorithms, that is at the center of many
recommendation and Dimensionality reduction.
In simple terms, SVD is the factorization of a matrix into 3 matrices. So if we have a matrix A,

then its SVD is represented by:


Where A is an m x n matrix,
U is an (m x m) orthogonal matrix, (U is also referred as left singular vectors)
𝚺 is an (m x n) nonnegative rectangular diagonal matrix (𝚺 is also referred as singular values)
V is an (n x n) orthogonal matrix (V is also referred as right singular vectors)

Independent component analysis (ICA)


It is a method for finding underlying factors or components from multivariate (multi-
dimensional) statistical data. What distinguishes ICA from other methods is that it looks for
components that are both statistically independent, and nonGaussian.”
--A.Hyvarinen, A.Karhunen, E.Oja
ICA estimation principles
 Principle 1: “Nonlinear decorrelation. Find the matrix W so that for any i ≠ j , the
components yi and yj are uncorrelated, and the transformed components g(yi) and h(yj)
are uncorrelated, where g and h are some suitable nonlinear functions.”
 Principle 2: “Maximum nongaussianity”. Find the local maxima of nongaussianity of a
linear combination y=Wx under the constraint that the variance of x is constant.
 Each local maximum gives one independent component.

Applications include: Audio Processing, Medical data, Finance, Array processing


(beamforming) etc.
ICA mathematical approach
Given a set of observations of random variables x1(t), x2(t)…xn(t), where t is the time or sample
index, assume that they are generated as a linear mixture of independent components: y=Wx,
where W is some unknown matrix. Independent component analysis now consists of estimating
both the matrix W and the yi(t), when we only observe the xi(t).

Example: Simple “Cocktail Party” Problem


Simple scenario: Two people speaking simultaneously in a room. Speeches are recorded by two

microphones in separate locations.


 Let s1(t), s2(t) be the speech signals emitted by the two speakers.

 Recorded time signals, by the two microphones, are denoted by x1(t), x2(t).

The recorded time signals can be expressed as a linear equation:

x1(t) = a11s1(t) + a12s2(t)

x2(t) = a21s1(t) + a22s2(t)

Use statistical “latent variables“ system. Random variable sk instead of time signal are latent

variables & are unknown AND Mixing matrix A is also unknown. Task here is to estimate A and

s using only the observeable random vector x.

xj = aj1s1 + aj2s2 + .. + ajnsn, for all j


x = As

Lets assume that


no. of Inedependant Componenets = no. of observable mixtures
and
A is square and invertible

So after estimating A,
we can compute W=A-1 and hence s = Wx = A-1x

Some related Concepts to understand above method:

Independent component analysis (ICA) is a method for finding underlying factors or components

from multivariate (multi-dimensional) statistical data. What distinguishes ICA from other

methods is that it looks for components that are both statistically independent, and nonGaussian.
In probability theory, two events are independent, statistically independent, or

stochastically independent if the occurrence of one does not affect the probability of occurrence

of the other.

In physics, a non-Gaussianity is the correction that modifies the expected Gaussian function

estimate for the measurement of a physical quantity.

Gaussian functions are widely used in statistics to describe the normal distributions.

In probability theory, the normal (or Gaussian or Gauss or Laplace–Gauss) distribution is a very

common continuous probability distribution.

Invertible matrix: If this is the case, then the matrix B is uniquely determined by A and is called

the inverse of A, denoted by A−1. A square matrix that is not invertible is called singular or

degenerate. A square matrix is singular if and only if its determinant is 0.

Latent variables are variables that are not directly observed but are rather inferred (through a

mathematical model) from other variables that are observed (directly measured)
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner

You might also like