ML - Unit 3
ML - Unit 3
Subset selection is a process in machine learning that involves selecting a subset of features from a dataset.
Subset selection involves identifying and removing irrelevant and redundant information. This reduces the
dimensionality of the data and allows learning algorithms to operate faster and more effectively.
There are several techniques for subset selection in machine learning:
Forward Selection:
• Start with no variables and add them one by one, at each step adding the one that decreases
• the error the most, until any further addition does not decrease
• the error (or decreases it only slightly).
• Iteratively add the most informative feature at each step until a certain criterion is met.
Backward Elimination:
• In backward selection, we start with all variables and remove them one by one, at each step
removing
• the one that decreases the error the most (or increases it only slightly), until any further removal
increases the error significantly.
• Iteratively remove the least informative feature at each step until a certain criterion is met.
J = arg min E(F ∪ xi)
Add xj to F if E(F ∪ xj) < E(F)
Floating elimination:
• Floating search is a variant of subset selection methods that combines forward and backward
selection.
• It starts with an initial subset of features, then iteratively adds and removes features based on some
criterion, floating towards a more optimal subset.
• This method aims to strike a balance between the computational cost of exhaustive search methods
and the suboptimality of stepwise methods.
j = arg minE(F − xi)
remove xj from F if E(F − xj) < E(F)
• Stop if removing a feature does not decrease the error. To decrease complexity, we may decide to
remove a feature if its removal causes only a slight increase in error.
• Initially the features are standardized, that is subtract the mean and divide by the standard
deviation. This step ensures that all features contribute equally to the analysis, as PCA is
sensitive to the scale of the features.
• Calculate the covariance matrix of the standardized data. The covariance matrix is a square
matrix that represents the relationships between all pairs of features.
• The primary goal of PCA is to capture the maximum variance in the data with a reduced
number of features, simplifying the analysis while retaining the essential information.
z = wT.x
• Find the eigenvalues and eigenvectors of the covariance matrix. The eigenvectors represent
the principal components, and the corresponding eigenvalues indicate the amount of
variance explained by each principal component.
• Here, the dataset with n features, the covariance matrix is an n x n matrix that quantifies the
relationships between pairs of features. It is calculated from the standardized data.
• The principal component is w1 such that the sample, after projection on to w1, is most
spread out so that the difference between the sample points becomes most apparent.
• For a unique solution and to make the direction the important factor, we require ||w1||=
1.
if z1 = wT1.x with Cov(x) = Σ, then
Var(z1) = wT1Σw1
• The eigenvalue problem for the covariance matrix Σ is expressed as: Σv=λv
where, v is an eigenvector, λ is the corresponding eigenvalue, and Σv is the product of the
covariance matrix and the eigenvector.
• Arrange the eigenvalues in descending order. The larger eigenvalues correspond to the
principal components that capture more variance in the data.
• Choose the top k eigenvectors based on the amount of variance you want to retain. The sum
of the selected eigenvalues represents the total variance retained.
Total Variance Retained=
∑𝑘𝑖=1 𝜆𝑖
∑𝑛𝑖=1 𝜆𝑖
• A common approach is to choose the number of principal components that capture a certain
percentage (e.g., 95%) of the total variance.
• Form a matrix with the selected eigenvectors as columns. This matrix, often denoted as P, is called
the projection matrix.
P=[v1 ,v2 ,...,vk ]
• Multiply the standardized data matrix X by the projection matrix P to obtain the lower-dimensional
representation Z.
Z=X . P
• The columns of Z represent the principal components of the data.
• The principal components, obtained as eigenvectors, are orthogonal (uncorrelated) and form a new
basis for the data.
• The eigenvalues indicate the importance of each principal component in capturing the variability of
the original data.
• By selecting a subset of these principal components, dimensionality reduction is achieve while
retaining as much variance as possible.
Factor Analysis
• While PCA focuses on capturing the maximum variance in the data, Factor Analysis aims to identify
the underlying factors that drive the observed correlations between variables.
• Factor Analysis assumes that observed variables are influenced by underlying latent factors.
• These latent factors are not directly observable but are inferred from the correlations between
observed variables.
• In PCA, from the original dimensions xi, i = 1, . . . , d, we form a new set of variables z that are
linear combinations of xi :
z = WT (x − μ)
• In factor analysis (FA), we assume that there is a set of unobservable, latent factors zj , j = 1, . . . , k,
which when acting in combination generate x.
• The goal is to characterize the dependency among the observed variables by means of a smaller
number of factors.
• Suppose there is a group of variables that have high correlation among themselves and low
correlation with all the other variables. Then there may be a single underlying factor that gave rise to
these variables.
• If the other variables can be similarly grouped into subsets, then a few factors can represent these
groups of variables.
• Though factor analysis always partitions the variables into factor clusters, whether the factors mean
anything, or really exist, is open to question.
• FA, like PCA, is a one-group procedure and is unsupervised. The aim is to model the data in a
smaller dimensional space without loss of information. In FA, this is measured as the correlation
between variables.
• As in PCA, we have a sample X = {xt}t drawn from some unknown probability density with E[x] =
μ and Cov(x) = Σ.
• Principal components analysis generates new variables that are linear combinations of the original
input variables. In factor analysis, however, there are factors that when linearly combined generate
the input variables.
• If x1 and x2 have high covariance, then they are related through a factor.
• If it is the first factor, then v11 and v21 will both be high; if it is the second factor, then v12 and v22
will both be high. In either case, the sum v11v21 + v12v22 will be high.
• If the covariance is low, then x1 and x2 depend on different factors and in the products in the sum,
one term will be high and the other will be low and the sum will be low.
Expectation-Maximization Algorithm
The Expectation-Maximization (EM) algorithm is a statistical method used for finding the maximum
likelihood estimates of parameters in probabilistic models with latent variables.
It is an iterative optimization technique used to find the maximum likelihood estimates of parameters in
statistical models when there are unobserved or latent variables.
The algorithm alternates between two steps: the "Expectation" (E) step and the "Maximization" (M) step.
Let us express the Expectation-Maximization (EM) algorithm more formally in mathematical terms.
Suppose we have observed data X and latent (unobserved) variables Z, and we want to maximize the
likelihood function L (θ; X, Z), where θ represents the model parameters.
EM Algorithm steps:
• Step 3 (M Step): Maximize the expected log-likelihood function obtained in the E-step with respect
to the model parameters:
• Repeat the E-step and M-step until convergence. Convergence is often determined by checking the
change in the log-likelihood or the parameters between successive iterations.
Applications:
Missing Data: EM is particularly useful when dealing with missing or incomplete data where latent variables
play a role in the model.
Clustering: It is commonly applied in clustering problems, such as Gaussian Mixture Models (GMMs).
Image Analysis: Used in various image processing tasks, including segmentation and denoising.