Model Selection with Probabilistic PCA and Factor Analysis (FA) in Scikit Learn
Last Updated :
24 Apr, 2025
In the field of machine learning, model selection plays a vital role in finding the most suitable algorithm for a given dataset. When dealing with dimensionality reduction tasks, methods such as Principal Component Analysis (PCA) and Factor Analysis (FA) are commonly employed. However, in scenarios where the assumption of linearity in PCA may not hold for high-dimensional data, Factor Analysis can be a more appropriate alternative. In this article, we will explore how to perform model selection using Probabilistic PCA and Factor Analysis in Scikit-Learn, a popular Python library for machine learning.
Concepts related to the topic:
- Probabilistic PCA (PPCA): PPCA extends traditional PCA by incorporating a probabilistic framework. It assumes that the observed data is generated by projecting the latent variables into a high-dimensional space, followed by the addition of Gaussian noise. PPCA estimates the latent variables and the noise parameters using maximum likelihood estimation, providing a probabilistic interpretation of the low-dimensional representation.
- Factor Analysis (FA): FA assumes a generative model where the observed variables are linear combinations of the latent variables, along with Gaussian noise. The goal is to estimate the latent variables and the loading matrix that represents the linear relationships between the observed and latent variables. FA also provides a probabilistic interpretation of the dimensionality reduction process.
Homoscedastic Noise
Heteroscedastic noise is a type of noise that has an unequal variance across values of the independent variable. This is often created by complex relationships between variables and nonlinear patterns.
1. Import the necessary libraries and create the Homoscedastic Noise dataset
Python3
import numpy as np
n_samples = 250
n_features = 30
mean = 0
sigma = 5
np.random.RandomState( 23 )
homo_noise = sigma * np.random.rand(n_features)
X_homoscedastic = np.random.normal(mean, sigma, (n_samples,n_features)) + homo_noise
print ( "Homoscedastic Noise Dataset Shape:" , X_homoscedastic.shape)
|
Output:
Homoscedastic Noise Dataset Shape: (250, 30)
2. Compute the PCA and Factor Analysis and find the cross_val_score
Python3
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.model_selection import cross_val_score
pca = PCA(svd_solver = "full" )
fa = FactorAnalysis()
def compute_score(X, model, n_components):
score = []
for n in n_components:
model.n_components = n
score.append(np.mean(cross_val_score(model, X)))
return score
n_components = [ 0 , 5 , 10 , 15 , 20 , 25 , 30 ]
pca_scores = compute_score(X_homoscedastic, pca, n_components)
fa_scores = compute_score(X_homoscedastic, fa, n_components)
|
3. Plot the PCA score and FA score vs n_componrnt curve
Python3
import matplotlib.pyplot as plt
plt.plot(n_components, pca_scores, "b" , label = "PCA scores" )
plt.plot(n_components, fa_scores, "r" , label = "FA scores" )
plt.xlabel( "nb of components" )
plt.ylabel( "CV scores" )
plt.legend()
plt.title( 'Heteroscedastic noise' )
plt.show()
|
Output:
.png)
Homoscedastic Noise
Heteroscedastic Noise
Heteroscedastic noise is a type of noise that has an unequal variance across values of the independent variable. This is often created by complex relationships between variables and nonlinear patterns.
1. Import the necessary libraries and create the Heteroscedastic Noise dataset
Python3
n_samples = 1000
n_features = 30
mean = 0
sigma = 2.5
np.random.RandomState( 23 )
sigmas = sigma * np.random.rand(n_features)
hetero_noise = sigmas * np.random.normal(mean, sigma, (n_samples,n_features))
X_heteroscedastic = np.random.normal(mean, sigma, (n_samples,n_features)) + hetero_noise
print ( "Heteroscedastic Noise Dataset Shape:" , X_heteroscedastic.shape)
|
Output:
Heteroscedastic Noise Dataset Shape: (1000, 30)
2. Compute the PCA and Factor Analysis and find the cross_val_score
Python3
pca = PCA(svd_solver = "full" )
fa = FactorAnalysis()
def compute_score(X, model, n_components):
score = []
for n in n_components:
model.n_components = n
score.append(np.mean(cross_val_score(model, X)))
return score
n_components = [ 0 , 5 , 10 , 15 , 20 , 25 , 30 ]
pca_scores = compute_score(X_heteroscedastic, pca, n_components)
fa_scores = compute_score(X_heteroscedastic, fa, n_components)
|
3. Plot the PCA score and FA score vs n_componrnt curve
Python3
import matplotlib.pyplot as plt
plt.plot(n_components, pca_scores, "g" , label = "PCA scores" )
plt.plot(n_components, fa_scores, "r" , label = "FA scores" )
plt.xlabel( "nb of components" )
plt.ylabel( "CV scores" )
plt.legend()
plt.title( 'Heteroscedastic Noise' )
plt.show()
|
Output:
.png)
Heteroscedastic Noise
Example :
To illustrate the process of model selection with Probabilistic PCA and Factor Analysis (FA) using Scikit-learn, let’s consider an example where we apply these techniques to the Digits dataset. We will use the GridSearchCV class to perform model selection and find the best parameters for both PCA and FA. The code snippet provided below demonstrates how to load the dataset, define the parameter grid, fit the models, and access the best models and their parameters. Additionally, it showcases the transformation of the data using the best models. The corresponding output highlights the best model parameters and the transformed data obtained from both PCA and Factor Analysis.
Python3
import numpy as np
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.model_selection import GridSearchCV
from sklearn import datasets
X = datasets.load_digits().data
n_samples, n_features = X.shape
X = X.reshape((n_samples, - 1 ))
param_grid = { 'n_components' : [ 2 , 5 , 10 ]}
ppcamodel = GridSearchCV(PCA(), param_grid = param_grid)
famodel = GridSearchCV(FactorAnalysis(), param_grid = param_grid)
ppcamodel.fit(X)
famodel.fit(X)
best_ppca_model = ppcamodel.best_estimator_
best_ppca_params = ppcamodel.best_params_
best_fa_model = famodel.best_estimator_
best_fa_params = famodel.best_params_
X_ppca = best_ppca_model.transform(X)
X_fa = best_fa_model.transform(X)
print ( "Best PCA Model:" )
print (best_ppca_model)
print ( "Best PCA Parameters:" )
print (best_ppca_params)
print ( "\nBest Factor Analysis Model:" )
print (best_fa_model)
print ( "Best Factor Analysis Parameters:" )
print (best_fa_params)
print ( "\nTransformed Data using PCA:" )
print (X_ppca)
print ( "\nTransformed Data using Factor Analysis:" )
print (X_fa)
|
Output:
Best PCA Model:
PCA(n_components=10)
Best PCA Parameters:
{'n_components': 10}
Best Factor Analysis Model:
FactorAnalysis(n_components=10)
Best Factor Analysis Parameters:
{'n_components': 10}
Transformed Data using PCA:
[[ -1.25946749 21.27488252 -9.46305634 ... 2.55462354 -0.58278883
3.62919484]
[ 7.95760992 -20.76870158 4.43950645 ... -4.6158487 3.58974259
-1.07981018]
[ 6.99191503 -9.95598027 2.95855308 ... -16.41785644 0.71701599
4.25521831]
...
[ 10.80128272 -6.960248 5.59955483 ... -7.4183565 -3.96726241
-13.06151415]
[ -4.87210282 12.42395632 -10.17086414 ... -4.36248613 3.93943916
-13.15159048]
[ -0.3443907 6.36555315 10.77370724 ... 0.66827285 -4.11461914
-12.56011197]]
Transformed Data using Factor Analysis:
[[-0.13967445 -0.34673074 0.5564195 ... -0.83954767 0.09716367
0.34834512]
[-0.87463488 -0.21243303 -0.4980917 ... 0.00387598 -0.26655999
0.79413425]
[-1.07614501 0.64196322 -0.27097307 ... -1.32789623 -0.91709769
-1.66106236]
...
[-0.70284004 -0.07191784 -0.69943904 ... -0.62932638 -1.31438831
1.22479397]
[-0.33269469 -0.0346382 1.36587899 ... -0.87243003 -0.0784538
0.63416391]
[ 0.60585414 0.83341048 -0.34026351 ... 0.1495371 -0.94304955
0.74673146]]
The output demonstrates the results of the model selection process using Probabilistic PCA and Factor Analysis in Scikit-learn. It includes the best model and its parameters, as well as the transformed data obtained from the best models.
The output begins by displaying the best PCA model and its parameters. In this example, the best PCA model has n_components=10, indicating that it reduces the dimensionality of the input data to 10 components. Similarly, the best Factor Analysis model and its parameters are shown, where n_components=10 denotes the number of components retained in the transformed data.
Following the model information, the transformed data using PCA and Factor Analysis is presented. The transformed data represents the original input data projected onto the lower-dimensional space determined by the selected models. The PCA-transformed data is displayed as an array of shape (n_samples, n_components), while the Factor Analysis-transformed data is also shown in a similar format.
Comparison and Choosing the Right Method :
Probabilistic PCA and Factor Analysis are both popular methods for dimensionality reduction, but they have distinct characteristics that may influence the choice between them. Here are some points to consider when deciding which method to use:
- Objective: Probabilistic PCA aims to find a low-dimensional representation of the data that maximizes the likelihood of the observed data. It assumes a probabilistic generative model where the observed data is assumed to be generated from a lower-dimensional latent space. On the other hand, Factor Analysis assumes a linear relationship between the observed variables and the latent factors, with an added noise term. The objective of Factor Analysis is to estimate the latent factors that underlie the observed data.
- Assumptions: Probabilistic PCA assumes that the observed data follows a Gaussian distribution, while Factor Analysis assumes that the observed data is a linear combination of the latent factors and an additional noise term. Therefore, if the underlying data distribution deviates from these assumptions, the results may be affected.
- Dimensionality Reduction: Probabilistic PCA provides a more flexible approach for dimensionality reduction, allowing for non-linear transformations of the input data. It captures both global and local dependencies in the data. Factor Analysis, on the other hand, assumes a linear relationship between the observed variables and the latent factors, making it suitable for capturing global dependencies but may not capture complex non-linear relationships.
- Interpretability: Factor Analysis provides a more interpretable representation since it explicitly estimates the relationship between the observed variables and the latent factors. The latent factors can be interpreted as underlying factors influencing the observed data. In contrast, Probabilistic PCA focuses on finding the low-dimensional representation without explicitly interpreting the latent factors.
Comparison table between Probabilistic PCA and Factor Analysis:
S.No.
| Features
| Probabilistic PCA
| Factor Analysis
|
1.
| Objective
| Maximizes likelihood of observed data
| Estimates latent factors underlying observed data
|
2.
| Assumptions
| Gaussian distribution of observed data
| A linear the relationship between observed variables and latent factors
|
3.
| Dimensionality
| Flexible approach, captures non-linear relationships
| Assumes linear relationship, may not capture complex non-linear relationships
|
4.
| Interpretability
| Less interpretable, focuses on low-dimensional representation
| More interpretable, explicit estimation of relationship between observed variables and latent factors
|
5.
| Data Distribution
| Assumes Gaussian distribution
| Assumes a linear combination of latent factors and noise term
|
6.
| Non-linearity
| Can capture non-linear transformations of input data
| Assumes linear relationship, limited in capturing non-linear relationships
|
7.
| Data Type
| Suitable for high-dimensional data
| Suitable for understanding underlying factors in observed data
|
8.
| Performance
| Good for dimensionality reduction
| Good for interpreting relationships and understanding underlying factors
|
In general, “Probabilistic PCA is suitable when dealing with high-dimensional data that may exhibit non-linear relationships and when interpretability of the latent factors is not the primary concern.”
“Factor Analysis, on the other hand, is preferred when interpretability and understanding the underlying factors driving the observed data are important.”
The choice between Probabilistic PCA and Factor Analysis depends on the characteristics of the dataset and the goals of the analysis. It is recommended to experiment with both methods and evaluate their performance in terms of dimensionality reduction and the interpretability of the results.
Conclusion:
In this article, we explored the utilization of Probabilistic PCA and Factor Analysis in Scikit-Learn for model selection in dimensionality reduction tasks. By leveraging Scikit-Learn’s GridSearchCV, we efficiently evaluated various parameter combinations and identified the best models based on the specified scoring metric.
Both Probabilistic PCA and Factor Analysis offer valuable techniques for dimensionality reduction, each with its own unique strengths. Probabilistic PCA excels in handling high-dimensional datasets and capturing non-linear relationships, while Factor Analysis provides interpretable representations by uncovering latent factors.
The choice between Probabilistic PCA and Factor Analysis depends on the specific characteristics of the dataset and the objectives of the analysis. Probabilistic PCA is suitable when dealing with high-dimensional data and non-linear relationships, whereas Factor Analysis is preferable when interpretability and understanding underlying factors are paramount.
By applying these techniques, researchers and practitioners can effectively reduce the dimensionality of datasets, leading to improved performance in subsequent machine learning tasks. Dimensionality reduction not only reduces computational complexity but also eliminates noise and irrelevant features, ultimately enhancing model accuracy.
In summary, Probabilistic PCA and Factor Analysis serve as powerful tools for dimensionality reduction in Scikit-Learn. Understanding their strengths and characteristics enables us to select the most appropriate approach for our specific dataset and analysis goals. Incorporating model selection techniques such as GridSearchCV further allows us to fine-tune parameters and identify the optimal models. By harnessing these techniques, we can extract insights from high-dimensional data and enhance the efficiency and accuracy of our machine-learningworkflows.
Similar Reads
Feature Selection in Python with Scikit-Learn
Feature selection is a crucial step in the machine learning pipeline. It involves selecting the most important features from your dataset to improve model performance and reduce computational cost. In this article, we will explore various techniques for feature selection in Python using the Scikit-L
4 min read
Probabilistic Predictions with Gaussian Process Classification (GPC) in Scikit Learn
Gaussian Process Classification (GPC) is a probabilistic model for classification tasks. It is based on the idea of using a Gaussian process to model the relationship between the input features and the target labels of a classification problem. GPC makes use of Bayesian inference to make predictions
7 min read
Normal and Shrinkage Linear Discriminant Analysis for Classification in Scikit Learn
In this article, we will try to understand the difference between Normal and Shrinkage Linear Discriminant Analysis for Classification. We will try to implement the same using sci-kit learn library in Python. But first, let's try to understand what is LDA. What is Linear discriminant analysis (LDA)?
4 min read
Probability Calibration of Classifiers in Scikit Learn
In this article, we will explore the concepts and techniques related to the probability calibration of classifiers in the context of machine learning. Classifiers in machine learning frequently provide probabilities indicating how confident they are in their predictions. However, the probabilities m
4 min read
Receiver Operating Characteristic (ROC) with Cross Validation in Scikit Learn
In this article, we will implement ROC with Cross-Validation in Scikit Learn. Before we jump into the code, let's first understand why we need ROC curve and Cross-Validation in Machine Learning model predictions. Receiver Operating Characteristic Curve (ROC Curve) To understand the ROC curve one mu
3 min read
Linear and Quadratic Discriminant Analysis using Sklearn
Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) are two well-known classification methods that are used in machine learning to find patterns and put things into groups. They are especially helpful when you have labeled data and want to classify new observations notes int
5 min read
How to Get Probabilities from Classification Model Using randomForest Package in R
The randomForest package in R is widely used for building random forest models, which are a type of ensemble learning method for classification and regression. In classification tasks, besides predicting class labels, it's often useful to obtain probabilities of class membership. This tutorial will
4 min read
Understanding the predict_proba() Function in Scikit-learn's SVC
The predict_proba() function in Scikit-learn's Support Vector Classification (SVC) is a powerful tool that allows users to obtain probability estimates for class predictions. This article delves into the internal workings of this function, exploring how it derives these probabilities and discussing
12 min read
Latent Text Analysis (lsa Package) Using Whole Documents in R
Latent Text Analysis (LTA) is a technique used to discover the hidden (latent) structures within a set of documents. This approach is instrumental in natural language processing (NLP) for identifying patterns, topics, and relationships in large text corpora. This article will explore using whole doc
8 min read
SHAP with a Linear SVC model from Sklearn Using Pipeline
SHAP (SHapley Additive exPlanations) is a powerful tool for interpreting machine learning models by assigning feature importance based on Shapley values. In this article, we will explore how to integrate SHAP with a linear SVC model from Scikit-learn using a Pipeline. We'll provide an overview of SH
5 min read