Implementing PCA in Python with scikit-learn
Last Updated :
18 May, 2025
Principal Component Analysis (PCA) is a dimensionality reduction technique. It transform high-dimensional data into a smaller number of dimensions called principal components and keeps important information in the data. In this article, we will learn about how we implement PCA in Python using scikit-learn. Here are the steps:
Step 1: Import necessary libraries
We import all the libraries needed like numpy , pandas, matplotlib, seaborn and scikit learn.
Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
Step 2: Load the Data
We will use breast cancer dataset. This dataset has 569 data items with 30 input attributes. There are two output classes-benign and malignant. This reads the dataset file and displays the first 5 rows. You can download the dataset from here.
Python
df = pd.read_csv('data.csv')
df.head()
Output:
DatasetStep 3: Data Cleaning and Preprocessing
It drops unnecessary columns like id, Unnamed: 32 and converts diagnosis co
lumn: Malignant to 1 and Benign to 0.
Python
df.drop(['id', 'Unnamed: 32'], axis=1, inplace=True)
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
Step 4: Separate Features and Target
In this separate features X
contains input features (30 columns) and y
contains the target labels (0 or 1)
Python
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']
Step 5: Standardize the Data
StandardScaler transforms features so they all have a mean = 0 and standard deviation = 1 which helps PCA to treat all features equally.
Python
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled[:2])
Output:
Standard ScalerStep 6: Apply PCA Algorithm
It reduces the data to 2 principal components. PCA finds combinations of original features that explain the most variation in the data.
Python
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(X_pca[:2])
Output:
[[ 9.19283683 1.94858307]
[ 2.3878018 -3.76817174]]
We reduce 30 features to 2 components. Each row now has 2 values (PC1, PC2) instead of 30. These components contain the most variation from original data.
Step 7: Explained Variance
It tells how much information each principal component holds.
Python
print("Explained variance:", pca.explained_variance_ratio_)
print("Cumulative:", np.cumsum(pca.explained_variance_ratio_))
Output:
- Explained variance: [0.44272026 0.18971182]
- Cumulative: [0.44272026 0.63243208]
PC1 explains 44% of data and PC2 explains 19%. Combined these 2 components explain 63% of all data variation.
Step 8: Visualization Before vs After PCA
First plot shows original scaled data using first 2 features and second plot shows reduced data using PCA's 2 components. Colors represent diagnosis Benign or Malignant.
Python
plt.figure(figsize=(8,6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y, cmap='coolwarm', edgecolor='k')
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Original Data (First Two Features)")
plt.colorbar(label="Diagnosis")
plt.show()
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='coolwarm', edgecolor='k')
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA Transformed Data")
plt.colorbar(label="Diagnosis")
plt.show()
Output:
Original Data
PCA Transformed DataStep 9: Train a Model on PCA Data
It splits PCA data into training and test sets. Train a Logistic Regression model to classify tumors and predicts and evaluate the model.
Python
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Output:
Classification reportStep 10: Confusion Matrix
It shows how many predictions were correct or incorrect and helps to visualize true vs. false predictions.
Python
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Benign', 'Malignant'],
yticklabels=['Benign', 'Malignant'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
Output:
Confusion matrixPCA reduces data size but some information is lost. This step converts reduced data back to its original shape and measures how much data was lost in the reduction process.
Python
X_reconstructed = pca.inverse_transform(X_pca)
reconstruction_loss = np.mean((X_scaled - X_reconstructed) ** 2)
print(f"Reconstruction Loss: {reconstruction_loss:.4f}")
Output:
Reconstruction Loss: 0.3676
As shows how much info was lost during PCA. A loss of 0.3676
means PCA with 2 components retains good structure.
Complete Code: click here
Similar Reads
Python Tutorial - Learn Python Programming Language Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly. It'sA high-level language, used in web development, data science, automation, AI and more.Known fo
10 min read
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.Machin
5 min read
Non-linear Components In electrical circuits, Non-linear Components are electronic devices that need an external power source to operate actively. Non-Linear Components are those that are changed with respect to the voltage and current. Elements that do not follow ohm's law are called Non-linear Components. Non-linear Co
11 min read
Linear Regression in Machine learning Linear regression is a type of supervised machine-learning algorithm that learns from the labelled datasets and maps the data points with most optimized linear functions which can be used for prediction on new datasets. It assumes that there is a linear relationship between the input and output, mea
15+ min read
Spring Boot Tutorial Spring Boot is a Java framework that makes it easier to create and run Java applications. It simplifies the configuration and setup process, allowing developers to focus more on writing code for their applications. This Spring Boot Tutorial is a comprehensive guide that covers both basic and advance
10 min read
Support Vector Machine (SVM) Algorithm Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It tries to find the best boundary known as hyperplane that separates different classes in the data. It is useful when you want to do binary classification like spam vs. not spam or
9 min read
Logistic Regression in Machine Learning Logistic Regression is a supervised machine learning algorithm used for classification problems. Unlike linear regression which predicts continuous values it predicts the probability that an input belongs to a specific class. It is used for binary classification where the output can be one of two po
11 min read
Class Diagram | Unified Modeling Language (UML) A UML class diagram is a visual tool that represents the structure of a system by showing its classes, attributes, methods, and the relationships between them. It helps everyone involved in a projectâlike developers and designersâunderstand how the system is organized and how its components interact
12 min read
100+ Machine Learning Projects with Source Code [2025] This article provides over 100 Machine Learning projects and ideas to provide hands-on experience for both beginners and professionals. Whether you're a student enhancing your resume or a professional advancing your career these projects offer practical insights into the world of Machine Learning an
5 min read
K means Clustering â Introduction K-Means Clustering is an Unsupervised Machine Learning algorithm which groups unlabeled dataset into different clusters. It is used to organize data into groups based on their similarity. Understanding K-means ClusteringFor example online store uses K-Means to group customers based on purchase frequ
4 min read