Incremental Learning with Scikit-learn

Last Updated : 31 Jul, 2025

Incremental Learning is a technique where a machine learning model learns from data in small chunks or batches rather than all at once. This is useful when working with very large datasets or streaming data that can’t fit into memory. Scikit-learn a popular machine learning library in Python that supports incremental learning using models that implement the partial_fit() method which allows you to train your model on fone batch at a time, update it with new data continuously and avoid retraining from scratch.

Incremental Learning

Incremental learning is a machine learning technique where models are trained gradually using small batches of data instead of the entire dataset at once.
This approach is particularly useful when working with large scale or streaming data that cannot fit into memory all at once.
Rather than starting over every time new data becomes available the model updates itself incrementally, learning from each new batch without forgetting what it has already learned.
This makes incremental learning ideal for real time applications such as fraud detection, recommendation systems and monitoring systems where data evolves continuously.

Implementation

Step 1: Import Required Libraries

This code imports key Python libraries for building and evaluating a machine learning model, pandas and numpy are used for data manipulation and numerical operations.
SGDClassifier from sklearn.linear_model is a fast linear classifier based on stochastic gradient descent and StandardScaler helps normalize the features to improve model training.
accuracy_score and classification_report are used to measure the performance of the model and shuffle randomizes the dataset to ensure better training and testing splits.

Python

import pandas as pd
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.utils import shuffle

Step 2: Load Dataset

This line reads the creditcard.csv file into a pandas DataFrame named df.
It loads the dataset into memory so it can be processed and analyzed using pandas functions.

Python

df = pd.read_csv("creditcard.csv")

Step 3: Separate Features and Target

These lines separate the dataset into features and target labels.
X contains all the input features by dropping the "Class" column while y stores the target values from the "Class" column which typically indicates whether a transaction is fraudulent or not.

Python

X = df.drop("Class", axis=1).values
y = df["Class"].values

Step 4: Normalize Time and Amount Features

This code creates a StandardScaler to normalize the first two columns of X often "Time" and "Amount" in credit card datasets.
It scales them to have zero mean and unit variance, improving the performance of machine learning models.

C++

scaler = StandardScaler()
X[:, [0, 1]] = scaler.fit_transform(X[:, [0, 1]])

Step 5: Shuffle Data to Simulate Streaming

This line shuffles the feature matrix X and target vector y in unison to randomize the data order which helps prevent any patterns in the original order from affecting model training.
The random_state=42 ensures reproducibility.

Python

X, y = shuffle(X, y, random_state=42)

Step 6: Initialize the Incremental Model

This line initializes an SGDClassifier with logistic loss for binary classification.
max_iter=1 allows training in small steps and warm_start=True ensures the model retains its state between training iterations enabling updates without reinitialization.

Python

model = SGDClassifier(loss='log_loss', max_iter=1, warm_start=True)

Step 7: Define Classes for partial_fit

This line extracts and stores the unique class labels from the target array y using np.unique().
It ensures that the model is aware of all possible output classes which is important for methods like partial fitting in incremental learning.

Python

classes = np.unique(y)

Step 8: Define Batch Size and Number of Batches

This code sets a batch size of 10,000 and calculates the total number of full batches by dividing the total number of samples by the batch size.
It's used to split the data for incremental training in manageable chunks.

Python

batch_size = 10000
n_batches = X.shape[0] // batch_size

Step 9: Train Model Incrementally in Batches

This loop trains the model incrementally on batches of data. For each batch it selects a slice of features and targets then uses partial_fit to update the model.
The first batch includes the full list of classes to initialize the model properly. Every 5 batches it predicts on the current batch and prints the accuracy allowing you to monitor training progress batch by batch.

Python

for i in range(n_batches):
    start = i * batch_size
    end = start + batch_size
    X_batch = X[start:end]
    y_batch = y[start:end]
    
    if i == 0:
        model.partial_fit(X_batch, y_batch, classes=classes)
    else:
        model.partial_fit(X_batch, y_batch)

    if i % 5 == 0:
        y_pred = model.predict(X_batch)
        acc = accuracy_score(y_batch, y_pred)
        print(f"Batch {i + 1}, Accuracy: {acc:.4f}")

Step 10: Final Evaluation on Last Batch

This code predicts the labels for the last batch of data and then prints a detailed classification report.
The report includes metrics like precision, recall and F1 score which help evaluate the model’s performance on the final batch.

Python

y_pred = model.predict(X[-batch_size:])
print("\nFinal Batch Classification Report:\n")
print(classification_report(y[-batch_size:], y_pred))

Ouput:

Applications

Fraud Detection: Financial fraud is dynamic with new attack patterns emerging regularly. Incremental learning helps update models quickly with recent transactions to detect anomalies in real time without full retraining.
Recommendation Systems: User interests change rapidly in platforms like e commerce or streaming services. By learning incrementally from each user interaction, models stay up to date and deliver more relevant, personalized content.
Sensor and IoT Analytics: Smart devices and industrial IoT generate massive continuous data streams. Incremental models can analyze this data on the fly, helping in tasks like predictive maintenance or real time monitoring.
Social Media Monitoring: Platforms like Twitter and Instagram evolve every second with new trends and opinions. Incremental learning allows sentiment analysis or topic classification models to stay current by processing recent posts in batches.

shrurfu5

Improve

Article Tags :

Incremental Learning with Scikit-learn

Incremental Learning

Implementation

Step 1: Import Required Libraries

Step 2: Load Dataset

Step 3: Separate Features and Target

Step 4: Normalize Time and Amount Features

Step 5: Shuffle Data to Simulate Streaming

Step 6: Initialize the Incremental Model

Step 7: Define Classes for partial_fit

Step 8: Define Batch Size and Number of Batches

Step 9: Train Model Incrementally in Batches

Step 10: Final Evaluation on Last Batch

Applications

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Thank You!

What kind of Experience do you want to share?