0% found this document useful (0 votes)
81 views54 pages

MNIST

Uploaded by

anushaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views54 pages

MNIST

Uploaded by

anushaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

MNIST (Modified National Institute of

Standards and Technology)


MNIST: The Handwritten Digit Classification Benchmark

• The MNIST (Modified National Institute of Standards and Technology)


dataset is a widely used benchmark for image classification tasks, specifically
focused on handwritten digit recognition (0-9). It's a cornerstone project for
beginners in machine learning and deep learning due to its:
• Simplicity: The images are grayscale (28x28 pixels), making them
relatively easy to process and understand compared to more complex image
datasets.
• Availability: The dataset is readily available from various sources,
including scikit-learn's fetch_openml function.
• Well-Documented: Extensive documentation explains the dataset's
structure, format, and usage.
Structure of the MNIST Dataset:
• The MNIST dataset consists of two primary parts: training and
testing sets.
• Training Set: Contains 60,000 handwritten digit images and
their corresponding labels (0-9). This set is used to train your
machine learning model.
• Testing Set: Contains 10,000 handwritten digit images and their
labels. This set is used to evaluate the performance of your
trained model on unseen data.
Key Components:
• Images: Each image in the dataset represents a handwritten
digit (0-9). These images are grayscale, meaning each pixel has
an intensity value ranging from 0 (black) to 255 (white). The
images are typically flattened into a one-dimensional array of 784
values (28 pixels x 28 pixels).
• Labels: Each image has a corresponding label indicating the
digit it represents (0-9). These labels are typically provided as
integers or strings.
MNIST Project Workflow:
1.Data Loading: Use scikit-learn's fetch_openml function or other libraries to
load the MNIST dataset.
2.Data Preprocessing (Optional): Depending on your chosen model, you
might need to perform preprocessing steps like normalization or
standardization. This can improve the training process and model performance.
3.Model Selection: Choose a suitable machine learning model for image
classification, such as:
– K-Nearest Neighbors (KNN): A simple yet effective approach for classification tasks.
– Support Vector Machines (SVMs): Powerful classifiers that can create decision
boundaries to separate different classes.
– Multi-Layer Perceptrons (MLPs) or Convolutional Neural Networks (CNNs): Deep
learning models particularly well-suited for image classification.
Cont..
4.Model Training: Train your chosen model using the training set. This involves
feeding the image data and corresponding labels into the model, allowing it to
learn the patterns that distinguish different handwritten digits.
5.Model Evaluation: Evaluate the performance of your trained model using the
testing set. This involves testing the model on unseen data and measuring its
accuracy (how well it predicts the correct digit for new images). Metrics like
classification accuracy or confusion matrix can be used.
6.Hyperparameter Tuning (Optional): If necessary, you can fine-tune the
hyperparameters of your model to improve its performance. Hyperparameters
are settings that control the learning process of the model.
7.Visualization (Optional): You can visualize the learned features or decision
boundaries of your model to gain insights into how it differentiates between
digits.
MNIST Project Benefits:
• Learning Fundamentals: Working with MNIST provides a
hands-on introduction to essential machine learning concepts like
data loading, preprocessing, model selection, training, evaluation,
and visualization.
• Deep Learning Exploration: MNIST can be a stepping stone
towards exploring deep learning techniques like CNNs, which are
powerful for various image recognition tasks.
• Benchmarking: You can compare your model's performance
with other implementations or baseline models to evaluate its
effectiveness.
MNIST
Scikit-learn provides a convenient way to access various datasets for machine
learning tasks. These datasets come in a structured format that makes it easy
to work with them. Let's break down this structure in detail:
1. DESCR Key:
• Imagine a dataset as a box of cards. The DESCR key acts like a label on the
box, describing the contents. It contains information about the dataset itself,
such as:
– The number of samples (data points)
– The number of features (attributes) for each sample
– The meaning of each feature
– Any other relevant details about the data
Cont..
2. data Key:
This key holds the heart of the data – the actual samples. It's like
the main stack of cards in the box. Each card represents a single
instance or data point. These cards are organized in a two-
dimensional NumPy array:
• Rows: Each row represents a single data point (like an image).
• Columns: Each column represents a specific feature of that
data point.
For example, if the dataset contains images of handwritten digits,
each row might represent a single image, and each column might
hold the intensity value of a particular pixel in that image.
Cont..

3. target Key:
• This key stores the labels or target values associated with
each data point. These are like separate label cards
accompanying the data cards. In the handwritten digit
example, the target card would tell you the actual digit (0-
9) represented by the image in the corresponding data
card.
Example with MNIST Dataset:
• The MNIST dataset is a popular example used in image classification tasks.
It contains thousands of handwritten digit images. Let's see how Scikit-learn
provides access to this data:
• Python
– from sklearn.datasets import fetch_openml
– mnist = fetch_openml('mnist_784') # Load the MNIST dataset
– X, y = mnist["data"], mnist["target"] # Separate data and target arrays

• Here, X holds the image data (features) and y holds the corresponding digit
labels (targets).
Understanding the Data Size:
• We can use the shape attribute to understand the
dimensions of the data and target arrays:
• Python
– print(X.shape) # Output: (70000, 784)
– print(y.shape) # Output: (70000,)
• The first number in X.shape (70,000) represents the total
number of images (data points).
• The second number (784) represents the number of features
for each image. Since each image in MNIST is 28x28 pixels,
784 represents the total number of pixels (28 * 28).
Visualizing a Single Image:

To visualize a single image from the dataset, we can:


1.Pick a specific data point (e.g., X[0]).
2.Reshape it to a 28x28 array to represent the image
dimensions.
3.Use libraries like Matplotlib to display the image.
Splitting Data into Training and Testing Sets:
• Before training a machine learning model, it's crucial to split the data into
training and testing sets. The training set is used to train the model, while the
testing set is used to evaluate its performance on unseen data.
• MNIST comes pre-split:
– Python
– X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
• This code separates the data and target arrays into training and testing sets.
The first 60,000 images and labels are used for training, and the remaining
10,000 are kept for testing.
Cont..
• Shuffling the Training Data:
– Scikit-learn often shuffles the training data by default. This ensures that
the model is exposed to a diverse set of examples during training and
avoids biases.
• Conclusion:
– Understanding the structure of Scikit-learn datasets is essential for
working with them effectively. The DESCR key provides context, the data
key holds the actual features, and the target key stores the labels. By
separating data into training and testing sets and shuffling training data,
Scikit-learn helps you prepare your data for robust machine learning
experiments.
Binary Classifier for Digit Recognition:
Building a model that can identify just one digit: the number 5. This type of
model is called a binary classifier because it can only distinguish between two
classes - images containing the digit 5 and images that don't (everything else).
• Preparing Training Data:
• 1.Target Vectors: We need to modify the original target labels (y_train and
y_test) to fit our binary classification problem. We create new target vectors
(y_train_5 and y_test_5) using a comparison trick.
– 1.For each data point in the original target (y_train or y_test), if the label is
5, the corresponding element in the new target vector (y_train_5 or
y_test_5) becomes True. Otherwise, it becomes False.
– 2.This essentially creates a new label that only indicates "yes" for the digit
5 and "no" for all other digits.
illustrate how target vectors are modified for binary
classification:
Original Target Vectors (y_train) and Sample Images:
Imagine we have a small subset of the MNIST dataset with just 4
images and their corresponding original target labels:
y_train = [3, 1, 5, 7] # Original target labels (digit the image
represents)
Here, each element in y_train represents the digit in the
corresponding image:
Image 1: Digit 3
Image 2: Digit 1
Image 3: Digit 5
Image 4: Digit 7
Creating Binary Target Vectors (y_train_5) for Digit 5
Classification:
Now, we want to create a new target vector (y_train_5) for a binary
classification task focusing on digit 5:
Python
y_train_5 = []for label in y_train:
if label == 5:
y_train_5.append(True) # True for digit 5
else:
y_train_5.append(False) # False for other digits
• This code iterates through each label in y_train and creates a new element in
y_train_5:
• If the label is 5, True is appended to y_train_5, indicating the image belongs
to the "5" class.
• If the label is any other digit (3, 1, or 7 in this case), False is
appended, indicating the image belongs to the "not 5" class.
Resulting Binary Target Vector (y_train_5):
• y_train_5 = [False, False, True, False]
• Here's the breakdown of the new target vector:
• Image 1 (originally digit 3): False (not 5)
• Image 2 (originally digit 1): False (not 5)
• Image 3 (originally digit 5): True (is 5)
• Image 4 (originally digit 7): False (not 5)
Using Binary Target Vectors:
• This new binary target vector y_train_5 can now be used to train a
machine learning model to distinguish images containing the digit
5 from all other digits in the dataset. The model will learn to
identify patterns and features specific to the digit 5 based on the
association with True in y_train_5.
• This is a simplified example, but it demonstrates how target
vectors are adapted for binary classification tasks by converting
the original multi-class labels into a binary representation that
focuses on a particular class of interest.
Choosing a Classifier
• Scikit-learn offers various machine learning algorithms. Here, we'll
use a Stochastic Gradient Descent (SGD) classifier implemented
by the SGDClassifier class.
• Advantages of SGD Classifier:
• Efficiency with Large Datasets: SGD is efficient for handling
large datasets because it processes data points individually
instead of requiring the entire dataset at once. This makes it
suitable for problems with massive amounts of data.
• Online Learning: SGD can be used for online learning, where data
arrives continuously, and the model updates itself incrementally
with each new data point.
Working of SGD
• Start with guess: You start with some basic ideas of what spam might look
like (all caps, weird symbols, free money offers). These are like your initial
guesses for spam filters (parameters).
• Random email check: Every now and then, you grab a single email from the
incoming pile (data point) or a small stack (mini-batch).
• Spam detective: You check the email for suspicious signs (calculate error).
Is it full of weird symbols? Does it have bad grammar?
• Learn from mistakes: If you miss a spam email (high error), you adjust your
spam filter slightly (update parameters) to be stricter on similar emails in the
future. The learning rate determines how much you adjust.
• Keep learning: You keep checking emails, learning from mistakes (both
catching spam and accidentally marking good emails as spam), until you
become a spam-fighting master (minimum error)
Training the Model:
1.Importing the Class: We import SGDClassifier from
sklearn.linear_model.
2.Creating the Classifier: We create an instance of SGDClassifier
and set the random_state parameter to a fixed value (e.g., 42) to
ensure reproducible results during training (the model might use
some randomness during the process).
3.Training: We call the fit method on the classifier object (sgd_clf)
and provide the training data (X_train) and the modified target
vectors (y_train_5) for the "5" classification task.
Making Predictions:
Once trained, we can use the classifier to predict the class (5 or not-
5) for a new image. We do this using the predict method:
Python
>>> sgd_clf.predict([some_digit]) # some_digit is a new image
we want to classify
array([ True])
The output array([True]) indicates that the classifier predicts the new
image (some_digit) to be a 5 with high confidence (True). However,
this is just one example. We need to evaluate the model's overall
performance on a separate test set.
Next Steps:
• The next section will likely focus on evaluating the
model's performance using the test data (X_test and
y_test_5). This will involve metrics like accuracy,
precision, recall, etc., to assess how well the model
generalizes to unseen data.
Performance Measures
• Evaluating classifiers compared to regressors.
• Classifiers vs. Regressors:
• Classifiers: These models predict discrete categories (classes) for
data points. Examples include spam detection (spam/not-spam)
or image classification (cat/dog/other).
• Regressors: These models predict continuous values for data
points. Examples include predicting house prices or weather
forecasts.
Cont..
• Challenges in Evaluating Classifiers:
• Multiple Classes: Classifiers can have multiple output categories,
unlike regressors with a single continuous output value. This
makes it more complex to define a single "goodness-of-fit" metric.
• Classification Threshold: Classifier outputs might need to be
converted into concrete predictions. For instance, a spam
classifier might output a probability (0-1). We need a threshold
(e.g., 0.7) to decide if an email is classified as spam (above the
threshold) or not-spam (below the threshold). Choosing the right
threshold can impact evaluation metrics.
various performance metrics used
• Accuracy: Proportion of correctly classified data points.
• Precision: Ratio of true positives (correctly classified positives) to
all predicted positives.
• Recall: Ratio of true positives to all actual positives in the data.
• F1-score: Harmonic mean of precision and recall, combining both
metrics into a single score.
• AUC-ROC (Area Under the Receiver Operating Characteristic
Curve): A performance measure for binary classifiers that
considers all possible classification thresholds.
Implementing Cross-Validation
• Cross-validation is a technique used in machine learning to assess
how well a model performs on unseen data.
• Steps
– Split the data: The available data is divided into multiple folds or subsets.
– Train-test loop:
• One fold is used as the validation set, and the remaining folds are combined to form the
training set.
• The model is trained on the training set.
• The model's performance is evaluated on the validation set.
– Repeat and average: This process (steps 2a-2c) is repeated multiple times,
each time using a different fold as the validation set.
– Final evaluation: The results from each validation step (e.g., accuracy score)
are averaged to get a more robust estimate of the model's overall performance
on unseen data.
create your own cross-validation function
Step 1 : Splitting data multiple times:
– Instead of one split, you divide your data (X_train and y_train) into
smaller folds (like 3 folds in this example).
– This ensures the model is trained and tested on various data
combinations.
Step 2 :Stratified Sampling (for imbalanced data):
– The code uses StratifiedKFold which is especially useful when you have
imbalanced data (e.g., mostly not-spam emails).
– This ensures each fold has a similar proportion of classes (spam/not-
spam) as the whole data.
Cont..
Step 3: Training and Testing in a loop:
– The code loops through each fold.
– Inside the loop:
• It creates a copy (clone) of your classifier model (sgd_clf). This prevents
changes in one fold from affecting others.
• It trains the cloned model on the training data from the current fold
(X_train_folds, y_train_folds).
• It tests the trained model on the testing data from the current fold (X_test_fold,
y_test_fold) and makes predictions.
• It checks how many predictions were correct.
• Step 4:Evaluating Performance:
– After looping through all folds, it calculates the average accuracy (correct
predictions / total predictions) across all folds.
– This gives a more robust estimate of the model's performance on unseen data
compared to a single train-test split.
Code
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
# Define the number of folds (3 in this case) and set a random state for reproducibility
skfolds = StratifiedKFold(n_splits=3, random_state=42)
# Loop through each fold in the KFold object
for train_index, test_index in skfolds.split(X_train, y_train_5):
# Clone the classifier to avoid affecting the original model
clone_clf = clone(sgd_clf)
# Separate training and testing data for the current fold
X_train_folds = X_train[train_index]
y_train_folds = y_train_5[train_index]
X_test_fold = X_train[test_index]
y_test_fold = y_train_5[test_index]
# Train the cloned classifier on the training data from the current fold
clone_clf.fit(X_train_folds, y_train_folds)
# Make predictions on the testing data from the current fold
y_pred = clone_clf.predict(X_test_fold)
# Calculate the number of correct predictions for the current fold
n_correct = sum(y_pred == y_test_fold)
# Print the accuracy (correct predictions / total predictions) for the
current fold
print(n_correct / len(y_pred))
k-fold cross-validation with a simplified example
Scenario: Imagine you're building a machine learning model to predict
house prices based on features like size, location, and number of
bedrooms. You have a dataset of 20 houses.
K-Fold Cross-Validation (k=3):
• Split the Data: Divide the 20 houses into 3 equal folds (groups) of 6
houses each.
• Iteration 1:
– Validation Set: Fold 1 (6 houses) becomes the validation set for testing.
– Training Set: Folds 2 and 3 (combined 14 houses) become the training set.
– Train the model on the training set and evaluate its performance (e.g.,
accuracy in predicting house prices) on the validation set (Fold 1).
– Record this performance metric (e.g., accuracy).
Cont..
• Iteration 2:
Validation Set: Fold 2 (6 houses) is now the validation set.
Training Set: Folds 1 and 3 (combined 14 houses) become the training set.
Train the model again and evaluate on the new validation set (Fold 2).
Record this performance metric.
Iteration 3:
Validation Set: Fold 3 (6 houses) is the validation set.
Training Set: Folds 1 and 2 (combined 14 houses) become the training set.
Train and evaluate the model on the final validation set (Fold 3).
Record this performance metric.
Average the Results: After all 3 iterations, you have 3 performance metrics (e.g.,
accuracies). Take the average of these metrics to get a more robust estimate of how
well the model generalizes to unseen data.
Handling Remainders:
• If your dataset size isn't perfectly divisible by k, you have two
options:
– Stratified Folds (Preferred): When dealing with classification problems,
try to create stratified folds. This ensures each fold has a similar
proportion of classes (house types) as the original dataset. You might
end up with folds of slightly unequal sizes to achieve this.
– Unequal Folds: If stratification isn't crucial, create folds as close to
equal size as possible, even if it means the last fold has a few extra
houses.
Accuracy
• Accuracy can be a misleading metric for evaluating classifiers,
particularly when dealing with imbalanced datasets.
• K-Fold Cross-Validation with cross_val_score:
– In this case, a model (SGDClassifier) is split into 3 folds (k=3). The
model is trained on k-1 folds and tested on the remaining fold. This
process is repeated for all folds, and the average performance metric
(accuracy in this case) is reported.
• High Accuracy Doesn't Always Mean Good Performance:
– The example shows an SGDClassifier achieving high accuracy (over
93%) on all folds using k-fold cross-validation. This might seem
impressive.
Accuracy
• The "Never-5" Classifier Trap:
– The code introduces a dummy classifier, Never5Classifier, that always predicts
"not-5" for every image.
– Surprisingly, this dummy model also achieves high accuracy (over 90%) using k-
fold cross-validation.
• Why Accuracy Fails with Imbalanced Data:
– The reason for the dummy model's high accuracy is the imbalanced nature of the
dataset. Only 10% of the images belong to the "5" class.
– By always predicting "not-5", the dummy model is essentially guessing the
majority class most of the time, which leads to high accuracy despite not actually
learning anything meaningful.
• Alternative Performance Measures:
– accuracy can be misleading with imbalanced datasets. Other performance
metrics, like precision, recall, or F1-score, might be more suitable in such cases.
Confusion Matrix
• Confusion matrix to evaluate a classifier's performance and
introduces two other metrics, precision and recall
• Confusion Matrix:
– A confusion matrix is a table that helps visualize the performance of a
classification model.
– It shows how many times the model correctly or incorrectly classified
instances of each class.
– Rows represent actual classes, and columns represent predicted classes.
– Ideally, a perfect classifier would have non-zero values only on its diagonal
(correct classifications).
Con..
• The image shows a confusion matrix
for a classifier that distinguishes
between images of "5" and "not-5".
• In the actual "not-5" class (first row),
53,057 images were correctly classified
(true negatives), and 1,522 were
incorrectly classified as "5" (false
positives).
• In the actual "5" class (second row),
1,325 images were incorrectly
classified as "not-5" (false negatives),
and 4,096 were correctly classified
(true positives).
• Precision and Recall:
Precision and Recall

• Precision: It tells you the accuracy of positive predictions


(Eq. 3-1). A high precision means the model isn't making
many false positive mistakes.
• Recall: It tells you how good the model is at finding all
positive instances (Eq. 3-2). A high recall means the
model isn't missing many true positives.
Combining Precision and Recall: F1-score
• F1-score is a single metric combining precision and recall (Eq. 3-
3). It favors models with balanced precision and recall (both high).
• A high F1-score (0.742 in this example) indicates a good overall
performance, but it doesn't reveal the balance between precision
and recall.
Precision vs. Recall: A Tradeoff
• Often, you can't have high precision and high recall
simultaneously. Increasing one typically reduces the other.
• The choice between precision and recall depends on the specific
application.
• High Precision:
Medical diagnosis: A medical diagnosis system for a serious
disease might prioritize high precision. A false positive (incorrectly
diagnosing someone with the disease) could lead to unnecessary
worry and treatment. Here, it's crucial to identify true positives
(correctly diagnosing those with the disease) even if it means
missing some cases (lower recall).
Cont..
• High Recall:
– Fire alarm: A fire alarm system in a building should have high
recall. A false negative (not detecting a fire) could have
devastating consequences. While some false positives
(accidental alarms) can be disruptive, it's better to err on the
side of caution.
Precision(false positives)/Recall(false negatives) Tradeoff
ROC (Receiver Operating Characteristic)Curve
• graph used to assess the performance of a binary classification
model. The ROC curve plots the true positive rate (TPR) on the y-
axis against the false positive rate (FPR) on the x-axis.
– The TPR, also known as recall, sensitivity or probability of detection, is
the proportion of positive cases that were correctly identified by the
model.
– The FPR, also known as probability of false alarm or 1-specificity, is the
proportion of negative cases that were incorrectly classified as positive
by the model.
ROC
• An ROC curve typically starts at the bottom
left corner (0,0) and ends in the upper right
corner (1,1). The closer the curve is to the
top left corner (1,0), the better the
performance of the model. A perfect classifier
would have an ROC curve that follows the
left and top borders of the graph.
• In the diagram, the model performs well at
first but then its performance levels off. This
means that as the threshold for classifying
something as positive is lowered, the true
positive rate increases quickly at first, but
then slows down as the model starts to
include more and more false positives.
Multiclass Classification
• Classifies data points into more than two categories.
• Some algorithms like Random Forest can handle this
directly.
• Other algorithms like SVM are binary and require
strategies to handle multiple classes.
• multiclass classification using scikit-learn, comparing two
common approaches: One-vs-All (OvA) and One-vs-One
(OvO).
Cont..
• One-vs-All (OvA): Imagine you have a classifier that can identify digits.
Using OvA, you would train ten separate classifiers, one for each digit (0, 1,
2, ..., 9). When you have a new image to classify, you'd pass it through all ten
classifiers. The classifier with the highest score "wins" and determines the
image's class.
• One-vs-One (OvO): This strategy involves training a classifier for every
possible pair of classes. For example, to classify digits 0-9, you'd need 45
classifiers (one for 0 vs 1, another for 0 vs 2, and so on). When classifying a
new image, you'd put it through all 45 classifiers and see which class wins
the most "duels".

• OvA is generally preferred for most binary classification algorithms because


each classifier only trains on a smaller subset of data
Classifying Handwritten Digits (0-9):
• We can use a classifier to predict which digit (0-9) is in an image.
• Scikit-learn automatically uses OvA for a Stochastic Gradient
Descent Classifier (SGDClassifier).
• It trains 10 binary classifiers (one for each digit) and predicts the
class with the highest score from these classifiers.
• decision_function() method: Returns an array of scores, one for
each class.
• classes_ attribute: Stores the class labels used during training.
Other Multiclass Classifiers:
• Random Forest Classifier can handle multiclass classification
directly.
• predict_proba() method: Returns the probability assigned to
each class for a data point.
• Evaluation:
– Use cross-validation to get a more reliable estimate of performance (e.g.,
accuracy).
– The example shows how cross_val_score can be used to evaluate the
SGDClassifier on the MNIST dataset.
– Preprocessing techniques like scaling the data can significantly improve
performance.
Error Analysis for classification task
Helps identify weaknesses in a classification model
and find ways to improve its performance.Common
techniques include confusion matrix and analyzing
individual errors.
Confusion matrix- plot the confusion matrix using
Matplotlib.
– It recommends normalizing the confusion matrix to
compare error rates instead of absolute errors
(especially for imbalanced datasets).
– By looking at rows and columns, you can see which
classes are frequently confused.
– In the example, the model confuses digits 3 and 5
often.
Multilabel Classification
• Multi-label classification allows data points to have multiple class labels (e.g.,
an image can contain both a cat and a dog).
• Method to classify multilabel - F1 score (0.6)
from sklearn.metrics import f1_score
# True labels
y_true = [0, 1, 0, 1, 0]
# Predicted labels
y_pred = [0, 0, 1, 1, 0]
# Calculate F1 score
f1 = f1_score(y_true, y_pred)
# Print the F1 score
print(f"F1 score: {f1}")
Multioutput Classification
• complex classification task where the model outputs
multiple labels, and each label can have multiple classes.
• Steps
– creating a noisy dataset from the MNIST digit dataset.
– Random noise is added to the pixel intensities of the original
images using NumPy's randint function.
– The target labels remain the original clean images.

You might also like