Handwritten Digit Recognition with ML Models
Handwritten Digit Recognition with ML Models
Project Description
The goal of this project is to accurately classify handwritten digits using a dataset of digit samples.
The dataset comprises a sequence of 16 features representing various characteristics of the
handwritten digits.
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_D… 1/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Problem Statement
The accurate classification of handwritten digits is a critical task in the field of computer vision, with
applications ranging from automated postal mail sorting to digit recognition in educational tools.
Despite the availability of advanced machine learning techniques, achieving high accuracy in digit
classification remains challenging due to variations in handwriting styles, sizes, and shapes.
This project aims to address this challenge by developing and evaluating three machine learning
models—Support Vector Machine (SVM), Random Forest, and Decision Tree—to classify handwritten
digits based on their trajectory characteristics. The goal is to determine which model performs best in
terms of accuracy, precision, recall, and F1-score, thereby providing a robust solution for digit
recognition tasks.
The outcomes of this study will contribute to the understanding of model performance on image-
based data and may be applied in various practical scenarios requiring digit recognition.
Objectives
Data Preprocessing: Clean and prepare the data for model training.
Model Development: Implement Support Vector Machine (SVM), Random Forest, and Decision
Tree classifiers.
Model Training and Evaluation: Train the models on the training set and evaluate their
performance using metrics such as accuracy, precision, recall, and F1-score.
Model Comparison: Compare the models to identify the best performer in terms of
classification accuracy and generalization.
Conclusion: Summarize the findings and provide recommendations for potential improvements.
The dataset and results are used for educational purposes, demonstrating the application of machine
learning techniques on image-based data. The aim is to build effective machine learning models to
classify handwritten digits and to gain a deeper understanding of these techniques.
INPUTS Description
input1- Integer values representing different characteristics of the digit's trajectory, such as
input16 coordinates and angles.
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_D… 2/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
The dataset is commonly used for training and evaluating machine learning models to recognize
handwritten digits, making it an ideal candidate for classification tasks in computer vision.
Relevant Paper:
Alimoglu, F., & Alpaydin, E. (1997, August). Combining multiple representations and classifiers for
pen-based handwritten digit recognition. In Proceedings of the Fourth International Conference
on Document Analysis and Recognition (Vol. 2, pp. 637-640). IEEE. DOI: 10.24432/C5MG6K
Table of Contents
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_D… 3/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
import warnings
warnings.filterwarnings('ignore')
sns.set_style("whitegrid")
In [2]: df = pd.read_csv("/kaggle/input/pen-based-handwritten-digit/pendigits_txt.csv")
df.head()
Out[2]: input1 input2 input3 input4 input5 input6 input7 input8 input9 input10 input11 inpu
0 47 100 27 81 57 37 26 0 0 23 56
1 0 89 27 100 42 75 29 45 15 15 37
2 0 57 31 68 72 90 100 100 76 75 50
3 0 100 7 92 5 68 19 45 86 34 100
4 0 67 49 83 100 100 81 80 60 60 40
Basic Statistics
In [3]: # Basic statistics summary of Numerical features
df.describe().T
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_D… 4/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
if pred is None:
cols = ['Types', 'Counts', 'Uniques', 'Nulls', 'Min', 'Max']
str = pd.concat([Types, Counts, Uniques, Nulls, Min, Max], axis = 1, sort=True)
str.columns = cols
print('___________________________\nData Types:')
print(str.Types.value_counts())
print('___________________________')
return str
summary(df)
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_D… 5/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Duplicated Values
In [5]: def duplicate_values(df):
print("Duplicate check...")
num_duplicates = df.duplicated(subset=None, keep='first').sum()
if num_duplicates > 0:
print("There are", num_duplicates, "duplicated observations in the dataset.")
df.drop_duplicates(keep='first', inplace=True)
print(num_duplicates, "duplicates were dropped!")
print("No more duplicate rows!")
else:
print("There are no duplicated observations in the dataset.")
duplicate_values(df)
Duplicate check...
There are no duplicated observations in the dataset.
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_D… 6/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Missing Values
missing_count = df.isnull().sum()
value_count = df.isnull().count()
missing_percentage = round(missing_count / value_count * 100, 2)
missing_df = pd.DataFrame({"count": missing_count, "percentage": missing_percentage})
return missing_df
missing_values(df)
input1 0 0.0
input2 0 0.0
input3 0 0.0
input4 0 0.0
input5 0 0.0
input6 0 0.0
input7 0 0.0
input8 0 0.0
input9 0 0.0
input10 0 0.0
input11 0 0.0
input12 0 0.0
input13 0 0.0
input14 0 0.0
input15 0 0.0
input16 0 0.0
class 0 0.0
Distributions
In [11]: # Targer Feature `Class`
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_D… 7/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
In [59]: plt.figure(figsize=(4,4))
sns.boxplot(y=df['class'],palette='Blues')
plt.title('Boxplot of Class Feature')
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_D… 8/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
sns.kdeplot(df[col])
plt.tight_layout();
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_D… 9/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
The density plots show that many features have multiple peaks, indicating the presence of
different subgroups in the data.
Some features are skewed, and the range of values varies across the features.
Overall, the dataset is complex, which may pose challenges for modeling.
Correlations
In [43]: plt.figure(figsize=(20,15))
sns.heatmap(df.corr(), vmin = -1, vmax = 1, annot = True, fmt = '.3f', cmap='Blues');
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 10/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
ax = price_corr.plot(kind='bar',figsize=(15,6))
ax.bar_label(ax.containers[0], fmt='%.2f')
plt.show()
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 11/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Highest Correlation: input1 has the highest positive correlation (0.35), indicating a strong
linear relationship with the class variable.
Negative Correlations: input9 has the lowest negative correlation (-0.17), suggesting an
inverse relationship with class .
Low Correlations: Some features, such as input11 , input14 , and input15 , have
correlations close to zero, indicating they have little to no linear impact on class .
Outlier Analysis
In [34]: # Boxplot of all features by target
plt.figure(figsize=(20,7))
sns.boxplot(data=df.drop("class", axis=1),palette='Blues');
plt.figure(figsize=(20, 30))
for i, col in enumerate(df.columns[:-1], 1):
plt.subplot(9, 2, i)
plt.title(col)
sns.boxplot(x='class', y=col, data=df,palette='Blues')
plt.tight_layout()
plt.show()
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 12/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Wide Distributions: Some features, particularly input5 , input7 , and input11 , have a wide
range of data values. This indicates that the data points for these features are more varied.
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 13/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Outliers: Features like input2 and input4 exhibit significant outliers. This suggests that
some data points in these features deviate considerably from the general distribution.
Symmetric and Asymmetric Distributions: While most features show relatively symmetric
boxplots, a few exhibit noticeable asymmetry, indicating skewness in the distribution.
However, I will not intervene with outliers at the moment, but could take an action later according to
the model's forecasting performance.
MACHINE LEARNING
Data Preprocessing
In [61]: # Train Test Split
X = df.drop("class", axis = 1)
y = df["class"]
In [77]: # Scaler
#scaler = StandardScaler()
scaler = MinMaxScaler()
y_train_pred = model.predict(X_train)
y_pred = model.predict(X_test)
print(f"{i} Test_Set")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print()
print('--------------------------------------------------------')
print(f"{i} Train_Set")
print(confusion_matrix(y_train, y_train_pred))
print(classification_report(y_train, y_train_pred))
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 14/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
print('----------------------------------------------------')
print('SVM_accuracy_test:', SVM_accuracy_test)
print('SVM_accuracy_train:', SVM_accuracy_train)
print('svm_f1_test:', svm_f1_test)
print('svm_f1_train:', svm_f1_train)
print('----------------------------------------------------')
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 15/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
----------------------------------------------------
SVM_accuracy_test: 0.9890909090909091
SVM_accuracy_train: 0.9957541447634453
svm_f1_test: 0.9890717989037039
svm_f1_train: 0.9957670665493271
----------------------------------------------------
svm_model Test_Set
[[106 0 0 0 0 0 0 0 0 0]
[ 0 108 1 0 0 0 0 1 0 0]
[ 0 1 119 0 0 0 0 0 0 0]
[ 0 0 0 115 0 0 0 0 0 1]
[ 0 0 0 0 117 2 0 0 0 0]
[ 0 0 0 2 0 96 0 0 0 1]
[ 0 0 0 0 0 0 105 0 0 0]
[ 0 1 0 0 0 0 0 109 0 0]
[ 0 0 0 0 0 0 0 0 104 0]
[ 0 0 0 0 0 0 0 2 0 109]]
precision recall f1-score support
--------------------------------------------------------
svm_model Train_Set
[[1035 1 0 0 0 0 1 0 0 0]
[ 0 1018 8 5 1 0 0 1 0 0]
[ 0 2 1021 0 0 0 0 1 0 0]
[ 0 1 2 933 0 1 0 1 0 1]
[ 0 0 0 0 1023 1 0 1 0 0]
[ 0 0 0 3 0 950 0 0 1 2]
[ 0 0 0 0 0 1 950 0 0 0]
[ 0 2 0 0 0 0 0 1030 0 0]
[ 0 0 0 0 0 1 0 1 949 0]
[ 0 0 0 0 0 0 0 2 1 941]]
precision recall f1-score support
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 16/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Model Validation
In [73]: # Cross Validation Scores of the Model Performance
scores = cross_validate(model,
X_train,
y_train,
scoring=["accuracy", "precision_macro", "recall_macro", "f1_macro"]
cv=5,
return_train_score=True)
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 17/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
svm_grid_model = GridSearchCV(model,
param_grid,
scoring = "accuracy",
n_jobs = -1, # Uses all available cores
verbose=1,
return_train_score=True).fit(X_train, y_train) # fit the model
svm_grid_model
▸ MinMaxScaler
▸ SVC
In [86]: svm_grid_model.best_params_
print('svm_grid_accuracy_test:', svm_grid_accuracy_test)
print('svm_grid_accuracy_train:', svm_grid_accuracy_train)
print('svm_grid_f1_test:', svm_grid_f1_test)
print('svm_grid_f1_train:', svm_grid_f1_train)
print('---------------------------------------------')
# Evaluating the Model Performance using Classification Metrics
eval_metric(svm_grid_model, X_train, y_train, X_test, y_test, 'svm_grid_model')
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 18/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
svm_grid_accuracy_test: 0.9927272727272727
svm_grid_accuracy_train: 0.9989890820865346
svm_grid_f1_test: 0.9927187020388594
svm_grid_f1_train: 0.9989841997564615
---------------------------------------------
svm_grid_model Test_Set
[[106 0 0 0 0 0 0 0 0 0]
[ 0 109 0 0 0 0 0 1 0 0]
[ 0 0 120 0 0 0 0 0 0 0]
[ 0 0 0 115 0 0 0 0 0 1]
[ 0 0 0 0 119 0 0 0 0 0]
[ 0 0 0 1 0 97 0 0 0 1]
[ 0 0 0 0 0 0 105 0 0 0]
[ 0 1 0 0 0 0 0 109 0 0]
[ 0 0 0 0 0 0 0 0 104 0]
[ 0 0 0 1 0 0 0 2 0 108]]
precision recall f1-score support
--------------------------------------------------------
svm_grid_model Train_Set
[[1037 0 0 0 0 0 0 0 0 0]
[ 0 1031 1 1 0 0 0 0 0 0]
[ 0 0 1023 0 0 0 0 1 0 0]
[ 0 1 2 934 0 1 0 1 0 0]
[ 0 0 0 0 1025 0 0 0 0 0]
[ 0 0 0 0 0 956 0 0 0 0]
[ 0 0 0 0 0 1 950 0 0 0]
[ 0 0 0 0 0 0 0 1032 0 0]
[ 0 0 0 0 0 0 0 0 951 0]
[ 0 1 0 0 0 0 0 0 0 943]]
precision recall f1-score support
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 19/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
visualizer = ConfusionMatrix(svm_grid_model)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show();
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show()
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 20/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Out[88]: <Axes: title={'center': 'Class Prediction Error for GridSearchCV'}, xlabel='actual class',
ylabel='number of predicted class'>
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 21/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Comparison:
The SVM Grid Model slightly outperforms the standard SVM Model in both accuracy and F1
score on both the training and test sets.
The differences are minimal but consistent, indicating that the grid search fine-tuning provided a
marginal performance improvement.
Both models demonstrate high performance, with the Grid model showing slightly better
generalization on the test data.
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 22/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Out[105… ▾ DecisionTreeClassifier
DecisionTreeClassifier(random_state=101)
print('----------------------------------------------------')
print('DT_accuracy_test:', DT_accuracy_test)
print('DT_accuracy_train:', DT_accuracy_train)
print('DT_f1_test:', DT_f1_test)
print('DT_f1_train:', DT_f1_train)
print('----------------------------------------------------')
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 23/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
----------------------------------------------------
DT_accuracy_test: 0.9572727272727273
DT_accuracy_train: 1.0
DT_f1_test: 0.9571347237434311
DT_f1_train: 1.0
----------------------------------------------------
DT_model Test_Set
[[106 0 0 0 0 0 0 0 0 0]
[ 0 105 5 0 0 0 0 0 0 0]
[ 0 1 118 0 0 0 0 1 0 0]
[ 0 3 0 110 0 0 0 2 0 1]
[ 0 0 0 0 117 1 1 0 0 0]
[ 0 0 0 5 0 90 0 1 0 3]
[ 3 1 1 0 0 0 100 0 0 0]
[ 0 6 2 0 0 1 0 99 1 1]
[ 2 0 0 0 0 0 0 0 102 0]
[ 0 0 0 2 1 0 1 1 0 106]]
precision recall f1-score support
--------------------------------------------------------
DT_model Train_Set
[[1037 0 0 0 0 0 0 0 0 0]
[ 0 1033 0 0 0 0 0 0 0 0]
[ 0 0 1024 0 0 0 0 0 0 0]
[ 0 0 0 939 0 0 0 0 0 0]
[ 0 0 0 0 1025 0 0 0 0 0]
[ 0 0 0 0 0 956 0 0 0 0]
[ 0 0 0 0 0 0 951 0 0 0]
[ 0 0 0 0 0 0 0 1032 0 0]
[ 0 0 0 0 0 0 0 0 951 0]
[ 0 0 0 0 0 0 0 0 0 944]]
precision recall f1-score support
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 24/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Model Validation
In [107… # Cross Validation Scores of the Model Performance
model = DecisionTreeClassifier()
scores = cross_validate(model,
X_train,
y_train,
scoring=["accuracy", "precision_macro", "recall_macro", "f1_macro"]
cv=5,
return_train_score=True)
model = DecisionTreeClassifier()
param_grid = {
'criterion': ["entropy","gini"],
'max_depth':[7,8],
'max_features':['auto', 0.8],
'max_leaf_nodes': [180,200],
}
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 25/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
DT_grid_model = GridSearchCV(model,
param_grid,
scoring = "accuracy",
n_jobs = -1, # Uses all available cores
verbose=1,
return_train_score=True).fit(X_train, y_train) # fit the model
In [112… DT_grid_model
Out[112… ▸ GridSearchCV
▸ estimator: DecisionTreeClassifier
▸ DecisionTreeClassifier
In [114… DT_grid_model.best_params_
print('DT_grid_accuracy_test:', DT_grid_accuracy_test)
print('DT_grid_accuracy_train:', DT_grid_accuracy_train)
print('DT_grid_f1_test:', DT_grid_f1_test)
print('DT_grid_f1_train:', DT_grid_f1_train)
print('---------------------------------------------')
# Evaluating the Model Performance using Classification Metrics
eval_metric(DT_grid_model, X_train, y_train, X_test, y_test, 'DT_grid_model')
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 26/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
DT_grid_accuracy_test: 0.9563636363636364
DT_grid_accuracy_train: 0.9711888394662354
DT_grid_f1_test: 0.9567899509438416
DT_grid_f1_train: 0.9710680122624465
---------------------------------------------
DT_grid_model Test_Set
[[105 0 0 0 0 0 0 0 1 0]
[ 0 98 11 1 0 0 0 0 0 0]
[ 0 3 114 1 0 0 0 0 0 2]
[ 0 0 0 115 0 1 0 0 0 0]
[ 0 1 0 0 118 0 0 0 0 0]
[ 0 0 0 2 0 91 0 0 1 5]
[ 1 0 1 1 0 0 101 0 1 0]
[ 0 3 2 1 0 0 0 103 0 1]
[ 1 0 0 0 0 0 0 0 103 0]
[ 0 2 0 3 0 0 0 2 0 104]]
precision recall f1-score support
--------------------------------------------------------
DT_grid_model Train_Set
[[1034 0 0 0 1 0 0 0 2 0]
[ 0 963 31 19 1 5 0 3 0 11]
[ 0 13 1001 7 0 2 0 0 0 1]
[ 0 7 1 882 1 42 0 0 0 6]
[ 0 1 0 0 1017 1 0 1 0 5]
[ 0 0 0 9 0 941 0 2 1 3]
[ 1 0 0 0 1 4 940 5 0 0]
[ 0 6 6 48 0 0 0 965 1 6]
[ 1 0 0 1 0 0 1 0 948 0]
[ 1 5 0 3 4 13 0 2 0 916]]
precision recall f1-score support
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 27/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
visualizer = ConfusionMatrix(DT_grid_model)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show();
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show()
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 28/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Out[117… <Axes: title={'center': 'Class Prediction Error for GridSearchCV'}, xlabel='actual class',
ylabel='number of predicted class'>
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 29/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Comparison:
The standard DT Model has perfect training scores, which likely indicates overfitting, as it
performs slightly better on the training set but not as well on the test set.
The DT Grid Model shows a more balanced performance, with slightly lower training scores,
which suggests better generalization and less overfitting compared to the standard DT Model.
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 30/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Out[122… ▾ RandomForestClassifier
RandomForestClassifier(random_state=101)
print('----------------------------------------------------')
print('RF_accuracy_test:', RF_accuracy_test)
print('RF_accuracy_train:', RF_accuracy_train)
print('RF_f1_test:', RF_f1_test)
print('RF_f1_train:', RF_f1_train)
print('----------------------------------------------------')
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 31/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
----------------------------------------------------
RF_accuracy_test: 0.9854545454545455
RF_accuracy_train: 1.0
RF_f1_test: 0.9856326999333506
RF_f1_train: 1.0
----------------------------------------------------
RF_model Test_Set
[[106 0 0 0 0 0 0 0 0 0]
[ 0 105 5 0 0 0 0 0 0 0]
[ 0 2 118 0 0 0 0 0 0 0]
[ 0 0 0 116 0 0 0 0 0 0]
[ 0 0 0 0 119 0 0 0 0 0]
[ 0 0 0 3 0 95 0 0 0 1]
[ 0 0 0 0 0 0 105 0 0 0]
[ 0 2 1 0 0 0 0 107 0 0]
[ 0 0 0 0 0 0 0 0 104 0]
[ 0 0 0 0 0 0 0 2 0 109]]
precision recall f1-score support
--------------------------------------------------------
RF_model Train_Set
[[1037 0 0 0 0 0 0 0 0 0]
[ 0 1033 0 0 0 0 0 0 0 0]
[ 0 0 1024 0 0 0 0 0 0 0]
[ 0 0 0 939 0 0 0 0 0 0]
[ 0 0 0 0 1025 0 0 0 0 0]
[ 0 0 0 0 0 956 0 0 0 0]
[ 0 0 0 0 0 0 951 0 0 0]
[ 0 0 0 0 0 0 0 1032 0 0]
[ 0 0 0 0 0 0 0 0 951 0]
[ 0 0 0 0 0 0 0 0 0 944]]
precision recall f1-score support
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 32/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Model Validation
In [124… # Cross Validation Scores of the Model Performance
model = RandomForestClassifier()
scores = cross_validate(model,
X_train,
y_train,
scoring=["accuracy", "precision_macro", "recall_macro", "f1_macro"]
cv=5,
return_train_score=True)
model = RandomForestClassifier()
param_grid = {
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 33/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
'criterion': ["entropy","gini"],
'max_depth':[7,8],
'max_features':['auto', 0.8],
'max_leaf_nodes': [180,200],
'n_estimators':[50,100],
}
RF_grid_model = GridSearchCV(model,
param_grid,
scoring = "accuracy",
n_jobs = -1, # Uses all available cores
verbose=1,
return_train_score=True).fit(X_train, y_train) # fit the model
In [129… RF_grid_model.best_params_
print('RF_grid_accuracy_test:', RF_grid_accuracy_test)
print('RF_grid_accuracy_train:', RF_grid_accuracy_train)
print('RF_grid_f1_test:', RF_grid_f1_test)
print('RF_grid_f1_train:',RF_grid_f1_train)
print('---------------------------------------------')
# Evaluating the Model Performance using Classification Metrics
eval_metric(RF_grid_model, X_train, y_train, X_test, y_test, 'RF_grid_model')
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 34/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
RF_grid_accuracy_test: 0.9781818181818182
RF_grid_accuracy_train: 0.9927213910230489
RF_grid_f1_test: 0.9782903956812433
RF_grid_f1_train: 0.9927662392429477
---------------------------------------------
RF_grid_model Test_Set
[[106 0 0 0 0 0 0 0 0 0]
[ 0 103 7 0 0 0 0 0 0 0]
[ 0 0 119 1 0 0 0 0 0 0]
[ 0 0 0 116 0 0 0 0 0 0]
[ 0 0 0 0 119 0 0 0 0 0]
[ 0 0 0 3 0 94 0 0 0 2]
[ 1 0 0 0 0 0 103 0 1 0]
[ 0 4 1 2 0 0 0 103 0 0]
[ 0 0 0 0 0 0 0 0 104 0]
[ 0 1 0 0 0 0 0 1 0 109]]
precision recall f1-score support
--------------------------------------------------------
RF_grid_model Train_Set
[[1037 0 0 0 0 0 0 0 0 0]
[ 0 995 22 15 0 0 0 1 0 0]
[ 0 2 1018 3 0 0 0 1 0 0]
[ 0 4 2 928 0 2 0 2 0 1]
[ 0 1 0 0 1023 0 0 0 0 1]
[ 0 0 0 1 0 953 0 0 0 2]
[ 0 0 0 0 1 1 949 0 0 0]
[ 0 1 1 3 0 0 0 1027 0 0]
[ 0 0 0 0 0 0 0 0 951 0]
[ 0 1 0 1 1 1 0 1 0 939]]
precision recall f1-score support
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 35/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
visualizer = ConfusionMatrix(svm_grid_model)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show();
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show()
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 36/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Out[88]: <Axes: title={'center': 'Class Prediction Error for GridSearchCV'}, xlabel='actual class',
ylabel='number of predicted class'>
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 37/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Comparison:
The RF Model shows perfect training scores, indicating overfitting, as it performs perfectly on
the training set but slightly less well on the test set.
The RF Grid Model demonstrates slightly lower training scores, indicating better generalization
and less overfitting compared to the standard RF Model, although its performance on the test
set is marginally lower.
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 38/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
Overall, the RF Grid Model offers a better balance between training and test performance, suggesting
it might generalize better to unseen data.
def labels(ax):
for p in ax.patches:
width = p.get_width() # get bar length
ax.text(width, # set the text at 1 unit right of the
p.get_y() + p.get_height() / 2, # get Y coordinate + X coordinate / 2
'{:1.3f}'.format(width), # set variable to display, 2 decimals
ha = 'left', # horizontal alignment
va = 'center') # vertical alignment
plt.figure(figsize=(14,10))
plt.subplot(311)
compare = compare.sort_values(by="Accurecy", ascending=False)
ax=sns.barplot(x="Accurecy", y="Models", data=compare, palette="Blues")
labels(ax)
plt.subplot(312)
compare = compare.sort_values(by="F1", ascending=False)
ax=sns.barplot(x="F1", y="Models", data=compare, palette="Blues")
labels(ax)
plt.show()
svm_grid_model = GridSearchCV(model,
param_grid,
scoring = "accuracy",
n_jobs = -1, # Uses all available cores
verbose=1,
return_train_score=True).fit(X_train, y_train) # fit the model
svm_grid_model
▸ MinMaxScaler
▸ SVC
import pickle
pickle.dump(final_svm_model, open("final_digit_class_model", "wb"))
Conclution
Parameters:
Accuracy: 0.993
f1: 0.993
Incorrect Predictions: 8
Overall:
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 40/41
17/08/2024, 01:04 Handwritten_Digit_Recognition_SVM_DT_RF
In this project, we used SVM, Random Forest, and Decision Tree models to classify handwritten digits
based on coordinate-based features. Given the complexity of the dataset, we prioritized accuracy and
F1 scores to evaluate model performance, ensuring both precision and recall were balanced.
SVM Model: Demonstrates the highest performance with both accuracy and F1 scores at 0.993.
Random Forest (RF) Model: Follows closely behind SVM with accuracy and F1 scores of 0.978.
Decision Tree (DT) Model: Shows the lowest performance among the three, with accuracy at
0.956 and F1 score at 0.957.
Based on the results, the SVM model was selected as the final model. It consistently outperforms the
other models in both accuracy and F1 score, indicating its strong ability to generalize and accurately
classify the data.
Reason:
High Accuracy and F1 Score: The SVM model achieves the highest accuracy and F1 score, which
suggests it is the best at both correctly predicting the class labels (accuracy) and balancing
precision and recall (F1 score).
Data Characteristics: Given the nature of the dataset, which involves complex patterns due to
high-dimensional coordinate-based features, SVM's ability to find an optimal hyperplane in high-
dimensional space makes it particularly effective.
Importance of F1 Score: The F1 score is especially important in cases where the dataset may be
imbalanced, ensuring that the model is not only accurate but also effective at managing false
positives and false negatives. SVM excels in this regard, as shown by its top F1 score.
Thank you...
localhost:8888/nbconvert/html/Desktop/00-GitHub-Repo/00-PROJECTS/00_ML_Projects/HandWritten_Digit_Recognition_Multi-Class/Handwritten_… 41/41