Pattern Recognition Sahil Malek
Pattern Recognition Sahil Malek
Pattern Recognition
(3171613)
B.E. Semester 7
(Information Technology)
Place:
Date:
Preface
Main motto of any laboratory/practical/field work is for enhancing required skills as well as
creating ability amongst students to solve real time problem by developing relevant competencies
in psychomotor domain. By keeping in view, GTU has designed competency focused outcome-
based curriculum for engineering degree programs where sufficient weight age is given to
practical work. It shows importance of enhancement of skills amongst the students and it pays
attention to utilize every second of time allotted for practical amongst students, instructors and
faculty members to achieve relevant outcomes by performing the experiments rather than having
merely study type experiments. It is must for effective implementation of competency focused
outcome-based curriculum that every practical is keenlydesigned to serve as a tool to develop and
enhance relevant competency required by the various industry among every student. These
psychomotor skills are very difficult to develop through traditional chalk and board content
delivery method in the classroom. Accordingly, this lab manual is designed to focus on the
industry defined relevant outcomes, rather than old practice of conducting practical to prove
concept and theory.
By using this lab manual students can go through the relevant theory and procedure in advance
before the actual performance which creates an interest and students can have basic idea prior to
performance. This in turn enhances pre-determined outcomes amongst students. Each experiment
in this manual begins with competency, industry relevant skills, course outcomes as well as
practical outcomes (objectives). The students will also achieve safety and necessary precautions
to be taken while performing practical.
This manual also provides guidelines to faculty members to facilitate student centric lab activities
through each experiment by arranging and managing necessary resources in order that the
students follow the procedures with required safety and necessary precautions to achieve the
outcomes. It also gives an idea that how students will be assessed by providing rubrics.
Engineering Thermodynamics is the fundamental course which deals with various forms of
energy and their conversion from one to the another. It provides a platform for students to
demonstrate first and second laws of thermodynamics, entropy principle and concept of exergy.
Students also learn various gas and vapor power cycles and refrigeration cycle. Fundamentals of
combustion are also learnt.
Utmost care has been taken while preparing this lab manual however always there is chances of
improvement. Therefore, we welcome constructive suggestions for improvement and removal of
errors if any.
Pattern Recognition (3171613)
Index
(Progressive Assessment Sheet)
Experiment No: 0
• To impart affordable and quality education in order to meet the needs of industries and
achieve excellence in teaching-learning process.
• To create a conducive research ambience that drives innovation and nurtures research-oriented
scholars and outstanding professionals.
• To collaborate with other academic & research institutes as well as industries in order to
strengthen education and multidisciplinary research.
• To promote equitable and harmonious growth of students, academicians, staff, society and
industries, thereby becoming a center of excellence in technical education.
• To practice and encourage high standards of professional ethics, transparency and
accountability.
To shape the young minds of aspiring Information Technology engineers to become the
front runner in the sustainable technological growth of our country, conserving its rich cultural
heritage and catering to its socioeconomic needs.
2. Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for
the public health and safety, and the cultural, societal, and environmental considerations.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities with
an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering solutions
in societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms
of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.
11. Project management and finance: Demonstrate knowledge and understanding of the engineering
and management principles and apply these to one’s own work, as a member and leader in a team,
to manage projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
Pattern Recognition (3171613)
Theory:
Ref : https://www.geeksforgeeks.org/bayes-theorem/
Case study:
A hospital wants to screen patients for a rare disease that affects 1 in 10,000 people. Thescreening
test has a sensitivity of 95% (meaning that it correctly identifies 95% of people who have the
disease) and a specificity of 99% (meaning that it correctly identifies 99% of people who do not
have the disease). However, the test is not perfect and produces false positive and false negative
results. If a patient tests positive for the disease, what is the probability that they actually have the
disease?
Procedure:
1) Open python programming terminal.
2) Apply bays theorem on given case study above.
3) Calculate probability that they actually have the disease.
Code:
# Prior probability of having the disease
p_disease = 0.0001
Observations:
2. If the probability of event A is 0.4 and the probability of event B is 0.6, what is the probability
of both events happening simultaneously if they are independent?
a) 0.24
b) 0.1
c) 0.6
d) 0.4
Suggested Reference: https://machinelearningmastery.com/bayes-theorem-for-machine-learning/
Experiment No: 2
Objectives:
(a) To understand the concept of minimum-error-rate classification and decision surfaces.
(b) To learn how to use discriminate functions for classification and analysis of data.
(c) To understand the concept of normal density and its importance in classification.
(d) To learn how to implement classification algorithms using Python programming language.
Equipment/Instruments: Desktop/laptop
Theory:
https://machinelearningmastery.com/linear-discriminant-analysis-for-machine-learning/
https://machinelearningmastery.com/linear-discriminant-analysis-with-python/
Procedure:
In this lab, we will learn how to implement minimum-error-rate classification using discriminate
functions. Discriminate functions are used to classify observations into different classes based on
a set of predictors. We will use the iris dataset to demonstrate how to implement discriminate
functions.
Dataset Description:
The iris dataset contains 150 observations of iris flowers. There are three different species of iris
flowers: setosa, versicolor, and virginica. The dataset has four predictors: sepal length, sepal width,
petal length, and petal width.
Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
3
Pattern Recognition (3171613) 220213116007
import seaborn as sns
#loading dataset
df = pd.read_csv("Iris.csv")
df.head()
df.describe()
#visualize dataset
sns.pairplot(df, hue='Species')
x= data[:,1:5]
y= data[:,5]
print(y)
prediction = model_LR.predict(x_test)
#Calculate Accuracy
from sklearn.metrics import accuracy_score
print (accuracy_score(y_test, prediction)*100)
Observations:
4
Pattern Recognition (3171613) 220213116007
#Accuracy Score:
#Output Result :
Suggested Reference:
https://www.analyticsvidhya.com/blog/2021/05/bayesian-decision-theory-discriminant-functions-and-
normal-densitypart-3/
5
Pattern Recognition (3171613) 220213116007
Experiment No: 03
Apply unsupervised learning and clustering methods on a WINE dataset using KMeans,
Hierarchical, and Gaussian mixture models. The results will be validated using criterion
functions such as silhouette score and Calinski-Harabasz index.
Date:
• Objectives: To learn how to use different techniques to transform and preprocess data to
make it suitable for clustering. To learn how to interpret and visualize clustering results to
gain insights into the underlying data structure.
Equipment/Instruments:
Desktop/laptop with Materials:
Python programming environment
NumPy and Scikit-learn libraries
Theory:
Ref : https://www.analyticsvidhya.com/blog/2019/10/gaussian-mixture-models-clustering/
Procedure:
1. Load the WINE dataset using the scikit-learn library.
2. Preprocess the data by scaling the features to have zero mean and unit variance.
3. Split the data into training and testing sets.
4. Implement the KMeans clustering algorithm by setting the number of clusters and fitting
the model to the training data.
5. Predict the cluster assignments for the testing data using the trained KMeans model.
6. Evaluate the clustering results using the silhouette score and Calinski-Harabasz index.
7. Apply step 1 to 6 for Hierarchical clustering algorithm and Gaussian mixture model
8. Evaluate the clustering results using the silhouette score and Calinski-Harabasz index.
6
Pattern Recognition (3171613) 220213116007
Code:
from sklearn.datasets import load_wine
data = load_wine()
X = data.data # Features
y = data.target # Target labels
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
y_pred = kmeans.predict(X_test)
# 6. Evaluate the clustering results using the silhouette score and Calinski-Harabasz index
from sklearn.metrics import silhouette_score, calinski_harabasz_score
# 7. Apply steps 1 to 6 for Hierarchical clustering algorithm and Gaussian mixture model
from sklearn.cluster import AgglomerativeClustering
from sklearn.mixture import GaussianMixture
# Hierarchical clustering
7
Pattern Recognition (3171613) 220213116007
agg_clustering = AgglomerativeClustering(n_clusters=n_clusters)
agg_clustering.fit(X_train)
y_agg_pred = agg_clustering.fit_predict(X_test)
# 8. Evaluate the clustering results for Hierarchical clustering and Gaussian mixture model
silhouette_agg = silhouette_score(X_test, y_agg_pred)
calinski_harabasz_agg = calinski_harabasz_score(X_test, y_agg_pred)
# KMeans
plt.figure(figsize=(12, 4))
plt.subplot(131)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred, cmap='viridis')
plt.title("KMeans Clustering")
# Hierarchical clustering
plt.subplot(132)
8
Pattern Recognition (3171613) 220213116007
plt.show()
Observation:
graph where the reduction in this value slows down. That's often the right number of clusters.
Silhouette analysis is another approach: it measures how similar each point is to its own cluster
compared to others. The higher the silhouette score, the better the clustering.
3. How does the initialization of KMeans algorithm affect the final clustering results?
The initialization of the KMeans algorithm can have a big impact on the final clustering results.
KMeans starts with initial guesses for the cluster centroids, and if those guesses are poorly
chosen, it can lead to suboptimal clusters. This is because KMeans can get stuck in local
optima, meaning it might not find the best possible clusters.
A common way to improve initialization is by using the KMeans++ algorithm. It spreads out
the initial centroids in a more informed way, helping to lead to better and more stable
clustering results.
4. Explain the concept of Gaussian mixture model and how it is used in clustering.
Gaussian Mixture Models (GMMs) are like KMeans' sophisticated cousin. They assume that
data points are generated from a mixture of several Gaussian distributions with unknown
parameters. Essentially, instead of just assigning each data point to a single cluster, GMM
assigns a probability of belonging to each cluster.
This is useful in clustering because it allows for soft clustering, where a data point can belong
to multiple clusters with different probabilities. This flexibility can model more complex data
distributions and overlap between clusters, which KMeans can't handle as well.
Suggested Reference:\
1. https://towardsdatascience.com/unsupervised-learning-with-k-means-clustering-generate-color-
palettes-from-images-94bb8e6a1416
2. https://neptune.ai/blog/clustering-algorithms
10
Pattern Recognition (3171613) 220213116007
Experiment No: 4
To implement dimensionality reduction techniques such as Principal Component Analysis
(PCA) andFisher Discriminant Analysis (FDA) on MNIST data and visualize the data.
Date:
Equipment/Instruments:
Desktop/laptop with Materials:
Python programming environment
NumPy and Scikit-learn libraries
Theory:
Ref :
https://medium.com/machine-learning-researcher/dimensionality-reduction-pca-and-lda-
6be91734f567
Procedure:
1. Load the MNIST dataset
The MNIST dataset contains 70,000 handwritten digit images. Each image is of size 28 x
28 pixels. The dataset is divided into 60,000 training images and 10,000 testing images.
2: Preprocess the dataset
The dataset needs to be preprocessed before applying any machine learning algorithms. The
preprocessing steps include normalization and flattening.
11
Pattern Recognition (3171613) 220213116007
Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import fetch_openml
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit(X, y).transform(X)
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.title('PCA')
for i in range(10):
plt.scatter(X_pca[y == str(i)][:, 0], X_pca[y == str(i)][:, 1], label=str(i))
plt.legend()
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.subplot(1, 2, 2)
plt.title('FDA')
for i in range(10):
plt.scatter(X_lda[y == str(i)][:, 0], X_lda[y == str(i)][:, 1], label=str(i))
plt.legend()
plt.xlabel('LDA 1')
plt.ylabel('LDA 2')
plt.tight_layout()
plt.show()
12
Pattern Recognition (3171613) 220213116007
Observation:
13
Pattern Recognition (3171613) 220213116007
Suggested Reference:
14
Pattern Recognition (3171613) 220213116007
Experiment No: 5
Implement linear discriminate functions using gradient descent procedures, the Perceptron
algorithm, and Support Vector Machines (SVM).
Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
Observations:
2. How does the non-linearly separable nature of the dataset affect the performance and
convergence of the Perceptron algorithm compared to gradient descent and SVM
algorithms?
17
Pattern Recognition (3171613) 220213116007
Ans:
Non-linearly separable data is tough terrain for the Perceptron algorithm. It struggles
because it can't find a single hyperplane to separate the data, leading to no convergence
Gradient descent algorithms, particularly in combination with more complex models like
neural networks, can navigate this better because they can capture non-linear
relationships.
Support Vector Machines (SVMs) excel here; they use kernel tricks to transform the data
into a higher dimension where it becomes linearly separable. This way, SVMs find the optimal
18
Pattern Recognition (3171613) 220213116007
Experiment No: 6
Design and train a Multilayer Perceptron (MLP) feed forward neural network for a classification
task using the CIFAR-10 Dataset.
Competency and Practical Skills:
1. Enhances competencies in neural network design, implementation, data preprocessing,
model training and optimization, performance evaluation, hyperparameter tuning, and
result analysis.
2. It develops practical skills in utilizing deep learning frameworks, handling real-world
datasets, and making informed decisions for classification tasks.
Relevant CO: co2,co3,co4
Objectives:
1. Train the MLP network using the CIFAR-10 dataset and evaluate its performance in
classifying the images.
2. Analyze the impact of hyper parameters and network architecture on performance: The
third objective is to experiment with different hyper parameters, such as learning rate,
batch size, and number of hidden neurons, and observe their impact on the MLP
network's performance.
Equipment/Instruments:
Computer/laptop with Python and necessary libraries (such as TensorFlow, Keras, or PyTorch)
installed.CIFAR-10 dataset (easily available in Keras or other libraries).
Theory:
Ref : https://towardsdatascience.com/implementing-a-deep-neural-network-for-the-cifar-10-
dataset-c6eb493008a5
Procedure:
1. Dataset Selection: The CIFAR-10 dataset is a widely used dataset for image classification
tasks. It contains 60,000 color images of size 32x32 pixels, belonging to 10 different classes
(such as airplanes, cars, cats, etc.). The dataset is divided into 50,000 training images and
10,000 testing images.
2. Data Preprocessing: Import the CIFAR-10 dataset using the Keras library. The dataset is
already preprocessed, so you can skip this step.
3. 3. Network Architecture Design: Design the architecture of the MLP network for image
classification. Since CIFAR-10 images are relatively small, a simple MLP with fully
connected layers can be used. Determine the number of neurons in the input layer based on the
image size (32x32x3). Decide on the number of hidden layers, the number of neurons in each
layer, and the activation functions to be used. You can start with a few hidden layers and
gradually increase the complexity if needed.
4. Network Implementation: Implement the MLP network using a deep learning framework such
as Keras or TensorFlow. Define the network architecture by specifying the number of layers,
the number of neurons in each layer, and the activation functions. Set the appropriate input
shape to match the image size (32x32x3).
5. Training and Testing: Split the CIFAR-10 dataset into training and testing sets (usually 80:20
or 70:30 ratio). Use the training set to train the MLP network. Select an appropriate optimizer
(e.g., stochastic gradient descent) and a suitable loss function (e.g., categorical cross-entropy)
for multi-class classification. Train the network for a specified number of epochs, observing
the training loss and accuracy.
6. Hyper parameter Tuning: Experiment with different hyper parameters to improve the
network's performance. Adjust hyper parameters such as learning rate, batch size, number of
hidden neurons, and number of epochs. Observe the impact of these hyper parameters on the
network's accuracy and convergence. You can also explore techniques like regularization or
dropout to mitigate overfitting.
19
Pattern Recognition (3171613) 220213116007
7. Performance Evaluation: Evaluate the trained MLP network on the testing set. Calculate and
analyze various performance metrics such as accuracy, precision, recall, and F1-score to
assess the network's effectiveness in classifying CIFAR-10 images. Additionally, generate a
confusion matrix to visualize the classification results and identify any class-specific
performance issues.
8. Comparison and Discussion: Compare the performance of the MLP network with other
classification algorithms applied to the CIFAR-10 dataset. Discuss the strengths and
weaknesses of the MLP approach for the given image classification task. Analyze the impact
of different hyperparameters and network architectures on the performance.
9. Visualization: Visualize the training progress by plotting the training loss and accuracy curves
over the epochs. Additionally, visualize the network's learned features or extract intermediate
layer outputs to gain insights into how the MLP network is processing the CIFAR-10 images.
10. Documentation and Analysis: Summarize the experimental setup, results, and findings in a
comprehensive report. Discuss the accuracy achieved by the MLP network and the impact of
different hyper parameters. Analyze the strengths.
Code:
import tensorflow as tf
from tensorflow import keras
from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.utils import to_categorical
# Step 9: Visualization
20
Pattern Recognition (3171613) 220213116007
import matplotlib.pyplot as plt
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.legend()
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title('Training and Validation Accuracy')
plt.show()
Observations:
21
Pattern Recognition (3171613) 220213116007
1. How does increasing the number of hidden layers and neurons affect the
performance of the MLP network trained on the CIFAR-10 dataset?
Ans:
Increasing the number of hidden layers and neurons in a Multi-Layer Perceptron (MLP)
network can have a couple of effects:
• Capacity to Learn: With more layers and neurons, the network can capture more
complex patterns and structures in the CIFAR-IO dataset. It can potentially achieve
higher accuracy and better representation of the data.
• Risk of Overfitting: However, adding too many layers and neurons can lead to
overfitting, where the network performs exceptionally well on training data but poorly
on unseen test data.
• Computational Complexity: More layers and neurons increase the computational
resources required for training—longer training times and more memory usage.
• Optimization Difficulty: As the network becomes deeper and wider, it can be harder to
train. Issues like vanishing and exploding gradients can make convergence a
challenge.
So, it's about balancing complexity and performance to avoid overfitting and ensure
generalization.
2. Do using different activation functions in the hidden layers of the MLP network have
an impact on the classification accuracy when training on the CIFAR-10 dataset?
Ans:
Absolutely. Activation functions are like the secret sauce of neural networks. Different
functions can significantly affect how well your MLP network learns and classifies.
• ReLU (Rectified Linear Unit) is popular because it helps mitigate the vanishing
gradient problem, enabling deep networks to learn better. It often performs well with
image data like CIFAR-IO.
22
Pattern Recognition (3171613) 220213116007
• Sigmoid and Tanh functions can struggle with deep networks because they saturate,
leading to vanishing gradients. They can still be useful for specific applications where
their properties align well with the data.
• Leaky ReLU and ELU (Exponential Linear Unit) offer variations that try to fix the
"dying ReLU" problem by allowing a small gradient when the input is negative.
outcomes, so it's worth trying a few to see what works best for your specific dataset and
network architecture.
23
Pattern Recognition (3171613) 220213116007
Experiment No: 7
Implementing Recurrent Neural Networks (RNNs) for Sequential Data Analysis.
Procedure:
1. Prepare the dataset:
▪ Datasets such as the LibriSpeech dataset or the TIMIT dataset can be used for
speech recognition tasks where the sequential data consists of audio signals.
▪ Preprocess the dataset by cleaning, normalizing, and transforming it into a suitable
format.
▪ Split the dataset into training and testing sets.
2. Implement an RNN model using the chosen deep learning framework:
▪ Import the necessary libraries and modules.
▪ Load the dataset into the program.
▪ Encode the sequential data into a suitable numerical representation, such as one-
hot encoding or word embeddings.
▪ Split the dataset into input sequences and corresponding target sequences.
▪ Split the data into training and testing sets.
3.Configure the RNN architecture:
▪ Select the appropriate RNN layer type (basic RNN, LSTM, or GRU) based on the
problem and dataset.
▪ Determine the number of RNN layers and their parameters, such as the number of
units or hidden states.
▪ Set other hyperparameters, such as learning rate, batch size, and number of
epochs.
▪ Define the loss function, optimizer, and evaluation metrics for training the model.
4. Train the RNN model:
▪ Initialize the RNN model with the defined architecture.
▪ Train the model using the training dataset.
▪ Monitor the training progress, such as loss convergence and model performance.
5. Evaluate the trained model:
▪ Use the trained RNN model to make predictions on the testing dataset.
▪ Calculate performance metrics, such as accuracy, precision, recall, and F1-score.
▪ Visualize the model's performance using appropriate plots, such as accuracy
curves or confusion matrices.
6. Fine-tune the model:
▪ Experiment with different hyper parameters, such as learning rate, number of
hidden units, or dropout rate, to observe their impact on performance.
▪ Iterate the training and evaluation process to find the optimal configuration.
24
Pattern Recognition (3171613) 220213116007
Code:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
dataset = dataset.map(preprocess).batch(1)
25
Pattern Recognition (3171613) 220213116007
Observation:
26
Pattern Recognition (3171613) 220213116007
27
Pattern Recognition (3171613) 220213116007
Experiment No: 8
Non-metric Methods for Pattern Classification: Analyzing Nominal Data using Decision
Trees.
Relevant CO: co1,co2,co3,co4
Objectives:
• Gain hands-on experience in implementing decision tree algorithms for classification or
regression tasks on datasets with nominal data.
• Discuss the strengths and limitations of decision trees for analyzing nominal data and
compare them to other machine learning algorithms.
Equipment/Instruments:
Computer with Python installed.
Car Evaluation Dataset: This dataset includes attributes related to the evaluation of car features,
such as buying price, maintenance cost, number of doors, and luggage capacity. It can be used to
predict the acceptability of a car using decision trees.
Decision tree libraries (e.g., scikit-learn in Python).
Theory:
Ref : https://towardsdatascience.com/decision-trees-d07e0f420175
Procedure:
1. Data Loading and Exploration:
• Load the Car Evaluation Dataset: This dataset includes attributes related to the evaluation
of car features, such as buying price, maintenance cost, number of doors, and luggage
capacity. It can be used to predict the acceptability of a car using decision trees
• Perform initial data exploration, including checking for missing values and
understanding the distribution of the target variable.
• Explore the non-numeric or nominal attributes and identify any unique categories or
patterns.
2. Data Preprocessing:
• Handle missing values: Decide on an appropriate strategy for handling missing values,
such as imputation or removal.
• Encode categorical variables: Convert non-numeric or nominal attributes into a numerical
format suitable for decision tree analysis. This can be done using techniques like label
encoding or one-hot encoding.
3. Dataset Split:
Split the preprocessed dataset into training and testing sets. The recommended split is usually 70-
30 or 80-20 for training and testing, respectively.
4. Decision Tree Implementation:
Import the necessary decision tree libraries in the chosen programming language (e.g., scikit-learn
in Python). Define the decision tree model and set any desired hyper parameters (e.g., maximum
tree depth, minimum samples for a split).Fit the decision tree model to the training data using the
appropriate function or method.
5. Model Training and Evaluation:
• Once the decision tree model is fitted to the training data, evaluate its performance on the
testing data.
• Calculate relevant evaluation metrics such as accuracy, precision, recall, or mean squared
error (for regression tasks).
• Interpret the results and analyze the model's effectiveness in handling non-numeric or
nominal data.
6. Model Visualization and Interpretation:
• Visualize the decision tree model to understand the learned decision rules and important
features.
• Use libraries or functions to generate visual representations of the decision tree, such as tree
28
Pattern Recognition (3171613) 220213116007
diagrams or rule sets.
• Interpret the decision rules and discuss the significance of each node and branch in the tree.
7. Fine-tuning and Optimization:
• Experiment with different hype parameters to optimize the decision tree model's
performance. For example, vary the tree depth or the minimum number of samples required
for a split.
• Evaluate the model's performance after each adjustment and compare the results to
determine the optimal hyper parameter settings.
Code:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import export_text
data.tail()
data.info
#2Data Preprocessing
#3Dataset Split
X = data.drop("class", axis=1)
y = data["class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
29
Pattern Recognition (3171613) 220213116007
# 5Model Training and Evaluation
y_pred = decision_tree.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print(classification_report(y_test, y_pred))
graph = graphviz.Source(dot_data)
graph.format = 'png' # You can choose the format you prefer (e.g., 'png', 'pdf', 'svg', etc.)
graph.render("car_decision_tree", view=False) # Save the visualization to a file (optional)
Observations:
30
Pattern Recognition (3171613) 220213116007
31
Pattern Recognition (3171613) 220213116007
32
Pattern Recognition (3171613) 220213116007
construction and performance of decision trees.
• Gini Index measures the impurity of a dataset. A lower Gini index indicates a
purer node. It's computationally less expensive and can handle splits better with
categorical attributes but might be biased towards attributes with more levels.
• Information Gain is based on entropy and calculates the reduction in randomness
or disorder. It's more robust and works well with nominal data, providing clearer
insights. However, it's computationally more intensive.
Example:
• For datasets with non-numeric, high-cardinality features (many unique values), the
Gini Index might be faster and simpler.
• For datasets requiring nuanced decisions where interpretability is key, information
gain could provide more meaningful splits.
2. Describe the concept of ensemble learning and its relevance to decision trees.
Discuss two popular ensemble techniques, namely Bagging and Boosting, and
explain how they can enhance the performance of decision tree models.
Ans:
Ensemble learning is about combining multiple models to create one strong predictive
model. It leverages the idea that a group of weak learners can come together to form a robust
model.
Bagging (Bootstrap Aggregating) involves training multiple versions of a model on
different subsets of the data (drawn with replacement). By averaging the predictions, it
reduces variance and helps the model generalize better.
Boosting focuses on training models sequentially. Each model tries to correct the errors of
the previous one. This iterative process reduces bias and improves the model's performance
on difficult cases.
So, bagging focuses on reducing overfitting by averaging, while boosting hones in on
reducing errors and refining the model iteratively. Both techniques supercharge decision
trees, making them more accurate and resilient.
33