0% found this document useful (0 votes)
79 views

PRACTICAL5

The document describes applying a decision tree classifier to the Pima Indian diabetes dataset. It discusses decision trees and how they work, loads the diabetes dataset, splits it into training and test sets, trains a decision tree classifier on the training set, evaluates it on the test set, and outputs the accuracy score.

Uploaded by

thundergamerz403
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

PRACTICAL5

The document describes applying a decision tree classifier to the Pima Indian diabetes dataset. It discusses decision trees and how they work, loads the diabetes dataset, splits it into training and test sets, trains a decision tree classifier on the training set, evaluates it on the test set, and outputs the accuracy score.

Uploaded by

thundergamerz403
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Enrollment No :202103103510280

PRACTICAL:5

Aim: To implement principal component analysis.


Principal Component Analysis (PCA) ia a statistical technique used for reducing the
dimensionality of data while preserving its important information. It's commonly employed in
data analysis and machine learning to simplify datasets with a large number of variables into a
smaller set of derived variables called principal components.

The key steps in PCA include:


1. Standardization: Normalize the dataset to have a mean of zero and a standard
deviation of one for each variable to ensure they are on the same scale.
2. Calculation of Covariance Matrix: Determine the covariance matrix of the standardized
data, which shows the relationships between variables.
3. Eigenvalue Decomposition: Compute the eigenvectors and eigenvalues of the
covariance matrix. Eigenvectors represent the directions (principal components)
of maximum variance, and eigenvalues indicate the magnitude of variance
along these directions.
4. Selection of Principal Components: Sort the eigenvectors based on their corresponding
eigenvalues in descending order. The principal components are chosen according to the
top eigenvalues, as they explain the most variance in the data.
5. Projection: Transform the original data into the new feature space formed by the
selected principal components. This transformation reduces the dimensions while
retaining most of the information present in the original dataset.
PCA is widely used in various fields such as image processing, pattern recognition,
finance, and many others to simplify complex datasets, remove redundant information, and
facilitate further analysis or visualization of data.

Here, we are performing Principal Component Analysis (PCA) on the Iris dataset
using Python.

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

Step 1: Import necessary libraries

This code block imports necessary libraries for data visualization and dimensionality reduction
using Principal Component Analysis (PCA). It includes Matplotlib for plotting, Pandas for data
manipulation, and scikit-learn's StandardScaler for feature scaling and PCA for dimensionality
reduction.

Step 2: Load the Iris dataset

Output:

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

This code block fetches the Iris dataset from the given URL and loads it into a Pandas
DataFrame named 'df'. The dataset contains measurements of sepal length, sepal width, petal
length, and petal width for different iris flowers, with the corresponding target variable
indicating the species of each iris. The 'names' parameter assigns column names to the
DataFrame. The resulting DataFrame 'df' is then printed, displaying the tabular representation
of the Iris dataset.

Step 3: Standardize the data

Output:

In this code block, a list named 'features' is defined, containing the names of the four features in
the Iris dataset. The features (sepal length, sepal width, petal length, and petal width) are then
extracted from the previously loaded DataFrame 'df' and stored in the variable 'x'. The target
variable ('target', indicating the iris species) is extracted and stored in the variable 'y'. The
features in 'x' are then standardized using the StandardScaler from scikit-learn, ensuring that
they have a mean of 0 and a standard deviation of 1. Finally, the standardized feature values for
the first 5 rows are printed to the console using 'print(x[:5, :])'.

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

Standardizing features is a common preprocessing step in machine learning to ensure


that all features contribute equally to the analysis, particularly in methods sensitive to
the scale of input variables.

Step 4: PCA projection to 2D


Output:

In this code block, a Principal Component Analysis (PCA) with two components is applied to
the standardized feature matrix 'x' using the 'PCA' class from scikit- learn. The resulting
principal components are stored in 'principalComponents'. These components are then used to

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

create a new DataFrame 'principalDf' with columns named 'principal component 1' and
'principal component 2'. The first 5 rows of 'principalDf' are printed. Additionally, the target
variable ('target') and the first 5 rows of the original DataFrame 'df' are printed to demonstrate
the correspondence between the reduced-dimensional data and the original dataset. Finally, a
new DataFrame 'finalDf' is created by concatenating 'principalDf' with the 'target' column
from the original DataFrame, providing a consolidated DataFrame that includes the principal
components along with the target variable for further analysis or visualization. This process is
often used to reduce the dimensionality of the data for visualization or modeling purposes
while retaining essential information captured by the principal components.

Step 5: Visualize 2D Projection

Output:
CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

This code block creates a scatter plot using Matplotlib to visualize the reduceddimensional
representation of the Iris dataset obtained through PCA. The figure is set to be 8x6 inches, and a
subplot is added to the figure. The x-axis and y-axis labels are set, and the title is specified.
Three target classes ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica') are assigned different colors ('r'
for red, 'g' for green, 'b' for blue). For each target class, a scatter plot is generated by identifying
the corresponding indices in the 'finalDf' DataFrame and plotting the values of the first two
principal components. The size of the points is set to 50, and a legend is added to the plot
indicating the target classes. Finally, grid lines are added to enhance readability. This
visualization provides insights into the distribution and separation of the iris species in the
reduced two-dimensional space.

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

PRACTICAL:6

Aim: Write a program to apply decision tree classifier on Pima Indian


diabetes dataset.

A decision tree classifier is a machine learning model that learns to classify data by creating a
tree-like structure of rules based on the features and labels of the training data. The tree
consists of nodes and branches, where each node represents a test or a decision on a feature,
and each branch represents an outcome or a value of that feature. The leaf nodes at the bottom
of the tree contain the class labels or the predictions for the data.
The decision tree classifier works by recursively splitting the data into smaller subsets based
on the feature that best separates the classes. The feature is chosen by using a criterion such as
entropy or Gini impurity, which measures the level of disorder or uncertainty in the data. The
goal is to find the feature that maximizes the information gain or the reduction in entropy or
impurity after the split. The process stops when all the data in a subset belong to the same
class, or when a predefined limit such as the maximum depth of the tree or the minimum
number of samples in a node is reached.

The decision tree classifier can handle both numerical and categorical features, and can also
deal with missing values by assigning them to the most frequent value or the most probable
class. The decision tree classifier is easy to understand and interpret, as it provides a visual
representation of the logic behind the classification. However, it also has some drawbacks,
such as being prone to overfitting, being sensitive to noise and outliers, and being unstable due
to small changes in the data.

How does the decision tree algorithm work?


The basic idea behind any decision tree algorithm is as follows:

1. Select the best attribute using Attribute Selection Measures (ASM) to splitthe records.
2. Make that attribute a decision node and breaks the dataset into smallersubsets.
3. Start tree building by repeating this process recursively for each child until one of the
conditions will match:
• All the tuples belong to the same attribute value.
• There are no more remaining attributes.
• There are no more instances.

Here, we are applying decision tree classifier on the Pima Indian diabetes dataset using Python.

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

Step 1: Import necessary libraries

This code block imports necessary libraries for decision tree classifier. It start by importing
necessary libraries, including Pandas for data manipulation and scikit-learn for machine learning.
The DecisionTreeClassifier from scikit-learn is then imported to create a Decision Tree model.
Additionally, the train_test_split function is imported for splitting the dataset into training and
testing sets, and the metrics module is imported to evaluate the model's accuracy.

Step 2: Load the dataset

Output:

In this code block, a dataset related to Pima Indian women's health, specifically focusing on
diabetes, is loaded into a Pandas DataFrame. The dataset is obtained from a given URL and
has columns representing attributes such as the number of pregnancies, glucose levels, blood
pressure, skin thickness, insulin levels, body mass index (BMI), diabetes pedigree function,
age, and a binary label indicating the presence or absence of diabetes. The read_csv function is
CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

used to read the data, skipping the first row (header) and assigning custom column names
specified in the col_names list. The resulting DataFrame, named 'pima,' is then displayed using
the head() function, providing a glimpse of the first few rows of the dataset.

Step 3: Split dataset in features and target variable

In this code block, the dataset is split into features (X) and the target variable (y). The features
are selected from the 'pima' DataFrame using the specified columns in the 'feature_cols' list,
which includes attributes such as the number of pregnancies, insulin levels, BMI, age, glucose
levels, blood pressure, and the diabetes pedigree function. These features are stored in the
variable X. The target variable y is assigned the values from the 'label' column in the 'pima'
DataFrame, representing whether an individual has diabetes (1) or not

(0). This separation of features and the target variable is a common preprocessing step before
training a machine learning model, allowing for effective training and evaluation.

Step 4: Split dataset into training set and test set

This code block utilizes the train_test_split function from scikit-learn to partition the dataset
into training and test sets. The feature matrix (X) and target variable (y) obtained from the
previous step are split into training sets (X_train and y_train) and test sets (X_test and
y_test). The parameter test_size=0.3 indicates that 30% of the data will be used for testing,
while the remaining 70% will be utilized for training the machine learning model. The
random_state=1 parameter ensures reproducibility by fixing the random seed during the
splitting process, resulting in consistent training and evaluation sets across multiple runs.
This step is crucial for assessing the model's generalization performance on unseen data.

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

Step 5: Building decision tree model

In this code block, a Decision Tree Classifier is instantiated using the DecisionTreeClassifier
class from scikit-learn, creating an object named clf. Subsequently, the classifier is trained
using the training sets (X_train and y_train) through the fit method. This process involves the
algorithm recursively splitting the data based on features to construct a decision tree that can
make predictions. Once the model is trained, it is applied to the test dataset (X_test) using the
predict method, generating predictions stored in the variable y_pred. The decision tree model
is now ready for evaluation and analysis of its predictive performance on the test data.

Step 6: Evaluating the model

Output:
Accuracy: 0.6796536796536796
This code block assesses the accuracy of the Decision Tree Classifier model on the test dataset.
The accuracy_score function from scikit-learn's metrics module isutilized to compare the
predicted labels (y_pred) with the actual labels (y_test). The result, printed as "Accuracy,"
represents the proportion of correctly classified instances. In this specific output, the accuracy is
approximately 0.688, indicating that the model correctly predicted the target variable for around
68.8% of the instances in the test set. Evaluating accuracy is a common metric to gauge the
overall performance of a classification model

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

Step 7: Visualizing decision trees

Output:

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

In this code block, the Decision Tree Classifier (clf) is visualized using Graphics tools. The
export_graphics function generates a DOT format representation of the decision tree, which is
stored in the dot_data variable. The only pydotplu library is then used to create a graphical
representation of the tree from the DOT data. The resulting image is saved as 'diabetes.png,' and
the Image module from IPython is employed to display the visualized decision tree directly
within the Colab notebook. This visualization provides a detailed overview of the decision-
making process and the structure of the trained decision tree model.

Practical:7

Aim: Write a program to classify various types of iris dataset using Support
Vector Machine (SVM).
Support Vector Machine (SVM) is a supervised machine learning algorithmused for
classification and regression tasks. The primary objective of SVM is to finda hyperplane in an N-
dimensional space (where N is the number of features) thatdistinctly classifies the data into
different classes. This hyperplane is chosen in such away that it maximizes the margin between
the classes. The margin is the distancebetween the hyperplane and the nearest data point of each
class.

Key Concepts:
1. Support Vectors:
• SVM works by finding the hyperplane that best separates the data into
different classes.
• Support Vectors are the data points that lie closest to the decision
boundary (hyperplane).
• These vectors are critical in determining the optimal hyperplane and

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

ultimately the classification boundary.


2. Hyperplane:
• In a two-dimensional space, a hyperplane is a simple line.
• In a three-dimensional space, it's a plane.
• For higher dimensions, it's referred to as a hyperplane.
• The goal of SVM is to find the hyperplane that best separates the data
into classes.
3. Margin:
• The margin is the distance between the hyperplane and the nearest
data point from either class.
• SVM aims to maximize this margin, resulting in a more robust classifier.
4. Kernel Trick:
• SVM can handle non-linear decision boundaries by transforming the
input features into a higher-dimensional space.

• This is done using kernel functions (e.g., polynomial, radial basis function)
to map the data into a space where a hyperplane can effectively
separate it.
5. C parameter:
• SVM has a regularization parameter denoted as 'C.'
• C determines the trade-off between having a smooth decision boundary
and classifying training points correctly.
• A smaller C allows for a softer decision boundary, while a larger C aims
for a more accurate classification on the training data.

Steps in SVM:
1.Data Collection:
•Gather a dataset with labelledsamples.

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

2.Choose a Kernel Function:


•Select a suitable kernel function based on the nature of the data. Common choices
include linear, polynomial, and radial basis function (RBF) kernels.
3.Model Training:
•Train the SVM model using the training dataset. The algorithm optimizes the
hyperplane to maximize the margin.
4.Parameter Tuning:
•Adjust hyperparameters, such as the choice of kernel and the regularization
parameter C, to optimize the model's performance.
5.Prediction:
•Use the trained model to predict the class labels of new, unseen data.

Advantages of SVM:
•Effective in high-dimensional spaces
.•Robust in the presence of outliers.
•Versatile with various kernel functions for different data types.

Limitations of SVM:
•Computationally expensive, especially for large datasets.
•The choice of the kernel and parameters requires careful tuning.
•It may not perform well when the number of features is much greater than the number of
samples.

Step 1: Import necessary libraries

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

from sklearn.metrics import accuracy_score This code block imports the necessary Python
libraries for implementing and evaluating a Support Vector Machine (SVM) on a dataset. It
includes NumPy for numerical operations, Matplotlib for data visualization, scikit-learn's
datasets module to load the Iris dataset, train_test_split for splitting the data into training and
testing sets, SVC for creating an SVM classifier, and accuracy_score for evaluating the accuracy
of the classifier. The SVM will be trained and tested on the Iris dataset, with the ultimate goal of
predicting and assessing the accuracy of the classification results.

Step 2: Load the Iris dataset

In this code block, the Iris dataset is loaded using scikit-learn'sdatasets module. The dataset
consists of four features for each sample, but for simplicity, only the first two features are
selected and stored in the variables X and y. X represents the feature matrix containing sepal
length and sepal width, while y contains the corresponding target labels denoting the species of
iris flowers. This reduced feature set simplifies the visualization and classification task while
retaining essential information for training an SVM classifier.

Step 3:Splits the data into training and testing sets

This code block uses scikit-learn's train_test_splitfunction to split the Iris dataset into training
and testing sets. The features (X) and target labels (y) are divided into X_train, X_test, y_train,
and y_test, respectively. The parameter test_size=0.2indicates that 20% of the data will be used
for testing, while the Machine Intelligence remaining 80% will be used for training the Support
Vector Machine (SVM) classifier. The random_state=42ensures reproducibility by fixing the
random seed for the data split, allowing consistent results across different runs.

Step 4: Creates an SVM classifier with a linear kernel and trains it on the
training data.

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

In this code block, a Support Vector Machine (SVM) classifier is created and trained using
scikit-learn's SVC(Support Vector Classification) class with a linear kernel. The
kernel='linear'parameter specifies that a linear decision boundary should be used for
classification. The classifier is then trained on the training set (X_train, y_train). Subsequently,
predictions are made on the test set (X_test), and the predicted labels are stored in the variable
y_pred. This process allows the evaluation of the classifier's performance on unseen data, which
will be assessed further using accuracy metrics.

Step 5:Evaluates the accuracy of the classifier on the test set

Output:
Accuracy: 0.90
In this code block, the accuracy of the Support Vector Machine (SVM) classifier is evaluated by
comparing its predictions (y_pred) on the test set (X_test) with the true labels (y_test). The
accuracy_scorefunction from scikit-learn's metrics module is used to calculate the accuracy,
which represents the proportion of correctly classified instances. The result is then printed to the
console, providing a quantitative measure of the SVM classifier's performance on the unseen
data. The accuracy score is a value between 0 and 1, with higher values indicating better
classification performance.

Step 6: Visualizes the data points and decision boundary of the SVM classifier

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

Output:

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

Practical:8

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

Aim: Write a program to implement K-means clustering on iris dataset.

K-Means is an iterative and unsupervised machine learning algorithm that partitions a dataset
into K distinct, non-overlapping subsets or clusters. The algorithm aims to group similar data
points together and separate different groups based on certain features or attributes.
Algorithm Steps:

1. Initialization:
 Choose the number of clusters (K) that you want to identify in the
dataset.
 Randomly initialize K centroids, one for each cluster. Centroids represent
the mean position of all the points in a cluster.
2. Assignment:
 For each data point in the dataset, calculate the Euclidean distance to
each centroid.
 Assign the data point to the cluster whose centroid is closest.
3. Update:
 Recalculate the centroids for each cluster as the mean of all data points
assigned to that cluster.
4. Iteration:
 Repeat the assignment and update steps until convergence.
 Convergence occurs when the assignment of data points to clusters stabilizes, and
centroids no longer change significantly.

Key Characteristics:
• Centroids: K-Means defines clusters by their centroids, which represent the center of mass for
the points in a cluster.
• Euclidean Distance: The algorithm uses Euclidean distance to measure the dissimilarity
between data points and centroids.
• Scalability: K-Means is computationally efficient and scalable to large datasets.

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

• Sensitivity to Initialization: The final clustering result can be sensitive to the initial placement
of centroids. Multiple runs with different initializations may be performed to mitigate this.
• Number of Clusters (K): The number of clusters needs to be predefined, and the algorithm
assumes that the data can be well-represented by this number. K-Means clustering is widely used
due to its simplicity, efficiency, and effectiveness in a variety of applications.
Here, we are applying K-means clustering on the iris dataset using Python.
Step 1: Import necessary libraries:

In this code block, the necessary libraries for implementing K-Means clustering on the Iris
dataset are imported. NumPy (np) is used for numerical operations, Pandas (pd) for data
manipulation, and Matplotlib (plt) for data visualization. The KMeans class from Scikit-Learn is
imported to perform the K-Means clustering algorithm, and the load_iris function is used to load
the Iris dataset. Additionally, StandardScaler from Scikit-Learn is imported to standardize the
features, ensuring that they have zero mean and unit variance, which is a common preprocessing
step for K-Means clustering to achieve better performance.

Step 2: Load the Iris dataset:

In this code block, the Iris dataset is loaded using the load_iris function from Scikit-Learn. The
data matrix X contains the features of the dataset, and feature_names stores the names of these
features. The Iris dataset is a well-known benchmark dataset in machine learning, containing
measurements of sepal length, sepal width, petal length, and petal width for three different
species of iris flowers. This code block prepares the data for subsequent processing and analysis
within the K-Means clustering algorithm.

Step 3: Standardizes the features using StandardScaler:

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

In this code block, the features of the Iris dataset stored in the matrix X are standardized to have
zero mean and unit variance using the StandardScaler from Scikit-Learn. Standardization is a
preprocessing step commonly applied in K-Means clustering to ensure that all features contribute
equally to the clustering process, as it minimizes the impact of differences in the scales of
different features. The standardized feature matrix X_std is then obtained by fitting the scaler to
the original data (X) and transforming it accordingly. This standardization enhances the
performance and convergence of the K-Means algorithm by preventing features with larger
scales from dominating the clustering process.

Step 4: Applies K-Means clustering with K=3 clusters:

In this code block, the K-Means clustering algorithm is applied to the standardized Iris dataset
(X_std) using the KMeans class from Scikit-Learn. The parameter n_clusters=3 specifies that the
algorithm should identify three clusters, corresponding to the three different species of iris
flowers in the dataset. The_init=10 parameter determines the number of times the algorithm is
run with different initial centroids, and the result with the lowest inertia (sum of squared
distances from points to centroids) is selected. Setting random_state=42 ensures reproducibility
of results. The fit method then performs the actual clustering, assigning each data point to one
of the identified clusters based on their similarity to the cluster centroids.

Step 5: Get cluster labels and centroids:

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

In this code block, after applying the K-Means clustering algorithm, the cluster labels assigned
to each data point are obtained using the labels_ attribute of the K-Means model (kmeans).
Each label indicates the cluster to which the corresponding data point belongs. Additionally, the
coordinates of the centroids of the identified clusters are retrieved using the cluster_centers_
attribute. These centroids represent the average position of the data points within their
respective clusters. Both the cluster labels and centroids are important outputs for further
analysis and interpretation ofthe clustering results.

Step 6: Visualizes the clustered data points in a 2D space and marks the
clustercentroids with red 'X' markers:

Output:

CGPIT/CE/SEM-6/Machine Intelligence
Enrollment No :202103103510280

CGPIT/CE/SEM-6/Machine Intelligence

You might also like