Lab Practice-II Manual
Lab Practice-II Manual
CLASS: TE-IT
SEMESTER: I
HOME
hrs/week TW: 25 Marks
Prerequisites:
1. Python programming language
Course Objectives:
1. The objective of this course is to provide students with the fundamental elements of machine
learning for classification, regression, clustering.
2. Design and evaluate the performance of a different machine learning models.
Course Outcomes:
On completion of the course, students will be able to–
CO1: Implement different supervised and unsupervised learning algorithms.
CO2: Evaluate performance of machine learning algorithms for real-world applications.
Guidelines for Instructor's Manual
The faculty member should prepare the laboratory manual for all the experiments and it should bemade
available to students and laboratory instructor/Assistant.
Guidelines for Student's Lab Journal
1. Students should submit term work in the form of a handwritten journal based on specified listof
assignments.
2. Practical Examination will be based on the term work.
3. Students are expected to know the theory involved in the experiment.
4. The practical examination should be conducted if and only if the journal of the candidate is
complete in all respects.
Guidelines for Lab /TW Assessment
Through the diagnosis test I predicted 100 report as COVID positive, but only 45 of those were actually
positive. Total 50 people in my sample were actually COVID positive. I have total 500 samples.
Create confusion matrix based on above data and find
I I. Accuracy
II II. Precision
III III. Recall
IV. F-1 score
2 Assignment on Regression technique
a. Apply Linear Regression using suitable library function and predict the Month-wise
This data consists of temperatures of INDIA averaging the temperatures of all places month wise.
Temperatures values are recorded in CELSIUS temperature. b. Assess the performance of regression
models using MSE, MAE and R-Square metrics c. Visualize simple regression model
The counselor of the firm is supposed check whether the student will get an admission or not based on
This dataset is composed by just one text file, where each line has the correct class followed by the raw
message
This dataset gives the data of Income and money spent by the customers visiting a Shopping Mall. The
data set contains Customer ID, Gender, Age, Annual Income, Spending Score. Therefore, as a mall
owner you need to find the group of people who are the profitable customers for the mall owner. Apply at
least two clustering algorithms (based on Spending Score) to find the group of customers.
a. Apply Data pre-processing (Label Encoding , Data Transformation….) techniques if necessary.
b. Perform data-preparation( Train-Test Split)
c. Apply Machine Learning Algorithm
d. Evaluate Model.
e. Apply Cross-Validation and Evaluate Model.
Through the diagnosis test I predicted 100 report as COVID positive, but only 45 of those were actually
positive. Total 50 people in my sample were actually COVID positive. I have total 500 samples.
Create confusion matrix based on above data and find
I Accuracy
II Precision
III Recall
IV F-1 score
Theory:
Data Preparation: It is the process of transforming raw data into a particular form so that data
scientists and analysts can run it through machine learning algorithms to uncover insights or make
predictions. All projects have the same general steps; they are:
2. Feature Selection: Feature selection refers to techniques for selecting a subset of input features that
are most relevant to the target variable that is being predicted. Feature selection techniques are generally
grouped into those that use the target variable (supervised) and those that do not (unsupervised).
Additionally, the supervised techniques can be further divided into models that automatically select
features as part of fitting the model (intrinsic), those that explicitly choose features that result in the best
performing model (wrapper) and those that score each input feature and allow a subset to be selected
(filter).
3. Data Transforms: Data transforms are used to change the type or distribution of data variables.
Numeric Data Type: Number values.
Integer: Integers with no fractional part.
Real: Floating point values.
Categorical Data Type: Label values.
Ordinal: Labels with a rank ordering.
Nominal: Labels with no rank ordering.
Boolean: Values True and False.
4. Feature Engineering: Feature engineering refers to the process of creating new input variables from
the available data. Engineering new features is highly specific to your data and data types. As such, it
often requires the collaboration of a subject matter expert to help identify new features that could be
constructed from the data.
5. Dimensionality Reduction: The number of input features for a dataset may be considered the
dimensionality of the data. This motivates feature selection, although an alternative to feature selection is
to create a projection of the data into a lower-dimensional space that still preserves the most important
properties of the original data. The most common approach to dimensionality reduction is to use a matrix
factorization technique:
Confusion Matrix
A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model,
where N is the number of target classes. The matrix compares the actual target values with those predicted
by the machine learning model.
Code:
import pandas as pd
#%matplotlib inline
df.info()
#Details of Rows & Columns (Count, Datatypes, Null Values & Memory Usage)
print()
print()
print(df.describe())
#5. 25%, 50%, and 75% are the percentile/quartile of each features.
plt.figure(figsize=(8,6)) #8 by 6 inch
labels = 'Male','Female'
sizes = [male,female]
plt.show()
plt.figure(figsize=(8,6))
# Data to plot
len(df[df['cp'] == 2]),
len(df[df['cp'] == 3])]
# Plot specifications
plt.axis('equal')
plt.show()
#3. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
plt.figure(figsize=(8,6))
# Data to plot
labels = 'fasting blood sugar < 120 mg/dl','fasting blood sugar > 120 mg/dl'
# Plot
plt.axis('equal')
plt.show()
plt.figure(figsize=(8,6))
# Data to plot
labels = 'No','Yes'
# Plot
plt.axis('equal')
plt.show()
#1. Heatmap
plt.figure(figsize=(14,8)) #14/8
#cmap: Colourmap
plt.show()
sns.distplot(df['thalach'],kde=False,bins=30,color='violet')
sns.distplot(df['chol'],kde=False,bins=30,color='red')
plt.show()
sns.distplot(df['trestbps'],kde=False,bins=30,color='blue')
plt.show()
plt.figure(figsize=(15,6))
plt.show()
plt.figure(figsize=(8,6))
sns.scatterplot(x='chol',y='thalach',data=df,hue='target')
plt.show()
plt.figure(figsize=(8,6))
sns.scatterplot(x='trestbps',y='thalach',data=df,hue='target')
plt.show()
#Making Predictions
y=df['target']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.3,random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_train = pd.DataFrame(X_train_scaled)
X_test_scaled = scaler.transform(X_test)
X_test = pd.DataFrame(X_test_scaled)
knn =KNeighborsClassifier()
params = {'n_neighbors':list(range(1,20)),
'p':[1, 2, 3, 4,5,6,7,8,9,10],
'leaf_size':list(range(1,20)),
'weights':['uniform', 'distance'] }
model.fit(X_train,y_train)
predict = model.predict(X_test)
#Checking accuracy
print()
round(accuracy_score(y_test,predict),5)*100,'%')
print()
#Confusion Matrix
class_names = [0,1]
fig,ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks,class_names)
plt.yticks(tick_marks,class_names)
cnf_matrix = confusion_matrix(y_test,predict)
print(cnf_matrix)
ax.xaxis.set_label_position('top')
plt.tight_layout()
plt.xlabel('Predicted label')
plt.show()
#Classification report
print(classification_report(y_test,predict))
y_probabilities = model.predict_proba(X_test)[:,1]
false_positive_rate_knn,true_positive_rate_knn,threshold_knn = roc_curve(y_test,y_probabilities)
plt.figure(figsize=(10,6))
plt.plot(false_positive_rate_knn,true_positive_rate_knn)
plt.plot([0,1],ls='--')
plt.plot([0,0],[1,0],c='.5')
plt.plot([1,1],c='.5')
plt.show()
Output:
Theory:
Classification: Classification may be defined as the process of predicting class or category from
observed values or given data points. The categorized output can have the form such as “Black” or
“White” or “spam” or “no spam”.Mathematically, classification is the task of approximating a mapping
function (f) from input variables (X) to output variables (Y).
Naive Bayes, Logistic regression, K-nearest neighbours, (Kernel) SVM, Decision tree
The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the
same steps as we have done in previous topics of Regression. Below are the steps:
2. Decision Tree Algorithm: Decision trees can be constructed by an algorithmic approach that can
split the dataset in different ways based on different conditions. Decisions tress is the most powerful
algorithms that falls under the category of supervised algorithms.
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.
Solve decision tree such problems there is a technique which is called as Attribute selection
measure or ASM. There are two popular techniques for ASM, which are:
1. Information Gain: Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute. It calculates how much information a feature
provides us about a class. According to the value of information gain, we split the node and build
the decision tree.
2. Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as:
3. Gini Index: Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm. An attribute with the low Gini index should
be preferred as compared to the high Gini index.
3. SVM Algorithm: Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.The goal of the SVM
algorithm is to create the best line or decision boundary that can segregate n-dimensional space into
classes so that we can easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane.SVM chooses the extreme points/vectors that help in creating
the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane.
1. Sentiment Analysis
2. Email Spam Classification
3. Document Classification
4. Image Classification
Code:
# To load the dataset
import pandas as pd
importmatplotlib.pyplot as plt
importseaborn as sns
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv('Admission_Predict.csv')
print(df.head(5))
##########################################################################
#To drop the irrelevant column and check if there are any null values in the dataset
print(df.isnull().sum())
plt.show()
plt.show()
plt.show()
plt.show()
#Show CGPA
plt.title("Distribution of CGPA")
plt.show()
#It is clear from the distributions, students with varied merit apply for the university.
#Understanding the relation between different factors responsible for graduate admissions GRE Score vs
TOEFL Score
plt.show()
#People with higher GRE Scores also have higher TOEFL Scores which is justified because both TOEFL
and GRE have a verbal section which although not similar are relatable
plt.show()
#Although there are exceptions, people with higher CGPA usually have higher GRE scores maybe
because they are smart or hard working
#hue: Variables that define subsets of the data, which will be drawn on separate facets in the grid.
plt.title("LOR vs CGPA")
plt.show()
#LORs (Letter of Recommendation strength) are not that related with CGPA so it is clear that a persons
LOR is not dependent on that persons academic excellence.
#Having research experience is usually related with a good LOR which might be justified by the fact that
supervisors have personal interaction with the students performing research which usually results in
good LORs
plt.show()
#GRE scores and LORs are also not that related. People with different kinds of LORs have all kinds of
GRE scores
#SOP vs CGPA
plt.title("SOP vs CGPA")
plt.show()
#CGPA and SOP are not that related because Statement of Purpose is related to academic performance,
but since people with good CGPA tend to be more hard working so they have good things to say in their
SOP which might explain the slight move towards higher CGPA as along with good SOPs
plt.show()
#SOP vs TOEFL
plt.title("SOP vs TOEFL")
plt.show()
importnumpy as np
corr = df.corr()
#zeros_like():Returns an array of given shape and type as given array, with zeros.
dropSelf = np.zeros_like(corr)
dropSelf[np.triu_indices_from(dropSelf)] = True
plt.show()
print("Results...")
#For loop for generating model results
forname,model in models:
model = model
model.fit(X_train, y_train)
predictions = model.predict(X_test)
classifier = RandomForestRegressor()
classifier.fit(X,y)
feature_names = X.columns
print(feature_names)
importance_frame = pd.DataFrame()
importance_frame['Features'] = X.columns
importance_frame['Importance'] = classifier.feature_importances_
plt.yticks([1,2,3,4,5,6,7], importance_frame['Features'])
plt.xlabel('Importance')
#Clearly, CGPA is the most factor for graduate admissions followed by GRE Score.
plt.title('Feature Importances')
plt.show()
Output:
Theory:
Regression:
Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling and
determining the causal-effect relationship between variables. In Regression, we plot a graph
between the variables which best fits the given datapoints, using this plot, the machine learning
model can make predictions about the data.
Dependent Variable: The main factor in Regression analysis which we want to predict or understand is
called the dependent variable. It is also called target variable.
Independent Variable: The factors which affect the dependent variables or which are used to predict the
values of the dependent variables are called independent variable, also called as a predictor.
Outliers: Outlier is an observation which contains either very low value or very high value in comparison
to other observed values. An outlier may hamper the result, so it should be avoided.
Multicollinearity: If the independent variables are highly correlated with each other than other
variables, then such condition is called Multicollinearity. It should not be present in the dataset,
because it creates problem while ranking the most affecting variable.
Cost Functions:
1. Mean Absolute Error (MAE): MAE is a very simple metric which calculates the absolute
difference between actual and predicted values.
2. Mean Squared Error(MSE): Mean squared error states that finding the squared difference
between actual and predicted value. we perform squared to avoid the cancellation of negative
terms and it is the benefit of MSE.
3. Root Mean Squared Error(RMSE): As RMSE is clear by the name itself, that it is a simple
square root of mean squared error.
Linear Regression: Linear regression is a statistical regression method which is used for predictive
analysis. It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables. It shows the linear relationship between the independent
variable (X-axis) and the dependent variable (Y-axis), hence called linear regression.
Code:
#Importing required libraries
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
#Reading the input dataset
trainData = pd.read_csv("temperatures.csv")
#Print first 10 records
print(trainData.head(n=10))
#Printing datatypes and columns in the dataset
#datatypes columwise
print("Below are the datatypes of columns:")
print(trainData.dtypes)
print()
#column names
Output:
This dataset gives the data of Income and money spent by the customers visiting a Shopping Mall. The
data set contains Customer ID, Gender, Age, Annual Income, Spending Score. Therefore, as a mall owner
you need to find the group of people who are the profitable customers for the mall owner. Apply at least
two clustering algorithms (based on Spending Score) to find the group of customers.
Theory:
Approach of Clustering : Clustering or cluster analysis is a machine learning technique, which groups
the unlabelled dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a group that has less
or no similarities with another group
Applications of Clustering: Market Segmentation, Statistical data analysis, Social network analysis,
Image segmentation, Anomaly detection, etc.
K-Means Clustering:
K-Means clustering is the most popular unsupervised learning algorithm. It is used when we have
unlabelled data which is data without defined categories or groups. The algorithm follows an easy or
simple way to classify a given data set through a certain number of clusters, fixed apriori.
K-Means Algorithm:
1. Data Pre-processing
2. Finding the optimal number of clusters using the elbow method
3. Training the K-means algorithm on the training dataset
4. Visualizing the clusters
Code:
import pandas as pd
# Visualization Library
# Scaling
# Dimensional
# Clustering
#import numpy as np
data = pd.read_csv('Mall_Customers.csv')
print(data.head())
print(data.shape)
print()
print(data.info())
print()
#Describe Datasets
print(data.describe())
print()
#Missing values
print(data.isnull().sum())
print()
print(data.dtypes)
print()
print(data['Gender'].value_counts())
print(data.info())
#Scaling
#Clustering algorithms such as K-means do need feature scaling before they are fed to the algorithm.
sc = StandardScaler()
data_scaled = sc.fit_transform(data)
#Dimensionality reduction
pca = PCA(n_components = 2)
data_pca = pca.fit_transform(data_scaled)
print("data_pca is:",data_pca)
# KMeans Clustering
Lower WCSS shows a better clustering(because points in a cluster are more similar to each other,
After this point the speed of WCSS decreasing will be lowered. '''
#font size
'''WCSS: Within Cluster Sum of Squares, in other word it's sum of squared
Lower WCSS shows a better clustering(because points in a cluster are more similar to each other,
wcss_list = []
wcss_list.append(kmeans.inertia_)
plt.plot(range(1,15) , wcss_list)
#Elbow line
plt.show()
#KMeans Algorithm
kmeans.fit(data_pca)
cluster_id = kmeans.predict(data_pca)
result_data = pd.DataFrame()
result_data['PC1'] = data_pca[:,0]
result_data['PC2'] = data_pca[:,1]
result_data['ClusterID'] = cluster_id
, 'Cluster2':'tab:blue' , 'Cluster3':'tab:pink'}
#Scatter data
, c = result_data['ClusterID'].map(cluster_colors))
for k, v in cluster_dict.items()]
#Heading details
plt.show()
Output:
Theory:
NumPy:
Methods of Numpy:
1. np.array: This method useful for creating one & multidimensional array.
PANDAS:
Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning,
exploring, and manipulating data. The name "Pandas" has a reference to both "Panel Data", and "Python
Data Analysis" and was created by Wes McKinney in 2008.
Why Use Pandas? A panda allows us to analyze big data and make conclusions based on statistical
theories. Pandas can clean messy data sets, and make them readable and relevant.
Methods of Pandas:
1. pd.Series: A Pandas Series is like a column in a table. It is a one-dimensional array holding data of
any type. With the index argument, you can name your own labels.
import pandas as pd
pd.DataFrame({"Name":['Bob','Sam','Anne'],"Marks":[76,25,92]})
3. pd.read_csv: A simple way to store big data sets is to use CSV files (comma separated files). CSV
files contain plain text and are a well know format that can be read by everyone including Pandas.
import pandas as pd
iris=pd.read_csv('iris.csv')
4. head(): Returns the headers and a specified number of rows, starting from the top.
iris.head()
5. tail(): Returns the headers and a specified number of rows, starting from the bottom.
iris.tail()
iris.shape
iris.describe()
iris.min()