ML-LAB-Manual
ML-LAB-Manual
Student Name :
Register Number :
Signature Signature
Lab Coradinator Head of the Department
1
CONTENT LIST
SL.NO. EXPERIMENT NAME PAGE NO.
1. Introduction
3-4
Program 1: Develop a program to create histograms for all numerical
2. features and analyze the distribution of each feature. Generate box plots for 5
all numerical features and identify any outliers. Use California Housing
dataset.
Program 2: Develop a program to Compute the correlation matrix to
understand the relationships between pairs of features. Visualize the
correlation matrix using a heatmap to know which variables have strong
3. positive/negative correlations. Create a pair plot to visualize pairwise 6
relationships between features. Use California Housing dataset.
2
INTRODUCTION
Machine Learning
Machine Learning is used anywhere from automating mundane tasks to offering
intelligent insights, industries in every sector try to benefit from it. You may already
be using a device that utilizes it. For example, a wearable fitness tracker like Fitbit,
or an intelligent home assistant like Google Home. But there are much more
examples of ML in use.
• Prediction: Machine learning can also be used in the prediction systems. Considering
the loan example, to compute the probability of a fault, the system will need to
classify the available data in groups.
• Image recognition: Machine learning can be used for face detection in an image as
well. There is a separate category for each person in a database of several people.
• Speech Recognition: It is the translation of spoken words into the text. It is used in
voice searches and more. Voice user interfaces include voice dialing, call routing,
and appliance control. It can also be used a simple data entry and the preparation of
structured documents.
• Medical diagnoses: ML is trained to recognize cancerous tissues.
• Financial industry: and trading: companies use ML in fraud investigations and credit
checks.
Types of Machine Learning?
Machine learning can be classified into 3 types of algorithms
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
Anaconda Navigator
Anaconda Navigator is a desktop graphical user interface (GUI) included in
Anaconda distribution that allows users to launch applications and manage conda
packages, environments and channels without using command-line commands.
Navigator can search for packages on Anaconda Cloud or in a local Anaconda
Repository, install them in an environment, run the packages and update them. It is
available for Windows, macOS and Linux.
The following applications are available by default in Navigator:
JupyterLab, Jupyter Notebook, QtConsole[19], Spyder, Glue, Orange, RStudio, Visual Studio Code
3
Conda
Conda is an open source cross-platform, language-agnostic package manager and
environment management system that installs, runs, and updates packages and
their dependencies. It was created for Python programs, but it can package and
distribute software for any language (e.g., R), including multi-language projects. The
conda package and environment manager is included in all versions of Anaconda,
Miniconda, and Anaconda Repository.
Jupyter Notebook
Jupyter Notebook can colloquially refer to two different concepts, either the user-
facing application to edit code and text, or the underlying file format which is
interoperable across many implementations.
4
1.Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify any
outliers. Use California Housing dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
housing_data = fetch_california_housing() df =
pd.DataFrame(housing_data.data,
columns=housing_data.feature_names)
plt.figure(figsize=(12, 8))
df.boxplot(rot=45)
plt.title("Box Plots of Numerical Features", fontsize=16) plt.show()
def detect_outliers(df):
outliers_dict = {}
for column in df.columns:
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
outliers_dict[column] = outliers.shape[0]
return outliers_dict
outliers = detect_outliers(df)
print("Outlier count per feature:", outliers)
OUTPUT:-
Outlier count per feature: {'MedInc': 681, 'HouseAge': 0, 'AveRooms':511, 'AveBedrms': 1424, 'Population':
1196, 'AveOccup': 711, 'Latitude': 0, 'Longitude': 0}
5
2.Develop a program to Compute the correlation matrix to understand the relationships
between pairs of features. Visualize the correlation matrix using a heatmap to know which
variables have strong positive/negative correlations. Create a pair plot to visualize pairwise
relationships between features. Use California Housing dataset.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
correlation_matrix = data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of California Housing Features') plt.show()
OUTPUT:-
6
3. Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt# Load the Iris dataset
iris = load_iris()
data = iris.data
labels = iris.target
label_names = iris.target_names# Convert to a DataFrame for better visualization
iris_df = pd.DataFrame(data, columns=iris.feature_names)# Perform PCA to reduce dimensionality to 2
pca = PCA(n_components=2)
data_reduced = pca.fit_transform(data)# Create a DataFrame for the reduced data
reduced_df = pd.DataFrame(data_reduced, columns=['Principal Component 1', 'Principal Component 2'])
reduced_df['Label'] = labels# Plot the reduced data
plt.figure(figsize=(8, 6))
colors = ['r', 'g', 'b']
for i, label inenumerate(np.unique(labels)):
plt.scatter(
reduced_df[reduced_df['Label'] == label]['Principal Component 1'],
reduced_df[reduced_df['Label'] == label]['Principal Component 2'], label=label_names[label],
color=colors[i])
plt.title('PCA on Iris Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid()
plt.show()
OUTPUT:-
7
4. For a given set of training data examples stored in a .CSV file, implement and demonstrate
the Find-Salgorithm to output a description of the set of all hypotheses consistent with the
training examples
import pandas as pd
def find_s_algorithm(file_path):
data = pd.read_csv(file_path)
print("Training data:")
print(data)
attributes = data.columns[:-1]
class_label = data.columns[-1]
return hypothesis
file_path = 'sample.csv'
hypothesis = find_s_algorithm(file_path)
print("\nThe final hypothesis is:", hypothesis)
OUTPUT:-
Training data:
id first last gender Marks selected
1 John Doe M 85 Yes
2 Jane Smith F 90 No
3 Jim Brown M 75 Yes
4 Jill White F 88 No
The final hypothesis is: ['?', '?', '?', 'M', '?']
8
5. Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly
generated 100 values of x in the range of [0,1]. Perform the following based on dataset
generated.
a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1, else xi ∊ Class1
b. Classify the remaining points, x51,……,x100 using KNN. Perform this for k=1,2,3,4,5,20,30
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
x_values = np.random.rand(100)
y_labels = np.array([1if x <= 0.5else2for x in x_values[:50]])
x_train = x_values[:50].reshape(-1, 1)
y_train = y_labels
x_test = x_values[50:].reshape(-1, 1)
def classify_and_plot(k_values):
plt.figure(figsize=(10, 6))
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
plt.scatter(x_test, y_pred, label=f'k={k}')
plt.xlabel('x values')
plt.ylabel('Predicted Class')
plt.title('KNN Classification of Random Points')
plt.legend()
plt.show()
k_values = [1, 2, 3, 4, 5, 20, 30]
classify_and_plot(k_values)
OUTPUT:-
9
6.Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.
import numpy as np
import matplotlib.pyplot as plt
return x @ theta
np.random.seed(42)
X = np.linspace(0, 2 * np.pi, 100)
y = np.sin(X) + 0.1 * np.random.randn(100) X_bias =
np.c_[np.ones(X.shape), X]
OUTPUT:-
10
7. Develop a program to demonstrate the working of Linear Regression and Polynomial
Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for
vehicle fuel efficiency prediction) for Polynomial Regression.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler from sklearn.pipeline import
make_pipeline
from sklearn.metrics import mean_squared_error, r2_score
def linear_regression_california():
housing = fetch_california_housing(as_frame=True) X =
housing.data[["AveRooms"]]
y = housing.target
X_train, X_test, y_train,
y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
plt.scatter(X_test, y_test, color="blue", label="Actual")
plt.plot(X_test, y_pred, color="red", label="Predicted”) plt.xlabel("Average number of
rooms (AveRooms)")
plt.ylabel("Median value of homes ($100,000)")
plt.title("Linear Regression - California Housing Dataset") plt.legend()
plt.show()
print("Linear Regression - California Housing Dataset")
print("Mean Squared Error:", mean_squared_error(y_test, y_pred)) print("R^2 Score:",
r2_score(y_test, y_pred))
def polynomial_regression_auto_mpg():
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/ auto-mpg.data"
column_names = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration",
"model_year", "origin"]
data = pd.read_csv(url, sep='\s+',
names=column_names, na_values="?")
data = data.dropna()
X = data["displacement"].values.reshape(-1, 1)
y = data["mpg"].values
X_train, X_test, y_train,
y_test = train_test_split(X, y, test_size=0.2, random_state=42)
poly_model = make_pipeline(PolynomialFeatures(degree=2), StandardScaler(),
LinearRegression())
poly_model.fit(X_train, y_train)
y_pred = poly_model.predict(X_test)
plt.scatter(X_test, y_test, color="blue", label="Actual") plt.scatter(X_test, y_pred,
color="red", label="Predicted") plt.xlabel("Displacement")
plt.ylabel("Miles per gallon (mpg)")
plt.title("Polynomial Regression - Auto MPG Dataset")
11
plt.legend()
plt.show()
print("Polynomial Regression - Auto MPG Dataset")
print("Mean Squared Error:", mean_squared_error(y_test, y_pred)) print("R^2 Score:",
r2_score(y_test, y_pred))
if__name__ == "__main__":
print("Demonstrating Linear Regression and Polynomial Regression\ n")
linear_regression_california()
polynomial_regression_auto_mpg()
OUTPUT:-
Polynomial Regression - Auto MPG Dataset Mean Squared Error: 0.743149055720586 R^2 Score:
0.7505650609469626
12
8. Develop a program to demonstrate the working of the decision tree algorithm. Use Breast
Cancer Data set for building the decision tree and apply this knowledge to classify a new
sample.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from
sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score
from sklearn import tree
data = load_breast_cancer()
X = data.data
y = data.target
plt.figure(figsize=(12, 8))
tree.plot_tree(
clf,
filled=True,
feature_names=data.feature_names.tolist(), # Ensure it's a list
class_names=data.target_names.tolist() # Convert to list
)
plt.title("Decision Tree - Breast Cancer Dataset")
plt.show()
OUTPUT:-
13
9. Develop a program to implement the Naive Bayesian classifier considering Olivetti Face
Data set for training. Compute the accuracy of the classifier, considering a few test data sets
import numpy as np
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split, cross_val_score from sklearn.naive_bayes import
GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, zero_division=1))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
plt.show()
14
OUTPUT:-
Accuracy: 80.83%
Classification Report:
15
10. Develop a program to implement k-means clustering using Wisconsin Breast Cancer data
set and visualize the clustering result.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, classification_report
data = load_breast_cancer()
X = data.data
y = data.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Confusion Matrix:")
print(confusion_matrix(y, y_kmeans))
print("\nClassification Report:")
print(classification_report(y, y_kmeans))
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster', palette='Set1', s=100,
edgecolor='black', alpha=0.7)
plt.title('K-Means Clustering of Breast Cancer Dataset') plt.xlabel('Principal Component
1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='True Label', palette='coolwarm', s=100, edgecolor='black',
alpha=0.7) plt.title('True Labels of Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="True Label")
plt.show()
16
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster',
palette='Set1', s=100, edgecolor='black', alpha=0.7)
centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(centers[:, 0], centers[:, 1], s=200, c='red', marker='X', label='Centroids')
plt.title('K-Means Clustering with Centroids')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()
OUTPUT:-
17
VIVA QUESTIONS
1. What Does The Fetch_California_Housing() Function Do?
2. Why Are Histograms And Boxplots Used In The Code?
3. What Is The Purpose Of The Detect_Outlier() Function?
4. Why Is Plt.Figure() Used Before The Boxplot?
5. Why Is It Important To Detect Outliers In A Dataset?
6. What Is The Purpose Of The Fetch_California_Housing(As_Frame=True) Function?
7. What Does The Data.Corr() Function Return?
8. What Is The Significance Of Using A Heatmap In This Context?
9. Why Is Pairplot() Used In Data Analysis?
10. What Does Diag_Kind='Kde' Do In The Pairplot Function?
11. What Is The Purpose Of Using Standardscaler() In PCA?
12. What Does The Covariance Matrix Represent In PCA?
13. Why Are Eigenvalues And Eigenvectors Computed From The Covariance Matrix?
14. What Does Np.Argsort(Eigenvalues)[::-1] Achieve In The Code?
15. What Does The Final Scatter Plot Represent?
16. What Is The Main Goal Of The Find-S Algorithm?
17. Why Do We Initialize The Hypothesis With None Values?
18. What Is The Role Of The Condition If Row[Label_Column] == 'Yes'?
19. Why Do We Replace Values With '?' In The Hypothesis?
20. What Does The Final Hypothesis Represent?
21. What Is The Purpose Of Using The Kneighborsclassifier In This Code?
22. How Are Class Labels Assigned To The Training Data?
23. Why Is The Data Reshaped Using Reshape(-1, 1) Before Fitting The Model?
24. What Is The Effect Of Changing The Value Of K In KNN?
25. What Does The Final Scatter Plot Represent?
26. What Is The Role Of The Gaussian_Kernel Function In The Code?
27. Why Is A Bias Term (1) Added To The Input Features In The Locally Weighted Regression Function?
28. What Is The Purpose Of Using Np.Linalg.Pinv In The Locally Weighted Regression Function?
29. What Do The Different Subplots In The Final Plot Represent?
30. What Does The Tau Parameter Control In The Locally Weighted Regression Model?
31. What Is The Purpose Of The Decisiontreeclassifier In This Code?
32. Why Is The Dataset Split Into Training And Testing Sets Using Train_Test_Split?
33. How Is The Accuracy Of The Decision Tree Model Evaluated?
34. What Does The Line Prediction_Class = "Benign" If Prediction == 1 Else "Malignant" Do?
35. What Is The Purpose Of Plotting The Decision Tree Using Tree.Plot_Tree?
36. What Is The Purpose Of The Linearregression Model In The Linear_Regression_California() Function?
37. Why Is Train_Test_Split Used In Both Regression Functions?
38. What Does The Mean_Squared_Error Metric Tell You About The Model's Performance?
39. How Does Polynomial Regression Differ From Linear Regression In The Polynomial_Regression_Auto_Mpg()
40. What Is The Significance Of Using Make_Pipeline In Polynomial Regression?
41. What Is The Role Of The Gaussiannb Model In This Code?
42. Why Is The Dataset Split Into Training And Testing Sets Using Train_Test_Split?
43. What Does The Accuracy_Score Metric Measure In This Case?
44. How Does Cross_Val_Score Help Evaluate The Model?
45. What Is The Purpose Of Displaying The Images Of Test Samples With True And Predicted Labels?
46. What Is The Purpose Of Using Standardscaler In This Code?
47. What Does PCA (Principal Component Analysis) Do In This Code?
48. Why Is The Confusion_Matrix Used Here?
49. What Is The Significance Of The Scatter Plots In The Visualization?
50. Why Are The Cluster Centroids Marked In The Final Scatter Plot?
18