0% found this document useful (0 votes)
11 views20 pages

DWDM Lab All

The documents discuss various machine learning algorithms including standard scaler, min-max scaler, k-means clustering, k-means++ clustering, mini-batch k-means clustering, k-medoids clustering, agglomerative clustering, naive bayes classifier and ID3 decision tree classifier. Code examples are provided to implement these algorithms on various datasets like iris dataset and diabetes dataset.

Uploaded by

PoojaDevi Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views20 pages

DWDM Lab All

The documents discuss various machine learning algorithms including standard scaler, min-max scaler, k-means clustering, k-means++ clustering, mini-batch k-means clustering, k-medoids clustering, agglomerative clustering, naive bayes classifier and ID3 decision tree classifier. Code examples are provided to implement these algorithms on various datasets like iris dataset and diabetes dataset.

Uploaded by

PoojaDevi Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

1.

1 Standard Scaler
1. Write a Python program to implement Standard Scaler
import numpy as np
import pandas as pd

class StandardNorm:
def scale(self, df):
for i in df.columns:
mean = df[i].mean()
sd = df[i].std()
df[i] = (df[i] - mean) / sd
return df

df = pd.DataFrame(
[[45000, 42], [32000, 26], [58000, 48], [37000, 32]],
columns=["Salary", "Age"]
)
print("Original Data")
print(df)

s = StandardNorm()
df_scaled = s.scale(df)

print("\nScaled Data")
print(df_scaled)

Original Data
Salary Age
0 45000 42
1 32000 26
2 58000 48
3 37000 32

Scaled Data
Salary Age
0 0.176318 0.506803
1 -0.969750 -1.114967
2 1.322386 1.114967
3 -0.528954 -0.506803

1.2 Min-max scaler


1. Write a Python program to implement Min-max Scaler
import numpy as np
import pandas as pd

class MinMaxNorm:
def scale(self, df):
for c in df.columns:
min = df[c].min()
max = df[c].max()
df[c] = (df[c] - min) / (max - min)
return df

df = pd.DataFrame(
[[45000, 42], [32000, 26], [58000, 48], [37000, 32]],
columns=["Salary", "Age"]
)
print("Original Data")
print(df)

s = MinMaxNorm()
df_scaled = s.scale(df)

print("\nScaled Data")
print(df_scaled)

Original Data
Salary Age
0 45000 42
1 32000 26
2 58000 48
3 37000 32

Scaled Data
Salary Age
0 0.500000 0.727273
1 0.000000 0.000000
2 1.000000 1.000000
3 0.192308 0.272727

2.1 K-means clustering


1. Write a Python program to implement K-means Clustering algorithm. Generate 1000
2D data points in the range 0-100 randomly. Divide data points into 3 clusters.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
data = np.random.rand(1000, 2) * 100

km = KMeans(n_clusters=3, init="random")

km.fit(data)

centers = km.cluster_centers_
labels = km.labels_

print("Cluser centers: ", *centers)


# print("Cluser Labels: ", *labels)

colors = ["r", "g", "b"]


markers = ["+", "x", "*"]

for i in range(len(data)):
plt.plot(data[i][0], data[i][1], color=colors[labels[i]],
marker=markers[labels[i]])
plt.scatter(centers[:, 0], centers[:, 1], marker="s", s=100,
linewidths=5)
plt.show()

Cluser centers: [69.00427765 76.70986483] [70.62044459 24.38044043]


[20.03717511 48.04890483]

2.2 K-means++ clustering


1. Write a Python program to implement K-means++ Clustering algorithm. Generate
1000 2D data points in the range 0-200 randomly. Divide data points into 4 clusters.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

data = np.random.rand(1000, 2) * 200

km = KMeans(n_clusters=4, init="k-means++")

km.fit(data)

centers = km.cluster_centers_
labels = km.labels_

print("Cluser centers: ", *centers)


# print("Cluser Labels: ", *labels)

colors = ["r", "g", "b", "y"]


markers = ["+", "x", "*", "."]

for i in range(len(data)):
plt.plot(data[i][0], data[i][1], color=colors[labels[i]],
marker=markers[labels[i]])
plt.scatter(centers[:, 0], centers[:, 1], marker="s", s=100,
linewidths=5)
plt.show()

Cluser centers: [149.84988926 48.45735275] [ 49.86238183


151.59163234] [55.0320991 49.91663519] [144.96502525 151.80352045]
3.1 K-means Clustering
1. Write a Python program to implement K-means Clustering algorithm. Generate
10000 2D data points in the range 0-100 randomly. Divide data points into 5 clusters.
Find time taken by the algorithm to find clusters.
import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

data = np.random.rand(10000, 2) * 100

km = KMeans(n_clusters=5, init="random")

t0 = time.process_time()
km.fit(data)
t1 = time.process_time()

tt = t1 - t0
print("Total Time:", tt)

centers = km.cluster_centers_
labels = km.labels_

print("Cluster Centers:", centers)


# print("Cluster Labels:", *labels)

colors = ["g", "r", "b", "y", "m"]


markers = ["+", "x", "*", ".", "d"]

for i in range(len(data)):
plt.plot(data[i][0], data[i][1], color=colors[labels[i]],
marker=markers[labels[i]])
plt.scatter(centers[:, 0], centers[:, 1], marker="o", s=50,
linewidths=5)
plt.show()

Total Time: 0.859375


Cluster Centers: [[24.75092479 76.18202085]
[74.95770536 76.31237726]
[15.76193527 26.71298897]
[84.14572431 27.64304157]
[50.1416164 26.58749603]]
3.2 Mini-Batch K-means Clustering
1. Write a Python program to implement Mini-batch K-means Clustering algorithm.
Generate 10000 2D data points in the range 0-100 randomly. Divide data points into 5
clusters. Find time taken by the algorithm to find clusters. Vary the batch size from
100 to 1500, find time taken by the algorithm in each case and find best value of the
batch size.
import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import MiniBatchKMeans

data = np.random.rand(10000, 2) * 100

mbk = MiniBatchKMeans(n_clusters=5, init="random", batch_size=500)

t0 = time.time()
mbk.fit(data)
t1 = time.time()

tt = t1 - t0
print("Total Time: ", tt)

centers = mbk.cluster_centers_
labels = mbk.labels_

print("Cluster Centers:", centers)


# print("Labels:", labels)
colors = ["g", "r", "b", "y", "m"]
markers = ["+", "x", "*", ".", "d"]

for i in range(len(data)):
plt.plot(data[i][0], data[i][1], color=colors[labels[i]],
marker=markers[labels[i]])
plt.scatter(centers[:, 0], centers[:, 1], marker="o", s=50,
linewidths=5)
plt.show()

C:\Users\Tirtha Raj Poudel\miniconda3\envs\test_venv\lib\site-


packages\sklearn\cluster\_kmeans.py:1046: UserWarning: MiniBatchKMeans
is known to have a memory leak on Windows with MKL, when there are
less chunks than available threads. You can prevent it by setting
batch_size >= 1024 or by setting the environment variable
OMP_NUM_THREADS=2
warnings.warn(

Total Time: 0.3155672550201416


Cluster Centers: [[71.60844294 18.24470869]
[21.28126715 73.60137663]
[23.66927609 24.79247076]
[67.68171442 84.73901063]
[77.10017603 53.80019715]]
4. KMedoids Clustering and Agglomerative
Clustering
1. Write a Python program to find clusters of Iris Dataset using KMedoids Clustering
Algorithm.
2. Write a Python program to find clusters of Iris Dataset using Agglomerative Clustering
Algorithm. Compare them in terms of different performance measures.

4.1 KMedoids Clustering


# !pip install scikit-learn-extra

from sklearn.datasets import load_iris


from sklearn.preprocessing import StandardScaler
from sklearn_extra.cluster import KMedoids
from sklearn import metrics
import matplotlib.pyplot as plt

iris_data = load_iris()

x = iris_data.data
y = iris_data.target

# print(x[:5])
# print(y[:5])

sc = StandardScaler().fit(x)

sx = sc.transform(x)

km = KMedoids(n_clusters=3)
km.fit(sx)

py = km.fit_predict(sx)
# print("Predicted: ", py)

fig = plt.figure(figsize=(12, 8))


ax = fig.add_subplot(111, projection="3d")

colors = ["g", "r", "b"]


markers = ["+", "x", "*"]

for i in range(len(sx)):
ax.scatter(sx[i][0], sx[i][1], sx[i][2], color=colors[py[i]],
marker=markers[py[i]])
plt.show()

ri = metrics.rand_score(y, py)
print("Rand Index:", ri)

hs = metrics.homogeneity_score(y, py)
print("Homogeniety Score:", hs)

cs = metrics.completeness_score(y, py)
print("Completeness Score:", cs)

sc = metrics.silhouette_score(sx, py, metric="euclidean")


print("Silhouette Coefficient:", sc)

Rand Index: 0.8367785234899329


Homogeniety Score: 0.6672491406379297
Completeness Score: 0.6701843437329579
Silhouette Coefficient: 0.4590416105554613

4.2 Agglomerative Clustering


from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from sklearn import metrics
import matplotlib.pyplot as plt

iris_data = load_iris()

x = iris_data.data
y = iris_data.target

# print(x[:5])
# print(y[:5])

sc = StandardScaler().fit(x)

sx = sc.transform(x)

ac = AgglomerativeClustering(n_clusters=3)
ac.fit(sx)

py = ac.fit_predict(sx)
# print("Predicted: ", py)

fig = plt.figure(figsize=(12, 8))


ax = fig.add_subplot(111, projection="3d")

colors = ["g", "r", "b"]


markers = ["+", "x", "*"]

for i in range(len(sx)):
ax.scatter(sx[i][0], sx[i][1], sx[i][2], color=colors[py[i]],
marker=markers[py[i]])
plt.show()

ri = metrics.rand_score(y, py)
print("Rand Index:", ri)

hs = metrics.homogeneity_score(y, py)
print("Homogeniety Score:", hs)

cs = metrics.completeness_score(y, py)
print("Completeness Score:", cs)
sc = metrics.silhouette_score(sx, py, metric="euclidean")
print("Silhouette Coefficient:", sc)

Rand Index: 0.8252348993288591


Homogeniety Score: 0.6578818079976051
Completeness Score: 0.6940248415952218
Silhouette Coefficient: 0.4466890410285909
5. Naive Bayes Classifier and ID3 Decision Tree
Classifier
1. Write a Python program to predict diabeties using Naive Bayes Classification.
2. Write a Python program to predict diabeties using ID3 Decision Tree Classifier. Compare
the performance of both classifiers.

5.1 Naive Bayes Classifier


import pandas as pd
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB

dataset = pd.read_csv("Diabetes.csv")

print("Dataset Size: ", len(dataset))

split = int(len(dataset) * 0.7)


train, test = dataset.iloc[:split], dataset.iloc[split:]

p = train["Pragnency"].values
g = train["Glucose"].values
bp = train["Blod Pressure"].values
st = train["Skin Thikness"].values
ins = train["Insulin"].values
bmi = train["BMI"].values
dpf = train["DFP"].values
a = train["Age"].values
d = train["Diabetes"].values

trainfeatures = zip(p, g, bp, st, ins, bmi, dpf, a)


traininput = list(trainfeatures)
# print(traininput)

model = GaussianNB()
model.fit(traininput, d)

p = test["Pragnency"].values
g = test["Glucose"].values
bp = test["Blod Pressure"].values
st = test["Skin Thikness"].values
ins = test["Insulin"].values
bmi = test["BMI"].values
dpf = test["DFP"].values
a = test["Age"].values
d = test["Diabetes"].values

testfeatures = zip(p, g, bp, st, ins, bmi, dpf, a)


testinput = list(testfeatures)

predicted = model.predict(testinput)
# print('Actual Class:', *d)
# print('Predicted Class:', *predicted)

print("Confusion Matrix:")
print(metrics.confusion_matrix(d, predicted))

print("\nClassification Measures:")
print("Accuracy:", metrics.accuracy_score(d, predicted))
print("Recall:", metrics.recall_score(d, predicted))
print("Precision:", metrics.precision_score(d, predicted))
print("F1-score:", metrics.f1_score(d, predicted))

Dataset Size: 767


Confusion Matrix:
[[128 24]
[ 30 49]]

Classification Measures:
Accuracy: 0.7662337662337663
Recall: 0.620253164556962
Precision: 0.6712328767123288
F1-score: 0.6447368421052632

5.2 ID3 Decision Tree Classifier


import pandas as pd
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier

dataset = pd.read_csv("Diabetes.csv")

print("Dataset Size: ", len(dataset))

split = int(len(dataset) * 0.7)


train, test = dataset.iloc[:split], dataset.iloc[split:]

p = train["Pragnency"].values
g = train["Glucose"].values
bp = train["Blod Pressure"].values
st = train["Skin Thikness"].values
ins = train["Insulin"].values
bmi = train["BMI"].values
dpf = train["DFP"].values
a = train["Age"].values
d = train["Diabetes"].values
trainfeatures = zip(p, g, bp, st, ins, bmi, dpf, a)
traininput = list(trainfeatures)
# print(traininput)

model = DecisionTreeClassifier(criterion="entropy", max_depth=4)


model.fit(traininput, d)

p = test["Pragnency"].values
g = test["Glucose"].values
bp = test["Blod Pressure"].values
st = test["Skin Thikness"].values
ins = test["Insulin"].values
bmi = test["BMI"].values
dpf = test["DFP"].values
a = test["Age"].values
d = test["Diabetes"].values

testfeatures = zip(p, g, bp, st, ins, bmi, dpf, a)


testinput = list(testfeatures)

predicted = model.predict(testinput)
# print('Actual Class:', *d)
# print('Predicted Class:', *predicted)

print("Confusion Matrix:")
print(metrics.confusion_matrix(d, predicted))

print("\nClassification Measures:")
print("Accuracy:", metrics.accuracy_score(d, predicted))
print("Recall:", metrics.recall_score(d, predicted))
print("Precision:", metrics.precision_score(d, predicted))
print("F1-score:", metrics.f1_score(d, predicted))

Dataset Size: 767


Confusion Matrix:
[[118 34]
[ 17 62]]

Classification Measures:
Accuracy: 0.7792207792207793
Recall: 0.7848101265822784
Precision: 0.6458333333333334
F1-score: 0.7085714285714286

6. Support Vector Machine and Multilayer


Perceptron
1. Write a Python program to classify breast cancer data using support vector machine.
2. Write a Python program to predict breast cancer data using multilayer perceptron.
Compare the performance of both classifiers.

6.1 Support Vector Machine


from sklearn import datasets
from sklearn.svm import SVC
from sklearn import metrics

cancer = datasets.load_breast_cancer()
x = cancer.data
y = cancer.target
print("Length of Data:", len(cancer.data))

split = int(len(x) * 0.7)


trainx, testx = x[:split], x[split:]
trainy, testy = y[:split], y[split:]

print("Number of features: ", len(cancer.feature_names))


# print("Features: ", *cancer.feature_names)
print("Number of classes: ", len(cancer.target_names))
print("Class Labels: ", cancer.target_names)

model = SVC(kernel="linear") # Linear Kernel


model.fit(trainx, trainy)
yp = model.predict(testx)

# print("Actual Class: ", *testy)


# print("Predicted Class: ", *yp)

print("\nConfusion Matrix:")
print(metrics.confusion_matrix(testy, yp))

print("\nClassification Measures:")
print("Accuracy:", metrics.accuracy_score(testy, yp))
print("Recall:", metrics.recall_score(testy, yp))
print("Precision:", metrics.precision_score(testy, yp))
print("F1-score:", metrics.f1_score(testy, yp))

Length of Data: 569


Number of features: 30
Number of classes: 2
Class Labels: ['malignant' 'benign']

Confusion Matrix:
[[ 39 0]
[ 9 123]]

Classification Measures:
Accuracy: 0.9473684210526315
Recall: 0.9318181818181818
Precision: 1.0
F1-score: 0.9647058823529412

6.2 Multilayer Perceptron


from sklearn import datasets
from keras.models import Sequential
from keras.layers import Dense
from sklearn import metrics
import numpy as np

cancer = datasets.load_breast_cancer()
x = cancer.data
y = cancer.target

split = int(len(x) * 0.7)


trainx, testx = x[:split], x[split:]
trainy, testy = y[:split], y[split:]

print("Number of features: ", len(cancer.feature_names))


# print("Features: ", *cancer.feature_names)
print("Number of classes: ", len(cancer.target_names))
print("Class Labels: ", cancer.target_names)

# Define the keras model


model = Sequential()
model.add(Dense(128, input_dim=30, activation="relu"))
model.add(Dense(128, activation="relu"))
model.add(Dense(128, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

# Compile and fit the model


model.compile(loss="binary_crossentropy", optimizer="adam",
metrics=["accuracy"])
model.fit(trainx, trainy, epochs=200, batch_size=16, verbose=0)

# Make class predictions with the model


yp = model.predict(testx)

pred = []
for x in yp:
pred.append(np.round(x))
pred = np.array(pred)
pred = pred.ravel()
pred = pred.astype(int)
# print("Actual Class: ", *testy)
# print("Predicted Class: ", *yp)

print("\nConfusion Matrix:")
print(metrics.confusion_matrix(testy, pred))

print("\nClassification Measures:")
print("Accuracy:", metrics.accuracy_score(testy, pred))
print("Recall:", metrics.recall_score(testy, pred))
print("Precision:", metrics.precision_score(testy, pred))
print("F1-score:", metrics.f1_score(testy, pred))

Number of features: 30
Number of classes: 2
Class Labels: ['malignant' 'benign']

Confusion Matrix:
[[ 36 3]
[ 2 130]]

Classification Measures:
Accuracy: 0.9707602339181286
Recall: 0.9848484848484849
Precision: 0.9774436090225563
F1-score: 0.981132075471698

7. Multi-class Classification Using MLP


import pandas as pd
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
from sklearn.metrics import classification_report

dataset = pd.read_csv("iris.csv")
dataset = dataset.values
dataset = shuffle(dataset)

x = dataset[:, 0:4].astype(float)
y = dataset[:, 4]

# Encode class values as integers


encoder = LabelEncoder()
encoder.fit(y)
ey = encoder.transform(y)

# Convert integers to dummy variables (i.e. one hot encoded)


# print(*y)
# print(*ey)
dy = np_utils.to_categorical(ey)

# Normalize input attributes


sc = StandardScaler().fit(x)
sx = sc.transform(x)

# Train/Test split
split = int(len(x) * 0.7)
trainx, testx = sx[:split], sx[split:]
trainy, testy = dy[:split], dy[split:]

# Define the keras model


model = Sequential()
model.add(Dense(64, input_dim=4, activation="relu"))
model.add(Dense(32, activation="relu"))
model.add(Dense(16, activation="relu"))
model.add(Dense(units=3, activation="softmax"))

# Compile and fit the model


model.compile(loss="categorical_crossentropy", optimizer="adam",
metrics=["accuracy"])
model.fit(trainx, trainy, epochs=20, batch_size=8, verbose=0)

# Make class predictions with the model


yp = model.predict(testx)
yp = np.argmax(yp, axis=-1)
yp = yp.ravel()

a = list()
for i in range(len(testy)):
d = np.argmax(testy[i])
a.append(d)
a = np.array(a)
al = encoder.inverse_transform(a)
pl = encoder.inverse_transform(yp)

# print('Actual Class: ', *al)


# print('Predicted Class: ', *pl)

print(classification_report(al, pl))

precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 13


Iris-versicolor 0.85 0.94 0.89 18
Iris-virginica 0.92 0.79 0.85 14

accuracy 0.91 45
macro avg 0.92 0.91 0.91 45
weighted avg 0.91 0.91 0.91 45

8. Apriori Algorithm
# !pip install apyori

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori

dataset = pd.read_csv("store_data.csv", header=None)


# print(dataset)

records = []
for i in range(0, 7501):
test = []
data = dataset.iloc[i]
data = data.dropna()
for j in range(0, len(data)):
test.append(str(dataset.values[i, j]))
records.append(test)
# print(records)

association_rules = apriori(
records, min_support=0.005, min_confidence=0.2, min_lift=3,
min_length=2
)
association_results = list(association_rules)

for item in association_results:


# print(item)
# print(item[2])
# print(item[2][0])
print(list(item[2][0][0]), '->', list(item[2][0][1]))

['mushroom cream sauce'] -> ['escalope']


['pasta'] -> ['escalope']
['herb & pepper'] -> ['ground beef']
['tomato sauce'] -> ['ground beef']
['whole wheat pasta'] -> ['olive oil']
['pasta'] -> ['shrimp']
['frozen vegetables', 'chocolate'] -> ['shrimp']
['spaghetti', 'frozen vegetables'] -> ['ground beef']
['mineral water', 'shrimp'] -> ['frozen vegetables']
['spaghetti', 'frozen vegetables'] -> ['olive oil']
['spaghetti', 'frozen vegetables'] -> ['shrimp']
['spaghetti', 'frozen vegetables'] -> ['tomatoes']
['spaghetti', 'grated cheese'] -> ['ground beef']
['herb & pepper', 'mineral water'] -> ['ground beef']
['spaghetti', 'herb & pepper'] -> ['ground beef']
['ground beef', 'shrimp'] -> ['spaghetti']
['spaghetti', 'milk'] -> ['olive oil']
['mineral water', 'soup'] -> ['olive oil']
['pancakes', 'spaghetti'] -> ['olive oil']

You might also like