0% found this document useful (0 votes)

56 views

Name: Muhammad Sarfraz Seat: EP1850086 Section: A Course Code: 514 Course Name: Data Warehousing and Data Mining

This document contains code and output from several Python labs involving conditions, loops, NumPy, Pandas, and linear regression. In Lab 01, an if-else statement checks the number of subjects and certifications a student is enrolled in. Lab 02 contains a for loop to print passwords within a certain limit. Lab 03 demonstrates NumPy and Pandas functionality like creating and manipulating DataFrames. Lab 04 applies gradient descent to perform linear regression on population and profit data to find the best fit line.

Uploaded by

Muhammad Sarfraz

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views

Name: Muhammad Sarfraz Seat: EP1850086 Section: A Course Code: 514 Course Name: Data Warehousing and Data Mining

Uploaded by

Muhammad Sarfraz

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 39

Name: Muhammad Sarfraz

Seat: EP1850086
Section: A
Course Code: 514
Course Name: Data Warehousing and Data Mining
lab 01

LAB 01 : CONDITIONS
Write an if-else statement in python, which checks if the student is enrolled in 2 or 3 subjects with extra
certification.

Write down proper message for the statement.

One subject fee=1000

Certification fee =700
2 Subjects and 3 certifications are allowed together for a student
If a student selects 3 subjects then only two certifications can be selected to enrolled

In [6]:

subjectFee = 1000
certificationFee = 700

noOfSubjects = 3
noOfCertifications = 2

if(noOfSubjects == 2 and noOfCertifications == 3):

print('Student is enrolled in 2 subjects and 3 certifications')
elif(noOfSubjects == 3 and noOfCertifications == 2):
print('Student is enrolled in 3 subjects and 2 certifications')
else:
print('Student cannot enrolled')

Student is enrolled in 3 subjects and 2 certifications

lab 02

LAB 02 : LOOPS
Initial a list of password of range 10, check if passwords numbers increasing 500 then break the operation
and print else statement “Password cannot be greater than 500” print every password in a new line with a
message “Your new password”.

In [2]:

passwords = [121,55,86,1,147,635,98,63,453,100]

for p in passwords:
if(p > 500):
print('Password cannot be greater than 500')
break
else:
print('Your new password : ',p)

Your new password : 121

Your new password : 55
Your new password : 86
Your new password : 1
Your new password : 147
Password cannot be greater than 500
lab 03

LAB 03 : NumPy & Pandas

In [1]:

import numpy as np
import pandas as pd

In [12]:

df = pd.DataFrame(np.random.randn(4,3),index=['a','b','c','d'], columns=
['one','two', 'three'])

In [13]:

Out[13]:

one two three

a 1.968427 0.360732 0.526789

b 0.545311 -0.511318 1.771034

c -1.270482 1.454086 -0.179600

d -1.487337 -0.008176 -0.849439

In [14]:

df['one']

Out[14]:

a 1.968427
b 0.545311
c -1.270482 d
-1.487337
Name: one, dtype: float64

In [15]:

df.loc['a']

Out[15]:

one 1.968427
two 0.360732
three 0.526789
Name: a, dtype: float64

In [16]:

df = df.reindex(['a','b','c','d','e'])
lab 03

In [17]:

Out[17]:

one two three

a 1.968427 0.360732 0.526789

b 0.545311 -0.511318 1.771034

c -1.270482 1.454086 -0.179600

d -1.487337 -0.008176 -0.849439

e NaN NaN NaN

In [18]:

df.fillna('0')

Out[18]:

one two three

a 1.96843 0.360732 0.526789

b 0.545311 -0.511318 1.77103

c -1.27048 1.45409 -0.1796

d -1.48734 -0.00817578 -0.849439

e 0 0 0

In [19]:

Out[19]:

one two three

a 1.968427 0.360732 0.526789

b 0.545311 -0.511318 1.771034

c -1.270482 1.454086 -0.179600

d -1.487337 -0.008176 -0.849439

e NaN NaN NaN

In [20]:

df = df.fillna('0')
lab 03

In [21]:

Out[21]:

one two three

a 1.96843 0.360732 0.526789

b 0.545311 -0.511318 1.77103

c -1.27048 1.45409 -0.1796

d -1.48734 -0.00817578 -0.849439

e 0 0 0

In [22]:

df = df.reindex(columns=['one','two','three','four','fiver'])

In [23]:

Out[23]:

one two three four fiver

a 1.96843 0.360732 0.526789 NaN NaN

b 0.545311 -0.511318 1.77103 NaN NaN

c -1.27048 1.45409 -0.1796 NaN NaN

d -1.48734 -0.00817578 -0.849439 NaN NaN

e 0 0 0 NaN NaN

In [24]:

df= df.fillna(1)

In [25]:

Out[25]:

one two three four fiver

a 1.96843 0.360732 0.526789 1.0 1.0

b 0.545311 -0.511318 1.77103 1.0 1.0

c -1.27048 1.45409 -0.1796 1.0 1.0

d -1.48734 -0.00817578 -0.849439 1.0 1.0

e 0 0 0 1.0 1.0
lab 03

In [29]:

df =df.rename(columns={'fiver':'five'})

In [30]:

Out[30]:

one two three four five

a 1.96843 0.360732 0.526789 1.0 1.0

b 0.545311 -0.511318 1.77103 1.0 1.0

c -1.27048 1.45409 -0.1796 1.0 1.0

d -1.48734 -0.00817578 -0.849439 1.0 1.0

e 0 0 0 1.0 1.0

In [31]:

In [ ]:

# -------------- CREATING NEW DATAFRAME ------------------ #

In [ ]:

In [41]:

data_frame = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6],"C":[7,8,9]})

In [42]:

data_frame

Out[42]:

A B C

0 1 4 7

1 2 5 8

2 3 6 9

In [43]:

data_frame = data_frame.reindex(columns=['A','B','C','D','E'])
lab 03

In [44]:

data_frame

Out[44]:

A B C D E

0 1 4 7 NaN NaN

1 2 5 8 NaN NaN

2 3 6 9 NaN NaN

In [46]:

for i in data_frame:
print(data_frame[i])

0 1
1 2
23
Name: A, dtype: int64
0 4
1 5
26
Name: B, dtype: int64
0 7
1 8
29
Name: C, dtype: int64
0 NaN
1 NaN
2 NaN
Name: D, dtype: float64
0 NaN
1 NaN
2 NaN
Name: E, dtype: float64
lab 03

In [47]:

for i in data_frame:
print(data_frame[i].isnull())

0 False
1 False
2False
Name: A, dtype: bool
0 False
1 False
2False
Name: B, dtype: bool
0 False
1 False
2False
Name: C, dtype: bool
0 True
1 True
2True
Name: D, dtype: bool
0 True
1 True
2True
Name: E, dtype: bool

In [ ]:
lab 04

LAB 04 : Gradient Descent for Linear Regression

In [2]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:

x_quad = [n/10 for n in range(0, 100)]

y_quad = [(n-4)**2+5 for n in x_quad]
plt.figure(figsize = (10,7))
plt.plot(x_quad, y_quad, 'k--')
plt.axis([0,10,0,30])
plt.plot([1, 2, 3], [14, 9, 6], 'ro')
plt.plot([5, 7, 8],[6, 14, 21], 'bo')
plt.plot(4, 5, 'ko')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Quadratic Equation')
Out[3]:

Text(0.5, 1.0, 'Quadratic Equation')

In [4]:

data = pd.read_csv('../ex1data1.txt', names = ['population', 'profit'])

lab 04

In [5]:

data

Out[5]:

population profit

0 6.1101 17.59200

1 5.5277 9.13020

2 8.5186 13.66200

3 7.0032 11.85400

4 5.8598 6.82330

... ... ...

92 5.8707 7.20290

93 5.3054 1.98690

94 8.2934 0.14454

95 13.3940 9.05510

96 5.4369 0.61705

97 rows × 2 columns

In [6]:

X_df = pd.DataFrame(data.population)
y_df = pd.DataFrame(data.profit)
m = len(y_df)

In [7]:

X_df

Out[7]:

population

0 6.1101

1 5.5277

2 8.5186

3 7.0032

4 5.8598

... ...

92 5.8707

93 5.3054

94 8.2934

95 13.3940

96 5.4369

97 rows × 1 columns
lab 04

In [8]:

plt.figure(figsize=(10,8))
plt.plot(X_df, y_df, 'kx')
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in $10,000s')
Out[8]:

Text(0, 0.5, 'Profit in $10,000s')

In [9]:

iter = 1000
alpha = 0.01

In [10]:

X_df['intercept'] = 1

In [11]:

X = np.array(X_df)
y = np.array(y_df).flatten()
theta = np.array([0, 0])
lab 04

In [12]:

def cost_function(X, y, theta):

m = len(y)

# Calculate the cost with the given parameters J

= np.sum((X.dot(theta)-y)**2)/2/m

return J

In [13]:

cost_function(X, y, theta)

Out[13]:

32.072733877455676

In [14]:

def gradient_descent(X, y, theta, alpha, iterations):

cost_history = [0] * iterations

for iteration in range(iterations):

print(X)
print(np.shape(X))
hypothesis = X.dot(theta)
loss = hypothesis-y
gradient = X.T.dot(loss)/m
theta = theta - alpha*gradient
cost = cost_function(X, y, theta)
cost_history[iteration] = cost
return theta, cost_history

In [28]:

gd = gradient_descent(X,y,theta,alpha, iter)

In [16]:

print(theta)

[0 0]
lab 04

In [17]:

best_fit_x = np.linspace(0, 25, 20)

best_fit_y = [theta[1] + theta[0]*xx for xx in best_fit_x]
plt.figure(figsize=(10,6))
plt.plot(X_df.population, y_df, '.')
plt.plot(best_fit_x, best_fit_y, '-')
plt.axis([0,25,-5,25])
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in $10,000s')
plt.title('Profit vs. Population with Linear Regression Line')
plt.show()

In [ ]:

Search a dataset for Linear Regression and apply same algorithm on your
dataset. Print the optimized parameters and visualizations and attach in your
file. Also attach the code of this part in your file.

In [18]:

data = pd.read_csv('../exam_result.csv')
lab 04

In [19]:

data.head()

Out[19]:

SAT GPA

0 1714 2.40

1 1664 2.52

2 1760 2.54

3 1685 2.74

4 1693 2.83

In [20]:

X_df = pd.DataFrame(data.SAT)
y_df = pd.DataFrame(data.GPA)
m = len(y_df)
lab 04

In [21]:

plt.figure(figsize=(10,8))
plt.plot(X_df, y_df, 'kx')
plt.xlabel('Score of SAT')
plt.ylabel('Obtained GPA')
Out[21]:

Text(0, 0.5, 'Obtained GPA')

In [22]:

iter = 1000
alpha = 0.01

In [23]:

X_df['intercept'] = 1

In [24]:

X = np.array(X_df)
y = np.array(y_df).flatten()
theta = np.array([0, 0])

In [25]:

cost_function(X, y, theta)

Out[25]:

5.581691666666667
lab 04

In [29]:

gd = gradient_descent(X,y,theta,alpha, iter)

In [27]:

best_fit_x = np.linspace(0, 5000, 20)

best_fit_y = [theta[1] + theta[0]*xx for xx in best_fit_x]
plt.figure(figsize=(10,6))
plt.plot(X_df.SAT, y_df, '.')
plt.plot(best_fit_x, best_fit_y, '-')
plt.axis([0,5000,-1,4])
plt.xlabel('Score of SAT')
plt.ylabel('Obtained GPA')
plt.title('SAT Score vs. GPA')
plt.show()

In [ ]:
lab 05

LAB 05 : Naive Bayes

Naive Bayes uses a similar method to predict the probability of different class based on
various attributes.
This algorithm is mostly used in text classification and with problems having multiple classes.

In [1]:

from sklearn.naive_bayes import GaussianNB

import numpy as np

In [2]:

#assigning predictor and target variables

x= np.array([[-3,7],[1,5], [1,2], [-2,0], [2,3], [-4,0], [-1,1], [1,1], [-2,2],
[2,7] , [-4,1], [-2,7]])
Y = np.array([3, 3, 3, 3, 4, 3, 3, 4, 3, 4, 4, 4])

In [8]:

#Create a Gaussian Classifier

model = GaussianNB()
# Train the model using the training
sets model.fit(x, Y)
Out[8]:

GaussianNB()

In [13]:

#Predict Output
predicted= model.predict([[1,2],[3,4]])
print (predicted)

[3 4]

In [ ]:

Convert the “Play tennis” example discussed in class into numeric form and initialize x and y
values based on that example.
Now run the code for the new x values as discussed in class and print the output.
Attach code and output in file.
lab 05

In [18]:

# 0 - Overcast
# 1 - Sunny
# 2 - Rainy
X_data = np.array([[1,0],[0,1],[2,1],[1,1],[1,1],[0,1],[2,0],[2,0], [1,1],[2,1],
[1,0],[0,1],[0,1],[2,0]])

In [20]:

Y_data = np.array([0,0,1,1,1,0,1,0,1,1,1,1,1,0])

In [23]:

model = GaussianNB()
model.fit(X_data, Y_data)
Out[23]:

GaussianNB()

In [28]:

predicted= model.predict([[2,0],[2,1],[2,2]])
print (predicted)

[011]

In [ ]:
lab 06

LAB 06 : Decision Tree Using Scikit Learn

In [2]:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier from
sklearn.metrics import accuracy_score from sklearn
import tree

In [13]:

balance_data = pd.read_csv('../balance-scale.data',sep= ',',header=None)

balance_data.head()
Out[13]:

0 1 2 3 4

0 B 1 1 1 1

1 R 1 1 1 2

2 R 1 1 1 3

3 R 1 1 1 4

4 R 1 1 1 5

In [14]:

print("Dataset Lenght:: ", len(balance_data))

print("Dataset Shape:: ", balance_data.shape)

Dataset Lenght:: 625

Dataset Shape:: (625, 5)

In [15]:

X = balance_data.values[:, 1:5]
Y = balance_data.values[:,0]

In [18]:

X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size =

0.3, random_state = 100)

In [19]:

clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state =

100, max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)
print(clf_entropy)

DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=

5,
random_state=100)
lab 06

In [20]:

y_pred_en = clf_entropy.predict(X_test)
print(y_pred_en)

['R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L'
'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L'
'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L'
'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L'
'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R'
'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'R'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'R']

In [21]:

print ("Accuracy is ", accuracy_score(y_test,y_pred_en)*100)

Accuracy is 70.74468085106383

In [22]:

with open("balanceScale.txt", "w") as f:

f = tree.export_graphviz(clf_entropy, out_file=f)

In [23]:

from IPython.display import Image

Image(filename='lab_06_1.PNG')
Out[23]:

In [ ]:

Apply same code on any other dataset from uci machine learning
repository write the outputs (accuracy, tree and its visualization)

In [149]:

machine_data = pd.read_csv('../machine.data',header=None)
lab 06

In [150]:

machine_data.head()
Out[150]:

0 1 2 3 4 5 6 7 8 9

0 adviser 32/60 125 256 6000 256 16 128 198 199

1 amdahl 470v/7 29 8000 32000 32 8 32 269 253

2 amdahl 470v/7a 29 8000 32000 32 8 32 220 253

3 amdahl 470v/7b 29 8000 32000 32 8 32 172 253

4 amdahl 470v/7c 29 8000 16000 32 8 16 132 132

In [151]:

print("Dataset Lenght:: ", len(machine_data))

print("Dataset Shape:: ", machine_data.shape)

Dataset Lenght:: 209

Dataset Shape:: (209, 10)

In [152]:

X = machine_data.values[:, 2:3]
Y = machine_data.values[:,0]

In [153]:

X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size =

0.3, random_state = 100)

In [155]:

clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state =

100, max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)
print(clf_entropy)

DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=

5,
random_state=100)

In [156]:

y_pred_en = clf_entropy.predict(X_test)
print(y_pred_en)

['nas' 'amdahl' 'ibm' 'harris' 'harris' 'nas' 'nas' 'nas' 'harris'

'amdahl' 'nas' 'nas' 'nas' 'nas' 'harris' 'amdahl' 'nas' 'harris' 'nas'
'nas' 'harris' 'burroughs' 'harris' 'amdahl' 'harris' 'nas' 'nas' 'nas'
'ibm' 'ibm' 'nas' 'harris' 'harris' 'nas' 'burroughs' 'nas' 'nas'
'amdahl' 'ibm' 'ibm' 'harris' 'amdahl' 'harris' 'honeywell' 'nas' 'nas'
'harris' 'honeywell' 'nas' 'nas' 'honeywell' 'harris' 'nas' 'harris'
'amdahl' 'nas' 'harris' 'harris' 'harris' 'burroughs' 'nas' 'harris'
'ibm']
lab 06

In [157]:

with open("machine_data.txt", "w") as f:

f = tree.export_graphviz(clf_entropy, out_file=f)

In [158]:

Image(filename='lab_06_2.PNG')

Out[158]:
lab 07

Lab 07 : Performance Metrics

In [12]:

from sklearn.metrics import confusion_matrix

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score from
sklearn.metrics import log_loss

In [13]:

X_actual = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
Y_predic = [1, 0, 1, 1, 1, 0, 1, 1, 0, 0]

In [14]:

results = confusion_matrix(X_actual, Y_predic)

print ('Confusion Matrix :')
print(results)

Confusion Matrix :
[[3 3]
[1 3]]

In [15]:

print ('Accuracy Score is',accuracy_score(X_actual, Y_predic))

print ('Classification Report : ')
print (classification_report(X_actual, Y_predic))
print('AUC-ROC:',roc_auc_score(X_actual, Y_predic))
print('LOGLOSS Value is',log_loss(X_actual, Y_predic))

Accuracy Score is 0.6

Classification Report :
precision recall f1-score support

0 0.75 0.50 0.60 6

1 0.50 0.75 0.60 4

accuracy 0.60 10
macro avg 0.62 0.62 0.60 10
weighted avg 0.65 0.60 0.60 10

AUC-ROC: 0.625
LOGLOSS Value is 13.815750437193334

In [ ]:

Why we use performance matrices in machine

learning.
lab 07

Performance metrics are use to evaluate different machine learning algorithms.

Using performance metrics helps to justify the accuracy of your model/algorithm.

Task
We have a confusion matric. This indicated the number of cancer patients tested and who came actually true .
write the code in python to calculate the classification accuracy and classification report of the given data.

In [638]:

X_actual = [1, 0, 1, 0, 1, 0, 1, 1, 0, 0,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 1,
1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0]

In [639]:

Y_predic = [1, 1, 1, 0, 1, 0, 1, 1, 0, 0,
1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 0, 1, 1, 0, 0, 0,
1, 1, 0, 1, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 1, 0, 1,
1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0]
lab 07

In [640]:

results = confusion_matrix(X_actual, Y_predic)

print ('Confusion Matrix :')
print(results)

Confusion Matrix :
[[ 50 10]
[ 5 100]]

In [641]:

print ('Accuracy Score is',accuracy_score(X_actual, Y_predic))

print ('Classification Report : ')
print (classification_report(X_actual, Y_predic))

Accuracy Score is 0.9090909090909091

Classification Report :
precision recall f1-score support

0 0.91 0.83 0.87 60

1 0.91 0.95 0.93 105

accuracy 0.91 165

macro avg 0.91 0.89 0.90 165
weighted avg 0.91 0.91 0.91 165
lab 09

LAB 09 : K-Means
In [19]:

import matplotlib.pyplot as plt

import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [20]:

X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=0)

In [21]:

plt.scatter(X[:, 0], X[:, 1], s=20);

plt.show()

In [22]:

kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
lab 09

In [23]:

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='summer')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=100,
alpha=0.9); plt.show()

In [ ]:

What is importance of K- mean theorem in clustering algorithms

of machine learning .
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e.,
data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the
number of groups represented by the variable K.

Advantages of k-means

Relatively simple to implement.

Scales to large data sets.
Guarantees convergence.
Can warm-start the positions of centroids.
Easily adapts to new examples.
Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
Choosing manually.
Being dependent on initial values.
lab 09

Write a code snippet in python to perform k mean algorithm

implementation on a data set. create 10 clusters and calculate
ceroids of data. And visualized them.
In [40]:

X, y_true = make_blobs(n_samples=1000, centers=10, cluster_std=1.5, random_state=0)

In [47]:

plt.scatter(X[:, 0], X[:, 1], s=5);

plt.show()

In [48]:

kmeans = KMeans(n_clusters=10)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

In [52]:

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=5, cmap='summer')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=50,
alpha=1); plt.show()

lab 10
Lab 10 : Hierarchical Clustering
In [15]:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import normalize
from sklearn.cluster import
AgglomerativeClustering import
scipy.cluster.hierarchy as shc %matplotlib inline

In [16]:

data=pd.read_csv('../Wholesale customers data.csv')

data.head()
Out[16]:

Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen

0 2 3 12669 9656 7561 214 2674 1338

1 2 3 7057 9810 9568 1762 3293 1776

2 2 3 6353 8808 7684 2405 3516 7844

3 1 3 13265 1196 4221 6404 507 1788

4 2 3 22615 5410 7198 3915 1777 5185

In [17]:
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
data_scaled.head()

Out[17]:

Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen

0 0.000112 0.000168 0.708333 0.539874 0.422741 0.011965 0.149505 0.074809

1 0.000125 0.000188 0.442198 0.614704 0.599540 0.110409 0.206342 0.111286

2 0.000125 0.000187 0.396552 0.549792 0.479632 0.150119 0.219467 0.489619

3 0.000065 0.000194 0.856837 0.077254 0.272650 0.413659 0.032749 0.115494

4 0.000079 0.000119 0.895416 0.214203 0.284997 0.155010 0.070358 0.205294

lab 10

In [18]:

plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
lab 10

In [19]:

plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
plt.axhline(y=6, color='r', linestyle='--')
Out[19]:

<matplotlib.lines.Line2D at 0x233621b01c0>
lab 10

In [20]:

cluster = AgglomerativeClustering(n_clusters=2,affinity='euclidean',linkage='ward')
cluster.fit_predict(data_scaled)
plt.figure(figsize=(10, 7))
plt.scatter(data_scaled['Milk'], data_scaled['Grocery'], c=cluster.labels_)
Out[20]:

<matplotlib.collections.PathCollection at 0x23362332850>

In [ ]:

How hierarchical clustering has importance over other

algorithms.
The advantage of hierarchical clustering is that it is easy to understand and implement. The dendrogram
output of the algorithm can be used to understand the big picture as well as the groups in your data.
lab 12

LAB 12 : PCA
In [2]:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
# import warnings
# warnings.filterwarnings("ignore")

In [3]:

m_data = pd.read_csv('../mushrooms.csv')

In [4]:

m_data.head()

Out[4]:

stalk
class cap- cap- cap- bruises odor gill- gill- gill- gill- ... surface
shape surface color attachment spacing size color below
rin

0 p x s n t p f c n k ...

1 e x s y t a f c b k ...

2 e b s w t l f c b n ...

3 p x y w t p f c n n ...

4 e x s g f n f w b k ...

5 rows × 23 columns

In [5]:

encoder = LabelEncoder()
# Now apply the transformation to all the
columns: for col in m_data.columns:
m_data[col] = encoder.fit_transform(m_data[col])
X_features = m_data.iloc[:,1:23]
y_label = m_data.iloc[:, 0]

In [6]:

scaler = StandardScaler()
X_features = scaler.fit_transform(X_features)
lab 12

In [7]:

# Visualize
pca = PCA()
pca.fit_transform(X_features)
pca_variance = pca.explained_variance_
plt.figure(figsize=(8, 6))
plt.bar(range(22), pca_variance, alpha=0.5,
align='center', label='individual variance')
plt.legend()
plt.ylabel('Variance ratio')
plt.xlabel('Principal
components') plt.show()
lab 12

In [8]:

pca2 = PCA(n_components=17)
pca2.fit(X_features)
x_3d = pca2.transform(X_features)
plt.figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:,5], c=m_data['class'])
plt.show()

In [ ]:

What is the difference between supervised and

unsupervised dimensionality reduction analysis test?
In a supervised learning model, the algorithm learns on a labeled dataset, providing an answer key that the
algorithm can use to evaluate its accuracy on training data. An unsupervised model, in contrast, provides
unlabeled data that the algorithm tries to make sense of by extracting features and patterns on its own.
lab 12

In [ ]:

Write a PCA implementation over a data set on the following link

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin
(Diagnostic
(https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin
(Diagnostic)) and attach it in your lab file .

In [11]:

m_data = pd.read_csv('../breast-cancer-wisconsin.data',header=None)

In [12]:

m_data.head()

Out[12]:

0 1 2 3 4 5 6 7 8 9 10

0 1000025 5 1 1 1 2 1 3 1 1 2

1 1002945 5 4 4 5 7 10 3 2 1 2

2 1015425 3 1 1 1 2 2 3 1 1 2

3 1016277 6 8 8 1 3 4 3 7 1 2

4 1017023 4 1 1 3 2 1 3 1 1 2

In [13]:

In [16]:

scaler = StandardScaler()
X_features = scaler.fit_transform(X_features)
lab 12

In [24]:

# Visualize
pca = PCA()
pca.fit_transform(X_features)
pca_variance = pca.explained_variance_
plt.figure(figsize=(8, 6))
plt.bar(range(10), pca_variance, alpha=0.5, align='center', label='individual variance'
)
plt.legend()
plt.ylabel('Variance ratio')
plt.xlabel('Principal
components') plt.show()
lab 12

In [33]:

pca2 = PCA(n_components=10)
pca2.fit(X_features)
x_3d = pca2.transform(X_features)
plt.figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:,5], c=m_data[0])
plt.show()

6/6

Neptune (Hybrid) V6.0 System Specifications
No ratings yet
Neptune (Hybrid) V6.0 System Specifications
266 pages
Assignment 61
100% (2)
Assignment 61
4 pages
FDS Slot 1
No ratings yet
FDS Slot 1
19 pages
DA lab
No ratings yet
DA lab
27 pages
numpy_dataframe
No ratings yet
numpy_dataframe
12 pages
Time Series Analysis Group 9
No ratings yet
Time Series Analysis Group 9
16 pages
Data Science Practical Problems
No ratings yet
Data Science Practical Problems
40 pages
Ai Tools and Applications-Lab
No ratings yet
Ai Tools and Applications-Lab
33 pages
Dav practicals
No ratings yet
Dav practicals
33 pages
FDS RECORD-1-4
No ratings yet
FDS RECORD-1-4
18 pages
DA_Programs
No ratings yet
DA_Programs
44 pages
HIV Regression Source Code
No ratings yet
HIV Regression Source Code
26 pages
Class 1 - 2024 Business Analytics
No ratings yet
Class 1 - 2024 Business Analytics
8 pages
ip practical file
No ratings yet
ip practical file
18 pages
AIML 01 Merged
No ratings yet
AIML 01 Merged
25 pages
ML Lab
No ratings yet
ML Lab
24 pages
Ilovepdf Merged (2) Merged
No ratings yet
Ilovepdf Merged (2) Merged
65 pages
Manual
No ratings yet
Manual
48 pages
ML(sudhanshu)
No ratings yet
ML(sudhanshu)
24 pages
ML Lab Manual (1-10) FINAL
No ratings yet
ML Lab Manual (1-10) FINAL
34 pages
DP prog
No ratings yet
DP prog
10 pages
DAO Cheatsheet
No ratings yet
DAO Cheatsheet
3 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
32 pages
ML Lab Manual Final
No ratings yet
ML Lab Manual Final
36 pages
Project paarth (1) (1)
No ratings yet
Project paarth (1) (1)
21 pages
sowmi DS
No ratings yet
sowmi DS
27 pages
AFB Saurabh Last Year
No ratings yet
AFB Saurabh Last Year
11 pages
EE2211 CheatSheet
No ratings yet
EE2211 CheatSheet
15 pages
ML-CONTENTHALF
No ratings yet
ML-CONTENTHALF
35 pages
MACHINE LEARNING manual
No ratings yet
MACHINE LEARNING manual
36 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Data Analysis Tools
No ratings yet
Data Analysis Tools
26 pages
Ipclass 12
No ratings yet
Ipclass 12
21 pages
List of Practical Ip065 Xii Session 2025 Ckc Academy
No ratings yet
List of Practical Ip065 Xii Session 2025 Ckc Academy
19 pages
data science practicals
No ratings yet
data science practicals
47 pages
Machine File
No ratings yet
Machine File
27 pages
AIML LAB MANAUAL R23
100% (1)
AIML LAB MANAUAL R23
10 pages
ML Lab Programs For Exam
No ratings yet
ML Lab Programs For Exam
10 pages
Practical File Question 28.09.2022
No ratings yet
Practical File Question 28.09.2022
15 pages
EXP-3
No ratings yet
EXP-3
10 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
AI Final PDF
No ratings yet
AI Final PDF
38 pages
Programs of Python Pandas
No ratings yet
Programs of Python Pandas
15 pages
1st PGM
No ratings yet
1st PGM
10 pages
15CSL76 Students
No ratings yet
15CSL76 Students
18 pages
Ge Sem II Dav Upc 2344001201 Sl. No. Qp. 2012 July 2023
No ratings yet
Ge Sem II Dav Upc 2344001201 Sl. No. Qp. 2012 July 2023
16 pages
Pandas & Mysql
No ratings yet
Pandas & Mysql
20 pages
Ip Project Work 2
No ratings yet
Ip Project Work 2
52 pages
batch1 ds
No ratings yet
batch1 ds
15 pages
DS Practical
No ratings yet
DS Practical
30 pages
L_AND_T_project_Naveen 24cs002895
No ratings yet
L_AND_T_project_Naveen 24cs002895
7 pages
CS PF 12
No ratings yet
CS PF 12
27 pages
Stat Lab
No ratings yet
Stat Lab
24 pages
Dataframe in Pandas
No ratings yet
Dataframe in Pandas
23 pages
1
No ratings yet
1
12 pages
DATAANALYSIS FINALS123
No ratings yet
DATAANALYSIS FINALS123
36 pages
Fdspracticals - Ipynb - Colaboratory
No ratings yet
Fdspracticals - Ipynb - Colaboratory
21 pages
Xii Ip Practical File 24-25
No ratings yet
Xii Ip Practical File 24-25
111 pages
Class 12 IP File 23 24
No ratings yet
Class 12 IP File 23 24
27 pages
python codes
No ratings yet
python codes
15 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
S250NJ3200 223727 Datasheet
No ratings yet
S250NJ3200 223727 Datasheet
2 pages
TAS BG Preamps and Power Amps 2011
No ratings yet
TAS BG Preamps and Power Amps 2011
110 pages
Inspirational Teenager Who Created An Anti
No ratings yet
Inspirational Teenager Who Created An Anti
11 pages
17Pcs03 - Advanced Java Programming Question and Answers: Unit - I
No ratings yet
17Pcs03 - Advanced Java Programming Question and Answers: Unit - I
1 page
Profile Company 3T BUILD
No ratings yet
Profile Company 3T BUILD
13 pages
Boiler Control System
No ratings yet
Boiler Control System
30 pages
Ultramag Sg3: Mild Steel Solid Wire
No ratings yet
Ultramag Sg3: Mild Steel Solid Wire
1 page
Evidence-Based Practices PowerPoint
No ratings yet
Evidence-Based Practices PowerPoint
53 pages
CMM
No ratings yet
CMM
8 pages
TDR 01 R1
No ratings yet
TDR 01 R1
2 pages
Cause List Court-I (18.01.2023) - 0
No ratings yet
Cause List Court-I (18.01.2023) - 0
8 pages
Masonry: Prepared By: Engr. Marianne Kriscel Jean T. Dejarlo, CE, SE
No ratings yet
Masonry: Prepared By: Engr. Marianne Kriscel Jean T. Dejarlo, CE, SE
36 pages
ME54 NRelease Req Indiv
No ratings yet
ME54 NRelease Req Indiv
9 pages
Assignment Nested Table
No ratings yet
Assignment Nested Table
8 pages
Base Units, Properties of Fluids, and Unit Pressures: CE Review For Nov 2022 - Hydraulics 1
No ratings yet
Base Units, Properties of Fluids, and Unit Pressures: CE Review For Nov 2022 - Hydraulics 1
1 page
Merritt Morning Market 3209 - Oct 26
No ratings yet
Merritt Morning Market 3209 - Oct 26
2 pages
Disassembly and Assembly: 800C Industrial Engine
No ratings yet
Disassembly and Assembly: 800C Industrial Engine
80 pages
Congratulations To The 2022 Graduating Classes
No ratings yet
Congratulations To The 2022 Graduating Classes
6 pages
Module - 3 Design of Storm Sewer
No ratings yet
Module - 3 Design of Storm Sewer
18 pages
Sma Sunny Home Manager Installation Manual
No ratings yet
Sma Sunny Home Manager Installation Manual
77 pages
Max Bridge Improved Wellbore Stability Indonesia Cs
No ratings yet
Max Bridge Improved Wellbore Stability Indonesia Cs
2 pages
Program Design and Algorithm Development
No ratings yet
Program Design and Algorithm Development
18 pages
Desulfurization of Natural Gas Liquids
No ratings yet
Desulfurization of Natural Gas Liquids
21 pages
Frequency Domain Pid
No ratings yet
Frequency Domain Pid
6 pages
Weapon Engineer
No ratings yet
Weapon Engineer
1 page
Asp - Ebj Series 0812
No ratings yet
Asp - Ebj Series 0812
2 pages
Soil Fertility and Climatic Constraints in Dryland 30896 PDF
No ratings yet
Soil Fertility and Climatic Constraints in Dryland 30896 PDF
34 pages
Assistant Slickline Operator 001
No ratings yet
Assistant Slickline Operator 001
3 pages
Data Sheet BP2815 - 12v 150ah Lithium Ion Battery Module: High Performance Batteries
No ratings yet
Data Sheet BP2815 - 12v 150ah Lithium Ion Battery Module: High Performance Batteries
5 pages