Name: Muhammad Sarfraz Seat: EP1850086 Section: A Course Code: 514 Course Name: Data Warehousing and Data Mining
Name: Muhammad Sarfraz Seat: EP1850086 Section: A Course Code: 514 Course Name: Data Warehousing and Data Mining
Seat: EP1850086
Section: A
Course Code: 514
Course Name: Data Warehousing and Data Mining
lab 01
LAB 01 : CONDITIONS
Write an if-else statement in python, which checks if the student is enrolled in 2 or 3 subjects with extra
certification.
In [6]:
subjectFee = 1000
certificationFee = 700
noOfSubjects = 3
noOfCertifications = 2
LAB 02 : LOOPS
Initial a list of password of range 10, check if passwords numbers increasing 500 then break the operation
and print else statement “Password cannot be greater than 500” print every password in a new line with a
message “Your new password”.
In [2]:
passwords = [121,55,86,1,147,635,98,63,453,100]
for p in passwords:
if(p > 500):
print('Password cannot be greater than 500')
break
else:
print('Your new password : ',p)
import numpy as np
import pandas as pd
In [12]:
df = pd.DataFrame(np.random.randn(4,3),index=['a','b','c','d'], columns=
['one','two', 'three'])
In [13]:
df
Out[13]:
In [14]:
df['one']
Out[14]:
a 1.968427
b 0.545311
c -1.270482 d
-1.487337
Name: one, dtype: float64
In [15]:
df.loc['a']
Out[15]:
one 1.968427
two 0.360732
three 0.526789
Name: a, dtype: float64
In [16]:
df = df.reindex(['a','b','c','d','e'])
lab 03
In [17]:
df
Out[17]:
In [18]:
df.fillna('0')
Out[18]:
e 0 0 0
In [19]:
df
Out[19]:
In [20]:
df = df.fillna('0')
lab 03
In [21]:
df
Out[21]:
e 0 0 0
In [22]:
df = df.reindex(columns=['one','two','three','four','fiver'])
In [23]:
df
Out[23]:
e 0 0 0 NaN NaN
In [24]:
df= df.fillna(1)
In [25]:
df
Out[25]:
e 0 0 0 1.0 1.0
lab 03
In [29]:
df =df.rename(columns={'fiver':'five'})
In [30]:
df
Out[30]:
e 0 0 0 1.0 1.0
In [31]:
In [ ]:
In [ ]:
In [41]:
In [42]:
data_frame
Out[42]:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
In [43]:
data_frame = data_frame.reindex(columns=['A','B','C','D','E'])
lab 03
In [44]:
data_frame
Out[44]:
A B C D E
0 1 4 7 NaN NaN
1 2 5 8 NaN NaN
2 3 6 9 NaN NaN
In [46]:
for i in data_frame:
print(data_frame[i])
0 1
1 2
23
Name: A, dtype: int64
0 4
1 5
26
Name: B, dtype: int64
0 7
1 8
29
Name: C, dtype: int64
0 NaN
1 NaN
2 NaN
Name: D, dtype: float64
0 NaN
1 NaN
2 NaN
Name: E, dtype: float64
lab 03
In [47]:
for i in data_frame:
print(data_frame[i].isnull())
0 False
1 False
2False
Name: A, dtype: bool
0 False
1 False
2False
Name: B, dtype: bool
0 False
1 False
2False
Name: C, dtype: bool
0 True
1 True
2True
Name: D, dtype: bool
0 True
1 True
2True
Name: E, dtype: bool
In [ ]:
lab 04
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [3]:
In [4]:
In [5]:
data
Out[5]:
population profit
0 6.1101 17.59200
1 5.5277 9.13020
2 8.5186 13.66200
3 7.0032 11.85400
4 5.8598 6.82330
92 5.8707 7.20290
93 5.3054 1.98690
94 8.2934 0.14454
95 13.3940 9.05510
96 5.4369 0.61705
97 rows × 2 columns
In [6]:
X_df = pd.DataFrame(data.population)
y_df = pd.DataFrame(data.profit)
m = len(y_df)
In [7]:
X_df
Out[7]:
population
0 6.1101
1 5.5277
2 8.5186
3 7.0032
4 5.8598
... ...
92 5.8707
93 5.3054
94 8.2934
95 13.3940
96 5.4369
97 rows × 1 columns
lab 04
In [8]:
plt.figure(figsize=(10,8))
plt.plot(X_df, y_df, 'kx')
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in $10,000s')
Out[8]:
In [9]:
iter = 1000
alpha = 0.01
In [10]:
X_df['intercept'] = 1
In [11]:
X = np.array(X_df)
y = np.array(y_df).flatten()
theta = np.array([0, 0])
lab 04
In [12]:
return J
In [13]:
cost_function(X, y, theta)
Out[13]:
32.072733877455676
In [14]:
In [28]:
gd = gradient_descent(X,y,theta,alpha, iter)
In [16]:
print(theta)
[0 0]
lab 04
In [17]:
In [ ]:
In [ ]:
Search a dataset for Linear Regression and apply same algorithm on your
dataset. Print the optimized parameters and visualizations and attach in your
file. Also attach the code of this part in your file.
In [18]:
data = pd.read_csv('../exam_result.csv')
lab 04
In [19]:
data.head()
Out[19]:
SAT GPA
0 1714 2.40
1 1664 2.52
2 1760 2.54
3 1685 2.74
4 1693 2.83
In [20]:
X_df = pd.DataFrame(data.SAT)
y_df = pd.DataFrame(data.GPA)
m = len(y_df)
lab 04
In [21]:
plt.figure(figsize=(10,8))
plt.plot(X_df, y_df, 'kx')
plt.xlabel('Score of SAT')
plt.ylabel('Obtained GPA')
Out[21]:
In [22]:
iter = 1000
alpha = 0.01
In [23]:
X_df['intercept'] = 1
In [24]:
X = np.array(X_df)
y = np.array(y_df).flatten()
theta = np.array([0, 0])
In [25]:
cost_function(X, y, theta)
Out[25]:
5.581691666666667
lab 04
In [29]:
gd = gradient_descent(X,y,theta,alpha, iter)
In [27]:
In [ ]:
lab 05
In [1]:
In [2]:
In [8]:
GaussianNB()
In [13]:
#Predict Output
predicted= model.predict([[1,2],[3,4]])
print (predicted)
[3 4]
In [ ]:
In [ ]:
Convert the “Play tennis” example discussed in class into numeric form and initialize x and y
values based on that example.
Now run the code for the new x values as discussed in class and print the output.
Attach code and output in file.
lab 05
In [18]:
# 0 - Overcast
# 1 - Sunny
# 2 - Rainy
X_data = np.array([[1,0],[0,1],[2,1],[1,1],[1,1],[0,1],[2,0],[2,0], [1,1],[2,1],
[1,0],[0,1],[0,1],[2,0]])
In [20]:
Y_data = np.array([0,0,1,1,1,0,1,0,1,1,1,1,1,0])
In [23]:
model = GaussianNB()
model.fit(X_data, Y_data)
Out[23]:
GaussianNB()
In [28]:
predicted= model.predict([[2,0],[2,1],[2,2]])
print (predicted)
[011]
In [ ]:
lab 06
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier from
sklearn.metrics import accuracy_score from sklearn
import tree
In [13]:
0 1 2 3 4
0 B 1 1 1 1
1 R 1 1 1 2
2 R 1 1 1 3
3 R 1 1 1 4
4 R 1 1 1 5
In [14]:
In [15]:
X = balance_data.values[:, 1:5]
Y = balance_data.values[:,0]
In [18]:
In [19]:
In [20]:
y_pred_en = clf_entropy.predict(X_test)
print(y_pred_en)
['R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L'
'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L'
'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L'
'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L'
'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R'
'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'R'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'R']
In [21]:
Accuracy is 70.74468085106383
In [22]:
In [23]:
In [ ]:
Apply same code on any other dataset from uci machine learning
repository write the outputs (accuracy, tree and its visualization)
In [149]:
machine_data = pd.read_csv('../machine.data',header=None)
lab 06
In [150]:
machine_data.head()
Out[150]:
0 1 2 3 4 5 6 7 8 9
In [151]:
In [152]:
X = machine_data.values[:, 2:3]
Y = machine_data.values[:,0]
In [153]:
In [155]:
In [156]:
y_pred_en = clf_entropy.predict(X_test)
print(y_pred_en)
In [157]:
In [158]:
Image(filename='lab_06_2.PNG')
Out[158]:
lab 07
In [13]:
X_actual = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
Y_predic = [1, 0, 1, 1, 1, 0, 1, 1, 0, 0]
In [14]:
Confusion Matrix :
[[3 3]
[1 3]]
In [15]:
accuracy 0.60 10
macro avg 0.62 0.62 0.60 10
weighted avg 0.65 0.60 0.60 10
AUC-ROC: 0.625
LOGLOSS Value is 13.815750437193334
In [ ]:
Task
We have a confusion matric. This indicated the number of cancer patients tested and who came actually true .
write the code in python to calculate the classification accuracy and classification report of the given data.
In [638]:
X_actual = [1, 0, 1, 0, 1, 0, 1, 1, 0, 0,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 1,
1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0]
In [639]:
Y_predic = [1, 1, 1, 0, 1, 0, 1, 1, 0, 0,
1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 0, 1, 1, 0, 0, 0,
1, 1, 0, 1, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 1, 0, 1,
1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0]
lab 07
In [640]:
Confusion Matrix :
[[ 50 10]
[ 5 100]]
In [641]:
LAB 09 : K-Means
In [19]:
In [20]:
In [21]:
In [22]:
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
lab 09
In [23]:
In [ ]:
Advantages of k-means
In [47]:
In [48]:
kmeans = KMeans(n_clusters=10)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
In [52]:
lab 10
Lab 10 : Hierarchical Clustering
In [15]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import normalize
from sklearn.cluster import
AgglomerativeClustering import
scipy.cluster.hierarchy as shc %matplotlib inline
In [16]:
In [17]:
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
data_scaled.head()
Out[17]:
In [18]:
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
lab 10
In [19]:
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
plt.axhline(y=6, color='r', linestyle='--')
Out[19]:
<matplotlib.lines.Line2D at 0x233621b01c0>
lab 10
In [20]:
cluster = AgglomerativeClustering(n_clusters=2,affinity='euclidean',linkage='ward')
cluster.fit_predict(data_scaled)
plt.figure(figsize=(10, 7))
plt.scatter(data_scaled['Milk'], data_scaled['Grocery'], c=cluster.labels_)
Out[20]:
<matplotlib.collections.PathCollection at 0x23362332850>
In [ ]:
LAB 12 : PCA
In [2]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
# import warnings
# warnings.filterwarnings("ignore")
In [3]:
m_data = pd.read_csv('../mushrooms.csv')
In [4]:
m_data.head()
Out[4]:
stalk
class cap- cap- cap- bruises odor gill- gill- gill- gill- ... surface
shape surface color attachment spacing size color below
rin
0 p x s n t p f c n k ...
1 e x s y t a f c b k ...
2 e b s w t l f c b n ...
3 p x y w t p f c n n ...
4 e x s g f n f w b k ...
5 rows × 23 columns
In [5]:
encoder = LabelEncoder()
# Now apply the transformation to all the
columns: for col in m_data.columns:
m_data[col] = encoder.fit_transform(m_data[col])
X_features = m_data.iloc[:,1:23]
y_label = m_data.iloc[:, 0]
In [6]:
scaler = StandardScaler()
X_features = scaler.fit_transform(X_features)
lab 12
In [7]:
# Visualize
pca = PCA()
pca.fit_transform(X_features)
pca_variance = pca.explained_variance_
plt.figure(figsize=(8, 6))
plt.bar(range(22), pca_variance, alpha=0.5,
align='center', label='individual variance')
plt.legend()
plt.ylabel('Variance ratio')
plt.xlabel('Principal
components') plt.show()
lab 12
In [8]:
pca2 = PCA(n_components=17)
pca2.fit(X_features)
x_3d = pca2.transform(X_features)
plt.figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:,5], c=m_data['class'])
plt.show()
In [ ]:
In [ ]:
In [11]:
m_data = pd.read_csv('../breast-cancer-wisconsin.data',header=None)
In [12]:
m_data.head()
Out[12]:
0 1 2 3 4 5 6 7 8 9 10
0 1000025 5 1 1 1 2 1 3 1 1 2
1 1002945 5 4 4 5 7 10 3 2 1 2
2 1015425 3 1 1 1 2 2 3 1 1 2
3 1016277 6 8 8 1 3 4 3 7 1 2
4 1017023 4 1 1 3 2 1 3 1 1 2
In [13]:
encoder = LabelEncoder()
# Now apply the transformation to all the
columns: for col in m_data.columns:
m_data[col] = encoder.fit_transform(m_data[col])
X_features = m_data.iloc[:,1:23]
y_label = m_data.iloc[:, 0]
In [16]:
scaler = StandardScaler()
X_features = scaler.fit_transform(X_features)
lab 12
In [24]:
# Visualize
pca = PCA()
pca.fit_transform(X_features)
pca_variance = pca.explained_variance_
plt.figure(figsize=(8, 6))
plt.bar(range(10), pca_variance, alpha=0.5, align='center', label='individual variance'
)
plt.legend()
plt.ylabel('Variance ratio')
plt.xlabel('Principal
components') plt.show()
lab 12
In [33]:
pca2 = PCA(n_components=10)
pca2.fit(X_features)
x_3d = pca2.transform(X_features)
plt.figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:,5], c=m_data[0])
plt.show()
6/6