0% found this document useful (0 votes)
12 views

T2_summary_VHA

Uploaded by

vishakhap2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

T2_summary_VHA

Uploaded by

vishakhap2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

T2_Machine Learning _NOTE_VHA

June 4, 2024

1 unit-4 Introduction to Machine Learning with Python

2 For data cleaning, you can refer to the pandas (uni-1) Note
[1]: import pandas as pd
#defining features
room_length = [18, 20, 10, 12, 18, 11]
room_breadth = [20, 20, 10, 11, 19, 10]
room_type = ['Big', 'Big', 'Normal', 'Normal', 'Big', 'Normal']

[2]: #creating a data frame


data = pd.DataFrame({'Length': room_length, 'Breadth': room_breadth, 'Type':␣
↪room_type})

data

[2]: Length Breadth Type


0 18 20 Big
1 20 20 Big
2 10 10 Normal
3 12 11 Normal
4 18 19 Big
5 11 10 Normal

[3]: #Adding a feature called area from length and breadth


data['Area'] = data['Length'] * data['Breadth']
data

[3]: Length Breadth Type Area


0 18 20 Big 360
1 20 20 Big 400
2 10 10 Normal 100
3 12 11 Normal 132
4 18 19 Big 342
5 11 10 Normal 110

[4]: import pandas as pd


#Creating features

1
age = [18, 20, 23, 19, 18, 22]
city = ['City A', 'City B', 'City B', 'City A', 'City C', 'City B']

#Creating a data frame


data1 = pd.DataFrame({'age': age, 'city': city})
data1#get_dummies function of pandas library can be used to dummy code␣
↪categorical variables.

df=pd.get_dummies(data=data1, drop_first=True)

[5]: df

[5]: age city_City B city_City C


0 18 False False
1 20 True False
2 23 True False
3 19 False False
4 18 False True
5 22 True False

3 Transforming numeric (continuous) features to categorical fea-


tures
[8]: #Defining features
apartment_area = [4720, 2430, 4368, 3969, 6142, 7912]
apartment_price = [2360000,1215000,2184000,1984500,3071000,3956000]

#Creating a data frame


data4 = pd.DataFrame({'Area':apartment_area, 'Price': apartment_price})
data4

[8]: Area Price


0 4720 2360000
1 2430 1215000
2 4368 2184000
3 3969 1984500
4 6142 3071000
5 7912 3956000

[9]: import numpy as np


data4['Price'] = np.where(data4['Price'] > 3000000, 'High', np.
↪where(data4['Price'] < 2000000, 'Low', 'Medium'))

data4

[9]: Area Price


0 4720 Medium
1 2430 Low

2
2 4368 Medium
3 3969 Low
4 6142 High
5 7912 High

4 5 and 6 Supervised Machine Learning

5 Simple Linear Regression


Simple linear regression models the relationship between two variables by fitting a straight line to
the data. It predicts the value of a dependent variable �y based on the value of an independent
variable �x.
The equation is: �=�0+�1�y=�0+�1x where �0�0 is the intercept and �1�1 is the slope.
Example using the Boston Housing Dataset
We’ll use only one feature (average number of rooms per dwelling, RM) for simplicity

6 The Boston Housing Dataset


The Boston Housing Dataset is a derived from information collected by the U.S. Census Service
concerning housing in the area of Boston MA. The following describes the dataset columns:
• CRIM - per capita crime rate by town
• ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
• INDUS - proportion of non-retail business acres per town.
• CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
• NOX - nitric oxides concentration (parts per 10 million)
• RM - average number of rooms per dwelling
• AGE - proportion of owner-occupied units built prior to 1940
• DIS - weighted distances to five Boston employment centres
• RAD - index of accessibility to radial highways
• TAX - full-value property-tax rate per $10,000
• PTRATIO - pupil-teacher ratio by town
• B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
• LSTAT - % lower status of the population
• MEDV - Median value of owner-occupied homes in $1000’s

[1]: import numpy as np


import pandas as pd

from sklearn.model_selection import train_test_split


from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load the dataset

3
df=pd.read_csv("HousingData.csv")
df=df.dropna()
X = df[['RM']]
y = df[['MEDV']]

# Split the dataset


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Initialize and fit the model


linear_regressor = LinearRegression()
linear_regressor.fit(X_train, y_train)

# Make predictions
y_pred = linear_regressor.predict(X_test)

# Evaluate the model


mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")


print(f"R²: {r2}")
print("coefficient",linear_regressor.coef_)
print("intercept",linear_regressor.intercept_)
print("equation",f"y={linear_regressor.coef_}x+{linear_regressor.intercept_}")
# Plotting the results
plt.scatter(X, y, color="blue")
plt.plot(X_test, y_pred, color="red")
plt.xlabel("Average number of rooms per dwelling (RM)")
plt.ylabel("House Price")
plt.title("Simple Linear Regression")
plt.show()

Mean Squared Error: 43.97149760658451


R²: 0.4786797724382229
coefficient [[9.40002419]]
intercept [-36.95412883]
equation y=[[9.40002419]]x+[-36.95412883]

4
[2]: df.columns

[2]: Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'MEDV'],
dtype='object')

[3]: df.head()

[3]: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1 296 15.3
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2 242 17.8
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2 242 17.8
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3 222 18.7
5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3 222 18.7

B LSTAT MEDV
0 396.90 4.98 24.0
1 396.90 9.14 21.6
2 392.83 4.03 34.7
3 394.63 2.94 33.4

5
5 394.12 5.21 28.7

7 Multiple Linear Regression


• Multiple linear regression models the relationship between a dependent variable and multiple
independent variables.
• The equation is: �=�0+�1�1+�2�2+…+����y=�0+�1x1+�2x2+…+�nxn
• Example using the Boston Housing Dataset
• We’ll use multiple features to predict the house prices.

[4]: import numpy as np


import pandas as pd

from sklearn.model_selection import train_test_split


from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load the dataset


df=pd.read_csv("HousingData.csv")
df = df.dropna()
y = df[['MEDV']]
X = df.drop("MEDV",axis=1)
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Initialize and fit the model


linear_regressor = LinearRegression()
linear_regressor.fit(X_train, y_train)

# Make predictions
y_pred = linear_regressor.predict(X_test)

# Evaluate the model


mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")


print(f"R²: {r2}")
print("coefficient",linear_regressor.coef_)
print("intercept",linear_regressor.intercept_)
print("y = ",end=" ")
n=0
for i in linear_regressor.coef_[0]:
n=n+1
print(i,"x",n,end=" ")

6
print("=" ,linear_regressor.intercept_[0])

Mean Squared Error: 31.45404766495098


R²: 0.6270849941673178
coefficient [[-1.12187394e-01 4.24404148e-02 2.56728238e-02 1.98383708e+00
-1.70792571e+01 4.25809072e+00 -2.17413906e-02 -1.42418883e+00
2.35587949e-01 -1.19971379e-02 -9.75834850e-01 9.59377961e-03
-3.88619588e-01]]
intercept [33.65240504]
y = -0.11218739411170192 x 1 0.04244041483285869 x 2 0.025672823789500122 x 3
1.983837084104812 x 4 -17.07925707304045 x 5 4.258090716067691 x 6
-0.021741390648340866 x 7 -1.4241888342746127 x 8 0.23558794897000693 x 9
-0.01199713790147483 x 10 -0.9758348496021871 x 11 0.009593779607598565 x 12
-0.3886195878856782 x 13 = 33.65240504056575

8 Polynomial Regression on Boston Housing Data


• Polynomial regression is an extension of linear regression where the relationship between the
independent variable �x and the dependent variable �y is modeled as an nth degree polynomial.
This allows for more complex relationships between the variables.
• The polynomial regression model is represented as: �=�0+�1�+�2�2+…+����y=�0+�1x+�2x2+…+�nxn
• Steps to Perform Polynomial Regression on the Boston Housing Data
1. Load the Dataset: Read the Boston Housing data from a CSV file.
2. Data Preprocessing: Handle any missing values and split the data into features (X) and target
(y).
3. Feature Engineering: Create polynomial features from the original features.
4. Model Training: Train a polynomial regression model using the transformed features.
5. Model Evaluation: Evaluate the model’s performance using metrics such as mean squared
error (MSE) and R² score.

[5]: import numpy as np


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load the dataset


boston = pd.read_csv("HousingData.csv")
boston=df.dropna()
# Display the first few rows of the dataset
print(boston.head())

# Separate features and target variable


X = boston.drop('MEDV', axis=1) # Assuming 'MEDV' is the target variable
y = boston['MEDV']

7
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Transform features into polynomial features


poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Initialize and train the model


poly_regressor = LinearRegression()
poly_regressor.fit(X_train_poly, y_train)

# Make predictions
y_train_pred = poly_regressor.predict(X_train_poly)
y_test_pred = poly_regressor.predict(X_test_poly)

# Evaluate the model


train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

print(f"Training Mean Squared Error: {train_mse}")


print(f"Testing Mean Squared Error: {test_mse}")
print(f"Training R²: {train_r2}")
print(f"Testing R²: {test_r2}")

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \


0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1 296 15.3
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2 242 17.8
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2 242 17.8
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3 222 18.7
5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3 222 18.7

B LSTAT MEDV
0 396.90 4.98 24.0
1 396.90 9.14 21.6
2 392.83 4.03 34.7
3 394.63 2.94 33.4
5 394.12 5.21 28.7
Training Mean Squared Error: 55.2252324506448
Testing Mean Squared Error: 95.57171547225752
Training R²: 0.33036607859294875
Testing R²: -0.13308554792106753

8
9 K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN) is a classification algorithm that assigns the class of a data point based
on the classes of its k nearest neighbors.

[6]: from sklearn.datasets import load_iris


from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score,␣
↪recall_score, classification_report

# Load the dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Initialize and fit the model


knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate the model


conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
error_rate = 1 - accuracy
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')

# Specificity calculation
tn, fp, fn, tp = conf_matrix.ravel()[:4]
specificity = tn / (tn + fp)

print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Accuracy: {accuracy}")
print(f"Error Rate: {error_rate}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"Specificity: {specificity}")
print("\nClassification Report:\n", classification_report(y_test, y_pred,␣
↪target_names=iris.target_names))

9
Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
Accuracy: 1.0
Error Rate: 0.0
Precision: 1.0
Recall: 1.0
Specificity: 1.0

Classification Report:
precision recall f1-score support

setosa 1.00 1.00 1.00 10


versicolor 1.00 1.00 1.00 9
virginica 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

10 Example using the Breast Cancer Dataset


• We’ll use the Breast Cancer dataset from scikit-learn to demonstrate binary classification
using KNN. This dataset contains features computed from digitized images of breast mass
and the target variable indicating whether the mass is malignant or benign.
• Steps to Perform KNN Binary Classification
1. Load the Dataset: Load the dataset and understand its structure.
2. Data Preprocessing: Split the data into features (X) and target (y) and then into training
and testing sets.
3. Model Training: Train the KNN model with the training data.
4. Model Prediction: Predict the target values for the test data.
5. Model Evaluation: Evaluate the model’s performance using a confusion matrix, accuracy,
error rate, precision, recall, and specificity.

[7]: import numpy as np


import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score,␣
↪recall_score, classification_report

# Load the dataset


data = load_breast_cancer()

10
X = data.data
y = data.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Initialize and train the KNN model


knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate the model


conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
error_rate = 1 - accuracy
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

# Specificity calculation
tn, fp, fn, tp = conf_matrix.ravel()
specificity = tn / (tn + fp)

print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Accuracy: {accuracy}")
print(f"Error Rate: {error_rate}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"Specificity: {specificity}")
print("\nClassification Report:\n", classification_report(y_test, y_pred,␣
↪target_names=data.target_names))

Confusion Matrix:
[[38 5]
[ 0 71]]
Accuracy: 0.956140350877193
Error Rate: 0.04385964912280704
Precision: 0.9342105263157895
Recall: 1.0
Specificity: 0.8837209302325582

Classification Report:
precision recall f1-score support

malignant 1.00 0.88 0.94 43


benign 0.93 1.00 0.97 71

11
accuracy 0.96 114
macro avg 0.97 0.94 0.95 114
weighted avg 0.96 0.96 0.96 114

11 A decision tree is a model that makes decisions by splitting the


data into subsets based on feature values. Entropy measures
the impurity or randomness in the data. A decision tree aims
to minimize entropy at each split.
[8]: from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score,␣
↪recall_score, classification_report

from sklearn.tree import plot_tree


import matplotlib.pyplot as plt

# Load the dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Initialize and fit the model


decision_tree = DecisionTreeClassifier(criterion='entropy', random_state=42)
decision_tree.fit(X_train, y_train)

# Make predictions
y_pred = decision_tree.predict(X_test)

# Evaluate the model


conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
error_rate = 1 - accuracy
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')

# Specificity calculation
tn, fp, fn, tp = conf_matrix.ravel()[:4]
specificity = tn / (tn + fp)

12
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Accuracy: {accuracy}")
print(f"Error Rate: {error_rate}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"Specificity: {specificity}")
print("\nClassification Report:\n", classification_report(y_test, y_pred,␣
↪target_names=iris.target_names))

# Plotting the decision tree


plt.figure(figsize=(20,10))
plot_tree(decision_tree, feature_names=iris.feature_names, class_names=iris.
↪target_names, filled=True)

plt.show()

Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
Accuracy: 1.0
Error Rate: 0.0
Precision: 1.0
Recall: 1.0
Specificity: 1.0

Classification Report:
precision recall f1-score support

setosa 1.00 1.00 1.00 10


versicolor 1.00 1.00 1.00 9
virginica 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

13
12 Summary
• Simple Linear Regression: Models the relationship between two variables.
• Multiple Linear Regression: Models the relationship between one dependent variable and
multiple independent variables.
• K-Nearest Neighbors (KNN): Classifies a data point based on the classes of its nearest neigh-
bors.
• Decision Tree with Entropy: Splits data into subsets to minimize impurity or randomness.

[ ]:

14

You might also like