0% found this document useful (0 votes)
6 views

LS_Project_Report

The Learning Systems Project Report discusses the application of machine learning algorithms on six datasets, divided into regression and classification tasks. It highlights the use of various algorithms, particularly Naïve Bayes, for medical diagnosis, and presents methodologies including standard scaling and PCA for data processing. Results indicate that Ridge Regression and Naïve Bayes achieved the best performance in their respective tasks, demonstrating effective predictive capabilities.

Uploaded by

ntuanh2705
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

LS_Project_Report

The Learning Systems Project Report discusses the application of machine learning algorithms on six datasets, divided into regression and classification tasks. It highlights the use of various algorithms, particularly Naïve Bayes, for medical diagnosis, and presents methodologies including standard scaling and PCA for data processing. Results indicate that Ridge Regression and Naïve Bayes achieved the best performance in their respective tasks, demonstrating effective predictive capabilities.

Uploaded by

ntuanh2705
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Learning Systems Project Report

Abhilash Kashyap

Anh Khoa Dang

March 2020

Introduction

We use machine learning in order to understand the data at hand which helps us to come up with a
framework or a model for either predicting or estimating the meaning of the data. Machine learning
is further divided into supervised and un-supervised learning. The datasets in the project mainly
revolve around supervised machine learning.

Supervised machine learning has data in the format of inputs and outputs. We use various
algorithms in order to find a mapping between both the input and output. Which helps us in
estimating or predicting the outputs for new input data which is similar to the initial data used for
learning the mapping or relation between the input and the output.

The project is divided into two main parts:-

 3 datasets for regression.


 3 datasets for classification.

Classification and regression models are different from each other.

 Classification models revolve around being able to predict outputs which are discrete labels.
The predictions of the classification model can be measure by calculating the accuracy.
 Regression models revolve around being able to predict outputs which are continuous in
nature. The predictions of the regression model can be measure by calculating the root
mean square error.

The main goal of the project is to be able to apply various machine learning algorithms on all the six
provided datasets and predict the output based on the input data consisting of various features by
using regression and classification.

The first 3 datasets revolve around being able to predict quantitative continuous data using
regression belonging to fuel consumption, cooling requirements for a specific gas and power load for
a specific company.

The last 3 datasets are for being able to predict which class the output belongs to using
classification. The data represent the condition of a patient for different diseases.
State-of-the-art
The author in the paper gives us a brief overview of the various machine learning algorithms which
have been used in the field of medical diagnosis. Having a strong emphasis on Naïve Bayes, neural
networks and decision trees.

The author compares some of the state of the art systems from the branches of machine learning on
various medical diagnostic problems.

Naïve Bayes
Having an interest in the naïve bayes algorithm, the author states that it is a very simple and a
powerful algorithm. And the comprehensive information achieved from this algorithm was
confirmed to be accurate with further discussions with the physicians regarding various medical
diagnosis.

It is efficient and has the ability to outperform various machine learning algorithms in medical as
well as non-medical applications. It was observed that when compared to six other algorithms, the
naïve bayes classifier performed better than the other algorithms on five out of the eight medical
diagnosis.

The author regards the naïve bayes classifier as a benchmark on any problems in the medical domain
before trying any other advanced algorithms. There has been recent advancement in the naïve bayes
algorithm which led to various branches of it for different specific purposes thereby improving it
further.

Despite having an own preference the author states that there is always a possibility of other
machine learning algorithms to have better results in terms of classification accuracy. Hence none of
the algorithms can be excluded while considering the performance criteria for a specific problem.

The author summarizes the advantages and disadvantages of various machine learning algorithms in
a tabular format.
Methodology
Standard Scaling
We use the standard scalar so that the data is transformed which will have a mean value of 0 as well
𝒙−𝝁
as a standard deviation of 1. Which is represented as 𝒛 =
𝝈

Principal component analysis (PCA)


We use PCA on the training data with high dimensional features to reduce it to low dimension
features while retaining as much information possible. By using explained_variance_ratio, we get
principal components with information in the descending order.

We can observe that from a given “n” features by reducing it to 2 features. We are able to produce
47% of the information.

Regressions
In regression problems, we compare multiple models using python for loop and from that we choose
the best model which has the lowest cross validation mean squared error.

Starting with four linear machine learning algorithms:


1. Linear Regression.
Linear Regression model makes a prediction by simply computing a weighted
sum of the input features, plus a constant called the bias term (also called the intercept
term), as shown below:
𝐲 = 𝛉𝟎 + 𝛉𝟏 𝐱𝟏 + 𝛉𝟐 𝐱𝟐 + ⋯ + 𝛉𝐧 𝐱𝐧
2. Ridge Regression.
Ridge Regression (also called Tikhonov regularization) is a regularized version of Lin‐
ear Regression:
𝟏
𝐄(𝛉) = 𝐌𝐒𝐄(𝛉) + 𝛂 ∑𝛉 𝟐𝐢
𝟐
3. LASSO Linear Regression.
Least Absolute Shrinkage and Selection Operator Regression (simply called Lasso
Regression) is another regularized version of Linear Regression:
𝐄(𝛉) = 𝐌𝐒𝐄(𝛉) + 𝛂∑|𝛉𝐢 |
4. Elastic Net Regression.
Elastic Net is a middle ground between Ridge Regression and Lasso Regression:
𝟏−𝐫
𝐄(𝛉) = 𝐌𝐒𝐄(𝛉) + 𝐫𝛂∑|𝛉𝐢 𝐢 | + 𝛂 ∑𝛉 𝟐𝐢
𝟐
Then looking at three nonlinear machine learning algorithms:
1. k-Nearest-Neighbour.
2. Classification and Regression Trees.
3. Support Vector Machines.

Classifications

1 Support Vector Machine (SVM)


SVM is used to find a line which separates the data of two different classes. Using this line as
reference, new data points are classified into either of the two classes based on which side of the
line they exist.

2 Naïve Bayes
Bayes theorem helps us in calculating the probability of a data point which belongs to a certain class
based on the training data. Hence using the probability value of each data point with respect to each
class, we can predict which class the data point belongs to. The Bayes theorem is given by:-

P(class|data) = (P(data|class) * P(class)) / P(data)

3 K-Nearest-Neighbour (KNN)
KNN algorithm is a supervised machine learning algorithm which calculates the distance between a
data point and “k” nearest training data points. The new data point is assigned to the class which has
the highest occurrence in the k distances calculated.

We can observer that the new data point (x) is assigned to the red class as with a value of k=3, data
points from the red class have the highest occurrence.
Data
1.1 - Estimating cetane number for diesel fuel (Regression)
We need to find a model that fit between the input X and the output Y, which give the smallest
cross-validation error (generalization error) so that it will produce a good prediction for unseen data.

The Y output of data is in range 40 to 60, therefore the Y predict is expected to be in the same range.

1.2 - Modeling the need for cooling in a H2O2 process (Regression)


The Y output of data is in range 0 to 100, therefore the Y predict is expected to be in the same range.

1.3 - Predicting power load (Regression)


The Y output of data is in range 1500 to 4000, so the Y predict is expected to be in the same range.
1.4 – Thyroid classification
We need to be able to classify whether the patient is normal or suffers from hypothyroid or
hyperthyroid. The output data is converted such that normal=2, hypothyroid=4 and hyperthyroid=8.

We scale the input training data and use PCA to reduce the number of features from 21 to 2 features
which retain the highest information.

Thyroid data representation

1.5 – Breast cancer classification


We need to be able to classify whether the patient has benign=0 or malign=1 cancer.

Similar steps are implemented to scale the data and use PCA to reduce features from 30 to 2 which
have the highest information.

Breast cancer data representation

1.6 – Electrocardiogram classification


We need to be able to classify whether the patient has Transmural Ischemia=1 or not=0.
The same process is repeated where the input data is scaled and PCA is used to reduce the
dimensionality of the data while retaining high information.

ECG data representation

Results
1.1 - Estimating cetane number for diesel fuel (Regression)
Regression Model Training Accuracy 10-fold Cross Validation
Accuracy
Linear Regression 4.491 5.341
Lasso Regression 4.624 5.337
Ridge Regression 4.491 5.340
Elastic Net Regression 4.553 5.277
KNN 4.318 6.777
CART 0.0 12.378
SVM 5.465 8.072
 We choose Ridge Regression as the model for this problem
1.2 - Modeling the need for cooling in a H2O2 process (Regression)

Regression Model Training Accuracy 10-fold Cross Validation


Accuracy
Linear Regression 27.279 68.015
Lasso Regression 52.759 64.269
Ridge Regression 27.282 65.119
Elastic Net Regression 64.674 84.232
KNN 8.088 188.538
CART 6.782 137.573
SVM 36.614 76.312
 We choose SVM as the model for this problem
1.3 - Predicting power load (Regression)

Regression Model Training Accuracy 10-fold Cross Validation


Accuracy
Linear Regression 2604.3 4387.6

Lasso Regression 4199.5 4669.8

Ridge Regression 3240.8 4149.7

Elastic Net Regression 6249.6 6528.6

KNN 7473.5 11502.3

CART 0.0 5095.7

SVM 4657.3 5095.7

 We choose Ridge Regression as the model for this problem


1.4 - Thyroid classification
Classification Model Training Accuracy 10-fold Cross Validation
Accuracy
SVM 92.56% 92.72%

Naïve Bayes 93.44% 93.52%

KNN (k=9) 91.92% 93.6%

1.5 - Breast cancer classification


Classification Model Training Accuracy 10-fold Cross Validation
Accuracy
SVM 81.0% 81.0%

Naïve Bayes 89.0% 88.0%

KNN (k=7) 82.0% 88.9%

1.6 - Electrocardiogram classification


Classification Model Training Accuracy 10-fold Cross Validation Accuracy

SVM 60.0% 60.0%

Naïve Bayes 68.0% 60.0%

KNN (k=19) 66.0% 66.0%


Learning Curve
1.1 1.2 1.3

1.4 1.5 1.6


Discussion

We can observe from the results that based on the various models trained for Regression and
Classification, it has given us a decent accuracy on the predicted data when compared to the output
data.
We can also compare that the results which we achieved in our classification training using naïve
bayes are similar to what we have understood by reading the research paper. We can say that we
achieved higher accuracy by using Naïve Bayes model as compared to the other classification
algorithms in a similar fashion that the author has mentioned.

References

1. https://www.sciencedirect.com/science/article/pii/S093336570100077X#SEC6
2. https://scikit-learn.org/stable/
3. Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow by Aurelien Geron.
4. Machine Learning Mastery with Python by Jason Brownlee.

You might also like