LS_Project_Report
LS_Project_Report
Abhilash Kashyap
March 2020
Introduction
We use machine learning in order to understand the data at hand which helps us to come up with a
framework or a model for either predicting or estimating the meaning of the data. Machine learning
is further divided into supervised and un-supervised learning. The datasets in the project mainly
revolve around supervised machine learning.
Supervised machine learning has data in the format of inputs and outputs. We use various
algorithms in order to find a mapping between both the input and output. Which helps us in
estimating or predicting the outputs for new input data which is similar to the initial data used for
learning the mapping or relation between the input and the output.
Classification models revolve around being able to predict outputs which are discrete labels.
The predictions of the classification model can be measure by calculating the accuracy.
Regression models revolve around being able to predict outputs which are continuous in
nature. The predictions of the regression model can be measure by calculating the root
mean square error.
The main goal of the project is to be able to apply various machine learning algorithms on all the six
provided datasets and predict the output based on the input data consisting of various features by
using regression and classification.
The first 3 datasets revolve around being able to predict quantitative continuous data using
regression belonging to fuel consumption, cooling requirements for a specific gas and power load for
a specific company.
The last 3 datasets are for being able to predict which class the output belongs to using
classification. The data represent the condition of a patient for different diseases.
State-of-the-art
The author in the paper gives us a brief overview of the various machine learning algorithms which
have been used in the field of medical diagnosis. Having a strong emphasis on Naïve Bayes, neural
networks and decision trees.
The author compares some of the state of the art systems from the branches of machine learning on
various medical diagnostic problems.
Naïve Bayes
Having an interest in the naïve bayes algorithm, the author states that it is a very simple and a
powerful algorithm. And the comprehensive information achieved from this algorithm was
confirmed to be accurate with further discussions with the physicians regarding various medical
diagnosis.
It is efficient and has the ability to outperform various machine learning algorithms in medical as
well as non-medical applications. It was observed that when compared to six other algorithms, the
naïve bayes classifier performed better than the other algorithms on five out of the eight medical
diagnosis.
The author regards the naïve bayes classifier as a benchmark on any problems in the medical domain
before trying any other advanced algorithms. There has been recent advancement in the naïve bayes
algorithm which led to various branches of it for different specific purposes thereby improving it
further.
Despite having an own preference the author states that there is always a possibility of other
machine learning algorithms to have better results in terms of classification accuracy. Hence none of
the algorithms can be excluded while considering the performance criteria for a specific problem.
The author summarizes the advantages and disadvantages of various machine learning algorithms in
a tabular format.
Methodology
Standard Scaling
We use the standard scalar so that the data is transformed which will have a mean value of 0 as well
𝒙−𝝁
as a standard deviation of 1. Which is represented as 𝒛 =
𝝈
We can observe that from a given “n” features by reducing it to 2 features. We are able to produce
47% of the information.
Regressions
In regression problems, we compare multiple models using python for loop and from that we choose
the best model which has the lowest cross validation mean squared error.
Classifications
2 Naïve Bayes
Bayes theorem helps us in calculating the probability of a data point which belongs to a certain class
based on the training data. Hence using the probability value of each data point with respect to each
class, we can predict which class the data point belongs to. The Bayes theorem is given by:-
3 K-Nearest-Neighbour (KNN)
KNN algorithm is a supervised machine learning algorithm which calculates the distance between a
data point and “k” nearest training data points. The new data point is assigned to the class which has
the highest occurrence in the k distances calculated.
We can observer that the new data point (x) is assigned to the red class as with a value of k=3, data
points from the red class have the highest occurrence.
Data
1.1 - Estimating cetane number for diesel fuel (Regression)
We need to find a model that fit between the input X and the output Y, which give the smallest
cross-validation error (generalization error) so that it will produce a good prediction for unseen data.
The Y output of data is in range 40 to 60, therefore the Y predict is expected to be in the same range.
We scale the input training data and use PCA to reduce the number of features from 21 to 2 features
which retain the highest information.
Similar steps are implemented to scale the data and use PCA to reduce features from 30 to 2 which
have the highest information.
Results
1.1 - Estimating cetane number for diesel fuel (Regression)
Regression Model Training Accuracy 10-fold Cross Validation
Accuracy
Linear Regression 4.491 5.341
Lasso Regression 4.624 5.337
Ridge Regression 4.491 5.340
Elastic Net Regression 4.553 5.277
KNN 4.318 6.777
CART 0.0 12.378
SVM 5.465 8.072
We choose Ridge Regression as the model for this problem
1.2 - Modeling the need for cooling in a H2O2 process (Regression)
We can observe from the results that based on the various models trained for Regression and
Classification, it has given us a decent accuracy on the predicted data when compared to the output
data.
We can also compare that the results which we achieved in our classification training using naïve
bayes are similar to what we have understood by reading the research paper. We can say that we
achieved higher accuracy by using Naïve Bayes model as compared to the other classification
algorithms in a similar fashion that the author has mentioned.
References
1. https://www.sciencedirect.com/science/article/pii/S093336570100077X#SEC6
2. https://scikit-learn.org/stable/
3. Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow by Aurelien Geron.
4. Machine Learning Mastery with Python by Jason Brownlee.