Module 4- ML-21EC744 (1)
Module 4- ML-21EC744 (1)
Module 4
Syllabus
Embedding a Machine Learning Model into a Web Application
Serializing fitted scikit-learn estimators, Setting up a SQLite database for data storage,
Developing a web application with Flask, Turning the movie classifier into a web application,
Deploying the web application to a public server
Predicting Continuous Target Variables with Regression Analysis
Introducing a simple linear regression model, Exploring the Housing Dataset, Implementing an
ordinary least squares linear regression model, Fitting a robust regression model using
RANSAC, Evaluating the performance of linear regression models, Using regularized methods
for regression- Turning a linear regression model into a curve – polynomial regression
Textbook 1: Chapter 9 and 10
Training a machine learning model can be computationally quite expensive, Applying Machine
Learning to Sentiment Analysis. Surely we don't want to train our model every time we close our
Python interpreter and want to make a new prediction or reload our web application? One option
for model persistence is Python's in-built pickle module (https://docs.python.org/3.6/
library/pickle.html), which allows us to serialize and deserialize Python object structures to
compact bytecode so that we can save our classifier in its current state and reload it if we want to
classify new samples, without needing the model to learn from the training data all over again.
Using the preceding code, we created a movieclassifier directory where we will later store the files
and data for our web application. Within this movieclassifier directory, we created a pkl_objects
subdirectory to save the serialized Python objects to our local drive.
DEPT OF ECE/SJBIT 1
MODULE 4
We will set up a simple SQLite database to collect optional feedback about the predictions from
users of the web application. We can use this feedback to update our classification model. SQLite
is an open-source SQL database engine that doesn't require a separate server to operate, which
makes it ideal for smaller projects and simple web applications. Essentially, a SQLite database can
be understood as a single, self-contained database file that allows us to directly access storage files.
Furthermore, SQLite doesn't require any system-specific configuration and is supported by all
common operating systems. It has gained a reputation for being very reliable as it is used by
popular companies such as Google, Mozilla, Adobe, Apple, Microsoft, and many more.
A new SQLite database inside the movieclassifier directory and store two example movie reviews
Following the preceding code example, we created a connection (conn) to a SQLite database file
by calling the connect method of the sqlite3 library, which created the new database file
reviews.sqlite in the movieclassifier directory if it didn't already exist.
DEPT OF ECE/SJBIT 2
MODULE 4
After Armin Ronacher's initial release of Flask in 2010, the framework has gained huge popularity
over the years, and examples of popular applications that make use of Flask include LinkedIn and
Pinterest. Since Flask is written in Python, it provides us Python programmers with a convenient
interface for embedding existing Python code, such as our movie classifier.
If the Flask library is not already installed in your current Python environment, you can simply
install it via conda or pip from your Terminal.
conda install flask
The goal of simple (univariate) linear regression is to model the relationship between a single
feature (explanatory variable x) and a continuous valued response (target variable y). The equation
of a linear model with one explanatory variable is defined as follows:
Here, the weight w0 represents the y-axis intercept and w1 is the weight coefficient of the
explanatory variable. Our goal is to learn the weights of the linear equation to describe the
relationship between the explanatory variable and the target variable, which can then be used to
predict the responses of new explanatory variables that were not part of the training dataset.
Based on the linear equation that we defined previously, linear regression can be understood as
finding the best-fitting straight line through the sample points, as shown in the following figure:
DEPT OF ECE/SJBIT 3
MODULE 4
This best-fitting line is also called the regression line, and the vertical lines from the regression
line to the sample points are the so-called offsets or residuals—the errors of our prediction.
The special case of linear regression with one explanatory variable that we introduced in the
previous subsection is also called simple linear regression. Of course, we can also generalize the
linear regression model to multiple explanatory variables; this process is called multiple linear
regression:
The following figure shows how the two-dimensional, fitted hyperplane of a multiple linear
regression model with two features could look:
As we can see, visualizing multiple linear regression fits in three-dimensional scatter plot are
already challenging to interpret when looking at static figures. Since we have no good means of
visualizing hyperplanes with two dimensions in a scatterplot (multiple linear regression models fit
to datasets with three or more features), the examples and visualizations in this chapter will mainly
focus on the univariate case, using simple linear regression. However, simple and multiple linear
regression
DEPT OF ECE/SJBIT 4
MODULE 4
are based on the same concepts and the same evaluation techniques; the code implementations that
we will discuss in this chapter are also compatible with both types of regression model.
A new dataset, the Housing dataset, which contains information about houses in the suburbs of
Boston collected by D. Harrison and D.L. Rubinfeld in 1978. The Housing dataset has been made
freely available and is included in the code bundle of this book. The dataset has been recently
removed from the UCI Machine Learning Repository but is available online at
https://raw.githubusercontent.com/rasbt/python-machine-learning-book-2nd edition/master/ code/
ch10/ housing.data.txt. As with each new dataset, it is always helpful to explore the data through
a simple visualization, to get a better feeling of what we are working with.
To load the Housing dataset using the pandas read_csv function, which is fast and versatile—a
recommended tool for working with tabular data stored in a plaintext format.
The features of the 506 samples in the Housing dataset are summarized here, taken from the
original source that was previously shared on https://archive.ics.uci.edu/ml/datasets/Housing:
DEPT OF ECE/SJBIT 5
MODULE 4
Pandas DataFrame:
DEPT OF ECE/SJBIT 6
MODULE 4
Exploratory Data Analysis (EDA) is an important and recommended first step prior to the training
of a machine learning model. In the rest of this section, we will use some simple yet useful
techniques from the graphical EDA toolbox that may help us to visually detect the presence of
outliers, the distribution of the data, and the relationships between features.
First, we will create a scatterplot matrix that allows us to visualize the pair-wise correlations
between the different features in this dataset in one place. To plot the scatterplot matrix, we will
use the pairplot function from the Seaborn library.
DEPT OF ECE/SJBIT 7
MODULE 4
Due to space constraints and in the interest of readability, we only plotted five columns from the
dataset: LSTAT, INDUS, NOX, RM, and MEDV. However, you are encouraged to create a
scatterplot matrix of the whole DataFrame to explore the dataset further by choosing different
column names in the previous sns.pairplot call, or include all variables in the scatterplot matrix by
omitting the column selector (sns.pairplot(df)).
Using this scatterplot matrix, we can now quickly eyeball how the data is distributed and whether
it contains outliers. For example, we can see that there is a linear relationship between RM and
house prices, MEDV (the fifth column of the fourth row).
Create a correlation matrix to quantify and summarize linear relationships between variables. A
correlation matrix is closely related to the covariance matrix. Intuitively, we can interpret the
correlation matrix as a rescaled version of the covariance matrix. In fact, the correlation matrix is
identical to a covariance matrix computed from standardized features.
DEPT OF ECE/SJBIT 8
MODULE 4
A cost function J(.) , which we minimized to learn the weights via optimization algorithms, such
as Gradient Descent (GD) and Stochastic Gradient Descent (SGD). This cost function in Adaline
is the Sum of Squared Errors (SSE), which is identical to the cost function that we use for OLS:
DEPT OF ECE/SJBIT 9
MODULE 4
To see our LinearRegressionGD regressor in action, let's use the RM (number of rooms) variable
from the Housing dataset as the explanatory variable and train a model that can predict MEDV
(house prices).
DEPT OF ECE/SJBIT 10
MODULE 4
It is always a good idea to plot the cost as a function of the number of epochs passes over the
training dataset when we are using optimization algorithms, such as gradient descent, to check the
algorithm converged to a cost minimum (here, a global cost minimum):
Linear regression models can be heavily impacted by the presence of outliers. In certain situations,
a very small subset of our data can have a big effect on the estimated model coefficients. There are
many statistical tests that can be used to detect outliers. However, removing outliers always
requires our own judgment as data scientists as well as our domain knowledge.
DEPT OF ECE/SJBIT 11
MODULE 4
As an alternative to throwing out outliers, we will look at a robust method of regression using the
RANdom SAmple Consensus (RANSAC) algorithm, which fits a regression model to a subset of
the data, the so-called inliers.
2. Test all other data points against the fitted model and add those points that
RANSAC Implementation
DEPT OF ECE/SJBIT 12
MODULE 4
Another useful quantitative measure of a model's performance is the so-called Mean Squared Error
( MSE), which is simply the averaged value of the SSE cost that we minimized to fit the linear
regression model. The MSE is useful to compare different regression models or for tuning their
parameters via grid search and cross-validation, as it normalizes the SSE by the sample size:
DEPT OF ECE/SJBIT 13
MODULE 4
DEPT OF ECE/SJBIT 14
MODULE 4
DEPT OF ECE/SJBIT 15
MODULE 4
We assumed a linear relationship between explanatory and response variables. One way to account
for the violation of linearity assumption is to use a polynomial regression model by adding
polynomial terms:
DEPT OF ECE/SJBIT 16
MODULE 4
To use the PolynomialFeatures transformer class from scikit-learn to add a quadratic term (d = 2)
to a simple regression problem with one explanatory variable. Then, we compare the polynomial
to the linear fit following these steps:
DEPT OF ECE/SJBIT 17
MODULE 4
DEPT OF ECE/SJBIT 18