Model Development
Model Development
Model development is an iterative process, in which many models are derived, tested and
built upon until a model fitting the desired criteria is built.
The first step toward model creation involves selecting the appropriate algorithm(s).
These algorithms rely on prepared data to create and train the model.
There are hundreds of machine learning algorithms that data scientists can access, and
new ones emerge every day.
In producing a functional business tool, the correct algorithm and machine learning
problem must be in alignment.
In this phase data science team needs to develop data sets for training, testing, and
production purposes.
These data sets enable data scientist to develop analytical method and train it, while
holding aside some of data for testing the model.
Simple regression
Regression is the analysis of the relation between one variable and some other
variable(s), assuming a linear relation.
Also referred to as least squares regression and ordinary least squares (OLS)
a) The purpose is to explain the variation in a variable (that is, how a variable
differs from it's mean value) using the variation in one or more other variables.
b) Suppose we want to describe, explain, or predict why a variable differs from
its mean.
Let the ith observation on this variable be represented as Y i , and let n indicate the
number of observations.
Introduction
Regression analysis is used when you want to predict a continuous dependent variable from a
number of independent variables.
The independent variables used in regression can be either continuous or discontinuous.
Independent variables with more than two levels can also be used in regression analyses, but
they first must be converted into variables that have only two levels.
One point to keep in mind with regression analysis is that causal relationships among the
variables cannot be determined.
While the terminology is such that we say that X "predicts" Y, we cannot say that X "causes"
Y.
Assumptions of regression
Number of cases
a) When doing regression, the cases-to-Independent Variables (IVs) ratio should
ideally be 20:1; that is 20 cases for every IV in the model.
b) The lowest your ratio should be is 5:1 (i.e., 5 cases for every IV in the model).
Accuracy of data
a) If you have entered the data (rather than using an established dataset), it is a
good idea to check the accuracy of the data entry.
b) If you don't want to re-check each data point, you should at least check the
minimum and maximum value for each variable to ensure that all values for
each variable are "valid.“
c) For example, a variable that is measured using a 1 to 5 scale should not have
a value of 8.
Assumptions of regression
Missing data
a) You also want to look for missing data. If specific variables have a lot of missing
values, you may decide not to include those variables in your analyses.
b) If only a few cases have any missing values, then you might want to delete those
cases.
c) If there are missing values for several cases on different variables, then you
probably don't want to delete those cases (because a lot of your data will be lost).
d) If there are not too much missing data, and there does not seem to be any pattern
in terms of what is missing, then you don't really need to worry.
e) Just run your regression, and any cases that do not have values for the variables
used in that regression will not be included.
Assumptions of regression
Outliers
a) You also need to check your data for outliers (i.e., an extreme value on a particular
item)
b) An outlier is often operationally defined as a value that is at least 3 standard
deviations above or below the mean.
c) If you feel that the cases that produced the outliers are not part of the same
"population" as the other cases, then you might just want to delete those cases.
Normality
a) You also want to check that your data is normally distributed.
b) To do this, you can construct histograms and "look" at the data to see its
distribution.
c) Often the histogram will include a line that depicts what the shape would look like
if the distribution were truly normal (and you can "eyeball" how much the actual
distribution deviates from this line).
Simple linear regression
a) y is the predicted value of the dependent variable (y) for any given value of the
independent variable (x).
b) is the intercept, the predicted value of y when the x is 0.
c) is the regression coefficient – how much we expect y to change as x increases.
d) X is the independent variable ( the variable we expect is influencing y).
e) ε is the error of the estimate, or how much variation there is in our estimate of the
regression coefficient.
How to perform a simple linear
regression
Linear regression finds the line of best fit line through your data
by searching for the regression coefficient (B 1) that minimizes the
total error (e) of the model.
While you can perform a linear regression by hand, this is a
tedious process, so most people use statistical programs to help
them quickly analyze the data.
Kinds of Linear Regression Model:-
Forecasting:
a) Different types of regression analysis can be used to forecast future
opportunities and threats for a business.
b) For instance, a customer’s likely purchase volume can be predicted using a
demand analysis.
c) However, when it comes to business, demand isn’t the only variable that
affects profitability.
Comparison with competition:
a) A company’s financial performance can be compared to that of a specific
competitor using this tool.
b) Also, it can be used to determine the correlation between the stock prices of
two different companies within the same industry or different industries.
APPLICATION OF REGRESSION ANALYSIS
When compared to a rival company, it can help identify which factors are
influencing its sales. It can help small businesses achieve rapid success in a
short term.
Problem Identification:
a) In addition to providing factual evidence, a regression can be used to identify
and correct judgment errors.
b) For example, a retail shop owner may believe that extending the hours of
operation will result in a significant increase in sales.
c) However, regression analysis shows that the monetary gains as a result of
increasing the working hours are not enough to offset the increase in
operational costs that comes along with it.
Regression analysis may provide the business owners with quantitative support
for their decisions and prevent them from making mistakes because of their
intuition.
Decision Making:
a) Regression analysis (and other types of statistical analysis) are now being
used by many businesses and their top executives to make better business
decisions and reduce guesswork and intuition.
b) Scientific management is made possible by regression. Data overload is a
problem for both small and large organizations.
c) To make the best decisions possible, managers can use regression analysis to
sort through data and select relevant factors.
Simple Linear Regression
Simple linear regression is when you want to predict values of one variable, given values of another
variable.
For example, you might want to predict a person's height (in inches) from his weight (in pounds).
Imagine a sample of ten people for whom you know their height and weight.
You could plot the values on a graph, with weight on the x axis and height on the y axis. If there were
a perfect linear relationship between height and weight, then all 10 points on the graph would fit on a
straight line.
But, this is never the case (unless your data are rigged). If there is a (nonperfect) linear relationship
between height and weight (presumably a positive one), then you would get a cluster of points on the
graph which slopes upward.
In other words, people who weigh a lot should be taller than those people who are of less weight.
The purpose of regression analysis is to
come up with an equation of a line that
fits through that cluster of points with
the minimal amount of deviations from
the line.
The deviation of the points from the line
is called "error." Once you have this
regression equation, if you knew a
person's weight, you could then predict
their height.
Simple linear regression is actually the
same as a bivariate correlation between
the independent and dependent
variable.
Simple Regression
When to use it
The distribution plot is suitable for comparing range and distribution for
groups of numerical data.
Advantages
The distribution plot visualizes the distribution of data.
Disadvantages
The distribution plot is not relevant for detailed analysis of the data as it
deals with a summary of the data distribution.
Creating a distribution plot
You can create a distribution plot on the sheet you are editing.
In a distribution plot you need to use one or two dimensions, and one measure. If you
use a single dimension you will receive a single line visualization. If you use two
dimensions, you will get one line for each value of the second, or outer, dimension.
Do the following:
a) From the assets panel, drag an empty distribution plot to the sheet.
b) Add the first dimension.
c) This is the inner dimension, which defines the value points.
d) Add a second dimension.
e) This is the outer dimension, which defines the groups of value points shown on the
dimension axis.
f) Click Add measure and create a measure from a field.
Viewing the distribution of measure valu
es in a dimension with a distribution plo
t
This example shows how to make a
distribution plot to view the distribution of
measure values in a dimension, using
weather data as an example.
Dataset
In this example, we'll use the following
weather data.
a) Location: Sweden > Gällivare Airport
b) Date range: all data from 2010 to 2017
c) Measurement: Average of the 24 hourly
temperature observations in degrees
Celsius
d) The dataset that is loaded contains a daily
average temperature measurement from a
weather station in the north of Sweden
during the time period of 2010 to 2017.
Measure
We use the average temperature measurement in the dataset as the measure, by creating a measure in Master
items with the name Temperature degrees Celsius, and the expression Avg([Average of the 24 hourly
temperature observations in degrees Celsius]).
Visualization
We add a distribution plot to the sheet and set the following data properties:
a) Dimension: Date (date) and Year (year). The order is important, Date needs to be the first dimension.
b) Measure: Temperature degrees Celsius, the measure that was created as a master item.
c) Distribution plot with the dimensions Date (date) and Year (year) and the measure Temperature degrees
Celsius.
Discovery
A data science pipeline is the set of processes that convert raw data into
actionable answers to business questions. Data science pipelines automate
the flow of data from source to destination, ultimately providing you insights
for making business decisions.
Benefits
Data science pipelines automate the processes of data validation; extract, transform, load (
ETL); machine learning and modeling; revision; and output, such as to a data warehouse or
visualization platform. A type of data pipeline, data science pipelines eliminate many manual,
error-prone processes involved in transporting data between locations which can result in data
latency and bottlenecks.
The benefits of a modern data science pipeline to your business:
Easier access to insights, as raw data is quickly and easily adjusted, analyzed, and modeled
based on machine learning algorithms, then output as meaningful, actionable information
Faster decision-making, as data is extracted and processed in real time, giving you up-to-date
information to leverage
Agility to meet peaks in demand, as modern data science pipelines offer instant elasticity via
the cloud
DATA SCIENCE PIPELINE FLOW
But the first step in deploying a data science pipeline is identifying the business
problem you need the data to address and the data science workflow.
Formulate questions you need answers to — that will direct the machine
learning and other algorithms to provide solutions you can use.
Once that’s done, the steps for a data science pipeline are:
Data collection, including the identification of data sources and extraction of
data from sources into usable formats
Data preparation, which may include ETL
Data modeling and model validation, in which machine learning is used to find
patterns and apply rules to the data via algorithms and then tested on sample
data
Model deployment, applying the model to the existing and new data
Reviewing and updating the model based on changing business requirements
CHARACTERISTICS OF A DATA SCIENCE PIP
ELINE
A robust end-to-end data science pipeline can source, collect, manage, analyze,
model, and effectively transform data to discover opportunities and deliver cost-
saving business processes. Modern data science pipelines make extracting
information from the data you collect fast and accessible.
To do this, the best data science pipelines have:
Continuous, extensible data processing
Cloud-enabled elasticity and agility
Independent, isolated data processing resources
Widespread data access and the ability to self-serve
High availability and disaster recovery
These characteristics enable organizations to leverage their data quickly,
accurately, and efficiently to make quicker and better business decisions.
BENEFITS OF A CLOUD PLATFORM FOR
DATA SCIENCE PIPELINES
A modern cloud data platform can satisfy the entire data lifecycle of a data science pipeline, including
machine learning, artificial intelligence, and predictive application development.
a) A cloud data platform provides:
b) Simplicity, making managing multiple compute platforms and constantly maintain integrations
unnecessary
c) Security, with one copy of data securely stored in the data warehouse environment and with user
credentials carefully managed and all transmissions encrypted
d) Performance, as query results are cached and can be used repeatedly during the machine learning
process, as well as for analytics
e) Workload isolation with dedicated compute resources for each user and workload
f) Elasticity, with scale-up capacity to accommodate large data processing tasks happening in
seconds
g) Support for structured and semi-structured data, making it easy to load, integrate, and analyze all
types of data inside a unified repository
h) Concurrency, as massive workloads run across shared data at scale
Evaluation Metrics in Machine Learning
Evaluation is always good in any field, right? In the case of machine learning,
it is best practice. In this post, we will almost cover all the popular as well as
common metrics used for machine learning.
Classification Metrics
In a classification task, our main task is to predict the target variable which is in the
form of discrete values. To evaluate the performance of such a model there are metrics
as mentioned below:
• Classification Accuracy
• Logarithmic loss
• Area under Curve
• F1 score
• Precision
• Recall
• Confusion Matrix
Classification Accuracy
It works great if there are an equal number of samples for each class. For
example, we have a 90% sample of class A and a 10% sample of class B in our
training set.
Then, our model will predict with an accuracy of 90% by predicting all the
training samples belonging to class A.
If we test the same model with a test set of 60% from class A and 40% from class
B. Then the accuracy will fall, and we will get an accuracy of 60%.
Logarithmic Loss
It is also known as Log loss. Its basic working propaganda is by penalizing the
false (False Positive) classification.
It usually works well with multi-class classification. Working on Log loss, the
classifier should assign a probability for each and every class of all the
samples.
If there are N samples belonging to the M class, then we calculate the Log
loss in this way:
Area Under Curve(AUC)
It is one of the widely used metrics and basically used for binary classification. The
AUC of a classifier is defined as the probability of a classifier will rank a randomly
chosen positive example higher than a negative example. Before going into AUC
more, let me make you comfortable with a few basic terms.
True positive rate:
Also called or termed sensitivity. True Positive Rate is considered as a portion of positive
data points that are correctly considered as positive, with respect to all data points that are
positive.
True Negative Rate
a) False Positive Rate and True Positive Rate both have values in the range [0, 1].
b) Now the thing is what is A U C then? So, A U C is a curve plotted between False Positive
Rate Vs True Positive Rate at all different data points with a range of [0, 1].
c) Greater the value of AUCC better the performance of the model.
F1 Score
It is a harmonic mean between recall and precision. Its range is [0,1]. This metric
usually tells us how precise (It correctly classifies how many instances) and robust
(does not miss any significant number of instances) our classifier is.
Precision
There is another metric named Precision. Precision is a measure of a model’s
performance that tells you how many of the positive predictions made by the
model are actually correct. It is calculated as the number of true positive
predictions divided by the number of true positive and false positive
predictions.
Recall
Lower recall and higher precision give you great accuracy but then it misses a
large number of instances. The more the F1 score better will be performance.
It can be expressed mathematically in this way:
Confusion Matrix
True Positives: It is the case where we predicted Yes and the real output was
also yes.
True Negatives: It is the case where we predicted No and the real output was
also No.
False Positives: It is the case where we predicted Yes but it was actually No.
False Negatives: It is the case where we predicted No but it was actually Yes.
The accuracy of the matrix is always calculated by taking average values
present in the main diagonal i.e.
Regression Evaluation Metrics
In the regression task, we are supposed to predict the target variable which is
in the form of continuous values. To evaluate the performance of such a
model below mentioned evaluation metrics are used:
Mean Absolute Error
Mean Squared Error
Root Mean Square Error
Root Mean Square Logarithmic Error
R2 – Score
Mean Absolute Error(MAE)
It is similar to mean absolute error but the difference is it takes the square of
the average of between predicted and original values. The main advantage to
take this metric is here, it is easier to calculate the gradient whereas, in the
case of mean absolute error, it takes complicated programming tools to
calculate the gradient. By taking the square of errors it pronounces larger
errors more than smaller errors, we can focus more on larger errors. It can be
expressed mathematically in this way.
Root Mean Square Error(RMSE)
We can say that RMSE is a metric that can be obtained by just taking the
square root of the MSE value. As we know that the MSE metrics are not robust
to outliers and so are the RMSE values. This gives higher weightage to the
large errors in predictions.
R2 – Score