0% found this document useful (0 votes)
78 views

da-unit-iii

Da
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

da-unit-iii

Da
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

lOMoARcPSD|43926333

DA-Unit-III

Data Analytics (Jawaharlal Nehru Technological University, Hyderabad)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Fabulous One ([email protected])
lOMoARcPSD|43926333

Data Analytics – UNIT – III

Regression

Regression is a well-known statistical technique to model the predictive


relationship between several independent variables (DVs) and one dependent
variable. The objective is to find the best-fitting curve for a dependent variable in a
multidimensional space, with each independent variable being a dimension. The
curve could be a straight line, or it could be a nonlinear curve. The quality of fit of
the curve to the data can be measured by a coefficient of correlation (r), which is
the square root of the amount of variance explained by the curve.

The key steps for regression are simple:

1. List all the variables available for making the model.


2. Establish a Dependent Variable (DV) of interest.
3. Examine visual (if possible) relationships between variables of interest.
4. Find a way to predict DV using the other variables.

Introduction to Properties of OLS Estimators(Blue property assumptions):


Linear regression models have several applications in real life. In
econometrics, Ordinary Least Squares (OLS) method is widely used to estimate the
parameters of a linear regression model. For the validity of OLS estimates, there are
assumptions made while running linear regression models.
A1. The linear regression model is “linear in parameters.”
A2. There is a random sampling of observations.
A3. The conditional mean should be zero.
A4. There is no multi-collinearity (or perfect collinearity).
A5. Spherical errors: There is homoscedasticity and no auto-correlation
A6: Optional Assumption: Error terms should be normally distributed.

These assumptions are extremely important because violation of any of these


assumptions would make OLS estimates unreliable and incorrect. Specifically, a
violation would result in incorrect signs of OLS estimates, or the variance of OLS
estimates would be unreliable, leading to confidence intervals that are too wide or
too narrow.
This being said, it is necessary to investigate why OLS estimators and its
assumptions gather so much focus. In this article, the properties of OLS model are
discussed. First, the famous Gauss-Markov Theorem is outlined. Thereafter, a
detailed description of the properties of the OLS model is described. In the end, the
article briefly talks about the applications of the properties of OLS in econometrics.

Sabbineni Venkateswara Rao Page 1

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

Least Square Estimation:

What is the Least Squares Regression Method?


The least-squares regression method is a technique commonly used in
Regression Analysis. It is a mathematical method used to find the best fit line that
represents the relationship between an independent and dependent variable.
Regression analysis makes use of mathematical methods such as least
squares to obtain a definite relationship between the predictor variable (s) and the
target variable. The least-squares method is one of the most effective ways used to
draw the line of best fit. It is based on the idea that the square of the errors
obtained must be minimized to the most possible extent and hence the name least
squares method.
If we were to plot the best fit line that shows the depicts the sales of a
company over a period of time, it would look something like this:

Notice that the line is as close as possible to all the scattered data points.
This is what an ideal best fit line looks like.
To better understand the whole process let’s see how to calculate the line using the
Least Squares Regression.
Steps to calculate the Line of Best Fit
Start constructing the line that best depicts the relationship between
variables in the data, we first need to get our basics right. Take a look at the
equation below:

Surely, you’ve come across this equation before. It is a simple equation that
represents a straight line along 2 Dimensional data, i.e. x-axis and y-axis. To better
understand this, let’s break down the equation:
 y: dependent variable
 m: the slope of the line
 x: independent variable
 c: y-intercept

Sabbineni Venkateswara Rao Page 2

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

So the aim is to calculate the values of slope, y-intercept and substitute the
corresponding ‘x’ values in the equation in order to derive the value of the
dependent variable.
Let’s see how this can be done.
As an assumption, let’s consider that there are ‘n’ data points.
Step 1: Calculate the slope ‘m’ by using the following formula:

Step 2: Compute the y-intercept (the value of y at the point where the line crosses the
y-axis):

Step 3: Substitute the values in the final equation:

Simple, isn’t it?


Now let’s look at an example and see how you can use the least-squares
regression method to compute the line of best fit.
Least Squares Regression Example
Consider an example. Tom who is the owner of a retail shop, found the price
of different T-shirts vs the number of T-shirts sold at his shop over a period of one
week.
He tabulated this like shown below:

Let us use the concept of least squares regression to find the line of best fit
for the above data.
Step 1: Calculate the slope ‘m’ by using the following formula:

After you substitute the respective values, m = 1.518 approximately.


Step 2: Compute the y-intercept value

After you substitute the respective values, c = 0.305 approximately.


Step 3: Substitute the values in the final equation

Sabbineni Venkateswara Rao Page 3

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

Once you substitute the values, it should look something like this:

Let’s construct a graph that represents the y=mx + c line of best fit:

Now Tom can use the above equation to estimate how many T-shirts of price
$8 can he sell at the retail shop.
y = 1.518 x 8 + 0.305 = 12.45 T-shirts
This comes down to 13 T-shirts! That’s how simple it is to make predictions
using Linear Regression.
Now let’s try to understand based on what factors can we confirm that the
above line is the line of best fit.
The least squares regression method works by minimizing the sum of the
square of the errors as small as possible, hence the name least squares. Basically
the distance between the line of best fit and the error must be minimized as much
as possible. This is the basic idea behind the least squares regression method.
A few things to keep in mind before implementing the least squares
regression method is:
 The data must be free of outliers because they might lead to a biased and
wrongful line of best fit.
 The line of best fit can be drawn iteratively until you get a line with the
minimum possible squares of errors.
 This method works well even with non-linear data.
 Technically, the difference between the actual value of ‘y’ and the predicted
value of ‘y’ is called the Residual (denotes the error).

Least Square Method Definition


The least-squares method is a crucial statistical method that is practised to
find a regression line or a best-fit line for the given pattern. This method is
described by an equation with specific parameters. The method of least squares is
generously used in evaluation and regression. In regression analysis, this method

Sabbineni Venkateswara Rao Page 4

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

is said to be a standard approach for the approximation of sets of equations having


more equations than the number of unknowns.
The method of least squares actually defines the solution for the
minimization of the sum of squares of deviations or the errors in the result of each
equation. Find the formula for sum of squares of errors, which help to find the
variation in observed data.
The least-squares method is often applied in data fitting. The best fit result
is assumed to reduce the sum of squared errors or residuals which are stated to be
the differences between the observed or experimental value and corresponding
fitted value given in the model.
There are two basic categories of least-squares problems:
 Ordinary or linear least squares
 Nonlinear least squares
These depend upon linearity or nonlinearity of the residuals. The linear
problems are often seen in regression analysis in statistics. On the other hand, the
non-linear problems generally used in the iterative method of refinement in which
the model is approximated to the linear one with each iteration.
Least Square Method Graph
In linear regression, the line of best fit is a straight line as shown in the following
diagram:

The given data points are to be minimized by the method of reducing


residuals or offsets of each point from the line. The vertical offsets are generally
used in surface, polynomial and hyperplane problems, while perpendicular offsets
are utilized in common practice.

Least Square Method Formula


The least-square method states that the curve that best fits a given set of
observations, is said to be a curve having a minimum sum of the squared residuals
(or deviations or errors) from the given data points. Let us assume that the given
points of data are (x1,y1), (x2,y2), (x3,y3), …, (xn,yn) in which all x’s are independent

Sabbineni Venkateswara Rao Page 5

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

variables, while all y’s are dependent ones. Also, suppose that f(x) be the fitting
curve and d represents error or deviation from each given point.
Now, we can write:
d1 = y1 − f(x1)
d2 = y2 − f(x2)
d3 = y3 − f(x3)
…..
dn = yn – f(xn)
The least-squares explain that the curve that best fits is represented by the
property that the sum of squares of all the deviations from given values must be
minimum. I.e:

Sum = Minimum Quantity


Limitations for Least-Square Method
The least-squares method is a very beneficial method of curve fitting. Despite
many benefits, it has a few shortcomings too. One of the main limitations is
discussed here.
In the process of regression analysis, which utilizes the least-square method
for curve fitting, it is inevitably assumed that the errors in the independent variable
are negligible or zero. In such cases, when independent variable errors are non-
negligible, the models are subjected to measurement errors. Therefore, here, the
least square method may even lead to hypothesis testing, where parameter
estimates and confidence intervals are taken into consideration due to the presence
of errors occurring in the independent variables.

REGRESSION ANALYSIS - MODEL BUILDING:


A regression analysis is typically conducted to obtain a model that may
needed for one of the following reasons:
• to explore whether a hypothesis regarding the relationship between the
response and predictors is true.
• to estimate a known theoretical relationship between the response and
predictors.
The model will then be used for:
• Prediction: the model will be used to predict the response variable from a
chosen set of predictors, and
• Inference: the model will be used to explore the strength of the
relationships between the response and the predictors
Therefore, steps in model building may be summarized as follows:

Sabbineni Venkateswara Rao Page 6

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

1. Choosing the predictor variables and response variable on which to collect the
data.
2. Collecting data. You may be using data that already exists (retrospective), or you
may be conducting an experiment during which you will collect data (prospective).
Note that this step is important in determining the researcher’s ability to claim
‘association’ or ‘causality’ based on the regression model.
3. Exploring the data.
• check for data errors and missing values.
• study the bivariate relationships to reveal other outliers and influential
observations, relationships, and identify possible multicollinearities to
suggest possible transformations. (a document was sent to you on Sept. 21st
regarding these topics).
4. Dividing the data into a model-building set and a model-validation set:
• The training set is used to estimate the model.
• The validation set is later used for cross-validation of the selected model.
5. Identify several candidate models:
• Use best subsets regression.
• Use stepwise regression.
6. Evaluate the selected models for violation of the model conditions. Below checks
may be performed visually via residual plots as well as formal statistical tests.
• Check the linearity condition.
• Check for normality of the residuals.
• Check for constant variance of the residuals.
• After time-ordering your data (if appropriate), assess the independence of
the observations.
• Overall goodness-of-fit of the model. Above checks turn out to be
unsatisfactory, then modifications to the model may be needed (such as a
different functional form). Regardless, checking the assumptions of your
model as well as the model’s overall adequacy is usually accomplished
through residual diagnostic procedures.
7. Select the final model:
• Compare the competing models by cross-validating them against the
validation data. Remember, there is not necessarily only one good model for
a given set of data. There may be a few equally satisfactory models.

Logistic Regression

Regression models traditionally work with continuous numeric value data for
dependent and independent variables. Logistic regression models can, however,
work with dependent variables with binary values, such as whether a loan is
approved (yes or no). Logistic regression measures the relationship between a
categorical dependent variable and one or more independent variables. For
example, Logistic regression might be used to predict whether a patient has a given

Sabbineni Venkateswara Rao Page 7

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

disease (e.g. diabetes), based on observed characteristics of the patient (age,


gender, body mass index, results of blood tests, etc.).

Logistical regression models use probability scores as the predicted values of


the dependent variable. Logistic regression takes the natural logarithm of the odds
of the dependent variable being a case (referred to as the logit) to create a
continuous criterion as a transformed version of the dependent variable. Thus the
logit transformation is used in logistic regression as the dependent variable. The
net effect is that although the dependent variable in logistic regression is binomial
(or categorical, i.e. has only two possible values), the logit is the continuous
function upon which linear regression is conducted.

What are the types of Logistic Regression techniques ?


Logistic Regression isn't just limited to solving binary classification
problems. To solve problems that have multiple classes, we can use extensions of
Logistic Regression, which includes Multinomial Logistic Regression and Ordinal
Logistic Regression. Let's get their basic idea:
1. Multinomial Logistic Regression: Let's say our target variable has K = 4
classes. This technique handles the multi-class problem by fitting K-1 independent
binary logistic classifier model. For doing this, it randomly chooses one target class
as the reference class and fits K-1 regression models that compare each of the
remaining classes to the reference class.
Due to its restrictive nature, it isn't used widely because it does not scale
very well in the presence of a large number of target classes. In addition, since it
builds K - 1 models, we would require a much larger data set to achieve reasonable
accuracy.
2. Ordinal Logistic Regression: This technique is used when the target variable is
ordinal in nature. Let's say, we want to predict years of work experience (1,2,3,4,5,
etc). So, there exists an order in the value, i.e., 5>4>3>2>1. Unlike a multinomial
model, when we train K -1 models, Ordinal Logistic Regression builds a single
model with multiple threshold values.
If we have K classes, the model will require K -1 threshold or cutoff points.
Also, it makes an imperative assumption of proportional odds. The assumption
says that on a logit (S shape) scale, all of the thresholds lie on a straight line.
Note: Logistic Regression is not a great choice to solve multi-class problems. But,
it's good to be aware of its types. In this tutorial we'll focus on Logistic Regression
for binary classification task.
How does Logistic Regression work?
Logistic Regression assumes that the dependent (or response) variable
follows a binomial distribution. Now, you may wonder, what is binomial
distribution? Binomial distribution can be identified by the following
characteristics:

Sabbineni Venkateswara Rao Page 8

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

1. There must be a fixed number of trials denoted by n, i.e. in the data set,
there must be a fixed number of rows.
2. Each trial can have only two outcomes; i.e., the response variable can have
only two unique categories.
3. The outcome of each trial must be independent of each other; i.e., the
unique levels of the response variable must be independent of each other.
4. The probability of success (p) and failure (q) should be the same for each
trial.
Let's understand how Logistic Regression works. For Linear Regression,
where the output is a linear combination of input feature(s), we write the equation
as:
`Y = βo + β1X + ∈`
In Logistic Regression, we use the same equation but with some
modifications made to Y. Let's reiterate a fact about Logistic Regression: we
calculate probabilities. And, probabilities always lie between 0 and 1. In other
words, we can say:
1. The response value must be positive.
2. It should be lower than 1.
First, we'll meet the above two criteria. We know the exponential of any value
is always a positive number. And, any number divided by number + 1 will always
be lower than 1. Let's implement these two findings:

This is the logistic function.


Now we are convinced that the probability value will always lie between 0
and 1. To determine the link function, follow the algebraic calculations
carefully. P(Y=1|X) can be read as "probability that Y =1 given some value for x." Y
can take only two values, 1 or 0. For ease of calculation, let's rewrite P(Y=1|X) as
p(X).

As you might recognize, the right side of the (immediate) equation above
depicts the linear combination of independent variables. The left side is known as

Sabbineni Venkateswara Rao Page 9

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

the log - odds or odds ratio or logit function and is the link function for Logistic
Regression. This link function follows a sigmoid (shown below) function which
limits its range of probabilities between 0 and 1.

In Multiple Regression, we use the Ordinary Least Square (OLS) method to


determine the best coefficients to attain good model fit. In Logistic Regression, we
use maximum likelihood method to determine the best coefficients and
eventually a good model fit.
Maximum likelihood works like this: It tries to find the value of coefficients
(βo,β1) such that the predicted probabilities are as close to the observed
probabilities as possible. In other words, for a binary classification (1/0), maximum
likelihood will try to find values of βo and β1 such that the resultant probabilities
are closest to either 1 or 0. The likelihood function is written as

How can you evaluate Logistic Regression model fit


and accuracy ?
In Linear Regression, we check adjusted R², F Statistics, MAE, and RMSE to
evaluate model fit and accuracy. But, Logistic Regression employs all different sets
of metrics. Here, we deal with probabilities and categorical values. Following are
the evaluation metrics used for Logistic Regression:
1. Akaike Information Criteria (AIC)
You can look at AIC as counterpart of adjusted r square in multiple
regression. It's an important indicator of model fit. It follows the rule: Smaller the
better. AIC penalizes increasing number of coefficients in the model. In other
words, adding more variables to the model wouldn't let AIC increase. It helps to
avoid overfitting.
Looking at the AIC metric of one model wouldn't really help. It is more useful
in comparing models (model selection). So, build 2 or 3 Logistic Regression models
and compare their AIC. The model with the lowest AIC will be relatively better.
2. Null Deviance and Residual Deviance
Deviance of an observation is computed as -2 times log likelihood of that
observation. The importance of deviance can be further understood using its types:

Sabbineni Venkateswara Rao Page 10

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

Null and Residual Deviance. Null deviance is calculated from the model with
no features, i.e.,only intercept. The null model predicts class via a constant
probability.
Residual deviance is calculated from the model having all the features.On
comarison with Linear Regression, think of residual deviance as residual sum of
square (RSS) and null deviance as total sum of squares (TSS). The larger the
difference between null and residual deviance, better the model.
Also, you can use these metrics to compared multiple models: whichever
model has a lower null deviance, means that the model explains deviance pretty
well, and is a better model. Also, lower the residual deviance, better the model.
Practically, AIC is always given preference above deviance to evaluate model fit.
3. Confusion Matrix
Confusion matrix is the most crucial metric commonly used to evaluate
classification models. It's quite confusing but make sure you understand it by
heart. If you still don't understand anything, ask me in comments. The skeleton of
a confusion matrix looks like this:

As you can see, the confusion matrix avoids "confusion" by measuring the
actual and predicted values in a tabular format. In table above, Positive class = 1
and Negative class = 0. Following are the metrics we can derive from a confusion
matrix:
Accuracy - It determines the overall predicted accuracy of the model. It is
calculated as Accuracy = (True Positives + True Negatives)/(True Positives + True
Negatives + False Positives + False Negatives)
True Positive Rate (TPR) - It indicates how many positive values, out of all the
positive values, have been correctly predicted. The formula to calculate the true
positive rate is (TP/TP + FN). Also, TPR = 1 - False Negative Rate. It is also known
as Sensitivity or Recall.
False Positive Rate (FPR) - It indicates how many negative values, out of all the
negative values, have been incorrectly predicted. The formula to calculate the
false positive rate is (FP/FP + TN). Also, FPR = 1 - True Negative Rate.
True Negative Rate (TNR) - It indicates how many negative values, out of all the
negative values, have been correctly predicted. The formula to calculate the true
negative rate is (TN/TN + FP). It is also known as Specificity.

Sabbineni Venkateswara Rao Page 11

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

False Negative Rate (FNR) - It indicates how many positive values, out of all the
positive values, have been incorrectly predicted. The formula to calculate false
negative rate is (FN/FN + TP).
Precision: It indicates how many values, out of all the predicted positive values,
are actually positive. It is formulated as:(TP / TP + FP). F Score: F score is the
harmonic mean of precision and recall. It lies between 0 and 1. Higher the value,
better the model. It is formulated as 2((precision*recall) / (precision+recall)).
4. Receiver Operator Characteristic (ROC)
ROC determines the accuracy of a classification model at a user defined
threshold value. It determines the model's accuracy using Area Under Curve (AUC).
The area under the curve (AUC), also referred to as index of accuracy (A) or
concordant index, represents the performance of the ROC curve. Higher the area,
better the model. ROC is plotted between True Positive Rate (Y axis) and False
Positive Rate (X Axis). In this plot, our aim is to push the red curve (shown below)
toward 1 (left corner) and maximize the area under curve. Higher the curve, better
the model. The yellow line represents the ROC curve at 0.5 threshold. At this point,
sensitivity = specificity.

Pros and Cons of Logistic Regression


Many of the pros and cons of the linear regression model also apply to the
logistic regression model. Although Logistic regression is used widely by many
people for solving various types of problems, it fails to hold up its performance due
to its various limitations and also other predictive models provide better predictive
results.
Pros
 The logistic regression model not only acts as a classification model, but also
gives you probabilities. This is a big advantage over other models where they
can only provide the final classification. Knowing that an instance has a 99%
probability for a class compared to 51% makes a big difference. Logistic
Regression performs well when the dataset is linearly separable.
 Logistic Regression not only gives a measure of how relevant a predictor
(coefficient size) is, but also its direction of association (positive or negative).
We see that Logistic regression is easier to implement, interpret and very
efficient to train.

Sabbineni Venkateswara Rao Page 12

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

Cons
 Logistic regression can suffer from complete separation. If there is a feature
that would perfectly separate the two classes, the logistic regression model
can no longer be trained. This is because the weight for that feature would
not converge, because the optimal weight would be infinite. This is really a
bit unfortunate, because such a feature is really very useful. But you do not
need machine learning if you have a simple rule that separates both classes.
The problem of complete separation can be solved by introducing
penalization of the weights or defining a prior probability distribution of
weights.
 Logistic regression is less prone to overfitting but it can overfit in high
dimensional datasets and in that case, regularization techniques should be
considered to avoid over-fitting in such scenarios.

Analytics applications to various Business Domains


Let’s first understand how logistic regression is used in business world.
Logistic regression has an array of applications. Here are a few applications used in
real-world situations.
Marketing: A marketing consultant wants to predict if the subsidiary of his
company will make profit, loss or just break even depending on the characteristic of
the subsidiary operations.
Human Resources: The HR manager of a company wants to predict the
absenteeism pattern of his employees based on their individual characteristic.
Finance: A bank wants to predict if his customers would default based on the
previous transactions and history.

Model Construction (Using R):


R makes it very easy to fit a logistic regression model. The function to be
called is glm() and the fitting process is similar the one used in linear regression. In
this post, I would discuss binary logistic regression with an example though the
procedure for multinomial logistic regression is pretty much the same.

The data which has been used is Bankloan. The dataset has 850 rows and 9
columns. (age, education, employment, address, income, debtinc, creddebt,
othdebt, default). The dependent variable is default (Defaulted and Not Defaulted).
Let’s first load and check the head of data.

bankloan<-read.csv(“bankloan.csv”)
head(bankloan)
Now, making the subset of the data with 700 rows.
mod_bankloan <- bankloan[1:700,]
Setting a seed of 1000 (meaning picking random numbers from 1000 as starting
point)
Sabbineni Venkateswara Rao Page 13

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

set.seed(500)
Let’s have a sample of 500 values. So, creating a variable of training data of 700
rows.
>train<-sample(1:700, 500, replace=FALSE)
Creating training as well as testing data.
>trainingdata<- mod_bankloan [train,]
>testingdata<- mod_bankloan [-train,]
Now, let’s fit the model. Be sure to specify the parameter family=binomial in the
glm() function.
model1<-glm(default~.,family=binomial(link=’logit’),data=trainingdata)
>summary(model1)
The summary will also include the significance level of all the variables. If the P
value is less than 0.05 then the variables are significant. We can also remove the
insignificant variables to make our accurate.

In our model, only age, employment, address and creddebt seems to be significant.
So, building another model with only these variables.
model12<-
glm(default~age+employ+address+creddebt,family=binomial(link=’logit’),data=traini
ngdata)

Sabbineni Venkateswara Rao Page 14

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

Let’s now predict the model with the training data.


pred1<-predict(model12,newdata=trainingdata, type=”response”)
Now looking at the probability with 0.5% flight delayed or ontime.
predicted_class<-ifelse(pred1<0.5, “Defaluted”, “Not Defaulted”)
Creating a table to see the same.
table(trainingdata$default, predicted_class)

This is also known as confusion matrix. It is a tabular representation of Actual vs


Predicted values. This helps us to find the accuracy or error of the model and avoid
overfitting.
There are 64 customers who actually defaulted and our model also predicted the
same. However, 72 customers defaulted but model predicted them as Not
Defaulted. Also, 36 customers actually Not Defaulted where the model mentioned
them as defaulted. Let’s now find out the error rate.
err_rate<-1-sum((trainingdata$default ==predicted_class))/500
> err_rate
0.344
Which is 34%.
Going ahead, lets test the model on testing data.

Sabbineni Venkateswara Rao Page 15

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

pred2<-predict(model12, newdata=testingdata,type=”response”)
predicted_class2<-ifelse(pred2<0.5, “Defaluted”, “Not Defaulted”)
table(testingdata$default, predicted_class2)
err_rate<-1-sum((testingdata$default ==predicted_class2))/200
err_rate
0.31
Here the error rate is 31%.
Now, we can plot this in Receiver Operating Characteristics Curve
(commonly known as ROC curve). In R, it can be done by downloading a package
called ROCR. An output of the plot is given below.
ROC traces the percentage of true positives accurately predicted by a given
logit model as the prediction probability cutoff is lowered from 1 to 0. For a perfect
model, as the cutoff is lowered, it should mark more of actual 1’s as positives and
lesser of actual 0’s as 1’s. The area under curve, known as index of accuracy is a
performance metric for the curve. Higher the area under curve, better the
prediction power of the model.

Regression Modeling
Regression modeling or analysis is a statistical process for estimating the
relationships among variables. It includes many techniques for modeling and
analyzing several variables, when the focus is on the relationship between a
dependent variable and one or more independent variables (or 'predictors').
Understand influence of changes in dependent variable:
More specifically, regression analysis helps one understand how the typical
value of the dependent variable (or 'criterion variable') changes when any one of the
independent variables is varied, while the other independent variables are held
fixed. Most commonly, regression analysis estimates the conditional expectation of
the dependent variable given the independent variables, i.e the average value of the
dependent variable when the independent variables are fixed. Less commonly, the
focus is on a quantile, or other location parameter of the conditional distribution of
the dependent variable given the independent variables. In all cases, the estimation

Sabbineni Venkateswara Rao Page 16

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – III

target is a function of the independent variables called the regression function. In


regression analysis, it is also of interest to characterize the variation of the
dependent variable around the regression function which can be described by a
probability distribution.
Estimation of continuous response variables:
Regression may refer specifically to the estimation of continuous response
variables, as opposed to the discrete response variables used in classification. The
case of a continuous output variable maybe more specifically referred to as metric
regression to distinguish it from related problems.
Regression analysis uses:
It is widely used for prediction and forecasting, where its use has substantial
overlap with the field of machine learning. Regression analysis is also used to
understand which among the independent variables are related to the dependent
variable, and to explore the forms of these relationships. In restricted
circumstances, regression analysis can be used to infer causal relationships
between the independent and dependent variables. However this can lead to
illusions or false relationships, so caution is advisable; for example, correlation
does not imply causation.
Parametric and non parametric regression:
Familiar methods such as linear regression and ordinary least squares
regression are parametric, in that the regression function is defined in terms of a
finite number of unknown parameters that are estimated from the data.
Nonparametric regression refers to techniques that allow the regression function to
lie in a specified set of functions, which may be infinite-dimensional.
Performance of regression analysis:
The performance of regression analysis methods in practice depends on the
form of the data generating process, and how it relates to the regression approach
being used. Since the true form of the data-generating process is generally not
known, regression analysis often depends to some extent on making assumptions
about this process. These assumptions are sometimes testable if a sufficient
quantity of data is available. Regression models for prediction are often useful even
when the assumptions are moderately violated, although they may not perform
optimally.

Sabbineni Venkateswara Rao Page 17

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

What is Supervised Learning?


In supervised learning, the computer is taught by example. It learns from past
data and applies the learning to present data to predict future events. In this case, both
input and desired output data provide help to the prediction of future events.
For accurate predictions, the input data is labeled or tagged as the right answer.

Supervised Machine Learning Categorisation


It is important to remember that all supervised learning algorithms are
essentially complex algorithms, categorized as either classification or regression
models.
1) Classification Models – Classification models are used for problems where the
output variable can be categorized, such as “Yes” or “No”, or “Pass” or “Fail.”
Classification Models are used to predict the category of the data. Real-life examples
include spam detection, sentiment analysis, scorecard prediction of exams, etc.
2) Regression Models – Regression models are used for problems where the output
variable is a real value such as a unique number, dollars, salary, weight or pressure,
for example. It is most often used to predict numerical values based on previous data
observations. Some of the more familiar regression algorithms include linear
regression, logistic regression, polynomial regression, and ridge regression.

There are some very practical applications of supervised learning algorithms in


real life, including:
 Text categorization
 Face Detection
 Signature recognition
 Customer discovery

Sabbineni Venkateswara Rao Page 1

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

 Spam detection
 Weather forecasting
 Predicting housing prices based on the prevailing market price
 Stock price predictions, among others

What is Unsupervised Learning?


Unsupervised learning, on the other hand, is the method that trains machines to
use data that is neither classified nor labeled. It means no training data can be
provided and the machine is made to learn by itself. The machine must be able to
classify the data without any prior information about the data.
The idea is to expose the machines to large volumes of varying data and allow it
to learn from that data to provide insights that were previously unknown and to
identify hidden patterns. As such, there aren’t necessarily defined outcomes from
unsupervised learning algorithms. Rather, it determines what is different or interesting
from the given dataset.
The machine needs to be programmed to learn by itself. The computer needs to
understand and provide insights from both structured and unstructured data. Here’s
an accurate illustration of unsupervised learning:

Unsupervised Machine Learning Categorization


1) Clustering is one of the most common unsupervised learning methods. The method
of clustering involves organizing unlabelled data into similar groups called clusters.
Thus, a cluster is a collection of similar data items. The primary goal here is to find
similarities in the data points and group similar data points into a cluster.
2) Anomaly detection is the method of identifying rare items, events or observations
which differ significantly from the majority of the data. We generally look for anomalies
or outliers in data because they are suspicious. Anomaly detection is often utilized in
bank fraud and medical error detection.

Applications of Unsupervised Learning Algorithms


Some practical applications of unsupervised learning algorithms include:
 Fraud detection
 Malware detection
 Identification of human errors during data entry
 Conducting accurate basket analysis, etc.
Decision trees used in data mining are of two main types −

Sabbineni Venkateswara Rao Page 2

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

 Classification tree − when the response is a nominal variable, for example if an


email is spam or not.
 Regression tree − when the predicted outcome can be considered a real number
(e.g. the salary of a worker).
Decision trees are a simple method, and as such has some problems. One of this
issues is the high variance in the resulting models that decision trees produce. In order
to alleviate this problem, ensemble methods of decision trees were developed. There are
two groups of ensemble methods currently used extensively −
 Bagging decision trees − These trees are used to build multiple decision trees
by repeatedly resampling training data with replacement, and voting the trees for
a consensus prediction. This algorithm has been called random forest.
 Boosting decision trees − Gradient boosting combines weak learners; in this
case, decision trees into a single strong learner, in an iterative fashion. It fits a
weak tree to the data and iteratively keeps fitting weak learners in order to
correct the error of the previous model.

What is a Decision Tree?


Decision tree is a type of supervised learning algorithm (having a pre-defined
target variable) that is mostly used in classification problems. It works for both
categorical and continuous input and output variables. In this technique, we split the
population or sample into two or more homogeneous sets (or sub-populations) based
on most significant splitter / differentiator in input variables.

Example:-
Let’s say we have a sample of 30 students with three variables Gender (Boy/
Girl), Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time.
Now, I want to create a model to predict who will play cricket during leisure period? In
this problem, we need to segregate students who play cricket in their leisure time based
on highly significant input variable among all three.
This is where decision tree helps, it will segregate the students based on all
values of three variable and identify the variable, which creates the best homogeneous
sets of students (which are heterogeneous to each other). In the snapshot below, you
can see that variable Gender is able to identify best homogeneous sets compared to the
other two variables.

Sabbineni Venkateswara Rao Page 3

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

As mentioned above, decision tree identifies the most significant variable and it’s
value that gives best homogeneous sets of population. Now the question which arises
is, how does it identify the variable and the split? To do this, decision tree uses various
algorithms, which we will discuss in next article.
Types of Decision Tree
Types of decision tree is based on the type of target variable we have. It can be of
two types:
1. Binary Variable Decision Tree: Decision Tree which has binary target variable
then it called as Binary Variable Decision Tree. Example:- In above scenario of
student problem, where the target variable was “Student will play cricket or not”
i.e. YES or NO.
2. Continuous Variable Decision Tree: Decision Tree has continuous target
variable then it is called as Continuous Variable Decision Tree.
Example:- Let’s say we have a problem to predict whether a customer will pay his
renewal premium with an insurance company (yes/ no). Here we know that income of
customer is a significant variable but insurance company does not have income details
for all customers. Now, as we know this is an important variable, then we can build a
decision tree to predict customer income based on occupation, product and various
other variables. In this case, we are predicting values for continuous variable.
Terminology related to Decision Trees:
Let’s look at the basic terminology used with Decision trees:
ROOT Node: It represents entire population or sample and this further gets divided
into two or more homogeneous sets.
SPLITTING: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called
decision node.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.

Sabbineni Venkateswara Rao Page 4

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

Pruning: When we remove sub-nodes of a decision node, this process is called pruning.
You can say opposite process of splitting.
Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree
Parent and Child Node: A node, which is divided into sub-nodes is called parent node
of sub-nodes where as sub-nodes are the child of parent node.
These are the terms commonly used for decision trees. As we know that every algorithm
has advantages and disadvantages, below I am discussing some of these for decision
trees.
Advantages:
1. Easy to Understand: Decision tree output is very easy to understand even for
people from non-analytical background. It does not require any statistical
knowledge to read and interpret them. Its graphical representation is very
intuitive and users can easily relate their hypothesis.
2. Useful in Data exploration: Decision tree is one of the fastest way to identify
most significant variables and relation between two or more variables. With the
help of decision trees, we can create new variables / features that has better
power to predict target variable. You can refer article (Trick to enhance power of
regression model) for one such trick. It can also be used in data exploration
stage. For example, we are working on a problem where we have information
available in hundreds of variables, there decision tree will help to identify most
significant variable.
3. Less data cleaning required: It requires less data cleaning compared to
some other modeling techniques. It is not influenced by outliers and missing
values to a fair degree.
4. Data type is not a constraint: It can handle both numerical and categorical
variables.
5. Non Parametric Method: Decision tree is considered to be a non-parametric
method. This means that decision trees have no assumptions about the space
distribution and the classifier structure.
Disadvantages:
1. Overfit: Over fitting is one of the most practical difficulty for decision tree
models. This problem gets solved by use of random forests.

Sabbineni Venkateswara Rao Page 5

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

2. Not fit for continuous variables: While working with continuous numerical
variables, decision tree looses information when it categorizes variables in
different categories.

Decision Tree - Overfitting


Overfitting is a significant practical difficulty for decision tree models and many
other predictive models. Overfitting happens when the learning algorithm
continues to develop hypotheses that reduce training set error at the cost of an
increased test set error. There are several approaches to avoiding overfitting in
building decision trees.

 Pre-pruning that stop growing the tree earlier, before it perfectly classifies
the training set.
 Post-pruning that allows the tree to perfectly classify the training set, and
then post prune the tree.
Practically, the second approach of post-pruning overfit trees is more successful
because it is not easy to precisely estimate when to stop growing the tree.

The important step of tree pruning is to define a criterion be used to determine


the correct final tree size using one of the following methods:
1. Use a distinct dataset from the training set (called validation set), to
evaluate the effect of post-pruning nodes from the tree.
2. Build the tree by using the training set, then apply a statistical test to
estimate whether pruning or expanding a particular node is likely to
produce an improvement beyond the training set.
o Error estimation
o Significance testing (e.g., Chi-square test)
3. Minimum Description Length principle : Use an explicit measure of the
complexity for encoding the training set and the decision tree, stopping
growth of the tree when this encoding size (size(tree) +
size(misclassifications(tree)) is minimized.
The first method is the most common approach. In this approach, the available
data are separated into two sets of examples: a training set, which is used to
build the decision tree, and a validation set, which is used to evaluate the impact
of pruning the tree. The second method is also a common approach. Here, we
explain the error estimation and Chi2 test.

Post-pruning using Error estimation


Error estimate for a sub-tree is weighted sum of error estimates for all its leaves.
The error estimate (e) for a node is:

Sabbineni Venkateswara Rao Page 6

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

In the following example we set Z to 0.69 which is equal to a confidence level of


75%.

The error rate at the parent node is 0.46 and since the error rate for its children
(0.51) increases with the split, we do not want to keep the children.

Post-pruning using Chi2 test


In Chi2 test we construct the corresponding frequency table and calculate the

Sabbineni Venkateswara Rao Page 7

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

Chi2 value and its probability.

Bronze Silver Gold


Bad 4 1 4
Good 2 1 2
Chi2 = 0.21 Probability = 0.90 degree of freedom=2

If we require that the probability has to be less than a limit (e.g., 0.05), therefore
we decide not to split the node.

Time Series
A time series is a set of statistics, usually collected at regular intervals. Time
series data occur naturally in many application areas.
 economics - e.g., monthly data for unemployment, hospital admissions, etc.
 finance - e.g., daily exchange rate, a share price, etc.
 environmental - e.g., daily rainfall, air quality readings.
 medicine - e.g., ECG brain wave activity every 2−8secs
The methods of time series analysis pre-date those for general stochastic
processes and Markov Chains. The aims of time series analysis are to describe and
summaries time series data, fit low-dimensional models, and make forecasts.
Components of Time Series

 Long term trend – The smooth long term direction of time series
where the data can increase or decrease in some pattern.
 Seasonal variation – Patterns of change in a time series within a
year which tends to repeat every year.
 Cyclical variation – Its much alike seasonal variation but the rise
and fall of time series over periods are longer than one year.
 Irregular variation – Any variation that is not explainable by any
of the three above mentioned components. They can be classified
into – stationary and non – stationary variation.
 When the data neither increases nor decreases, i.e. it’s
completely random it’s called stationary variation.
 When the data has some explainable portion remaining and can be
analyzed further then such case is called non – stationary variation.

Sabbineni Venkateswara Rao Page 8

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

ARIMA & ARMA:

In time series analysis, an autoregressive integrated moving average


(ARIMA) model is a generalization of an autoregressive moving average
(ARMA) model. These models are fitted to time series data either to better
understand the data or to predict future points in the series (forecasting).
They are applied in some cases where data show evidence of non-stationary,
Where an initial differencing step (corresponding to the "integrated" part
of the model) can be applied to reduce the non-stationary.

Non-seasonal ARIMA models are generally denoted ARIMA(p, d, q)


where parameters p, d, and q are non-negative integers, p is the order of the
Autoregressive model, d is the degree of differencing, and q is the order of
the Moving-average model. Seasonal ARIMA models are usually denoted
ARIMA(p, d, q)(P, D, Q)_m, where m refers to the number of periods in each
season, and the uppercase P, D, Q refer to the autoregressive, differencing,
and moving average terms for the seasonal part of the ARIMA model. ARIMA
models form an important part of the Box-Jenkins approach to time-series
modeling.
Univariate stationary processes (ARMA)

A covariance stationary process is an ARMA (p, q) process of


autoregressive order p and moving average order q if it can be written as

Sabbineni Venkateswara Rao Page 9

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

For this process to be stationary the number of moving average


coefficients q must be finite and the roots of the same characteristic
equation as for the AR (p) process, must all lie inside the unit circle.

Measure of Forecast Accuracy:

Forecast Accuracy can be defined as the deviation of Forecast or


Prediction from the actual results.
Error = Actual
demand – Forecast
OR
ei = At – Ft

We measure Forecast Accuracy by 2 methods :


1. Mean Forecast Error (MFE):
For n time periods where we have actual demand and forecast values:

Ideal value = 0;
MFE > 0, model tends to under-forecast MFE < 0, model tends to over-forecast

2. Mean Absolute Deviation (MAD)


For n time periods where we have actual demand and forecast values:

While MFE is a measure of forecast model bias, MAD


indicates the absolute size of the errors.
Uses of Forecast error:
 Forecast model bias
 Absolute size of the forecast errors
 Compare alternative forecasting models
 Identify forecast models that need adjustment

Sabbineni Venkateswara Rao Page 10

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

STL Model

A time series can be divided into 3 components: the trend, the seasonality and
the error or residuals of the model.
The STL model is a deterministic model that allows the components to be
calculated separately using different methods. It estimates the behavior of the trend
using a LOESS regression, and in turn, calculates the seasonal component by
selecting one of more models, but it is usually selected only between 2: the seasonal
ARIMA model, or the ETS model. The main difference that the STL model has with
the others is that, when considering the trend as a LOESS estimation, it is extremely
flexible to the changes in the trend of the series, unlike the linear regression, which
assumes that the series maintains the same constant.

Trend:
As mentioned previously, the way to calculate the trend using the STL model is
to calculate it from a LOESS regression. LOESS combines the simplicity of linear
least squares regression with the flexibility of non-linear regression by fitting simple
models on local subsets of data to create a function that describes the deterministic
part of the variation in point-to-point data. In fact, one of the main attractions of this
method is that it is not necessary to specify a global function to fit a model to the
data. In return, a greater calculation power is necessary.
Because it is so computationally intensive, LOESS would have been practically
impossible to use at the time when the least squares regression was developed. Most
of the other modern methods for process modeling are similar to those of LOESS in
this regard. These methods have been consciously designed to use our current
calculation capacity to achieve objectives not easily achieved by traditional methods.
The key parameter for the estimation of the regression LOESS is the span. The
span is the degree of smoothing of the series. Higher smoothing values (h) produce
softer functions that move less in response to fluctuations in the data. The smaller
the h, the closer the adjustment of the regression function to the data will be. Using
too small a value of the smoothing parameter is not desirable because the regression
function will begin to capture the random error in the data. The useful values of the
smoothing parameter are generally in the range of 0.25 to 0.5 for most LOESS
applications. As an example to this smoothing difference we will occupy different
values of span for the same regression, in order to compare the results, using the
following code:
#Estimation:
loessMod10 <- loess(Sales ~ Period, data=Train, span=0.10) # 10%
smoothing span
loessMod25 <- loess(Sales ~ Period, data=Train, span=0.25) # 25%
smoothing span
loessMod50 <- loess(Sales ~ Period, data=Train, span=0.50) # 50%
smoothing span
loessMod75 <- loess(Sales ~ Period, data=Train, span=0.75) # 75%

Sabbineni Venkateswara Rao Page 11

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV


smoothing span
loessMod100 <- loess(Sales ~ Period, data=Train, span=1) # 100% smoothing span

We save the results of the predictions in Data frames that allow us to plot as a
comparison each prediction along with the actual training base. It was necessary, to
perform the LOESS regression estimation, to select an explanatory variable and an
explained variable. As it is a series of time, we use as an explanatory variable the
fictitious variable that we create with the name Period, and the variable to explain is
the level of wine sales. A brief explanation of why these variables were selected in
this order is due to the fact that we seek to find the relationship (or the effect, in this
case) that the time has on the wine sales level.

#Predictions:

smoothed10 <- predict(loessMod10)


smoothed25 <- predict(loessMod25)
smoothed50 <- predict(loessMod50)
smoothed75 <- predict(loessMod75)
smoothed100 <- predict(loessMod100)

plot(Train$Sales, x=Train$Period, type="l", main="Loess Smoothing and Prediction.",


xlab="Date", ylab="Sales.")

lines(smoothed10, x=Train$Period, col="red")


lines(smoothed25, x=Train$Period, col="green")
The graph resulting x=Train$Period,
lines(smoothed50, col="blue")
from the previous written code is:

We can observe the comparison between different span values separately. Part of
the work of the data scientist is to find the value that helps to maximize the
estimation of the different models, and in this way avoid problems of over fitting or
under fitting. In such a way that we will seek to minimize the estimation error from
different span values for the series. To achieve this result, we will use the loess.as
function, from the fANCOVA package.

Sabbineni Venkateswara Rao Page 12

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

The loess.as function aims to select the optimal smoothing value from two
methods:: bias-corrected Akaike information criterion (aicc); and generalized cross-
validation (gcv).The code to calculate the optimal span value is:

LoessOptim<-fANCOVA::loess.as(Train$Period, Train$Sales, user.span = NULL, plot =


FALSE)
LoessOptim[["pars"]][["span"]]
## [1] 0.7906048
We obtain that the value that minimizes the estimation error of the model is
0.79. On the other hand, using the checkresiduals () function of the forecast
package we can carry out a brief analysis of the residuals of the estimate, in order
to contrast that the waste is distributed in a normal way, with a constant variance
and an average equal to 0.
forecast::checkresiduals(LoessOptim$residuals)

The analysis of the residuals helps us to contrast interesting and useful results
for the general analysis of the series. In the first place, that the series presents a
seasonal behavior, in such a way that the waste has fallen and lowered in specific
periods of time; This is not surprising, since we are assuming that the series is only
composed of the component of the trend.
Secondly, the series presents a different distribution to the normal, since there
is an important peak in the Gauss campaign plotted. In such a way that, according
to the results, it is necessary to estimate in turn the

Trend + Seasonal:
The STLF function allows the calculation of the seasonal component from
selecting a model that meets this specific task. The most common options are
usually the method by model ARIMA and mor model ETS. Both models have an

Sabbineni Venkateswara Rao Page 13

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

important effect that facilitates the calculation of seasonality once the trend is
already conceived (which was already calculated from LOESS). To be sure that the
appropriate model was selected to model the behavior of the seasonal component,
both the Akaike criterion and the RMSE of both models will be compared, and we will
select the one that best suits us according to our purposes.
As a first step, it is necessary to define the training series as a time series, with a
periodicity of 12 (since we are considering a monthly seasonality that is repeated year
after year). For the forecast (12 months) of the series, we will need to select the s.window
= 12, because we will look for the behavior of the seasonal component with a periodicity
of 12 months. In turn, as the calculation of the optimal value of the span for the
estimation of the trend was made, it will be added to the model from the criterion
t.window. We will start by making the forecast with the ETS model.

Ts<-ts(Train$Sales, freq=12)

ForecastEts<-forecast::stlf(Ts, h = 12, s.window = 12, t.window =


0.7906048,method = c("ets"))

ForecastEts[["model"]][["aic"]]
We can check that the Akaike information selection criterion tells us that the
value is 3333. Now, we perform the same process that was done, but changing to an
ARIMA model with the following code:
ForecastArima<-forecast::stlf(Ts, h = 12, s.window = 12, t.window =
0.7906048,method = c("arima"))
ForecastArima[["model"]][["aic"]] ## [1]
2944.866

Using the calculation of the selection criteria on the proposed ARIMA model, we
contrasted that, according to the selection criteria, the ARIMA model is better to
perform the forecast of the series than the model with ETS. Now, we will proceed to
contrast with the RMSE and verify if the ARIMA model has a greater predictive power
than the ETS model. For this we will use the following code:

ForecastArima<-as.numeric(ForecastArima$mean)

TestSales<-as.numeric(Test$Sales)

A<-data.frame(forecast::accuracy(ForecastArima,TestSales))

ForecastEts<-as.numeric(ForecastEts$mean)

B<-data.frame(forecast::accuracy(ForecastEts,TestSales))

## ME RMSE MAE MPE MAPE


## ARIMA -1156.386 2783.632 2035.860 -6.477117 9.782419
## ETS -923.317 2633.770 1973.445 -5.430143 9.304786
When calculating the Accuracy criteria for both forecasts, using the test database,
we contrasted that, unlike the Akaike information criterion, the model with the best

Sabbineni Venkateswara Rao Page 14

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

results was the model with the seasonal component calculated from the ETS. ., with
an RMSE lower than that of the ARIMA model. In this way, we proceed to make the
forecast of the series using only the ETS model as a seasonal component.
The comparison between the predictions of the series with the real data of the
training set is made using the following code:

ForecastEts<-forecast::stlf(Ts, h = 12, s.window = 12, t.window =


0.7906048,method = c("ets"))

plot(Train$Sales, x=Train$Period, type="l", main="Comparation between data train


and prediction.", xlab="Date", ylab="Sales.") lines(ForecastEts$fitted,
x=Train$Period, col="red")

Residual Analysis for the final model.

As mentioned previously, the analysis of waste is of the utmost importance


because it allows us to observe what kind of behavior the series needs to model. In
the previous analysis, for example, it was discovered that there was a seasonal
pattern that the only estimate by LOESS could not capture. On the other hand,
using the STLF function, we were able to estimate both the trend and the seasonal
component estimate. We then perform again the analysis of the residuals of the new
model, in order to know if the distribution of them follows a normal behavior, with
constant and average variance equal to 0.

Sabbineni Venkateswara Rao Page 15

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

To perform the analysis of the waste we use, in the same way, the next code:

forecast::checkresiduals(ForecastEts$residuals)

We can now observe that the residuals have a behavior closer to the normal than
the previous proposed model, since there are no higher values than the Gaussian
distribution in the graph. On the other hand, there are no significant problems of
autocorrelation between the waste. As another important analysis, we proceed to
calculate the Q-Q curve that helps us compare the normality of waste. A Cuantil-
Cuantil chart allows you to see how close the distribution of a data set to some ideal
distribution or compare the distribution of two data sets. If it is interesting to compare
with the Gaussian distribution, it is called normal probability graph. The data is sorted
and graph the i-th data against the corresponding quantile Gaussian. The code to
elaborate this graph is the following:

qqnorm(ForecastEts$residuals, pch = 1, frame = FALSE);

qqline(ForecastEts$residuals, col = "steelblue", lwd = 2)

Sabbineni Venkateswara Rao Page 16

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT - IV

As we can see in the Q-Q plot, there is a normal behavior of the waste series. The
objective of the Temporary Series is to decompose the series observed in two parts:
one is the dependent part of the past and the other the unpredictable part. The use of
the ETS model allowed to capture, in an effective way, the behavior of the component
of the trend, in such a way that the only thing that "remains" of the series is white
noise, that is, random variations that cannot be predicted.

Sabbineni Venkateswara Rao Page 17

Downloaded by Fabulous One ([email protected])


lOMoARcPSD|43926333

Data Analytics – UNIT – V

Data Visualization

Data visualization aims to communicate data clearly and effectively through


graphical representation. Data visualization has been used extensively in many
applications—for example, at work for reporting, managing business operations, and
tracking progress of tasks. More popularly, we can take advantage of visualization
techniques to discover data relationships that are otherwise not easily observable by
looking at the raw data. Nowadays, people also use data visualization to create fun
and interesting graphics.

We briefly introduce the basic concepts of data visualization. We start with


multidimensional data such as those stored in relational databases. We discuss
several representative approaches, including pixel-oriented techniques, geometric
projection techniques, icon-based techniques, and hierarchical and graph-based
techniques. We then discuss the visualization of complex data and relations.

1. Pixel-Oriented Visualization Techniques


A simple way to visualize the value of a dimension is to use a pixel where the
color of the pixel reflects the dimension’s value. For a data set of m dimensions,
pixel-oriented techniques create m windows on the screen, one for each dimension.
The m dimension values of a record are mapped to m pixels at the corresponding
positions in the windows. The colors of the pixels reflect the corresponding values.
Inside a window, the data values are arranged in some global order shared by
all windows. The global order may be obtained by sorting all data records in a way
that’s meaningful for the task at hand.

Example 2.16: Pixel-oriented visualization. All Electronics maintains a customer


information table, which consists of four dimensions: income, credit limit, transaction
volume, and age. Can we analyze the correlation between income and the other
attributes by visualization?
We can sort all customers in income-ascending order, and use this order to
lay out the customer data in the four visualization windows, as shown in Figure
2.10. The pixel colors are chosen so that the smaller the value, the lighter the
shading. Using pixel- based visualization, we can easily observe the following: credit
limit increases as income increases; customers whose income is in the middle range
are more likely to purchase more from All Electronics; there is no clear correlation
between income and age.

In pixel-oriented techniques, data records can also be ordered in a query-


dependent way. For example, given a point query, we can sort all records in
descending order of similarity to the point query.
Filling a window by laying out the data records in a linear way may not work
well for a wide window. The first pixel in a row is far away from the last pixel in the

Sabbineni Venkateswara Rao Page 1


Downloaded by Fabulous One ([email protected])
lOMoARcPSD|43926333

Data Analytics – UNIT – V


previous row, though they are next to each other in the global order. Moreover, a
pixel is next to the one above it in the window, even though the two are not next to
each other in the global order. To solve this problem, we can lay out the data records
in a space-filling curve

to fill the windows. A space-filling curve is a curve with a range that covers the entire
n-dimensional unit hypercube. Since the visualization windows are 2-D, we can use
any 2-D space-filling curve. Figure 2.11 shows some frequently used 2-D space-
filling curves. Note that the windows do not have to be rectangular. For example, the
circle segment technique uses windows in the shape of segments of a circle, as
illustrated in Figure 2.12.
This technique can ease the comparison of dimensions because the dimension
windows are located side by side and form a circle.

Sabbineni Venkateswara Rao Page 2


Downloaded by Fabulous One ([email protected])
lOMoARcPSD|43926333

Data Analytics – UNIT – V

2.Geometric Projection Visualization Techniques:


A drawback of pixel-oriented visualization techniques is that they cannot
help us much in understanding the distribution of data in a multidimensional
space. For example, they do not show whether there is a dense area in a
multidimensional subspace. Geometric

projection techniques help users find interesting projections of multidimensional


data sets. The central challenge the geometric projection techniques try to address is
how to visualize a high-dimensional space on a 2-D display.
A scatter plot displays 2-D data points using Cartesian coordinates. A third
dimension can be added using different colors or shapes to represent different data
points. Figure 2.13 shows an example, where X and Y are two spatial attributes and
the third dimension is represented × by different shapes. Through this visualization,
we can see that points of types “+” and “ ” tend to be colocated.
A 3-D scatter plot uses three axes in a Cartesian coordinate system. If it
also uses color, it can display up to 4-D data points (Figure 2.14).
For data sets with more than four dimensions, scatter plots are usually ineffective.
The scatter-plot matrix technique is a useful extension to the scatter plot.
×
For an n- dimensional data set, a scatter-plot matrix is an n n grid of 2-D scatter
plots that provides a visualization of each dimension with every other dimension.
Figure 2.15 shows an example, which visualizes the Iris data set. The data set
consists of 450 samples from each of three species of Iris flowers. There are five
dimensions in the data set: length and width of sepal and petal, and species.
The scatter-plot matrix becomes less effective as the dimensionality increases.
Another popular technique, called parallel coordinates, can handle higher
dimensional- ity. To visualize n-dimensional data points, the parallel coordinates
technique draws n equally spaced axes, one for each dimension, parallel to one of
the display axes.

Sabbineni Venkateswara Rao Page 3


Downloaded by Fabulous One ([email protected])
lOMoARcPSD|43926333

Data Analytics – UNIT – V

A data record is represented by a polygonal line that intersects each axis


at the point corresponding to the associated dimension value (Figure 2.16).
A major limitation of the parallel coordinates technique is that it cannot
effectively show a data set of many records. Even for a data set of several thousand
records, visual clutter and overlap often reduce the readability of the visualization
and make the patterns hard to find.

3. Icon-Based Visualization Techniques:


Icon-based visualization techniques use small icons to represent
multidimensional data values. We look at two popular icon-based techniques:
Chernoff faces and stick figures.

Chernoff faces were introduced in 1973 by statistician Herman Chernoff.


They dis- play multidimensional data of up to 18 variables (or dimensions) as a
cartoon human face (Figure 2.17). Chernoff faces help reveal trends in the data.
Components of the face, such as the eyes, ears, mouth, and nose, represent values
of the dimensions by their shape, size, placement, and orientation. For example,
dimensions can be mapped to the following facial characteristics: eye size, eye
spacing, nose length, nose width, mouth curvature, mouth width, mouth openness,
pupil size, eyebrow slant, eye eccentricity, and head eccentricity.

Chernoff faces make use of the ability of the human mind to recognize small
differences in facial characteristics and to assimilate many facial characteristics at
once.
Viewing large tables of data can be tedious. By condensing the data,
Chernoff faces make the data easier for users to digest. In this way, they
facilitate visualization of regularities and irregularities present in the data,

Sabbineni Venkateswara Rao Page 4


Downloaded by Fabulous One ([email protected])
lOMoARcPSD|43926333

Data Analytics – UNIT – V


although their power in relating multiple relationships is limited. Another
limitation is that specific data values are not shown. Furthermore, facial features
vary in perceived importance. This means that the similarity of two faces
(representing two multidimensional data points) can vary depending on the order
in which dimensions are assigned to facial characteristics. Therefore, this
mapping should be carefully chosen. Eye size and eyebrow slant have been
found to be important.

Asymmetrical Chernoff faces were proposed as an extension to the original


technique. Since a face has vertical symmetry (along the y-axis), the left and right
side of a face are identical, which wastes space. Asymmetrical Chernoff faces double
the number of facial characteristics, thus allowing up to 36 dimensions to be
displayed.

Sabbineni Venkateswara Rao Page 5


Downloaded by Fabulous One ([email protected])
lOMoARcPSD|43926333

Data Analytics – UNIT – V

The stick figure visualization technique maps multidimensional data to


five-piece stick figures, where each figure has four limbs and a body. Two
dimensions are mapped to the display (x and y) axes and the remaining
dimensions are mapped to the angle and/or length of the limbs. Figure 2.18 shows
census data, where age and income are mapped to the display axes, and the
remaining dimensions (gender, education, and so on) are mapped to stick figures. If
the data items are relatively dense with respect to the two display dimensions, the
resulting visualization shows texture patterns, reflecting data trends.

Sabbineni Venkateswara Rao Page 6


Downloaded by Fabulous One ([email protected])
lOMoARcPSD|43926333

Data Analytics – UNIT – V

4. Hierarchical Visualization Techniques:


The visualization techniques discussed so far focus on visualizing multiple
dimensions simultaneously. However, for a large data set of high dimensionality, it
would be difficult to visualize all dimensions at the same time. Hierarchical
visualization techniques partition all dimensions into subsets (i.e., subspaces). The
subspaces are visualized in a hierarchical manner.
“Worlds-within-Worlds,” also known as n-Vision, is a representative hierarchical
visualization method. Suppose we want to visualize a 6-D data set, where the
dimensions are F, X1, . . . , X5. We want to observe how dimension F changes with
respect to the other dimensions. We can first fix the values of dimensions X3, X4, X5 to
some selected values, say, c3, c4, c5. We can then visualize F, X1, X2 using a 3-D plot,
called a world, as shown in Figure 2.19. The position of the origin of the inner world is
located at the point (c3, c4, c5) in the outer world, which is another 3-D plot using
dimensions X3, X4, X5. A user can interactively change, in the outer world, the location of
the origin of the inner world. The user then views the resulting changes of the inner
world. Moreover, a user can vary the dimensions used in the inner world and the outer
world. Given more dimensions, more levels of worlds can be used, which is why the
method is called “worlds-within- worlds.”
As another example of hierarchical visualization methods, tree-maps display
hierarchical data as a set of nested rectangles. For example, Figure 2.20 shows a
tree-map visualizing Google news stories. All news stories are organized into seven
categories, each shown in a large rectangle of a unique color. Within each category
(i.e., each rectangle at the top level), the news stories are further partitioned into
smaller subcategories.

Sabbineni Venkateswara Rao Page 7


Downloaded by Fabulous One ([email protected])
lOMoARcPSD|43926333

Data Analytics – UNIT – V

Visualizing Complex Data and Relations


In early days, visualization techniques were mainly for numeric data.
Recently, more and more non-numeric data, such as text and social networks, have
become available. Visualizing and analyzing such data attracts a lot of interest.
There are many new visualization techniques dedicated to these kinds of data.
For example, many people on the Web tag various objects such as pictures, blog
entries, and product reviews. A tag cloud is a visualization of statistics of user-
generated tags. Often, in a tag cloud, tags are listed alphabetically or in a user-
preferred order. The importance of a tag is indicated by font size or color. Figure
2.21 shows a tag cloud for visualizing the popular tags used in a Web site.
Tag clouds are often used in two ways. First, in a tag cloud for a single item,
we can use the size of a tag to represent the number of times that the tag is applied
to this item by different users. Second, when visualizing the tag statistics on
multiple items, we can use the size of a tag to represent the number of items that
the tag has been applied to, that is, the popularity of the tag.
In addition to complex data, complex relations among data entries also raise
challenges for visualization. For example, Figure 2.22 uses a disease influence graph
to visualize the correlations between diseases. The nodes in the graph are diseases,
and the size of each node is proportional to the prevalence of the corresponding
disease. Two nodes are linked by an edge if the corresponding diseases have a strong
correlation. The width of an edge is proportional to the strength of the correlation
pattern of the two corresponding diseases.

Sabbineni Venkateswara Rao Page 8


Downloaded by Fabulous One ([email protected])

You might also like