0% found this document useful (0 votes)

78 views

da-unit-iii

Uploaded by

Fabulous One (Karthik)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views

da-unit-iii

Uploaded by

Fabulous One (Karthik)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

lOMoARcPSD|43926333

DA-Unit-III

Data Analytics (Jawaharlal Nehru Technological University, Hyderabad)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Downloaded by Fabulous One ([email protected])
lOMoARcPSD|43926333

Data Analytics – UNIT – III

Regression

Regression is a well-known statistical technique to model the predictive

relationship between several independent variables (DVs) and one dependent
variable. The objective is to find the best-fitting curve for a dependent variable in a
multidimensional space, with each independent variable being a dimension. The
curve could be a straight line, or it could be a nonlinear curve. The quality of fit of
the curve to the data can be measured by a coefficient of correlation (r), which is
the square root of the amount of variance explained by the curve.

The key steps for regression are simple:

1. List all the variables available for making the model.

2. Establish a Dependent Variable (DV) of interest.
3. Examine visual (if possible) relationships between variables of interest.
4. Find a way to predict DV using the other variables.

Introduction to Properties of OLS Estimators(Blue property assumptions):

Linear regression models have several applications in real life. In
econometrics, Ordinary Least Squares (OLS) method is widely used to estimate the
parameters of a linear regression model. For the validity of OLS estimates, there are
assumptions made while running linear regression models.
A1. The linear regression model is “linear in parameters.”
A2. There is a random sampling of observations.
A3. The conditional mean should be zero.
A4. There is no multi-collinearity (or perfect collinearity).
A5. Spherical errors: There is homoscedasticity and no auto-correlation
A6: Optional Assumption: Error terms should be normally distributed.

These assumptions are extremely important because violation of any of these

assumptions would make OLS estimates unreliable and incorrect. Specifically, a
violation would result in incorrect signs of OLS estimates, or the variance of OLS
estimates would be unreliable, leading to confidence intervals that are too wide or
too narrow.
This being said, it is necessary to investigate why OLS estimators and its
assumptions gather so much focus. In this article, the properties of OLS model are
discussed. First, the famous Gauss-Markov Theorem is outlined. Thereafter, a
detailed description of the properties of the OLS model is described. In the end, the
article briefly talks about the applications of the properties of OLS in econometrics.

Sabbineni Venkateswara Rao Page 1

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT – III

Least Square Estimation:

What is the Least Squares Regression Method?

The least-squares regression method is a technique commonly used in
Regression Analysis. It is a mathematical method used to find the best fit line that
represents the relationship between an independent and dependent variable.
Regression analysis makes use of mathematical methods such as least
squares to obtain a definite relationship between the predictor variable (s) and the
target variable. The least-squares method is one of the most effective ways used to
draw the line of best fit. It is based on the idea that the square of the errors
obtained must be minimized to the most possible extent and hence the name least
squares method.
If we were to plot the best fit line that shows the depicts the sales of a
company over a period of time, it would look something like this:

Notice that the line is as close as possible to all the scattered data points.
This is what an ideal best fit line looks like.
To better understand the whole process let’s see how to calculate the line using the
Least Squares Regression.
Steps to calculate the Line of Best Fit
Start constructing the line that best depicts the relationship between
variables in the data, we first need to get our basics right. Take a look at the
equation below:

Surely, you’ve come across this equation before. It is a simple equation that
represents a straight line along 2 Dimensional data, i.e. x-axis and y-axis. To better
understand this, let’s break down the equation:
 y: dependent variable
 m: the slope of the line
 x: independent variable
 c: y-intercept

Sabbineni Venkateswara Rao Page 2

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT – III

So the aim is to calculate the values of slope, y-intercept and substitute the
corresponding ‘x’ values in the equation in order to derive the value of the
dependent variable.
Let’s see how this can be done.
As an assumption, let’s consider that there are ‘n’ data points.
Step 1: Calculate the slope ‘m’ by using the following formula:

Step 2: Compute the y-intercept (the value of y at the point where the line crosses the
y-axis):

Step 3: Substitute the values in the final equation:

Simple, isn’t it?

Now let’s look at an example and see how you can use the least-squares
regression method to compute the line of best fit.
Least Squares Regression Example
Consider an example. Tom who is the owner of a retail shop, found the price
of different T-shirts vs the number of T-shirts sold at his shop over a period of one
week.
He tabulated this like shown below:

Let us use the concept of least squares regression to find the line of best fit
for the above data.
Step 1: Calculate the slope ‘m’ by using the following formula:

After you substitute the respective values, m = 1.518 approximately.

Step 2: Compute the y-intercept value

After you substitute the respective values, c = 0.305 approximately.

Step 3: Substitute the values in the final equation

Sabbineni Venkateswara Rao Page 3

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT – III

Once you substitute the values, it should look something like this:

Let’s construct a graph that represents the y=mx + c line of best fit:

Now Tom can use the above equation to estimate how many T-shirts of price
$8 can he sell at the retail shop.
y = 1.518 x 8 + 0.305 = 12.45 T-shirts
This comes down to 13 T-shirts! That’s how simple it is to make predictions
using Linear Regression.
Now let’s try to understand based on what factors can we confirm that the
above line is the line of best fit.
The least squares regression method works by minimizing the sum of the
square of the errors as small as possible, hence the name least squares. Basically
the distance between the line of best fit and the error must be minimized as much
as possible. This is the basic idea behind the least squares regression method.
A few things to keep in mind before implementing the least squares
regression method is:
 The data must be free of outliers because they might lead to a biased and
wrongful line of best fit.
 The line of best fit can be drawn iteratively until you get a line with the
minimum possible squares of errors.
 This method works well even with non-linear data.
 Technically, the difference between the actual value of ‘y’ and the predicted
value of ‘y’ is called the Residual (denotes the error).

Least Square Method Definition

The least-squares method is a crucial statistical method that is practised to
find a regression line or a best-fit line for the given pattern. This method is
described by an equation with specific parameters. The method of least squares is
generously used in evaluation and regression. In regression analysis, this method

Sabbineni Venkateswara Rao Page 4

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT – III

is said to be a standard approach for the approximation of sets of equations having

more equations than the number of unknowns.
The method of least squares actually defines the solution for the
minimization of the sum of squares of deviations or the errors in the result of each
equation. Find the formula for sum of squares of errors, which help to find the
variation in observed data.
The least-squares method is often applied in data fitting. The best fit result
is assumed to reduce the sum of squared errors or residuals which are stated to be
the differences between the observed or experimental value and corresponding
fitted value given in the model.
There are two basic categories of least-squares problems:
 Ordinary or linear least squares
 Nonlinear least squares
These depend upon linearity or nonlinearity of the residuals. The linear
problems are often seen in regression analysis in statistics. On the other hand, the
non-linear problems generally used in the iterative method of refinement in which
the model is approximated to the linear one with each iteration.
Least Square Method Graph
In linear regression, the line of best fit is a straight line as shown in the following
diagram:

The given data points are to be minimized by the method of reducing

residuals or offsets of each point from the line. The vertical offsets are generally
used in surface, polynomial and hyperplane problems, while perpendicular offsets
are utilized in common practice.

Least Square Method Formula

The least-square method states that the curve that best fits a given set of
observations, is said to be a curve having a minimum sum of the squared residuals
(or deviations or errors) from the given data points. Let us assume that the given
points of data are (x1,y1), (x2,y2), (x3,y3), …, (xn,yn) in which all x’s are independent

Sabbineni Venkateswara Rao Page 5

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT – III

variables, while all y’s are dependent ones. Also, suppose that f(x) be the fitting
curve and d represents error or deviation from each given point.
Now, we can write:
d1 = y1 − f(x1)
d2 = y2 − f(x2)
d3 = y3 − f(x3)
…..
dn = yn – f(xn)
The least-squares explain that the curve that best fits is represented by the
property that the sum of squares of all the deviations from given values must be
minimum. I.e:

Sum = Minimum Quantity

Limitations for Least-Square Method
The least-squares method is a very beneficial method of curve fitting. Despite
many benefits, it has a few shortcomings too. One of the main limitations is
discussed here.
In the process of regression analysis, which utilizes the least-square method
for curve fitting, it is inevitably assumed that the errors in the independent variable
are negligible or zero. In such cases, when independent variable errors are non-
negligible, the models are subjected to measurement errors. Therefore, here, the
least square method may even lead to hypothesis testing, where parameter
estimates and confidence intervals are taken into consideration due to the presence
of errors occurring in the independent variables.

REGRESSION ANALYSIS - MODEL BUILDING:

A regression analysis is typically conducted to obtain a model that may
needed for one of the following reasons:
• to explore whether a hypothesis regarding the relationship between the
response and predictors is true.
• to estimate a known theoretical relationship between the response and
predictors.
The model will then be used for:
• Prediction: the model will be used to predict the response variable from a
chosen set of predictors, and
• Inference: the model will be used to explore the strength of the
relationships between the response and the predictors
Therefore, steps in model building may be summarized as follows:

Sabbineni Venkateswara Rao Page 6

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT – III

1. Choosing the predictor variables and response variable on which to collect the
data.
2. Collecting data. You may be using data that already exists (retrospective), or you
may be conducting an experiment during which you will collect data (prospective).
Note that this step is important in determining the researcher’s ability to claim
‘association’ or ‘causality’ based on the regression model.
3. Exploring the data.
• check for data errors and missing values.
• study the bivariate relationships to reveal other outliers and influential
observations, relationships, and identify possible multicollinearities to
suggest possible transformations. (a document was sent to you on Sept. 21st
regarding these topics).
4. Dividing the data into a model-building set and a model-validation set:
• The training set is used to estimate the model.
• The validation set is later used for cross-validation of the selected model.
5. Identify several candidate models:
• Use best subsets regression.
• Use stepwise regression.
6. Evaluate the selected models for violation of the model conditions. Below checks
may be performed visually via residual plots as well as formal statistical tests.
• Check the linearity condition.
• Check for normality of the residuals.
• Check for constant variance of the residuals.
• After time-ordering your data (if appropriate), assess the independence of
the observations.
• Overall goodness-of-fit of the model. Above checks turn out to be
unsatisfactory, then modifications to the model may be needed (such as a
different functional form). Regardless, checking the assumptions of your
model as well as the model’s overall adequacy is usually accomplished
through residual diagnostic procedures.
7. Select the final model:
• Compare the competing models by cross-validating them against the
validation data. Remember, there is not necessarily only one good model for
a given set of data. There may be a few equally satisfactory models.

Logistic Regression

Regression models traditionally work with continuous numeric value data for
dependent and independent variables. Logistic regression models can, however,
work with dependent variables with binary values, such as whether a loan is
approved (yes or no). Logistic regression measures the relationship between a
categorical dependent variable and one or more independent variables. For
example, Logistic regression might be used to predict whether a patient has a given

Sabbineni Venkateswara Rao Page 7

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT – III

disease (e.g. diabetes), based on observed characteristics of the patient (age,

gender, body mass index, results of blood tests, etc.).

Logistical regression models use probability scores as the predicted values of

the dependent variable. Logistic regression takes the natural logarithm of the odds
of the dependent variable being a case (referred to as the logit) to create a
continuous criterion as a transformed version of the dependent variable. Thus the
logit transformation is used in logistic regression as the dependent variable. The
net effect is that although the dependent variable in logistic regression is binomial
(or categorical, i.e. has only two possible values), the logit is the continuous
function upon which linear regression is conducted.

What are the types of Logistic Regression techniques ?

Logistic Regression isn't just limited to solving binary classification
problems. To solve problems that have multiple classes, we can use extensions of
Logistic Regression, which includes Multinomial Logistic Regression and Ordinal
Logistic Regression. Let's get their basic idea:
1. Multinomial Logistic Regression: Let's say our target variable has K = 4
classes. This technique handles the multi-class problem by fitting K-1 independent
binary logistic classifier model. For doing this, it randomly chooses one target class
as the reference class and fits K-1 regression models that compare each of the
remaining classes to the reference class.
Due to its restrictive nature, it isn't used widely because it does not scale
very well in the presence of a large number of target classes. In addition, since it
builds K - 1 models, we would require a much larger data set to achieve reasonable
accuracy.
2. Ordinal Logistic Regression: This technique is used when the target variable is
ordinal in nature. Let's say, we want to predict years of work experience (1,2,3,4,5,
etc). So, there exists an order in the value, i.e., 5>4>3>2>1. Unlike a multinomial
model, when we train K -1 models, Ordinal Logistic Regression builds a single
model with multiple threshold values.
If we have K classes, the model will require K -1 threshold or cutoff points.
Also, it makes an imperative assumption of proportional odds. The assumption
says that on a logit (S shape) scale, all of the thresholds lie on a straight line.
Note: Logistic Regression is not a great choice to solve multi-class problems. But,
it's good to be aware of its types. In this tutorial we'll focus on Logistic Regression
for binary classification task.
How does Logistic Regression work?
Logistic Regression assumes that the dependent (or response) variable
follows a binomial distribution. Now, you may wonder, what is binomial
distribution? Binomial distribution can be identified by the following
characteristics:

Sabbineni Venkateswara Rao Page 8

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT – III

1. There must be a fixed number of trials denoted by n, i.e. in the data set,
there must be a fixed number of rows.
2. Each trial can have only two outcomes; i.e., the response variable can have
only two unique categories.
3. The outcome of each trial must be independent of each other; i.e., the
unique levels of the response variable must be independent of each other.
4. The probability of success (p) and failure (q) should be the same for each
trial.
Let's understand how Logistic Regression works. For Linear Regression,
where the output is a linear combination of input feature(s), we write the equation
as:
`Y = βo + β1X + ∈`
In Logistic Regression, we use the same equation but with some
modifications made to Y. Let's reiterate a fact about Logistic Regression: we
calculate probabilities. And, probabilities always lie between 0 and 1. In other
words, we can say:
1. The response value must be positive.
2. It should be lower than 1.
First, we'll meet the above two criteria. We know the exponential of any value
is always a positive number. And, any number divided by number + 1 will always
be lower than 1. Let's implement these two findings:

This is the logistic function.

Now we are convinced that the probability value will always lie between 0
and 1. To determine the link function, follow the algebraic calculations
carefully. P(Y=1|X) can be read as "probability that Y =1 given some value for x." Y
can take only two values, 1 or 0. For ease of calculation, let's rewrite P(Y=1|X) as
p(X).

As you might recognize, the right side of the (immediate) equation above
depicts the linear combination of independent variables. The left side is known as

Sabbineni Venkateswara Rao Page 9

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT – III

the log - odds or odds ratio or logit function and is the link function for Logistic
Regression. This link function follows a sigmoid (shown below) function which
limits its range of probabilities between 0 and 1.

In Multiple Regression, we use the Ordinary Least Square (OLS) method to

determine the best coefficients to attain good model fit. In Logistic Regression, we
use maximum likelihood method to determine the best coefficients and
eventually a good model fit.
Maximum likelihood works like this: It tries to find the value of coefficients
(βo,β1) such that the predicted probabilities are as close to the observed
probabilities as possible. In other words, for a binary classification (1/0), maximum
likelihood will try to find values of βo and β1 such that the resultant probabilities
are closest to either 1 or 0. The likelihood function is written as

How can you evaluate Logistic Regression model fit

and accuracy ?
In Linear Regression, we check adjusted R², F Statistics, MAE, and RMSE to
evaluate model fit and accuracy. But, Logistic Regression employs all different sets
of metrics. Here, we deal with probabilities and categorical values. Following are
the evaluation metrics used for Logistic Regression:
1. Akaike Information Criteria (AIC)
You can look at AIC as counterpart of adjusted r square in multiple
regression. It's an important indicator of model fit. It follows the rule: Smaller the
better. AIC penalizes increasing number of coefficients in the model. In other
words, adding more variables to the model wouldn't let AIC increase. It helps to
avoid overfitting.
Looking at the AIC metric of one model wouldn't really help. It is more useful
in comparing models (model selection). So, build 2 or 3 Logistic Regression models
and compare their AIC. The model with the lowest AIC will be relatively better.
2. Null Deviance and Residual Deviance
Deviance of an observation is computed as -2 times log likelihood of that
observation. The importance of deviance can be further understood using its types:

Sabbineni Venkateswara Rao Page 10

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT – III

Null and Residual Deviance. Null deviance is calculated from the model with
no features, i.e.,only intercept. The null model predicts class via a constant
probability.
Residual deviance is calculated from the model having all the features.On
comarison with Linear Regression, think of residual deviance as residual sum of
square (RSS) and null deviance as total sum of squares (TSS). The larger the
difference between null and residual deviance, better the model.
Also, you can use these metrics to compared multiple models: whichever
model has a lower null deviance, means that the model explains deviance pretty
well, and is a better model. Also, lower the residual deviance, better the model.
Practically, AIC is always given preference above deviance to evaluate model fit.
3. Confusion Matrix
Confusion matrix is the most crucial metric commonly used to evaluate
classification models. It's quite confusing but make sure you understand it by
heart. If you still don't understand anything, ask me in comments. The skeleton of
a confusion matrix looks like this:

As you can see, the confusion matrix avoids "confusion" by measuring the
actual and predicted values in a tabular format. In table above, Positive class = 1
and Negative class = 0. Following are the metrics we can derive from a confusion
matrix:
Accuracy - It determines the overall predicted accuracy of the model. It is
calculated as Accuracy = (True Positives + True Negatives)/(True Positives + True
Negatives + False Positives + False Negatives)
True Positive Rate (TPR) - It indicates how many positive values, out of all the
positive values, have been correctly predicted. The formula to calculate the true
positive rate is (TP/TP + FN). Also, TPR = 1 - False Negative Rate. It is also known
as Sensitivity or Recall.
False Positive Rate (FPR) - It indicates how many negative values, out of all the
negative values, have been incorrectly predicted. The formula to calculate the
false positive rate is (FP/FP + TN). Also, FPR = 1 - True Negative Rate.
True Negative Rate (TNR) - It indicates how many negative values, out of all the
negative values, have been correctly predicted. The formula to calculate the true
negative rate is (TN/TN + FP). It is also known as Specificity.

Sabbineni Venkateswara Rao Page 11

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT – III

False Negative Rate (FNR) - It indicates how many positive values, out of all the
positive values, have been incorrectly predicted. The formula to calculate false
negative rate is (FN/FN + TP).
Precision: It indicates how many values, out of all the predicted positive values,
are actually positive. It is formulated as:(TP / TP + FP). F Score: F score is the
harmonic mean of precision and recall. It lies between 0 and 1. Higher the value,
better the model. It is formulated as 2((precision*recall) / (precision+recall)).
4. Receiver Operator Characteristic (ROC)
ROC determines the accuracy of a classification model at a user defined
threshold value. It determines the model's accuracy using Area Under Curve (AUC).
The area under the curve (AUC), also referred to as index of accuracy (A) or
concordant index, represents the performance of the ROC curve. Higher the area,
better the model. ROC is plotted between True Positive Rate (Y axis) and False
Positive Rate (X Axis). In this plot, our aim is to push the red curve (shown below)
toward 1 (left corner) and maximize the area under curve. Higher the curve, better
the model. The yellow line represents the ROC curve at 0.5 threshold. At this point,
sensitivity = specificity.

Pros and Cons of Logistic Regression

Many of the pros and cons of the linear regression model also apply to the
logistic regression model. Although Logistic regression is used widely by many
people for solving various types of problems, it fails to hold up its performance due
to its various limitations and also other predictive models provide better predictive
results.
Pros
 The logistic regression model not only acts as a classification model, but also
gives you probabilities. This is a big advantage over other models where they
can only provide the final classification. Knowing that an instance has a 99%
probability for a class compared to 51% makes a big difference. Logistic
Regression performs well when the dataset is linearly separable.
 Logistic Regression not only gives a measure of how relevant a predictor
(coefficient size) is, but also its direction of association (positive or negative).
We see that Logistic regression is easier to implement, interpret and very
efficient to train.

Sabbineni Venkateswara Rao Page 12

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT – III

Cons
 Logistic regression can suffer from complete separation. If there is a feature
that would perfectly separate the two classes, the logistic regression model
can no longer be trained. This is because the weight for that feature would
not converge, because the optimal weight would be infinite. This is really a
bit unfortunate, because such a feature is really very useful. But you do not
need machine learning if you have a simple rule that separates both classes.
The problem of complete separation can be solved by introducing
penalization of the weights or defining a prior probability distribution of
weights.
 Logistic regression is less prone to overfitting but it can overfit in high
dimensional datasets and in that case, regularization techniques should be
considered to avoid over-fitting in such scenarios.

Analytics applications to various Business Domains

Let’s first understand how logistic regression is used in business world.
Logistic regression has an array of applications. Here are a few applications used in
real-world situations.
Marketing: A marketing consultant wants to predict if the subsidiary of his
company will make profit, loss or just break even depending on the characteristic of
the subsidiary operations.
Human Resources: The HR manager of a company wants to predict the
absenteeism pattern of his employees based on their individual characteristic.
Finance: A bank wants to predict if his customers would default based on the
previous transactions and history.

Model Construction (Using R):

R makes it very easy to fit a logistic regression model. The function to be
called is glm() and the fitting process is similar the one used in linear regression. In
this post, I would discuss binary logistic regression with an example though the
procedure for multinomial logistic regression is pretty much the same.

The data which has been used is Bankloan. The dataset has 850 rows and 9
columns. (age, education, employment, address, income, debtinc, creddebt,
othdebt, default). The dependent variable is default (Defaulted and Not Defaulted).
Let’s first load and check the head of data.

bankloan<-read.csv(“bankloan.csv”)
head(bankloan)
Now, making the subset of the data with 700 rows.
mod_bankloan <- bankloan[1:700,]
Setting a seed of 1000 (meaning picking random numbers from 1000 as starting
point)
Sabbineni Venkateswara Rao Page 13

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT – III

set.seed(500)
Let’s have a sample of 500 values. So, creating a variable of training data of 700
rows.
>train<-sample(1:700, 500, replace=FALSE)
Creating training as well as testing data.
>trainingdata<- mod_bankloan [train,]
>testingdata<- mod_bankloan [-train,]
Now, let’s fit the model. Be sure to specify the parameter family=binomial in the
glm() function.
model1<-glm(default~.,family=binomial(link=’logit’),data=trainingdata)
>summary(model1)
The summary will also include the significance level of all the variables. If the P
value is less than 0.05 then the variables are significant. We can also remove the
insignificant variables to make our accurate.

In our model, only age, employment, address and creddebt seems to be significant.
So, building another model with only these variables.
model12<-
glm(default~age+employ+address+creddebt,family=binomial(link=’logit’),data=traini
ngdata)

Sabbineni Venkateswara Rao Page 14

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT – III

Let’s now predict the model with the training data.

pred1<-predict(model12,newdata=trainingdata, type=”response”)
Now looking at the probability with 0.5% flight delayed or ontime.
predicted_class<-ifelse(pred1<0.5, “Defaluted”, “Not Defaulted”)
Creating a table to see the same.
table(trainingdata$default, predicted_class)

What is Supervised Learning?

In supervised learning, the computer is taught by example. It learns from past
data and applies the learning to present data to predict future events. In this case, both
input and desired output data provide help to the prediction of future events.
For accurate predictions, the input data is labeled or tagged as the right answer.

Supervised Machine Learning Categorisation

It is important to remember that all supervised learning algorithms are
essentially complex algorithms, categorized as either classification or regression
models.
1) Classification Models – Classification models are used for problems where the
output variable can be categorized, such as “Yes” or “No”, or “Pass” or “Fail.”
Classification Models are used to predict the category of the data. Real-life examples
include spam detection, sentiment analysis, scorecard prediction of exams, etc.
2) Regression Models – Regression models are used for problems where the output
variable is a real value such as a unique number, dollars, salary, weight or pressure,
for example. It is most often used to predict numerical values based on previous data
observations. Some of the more familiar regression algorithms include linear
regression, logistic regression, polynomial regression, and ridge regression.

There are some very practical applications of supervised learning algorithms in

real life, including:
 Text categorization
 Face Detection
 Signature recognition
 Customer discovery

Sabbineni Venkateswara Rao Page 1

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT - IV

 Spam detection
 Weather forecasting
 Predicting housing prices based on the prevailing market price
 Stock price predictions, among others

What is Unsupervised Learning?

Unsupervised learning, on the other hand, is the method that trains machines to
use data that is neither classified nor labeled. It means no training data can be
provided and the machine is made to learn by itself. The machine must be able to
classify the data without any prior information about the data.
The idea is to expose the machines to large volumes of varying data and allow it
to learn from that data to provide insights that were previously unknown and to
identify hidden patterns. As such, there aren’t necessarily defined outcomes from
unsupervised learning algorithms. Rather, it determines what is different or interesting
from the given dataset.
The machine needs to be programmed to learn by itself. The computer needs to
understand and provide insights from both structured and unstructured data. Here’s
an accurate illustration of unsupervised learning:

Unsupervised Machine Learning Categorization

1) Clustering is one of the most common unsupervised learning methods. The method
of clustering involves organizing unlabelled data into similar groups called clusters.
Thus, a cluster is a collection of similar data items. The primary goal here is to find
similarities in the data points and group similar data points into a cluster.
2) Anomaly detection is the method of identifying rare items, events or observations
which differ significantly from the majority of the data. We generally look for anomalies
or outliers in data because they are suspicious. Anomaly detection is often utilized in
bank fraud and medical error detection.

Applications of Unsupervised Learning Algorithms

Some practical applications of unsupervised learning algorithms include:
 Fraud detection
 Malware detection
 Identification of human errors during data entry
 Conducting accurate basket analysis, etc.
Decision trees used in data mining are of two main types −

Sabbineni Venkateswara Rao Page 2

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT - IV

 Classification tree − when the response is a nominal variable, for example if an

email is spam or not.
 Regression tree − when the predicted outcome can be considered a real number
(e.g. the salary of a worker).
Decision trees are a simple method, and as such has some problems. One of this
issues is the high variance in the resulting models that decision trees produce. In order
to alleviate this problem, ensemble methods of decision trees were developed. There are
two groups of ensemble methods currently used extensively −
 Bagging decision trees − These trees are used to build multiple decision trees
by repeatedly resampling training data with replacement, and voting the trees for
a consensus prediction. This algorithm has been called random forest.
 Boosting decision trees − Gradient boosting combines weak learners; in this
case, decision trees into a single strong learner, in an iterative fashion. It fits a
weak tree to the data and iteratively keeps fitting weak learners in order to
correct the error of the previous model.

What is a Decision Tree?

Decision tree is a type of supervised learning algorithm (having a pre-defined
target variable) that is mostly used in classification problems. It works for both
categorical and continuous input and output variables. In this technique, we split the
population or sample into two or more homogeneous sets (or sub-populations) based
on most significant splitter / differentiator in input variables.

Example:-
Let’s say we have a sample of 30 students with three variables Gender (Boy/
Girl), Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time.
Now, I want to create a model to predict who will play cricket during leisure period? In
this problem, we need to segregate students who play cricket in their leisure time based
on highly significant input variable among all three.
This is where decision tree helps, it will segregate the students based on all
values of three variable and identify the variable, which creates the best homogeneous
sets of students (which are heterogeneous to each other). In the snapshot below, you
can see that variable Gender is able to identify best homogeneous sets compared to the
other two variables.

Sabbineni Venkateswara Rao Page 3

Downloaded by Fabulous One ([email protected])

75%.

The error rate at the parent node is 0.46 and since the error rate for its children
(0.51) increases with the split, we do not want to keep the children.

Post-pruning using Chi2 test

In Chi2 test we construct the corresponding frequency table and calculate the

Sabbineni Venkateswara Rao Page 7

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT - IV

Chi2 value and its probability.

Bronze Silver Gold

Bad 4 1 4
Good 2 1 2
Chi2 = 0.21 Probability = 0.90 degree of freedom=2

If we require that the probability has to be less than a limit (e.g., 0.05), therefore
we decide not to split the node.

Time Series
A time series is a set of statistics, usually collected at regular intervals. Time
series data occur naturally in many application areas.
 economics - e.g., monthly data for unemployment, hospital admissions, etc.
 finance - e.g., daily exchange rate, a share price, etc.
 environmental - e.g., daily rainfall, air quality readings.
 medicine - e.g., ECG brain wave activity every 2−8secs
The methods of time series analysis pre-date those for general stochastic
processes and Markov Chains. The aims of time series analysis are to describe and
summaries time series data, fit low-dimensional models, and make forecasts.
Components of Time Series

 Long term trend – The smooth long term direction of time series
where the data can increase or decrease in some pattern.
 Seasonal variation – Patterns of change in a time series within a
year which tends to repeat every year.
 Cyclical variation – Its much alike seasonal variation but the rise
and fall of time series over periods are longer than one year.
 Irregular variation – Any variation that is not explainable by any
of the three above mentioned components. They can be classified
into – stationary and non – stationary variation.
 When the data neither increases nor decreases, i.e. it’s
completely random it’s called stationary variation.
 When the data has some explainable portion remaining and can be
analyzed further then such case is called non – stationary variation.

Sabbineni Venkateswara Rao Page 8

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT - IV

ARIMA & ARMA:

In time series analysis, an autoregressive integrated moving average

(ARIMA) model is a generalization of an autoregressive moving average
(ARMA) model. These models are fitted to time series data either to better
understand the data or to predict future points in the series (forecasting).
They are applied in some cases where data show evidence of non-stationary,
Where an initial differencing step (corresponding to the "integrated" part
of the model) can be applied to reduce the non-stationary.

Non-seasonal ARIMA models are generally denoted ARIMA(p, d, q)

where parameters p, d, and q are non-negative integers, p is the order of the
Autoregressive model, d is the degree of differencing, and q is the order of
the Moving-average model. Seasonal ARIMA models are usually denoted
ARIMA(p, d, q)(P, D, Q)_m, where m refers to the number of periods in each
season, and the uppercase P, D, Q refer to the autoregressive, differencing,
and moving average terms for the seasonal part of the ARIMA model. ARIMA
models form an important part of the Box-Jenkins approach to time-series
modeling.
Univariate stationary processes (ARMA)

A covariance stationary process is an ARMA (p, q) process of

autoregressive order p and moving average order q if it can be written as

Sabbineni Venkateswara Rao Page 9

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT - IV

For this process to be stationary the number of moving average

coefficients q must be finite and the roots of the same characteristic
equation as for the AR (p) process, must all lie inside the unit circle.

Measure of Forecast Accuracy:

Forecast Accuracy can be defined as the deviation of Forecast or

Prediction from the actual results.
Error = Actual
demand – Forecast
OR
ei = At – Ft

We measure Forecast Accuracy by 2 methods :

1. Mean Forecast Error (MFE):
For n time periods where we have actual demand and forecast values:

Ideal value = 0;
MFE > 0, model tends to under-forecast MFE < 0, model tends to over-forecast

2. Mean Absolute Deviation (MAD)

For n time periods where we have actual demand and forecast values:

While MFE is a measure of forecast model bias, MAD

indicates the absolute size of the errors.
Uses of Forecast error:
 Forecast model bias
 Absolute size of the forecast errors
 Compare alternative forecasting models
 Identify forecast models that need adjustment

Sabbineni Venkateswara Rao Page 10

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT - IV

STL Model

A time series can be divided into 3 components: the trend, the seasonality and
the error or residuals of the model.
The STL model is a deterministic model that allows the components to be
calculated separately using different methods. It estimates the behavior of the trend
using a LOESS regression, and in turn, calculates the seasonal component by
selecting one of more models, but it is usually selected only between 2: the seasonal
ARIMA model, or the ETS model. The main difference that the STL model has with
the others is that, when considering the trend as a LOESS estimation, it is extremely
flexible to the changes in the trend of the series, unlike the linear regression, which
assumes that the series maintains the same constant.

Trend:
As mentioned previously, the way to calculate the trend using the STL model is
to calculate it from a LOESS regression. LOESS combines the simplicity of linear
least squares regression with the flexibility of non-linear regression by fitting simple
models on local subsets of data to create a function that describes the deterministic
part of the variation in point-to-point data. In fact, one of the main attractions of this
method is that it is not necessary to specify a global function to fit a model to the
data. In return, a greater calculation power is necessary.
Because it is so computationally intensive, LOESS would have been practically
impossible to use at the time when the least squares regression was developed. Most
of the other modern methods for process modeling are similar to those of LOESS in
this regard. These methods have been consciously designed to use our current
calculation capacity to achieve objectives not easily achieved by traditional methods.
The key parameter for the estimation of the regression LOESS is the span. The
span is the degree of smoothing of the series. Higher smoothing values (h) produce
softer functions that move less in response to fluctuations in the data. The smaller
the h, the closer the adjustment of the regression function to the data will be. Using
too small a value of the smoothing parameter is not desirable because the regression
function will begin to capture the random error in the data. The useful values of the
smoothing parameter are generally in the range of 0.25 to 0.5 for most LOESS
applications. As an example to this smoothing difference we will occupy different
values of span for the same regression, in order to compare the results, using the
following code:
#Estimation:
loessMod10 <- loess(Sales ~ Period, data=Train, span=0.10) # 10%
smoothing span
loessMod25 <- loess(Sales ~ Period, data=Train, span=0.25) # 25%
smoothing span
loessMod50 <- loess(Sales ~ Period, data=Train, span=0.50) # 50%
smoothing span
loessMod75 <- loess(Sales ~ Period, data=Train, span=0.75) # 75%

Sabbineni Venkateswara Rao Page 11

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT - IV

smoothing span
loessMod100 <- loess(Sales ~ Period, data=Train, span=1) # 100% smoothing span

We save the results of the predictions in Data frames that allow us to plot as a
comparison each prediction along with the actual training base. It was necessary, to
perform the LOESS regression estimation, to select an explanatory variable and an
explained variable. As it is a series of time, we use as an explanatory variable the
fictitious variable that we create with the name Period, and the variable to explain is
the level of wine sales. A brief explanation of why these variables were selected in
this order is due to the fact that we seek to find the relationship (or the effect, in this
case) that the time has on the wine sales level.

#Predictions:

smoothed10 <- predict(loessMod10)

smoothed25 <- predict(loessMod25)
smoothed50 <- predict(loessMod50)
smoothed75 <- predict(loessMod75)
smoothed100 <- predict(loessMod100)

plot(Train$Sales, x=Train$Period, type="l", main="Loess Smoothing and Prediction.",

xlab="Date", ylab="Sales.")

lines(smoothed10, x=Train$Period, col="red")

lines(smoothed25, x=Train$Period, col="green")
The graph resulting x=Train$Period,
lines(smoothed50, col="blue")
from the previous written code is:

We can observe the comparison between different span values separately. Part of
the work of the data scientist is to find the value that helps to maximize the
estimation of the different models, and in this way avoid problems of over fitting or
under fitting. In such a way that we will seek to minimize the estimation error from
different span values for the series. To achieve this result, we will use the loess.as
function, from the fANCOVA package.

Sabbineni Venkateswara Rao Page 12

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT - IV

The loess.as function aims to select the optimal smoothing value from two
methods:: bias-corrected Akaike information criterion (aicc); and generalized cross-
validation (gcv).The code to calculate the optimal span value is:

LoessOptim<-fANCOVA::loess.as(Train$Period, Train$Sales, user.span = NULL, plot =

FALSE)
LoessOptim[["pars"]][["span"]]
## [1] 0.7906048
We obtain that the value that minimizes the estimation error of the model is
0.79. On the other hand, using the checkresiduals () function of the forecast
package we can carry out a brief analysis of the residuals of the estimate, in order
to contrast that the waste is distributed in a normal way, with a constant variance
and an average equal to 0.
forecast::checkresiduals(LoessOptim$residuals)

The analysis of the residuals helps us to contrast interesting and useful results
for the general analysis of the series. In the first place, that the series presents a
seasonal behavior, in such a way that the waste has fallen and lowered in specific
periods of time; This is not surprising, since we are assuming that the series is only
composed of the component of the trend.
Secondly, the series presents a different distribution to the normal, since there
is an important peak in the Gauss campaign plotted. In such a way that, according
to the results, it is necessary to estimate in turn the

Trend + Seasonal:
The STLF function allows the calculation of the seasonal component from
selecting a model that meets this specific task. The most common options are
usually the method by model ARIMA and mor model ETS. Both models have an

Sabbineni Venkateswara Rao Page 13

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT - IV

important effect that facilitates the calculation of seasonality once the trend is
already conceived (which was already calculated from LOESS). To be sure that the
appropriate model was selected to model the behavior of the seasonal component,
both the Akaike criterion and the RMSE of both models will be compared, and we will
select the one that best suits us according to our purposes.
As a first step, it is necessary to define the training series as a time series, with a
periodicity of 12 (since we are considering a monthly seasonality that is repeated year
after year). For the forecast (12 months) of the series, we will need to select the s.window
= 12, because we will look for the behavior of the seasonal component with a periodicity
of 12 months. In turn, as the calculation of the optimal value of the span for the
estimation of the trend was made, it will be added to the model from the criterion
t.window. We will start by making the forecast with the ETS model.

Ts<-ts(Train$Sales, freq=12)

ForecastEts<-forecast::stlf(Ts, h = 12, s.window = 12, t.window =

0.7906048,method = c("ets"))

ForecastEts[["model"]][["aic"]]
We can check that the Akaike information selection criterion tells us that the
value is 3333. Now, we perform the same process that was done, but changing to an
ARIMA model with the following code:
ForecastArima<-forecast::stlf(Ts, h = 12, s.window = 12, t.window =
0.7906048,method = c("arima"))
ForecastArima[["model"]][["aic"]] ## [1]
2944.866

Using the calculation of the selection criteria on the proposed ARIMA model, we
contrasted that, according to the selection criteria, the ARIMA model is better to
perform the forecast of the series than the model with ETS. Now, we will proceed to
contrast with the RMSE and verify if the ARIMA model has a greater predictive power
than the ETS model. For this we will use the following code:

ForecastArima<-as.numeric(ForecastArima$mean)

TestSales<-as.numeric(Test$Sales)

A<-data.frame(forecast::accuracy(ForecastArima,TestSales))

ForecastEts<-as.numeric(ForecastEts$mean)

B<-data.frame(forecast::accuracy(ForecastEts,TestSales))

## ME RMSE MAE MPE MAPE

## ARIMA -1156.386 2783.632 2035.860 -6.477117 9.782419
## ETS -923.317 2633.770 1973.445 -5.430143 9.304786
When calculating the Accuracy criteria for both forecasts, using the test database,
we contrasted that, unlike the Akaike information criterion, the model with the best

Sabbineni Venkateswara Rao Page 14

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT - IV

results was the model with the seasonal component calculated from the ETS. ., with
an RMSE lower than that of the ARIMA model. In this way, we proceed to make the
forecast of the series using only the ETS model as a seasonal component.
The comparison between the predictions of the series with the real data of the
training set is made using the following code:

ForecastEts<-forecast::stlf(Ts, h = 12, s.window = 12, t.window =

0.7906048,method = c("ets"))

plot(Train$Sales, x=Train$Period, type="l", main="Comparation between data train

and prediction.", xlab="Date", ylab="Sales.") lines(ForecastEts$fitted,
x=Train$Period, col="red")

Residual Analysis for the final model.

As mentioned previously, the analysis of waste is of the utmost importance

because it allows us to observe what kind of behavior the series needs to model. In
the previous analysis, for example, it was discovered that there was a seasonal
pattern that the only estimate by LOESS could not capture. On the other hand,
using the STLF function, we were able to estimate both the trend and the seasonal
component estimate. We then perform again the analysis of the residuals of the new
model, in order to know if the distribution of them follows a normal behavior, with
constant and average variance equal to 0.

Sabbineni Venkateswara Rao Page 15

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT - IV

To perform the analysis of the waste we use, in the same way, the next code:

forecast::checkresiduals(ForecastEts$residuals)

We can now observe that the residuals have a behavior closer to the normal than
the previous proposed model, since there are no higher values than the Gaussian
distribution in the graph. On the other hand, there are no significant problems of
autocorrelation between the waste. As another important analysis, we proceed to
calculate the Q-Q curve that helps us compare the normality of waste. A Cuantil-
Cuantil chart allows you to see how close the distribution of a data set to some ideal
distribution or compare the distribution of two data sets. If it is interesting to compare
with the Gaussian distribution, it is called normal probability graph. The data is sorted
and graph the i-th data against the corresponding quantile Gaussian. The code to
elaborate this graph is the following:

qqnorm(ForecastEts$residuals, pch = 1, frame = FALSE);

qqline(ForecastEts$residuals, col = "steelblue", lwd = 2)

Sabbineni Venkateswara Rao Page 16

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT - IV

As we can see in the Q-Q plot, there is a normal behavior of the waste series. The
objective of the Temporary Series is to decompose the series observed in two parts:
one is the dependent part of the past and the other the unpredictable part. The use of
the ETS model allowed to capture, in an effective way, the behavior of the component
of the trend, in such a way that the only thing that "remains" of the series is white
noise, that is, random variations that cannot be predicted.

Sabbineni Venkateswara Rao Page 17

Downloaded by Fabulous One ([email protected])

lOMoARcPSD|43926333

Data Analytics – UNIT – V

Data Visualization

Data visualization aims to communicate data clearly and effectively through

graphical representation. Data visualization has been used extensively in many
applications—for example, at work for reporting, managing business operations, and
tracking progress of tasks. More popularly, we can take advantage of visualization
techniques to discover data relationships that are otherwise not easily observable by
looking at the raw data. Nowadays, people also use data visualization to create fun
and interesting graphics.

We briefly introduce the basic concepts of data visualization. We start with

multidimensional data such as those stored in relational databases. We discuss
several representative approaches, including pixel-oriented techniques, geometric
projection techniques, icon-based techniques, and hierarchical and graph-based
techniques. We then discuss the visualization of complex data and relations.

1. Pixel-Oriented Visualization Techniques

A simple way to visualize the value of a dimension is to use a pixel where the
color of the pixel reflects the dimension’s value. For a data set of m dimensions,
pixel-oriented techniques create m windows on the screen, one for each dimension.
The m dimension values of a record are mapped to m pixels at the corresponding
positions in the windows. The colors of the pixels reflect the corresponding values.
Inside a window, the data values are arranged in some global order shared by
all windows. The global order may be obtained by sorting all data records in a way
that’s meaningful for the task at hand.

Example 2.16: Pixel-oriented visualization. All Electronics maintains a customer

information table, which consists of four dimensions: income, credit limit, transaction
volume, and age. Can we analyze the correlation between income and the other
attributes by visualization?
We can sort all customers in income-ascending order, and use this order to
lay out the customer data in the four visualization windows, as shown in Figure
2.10. The pixel colors are chosen so that the smaller the value, the lighter the
shading. Using pixel- based visualization, we can easily observe the following: credit
limit increases as income increases; customers whose income is in the middle range
are more likely to purchase more from All Electronics; there is no clear correlation
between income and age.

In pixel-oriented techniques, data records can also be ordered in a query-

dependent way. For example, given a point query, we can sort all records in
descending order of similarity to the point query.
Filling a window by laying out the data records in a linear way may not work
well for a wide window. The first pixel in a row is far away from the last pixel in the

Sabbineni Venkateswara Rao Page 1

Downloaded by Fabulous One ([email protected])
lOMoARcPSD|43926333

Data Analytics – UNIT – V

previous row, though they are next to each other in the global order. Moreover, a
pixel is next to the one above it in the window, even though the two are not next to
each other in the global order. To solve this problem, we can lay out the data records
in a space-filling curve

although their power in relating multiple relationships is limited. Another
limitation is that specific data values are not shown. Furthermore, facial features
vary in perceived importance. This means that the similarity of two faces
(representing two multidimensional data points) can vary depending on the order
in which dimensions are assigned to facial characteristics. Therefore, this
mapping should be carefully chosen. Eye size and eyebrow slant have been
found to be important.

Asymmetrical Chernoff faces were proposed as an extension to the original

technique. Since a face has vertical symmetry (along the y-axis), the left and right
side of a face are identical, which wastes space. Asymmetrical Chernoff faces double
the number of facial characteristics, thus allowing up to 36 dimensions to be
displayed.

Sabbineni Venkateswara Rao Page 5

Downloaded by Fabulous One ([email protected])
lOMoARcPSD|43926333

Data Analytics – UNIT – V

The stick figure visualization technique maps multidimensional data to

five-piece stick figures, where each figure has four limbs and a body. Two
dimensions are mapped to the display (x and y) axes and the remaining
dimensions are mapped to the angle and/or length of the limbs. Figure 2.18 shows
census data, where age and income are mapped to the display axes, and the
remaining dimensions (gender, education, and so on) are mapped to stick figures. If
the data items are relatively dense with respect to the two display dimensions, the
resulting visualization shows texture patterns, reflecting data trends.

Sabbineni Venkateswara Rao Page 6

Downloaded by Fabulous One ([email protected])
lOMoARcPSD|43926333

Data Analytics – UNIT – V

4. Hierarchical Visualization Techniques:

The visualization techniques discussed so far focus on visualizing multiple
dimensions simultaneously. However, for a large data set of high dimensionality, it
would be difficult to visualize all dimensions at the same time. Hierarchical
visualization techniques partition all dimensions into subsets (i.e., subspaces). The
subspaces are visualized in a hierarchical manner.
“Worlds-within-Worlds,” also known as n-Vision, is a representative hierarchical
visualization method. Suppose we want to visualize a 6-D data set, where the
dimensions are F, X1, . . . , X5. We want to observe how dimension F changes with
respect to the other dimensions. We can first fix the values of dimensions X3, X4, X5 to
some selected values, say, c3, c4, c5. We can then visualize F, X1, X2 using a 3-D plot,
called a world, as shown in Figure 2.19. The position of the origin of the inner world is
located at the point (c3, c4, c5) in the outer world, which is another 3-D plot using
dimensions X3, X4, X5. A user can interactively change, in the outer world, the location of
the origin of the inner world. The user then views the resulting changes of the inner
world. Moreover, a user can vary the dimensions used in the inner world and the outer
world. Given more dimensions, more levels of worlds can be used, which is why the
method is called “worlds-within- worlds.”
As another example of hierarchical visualization methods, tree-maps display
hierarchical data as a set of nested rectangles. For example, Figure 2.20 shows a
tree-map visualizing Google news stories. All news stories are organized into seven
categories, each shown in a large rectangle of a unique color. Within each category
(i.e., each rectangle at the top level), the news stories are further partitioned into
smaller subcategories.

Sabbineni Venkateswara Rao Page 7

Downloaded by Fabulous One ([email protected])
lOMoARcPSD|43926333

Data Analytics – UNIT – V

Visualizing Complex Data and Relations

In early days, visualization techniques were mainly for numeric data.
Recently, more and more non-numeric data, such as text and social networks, have
become available. Visualizing and analyzing such data attracts a lot of interest.
There are many new visualization techniques dedicated to these kinds of data.
For example, many people on the Web tag various objects such as pictures, blog
entries, and product reviews. A tag cloud is a visualization of statistics of user-
generated tags. Often, in a tag cloud, tags are listed alphabetically or in a user-
preferred order. The importance of a tag is indicated by font size or color. Figure
2.21 shows a tag cloud for visualizing the popular tags used in a Web site.
Tag clouds are often used in two ways. First, in a tag cloud for a single item,
we can use the size of a tag to represent the number of times that the tag is applied
to this item by different users. Second, when visualizing the tag statistics on
multiple items, we can use the size of a tag to represent the number of items that
the tag has been applied to, that is, the popularity of the tag.
In addition to complex data, complex relations among data entries also raise
challenges for visualization. For example, Figure 2.22 uses a disease influence graph
to visualize the correlations between diseases. The nodes in the graph are diseases,
and the size of each node is proportional to the prevalence of the corresponding
disease. Two nodes are linked by an edge if the corresponding diseases have a strong
correlation. The width of an edge is proportional to the strength of the correlation
pattern of the two corresponding diseases.

Sabbineni Venkateswara Rao Page 8

Downloaded by Fabulous One ([email protected])

MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Da Unit III
0% (1)
Da Unit III
43 pages
UNIT - III
No ratings yet
UNIT - III
9 pages
Unit III
No ratings yet
Unit III
18 pages
Least Square Regression
No ratings yet
Least Square Regression
15 pages
Linear Regression Models
No ratings yet
Linear Regression Models
41 pages
DAUNIT-3
No ratings yet
DAUNIT-3
32 pages
UNIT 2 Machine Learning BCAI601BCDS062.pptx
No ratings yet
UNIT 2 Machine Learning BCAI601BCDS062.pptx
244 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
13 pages
Linear Regression Models
No ratings yet
Linear Regression Models
42 pages
Cs3351 Aiml Unit 3 Notes Eduengg
No ratings yet
Cs3351 Aiml Unit 3 Notes Eduengg
38 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
27 pages
Data Analytics Unit 3 Notes
100% (2)
Data Analytics Unit 3 Notes
28 pages
Regression Notes- Part-1
No ratings yet
Regression Notes- Part-1
17 pages
APznzaaV-S8wLPGsP_Add8mCHq3JcpXzeJ180tg4GWAcHx6DAgMVD3eyvT5dWstrOMVpGkO6YPvB6EzW3QMZ2MOlHap6AIHzt5bF4qrpZ6P5COArRIkGSOpTA3irJqdWr5VzZJgsslAEoNck-7XB6goMBGQ2C1xBIjiLrywLxqEZfdK9zE3-of9LPSjsbB_QkInc2mquD_oyBRUUJcHri
No ratings yet
APznzaaV-S8wLPGsP_Add8mCHq3JcpXzeJ180tg4GWAcHx6DAgMVD3eyvT5dWstrOMVpGkO6YPvB6EzW3QMZ2MOlHap6AIHzt5bF4qrpZ6P5COArRIkGSOpTA3irJqdWr5VzZJgsslAEoNck-7XB6goMBGQ2C1xBIjiLrywLxqEZfdK9zE3-of9LPSjsbB_QkInc2mquD_oyBRUUJcHri
199 pages
Topic 8 - Regression Analysis
No ratings yet
Topic 8 - Regression Analysis
51 pages
Unit-2
No ratings yet
Unit-2
26 pages
NOTES - UNIT 2 - Machine Learning
No ratings yet
NOTES - UNIT 2 - Machine Learning
33 pages
UNIT-3NEW
No ratings yet
UNIT-3NEW
34 pages
FML Unit2
No ratings yet
FML Unit2
13 pages
Machine Learning Unit2
No ratings yet
Machine Learning Unit2
31 pages
UNIT-3-1
No ratings yet
UNIT-3-1
41 pages
Da Unit-3
No ratings yet
Da Unit-3
27 pages
Experiment 1
No ratings yet
Experiment 1
17 pages
linearregression-190924053948
No ratings yet
linearregression-190924053948
10 pages
Least Squares Method
No ratings yet
Least Squares Method
36 pages
Rohini 73149042113
No ratings yet
Rohini 73149042113
11 pages
CS3351 AIML UNIT 3 NOTES
No ratings yet
CS3351 AIML UNIT 3 NOTES
32 pages
DA-3rd unit
No ratings yet
DA-3rd unit
16 pages
Unit-3 Data Analysis
No ratings yet
Unit-3 Data Analysis
36 pages
Module4 CSE3190 FDA Updated (1)
No ratings yet
Module4 CSE3190 FDA Updated (1)
46 pages
Linear Models
No ratings yet
Linear Models
50 pages
unit-5 -notes
No ratings yet
unit-5 -notes
41 pages
Artificial Intelligence and Machine Learning - CS3491 - Notes - Unit 3 - Supervised Learning
No ratings yet
Artificial Intelligence and Machine Learning - CS3491 - Notes - Unit 3 - Supervised Learning
37 pages
Linear Regression: What Is Regression Analysis?
100% (1)
Linear Regression: What Is Regression Analysis?
21 pages
Chapter 4 - Student
No ratings yet
Chapter 4 - Student
53 pages
Chapter 2 Simple Linear Regression
No ratings yet
Chapter 2 Simple Linear Regression
70 pages
UCS-401_CSE7th M L_lect_10_Unit-Ll_Least Squares Method, Multivariate Linear Regression, Regul
No ratings yet
UCS-401_CSE7th M L_lect_10_Unit-Ll_Least Squares Method, Multivariate Linear Regression, Regul
16 pages
(Revised) Simple Linear Regression and Correlation
No ratings yet
(Revised) Simple Linear Regression and Correlation
41 pages
IV Ai & Ds Al3451 Ml Unit2
No ratings yet
IV Ai & Ds Al3451 Ml Unit2
50 pages
Lecture-2 Least Squares Regression
No ratings yet
Lecture-2 Least Squares Regression
18 pages
Dr. Siti Mariam Binti Abdul Rahman Faculty of Mechanical Engineering Office: T1-A14-01C E-Mail: Mariam4528@salam - Uitm.edu - My
No ratings yet
Dr. Siti Mariam Binti Abdul Rahman Faculty of Mechanical Engineering Office: T1-A14-01C E-Mail: Mariam4528@salam - Uitm.edu - My
30 pages
(Unit-04) Part-01 - ML Algo
No ratings yet
(Unit-04) Part-01 - ML Algo
49 pages
Least Square Regression
No ratings yet
Least Square Regression
13 pages
Regression Model and Its Applications
100% (1)
Regression Model and Its Applications
30 pages
Unit - II_DA
No ratings yet
Unit - II_DA
22 pages
Combinepdf
No ratings yet
Combinepdf
8 pages
Complete Linear Regression Algorithm
No ratings yet
Complete Linear Regression Algorithm
4 pages
Numerical Methods With Applications
No ratings yet
Numerical Methods With Applications
34 pages
Linear Regression
No ratings yet
Linear Regression
15 pages
Chapter4_Regression.docx
No ratings yet
Chapter4_Regression.docx
15 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
33 pages
LinearRegression_FoundationalMathofAI_S24
No ratings yet
LinearRegression_FoundationalMathofAI_S24
4 pages
ML Unit-4
No ratings yet
ML Unit-4
65 pages
AI18
No ratings yet
AI18
11 pages
(Mathe) Simple Linear Regression and Correlation
No ratings yet
(Mathe) Simple Linear Regression and Correlation
61 pages
Linear Regression
No ratings yet
Linear Regression
4 pages
Coding 2
No ratings yet
Coding 2
3 pages
Reference Material Linear Regression
No ratings yet
Reference Material Linear Regression
12 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Research Project Report Template - Final MBA
No ratings yet
Research Project Report Template - Final MBA
23 pages
Bootcamp in Data Analytics (AnalytixLabs)
No ratings yet
Bootcamp in Data Analytics (AnalytixLabs)
40 pages
Interpretation of Anova Table
No ratings yet
Interpretation of Anova Table
3 pages
No 2
No ratings yet
No 2
2 pages
1 s2.0 S0268401214001078 Main
No ratings yet
1 s2.0 S0268401214001078 Main
7 pages
WD 138 Word Chapter 2 Creating A Research Paper
No ratings yet
WD 138 Word Chapter 2 Creating A Research Paper
9 pages
Using CAATs To Support IS Audit
No ratings yet
Using CAATs To Support IS Audit
3 pages
My RM Project
No ratings yet
My RM Project
43 pages
QMS 064-DAS-Content
No ratings yet
QMS 064-DAS-Content
3 pages
BasicStatistics I
No ratings yet
BasicStatistics I
90 pages
4 I Framework
No ratings yet
4 I Framework
22 pages
CBC - Module of Instruction Basic 6
No ratings yet
CBC - Module of Instruction Basic 6
3 pages
Post Graduate Program in Lean Six Sigma
No ratings yet
Post Graduate Program in Lean Six Sigma
24 pages
Appendix 2 - Introduction To Basic Statistics
No ratings yet
Appendix 2 - Introduction To Basic Statistics
6 pages
Itf Report
No ratings yet
Itf Report
16 pages
Keharmonisan Keluarga, Konsep Diri Dan Interaksi Sosial Remaja
No ratings yet
Keharmonisan Keluarga, Konsep Diri Dan Interaksi Sosial Remaja
13 pages
Module 1
No ratings yet
Module 1
86 pages
3master Log Book
100% (1)
3master Log Book
132 pages
605 Midterm Solution 2016
No ratings yet
605 Midterm Solution 2016
3 pages
New PRJ
No ratings yet
New PRJ
23 pages
Uji Validitas Dan Reabilitas
No ratings yet
Uji Validitas Dan Reabilitas
6 pages
Week 3 - The SLRM (2) - Updated PDF
No ratings yet
Week 3 - The SLRM (2) - Updated PDF
49 pages
2.1 Data Analysis
No ratings yet
2.1 Data Analysis
8 pages
Knowledge Hiding in Organizations
No ratings yet
Knowledge Hiding in Organizations
25 pages
PACOTE PRAIS COM TESTE DE WHITE (vcovHC)
No ratings yet
PACOTE PRAIS COM TESTE DE WHITE (vcovHC)
2 pages
Syllabus For IS525 Fall 2024
No ratings yet
Syllabus For IS525 Fall 2024
9 pages
STAT 319-Lab-192-Syllabus PDF
100% (1)
STAT 319-Lab-192-Syllabus PDF
2 pages
For Serious Injury and Fatality Prevention: Designing Strategy
No ratings yet
For Serious Injury and Fatality Prevention: Designing Strategy
11 pages
Kel388 XLS Eng
No ratings yet
Kel388 XLS Eng
9 pages
Assessing a Single Classification Algorithm and Two Classification Algorithms
No ratings yet
Assessing a Single Classification Algorithm and Two Classification Algorithms
12 pages