0% found this document useful (0 votes)
7 views19 pages

Business Analytics

Summary of business analytics course

Uploaded by

giuliespos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views19 pages

Business Analytics

Summary of business analytics course

Uploaded by

giuliespos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

BUSINESS

ANALYTICS
Lessons: Monday , Tuesday, Wednesday 9:00-11:00 a.m.

NON COMPULSORY COMPUTING LAB


Prof. Carlo Cavicchia February 29; March 1-7-14-15
FINAL TEST 22 March at 9:00-11:00 (Solo se segui tutte le lezioni)
A Lab exam describing a statistical analysis using SAS on a specific dataset.
You may add until 2 points to the score of the final exam.
Software: SAS studio – Sas OnDemand for Academics

FINAL EXAM: 4 APRIL multiple choice/open answers test, written or on-line time: one hour

Books
J. Neter, M. Kutner, C. Nachtsheim, W. Wasserman, 1996, Applied Linear Regression Models, Irwin J. Lattin,
J. Carroll, P. Green, 2003, Analyzing Multivariate Data, Thomson

Office: Room SB5, 3rd floor, Building B


06.72595943 [email protected]
Office hours: Monday , 2:30-3:30 p.m.

Pre-requisites for the course


Basic knowledge of descriptive statistics, elements of probability, random variables and statistical
inference. (use f.i. Statistical background slide
Lesson 1 19/02/24

WHY BUSINESS ANALYTICS AND NOT STATISTICS?


In the last 20 years we developed the possibility to deal with statistical inquiries that include more and
more qualitative information, not anymore just numerical data.
Today we have real time information about almost everything. (Thanks for example to socials we now have
information about many consumer preferences, 500 terabytes per day by Facebook)

ERA OF BIG DATA An enormous amount of data with 3 characteristics (VVV)

VOLUME
We can store tons of data, or better, there is so
much data that physical facilities are needed to store
them (clouds have physical supports)

VELOCITY
ex years ago Census required almost a year to be
processed, now is real time. There has been an
increase in computational capability

VARIETY
not only traditional data (income, age…) but also
other formats as images, audio, videos… which
require new methods for analysis: Deep Learning and
Machine Learning, both based on algorithms.

How does a dependent variable Y change on the basis of independent, explanatory variables Xn?

We can study two types of relationships:


- Y as influenced by explanatory unrelated variables

- Reciprocal Relationships between X variables


MACHINE LEARNING MODELS

SUPERVISED
• Parametric Method = we have assumptions about the behaviour or the variable Y
Y is a random variable
whose behaviour is like a random normal variable

• Non Parametic Approach = no assumptions on Y


Data Driven Approach: we have such a large set of data that no assumption is needed, big data is sufficient
to tell something about the behaviour of the variable

Lecture 2 20/02/2024

STATISTICAL MODELS
Regression

Model Construction
1. Choice of variables what phenomena Y we want to study in relation to which Xn
2. Data collection recording observations by primary or secondary sources
3. Model Type Selection suitable to describe the X,Y relationship and fit to the data collected
4. Parameters estimation Bo and B1 to get expected value and variance
5. Goodness of fit of the model R2
+
6. Purpose of use
• Explanatory analysis to analyse the relationships between variables X and Y, mostly parametric
• Forecasting predict the behaviour of Y, non-parametric, ignores the relationships
• Simulate describing different scenarios on the basis of different inputs X

If we se a linear relationship between Y and only one X we can use a Scatterplot Graph
A positive slope indicates a positive relation, and
the opposite a negative one.

Sample Data Collection


variables y and x can be measured…
1. Cross-section on n different units, in a
certain instant T=1
2. Time Series on n different units in T different times T=n
3. Panel always on the same n units in T different times T=n

Lecture 2 20/02/2024

Simple Linear Regression Model

Assumptions Needed For A Parametric Model

- Shape of the Function F(x) on the basis the variable type

- NATURE OF THE DISTURBANCE Ꜫi


random variables have different behaviours, so we need to first understand its nature

E(εi |xi )=0 The average Value of Ꜫi given xi is equal to 0


This is true when we have both positive and negative values of X
-100 -20 -2 +2 +20 +100 Expected Value =
Average = 0

Var(εi |xi )=σ2 Variance of Epsilon is costant


Omoschedasticity : Increasing X, the variability remains the same.
We adopt this assumption in our exercises for simplicity, even if there are
exceptions.
for little changes in income, x=1  x =2 consumption variability remains
unchanged
for big changes in income x=1 x=200  consumption variability, variance changes

εi is a Normal Random Variable

εi i=1,…,n Since it is a normal random variable, the values of Ꜫi are incorrelated,


independent

Given these assumptions, we can analyse the BEHAVIOUR OF Y

- The regression function f ( x )= β0 + β 1 x 1 describes the relationship between X and the conditional
expectation of Y
E ( Y i|X i) =μi=β 0 + β 1 x 1 derives from
X =E ( β 0 + β 1 x 1+ ε|X )=E ( β 0|X ) + E ( β1 x 1| X 1 )+ E ( ε|X )=β 0 + β 1 x 1+ 0
- β 0=E ( Y i| X i )=0
depending on the inquiry, b0 can be relevant or not
price Y per square metre X=0, B0 irrelevant consumption Y for income X=0, relevant

- β 1 is the average change in Y when increasing X by 1 unit for linear relationships only

- Ꜫi includes the effect of omitted variables and noise factors on the response variable Y
if noise is very high, maybe we missed an important variable X in our model

RECAP REGRESSION Equation E ( Y i|X i) =μi=β 0 + β 1 x 1


2
Hypothesis Y i N ( μi, σ ε )

ALL THE VARIABILITY OF Y DEPENDS ON THE VARIABILITY OF THE NOISE TERM Ꜫi


2 2
V (Y ∨X )=V ( β 0 + β 1 x 1 +εi)=Var (β 0∨X )+V (β 1∨X )+V (ε∨x )=0+0+ σ ε =σ ε
2
So we can rewrite our hypothesis as : Y i∨ X i N ( β0 + β 1 x1 , σ ε )

Now that we have an idea of the type of model we are about to use, we can go on with the estimation of
the parameters, who will reveal useful information for our inquiry.
If now have to find the parameters

RIVEDERE PEZZO CON LUDO

ESTIMATION OF REGRESSION PARAMETERS


Ordinary Least Squares Method
estimates the the regression parameters that minimize the Sum of Square Criterion, that minimize the
distance the observed data and the estimated regression line.
Here in fact yi and xi are observed values and not random variables.

We use two Estimators, formulas to get the value of the population parameter from a little sample

They are both Unbiased, there is no bias

and this means that they are reliable and very close to the real parameter, because they are the average of
the possible results from all possible samples of the entire population.
In general by changing samples we could get different estimations of beta, but the average of all samples
allows us to overcome this problem.
All else being equal, an unbiased estimator is preferable to a biased estimator, although in practice, biased
estimators are frequently used. When a biased estimator is used, bounds of the bias are calculated.

We also want the estimators to be Efficient : showing the smallest possible variance across samples,
indicating that there is a small deviance between the estimated value and the "true" value.

Of course, the best estimator is both Unbiased and Efficient, but if we had to choose one?
Bias- Variance Tradeoff
In general, a simpler model can have higher bias and lower variance. Bias gets down while variance goes up
while a model becomes complicated. Which is better over another pretty much depends on the aim of
analysis. If interpretation is more important, simpler models would serve well (we preferer biased over
unbiased) while complicated models, so called black-box models, would be necessary if prediction is of
main interest (we prefer efficient, even if biased).

Once we obtained our estimates of the parameters β0 and β1 we can go on and compute the Expected
Value μ and Variance σ2, which in the case of Linear Regression are β 0 + β 1 x 1 and σ ε .
2
For the variance we need to estimate the variability of the error Ꜫ.

Lecture 3 21/02/24
In SAS Studio

Is the regression line we drew close or far to the sample data?

Fit to have a prediction very close to the sample. In this case the high variability of Y does
not allow for good predictions

To understand if the model is reliable for making predictions, we need to make a GOODNESS OF FIT TEST

We first need the Composition of Total Variance


Start with the variability of a point in respect to y

Then consider the Deviance


describes variability of Y in the sample data
R2 tells us about the usefulness of our regression analysis, how much of ∆Y is explained by ∆X

If R2=0  SSM = 0 E(Y) does not change according to X…. Y is a constant = a flat line
no correlation, my model is not useful.

The more explanatory variables X I use, the more R2 will increase.

Limitations of R2
First case the relation is not linear, is quadratic, but still the model il a good fit.
Second case R2 is low, and the model, the linear regression is not a good approximation of the real Y=f(x)
is not that there is no relationship between X and Y, only that is not explained by linear regression.

Third Parameter Variance of the Noise Term

Why this one?

Y = β0 + β 1 x1 + εε =Y −β 0 + β 1 x 1

The problem is that we cannot observe epsilon, we just can estimate it by

An therefore we apply a correction, particularly important when we have a small sample.

xx
Look at an example

By MSR and Root MSE we have the full information about the variability of our data.

WHICH IS THE REAL FUNCTION Y=F(X) ?


It is impossibile to look at the whole population’s values of Y and X, we just look at a sample when we
estimate B0 and B1.
The problem is that samples are subsets of N and they can differ from each other, so they lead us to
different values of Bo and B1.  Sample Variability

We overcome this obstacle by thesting our inference about Bo and B1 by using


statistical test + confidence interval

T-test
Null Hypothesis H0 usually implies no correlation, so b1=0
H1 some kind of correlation B1 ≠0

TEST STATISTICS

If the Standard error is very high, (+40 in a sample, -20 in another) then B1 is not stable, we need other
estimations. When close to 0, we are making a good approximation of B1

If we get t form many different random sample, then t is a random variable: T-student
If we change the degrees of freedom we change the shape of the distribution.
In other words, the P-value is the plausibility of the Null Hypothesis (close to 0 = rejection)
Aplha is the level of confidence, usually we admit a 95% certainty.

Consider the t-test for a fixed level of α.


If P Value is 0  Null Hypothesis is rejected
Imagine that now the P-Value is 0,05 = 5%  we now accept the null Hypothesis

Example
Lecture 4 26.02.2023

RESIDUAL ANALYSYS
tools to verify the our assumptions, hypotheses are true
roperties of Residuals

Standardizzazione = trasformi la variabile in una Normale, con media E(0) =0 e V(X)=1


we obtain new values that are comparable with eachother

DEPARTURES FROM MODELS TO BE STUDIED BY RESIDUALS


1. The regression function is not linear
2. The error terms don’t have constant variance
3. The error terms are not independent
4. The model fits all but one or few outlier observations (we can check the presence of outliers, whether
the capability of the model is able to consider them)
5. The error terms are not normally distributed

Diagnostics for residuals based on plots:


1. Plot of residuals against explanatory variable or again fitted values
2. Plot of residuals against time
3. Box plot or histogram of residuals
4. Normal probability plot of residuals
The position of the residual is random, but in this case the are negative, positive, then negative…we can tell
this is not a random behavior…still the linear regression function is not a good model for Y=f(x)
By the residual, we can see that the true function is not linear.

Nonconstancy of error variance


We can use the residual plot against the fitted values
The position of the residuals is random, but we see that the more we increase the Y^ values, the variability
of residuals increases…..omoschedasticity is not true
anymore.

NB: If the variance of the error term is not constant,


we need to change the formulas to obtain B0 and B1.

Normality of residuals
A normal quantile plot graphs the quantiles of a variable against the quantiles of a normal (Gaussian)
distribution. qnorm is sensitive to non-normality near the tails, and indeed we see considerable deviations
from normal, the diagonal line, in the tails.

The variance of X is equal to the variance of the


observed residuals.
We overlap the theoretical curve with the observed
data (histogram) check for Normality: is my empirical
distribution equal to the theoretical distribution?

We can more easily use another graph  QNORM


If the red curve is equal to the istogram, we are on the
line…otherwise not.
In this case there is a good match between tehoritical
and empirical, with a little mismatch in the tails … still
the assumption of normality is true.

Presence of outliers
Three different types of outliers:
1: a value of Y very different from the whole sample but not X
2: values of X and Y very different from the whole sample
3: a value of X very different from the whole sample but not Y

Are point 1,2,3 strange points or not?

We now apply standardization

We now plot the standardized residual e* and see that there is a 99% probability that values are between
+3 and -3.
So in this case, OR point 3 is a mistake OR my model, my linear function is not fit for all data.

So, the first step is to understand if we do have outliers in the data set. To do so, we can use a BOXPLOT
The median is robust, is a measure of position In this
case the distribution is quite symmetrical, as Q2 is in
the middle of the box.

SAS
Classification Variable = Qualitative or Discrete Quantitative
Continuous Variable = Quantitivate

Volume of the noise = variance of the residual errors.

Correlation index -1 < R < +1


Square Correlation Index 0 < R2 < 1

When we only use 1 explanatory variable, we can obtain a


SPURIOS CORRELATION
two variables Y and X have no direct causal connection, yet it may be
wrongly inferred that they do, due to either coincidence or the presence
of a certain third, unseen factor Z.
↑X =↑Y ma magari solo perche ↑Z

For instance we consider a sample student


We observe Stature and Hair length
We estimate simple linear regression with Y= Hair length and X=Stature

There is only an apparent relationship between Stature and Hair length, which is not even logical.
Once we include gender in our study, the relationship is more plausible (and we can ignore stature).

binomial = dummy variable (assume 2 valori, 0 oppure 1, si/no)

You might also like