Business Analytics
Business Analytics
ANALYTICS
Lessons: Monday , Tuesday, Wednesday 9:00-11:00 a.m.
FINAL EXAM: 4 APRIL multiple choice/open answers test, written or on-line time: one hour
Books
J. Neter, M. Kutner, C. Nachtsheim, W. Wasserman, 1996, Applied Linear Regression Models, Irwin J. Lattin,
J. Carroll, P. Green, 2003, Analyzing Multivariate Data, Thomson
VOLUME
We can store tons of data, or better, there is so
much data that physical facilities are needed to store
them (clouds have physical supports)
VELOCITY
ex years ago Census required almost a year to be
processed, now is real time. There has been an
increase in computational capability
VARIETY
not only traditional data (income, age…) but also
other formats as images, audio, videos… which
require new methods for analysis: Deep Learning and
Machine Learning, both based on algorithms.
How does a dependent variable Y change on the basis of independent, explanatory variables Xn?
SUPERVISED
• Parametric Method = we have assumptions about the behaviour or the variable Y
Y is a random variable
whose behaviour is like a random normal variable
Lecture 2 20/02/2024
STATISTICAL MODELS
Regression
Model Construction
1. Choice of variables what phenomena Y we want to study in relation to which Xn
2. Data collection recording observations by primary or secondary sources
3. Model Type Selection suitable to describe the X,Y relationship and fit to the data collected
4. Parameters estimation Bo and B1 to get expected value and variance
5. Goodness of fit of the model R2
+
6. Purpose of use
• Explanatory analysis to analyse the relationships between variables X and Y, mostly parametric
• Forecasting predict the behaviour of Y, non-parametric, ignores the relationships
• Simulate describing different scenarios on the basis of different inputs X
If we se a linear relationship between Y and only one X we can use a Scatterplot Graph
A positive slope indicates a positive relation, and
the opposite a negative one.
Lecture 2 20/02/2024
- The regression function f ( x )= β0 + β 1 x 1 describes the relationship between X and the conditional
expectation of Y
E ( Y i|X i) =μi=β 0 + β 1 x 1 derives from
X =E ( β 0 + β 1 x 1+ ε|X )=E ( β 0|X ) + E ( β1 x 1| X 1 )+ E ( ε|X )=β 0 + β 1 x 1+ 0
- β 0=E ( Y i| X i )=0
depending on the inquiry, b0 can be relevant or not
price Y per square metre X=0, B0 irrelevant consumption Y for income X=0, relevant
- β 1 is the average change in Y when increasing X by 1 unit for linear relationships only
- Ꜫi includes the effect of omitted variables and noise factors on the response variable Y
if noise is very high, maybe we missed an important variable X in our model
Now that we have an idea of the type of model we are about to use, we can go on with the estimation of
the parameters, who will reveal useful information for our inquiry.
If now have to find the parameters
We use two Estimators, formulas to get the value of the population parameter from a little sample
and this means that they are reliable and very close to the real parameter, because they are the average of
the possible results from all possible samples of the entire population.
In general by changing samples we could get different estimations of beta, but the average of all samples
allows us to overcome this problem.
All else being equal, an unbiased estimator is preferable to a biased estimator, although in practice, biased
estimators are frequently used. When a biased estimator is used, bounds of the bias are calculated.
We also want the estimators to be Efficient : showing the smallest possible variance across samples,
indicating that there is a small deviance between the estimated value and the "true" value.
Of course, the best estimator is both Unbiased and Efficient, but if we had to choose one?
Bias- Variance Tradeoff
In general, a simpler model can have higher bias and lower variance. Bias gets down while variance goes up
while a model becomes complicated. Which is better over another pretty much depends on the aim of
analysis. If interpretation is more important, simpler models would serve well (we preferer biased over
unbiased) while complicated models, so called black-box models, would be necessary if prediction is of
main interest (we prefer efficient, even if biased).
Once we obtained our estimates of the parameters β0 and β1 we can go on and compute the Expected
Value μ and Variance σ2, which in the case of Linear Regression are β 0 + β 1 x 1 and σ ε .
2
For the variance we need to estimate the variability of the error Ꜫ.
Lecture 3 21/02/24
In SAS Studio
Fit to have a prediction very close to the sample. In this case the high variability of Y does
not allow for good predictions
To understand if the model is reliable for making predictions, we need to make a GOODNESS OF FIT TEST
If R2=0 SSM = 0 E(Y) does not change according to X…. Y is a constant = a flat line
no correlation, my model is not useful.
Limitations of R2
First case the relation is not linear, is quadratic, but still the model il a good fit.
Second case R2 is low, and the model, the linear regression is not a good approximation of the real Y=f(x)
is not that there is no relationship between X and Y, only that is not explained by linear regression.
Y = β0 + β 1 x1 + εε =Y −β 0 + β 1 x 1
xx
Look at an example
By MSR and Root MSE we have the full information about the variability of our data.
T-test
Null Hypothesis H0 usually implies no correlation, so b1=0
H1 some kind of correlation B1 ≠0
TEST STATISTICS
If the Standard error is very high, (+40 in a sample, -20 in another) then B1 is not stable, we need other
estimations. When close to 0, we are making a good approximation of B1
If we get t form many different random sample, then t is a random variable: T-student
If we change the degrees of freedom we change the shape of the distribution.
In other words, the P-value is the plausibility of the Null Hypothesis (close to 0 = rejection)
Aplha is the level of confidence, usually we admit a 95% certainty.
Example
Lecture 4 26.02.2023
RESIDUAL ANALYSYS
tools to verify the our assumptions, hypotheses are true
roperties of Residuals
Normality of residuals
A normal quantile plot graphs the quantiles of a variable against the quantiles of a normal (Gaussian)
distribution. qnorm is sensitive to non-normality near the tails, and indeed we see considerable deviations
from normal, the diagonal line, in the tails.
Presence of outliers
Three different types of outliers:
1: a value of Y very different from the whole sample but not X
2: values of X and Y very different from the whole sample
3: a value of X very different from the whole sample but not Y
We now plot the standardized residual e* and see that there is a 99% probability that values are between
+3 and -3.
So in this case, OR point 3 is a mistake OR my model, my linear function is not fit for all data.
So, the first step is to understand if we do have outliers in the data set. To do so, we can use a BOXPLOT
The median is robust, is a measure of position In this
case the distribution is quite symmetrical, as Q2 is in
the middle of the box.
SAS
Classification Variable = Qualitative or Discrete Quantitative
Continuous Variable = Quantitivate
There is only an apparent relationship between Stature and Hair length, which is not even logical.
Once we include gender in our study, the relationship is more plausible (and we can ignore stature).