Mosconi W1
Mosconi W1
Descriptive statistics is focused on developing indicators whose purpose is to summarize large sets of data regarding a variable of interest. The
basic idea is that a large set of observations can be difficult to interpret unless it is summarized through few indicators allowing us to grasp the
most important characteristics of the data.
Average(Mean)
Median
Mode
Dispersion Measures: (order of magnitude of the distances of the observed values with respect to the location)
Variance/std.Deviation
Range/interquartile Range
Shape Measures: ( whether the distances with respect to the location measure are distributed evenly on both sides or not (skewness), and
whether there is a relevant presence of outliers or not (kurtosis).
Skewness
Kurtosis
Association Measures: ( aim to measure the sign and the intensity of the relationship between two variables.)
Covariance
Correlation
The histogram provides, at a glance, useful information about location, dispersion and shape.
The boxplot provides the same information in a more concise form, and it is particularly useful for comparing different groups of
individuals with respect to the same variable.
The scatterplot is instead connected to the association measures, and allows to intuitively visualize the relationship between two
variables.
We may say that the median is not affected by extreme values (sometimes called outliers), while the average is.
if the two indicators are close enough to each other, we can conclude that probably there are no outliers (or, in case there is any, they are on both sides of the average and
therefore they balance). Conversely, a relevant distance between the average and the median indicates that the data present asymmetrically distributed outliers.
The side where the outliers are found can be deduced from the sign of the difference between average and median:
if the average is greater than the median, we infer that large positive distances are present. We define this case “positive skewness”
(or “right skewness”).
Conversely, if the average is smaller than the median, and we refer to this case as “negative skewness” (or “left skewness”)
For continuous variables, the notion of mode can be replaced by the notion of “modal class”, the modal class is the one with the
highest frequency (both absolute and relative, of course).
The variance focuses on the distances from the average. We compute the distance from the average. This distance is then squared, to remove
the sign so negative distance to also become positive. We then calculate the average of the quadratic distances, adding them up, and then
dividing by n.
If, for example, we are interested in a variable measured in meters, the variance will be measured in square meters; if the variable is measured in euro,
the variance will be measured in squared euro, and so on. For this reason, we introduce another indicator, called 'standard deviation'.
The standard deviation is simply the square root of the variance, and is usually denoted with “S of X”.
So, for example, if the average price of the cars of a given brand is 15,000 euros with a standard deviation of 6,000, I should not be surprised if the price
of a specific car of that brand is 21,000 euros: the distance from the average would be equal to one standard deviation, that is, a “typical” distance.
The relationship between the dispersion of the data and their location is measured by the coefficient of variation, given by the ratio of the
standard deviation to the average.
For example, the coefficient of variation of the top managers' income could be 1.5 (i.e. 150%), while that of the employees could be 0.2 (i.e. 20%): this
means that the income of top managers is much more heterogeneous than income of employees, not only for the obvious reason that income of the top
managers is much higher.
Q: if two variables have standard deviation 100 and 2 respectively. If the covariance between them were equal to -
50 we could conclude that tell The correlation will be negative and equal to -0.25, why?
Given that the covariance (cov) is -50, and the standard deviations are 100 and 2 for variables X and Y, respectively, we can
use the formula to calculate the correlation coefficient:
So, the correlation coefficient is indeed -0.25. The negative sign indicates a negative correlation, and the magnitude of
0.25 suggests a relatively weak negative linear relationship between the two variables. Therefore, the conclusion is
accurate: the correlation is negative and equal to -0.25.
Skewness
The indicator is based on the average of the cubed distances from the average. This average is then divided by the
cubed standard deviation, in order to normalize it. By raising the distances to the power 3, we assign a large weight
to large distances, and a small weight to small distances, without removing the sign
Kurtosis
The Kurtosis indicator is mainly used to compare the relevance of outliers for the same variable in different groups,
also is generally considered only for symmetrical distributions, and it is mainly used in economics in the context of
risk management.
for example, if the Kurtosis of the returns of the financial asset “alpha” is equal to 7.2 while the Kurtosis of the returns
of the financial asset “beta” is equal to 3.4, we can conclude that in both cases there are outliers, but for the asset
alpha they are more extreme.
The covariance is the average of the products of the distances from the average of X and Y. The sign of covariance is
easy to interpret, but its size is not easy to interpret, since its unit of measurement is given by the product of the
units of measurement of the two variables.
To make covariance easier to interpret, we normalize it by dividing by the product of the two standard deviations.
This results in the “correlation index”, which is a pure number with no unit of measurement. It can be
demonstrated that the correlation index ranges from negative 1 (maximum negative linear association) to 1
(Maximum positive linear association).
If the correlation is close to zero, it means that there is no linear, for example, initially decreasing and then increasing,
as in the graph shown in green. A low correlation between two variables therefore does not necessarily mean that the
variables have no relationship: they might be unrelated, or they might have a non-linear relationship.
The variance-covariance matrix is a table, where the number of rows and the number of columns is equal to the
number of variables (3 in this example). On the diagonal, highlighted in green, we report the variances. The extra-
diagonal elements, highlighted in blue, correspond to the covariances. The matrix is symmetrical, i.e. the values
above the diagonal are equal to the values below the diagonal.
Note that on the diagonal of the correlation matrix all values are equal to one (the correlation of a variable with itself
is equal to 1 by construction).
Random variables may be divided into two groups:
Discrete random variables: these variables can take on a countable number of distinct values
Continuous random variables: these variables can take on an uncountably infinite number of possible outcomes
The objectivist perspective considers probability as an intrinsic property of the random event, which is independent of the
observer.
The subjectivist perspective stems from the remark that often the experiment cannot be repeated under the same conditions,
and therefore what we define probability simply reflects the state of our information.
For example: the probability that a certain team wins a certain match can be evaluated differently by subjects with different
information regarding the shape of the players, and can be completely different to someone who paid the referee to favor one of the
two teams.
Whichever point of view you assume about probability, from a mathematical viewpoint there is complete agreement regarding
the axioms that probability must fulfill
The probability of an elementary event (outcome of the experiment) must be a positive real number
The probability that at least one of the elementary events will occur (certain event) must be equal to 1
The probability of the union of two mutually exclusive events must be equal to the sum of the probabilities of the
single event
The expected value is, in essence, a sort of average of all the values that the random variable could assume, where each value is
weighed for its probability
The variance is the expected value of the squared distance from the expected value.
The graph shows how the variance of a binary random variable varies as a function of π.
The Binomial distribution is appropriate to represent the probabilities associated with a specific type of random variable: the
sum of successes when we replicate a binomial experiment n times.
As an economic example of reference, we could think of a car dealer who, in one day, proposes the purchase of a car to 10 potential
buyers (n is equal to 10): how many cars will he be able to sell? We define by “X-i” the random variable "outcome of the i-th
proposal".
Knowing π, the formula of “p of y” allows us to calculate, for example, the probability that the dealer sells 3 cars
If, for example, the probability of selling a car to a generic customer is 10%, the probability of selling 3 cars when we meet 10
customers is 5.74%.
The U.S. Department of Transportation reported that in 2009, 82.5% of all domestic U.S. flights was on time. Using the
binomial distribution, what is the probability that at least 14 out of the next 15 flights will be on time?
Technically, we say that the probability density is higher for the value 70 than for the value 400.
Depending on the type of random variable that is to be represented, the shape of the density function may vary. An important feature is
symmetry versus asymmetry.
The first graph shows a unimodal distribution: this means that the density has a single peak.
The second graph shows a bimodal distribution, where the density has two peaks (not necessarily with the same height).
The third graph shows a multimodal distribution, in where the density has several peaks (in this case three).
“Normal" (or "Gaussian") distribution. The name itself suggests that this model is well suited to represent the uncertainty in
many real-life situations.
The central value is usually indicated with the Greek letter µ (mu), and it is the value with the highest probability density. µ can
be any real number (positive, negative or zero).
The final result of the integrals is surprisingly simple: the expected value is equal to mu and the variance is equal to sigma square (so the
standard deviation is equal to sigma).
It is worth noting that for any symmetric distribution (not just for normal), the expected value is equal to the central value.
Furthermore, in symmetric distributions the central value is also the median of the distribution, since it leaves a 50% probability to
the left and to the right.
If I want to construct a symmetric interval around mu, containing 95 percent of the probability, this would be the interval
between mu-1.96sigma and mu + 1.96sigma.
Increasing the width of the interval, so does the probability: the probability that a normal random variable takes on values
between mu-2.57sigma and mu + 2.57sigma is 99%.
Obviously, reducing the width of the interval, the probability is also reduced: the probability that a normal random variable
takes on values between mu-1.64sigma and mu + 1.64sigma is equal to 90%. The three values 1.64, 1.96 and 2.57 are very
useful in practice, and should be memorized.
This is a normal in which mu = 0 and sigma = 1. By replacing these values in the previous figures, we obtain some useful
probabilities associated with the normal standard. Any normal random variable can be "standardized", i.e. transformed into a
standard normal.
The Normal distribution is an appropriate model if our experience suggests a continuous, symmetric, unimodal, bell-shaped
distribution.
However, "What is the probability of having a delivery time exceeding seven days"? The following graph shows data on delivery
times observed in the past
Some economic variables present a right skewed distribution, but the shape is somewhat different from that seen
previously.
The Lognormal distribution presents right skewness. The distribution has two parameters, mu and sigma, which basically
represent expected value and standard deviation of the logarithm
Descriptive statistics aims at summarizing a given set of data related to a certain phenomenon of interest through appropriate
indicators (mean, median, standard deviation, ...) and appropriate graphs (histogram, boxplot, ...)
When we do statistical inference instead, we are interested into a very large set of individuals, which we call “population” or
“universe”. Typically, we do not have the possibility to observe all the individuals in the population, because of time, cost, or
simply because they are not reachable. All we can do is to collect data about a subset of individuals randomly extracted from
the population of interest: we call this subset a “sample”..
By independent sampling we mean that the result of an extraction does not influence the probabilities associated with the
others.
For example, the second draw could be influenced by the result of the first, the third by the result of the second and so on.
A typical inferential problem parametric estimation. In essence, the problem is to use sample data to estimate the value of a
parameter, such as the average of the population
The first aspect to consider when evaluating whether an estimator is appropriate, is its expected value. An estimator is defined "unbiased" if its
expected value is equal to the parameter we want to estimate.
For example, the sample average is an unbiased estimator for the population average. Another unbiased estimator would be given by the average of
just the first two observations, neglecting all the others. The sample minimum is instead a biased estimator for the population average, given that
obviously its expected value is lower than the average.
It can be proved that the sample average is the BLUE estimator (Best Linear Unbiased Estimator), i.e. it is the most efficient of all
unbiased linear estimators.
The statement "The average of your sample will be equal to 60, and the standard deviation of your sample will be 15" is incorrect. While the expected
value of the sample mean is indeed equal to the population mean, the standard deviation of the sample mean (standard error) is not equal to the
population standard deviation divided by the square root of the sample size.
A point estimate is a single number that represents our best conjecture about the true value of the parameter of interest. For
example, if we want to estimate the average mu population, the point estimate is given by the sample mean.
The sentence “the sample average is an unbiased and efficient estimator for the population average” means:By using the sample average to
estimate the population average we will not do systematic errors, and the absolute value of the estimation error will be, on average, as small
as possible. why?
Unbiased Estimator: An estimator is unbiased if, on average, it equals the true parameter it is estimating. In the context of estimating the population
average, if the sample average is an unbiased estimator, it means that, on average, it accurately estimates the true population average. There is no
systematic overestimation or underestimation; any errors in estimation are due to random chance.
Efficient Estimator: Efficiency in estimation refers to the precision or accuracy of the estimator. An efficient estimator minimizes the variability of the
estimation errors. In this case, the sample average is efficient because, among all unbiased estimators, it has the smallest possible variance. The use of
the sample average minimizes the spread of the estimation errors, making it a precise estimator.
So, when you use the sample average to estimate the population average:
Unbiasedness: You are not introducing systematic errors on average. The sample average, over many samples, will be centered around the true population average.
Efficiency: The spread of the estimation errors (the difference between the sample average and the true population average) is, on average, as small as possible. This
means that the sample average is a precise estimator, and the variability in your estimates is minimized.
In summary, an unbiased and efficient estimator, like the sample average, is desirable because it provides accurate estimates with minimal variability. This is crucial
in statistical inference, where we use sample information to make inferences about the population.
A confidence interval is a region, constructed around the point estimate, within which we expect to find the true value of the
parameter of interest, with a given confidence level (that is, with a given probability).
The figure illustrates in a very general way the idea of a confidence interval for the parameter mu (population mean) based on
the sample mean, “x-bar”. The confidence level is usually 95%.
We can then "standardize" the sample mean: the standardized sample mean will have normal distribution with an expected
value of 0 and standard deviation equal to 1.
Therefore, under our assumptions, we can say that the standardized sample average will be between -1.96 and 1.96 with
probability 95%. This is the starting point for calculating the 95% confidence interval, i.e. the interval within which the
population average falls with 95% probability.
A very useful result to address this problem is the “central limit theorem”.
-Based on this theorem we can state that even if the population is not normal, the sample mean is "asymptotically" normal.
This means that as n increases, the distribution of the sample average becomes more and more similar to normal.
Basically, if n is large enough, we can use the normal to construct the confidence interval even if the population is not
normal. It is important to specify that even if sigma is not known but must be estimated, the asymptotic distribution is
still normal.
There is no univocal answer. It depends on how far from normality the distribution of the population is. If it is almost normal, 30
observations might be enough. If there is a moderate skewness (whether positive or negative) it would be preferable to have a
hundred observations. If skewness is very strong, it is better to have at least 250 observations.
To make this decision, we observe that the width of the 95% asymptotic confidence interval, say W, is about 4 times the
standard deviation divided by the square root of n. Therefore, if we have a target precision level, that is, a width W* that we
consider as satisfactory for our purposes, to find the optimal sample size we just have to set the width of the interval equal to
W* and solve the equation in n.
So we get a simple formula to define the sample size as a function of the target width. Let's see an example: suppose that,
based on a pilot sample of moderate size - say 50 observations, we have obtained a preliminary estimate of the standard
deviation, for example sigma tilde = 20. If the target width of the confidence interval is W* = 5, it is easy to calculate that we
need 256 observations to obtain the desired level of precision. It will therefore be necessary to add about 200 observations to
the 50 we already have.
What is a test? A test is a statistical procedure aimed at deciding whether a hypothesis is true or false, using sample data.
What is a hypothesis? A hypothesis is a statement about some aspect of the distribution of a variable in the population of
interest.
For example: consider the life of light bulbs produced by a given manufacturer. The statement "the average life of light bulbs is equal
to 1000 hours", reported on the package may be seen as a hypothesis, that is a statement about some aspect of the distribution (the
average). Imagine you are hired by a consumer association to verify if this hypothesis is true or false. Usually hypotheses are formally
written in this way
The hypothesis we want to test is called “null hypothesis”, and denoted with “H zero”. In our example the null hypothesis is that the mean, mu,
is equal to 1000. We call alternative hypothesis, denoted by “H A”, the hypothesis that is true if the null hypothesis is false.
In practice, the p-value can be seen as the probability that the null hypothesis is true. Therefore, the decision rule is very simple.
If the p-value is low (usually if it is lower than 5%) we conclude that the hypothesis is unlikely to be true, and therefore we reject it.
On the other hand, if the p-value is high (greater than 5%) we conclude that the hypothesis could be true, and therefore we do not
reject it.‚
In order to get the decision we need to gave hypothesis, number of tries, std.dev and sample mean
In essence, the p-value is the probability that the absolute value of the test is greater than the computed value. This probability is given by the
area shaded in green in these plots. In the plot on the left the value of the test, T, is close to zero (i.e. the distance between the sample average
and the assumed average “mu zero” is small) and therefore the p-value is high (greater than alpha): the null hypothesis it is therefore accepted.
Let’s now introduce the concepts of one-sided hypothesis and one-tailed tests.
The case of a null hypothesis represented by an inequality, is defined “unilateral” or “one-sided” hypothesis. Assume, for
example, that you want to test whether the average of the population is greater than or equal to a given value “mu zero”.
Clearly, if the sample average is greater than “mu zero”, even much larger, the evidence would be in favor of the null hypothesis,
so I wouldn’t have any reason to reject it. If instead the sample average is less than “mu zero”, and if the distance is large
enough, then the evidence would be against “H zero”, and therefore we should reject it.
For example, if I were the light bulb manufacturer, I would like to reject that the average lifetime is less than 1000, because if that is the case, the
consumer association could sue me. So I will set as the null hypothesis that mu it less than or equal to 1000, and then I will reject only in the right
tail. Unfortunately, as we have seen, in this case the test is equal to negative 6.06, therefore I cannot reject the null hypothesis.
Sometimes we are interested in comparing the averages of two populations, which we could define by “mu one” and “mu two”. Also in this case,
the test can be used both for bilateral hypotheses (equal averages versus different averages) and for unilateral hypotheses. To test the
hypothesis, we have to collect a sample of size “n one” from the first population, and a sample of size “n two” from the second population. We
then compute mean and variance for each of the two samples. In a fairly intuitive way, the test is based on the difference between the two
sample averages, divided by a suitable indicator of variability, which is needed to evaluate if the distance between the sample averages is large
enough
The linear regression model, is a tool used to study the relationship between 2 or more variables.
Specifically, we define “simple regression” the situation in which only two variables are involved, and “multiple regression” the case in which
more than two variables are considered.
Usually the two variables are denoted by Y and X. Y is referred to as “dependent variable” or “output variable”. It represents the focus of the
analysis, being the variable that we want to explain, model, interpret, predict. The variable X is instead referred to as “explanatory variable”, or
“regressor”, or “independent variable”, or “right hand side variable”. Basically the aim of the model is to analyze how Y varies as X varies.
An example could be: Y is the annual consumption of goods and services in a household and X is the income of the household. The
intuition is that these two variables are positively related, and we want to use a sample of households extracted from the population
of interest (say, all Italian households) to precisely quantify the relationship, so that we can measure to what extent an increase in
income translates into an increase in consumption, and to predict the level of consumption of a household given its income.
Essentially the linear regression model assumes that the relationship between Y and X is linear:
Therefore, the model can be represented as a straight line in the X-Y plane. The intercept, here denoted by beta-zero,
and the slope denoted by beta-one, are unknown parameters. The purpose of the statistical analysis is to estimate
these parameters using sample data. Note that the graph shows an increasing straight line for illustrative purposes:
however, both beta-zero and beta-one can be positive, negative or zero and the purpose of the analysis is to estimate
them.
Sigma represents the “typical” distance of Y from its conditional expected value “beta-zero plus beta-one times X”. Sigma is a
very important parameter: if the regression line is used to predict Y (for example, when X is equal to “X star”), sigma tells us
what is the “usual” distance of the observed value from the expected value.
How can we estimate the unknown parameters beta-zero, beta-one and sigma?
First of all, we collect data, which can be represented as a scatter plot in the X-Y plane (the black dots in the figure).
Next, among the infinitely many lines that can be drawn in the plane, we identify the one which is closest to the dots. The
intuition is very simple: the graph represents just three lines, and the green one is clearly the best fit to the data. Addressing the
problem mathematically, it is easy to identify the best of all straight lines, by solving a simple optimization problem.
Technically, the “Ordinary Least Squares” (or OLS) estimates of beta-zero and beta-one are obtained by solving the optimization
problem consisting in minimizing the sum of the squared distances of the observed values to the straight line