ApplStat2007ZK
ApplStat2007ZK
TEXTBOOK
BRNO 2007
(Second Edition)
Doc. RNDr. ZdenČk Karpíšek, CSc.
Department of Statistics and Optimization
Institute of Mathematics
Faculty of Mechanical Engineering
Brno University of Technology
Technická 2, 616 69 Brno
Phone: +420 541 142 529
E-mail: [email protected]
2
CONTENT
PREFACE (5)
3
4. INDEX NUMBERS (72)
Basic notions (72)
Simple index numbers (72)
Group indexes (73)
Composite indexes (75)
Index numbers and absolute quantities (78)
Basic and chain indexes (79)
Exercises (80)
Questions (82)
BIBLIOGRAPHY (118)
4
PREFACE
Modern times have created modern problems and many of those problems
involve data. Marketing studies, product testing and quality control are typical
application areas that require an intelligent analysis of data. In particular, the
business and manufacture now demand employers as well as employees that are
better prepared to use statistics.
Probability plays a special role in all our lives, because we use it to measure
uncertainty. We are continually faced with decisions leading to uncertain outcomes
and we rely on probability to help us make our choice. A probability is a numerical
value that measures the uncertainty that a particular event will occur. The probability
of an event ordinarily represents the proportion of times under identical
circumstances that the event can be expected to occur.
When data for evaluations are collected at regular intervals from monthly,
quarterly, or annual reports, they are referred to as time-series data. In each case,
values of the variable being predicted are available for several past periods of time.
Such data are called time-series. Statistical procedures that use such values are
called time-series analysis.
5
Chapter No4: Index Numbers
6
1. FUNDAMENTALS OF PROBABILITY
Random events
f
and *A
i 1
i .
7
b) The intersection A B of random events A and B occurs when both events
n f
occur. Similarly, we define A
i 1
i and A
i 1
i .
, : 6,
A6 A6,
f
Ai 6, i = 1, 2, … A 6 .
i 1
i
E x a m p l e 2.1
A B, A B, B A, 6.
S o l u t i o n:
The sample space : = {1, 2, 3, 4, 5, 6} with elementary events {1}, {2}, {3}, {4}, {5},
{6}. Next we have A = {2, 4, 6} a B = {5, 6} so that
8
B A = {5, 6} {2, 4, 6} = {5} ... number 5 comes up.
Since no restrictions are imposed on random events, we can consider the maximal
field of events (the system of all subsets of :):
6 = {, {1}, {2}, {3}, {4}, {5}, {6}, {1,2}, {1,3},…,{5,6},…, {2, 3, 4, 5, 6}, : }
which contains 26 = 64 random events.
It holds:
a) P( A) 1 P A ; P() = 0; 0 d P(A) d 1.
c) P A 1 ... A n 1 P( A 1 ... A n )
n n
n1
¦P A i ¦P A i A j ... 1
i 1 i, j 1
P A 1 ... A n .
i¢ j
For a finite or countable sample space : (i.e. its elementary events ^Z` may
be listed in a sequence) we get
PA ¦ P ^Z` .
Z A
9
the number of outcomes of the experiment". This is the so-called classical definition
of probability.
E x a m p l e 2.2
Calculate the probabilities P(A), P(B), P( A ), P( B ), P(A B), P(A B), P(A B),
P(B A) of the random events from Example 2.1.
S o l u t i o n:
Due to the symmetry and homogeneity of the cube, all the elementary events have
the same probability P({Z}) = 1/6 and n = 6. This yields the following probabilities:
P( A ) = 3/6 = 1/2,
P( B ) = 4/6 = 2/3,
P(A B) = 1/6,
P(B A) = 1/6.
1 1
P( A ) = 1 P(A) = 1 ,
2 2
1 1 1 2
P(A B) = P(A) + P(B) P(A B) = .
2 3 6 3
E x a m p l e 2.3
In a supply of 100 shafts, 10 items do not comply with the standard diameter,
20 items have not the required length and 5 items comply neither with the length nor
with the diameter requirement. Calculate the probability that a shaft selected at
random has both the required length and diameter.
10
S o l u t i o n:
Denoting by A and B the event that the selected shaft does not comply with the
required diameter and length respectively, the probability that the selected shaft has
both the required length and diameter
P( A B)
P( A / B) .
P(B)
It holds:
i = 1, …, n, the probability
n
P( A ) ¦ P(B )P( A / B )
i 1
i i
P(B j )P( A / B j )
P(B j / A ) n
, j = 1, …, n.
¦ P(B )P( A / B )
i 1
i i
11
E x a m p l e 2.4
Ten products out of a total of 100 are defective. We choose 3 products at random
without replacement. The probability that the first product chosen is defective -
random event A1, the second product chosen is defective - random event A2, and the
third product chosen is not defective – random event A 3 , is calculated below:
P( A1 A 2 A 3 ) P( A1 )P( A 2 / A1 )P( A 3 / A1 A 2 )
90 / 100 89 / 99 10 / 98 # 0.08256 .
E x a m p l e 2.5
a) it is defective,
S o l u t i o n:
Denote by A the random event that the croissant bought is defective and by Bi,
i = 1, 2, 3, the event that the croissant has been supplied by the i-th bakehouse. We
get the following probabilities
500 1
P(B1 ) , P( A / B1 ) 0.05 ,
500 1000 1500 6
1000 2
P(B 2 ) , P( A / B 2 ) 0.04 ,
500 1000 1500 6
1500 3
P(B 3 ) , P( A / B3 ) 0.03 .
500 1000 1500 6
1 2 3 0.22
P( A) (0.05) (0.04) (0.03) 0.036 # 0.03667 ,
6 6 6 6
12
so that, from the customer's point of view, the chance of buying a defective croissant
is approximately 3.667%. Applying Bayes' theorem for j = 2, we have
2
(0.04)
6 0.08
P(B 2 / A) 0.36 # 0.36364 .
0.22 0.22
6
In a similar way, we can obtain P(B1 /A) # 0.22727 and P(B3 /A) # 0.40909, which
means that the third bakehouse supplies the largest quantity even if their percentage
of defective products is the lowest of the three. This is due to the fact that they supply
the largest part of the croissants.
Ai, Aj Ak for i z j, i z k,
etc.
are independent.
E x a m p l e 2.6
What is the probability that, when throwing a dice, an even number comes up in the
first trial, (random event A) and, in the second trial, an odd number comes up
(random event B)?
13
S o l u t i o n:
Random events A and B are independent and their probabilities are P(A) = P(B) =
= 1/2, so that P(A B) = (1/2)(1/2) = 1/4.
E x a m p l e 2.7
S o l u t i o n:
Since the operations are independent, the random events A1, A2, A3 are mutually
independent and the product is defective if at least one of them occurs so that
c) lim F( x ) 0 , lim F( x ) 1 ,
x o f x o f
14
A random variable X is said to be discrete with a discrete distribution of
probabilities if it takes on at most a countable number of values x = x1, x2,… . Its
probability distribution is given in the form of a sequence
We have:
a) ¦ p(x)
x
1,
b) F( x ) ¦ p(t )
t¢x
for all x(-f;+f),
The distribution function for a discrete random variable is a "step-like line with jumps"
– see Fig. 2.1.
0.5 1
0.4 0.8
0.3 0.6
prob. mass cum. prob.
0.2 0.4
0.1 0.2
0 0
0 1 2 3 -1 0 1 2 3 4
X X
(a) (b)
Fig. 2.1 The graphs of the distribution of probabilities (a) and the distribution function
(b) of a discrete variable.
E x a m p l e 2.8
The probability of failure for each of three independently operating production lines is
0 p 1. The discrete random variable X that expresses the number of production
lines with a failure takes on the values x = 0, 1, 2, and 3 and the values of its
distribution of probabilities are given below
15
p(0) = (1 – p)3,
p(3) = p3.
In Fig. 2.1 you can see the graphs of p(x) and F(x) for p = 0.5. The probability of a
failure occurring in at least one of the production lines is given by
F( x) ³ f ( t)dt
f
for all x(-f;+f).
a) ³ f ( x)dx
f
1,
16
E x a m p l e 2.9
A real variable X has a density function f(x) = cx for x ¢0; 2² and 0 for x ¢0; 2².
Using the properties of a continuous random variable, we can derive the following
results. We have
f 0 2 f
0 x
t x2
F( x ) ³f0dt ³0 2dt ... 4
for x ¢0; 2²,
0 2 x
t
F( x ) ³f0dt ³0 2dt ³2 0dt ... 1 for x ¢2;+f).
The graphs of f(x) and F(x) are shown in Fig. 2.2. The probability of the random
variable taking on a value x ¢1; 3² is P(1 d X d 3) = F(3) – F(1) = 1 – (12/4) = 0,75.
1.5 1.5
1 1
f(x) F(x)
0.5 0.5
0 0
-2 -1 0 1 2 3 4 -2 -1 0 1 2 3 4
x x
(a) (b)
Fig. 2.2 Graphs of the density function (a) and the distribution function (b) of a
continuous random variable.
17
Numerical characteristics of random variables
E( X) ¦ xp( x )
x
for discrete random variable X,
f
E( X) ³ xf ( x)dx
f
for continuous random variable X,
provided that the sum or the integral is absolutely convergent. The expected value
has the following properties:
§ n · n
b) E¨¨ ¦ X i ¸¸ ¦E X i for random variables X1,…,Xn.
©i1 ¹ i 1
>
D( X) E ( X E( X)) 2 . @
The variance has the following properties:
a) D( X) ¦ ( x E( X))
x
2
p( x ) ¦ x p( x) (E( X))
x
2 2
for discrete random variable X,
f f
2 2
b) D( X ) ³ (x E( X)) f ( x )dx ³x f ( x )dx (E( X ))2 for continuous random
f f
c) D(X) t 0,
§ n · n
e) D¨ ¦ X i ¸ ¦D X i for independent random variables X1,…, Xn.
©i1 ¹ i 1
18
The standard deviation of a random variable X is defined as V( X ) DX .
a) V(X) t 0;
b) V(aX + b) = _a_ V(X) for arbitrary real numbers a, b.
E x a m p l e 2.10
The random variable X from Example 2.9 has the expected value
0 2 f
x 4
EX ³ x 0dx ³ x
f 0
2
dx ³ x 0dx
2
...
3
# 1.33333 ,
the variance
0 2 f 2
2 x 2 2 §4· 16 2
D( X ) ³fx 0dx ³0 x 2 dx ³2 x 0dx ¨© 3 ¸¹ 2
9 9
# 0.22222 ,
2
VX # 0.47140 .
9
x2
The P-quantile xP is the root of the equation P that lies in the interval ¢0; 2²,
4
which means that xP = 2 P so that the median of X is x0,5 = 2 0.5 # 1.41421. From
the graph of f(x) in Fig. 2.2, we can see that the mode of X is x = 2.
19
Random vectors and their probability distributions
c) lim F( x, y) F( f,f) 1,
( x ,y )o f,f
a) ¦¦ p(x, y)
x y
1,
20
We have:
f f
a) ³ ³ f (x, y)dxdy 1,
f f
w 2F( x, y )
b) f ( x, y ) provided that the partial derivative exists,
wxwy
If, in a random vector (X,Y), we leave out its constituent X or Y (that is, we
assume that this random variable takes on any value), we get marginal probability
distributions for Y or X. For marginal distributions of probability, density functions,
and distribution functions we have
a) p X (x) ¦ p( x, y ),
y
p Y (y) ¦ p( x, y)
x
for a discrete random vector,
f f
b) fX ( x) ³ f ( x, y)dy,
f
fY (y) ³ f ( x, y)dx
f
for a continuous random vector,
p( x, y ) p( x, y )
a) pX (x / y) , pY (y / x) for a discrete random vector,
pY (y) p X (x)
f ( x, y ) f ( x, y )
b) fX ( x / y) , fY (y / x) for a continuous random vector.
fY (y) f X (x)
Random variables X and Y are said to be independent if, for all pairs (x,y),
21
Numerical characteristics of random vectors
a) E( X) ¦ xp
x
X ( x) ¦¦ xp( x, y ) ,
x y
E( Y ) ¦ yp
y
Y (y) ¦¦ yp( x, y)
x y
for a
f f f f f f
b) E( X) ³ xf
f
X ( x )dx ³ ³ xf ( x, y)dxdy , E( Y ) ³ yf
f f f
Y ( y )dy ³ ³ yf ( x, y )dxdy
f f
for
b) cov(X,Y) = cov(Y,X),
c) cov(X,X) = D(X),
22
The covariances of a random vector (X,Y) can be displayed in the form of a
symmetrical covariance matrix cov(X,Y) – see [1], [2], [3] and [4].
cov X, Y
U X, Y .
D XDY
a) U(X,Y) = U(Y,X),
b) U(X,X) = U(Y,Y) = 1,
c) -1 d U(X,Y) d 1,
ac
d) U(aX + b, cY + d) = U( X, Y ) for arbitrary real numbers a, b, c, d, ac z 0,
ac
f) X, Y independent U(X, Y) = 0.
If U(X,Y) = 0, we say that the random variables X and Y are not correlated.
Independent random variables are not correlated but two random variables
that are not correlated are not necessarily independent. However, if they are both
normally distributed, they are also independent. The correlation coefficients of a
random vector (X,Y) can be displayed in the form of a symmetrical correlation matrix
U(X,Y) - see [1], [2], [3] and [4].
E x a m p l e 2.11
23
x
y 0 1 2 3
-1 2c c 0 0
0 c 2c c 0
1 0 0 2c c
Calculate c, F(2;0), pX(1), FY(1), pX(x/y) for (x,y) = (1;0), E(X), E(Y), D(X), D(Y),
cov(X,Y), U(X,Y) and determine if the random variables X,Y are independent.
S o l u t i o n:
D(X) = (02)(0.2) + (02)(0.1) + ... + (32)(0) + (32)(0.1) – 1.22 = 2.4 - 1.44 = 0.96 ;
D(Y) = (-1) 2(0.2) + (-1) 2(0.1) + ... + 12(0.2) + 12(0.1) - 02 = 0.6 - 0 = 0.6 ;
0.6
U(X,Y) = # 0.79057 z 0, which means that X, Y are not
(0.96)(0.6)
independent.
24
Some important probability distributions
§ n· x n x
p( x) ¨ ¸ p 1 p , x = 0, 1, …, n;
© x¹
This is the probability distribution of a random variable that expresses the number of
occurrences of an observed event in the sequence of n mutually independent trials
(such as the number x of defective products out of a total of n products if p is the
probability of a defective product being manufactured). This probability distribution
may be employed when performing a random sample with replacement such as
checking n products from a supply and replacing each product after it has been
checked. For np(1 – p) ! 9 the binomial distribution may be approximated by a
normal distribution where P = np, V2 = np(1 – p). For p 0,1 and n ! 30 we can also
approximate this probability distribution by the Poisson probability distribution with
O = np. The graphs of the distribution of probability and the distribution function of the
binomial distribution for n = 3 and p = 0,5 are shown in Fig. 2.1.
E x a m p l e 2.12
S o l u t i o n:
The random variable X has the binomial distribution Bi(n,p) with n = 3 and p = 5/50 =
= 0.1. X takes on the values x = 0, 1, 2, 3. The distribution of probabilities
25
§ 3 · x 3 x
p( x ) ¨¨ ¸¸0,1 0,9 for x = 0, 1, 2, 3.
©x¹
§ M· § N M·
¨ ¸¨ ¸
© x ¹ ©n x ¹
p( x) , x = max ^0, M – N + n`, …, min ^M, N`;
§ N·
¨ ¸
©n¹
M M § M· N n M1 n 1
EX n ; DX n ¨1 ¸ ; a – 1 d x d a where a .
N N © N¹ N 1 N 2
E x a m p l e 2.13
Among a total of 50 products there are 5 defective ones. Three products are drawn at
random out of the total. The number of defective products among those selected is a
random variable X. Determine the type of its probability distribution, its distribution of
26
probabilities p(x), expected value E(X), variance D(X), standard deviation V(X),
median x0,5, mode x , and P(1 < X d 3). Assume, as opposed to Example 2.12, that a
product is not replaced once it has been selected so that this is a random sample
without replacement.
S o l u t i o n:
The random selection X has the probability distribution H(N,M,n) with N = 50, M = 5
and n = 3. X takes on the values x = 0, 1, 2, and 3. The distribution of probabilities is
given by
§ 5 ·§ 45 ·
¨¨ ¸¸¨¨ ¸¸
p( x ) © x ¹© 3 x ¹ for x = 0, 1, 2, 3.
§ 50 ·
¨¨ ¸¸
©3¹
M
The expected value E(X) = n = (3)(0.1) = 0.3,
N
M § M· N n
the variance D(X) = D X n ¨1 ¸ = (3)(0.1)(0.9) (47/49) | 0.25898,
N © N¹ N 1
M1 n 1
the mode x = 0 since a # 0.46154, a – 1 # -0.53846,
N 2
Ox O
p( x) e , x = 0, 1, … ; E(X) = O; D(X) = O; O - 1 d x d O.
x!
27
E x a m p l e 2.14
On the average, three customers enter a shop within a given minute. Determine the
appropriate type of probability distribution of a random variable that expresses the
number of customers that enter the shop within a given minute, the expected number
of customers, the variance of this number, and the most likely number of customers
that enter the shop within a given minute. Next calculate the probability that within
that minute a) exactly one customer enters the shop, b) at least one customer enters
the shop.
S o l u t i o n:
If we approximate the expected value of customers that enter the shop within one
given minute by their average number, we can assume that the random variable X
has the Poisson probability distribution Po(O) with the distribution of probabilities
given by
3x 3
p( x ) e , x = 0, 1, …
x!
31 3
P(X = 1) = p(1) = e # 0.14936,
1!
30 3
P(X t 1) = p(1) + p(2) + … = 1 – p(0) = 1 – e # 1 – 0.04979 = 0.95021.
0!
28
Continuous probability distributions
1
f (x ) for x a;b ,
ba
0 for x a;b ,
F( x ) 0 for x f; a ,
xa
for x ¢ a; b²,
ba
1 for x b;f ,
ab (b a) 2
EX x 0.5 D( X ) .
2 12
The graphs of the density function and the distribution function for a = -1 and b = 2
are shown in Fig. 2.3. This probability distribution is mostly used to simulate real
processes, in numerical calculations to implement the so-called Monte Carlo method
on a computer, and for calculations using the so-called geometric probability.
E x a m p l e 2.15
An optical cable of a length of 500 m may be disrupted at any distance from its
beginning. The probability of the random event that the cable will be disrupted in a
given section is in direct proportion to the length of the section and is independent of
its position. Determine the probability distribution of the random event X expressing
the distance of a disruption from the beginning of the cable, the density function, the
basic numerical characteristics, and the probability that the cable will be disrupted in
the section beginning at 300 m and ending at 400 m.
S o l u t i o n:
Random variable X has the probability distribution R(a, b) with a = 0 and b = 500.
29
0 500
The expected distance and the median E X x 0.5 = 250 m,
2
2
500 0 2
the variance D( X) # 20 833.3 m ,
12
the standard deviation V(X) = D(X ) # 20 833.3 # 144.34 m,
400 300
the probability P(300 d X d 400) = F(400) - F(300) = = 0.2.
500 500
0.5 1.5
0.4
1
0.3
prob. density cum. prob.
0.2
0.5
0.1
0 0
-2 -1 0 1 2 3 -2 -1 0 1 2 3
X X
(a) (b)
Fig. 2.3 The graphs of the density function (a) and the distribution function (b) of a
uniform probability distribution.
b) The normal probability distribution N(P, V2) where P, V2 are real numbers,
V2 ! 0:
2
1 ª xP º
f x exp« », x (- f, + f);
V 2S «¬ 2V 2 »¼
For P = 1 and V = 1, the graphs of the density function and the distribution function
are plotted in Fig. 2.4. This is the most widely used probability distribution sometimes
also called the Gauss probability distribution applied to random variables that can be
30
interpreted as the result of adding up a multitude independent influences (such as the
error of a measurement, the size deviation of a product and the like). Using the
transformation
XP
U
V
we get the standard normal probability distribution N(0;1) whose distribution function
)(u) is tabulated (see Table T1) or its values are approximated. We have
)(-u) = 1 - )(u).
For a random variable X with the normal probability distribution N(P, V2) we have
§xP·
F( x ) )¨ ¸,
© V ¹
and, for example, P(P - 3V d X d P + 3V) # 0.9973 (the so-called three-sigma rule).
E x a m p l e 2.16
What is the probability that a random variable X with the normal probability
distribution N(20, 16) will take on a value a) less than 16, b) greater than 20, c) from
12 to 28, d) less than 12 or greater than 28 ?
S o l u t i o n:
§xP·
Using the formula F( x ) )¨ ¸ and table T1 we get
© V ¹
= 1 – 0.5 = 0.5 ;
c) P(12 d X d 28) = F(28) – F(12) = )((28 – 20) / 4) - )((12 – 20) / 4) = )(2) - )(-2)
= )(2) – (1 - )(2)) = 2)(2) – 1 = (2)(0.97725) – 1 = 0.9545 ;
31
Prob. Density Fcn. Cum. Dist. Fcn.
11 11
Normal Normal
0.6 1
0.75
0.4
cum. prob.
prob. density 0.5
0.2
0.25
0 0
-3 -2 -1 0 1 2 3 4 5 -3 -2 -1 0 1 2 3 4 5
X X
(a) (b)
Fig. 2.4 The graphs of the density function (a) and the distribution function (b) of a
normal probability distribution
32
2. DESCRIPTIVE STATISTICS
Basic notions
When performing statistical analysis we deal with events and processes which
occur on a mass scale and can be found in a large set of individual objects such as
products or persons. We call this set a population. The objects under investigation
are called statistical items and we observe them focussing on certain properties -
statistical variables such as parameters which have outcomes or values that we
observe.
Quantitative variables are either discrete if they only take on discrete values
(number of defective products, number of faults, number of pieces, etc.) or
continuous if they assume all the values of an interval of real numbers (size of a
product, time to failure, price index and the like).
Statistical methods are based on the fact that information on the population is
not taken from all its elements but rather from a subset of the population defined by
taking a sample. This is due to certain limitations such as accessibility of all the
statistical items, large size of the population, the way the information is obtained
(service life tests, wear tests, etc.), excessive costs of statistical surveys and others.
The number of statistical items in a sample is called the size of a sample. If the size
33
of a sample less than 30 to 50, we say that the sample is small, if the size is several
hundreds or thousands we say that the sample is large. This classification is of
course arbitrary and may differ depending on circumstances. A sample should be
representative, which means that it should provide information without any limitations,
and homogeneous (not affected by other factors). This can seldom be achieved with
sufficient confidence and that is the reason why we usually select the items of a
sample at random, even at the risk of the information about the whole population
contained in the sample being biased.
regional (the population is first divided into partitions and each partition supplies
part of the sample),
The primary sample data or raw scores x 1,..., x n of size n are called
ungrouped sample data. The outcomes xi may be listed in order of numerical
magnitude, which yields an array of sample data x 1 ,..., x n where x i d x i1 for
34
every i, and x 1 x min , x n x max . The interval x min ;x max is called the domain of
sample data and the number x max x min is called the range of sample data.
If sample data are very large or if they are to be further processed (some
graphical representations or application of mathematical statistical methods) the raw
scores are grouped. Sample data are grouped by partitioning the domain into a
series of m non-overlapping intervals (usually left-open and right-closed), the so-
called classes usually of the same width h. Each class is represented by a pair x j , fj
class j, j 1,...,m . The absolute frequency fj is defined as the number of raw scores
that lie in class j. The number fj/n is called relative frequency and sometimes it is
m
also shown as a percentage. Obviously, we have ¦f
j 1
j n.
accuracy of the outcomes x i and for the midpoint x j to be a rounded number. For a
j
The number Fj ¦f
k 1
k is called cumulative absolute frequency, the number
xj x1 ... xm
fj f1 ... fm
35
The properties of sample data are, in a concentrated form, expressed by
different measures.
1 n
x ¦ xi
ni1
for ungrouped data,
1 m
x ¦ fjx j
nj1
for grouped data.
Mathematical properties:
x x y xy,
x x min d x d x max ,
¦w x
i 1
i i
xw n
¦w
i 1
i
is used where w i t 0 are the weights of outcomes x i , which reflect their significance
such as accuracy.
The median
36
Mathematical properties:
x y ax b ~
y a~
x b for constants a, b,
x x min d ~
x d x max ,
x ~
x has the same unit of measurement as variable X.
The median divides the sample data into "the upper part" and "the lower part"
of outcomes xi. This is a robust measure, which, as compared to the arithmetic mean,
is little affected by extreme values. Sometimes a suitable approximation is used to
calculate the median (for example if the data is grouped).
The mode x is the number whose neighbourhood contains the most outcomes
x i , or the middle x j of the class with the largest absolute frequency fj . The mode
has the same unit of measurement as the variable X and, if it is needed, a suitable
approximation is used for calculating it.
Variance (dispersion)
1 n 2 §1 n 2·
s2 ¦ xi x
ni1
¨ ¦ xi ¸ x 2 for ungrouped data,
©n i 1 ¹
1 m 2 §1 m ·
s2 ¦ fj x j x
ni1
¨ ¦ fjx j 2 ¸ x 2 for grouped data.
©n i 1 ¹
x s 2 t 0 ,
x y ax b s 2 y a 2s 2 x for constants a, b,
The more the outcomes of a variable X are scattered, the more greater is its
variance and vice versa. For calculations, sometimes an alternative formula for
37
1 1
variance is used by replacing by . This variance calculated by this new
n n 1
n 2 2
formula equals to s ²s .
n 1
Standard deviation
s s2 .
x s t 0,
x y ax b s y as x for constants a, b,
The more the outcomes of a variable X are scattered, the greater is its
standard deviation and vice versa.
Coefficient of variation
s
v .
x
The basic measure of symmetry for sample data is the coefficient of skewness
1 m 3
¦ xi x
ni1
A for ungrouped data,
s3
1 m 3
¦ fj x j x
n j1
A for grouped data.
s3
38
Mathematical properties:
a
x y ax b A y Ax for any constant a z 0 ,
a
x A is a dimensionless number.
is employed instead of the arithmetic one for some variables which describe ratios
such as volume and price indices, interest rates and the like. In special cases we use
the harmonic mean
1
§1 n 1 ·
xh ¨¨ ¦ ¸¸ .
© n i 1 xi ¹
For univariate ungrouped or grouped sample data we can use box charts see
Fig 2.1 where the box contains the middle part of grouped data (about one half of all
the outcomes) while about a quarter of the data is placed on either side of the box.
The line on the left (on the right) corresponds to the so-called lower quartile (upper
quartile) and the perpendicular line in the middle is in the place of the median. The
height of the box is proportional to the size of the data and the line segments on both
sides represent acceptable domains for the above quarters of the data. Outcomes
beyond these line segments are considered as suspicious or extremely deviated.
There are also other modifications of this chart and other graphical tools.
39
0 4 8 12 16
(u 1 0 0 0 )
Fig. 2.1
Two other types of charts are frequently used for univariate sorted data:
Histograms - see Fig. 2.2 - a histogram is a system of bars in Cartesian co-
ordinates where the bases if the bars are the classes and their heights
correspond to the absolute (relative, cumulative, etc.) frequencies. Frequency
polygons - see Fig. 2.3 - a frequency polygon is a broken line in Cartesian co-
ordinates connecting points whose abscissas coincide with the midpoints (or
with upper limits) of classes and their ordinates are proportional to the
frequency.
E x a m p l e 2.1
A total of 10 rollers have been measured with the following results: 5.38; 5.36; 5.35;
5.40; 5.41; 5.34; 5.29; 5.43; 5.42; 5.32. Determine the size, domain, and range,
arithmetic mean, variance, standard deviation, coefficient of variation, and median of
the sample data.
S o l u t i o n:
The data size is n = 10, the domain is <5.29; 5.43> mm and the range is
5.43 5.29 = 0.14 mm.
40
v = 0.0435889894/5.37 # 0.00811713 # 0.8117 %,
~
x = (5.36 + 5.38)/2 = 5.37 mm.
E x a m p l e 2.2
When checking the volume of beverage in a bottle for a sample of 50 bottles, the
following deviations (in ml) from the values stated on labels have been found:
1.2 2.1 1.7 0.9 0.3 2.0 -1.3 -0.1 3.2 2.8
0.8 4.4 2.9 1.2 0.0 -2.3 1.2 0.9 2.3 - 0.2
0.1 1.9 -1.9 -0.2 -1.3 1.5 0.5 2.0 -1.3 3.7
0.9 1.0 0.4 1.9 1.4 -1.3 1.6 1.4 3.1 -0.1
1.8 0.0 4.1 1.3 3.0 0.4 3.8 -0.8 3.1 0.9
Group the data, set up a frequency distribution, and design a graphical
representation. Calculate x , s2, s, x .
S o l u t i o n:
The size of the data is n = 50; xmin = - 2.3 ml and xmax = 4.4 ml, which means that
the domain is <-2.3; 4.4> ml, the range being 4.4 - (-2.3) = 6.7 ml. We choose the
number of classes to be m = 7 and the class width h = 1 (approximation of 6.7/7).
The selection of classes, their midpoints, the grouping of the data and the calculation
of absolute and relative frequencies are shown in the table below (//// stands for 5
outcomes):
j class classification fj Fj
xj
1 -2.5; -1.5 -2 // 2 2
41
Histograms and polygons for this sample data are shown in Fig. 2.2 and 2.3. Further
calculations are, for the sake of clarity, shown in the following table:
j xj fj fj x j fj x j 2
1 -2 2 -4 8
2 -1 5 -5 5
3 0 11 0 0
4 1 13 13 13
5 2 9 18 36
6 3 6 18 54
7 4 4 16 64
¦ 50 56 180
f 15 F 50
40
10
30
20
5
10
0 0
-3 -2 -1 0 1 2 3 4 5 -3 -2 -1 0 1 2 3 4 5
x x
Fig. 2.2
42
f 15 F 50
40
10
30
20
5
10
0 0
-3 -2 -1 0 1 2 3 4 5 -3 -2 -1 0 1 2 3 4 5
x x
Fig. 2.3
For univariate sorted sample data with a discrete variable, usually the
following charts are used. Bar chart - see Fig. 2.4 - is similar to a histogram but there
are gaps between the bars and sometimes the bars are positioned horizontally. Pie
chart - see Fig. 2.5 - is a circle divided into sections whose perimeter corresponds to
the class frequencies. Some of the sections may be shifted in the upward or
downward direction. Different colours or types of hatching are used in these charts to
make selected pieces of information more prominent and sometimes the charts are
further geometrically and artistically modified for better presentation.
40
36
15
30
30 36
25
22 1990
1991
20 30
15 1992
1993
1994
10
22
25
0
1990 1991 1992 1993 1994
43
Processing bivariate sample data with quantitative variables
The raw scores ((x1, y1),..., (xn, yn)) obtained are called ungrouped data. If we
leave out the first and the second value in each pair, we get two sets of univariate
sample data (x1,..., xn) and (y1,..., yn) respectively. Processing these sets we obtain
measures like x , y , s 2 x , s 2 y etc.
We can group bivariate sample data by grouping each of the sets of univariate
sample data x 1,..., x n and y1,..., y n where for each data set the number of classes
or the widths of classes may be different. In this way we obtain bivariate classes with
middles x j , y k and absolute frequencies f jk , j 1,...,m1 and k 1,...,m2 . The relative
yk
xj y1 ... y m2 fxj
fyk fy 1 ... fy m2 n
The numbers fxj and fyk are marginal frequencies and the following formulas
hold:
m2 m1 m1 m2 m1 m2
fxj ¦ fjk ,
k 1
fyk ¦ fjk ,
j 1
¦ fxj
j 1
¦ fyk
k 1
¦¦ f
j 1 k 1
jk n.
44
For grouped data x j , fxj , j 1,...,m1 , and y k , fyk , k 1,...,m 2 , we obtain
1 n 1 n
¦ f jk x j x y k y
ni1
¦ f jk x j y k xy
ni1
r for grouped data.
sxsy sxsy
The numerators in all fractions define the so-called covariance cov. Sometimes we
write r(x,y) and cov(x,y).
Mathematical properties:
ac
x r(ax + b, cy + d) = r ( x, y) for constants a, b, c, d, a z 0 , c z 0 ,
ac
x 1 d r d 1 ,
x r is a dimensionless number.
45
Fig. 2.6
46
Fig.2.7
E x a m p l e 2.3
(30.18; 50.26), (30.19; 50.23), (30.21; 50.27), (30.22; 50.25), (30.25; 50.22),
(30.26; 50.32), (30.26; 50.33), (30.28; 50.29), (30.30; 50.37), (30.33; 50.42).
Calculate x , y , s2(x), s2(y), s(x), s(y), cov, r.
S o l u t i o n:
As the data size is only n = 10 the data need not be grouped. Using the above
formulas we get:
47
cov = [(30.18)(50.26) + ...+(30.33)(50.42)]/10 – (30.248)(50.296) = 0.002292 CZK2,
Judging by the value of the correlation coefficient it may be assumed that there is a
dependency between the variables that is fairly close to linear.
where x j are all possible values of variable X expressed in words and fj are the
frequencies of these values in the original data, j 1,...,m . Measures are used only
exceptionally (variability). Bar charts and pie charts are mainly used for graphical
representation.
are pairs representing all combinations of the outcomes of variables (X,Y) and fjk are
the frequencies of these outcomes for j 1,...,m1 and k 1,...,m2 . Out of various
measures the most frequently used are measures of the dependence of X and Y.
3D - bar charts are used to graphically represent these data.
Exercises
E x e r c i s e 2.4
A total of ten metal parts have been machined, for each part the wasted material has
been weighed and the corresponding percentage calculated. The following data have
been obtained: 40.60; 40,29; 37.51; 38,90; 38.13; 38,15; 34.81; 37,00; 39.95; 40.43.
Calculate x , s2, s, and v.
48
E x e r c i s e 2.5
Calculate the domain, range, arithmetic mean, variance, and standard deviation for
the following data describing the precipitation (the amount of rain and snow fallen in
mm) in Brno from 1941 to 1960: 718.5; 492.3; 431.5; 540.5; 514.7; 584.0; 385.0;
532.0; 531.0; 578.3; 551.9; 613.6; 476.0; 661.3; 518.0; 508.5; 488.7; 494.9; 554.6;
673.5.
E x e r c i s e 2.6
2 2 3 5 3 3 2 7 4 7 2 3 5 6 4 4 4 2 4 6 5 3 4 5 5
4 5 7 4 3 4 2 4 4 4 4 4 4 3 2 4 3 3 3 4 2 3 4 2 3
3 3 4 3 5 9 3 3 4 8 5 4 5 3 3 4 3 3 3 4 5 2 3 7 3
5 5 1 4 4 5 3 3 4 3 4 4 4 3 3 4 3 4 2 3 3 5 6 2 4
(a) set up frequency distribution tables for absolute, relative, and accumulative
frequencies
(b) calculate the average number of household members, the mode and the median.
R e s u l t: x = 3.82; x = 3; ~
x =4
E x e r c i s e 2.7
For a total of 200 parts processed by an automatic machine tool the differences from
the required size in micrometres have been measured. The following are the resulting
differences:
1.0 1.5 -2.5 0.0 -1.5 1.0 1.0 15.0 -1.0 2.0
2.0 3.0 11.0 -1.0 5.0 4.5 0.5 3.5 8.0 5.0
4.5 3.5 9.5 12.0 7.5 7.5 10.0 8.5 10.0 11.0
14.0 11.0 11.0 13.0 16.0 14.5 19.0 14.0 18.0 19.0
19.0 23.5 22.0 18.5 19.5 17.5 18.0 19.5 17.5 25.5
49
19.5 22.0 13.5 18.5 21.5 27.5 21.0 13.5 11.5 10.0
7.5 8.5 6.5 8.5 5.5 26.0 12.5 6.5 8.5 7.5
2.5 7.0 4.5 -1.5 4.0 5.5 1.0 4.0 6.5 5.5
4.5 5.0 7.5 5.0 5.5 6.0 6.5 -3.0 5.0 3.5
-3.0 -14.0 17.0 -9.0 -3.0 -12.0 8.5 12.0 6.0 8.5
0.0 7.0 -1.0 -3.0 0.5 0.0 2.0 -4.5 2.0 -10.0
-8.5 -3.5 -11.5 -7.5 -11.5 -6.5 2.0 -11.5 -11.0 -17.5
-15.0 -15.5 1.5 -18.0 -20.0 -15.0 -3.0 -8.0 -1.0 -6.5
-8.0 -13.5 -12.0 -17.0 -10.5 14.5 10.0 9.5 7.0 0.5
21.0 10.5 5.0 0.5 4.0 0.0 0.5 3.5 9.0 2.5
2.0 7.0 7.5 3.5 7.0 4.5 -1.0 11.0 4.0 9.0
4.5 11.5 14.0 10.0 20.0 13.0 7.0 12.0 7.5 2.0
1.0 25.0 0.5 -3.0 4.5 6.0 9.5 12.5 19.0 13.0
1.5 0.5 12.0 4.0 6.5 -9.5 -8.0 -4.5 7.5 -4.0
-9.0 -9.0 2.0 -0.5 3.5 10.5 -5.5 -6.0 -6.5 -8.0
Summarize the data and use the resulting crosstabulation to calculate the arithmetic
mean, the standard deviation and the coefficient of skewness.
R e s u l t: xmin = -17.5 Pm; xmax = 27.5 Pm; h = 5.0 Pm; m = 10; x # 4.3 Pm;
s # 9.7 Pm; A # -0.102
E x e r c i s e 2.8
The below frequency distribution shows how a total of 200 workers have met the
norm. The numbers in the upper line are the midpoints of percentage classes:
Fj 4 21 65 39 24 17 12 9 7 2
Use the frequency distribution to calculate the arithmetic mean, mode, median,
standard deviation, and coefficient of skewness.
R e s u l t: x = 117.9 %; x = 105 %; ~
x = 115 %; s2 # 380 %2; A # 0.92
50
E x e r c i s e 2.9
In a statistical survey made by an insurance company each person has been asked
about the bonus they are paying. The following is the resulting frequency distribution.
The upper line shows the bonuses in CZK:
xj 390 410 430 450 470 490 510 530 550 570
Fj 7 10 14 22 25 12 3 3 2 2
E x e r c i s e 2.10
xi 18 19 20 21 22 22 25 26 26 26 27 28 29 30 31 33
yi 26 23 29 27 31 25 22 32 32 33 38 29 36 37 41 42
E x e r c i s e 2.11
xi 2 4 4 5 6 8 10 10 10 10
yi 1 2 3 4 4 4 5 5 5 6
51
E x e r c i s e 2.12
The following contingency table summarizes last year's (x) and this year's (y) prices
of shares in thirty companies selected at random. Find the average prices of shares
and the correlation coefficient.
yk
1001 - 2000 2001 - 3000 3001 - 4000
xj
501 - 1000 3 6 0
1001 - 1500 5 8 2
1501 - 2000 0 1 3
2001 - 4000 0 1 1
Questions
52
3. ANALYSIS OF TIME SERIES
Fundamentals
Two main categories of statistical information exist: cross sections and time
series. The economists often estimate the consumption by relating the consumers'
costs to the national product or analyse in detail the distribution of consumption at
one particular point of time (cross sections). This approach has a broader
significance for the practice but is not sufficient if we are interested the dynamic of an
event and in particular changes over time.
The basic tool employed to study of the dynamic an event is an analysis of its
past development, which helps us grasp the existing laws and to estimate its future
development.
A time series is obtained if the data on a particular event over time are
arranged in order of increasing time. A well-established time series that can be used
for an analysis must meet the following requirements.
a) the same period of time over which the data has been acquired,
Failure to comply with any of the above conditions may result in erroneous
conclusions.
From the statistical point of view a time series is a sequence (y1,...,yn) of the
observed values of a statistical variable Y where the index i corresponds to the time ti
or to the i-th interval ending at ti, ti < ti+1, i = 1,...,n. Sometimes we write yt instead
of yi. Graphically, the time series is mostly represented by a graph in the Cartesian
53
system of co-ordinates with the indices i or times ti as abscissas and the values yi as
ordinates. Fig 3.1 shows an example of a time series.
15
12
9
Sale
6
0
0 12 24 36 48 60 72 84
Time
Fig. 3.1
If time series are related to periods of time, they are called interval time series,
if they refer to points of time, they are called point time series.
Interval time series are composed of indexes measured for fixed time intervals
such as an hour, day, month, or year, etc. They are characterised by the following
features:
- the data items express quantities,
- they are dependent on the length of the time interval,
- the sum of the data is meaningful.
Point time series contain data that are related to a fixed point in time. They are
characterised by the following features:
54
To analyse a time series correctly certain differences must be taken into
consideration that follow from the different character of time series data and from
their significance.
b) point
c) quotients
Interval series
It is typical of interval time series that they are related a fixed time interval and
as such are affected by the length of the interval. The following are the most common
interval quantities: production volume, retail sales, revenues, wages and salaries,
man-hours, number of children born within a certain period, etc.
The quantities used for setting up a time series typically do not refer to an
interval but rather to a time point. This may be the first or the last day of a period, an
arbitrary but fixed day or moment. The number of inhabitants, workers, the amount of
fixed assets, etc may exemplify data of this type. Such data show an instantaneous
condition of the event in question. We use the following numerical characteristic
called the chronological mean to aggregate the data.
55
Given the values of a point index (outcomes of the observed variable Y)
y1, y2,..., yn
Using these partial averages we now calculate the average for all the aggregate point
values
y1 y 2 y2 y3 y n1 y n
y1 , y2 ,..., yn1 .
2 2 2
The number dividing the sum of the above partial averages will be one less than n.
If the intervals between the individual values of a point time series are of equal
length, the chronological mean is calculated as follows. Denoting the distances
between the members of a point time series by
we have
d1 = d2 =...= dn-1,
y1 y 2 y 2 y 3 y yn
" n1
y chr 2 2 2 .
n 1
If the distances between the neighbouring members of a point time series are
not equal, the calculation is similar. We reduce the different lengths of time
d1, d2,..., dn to one value. We use their weighted average to do this:
56
y1 y2 y3 yn-1 yn
t1 t2 t3 tn-1 tn
d1 d2 dn-1
The chronological mean is then
y1 y 2 y y3 y yn
d1 2 d2 " n1 dn1
y chr 2 2 2
d1 d2 " dn1
which yields
E x a m p l e 3.1
The following are data on the numbers of employees of a company during the
calendar year:
Calculate the chronological mean of the time series expressing the numbers of
employees.
Solution:
We use the above formulas to calculate the chronological mean, where y1 = 3 500,
y2 = 3 425, y3 = 3 430, y4 = 3 390, y5 = 3 350 and d1 = d2 = d3 = d4. Substituting
these values we get
The chronological mean of the numbers of employees of the company is 3 417.5 for
the given year. For practical use we can round off the value to 3 418.
57
E x a m p l e 3.2
A company keeps an inventory of stock. The total figures are in Czech Korunas. The
data for the following dates are available:
Solution:
We use the formula for the calculation of the chronological mean, where y1 = 20.523,
y2 = 16.100, y3 = 17.230, y4 = 21.432 and
The average yearly stock in the company was 17.880 million CZK.
Cumulative time series behave like increasing sums. A cumulative time series
is formed by gradually adding up the values of a given variable starting from a fixed
point. This method is employed for example when monitoring indexes for a certain
period of time such as a month or a year. Cumulative values are of considerable help
58
in matters of strategic decision making. In the following example use of this method is
demonstrated.
E x a m p l e 3.3
Using the following data on the production volume for each month of the year
analyse the real production for each month and for the whole year.
From the tabulated values it can be concluded (cf. December) that, as compared with
the planned value 415.6 thousand tons, the yearly total was 411.1 thousand tons,
which is 98.9 % of the plan.
59
Time series of cumulative averages
Time series of cumulative averages are derived from interval series. These
series show how the cumulative averages approach the total average over the given
period of time, which is equal to the last value.
This method is used for example to record the costs in quality control. It is
based on a cumulative time series where the values are divided by the number of
periods over which it has been accumulated. We will use the data from Example 3.3
to demonstrate this. The quantities are shown in thousands of tons.
E x a m p l e 3.4
60
E x a m p l e 3.5
The values for 1994 and 1995 clearly show that the production trend is increasing.
The simplest numerical characteristics used to analyse time series are the
absolute and relative measure of growth and decline. An analysis of absolute and
relative measures of growth enables decisions necessary for the selection of a
function used for smoothing a time series. For the following methods, we will always
assume that the neighbouring boundaries or midpoints of the time intervals are
equidistant.
61
Absolute measures of growth provide absolute comparison between the
members of a time series. The following measures are used:
Gi y i y i 1, for i = 2, 3,…, n,
1 n y n y1
G ¦ Gi
n1i 2 n 1
,
If all average increments (also called first differences G(i1) ) are close to a
constant, the time series has a linear trend, which can be expressed in the form of a
straight line. Second differences G(i 2 ) are obtained by subtracting two neighbouring
second differences of a time series are all close to a constant, the time series may be
represented by a parabola. Third differences are calculated as the differences of two
neighbouring second differences. Further differences are established in a similar
way. If the differences of order k are almost constant, the corresponding time series
may be represented by a k-th order polynomial.
coefficient of growth
yi
ki for i = 2, 3,..., n,
y i1
62
coefficient of increment
Gi
Ni k i 1 for i = 2, 3,..., n,
y i1
The average coefficient of growth may also be calculated as the (n - 1)-th root of the
quotient of the first and the last current value in the given time series
y 2 y 3 y 4 y n1 y n yn
k n1 " n1 .
y1 y 2 y 3 y n2 y n1 y1
E x a m p l e 3.6
The GDP figures (in thousands of millions of CZK) in the Czech republic between
1990 and 1996, recalculated for fixed prices, is given in the table below.
Determine the average GDP, absolute yearly increments, the average yearly
increment, second differences, the average second difference, the coefficient of
growth, and the average coefficient of GDP growth.
63
Solution:
Part of the results can be found in the table where t as the time variable is used
instead of i:
7268
y # 1038.2857 # 1038.3 thousands of million CZK.
7
1075
G = 169.1666…#169 thousands of million CZK.
7 1
64
The average yearly growth coefficient for the GDP is
1579 6
k 71 # 2.7996454 # 1.1872 or 118.72 %.
564
Hence the average yearly growth coefficient for the GDP is 0.1872 or 18.72 %. The
calculation of the average coefficient of growth that uses the arithmetic mean may
be misleading but unfortunately this often occurs in economic applications. The
average yearly second difference of the GDP (in thousands of millions of CZK) is
84
G ( 2) = 16.8 ! 0,
72
which means that the overall growth of the GDP is increasing.
Time series is the measurement of a variable over time. The following major
components, or movements, of time series may be identified:
- trend (long-term influence),
- periodic influences (recurring regularly) affecting the values of a time series,
- irregular influences (occurring at random, forecasting is difficult).
Periodic influences
Periodic influences account for periodic variations of time series over time. The
length of the periods varies as well and can be used for further subdivision of periodic
65
influences as follows:
- cycles (wavelike repetitive movements fluctuating about the trend of the series),
A number of methods have been devised and computer programs have been written
to express the periodic changes of time series [1], [3], [4], [5] and [6].
Irregular influences
A time series may be thought of as the result of its trend component Tt, periodic
component Pt and random component Et. The periodic component may further be
decomposed into a cyclic component Ct and a seasonal component St. The
decomposition of a time series is mostly based on an additive model
yt = Tt + Pt + Et , or yt = Tt + Ct + St + Et
or a multiplicative model
yt = Tt Pt Et , or yt = Tt Ct St Et .
When analysing the trend component of a time series, we try to identify the
influence of those factors that are stable and determine the trend. Graphically, this
corresponds to drawing a curve that best fits the time series trend when plotted. Such
66
a curve may be obtained by graphically, mechanically, or analytically smoothing the
time series.
For graphical smoothing the time series is plotted in a graph (Fig. 3.1) and its
trend is estimated (smoothed) graphically. This method can only serve as a guideline
and sometimes it may be misleading.
yct f ( t) e t ,
The function f(t) ought to render the trend of the time series correctly, that is,
to smooth the time series as well as possible. Linear, parabolic, and exponential
functions are the ones most frequently used. Generally, any function may be used -
more details can be found in Chapter 5.
The most frequently used method is linear smoothing if the trend of the time
67
series appears to be linear. The function f(t) has the form
y ct b 0 b 1t ,
and the parameters bo and b1 are determined from the so-called system of normal
equations
T T
b0 T b1 ¦ t ¦y t ,
t 1 t 1
T T T
b0 ¦ t b1¦ t ¦ yt t . 2
t 1 t 1 t 1
The first coefficient b0 determines the point at which the smoothing straight
line intersects the y-axis. It is interpreted as the smoothed value of the time series in
period zero. The second coefficient b1 is the slope of the straight line and expresses
the actual trend. It determines the change of the smoothed values y't for a unitary
change of t or the average change of the original values yt when t is increase by
one. We can test the suitability of the smoothing function by looking at the plotted
diagram or by calculating the correlation coefficient of the pairs (t, yt) or the sum of
the squared differences ¦(yt - y't)2.
This method may be simplified if we shift the time variable for the sum of its
shifted values to be equal to zero. This can be achieved by shifting the origin (0) to
the central period, that is, by decreasing t by its mean value
T1
t .
2
Thus, instead of t, we consider the variable t´ = t - t . By this transformation the
terms in the system of normal equations with 6 t turn to zero, which yields the
following explicit formulas
T
T
¦y t ¦ y t tc
bc0 t 1
, b1c t 1 .
T T
¦ tc2
t 1
68
E x a m p l e 3.7
Determine the trend component of the time series representing the development of
the gross domestic product in the Czech Republic from 1990 to 1996 as shown in
Example 3.6.
Solution:
As the original time variable t takes on the values 1990, 1991,…, 1996, we will use
the transformation t´ = t - 1993 since then we have
b1 = 164.71428 # 164.7 .
Hence we get the following straight line, which smoothes the time series
c
For t = 1993, say, we obtain y1993 327 237.3 (164.7)(1993) 1009.8 , which is in
good correspondence with the real GDP y1993 = 1015. Also the value b1 = = 164.7 is
close to the average increment G = 169 from Example 3.6.
Exercises
E x e r c i s e 3.8
The below data describe the amount of fixed assets in a company over a calendar
year (the accounting value):
69
1st Jan 101.230 million CZK 1st Aug 100.250 million CZK
1st Mar 105.100 million CZK 1st Dec 99.800 million CZK
1st Apr 105.500 million CZK 1st Jan 103.150 million CZK
E x e r c i s e 3.9
E x e r c i s e 3.10
Smooth the average fixed assets figures (in millions of CZK) shown in the table below
for the years 1978 to 1985 using the linear smoothing method and calculate a
forecast of the average fixed assets for 1987 assuming that the trend of the time
series remains does not change.
R e s u l t: yct c
9352.193 5.069t , y1987 719.91 mill. CZK
E x e r c i s e 3.11
A transport company recorded the following average numbers vehicles in its fleet:
70
1993…54 , 1994…63 , 1995…69 , 1996… 72.
The company's real figures as of particular dates of 1997 are shown in the below
table:
Number 70 66 71 80 82 90
Determine the average number of the company's vehicles for 1997, the average
yearly coefficient of growth and characterise the trend of the development from 1993
to 1997 using a linear function.
Questions
71
4. INDEX NUMBERS
Basic notions
Simple indexes measure the relative change from the base period for a single
item or for a group of homogeneous items. In the former case they are called single
indexes (describing such quantities as the price or the amount of a single product)
and in the latter case they are called group indexes (when used to measure the
change in one variable (such as the price or quantity) for a group of homogeneous
items. As opposed to that, composite indexes measure the relative change from the
base period in a group of inhomogeneous items or an aggregate (such as a bundle of
commodities of different types).
Indexes that measure the number of items, the production volume and the like
are called quantity indexes and are denoted by q while those that measure such
quantities as price or intensity are called value indexes and are denoted by p. Among
the indexes of the first group the most frequently used are volume indexes while
those in the second group are usually price indexes.
Single indexes
72
q1 p1
iq or ip ,
q0 p0
respectively where the numerator corresponds to the current period and the
denominator corresponds to the base period. The correct selection of the base period
is important. Those values that best represent the outcomes of the variable should be
chosen as the base. Sometimes it is best to use the average of several values. When
calculating and interpreting index numbers, we must ensure comparability of the
periods and a factual agreement of the aspects concerned otherwise the expressive
power of the index could be significantly impaired.
E x a m p l e 4.1
The production volume of a steel works reached 2780 tons in 1994 when the price of
steel was 8750 CZK and in 1995 the production volume rose to 2950 tons with a
price of 9690 CZK. Calculate indexes for the production and price of steel.
2950 9690
Solution: iq # 1.061 106.1% , ip # 1.107 110.7%
2780 8750
Group indexes
While single indexes are used to analyse single items such as the quantity of
cement produced by one plant, group indexes are related to a group of similar items
such as the cement production volumes for a group of plants. The distinction
between single and group indexes is of great importance. For group indexes, the
comparability of periods, facts and composition plays an important role. For a
quantity variable, the group index is given by
¦q
i
(i)
1
iq .
¦q
i
(i)
0
E x a m p l e 4.2
The production figures for four cement production plants are given in the following
table:
73
Plant January (0) February (1) March (2)
If we consider all the plants as one unit, the group indexes for a quantity variable are
as follows (in fact they are what will be later called chain production indexes):
22 400 22 090
iq1 # 1.025 , iq2 # 0.986 .
21 860 22 400
¦p q
i
(i) (i )
1 1
p1 ¦q i
(i )
1
i var .comp. .
p0 ¦p q
i
(i)
0
(i )
0
¦q i
(i )
0
E x a m p l e 4.3
74
The value under investigation is the price. Price changes in individual types of supply
may be measured by single indexes for contractual (1) and surplus (2) supplies:
For contractual supplies the price has risen by 10% while for surplus supplies it has
remained the same. The average price for the base period is
p0
¦p q (i )
0
(i)
0 111 000
# 13.06 CZK
¦q (i )
0 8 500
p1
¦p q (i) (i)
1 1 243 600
# 19.33 CZK,
¦q (i)
1 12 600
19.33
i var .comp. # 1.480 148% .
13.06
Composite indexes
The basic property of composite (aggregate) indexes is that they can be used
to measure changes in quantities and values of inhomogeneous variables. If, for
example, the prices of consumer goods have changed, we would like to know the
percentage of the drop in prices of products as a whole. Since we have to deal with a
range of different types of goods, the average price of a unified item cannot be used.
75
Then the so-called composite value index (retail trade turnover) is given by
¦q i
(i ) ( i)
1 1 p
Ih .
¦q i
(i ) ( i)
0 0 p
To determine the influence of only one of the variables, the influence of the
other must be eliminated, which is done by fixing it at a given constant level in each
of the aggregates that are being compared. A constant level for a value variable (p)
or a quantity variable (q) to calculate the index may be achieved in two ways: by
fixing it at the level of either the base period or the current period. Accordingly, we get
the following indexes:
¦q (i)
1 p(0i)
the Laspeyres composite index for quantity ILq i
,
¦q
i
(i)
0 p(0i)
L
¦q
i
( i ) (i )
0 1 p
the Laspeyres composite index for value Ip ,
¦q
i
(i)
0 p(0i)
P
¦q i
(i) (i)
1 1 p
the Paasche composite index for quantity I q ,
¦q i
(i) (i )
0 1 p
P
¦q i
(i) (i)
1 1 p
and the Paasche composite index for value I p .
¦q i
(i)
1 p(0i)
Ih ILpIPq ILqIPp ,
since we have
76
Neither the Laspeyres nor the Paasche index expresses a change in a
satisfactory manner since a change in p or q between the base period and the
current one may cause a change in q or p respectively. For example a change in
prices may influence the consumption and vice versa. To eliminate this drawback,
we use sometimes the Fisher ideal index
However, not even the Fisher index does reflect the changes sufficiently (it is only a
compromise between the Laspeyres and the Paasche index) and therefore further
indexes are used as well [1].
E x a m p l e 4.4
The following table contains sales figures for products A, B, C in a trading company:
q0 q1 p0 p1
A 1 000 1 200 60 69
B 6 000 4 500 10 11
C 8 000 9 000 8 7
Determine: a) the growth of the total retail trade turnover in the company for the
given period,
b) the growth of the sales volume,
c) the change in the price of the commodities sold.
Solution:
77
Product iq ip q0p0 q1p1 q0p1 q1p0
Total --- --- 184 000 195 300 191 000 189 000
195 300
Ih # 1.0614 = 106.14%.
184 000
The above index tells us that the retail trade turnover in the given company rose by
6.14 %. The Laspeyres index for the sales volume is
189 000
ILq # 1.0272 = 102.72%
184 000
195 300
IPp # 1.0333 = 103.33%.
189 000
The sales volume rose by 2.72% and the price of the commodities sold rose by
3.33%. The following relationship may be established for the two indexes
195 300
IPq # 1.0225 = 102.25%
191 000
and
191 000
ILp # 1.0380 = 103.80% .
184 000
Indexes express only relative changes in observed variables. They are not
sufficient to analyse the development to the full extent. For this reason we must know
78
also the absolute value of the change expressed by an index. To do this we proceed
in different ways for quantity and value variables. For quantities, apart from the index
q1
iq ,
q0
we are also interested in the absolute value of the quantity under investigation, which
is given by
q1 q0 .
In value variables, the difference between the numerator and denominator of the
index fraction only indicates the increase (decrease) of the level.
a) One period is taken for a base with the value y0 of the variable to be analysed and
the ratios of other outcomes yn in current (other) periods to this base period are
calculated. In this way we obtain what is called basic indexes or constant base
indexes,
yn
in / 0 , n = 0, 1, 2, …
y0
b) A given period is always compared with the previous one, yn being divided by yn-1.
In this way we obtain chain indexes or moving base indexes,
yn
in / n1 , n = 1, 2, …
y n1
i m / k in / m in / k .
We use both basic and chain indexes to set up time series to be processed by
methods described in Chapter 3. For example, the geometric mean of chain indexes
is the average index that expresses the same relative change in the given variable
79
between individual periods of equal length. Note that, in this case, we cannot use the
arithmetic mean.
E x a m p l e 4.5
Calculate the basic and chain indexes for the data in the following table.
January 75 000 t q0
February 75 250 t q1
March 81 000 t q2
April 82 100 t q3
Solution:
Basic indexes: i0/0 = 1.0000; i1/0 # 1.0033; i2/0 = 1.0800; i3/0 # 1.0947
Exercises
Exercise 4.6
The beer production of a brewery for individual types of beer with the average yearly
prices in 1996 and 1997 is given in the table where q is the quantity (hl) and p is the
price (CZK/hl):
Period q p q p q p
Since the products are homogeneous, calculate both single and group indexes
80
related to the base year 1996 and interpret them in terms of absolute changes in
price, quantity, and value of the beer production.
Exercise 4.7
Using single indexes, composite indexes (of the Laspeyres, Paasche, and Fisher
type) for both value and quantity, and the cost index, calculate changes for a "small"
market basket of a typical four-member family. The average retail prices p (CZK/kg or
CZK/l) and the quantities of purchased food q (kg or l) are shown in the following
table. Interpret the resulting indexes in terms of absolute changes in the price of the
basket.
Period q p q p q p q p
Exercise 4.8
The table contains the figures of monthly loans (in millions of CZK) given by a bank in
1997. Calculate the basic and chain indexes of the amounts loaned. For the base
period take (a) January, (b) July. What was the average index of monthly loans?
53.2 56.2 49.5 48.0 47.6 52.8 54.5 42.9 49.2 56.0 57.1 55.0
Exercise 4.9
Use the basic index (with 1978 as the base period) and chain index to analyse the
data from Exercise 3.10 on the average fixed assets (in millions of CZK) for the years
1978 to 1985. On the assumption that the trend of the time series does not change
81
use the average chain index to make a forecast of the average fixed assets in 1987.
Questions
2. How are single indexes determined and what are their properties?
3. How are group indexes determined and what are their properties?
4. How are composite indexes determined and what are their properties?
6. What are the drawbacks of the Laspeyres and Paasche index? Exemplify by a
particular market basket.
82
5. MATHEMATICAL STATISTICS
The numbers x1, ..., xa where xi is an observed value of Xi, i = 1,..., n are
said to be sample data of size n. Sample data processing is described in Chapter 2.
1 n
a) sample mean X ¦ Xi ,
ni 1
1 n 2
b) sample variance S2 ¦
n i1
Xi X ,
83
1 n
¦ X i X Yi Y
ni1
d) sample correlation coefficient R .
SX SY
for a random sample from random vector (X, Y) where S(X) and S(Y) are the sample
standard deviations of X and Y.
We have:
DX n 1
a) EX E X, DX , E S2 DX .
n n
XP
b) If X has a normal distribution N(P,V2), then n 1 has the so-called
S
nS2
t-distribution S(n - 1) or Student's distribution and has the so-called
V2
chi-square distribution F2 n 1 .
You can find more information on statistics and their distributions in [4].
Parameter estimation
84
Further types of estimators (such as maximum likelihood estimators) are described in
[1] and [4]. It holds:
assumes for sample data x1,...xn . Point estimates of the basic number
characteristics are calculated as follows:
n 2 n
EX x, D X s, VX s, U X, Y r .
n 1 n 1
An interval estimator for a parameter - with confidence 1 - D is a pair of
statistics T1;T2 such that
P T1 d - d T2 1 D
85
focus on both-sided interval estimations. One-sided estimations and interval
estimations for distributions other than normal can be found in [1].
A (1- D) confidence interval for the mean value P where the variance V 2 is
unknown is given by the formula
s s
x t 1D 2 , x t 1D 2 ,
n1 n 1
n s2 n s2
, ,
F 12D 2 F D2 2
E x a m p l e 5.1
By measuring the lengths of 10 rollers sample data have been obtained with sample
characteristics x = 5.37 mm, s2 = 0.0019 mm2 and s = 0.044 mm (see Example
2.1). Calculate the minimum variance unbiased point estimates for the mean value,
variance and standard deviation. Assuming that the observed length is normally
distributed, calculate 0.95 confidence intervals for these characteristics.
S o l u t i o n:
86
10 2
variance V2 = 0.0019 = 0.00211 mm ,
9
standard deviation V= 0.00211 # 0.046 mm.
A 0.95 confidence interval for the mean value P is calculated for t0.975 = 2.262 for 9
degrees of freedom using Table T2,
0.0019 0.0019
P <5.37 2.262; 5.37 + 2.262> # <5.337; 5.403> mm.
10 1 10 1
A 0.95 confidence interval for variance V2 is calculated for F02,025 = 2.70 and F 02.975 =
10(0.0019) 10(0,0019)
V2 < ; > # <0.00100; 0.00704> mm2 ,
19.0 2.70
tgh z1 , tgh z 2 ,
where
u1D 2 u1D 2 1 § 1 r r · ez e z e2 z 1
z1 w , z2 w , w ¨ ln ¸ , tgh z
n3 n3 2 © 1 r n 1¹ ez e z e2 z 1
Table T1. For 1 - D = 0.95 and 1 - D = 0.99, we have u0.975 1.960 and u0.995 2.576
respectively.
E x a m p l e 5.2
87
coefficient of r = 0.82482 (see Example 2.3). Calculate the minimum variance
unbiased point estimator and find a 0.99 confidence interval for the correlation
coefficient U of the parent population.
S o l u t i o n:
1 § 1 0.82482 0.82482 ·
w ¨ ln ¸ # 1.21753 .
2 © 1 0.82482 10 1 ¹
2.576 2.576
z1 1.21753 # 0.24397 , z 2 1.21753 # 2.19110
10 3 10 3
and a 0.99 confidence interval for the correlation coefficient U is shown below
88
x § x· x § x·
¨1 ¸ ¨1 ¸
x n © n¹ x n © n¹
u1 D / 2 ; u1 D / 2
n n n n
E x a m p l e 5.3
When asked about a new product by a marketing research agency, 80 out of the 400
customers of the STAMET supermarket answered that they would buy it. Calculate
the minimum variance unbiased point estimator and find an D confidence interval for
the ratio p of such customers to all the STAMET customers.
S o l u t i o n:
80
Since x = 80 and n = 400, the point estimator assumes the value p = = 0.2, or
400
20% of the customers.
For confidence 0.95 we have u0.975 = 1.960, which yields the following 0.95
confidence interval for p
80 80 80 80
(1 ) (1 )
80 400 400 ; 80 + 1.960 400 400 ! =...=
p - 1.960
400 400 400 400
= 0.1608; 0.2392 !.
p 0.1485; 0.2515 !.
We can say with a 0.95 or 0.99 confidence that about 16% to 24% or 15% to 25% of
the STAMET customers will buy the new product. If there are about 10 000 STAMET
customers it may be expected that 2 000 products will be sold. A 0.95 confidence
interval tells us that STAMET will sell approximately 10 000(0.16) = 1600 to
10 000(0.24) = 2400 new products.
89
Testing statistical hypotheses
the so-called test criterion or test statistic. For - - o , the range of values of the test
criterion is divided into two disjunct subsets - the critical range WD and its
If, for sample data x1 , ... , xn , the test criterion assumes a value t =
= T x1 , ... , xn from the critical range that is t WD , we reject the hypothesis H and
do not reject the hypothesis H. If, on the other hand, t lies outside the critical range
or t W D , we reject the hypothesis H and do not reject the hypothesis H at the level
(1) the so called first type error, when H is true but t WD , so that we reject it (the
90
(2) the so called second type error, when H is not true but t WD , so that we do not
Since a test criterion T is a random variable, the range W D is often in the form
critical values) as in confidence intervals. You can find more about statistical
hypotheses and their testing in [1], [2], [3] and [4].
We assume that random variables and vectors are normally distributed. The
following testing criteria are for two-tailed alternative hypotheses such as H : P z P0
except for the variance equality test.
with n - 1 degrees of freedom. These values can be found in Table T2. This is the
so-called one sample t - test.
91
E x a m p l e 5.4
s2 = 0.0019 mm2 have been established (see Problem 2.1). At the 0.05 level of
significance test the hypothesis that the mean value of the roller length measured is
5.40 mm, and so H : P = 5.40.
S o l u t i o n:
5.37 5.40
t 10 1 # -2.0647.
0.0019
= <-2.262; 2.262>. Since t W0,05 , we do not reject the hypothesis. To test this
hypothesis we could also use the 0.95 confidence interval from Example 5.1. Since
this interval includes the hypothetical value of 5.40, we do not reject the hypothesis at
the level of significance 1 - 0.95 = 0.05.
n s2
t
V02
E x a m p l e 5.5
At the 0.05 level of significance test a hypothesis that the variance of the value of the
roller length measured in Example 5.2 is 0.0025 mm2 so that H : V2 = 0.0025.
S o l u t i o n:
10(0.0019)
t = 7.6 .
0.0025
92
For 10 - 1 = 9 degrees of freedom we determine F 02.025 = 2.70 and F 02.975 = 19.0 from
Table T3 so that W0.05 = <2.70; 19.0>. Since t W0.05 , we do not reject the
hypothesis.
§ 1 r 1 U0 U · n3
t ¨¨ ln ln 0 ¸¸
© 1 r 1 U0 n 1 ¹ 2
E x a m p l e 5.6
S o l u t i o n:
§ 1 0.82482 1 0 0 · 10 3
t ¨ ln ln ¸ # 3.1001.
© 1 0.82482 1 0 10 1 ¹ 2
For the given level of significance we find u0.995 = 2.576 in Table T1 so that W0.01 =
= <2.576; 2.576 >. Since t W0.01 , we reject the hypothesis and consider X, Y as
dependent.
their differences and by d and s 2 d their empirical characteristics . The test criterion
is given by
93
d
t n1
sd
distribution with n -1 degrees of freedom, which can be found in Table T2. This is the
so-called t - test for paired values.
Example 5.7
Using two thermometers the following pairs of temperature values have been
measured over eight days: (xi; yi) = (51.8; 49.5), (54.9; 53.3), (52.2; 50.6),
(53.3; 52.0), (51.6; 46.8), (54.1; 50.5), (54.2; 52.1), (53.3; 53.0) (oC). At the level of
significance 1%, test the hypothesis that the difference of the mean values is
insignificant so that H : P(X) = P(Y).
S o l u t i o n:
For di = xi - yi, i = 1,...,8, we get d = 2.2 oC and s(d) = 1.3172 oC. The test criterion
assumes the value
2.2
t= 8 1 # 4.4190.
1.3172
For 8 - 1 = 7 degrees of freedom we have t0.995 = 3.499 from Table T2 so that W0.01 =
For the following tests we assume that by observing two independent random
variables X and Y normally distributed with parameters P X , V2 X and P Y , V2 Y
sample data of sizes n1 and n 2 have been obtained.
94
x y P0 n1n2 n1 n2 2
t
2
n1 s x n2 s y 2 n1 n2
distribution with n1 n 2 2 degrees of freedom listed in Table T2. This is the so-
called two-sample t - test.
E x a m p l e 5.8
s2(y) = 0.4522 kN2. At the level of significance 0.05 test the hypothesis that the
different technologies do not affect the expected value of the wire strength (assuming
that the variances V2 (X) and V2 (Y) are the same) so that H : P(X) - P(Y) = 0.
S o l u t i o n:
The test criterion assumes the value
95
s2 ( x ) s2 ( y )
t( x ) t( y )
n1 1 n2
t1 D / 2
s2 ( x ) s2 ( y )
n1 1 n2 1
n2 – 1 degrees of freedom respectively shown in Table T2. This is the so-called two-
sample t - test.
E x a m p l e 5.9
Surveys made to determine the mean service life of products in two different systems
of extreme conditions yielded two sets of sample data with the following sample
characteristics n1 = 21, x = 3.581, s2(x) = 0.114, n2 = 23, y = 3.974, s2(y) = 0.041 (the
length of the life is in hours). At the level of significance 0.05, test the hypothesis that
the second system of extreme conditions increases the mean service life by 0.5 h as
compared with the first one (assuming different variances V2(X) and V2(Y)) so that
H : P(X) - P(Y) = - 0.5.
S o l u t i o n:
Entering Table T2 at 1- D/2 = 0.975 we see that t(x) = 2.086 for 21 - 1 = 20 degrees
of freedom and t(y) = 2.074 for 23 - 1 = 22 degrees of freedom. Then we can
calculate
0.114 0.041
2.086 2.074
t 0,975 21 1 23 1 # 2.083.
0.114 0.041
21 1 23 1
and W0.05 = <-2.083; 2.083>. Since t W0.05 , we do not reject the hypothesis that
96
Testing hypothesis H : V2 X V2 Y against alternative hypothesis
n1s2 ( x )
n1 1
t t1
n2 s 2 ( y )
n2 1
E x a m p l e 5.10
At the level of significance 0.05, test the hypothesis that the variances V2 (X) > V2 (Y)
are different in Example 5.9 where s2(x) = 0.114, n1 = 21, s2(y) = 0.041, n2 = 23.
S o l u t i o n:
21(0.114)
t 21 1 # 2.7926.
23(0.041)
23 1
hypothesis. We consider the assumption that the variances are different as correct.
97
the fact that x elements out of the n elements in a random sample have the property
(see estimations of parameters).
E x a m p l e 5.11
S o l u t i o n:
62
0.2
400 0.045
t= = -2.25 .
(0.2)(1 0.2) 0.02
400
From Table T1 we get u0.975 = 1.960. Since t = -2.25 W0,05 = <-1.960; 1.960>, we
reject the hypothesis that 20% of customers will be interested at the 0.05 significance
level. The real interest will probably be lower. Note that we would not reject the
hypothesis at the 0.01 significance level since u0.995 = 2.576.
For the following test we assume that we observe two independent random
variables X, Y which have s alternative distributions with parameters p1, p2 and that
two independent sample data have been obtained of sizes n1 , n 2 respectively and
98
the respective numbers x, y of elements with the desired property (see estimations
of parameters).
x y
n1 n 2 n1n 2
t ,
f (1 f ) n1 n 2
xy
for f and W D u1D 2 ; u1D 2 where u1D 2 is the 1 D 2 - quantile of the
n1 n2
normal distribution N(0, 1) with values shown in Table T1.
E x a m p l e 5.12
Shop inspectors bought 250 items of food and 200 items of hard goods to test their
quality. Subsequently, they found 108 items of food and 73 items of hard goods to be
defective. At the 0.05 significance level test, if the quality of food and hard goods is
equal, that is the hypothesis H: p1 = p2 where p1, p2 are the theoretical ratios
(probabilities) of buying defective items for the given kinds of goods.
S o l u t i o n:
108 73
f # 0.40222
250 200
108 73
250 200 250(200) 0.067(10.5409)
t # # 1.4403 .
0.40222(1 0.40222) 250 200 0.49035
We establish u0.975 = 1.960 from Table T1. Since t = 1.4403 W0.05 = <1.960;
1.960>, at the 0.05 significance level we do not reject the hypothesis that the
probabilities of buying a defective item are the same for both kinds of goods and
consider both kinds of goods to be of equally bad quality.
99
Regression analysis
Regression function
y M x, ȕ E Y/X x ,
¦ >y @
2
S* i M xi ,E
i 1
100
m
y ¦E f
j 1
j j x ,
4. Random variables Yi are not correlated and are normally distributed for i = 1,..., n.
a) For j = 1,..., m, E j is the minimum variance unbiased point estimate for the
Hb = g.
b) The minimum variance unbiased point estimate for the linear regression function
is
m
y ¦b f
j 1
j j x .
101
where
2
n § m · n m
S *
min ¦ ¨ y i ¦ b j f ji ¸
¨
i 1©
¸ ¦ y ¦b g
2
i j j
j 1 ¹ i 1 j 1
bj s h jj t 1 D 2 ; b j s h jj t 1 D 2 ,
j = 1, ..., m, where h jj is the jth diagonal element of the matrix H1 and t 1D 2 is
m m
b j E j0
t
s h jj
The simplest and the most used linear regression function is the so-called
regression line
y E1 E2 x .
102
§ y1 ·
§1 " 1 · ¨ ¸
F ¨¨ ¸¸ , y ¨ # ¸.
© x1 " x n ¹ ¨y ¸
© n¹
When calculating with a pocket calculator, we can use the following explicit formulas
n
( ¦ stands for ¦ i 1
):
§ ¦1
1) H ¨
¦x i ·
¸ , g
§ ¦ yi ·
¨ ¸ ,
¨¦ x 2¸ ¨¦ x y ¸ ¦1 n ,
© i ¦x i ¹ © i i¹
2 n¦ x i y i ¦ x i ¦ y i
2) det H n¦ x i2 ¦x i , b2 , b1 y b2 x ,
det H
*
2 Smin
3) S *
min ¦ yi b1 b2 xi ¦y 2
i b1 ¦ yi b2 ¦ xi yi , s 2
n2
,
4) h 11 ¦x i
, h22
n
,
det H det H
2 2
* 1 xx 1 n xx
5) h 2
,
n ¦ x i2 n x n det H
*
Smin
6) r 1 2
r( x, y ) , where r(x,y) is the correlation coefficient – see
¦y 2
i n y
Chapter 2.
Example 5.13
xi 3 5 5 8 9 11 12 15
Find out how the annual sales of a company depend on the number of its employees.
Using the formula y = E1 + E2 x, calculate a 0.95 confidence interval for E2 , with a
0.05 level of significance test the hypothesis H : E1 = 0.2, find the minimum variance
unbiased point estimate and a 0.95 confidence interval for y(10). Using a graph and
the correlation coefficient analyse the suitability of the regression function.
103
S o l u t i o n:
This is a regression line and so using the above formulas and table, for n = 8,
§ 8 68 ·
we can set up the matrix H = ¨¨ ¸¸ with det H = (8)(694) - 682 = 928. This
© 68 694 ¹
yields the point estimate for E2 :
(8)(150.2) (68)(15.2)
b2 = = 0.1810344 # 0.181.
928
Next we have x = 68/8 = 8.5 and y = 15.2/8 = 1.9, so that the point estimate for E1
is
b1 = 1.9 – (0.1810344)(8.5) = 0.3612068 # 0.361.
Thus we get the point estimate for the regression function y = 0.361 + 0.181x.
104
The diagonal elements of H-1 are h11 = 694/928 # 0.7478448 and h22 = 8/928 #
# 0.00862069. In Table T2, for 8 - 2 = 6 degrees of freedom, we find t0.975 = 2.447.
The 0.95 confidence interval for the regression coefficient E2 is
We have found a point estimate of 181 000 CZK for the increase in annual sales
corresponding to an increase in the number of employees by one. A 0.95 confidence
interval for this value is 149 000 CZK to 213 000 CZK.
0.3612068 0.2
t= # 1.3277.
0.1404017 0.7478448
For the alternative hypothesis H : E1 z 0.2 we get W0,05 = <-2.447; 2.447>. Since
it is, at this level of significance we would not actually reject the hypothesis that a
company without employees (the owners themselves work), since y(0) = E1 , will
have annual sales of about 200 000 CZK.
The point estimate for the average and individual sales at a company that has
ten employees is the following
For the given company annual sales of about 2 172 000 CZK may be expected Since
1 8(10 8.5)2
h* = = 0.1443965,
8 928
is a 0.95 confidence interval for the annual sales of a company with ten employees,
105
With probability 0.95 it may be expected that the average annual sales of such a
company will range between 2 040 000 CZK and 2 302 000 CZK.
With a 0.95 probability it may be expected that the annual sales (its individual value)
of such a company will range from 1 804 000 CZK to 2 539 000 CZK.
3.6
3.2
Annual sale (million of CZK)
2.8
2.4
2.0
1.6
1.2
0.8
0.4
0.0
0 3 6 9 12 15 18
Number of employees
Fig. 5.1
The regression functions are either linear or nonlinear (with respect to their
regression coefficients). Some of the nonlinear regression functions can be
106
transformed into linear ones using a suitable linearization (such as taking a logarithm
of a power or an exponential function). You will find more details about linearization,
tests of the suitability of a linear regression function, regression diagnostics and other
topics in [1], [2], [3] and [4].
Exercises
E x e r c i s e 5.14
Find the minimum variance unbiased point estimate and a 0.99 confidence interval
for the parameters P and V2 of a normal distribution, using sample data from a
random sample of size n = 18 with the sample mean x = 50.1 and variance
s2 = 17.64.
E x e r c i s e 5.15
Sample data of size n = 12 have sample mean x = 77.55 and variance s2 = 1045.65.
Calculate the point estimator and find a 0.99 confidence interval for P and V of the
parent population.
E x e r c i s e 5.16
A total of 100 workers of the same category selected at random have been asked
about their wages per hour and the empirical characteristics x = 28.64 CZK and s2 =
= 1.1979 CZK have been calculated. Find the minimum variance unbiased point
estimate and a 99% confidence interval for the expected value of wages per hour P
and standard deviation V provided that the parent population is normally distributed.
107
E x e r c i s e 5.17
E x e r c i s e 5.18
A total of fifty values of a random variable with a normal distribution N(P,V2) has been
used to calculate a sample mean x = 610 and a variance s2 = 2770.4. Find interval
estimates for P with confidences of 0.9; 0.95; and 0.99.
E x e r c i s e 5.19
Five independent and equally accurate measurements have been carried out to
determine the volume of a vessel: 4.781; 4.792; 4.795; 4.779; 4.769 (in litres). Find a
0.99 level of significance confidence interval for the expected value of the volume of
vessel provided that it is normally distributed.
R e s u l t: <4.761; 4.805> l
E x e r c i s e 5.20
A sample of size n = 128 has been used to determine the correlation coefficient and
the result was r = 0.560. Calculate the minimum variance unbiased point estimator
and find a 95% confidence interval for U.
E x e r c i s e 5.21
In a total of 46 households chosen at random a survey has been made to find the
relationship between income X and expenses Y. A sample correlation coefficient of
108
r = 0.638 was calculated. Calculate the minimum variance unbiased point estimator
and determine a 0.95 confidence interval for the correlation coefficient provided that
income and expenses have a two-dimensional normal distribution.
E x e r c i s e 5.22
E x e r c i s e 5.23
During a check of expiry periods of a certain type of tinned meat in a food industry
warehouse, a total of 320 tins have been chosen at random. It has been found that in
59 of them the guarantee periods have expired. Calculate the minimum variance
unbiased point estimate and find a 95% level of significance confidence interval for
the percentage of tins in the warehouse with expired guarantee periods. Do the same
for a warehouse with 20 000 tins.
E x e r c i s e 5.24
109
R e s u l t: <0.041; 0.159>; <0.071; 0.129>; <0.085; 0.115>
E x e r c i s e 5.25
E x e r c i s e 5.26
The following sample data have been provided by a random sample from normal
distribution.
xj* -2 -1 0 1 2 3
fj 1 4 7 3 3 2
E x e r c i s e 5.27
The expected value of humidity in roasted coffee is set at 4.2% and the standard
deviation at 0.4%. The actual percentages of humidity determined by analyzing 20
samples are the following: 4.5; 4.3; 4.1; 4.9; 4.6; 3.2; 4.4; 5.1; 4.8; 4.0; 3.7; 4.4; 3.9;
4.1; 4.2; 4.1; 4.7; 4.3; 4.2; 4.4. At a 5% level of significance test the hypotheses that
a) the expected value of humidity for the parent population complies with the
standard and
b) the standard deviation of humidity for the parent population complies with
the standard.
110
E x e r c i s e 5.28
E x e r c i s e 5.29
Using a random sample from a two-dimensional normal distribution sample data have
been obtained of size n = 44 and the correlation coefficient has been found to be
r = 0.7417. At a 1% level of significance test the hypothesis that the random variables
for the parent population are independent.
E x e r c i s e 5.30
E x e r c i s e 5.31
E x e r c i s e 5.32
Using two analytical scales a total of 10 samples have been weighed with the
following results: (xi; yi) = (25; 28), (30; 31), (28; 26), (50; 52), (20; 24), (40; 36),
(32; 33), (36; 35), (42; 45), (38; 40) (mg). With a 0.01 level of significance find out
111
whether the different results are statistically insignificant provided that they are
normally distributed.
E x e r c i s e 5.33
The following are the results of measurements made before and after a scale of a
packing machine has been calibrated: n1 = 12, x = 31.2 g, s2(x) = 0.770 g2 and
n2 = 18, y = 29.2 g, s2(y) = 0.378 g2. Suppose the variances are equal and the
distribution is normal. At a 0.05 level of significance test the hypothesis that the
expected value has not been changed by the calibration.
E x e r c i s e 5.34
The average grading of a total of twenty study groups of students in a particular year
are shown in the table below:
fj 2 3 5 7 2 1
The overall average grading in the previous year for 20 study groups was y = 2.201
and the variance was s2(y) = 0.012 . Test the hypothesis that the average gradings of
the two years do not differ if we assume that the distribution is normal and variances
are equal.
E x e r c i s e 5.35
Two types of rope have been tested for tensile strength. Two samples of an equal
size of n = 18 have been taken and the following values have been calculated:
112
x = 3389.3 N, s2(x) = 1144.4 N2, y = 3339.2 N, s2(y) = 3453.5 N2. Assuming the
variances to be different test the hypothesis that the expected tensile strengths of the
ropes are the same. Use a 0.05 level of significance.
E x e r c i s e 5.36
E x e r c i s e 5.37
Two different methods have been used to determine the fat content of milk. Using the
first method for a sample of 12 analyses a variance of s2(x) = 0.0224 has been
calculated and with the second method a sample of 8 analyses has been used to
produce a variance of s2(y) = 0.0263. At a 0.01 level of significance test the
hypothesis that both methods are equally accurate in terms of variance.
E x e r c i s e 5.38
Test the hypothesis that the variances in Exercise 5.19 are equal. Use a 0.05 level of
significance.
E x e r c i s e 5.39
The Board of Directors of a large company consider selling shares of the company's
stock to their own employees. The estimate that about 20% of the employees will buy
the shares. A total of 400 employees chosen at random have been asked whether
113
they will buy the shares. The answer has been yes in 66 cases. Using a 0.05 level of
significance test the hypothesis that the directors' estimate is realistic.
E x e r c i s e 5.40
A sample of size n = 200 has been taken from products manufactured using a new
technology. Out of those 200 products 31 have been found to be defective. Ascertain
that the new technology has changed the wastage rate of the products as compared
to a previous rate of 10% determined by long experience. Use a 1% level of
significance.
E x e r c i s e 5.40
In two plants the same type of product is manufactured. The wastage rates in the two
plants should be the same as both use the same technology. In the first plant, 10
products out of a total of 200 chosen at random and checked are defective while in
the second plant it is 23 defective products out of a total of 250. Using a 0.01 level of
significance test whether there is a statistically significant difference in quality
between the two plants.
E x e r c i s e 5.41
114
E x e r c i s e 5.42
When investigating the dependence of quantity y on quantity x the following data has
been measured:
y(5.4) <6.41;7.23>
E x e r c i s e 5.43
To determine the dependence of this year's demand y on last year's demand x for a
certain type of goods the following data have been collected from 6 businessmen
(pieces):
Calculate the minimum variance unbiased point estimate and find a 95% interval
estimate for the coefficients of the regression line and for the value of this year's
demand at 110 pieces of last year's demand. With a 5% level of significance test the
hypothesis that E1 = 0 and determine the correlation coefficient.
E x e r c i s e 5.44
The values y* (in thousands of items) of the demand for a certain type of goods at
prices x* (in thousands of CZK) are shown in the following table:
115
xI* 100 110 140 160 200
yI* 120 89 56 41 22
Fit the data using a power regression function y* = Jx*G and find point and interval
estimates (using a 0.95 confidence) for the regression coefficients and for the
demand at price 120 CZK.
E x e r c i s e 5.45
The values y* (in thousands of CZK) of net sales at a company over the first six years
x* of operation are shown below:
xI* 1 2 3 4 5 6
Questions
116
3. Define the notion of a parameter and its types.
4. Define a point estimator and show point estimators for the basic number
characteristics.
6. Hoe does a change in the confidence level and the sample size affect the length
of a confidence interval?
11. What estimations and tests of statistical hypotheses are used in regression
analysis?
117
BIBLIOGRAPHY
3. SPRINTHALL, R. C.: Basic Statistical Analysis (5th ed.). Allyn and Bacon, Boston,
1997.
7. BERENSON, M.: Business Statistics; A First Course. New York, Prentice Hall
1997.
9. SEGER, J., and HINDLS, R.: Statistické metody v tržním hospodáĜství. Victoria
Publishing, Praha, 1995.
11. LIKEŠ, J., CYHELSKÝ, L., and HINDLS, R.: Úvod do statistiky a pravdČpodob-
nosti (Statistika A). VŠE, Praha, 1995.
15. CIPRA, T.: Analýza þasových Ĝad s aplikacemi v ekonomii. SNTL/Alfa, Praha,
1986.
118
STATISTICAL TABLES
119
T2 Quantiles tP of the Student distribution S(k)
120
2
T3 Quantiles FP2 of the Pearson distribution F (k)
121
2
T3 Quantiles FP2 of the Pearson distribution F (k) (continued)
122
T4 Quantiles FP of the Fisher – Snedecor distribution F(k1,k2) for P = 0.95
k1
1 2 3 4 5 6 8 12 24 f
k2
1 161.446 199.499 215.707 224.583 230.160 233.988 238.884 243.905 249.052 254.313
2 18.513 19.000 19.164 19.247 19.296 19.329 19.371 19.412 19.454 19.496
3 10.128 9.552 9.277 9.117 9.013 8.941 8.845 8.745 8.638 8.526
4 7.709 6.944 6.591 6.388 6.256 6.163 6.041 5.912 5.774 5.628
5 6.608 5.786 5.409 5.192 5.050 4.950 4.818 4.678 4.527 4.365
6 5.987 5.143 4.757 4.534 4.387 4.284 4.147 4.000 3.841 3.669
7 5.591 4.737 4.347 4.120 3.972 3.866 3.726 3.575 3.410 3.230
8 5.318 4.459 4.066 3.838 3.688 3.581 3.438 3.284 3.115 2.928
9 5.117 4.256 3.863 3.633 3.482 3.374 3.230 3.073 2.900 2.707
10 4.965 4.103 3.708 3.478 3.326 3.217 3.072 2.913 2.737 2.538
11 4.844 3.982 3.587 3.357 3.204 3.095 2.948 2.788 2.609 2.404
12 4.747 3.885 3.490 3.259 3.106 2.996 2.849 2.687 2.505 2.296
13 4.667 3.806 3.411 3.179 3.025 2.915 2.767 2.604 2.420 2.206
14 4.600 3.739 3.344 3.112 2.958 2.848 2.699 2.534 2.349 2.131
15 4.543 3.682 3.287 3.056 2.901 2.790 2.641 2.475 2.288 2.066
16 4.494 3.634 3.239 3.007 2.852 2.741 2.591 2.425 2.235 2.010
17 4.451 3.592 3.197 2.965 2.810 2.699 2.548 2.381 2.190 1.960
18 4.414 3.555 3.160 2.928 2.773 2.661 2.510 2.342 2.150 1.917
19 4.381 3.522 3.127 2.895 2.740 2.628 2.477 2.308 2.114 1.878
20 4.351 3.493 3.098 2.866 2.711 2.599 2.447 2.278 2.082 1.843
21 4.325 3.467 3.072 2.840 2.685 2.573 2.420 2.250 2.054 1.812
22 4.301 3.443 3.049 2.817 2.661 2.549 2.397 2.226 2.028 1.783
23 4.279 3.422 3.028 2.796 2.640 2.528 2.375 2.204 2.005 1.757
24 4.260 3.403 3.009 2.776 2.621 2.508 2.355 2.183 1.984 1.733
25 4.242 3.385 2.991 2.759 2.603 2.490 2.337 2.165 1.964 1.711
26 4.225 3.369 2.975 2.743 2.587 2.474 2.321 2.148 1.946 1.691
27 4.210 3.354 2.960 2.728 2.572 2.459 2.305 2.132 1.930 1.672
28 4.196 3.340 2.947 2.714 2.558 2.445 2.291 2.118 1.915 1.654
29 4.183 3.328 2.934 2.701 2.545 2.432 2.278 2.104 1.901 1.638
30 4.171 3.316 2.922 2.690 2.534 2.421 2.266 2.092 1.887 1.622
35 4.121 3.267 2.874 2.641 2.485 2.372 2.217 2.041 1.833 1.558
40 4.085 3.232 2.839 2.606 2.449 2.336 2.180 2.003 1.793 1.509
45 4.057 3.204 2.812 2.579 2.422 2.308 2.152 1.974 1.762 1.470
50 4.034 3.183 2.790 2.557 2.400 2.286 2.130 1.952 1.737 1.438
55 4.016 3.165 2.773 2.540 2.383 2.269 2.112 1.933 1.717 1.412
60 4.001 3.150 2.758 2.525 2.368 2.254 2.097 1.917 1.700 1.389
70 3.978 3.128 2.736 2.503 2.346 2.231 2.074 1.893 1.674 1.353
80 3.960 3.111 2.719 2.486 2.329 2.214 2.056 1.875 1.654 1.325
90 3.947 3.098 2.706 2.473 2.316 2.201 2.043 1.861 1.639 1.302
100 3.936 3.087 2.696 2.463 2.305 2.191 2.032 1.850 1.627 1.283
120 3.920 3.072 2.680 2.447 2.290 2.175 2.016 1.834 1.608 1.254
150 3.904 3.056 2.665 2.432 2.274 2.160 2.001 1.817 1.590 1.223
250 3.879 3.032 2.641 2.408 2.250 2.135 1.976 1.791 1.561 1.166
500 3.860 3.014 2.623 2.390 2.232 2.117 1.957 1.772 1.539 1.113
f 3.841 2.996 2.605 2.372 2.214 2.099 1.938 1.752 1.517 1.000
123
T4 Quantiles FP of the Fisher – Snedecor distribution F(k1,k2) for P = 0.99
k1
1 2 3 4 5 6 8 12 24 f
k2
1 4052.18 4999.34 5403.53 5624.26 5763.96 5858.95 5980.95 6106.68 6234.27 6365.59
2 98.502 99.000 99.164 99.251 99.302 99.331 99.375 99.419 99.455 99.499
3 34.116 30.816 29.457 28.710 28.237 27.911 27.489 27.052 26.597 26.125
4 21.198 18.000 16.694 15.977 15.522 15.207 14.799 14.374 13.929 13.463
5 16.258 13.274 12.060 11.392 10.967 10.672 10.289 9.888 9.466 9.020
6 13.745 10.925 9.780 9.148 8.746 8.466 8.102 7.718 7.313 6.880
7 12.246 9.547 8.451 7.847 7.460 7.191 6.840 6.469 6.074 5.650
8 11.259 8.649 7.591 7.006 6.632 6.371 6.029 5.667 5.279 4.859
9 10.562 8.022 6.992 6.422 6.057 5.802 5.467 5.111 4.729 4.311
10 10.044 7.559 6.552 5.994 5.636 5.386 5.057 4.706 4.327 3.909
11 9.646 7.206 6.217 5.668 5.316 5.069 4.744 4.397 4.021 3.602
12 9.330 6.927 5.953 5.412 5.064 4.821 4.499 4.155 3.780 3.361
13 9.074 6.701 5.739 5.205 4.862 4.620 4.302 3.960 3.587 3.165
14 8.862 6.515 5.564 5.035 4.695 4.456 4.140 3.800 3.427 3.004
15 8.683 6.359 5.417 4.893 4.556 4.318 4.004 3.666 3.294 2.868
16 8.531 6.226 5.292 4.773 4.437 4.202 3.890 3.553 3.181 2.753
17 8.400 6.112 5.185 4.669 4.336 4.101 3.791 3.455 3.083 2.653
18 8.285 6.013 5.092 4.579 4.248 4.015 3.705 3.371 2.999 2.566
19 8.185 5.926 5.010 4.500 4.171 3.939 3.631 3.297 2.925 2.489
20 8.096 5.849 4.938 4.431 4.103 3.871 3.564 3.231 2.859 2.421
21 8.017 5.780 4.874 4.369 4.042 3.812 3.506 3.173 2.801 2.360
22 7.945 5.719 4.817 4.313 3.988 3.758 3.453 3.121 2.749 2.305
23 7.881 5.664 4.765 4.264 3.939 3.710 3.406 3.074 2.702 2.256
24 7.823 5.614 4.718 4.218 3.895 3.667 3.363 3.032 2.659 2.211
25 7.770 5.568 4.675 4.177 3.855 3.627 3.324 2.993 2.620 2.169
26 7.721 5.526 4.637 4.140 3.818 3.591 3.288 2.958 2.585 2.131
27 7.677 5.488 4.601 4.106 3.785 3.558 3.256 2.926 2.552 2.097
28 7.636 5.453 4.568 4.074 3.754 3.528 3.226 2.896 2.522 2.064
29 7.598 5.420 4.538 4.045 3.725 3.499 3.198 2.868 2.495 2.034
30 7.562 5.390 4.510 4.018 3.699 3.473 3.173 2.843 2.469 2.006
35 7.419 5.268 4.396 3.908 3.592 3.368 3.069 2.740 2.364 1.891
40 7.314 5.178 4.313 3.828 3.514 3.291 2.993 2.665 2.288 1.805
45 7.234 5.110 4.249 3.767 3.454 3.232 2.935 2.608 2.230 1.737
50 7.171 5.057 4.199 3.720 3.408 3.186 2.890 2.563 2.183 1.683
55 7.119 5.013 4.159 3.681 3.370 3.149 2.853 2.526 2.146 1.638
60 7.077 4.977 4.126 3.649 3.339 3.119 2.823 2.496 2.115 1.601
70 7.011 4.922 4.074 3.600 3.291 3.071 2.777 2.450 2.067 1.540
80 6.963 4.881 4.036 3.563 3.255 3.036 2.742 2.415 2.032 1.494
90 6.925 4.849 4.007 3.535 3.228 3.009 2.715 2.389 2.004 1.457
100 6.895 4.824 3.984 3.513 3.206 2.988 2.694 2.368 1.983 1.427
120 6.851 4.787 3.949 3.480 3.174 2.956 2.663 2.336 1.950 1.381
150 6.807 4.749 3.915 3.447 3.142 2.924 2.632 2.305 1.918 1.331
250 6.737 4.691 3.861 3.395 3.091 2.875 2.583 2.256 1.867 1.244
500 6.686 4.648 3.821 3.357 3.054 2.838 2.547 2.220 1.829 1.164
f 6.635 4.605 3.782 3.319 3.017 2.802 2.511 2.185 1.791 1.000
124