0% found this document useful (0 votes)
6 views

Statatics Chapter 1

This document serves as a quick reference guide for introductory statistical methods, covering key concepts such as the definitions of statistics, basic statistical terms, types of data, and methods of data collection. It also explains descriptive and inferential statistics, frequency distributions, measures of central tendency, and methods for calculating the arithmetic mean. Additionally, it includes examples and formulas to illustrate the application of these statistical concepts.

Uploaded by

merhawitareke27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Statatics Chapter 1

This document serves as a quick reference guide for introductory statistical methods, covering key concepts such as the definitions of statistics, basic statistical terms, types of data, and methods of data collection. It also explains descriptive and inferential statistics, frequency distributions, measures of central tendency, and methods for calculating the arithmetic mean. Additionally, it includes examples and formulas to illustrate the application of these statistical concepts.

Uploaded by

merhawitareke27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 21

Quick reference in statistics

Review of introductory statistical methods


Introduction
Meaning of statistics
The word statistics means different to different people according to the way they use it. In
general meaning of statistics can be categorized in two different categories. Statistics in
its Plural sense which refers to numerical facts, figures or statistical data and
Statistics in its Singular sense which is defined as a branch of mathematics or applied
research which is concerned with development and application of methods and
techniques for collecting organizing presenting analyzing and interpreting
quantitative data in such a way that the reliability of conclusions based on data may
be evaluated objectively in terms of probability statements.

Basic terms in statistics


Some of the basic terms in statistics are:
Experimental unit: is an object (e.g., person thing, transaction, event) upon which we
collect data.
Data: a set of related observations or facts collected to draw conclusions from.
Qualitative data: are measurements that are recorded on a naturally occurring numerical
data.
Quantitative data: are data that cannot be measured on a natural numerical scale; they
can only be classified into one of the groups of categories.
Discrete data: refers to the data obtained by counting, these data assume only
whole numbers. Example number of students in computer science department
Continuous data: refers to the data obtained by measurement and can assume
decimal numbers. Example Age, height …
Population: is the complete collection of individuals, objects or measurements that have
a characteristic in common in a given investigation.
Variable is a characteristic or property of an individual population unit. We measure a
variable for members of the population. If we measure a variable for every member of
the population that is census.
Census: process of collecting data or observations covering all the units that are in the
population.
Sample is a subset of the units of the population..
Sample survey:- process of collecting data covering representative of a population.
Statistic: is a measure obtained from a sample.
Parameter: is a measure obtained from a population.
Discrete variable: is a variable that assumes integral values only
Continuous variable: is a variable that assumes any value between two given
values.
Constant variable: a variable that assumes only one variable
Sample frame: a list of the entire population from which sample is drawn.
Attribute data: data yielded form qualitative information.
Variable data: data yielded from quantitative information.

Mekelle University 1
Quick reference in statistics

Basic types in statistical analysis


Descriptive Statistics: utilizes numerical & graphical methods to look for patterns in the
data set, to summarize the info revealed in a data set and to present that info in a
convenient form. (4 elements, namely; population/sample of interest. 1/more variables to
be investigated, tables graphs,/ numerical summary tools, identification of patterns in the
data.)
Inferential Statistics: utilizes sample data to make estimates, decisions, predications, or
other generalization about larger set of data. (5 elements, namely; the population of
interest, 1/more variables to be investigated, the sample of the population units, inference
about the population based on info contained in the sample, a measure of variability (is a
statement about the degree of uncertainty associated with a statistical inference) for the
inference)

We generally gather data in four different ways:


1. Data from published source. E.g., distribution of electrical power with the use of
data from Federal Energy regulatory commission
2. Data from a designed experiment: which treatments do a better job controlling or
curing leukemia than others
3. Data from a survey:
4. Data from an observational data: observe the grades of males and females in my
stat class in order to compare them. I concluded that there was no important
pattern of difference b/n the groups.

Frequency distributions
Frequency distribution is a tabular arrangement of data by classes together with
corresponding frequencies. (Discrete frequency distribution and continuous frequency
distribution)

Example: Table I Age of 68 students in a class


No of students
Age (Classes) (Frequency)
15 – 19 10
20 – 24 20
25 – 29 17
30 – 34 13
35 - 39 8
Total 68

Basic terms used in frequency distribution


1. Row data are data collected which have not been organized.
2. Array an arrangement of a numerical data in ascending or descending order.
3. Range the difference between the largest and smallest numbers of adata.
4. Classes (Class intervals) are intervals where data can be regrouped e.g 15 – 19
in table I above is a class interval.

Mekelle University 2
Quick reference in statistics

5. Class limits the end numbers in a class known as class limits. E.g 15 is lower
class limit and 19 upper class limit of the first class in the above table.
6. Class boundaries (true class limits) are obtained as follows

l.c.b = (U.C.L of previous class + L.C.L of currents class) / 2


u.c.b = (U.C.L of current class + L.C.L of next class) / 2

7. Class mid point (Class mark) is the mid point of a class interval example
(15+19)/2 = 17 is the class mark of the 1st class in the above table.
8. Class width (Class size or class strength) the difference between upper and
lower class boundaries of a class or difference between successive L.C.Ls or
successive upper class limits or successive class marks.
Steps in creating in creating continuous frequency distribution table
1. Determine the largest and smallest numbers in the raw data
2. Calculate the Range
3. Determine the number of classes you want to have, usually the number of classes
ranges from 5 to 20.
4. Divide Range by the number of classes to get class width
5. Identify the lower class limit of the first class (make sure that the minimum value
is included in the first class and the maximum value in the last class).
6. Identify the remaining lower class limits by adding class width
7. Write all upper class limits
8. Start tallying.

Example Create a continuous frequency distribution for the following data of 100
employees and their weekly salaries.
70 61 64 ……….74………65………..72…72…

1. Max = 74 min = 61
2. Range = 74 – 61 = 13
3. Make the number of classes to be 5
4. Class width = Range/5 = 13/5 = 2.63 ≈‫ﻩ‬

Table II
Salary Frequency
60 – 62 5
63 – 65 18
66 – 68 42
69 – 71 27
72 - 74 8

Relative frequency distribution


The relative frequency of a class is frequency of a class divided by total frequency of all
classes and is generally expressed as percentage. The sum of the relative frequencies of
all classes is 1 or 100%.

Mekelle University 3
Quick reference in statistics

Cumulative Frequency Distributions


Less than cumulative frequency distribution
Each class contains the total frequency of all values less than the upper class boundary of
a given class.
Example the less cumulative frequency of the 3rd class in the above table is 42 + 18 + 5 =
65 => 65 of the employees earn a salary of less than 68.5.

Or more cumulative frequency distribution


Each class contains the sum of frequencies greater than the L.C.B of the class.
Example Or more c.f of the 4th class is 27 + 8 = 35 => 35 of the employees earn a salary
greater than or equal to 68.5

Salary Frequency Relative frequency Less than Or more C.F


C.F
60 – 62 5 5/100=0.05 = 5% 5 100
63 – 65 18 18 % 23 95
66 – 68 42 42% 65 77
69 – 71 27 27% 92 35
72 - 74 8 8% 100 8

Measures of central tendency


Introduction, an average is a value, which is typical or representative
of set of data, there are many types of measures of averages each
possessing particular properties and each being typical in some unique
way. The most common ones are.
1 Computed averages
- The arithmetic mean
2 Positional average
- The median
3. The mode

The summation notation


Let the symbol xi(read as x sub i) denote any of the values x 1, x2,
x3. . . xn assumed by the variable x(i is a called a subscript or index),
then the sum of the N numbers can be represented by a Greek

Capital letter  sigma as = x1 + x2 + x3 + . . . + x n

Example 1- find the sum of the numbers


4 2 6 -5 8
Let x = the value of the ith number then
x1 = 4, x2 = 2 x3 = 6, x4 = -5, x5 = 8

= x1 + x2 + x3 + x4 + x5

= 4 + 2 + 6 +(-5) + 8
= 15

Mekelle University 4
Quick reference in statistics

Example 1 given x1 = 4, x2 = 6, x3 = -5 x4 = 8 and


y1 = -3 y2 = 10, y3 = 2 y4 = 18
then calculate

= x1 + x2 + x3 + x4 = 4 + 6 + (-5) + 8 = 13

= -3 – 10 + 2 + 8 = -3

+yi)= (4 + (-3)) + (6 + -10) + (-5 + 2) + (8 + 8)

= 1 + (-4) + (-3) + 16
= 10
=

Properties of the summation notation

1.

2.

3. where c is a constant number

4. , where c is a constant number

5.

Mekelle University 5
Quick reference in statistics

Example 2 Given = 7, ,

Find

a. +5yj)

= . . . property 1.
= . . . property 3

= 2(7) = 5(-3) = 14 – 15 = -1

b. (xj – 3) (2yj + 1)

= (2xjyj) + xj – 6yj – 3

= (2xjyj) + - 3 … property 1

= 2 - 4x3 . . . property 3 and 4


= 2(5) + 7-6(-3) – 12
= 10 = 7 + 18 - 12
= 23

Example 2 = 4 and = 10

Find a. (2xj +3) b. (xj -1) c. (xj – 5)2

Arithmetic mean
The arithmetic mean or the mean of a set of n numbers x 1, x2, x3. . . xn
is denoted of by and is defined as
1. Simple mean

Formula I

Example 3 Find the arithmetic mean of the numbers 9 2 5 12


8

=7.2

Mekelle University 6
Quick reference in statistics

if the numbers x1, x2 . . . xk occur f1, f2… fk times respectively (that is

occur with frequencies f1, f2, . . . fn, the arithmetic mean =

where N = = total frequency.

Example 4 A set of numbers consists of six 6’s , seven 7’s and


eight 8’s, nine 9’s what is the arithmetic mean of the numbers.
Table 1.
xi 6 7 8 9 Total
fi 6 7 8 9 30
xifi 36 49 64 8 230
1

= 7.67

II weighted mean
If w1 w2 . . . wn are weights of the values (x1, x2, x3. . . xn)
respectively , then

………. formula III

Example 5 If final examination in a course is weighted three times


as much as a quiz and a student has a final examination grade of 85
and quiz grades 70 and 90, then mean grade is
= 83
Note formula II can be applied to calculate the arithmetic mean of a
continuous frequency distribution (groped data) by taking class
marks as xi’s
Example: 6 calculate the mean salary of employees in the
following table.

ixi
Salary Frequenc Class (xi)

Mekelle University 7
Quick reference in statistics

y
(i)

60 - 62 5 61 305
63 - 65 18 64 1152
66 - 68 42 67 284
69 - 71 27 70 1890
72 - 74 8 73 584

= 67.45

Shortcut methods of computing mean

1. =A+ or = A +

Where A = assumed mean (preferably the value of x i at the middle).


2. Coding method (transformation method) is an alternative
method of computing mean of a continuous frequency distribution
and is given by

=A+( )c, where ui=di/c A= assumed mean c = class

width

Example 7 Compute the arithmetic mean of the following data


using the short at method.
20020, 20005, 20008 19992 19987
Solution: assumed mean A = 20000

d1 = 20020 – 20000 = 20
d2 = 20005 – 20000= 5
d3 = 20008 – 20000= 8
d4 = 19992 – 20000 = -8
d5 = 199987 – 20000 = -13

= 12

Mekelle University 8
Quick reference in statistics

= 20000 + 12
/5 = 20000 + 2.4

= 20002.4

Example 8 calculate the arithmetic mean of table 2 using the short


cut method and the coding method
(i) Short method

Salary xi i di=xi=A  = di/c idi iui

60-62 61 5 -6 -2 -30 -10


63-65 64 18 -3 -1 -54 -18
66-68 67 42 0 0 0 0
69-71 70 27 3 1 81 27
72-74 73 8 6 2 48 16
i di idi =45 15

I short method

=A+ = 67 + = 67 + 0.45 = 67.45

ii coding method

=A+

= 67 = 3
= 67 + (0.15)3 = 67 = 0.45 = 67.45

Properties of arithmetic mean


A the sum of the divisions of a set of numbers from their mean is zero
That is (xi - ) = xi - 
= xi - n = xi - n
xi - xi = 0
b. If 1 numbers have mean m1, 2 number have mean m2 . .. n
numbers have mean mk then the mean of all the number is

Mekelle University 9
Quick reference in statistics

Example If the mean results of scores of three classes were 79, 74, 82
with sizes 32, 25 and 17 respectively , then the mean result of the
scores of the students

= 78

3. If each value in the distribution with mean X is increased by by a constant number C,


then the X+C will be the mean of the new distribution.

4. If each value is multiplied by a constant number C, then CX, will be the mean for the
new data.

Advantage of arithmetic mean


 It always exists (can be calculated for any set of numerical data)
 It is unique (any numerical data has one mean)
 It is easy to understand and compute.
 It is makes use of all values in the data.
 It is stable (means of different samples of the same population do not flactuate
considerably)

Disadvantages of arithmetic mean


 It is affected by extreme values in the data
 It cannot be calculated for data, which are not quantifiable.
 It cannot be calculated for continuous grouped data with open end classes.

The median
Definition: - The median of a set of numbers arranged in an array is the middle value or
the arithmetic mean of the two middle values
i.e observation, …………………if n=add

……………………….if n is even

Mekelle University 10
Quick reference in statistics

Example a. the median of the values 8, 2, 4, 2, 2, 23, 15, 17


Soln 2, 4, 8, 15, 17, 22, 23

X = = = th
= 4th observation = 15

b. Consider 18, 29, 31, 32, 27, 24, 22, 25

Soln. 18, 22, 24, 25, 27, 29, 31, 32

N=8=even

X = = = = = 26

For grouped continuous distribution the median is given by

X=L+

Where L= lower class voundry of the median class


N= Number of observations
fb=Sum all the frequencies below the median class.
fm= frequency of the median class
C= class width of the median class.

Example . Find the median of he following data.

Salary fi
= 50, therefore the median class is 66-68
60-62 5
because the 50th observation is found in the
63-65 18 class.

66-68 42
X = 65.5+ = 65.5 +
69-71 27

72-74 8
X=65.5 + = 65.5+ (0.64)3 = 65.5 + 1.93 = 67.43

Mekelle University 11
Quick reference in statistics

Advantages and disadvantages of median

a. Advantages of the median


 It always exists
 It is not affected by extreme values
 It is unique
 It can be computed for a distribution with open class limit.
 It can be computed for ratio, interval and ordinal data.
b. Disadvantages
 It doesn’t take each and every value into consideration.
 Arrangement of the data in order
 No algebraic manipulation( e.g it is not possible to calculate combined
median of two or groups.

The Mode( )
Definition:- the mode(s) of n values is the value with the highest frequency ( it is most
frequent value).
E.g what is the mode of the values, 8, 3, 2, 3, 4, 7, 3
Soln. 8, 3, 3, 3, 2, 4, 7
=> the mode is 3 because it is the most frequent value.

Note: i. A distribution with exactly one mode is called Unimodal


ii. A distribution with exactly two modes is called Bimodal.
iii. A distribution with many modes is called Multimodal.
iv. A distribution with all observations having the identical (equal) frequencies is
said to have no mode.
Example.
Xi fi The modes are 56 and 83 , Bimodal.
40 3
45 2
56 7
72 6
83 7

For continuous frequency distribution the mode is computed by

=L+ C

Where L = Lower class boundary of the modal class


1 = fm – fl , 2 =fm – fh , C=class width
fm = frequency of the modal class (a class with the highest frequency)
fl = frequency of the next lower class to the modal class
fh = frequency of the next higher class to the modal class.

Mekelle University 12
Quick reference in statistics

Example. Compute mode of the following C.F.D.

Classes fi
42-50 3 The next lower class (fl)
51-59 4
60-68 4 Modal class (fm)
69-77 9
78-86 2 Next higher class (fh)
87-95 3

= 68.5 + 9 = 68.5 + 9

Advantages of the mode


 it is easily identifiable
 It can be applied to measure qualitative data
 It is not affected by extreme values.
Disadvantage of the mode
 It may not exit
 It may not be unique
 It does not make use of every value in the data.

Measures of Variation

Variation or dispersion is the degree to which numerical data is scattered or


spread about some measure of central tendency (usually the mean).
E.g Consider the following two sets of data.
i. 6, 18, 30 and ii. 17, 18, 19

= =

Observation:- Even though the two sets of data have the same arithmetic mean, the
values in i are more scattered or dispersed than that of ii.

Mekelle University 13
Quick reference in statistics

Absolute and relative measures of dispersion.

Absolute measures of dispersion

1. The Range (R) = maximum value – minimum value


For grouped data, Range = U.C.B of last class – L.C.B of first class.

4. Mean deviation (Average deviation) M.D measures scatter of individual observation


around a central value usually mean or median.

M.D = = …….……………Ungrouped data

Or

MD = …………….Grouped data, where N=

5. Standard Deviation (S)

S= ……….. Ungrouped data

S= …………. Short method (ungrouped data)

Or

S= ……………… for frequency distribution

S=C ……………Coding method (Frequency distribution)

Where C= class width, N=


6. Variance (V). Variance is equal to the square of S.

Mekelle University 14
Quick reference in statistics

V = S2

Example 1 Compute Range, MD, Standard deviation and variance of the following data.
2, 3, 6, 8, 11
Solution.i Range (R)=Max – Min = 11-2 = 9

ii. M.D = , but = = = =6

Then M.D = = = =2.8

iv. S= =

= = = 3.29

v. V = S2 = 10.8

Example 2 Consider the table below and compute the Range, MD, S and V
Classes fi Xi A fi ‌ di ui Ui2 fixi fiui fiui2
2–6 1 4 14 9 -10 -2 4 4 -2 4

7 – 11 4 9 14 16 -5 -1 1 36 -4 4

12 – 16 2 14 14 2 0 0 0 28 0 0

17 – 21 2 19 14 12 5 1 1 38 2 2

22 - 26 1 24 14 11 10 2 4 24 2 4
Total 10 50 xi- A 130 -2 14

Solution = = 13

i. Range (R) = u.c.b of last class – l.c.b of


= 26.5 – 1.5 =2.5

Mekelle University 15
Quick reference in statistics

ii. MD = = = 5 is the mean deviation

iii. S= C =5 =5 =5 = 5.83

iv. V = S2 =(5.83)2 = 34

Properties of the standard deviation


a. The standard deviation of any distribution is non-negative (S  0). S=0
<=> the distribution contains identical values.
b. If all values are increased (decreased) by a constant number “C”, the
standard deviation is not affected.
c. If a constant number “C” multiplies all values, then the new standard
deviation is equal to “C” times the old standard deviation.
d. Generally the greater the standard deviation the more dispersed (scattered)
the data are.
e. For two sets of data with N1 and N2 number of observations and variance
S12 and S22 respectively with the same mean, the combined variance is
given by

S2 =

Relative measures of dispersion


These are used to measure the size of the absolute measures relative to some measure of
central tendency usually the mean.
These measure are:
1. Relative range (R.R)

R.R =

2. Coefficient of variation (C.V)

C.V =

3. Coefficient of mean deviation (C.M.D)

C.M.D =

Mekelle University 16
Quick reference in statistics

Example Refer to the results obtained in the above table and compute the relative
variations.
Solution

1. R.R = = = 1.923 = 192.3%

2. C.V = = = 0.4483 = 44.85%

3. C.M.D = = = 0.3846 = 38.46%

Standard scores (Z-Scores)


Definition the standard score “Z” of an observation (value) x of a distribution is given by
Z=
The value of Z gives the number of standard deviations of X above or below the mean.
Example Suppose the mean and standard deviation of a distribution are given by =60
and S=10.
a. Compute Z-scores of the values
i. 70 ii. 40 iii. 30 iv. 95

Solution
i. Z = = = 1 (above the mean)

ii. Z = = = -2 (below the mean)

b. Compute the values of X with Z-scores


i. –1 ii. 4
Solution i. Z = -1 => -1 = => - 10 = X- 60
 X= -10+60 = 50
ii. Z = 4 => 4= => 40 = X-60
=> X= 40 +60 = 100

Uses of Z-scores
1. Make comparisons in performance
Example Suppose a student scores 90 in statistics exam with mean 75 and standard
deviation 7.5. The same student scored 85 in English with mean 72 and standard
deviation 6. In which course has the student done better?
Solution

Mekelle University 17
Quick reference in statistics

Z-score of stat = Z-score of English =  2.17

Therefore the student has done better in English.


2. To convert a set of values with mean and standard deviation Sx into another set of
values yi with mean and standard deviation Sy having the same Z-sore.
Zi = =

=> Yi =
Example Convert X:7, 9, 1, 11, 13 into another distribution Y with = 12 and Sy = 14

Simple Linear Regression and Correlation


Introduction
Correlation applies to he statistical relationship between two variables
Examples The following area examples of correlated variables
a. Height in cm (x) and weight in Kg (y) of people.
b. Amount of rainfall (x) and yield of wheat (y)
c. Input (x) and output (y) of industry
d. Income (x) and expenditure (y).
Generally i. Linear regression assumes the existence of a relationship and aims at
making predictions about the second variable from a knowledge of the first variable
(usually x).
ii. Linear correlation deals with the relationship of the two variables and aims at the
answers for the following questions.
- Does there exist a s straight line relationship between the two variables?
- If so, how “strong” is that relationship?
Methods of fitting linear regression
1. Scatter Diagrams
Steps i. Gather a random sample (X1 , Y1), (X2 , Y2),…, (Xn ,Yn)
ii. Plot the data on X-Y plane to obtain a scattered diagram.
iii. Fit a straight line that would best fit to the data.
2. The least Square Method of fitting a regression line
Suppose the equation of L (called the regression line of y on x) is given by
Y’ = a + bx a, b  , Let di = Y – Y’

The least square method assumes that is minimum

=> Minimum

=> Minimum

Applying calculus gives the following Normal equations

Mekelle University 18
Quick reference in statistics

Therefore b =

and a= -b
Therefore Y’ = - b( - X) is the equation of the regression line of Y on X.
If we assume X to be dependent and Y independent we will have regression equation X
on Y then a0 and b0 are obtained as follows

b0 = a0 = -b

Example
Let X = number of hrs. which some students studied for an exam
Y = Grades out of 100 scored.
The following information are obtained
n = 10 students = 95 = 1121 = 652 = 6996

1. Compute i. the regression coefficient


ii. the value of “a”
2. Determine the equation of the regression line of Y on X.
3. Estimate the grades scored for X = 9 hrs, X = 10 hrs.
Solution

1. b = = =

b= = 3.67

2. Y’ =65.2 – 3.67(9.5 – X) = 30.33 + 3.67X


3. X = 9 => Y’ = 30.33 + 3.67(9) = 63.36
X = 10 => Y’ = 30.33 + 3.67(10) = 67.03

Pearsonian Coefficient of Correlation

Mekelle University 19
Quick reference in statistics

The value of r gives a measure of the strength of the relationship between


two variables provided that the relationship is linear. It is given by

r=

Note that -1  r  1
Analysis i. r  1  there is a strong positive relationship.
ii. r  -1there is a strong negative linear relationship.
iii. r 0 there is almost no liner relationship

Exercise consider the data below

X Y X2 Y2 X-Y
1 2 1 4 2
3 4 9 16 12
4 4 16 16 16
5 8 25 64 40
7 12 49 144 84
Total 20 30 100 244 154
Questions
1. Determine the equation of the regression line of Y on X
2. Compute the value of “ r “ and interpret

Answer

1. X = = 4, Y = = 6.

B= = =

= = = 1.57

Mekelle University 20
Quick reference in statistics

Y | = 6 - 1.57(4 –X) = 6 – 6.28 + 1.57X = 1.57X – 0.28

2. r = = = = 0.98

Interpretation: There is a strong positive relation ship between the two variables

Mekelle University 21

You might also like