Lecture 7 Review Basic Statistics
Lecture 7 Review Basic Statistics
Historically, statistics is state arithmetic. In very early times the state or government kept records to aid
in the collection of taxes and the provision of military service.
Objectives;
Given that this course incorporates the topic of basic statistics only as a review on previous
knowledge, the main objective is subsequently to consolidate previous understanding on the following
important areas;
o Definition of data
o Different types of data
o Sampling techniques
o Different methods of data collection
o Organization of data
o Analysis of Data
Measures of central tendency
Measures of dispersion
1. Data
The word data refers to a collection of related facts gathered via one of the four data collecting
methods. Data is plural, singular datum.
2. Types of Data
Data collected or gathered may fall into two categories – qualitative or quantitative.
Qualitative Data: refers to those data that which cannot be expressed numerically but by some code.
For example, data gathered on sex – male/female, color – red/blue/yellow/pink, etc.
Quantitative Data: On the other hand, quantitative data refers to those data that which can be
expressed numerically like data on students’ height and age or number of vehicles passing through a
certain spot, etc. For example, Peter’s height could be expressed as 1.78 m and his age 19 years, or the
total number of vehicles passing through a certain point, say, the University Circle, expressed as being
25 vehicles.
Further divisions are created. For example, quantitative data can further be divided into either discrete
or continuous.
Discrete Data: discrete refers to those that can be expressed using only whole (or counting) numbers
usually referred to as integers. Whole (counting) numbers can be used to represent count of objects
like total number of vehicles or people.
Continuous Data: refers to those that can be expressed using real (decimal) numbers. Variables like
heights or distance can assume any values on a real number line, including the decimal numbers.
3. Collecting Data
In any survey, the surveyor must first identify the quantity to measure. Secondly, based on the type of
measure, he/she engages an appropriate method of selecting the subjects for the survey.
(i) Census - Where all individual elements in the population become subjects of the survey
or data collection exercise. Sampling is therefore not an option.
4. Sampling Techniques.
a) Simple Random Sampling – the subject is picked at random from the population
where each member of the population has an equal chance of being selected.
b) Stratified sampling – population is separated into strata that share the same
characteristics before applying simple random sampling.
c) Systematic sampling – where the surveyor engages the kth element as a subject for
the measure.
Example, Use a telephone directory of 10, 000 names from which we choose 200 of
the names. This is done by randomly selecting from the first 50 names then we
choose the 50th name after the randomly selected name.
d) Cluster sampling – First divide the population into sections and then randomly
select a few of those sections. Can then choose all the members from these sections.
OBSERVATION Data are of high quality due to the observation of Problems with observation
actual event. requiring the observer to be on
Observer watches, or walks the scene when the event
through, the actual process Data is of real time value occurs.
associated with the subject of
interest. Data is highly believable
QUESTIONARE
High-volume responses, inexpensive to administer, No flexibility of questions, no
A special-purpose document fast and efficient. probing questions, no follow-
that requests specific up questions.
information from May have low return of
respondents. questionnaire
RECORDS REVEW Very inexpensive method of data collection. Records may not be available
If records exists then all information are accessible.
A research into old records to
extract data
6. Organization of Data
Tables
- an array of data as in the fire engine call array of data
- the popular use of the frequency distribution, in table form
- the use of the frequency distribution table to include cumulative frequency
Graphical methods
- The ogive, a graph of the cumulative frequency against scores.
- Frequency histogram – a vertical bar graph of frequencies against scores
- A frequency Polygon – constructed from linking the midpoints of the apex of each
frequency columns.
7. Data Analysis
In most frequency distributions, the majority of cases or scores tend to cluster about some central value
with a few cases at the upper end and a few cases at the lower end. Of the many measures of central
tendency used to describe this state of affairs, we shall consider the mode, the mean and the median.
The mode is the score which occurs the most. It is the score with the highest frequency.
The mean is the average of all the scores, the sum of all the scores divided by the number of scores.
The median is the middle score when the scores are arranged in order of size from smallest to largest.
Read on the method of evaluation of each of these measures and perform the relevant exercises in your
CLN Text
The mean
∑𝑥 ∑ 𝑓𝑥 ∑ 𝑓𝑥
𝑥̅ = 𝑥̅ = 𝑥̅ =
𝑛 ∑𝑓 ∑𝑓
The median
The middle observation when all scores are arranged in increasing order
When n is odd the median is one value. When n is even, the median is the average of the two middle
observations.
Easily read from a cumulative frequency distribution with the median corresponding to the 50th
percentile of frequencies.
Number of Cumulative
Magazines Frequency frequency
x f
0 2 2
1 12 14
2 49 63
3 64 127
4 43 170
5 20 190
6 9 199
7 1 200
f = 200
The 63rd score is the last observation of 2, the 64th is the first observation of 3 and the
127th score is the last observation of 3.
The mode is by comparison, the easiest to evaluate from a set of data. It is the score or observation that
occurs most often, or simply that with the highest frequency. In grouped data, the modal class is the class
with the highest frequency.
b) Measures of Dispersion
The second stage in analyzing the numerical information is to investigate the variability of the data; that
is the spread or scatter of values from the ‘average’.
The Range -
This is the difference between the highest and lowest score. It is not an ideal measure of dispersion; its
weakness is that it is based only on two extreme values of the distribution and thus may not give
sufficient detail about the scatter of all the scores from the mean.
This measure is important only in some relevant data reporting scenarios. An example of such a
relevance is weather statistics where there is compulsory reporting on the extremes of atmospheric
temperature, rainfall or pressure.
The variance is a widely used measure of dispersion but it has the disadvantage that it is expressed in
units which are the squares of the original units. For many purposes it is desirable that a measure of
dispersion be expressed in the same units as the original data and their mean. Such a measure of
dispersion is the standard deviation which is obtained by simply taking the square root of the variance.
Similar, for grouped data, except that 𝑥 is the class center. The class center is used to
calculate the individual deviations from the mean.
Example:
Question:
Complete the table and determine the variance and standard deviation of the
distribution.
class f
21 -25 5
26 - 30 11
31 - 35 4
36 - 40 13
41 - 45 7
46 - 50 10
Solution
Steps
1. Calculate the class centers by calculating the average of the upper and lower values.
2. Evaluate f, the product of each score (class center) with corresponding frequency.
3. Calculate sum of 𝑓𝑥 , ∑ 𝑓𝑥
∑ 𝑓𝑥
4. Calculate the mean, 𝜇 = = 1830/50 = 36.6
∑𝑓
Use the appropriate formula to calculate the variance and the standard deviation.
Variance =
∑ 𝑓(𝑥−𝜇)2 3402
2 = ∑𝑓
= 50
= 68.04
Standard deviation =
∑ 𝑓(𝑥−𝜇)2
= √ ∑𝑓
= √68.04 = 8.2
The Frequency Distribution Table.
∑ 𝑓𝑥 1830
Mean = 𝜇 = ∑𝑓
= 36.6
50
∑ 𝑓(𝑥−𝜇)2 3402
2 = ∑𝑓
= 50
= 68.04
∑ 𝑓(𝑥−𝜇)2
= √ ∑𝑓
= √68.04 = 8.2
Identify.