MMW Chapter 4
MMW Chapter 4
DATA MANAGEMENT
With the advent of modern technology, data gathered across different fields of
expertise are being collected and stored for further analysis in the hope of revealing
hidden information that are relevant to the field. From weather forecasting, providing
information on the spread of an epidemic to medical doctors testing whether a new
treatment to a disease is effective requires some knowledge of data management or
Statistics. It also serves as a tool for teachers to conclude about the performance of
their students or for quality control experts to test the level of satisfaction of their
customers on certain products. On a wider scale, economists use statistical data to
describe the progress of a country in terms of various indicators.
Statistics allows one to generate valid and reliable results, especially in the
area of research and in big data. For example, statistical methodologies are used to
construct and evaluate models from large and complex data. Moreover,
understanding and communicating research findings on quantitative information
demand some degree of statistical skills.
This chapter introduces students to the most basic statistical concepts that
are almost always applicable to any field of discipline. Acquiring the most vital
statistical skills is the main goal for this chapter.
Lesson 4.1
INTRODUCTION TO STATISTICS
Objectives
At the end of the lesson, students are expected to be able to
1. Define Statistics;
2. Differentiate the areas of Statistics;
3. Identify the area of Statistics applied in a particular scenario;
4. Differentiate a sample from a population, and statistic from a parameter;
5. Distinguish the types of variables;
6. Determine the level of measurement of a given variable.
Definition of Statistics
Areas of Statistics
Population vs Sample
Example 4.1.1
Population
Parameter vs Statistic
Example 4.1.2
PARAMETER STATISTIC
Variable vs Attribute
Types of Variable
Levels of Measurement
Interval level applies to numerical data in which a zero value is not absolute.
Examples of interval data include scores in a test and temperature reading in 0C. A
zero score in a test does not mean that one has no knowledge about the subject
matter. Likewise, a temperature reading of 0 0C does not indicate the absence of
temperature.
Ratio level applies to numerical data in which a zero value is absolute. That
is, a zero value indicates the absence of the characteristics being measured.
Examples include distance, floor area and height. Zero values for such data mean
the characteristics being observed do not exist.
Exercise 4.1
Introduction to Statistics
Name:___________________________ Score:_______
Course and Year:__________________ Date:________
B. Identify the type of variable. Write QnD and QnC if the variable is quantitative-
discrete and quantitative-continuous, respectively; while Ql if it is qualitative.
After which, identify the level of measurement.
Variable Type Level
1. IQ
2. Ethnicity
3. Electric Bill
4. Military Rank
5. Blood Type
6. Number of Coin Flips
7. Detergent Brand
8. Cabbage Yield in kg
9. Number of Votes
10. Level in a Mobile Game
Lesson 4.2
DATA COLLECTION METHODS AND DATA SOURCES
Objectives
At the end of the lesson, students are expected to be able to
1. Differentiate the different data collection methods;
2. Identify the most appropriate data collection method for a given a scenario;
3. Differentiate sources of data;
4. Identify the source of data from a given a scenario.
Data can be gathered using different methods depending on the nature and
objective of a research. Here are some of the most popular methods.
Data Sources
Name:___________________________ Score:_______
Course and Year:__________________ Date:________
Method
1. To trace possible COVID-19 carriers, people were asked to
answer a standardized form asking about their age, travel
histories, symptoms being experienced, and people they
came in contact with.
Source
1. The data collected in item A1.
Objectives
At the end of the lesson, students are expected to be able to
1. Identify and apply the most appropriate data presentation technique for a
given a scenario;
2. Construct a Frequency Distribution Table (FDT) for a given set of data;
3. Use the most appropriate graph for a given set of data;
4. Give contexts or situations appropriate for each technique of presentation.
People are visual creatures. Much of the information entering people’s brains
is through their visual senses. This is the reason why presenting data visually is
preferred. Tables and graphs are used to present data in a clear and concise
manner. In such ways, trends and patterns can be easily spotted, a fact that makes
understanding and communicating of research results easier to do.
Generally, data are presented using texts, tables, and graphs. The textual
technique is most appropriate when data are in themselves texts as in the case of
qualitative studies. It is also best suited when presenting small data sets or when it is
more important to emphasize points. Tables and graphs are most applicable for
purely quantitative data.
1. Compute the value of the Range (R), which is the difference between the
lowest and highest value.
2. Approximate the number of class intervals, k, by taking the square root of
the total number of observations, n. Thus, k = n .
3. Obtain the value of the class width, c, by dividing R by k. Thus, c = R k and
round this off to the nearest odd number.
4. Construct the class intervals (CI).
▪ the first lower limit LL1 is the lowest value in the data set
▪ the first upper limit UL1 = LL1 + c − d .
The number d refers to the decimal unit of the raw data.
✓ d = 1 , if data are whole numbers
✓ d = 0.1 , if data have decimal values in the tenths place
✓ d = 0.01 , if data have decimal values in the hundredths place
▪ the lower and upper limits of the succeeding CIs are obtained by adding the
value of c to the lower and upper limits of the preceding CIs until reaching the
CI that contains the highest value in the data set.
CI
Lower Limits Upper Limits
LL1 = lowest value UL1 = LL1 + c − d
LL2 = LL1 + c UL2 = UL1 + c
LL3 = LL2 + c UL3 = UL2 + c
LL4 = LL3 + c UL4 = UL3 + c
Example 4.3.1
The following set of data represents the initial body weights (in grams) of 50
rats used in a study to determine whether a new vitamin is effective in gaining
weight. These 40-day old rats have normal body weights ranging from 100 to 130
grams. Construct the frequency distribution table for this data set. Compute for the
values in the following columns: class mark, true class boundary, “less than” and
“greater than” cumulative frequencies and rank.
140 135 137 126 150 119 126 120 118 125
115 127 95 100 100 101 103 142 113 115
129 110 126 106 105 87 126 119 125 130
108 118 119 117 115 133 102 90 110 104
82 105 132 143 95 124 113 96 139 140
1. R = 150 − 82 = 68 .
2. k = 50 .
68
3. c = = 9.62 → 9 (nearest odd number)
50
4. Since the raw data are whole numbers, d = 1 .
LL1 = lowest value = 82 and UL1 = 82 + 9 − 1 = 90 .
CF CF Rank
CI RF (%) Xi TCB
CI f
82 90 82-90 3 6 86 81.5-90.5 3 50 6.5
82+9=91 90+9=99 91-99 3 6 95 90.5-99.5 6 47 6.5
91+9=100 99+9=108 100-108 10 20 104 99.5-108.5 16 44 2
→
100+9=109 108+9=117 109-117 8 16 113 108.5-117.5 24 34 3
109+9=118 117+9=126 118-126 13 26 122 117.5-126.5 37 26 1
118+9=127 126+9=135 127-135 6 12 131 126.5-135.5 43 13 4.5
127+9=136 135+9=144 136-144 6 12 140 135.5-144.5 49 7 4.5
136+9=144 144+9=153 144-153 1 2 149 143.5-153.5 50 1 8
Total 50 100
Based on the table, it is observed that more body weights belong to the
normal weight range (middle class intervals) while few have extreme weights.
Statistical Maps are some of the best techniques when matching data values
to geographical locations. One can use colors, symbols, pictures, and numbers to
show the differences in values of each area on the map. Example 4.3.6 shows the
carbon dioxide (CO2) emission of each country in 2017. The darker colors indicate a
higher CO2 emission. Obviously, Canada, USA, Australia, Kazakhstan, Saudi
Arabia, Kuwait, Qatar and Bahrain are the countries that emitted the most CO2.
Example 4.3.5 Example 4.3.6
Number of COVID-19 Cases by Country 2017 Average Carbon Dioxide Emissions(tons) per
Capita
UK equates to 10 000 cases
Iran
France
Germ…
China
Spain
Italy
USA
Scatterplots are the easiest and quickest technique to show whether two
variables are related. The three scatterplots below show that variable X is positively
and negatively related to Variables Y and Z, respectively; however, it is not related to
variable W.
Example 4.3.7
Variable Z
Variable W
Variable Y
10 10 10
0 0 0
0 10 20 0 10 20 0 10 20
Variable X Variable X Variable X
Histograms are bar graphs based from an FDT. The sizes between true class
boundaries serve as the categories and are plotted on the horizontal axis.
Histograms show the shape and the spread of a distribution. The histogram in
example 4.3.8 displays that majority of the rats have body weights that are within the
normal range, as indicated by the taller bars. Rats with extreme body weights are
least in number, as represented by the bars at the opposite ends.
Ogives are used to estimate the number of observations that are less than or
greater than a particular value. They are constructed by plotting the cumulative
frequencies against the true class boundaries. Specifically, the less-than cumulative
frequencies are plotted against the upper limits of each TCB, while the greater-than
cumulative frequencies are plotted against the lower limits. In example 4.3.9, the
third point of the less-than ogive means that there are six mice whose body weight
are less than or equal to 99.5 cm. Similarly, the fourth point of the greater-than ogive
means that 34 mice have body weights greater than or equal to 108.5 cm.
50
12
40
10
frequency
30
8
20
6
10
4
0
2 81.5 90.5 99.5 108.5 117.5 126.5 135.5 144.5 153.5
Name:___________________________ Score:_______
Course and Year:__________________ Date:________
3. Population density refers to the number of people per unit area. The
following shows the population density per square kilometer of some
selected regions of Luzon.
CAR:87 II (Cagayan Valley):116 IV-A(CALABARZON):870
I (Ilocos):388 III (Central Luzon):512 IV-B (MIMAROPA):100
Total
C. Choose two items in A and construct the graph/chart. Briefly discuss the graphs
Lesson 4.4
DESCRIBING DATA SETS
Objectives
At the end of the lesson, students are expected to be able to
1. Identify and compute the appropriate measure of central tendency for a
given data set;
2. Identify and compute the appropriate measure of dispersion for a given
data set;
3. Describe the skewness of a data set;
4. Locate the different quantiles relative to a given data set.
Median. The median is the middle most value in an ordered data set. This is
given by
xn /2 + x( n /2)+1
Md = x( n +1) / 2 if n is odd Md = if n is even
2
Mode. Denoted by Mo, this is the most frequent data point/s in a set of data.
A data set may be unimodal, bimodal, multimodal or has no mode at all.
The mean is the only measure in which further computations can be carried
out. However, unlike the median, it is sensitive to extreme values. On the other
hand, the mode becomes unstable in cases of data changes due to changing
method of rounding off decimals. It is problematic in providing a single measure of
central tendency when its value is not unique.
Example 4.4.1
The following are the scores of 12 randomly selected students from a 40-point
test: 35,30,38,28,20,37,18,26,32,36,39,21. Scores that are within the range of “20-
35” are considered “average.” Scores that are below this range are “poor,” while
those above it are “outstanding.” Identify the most appropriate measure of central
tendency and then compute its value.
Solution:
Since the set of data originates from a sample, and the values are numerical,
the appropriate measure is the sample mean.
n
x
35 + 30 + 38 + ... + 39 + 21 360
i =1 i
x= = = = 30
n 12 12
The mean value of 30 suggests that students are “average” in terms of the
skills measured by the test.
Example 4.4.2
Example 4.4.3
The following shows the senior high school strand taken by 16 students who
are entering a certain university. Strand includes STEM (Science, Technology,
Engineering and Mathematics), GA (General Academic), ABM (Accountancy,
Business and Management), and HUMMS (Humanities and Social Sciences).
Identify the most appropriate measure of central tendency and then compute for it.
GA, STEM , GA, GA, STEM , GA, STEM , ABM ,
GA, ABM , GA, HUMMS , HUMMS , STEM , ABM , GA
Solution:
Since the variable “strand” is nominal, the mode is the appropriate measure of
central tendency. The mode is “GA” since it is the most frequent strand among
sixteen observations. This means that of the sixteen entering students, there are
more graduates of the GA strand as compared to the student number in other
strands.
Example 4.4.4
Solution:
The average height of the boys is 119.64 cm which falls below the normal
range. In contrast, the average height of their female counterpart is within the normal
range. Based from these results, it appears that only the boys exhibit stunting.
Measures of Dispersion
While a measure of central tendency gives a value around which a set of data
tends to cluster or fluctuate, a measure of dispersion gives a value that tells whether
the set of data is compact or dispersed. That is, it tells whether the data are
relatively closer to each other, or are relatively farther apart from each other. Thus
two sets of data may have the same central values yet they are different if they have
different values of dispersion; that is, if one is more compact than the other. So the
measure of dispersion supplements the information given by a central value.
To illustrate, consider the following scores of two groups of students who took
the same examination:
Group Scores Mean
A 78 79 80 81 82 80
B 70 72 75 88 95 80
The mean scores of the two groups of students are equal. However, the
scores of students from group A are relatively closer to the mean than those from
group B. For example, the extreme scores from group A of “78” and “82” are closer
to the mean value of 80 than the extreme scores from group B of “70”and “95”. That
is why even if the mean scores of the two groups of students are the same, the two
sets of scores are different because one is more dispersed than the other.
Variance is the average of the squared deviations of each data point from the
mean. This is given by
( x )
N 2
( x )
n 2
n
i =1 i i =1 i
N
xi2 − x2
i =1 i
−
2 =
i =1
N n
s =
2
N n −1
where
2 2
is called the population variance while s is the sample variance.
Standard Deviation is simply the square root of the variance and is denoted
by and s for population and sample, respectively. The unit of or s is the same
as the unit of the set of data. Also, these values are non-negative.
Example 4.4.5
Compute for the variance and standard deviation of the data provided in
example 4.4.1.
Data: 35,30,38,28,20,37,18,26,32,36,39,21
Solution:
n n
x = 35 + 30 + ... + 39 + 29 = 360 ;
i =1 i
x = 352 + 302 + ... + 392 + 292 = 11404 ;
2
i =1 i
n = 12
( x )
12 2
360 2
i=1 xi2
12 i
− i =1
11 404 −
s2 = n = 12 = 54.91
n −1 12 − 1
s = s 2 = 54.91 = 7.41
Example 4.4.6
Given the height of children in example 3.4, compute for the standard
deviation of each group.
BOYS (X) GIRLS (Y)
116 119 120 121 122 130 118 120 133 132 130 126 125 125 125 126
122 123 117 120 115 116 122 115 122 120 123 114 127 126 131 132
119 118 118 117 119 120 129 116 115 129 118 129 130 131 129 126
118 123 119 118 126 127 128 127 128 130
Solution:
28
x = 116 + 119 + ... + 119 + 118 = 3350
i =1 i
28 2
x = 1162 + 1192 + ... + 1192 + 1182 = 401152
i =1 i
nx = 28
( )
2
i=1 xi
28
33502
i=1 xi2 −
28
401152 −
sx = sx2 = n = 28 = 3.59
n −1 28 − 1
30
y = 133 + 132 + ... + 128 + 130 = 3790
i =1 i
30
y = 1332 + 1322 + ... + 1282 + 1302 = 479450
2
i =1 i
ny = 30
( )
2
i=1 yi
30
337902
i=1 yi2 −
30
479450 −
s y = s y2 = n = 30 = 4.72
n −1 30 − 1
The higher value of the standard deviation of the height of girls than those of
the boys means that the height of the former is more dispersed than that of the
latter.
Measure of Skewness
Skewness tells about the symmetry (or lack thereof) of a distribution about its
mean. Using the symmetric (normal) distribution as a baseline, skewness measures
the degree of distortion (long tails) of a distribution. A positively skewed distribution
has its long tail at the right which indicates that the mean is higher than the median.
Conversely, a negatively skewed distribution has its long tail at the left which
indicates that the mean is lower than the median. For a perfectly symmetric
distribution (or normal distribution), the mean and the median coincide.
Frequency
Frequency
Frequency
median
median mean mean mean median
3( x − Md ) 3( − Md )
SK = and SK = .
s
Interpreting SK
Example 4.4.7
Solution:
We first compute the mean, the median, and the standard deviation of the two
sets of data. Recall that the mean height for the two groups (as computed in e.g.
4.4.3) are x = 119.64 cm and y = 126.33 cm .
Ordered Data
BOYS (X) GIRLS (Y)
115 115 116 116 116 117 117 118 114 115 118 120 122 123 125 125
118 118 118 118 119 119 119 119 125 126 126 126 126 126 127 127
120 120 120 120 121 122 122 122 127 128 128 129 129 129 130 130
123 123 129 130 130 131 131 132 132 133
Since there are 28 observations, Since there are 30 observations
xn /2 + x( n /2)+1 x14 + x15 119 + 119 xn /2 + x( n /2)+1 x15 + x16 127 + 127
Md = = = = 119 Md = = = = 127
2 2 2 2 2 2
The standard deviations of the two groups (as computed in e.g. 4.4.6) are
sx = 3.59 and s y = 4.72 .
So that
3( x − Md ) 3(199.64 − 119)
SK x = = = 0.53 and
s 3.59
3( x − Md ) 3(126.33 − 127)
SK y = = = −0.43 .
s 4.72
These coefficients of skewness reveal that the height distribution of the boys
is positively skewed, while that of the girls is negatively skewed. As already
mentioned above, there are more boys whose height are below the mean and there
are more girls whose heights are above the mean.
Exercise 4.4.1
Measures of Central Tendency, Dispersion, and Skewness
Name:___________________________ Score:_______
Course and Year:__________________ Date:________
A. Identify and compute the most appropriate measure of central tendency for
each case.
1. The scores of 10 randomly selected students in a 50-point Statistics
Examination are as follows 49, 35, 38, 25, 34, 21, 47, 40, 35, 10.
3. The following are the year levels of students that are scholars of the Department
of Science and Technology who enrolled at BSU this year:
I, II, I, III, III, IV, II, IV, I, I.
4. Below are the medals won by team Philippines during the 2018 Asian Games.
Gold Silver Bronze
4 2 15
5. The data below are the general weighted average (GWA) of 15 randomly
selected BSU varsity and non-varsity students for second semester of SY 2019-
2020. Compute separately, then compare the two groups based from these
values. Remember that in BSU, a lower numerical grade indicates a better
performance.
Varsity Players 1.75 1.42 2.45 1.81 2.45 1.51 1.61 1.43
Non-varsity Players 3.01 1.18 2.89 2.15 1.97 2.85 2.56
Varsity Players
Non-Varsity Players
Brief Comparison
B. Compute the standard deviations for the GWA of the two groups of students
given in item 5 of A, and then briefly compare in terms of these values.
Varsity Players Non-Varsity Players
Brief Comparison
C. Compute the coefficient of skewness for the GWA of the two groups of
students given in item 5 of A, and then briefly compare the groups in terms
of these values.
Varsity Players Non-Varsity Players
Brief Comparison
Measures of Relative Position (Quantiles)
Quantiles are values that divide an ordered distribution into parts such that
there is a given proportion of observations that are equal to or below it. These
measures are applicable when we want to identify the “position” or “standing” of a
data point relative to the entire data set. The most used quantiles are quartiles,
deciles and percentiles.
Quartiles divide an ordered data set into four equal parts (quarters) so that
25% of the data is less than or equal to the first quartile (Q1 ) . Similarly, 50% and 75%
of the entire data set is lower or equal to the second (Q2 ) and third quartile (Q3 ) ,
respectively.
Example 4.4.8
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11
3 6 7 8 8 9 10 10 11 12 19
Q1 Q2 Q3
Deciles divide the distribution into 10 equal parts. The values separating each
part are called the first decile ( D1 ) , the second decile ( D2 ) , …, ninth decile ( D9 ) .
Example 4.4.9
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29
21 25 36 37 37 40 41 41 41 45 46 47 49 55 55 59 60 65 66 70 75 76 76 76 80 81 89 90 95
D1 D2 D3 D5 D6 D7 D8 D9
D4
0 0 0 0 0
Percentiles divide an ordered distribution is divided into 100 equal parts, the
values that separate each part from the rest are called percentiles, namely
P1 , P2 , P3 , , P99 .
The Median is also a quantile that divides a distribution into two equal parts,
the upper and lower 50%. The median value is the same as Q2 , D5 , and P50 .
Other Equal Quantiles
Q1 and P25 are the same values since both of them separates the lower 25%
of an ordered data set from the rest. Here are other pairs of equal quantiles:
(Q2 , P50 ), (Q3 , P75 ), ( D1 , P10 ), ( D2 , P20 ), ( D3 , P30 ), ( D4 , P40 ), ( D6 , P60 ), ( D7 , P70 ), ( D8 , P80 ), ( D9 , P90 ) .
Before computing the values of the different quantiles, make sure that the set
of data is ordered or arranged. Then solve for the index (k ) of the quantile which
gives the location number of the quantile value being sought to find. The size of a
data set is denoted by n , and the specific quantile being located is denoted by p .
For example, for the quantiles Q3 , D6 , P59 , the values of p are 3,6, and 59 ,
respectively. The following show how to compute the values of k for the different
quantiles:
p ( n + 1)
for quartiles, k = where p = 1,2,3 ;
4
p ( n + 1)
for deciles, k = where p = 1,2,3, ,9 ;
10
p ( n + 1)
for percentiles, k = where p = 1,2,3, ,99 .
100
Example 4.4.10
The first quartile indicates that 25% of the students (or approximately 14
students) scored at most 49 points, while the third quartile indicates that 75% of the
students scored at most 72 points.
b. for D5 : for D8 :
p (n + 1) 5(55 + 1) 5 56 p(n + 1) 8(55 + 1) 8 56
k= = = = 28 k= = = = 44.8
10 10 10 10 10 10
Based on the value of k, D5 = x28 = 57 , which is the same as the median. That
is, one-half (50%) of the examinees scored at most 57 points. Similarly, the value of
D8 is located at x44.8 . Since this number does not correspond to a particular value in
the set of data because it lies in between two consecutive locations of the ordered
set of data, a linear interpolation is done. In so doing, assume that a value in
between x44 = 74 and x45 = 79 exists, and then follow the steps below.
where L C R
Linear interpolation is done by assuming that values between two numbers
say xL and xR exist. In the above figure, xC is a number between xL and xR which
is computed by taking the " C − L " part of the distance ( xR − xL ) and then adding it to
xL .
xC = xL + (C − L)( xR − xL ) .
So from example 4.4.10 (b), given that x44 = 74 and x45 = 79 , xC =44.8 is
computed as follows
c. P61
p (n + 1) 61(55 + 1) 61 56
k= = = = 34.16
100 100 100
Again, there is no x34.16th observation, so we apply linear interpolation. Given
that C = 34.16 , L = 34 , xL = x34 = 61 , and xR = x35 = 64 , find P61 as follows
P61 = x34.16 = xL + (C − L)( xR − xL ) = 61 + (34.16 − 34)(64 − 61) = 61 + 0.16(3) = 61 + 0.48 = 61.48.
The value of P61 = 61.48 indicate that 61% of the examinees scored at most
61.48.
Exercise 4.4.2
Measures of Relative Position
Name:___________________________ Score:_______
Course and Year:__________________ Date:________
Answer
1. The score of Maria is the 87th percentile. This means that
87% of the number of examinees scored lower or equal to
Maria’s.
10. Ken’s score is the 6th decile. Therefore, 300 students scored
at most as high as Ken’s.
B. Given the scores of 31 students in a Mathematics Examination, compute for the
following quantiles.
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31
22 23 32 34 37 37 39 39 40 42 43 43 43 44 46 46 46 47 47 48 48 50 54 54 55 55 56 58 61 62 65
a. Q1 b. Q3
Interpretation Interpretation
c. D3 d. D9
Interpretation Interpretation
e. P71 f. P87
Interpretation Interpretation