0% found this document useful (0 votes)
6 views28 pages

MMW Chapter 4

Chapter 4 focuses on data management, introducing essential statistical concepts applicable across various fields. It covers definitions, areas of statistics, data collection methods, and data presentation techniques, aiming to equip students with vital statistical skills. The chapter emphasizes the importance of understanding statistics for effective data analysis and interpretation.

Uploaded by

Rojhon Sawac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views28 pages

MMW Chapter 4

Chapter 4 focuses on data management, introducing essential statistical concepts applicable across various fields. It covers definitions, areas of statistics, data collection methods, and data presentation techniques, aiming to equip students with vital statistical skills. The chapter emphasizes the importance of understanding statistics for effective data analysis and interpretation.

Uploaded by

Rojhon Sawac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Chapter 4

DATA MANAGEMENT

With the advent of modern technology, data gathered across different fields of
expertise are being collected and stored for further analysis in the hope of revealing
hidden information that are relevant to the field. From weather forecasting, providing
information on the spread of an epidemic to medical doctors testing whether a new
treatment to a disease is effective requires some knowledge of data management or
Statistics. It also serves as a tool for teachers to conclude about the performance of
their students or for quality control experts to test the level of satisfaction of their
customers on certain products. On a wider scale, economists use statistical data to
describe the progress of a country in terms of various indicators.

Statistics allows one to generate valid and reliable results, especially in the
area of research and in big data. For example, statistical methodologies are used to
construct and evaluate models from large and complex data. Moreover,
understanding and communicating research findings on quantitative information
demand some degree of statistical skills.

This chapter introduces students to the most basic statistical concepts that
are almost always applicable to any field of discipline. Acquiring the most vital
statistical skills is the main goal for this chapter.
Lesson 4.1
INTRODUCTION TO STATISTICS

This section provides a working definition to terminologies that are essential to


Statistics. Examples are provided to better clarify the definitions.

Objectives
At the end of the lesson, students are expected to be able to
1. Define Statistics;
2. Differentiate the areas of Statistics;
3. Identify the area of Statistics applied in a particular scenario;
4. Differentiate a sample from a population, and statistic from a parameter;
5. Distinguish the types of variables;
6. Determine the level of measurement of a given variable.

Definition of Statistics

Statistics refers to the methodology for the collection, tabulation or


presentation, analysis and interpretation of numerical or quantitative data.

Areas of Statistics

Descriptive Statistics is concerned with collecting, organizing and presenting


data for purposes of describing and yielding meaningful results. It includes the use of
graphs, tables, and different summarizing techniques.

Inferential Statistics involves the analysis of a subset of data for purposes of


drawing conclusion about an entire data set.

Population vs Sample

Population refers to the totality of observations under consideration while a


sample is a subset of a population.

Usually, observing an entire population is either impossible or impractical.


That is why a sample is selected. A good sample is one that is representative or
reflective of the characteristics of the population.

Example 4.1.1

Think of the rectangular figure below as a population containing objects of


different characteristics from which we want to draw a sample. These objects are
probably people with different behaviors, or insects from different areas reacting to a
specific pesticide, or a variety of plants exposed to different conditions leading to
different growth performances. From among the three samples selected, sample 3 is
the best. Sample 1 contains objects with the same characteristics while sample 2
contains too few. Both of these samples are not representative of the population.

Sample 1 Sample 2 Sample 3

Population

Parameter vs Statistic

A value obtained from a population is called a parameter while that obtained


from a sample is called a statistic.

Example 4.1.2

PARAMETER STATISTIC

• population mean • sample mean


• population standard deviation • sample standard deviation
• population total • sample total

Variable vs Attribute

A variable is a characteristic of an object being observed such as people,


animals, ideas, behaviors, etc. A variable assumes a range of values that are either
numerical or categorical.

An attribute is a specific value of a variable.

Types of Variable

A quantitative variable gives numerical values, either discrete or continuous.


A quantitative variable that assumes integral or whole number values is discrete.
Such values may be counted. Here, fractional values have no meaning. A variable
that assumes any value, be it integral or fractional, is continuous. It includes values
that are a result of measurements.

Qualitative variables are those whose values are categorical or non-


numerical, regardless of the use of numerical coding to identify them. Moreover,
arithmetic operations such as addition and multiplication are not applicable.
Example 4.1.3

TYPES OF VARIABLE EXAMPLE ATTRIBUTES


a. Sex Male, Female
Qualitative b. Province Mountain Province, Benguet
c. Performance Rate Poor, Average, Excellent
a. Number of Students 51, 43
Quantitative-Discrete b. Number of COVID-19 Cases 45 322, 471
c. Number of Guns 0, 5, 103
a. Hair Length 2 cm, 10.3 cm
Quantitative- Continuous b. Body Weight 50 kg, 75.6 kg
c. Volume of a tank 4.5 m3, 2.7 m3

Levels of Measurement

Any variable may be categorized in one of the four levels of measurement.

QUALITATIVE (CATEGORICAL) QUANTITATIVE (NUMERICAL)

Nominal Ordinal Interval Ratio

cannot be ordered can be ordered zero is not absolute zero is absolute

Nominal level applies to categorical variables whose values cannot be


ordered or ranked. Examples are names and labels.

Ordinal level applies to categorical variables whose values can be ordered or


ranked. Examples include students’ year level (freshman, sophomore, junior, senior)
and pain level (none, mild, severe). Here, the relative differences between values
have no meaning.

Interval level applies to numerical data in which a zero value is not absolute.
Examples of interval data include scores in a test and temperature reading in 0C. A
zero score in a test does not mean that one has no knowledge about the subject
matter. Likewise, a temperature reading of 0 0C does not indicate the absence of
temperature.

Ratio level applies to numerical data in which a zero value is absolute. That
is, a zero value indicates the absence of the characteristics being measured.
Examples include distance, floor area and height. Zero values for such data mean
the characteristics being observed do not exist.
Exercise 4.1
Introduction to Statistics

Name:___________________________ Score:_______
Course and Year:__________________ Date:________

A. Identify which area of Statistics is illustrated. Write D if descriptive and I if


inferential.
Area
1. The Philippines Statistics Authority reports on the mean
performance of students during the National Achievement Test.

2. To investigate the psychological effects of the COVID-19


Pandemic to students in Luzon, a sample was randomly selected
and became the respondents of a research study.

3. The number of COVID-19 patients per region was tallied.

4. The Grade Point Averages (GPAs) of college students were


computed.

5. To measure the people’s overall level of satisfaction on the


performance of the incumbent President of the Philippines, a
sample randomly selected from each region of the country was
surveyed.

B. Identify the type of variable. Write QnD and QnC if the variable is quantitative-
discrete and quantitative-continuous, respectively; while Ql if it is qualitative.
After which, identify the level of measurement.
Variable Type Level
1. IQ
2. Ethnicity
3. Electric Bill
4. Military Rank
5. Blood Type
6. Number of Coin Flips
7. Detergent Brand
8. Cabbage Yield in kg
9. Number of Votes
10. Level in a Mobile Game
Lesson 4.2
DATA COLLECTION METHODS AND DATA SOURCES

Objectives
At the end of the lesson, students are expected to be able to
1. Differentiate the different data collection methods;
2. Identify the most appropriate data collection method for a given a scenario;
3. Differentiate sources of data;
4. Identify the source of data from a given a scenario.

Data Collection Methods

Data can be gathered using different methods depending on the nature and
objective of a research. Here are some of the most popular methods.

Direct Method (Interview) is normally a face-to-face conversation with the


respondents. During interviews, the interviewer directly records not only the
interviewee’s responses to a set of questions but also their behaviors such as their
body language and tones.

Indirect Method (Questionnaire) is commonly used for survey studies. A


standardized set of questions is prepared, printed and administered to the people of
interest, or respondents.

Observation is the best method in documenting the behaviors of a subject in


an uncontrolled environment (or controlled as in the case of laboratories) or in
examining a phenomenon as it is occurring.

Experimentations are conducted to explore the causal relationships between


or among variables while controlling the effects of other factors in the environment.
For example, one can conduct an experiment to identify which fertilizer (organic,
commercial, mix) would result to best Petchay yield, assuming that the effect of the
natural fertility of the soil is controlled.

Data Sources

Data are either collected from a primary or a secondary source.

Primary Sources refers to sources from which a researcher directly and


originally gathers the data. Examples include data obtained from an interview with
people, from questionnaires, or from observations in the community. The people
involved are the primary data sources.

Secondary Sources are documentary data sources such as research journals,


books and the internet. It also includes agencies that collect and store data such as
the Philippine Statistics Authority, the Department of Agriculture and the Department
of Health.
Exercise 4.2
Data Collection Methods and Data Sources

Name:___________________________ Score:_______
Course and Year:__________________ Date:________

A. Identify the data collection method applied in each scenario.

Method
1. To trace possible COVID-19 carriers, people were asked to
answer a standardized form asking about their age, travel
histories, symptoms being experienced, and people they
came in contact with.

2. A researcher invited people who were asked to share orally


about their lived experiences as single parents. The
researcher used guide questions.

3. A researcher records the cultural practices of a certain tribe


as these are happening.

4. Three groups of lab rats were injected with different dose of a


certain drug. Physical performances were then compared.

5. The head of the Department of Mathematics entered classes


at random to evaluate the faculty members.

B. Identify the data source illustrated in each scenario.

Source
1. The data collected in item A1.

2. A TV field reporter presents information on the status of


COVID-19 in the country based on data gathered from the
records of the Department of Health.

3. The data collected by the researcher in item A2.

4. The data collected by the head of the Department of


Mathematics in item A5.

5. By conducting a systematic literature review, a researcher


concludes about the cultural practices of a certain tribe.
Lesson 4.3
DATA PRESENTATION TECHNIQUES

Objectives
At the end of the lesson, students are expected to be able to
1. Identify and apply the most appropriate data presentation technique for a
given a scenario;
2. Construct a Frequency Distribution Table (FDT) for a given set of data;
3. Use the most appropriate graph for a given set of data;
4. Give contexts or situations appropriate for each technique of presentation.

People are visual creatures. Much of the information entering people’s brains
is through their visual senses. This is the reason why presenting data visually is
preferred. Tables and graphs are used to present data in a clear and concise
manner. In such ways, trends and patterns can be easily spotted, a fact that makes
understanding and communicating of research results easier to do.

Generally, data are presented using texts, tables, and graphs. The textual
technique is most appropriate when data are in themselves texts as in the case of
qualitative studies. It is also best suited when presenting small data sets or when it is
more important to emphasize points. Tables and graphs are most applicable for
purely quantitative data.

The Frequency Distribution Table (FDT)

A frequency distribution table is a way of presenting large number of data in


their simplest form by systematically grouping them into smaller and equal class
intervals. The frequency of a class interval is the number of observations falling
within it. The list of steps in preparing an FDT is as follows

1. Compute the value of the Range (R), which is the difference between the
lowest and highest value.
2. Approximate the number of class intervals, k, by taking the square root of
the total number of observations, n. Thus, k = n .
3. Obtain the value of the class width, c, by dividing R by k. Thus, c = R k and
round this off to the nearest odd number.
4. Construct the class intervals (CI).
▪ the first lower limit LL1 is the lowest value in the data set
▪ the first upper limit UL1 = LL1 + c − d .
The number d refers to the decimal unit of the raw data.
✓ d = 1 , if data are whole numbers
✓ d = 0.1 , if data have decimal values in the tenths place
✓ d = 0.01 , if data have decimal values in the hundredths place
▪ the lower and upper limits of the succeeding CIs are obtained by adding the
value of c to the lower and upper limits of the preceding CIs until reaching the
CI that contains the highest value in the data set.
CI
Lower Limits Upper Limits
LL1 = lowest value UL1 = LL1 + c − d
LL2 = LL1 + c UL2 = UL1 + c
LL3 = LL2 + c UL3 = UL2 + c
LL4 = LL3 + c UL4 = UL3 + c

LLk = LLk −1 + c ULk = ULk −1 + c


5. Identify the frequency of each CI by counting the number of data points
that are within each CI .
6. Compute for other information that are needed to construct the graphs
related to the FDT.
a. A relative frequency ( RF ) is the proportion of the frequency count relative to
the size of the data. This is expressed in percent form.
f 
RFi =  i  (100 ) %
n
b. A class mark ( X i ) is the midpoint of a CI .
c. The lower and upper limits of a true class boundary (TCB ) is obtained by
subtracting and adding ½ of the unit of measure (d) of the data points to the
lower and upper limits of the preceding CI , respectively. If d = 0.1 or d = 0.01 ,
we subtract and add 0.05 and 0.005 respectively.
d. A less-than cumulative frequencies ( CF ) of a CI is obtained by adding the
frequencies of the preceding CIs to the frequency of the current CI .
e. A greater-than cumulative frequencies ( CF ) of a CI is obtained by
subtracting the total frequencies of the preceding CIs the cumulative sums of
the frequencies starting from the bottom CI .
f. The rank of a frequency is its relative position with respect to the other
frequencies. That is, the highest frequency is ranked first and the lowest is
ranked last.

Example 4.3.1

The following set of data represents the initial body weights (in grams) of 50
rats used in a study to determine whether a new vitamin is effective in gaining
weight. These 40-day old rats have normal body weights ranging from 100 to 130
grams. Construct the frequency distribution table for this data set. Compute for the
values in the following columns: class mark, true class boundary, “less than” and
“greater than” cumulative frequencies and rank.

140 135 137 126 150 119 126 120 118 125
115 127 95 100 100 101 103 142 113 115
129 110 126 106 105 87 126 119 125 130
108 118 119 117 115 133 102 90 110 104
82 105 132 143 95 124 113 96 139 140
1. R = 150 − 82 = 68 .
2. k = 50 .
68
3. c = = 9.62 → 9 (nearest odd number)
50
4. Since the raw data are whole numbers, d = 1 .
LL1 = lowest value = 82 and UL1 = 82 + 9 − 1 = 90 .
 CF  CF Rank
CI RF (%) Xi TCB
CI f
82 90 82-90 3 6 86 81.5-90.5 3 50 6.5
82+9=91 90+9=99 91-99 3 6 95 90.5-99.5 6 47 6.5
91+9=100 99+9=108 100-108 10 20 104 99.5-108.5 16 44 2

100+9=109 108+9=117 109-117 8 16 113 108.5-117.5 24 34 3
109+9=118 117+9=126 118-126 13 26 122 117.5-126.5 37 26 1
118+9=127 126+9=135 127-135 6 12 131 126.5-135.5 43 13 4.5
127+9=136 135+9=144 136-144 6 12 140 135.5-144.5 49 7 4.5
136+9=144 144+9=153 144-153 1 2 149 143.5-153.5 50 1 8
Total 50 100

Based on the table, it is observed that more body weights belong to the
normal weight range (middle class intervals) while few have extreme weights.

The Graphical Methods

Bar Graphs are most Example 4.3.2


appropriate to use when comparing
groups or categories. This is done by BSU Number of Enrollees in the
constructing rectangular blocks with School Year 2020-2021
2000
height (or length for horizontal bar
graphs) corresponding to the 1500
Frequency

frequency or magnitude of the 1000


category being represented. Example 500
4.3.2 shows the number of BSU
0
enrollees in the school year 2020- Freshmen Sophomore Junior Senior
2021. As shown, most students are Year Level
freshmen while the least in number
are senior students.
Example 4.3.3
Line Graphs are used to show
changes and trends over time. The Average Monthly Temperature in
vertical axis of a line graphs is any 20.00 Baguio City
measure while the horizontal axis is 17.00 16.80
degree celcius

15.50 16.00 15.90 15.70


15.20
time. A value is plotted with respect to 15.00 13.10
14.30 14.20 14.00

a time period, and then joined with the 12.50

neighboring points by straight lines.


10.00
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Example 4.3.3 shows a year-long average temperature reading in Baguio City. As
shown, Baguio City experiences its coldest temperatures within November and
February, and its hottest temperature within May and June. After which the
temperature cycle starts all over again.

Pie Charts, like bar graphs, Example 4.3.4


compares mutually exclusive
categories which are parts of a whole. Allocation of Students' Daily Allowance
Consequently, pie charts are best (on average, a Filipino college student's daily allowance is Php
100)
used when presenting allocations of
Others, 5
quantities such as budget and Photocopies, 10
consumption. A pie chart is
constructed by dividing a circle into Lunch, 40
slices whose sizes are directly Fare, 20
proportional to the value of the
categories they represent. The data
in example 4.3.4 show where do
average Filipino college students Snack, 25
spend their daily allowances. It shows
that much of it is spent on food (lunch and snack).

Pictographs use symbols or pictures to represent data. Each symbol


represents a certain unit value. Pictographs may be used to handle large sets of
data.

Example 4.3.5 displays the number of COVID-19 cases by country. Clearly,


USA has the highest number of cases followed by Italy and Spain. Pictographs are
difficult to construct for partial data because accurate proportional allocation of parts
cannot be ascertaining without knowledge of the whole set of data.

Statistical Maps are some of the best techniques when matching data values
to geographical locations. One can use colors, symbols, pictures, and numbers to
show the differences in values of each area on the map. Example 4.3.6 shows the
carbon dioxide (CO2) emission of each country in 2017. The darker colors indicate a
higher CO2 emission. Obviously, Canada, USA, Australia, Kazakhstan, Saudi
Arabia, Kuwait, Qatar and Bahrain are the countries that emitted the most CO2.
Example 4.3.5 Example 4.3.6
Number of COVID-19 Cases by Country 2017 Average Carbon Dioxide Emissions(tons) per
Capita
UK equates to 10 000 cases
Iran
France
Germ…
China
Spain
Italy
USA
Scatterplots are the easiest and quickest technique to show whether two
variables are related. The three scatterplots below show that variable X is positively
and negatively related to Variables Y and Z, respectively; however, it is not related to
variable W.

Example 4.3.7

Positive Relationship Negative Relationship No Relationship


20 20 20

Variable Z

Variable W
Variable Y

10 10 10

0 0 0
0 10 20 0 10 20 0 10 20
Variable X Variable X Variable X

Histograms are bar graphs based from an FDT. The sizes between true class
boundaries serve as the categories and are plotted on the horizontal axis.
Histograms show the shape and the spread of a distribution. The histogram in
example 4.3.8 displays that majority of the rats have body weights that are within the
normal range, as indicated by the taller bars. Rats with extreme body weights are
least in number, as represented by the bars at the opposite ends.

Ogives are used to estimate the number of observations that are less than or
greater than a particular value. They are constructed by plotting the cumulative
frequencies against the true class boundaries. Specifically, the less-than cumulative
frequencies are plotted against the upper limits of each TCB, while the greater-than
cumulative frequencies are plotted against the lower limits. In example 4.3.9, the
third point of the less-than ogive means that there are six mice whose body weight
are less than or equal to 99.5 cm. Similarly, the fourth point of the greater-than ogive
means that 34 mice have body weights greater than or equal to 108.5 cm.

Example 4.3.8 Example 4.3.9

Initial Body Weight of 50 Mice


Initial Body Weight of 50 Mice (refer to FDT in e.g.4.3.1)
(refer to data in e.g.4.3.1) 60
14
Cumulative Frequency

50
12
40
10
frequency

30
8
20
6
10
4
0
2 81.5 90.5 99.5 108.5 117.5 126.5 135.5 144.5 153.5

0 True Class Boundaries


81.5 90.5 99.5 108.5 117.5 126.5 135.5 144.5 153.5

True Class Boundaries less-than ogive greater-than ogive


Exercise 4.3
Data Presentation Techniques

Name:___________________________ Score:_______
Course and Year:__________________ Date:________

A. Identify the most appropriate graph/chart to be used for each scenario.


Graph/Chart
1. In 2018, the total volume of crop production in CAR is 235,420.0
metric tons. Here is the distribution per province:
Abra: 14 820.8 (6%) Ifugao: 13 772.4 (6%)
Apayao: 11 815.0 (5%) Kalinga: 30 986.2 (13%)
Benguet: 153 313.1 (65%) Mountain Province: 10 712.6 (5%)

2. The following are the number of cancer patients in CAR since


1995:
1995: 50 2005: 125 2015: 150
2000: 75 2010: 100 2020: 175

3. Population density refers to the number of people per unit area. The
following shows the population density per square kilometer of some
selected regions of Luzon.
CAR:87 II (Cagayan Valley):116 IV-A(CALABARZON):870
I (Ilocos):388 III (Central Luzon):512 IV-B (MIMAROPA):100

4. The following shows the passing rate of four universities during a


recently concluded board exam.
Benguet State University: 78% Saint Louis University: 82%
University of Baguio: 64% University of the Cordilleras: 70%

5. A researcher aims to explore the relationship between Mathematics


Proficiency (X) and Problem-Solving Performance (Y) of students.
Here are their scores.
X: 80 95 88 79 85 84 80 81 92 88 82 89 75 77
Y: 79 96 87 76 89 85 83 82 91 91 87 93 73 79

B. The following are the scores of 30 students in a Mathematics Examination.


Construct a frequency distribution table and then compute for other information,
namely the class mark, true class boundaries, “less than” and “greater than”
cumulative frequency, and rank. Describe the performance of the students using
the FDT.
41 37 43 40 54 57
23 55 22 61 55 46
34 32 46 48 46 43
48 37 47 54 42 59
44 47 43 50 62 40
R = ________ k = ________ c = ________
 CF  CF Rank
CI f RF (%) Xi TCB

Total

Brief Description of the performance based on the FDT:

C. Choose two items in A and construct the graph/chart. Briefly discuss the graphs
Lesson 4.4
DESCRIBING DATA SETS

This section introduces ways of describing a data set through such as


measures of central tendency, dispersion, skewness, and location.

Objectives
At the end of the lesson, students are expected to be able to
1. Identify and compute the appropriate measure of central tendency for a
given data set;
2. Identify and compute the appropriate measure of dispersion for a given
data set;
3. Describe the skewness of a data set;
4. Locate the different quantiles relative to a given data set.

Measures of Central Tendency

These measures provide a single value that is somehow reflective of the


entire data set. It is a value from which data points tend to cluster.

Mean. Otherwise called as the average, this is given by


 
n N
x
i =1 i
x
i =1 i
x= =
n N
where x : sample mean  : population mean n : sample size N : population size

Median. The median is the middle most value in an ordered data set. This is
given by
xn /2 + x( n /2)+1
Md = x( n +1) / 2 if n is odd Md = if n is even
2
Mode. Denoted by Mo, this is the most frequent data point/s in a set of data.
A data set may be unimodal, bimodal, multimodal or has no mode at all.

Selecting the Appropriate Measure of Central Tendency

The decision as to which measure of central tendency is appropriate to use to


describe a set of data depends on the nature of the data at hand. For example, the
mean is most appropriate for numerical data (interval or ratio), while the median or
mode is appropriate for ordinal data. However, if the data is purely categorical which
cannot be ranked as in the case of a nominal data, then only the mode is the
appropriate measure.

Properties of the Different Measures of Central Tendency

The mean is the only measure in which further computations can be carried
out. However, unlike the median, it is sensitive to extreme values. On the other
hand, the mode becomes unstable in cases of data changes due to changing
method of rounding off decimals. It is problematic in providing a single measure of
central tendency when its value is not unique.

Example 4.4.1

The following are the scores of 12 randomly selected students from a 40-point
test: 35,30,38,28,20,37,18,26,32,36,39,21. Scores that are within the range of “20-
35” are considered “average.” Scores that are below this range are “poor,” while
those above it are “outstanding.” Identify the most appropriate measure of central
tendency and then compute its value.

Solution:

Since the set of data originates from a sample, and the values are numerical,
the appropriate measure is the sample mean.

n
x
35 + 30 + 38 + ... + 39 + 21 360
i =1 i
x= = = = 30
n 12 12
The mean value of 30 suggests that students are “average” in terms of the
skills measured by the test.

Example 4.4.2

The level of reading comprehension of a student is categorized into three:


Poor (P), Satisfactory (S), and Outstanding (O). Given the reading comprehension
level of nine students below, identify the most appropriate measure of central
tendency and then compute for its value.
P, S , O, S , P, S , O, S, S
Solution:

The variable “level of reading comprehension” is categorical and can be


ranked (ordinal). Therefore, the median is appropriate.

Arranging the data from the lowest to highest level, we have P, P, S, S, S, S,


S, O, O. It follows that the median is the fifth observation which is “satisfactory.” This
means that, overall, students have a “satisfactory” level of reading comprehension.

Example 4.4.3

The following shows the senior high school strand taken by 16 students who
are entering a certain university. Strand includes STEM (Science, Technology,
Engineering and Mathematics), GA (General Academic), ABM (Accountancy,
Business and Management), and HUMMS (Humanities and Social Sciences).
Identify the most appropriate measure of central tendency and then compute for it.
GA, STEM , GA, GA, STEM , GA, STEM , ABM ,
GA, ABM , GA, HUMMS , HUMMS , STEM , ABM , GA
Solution:

Since the variable “strand” is nominal, the mode is the appropriate measure of
central tendency. The mode is “GA” since it is the most frequent strand among
sixteen observations. This means that of the sixteen entering students, there are
more graduates of the GA strand as compared to the student number in other
strands.

Example 4.4.4

To assess whether stunting among 9-12 years old children is prevalent, a


researcher initially collected the height (cm) of randomly selected children as
presented below. The standard normal height for boys of this age ranges from
122.69 cm to 135.90 cm, while that for girls ranges from 122.82 cm to 139.58 cm.
For each group of children, identify the most appropriate measure of central
tendency and then compute for its value.

BOYS (X) GIRLS (Y)


116 119 120 121 122 130 118 120 133 132 130 126 125 125 125 126
122 123 117 120 115 116 122 115 122 120 123 114 127 126 131 132
119 118 118 117 119 120 129 116 115 129 118 129 130 131 129 126
118 123 119 118 126 127 128 127 128 130

Solution:

The data values are numerical so the appropriate measure of central


tendency is the mean.
 
n 28
x
i =1 i
x
i =1 i 116 + 119 + 120 + ... + 119 + 118 3350
x= = = = = 119.64 cm
n 28 28 28
 
n 30
y
i =1 i
yi 133 + 132 + 130 + ... + 128 + 130 3790
y= = i =1
= = = 126.33 cm
n 30 30 30

The average height of the boys is 119.64 cm which falls below the normal
range. In contrast, the average height of their female counterpart is within the normal
range. Based from these results, it appears that only the boys exhibit stunting.

Measures of Dispersion

While a measure of central tendency gives a value around which a set of data
tends to cluster or fluctuate, a measure of dispersion gives a value that tells whether
the set of data is compact or dispersed. That is, it tells whether the data are
relatively closer to each other, or are relatively farther apart from each other. Thus
two sets of data may have the same central values yet they are different if they have
different values of dispersion; that is, if one is more compact than the other. So the
measure of dispersion supplements the information given by a central value.

To illustrate, consider the following scores of two groups of students who took
the same examination:
Group Scores Mean
A 78 79 80 81 82 80
B 70 72 75 88 95 80

The mean scores of the two groups of students are equal. However, the
scores of students from group A are relatively closer to the mean than those from
group B. For example, the extreme scores from group A of “78” and “82” are closer
to the mean value of 80 than the extreme scores from group B of “70”and “95”. That
is why even if the mean scores of the two groups of students are the same, the two
sets of scores are different because one is more dispersed than the other.

Range (R) is a rough estimate of the spread of a set of data. It is obtained by


getting the difference between the highest and the lowest value in a set of data. It is
a quick way of knowing which set of data is more spread than the others.

Variance is the average of the squared deviations of each data point from the
mean. This is given by
( x )
N 2
( x )
n 2


n

i =1 i i =1 i
N
xi2 − x2
i =1 i

2 =
i =1
N n
s =
2

N n −1
where 
2 2
is called the population variance while s is the sample variance.

Standard Deviation is simply the square root of the variance and is denoted
by  and s for population and sample, respectively. The unit of  or s is the same
as the unit of the set of data. Also, these values are non-negative.

Example 4.4.5

Compute for the variance and standard deviation of the data provided in
example 4.4.1.
Data: 35,30,38,28,20,37,18,26,32,36,39,21

Solution:

 
n n
x = 35 + 30 + ... + 39 + 29 = 360 ;
i =1 i
x = 352 + 302 + ... + 392 + 292 = 11404 ;
2
i =1 i

n = 12
( x )
12 2
360 2
i=1 xi2
12 i
− i =1
11 404 −
s2 = n = 12 = 54.91
n −1 12 − 1
s = s 2 = 54.91 = 7.41

Example 4.4.6

Given the height of children in example 3.4, compute for the standard
deviation of each group.
BOYS (X) GIRLS (Y)
116 119 120 121 122 130 118 120 133 132 130 126 125 125 125 126
122 123 117 120 115 116 122 115 122 120 123 114 127 126 131 132
119 118 118 117 119 120 129 116 115 129 118 129 130 131 129 126
118 123 119 118 126 127 128 127 128 130

Solution:


28
x = 116 + 119 + ... + 119 + 118 = 3350
i =1 i


28 2
x = 1162 + 1192 + ... + 1192 + 1182 = 401152
i =1 i

nx = 28

( )
2
i=1 xi
28
33502
i=1 xi2 −
28
401152 −
sx = sx2 = n = 28 = 3.59
n −1 28 − 1

30
y = 133 + 132 + ... + 128 + 130 = 3790
i =1 i


30
y = 1332 + 1322 + ... + 1282 + 1302 = 479450
2
i =1 i

ny = 30

( )
2
i=1 yi
30
337902
i=1 yi2 −
30
479450 −
s y = s y2 = n = 30 = 4.72
n −1 30 − 1

The higher value of the standard deviation of the height of girls than those of
the boys means that the height of the former is more dispersed than that of the
latter.

Measure of Skewness

Skewness tells about the symmetry (or lack thereof) of a distribution about its
mean. Using the symmetric (normal) distribution as a baseline, skewness measures
the degree of distortion (long tails) of a distribution. A positively skewed distribution
has its long tail at the right which indicates that the mean is higher than the median.
Conversely, a negatively skewed distribution has its long tail at the left which
indicates that the mean is lower than the median. For a perfectly symmetric
distribution (or normal distribution), the mean and the median coincide.

Frequency

Frequency
Frequency

median
median mean mean mean median

Scores Scores Scores

Positively Skewed Distribution Symmetric Distribution Negatively Skewed Distribution

Consider a hypothetical distribution of the numerical scores of students in an


examination. A symmetric distribution indicates that majority of the scores are middle
scores, called “average” scores. These scores are found about the mean score
where the peak of the graph is located. A negatively skewed distribution indicates
that majority of the scores are found above the mean. A positively skewed
distribution indicates that more scores are below the mean. In this particular context,
a teacher may target for a negatively skewed score distribution.

The Pearsonian Coefficient of Skewness (SK)

Without preparing a histogram or other visual representations of a data set,


computing for SK is a quick and easy technique to describe the direction and degree
of skewness of a data set. For samples and populations respectively, SK is given by

3( x − Md ) 3(  − Md )
SK = and SK = .
s 

Interpreting SK

A positive value of SK indicates that the distribution is positively skewed,


while a negative value indicates a negatively skewed distribution. Moreover, the
closer the value of SK gets to zero, the more symmetric a distribution becomes.

Example 4.4.7

Compute for the SK of the data given in example 4.4.4.

Solution:
We first compute the mean, the median, and the standard deviation of the two
sets of data. Recall that the mean height for the two groups (as computed in e.g.
4.4.3) are x = 119.64 cm and y = 126.33 cm .

The median value of each group is identified by arranging the data.

Ordered Data
BOYS (X) GIRLS (Y)
115 115 116 116 116 117 117 118 114 115 118 120 122 123 125 125
118 118 118 118 119 119 119 119 125 126 126 126 126 126 127 127
120 120 120 120 121 122 122 122 127 128 128 129 129 129 130 130
123 123 129 130 130 131 131 132 132 133
Since there are 28 observations, Since there are 30 observations
xn /2 + x( n /2)+1 x14 + x15 119 + 119 xn /2 + x( n /2)+1 x15 + x16 127 + 127
Md = = = = 119 Md = = = = 127
2 2 2 2 2 2

* x14 and x15 refers to the 14 th


and 15th observation in the data set * x15 and x16 refers to the 15 th
and 16th observation in the data set

The standard deviations of the two groups (as computed in e.g. 4.4.6) are
sx = 3.59 and s y = 4.72 .

So that
3( x − Md ) 3(199.64 − 119)
SK x = = = 0.53 and
s 3.59
3( x − Md ) 3(126.33 − 127)
SK y = = = −0.43 .
s 4.72
These coefficients of skewness reveal that the height distribution of the boys
is positively skewed, while that of the girls is negatively skewed. As already
mentioned above, there are more boys whose height are below the mean and there
are more girls whose heights are above the mean.
Exercise 4.4.1
Measures of Central Tendency, Dispersion, and Skewness

Name:___________________________ Score:_______
Course and Year:__________________ Date:________

A. Identify and compute the most appropriate measure of central tendency for
each case.
1. The scores of 10 randomly selected students in a 50-point Statistics
Examination are as follows 49, 35, 38, 25, 34, 21, 47, 40, 35, 10.

2. Fifteen randomly selected customers were asked which brand (brand A, B, C) of


android phone they prefer. Here are their responses:
A, C, B, C, B, C, B, A, B, B, C, A, B, C, A.

3. The following are the year levels of students that are scholars of the Department
of Science and Technology who enrolled at BSU this year:
I, II, I, III, III, IV, II, IV, I, I.

4. Below are the medals won by team Philippines during the 2018 Asian Games.
Gold Silver Bronze
4 2 15

5. The data below are the general weighted average (GWA) of 15 randomly
selected BSU varsity and non-varsity students for second semester of SY 2019-
2020. Compute separately, then compare the two groups based from these
values. Remember that in BSU, a lower numerical grade indicates a better
performance.
Varsity Players 1.75 1.42 2.45 1.81 2.45 1.51 1.61 1.43
Non-varsity Players 3.01 1.18 2.89 2.15 1.97 2.85 2.56
Varsity Players

Non-Varsity Players

Brief Comparison
B. Compute the standard deviations for the GWA of the two groups of students
given in item 5 of A, and then briefly compare in terms of these values.
Varsity Players Non-Varsity Players

Brief Comparison

C. Compute the coefficient of skewness for the GWA of the two groups of
students given in item 5 of A, and then briefly compare the groups in terms
of these values.
Varsity Players Non-Varsity Players

Brief Comparison
Measures of Relative Position (Quantiles)

Quantiles are values that divide an ordered distribution into parts such that
there is a given proportion of observations that are equal to or below it. These
measures are applicable when we want to identify the “position” or “standing” of a
data point relative to the entire data set. The most used quantiles are quartiles,
deciles and percentiles.

Quartiles divide an ordered data set into four equal parts (quarters) so that
25% of the data is less than or equal to the first quartile (Q1 ) . Similarly, 50% and 75%
of the entire data set is lower or equal to the second (Q2 ) and third quartile (Q3 ) ,
respectively.

Example 4.4.8

Consider an ordered distribution consisting of eleven observations ( x1 − x11 ).


Clearly, the first quartile is 7, since 25% of the data set (3 out of 11) is less than or
equal to it. For the same reason, the second and the third quartile are 9 and 11
respectively.
Lower 25% Lower 50% Lower 75%

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11
3 6 7 8 8 9 10 10 11 12 19
Q1 Q2 Q3

Deciles divide the distribution into 10 equal parts. The values separating each
part are called the first decile ( D1 ) , the second decile ( D2 ) , …, ninth decile ( D9 ) .

Example 4.4.9

Consider an ordered distribution of 29 observations ( x1 − x29 ). The first decile


( D1 ) is 36. This is the data point that separates the lower 10% from the upper 90%.
Likewise, D7 = 75 means that 70% of the 29 observation is lesser or equal to 75.
The other deciles are interpreted in the same manner.
Lower 70% Lower 80% Lower 90%
Lower 40% Lower 50% Lower 60%
Lower 10% Lower 20% Lower 30%

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29
21 25 36 37 37 40 41 41 41 45 46 47 49 55 55 59 60 65 66 70 75 76 76 76 80 81 89 90 95
D1 D2 D3 D5 D6 D7 D8 D9
D4

0 0 0 0 0
Percentiles divide an ordered distribution is divided into 100 equal parts, the
values that separate each part from the rest are called percentiles, namely
P1 , P2 , P3 , , P99 .

The Median is also a quantile that divides a distribution into two equal parts,
the upper and lower 50%. The median value is the same as Q2 , D5 , and P50 .
Other Equal Quantiles

Q1 and P25 are the same values since both of them separates the lower 25%
of an ordered data set from the rest. Here are other pairs of equal quantiles:
(Q2 , P50 ), (Q3 , P75 ), ( D1 , P10 ), ( D2 , P20 ), ( D3 , P30 ), ( D4 , P40 ), ( D6 , P60 ), ( D7 , P70 ), ( D8 , P80 ), ( D9 , P90 ) .

Computing for the Different Quantiles

Before computing the values of the different quantiles, make sure that the set
of data is ordered or arranged. Then solve for the index (k ) of the quantile which
gives the location number of the quantile value being sought to find. The size of a
data set is denoted by n , and the specific quantile being located is denoted by p .
For example, for the quantiles Q3 , D6 , P59 , the values of p are 3,6, and 59 ,
respectively. The following show how to compute the values of k for the different
quantiles:
p ( n + 1)
for quartiles, k = where p = 1,2,3 ;
4
p ( n + 1)
for deciles, k = where p = 1,2,3, ,9 ;
10
p ( n + 1)
for percentiles, k = where p = 1,2,3, ,99 .
100
Example 4.4.10

The following set of data is the scores of 55 students in a 90-point


examination:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15
15 19 25 25 36 37 39 40 40 41 47 48 49 49 50
x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30
51 51 51 52 53 53 55 55 55 55 55 56 57 58 59
x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45
60 60 60 61 64 67 67 68 71 72 72 72 73 74 79
x46 x47 x48 x49 x50 x51 x52 x53 x54 x55
80 81 81 82 83 83 84 85 85 88

Solve for the following quantiles:


a. Q1 and Q3 b. D5 and D8 c. P61
Solution:
a. for Q1 : for Q3 :
p (n + 1) 1(55 + 1) 56 p (n + 1) 3(55 + 1) 3  56
k= = = = 14 k= = = = 42
4 4 4 4 4 4
These mean that the first and the third quartile are the 14th and 42nd
observation respectively. Thus,
Q1 = x14 = 49 and Q3 = x42 = 72 .

The first quartile indicates that 25% of the students (or approximately 14
students) scored at most 49 points, while the third quartile indicates that 75% of the
students scored at most 72 points.

b. for D5 : for D8 :
p (n + 1) 5(55 + 1) 5  56 p(n + 1) 8(55 + 1) 8  56
k= = = = 28 k= = = = 44.8
10 10 10 10 10 10
Based on the value of k, D5 = x28 = 57 , which is the same as the median. That
is, one-half (50%) of the examinees scored at most 57 points. Similarly, the value of
D8 is located at x44.8 . Since this number does not correspond to a particular value in
the set of data because it lies in between two consecutive locations of the ordered
set of data, a linear interpolation is done. In so doing, assume that a value in
between x44 = 74 and x45 = 79 exists, and then follow the steps below.

Performing Linear Interpolation

where L  C  R
Linear interpolation is done by assuming that values between two numbers
say xL and xR exist. In the above figure, xC is a number between xL and xR which
is computed by taking the " C − L " part of the distance ( xR − xL ) and then adding it to
xL .
xC = xL + (C − L)( xR − xL ) .

So from example 4.4.10 (b), given that x44 = 74 and x45 = 79 , xC =44.8 is
computed as follows

D8 = x44.8 = xL + (C − L)( xR − xL ) = 74 + (44.8 − 44)(79 − 74) = 74 + 0.8(5) = 74 + 4 = 78 .

c. P61
p (n + 1) 61(55 + 1) 61  56
k= = = = 34.16
100 100 100
Again, there is no x34.16th observation, so we apply linear interpolation. Given
that C = 34.16 , L = 34 , xL = x34 = 61 , and xR = x35 = 64 , find P61 as follows
P61 = x34.16 = xL + (C − L)( xR − xL ) = 61 + (34.16 − 34)(64 − 61) = 61 + 0.16(3) = 61 + 0.48 = 61.48.

The value of P61 = 61.48 indicate that 61% of the examinees scored at most
61.48.
Exercise 4.4.2
Measures of Relative Position

Name:___________________________ Score:_______
Course and Year:__________________ Date:________

A. Five hundred students took an 80-point standardized examination. After the


examination, students were given their percentile scores, instead of the raw
scores. In this examination, the passing score is 60 points. Using this
information, identify if the statements below are true of false.

Answer
1. The score of Maria is the 87th percentile. This means that
87% of the number of examinees scored lower or equal to
Maria’s.

2. The percentile rank of Jose’s score is 60. This means that


Jose passed the examination.

3. Ruben’s score is the 72nd percentile. It follows that 28% of the


number of examinees outdid Ruben.

4. Val’s score is the median score. This means that 250


students scored higher than Val.

5. The score of Jane is the 80th percentile so she must have


perfected the examination.

6. Kyro’s score is the 49th percentile. This means that Kyro is 11


points behind passing.

7. Christine’s score is only the 10th percentile but it does not


mean that she failed the examination.

8. Timothy’s score is the first quartile. This means that 25% of


the number of examinees scored better than him.

9. Pedro’s score is the 99th percentile. This means that Pedro


garnered the highest score.

10. Ken’s score is the 6th decile. Therefore, 300 students scored
at most as high as Ken’s.
B. Given the scores of 31 students in a Mathematics Examination, compute for the
following quantiles.
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31
22 23 32 34 37 37 39 39 40 42 43 43 43 44 46 46 46 47 47 48 48 50 54 54 55 55 56 58 61 62 65
a. Q1 b. Q3

Interpretation Interpretation

c. D3 d. D9

Interpretation Interpretation

e. P71 f. P87

Interpretation Interpretation

You might also like