0% found this document useful (0 votes)
24 views

Descriptive Analysis

Uploaded by

richeldameir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Descriptive Analysis

Uploaded by

richeldameir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

COMP ED 20 – INTRODUCTION TO ANALYTICS

UP Open University

Descriptive Analytics and Exploratory Data Analysis

Introduction

Descriptive analytics is used to describe and summarize the data set and
answers the questions “what happened?” This analytics will provide information on
past data, and this should be performed before doing any advanced analytics. This
is the simplest and the easiest analytics and widely used today by many
organizations.

Objectives:

At the end of this module, you should be able to:

• Understand and explain what is descriptive analytics and the statistics for
descriptive analysis;
• Use exploratory data analytics, what it is for, and what are its methods?
• Apply the different descriptive statistics methods, and
• Perform exploratory data analysis using MS Excel software.

Topics:

• What is descriptive statistics?


• Measures of Central Tendency
• Measures of Dispersion or Variation
• Measures of Frequency
• Measures of Skewness and Kurtosis
• Measures of Position
• Exploratory Data Analysis (EDA)

What is Descriptive Statistics?

Statistics is the science of collecting, organizing, analyzing, and interpreting


large data. It can be divided into two general types namely: descriptive
statistics and inferential statistics.

Descriptive statistics is used to provide an overall view of the dataset,


determine at which point the data converge (e.g. mean, median, mode), and
find if there are data anomalies and outliers. Descriptive analysis is
necessary before conducting further advanced analysis.

Inferential statistics is used to infer or provide conclusions about the


population based on the analysis done on the sample data. The samples are
selected from the population and the manner that they are selected from the
population can be done via simple random sampling, purposive sampling,
stratified sampling, clustered sampling, and other sampling techniques.
Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

The population refers to the whole or entire data set of the subject being
studied. For example, the population of enrolled students in the Open
University refers to all enrolled students.

The sample data are the data taken from the population and used as
representative of the population data set. For example, the BS Education and
Diploma in Computer Science are samples of the UPOU students.

Descriptive analytics uses descriptive statistics to summarize and understand


the characteristics of the data set and answer the question “what happened?”.

Descriptive statistics can be grouped into the following:

a) Measures of Central Tendency


b) Measures of Dispersion or Variation
c) Measures of Frequency
d) Measures of Skewness and Kurtosis
e) Measures of Position

Measures of Central Tendency

The measure of central tendency is to determine the value where the data set
converge. This value is the central location of the distribution. The three
common measures of central tendency are the mean, median and mode.
As mentioned, EDA can be done using numerical/ mathematical method or
graphical method. The following are numerical method in understanding the
characteristics of the data set.

a. MEAN Value. The mean is the average value of a data set. It can be
obtained by getting the sum of the elements in the data set divided by the total
number of observation or elements in the data set.

Course code: COMP ED 20 Page | 2


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

Process:
1. Compute the total of all elements.
Sum = (x1 + x2 + …. + xn) = ∑"!#$ 𝑥!
Where: x1 – refers to the first element/value in the data set
x2 - refers to the 2nd element/value in the data set
xn - refers to the last element/value in the data set
n - the total number of elements in the data set
2. Compute the mean value, mean = Sum/n.

b. MODE Value. The mode is the most common element or the element that
has the most count in the data set.

Process:
1. re-arrange the elements in ascending order (although this is not
necessary but this will accelerate in finding the most frequently
occurring value)
2. count the occurrence of each value in the dataset
3. the value with the highest count is the mode of the data set.
4. The data set may have one, two or more mode values.

c. The MEDIAN. The median is the middle value or element of the data set.
This can be obtained by sorting or arranging the elements or values either in
descending or ascending order. Finding the median value depends on the
total number of elements (n) in the set.

Process (if n is odd):

1. Sort the elements of the data set in ascending or descending order.


2. Determine the index of the middle element, index = (n+1)/2
3. The median is the element/value at position/location index;

Process (if n is even):

1. Sort the elements of the data set in ascending or descending order.


2. Determine the indeces of the two middle values.
a. Index 1 = n/2
b. Index 2 = n/2+1
3. The median = (Data[index 1] + Data[index 2]) /2

Example 1: Determine the central values (mean, median, mode) of the given
data set?

Age of students = {19, 18, 25, 35, 21, 43, 20, 30, 30, 30}
Number of elements = 10
Age in sorted order = {18, 19, 20, 21, 25, 30, 30, 30, 35, 43}

Mean:

Course code: COMP ED 20 Page | 3


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

a) Get the sum of all elements, Sum = (19 + 18 + 25 +… + 30)/10 =


271
b) Mean = sum/n = 271/10 = 27.10

Median:
a) Sort the elements in ascending or descending order.
Age in sorted order = {18, 19, 20, 21, 25, 30, 30, 30, 35, 43}
b) The total number of elements = 10
c) Since 10 is even, the two middle elements are in the 5th and 6th
positions in the list = {25, 30}
d) Mediam = (25+30)/2 = 27.5

Mode:
a) Sort the elements in ascending or descending order.
Age in sorted order = {18, 19, 20, 21, 25, 30, 30, 30, 35, 43}
b) Count the number of times that each value occurs in the list

Age 18 19 20 21 25 30 35 43
Count 1 1 1 1 1 3 1 1

c) Mode = 30 (it occurs thrice)

Example 2: In this example, we will be dealing with nominal data such as


gender which has the following values: M - Male, F - Female

Gender of students = {M, M, M, F, M, F, F, M, M, F} .

Mean: Not applicable. Cannot be used for qualitative data.


Median: Not applicable. Cannot be used for qualitative data.
Mode: Count the frequency of each value

Gender M F
Count 6 4

Mode is M.

Example 3: Assuming that we have the following values to represent the


opinion of customers on certain product (1 - very poor, 2 - poor, 3 - average, 4
- above average, 5 - Outstanding):

Mean = 3.4
Median = 4
Mode = 4

Course code: COMP ED 20 Page | 4


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

When to use Mean, Median and Mode?

Mean, median and mode can be used for quantitative data but only mode can
be used for qualitative data.

Data Type Mean Median Mode


Qualitative - Nominal NA NA Yes
Qualitative – Ordinal Yes Yes Yes
Quantitative - Discrete Yes Yes Yes
Quantitative - Continuous Yes Yes Yes

Strengths and Weaknesses

Data Type Strengths Weaknesses


Mean Quantitative Include all data in the Sensitive to outlier values.
computation Cannot be used for
qualitative - nominal data
(e.g. color)
Median Quantitative Best for data that are Does not include all data
skewed or when outlier in the computation
values are present. Its
value is not affected by
the presence of outlier
values
Mode Quantitative & Best for qualitative If applied to quantitative, it
Qualitative data. When used for does not include all data in
quantitative data, its the computation.
value is not affected by
outlier values

An outlier value is a data that lies far from the rest of the data set. It can be at
the lower end or at the upper part of the data set.

Assuming that we have following data sets 1 and 2 and their computed mean,
median and mode. The result shows that median and mode are not affected
by the outlier value but mean changes from 25.44 to 27.33. This is because
the mean includes all of the elements in the data set.

Data Set 1: 18, 19, 20, 21, 25, 30, 30, 30, 35, 36
Data Set 2: 18, 19, 20, 21, 25, 30, 30, 30, 35, 53 (53 is outlier)

Mean Median Mode


Data set 1 25.44 25.00 30.00
Data set 2 27.33 25.00 30.00

Course code: COMP ED 20 Page | 5


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

Getting the Mean, Median and Mode values in MS Excel

To determine the mean value of a data set using MS Excel, click the cell
address where you want the mean value to appear and then, type
=average(SRC: ERC). SRC means the starting Row-Column Cell and ERC is
the End Row-Column Cell.

To determine the mean and mode, use the same approach above but use the
keyword =median(SRC: ERC) and =mode(SRC: ERC) to determine their
values, respectively. The figures below illustrate how to perform these
operations in MS Excel.

The Measures of Dispersion

The measure of dispersion will provide information of how far or close are the
values with respect to the average or mean value.

The following descriptive statistics are also used as the numerical method in the
Exploratory Data Analysis for single variables.

The common measures of dispersion include the following:

a) Range. Range is the simplest method and easy to calculate. It uses only two
values or elements in the data set, which are the maximum and minimum
values.

Process:
a. Determine the largest value (max) and smallest value (min)
b. Compute: Range = max - min

b) Standard Deviation. Unlike range, which uses only two values in the data
set, standard deviation uses all the elements in the data set in getting its

Course code: COMP ED 20 Page | 6


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

value. It measures the absolute variability of each element in the data set with
respect to the mean value. Standard deviation is used to measure how many
percent of the data elements fall within the mean and one, two and three
standard deviations

Image Source: Free image from pixabay

In a normally distributed data, approximately, 68% of the elements in the data


set is within mean ± one standard deviation (shaded portion), 95% of the data
is within the mean ± 2 standard deviation, and 99.7% is within the mean ± 3
standard deviation.

Process:
1. Compute the mean value: Mean = ∑ (𝑥𝑖)/ n,
• where n = total number of elements
2. Compute: Sum of Square (SS) = ∑ (𝑥𝑖 − 𝑚𝑒𝑎𝑛)^2
• xi – is the individual element in the data set
3. Compute:
• Sample data: Standard Deviation (SD) = .𝑆𝑆/(𝑛 − 1)
• Population data: Standard Deviation (SD) = .𝑆𝑆/𝑛

c) Variance. It is the square of the standard deviation. It measures the


variability of the data and it provides a single value that tells us the average of
the square of the difference between the individual element and the mean
value.

Process:
1. Compute the standard deviation
2. Compute Variance = (Standard Deviation)2

d) Coefficient of Variation (CV). It is a measure of variability and this is used


when we compare the variability of two or more data sets. It is the ratio of the
standard deviation and the mean times 100%. If the value of the CV is high,
this means that the data has more variation with respect to the mean.

Process: Compute CV = SD/mean x 100%

Course code: COMP ED 20 Page | 7


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

To help you understand the above concept, let us consider the dataset below
with six variables or attributes (Obs, X, Y, Z, Gender, Opinion).

Variable Obs or Observation is just a sequential number that represents the


record number while X, Y and Z were randomly generated with values that
ranges from 30 to 50, 20 to 60, and 0 to 10, respectively.

Using MS Excel and using X, Y and Z, we can determine their respective values
for the Mean, Range, Standard deviation, Variance and Coefficient of Variation
values. At the cells where these values will be displayed, define the following
formula:

• Mean value, use =average(SC:EC)


• Range value, use =max(SC:EC)-min(SC:EC)
• Standard Deviation, use =stdev(SC:EC)
• Variance, use =var(SC:EC)

Where: SC – Starting cell, EC – Ending Cell


Obs X Y Z Gender Opinion
1 33 32 6.386 M 1
2 46 28 2.364 M 1
3 37 54 8.119 F 1
4 39 41 5.122 F 2
5 45 60 7.68 F 3
6 49 30 2.622 F 4
7 48 51 4.036 M 5
8 32 24 0.999 F 3
9 40 44 3.193 M 2
10 35 23 5.909 F 5
11 42 34 4.568 M 2
12 50 30 6.237 F 4
13 46 31 7.216 F 5
14 44 29 1.947 F 2
15 30 52 8.267 M 1
16 42 41 8.614 M 3
17 34 46 1.662 M 1
18 42 45 6.44 F 3
19 36 53 2.941 M 4
20 44 20 3.755 F 5
21 30 46 7.294 F 3
22 50 28 9.914 F 4
23 31 29 6.561 M 2
24 39 58 0.748 F 1
25 44 30 7.471 M 2

Course code: COMP ED 20 Page | 8


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

26 46 22 6.368 F 3
27 49 56 8.502 M 1
28 49 57 5.203 F 3
29 45 27 4.406 M 4
30 32 60 4.071 F 3

This sample data set can be downloaded from https://bit.ly/3wpnmnl.

With reference to the result of the computation, X has a range value of 20, mean
value of 40.97, standard deviation of 6.55, variance of 42.86, and coefficient of
variation 15.98.

Using the standard deviation value, it will tell us that 68% of the data falls
within the range from (40.97 – 6.55) to (40.97 + 6.55) or from 34.42 to 47.52.

The Coefficient of Variation of X means that the variation of the data is 15.98%
from its mean value of 40.97.

When Y is compared with X, it follows that Y’s dispersion measure values are
also higher because its elements are more dispersed than X because its value
ranges from 20 to 60. We can use range, standard deviation and variance to
compare the variabilities of the two variables since they have the same units or
scales. Since variable Z has different scales, then it is not appropriate to
compare their range, standard deviation and variance.

In this case, let us instead use the CV to compare the variability of data sets
X, Y and Z. CV is more appropriate for data sets different units or scales.

As shown from the descriptive statistics, Z has the smallest range, standard
deviation and variance values compared to X and Y however, it has the most
dispersed or diverse data based on its CV value.

Some example of datasets with different units are income (Philippine peso) and
height (feet), swimmer’s speed (meters per sec) and body weight (kilogram).

Measures of Frequency

A frequency distribution is a graphical presentation showing the number of


times that each element occurs in the data set. Frequency Distribution is an
example of graphical method in Exploratory Data Analysis.

Using the same dataset presented above, let us consider variables Gender
and Opinion. Opinion is an ordinal data coded with numeric values from 1 to
5.

Creating a Simple Frequency Distribution of Qualitative Data

Course code: COMP ED 20 Page | 9


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

1. Make a table and count the number of times (or frequency) that each
distinct element occurs in the list. Assuming that we will create a
frequency distribution for Gender, our frequency table will be:
Element Count
F 17
M 13
2. Using MS Excel, select the cell addresses of the data to graph
3. Click Insert and choose the appropriate Chart format. Your frequency
distribution may look as follows:
18
16
14
12
Frequency

10
8
6
4
2
0
F M

Creating a Simple Frequency Distribution (Discrete Data or Ordinal


data coded as number)

a. Assuming that variable Opinion is a discrete data or ordinal data


coded as numbers.
b. Following the same process above, the resulting frequency table will
look as follows:

Element Count
1 7
2 6
3 8
4 5
5 4

c. Using MS Excel, select the cell addresses of the data to graph,

Course code: COMP ED 20 Page | 10


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

d. Click Insert and choose the appropriate Chart format. Your frequency
distribution may look as follows:

9
8
7
6
Frequency

5
4
3
2
1
0
1 2 3 4 5

Creating Grouped Frequency Distribution

a. For data with many values such as X, Y, and Z, create a group


frequency distribution of these data.
b. Determine the range of the data set: Maximum value – minimum value.
c. Determine the number of classes or groups (5 to 20) of your data. You
may choose your desired class size or use the 2k ≧ n rule, which can be
computed manually or mathematically. In this equation,
k = number of classes and
n = total number of elements.

Manual method of finding k by using trial and error.


1. Assign k with some values and compute the value of 2k
2. Compare the result with the value of n
3. If 2k ≧ n then, use the value of k for the number of classes.
k 2k Is 2k ≧ n?
3 8 No
4 16 No
5 32 Yes

Mathematical method of determining the value of k:

Course code: COMP ED 20 Page | 11


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

2k = n
k log (2) = log (n)
k = log (n)/log (2)
k = log (30)/ log (2) = 1.477/0.301 = 4.99

Since we can’t have a class size of 4.99, we have to round this up to


the next higher integer, which is 5.
d. Determine the class width = range/k. Using the above data set, the class
width of X=4, Y=8 and Z =1.83 (round it to 2).
e. Start the first class with a value that is less than or equal to the minimum
value in the data set. For X, the lowest class can start at 30, then the
lowest value of the second class is 30+class width = 34. For X, if the
class width is 4, the highest value (50) in the data will not be counted in
the frequency table.
Class Interval Range
1 30 33
2 34 37
3 38 41
4 42 45
5 46 49
We can address this problem with the following options:

Option1: Adjust the class width to 5 to include 50 and retain the lower
value of class 1 at 30 or adjust both the class width to 5 and the lower
value to 28 as shown below.
Class Interval Range Class Interval Range
1 30 34 1 28 32
2 35 39 2 33 37
3 40 44 3 38 42
4 45 49 4 43 47
5 50 54 5 48 54

Course code: COMP ED 20 Page | 12


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

Option 2: Maintain the class width at 4 and adjust the class size to 6.
In the second table below, the value of the lower limit in class 1 is
adjusted to 29.
Class Interval Range Class Interval Range
1 30 33 1 29 32
2 34 37 2 33 36
3 38 41 3 37 40
4 42 45 4 41 44
5 46 49 5 45 48
6 50 53 6 49 52

f. Then, find the frequency for each group using MS Excel countifs()
function. Below is the frequency distribution table of X.

Class Interval Range Frequency


1 30 34 7
2 35 39 5
3 40 44 7
4 45 49 9
5 50 54 2

The data at the frequency column can be generated using the following
command in MS Excel:

Interval
Class Range Frequency
1 30 34 =COUNTIFS(B$2:B$31,"<=34.5")
=COUNTIFS(B$2:B$31,">34.5",B$2:B$31,
2 35 39 "<=39.5" )
=COUNTIFS(B$2:B$31,">39.5",B$2:B$31,
3 40 44 "<=44.5" )
=COUNTIFS(B$2:B$31,">44.5",B$2:B$31,
4 45 49 "<=49.5" )
5 50 54 =COUNTIFS(B$2:B$31,">49.5")
Note: Cell B$2:B$31 is the cell address range of the data.

Course code: COMP ED 20 Page | 13


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

Measures of Skewness and Kurtosis

Skewness can be defined as the measure of symmetry of the data’s distribution


with respect to the mean. With skewness, we can determine if the data is
normally distributed, positively skewed or negatively skewed.

The skewness of a distribution can be summarized by the following


characteristics:

Skewness Shape of the Value Value of Mean


Distribution
Normal Symmetric with respect to Skew = (-0.5, 0.5) mean = mode =
the mean median
Positive Skew Shape has longer tail at Skew ≥ 0.50 mode < median
the right side of the mean < mean
Negative Skew Shape has longer tail at Skew ≤ -0.50 mode > median
the left side of the mean > mean

Source: Dugar, D. (2018).

Kurtosis measures the height or flatness of the curve. The following are the
summary on the kurtosis of the distribution:

Shape of the Distribution Kurtosis Value of the MS Excel


kurtosis (k) function
Normal or Medium peak. Mesokurtic k=3 KURT(array) = 0
Distribution is symmetric
Higher/Taller Peak. Data Leptokurtic. k>3 KURT(array) > 0
is highly distributed at the Lepto=thin
center. Data has few
outliers.
Lower Peak. Data Platykurtic. k<3 KURT(array) < 0
distribution is slightly higher Platy =
at the center and even at broad
the rest of the distribution.

Course code: COMP ED 20 Page | 14


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

With MS Excel, we can determine the skewness and kurtosis of the data without
necessarily graphing their distribution. These can be done by using the
SKEW.P(array) and KURT(array) functions, respectively. The KURT(array)
function is equal to the kurtosis value minus 3. Therefore, if:
KURT(array) = 0, the kurtosis is mesokurtic
KURT(array) < 0, the kurtosis is platykurtic
KURT(array) > 0, the kurtosis is leptokurtic

Measures of Position

The measure of position is to determine the position of a specific value within the
data set. This will give us an idea as to where such value falls in the distribution
– whether it is close to the mean value or if it is at the extreme lower end or
higher end of the data set.

Box and Whiskers Plot or Box plot. This provides a visualization on the
spread and centers of a data set. The five numbers are the minimum, 1st
quartile, median, the 3rd quartile and the maximum value, and the mean value of
the data set.

In MS Excel, this can be done by selecting the array cells of the data set, then
click Insert from the main menu, select the Histogram Icon and select the Box
and Whisker button.

With reference to X and Y in the previous data set, their Box and Whisker plots
are shown below:

70.00

60.00 60.00

52.25
50.00 50.00
46.00
40.00 40.97 42.00 39.37 37.50
34.75
30.00 30.00 28.75

20.00 20.00

10.00

0.00
1

The box at the center represents the middle portion of the data set. This box
gives us an idea which data set is dispersed.
Course code: COMP ED 20 Page | 15
Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

The graphical presentations, and the values of the mean and median could
give us an idea on the skewness of the distribution of the data sets. For
example, the median value (42) of X is higher than the mean value (40.97),
then the distribution is expected to be skewed to the left.

With reference to the second boxplot in the graph, it shows that the minimum
and maximum values are 20 and 60, respectively. The second value (28.75) is
the threshold of the first quartile, while the middle line with a value of 37.5
represents the median. The value at the center (39.37) is the average value,
while 52.25 is the threshold value of the 3rd quartile. The difference between
52.25 and 37.5 is inter quartile.

Outliers and its Effects

Outlier values are values that appear at the extremely lower end or extremely
upper end of the data set. The occurrence of these outlier values can be due
to clerical errors or incorrect data entry, and other factors.

Outlier values can be identified and detected by visually inspection of the box
plots, or scatter plots.

Assuming that an element in the variable Y is changed to 15 and 90, the box
and whisker plot shows the outlier values as follows:

Box Plot of X and Y


100.00

90.00 90.00

80.00

70.00

60.00 60.00

50.00 50.00 52.25


46.00
40.00 40.40 42.00 40.37
37.50
34.75
30.00 30.00 28.75
20.00 20.00
15.00
10.00

0.00
1

The inclusion of outliers in the data set affects the central tendency measures
specifically the mean value, the Variability measures (e.g. range, standard
deviation, variance and coefficient of variation) and many other statistics.
Course code: COMP ED 20 Page | 16
Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

This outlier value can be corrected by validating from the raw data or this
value can be substituted with new value or be removed from the data set.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach in data analytics to determine


characteristics by summarizing and describing the dataset.

There are two known methods for performing exploratory data analysis, these
are numerical methods and graphical methods.

EDA uses descriptive statistics to summarize the data such as mean,


standard deviation, variance, cross-tabulation and inferential statistics such as
regression analysis, principal component analysis (PCA), and correlation
analysis.

The graphical method is done to visualize the data distribution, detect the
presence of extremely high/low values or outliers, test the assumptions of the
data, identify important variables, and detect relationships between
variables. Some of the graphical methods include bar plots, boxplots, pie
charts, histograms, and scatterplots.

Exploratory data analysis is needed to understand the dataset under


investigation and increase our confidence in the correctness of further
analysis that will be done on the dataset.

EDA can be univariate (single variable), bivariate (two variables), or


multivariate (many variables).

The univariate EDA means that we will be summarizing the characteristics of


the data of the variable without interaction or consideration on the effects of
other variables to the values on the variable under investigation.

Bivariate or multivariate EDA means that we will summarize the data with
consideration of the relationship of the variables with each other. For
example, we may explore the pattern or possible relationship between total
number of years of education (education) and income/salary. EDA can be
used to determine how income and education are related to each other.

Some of the descriptive statistics used in the exploratory data analysis of a


univariate or single variable include the following:

a) Measures of Central Tendency (Mean, Median, Mode)


b) Measures of Variability (Range, Variance, Standard Deviation)
c) Measures of Shape and Distribution (Skewness, Kurtosis)

Some of the statistics used in the two variables exploratory data analysis
(Bivariate EDA) include the following:

Course code: COMP ED 20 Page | 17


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

a) Scatterplots
b) Cross Tabulations
c) Correlation Analysis
d) Regression Analysis

Reading Activity:

1. Dugar, D. (2018). Skew and Kurtosis: 2 Important Statistics terms you need to know
in Data Science. URL: https://codeburst.io/2-important-statistics-terms-you-need-to-
know-in-data-science-skewness-and-kurtosis-388fef94eeaa
2. NIST/SEMATECH e-Handbook of Statistical Methods, Accessed from:
https://doi.org/10.18434/M32189
3. Kallner, A. (2018). Formulas. Accessed from:
https://www.sciencedirect.com/topics/neuroscience/kurtosis#:~:text=A%20standard%
20normal%20distribution%20has,recognized%20as%20leptokurtic%20and%20%3C
3.
4. Statistics Canada. Constructing box and whisker plots.
https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch12/5214889-
eng.htm#:~:text=A%20box%20and%20whisker%20plot%20is%20a%20way%20of%
20summarizing,central%20value%2C%20and%20its%20variability.
5. Gomes, G. 2021. Descriptive Statistics: Expectations vs. Reality (Exploratory Data
Analysis) https://towardsdatascience.com/descriptive-statistics-expectations-vs-
reality-exploratory-data-analysis-eda-8336b1d0c60b. Accessed: 21 January 2021.
6. National Institutes of Standards and Technology. What is Exploratory Data Analysis.
https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm

References:

1. https://www.statisticshowto.com/measures-of-position/#:~:text =Measures%20of%20
position%20give%20us,falls%20on%20some%20numerical%20scale.
2. Gordon, S. (2006). The Normal Distribution. Accessed from:
https://www.sydney.edu.au/content/dam/students/documents/mathematics-learning-
centre/normal-distribution.pdf
3. Normal Distributions, Standard Deviations, Modality, Skewness and Kurtosis:
Understanding concepts. https://www.youtube.com/watch?v=HnMGKsupF8Q.
4. Chen, J. (2021). Skewness. https://www.investopedia.com/terms/s/skewness.asp
5. Meyer, P. 2015. Exploratory Data Analysis.
https://www.youtube.com/watch?v=zHcQPKP6NpM
6. Lecture 2 - Descriptive Statistics & Exploratory Data Analysis Flashcards Preview.
https://www.brainscape.com/flashcards/lecture-2-descriptive-statistics-amp-expl-
6422027/packs/10091201

Course code: COMP ED 20 Page | 18


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

Assignment: Descriptive Statistics & Exploratory Data Analysis

In this activity you will perform a descriptive analysis on a dataset involving five
variables with 737 records. You can use MS Excel to perform this activity. You
can download the file from: https://bit.ly/3sSZRRr

This data set is from the USDA's commissioned study of women’s nutrition in
1985. Nutrient intake was measured for a random sample of 737 women aged
25-50 years. The following variables were measured:

• Calcium(mg)
• Iron(mg)
• Protein(g)
• Vitamin A(μg)
• Vitamin C(mg)

1. 5 pts. Using MS Excel, determine the characteristics of each of the variables


by determining the range, mean, standard deviation, variance and coefficient
of variations.

Variable Median Mean Standard Variance CV


Deviation
Calcium
Iron
Protein
Vitamin A
Vitamin B

2. 5 pts. Based on the result of your descriptive analysis, which attribute is more
dispersed? Support your answer.
3. 5 pts. Select one variable, create a group frequency distribution and
graph/chart of this variable.
4. 5 pts. Compute the skewness and kurtosis of all variables.
5. 5 pts. Based on the value that you got, what is your interpretation on the
skewness and kurtosis?
6. 5 pts. Create a box and whisker plots of these five variables. Grab a
screenshot of your plots of each variable.

Variable Skewness Skewness Kurtosis Kurtosis


value description* value description**
Calcium
Iron
Protein
Vitamin A
Vitamin B
Note: * - normal, positively skewed, negatively skewed
** - Mesokurtic, Leptokurtic, Platykurtic

Course code: COMP ED 20 Page | 19


Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

Course code: COMP ED 20 Page | 20

You might also like