02 Exploratory Data Analysis
02 Exploratory Data Analysis
Examples:
sex or gender, religion, cellphone number, eye color, marital status, etc.
Advantages
o Easier to detect the smallest and largest value
o Easier to find the measure of position and frequency
Example 1:
The year level of 24 randomly selected
students are given in the previous
example coded as
1 = 1st year
2 = 2nd year
3 = 3rd year
4 = 4th year
The Criteria
for 1st year
=COUNTIF(B2:B25,1)
=COUNTIF(B2:B25,2)
=COUNTIF(B2:B25,3)
=COUNTIF(B2:B25,4)
The Criteria
for 1st year
The Criteria
for 1st year
OPTION 2
=COUNTIF(B2:B25,”1st year”)
=COUNTIF(B2:B25,”3rd year”)
=COUNTIF(B2:B25,4)
=SUM(G5:G9)
Note: The range on the formula depend on how you encoded you data. In my
case, the data were in B2:B25 and the frequencies were found in G5:G9
Note:
• The formula in finding the
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
𝑅𝐹 = × 100
𝑡𝑜𝑡𝑎𝑙
• In excel: / = division
* = multiplication
=(G5/G9)*100
=(G6/G9)*100
=(G7/G9)*100
=(G8/G9)*100
=SUM(H5:H9)
Note: The range on the formula depend on how you encoded you data.
In my case, the data were in G5:G9 and the frequencies were found in H5:H9
In Excel: / = division
* = multiplication
Step 1: Highlight the cells that contain the data you want to use in the chart
Step 2: Click the insert tab on the ribbon
Step 3: Select the type of chart you want to create
to see the chart definition, point the
cursor pointer of each chart icons
Step 4: You'll see many options when you select this button, such as 2-D columns and 3-D
columns, as well as 2-D and 3-D bars. For these purposes, we're selecting 2-D columns.
to see a
preview of the
bar graph,
point the
cursor pointer
on the chart
icons
to edit the
chart elements,
styles, and
filters
Examples:
age, allowance, number of classrooms, weight, height, etc.
Example: The following data shows the total sales per person
First, put the data into Array for or arrange from smallest to highest value.
Highlight the data ⟶ Sort & Filter ⟶ Sort Smallest to Largest
Where
array is the sorted data from lowest to highest
k is the percentile in decimal being used
Where
array is the sorted data from lowest to highest
k is the percentile in decimal being used
Where
array is the sorted data from lowest to highest
quart is the quartile being used
Where
array is the sorted data from lowest to highest
quart is the quartile being used
1. Mean
- arithmetic average obtained by adding up all the data values and dividing
by the total number of observations
Population mean
σ 𝑥𝑖 where:
𝜇= 𝑥𝑖 = value at ith observation
𝑁
Sample mean
N = number of observations in population
σ 𝑋𝑖
𝑥ҧ =
𝑛 n = number of observations in sample
2. Median
- denoted by 𝑋෨ or Md
- value that divides an array of observations into two equal parts, so that
half of the cases are above it and half below it
- middle value, or average middle value in an array of observations
In symbols;
✓Check first if the data is in array
𝑿𝒏+𝟏 𝑖𝑓 𝑛 𝑖𝑠 𝑜𝑑𝑑
𝟐
෨ ൞𝑿𝒏 + 𝑿𝒏+𝟐
𝑋=
𝟐 𝟐
𝑖𝑓 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛
𝟐
3. Mode
- value(quantitative) or category(qualitative) with the largest frequency (or
percentage) in the distribution
- Denoted by 𝑋 or Mo
- Locates the point where the observation values occur with the greatest density
- Generally a less popular measure than the mean or the median
- Determined by counting the frequency of each value and finding the value with
the highest frequency of occurrence
Example: 2, 5, 2, 3, 5, 2, 1, 4, 2, 2, 2, 1, 2, 2, 2, 3, 2, 2, 2, 2
Answer: To find the mode, find the frequency for each observation.
Therefore, the mode is the observation with the highest frequency which is 2.
1. Range
– measures how far the highest value is from the lowest value
– a rough measure of dispersion
– difference between the highest value (HV) and the lowest value (LV) in the
population
– it uses only the extreme values
– it fails to communicate any information about the clustering or the lack of
clustering of the values between the extremes
– a weakness is that an outlier can greatly alter its value
𝐑 = HV – LV = max– min
Range = (minimum-maximum)
Population Variance:
=VAR.P(number1, [number2], ...)
Sample Variance:
=VAR.S(number1, [number2], ...)
where: 43 is the sample size of the data where: COUNT give the number
SQRT means square root of entries in a data range
a range of cells
where the tool
will give you
output after its
analysis
• Line Chart
• Histogram
• Boxplot
4 2008
2009
3 2010
0
CHSI CAS Ced Cag CF
Step 5:
The chart will appear.
Customize bar chart
through ”chart design”,
“format”, “Quick
Layout”.
2020 Sales
10
8.1
8 7.2 7.5
6.9 6.7
FREQUENCY
6.0 5.7
6 4.9 5.1 5 5.4
4.2
4
2
0
Jan Feb Mar Apr May June Jul Aug Sep Oct Nov Dec
PERIOD
Range
=SQRT(43) or
= SQRT(COUNT(data range))
=R/k (=1870.03/7)
=MIN(data range)
Classes Frequency
9.03-276.17 21 20
276.18-543.32 10
15
FREQUENCY
543.33-810.47 4
810.48-1077.62 3 10
1077.63-1344.77 3
5
1344.78-1611.92 0
0
1611.93-1879.07 2 9.03-276.17 276.18-543.32 543.33-810.47 810.48-1077.62 1077.63-1344.77 1344.78-1611.92 1611.93-1879.07
TOTAL SALES
Interquartile Range
(IQR)
whisker whisker
Minimum/ Maximum/
Lower Fence Median Upper Fence
𝑸𝟏 𝑸𝟑
(𝑸𝟏 − 1.5 ∗ IQR) (25th Percentile) (75th Percentile) (𝑸𝟑 + 1.5 ∗ IQR)
1. Compute the
=MIN(data range)
=QUARTILE(Array,1)
Minimum value,
=QUARTILE(Array,2) 1st Quartile,
=QUARTILE(Array,3) 2nd Quartile,
=MAX(selected data range) 3rd Quartile, and
Maximum value.
2. Create a scatter
plot using the 5
computed values
and the column of
1’s.
Change the
maximum
bound of Y
axis
Contingency Tables
• Contingency tables (also called crosstabs) are useful as a rudimentary
tool to analyze the relationship between two variables.
• In a contingency table, one variable is presented in the columns and the
other in the rows.
• By looking at the distribution of one variable across categories of the
other, we are able to gain preliminary insight into the association among
variables.
• Contingency tables are most useful when variables have a limited
number of response categories.
4 Female
Male
0
Econ Math Politics