Exploratory Data Analysis_v4_part2
Exploratory Data Analysis_v4_part2
Analysis
Benefits, Techniques, and Examples
Part 2
Recall
Identifying the
Right Data
What is in the
data?
Bivariate Analysis
• A statistical method used in data analysis to examine and understand
the relationship, association, or interaction between two different
variables.
• It involves the simultaneous analysis of two variables, which can be
numerical or categorical.
• It explores how changes in one variable are associated with changes
in another variable.
Types of Bivariate Analysis
Bivariate analysis involves examining the relationships or associations
between two different variables.
The type of bivariate analysis you choose depends on the nature of the
variables you're working with.
Here are the common types of bivariate analysis:
1. Numerical-Numerical Analysis
2. Categorical-Categorical Analysis
3. Categorical-Numerical Analysis
Categorical-Categorical Analysis
• examine the relationship between two categorical variables.
• Used to determine association/dependency between categorical
variables.
• Common techniques include contingency tables and chi-square tests.
Categorical-Categorical Analysis
Contingency Table
• Choose variables
• Create a Contingency Table (also called a cross-tabulation or crosstab)
to display the distribution of one categorical variable in relation to the other.
Rows represent categories of one variable, and columns represent categories
of the other variable.
The values in the table are the counts or frequencies of observations falling
into each combination of categories.
Categorical-Categorical Analysis
Chi-Square Test
• Create Contingency Table
• Calculate Expected Frequencies
• For each cell under the assumption of independence.
• Perform the Chi-Square Test
• chi-square test for independence is used to determine whether there is a
significant association between the two categorical variables.
• It tests the null hypothesis that the variables are independent.
Chi-Square Test
Chi-Square Test
Steps for Categorical-Categorical
Analysis
• Interpret the Results
Examine the chi-square statistic and its associated p-value.
If the p-value is less than a chosen significance level (e.g., 0.05), you can
reject the null hypothesis and conclude that there is a statistically significant
association between the two variables.
Categorical-Numerical Analysis
ID Education Gender Income
• Explore the
1 FSc Female 30,000
relationship/association
2 BS Male 50,000
between a categorical variable
3 B.Ed Male 45,000
and a numerical variable
4 MS Female 90,000
5 FSc Male 32,000
• Determine, if there are 6 Ph.D. Male 150,000
• Descriptive Analysis
• Summarize the numerical variable’s statistics within each category
• Inferential Analysis
• Use statistical tests
• Such as T-Test, ANOVA
Categorical-Numerical Analysis
ANOVA
• One-Way ANOVA
• 1 categorical (independent) variable with 3 or more groups
• Two-Way ANOVA
• 2 categorical (independent) variables
• Analyze individual and interactive effects on the dependent variable
Categorical-Numerical Analysis
ANOVA: F-Statistic
Step 1:
• Null Hypothesis (H0): There is no significant difference in the test
scores among the three schools (μa = μb = μc).
• Alternative Hypothesis (Ha): There is a significant difference in the
test scores among at least one pair of schools.
Categorical-Numerical Analysis
ANOVA: Example
Step 2:
• Test Scores for students from 3 schools
• School A:[85,88,90,82,89]
• School B:[78,81,83,80,85]
• School C:[92, 89, 94, 88, 91]
Categorical-Numerical Analysis
ANOVA: Example
Step 3:
• Mean for each school:
• μa: (85+88+90+82+89)/5 = 86.8
• μb : (78+81+83+80+85)/5 = 81.4
• μc: (92+89+94+88+91)/5 = 90.8
Categorical-Numerical Analysis
ANOVA: Example
Step 4:
• Calculate the overall mean (μ) for all test scores:
• Overall Mean (μ) = (86.8 + 81.4 + 90.8) / 3 = 86.33
Categorical-Numerical Analysis
ANOVA: Example
Step 5:
• Calculate the Between-Group Variation (SSB)
Step 6:
• Calculate the Within-Group Variation (SSW)
Step 7:
• Calculate Degrees of Freedom
• dfB = k-1 = ?
• dfW = N-k = ?
Categorical-Numerical Analysis
ANOVA: Example
Step 8:
• Calculate F-Statistic:
• F = (SSB / dfB) / (SSW / dfW) = ?
Categorical-Numerical Analysis
ANOVA: Example
Step 9:
• Identify the critical F-value threshold (e.g. )
• For , (numerator) dfB=2, (denominator) dfW=12, the critical F-value is
~3.8853
• Calculate the p-value using F-distribution.
# Calculate the p-value
• Get CDF of F-distribution p_value <- 1 - pf(F, dfB, dfW)
• p_value = 1- cumulative probability
Step 10:
• Compare the p-value to the chosen significance level (α)
• If the p-value ≤ α
• Reject the null hypothesis (Ha is supported).
• If the p-value > α
• Fail to reject the null hypothesis (H0 is supported).
Categorical-Numerical Analysis
ANOVA: Example
Step 11:
• Interpret the Results:
• Since H0 is rejected, we conclude that there is a significant difference in the
test scores among at least one pair of schools.
Categorical-Numerical Analysis
T-Test