0% found this document useful (0 votes)
3 views

Exploratory Data Analysis_v4_part2

Uploaded by

ahmedpandit48
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Exploratory Data Analysis_v4_part2

Uploaded by

ahmedpandit48
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Exploratory Data

Analysis
Benefits, Techniques, and Examples
Part 2
Recall

Identifying the
Right Data

Clean the Data

What is in the
data?
Bivariate Analysis
• A statistical method used in data analysis to examine and understand
the relationship, association, or interaction between two different
variables.
• It involves the simultaneous analysis of two variables, which can be
numerical or categorical.
• It explores how changes in one variable are associated with changes
in another variable.
Types of Bivariate Analysis
Bivariate analysis involves examining the relationships or associations
between two different variables.
The type of bivariate analysis you choose depends on the nature of the
variables you're working with.
Here are the common types of bivariate analysis:
1. Numerical-Numerical Analysis
2. Categorical-Categorical Analysis
3. Categorical-Numerical Analysis
Categorical-Categorical Analysis
• examine the relationship between two categorical variables.
• Used to determine association/dependency between categorical
variables.
• Common techniques include contingency tables and chi-square tests.
Categorical-Categorical Analysis
Contingency Table
• Choose variables
• Create a Contingency Table (also called a cross-tabulation or crosstab)
to display the distribution of one categorical variable in relation to the other.
Rows represent categories of one variable, and columns represent categories
of the other variable.
The values in the table are the counts or frequencies of observations falling
into each combination of categories.
Categorical-Categorical Analysis
Chi-Square Test
• Create Contingency Table
• Calculate Expected Frequencies
• For each cell under the assumption of independence.
• Perform the Chi-Square Test
• chi-square test for independence is used to determine whether there is a
significant association between the two categorical variables.
• It tests the null hypothesis that the variables are independent.
Chi-Square Test
Chi-Square Test
Steps for Categorical-Categorical
Analysis
• Interpret the Results
Examine the chi-square statistic and its associated p-value.
If the p-value is less than a chosen significance level (e.g., 0.05), you can
reject the null hypothesis and conclude that there is a statistically significant
association between the two variables.
Categorical-Numerical Analysis
ID Education Gender Income
• Explore the
1 FSc Female 30,000
relationship/association
2 BS Male 50,000
between a categorical variable
3 B.Ed Male 45,000
and a numerical variable
4 MS Female 90,000
5 FSc Male 32,000
• Determine, if there are 6 Ph.D. Male 150,000

statistically significant 7 BS Male 80,000

differences in the numerical 8 FSc Male 37,000


9 Ph.D. Female 220,000
variable across different
categories Is there a significant difference in Income based on
education level?

What is the relationship between Gender and Income?


Categorical-Numerical Analysis
Types

• Descriptive Analysis
• Summarize the numerical variable’s statistics within each category
• Inferential Analysis
• Use statistical tests
• Such as T-Test, ANOVA
Categorical-Numerical Analysis
ANOVA

• ANOVA: Analysis of Variance


• Statistical method to analyze differences among group means in a
sample.
• Statistical significant differences between the means of three or more
independent groups.
• Assumes normal distribution for data and independence of
observations
Categorical-Numerical Analysis
ANOVA: Purpose

• Test the null Hypothesis (H0): Several groups are equal


• No significant difference between the groups
• Variations in data caused due to differences between groups or random
fluctuations within groups
Categorical-Numerical Analysis
ANOVA: Types

• One-Way ANOVA
• 1 categorical (independent) variable with 3 or more groups
• Two-Way ANOVA
• 2 categorical (independent) variables
• Analyze individual and interactive effects on the dependent variable
Categorical-Numerical Analysis
ANOVA: F-Statistic

• Variance between group means VS variance within groups


• >F-Statistic and <p-value : significant differences between groups
• Calculation of F-Statistic:
• Explained Variation – variation between group
• Also known as “between-group Sum of Squares (SSB)” or “Treatment Sum of Squares
(SST)”
• Measure of difference between group means vs overall mean
Categorical-Numerical Analysis
ANOVA: Calculating F-Statistic

1. Explained Variation – variation between group


• Also known as “between-group Sum of Squares (SSB)” or “Treatment Sum of
Squares (SST)”
• Measure of difference between group means vs overall mean
Categorical-Numerical Analysis
ANOVA: Calculating F-Statistic

2. Unexplained Variation – variation between group


• Also known as “within-group Sum of Squares (SSW)” or “Error Sum of Squares
(SSE)”
• Measure of difference between individual data points within each group vs
group mean
Categorical-Numerical Analysis
ANOVA: Calculating F-Statistic

3. Degrees of Freedom (df)


• Degrees of Freedom for the between-group variation (dfB):
• Number of groups(k)
• dfB = k-1
• Degrees of Freedom for the within-group variation (dfW):
• Total Number of Observations(N)
• dfW = N-k
Categorical-Numerical Analysis
ANOVA: Calculating F-Statistic

4. F-Statistic = (SSB/dfB) / (SSW/dfW)

Next for ANOVA:


• Use a Cumulative Distribution Function (CDF) for the F-distribution to
find the p-value associated with F-Statistic.
• Compare p-value to the threshold.
• If p-value is less than the threshold, H0 is rejected.
Categorical-Numerical Analysis
ANOVA: Example

Step 1:
• Null Hypothesis (H0): There is no significant difference in the test
scores among the three schools (μa = μb = μc).
• Alternative Hypothesis (Ha): There is a significant difference in the
test scores among at least one pair of schools.
Categorical-Numerical Analysis
ANOVA: Example

Step 2:
• Test Scores for students from 3 schools
• School A:[85,88,90,82,89]
• School B:[78,81,83,80,85]
• School C:[92, 89, 94, 88, 91]
Categorical-Numerical Analysis
ANOVA: Example

Step 3:
• Mean for each school:
• μa: (85+88+90+82+89)/5 = 86.8
• μb : (78+81+83+80+85)/5 = 81.4
• μc: (92+89+94+88+91)/5 = 90.8
Categorical-Numerical Analysis
ANOVA: Example

Step 4:
• Calculate the overall mean (μ) for all test scores:
• Overall Mean (μ) = (86.8 + 81.4 + 90.8) / 3 = 86.33
Categorical-Numerical Analysis
ANOVA: Example

Step 5:
• Calculate the Between-Group Variation (SSB)

• Where: k = number of groups


• = size from group j
• = mean of data items in group j
• = mean of all data items in the dataset
• where yij is each individual data point in group i.
• SSB = (5 * (86.8 - 86.33)²) + (5 * (81.4 - 86.33)²) + (5 * (90.8 - 86.33)²)
= 103.29
Categorical-Numerical Analysis
ANOVA: Example

Step 6:
• Calculate the Within-Group Variation (SSW)

• Where: k = number of groups


• = size from group j
• = mean of data items in group j
• = ith observation in group j

• SSW = (85 - 86.8)² + (88 - 86.8)² + ….= ?


Categorical-Numerical Analysis
ANOVA: Example

Step 7:
• Calculate Degrees of Freedom
• dfB = k-1 = ?
• dfW = N-k = ?
Categorical-Numerical Analysis
ANOVA: Example

Step 8:
• Calculate F-Statistic:
• F = (SSB / dfB) / (SSW / dfW) = ?
Categorical-Numerical Analysis
ANOVA: Example

Step 9:
• Identify the critical F-value threshold (e.g. )
• For , (numerator) dfB=2, (denominator) dfW=12, the critical F-value is
~3.8853
• Calculate the p-value using F-distribution.
# Calculate the p-value
• Get CDF of F-distribution p_value <- 1 - pf(F, dfB, dfW)
• p_value = 1- cumulative probability

• For F=15.01, dfB=2, dfW=12, the calculated p-value is 0.04999981


Categorical-Numerical Analysis
F-Table
Categorical-Numerical Analysis
ANOVA: Example

Step 10:
• Compare the p-value to the chosen significance level (α)
• If the p-value ≤ α
• Reject the null hypothesis (Ha is supported).
• If the p-value > α
• Fail to reject the null hypothesis (H0 is supported).
Categorical-Numerical Analysis
ANOVA: Example

Step 11:
• Interpret the Results:
• Since H0 is rejected, we conclude that there is a significant difference in the
test scores among at least one pair of schools.
Categorical-Numerical Analysis
T-Test

• Statistical hypothesis test


• Determines if there is a significant difference between means of two
groups.
• Useful to assess if the means of a continuous variable in one group
differs from the other.
• Parametric Test: Assumes,
• Data is normally distributed
• Variances in the two groups are equal
Categorical-Numerical Analysis
T-Test: Types

• Independent Samples T-Test (Student’s T-Test)


• Two independent Groups
• Compare means of a variable between these groups
• Assess, if the means are significantly different from each other.
• Paired Samples T-Test
• One group of subjects
• Measure same variable for each subject twice, under different conditions or
different time points.
• Assess, if the means of the paired observations are significantly different.
Categorical-Numerical Analysis
T-Test: Independent Samples T-Test (Example)

• Independent Samples T-Test (Student’s T-Test)


• Determine, if there is a significant difference in the average test scores of two
groups of students (Group A and Group B) Score When
• Group A: [85, 88, 90, 82, 89] 85 Group A
88 Group A
• Group B: [78, 81, 83, 80, 85] 90 Group A
82 Group A
89 Group A
78 Group B
81 Group B
83 Group B
80 Group B
85 Group B
Categorical-Numerical Analysis
T-Test: Independent Samples T-Test (Example)

Step 1: Formulate the Hypotheses


• Null Hypothesis(H0)
• means of test scores in Group A and Group B are equal ()
• Alternate Hypothesis(Ha)
• means of test scores in Group A and Group B are not equal ()
Categorical-Numerical Analysis
T-Test: Independent Samples T-Test (Example)

Step 2: Calculate the Means


Categorical-Numerical Analysis
T-Test: Independent Samples T-Test (Example)

Step 3: Calculate the Variance and Standard Error



• ?
Categorical-Numerical Analysis
Formula for t-Statistic is:

T-Test: Independent Samples T-Test (Example) Where:


• is the sample mean.
Step 4: Calculate the T-Statistic • μ is the hypothesized
population mean (under the
• null hypothesis).
• SE is the standard error.
• represents the t-statistic.
• ​and ​are the sample means for Group A and Group B, respectively.
• and are the sample variances for Group A and Group B respectively.
• ​and are the sample sizes for Group A and Group B, respectively.
Categorical-Numerical Analysis
T-Test: Independent Samples T-Test (Example)

Step 5: Determine Degrees of Freedom



Categorical-Numerical Analysis
T-Test: Independent Samples T-Test (Example)

Step 6: Find the Critical T-Value


• Based on chosen threshold (α) and degrees of freedom
• Use T-Table

Step 7: Calculate p-value


• Associated with the t-statistic using a t-distribution table
Categorical-Numerical Analysis
T-Test: Independent Samples T-Test (Example)

Step 8: Make a decision


• If |t-statistic| > critical t-value and p-value < α, then reject the null
hypothesis.
• If |t-statistic| ≤ critical t-value or p-value ≥ α, then fail to reject the
null hypothesis.
Categorical-Numerical Analysis
T-Test: Paired Samples T-Test (Example)

• Determine, if there is a significant difference in the test scores of


students before and after a special tutoring program students (Group
A and Group B) Score When
85 Pre-Tutor
• Pre-Tutor: [85, 88, 90, 82, 89]
88 Pre-Tutor
• Post-Tutor: [78, 81, 83, 80, 85] 90 Pre-Tutor
82 Pre-Tutor
89 Pre-Tutor
78 Post-Tutor
81 Post-Tutor
83 Post-Tutor
80 Post-Tutor
85 Post-Tutor
Categorical-Numerical Analysis
T-Test: Paired Samples T-Test (Example)

Step 1: Formulate the Hypotheses


• Null Hypothesis(H0)
• no significant difference between the mean test scores before tutoring and
after tutoring ()
• Alternate Hypothesis(Ha)
• a significant difference between the mean test scores before tutoring and
after tutoring ()
Categorical-Numerical Analysis
T-Test: Paired Samples T-Test (Example)

Step 2: Calculate the differences


• Differences d = [85-78,88-81,90-83,82-80,89-85]
Step 3: Calculate the Same Mean and Sample SD

Step 4: Calculate t-statistic


Categorical-Numerical Analysis
T-Test: Paired Samples T-Test (Example)

Step 5: Calculate the degrees of freedom

Step 6: Find the Critical T-Value (tα/2,df )

Step 7: Calculate the p-value (ppaired)

Step 8: Make a decision


• If ∣tpaired∣>tα/2,df and ppaired<α
• Reject H0
Multivariate Analysis
Next time!

You might also like