CIA 3 STATS
CIA 3 STATS
By
1MBA-L
PROF. SASEEKALA M.
MBA PROGRAMME
SCHOOL OF BUSINESS AND MANAGEMENT
CHRIST (DEEMED TO BE UNIVERSITY), BANGALORE
AUGUST 2024
1
Chapter TABLE OF CONTENTS Page
No. No.
1 INTRODUCTION 3
2 DATA DICTIONARY 4
5 CORRELATION 9-11
2
1. INTRODUCTION OF THE REPORT
The purpose of the report is to use statistical methods to examine a business issue that was
found in a provided dataset. The report seeks to perform hypothesis testing to derive
significant insights and provide an appropriate solution to the issue, depending on
information from Units 3, 4, and 5 of our course. The report will specifically concentrate on
comparing frequencies between groups, evaluating means, and investigating correlations
between continuous variables. Offering a data-driven analysis that facilitates well-informed
decision-making in a commercial setting is the ultimate objective.
The specified problem and dataset were carefully selected and approved in cooperation with
the faculty, making this study a joint effort. By means of thorough statistical examination, our
objectives are as follows:
1. Create and evaluate hypotheses about the mean and variances of the means of the relevant
variables.
2. To find trends or differences, compare the means and frequencies among several groups.
By achieving these goals, the report will offer thorough analysis and practical
recommendations that can successfully handle the mentioned business issue.
3
2. DATA DICTIONARY FOR THE INSURANCE DATA SET
TYPE OF TESTS:
1. CHI-SQUARE TEST
2. ANOVA-SINGLE FACTOR
3. CORRELATION
4. REGRESSION ANALYSIS
4
1. CHI-SQUARE TEST-The Chi-square test is a statistical method for determining if two
categorical variables have a significant relationship. In this case, we analyzed the
correlation between the age distribution and the number of children. The p-value from the
chi-square test was around 0.343.
NULL HYPOTHESIS:
ALTERNATE HYPOTHESIS:
Interpretation:
One metric that aids in assessing the strength of the evidence against the null hypothesis is the
p-value. The null hypothesis in this case asserts that there is no correlation between an
individual's age and the number of children they have. We would reject the null hypothesis if
the p-value was less than the selected significance level, which is often 0.05. This would
imply that there is probably a link between the two variables.
But in this case, the p-value is 0.343, significantly greater than the conventional significance
standards of 0.05. This finding suggests that there is no significant variance between the
observed age distribution across various kid counts and the distribution expected in the
absence of any connection. We are unable to reject the null hypothesis as a result. In other
5
words, there is not enough evidence from this data to draw the conclusion that an individual's
age is related to the number of children they have.
Practical Implications:
This finding has various implications, especially for the domains of sociology, public health,
and demographics, where knowledge of the correlation between family size and age might
guide intervention or policy decisions. If age was significantly correlated with the number of
children, for example, it could indicate that particular age groups are more likely to have
more or fewer children, which could inform the distribution of resources or the design of
educational initiatives aimed at particular populations.
Nonetheless, the absence of a statistically significant correlation in this dataset implies that
variables other than age may have a bigger impact on the size of a family. Socioeconomic
status, education, cultural standards, and personal preferences are a few examples of these
variables. Therefore, to find potential connections that our analysis might have overlooked,
more research may be required to examine these variables or to investigate a bigger or more
diverse dataset.
Limitations:
The chi-square test has limits even though it's useful. The test makes the assumptions that the
sample size is sufficient to support the validity of the chi-square approximation and that the
data are independent. The outcomes could not be trustworthy if certain presumptions are not
fulfilled. Furthermore, the test merely determines whether an association exists; it offers no
insight into the nature or direction of the relationship.
Furthermore, the context affects how the results should be interpreted. A non-significant
finding could indicate that the study was not strong enough to find a minor effect, rather than
indicating that there is no association at all. The observed result could be influenced by the
6
sample size, the level of detail in the age data, or other unmeasured factors.
Conclusion:
In conclusion, the results of the chi-square test show that there isn't a statistically significant
correlation between the age of the dataset's children and their number. Although this result
implies that age might not be a significant factor in determining family size, it also
emphasizes the significance of taking other factors into account and the possible need for
additional research using larger or alternative datasets. The findings serve as a reminder that
statistical analysis is only one aspect of a larger picture that helps explain intricate social
events.
This ANOVA analysis is done to find out whether there is significant difference is
present between the four groups i.e., northeast, northwest, southeast, southwest, etc. in
this data set.
Null Hypothesis: There is no difference among group means.
Alternate Hypothesis: Significant difference among group means.
Overall Conclusion:
According to the ANOVA results, at least two of the group means appear to differ statistically
significantly. Given the low p-value (0.020), we may reject the null hypothesis because there
7
is extremely little possibility that such a difference would have been seen by coincidence.
This suggests that, in terms of the variable being assessed, the groups represented by
Columns 1 through 4 are not all the same.
ANOVA does not identify which groups are different from one another, even though it
indicates that a difference exists.
Practical Implications:
Depending on the study's context, the groups' significant variations could have substantial
consequences. If these columns, for example, show various treatment groups, the findings
may indicate that at least one therapy produces a different result from the others, which may
help guide future investigation or decision-making. Knowing where these disparities are
could aid in the optimization of tactics or interventions in practical applications.
Final Thoughts:
The ANOVA test is one effective statistical technique for comprehending the variation in
group means. This study's results indicate a statistically significant difference between the
means of the four groups, indicating that the differences between them are relevant. Further
research must be conducted to determine the precise nature of these discrepancies and their
practical importance.
8
3. CORRELATION
The image provided presents the correlation between two variables: Age and Charges, within
an insurance dataset. The correlation coefficient, denoted by the symbol ρ (rho), is given as
approximately 0.0097.
Understanding Correlation:
A statistical tool used to characterize the direction and degree of a link between two variables
is a correlation. The range of the correlation coefficient (ρ) is -1 to 1.
Key Findings:
• This number indicates that the age of the people and the costs they pay have a
very small, linear relationship. Otherwise, there is no regular or expected
variation in the charges with age.
9
fi
Implications for the Insurance Dataset:
Regarding an insurance dataset, this outcome is a little unexpected. It makes sense to assume
that an important consideration in deciding insurance costs would be age. For example, it is
reasonable to assume that older people will pay more since they have more health risks or
medical demands. Nonetheless, the correlation coefficient suggests that there is no significant
linear association between age and charges in this dataset.
1. Non-linear Relationship:
• Age and charges may not have a linear relationship. Charges may rise with age until a
certain point, at which point they may even level off or even go down. A correlation
coefficient could not accurately reflect the underlying nature of the relationship in
certain situations. To investigate this further, further statistical techniques like
regression analysis might be required.
2. Confounding Variables:
• The absence of correlation could imply that the charges are being determined more by
other factors. For instance, factors like geography, lifestyle choices, health status, and
type of insurance coverage may have a stronger correlation with costs than age.
3. Dataset Characteristics:
• It is also possible that the dataset employed in this analysis is skewed or non-
representative. For example, the expected association between age and charges may
not appear if the population included in the study is largely younger or healthier.
• This relationship may also be impacted by the structure of the insurance policy. Age
may not be a major element in determining charges if other criteria, such as coverage
10
limits, deductibles, or particular risk assessments, have a greater influence on
premiums and charges.
Practical Considerations:
Knowing how age affects costs is important for insurance firms because it helps them
determine risk, establish rates, and create age-appropriate policies. The tiny connection
observed here raises the possibility that age alone is not a good indicator of charges incurred
in this sample.
Conclusion:
With a correlation coefficient of roughly 0.0097, the correlation analysis between age and
charges in the insurance dataset shows a very weak linear link. This implies that charges in
this dataset are not significantly influenced by age. This finding, despite its seeming paradox,
emphasizes how crucial it is to examine more intricate correlations and take into account
additional variables when studying insurance data. The results highlight the fact that age
alone shouldn't be used to forecast insurance costs; additional research is needed to identify
the underlying variables affecting costs in this dataset.
11
fi
( FOR YULU BIKE DATA SET)
DATA DICTIONARY
4. REGRESSION ANALYSIS
12
The regression analysis for the Yulu bike dataset, with count as the dependent variable and
weather as the independent variable, provides insights into how weather conditions impact
the number of bike rentals.
• Multiple R: 0.159583
• R Square: 0.025467
• Adjusted R Square: 0.024815
• Standard Error: 70.77804
• Observations: 1498
ANOVA:
• Regression df: 1
• Residual df: 1496
• F-value: 39.09375
• Signi cance F: 5.26E-10
Coef cients:
Interpretation:
1. R-Square (0.025): Only 2.5% of the variation in the number of bike rentals (count)
can be explained by the weather variable. This suggests that weather alone is not a
strong predictor of bike rental counts, indicating other factors play a signi cant role.
2. Signi cance (P-value: 5.26E-10): The low p-value indicates that the weather variable
is statistically signi cant, meaning changes in weather conditions do have a
measurable effect on the number of rentals.
13
fi
fi
fi
fi
fi
fi
3. Coef cient for Weather (-17.9718): The negative coef cient implies that as weather
conditions worsen (higher weather values typically correspond to poorer weather), the
number of bike rentals decreases. Speci cally, for each unit increase in the weather
index, the number of bike rentals decreases by approximately 18.
4. Intercept (101.846): When the weather condition is at its baseline (1), the expected
count of bike rentals is around 102.
Conclusion:
While the analysis con rms that weather signi cantly impacts bike rentals, the low R-square
value suggests that weather is not a substantial predictor by itself. Other variables should be
considered to create a more comprehensive model for predicting bike rentals.
14
fi
fi
fi
fi
fi