0% found this document useful (0 votes)
9 views26 pages

FDA CIA 2 Qs Answers

Unit 3 of the Fundamentals of Data Science and Analytics course covers inferential statistics, including concepts such as populations, samples, random sampling, hypothesis testing, and estimation. It details the z-test procedure, definitions of probability, mutually exclusive events, and the standard error of the mean. Additionally, it explains hypothesis testing, including null and alternative hypotheses, significance levels, p-values, and various statistical tests like t-tests and ANOVA.

Uploaded by

sumathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views26 pages

FDA CIA 2 Qs Answers

Unit 3 of the Fundamentals of Data Science and Analytics course covers inferential statistics, including concepts such as populations, samples, random sampling, hypothesis testing, and estimation. It details the z-test procedure, definitions of probability, mutually exclusive events, and the standard error of the mean. Additionally, it explains hypothesis testing, including null and alternative hypotheses, significance levels, p-values, and various statistical tests like t-tests and ANOVA.

Uploaded by

sumathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

D3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

UNIT III – INFERENTIAL STATISTICS


SYLLABUS:
Populations – samples – random sampling – Sampling distribution- standard
error of the mean - Hypothesis testing – z-test – z-test procedure –
decision rule – calculations – decisions – interpretations - one-tailed and
two-tailed tests – Estimation – point estimate – confidence interval – level
of confidence – effect of sample size.

PART A

1.List the z - test step by step procedure


Step 1 - State the research problem.
Step 2 - Identify the statistical hypotheses.
Step 3 - Specify a decision rule.
Step 4 - Calculate the value of the observed
z. Step 5 - Make a decision.
Step 6 - Interpret the decision.

2.Define the term probability.


 Probability
 The proportion or fraction of times that a particular event is likely
to occur.
3. What is meant by Mutually Exclusive Events? State the Addition
Rule for Mutually Exclusive Events
Mutually Exclusive Events  Events that cannot occur together.
Addition Rule  Add together the separate probabilities of several mutually
exclusive events to find the probability that any one of these events will
occur. where Pr( ) refers to the probability of the event in parentheses and A
and B are mutually exclusive events.
4. Narrate the symbols used for the mean and standard deviation of
three types of Distributions.
5. Define Standard error of the mean

 STANDARD ERROR OF THE MEAN


 The distribution of sample means also has a standard deviation,
referred to as the standard error of the mean.
 The standard error of the mean equals the standard deviation of the
population divided by the square root of the sample size.
PART B

1. Give a detailed introduction about Population Sample and Probability.


 Population
 Any complete set of observations (or potential observations).
Types of Population
 Real Populations
o A real population is one in which all potential observations
are accessible at the time of sampling.
 Hypothetical Populations
o A hypothetical population is one in which all potential
observations are not accessible at the time of
sampling.

 Sample
 Any subset of observations from a population.
 The sample size is small relative to the population size.

Example 1
For each of the following pairs, indicate with a Yes or No
whether the relationship between the first and second
expressions could describe that between a sample and its
population, respectively.
(a) students in the last row; students in class
(b) citizens of Wyoming; citizens of New York
(c) 20 lab rats in an experiment; all lab rats, similar to
those used, that could undergo the same experiment
(d) all U.S. presidents; all registered Republicans
(e) two tosses of a coin; all possible tosses of a coin
Solution
(a) Yes
(b)No. Citizens of Wyoming aren’t a subset of citizens of New York.
(c)Yes
(d) No. All U.S. presidents aren’t a subset of all registered Republicans.
(e)Yes

Example 2
Identify all of the expressions from Example 3.1 that involve a
hypothetical population.
Solution
Expressions in 8.1(c) and 8.1(e) involve hypothetical populations.

9
 Random Sampling
 A selection process that guarantees all potential observations in
the population have an equal chance of being selected.
 Inferential statistics requires that samples be random.

Example 3
Indicate whether each of the following statements is True or False.
A random selection of 10 playing cards from a deck of 52 cards implies that
(a) the random sample of 10 cards accurately represents the
important features of the whole deck.
(b) each card in the deck has an equal chance of being selected.
(c)it is impossible to get 10 cards from the same suit (for example,
10 hearts).
(d) any outcome, however unlikely, is possible.
Solution
a. False. Sometimes, just by chance, a random sample of 10 cards fails
to represent the important features of the whole deck.
b. True
c. False. Although unlikely, 10 hearts could appear in a random sample of
10 cards.
d. True

 Tables Of Random Numbers


 Tables of random numbers can be used to obtain a random sample.
 These tables are generated by a computer designed to equalize
the occurrence of any one of the 10 digits: 0, 1, 2, . . . , 8, 9.

Example 4
Describe how you would use the table of random numbers to take
a. a random sample of five statistics students in a classroom
where each of nine rows consists of nine seats.
b. a random sample of size 40 from a large directory consisting
of 3041 pages, with 480 lines per page.
Solution
a. There are many ways. For instance, consult the tables of random
numbers, using the first digit of each 5-digit random number to identify
the row (previously labeled 1, 2, 3, and so on), and the second digit of the
same random number to locate a particular student’s seat within that
row. Repeat this process until five students have been identified. (If the
classroom is larger, use additional digits so that every student can be
sampled.)
b. Once again, there are many ways. For instance, use the initial 4
digits of each random number (between 0001 and 3041) to identify
the page number of the telephone directory and the next 3 digits
(between 001 and 480) to identify the particular line on that
page. Repeat this process, using 7-digit numbers, until 40 telephone
numbers have been identified.

 Probability
 The proportion or fraction of times that a particular event is likely to
occur.

Mutually Exclusive Events


 Events that cannot occur together.
Addition Rule
 Add together the separate probabilities of several mutually
exclusive events to find the probability that any one of these
events will occur.

where Pr( ) refers to the probability of the event in parentheses and


A and B are mutually exclusive events.

Example 5
Assuming that people are equally likely to be born during
any One of the months, what is the probability of Jack
being born during
(a) June?
(b) any month other than June?
(c) either May or
June? Solution

Independent Events
 The occurrence of one event has no effect on the probability that
the other event will occur.
Multiplication Rule
 Multiply together the separate probabilities of several
independent events to find the probability that these events will
occur together.
where A and B are independent events.

Example 6
Assuming that people are equally likely to be born during any of the
months, and also assuming (possibly over the objections of
astrology fans) that the birthdays of married couples are
independent, what’s the probability of
(a) the husband being born during January and the wife being born
during February?
(b) both husband and wife being born during December?
(c) both husband and wife being born during the spring (April or
May)? (Hint: First, find the probability of just one person being born
during April or May.)
Solution

Dependent Events
 When the occurrence of one event affects the probability of the
other event, these events are dependent.
 Although the heights of randomly selected pairs of men are
independent, the heights of brothers are dependent.

Conditional Probability
 The probability of one event, given the occurrence of another event.

Alternative Approach to Conditional Probabilities


 Conditional probabilities can be easily misinterpreted.
 Convert probabilities to frequencies (which, for example, total 100);
solve the problem with frequencies; and then convert the answer back to
a probability

12. Explain in detail about Hypothesis Testing and its types.


Hypothesis Testing
 Hypothesis testing is a statistical method used to determine if there is
enough evidence in a sample data to draw conclusions about a
population.
 It is used to estimate the relationship between 2 statistical variables.
 It involves formulating two competing hypotheses, the null
hypothesis (H0) and the alternative hypothesis (Ha), and then
collecting data to assess the evidence.
 Hypothesis testing evaluates two mutually exclusive population
statements to determine which statement is most supported by
sample data.

Defining Hypotheses
 Null hypothesis (H0):
In statistics, the null hypothesis is a general statement or default
position that there is no relationship between two measured cases or
no relationship among groups. In other words, it is a basic
assumption or made based on the problem knowledge.
Example:
A company’s mean production is 50 units/per day
H0: = 50.
 Alternative hypothesis (H1):
The alternative hypothesis is the hypothesis used in hypothesis
testing that is contrary to the null hypothesis.
Example:
A company’s production is not equal to 50 units/per day i.e. H1: 50.
Key Terms of Hypothesis Testing
 Level of significance:
o It refers to the degree of significance to accept or reject the null
hypothesis. 100% accuracy is not possible for accepting a
hypothesis, so, therefore, select a level of significance that is usually
5%.
o This is normally denoted with and generally, it is 0.05 or 5%,
which means the output should be 95% confident to give a similar
kind of result in each sample.
 P-value:
o The P value, or calculated probability, is the probability of finding the
observed/extreme results when the null hypothesis(H0) of a study-
given problem is true.
o If P-value is less than the chosen significance level then reject the
null hypothesis i.e. accept that the sample claims to support the
alternative hypothesis.
 Test Statistic:
o The test statistic is a numerical value calculated from sample data
during a hypothesis test, used to determine whether to reject the
null hypothesis.
o It is compared to a critical value or p-value to make decisions
about the statistical significance of the observed results.
 Critical value:
o The critical value in statistics is a threshold or cutoff point used to
determine whether to reject the null hypothesis in a hypothesis
test.
 Degrees of freedom:
o Degrees of freedom are associated with the variability or freedom
one has in estimating a parameter.
o The degrees of freedom are related to the sample size and
determine the shape.

Testing Null Hypothesis


The null hypothesis is tested by determining whether the one observed sample
mean qualifies as a common outcome or a rare outcome in the hypothesized
sampling distribution.

Figure 3.5. - Hypothesized sampling distribution of the mean centred about a


hypothesized population mean of 500.
 Common Outcomes
o An observed sample mean qualifies as a common outcome if the
difference between its value and that of the hypothesized population
mean is small enough to be viewed as a probable outcome under the
null hypothesis.
o There is no compelling reason for rejecting the null hypothesis, it is
retained.
 Rare Outcomes
o An observed sample mean qualifies as a rare outcome if the difference
between its value and the hypothesized population mean is too large to
be reasonably viewed as a probable outcome under the null hypothesis.
Boundaries for Common and Rare Outcomes

Figure 3.6 - One possible set of common and rare outcomes (values of X).

Figure 3.6 shows one possible set of boundaries for common and rare
outcomes, expressed in values of X.
If the one observed sample mean is located between 478 and 522, it will
qualify as a common outcome, and the null hypothesis will be retained.
If, however, the one observed sample mean is greater than522 or less than 478,
it will qualify as a rare outcome, and the null hypothesis will be rejected.

UNIT-4
PART-A
1. State the null hypothesis in a one-sample t-test.
The null hypothesis states that the population mean is equal to a specified
value.
2. What is the sampling distribution of t?
The sampling distribution of t is the distribution of the t-statistic under the null
hypothesis.
3. Specify the purpose of a t-test for two independent samples.
It is used to compare the means of two independent groups to determine if they
are significantly different.
4. Define p-value.
The p-value is the probability of obtaining test results at least as extreme as the
observed results, under the assumption that the null hypothesis is true.
5. What is statistical significance?
Statistical significance indicates that the observed result is unlikely to have
occurred by chance under the null hypothesis.
6. What is a t-test for two related samples?
It compares the means of two related groups, such as paired observations or
repeated measures on the same subjects.
7. Define F-test.
The F-test is used to compare the variances of two populations or to test the
overall significance in ANOVA.
8. What is ANOVA?
ANOVA, or Analysis of Variance, is a statistical method used to compare the
means of three or more groups.
9. List the purpose of two-factor ANOVA.
Two-factor ANOVA analyzes the impact of two independent variables on a
dependent variable.
10. What is a chi-square test used for?
The chi-square test is used to determine if there is a significant association
between two categorical variables.
PART-B
1. Explain the procedure for conducting a one-sample t-test with an example.
The one-sample t-test is a statistical test used to determine whether the
mean of a single sample is significantly different from a known or
hypothesized population mean. This test is particularly useful when the
population standard deviation is unknown, and the sample size is relatively
small.

Detailed procedure:
1. **State the Hypotheses**:
- Null Hypothesis (H0): μ = μ0 (The sample mean is equal to the population
mean).
- Alternative Hypothesis (H1): μ ≠ μ0 (The sample mean is not equal to the
population mean).
This can also be a one-tailed test if a specific direction of difference is
hypothesized.
2. **Collect Sample Data**:
Gather a random sample from the population of interest and compute the
sample mean (x̄) and standard deviation (s).
3. **Calculate the Test Statistic**:
Use the formula: t = (x̄ - μ0) / (s / √n), where:
- x̄ is the sample mean,
- μ0 is the hypothesized population mean,
- s is the sample standard deviation,
- n is the sample size.
4. **Determine Degrees of Freedom (df)**:
Degrees of freedom for a one-sample t-test are calculated as df = n - 1.
5. **Find the Critical Value or P-Value**:
Using the t-distribution table or statistical software, find the critical value for
the chosen significance level (e.g., α = 0.05). Alternatively, calculate the p-
value corresponding to the computed t-statistic.
6. **Decision Rule**:
- If |t| > critical value, or if the p-value < α, reject the null hypothesis (H0).
- Otherwise, fail to reject H0.
7. **Interpret the Results**:
Clearly state whether the sample mean is significantly different from the
hypothesized population mean.

Example:
Suppose a nutritionist claims that the average weight of a type of apple is 150
grams. A sample of 20 apples has a mean weight of 155 grams and a
standard deviation of 10 grams. Using a one-sample t-test:
- Hypotheses: H0: μ = 150, H1: μ ≠ 150
- Test statistic: t = (155 - 150) / (10 / √20) ≈ 2.236
- Degrees of freedom: df = 20 - 1 = 19
Using a t-table at α = 0.05 (two-tailed), the critical value is approximately
±2.093.
Since |t| > 2.093, we reject H0 and conclude that the mean weight of the
apples is significantly different from 150 grams.
2. Describe the t-test for two independent samples and its assumptions.

The t-test for two independent samples, also known as the independent t-test,
is used to compare the means of two unrelated groups to determine if they
are significantly different. This test is commonly applied in controlled
experiments, such as comparing the effect of two treatments on different
groups.

Key Assumptions:
1. **Independence**: Observations within each group must be independent.
2. **Normality**: The data in each group should approximately follow a
normal distribution.
3. **Homogeneity of Variances**: The variances of the two groups should be
equal. If this assumption is violated, a modified version of the test, such as
Welch’s t-test, can be used.

Procedure:
1. **State the Hypotheses**:
- Null Hypothesis (H0): μ1 = μ2 (The means of the two groups are equal).
- Alternative Hypothesis (H1): μ1 ≠ μ2 (The means of the two groups are not
equal).
2. **Collect Sample Data**:
Calculate the mean and standard deviation for each group.
3. **Compute the Test Statistic**:
The formula for the t-statistic depends on whether the variances are
assumed equal:
- For equal variances:
t = (x̄1 - x̄2) / √(s_p² * (1/n1 + 1/n2)),
where s_p² is the pooled variance:
s_p² = [(n1 - 1)s1² + (n2 - 1)s2²] / (n1 + n2 - 2).
- For unequal variances (Welch’s t-test):
t = (x̄1 - x̄2) / √(s1²/n1 + s2²/n2).
4. **Determine Degrees of Freedom (df)**:
- For equal variances: df = n1 + n2 - 2.
- For unequal variances, use an approximation formula.
5. **Compare with Critical Value or Compute P-Value**:
Use the t-distribution table or software to find the critical value or p-value.
6. **Decision Rule**:
- If |t| > critical value or p-value < α, reject H0.
- Otherwise, fail to reject H0.
Example:
A study compares the effectiveness of two teaching methods on students’ test
scores. Group 1 (n1 = 30) has a mean score of 75 (s1 = 8), and Group 2 (n2 =
30) has a mean score of 70 (s2 = 6). Assuming equal variances:
- s_p² = [(29 × 8²) + (29 × 6²)] / 58 = 50.
- t = (75 - 70) / √(50 * (1/30 + 1/30)) ≈ 2.236.
With df = 58, the critical value at α = 0.05 (two-tailed) is ±2.002. Since |t| >
2.002, we reject H0 and conclude that the two teaching methods have
significantly different effects on test scores.

3. Discuss the importance of p-value and statistical significance in


hypothesis testing.
Understanding the P-Value
The p-value, or probability value, quantifies the strength of evidence against
the null hypothesis (H₀). It represents the probability of obtaining a result as
extreme as, or more extreme than, the observed data, assuming that the null
hypothesis is true. Formally, the p-value is calculated using the test statistic
derived from the data under a given statistical test, such as a t-test, ANOVA,
or chi-square test.
 Low p-values: A low p-value (commonly < 0.05) suggests that the observed
data is unlikely under the null hypothesis, leading researchers to consider
rejecting it.
 High p-values: A high p-value indicates that the observed data is consistent
with the null hypothesis, and there is insufficient evidence to reject it.
The p-value is not the probability that the null hypothesis is true; rather, it
measures how well the data align with the assumptions of the null hypothesis.

2. Statistical Significance
Statistical significance provides a formal decision rule for hypothesis testing.
It is determined by comparing the p-value to a predefined threshold called the
significance level (α). Common choices for α include 0.05, 0.01, and 0.10,
although the choice depends on the context of the study and the potential
consequences of errors.
 Significance Level (α): This is the maximum probability of making a Type I
error—rejecting a true null hypothesis. If the p-value is less than α, the result
is deemed statistically significant.
For example, if α is set at 0.05 and the p-value is 0.03, the result is
statistically significant, suggesting sufficient evidence to reject the null
hypothesis. This significance implies that the observed effect or relationship is
unlikely to be due to random chance alone.

3. Why the P-Value Matters


a. Decision-Making Framework: The p-value provides a standardized
criterion for decision-making. By comparing the p-value to the significance
level, researchers can systematically evaluate hypotheses, reducing
subjectivity in interpretations.
b. Evidence Quantification: P-values offer a quantitative measure of
evidence against the null hypothesis, allowing researchers to assess the
strength of their findings rather than relying solely on binary outcomes (e.g.,
accept/reject).
c. Versatility Across Tests: P-values are applicable to various statistical
tests, making them a universal tool for hypothesis testing. Whether the study
involves comparing means, assessing correlations, or testing independence,
the p-value serves as a consistent metric.

4. Limitations and Misinterpretations of P-Values


Despite their utility, p-values are often misinterpreted or misused.
Understanding their limitations is critical to avoiding erroneous conclusions.
a. Not Proof of Truth: A significant p-value does not confirm the truth of the
alternative hypothesis or the falsity of the null hypothesis. It merely indicates
that the observed data are inconsistent with the null hypothesis under the
given assumptions.
b. Dependence on Sample Size: The p-value is influenced by sample size.
Large samples can produce significant p-values even for trivial effects, while
small samples may fail to detect meaningful differences.
c. Arbitrary Thresholds: The choice of significance level (α) is subjective,
and small variations in α can alter conclusions. This underscores the need for
context-specific interpretation.

5. Practical Applications
a. Scientific Research: P-values and statistical significance are integral to
research across disciplines, from medicine to economics. They help determine
whether observed effects are genuine or likely due to random variation.
b. Policy and Decision Making: In applied settings, such as public health or
business, statistically significant results guide actionable decisions. For
instance, a pharmaceutical trial might rely on p-values to evaluate the
efficacy of a new drug.
c. Replicability and Reliability: Consistent findings of statistical
significance across multiple studies enhance the credibility and
generalizability of results. Researchers often replicate experiments to confirm
findings, emphasizing the role of p-values in verifying reliability.

6. Statistical Significance in the Broader Context


While p-values and statistical significance are vital tools, they should not be
the sole criteria for decision-making. Effect sizes, confidence intervals, and
the broader context of the research question are equally important for a
comprehensive interpretation of results.
 Effect Sizes: Provide information about the magnitude of an observed effect,
complementing p-values.
 Confidence Intervals: Offer a range of plausible values for the parameter of
interest, enhancing understanding of uncertainty.

Conclusion
The p-value and statistical significance are indispensable in hypothesis
testing, providing a rigorous framework for evaluating evidence and making
decisions. However, they must be interpreted carefully, considering their
limitations and the broader research context. By combining these metrics with
other statistical tools, researchers can draw robust and meaningful
conclusions, advancing knowledge across disciplines.

4. Explain ANOVA and its application in two-factor experiments.

Introduction to ANOVA
Analysis of Variance (ANOVA) is a statistical technique used to determine whether there are
significant differences among the means of three or more groups. It is particularly useful when
comparing multiple datasets to identify variations attributable to different factors or
experimental conditions. Unlike a simple t-test that compares means between two groups,
ANOVA expands this capability to multiple groups, offering a robust methodology for complex
experimental designs.
.
Types of ANOVA
There are several types of ANOVA, tailored to different experimental scenarios:
 One-Way ANOVA: Used when one factor with multiple levels is under consideration.
 Two-Way ANOVA: Suitable when two factors are studied simultaneously.
 Multivariate ANOVA (MANOVA): Extends ANOVA to handle multiple dependent
variables.

Two-Way ANOVA
Two-way ANOVA is a statistical method employed to evaluate the effect of two independent
factors on a dependent variable. It also examines the interaction between these factors,
providing insights into whether the combined effect of the factors differs from their individual
effects.
Key Components
1. Factors and Levels:
o Each factor represents an independent variable.
o Each factor has multiple levels (e.g., different treatments, groups, or categories).
2. Main Effects:
o These are the effects of each factor independently on the dependent variable.
3. Interaction Effects:
o These occur when the effect of one factor depends on the level of the other
factor.
Model Structure
A two-way ANOVA model can be expressed as: Where:
 : Observed value for the k-th observation in the i-th level of factor A and j-th level of
factor B.
 : Overall mean.
 : Effect of the i-th level of factor A.
 : Effect of the j-th level of factor B.
 : Interaction effect between factors A and B.
 : Random error.

Applications of Two-Way ANOVA in Experiments


Two-way ANOVA is widely used in experimental research where two independent variables are
manipulated. Common applications include:
1. Agricultural Studies:
Researchers might analyze the effects of fertilizer types (factor A) and irrigation levels (factor
B) on crop yield. Two-way ANOVA can determine whether these factors independently
influence yield and whether their interaction produces a combined effect.
2. Industrial Quality Control:
In manufacturing, a two-way ANOVA could assess the effects of different machine settings
(factor A) and raw material types (factor B) on product quality.
3. Medical Research:
Medical trials often investigate the impact of treatment types (factor A) and patient
demographics (factor B) on health outcomes. For instance, researchers might study how
different medications interact with age groups to influence recovery times.
4. Psychological Experiments:
Two-way ANOVA can analyze behavioral data, such as the effects of learning methods (factor
A) and environmental settings (factor B) on test performance.

Steps in Conducting Two-Way ANOVA


1. Formulating Hypotheses:
 Null hypotheses: No main effects or interaction effects.
 Alternate hypotheses: Presence of main effects and/or interaction effects.
2. Data Collection:
 Ensure random sampling and equal sample sizes for robustness.
3. Assumptions:
 Independence of observations.
 Normality of data distribution.
 Homogeneity of variance across groups.
4. Computations:
 Partition total variance into components for factors A, B, their interaction, and error.
 Compute F-statistics to test hypotheses.
5. Interpretation:
 Significant F-values indicate differences attributable to factors or their interaction.
 Post hoc tests (e.g., Tukey’s HSD) can be used for detailed pairwise comparisons.

Advantages and Limitations


Advantages:
1. Simultaneous analysis of two factors reduces experiment complexity.
2. Detects interaction effects, providing richer insights into data relationships.
3. Enhances statistical power by reducing unexplained variance.
Limitations:
1. Requires balanced data; unequal sample sizes can complicate analysis.
2. Sensitive to violations of assumptions (e.g., non-normality).
3. Interpretation of interaction effects can be challenging in complex designs.

5. Describe the chi-square test and its applications with an example.

Chi Square Test


The Chi-Square Test: Description and Applications with an Example
The Chi-Square test is a statistical tool used to determine if there is a
significant association between categorical variables. This non-parametric test
evaluates whether observed data significantly deviate from expected data
under a given hypothesis. Originating from Karl Pearson's work in 1900, the
test has since become a cornerstone in hypothesis testing for categorical
data.
Overview of the Chi-Square Test
The Chi-Square test measures the discrepancy between observed and
expected frequencies in categorical data. The test is based on the Chi-Square
(χ²) statistic, calculated using the formula:
χ² = Σ [(O_i - E_i)^2 / E_i]
Where:
 O_i = Observed frequency in category i
 E_i = Expected frequency in category i
 Σ = Summation over all categories
The resulting χ² value is compared to a critical value from the Chi-Square
distribution table, determined by the degrees of freedom and the chosen
significance level (α). If the calculated χ² exceeds the critical value, the null
hypothesis is rejected, indicating a statistically significant difference between
the observed and expected data.
Types of Chi-Square Tests
1. Chi-Square Test for Independence: This test evaluates whether two
categorical variables are independent. For example, it can determine if
gender and product preference are related.
2. Chi-Square Test for Goodness of Fit: This test assesses whether a sample
distribution matches a theoretical distribution. For example, it can test if a die
is fair by comparing the observed roll frequencies to the expected uniform
distribution.
Assumptions of the Chi-Square Test
1. The data must be categorical.
2. Observations should be independent.
3. Expected frequencies in each category should be at least 5 for the test to be
reliable.
Applications of the Chi-Square Test
The Chi-Square test has diverse applications across fields such as:
1. Social Sciences: To study relationships between demographic variables, like
age and voting preference.
2. Market Research: To evaluate consumer behaviour and product
preferences.
3. Biology: To analyse genetic inheritance patterns, such as Mendelian ratios.
4. Education: To determine if teaching methods influence student performance
based on categorical grades.
5. Healthcare: To investigate associations between lifestyle factors and health
conditions.
Example: Chi-Square Test for Independence
Research Question: Is there a relationship between exercise frequency and
diet preference?
Data Collection: A survey collects responses from 200 individuals,
categorizing them by exercise frequency (Low, Moderate, High) and diet
preference (Vegetarian, Non-Vegetarian).

Non T
Veg - o
etar Veg t
ian etar a
ian l

Lo
w
Ex
erc 8
ise 30 50 0
Mo
der
ate
Ex
erc 7
ise 40 30 0
Hig
h
Ex
erc 5
ise 20 30 0
2
Tot 0
al 90 110 0
Step 1: Hypotheses
 Null Hypothesis (H0): Exercise frequency and diet preference are
independent.
 Alternative Hypothesis (H1): Exercise frequency and diet preference are
dependent.
Step 2: Calculate Expected Frequencies The expected frequency (E) for
each cell is calculated as:
E = (Row Total × Column Total) / Grand Total
For example, the expected frequency for Low Exercise and Vegetarian is: E =
(80 × 90) / 200 = 36
Step 3: Compute the Chi-Square Statistic Using the formula, calculate χ²
for all cells:
χ² = Σ [(O_i - E_i)^2 / E_i]
Step 4: Compare with Critical Value Determine degrees of freedom (df):
df = (Number of Rows - 1) × (Number of Columns - 1) df = (3 - 1) × (2 - 1) = 2
At α = 0.05, the critical value for df = 2 is 5.991.
If the calculated χ² exceeds 5.991, reject the null hypothesis.
Step 5: Interpret Results If the null hypothesis is rejected, conclude that
exercise frequency and diet preference are significantly associated.
UNIT-5
PART-A
1.What Is Predictive Analytics?
 The term predictive analytics refers to the use of statistics and modeling techniques
to make predictions about future outcomes and performance.
 Predictive analytics looks at current and historical data patterns to determine if those
patterns are likely to emerge again.
 It allows businesses and investors to adjust where they use their resources to take
advantage of possible future events.

2. List the uses of Predictive model analysis

a. Weather forecasts
b. Creating video games
c. Translating voice to text for mobile phone messaging
d. Customer service

3. Define Credit
Credit scoring makes extensive use of predictive analytics.
Example: When a consumer or business applies for credit, data on the applicant's
credit history and the credit record of borrowers with similar characteristics are used
to predict the risk that the applicant might fail to perform on any credit extended

4. Specify the Decision Trees


Decision trees are the simplest models because they're easy to understand and
dissect. They're also very useful when you need to make a decision in a short
period of time.

5. Regression is the model that is used the most in statistical analysis.


If you want to determine patterns in large sets of data and when there's a linear
relationship between the inputs.
This method works by figuring out a formula, which represents the relationship
between all the inputs found in the dataset. For example, you can use regression to
figure out how price and other key factors can shape the performance of a security

6. What are the various steps of Data Analysis?


 The first step is to determine the data requirements.
 The second step in data analytics is the process of collecting it.
 Once the data is collected, it must be organized so it can be analyzed.
 The data is then cleaned up before analysis

7. What is linear least squares?


In statistics and mathematics, linear least squares is an approach to fitting a
mathematical or statistical model to data in cases where the idealized value provided
by the model for any data point is expressed linearly in terms of the unknown
parameters of the model

8. What is weighted resampling?


A sample in which each Sampling unit has been assigned a weight for use in
subsequent
analysis. Common uses include survey weights to adjust for intentional oversampling of
some units relative to others
9. What are the five types of regression model?
1. Linear Regression.
2. Logistic Regression.
3. Ridge Regression.
4. Lasso Regression

10. Define auto correlation in statistics.


Autocorrelation refers to the degree of correlation of the same variables between two
successive time intervals. It measures how the lagged version of the value of a
variable is related to the original version of it in a time series. Auto correlation, as a
statistical concept, is also known as serial correlation.

PART_B
11. Explain about the linear lest square method in detail
Least squares method:
 Now that we have determined the loss function, the only thing left to do is minimize
it.
 This is done by finding the partial derivative of L, equating it to 0 and then finding an
expression for m and c.
 After we do the math, we are left with these equations:

Here is the mean of all the values in the input X and is the mean of all the values in ȳ
the desired output Y.
 This is the Least Squares method.
 Now we will implement this in python and make predictions.
Implementing the model:
This is the example for implementing the model using python

Making Predictions:

# Making predictions
Y_pred = m*X + c
plt.scatter(X, Y) # actual
# plt.scatter(X, Y_pred, color='red')
plt.plot([min(X), max(X)], [min(Y_pred), max(Y_pred)], color='red')
# predicted
plt.show()
There won’t be much accuracy because we are simply taking a straight line and forcing
it
to fit into the given data in the best possible way.
 But you can use this to make simple predictions or get an idea about the
magnitude/range
of the real value.
 Also this is a good first step for beginners in Machine Learning

12. Explain in detail about Regression using statsmodels.


Introduction:

 There won’t be much accuracy because we are simply taking a straight line and
forcing it to fit into the given data in the best possible way.
 But you can use this to make simple predictions or get an idea about the
magnitude/range of the real value.
 Also this is a good first step for beginners in Machine Learning.

Linear Regression:
 Linear models with independently and identically distributed errors, and for errors
with
heteroscedasticity or autocorrelation.
This module allows estimation by ordinary least squares (OLS), weighted least squares
(WLS), generalized least squares (GLS), and feasible generalized least squares with
auto correlated AR(p) errors
13. Explain in detail about Multiple regression.

Multiple linear regression (MLR), also known simply as multiple regression, is a


statistical
technique that uses several explanatory variables to predict the outcome of a response
variable. Multiple regression is an extension of linear regression that uses just one
explanatory variable.
 In our daily lives, we come across variables, which are related to each other. To study
the degree of relationships between these variables, we make use of correlation.
 To find the nature of the relationship between the variables, we have another
measure,
which is known as regression.

In this, we use correlation and regression to find equations such that we can estimate
the
value of one variable when the values of other variables are given

Multiple regression analysis is a statistical technique that analyzes the relationship


between two or more variables and uses the information to estimate the value of the
dependent variables.
 In multiple regression, the objective is to develop a model that describes a dependent
variable y to more than one independent variable.

Multiple Regression formula:

In linear regression, there is only one independent and dependent variable involved.
But, in the
case of multiple regression, there will be a set of independent variables that helps us to
explain better or predict the dependent variable y.

The multiple regression equation is given by


y = a + b 1×1+ b2×2+……+ bkxk
where x1, x2, ….xk are the k independent variables and y is the dependent variable.

Multiple Regression Analysis:


 Multiple regression analysis permits to control explicitly for many other
circumstances
that concurrently influence the dependent variable.
 The objective of regression analysis is to model the relationship between a dependent
variable and one or more independent variables.
 Let k represent the number of variables and denoted by x1, x2, x3, ……, xk. Such an
equation is useful for the prediction of value for y when the values of x are known.

Advantages of Stepwise Multiple Regression:


 Only independent variables with non zero regression coefficients are included in the
regression equation.
 The changes in the multiple standard errors of estimate and the coefficient of
determination are shown.
 The stepwise multiple regression is efficient in finding the regression equation with
only
significant regression coefficients.
14. Explain in detail about Time series Analysis.

Time series analysis is indispensable in data science, statistics, and analytics.


 At its core, time series analysis focuses on studying and interpreting a sequence of
data
points recorded or collected at consistent time intervals.
 Unlike cross-sectional data, which captures a snapshot in time, time series data is
fundamentally dynamic, evolving over chronological sequences both short and
extremely
long.
 This type of analysis is pivotal in uncovering underlying structures within the data,
such as trends, cycles, and seasonal variations.
 Technically, time series analysis seeks to model the inherent structures within the
data, accounting for phenomena like autocorrelation, seasonal patterns, and trends.
 The order of data points is crucial; rearranging them could lose meaningful insights or
distort interpretations.
 Furthermore, time series analysis often requires a substantial dataset to maintain the
statistical significance of the findings.
 This enables analysts to filter out 'noise,' ensuring that observed patterns are not
mere
outliers but statistically significant trends or cycles.

Components of Time Series Data:


Time series data is generally comprised of different components that characterize the
patterns and behavior of the data over time. By analyzing these components, we can
better understand the dynamics of the time series and create more accurate models.
Four main elements make up a time series dataset:
 Trends
 Seasonality
 Cycles
 Noise
In summary, the key components of time series data are:

Trends: Long-term increases, decreases, or stationary movement


Seasonality: Predictable patterns at fixed intervals
Cycles: Fluctuations without a consistent period
Noise: Residual unexplained variability

Time Series analysis example:

You will import the file to R by the following command:


>library_borrowing<-read.csv(“C:/Table.1″, header=T, dec=”,”, sep=”;”)
Note that paths use forward slashes “/” instead of backslashes
>plot(library_borrowing[, 5], type=”1″, lowd=2, col=”red”, xlab=Years”,
ylab=”Number of
books”, main=”Number of books borrowed from the library” xl)

The result for the above code is:

15. Explain predictive analysis in detail.

Predictive analytics determines the likelihood of future outcomes using techniques like
data mining, statistics, data modeling, artificial intelligence, and machine learning.
Put simply, predictive analytics interprets an organization’s historical data to make
predictions about the future.
Today’s predictive analytics techniques can discover patterns in the data to identify
upcoming risks and opportunities for an organization.

Predictive analysis Importance


Predictive analytics allows organizations to be more proactive in the way they do
business,
detecting trends to guide informed decision-making.
With the predictive models outlined above, organizations no longer have to rely on
educated guesses because forecasts provide additional insight.
The benefits of predictive analytics vary by industry, but here are some common
reasons for forecasting.
Improve profit margins.
Optimize marketing campaigns.
Reduce risk

Steps to effectively implement Predictive analysis:


Problem Definition
It may seem obvious, but the very first step to introduce Predictive Analytics is to
precisely define its scope. There could be various applications that may change
accordingly to their purpose and to the company’s industry.
Some well-known examples come from forecasting models, anomaly detection
algorithms or Churn Analysis tools.
During this phase is also important to understand which data are necessary and where
do they exist.
Data collection
In this step we take the necessary data (both structured and unstructured) from
different sources.
In the ideal scenario there is a Data Lake, designed and maintained for this purpose, or
at least a Data Warehouse with its staging area from which we can retrieve the data.
Data manipulation and descriptive analysis
During this phase data are organized for their final scope: being used by Predictive
Analytics’ models to solve problems

Statistical analysis
Once the final forma of data is obtained, it is possible to go on with a Statistical Analysis
of parameters, so that previous hypotheses are directly tested, or insights are extracted
thanks to metrics visualization.

Modeling
Once thoroughly setting up data, predictive models can be tested, and necessary
experiments can be carried out to obtain a model with a satisfactory productiveness.

Implementation
It is the stage of the actual deploy. After performing all the required tests, evaluating
the quality of models, and validating output data, it is possible to implement the
Predictive Analytics tool in production, so that it provides predictions able to solve the
problem stated in the first point.

Predictive analytics tools:


Identify the business objective.
Before you do anything else, clearly define the question you want predictive analytics
to answer.Generate a list of queries and prioritize the questions that mean the most to
your organization.

Determine the datasets.


Once you outline a list of clear objectives, determine if you have the data available to
answer those queries. Make sure that the datasets are relevant, complete, and large
enough for predictive modeling.

Create processes for sharing and using insights.


Any opportunities or threats you uncover will be useless if there’s not a process in
place to act on those findings. Ensure proper communication channels are in place so
that valuable predictions end up in the right hands.

Choose the right software solutions.


Your organization needs a platform it can depend on and tools that empower people of
all skill levels to ask deeper questions of their data. Tableau’s advanced analytics tools
support time series analysis, allowing you to run predictive analysis like forecasting
within a visual analytics interface

You might also like