FDS UNIT 3 QB
FDS UNIT 3 QB
1. During their first swim through a water maze, 15 laboratory rats made the
following number of errors (blind alleyway entrances): 2, 17, 5, 3, 28, 7, 5, 8, 5, 6, 2,
12, 10, 4, 3. Find the mode and median for these data.
Mode: The mode of a data set is the number that occurs most frequently in the set.
Data Mode
2 2
3 2
4 1
5 3
6 1
7 1
8 1
10 1
12 1
17 1
28 1
Her the data 05 occurs most frequently as 03 times compared to other data. Hence the
mode these data is 05.
Median:The median M is the midpoint of the distribution
Ordered list of given data set is 2,2,3,3,4,5,5,5,6,7,8,10,12,17,28
Total no of observation in the data set is:15
Here ‘n’=15. It is an Odd value. So, median located in (15+1)/2=8th spot in the ordered
1
list is 5
So, median is 05
2. Mentions the essential and optional guidelines to be followed for frequency
distributions or State the “Guidelines for frequency distribution”.
Each observation should be included in one, and only one, class.
List all classes, even those with zero frequencies.
All classes should have equal intervals.
All classes should have both an upper boundary and a lower boundary.
Select the class interval from convenient numbers, particularly 5 and
10 ormultiples of 5 and 10.
The lower boundary of each class should be a multiple of the class interval.
Aim for a total of approximately 10 classes.
3. GRE scores for a group of graduate school applicants are distributed as follows:
2
Convert to a relative frequency distribution. When calculating proportions, round
numbers to two digits to the right of the decimal point, using the rounding
procedure
(a) Convert the distribution of GRE scores shown in above table to a cumulative
frequency distribution.
(b) Convert the distribution of GRE scores obtained in above table to a cumulative
percent frequency distribution
Solun. a)
GRE Frequency (f) cumulative frequency distribution
725-749 1/200=0.005
700-724 3/200=0.015
675-699 14 14/200=0.07
650-674 30 30/200=0.15
625-649 34 34/200=0.17
600-624 42 42/200=0.21
575-599 30 30/200=0.15
550-574 27 27/200=0.135
525-549 13 13/200=0065
500-524 4/200=0.02
475-499 2/100=0.02
b)
GRE Frequency cumulative Convert the distribution of GRE
(f) frequency scores obtained in above table to
distribution a cumulative percent frequency
distribution
725-749 1/200=0.005 0.5%
700-724 3/200=0.015 1.5%
675-699 14 14/200=0.07 7%
650-674 30 30/200=0.15 15%
625-649 34 34/200=0.17 17%
600-624 42 42/200=0.21 21%
575-599 30 30/200=0.15 15%
550-574 27 27/200=0.135 13.5%
525-549 13 13/200=0.065 6.5%
500-524 4/200=0.02 2%
475-499 2/100=0.02 2%
3
4. Write short note on Stem-and-leaf display. Represent the following datain
stem- and-leaf display. 67, 74, 63, 88, 82, 97, 65, 79
A stem-and-leaf display is used to present quantitative data in a graphicalformat,
similar to a histogram, to assist in visualizing the shape of adistribution.
A stem and leaf plot displays data by splitting up each value in a datasetinto a
“stem” and a “leaf.”
Raw Data 67 Stem Leaf
74
6 7 3 5
63
88
7 4 9
82
97
8 8 2
65
79
9 7
4
7. Define frequency distribution? Or Define Frequency distribution.
A frequency distribution is a collection of observations produced by sorting observations
into classes and showing their frequency (f ) of occurrence in each class.
8. The IQ scores for a group of 35 high school dropouts are given in the table:
a) Construct a frequency distribution for grouped data.
(b) Specify the real limits for the lowest class interval in this frequency distribution.
Solution
(a) Calculating the class
width, 123 -69/10=
54/10=5.4
Round off to a convenient number, such as 5.
(b) 64.5–69.5
9. What are some possible poor features of the following frequency distribution?
Solun:
Not all observations can be assigned to one and only one class (because of
gap between 20–22 and 25–30 and overlap between 25–30 and 30–34).
All classes are not equal in width (25–30 versus 30–34).
All classes do not have both boundaries (35–above).
10. Define Outlier.
An outlier is an observation that lies an abnormal distance from other values in
a random sample from a population.
It will be considered as abnormal.
i.e., the appearance of one or more very extreme scores, or outliers.
11. Identify any outliers in each of the following sets of data collected from nine
college students.
5
Solun:
Outliers are a summer income of $25,700; an age of 61; and a family size of 18. No outliers
for GPA.
12. List out the typical shapes of smoothed frequency distribution.
Normal
Bimodal
Positively skewed
Negatively skewed
13. Define mean.
The mean is the average of a set of observations.
i.e., the sum of the observations divided by the number of observations.
If the n observations are written as their mean can be written mathematically as:
X1,x2…xn
We read the symbol as “x-bar.”
The bar notation is commonly used to represent the samplemean, i.e. the mean of
the sample.
14. Find the sample mean value for the best actress Oscar winner data set: 34 34 26 37
42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33.
Solun.
The mean and the median, the most common measures of center.
Each describe the center of a distribution of values in a different way.
The mean describes the center as an average value, in which the actual values of
6
the data points play an important role.
The median, on the other hand, locates the middle value as the center, and the
order of the data is the key.
18. Define mode.
The mode of a data set is the number that occurs most frequently in the set.
• If no value appears more than once in the data set, the data set has no
mode.
• If a there are two values that appear in the data set an equal number of
times, they both will be modes etc.
19. When to use mean/ median?
Use the sample mean as a measure of center for symmetric distributions with no
outliers.
Otherwise, the median will be a more appropriate measure of the center of our
data.
20. What do you meant by range?
A range measures the spread of a data inside the limits of a data set.
It is calculated as a difference between the highest and lowest values in the data
set.
The larger the range, the greater the spread of the data.
The range covered by the data is the most intuitive measure of variability.
The range is exactly the distance between the smallest data point (min) and the
largest one (Max).
Range = Max – min
21. Define standard deviation.
The standard deviation is to quantify the spread of a distribution by measuring
how far the observations are from their mean.
The standard deviation gives the average (or typical distance) between a
Standard deviation is the measure of the overall spread (variability) of a data set
values from the mean.
The more spread out a data set is, the greater are the distances from the mean and
the standard deviation.
There are many notations for the standard deviation: SD, s, Sd, StDev.
22. Compute the standard deviation of the sample data: 3, 5, 7 with a sample mean
of 5.
Degrees of freedom (df) refers to the number of values that are free to vary,
given one or more mathematical restrictions, in a sample being used to estimate a
population character
The number of values frees to vary, given one or more mathematical restrictions.
Degrees of freedom, that is, df = n – 1.
24. Define Inter-Quartile Range (IQR).
7
giving us the range covered by the MIDDLE 50% of the data.
To find the interquartile range (IQR), first find the median (middle value) of the
lower and upper half of the data.
These values are quartile 1 (Q1) and quartile 3 (Q3).
The IQR is the difference between Q3 and Q1
IQR = Q3 – Q1
Q3 = 3rd Quartile = 75th Percentile
Q1 = 1st Quartile = 25th Percentile
25. How to measure/interpret the strength of a relationship based on the absolute value
of ‘r’?
Absolute Value of r Strength of Relationship
0.5<r<0.7 Moderate
r>0.7 Strong
28. What are the 4 things to describe the relationship between the variables?
Strength
• Strength of the relationship is given by the correlation coefficient
Direction
• It can be +ve or –ve based on the sign of the correlation coefficient
Shape
• It must always be linear to computer a pearson correlation coefficient
Statistically significant
• It is based on p-value.
29. What does correlation coefficient tells you?
8
It summarizes the data
It helps you to compare the results between studies.
30. State the guidelines for interpreting correlation strength.
9
2. In a survey, a question was asked “During your life time, how often have you
changed your permanent residence?” a group of 18 college students replied a
follows: 1,3,4,1,0,2,5,8,0,2,3,4,7,11,0,2,3,3. Find the mode, median, and standard
deviation [April/May 2023]
3. Consider an example. Tom who is the owner of a retail shop, found the price of
different T-shirts vs the number of T-shirts sold at his shop over a period of one
week. He tabulated this like shown below:
Price of T-Shirt Number of T-Shirt Sold
10
15
Explain the concept of least squares regression to find the line of best fit for the above data
4. The following frequency distribution shows the annual incomes in dollars for a
group of college graduates.
i. Construct a histogram.
ii. Construct a frequency polygon.
iii. Is this distribution balanced or lopsided?
5. Consider the best actress Oscar winners dataset given below, construct the stem plot
for the above dataset.
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39
34 26 25 35 33
6. Explain multiple linear regression model with the prediction of sales through the
10
various attributes like budget for TV advertisement, Radio Advertisement and News paper
Advertisement using statistical model
7. Consider the following x and y set of values, create least square linear regression and check
the result of model fitting to know whether the model is satisfactory
8. Discuss in detail the various typical shapes of frequency distribution. Analyze its
characteristics with an example
9. The following are the number of customers who entered a video store in 8 consecutive
hours: 7,9,5,13,3,11,15,9. Find the standard deviation of the number of hourly customers.
Summarize about the aforementioned data with the help of standard deviation
10. Explain the steps to calculate IQR with an example of best actress Oscar winners
11. For each of the following pairs of distributions, first decide whether their standard
deviations are about the same or different. If their standard deviations are different, indicate
which distribution should have the larger standard deviation. Note that the distribution with
the more dissimilar set of scores or individuals should produce the larger standard deviation
regardless of whether, on average, scores or individuals in one distribution differ from
those in other distribution.
12. The IQ scores for a group of 35 high school dropouts are as follows:
11
UNIT III- INFERENTIAL STATISTICS
Part A
12
mean of every sample group chosen from the population and plotting the data points. The
graph shows a normal distribution where the center is the mean of the sampling distribution,
which represents the mean of the entire population.
46. What is mean by Sampling distribution of proportion
This sampling distribution focuses on proportions in a population. Samples are selected and
their proportions are calculated. The mean of the sample proportions from each group
represent the proportion of the entire population,
47. Define T-distribution
T-distribution is a sampling distribution that involves a small population or one where
not much is known about it. It is used to estimate the mean of the population and other
statistics such as confidence intervals, statistical differences and linear regression. The
T-distribution uses a t- score to evaluate data that wouldn't be appropriate for a normal
distribution
The formula for t-score is:
In the formula, "x" is the sample mean and "μ" is the population mean and signifies
standard deviation.
48. Define MEAN OF ALL THE SAMPLE MEAN
The mean of the sampling distribution of the mean always equals the mean of the
population
49. Standard Error Of The Mean
The standard error of the mean equals the standard deviation of the population divided
by the square root of the sample size
50. What is the Special Type Of Standard Deviation
You might find it helpful to think of the standard error of the mean as a rough measure
of the average amount by which sample means deviate from the mean of the sampling
distribution or from the population mean
51. What Is The Hypothesis Testing
Hypothesis testing is a form of statistical inference that uses data from a sample to draw
conclusions about a population parameter or a population probability distribution. First,
a tentative assumption is made about the parameter or distribution. This assumption is
called the null hypothesis and is denoted by H0.
52. Define Hypothesized Sampling Distribution
When you perform a hypothesis test of a single population mean μ using a normal
distribution (often called a z-test), you take a simple random sample from the
population. ... Then the binomial distribution of a sample (estimated) proportion can be
approximated by the normal distribution with μ = p and σ=√pqn σ = p q n
53. Define Decision Rule
A decision rule specifies precisely when H0 should be rejected (because the observed z
qualifies as a rare outcome). There are many possible decision rules, as will be seen in
Section 11.3. A very common one, already introduced in Figure 10.3, specifies that H0
should be rejected if the observed z equals or is more positive than 1.96 or if the
observed z equals or is more negative than –1.96. Conversely, H0 should be retained if
the observed z falls between ± 1.96.
54. Define null hypothesis
The null hypothesis is a typical statistical theory which suggests that no statistical
relationship and significance exists in a set of given single observed variable, between
two sets of observed data and measured phenomena.
55. What is Level of Significance
Total area that is identified with rare outcomes. Often referred to as the level of
significance of the statistical test, this proportion is symbolized by the Greek letter α
(alpha. In the present example, the level of significance, α, equals 05.
56. Define One-Tailed And Two-Tailed Tests
13
A one-tailed test is a statistical test in which the critical area of a distribution is one-
sided so that it is either greater than or less than a certain value, but not both. If the
sample being tested falls into the one-sided critical area, the alternative hypothesis will
be accepted instead of the null hypothesis.
In statistics, a two-tailed test is a method in which the critical area of a distribution is
two-sided and tests whether a sample is greater or less than a range of values. It is used
in null-hypothesis testing and testing for statistical significance.
57. State Addition Rule and Multiplication Rule
Addition rule states that add together the separate probabilities of several mutually
exclusive events to find the probability that any one of these events will occur
Multiplication rule states that multiply together the separate probabilities of several
independent events to find the probability that these events will occur together.
60. What are four possible outcomes for any hypothesis test?
If H0 really is true, it is a correct decision to retain the true H0.
If H0 really is true, it is a type I error to reject the true H0.
If H0 really is false, it is a type II error to retain the false H0.
If H0 really is false, it is a correct decision to reject the false H0.
61. Define Point Estimate
A point estimate for μ uses a single value to represent the unknown population mean
62. What is mean by confidence interval ( ci ) for μ?
A confidence interval for μ uses a range of values that, with a known degree of certainty,
includes the unknown population mean.
63. What do you mean by Hypothesis? Name at least 4 of its types.
Hypothesis is a statement about the nature of a population. It is oftenstated in terms of a
population parameter. Hypothesis testing is a form ofstatistical inference that uses data
from a sample to draw conclusions about a population parameter or a population
probability distribution. Some types of hypothesis statements are
Directional Hypothesis,
Non-Directional Hypothesis,
3
14
Null hypothesis,
Alternative hypothesis,
Associative Hypothesis
64. State Central Limit Theorem (April/May 2023)
Central Limit Theorem states that regardless of the population shape, the shape of the
sampling distribution of the mean approximates a normal curve if the sample size is
sufficiently large.
According to this theorem, it doesn’t matter whether the shape of the parent population
is normal, positively skewed or negatively skewed, as long as the sample size is
sufficiently large.
If the shape of the parent population is normal, then any sample size will be sufficiently
large. Otherwise, depending on the degree of non-normality in the parent population, a
sample size between 25 and 100 is sufficiently large
65. Indicate whether the following statements are True or False with proper
justification. The mean of all sample means,
(c) always equals the value of a particular sample mean.(b) equals 100 if, in fact,
the population mean equals 100.(c) usually equals the value of a particular sample
mean.(d) is interchangeable with the population mean
FALSE, Mean of all sample mean will not represent a particular sample mean
TRUE, The population mean can be equated to mean of all sample mean
FALSE, Mean of all sample mean will not represent a particular sample mean
TRUE, The population mean can be equated to mean of all sample mean
66. Indicate what’s wrong with each of the following statistical hypothesis:
(c) Null hypothesis and its respective alternative hypothesis cannot have different
anchor point values. In given scenario, both hypothesis dintcover any values
between155 and160 exclusively.
(d) Any hypothesis statement represents details about any one of population parameter.
But Sample mean X ̅ is referred in given above scenario.
67. Define Effect Of Sample Size
The larger the sample size, the smaller the standard error and, hence, the more precise
(narrower) the confidence interval will be. Indeed, as the sample size grows larger, the
standard error will approach zero and the confidence interval will shrink to a point
estimate. Given this perspective, the sample size for a confidence interval, unlike that for
a hypothesis test, never can be too large.
Part B
15. Explain population and samples. And difference?
16. Describe random sampling
17. Explain sampling distribution and types
18. Describe null hypothesis test in detail
19. Explain in detail hypothesis testing and examples
20. Does the mean of SAT math score for all local freshman differ for all local
average of 500? (z test for population mean)
21. Explain one tailed and two tailed test
22. Define estimation .Explain in detail about point estimation.
23. Discuss about the following with suitable example:
i. Random Sampling vs Random Assignments
4
15
ii. Independent vs Dependent Events
iii. Independent vs Mutually Exclusive Events
iv. Conditional Probability
v. Sampling Distribution of the Mean
24. Imagine a very simple population consisting of only four observations:2 3 4 5
i. Explain the process of constructing relative frequency table showing
thesampling distribution of the mean.
ii. Construct a relative frequency table showing the sampling distributionof the
mean for the above observations.
25. Define Hypothesis. Discuss in detail about at least 5 types of hypothesis
statement with an example.
26. Calculate the value of the z test for each of the following situations. Also, given
critical z score of +/- 1.96, calculate the critical confidence level.
i. X=12; σ=9; n=25; µhyp=15
ii. X=3600; σ=4000; n=100; µhyp=3500
iii. X=0.25; σ=010; n=36; µhyp=0.22
27. Reading achievement scores are obtained for a group of fourth graders. A scores
of 4.0 indicates a level of achievement appropriate for fourth grades, a score
below 4.0 indicates under achievement., and a score above 4.0 indicates over
achievement. Assume that the population standard deviation equals 0.4. A
random sample of 64 fourth graders reveals a mean achievement score of 3.82.
Construct a 95% confidence interval for the unknown population mean.
(Remember to convert the standard deviation to a standard error). Interpret this
confidence interval; that is, do you find any consistent evidence either of
overachievement or of underachievement?
28. Illustrate in detail about estimation method and confidence interval.
29. For the population at large, the Wechsler Adult Intelligence Scale is designed to
yield a normal distribution of test score with a mean of 100 and a standard
deviation of 15. School district officials wonder whether, on the average, an IQ
score different from 100 describes the intellectual aptitudes of all students in
their district. Wechsler IQ scores are obtained for random sample of 25 of their
students, and the mean IQ is found to equal 105. Using the step-by-step
procedure, test the null hypothesis at the .05 level of significance.
30. Imagine a simple population consisting of only 5 observations: 2 4 6 8 10. List
all possible sample of size two. Construct relative frequency table showing the
sample distribution of the mean.
31. According to the American Psychological Association, members with a
doctorate and a full-time teaching appointment earn, on the average, $82,500
per year, with a standard deviation of $6,000. An investigator wishes to
determine whether $82,500 is also the mean salary for all female members with
a doctorate and a full-time teaching appointment. Salaries are obtained for a
random sample of 100 women from this population, and the mean salary equals
$80,100.
i. Someone claims that the observed difference between $80,100 and $82,500 is large
16
5
enough by itself to support the conclusion that female members earn less than male
members. Explain why it is important to conduct a hypothesis test.
ii. The investigator wishes to conduct a hypothesis test for what population?
iii. What is the null hypothesis, H0?
iv. What is the alternative hypothesis, H1?
v. Specify the decision rule, using the .05 level of significance.
vi. Calculate the value of z. (Remember to convert the standard deviation to
a standard error.)
vii. What is your decision about H0?
viii. Using words, interpret this decision in terms of the original problem.
32. According to the California Educational Code
(http://www.cde.ca.gov/ls/fa/sf/peguidemidhi.asp), students in grades 7 through 12
should receive 400 minutes of physical education every 10 school days. A random
sample of 48 students has a mean of 385 minutes and a standard deviation of 53
minutes. Test the hypothesis at the .05 level of significance that the sampled
population satisfies the requirement.
33. According to a 2009 survey based on the United States census (http://www.census.
gov/prod/2011pubs/acs-15.pdf), the daily one-way commute time of U.S. workers
averages 25 minutes with, we’ll assume, a standard deviation of 13 minutes. An
investigator wishes to determine whether the national average describes the mean
commute time for all workers in the Chicago area. Commute times are obtained for a
random sample of 169 workers from this area, and the mean time is found to be 22.5
minutes. Test the null hypothesis at the .05 level of significance.
34. Each of the following statements could represent the point of departure for a
hypothesis test. Given only the information in each statement, would you use a two-
tailed (or nondirectional) test, a one-tailed (or directional) test with the lower tail
critical, or a one-tailed (or directional) test with the upper tail critical? Indicate your
decision by specifying the appropriate H0 and H1. Furthermore, whenever you
conclude that the test is one-tailed, indicate the precise word (or words) in the
statement that justifies the one-tailed test.
i. An investigator wishes to determine whether, for a sample of drug addicts, the mean
score on the depression scale of a personality test differs from a score of 60, which,
according to the test documentation, represents the mean score for the general
population.
ii. To increase rainfall, extensive cloud-seeding experiments are to be conducted, and
the results are to be compared with a baseline figure of 0.54 inch of rainfall (for
comparable periods when cloud seeding was not done).
iii. Public health statistics indicate, we will assume, that American males gain an
average of 23 lbs during the 20-year period after age 40. An ambitious weight-
reduction program, spanning 20 years, is being tested with a sample of 40-year-old
men.
iv. When untreated during their lifetimes, cancer-susceptible mice have an
average life span of 134 days. To determine the effects of a potentially life-
prolonging (and cancer-retarding) drug, the average life span is determined for a
group of mice that receives this drug.
35. For each of the following situations, indicate whether H0 should be retained or
rejected. Given a one-tailed test, lower tail critical with α = .01, and
36. Specify the decision rule for each of the following situations (referring to Table 11.1
to find critical z values):
17
37. Each of the following statements could represent the point of departure for a hypothesis test.
Given only the information in each statement, would you use a two- tailed (or
nondirectional) test, a one-tailed (or directional) test with the lower tail critical, or a one-
tailed (or directional) test with the upper tail critical? Indicate your decision by specifying
the appropriate H0 and H1. Furthermore, whenever you conclude that the test is one-tailed,
indicate the precise word (or words) in the statement that justifies the one-tailed test.
i. An investigator wishes to determine whether, for a sample of drug addicts, the mean
score on the depression scale of a personality test differs from a score of 60, which,
according to the test documentation, represents the mean score for the general population.
ii. To increase rainfall, extensive cloud-seeding experiments are to be conducted, and
the results are to be compared with a baseline figure of 0.54 inch of rainfall (for
comparable periods when cloud seeding was not done).
iii. Public health statistics indicate, we will assume, that American males gain an
average of 23 lbs during the 20-year period after age 40. An ambitious weight- reduction
program, spanning 20 years, is being tested with a sample of 40-year-old men.
iv. When untreated during their lifetimes, cancer-susceptible mice have an average life
span of 134 days. To determine the effects of a potentially life- prolonging (and cancer-
retarding) drug, the average life span is determined for a group of mice that receives this
drug.
38. For each of the following situations, indicate whether H0 should be retained or rejected.
Given a one-tailed test, lower tail critical with α = .01, and
(a) z = – 2.34 (b) z = – 5.13 (c) z = 4.04
Given a one-tailed test, upper tail critical with α = .05, and
(d) z = 2.00 (e) z = – 1.80 (f) z = 1.61
39. Reading achievement scores are obtained for a group of fourth graders. A score of
4.0 indicates a level of achievement appropriate for fourth grade, a score below 4.0
indicates underachievement, and a score above 4.0 indicates overachievement. Assume that
the population standard deviation equals 0.4. A random sample of 64 fourth graders reveals
a mean achievement score of 3.82.
iv. Construct a 95 percent confidence interval for the unknown population mean.
(Remember to convert the standard deviation to a standard error.)
v. Interpret this confidence interval; that is, do you find any consistent evidence either of
overachievement or of underachievement?