100 plus Statistics Interview Questions
100 plus Statistics Interview Questions
Assignment
Submitted by : Abutalha D Maniyar
100+ Statistics Interview Questions
1
100+ Statistics Interview Questions
Ans: As the name suggests Quantitative data are those which are numerical.
Example : Age in years, Number of students in a class etc., whereas
Qualitative data refers to the categorical data or non-numerical data.
Example: Size of shirt(S, M, L, XL) and Taste of Dish (Good, Bad)
6. How Would You Approach a Dataset That’s Missing More Than 30 Percent of
Its Values?
Ans: The approach will depend on the size of the dataset. If it is a large
dataset, then the quickest method would be to simply remove the rows
containing the missing values. Since the dataset is large, this won’t affect the
ability of the model to produce results.If the dataset is small, then it is not
practical to simply eliminate the values. In that case, it is better to calculate the
mean or mode of that particular feature and input that value where there are
missing entries. Another approach would be to use a machine learning
algorithm to predict the missing values. This can yield accurate results unless
there are entries with a very high variance from the rest of the dataset.
2
100+ Statistics Interview Questions
7. Give an example where the median is a better measure than the mean?
Ans: When Data is skewed it is better to use median than mean. Example: In
an exam 3 students perform really well but remaining 32 students have not
performed that good hence in this case it is better to use median of marks
than mean of marks. (from below fig we can see median does not get affected
as much as the mean gets affected when data gets positively skewed or
negatively)
10. Can you state the method of dispersion of the data in statistics?
3
100+ Statistics Interview Questions
Ans: Yes, Range is very sensitive. If outliers are present in the data they
adversely affect range.
14. What are the scenarios where outliers are kept in the data?
Ans: When the data has some special case of outliers they need to be
interpreted accordingly. Suppose we have data of Blood Sugar Level of 100
students, if we remove the extreme data by calling it as an outlier we will lose
that data point. And that data could be helpful in Assessing total health of the
student or something like that. And in case of spam email detection we keep
outliers. So outliers must be interpreted accordingly before completely
removing them from the dataset.
16. What do you understand about a spread out and concentrated curve?
Ans: Spread out curve has large deviation from center whereas concentrated
curve has larger frequency of data around the center.
4
100+ Statistics Interview Questions
Ans: If I know the S.D and Mean of the data I can find Coefficient of variation
using the formula below, and is expressed in percentage by multiplying by 100
18. State the case where the median is a better measure when compared to
the mean.
Ans: Case- Suppose In a City there are 100 people and The Richest Person
of Asia is also in that city. If we calculate the mean of the wealth of 100
people. We would get a value which is more dragged towards the outlier(i.e,
rich person’s wealth) hence in this case we will go for median to avoid the
misinterpretation of the mean and wealth of the city. Hence in this case
median is a better measure than the mean.
Ans: Depending upon the type of feature that we have we will try to handle
the missing data. If the feature is categorical we will replace it with the mode
of the variable. And if multimodal data is present we may randomly assign the
missing values with the modes of the data.
If the data is numerical we may replace missing values with the median, mean
or mode. Depending upon the variable we have and the distribution it follows
and by domain knowledge we can handle the missing data.
5
100+ Statistics Interview Questions
20. What is meant by mean imputation for missing data? Why is it bad?
Ans: The box plot is suitable for comparing range and distribution for groups
of numerical data. Advantages: The box plot organizes large amounts of data,
and visualizes outlier values. And Outliers can be detected.
Ans: Five Number summary describes given data in 5 numbers and they are:
1. Minimum
2. Q1
3. Q2 (or Median)
4. Q3
5. Maximum
Minimum : All the data lies above it. I.e., 100% data lies above this point.
Q1 : 75% of the data lies above this value.
Q2 : 50% of the data lies above this point.
Q3. 25% of the data lies above this point.
Maximum : All the data is below this point.
23. What is the difference between the First quartile, the IInd quartile, and
the IIIrd quartile?
Ans: The difference between these quartiles is that they represent the value
of the variable at a certain point. Point refers to a location in the sorted dataset
where data gets divided into four equal parts.
6
100+ Statistics Interview Questions
Ans: Outliers are extreme values that stand out greatly from the overall
pattern of values in a dataset or graph.
Ans: Following are some impacts of outliers in the data set: It may cause a
significant impact on the mean and the standard deviation. If the outliers are
non-randomly distributed, they can decrease normality. They can bias or
influence estimates that may be of substantive interest.
Ans: We may use Box plot to screen outliers, in this case minimum and
maximum are calculated based on IQR (generally we take min is 1.5 below
IQR from Q1 and similarly maximum is 1.5 above IQR from Q3).
If data is skewed it has outliers. And we can find values that are above and
below three S.D from mean.
In this way we can screen outliers.
Ans : Depending on the specific characteristics of the data, there are several
ways to handle outliers in a dataset. Let’s review a few of the most common
approaches to handle outliers below:
Remove outliers:
In some cases, it may be appropriate to simply remove the observations that
contain outliers. This can be particularly useful if you have a large number of
observations and the outliers are not true representatives of the underlying
population.
7
100+ Statistics Interview Questions
Transform outliers:
The impact of outliers can be reduced or eliminated by transforming the
feature. For example, a log transformation of a feature can reduce the
skewness in the data, reducing the impact of outliers.
Impute outliers:
In this case, outliers are simply considered as missing values. You can employ
various imputation techniques for missing values, such as mean, median,
mode, nearest neighbor, etc., to impute the values for outliers.
Ans: Kurtosis is the fourth moment of statistic and it measure the tailedness of
the distribution
There is 3 types of kurtosis
1) Leptokurtic
2) Mesokurtic
3) Platykurtic
Ans: The means of the samples (each of size n) drawn at random from
population follows is normally distributed.
Let's say, we draw 100 samples each of size n=30, then the mean of these
100 samples follow a standard normal distribution.
36. Can you give an example to denote the working of the central limit
theorem?
Ans: Suppose we want to calculate the average milk produced by 1000 cows
in a city. Let's say we sample 30 cows at random and calculate the mean and
also we can calculate its standard deviation. By Central Limit theorem the
mean is normally distributed. Hence we can say that the actual mean may lie
around the calculated mean. And we generally define confidence interval to
compensate for any chance error.
37. What general conditions must be satisfied for the central limit theorem to
hold?
9
100+ Statistics Interview Questions
equal than 30
manner. e.g. when you collect data by only surveying customers who
purchased your product and not another half, your dataset does not represent
40. What is the probability of throwing two fair dice when the sum is 8?
Ans: There are 36 possible outcomes when rolling two dice because each die
has 6 sides, and there are 6 possible outcomes for the first die and 6 possible
outcomes for the second die (6 x 6 = 36).
(2, 6)
(3, 5)
(4, 4)
(5, 3)
(6, 2)
10
100+ Statistics Interview Questions
So, the probability of getting a sum of 8 when throwing two fair dice is:
Probability = 5 / 36
41. What are the different types of Probability Distribution used in Data
Science?
1. Symmetry: The curve is perfectly symmetric, with the mean, median, and
mode all coinciding at the center of the distribution.
3. Mean and Standard Deviation: The shape and spread of the curve are
determined by the mean (μ) and standard deviation (σ) of the data. The mean
represents the central value, while the standard deviation measures the
dispersion or spread of data points around the mean.
11
100+ Statistics Interview Questions
In this formula:
44. What type of data does not have a normal distribution or a Gaussian
distribution?
12
100+ Statistics Interview Questions
Ans: It is symmetric around the mean and it follows the empirical rule
Empirical Rule: A significant property of the normal distribution is the
empirical rule (also known as the 68-95-99.7 rule), which states that:
● Approximately 68% of the data falls within one standard deviation of the
mean.
● Approximately 95% of the data falls within two standard deviations of
the mean.
● Approximately 99.7% of the data falls within three standard deviations
of the mean.
2. Symmetry: The data distribution is symmetric around its mean. This means
that the mean, median, and mode are all equal and located at the center of
the distribution.
13
100+ Statistics Interview Questions
49. Can you tell me the range of the values in standard normal distribution?
Ans: The Pareto Principle, also known as the 80/20 Rule, is a concept that
suggests a significant imbalance between inputs and outputs, effort and
results, or causes and effects.
The Pareto Principle can be summarized as follows:
● Roughly 80% of the effects come from 20% of the causes.
● In various contexts, a small proportion (typically around 20%) of inputs,
efforts, or factors often generates a large proportion (typically around
80%) of outputs, results, or outcomes.
14
100+ Statistics Interview Questions
left-skewed (Negative Skew) distribution is one in which the left tail is longer
52. If a distribution is skewed to the right and has a median of 20, will the
mean be greater than or less than 20?
Ans: In this case mean is less than 60 and mode is greater than 60.
54. Imagine that Jeremy took part in an examination. The test has a mean
score of 160, and it has a standard deviation of 15. If Jeremy’s z-score is
1.20, what would be his score on the test?
1.20 = (x-160) / 15
Therefore, x = 178
55. The standard normal curve has a total area to be under one, and it is
symmetric around zero. True or False?
Ans: True
15
100+ Statistics Interview Questions
56. Briefly explain the procedure to measure the length of all sharks in the
world.
Ans: The population of the shark is large hence we randomly select 30 sharks
and by some method we calculate the length of all 30 sharks. We estimate the
parameter, and to do so. We calculate the mean and standard deviation of the
data obtained. With a certain confidence interval we express the mean length
of all sharks in the world.
57. Can you tell me the difference between unimodal bimodal and bell-shaped
curves?
Ans: Unimodal has only one mode whereas bimodal has two modes. The
bell-shaped curves have only one mode.
59. What are some examples of data sets with non-Gaussian distributions?
Ans. When data follows a non-normal distribution, it is frequently
non-Gaussian. A non-Gaussian distribution is often seen in many
statistics processes. This occurs when data is naturally clustered on
one side or the other on a graph. For instance, bacterial growth follows
an exponential or non-Gaussian distribution, which is non-normal.
Ans:
61. What are the criteria that Binomial distributions must meet?
16
100+ Statistics Interview Questions
1. Two Outcomes: Each trial must result in one of two possible outcomes,
often denoted as "success" and "failure." These outcomes are mutually
exclusive.
With these criteria met, you can use the binomial probability formula to
calculate the probability of obtaining a specific number of successes (k) in the
fixed number of trials (n). The formula is:
17
100+ Statistics Interview Questions
63. How to find the mean length of all fishes in the sea?
Ans: Define the confidence level (most common is 95%). Take a sample of
fishes from the sea (to get better results the number of fishes > 30). Calculate
the mean length and standard deviation of the lengths. Calculate t-statistics
Get the confidence interval in which the mean length of all the fishes should
be.
Ans: There are several types of sampling methods, each with its own
advantages and disadvantages. Here are some common types of sampling
methods:
18
100+ Statistics Interview Questions
Ans: The sample size depends on several factors, including the desired level
of confidence, the margin of error, the variability in the population, and the
specific research objectives. Here's a general approach to calculate the
needed sample size:
4. Minimum size of the sample is 30. The Standard error follows the law of
Large Numbers. And sample size is chosen accordingly by plugging in the
above values in respective formulae.
19
100+ Statistics Interview Questions
67. Can you give the difference between stratified sampling and clustering
sampling?
69. What are population and sample in Inferential Statistics, and how are
they different?
Ans: In Inferential statistics, Population is larger and is the entire dataset. But
Sample is a minimal amount of data from the population. The sample is
representative of the population. Mean, SD, Median, anything of Population is
called Parameter where as of the sample it is called statistic. And they may
slightly differ from each other.
70. What is the relationship between the confidence level and the
significance level in statistics?
Ans: The relationship between the confidence level and the significance level
can be understood as follows:
71. What is the difference between Point Estimate and Confidence Interval
Estimate?
21
100+ Statistics Interview Questions
Ans: Point Estimate and Confidence Interval Estimate are two different ways
of summarizing and conveying information about population parameters
based on sample data in statistics. And difference between them is
1. Point Estimate provides a single value, whereas Confidence Interval
Estimate provides a range of values.
2. Point estimates are precise but do not convey the uncertainty associated
with the estimate. Confidence intervals provide a measure of that uncertainty.
3. Confidence intervals are often used when researchers want to
communicate the range of values that are plausible for the population
parameter, taking into account the variability in the data.
Biased:
● If an estimator or method is "biased," it means that, on average, it tends
to produce estimates that are systematically different from the true
population parameter.
● In other words, if you were to use the biased estimator or method many
times with different samples from the same population, the average of
those estimates would not equal the true population parameter.
● Biased estimators/methods can introduce systematic errors and are
less desirable because they do not provide accurate and unbiased
estimates of the population parameter.
Unbiased:
● If an estimator or method is "unbiased," it means that, on average, it
produces estimates that are equal to the true population parameter.
● In the long run, when using the unbiased estimator or method
repeatedly with different samples from the same population, the
average of those estimates will converge to the true population
parameter.
● Unbiased estimators/methods are preferred because they provide
accurate and reliable estimates that are not systematically too high or
too low.
It's important to note that an estimator can still produce an estimate that is
different from the true parameter for a specific sample (due to random
sampling variation), even if it is unbiased. However, the key characteristic of
22
100+ Statistics Interview Questions
73. How does the width of the confidence interval change with length?
The standard error is closely related to the standard deviation (SD), but it is
specific to sample statistics, whereas the standard deviation is used to
measure the dispersion of individual data points in a dataset.
Ans: Sampling error is a type of error that occurs when the characteristics of a
sample, which is a subset of a larger population, do not perfectly represent the
characteristics of the entire population. In other words, it is the discrepancy
between the sample statistic (e.g., sample mean or sample proportion) and
the true population parameter.
1. Increase the Sample Size: One of the most effective ways to reduce
sampling error is to increase the sample size. As the sample size grows, the
sample statistic becomes a more accurate estimate of the population
parameter, and the sampling error decreases. Larger samples tend to provide
more precise and reliable estimates.
23
100+ Statistics Interview Questions
2. Use Random Sampling: Ensure that the sampling process is truly random.
Random sampling helps minimize bias and ensures that every element or
individual in the population has an equal chance of being included in the
sample. Common methods for random sampling include simple random
sampling, stratified sampling, and cluster sampling.
76. How do the standard error and the margin of error relate?
Ans: The margin of error (MOE) is directly related to the standard error (SE). In
fact, the MOE is typically calculated as a function of the SE and the chosen
critical value from the probability distribution.
The margin of error defines the range within which the true population
parameter is likely to fall with a specified level of confidence. It provides the
"plus or minus" part of a confidence interval.
A wider confidence interval (larger MOE) implies lower precision and less
certainty about the true parameter value, while a narrower confidence interval
(smaller MOE) implies higher precision and greater certainty.
3. Critical Region: The critical region, where the rejection of the null
hypothesis occurs, is located entirely in one tail of the probability distribution of
the test statistic. This tail corresponds to the direction specified in the
alternative hypothesis.
26
100+ Statistics Interview Questions
observed or more extreme results when the null hypothesis (H0) is True.
P-values are used in hypothesis testing to help decide whether to reject the
null hypothesis or not. The smaller the p-value, the stronger the evidence that
84. If there is a 30 percent probability that you will see a supercar in any
20-minute time interval, what is the probability that you see at least one
supercar in the period of an hour (60 minutes)?
Ans: Hypothesis testing is a type of statistical inference that uses data from a
27
100+ Statistics Interview Questions
Ans: Type I and Type II errors are two different types of errors that can occur
in hypothesis testing and statistical decision-making. They are associated with
the acceptance or rejection of null and alternative hypotheses. Here's the
difference between Type I and Type II errors:
28
100+ Statistics Interview Questions
Ans: The choice between using a t-test and a z-test depends on several
factors, including the characteristics of your data, the sample size, and your
knowledge of the population standard deviation. Here are guidelines on when
to use each test:
2 Small Sample Size: For small sample sizes (typically when n < 30), a t-test
is preferred. As the sample size increases, the t-distribution approaches the
normal distribution (z-distribution), so the distinction becomes less important
with larger samples.
2. Large Sample Size: For large sample sizes (typically when n >=30), the
t-distribution closely approximates the normal distribution. In such cases,
using a z-test is appropriate because the distinction between the t-distribution
and the normal distribution becomes negligible with larger samples.
88. What is the difference between the f test and anova test?
29
100+ Statistics Interview Questions
Ans: The F-test and ANOVA (Analysis of Variance) test are related statistical
tests, but they have different applications and purposes. Here are the key
differences between the two:
F-Test:
2. Two Variance Groups: In its simplest form, the F-test compares the
variances of two groups (two samples). This is known as a two-sample F-test.
It is often used to check the homogeneity of variances assumption before
conducting other statistical tests, such as t-tests or ANOVA.
3. Multiple Factors: ANOVA can be used to analyze the impact of one or more
factors (independent variables) on a dependent variable. For example,
one-way ANOVA analyzes the effect of a single factor, while two-way ANOVA
considers the effects of two factors.
30
100+ Statistics Interview Questions
89. What is Resampling and what are the common methods of resampling?
● K-fold cross-validation
● Bootstrapping
90. What is the proportion of confidence intervals that will not contain the
population parameter?
Ans: The proportion of confidence intervals that will not contain the population
parameter is equal to the chosen significance level (alpha). In other words, if
you construct many confidence intervals from different samples and calculate
their average, approximately (100% times alpha) of those intervals will not
contain the true population parameter.
For example, if we are studying the effect of weight gain, then lack of workout
will be the independent variable, and weight gain will be the dependent
variable. In this case, the amount of food consumption can be the confounding
variable as it will mask or distort the effect of other variables in the study. The
effect of weather can be another confounding variable that may later the
experiment design.
4. Collect Data
93. What is the relationship between standard error and the margin of error?
Ans: The margin of error (MOE) is directly related to the standard error (SE).
In fact, the MOE is typically calculated as a function of the SE and the chosen
critical value from the probability distribution.
The margin of error defines the range within which the true population
parameter is likely to fall with a specified level of confidence. It provides the
"plus or minus" part of a confidence interval.
A wider confidence interval (larger MOE) implies lower precision and less
certainty about the true parameter value, while a narrower confidence interval
(smaller MOE) implies higher precision and greater certainty.
Ans: The best way to describe the p-value in simple terms is with an example.
In practice, if the p-value is less than the alpha, say of 0.05, then we’re saying
that there’s a probability of less than 5% that the result could have happened
by chance. Similarly, a p-value of 0.05 is the same as saying “Only 5% of the
time, we would see this by chance.”
32
100+ Statistics Interview Questions
Ans: Interpolation is a prediction made using inputs that lie within the set of
observed values. Extrapolation is when a prediction is made using an input
that’s outside the set of observed values.
Ans: An inlier is a data observation that lies within the rest of the dataset and
identify than an outlier and requires external data to identify them. Once we
identify them, we can simply remove them from the dataset to address them.
97. You roll a biased coin (p(head)=0.8) five times. What’s the probability of
getting three or more heads?
Ans: We can use the binomial probability formula. In this case, we want to
find the probability of getting 3, 4, or 5 heads. The formula is:
Ans: To find the p-value for the one-sided test of whether the hospital infection
rate is below the standard of 1 infection per 100 person-days at risk, you can
use the Poisson distribution. The Poisson distribution is appropriate for
modeling the number of rare events, such as infections in a hospital, over a
known interval of time.
So, the p-value for the one-sided test of whether the hospital is below the
standard infection rate of 1 infection per 100 person-days at risk is
approximately 0.033. This p-value indicates strong evidence that the hospital's
infection rate is below the standard, as it is smaller than a typical significance
level α such as 0.05.
34
100+ Statistics Interview Questions
Ans: To calculate a 95% Student's T confidence interval for the mean brain
volume in the new population based on a sample of 9 men with a sample
average of 1,100cc and a standard deviation of 30cc, you can use the formula
for the confidence interval:
Where:
● x̄ is the sample mean (1,100cc).
● s is the sample standard deviation (30cc).
● n is the sample size 9.
● t is the critical value from the Student's T-distribution for the desired
confidence level.
So, the 95% Student's T confidence interval for the mean brain volume in the
new population is approximately 1,076.94cc to 1,123.06cc. This means we are
95% confident that the true population mean brain volume falls within this
interval.
● Due to chance or
● Due to relationship
Ans: If p-value ≤ α: You reject the null hypothesis. This means that the
observed data provides strong evidence against the null hypothesis, and you
conclude that there is a statistically significant effect or difference in the data.
If p-value > α:
You fail to reject the null hypothesis. This means that the observed data does
not provide strong enough evidence to conclude that there is a statistically
significant effect or difference. It does not prove that the null hypothesis is
true; it simply means you don't have enough evidence to reject it.
36
100+ Statistics Interview Questions
Goal of A/B Testing: The primary goal of A/B testing (also known as split
testing) is to compare two or more versions of a web page, app, or marketing
campaign to determine which one performs better in terms of a specific
outcome or metric. Common objectives include improving conversion rates,
click-through rates, user engagement, or other key performance indicators
(KPIs).
Ans: Box plot summarizes the data in 5 numbers. Whereas histogram shows
the complete distribution of data when appropriate bin size is used. Box Plot is
more generally used to compare the distributions. Both the plots help in
identifying the outliers.
Box Plot Histogram
105. A jar has 1000 coins, of which 999 are fair and 1 is double headed. Pick
a coin at random, and toss it 10 times. Given that you see 10 heads, what is
the probability that the next toss of that coin is also a head?
Ans: We use Bayes Theorem to find the answer. Let's split problem into two
parts:
1) What is the probability you picked the double-headed coin (referred as D)
2) What is the probability of getting a head on the next toss?
Question 2 follows very naturally after question 1, so let's tackle question 1.
37
100+ Statistics Interview Questions
Tackling the numerator, the prior probability, P(D) = 1/1000. If we used the
double headed coin, the chance of getting 10 heads, P(10 H | D) = 1 (we
always flip heads). So the numerator = 1 / 1000 * 1 = 1 / 1000.
Since we have all the components of P(D | 10 H), compute and you'll find the
probability of having a double headed coin is .506. We have finished the first
question.
The second question is then easily answered: we just compute the two
individual possibilities and add.
The following graph shows three samples (of different size) all sampled from
the same population.
38
100+ Statistics Interview Questions
With the small sample on the left, the 95% confidence interval is similar to the
range of the data. But only a tiny fraction of the values in the large sample on
the right lie within the confidence interval. This makes sense. The 95%
confidence interval defines a range of values that you can be 95% certain
contains the population mean. With large samples, you know that mean with
much more precision than you do with a small sample, so the confidence
interval is quite narrow when computed from a large sample.
107. How do you stay up-to-date with the new and upcoming concepts in
statistics?
Ans: By following the experts and influencers who share their insights,
opinions, and best practices on various platforms. We can follow them on
social media, blogs, podcasts, newsletters, webinars, or online communities
39
100+ Statistics Interview Questions
109. What types of variables are used for Pearson’s correlation coefficient?
Ans: To get Correlation of two variables they must be numerical variables and
outliers must be avoided to get precise correlation between them.
Quantities used to evaluate coefficient of correlation are
1. Covariance 2. Standard Deviation
Ans: A high correlation between the time a person sleeps and the amount of
productive work they do suggests a statistically significant relationship
between these two variables.
However, it's important to clarify the nature of the relationship and be cautious
about making causal inferences.
40
100+ Statistics Interview Questions
113. How will you determine the test for the continuous data?
Ans: Determining the appropriate statistical test for continuous data depends
on several factors, including the research question, the nature of the data, and
the specific hypothesis you want to test.
1. Two-Sample T-Test:
Use case: Use a two-sample t-test when you want to compare the means of
two independent groups or samples. It's suitable for testing whether there is a
statistically significant difference between the means of two continuous
variables.
2. Paired T-Test:
Use case: Use a paired t-test when you want to compare the means of two
related or paired groups. It's suitable when the same subjects or items are
measured before and after an intervention or treatment.
114. What can be the reason for non normality of the data?
41
100+ Statistics Interview Questions
Ans: Non-normality in data, where the distribution of data points deviates from
a normal (Gaussian) distribution, can be caused by various factors and
phenomena. Understanding the reasons for non-normality is important for
appropriate data analysis and modeling. Some reasons for non-normality in
data can be:
1. Skewness
2. Outliers
3. Bimodality
4. Heavy tails
5. Transformation
115. why is there no such thing like 3 samples t- test?? why t-test failed with
3 samples
Resources :
1. Statistics in one shot :
Complete Statistics For Data Science In 6 hours By Krish Naik
2. Stats In English Detailed Playlist:
How to Learn Statistics for Data Science As A Self Starter- Follow My Way
42
100+ Statistics Interview Questions
43