Phython Assignment
Phython Assignment
a) True
b) False
Q2 Which one of the following is not a classification of Data Analytics?
a) Diagnostic analytics
b) Deceptive analytics
c) Predictive analytics
d) Prescriptive analytics
Q3 State True or false:
Statement: Nominal scale is the lowest level of measurement and ratio scale is the
highest level of measurement.
a) True
b) False
Q4 Consider the following statements-
Statement A : With iloc, we can pass in the negative value.
Statement B : With loc, we can pass in the negative value.
a. A and B are correct
b. Both are false
c. A is correct B is false
d. B is correct A is false
Q5 For getting 3rd, 4th & 6th row of a datafile “df”in Python programming, we can write:
a. df.loc[[2,3,5]]
b. df.loc[[3,4,5]]
c. df.iloc[3,4,6]
d. None of the above
a. True
b. False
Answer Key
Ques 1 B
8x7x6!/(5!x 3!)
Ques2 A
3 A
4 B
5 C
6 C
7 B
8 B
9 C
10 C
Assignment 3
Q2 If the true proportion of customers who are below 20 years is P=0.35, what is the
probability that a sample size 100 yields a sample proportion between 0.3 to
0.4
a) 0.961
b) 0.827
c) 0.706
d) 0.53
c. the population is first divided into strata, and then random samples are drawn
from each stratum
Q8 In question 7, Find the probability that on a randomly selected day, they will sell no
cars:
a. 0.0668
b. 0.544
c. 0.082
d. 0.205
Q9 In question 7, Find the probability that on a randomly selected day, they will sell at
most 2 cars
a. 0.0668
b. 0.544
c. 0.082
d. 0.205
Q10 In question 7, Find the probability that on a randomly selected day, they will sell
exactly one car:
a. 0.0668
b. 0.544
c. 0.082
d. 0.205
Answer Key
A1 B
The specific value of a random variable is called estimate
A2 C
A3 C
A4 A
A5 A
A6 B
A7 A
A8 C
A9 B
A10 D
Assignment 4
b. -0.5
c. +0.5
d. 1.00
b. a Type II error
c. is the same as b
Q7 The mean cost of a hotel room in a city is said to be $168 per night. A random
sample of 25 hotels resulted in X-bar = $172.50 and sample standard deviation s =
15.40. Calculate the t statistic.
a. 2
b. -2
c. 1.46
d. -1.46
Q8 In hypothesis testing if the null hypothesis is rejected,
a. no conclusions can be drawn from the test
d. level of significance
Q10 If a hypothesis is rejected at the 5% level of significance, it
a. will always be rejected at the 1% level
ANSWERKEY
A1 B
A2 C
A3 A
A4 A
A5 C
A6 A
A7 C
A8 B
A9 D
A10 D
Assignment 5
Q1 In the analysis of variance procedure (ANOVA) the term "factor" refers to:
a. the dependent variable
b. the independent variable
c. different levels of a treatment
d. the critical value of F
Q2 In a problem of ANOVA, involving 3 treatments and 10 observations per treatment, SSE = 500.
The MSE for this situation is
a. 130.2
b. 48.8
c. 18.52
d. 30.0
Q4. An ANOVA procedure is applied to data obtained from 7 samples where each sample contains
10 observations. The degrees of freedom for the critical value of F are
a. 7 numerator and 20 denominator degrees of freedom
b. 5 numerator and 20 denominator degrees of freedom
c. 6 numerator and 63 denominator degrees of freedom
d. 7 numerator and 63 denominator degrees of freedom
Q5. In an ANOVA problem if SST = 200 and SSTR = 80, then SSE is
a. 280
b. 120
c. 80
d. 120
Q6. The critical F value with 8 numerator and 29 denominator degrees of freedom at α = 0.01 is
a. 2.18
b. 3.20
c. 3.53
d. 3.94
Q7. Two Independent simple random samples are taken to test the difference between the means of
two populations. The standard deviations are not known, but are assumed to be equal. The
sample sizes are n1 = 15 and n2 = 35. The correct distribution to use is the
a. t distribution with 51 degrees of freedom
b. z distribution with 50 degrees of freedom
c. z distribution with 49 degrees of freedom
d. t distribution with 48 degrees of freedom
Q8. Stare true or false:
Q9. Mean marks obtained by male and female students of school ABCD in first unit test are shown
as below.
Male Female
Sample Size 64 36
Sample Mean Marks 44 41
128 72
Population Variance ( )
The standard error for the difference between the two means is
a. 4
b. 7.46
c. 4.24
d. 2.0
Q10 If you are interested in testing whether or not the average marks of males is significantly
greater than that of females, the test statistic is
a. 2.0
b. 1.5
c. 1.96
d. 1.645
ANSWER KEY
A1 B
A2 C
MSE = SSE/DOF =500/(30-3) = 18.52
A3 B
A4 C
NUMERATOR DOF = C-1 =6
DENOMINATOR DOF =N-C = 70 - 7 = 63
A5 B
SSE = SST-SSTR = 200 – 80 = 120
A6 B (USE F TABLE)
A7 D
DOF for two sample t test = n1+n2 -2 = 15 +35 -2 = 48
A8 A
Only z test is possible in case of two proportions.
A9 D
A10 B
Week 6: Two way ANOVA and Linear regression
Q1: The model developed from sample data having the form of is known as
a. regression equation
b. correlation equation
d. regression model
ANS: C
Q2: In regression analysis, which of the following is not a required assumption about the error term ε?
b. The variance of the error term is the same for all values of X.
ANS: A
Q3: A regression analysis between sales (Y in $1000) and advertising (X in dollars) resulted in the following
equation
= 30,000 + 5 X
ANS: D
Q4: In a regression and correlation analysis if r2 = 1, then
a. SSE = SST
b. SSE = 1
c. SSR = SSE
d. SSR = SST
ANS: D
c. equal to 1
d. equal to zero
ANS: A
Q6:
a) 0.887
b) 0.956
c) 0.945
d) 0.932
ANS: B
ANS: B
Q8: In Question 6, determine a 95% confidence interval for b1 to test the hypotheses
ANS: D
Statement: The variance of error, is same for all values of the independent variable
a) True
b) False
ANS: A
ANS: B
Week 7 - Linear and Multiple Regression
Q1. The interval estimate of the mean value of y for a given value of x is defined as?
a. Prediction interval estimate
b. Confidence interval estimate
c. Average regression
d. X vs Y correlation interval
Ans: B
Q2. If the coefficient of determination is a positive value, then the coefficient of correlation
a. must also be positive
b. must be zero
c. can be either negative or positive
d. must be larger than 1
ANS: C
Ans: C
Q5. Regression analysis is a statistical procedure for developing a mathematical equation that
describes how
a. one independent and one or more dependent variables are related
b. several independent and several dependent variables are related
c. one dependent and one or more independent variables are related
d. None of these alternatives is correct.
ANS: C
Q6. If the R.sq value is small for a model with a large number of independent variables, the
adjusted coefficient of determination _______________
a. Can be positive
b. Can be negative
c. Is zero
d. Can’t say
Ans: B
Q7. Which one of the statements is true regarding residuals in regression analysis?
a. Mean of residuals is always 0
b. Mean of residuals is always < 0
c. Mean of residuals is always > 0
d. There is no such rule for residuals
Ans: A
Q8. In a simple linear regression model (one independent variable), if we change the input
variable by 1 unit, how much will the output variable change?
a. By 1
b. No change
c. By its slope
d. None of these
Ans: C
Q9. If all the points of a scatter diagram lie on the least squares regression line, then the
coefficient of determination for these variables based on these data is
a. 0
b. 1
c. either 1 or -1, depending upon whether the relationship is positive or negative
d. could be any value between -1 and 1
ANS: B
Q10. In a regression analysis, the regression equation is given by y = 12 - 6x. If SSE = 510 and
SST = 1000, then the coefficient of correlation is
a. -0.7
b. +0.7
c. 0.49
d. -0.49
ANS: A
Q1. For categorical data with ‘n’ categories, the number of dummy variables will be________
a. n
b. n-1
c. n+1
d. 2n
Ans: b
Q5. State true or false: G statistic is used to check the individual significance of the independent
variables
a. True
b. False
Ans: B.
Q7. State True or False: The Method of Least Squares can be applied to models with any
probability distribution.
a. True
b. False
Ans: b.
Q8. Suppose you have been given a fair coin and you want to find out the odds of getting
heads. Which of the following option is true for such a case?
a. Odds will be 0
b. Odds will be 0.5
c. Odds will be 1
d. None of these
Ans. C
Q10. The logit function(given as l(x)) is the log of odds function. What could be the range of logit
function in the domain x=[0,1]?
a. (– ∞ , ∞)
b. (0,1)
c. (0 , ∞)
d. (- ∞, 0 )
Ans. a.
Week 9
Q4. Sensitivity in ROC analysis is defined as (TP = True Positive, FP = False Positive, TN =
True Negative, FN = False Negative):
a. FP / (FP+TN)
b. FN/(TP+FN)
c. TN / (TN+FP)
d. TP / (TP+FN)
Ans. d.
Predicted Positive 8 3
Predicted Negative 2 7
a. 0.73
b. 0.7
c. 0.78
d. 0.8
Ans: d
Q8. State True or False: Standardization of features is not required before training a Logistic
regression model
a. True
b. False
Ans: a.
Q10. Which of the following is true regarding the logistic function for any value “x”?
A. Logistic(x): is a logistic function of any number “x”
B. Logit(x): is a logit function of any number “x”
C. Logit_inv(x): is an inverse logit function of any number “x”
A) Logistic(x) = Logit(x)
B) Logistic(x) = Logit_inv(x)
C) Logit_inv(x) = Logit(x)
D) None of these
Ans: b.
Week 10: Chi-square test & Clustering
Q3. State True or False: Statement: Null hypothesis for chi square test of independence
assumes that, all the proportions are equal.
a. True
b. False
Ans. a.
Q4. Statistical test conducted to determine whether to reject or not reject a hypothesized
probability distribution for a population is known as a ________
a. contingency test
b. probability test
c. goodness of fit test
d. None of these alternatives is correct.
Ans. c.
Q6. The degrees of freedom for a contingency table with 12 rows and 12 columns is
a. 144
b. 121
c. 12
d. 120
ANS: B
Q7. The table below gives beverage preferences for random samples of teens and adults.
Teens Adults Total
Coffee 50 200 250
Tea 100 150 250
Soft Drink 200 200 400
Other 50 50 100
400 600 1,000
We are asked to test for independence between age (i.e., adult and teen) and drink preferences.
With a .05 level of significance, the critical value for the test is _______
a. 1.645
b. 7.815
c. 14.067
d. 15.507
ANS: B
Q8. How can Clustering (Unsupervised Learning) be used to improve the accuracy of the
Linear Regression model (Supervised Learning):
1. Creating different models for different cluster groups.
2. Creating an input feature for cluster ids as an ordinal variable.
3. Creating an input feature for cluster centroids as a continuous variable.
4. Creating an input feature for cluster size as a continuous variable
a. 1. Only
b. 1 & 2
c. 1 & 4
d. 1,2,3 & 4
Ans. d.
Q9. Let x1 = (1,2) and x2 = (3,5) be the co-ordiantes for two objects. The Euclidean and
Manhattan distance between these two objects is __________ respectively
a. 4.2 and 3
b. 3.15 and 2
c. 3.61 and 5
d. None of the above
Ans: c.
Q10. Last school year, the student body of a local university consisted of 30% freshmen, 24%
sophomores, 26% juniors, and 20% seniors. A sample of 300 students taken from this year's
student body showed the following number of students in each classification.
Freshmen 83
Sophomores 68
Juniors 85
Seniors 64
We are interested in determining whether or not there has been a significant change in the
classifications between the last school year and this school year. The expected number of
freshmen is ________
a. 83
b. 90
c. 30
d. 10
ANS: B
Week 11 - Clustering Analysis, K-means, Hierarchical clustering
Q1. Which library is used for calculating distance measures in clustering using python?
A. distance_matrix
B. scipy.spatial
C. scipy_spatial
D. distance.matrix
Ans: B.
(Error in the portal will be rectified soon.)
Q2. Formula for dissimilarity computation between two objects for categorical variables is –
Here p is a categorical variable and m denotes the number of matches.
Q3. Select the correct option for a data set with 7 objects and an interval-scaled variable ‘f’ we
have the following measurements:
f = (1, 2, 3, 4, 5, 8, 50)
containing one outlying value.
A. Std deviation (std_f) and mean absolute deviation (s_f) are having the same effect of the
outlier.
B. Mean absolute deviation (s_f) is more affected by the outlier
C. Std deviation (std_f) is less affected by the outlier
D. Std deviation(std_f) is more affected by the outlier.
Ans. D
Q4. Select the correct statement about the standardization in the following options –
A. Standardizing the data always gives inefficient result while making clusters
B. Standardizing the data always beneficial during clustering analysis
C. The variables having an absolute value may not efficient after standardization during
clustering analysis
D. Outliers can not be detected by standardized data
Ans. C
Q5. Which of the following can act as possible termination conditions in K-Means?
Q6. In the figure below, if you draw a horizontal line on y-axis for y=2. What will be the number
of clusters formed?
a. 1
b. 2
c. 3
d. 4
Ans: b.
Q8. State True or False: Hierarchical clustering should primarily be used for exploration
a. True
b. False
Ans. a.
Q9. State True or False: For finding dissimilarity between two clusters in hierarchical clustering,
average-link is the only metric used
a. True
b. False
Ans. b.
Q10. If two variables V1 and V2, are used for clustering. Which of the following are true for K
means clustering with k =3?
1. If V1 and V2 has a correlation of 1, the cluster centroids will be in a straight line
2. If V1 and V2 has a correlation of 0, the cluster centroids will be in straight line
a. 1 only
b. 2 only
c. 1 and 2
d. None of the above
Ans: a.
Week 12 - CART 1 & 2
Q1. Which clustering algorithm works well when the shape of the clusters is hyper-spherical?
a. K means
b. Agglomerative Hierarchical clustering
c. Divisive Hierarchical clustering
d. All of the above
Ans: a.
Q5. State True or False: Gini Index enforces the resulting tree to have multiway splits
a. True
b. False
Ans. b
Q7. _______is the measure of uncertainty of a random variable, it characterizes the impurity of
an arbitrary collection of examples.
a. Information Gain
b. Gini Index
c. Entropy
d. None of the above
Ans: c
Q9. Decision tree learners may create biased trees if some classes dominate. What’s the
solution of it?
a. Balance the dataset prior to fitting
b. Imbalance the dataset prior to fitting
c. Balance the dataset after fitting
d. None of the above
Ans: a.
Q10. Suppose, your target variable is the price of a house using Decision Tree. What type of
tree do you need to predict the target variable?
a. Classification tree
b. Regression tree
c. Clustering tree
d. Dimensionality reduction tree
Ans. b.