Interview Questions
Interview Questions
What is data?
A collection of facts from which conclusion may be drawn.
What is Statistics?
Statistics is the branch of Mathematics which deals with collection, representation
& interpretation of data to get some conclusion.
According to R.A.Fisher, Statistics is the branch of applied mathematics
which specializes in data.
Sometimes Median is better measure of central tendency as Mean. Why?
Most of data contains outlier value & we know that median is not affected by
outlier. So, in that case we prefer to use Mean.
What is population & a sample?
A population is the set of similar items or events which is of interest for some
experiment. A sample is the subset of population is to chosen to represent the
population.
What is Secondary research?
Secondary research involves the summary & collation of existing
research. Secondary research is contrasted with primary research in that primary
research involves the generation of data, whereas secondary research uses
primary research sources as a source of data for analysis.
What is Normal distribution?
It refers to most of the data is distributed around a central value without any bias to
the left or right. It has a symmetrical bell shaped structure.
What is p-value?
P-value is the probability of test statistic. It forms the basis for acceptance or
rejection of null hypothesis.
Types of errors in Hypothesis testing:
Type1 Error: prob. of rejection of true hypothesis when it is true. It is denoted by α
Type2 Error: prob. of acceptance of false hypothesis when it is false. It is denoted
by β.
Which type of error is more dangerous than others?
Sometimes type1 is more dangerous than type 2 & sometime type 2 is more
dangerous than type2. It totally depend upon the situation for e.g.: -
H0: The patient has the disease. H1: The patient has not disease.
In this above situation shows type 1 is more dangerous than 2 because a patient
without medicine may be die but a healthy person with medicine may not die.
H0: The person is a criminal. H1: The person is not a criminal
In this situation, type 2 is more dangerous than type 1 because if a criminal goes
free than there is a chance that he again do a crime.
What is a time series data?
A set of data which is in the order of time is known as time series data. For e.g. -
We may have weekly data, monthly data, yearly data, etc.
Types of forecasting method:
1) Naïve Approach
2) Simple Average
3) Moving Average
4) Single Exponential Smoothing
5) Holt’s Linear trend method
6) Holt’s winter seasonal method
7) ARIMA method
Package for ARIMA Model:
1) timeSeries() package for generate time series data. Function is ts()
2) tseries() package for stationary testing. Function is adf.test() which is known
a dickey fuller test.
3) forecast() package is used for forecasting. Function is forecast.arima()
What is meant by stationary time series?
A time series data whose mean & variance is constant over the period of time is
known as stationary time series.
(n−k ) R2
= (n−1) 1−R2
where R2 is the coefficient of determination.
Here, R & F are closely related. If R2 is zero, then F will be 0 & if R2 is 1, then F
will become ∞. That is why the F test under analysis of variance is termed as the
measure of overall significance of estimated regression.
It is also a test of significance of R2. If F is highly significant, it implies that we
can reject H0, i.e., y is linearly related to X’s.
Multicollinearity: -Existence of linear relationship between features or
variables is known as multicollinearity. There are several approach to measure
Multicollinearity in the data.
1) Inspection of correlation matrix
If corr (X1, X2) >0.75 then we can say that, they are highly correlated.
2) Determinant of correlation matrix
Let us consider D be the determinant of correlation matrix then If D = 0 then
it indicates the existence of exact linear dependence among explanatory
variables. If D = 1 then the columns of X matrix are orthonormal.
Thus a value close to 0 is an indication of high degree of multicollinearity.
Any value of D between 0 and 1 gives an idea of the degree of
multicollinearity.
Limitation
It gives no information about the number of linear dependencies among
explanatory variables.
3) Variance Inflation Factor
If Variance Inflation Factor (VIFj)>5 then we should remove jth variable from
our model where the VIF for jth variable is defined as,
1
VIFj= 1−R j2 where Rj
2
denotes the coefficient of determination obtained
when Xj is regressed on the remaining (k - 1) variables excluding Xj
Limitations
a) It sheds no light on the number of dependencies among the explanatory
variables.
b) The rule of VIF > 5 or 10 is a rule of thumb which may differ from one
situation to another situation.
Autocorrelation: - When there is linear relation between the errors term. Then
this situation is known as auto correlation. The detection of Autocorrelation by
Durbin-Watson (DW) test. There is a package “lmtest” in R for Durbin-Watson
test.
If 1.5 ≤ DW ≤ 2.5 then there is no autocorrelation.
If 0 ¿ DW ¿ 1.5 then there is positive autocorrelation.
If 2.5 ¿ DW ¿ 4 then there is a negative autocorrelation.
True Positives.
True Negative.
False Positives.
False Negative.
3) Area under curve: AUC of a classifier is equal to the probability that the
classifier will rank a randomly chosen positive example higher than a
randomly chosen negative example. There are two important thing for
AUC Curve as sensitivity & specificity. As evident, AUC has a range of
[0, 1]. The greater the value, the better is the performance of our
model.
4) F1 Score: F1 Score is the Harmonic Mean between precision and recall.
The range for F1 Score is [0, 1]. The greater the F1 Score, the better is
the performance of our model. F1 Score tries to find the balance
between precision and recall. Mathematically,
1
F1= 2* 1
+
1
precision recall
True Positive
Precision= True Positive+ False Positive
True Positive
Recall= True Positive+ False Negative
6) Mean Squared Error: MSE is quite similar to Mean Absolute Error, the
only difference being that MSE takes the average of the square of the
difference between the original values and the predicted values.
N
1
Mathematically, MSE= ∑
N j=1
( y j− y ' j )2
7) Logarithmic Loss:
Why SVM is so special?
ML concept come up with a large amount of data. All the algorithm like
randomForest, decision tree etc. needs more and more information to be more
accurate. On the other hand SVM can be used in the case when data is not very
large. Even in the case of less than 1k observations SVM work very well.
Why Naïve Bayes is called naïve?
Naïve Bayes classifier assumes that the independent of event means presence of
particular features of a class is unrelated to the presence of any other features.
Why Logistic Regression is called as linear classifier.
Linear classifier is the classifier which uses some linear combination of variables
for classification. Logistic Regression does this with the help of logit function. It
should be remembered that we are talking about linear classifier not about linear
model. Logistic regression is the case of general linear model.
What is Natural Language Processing?
It is the area of computer science which concerned with the interaction between
human (natural) and computer languages. In particular, how to program computer
to process & analyze large amount of natural language data.
Usually, if we increase the depth of tree, it will cause over fitting & Increase
the number of trees will cause under fitting.
What is Sentiment Analysis?
Sentiment Analysis is the process of determining whether a body of text is positive,
negative or neutral.
How can you get sentiment score?
For obtaining sentiment score, we can use get_sentiment() function in R. After that
we can classified the sentiment score into positive or negative as score >0 or score
<0.
How can we extract data from Twitter?
For extracting tweets from twitter, first of all we have to create a twitter API, then
we can use twitterR package in R.