0% found this document useful (0 votes)
104 views

Interview Questions

Prakash Chandra provides information about himself in response to an interview question. He has an MSc in Statistics and Computing from BHU and has worked as a statistical trainee at the Indian Statistical Institute. He has skills in R, Python, machine learning, and statistical software like SAS and SPSS. He has completed projects in time series analysis, logistic regression, and machine learning.

Uploaded by

Prakash Chandra
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views

Interview Questions

Prakash Chandra provides information about himself in response to an interview question. He has an MSc in Statistics and Computing from BHU and has worked as a statistical trainee at the Indian Statistical Institute. He has skills in R, Python, machine learning, and statistical software like SAS and SPSS. He has completed projects in time series analysis, logistic regression, and machine learning.

Uploaded by

Prakash Chandra
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Q1. Tell me something about yourself.

My name is Prakash Chandra. I am from Patna. Currently, I am doing Job in Indian


Statistical Institute as a Statistical Trainee. I have passed M.Sc. in Statistics &
Computing from Banaras Hindu University. I completed Graduation from Patna
University. I am good in R, Python, and Machine Learning & also little good in
Base SAS & SPSS. I have done three projects on Time Series and Logistic
Regression. I have done one month internship. My Father is a business man and my
mother is a house wife. My hobbies is to playing cricket and listening to song.
Q2. What are your strengths?
I am hard worker. I have adaptability & I like to take risk.
Q3. What are your weakness?
I take decision too quickly. I believes on other easily. I am a bit lazy person.
Q4. Why should I hire you?
I have 3-4 months experience. I have done projects on Machine Learning and Time
Series. I have good knowledge of Machine Leaning, R and Python programming
which is mostly used in Data Science field. So, I think that I am a suitable
candidate for this Job.
Q5. Why you choose this field?
First of all, my skills and knowledge, curiosity towards this profession, money
inspired me to choose this field.
Q6. Tell me about your Internship.
Sir, I have two internship. I have done one month Internship in Sigma Research &
Consultancy Pvt. Ltd. Worked as a Research Trainee in December 2016. During
Internship, I learned two things. First one is that I learned how to work on a project
i.e. project management, how to make questionnaire for a project. 2nd thing is that I
have done a survey on Demonetization which is run by RBI. After the survey, I
found that 60% people are supporting Demonetization and 40% people are against
of Demonetization.
Q7. Tell me something about any projects which you have done.
Sir, I have done three projects. I am going to tell you about Titanic Survival
Analysis. This is my first Machine learning project where my task was to predict
the passenger who survived in the titanic disaster or death. For doing this project, I
collected the dataset from Kaggle. There is a two dataset, one is train & other is test
dataset. There are so many missing value in Age & other variable also. So, first of
all, I replace the missing value by their corresponding median value. Then I used
Logistic Regression for model building & test the model validity on test dataset. By
the confusion matrix & ROC curve, I got 77% accuracy in my model.
Q8. Tell me about Time Series Project.
In this project, I forecast the SBI share price for the next 10 days on the basis of 2
year historical data. So, for the forecasting purpose, I used ARIMA model. ARIMA
is Auto Regressive Integrated Moving Average method. First of all, I collected the
data from yahoo.fininace. I found that the data is not stationary. So, for making
stationary, I taken a first order difference then check it by plotting. Then for finding
the order of p & q, I plotted auto correlation & partial auto correlation & checked
how many lag outside from the significance level. After the finding the value of p,
d & q, I build a model.
Q9. Tell me about Credit Risk Modeling project.
This project is on classification problem where I classified the applicant whether he
is applicable for taking loan or not. I used Logistic Regression and RandomForest
technique for this purpose. There was 44 variable & 500000 observation in my
dataset. First of all, I done data cleaning & feature selection. For feature selection, I
used Correlation test for numeric variable & Chi square test for categorical
variable. Finally, I found that only 5 features are more important for the response
variable like as loan amount, income, and occupation, extra. I build a model & for
model validation I done cross validation technique. I found that in my dataset
RandomForest gives more accuracy than Logistic Regression. On the basis of
confusion matrix & ROC curve, I got 92 % accuracy in my model.
What is R & R studio?
R is a statistical programming language which is an open source software. R is
mainly used for the purpose of Data Analysis. Most of data scientist uses R
software because the data visualization in R is very good. Ross Ihaka & Robert
gentleman made R language. RStudio is the integrated development environment
for R.
Tell me some library name in R on which you have worked.
1) caTools:- subset()- used for dividing dataset into train & test dataset.
2) Forecast: - used for forecasting purpose.
3) ggplot: - ggplot2()- used for high level graphics in R.
4) Carret: - confusionMatrix()- used for constructing confusion matrix.

5) TimeSeries: - ts()-used for coverting data into time series data.

& so many other packages also.

What is data?
A collection of facts from which conclusion may be drawn.
What is Statistics?
Statistics is the branch of Mathematics which deals with collection, representation
& interpretation of data to get some conclusion.
According to R.A.Fisher, Statistics is the branch of applied mathematics
which specializes in data.
Sometimes Median is better measure of central tendency as Mean. Why?
Most of data contains outlier value & we know that median is not affected by
outlier. So, in that case we prefer to use Mean.
What is population & a sample?
A population is the set of similar items or events which is of interest for some
experiment. A sample is the subset of population is to chosen to represent the
population.
What is Secondary research?
Secondary research involves the summary & collation of existing
research. Secondary research is contrasted with primary research in that primary
research involves the generation of data, whereas secondary research uses
primary research sources as a source of data for analysis.
What is Normal distribution?
It refers to most of the data is distributed around a central value without any bias to
the left or right. It has a symmetrical bell shaped structure.

What is p-value?
P-value is the probability of test statistic. It forms the basis for acceptance or
rejection of null hypothesis.
Types of errors in Hypothesis testing:
Type1 Error: prob. of rejection of true hypothesis when it is true. It is denoted by α
Type2 Error: prob. of acceptance of false hypothesis when it is false. It is denoted
by β.
Which type of error is more dangerous than others?
Sometimes type1 is more dangerous than type 2 & sometime type 2 is more
dangerous than type2. It totally depend upon the situation for e.g.: -
H0: The patient has the disease. H1: The patient has not disease.
In this above situation shows type 1 is more dangerous than 2 because a patient
without medicine may be die but a healthy person with medicine may not die.
H0: The person is a criminal. H1: The person is not a criminal
In this situation, type 2 is more dangerous than type 1 because if a criminal goes
free than there is a chance that he again do a crime.
What is a time series data?
A set of data which is in the order of time is known as time series data. For e.g. -
We may have weekly data, monthly data, yearly data, etc.
Types of forecasting method:
1) Naïve Approach
2) Simple Average
3) Moving Average
4) Single Exponential Smoothing
5) Holt’s Linear trend method
6) Holt’s winter seasonal method
7) ARIMA method
Package for ARIMA Model:
1) timeSeries() package for generate time series data. Function is ts()
2) tseries() package for stationary testing. Function is adf.test() which is known
a dickey fuller test.
3) forecast() package is used for forecasting. Function is forecast.arima()
What is meant by stationary time series?
A time series data whose mean & variance is constant over the period of time is
known as stationary time series.

What is the difference between Time Series & Regression data?


A Time Series data depends on time but regression data is not depend on time.
Regression models assume independence between the output variables for different
values of the input variable, while the time series model doesn't.
What is Regression Analysis?
Regression Analysis deals with finding the mathematical relationship between two
or more variables.
To proceed regression modelling in real life situation, one needs to consider the
experimental condition & phenomenon before taking the decision on how many,
why and how to choose the dependent & independent variable.
For e.g. Income & Education of a person are related since it can be expect, higher
level of education provides high income. But,
Income=β0 + β1*Education + ε is not correct modelling. So add a variable Age in this
model
Income=β0 + β1*Education + β2*Age + ε is also not correct.
Income=β0 + β1*Education + β2*Age + β3*Age^2 + ε is the correct regression modelling.
What are the assumption of Linear Regression Model?
There are following assumptions of Linear Regression:
 Linearity i.e. model should be linear in parameter.
 Homoscedasticity i.e. equal variance of error term.
 Lack of multicollinearity.
 Lack of auto correlation.
 Multivariate Normality.
 No. of observation should be greater than no. of parameter to be estimated.
 X values is fixed in repeated sampling.
 Zero covariance between ei & xi.
 Regression model should be correctly specified.
 All X-values must not be same. (There should be finite variance).
What are the assumptions of Logistic Regression?
There are following assumptions of Logistic Regression:
 Response variable should be categorical
 It does not required linear relationship between dependent & independent
variable
 Observation should be independent from each other.
 There should be no multicollinearity in Independent variables.
 Homoscedasticity is not required.
 Errors term do not need to follow normal distribution.
What is R square?
R square is known as coefficient of determination. It is the proportion of variation
in dependent variable explained by independent variable.
If SSreg is regression sum of square, SST is the total sum of square & SSres is the
residual sum of square. Let n is the no. of observation in the data & k is the no. of
explanatory variable in the model, then
SS res SS reg
R2=1- SST = SS T . Adjusted R2= 1- ( n−1
n−k )
2
( 1−R )

What is the F-statistic in Regression model? What are its significance?


SST −SS SSreg
MS reg (n−k ) SS reg (¿¿ res) (n−k ) SST
F= MS res
= (n−1) SSres
= (n−k ) SS reg = (n−1) SS res
1−
(n−1) ¿ SST

(n−k ) R2
= (n−1) 1−R2
where R2 is the coefficient of determination.

Here, R & F are closely related. If R2 is zero, then F will be 0 & if R2 is 1, then F
will become ∞. That is why the F test under analysis of variance is termed as the
measure of overall significance of estimated regression.
It is also a test of significance of R2. If F is highly significant, it implies that we
can reject H0, i.e., y is linearly related to X’s.
 Multicollinearity: -Existence of linear relationship between features or
variables is known as multicollinearity. There are several approach to measure
Multicollinearity in the data.
1) Inspection of correlation matrix
If corr (X1, X2) >0.75 then we can say that, they are highly correlated.
2) Determinant of correlation matrix
Let us consider D be the determinant of correlation matrix then If D = 0 then
it indicates the existence of exact linear dependence among explanatory
variables. If D = 1 then the columns of X matrix are orthonormal.
Thus a value close to 0 is an indication of high degree of multicollinearity.
Any value of D between 0 and 1 gives an idea of the degree of
multicollinearity.
Limitation
It gives no information about the number of linear dependencies among
explanatory variables.
3) Variance Inflation Factor
If Variance Inflation Factor (VIFj)>5 then we should remove jth variable from
our model where the VIF for jth variable is defined as,
1
VIFj= 1−R j2 where Rj
2
denotes the coefficient of determination obtained
when Xj is regressed on the remaining (k - 1) variables excluding Xj
Limitations
a) It sheds no light on the number of dependencies among the explanatory
variables.
b) The rule of VIF > 5 or 10 is a rule of thumb which may differ from one
situation to another situation.

 Autocorrelation: - When there is linear relation between the errors term. Then
this situation is known as auto correlation. The detection of Autocorrelation by
Durbin-Watson (DW) test. There is a package “lmtest” in R for Durbin-Watson
test.
If 1.5 ≤ DW ≤ 2.5 then there is no autocorrelation.
If 0 ¿ DW ¿ 1.5 then there is positive autocorrelation.
If 2.5 ¿ DW ¿ 4 then there is a negative autocorrelation.

 Heteroscedasticity: -When errors term comes from different sample have


different variances then this situation is called heteroscedasticity.

How do you check Normality of data i.e. Normality test?


There are several methods accessing whether data are normally distributed or not,
they can be classified into two broad categories:
Graphical
 Q-Q probability plots
 Cumulative frequency (P-P) plots
Statistical
 W/S test
 Jarque-Bera (J-B) test
 Shapiro Wilks test
 Kolmogorov-Smirnov test
 D’Agostino test

When a Non-Normality a problem?


 Normality can be a problem when the sample size is small (n<50).
 Highly skewed data creates problem.
 Highly leptokurtic data are problematic, but not as much as skewed data.
 Normality becomes a serious concern when there is “activity” in the tails of
data set.
 If data are not normal then use Non-Parametric tests & if the data are
Normal, use parametric test.
What happen with the shape of Normal Distribution if the standard deviation
increases?
If there is constant value of Range but standard deviation increases then shape of
Normal distribution change from leptokurtic to platykurtic curve i.e. tail of the
curve will increase & peak of the curve will decrease.
What is the difference between Errors & Residuals?
Error is the difference between actual (true) values & observed value where as
residuals is the difference between observed and predicted value.
What are the difference between R squared & Adjusted R squared?
R square always increases with the number of feature increases in model. But
Adjusted R square increase or decrease if the new variable is significant or not.
Note: Value of adjusted R2 can be negative also.
There are following methods for the model building:
 All in
 Backward Elimination
 Forward Elimination
 Bidirectional Selection
 Score comparison
What is difference between Statistic & Estimator?
Statistics is the function of sample values & when it is used to estimate the some
unknown parameter, it is known as Estimator.

What is difference between Estimator & Estimate?


Any statistic which is used to estimate the unknown population parameter is known
as estimator where as an estimate is the particular value of estimator.
What is difference between standard error & standard deviation?
Standard deviation measure the amount of variability for a set of data from the
mean where as standard error measures how the sample mean far from the true
population mean. Standard error of mean is always smaller than standard deviation.
What is the difference between Correlation & Regression?
Correlation measure the linear association between the variable whereas regression
measure the how one variable affect the others.
What is the difference between covariance & variance?
Covariance is the measure of strengths which implies that how a variables varies
according to the variation in other variable where correlation measure the degree of
relation between two variable i.e. how strong they are related?
What is the difference between Mathematical modelling and Statistical
modelling?
Mathematical modelling is deterministic while a statistical modelling is stochastic.
So, there is a chance of error occur in statistical model where as there is no chance
of error occur in mathematical modelling.
What is the difference between mean & expected value?
Both are same term but Mean is generally referred when we talk about a prob.
distribution where as expected value is referred in the context of random variables.
What is the difference between randomized & non-randomized test?
A hypothesis test is said to be non-randomized if the decision about the acceptance
or rejection of a hypothesis depend on a test statistic.
What is Box-Cox Method?
It is a way to transform non normal dependent variable into a normal shape. It is
mostly used in Simple Linear Regression.
Box-Cox transformation of the variable x is also indexed by λ, and is defined as
x λ −1
x’λ= λ

What is the difference between Location, Scale & Shape Parameter?


 Location Parameter shifts the entire distribution left or right.
 Scale Parameter compress or stretches the entire distribution.
 Shape Parameter changes the shape of distribution in some other way.
What is Machine Learning & its uses?
Machine Learning is the field of study that gives ability to learn computer and
works on the basis of experience.
Uses of Machine Learning:
Spam filtration, Image recognition, Recommendation, Self-driving car, etc.
What is the difference between Machine Learning, Deep Learning & Artificial
Intelligence?
Machine Learning is the subset of Artificial intelligence where as Deep Learning is
the subset of Machine Learning.
Artificial intelligence is the field of computer science that concerned with the
design of intelligence in an artificial device.
Machine Learning is the field of study that gives ability to learn computer and
works on the basis of experience.
Deep learning is the sub-field of ML concerned with the algorithms inspired by the
structure & function of the brain.
 AI doesn’t required data like ML & deep learning require large amount of data.
Depending on the available data, Machine Learning is divided into three
parts:
 Supervised Learning: - When we have labelled data i.e. information on both
dependent & independent data is available is known as Supervised Learning.
Techniques: - Linear & Logistic Regression, KNN, SVM, etc.
 Unsupervised Learning: -When we have unlabeled data i.e. information on
dependent variable is not available is known as unsupervised Learning.
Techniques:-cluster analysis, PCA, Factor Analysis, etc.
 Reinforcement Learning: - It is a type of ML where an agent perform an
action in environment to get maximum reward. e.g.:-Chess game, Robot, etc.
What is ROC curve?
AUC - ROC curve is a performance measurement for classification problem using
various thresholds values. ROC is a probability curve plotted TPR against the FPR
where TPR is on y-axis and FPR is on the x-axis. Higher the AUC, better the model.
TP
True Positive Rate (TPR)/Recall/Sensitivity= TP+ FN
FP
False Positive Rate (FPR) = 1- Specificity= TN + FP
If TPR increases then FPR also increases & if TPR
decreases then FPR also decreases.
How to use AUC-ROC curve for multi-class model?
In multi- class model, we can plot N number of AUC ROC Curves for N number
classes using one vs ALL methodology. So for Example, If we have three classes
named X, Y and Z, we will have one ROC for X classified against Y and Z, another
ROC for Y classified against X and Z, and a third one of Z classified against Y and
X.
How to evaluate our Model?
There are following measure of evaluating the model:
Number of Correct Prediction
1) Model Accuracy: Total No. of Prediction made & It works well only if there
are equal number of samples belonging to each class.
2) Confusion Matrix: It gives us a matrix as output and describes the
complete performance of the model.
There are 4 important terms:

 True Positives.

 True Negative.

 False Positives.

 False Negative.

3) Area under curve: AUC of a classifier is equal to the probability that the
classifier will rank a randomly chosen positive example higher than a
randomly chosen negative example. There are two important thing for
AUC Curve as sensitivity & specificity. As evident, AUC has a range of
[0, 1]. The greater the value, the better is the performance of our
model.
4) F1 Score: F1 Score is the Harmonic Mean between precision and recall.
The range for F1 Score is [0, 1]. The greater the F1 Score, the better is
the performance of our model. F1 Score tries to find the balance
between precision and recall. Mathematically,
1
F1= 2* 1
+
1
precision recall

True Positive
 Precision= True Positive+ False Positive

True Positive
 Recall= True Positive+ False Negative

5) Mean Absolute Error: Mean Absolute Error is the average of the


difference between the Original Values and the Predicted Values. It
gives us the measure of how far the predictions were from the actual
output. Mathematically,
N
1
Mean Square Error= ∑ | y − y ' j|
N j=1 j

6) Mean Squared Error: MSE is quite similar to Mean Absolute Error, the
only difference being that MSE takes the average of the square of the
difference between the original values and the predicted values.
N
1
Mathematically, MSE= ∑
N j=1
( y j− y ' j )2

7) Logarithmic Loss:
Why SVM is so special?
ML concept come up with a large amount of data. All the algorithm like
randomForest, decision tree etc. needs more and more information to be more
accurate. On the other hand SVM can be used in the case when data is not very
large. Even in the case of less than 1k observations SVM work very well.
Why Naïve Bayes is called naïve?
Naïve Bayes classifier assumes that the independent of event means presence of
particular features of a class is unrelated to the presence of any other features.
Why Logistic Regression is called as linear classifier.
Linear classifier is the classifier which uses some linear combination of variables
for classification. Logistic Regression does this with the help of logit function. It
should be remembered that we are talking about linear classifier not about linear
model. Logistic regression is the case of general linear model.
What is Natural Language Processing?
It is the area of computer science which concerned with the interaction between
human (natural) and computer languages. In particular, how to program computer
to process & analyze large amount of natural language data.
 Usually, if we increase the depth of tree, it will cause over fitting & Increase
the number of trees will cause under fitting.
What is Sentiment Analysis?
Sentiment Analysis is the process of determining whether a body of text is positive,
negative or neutral.
How can you get sentiment score?
For obtaining sentiment score, we can use get_sentiment() function in R. After that
we can classified the sentiment score into positive or negative as score >0 or score
<0.
How can we extract data from Twitter?
For extracting tweets from twitter, first of all we have to create a twitter API, then
we can use twitterR package in R.

What is credit Risk?


When you have lend some money to someone, then there is an uncertainty that the
person may or may not return the loan amount. That uncertainty is known as Credit
Risk. It can be measured in terms of probability.

You might also like