0% found this document useful (0 votes)
11 views

ML Interview Ques

Uploaded by

Akhil Shrivastav
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

ML Interview Ques

Uploaded by

Akhil Shrivastav
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Q1. What are the different types of Machine Learning?

Supervised Learning: It is a method in which the machine learns using labeled data.
 It is like learning under the guidance of a teacher
 Training dataset is like a teacher which is used to train the machine
 Model is trained on a pre-defined dataset before it starts making decisions when given new data

Unsupervised Learning: It is a method in which the machine is trained on unlabelled data or without any
guidance
 It is like learning without a teacher.
 Model learns through observation & finds structures in data.
 Model is given a dataset and is left to automatically find patterns and relationships in that
dataset by creating clusters.

Reinforcement Learning: It involves an agent that interacts with its environment by producing actions &
discovers errors or rewards.
 It is like being stuck in an isolated island, where you must explore the environment and learn
how to live and adapt to the living conditions on your own.
 Model learns through the hit and trial method
 It learns on the basis of reward or penalty given for every action it performs

Q2. Explain Classification and Regression.


Q3. What do you understand by selection bias?

 It is a statistical error that causes a bias in the sampling portion of an experiment.


 The error causes one sampling group to be selected more often than other groups included in
the experiment.
 If selection bias is not identified, it may produce an inaccurate conclusion.

Q4. How to ensure that your model is not Over-fitting?

 Keep the design of the model simple. Try to reduce the noise in the model by considering
fewer variables and parameters.
 Cross-validation techniques such as K-folds cross validation help us keep over-fitting under
control.
 Regularization techniques such as LASSO help in avoiding over-fitting by penalizing certain
parameters if they are likely to cause over-fitting.

Q5. What is a Confusion Matrix?


A confusion matrix or an error matrix is a table which is used for summarizing the performance of a
classification algorithm.

Q6. Explain false negative, false positive, true negative and true positive with a simple
example.
Let’s consider a scenario of a fire emergency:

 True Positive: If the alarm goes on in case of a fire.


Fire is positive and prediction made by the system is true.
 False Positive: If the alarm goes on, and there is no fire.
System predicted fire to be positive which is a wrong prediction, hence the prediction is false.
 False Negative: If the alarm does not ring but there was a fire.
System predicted fire to be negative which was false since there was fire.
 True Negative: If the alarm does not ring and there was no fire.
The fire is negative and this prediction was true.
Q7. What do you understand by Precision and Recall?

 Imagine that, your girlfriend gave you a birthday surprise every year for the last 10 years. One
day, your girlfriend asks you: ‘Sweetie, do you remember all the birthday surprises from me?’
 To stay on good terms with your girlfriend, you need to recall all the 10 events from your
memory. Therefore, recall is the ratio of the number of events you can correctly recall, to the
total number of events.
 If you can recall all 10 events correctly, then, your recall ratio is 1.0 (100%) and if you can recall 7
events correctly, your recall ratio is 0.7 (70%)

However, you might be wrong in some answers.

 For example, let’s assume that you took 15 guesses out of which 10 were correct and 5 were
wrong. This means that you can recall all events but not so precisely
 Therefore, precision is the ratio of a number of events you can correctly recall, to the total
number of events you can recall (mix of correct and wrong recalls).
 From the above example (10 real events, 15 answers: 10 correct, 5 wrong), you get 100% recall
but your precision is only 66.67% (10 / 15)

Q8. What is the difference between inductive and deductive learning?

 Inductive learning is the process of using observations to draw conclusions


 Deductive learning is the process of using conclusions to form observations

Q9. What is ROC curve and what does it represent?


Receiver Operating Characteristic curve (or ROC curve) is a plot of the true positive rate (Sensitivity)
against the false positive rate (1-Specificity) for the different possible cut-off points of a diagnostic test.

 The closer the curve follows the left-hand border and then the top border of the ROC space, the
more accurate the test.
 The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the
test.
 The slope of the tangent line at a cut point gives the likelihood ratio (LR) for that value of the
test.
 The area under the curve is a measure of test accuracy.

Q10. What’s the difference between Type I and Type II error?


Q11. Is it better to have too many false positives or too many false negatives? Explain.

It depends on the question as well as on the domain for which we are trying to solve the problem. If
you’re using Machine Learning in the domain of medical testing, then a false negative is very risky, since
the report will not show any health problem when a person is actually unwell.
Similarly, in spam detection, a false positive is very risky because the algorithm may classify an
important email as spam.

Q12. Which is more important to you – model accuracy or model performance?


Well, you must know that model accuracy is only a subset of model performance. The accuracy of the
model and performance of the model are directly proportional and hence better the performance of the
model, more accurate are the predictions.

Q13. What is the difference between Gini Impurity and Entropy in a Decision Tree?
Gini Impurity and Entropy are the metrics used for deciding how to split a Decision Tree when the target
variable is categorical (i.e they are only used in classification, not regression).

Gini Impurity tells the probability of misclassifying an observation. It is the probability of a random
sample being classified correctly when you randomly pick a label according to the distribution in the
branch. Lower the value of Gini, better the split. In other words the lower the likelihood of
misclassification.

Entropy is a measurement to calculate the lack of information. You calculate the Information Gain
(difference in entropies) by making a split. This measure helps to reduce the uncertainty about the
output label.

Q14. What is the difference between Entropy and Information Gain?

The largest information gain is equivalent to the smallest entropy.

Entropy is the measure of disorder, means it tells the uncertainty in our data (i.e how messy your data
is). If the sample is completely homogeneous the entropy is zero and if the sample is equally divided
then it has entropy of one. It decreases as you reach closer to the leaf node.

Information Gain is based on the decrease in entropy after a dataset is split on an attribute. It is
calculated by comparing the entropy of the dataset before and after a transformation. It keeps on
increasing as you reach closer to the leaf node. It is defined as the amount of information that's gained
by knowing the value of the attribute, which is the entropy of the distribution before the split minus the
entropy of the distribution after it.
Q15. Explain Ensemble learning technique in Machine Learning.

Ensemble learning is a technique that is used to create multiple Machine Learning models, which are
then combined to produce more accurate results. A general Machine Learning model is built by using
the entire training data set. However, in Ensemble Learning the training data set is split into multiple
subsets, wherein each subset is used to build a separate model. After the models are trained, they are
then combined to predict an outcome in such a way that the variance in the output is reduced.

Q16. How do you handle outliers?

 If your data set is huge and rich then you can risk dropping the outliers.
 However, if your data set is small then you can cap the outliers, by setting a threshold
percentile. For example, the data points that are above the 95th percentile can be used to cap
the outliers.
 Lastly, based on the data exploration stage, you can narrow down some rules and impute the
outliers based on those business rules.

Q17. What are collinearity and multicollinearity?

 Collinearity occurs when two predictor variables (e.g., x1 and x2) in a multiple regression have
some correlation.
 Multicollinearity occurs when more than two predictor variables (e.g., x1, x2, and x3) are inter-
correlated.
Q18. What do you understand by Eigenvectors and Eigenvalues?

 Eigenvectors are those vectors whose direction remains unchanged even when a linear
transformation is performed on them.
 Eigenvalue is the scalar that is used for the transformation of an Eigenvector.

The Eigenvector of a square matrix A is a nonzero vector x such that for some number λ, we
have the following: Ax = λx, where λ is an Eigenvalue.

So in above example, λ = 3 and X = [1 1 2]. Here, 3 is an Eigenvalue with the original vector in the
multiplication problem being an eigenvector.

Q19. Running a binary classification tree algorithm is quite easy. But do you know how
the tree decides on which variable to split at the root node and its succeeding child
nodes?
Measures such as, Gini Index and Entropy can be used to decide which variable is best fitted for splitting
the Decision Tree at the root node.

Gini Impurity tells the probability of misclassifying an observation.


Calculate Gini for sub-nodes, using the formula – sum of square of probability for success and failure.
Calculate Gini for split using weighted Gini score of each node of that split

Entropy is the measure of impurity or randomness in the data, (for binary class).
Entropy is zero when a node is homogeneous and is maximum when both the classes are present in a
node at 50% – 50%. To sum it up, the entropy must be as low as possible in order to decide whether or
not a variable is suitable as the root node.
Q20. Which Algorithm to use When ----- Decision Tree vs. Random Forest ?
Decision tree is used over random forest when explainability between variable is prioritised over
accuracy. Random have more accuracy than decision tress without risk of over-fitting.
Decision Tree is better when the dataset have a “Feature” that is really important to take a decision.
Random Forest selects some “Features” randomly to build the Trees, if a “Feature” is important but
sometimes Random Forest will not take that “Feature” in the final decision.
Random Forest do not take the all features, it selects randomly and we cannot control the randomness
but as the name suggest, it's like a forest, you can build and control by specifying the number of trees
you want in model but cannot control the features which are going to the part of model.

Decision trees are much easier to understand and faster to train. The advantage of a simple decision
tree model is easy to interpret as we know what variable and what values of that variable are used to
split the data and predict outcome. It does not require much of data preprocessing and it does not
require any assumptions of distribution of data. This algorithm is very useful to identify the hidden
pattern in the dataset. Disad: When using a decision tree model, the accuracy keeps improving with
more and more splits. You can easily overfit the data and doesn't know when you have crossed the line
unless you are using cross validation (on training data set).

Random Forest is suitable for situations when we have a large dataset and interpretability is not a major
concern. Accuracy keeps increasing as you increase the number of trees, but becomes constant at
certain point. Unlike decision tree, it won't create highly biased model and reduces the variance.
Disad: Since a random forest combines multiple decision trees, it becomes more difficult to interpret.

When to use to decision tree:


 When you want your model to be simple and explainable
 When you want non parametric model
 When you don't want to worry about feature selection or regularization or multi-collinearity.
 You can over-fit the tree and build a model if you are sure of validation or test data set is going
to be subset of training data set or almost overlapping instead of unexpected.
When to use random forest :
 When you don't bother much about interpreting the model but want better accuracy.
 When data has high bias, employing bagging and sampling techniques will reduce over fitting.

Random Forest has a higher training time than a single decision tree because as we increase the
number of trees in a random forest, the time taken to train each of them also increases. That can often
be crucial when we are working with a tight deadline in a machine learning project.

Random forest will reduce variance part of error rather than bias part, so on a given training data set
decision tree may be more accurate than a random forest. But on an unexpected validation data set,
Random forest always wins in terms of accuracy.

Q21. What is difference between Statistics and Arithmetic (or Mathematics)?


Mathematics deals with numbers, patterns and their relationships whereas statistics is concerned with
systematic representation and analysis of data. Mathematics creates an idealized model of reality where
everything is clear and deterministic; statistics accepts that all knowledge is uncertain and tries to make
sense of the data in spite of all the randomness.

Mathematics always follows a consistent definition, theorem & proof structure.


In statistics, it’s common to define things with intuition and examples, so “you know it when you see
it”; things are rarely so black-and-white like in mathematics. For another example, consider p-values.
Usually, when you get a p-value under 0.05, it can be considered statistically significant. But this value is
merely a guideline, not a law –– it’s not like 0.048 is definitely significant and 0.051 is not.

Statistics is that branch of mathematics that deals in probability, graphical representation of


mathematical data and interpretation of uncertain observation. It is mainly concerned with collection,
analysis, explanation, and presentation of data. It also helps in forecasting and predicting results based
upon insufficient data. It can improve the quality of any data and make interpretations from it easy.

Mathematics deals in numbers and basic operations such as counting, addition, subtraction,
multiplication, division, algebra, geometry, calculus, and finally statistics. Mathematics is an academic
discipline that allows us to understand the concepts of quantity and structures. There are many who
feel that math is all about seeking about patterns whether found in numbers, science, space, computers,
designs, architecture, and so on.

In mathematics, measurement refers to understanding units and precision in problems with most
concrete measures such as length, area, and volume. But, in statistics, measurement can be a bit more
abstract. For example, when considering how you might measure intelligence or a city’s pace of life,
there is not a straightforward method. Instead, researchers and statisticians have to decide how to best
measure what is being studied and often do so in different ways.

Variability and the uncertainty of conclusions is another major difference between statistics and
mathematics. In mathematics, results are usually reached by means of deduction, logical proof or
mathematical induction and typically there is one correct answer. Statistics, however utilizes inductive
reasoning and conclusions are always uncertain. This is largely due to the interpretation of the context
and methods surrounding the data collection and analysis.

Q22. What is Cost Function and Gradient descent in Linear Regression?


Cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship
between X and y. This is typically expressed as a difference or distance between the predicted value
and the actual value. The cost function (also referred as loss or error) can be estimated by iteratively
running the model to compare estimated predictions against the known values of y. The objective of a
ML model, therefore, is to find parameters, weights or a structure that minimises the cost function.
The MSE is the cost function in Linear Regression. It is simply the mean of the squared differences
between predicted y and actual y (i.e. the residuals).
Depending on the problem Cost Function can be formed in many different ways. The purpose of Cost
Function is to be either:
Minimized - then returned value is usually called cost, loss or error. The goal is to find the values of
model parameters for which Cost Function return as small number as possible.
Maximized - then the value it yields is named a reward. The goal is to find values of model parameters
for which returned number is as large as possible.
When predictions and expected results overlap, then the value of each reasonable Cost Function is
equal to zero.
Gradient descent enables a model to learn the gradient or direction that the model should take in order
to reduce errors (differences between actual y and predicted y). Direction in the simple linear
regression example refers to how the model parameters b0 and b1 (y=b0+b1x) should be tweaked or
corrected to further reduce the cost function. As the model iterates, it gradually converges towards a
minimum where further tweaks to the parameters produce little or zero changes in the loss — also
referred to as convergence. It is an efficient optimization algorithm that attempts to find local or global
minima of a function.

Using the cost function in conjunction with GD is called linear regression.

Q23.What is the difference between R and R2 (r-square)?


R square (Coefficient of Determination) is literally the square of correlation between x and y. It shows
percentage variation in y which is explained by all the x variables together. Higher the value, more
better. It is always between 0 and 1. It can never be negative, since it is a squared value. R square when
used in regression model context tells about the amount of variability in y that is explained by the
model.

R (Coefficient of Correlation) tells the strength of linear association between x and y. It is the degree of
relationship between two variables say x and y. It can go between -1 and 1. 1 indicates that the two
variables are moving in unison. They rise and fall together and have perfect correlation. -1 means that
the two variables are in perfect opposites. One goes up and other goes down, in perfect negative way.
Any two variables in this universe can be argued to have a correlation value. If they are not correlated
then the correlation value can still be computed which would be 0.

Correlation can be rightfully explained for simple linear regression – because you only have one x and
one y variable. For multiple linear regression, if R is computed then it will be difficult to explain because
we have multiple variables involved here. That is why R square is a better term.
Q24.What is difference between MAE and MSE.
There is much more regression metrics that can be used as Cost Function for measuring the
performance of regression models but MAE and MSE seem to be relatively simple and very popular.
MAE doesn’t add any additional weight to the distance between points — the error growth is linear.
MSE errors grow exponentially with larger values of distance. It’s a metric that adds a massive penalty to
points which are far away and a minimal penalty for points which are close to the expected result. Error
curve has a parabolic shape.
MAE is a mean of absolute differences among predictions and expected results where all individual
deviations have even importance.
MSE is the average squared difference between the predictions and expected results. In other words,
it is an alteration of MAE where instead of taking the absolute value of differences, they are squared.

Q25.What is the role of Logit Function and Sigmoid Function in Logistic Regression.
These are useful functions when we are working with probabilities or trying to classify data.
Given a probability p, the corresponding odds are calculated as p / (1 – p). For example if p=0.75, the
odds are 3 to 1: 0.75/0.25 = 3.
The logit function is also called link function, it is simply the logarithm of the odds: logit(x) = log(x / (1 –
x)). Here is a plot of the logit function in fig1, the value of the logit function heads towards infinity
as p approaches 1 and towards negative infinity as it approaches 0.

Fig1: Fig2:

The logit function is useful in analytics because it maps probabilities (in the range [0,1]) to the full
range of real numbers. In particular, if you are working with “yes-no” (binary) inputs it can be useful to
transform them into real-valued quantities prior to modeling.
The sigmoid function is the inverse of the logit function. If we have probability p, sigmoid(logit(p)) = p,
then it will map arbitrary real values back to the range [0, 1]. The larger the value, the closer to 1 you’ll
get. The formula for the sigmoid function is σ(x) = 1/(1 + exp(-x)). The term “sigmoid function” is used to
refer to a class of functions with S-shaped curves. The plot of sigmoid function is shown in fig2.
Q26.Why use Odds Ratios instead of probability in Logistic Regression.
The probability that an event will occur is the fraction of times you expect to see that event in many
trials. Probabilities always range between 0 and 1. The odds are defined as the probability that the event
will occur divided by the probability that the event will not occur. Probability and odds both measure
how likely it is that something will occur. But they have different properties that give odds some
advantages in statistics.
In logistic regression, the odds ratio represents the constant effect of a predictor X, on the likelihood
that one outcome will occur. The key phrase here is constant effect. In regression models, we often
want a measure of the unique effect of each X on Y.
If we try to express the effect of X on the likelihood of a categorical Y having a specific value through
probability, the effect is not constant. It means, in terms of probability there is no way to express in
single number how X affects Y. The effect of X on the probability of Y has different values depending on
the value of X. There is a great approach to use probability with odds ratios. The odds ratio is a single
summary score of the effect and the probabilities are more intuitive. The odds ratio is constant across
values of X but probabilities aren’t.

It works exactly the same way as interest rates. For an annual interest rate of 8%, at the end of the year
you’ll earn $8 if you invested $100 or $40 if you invested $500. The rate stays constant, but the actual
amount earned differs based on the amount invested. Odds ratios work the same. An odds ratio of 1.08
will give you an 8% increase in the odds at any value of X.

So if you do decide to report the increase in probability at different values of X, you’ll have to do it at
low, medium, and high values of X. You can’t use a single number on the probability scale to convey
the relationship between the predictor and the probability of a response.

Example: Imagine you are putting your hand inside a black bag. Inside that bag are five red balls, three
blue balls and two yellow balls, then:
Probability, in our black bag there are three blue balls, but there are ten balls in total, so the probability
that you pull out a blue ball is three divided by ten which is 30% or 0.3.
Odds, in our black bag there are three blue balls, but there are seven balls which are not blue, so the
odds for drawing a blue ball are 3:7. Odds are often expressed as “odds for” and “odds against”.
odds for, which in this case would be three divided by seven, which is about 43% or 0.43, or
odds against, which would be seven divided by three, which is 233% or 2.33.

If the probability of something happening is P, then the odds of it happening is P/(1 — P).
If the odds of something happening is O, then the probability of it happening is O/(1 + O).
 Probability has a limited range from zero to one. Odds have an infinite range.
 The probability of something happening is always less than the odds of it happening (assuming
the probability is non-zero).
 The smaller the probability, the more similar probability and odds will be. For example, the
probability of winning the UK National Lottery is 0.0000000221938762. The odds are
0.0000000221938767.
 The larger the probability, the larger the difference with the odds. High probabilities have
astronomical odds. A probability of 90% equates to odds of 900%, 99% equates to 9,900% and
99.999% equates to 9,999,900%.

You might also like