ML Interview Questions PDF
ML Interview Questions PDF
Steve Nouri
Q10 How Can You Choose a Classifier Based on a Training Set Data Size?
When the training set is small, a model that has a right bias and low variance seems to work
better because they are less likely to overfit.
Q12 How much data should you allocate for your training, validation, and test
sets?
You have to find a balance, and there's no right answer for every problem.
If your test set is too small, you'll have an unreliable estimation of model performance
(performance statistic will have high variance). If your training set is too small, your actual model
parameters will have a high variance.
A good rule of thumb is to use an 80/20 train/test split. Then, your train set can be further split
into train/validation or into partitions for cross-validation.
Q13 What Is a False Positive and False Negative and How Are They Significant?
False positives are those cases which wrongly get classified as True but are False.
False negatives are those cases which wrongly get classified as False but are True.
In the term ‘False Positive,’ the word ‘Positive’ refers to the ‘Yes’ row of the predicted value in
the confusion matrix. The complete term indicates that the system has predicted it as a positive,
but the actual value is negative.
Q16 What is deep learning, and how does it contrast with other machine learning
algorithms?
Deep learning is a subset of machine learning that is concerned with neural networks: how to
use backpropagation and certain principles from neuroscience to more accurately model large
sets of unlabelled or semi-structured data. In that sense, deep learning represents an
Q27 How Will You Know Which Machine Learning Algorithm to Choose for Your
Classification Problem?
While there is no fixed rule to choose an algorithm for a classification problem, you can follow
these guidelines:
● If accuracy is a concern, test different algorithms and cross-validate them
● If the training dataset is small, use models that have low variance and high bias
● If the training dataset is large, use models that have high variance and little bias
Q29 What evaluation approaches would you work to gauge the effectiveness of a
machine learning model?
You would first split the dataset into training and test sets, or perhaps use cross-validation
techniques to further segment the dataset into composite sets of training and test sets within the
data. You should then implement a choice selection of performance metrics: here is a fairly
comprehensive list. You could use measures such as the F1 score, the accuracy, and the
confusion matrix. What’s important here is to demonstrate that you understand the nuances of
how a model is measured and how to choose the right performance measures for the right
situations.
Q30 How would you implement a recommendation system for our company’s
users?
A lot of machine learning interview questions of this type will involve the implementation of
machine learning models to a company’s problems. You’ll have to research the company and its
industry in-depth, especially the revenue drivers the company has, and the types of users the
company takes on in the context of the industry it’s in.
Q32 What is the ROC Curve and what is AUC (a.k.a. AUROC)?
The ROC (receiver operating characteristic) the performance plot for binary classifiers of True
Positive Rate (y-axis) vs. False Positive Rate (x-
axis).
AUC is the area under the ROC curve, and it's a common performance metric for evaluating
binary classification models.
It's equivalent to the expected probability that a uniformly drawn random positive is ranked
before a uniformly drawn random negative.
Q41 What Are the Three Stages of Building a Model in Machine Learning?
The three stages of building a machine learning model are:
● Model Building Choose a suitable algorithm for the model and train it according to the
requirement
● Model Testing Check the accuracy of the model through the test data
● Applying the Mode Make the required changes after testing and use the final model for
real-time projects. Here, it’s important to remember that once in a while, the model
needs to be checked to make sure it’s working correctly. It should be modified to make
sure that it is up-to-date.
Q43 Mention the difference between Data Mining and Machine learning?
Machine learning relates to the study, design, and development of the algorithms that give
computers the capability to learn without being explicitly programmed. While data mining can
be defined as the process in which the unstructured data tries to extract knowledge or unknown
interesting patterns. During this processing machine, learning algorithms are used.
Q45 You are given a data set. The data set has missing values that spread along 1
standard deviation from the median. What percentage of data would remain
unaffected? Why?
This question has enough hints for you to start thinking! Since the data is spread across the
median, let’s assume it’s a normal distribution. We know, in a normal distribution, ~68% of the
data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data
unaffected. Therefore, ~32% of the data would remain unaffected by missing values.
Q51 You’ve built a random forest model with 10000 trees. You got delighted after
getting training error as 0.00. But, the validation error is 34.23. What is going on?
Haven’t you trained your model perfectly?
The model has overfitted. Training error 0.00 means the classifier has mimicked the training
data patterns to an extent, that they are not available in the unseen data. Hence, when this
classifier was run on an unseen sample, it couldn’t find those patterns and returned predictions
with higher error. In a random forest, it happens when we use a larger number of trees than
necessary. Hence, to avoid this situation, we should tune the number of trees using
cross-validation.
Q58 We know that one hot encoding increases the dimensionality of a dataset,
but label encoding doesn’t. How?
When we use one-hot encoding, there is an increase in the dimensionality of a dataset. The
reason for the increase in dimensionality is that, for every class in the categorical variables, it
forms a different variable.
Q62 When would you use random forests Vs SVM and why?
There are a couple of reasons why a random forest is a better choice of the model than a
support vector machine:
● Random forests allow you to determine the feature importance. SVM’s can’t do this.
● Random forests are much quicker and simpler to build than an SVM.
● For multi-class classification problems, SVMs require a one-vs-rest method, which is
less scalable and more memory intensive.
Q67 What is the exploding gradient problem while using the backpropagation
technique?
When large error gradients accumulate and result in large changes in the neural network
weights during training, it is called the exploding gradient problem. The values of weights can
become so large as to overflow and result in NaN values. This makes the model unstable and
the learning of the model to stall just like the vanishing gradient problem.
Q72 When does the linear regression line stop rotating or finds an optimal spot where it
is fitted on data?
A place where the highest RSquared value is found, is the place where the line comes to rest.
RSquared represents the amount of variance captured by the virtual linear regression line with
respect to the total variance captured by the dataset.
Q75 Name and define techniques used to find similarities in the recommendation system.
Pearson correlation and Cosine correlation are techniques used to find similarities in
recommendation systems.
Q80 What is the difference between inductive machine learning and deductive machine
learning?
The difference between inductive machine learning and deductive machine learning are as
follows: machine-learning where the model learns by examples from a set of observed
instances to draw a generalized conclusion whereas in deductive learning the model first draws
the conclusion and then the conclusion is drawn.
Q84 Steps Needed to Choose the Appropriate Machine Learning Algorithm for
your Classification problem.
Firstly, you need to have a clear picture of your data, your constraints, and your problems
before heading towards different machine learning algorithms. Secondly, you have to
understand which type and kind of data you have because it plays a primary role in deciding
which algorithm you have to use.
Following this step is the data categorization step, which is a two-step process – categorization
by input and categorization by output. The next step is to understand your constraints; that is,
what is your data storage capacity? How fast the prediction has to be? etc.
Finally, find the available machine learning algorithms and implement them wisely. Along with
that, also try to optimize the hyperparameters which can be done in three ways – grid search,
random search, and Bayesian optimization.
Q87 What’s the Relationship between True Positive Rate and Recall?
The True positive rate in machine learning is the percentage of the positives that have been
properly acknowledged, and recall is just the count of the results that have been correctly
identified and are relevant. Therefore, they are the same things, just having different names. It is
also known as sensitivity.
Q91 Which are the two components of the Bayesian logic program?
A Bayesian logic program consists of two components:
● Logical It contains a set of Bayesian Clauses, which capture the qualitative structure of
the domain.
● Quantitative It is used to encode quantitative information about the domain.
References
1 springboard.com 2 simplilearn.com 3 geeksforgeeks.org 4 elitedatascience.com 5
analyticsvidhya.com 6 g uru99.com 7 i ntellipaat.com 8 towardsdatascience.com 9
mygreatlearning.com 10 mindmajix.com 11 toptal.com 12 g lassdoor.co.in 13 udacity.com 14
educba.com 15 a nalyticsindiamag.com 16 ubuntupit.com 17 javatpoint.com 18 quora.com 19
hackr.io 20 kaggle.com