Machine Learning
Machine Learning
Machine Learning has gained phenomenal recognition by its abilities to perform tasks,
even better than humans. As a result, this field is being widely studied and pursued by
companies, academicians and researchers.
Now a days, companies are building products which nobody imagined few years back.
Their aim is to build ML products in order to improve customer experience (both online
& offline). This has led to a rapid upsurge in job demand of machine learning specialists.
Though, machine learning is still evolving, but as a job candidate you are expected to
know at least widely used industry algorithms (supervised & unsupervised).
Many a times, been driven by pleasure of coding, students fail to understand the theory
behind ML algorithms and focus more on how to code them. This risky imbalance
between theoretical & practical knowledge could be the primary reason of one’s
rejection.
No doubt, the field of analytics & machine learning looks lucrative & charming right
now, but to survive here, you must have detailed knowledge of ML algorithms. Are you
ready?
Solved Questions
Q1. You are given a train data set having 1000 columns and 1 million rows.
The data set is based on a classification problem. Your manager has asked
you to reduce the dimension of this data so that model computation time can
be reduced. Your machine has memory constraints. What would you do? (You
are free to make practical assumptions.)
1. Since we have lower RAM, we should close all other applications in our machine,
including the web browser, so that most of the memory can be put to use.
2. We can randomly sample the data set. This means, we can create a smaller data
set, let’s say, having 1000 variables and 300000 rows and do the computations.
3. To reduce dimensionality, we can separate the numerical and categorical
variables and remove the correlated variables. For numerical variables, we’ll use
correlation. For categorical variables, we’ll use chi-square test.
4. Also, we can use PCA and pick the components which can explain the maximum
variance in the data set.
5. Using online learning algorithms like Vowpal Wabbit (available in Python) is a
possible option.
6. Building a linear model using Stochastic Gradient Descent is also helpful.
7. We can also apply our business understanding to estimate which all predictors
can impact the response variable. But, this is an intuitive approach, failing to
identify useful predictors might result in significant loss of information.
Q2. Is rotation necessary in PCA? If yes, Why? What will happen if you
don’t rotate the components?
If we don’t rotate the components, the effect of PCA will diminish and we’ll have
to select more number of components to explain variance in the data set.
Q3. You are given a data set. The data set has missing values which spread
along 1 standard deviation from the median. What percentage of data would
remain unaffected? Why?
Answer: This question has enough hints for you to start thinking! Since, the data is
spread across median, let’s assume it’s a normal distribution. We know, in a normal
distribution, ~68% of the data lies in 1 standard deviation from mean (or mode,
median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would
remain unaffected by missing values.
Q4. You are given a data set on cancer detection. You’ve build a classification
model and achieved an accuracy of 96%. Why shouldn’t you be happy with
your model performance? What can you do about it?
Answer: If you have worked on enough data sets, you should deduce that cancer
detection results in imbalanced data. In an imbalanced data set, accuracy should not be
used as a measure of performance because 96% (as given) might only be predicting
majority class correctly, but our class of interest is minority class (4%) which is the
people who actually got diagnosed with cancer. Hence, in order to evaluate model
performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative
Rate), F measure to determine class wise performance of the classifier. If the minority
class performance is found to to be poor, we can undertake the following steps:
Answer: Prior probability is nothing but, the proportion of dependent (binary) variable
in the data set. It is the closest guess you can make about a class, without any further
information. For example: In a data set, the dependent variable is binary (1 and 0). The
proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that
there are 70% chances that any new email would be classified as spam.
Q7. You are working on a time series data set. You manager has asked you to
build a high accuracy model. You start with the decision tree algorithm, since
you know it works fairly well on all kinds of data. Later, you tried a time
series regression model and got higher accuracy than decision tree model.
Can this happen? Why?
Answer: Time series data is known to posses linearity. On the other hand, a decision
tree algorithm is known to work best to detect non – linear interactions. The reason why
decision tree failed to provide robust predictions because it couldn’t map the linear
relationship as good as a regression model did. Therefore, we learned that, a linear
regression model can provide robust prediction given the data set satisfies its linearity
assumptions.
Q8. You are assigned a new project which involves helping a food delivery
company save more money. The problem is, company’s delivery team aren’t
able to deliver food on time. As a result, their customers get unhappy. And, to
keep them happy, they end up delivering food for free. Which machine
learning algorithm can save them?
Answer: You might have started hopping through the list of ML algorithms in your
mind. But, wait! Such questions are asked to test your machine learning fundamentals.
This is not a machine learning problem. This is a route optimization problem. A machine
learning problem consist of three things:
Always look for these three factors to decide if machine learning is a tool to solve a
particular problem.
Q9. You are given a data set. The data set contains many variables, some of
which are highly correlated and you know about it. Your manager has asked
you to run PCA. Would you remove correlated variables first? Why?
Answer: Chances are, you might be tempted to say No, but that would be incorrect.
Discarding correlated variables have a substantial effect on PCA because, in presence of
correlated variables, the variance explained by a particular component gets inflated.
For example: You have 3 variables in a data set, of which 2 are correlated. If you run
PCA on this data set, the first principal component would exhibit twice the variance than
it would exhibit with uncorrelated variables. Also, adding correlated variables lets PCA
put more importance on those variable, which is misleading.
Answer: Don’t get mislead by ‘k’ in their names. You should know that the fundamental
difference between both these algorithms is, kmeans is unsupervised in nature and kNN
is supervised in nature. kmeans is a clustering algorithm. kNN is a classification (or
regression) algorithm.
kmeans algorithm partitions a data set into clusters such that a cluster formed is
homogeneous and the points in each cluster are close to each other. The algorithm tries
to maintain enough separability between these clusters. Due to unsupervised nature,
the clusters have no labels.
kNN algorithm tries to classify an unlabeled observation based on its k (can be any
number ) surrounding neighbors. It is also known as lazy learner because it involves
minimal training of model. Hence, it doesn’t use training data to make generalization on
unseen data set.
Q11. How is True Positive Rate and Recall related? Write the equation.
Answer: True Positive Rate = Recall. Yes, they are equal having the formula (TP/TP +
FN).
Know more: Evaluation Metrics
Q12. You have built a multiple regression model. Your model R² isn’t as good
as you wanted. For improvement, your remove the intercept term, your model
R² becomes 0.8 from 0.3. Is it possible? How?
When intercept term is present, R² value evaluates your model wrt. to the mean model.
In absence of intercept term (ymean), the model can make no such evaluation, with large
denominator, ∑(y - y´)²/∑(y)² equation’s value becomes smaller than actual, resulting
in higher R².
Q13. After analyzing the model, your manager has informed that your
regression model is suffering from multicollinearity. How would you check if
he’s true? Without losing any information, can you still build a better model?
But, removing correlated variables might lead to loss of information. In order to retain
those variables, we can use penalized regression models like ridge or lasso regression.
Also, we can add some random noise in correlated variable so that the variables
become different from each other. But, adding noise might affect the prediction
accuracy, hence this approach should be carefully used.
Answer: After reading this question, you should have understood that this is a classic
case of “causation and correlation”. No, we can’t conclude that decrease in number of
pirates caused the climate change because there might be other factors (lurking or
confounding variables) influencing this phenomenon.
Q15. While working on a data set, how do you select important variables?
Explain your methods.
Answer: Following are the methods of variable selection you can use:
Q18. Running a binary classification tree algorithm is the easy part. Do you
know how does a tree splitting takes place i.e. how does the tree decide
which variable to split at the root node and succeeding nodes?
Answer: A classification trees makes decision based on Gini Index and Node Entropy. In
simple words, the tree algorithm find the best possible feature which can divide the
data set into purest possible children nodes.
Gini index says, if we select two items from a population at random then they must be
of same class and probability for this is 1 if population is pure. We can calculate Gini as
following:
1. Calculate Gini for sub-nodes, using formula sum of square of probability for
success and failure (p^2+q^2).
2. Calculate Gini for split using weighted Gini score of each node of that split
Here p and q is probability of success and failure respectively in that node. Entropy is
zero when a node is homogeneous. It is maximum when a both the classes are present
in a node at 50% – 50%. Lower entropy is desirable.
Q20. You are given a data set consisting of variables having more than 30%
missing values? Let’s say, out of 50 variables, 8 variables have missing values
higher than 30%. How will you deal with them?
1. Assign a unique category to missing values, who knows the missing values might
decipher some trend
2. We can remove them blatantly.
3. Or, we can sensibly check their distribution with the target variable, and if found
any pattern we’ll keep those missing values and assign them a new
category while removing others.
Answer: The basic idea for this kind of recommendation engine comes from
collaborative filtering.
Collaborative Filtering algorithm considers “User Behavior” for recommending items.
They exploit behavior of other users and items in terms of transaction history, ratings,
selection and purchase information. Other users behaviour and preferences over the
items are used to recommend items to the new users. In this case, features of the items
are not known.
Answer: Type I error is committed when the null hypothesis is true and we reject it,
also known as a ‘False Positive’. Type II error is committed when the null hypothesis is
false and we accept it, also known as ‘False Negative’.
In the context of confusion matrix, we can say Type I error occurs when we classify a
value as positive (1) when it is actually negative (0). Type II error occurs when we
classify a value as negative (0) when it is actually positive(1).
Q24. You have been asked to evaluate a regression model based on R²,
adjusted R² and tolerance. What will be your criteria?
Answer: It’s simple. It’s just like how babies learn to walk. Every time they fall down,
they learn (unconsciously) & realize that their legs should be straight and not in a bend
position. The next time they fall down, they feel pain. They cry. But, they learn ‘not to
stand like that again’. In order to avoid that pain, they try harder. To succeed, they even
seek support from the door or wall or anything near them, which helps them stand firm.
This is how a machine works & develops intuition from its environment.
Note: The interview is only trying to test if have the ability of explain complex concepts
in simple terms.
Answer: The error emerging from any model can be broken down into three
components mathematically. Following are these component :
Bias error is useful to quantify how much on an average are the predicted values
different from the actual value. A high bias error means we have a under-performing
model which keeps on missing important trends. Variance on the other side quantifies
how are the prediction made on same observation different from each other. A high
variance model will over-fit on your training population and perform badly on any
observation beyond training.