0% found this document useful (0 votes)
26 views

ML_DS_interview_quetions

The document provides a comprehensive overview of machine learning and data science interview questions and answers, covering key concepts such as AI, ML, and DL, types of learning models, and various algorithms. It discusses important topics like overfitting, regularization, handling missing values, and performance metrics like confusion matrices. Additionally, it addresses techniques for feature scaling, dealing with outliers, and cross-validation methods, making it a valuable resource for interview preparation in the field.

Uploaded by

Keerthivasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

ML_DS_interview_quetions

The document provides a comprehensive overview of machine learning and data science interview questions and answers, covering key concepts such as AI, ML, and DL, types of learning models, and various algorithms. It discusses important topics like overfitting, regularization, handling missing values, and performance metrics like confusion matrices. Additionally, it addresses techniques for feature scaling, dealing with outliers, and cross-validation methods, making it a valuable resource for interview preparation in the field.

Uploaded by

Keerthivasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Machine Learning

&
Data Science

Interview Q & A’s

www.imageconindia.com/academy

1
www.imageconindia.com/academy
1. Explain the terms Artificial Intelligence (AI), Machine Learning (ML and Deep Learning?

Artificial Intelligence (AI) is the domain of producing intelligent machines. ML refers to systems that
can assimilate from experience (training data) and Deep Learning (DL) states to systems that learn
from experience on large data sets. ML can be considered as a subset of AI. Deep Learning (DL) is ML
but useful to large data sets. The figure below roughly encapsulates the relation between AI, ML, and
DL:

In summary, DL is a subset of ML & both were the subsets of AI.

2. What are the different types of Learning/ Training models in ML?

ML algorithms can be primarily classified depending on the presence/absence of target variables.

A. Supervised learning: [Target is present]


The machine learns using labelled data. The model is trained on an existing data set before it starts
making decisions with the new data.
The target variable is continuous: Linear Regression, polynomial Regression, quadratic Regression.
The target variable is categorical: Logistic regression, Naive Bayes, KNN, SVM, Decision Tree, Gradient
Boosting, ADA boosting, Bagging, Random forest etc.

B. Unsupervised learning: [Target is absent]


The machine is trained on unlabelled data and without any proper guidance. It automatically infers
patterns and relationships in the data by creating clusters. The model learns through observations

2
www.imageconindia.com/academy
and deduced structures in the data.
Principal component Analysis, Factor analysis, Singular Value Decomposition etc.

C. Reinforcement Learning:
The model learns through a trial and error method. This kind of learning involves an agent that will
interact with the environment to create actions and then discover errors or rewards of that action.

3.What is the main key difference between supervised and unsupervised machine learning?
Supervised learning technique needs labeled data to train the model. For example, to solve a
classification problem (a supervised learning task), you need to have label data to train the model and
to classify the data into your labeled groups. Unsupervised learning does not need any labelled
dataset. This is the main key difference between supervised learning and unsupervised learning.

4.There are many machine learning algorithms till now. If given a data set, how can one determine
which algorithm to be used for that?
Machine Learning algorithm to be used purely depends on the type of data in a given dataset. If data
is linear then, we use linear regression. If data shows non-linearity then, the bagging algorithm would
do better. If the data is to be analyzed/interpreted for some business purposes then we can use
decision trees or SVM. If the dataset consists of images, videos, audios then, neural networks would
be helpful to get the solution accurately.
So, there is no certain metric to decide which algorithm to be used for a given situation or a data set.
We need to explore the data using EDA (Exploratory Data Analysis) and understand the purpose of
using the dataset to come up with the best fit algorithm. So, it is important to study all the algorithms
in detail.

5.How are covariance and correlation different from one another?


Covariance measures how two variables are related to each other and how one would vary with
respect to changes in the other variable. If the value is positive it means there is a direct relationship
between the variables and one would increase or decrease with an increase or decrease in the base
variable respectively, given that all other conditions remain constant.
Correlation quantifies the relationship between two random variables and has only three specific
values, i.e., 1, 0, and -1.
1 denotes a positive relationship, -1 denotes a negative relationship, and 0 denotes that the two
variables are independent of each other.

3
www.imageconindia.com/academy
6.We look at machine learning software almost all the time. How do we apply Machine Learning to
Hardware?
We have to build ML algorithms in System Verilog which is a Hardware development Language and
then program it onto an FPGA to apply Machine Learning to hardware.
7. Explain One-hot encoding and Label Encoding. How do they affect the dimensionality of the
given dataset?
One-hot encoding is the representation of categorical variables as binary vectors. Label Encoding is
converting labels/words into numeric form. Using one-hot encoding increases the dimensionality of
the data set. Label encoding doesn’t affect the dimensionality of the data set. One-hot encoding
creates a new variable for each level in the variable whereas, in Label encoding, the levels of a
variable get encoded as 1 and 0.
8.What is overfitting?
Overfitting is a type of modelling error which results in the failure to predict future observations
effectively or fit additional data in the existing model. It occurs when a function is too closely fit to a
limited set of data points and usually ends with more parameters than the data can accommodate. It
is common for huge data sets to have some anomalies, so when this data is used for any kind of
modelling, it can result in inaccuracies in the analysis.
9. When does regularization come into play in Machine Learning?
At times when the model begins to underfit or overfit, regularization becomes necessary. It is a
regression that diverts or regularizes the coefficient estimates towards zero. It reduces flexibility and
discourages learning in a model to avoid the risk of overfitting. The model complexity is reduced and
it becomes better at predicting.
10. How can we relate standard deviation and variance?
Standard deviation refers to the spread of your data from the mean. Variance is the average degree
to which each point differs from the mean i.e. the average of all data points. We can relate Standard
deviation and Variance because it is the square root of Variance.
11.A data set is given to you and it has missing values which spread along 1standard deviation from
the mean. How much of the data would remain untouched?
It is given that the data is spread across mean that is the data is spread across an average. So, we can
presume that it is a normal distribution. In a normal distribution, about 68% of data lies in 1 standard
deviation from averages like mean, mode or median. That means about 32% of the data remains
uninfluenced by missing values.

4
www.imageconindia.com/academy
12. is a high variance in data good or bad?
Higher variance directly means that the data spread is big and the feature has a variety of data.
Usually, high variance in a feature is seen as not so good quality.
13. If your dataset is suffering from high variance, how would you handle it?
For datasets with high variance, we could use the bagging algorithm to handle it. Bagging algorithm
splits the data into subgroups with sampling replicated from random data. After the data is split,
random data is used to create rules using a training algorithm. Then we use polling technique to
combine all the predicted outcomes of the model.
14. Explain the handling of missing or corrupted values in the given dataset.
An easy way to handle missing values or corrupted values is to drop the corresponding rows or
columns. If there are too many rows or columns to drop then we consider replacing the missing or
corrupted values with some new value.
Identifying missing values and dropping the rows or columns can be done by using IsNull() and
dropna( ) functions in Pandas. Also, the Fillna() function in Pandas replaces the incorrect values with
the placeholder value.
15.Can you mention some advantages and disadvantages of decision trees?
The advantages of decision trees are that they are easier to interpret, are nonparametric and hence
robust to outliers, and have relatively few parameters to tune.
On the other hand, the disadvantage is that they are prone to overfitting.
16. What is a confusion matrix and why do you need it?
Confusion matrix (also called the error matrix) is a table that is frequently used to illustrate the
performance of a classification model i.e. classifier on a set of test data for which the true values are
well-known.
It allows us to visualize the performance of an algorithm/model. It allows us to easily identify the
confusion between different classes. It is used as a performance measure of a model/algorithm.
A confusion matrix is known as a summary of predictions on a classification model. The number of
right and wrong predictions were summarized with count values and broken down by each class
label. It gives us information about the errors made through the classifier and also the types of errors
made by a classifier.
17. Explain the phrase “Curse of Dimensionality”.
The Curse of Dimensionality refers to the situation when your data has too many features.
The phrase is used to express the difficulty of using brute force or grid search to optimize a function
with too many inputs.

5
www.imageconindia.com/academy
It can also refer to several other issues like:
If we have more features than observations, we have a risk of overfitting the model.
When we have too many features, observations become harder to cluster. Too many dimensions
cause every observation in the dataset to appear equidistant from all others and no meaningful
clusters can be formed.
Dimensionality reduction techniques like PCA come to the rescue in such cases.
18. What is the Principle Component Analysis?
The idea here is to reduce the dimensionality of the data set by reducing the number of variables that
are correlated with each other. Although the variation needs to be retained to the maximum extent.
The variables are transformed into a new set of variables that are known as Principal Components’.
These PCs are the eigenvectors of a covariance matrix and therefore are orthogonal.
19. Why is rotation of components so important in Principle Component Analysis (PCA)?
Rotation in PCA is very important as it maximizes the separation within the variance obtained by all
the components because of which interpretation of components would become easier. If the
components are not rotated, then we need extended components to describe variance of the
components.
20. What are outliers? Mention three methods to deal with outliers.
A data point that is considerably distant from the other similar data points is known as an outlier.
They may occur due to experimental errors or variability in measurement. They are problematic and
can mislead a training process, which eventually results in longer training time, inaccurate models,
and poor results.
The three methods to deal with outliers are:
Univariate method – looks for data points having extreme values on a single variable
Multivariate method – looks for unusual combinations on all the variables
Minkowski error – reduces the contribution of potential outliers in the training process

21. Explain the difference between Normalization and Standardization.


Normalization and Standardization are the two very popular methods used for feature scaling.
Normalization refers to re-scaling the values to fit into a range of [0,1]. Standardization refers to re-
scaling data to have a mean of 0 and a standard deviation of 1 (Unit variance). Normalization is useful
when all parameters need to have the identical positive scale however the outliers from the data set
are lost. Hence, standardization is recommended for most applications.

6
www.imageconindia.com/academy
22. List the most popular distribution curves along with scenarios where you will use them in an
algorithm.
The most popular distribution curves are as follows- Bernoulli Distribution, Uniform Distribution,
Binomial Distribution, Normal Distribution, Poisson Distribution, and Exponential Distribution.
Each of these distribution curves is used in various scenarios.
Bernoulli Distribution can be used to check if a team will win a championship or not, a newborn child
is either male or female, you either pass an exam or not, etc.
Uniform distribution is a probability distribution that has a constant probability. Rolling a single dice
is one example because it has a fixed number of outcomes.
Binomial distribution is a probability with only two possible outcomes, the prefix ‘bi’ means two or
twice. An example of this would be a coin toss. The outcome will either be heads or tails.
Normal distribution describes how the values of a variable are distributed. It is typically a symmetric
distribution where most of the observations cluster around the central peak. The values further away
from the mean taper off equally in both directions. An example would be the height of students in a
classroom.
Poisson distribution helps predict the probability of certain events happening when you know how
often that event has occurred. It can be used by businessmen to make forecasts about the number of
customers on certain days and allows them to adjust supply according to the demand.
Exponential distribution is concerned with the amount of time until a specific event occurs. For
example, how long a car battery would last, in months.
23. Differentiate between regression and classification.
Regression and classification are categorized under the same umbrella of supervised machine
learning. The main difference between them is that the output variable in the regression is numerical
(or continuous) while that for classification is categorical (or discrete).
Example: To predict the definite Temperature of a place is Regression problem whereas predicting
whether the day will be Sunny cloudy or there will be rain is a case of classification.
24. Why is logistic regression a type of classification technique and not a regression? Name the
function it is derived from?
Since the target column is categorical, it uses linear regression to create an odd function that is
wrapped with a log function to use regression as a classifier. Hence, it is a type of classification
technique and not a regression. It is derived from cost function.

7
www.imageconindia.com/academy
25 .What does the term Variance Inflation Factor mean?
Variation Inflation Factor (VIF) is the ratio of variance of the model to variance of the model with only
one independent variable. VIF gives the estimate of volume of multicollinearity in a set of many
regression variables.
VIF = Variance of model Variance of model with one independent variable
26.Which machine learning algorithm is known as the lazy learner and why is it called so?
KNN is a Machine Learning algorithm known as a lazy learner. K-NN is a lazy learner because it
doesn’t learn any machine learnt values or variables from the training data but dynamically calculates
distance every time it wants to classify, hence memorises the training dataset instead.
27.Differentiate between K-Means and KNN algorithms?
KNN is Supervised Learning where-as K-Means is Unsupervised Learning. With KNN, we predict the
label of the unidentified element based on its nearest neighbour and further extend this approach for
solving classification/regression-based problems.
K-Means is Unsupervised Learning, where we don’t have any Labels present, in other words, no
Target Variables and thus we try to cluster the data based upon their coordinates and try to establish
the nature of the cluster based on the elements filtered for that cluster.
28.What are Kernels in SVM? List popular kernels used in SVM along with a scenario of their
applications.
The function of kernel is to take data as input and transform it into the required form. A few popular
Kernels used in SVM are as follows: RBF, Linear, Sigmoid, Polynomial, Hyperbolic, Laplace, etc.
29.What is Kernel Trick in an SVM Algorithm?
Kernel Trick is a mathematical function which when applied on data points, can find the region of
classification between two different classes. Based on the choice of function, be it linear or radial,
which purely depends upon the distribution of data, one can build a classifier.
30. What are ensemble models? Explain how ensemble techniques yield better learning as
compared to traditional classification ML algorithms?
Ensemble is a group of models that are used together for prediction both in classification and
regression class. Ensemble learning helps improve ML results because it combines several models. By
doing so, it allows a better predictive performance compared to a single model.
They are superior to individual models as they reduce variance, average out biases, and have lesser
chances of overfitting.

8
www.imageconindia.com/academy
31.What are overfitting and underfitting? Why does the decision tree algorithm suffer often with
overfitting problem?
Overfitting is a statistical model or machine learning algorithm which captures the noise of the data.
Underfitting is a model or machine learning algorithm which does not fit the data well enough and
occurs if the model or algorithm shows low variance but high bias.
In decision trees, overfitting occurs when the tree is designed to perfectly fit all samples in the
training data set. This results in branches with strict rules or sparse data and affects the accuracy
when predicting samples that aren’t part of the training set.
32.How do you handle outliers in the data?
Outlier is an observation in the data set that is far away from other observations in the data set. We
can discover outliers using tools and functions like box plot, scatter plot, Z-Score, IQR score etc. and
then handle them based on the visualization we have got. To handle outliers, we can cap at some
threshold, use transformations to reduce skewness of the data and remove outliers if they are
anomalies or errors.
33. List popular cross validation techniques.
There are mainly six types of cross validation techniques. They are as follow:

• K fold
• Stratified k fold
• Leave one out
• Bootstrapping
• Random search cv
• Grid search cv
34.How can we use a dataset without the target variable into supervised learning algorithms?
Input the data set into a clustering algorithm, generate optimal clusters, label the cluster numbers as
the new target variable. Now, the dataset has independent and target variables present. This ensures
that the dataset is ready to be used in supervised learning algorithms.
35.Explain the term instance-based learning.
Instance Based Learning is a set of procedures for regression and classification which produce a class
label prediction based on resemblance to its nearest neighbors in the training data set. These
algorithms just collects all the data and get an answer when required or queried. In simple words
they are a set of procedures for solving new problems based on the solutions of already solved
problems in the past which are similar to the current problem.
36.Explain how a Naive Bayes Classifier works.
Naive Bayes classifiers are a family of algorithms which are derived from the Bayes theorem of
probability. It works on the fundamental assumption that every set of two features that is being

9
www.imageconindia.com/academy
classified is independent of each other and every feature makes an equal and independent
contribution to the outcome.
37. What’s the difference between probability and likelihood?
Probability is the measure of the likelihood that an event will occur that is, what is the certainty that a
specific event will occur? Where-as a likelihood function is a function of parameters within the
parameter space that describes the probability of obtaining the observed data.
So the fundamental difference is, Probability attaches to possible results; likelihood attaches to
hypotheses.
38. Why would you Prune your tree?
In the context of data science or AIML, pruning refers to the process of reducing redundant branches
of a decision tree. Decision Trees are prone to overfitting, pruning the tree helps to reduce the size
and minimizes the chances of overfitting. Pruning involves turning branches of a decision tree into
leaf nodes and removing the leaf nodes from the original branch. It serves as a tool to perform the
tradeoff.
39. Model accuracy or Model performance? Which one will you prefer and why?
This is a trick question, one should first get a clear idea, what is Model Performance? If Performance
means speed, then it depends upon the nature of the application, any application related to the real-
time scenario will need high speed as an important feature. Example: The best of Search Results will
lose its virtue if the Query results do not appear fast.
If Performance is hinted at Why Accuracy is not the most important virtue – For any imbalanced data
set, more than Accuracy, it will be an F1 score than will explain the business case and in case data is
imbalanced, then Precision and Recall will be more important than rest.
40. Differentiate between Statistical Modeling and Machine Learning?
Machine learning models are about making accurate predictions about the situations, like Foot Fall in
restaurants, Stock-Price, etc. where-as, Statistical models are designed for inference about the
relationships between variables, as What drives the sales in a restaurant, is it food or Ambience.
41.What is the significance of Gamma and Regularization in SVM?
The gamma defines influence. Low values meaning ‘far’ and high values meaning ‘close’. If gamma is
too large, the radius of the area of influence of the support vectors only includes the support vector
itself and no amount of regularization with C will be able to prevent overfitting. If gamma is very
small, the model is too constrained and cannot capture the complexity of the data.
The regularization parameter (lambda) serves as a degree of importance that is given to miss-
classifications. This can be used to draw the tradeoff with OverFitting.
42.What are hyperparameters and how are they different from parameters?

10
www.imageconindia.com/academy
A parameter is a variable that is internal to the model and whose value is estimated from the training
data. They are often saved as part of the learned model. Examples include weights, biases etc.
A hyperparameter is a variable that is external to the model whose value cannot be estimated from
the data. They are often used to estimate model parameters. The choice of parameters is sensitive to
implementation. Examples include learning rate, hidden layers etc.
43. What is the meshgrid () method and the contourf () method? State some uses of both.
The meshgrid( ) function in numpy takes two arguments as input : range of x-values in the grid, range
of y-values in the grid whereas meshgrid needs to be built before the contourf( ) function in
matplotlib is used which takes in many inputs : x-values, y-values, fitting curve (contour line) to be
plotted in grid, colours etc.
Meshgrid () function is used to create a grid using 1-D arrays of x-axis inputs and y-axis inputs to
represent the matrix indexing. Contourf () is used to draw filled contours using the given x-axis inputs,
y-axis inputs, contour line, colours etc.
44.Describe a hash table.
Hashing is a technique for identifying unique objects from a group of similar objects. Hash functions
are large keys converted into small keys in hashing techniques. The values of hash functions are
stored in data structures which are known hash table.
45. Explain Eigenvectors and Eigenvalues.
Linear transformations are helpful to understand using eigenvectors. They find their prime usage in
the creation of covariance and correlation matrices in data science.
Simply put, eigenvectors are directional entities along which linear transformation features like
compression, flip etc. can be applied.
Eigenvalues are the magnitude of the linear transformation features along each direction of an
Eigenvector.
46.How would you define the number of clusters in a clustering algorithm?
Ans. The number of clusters can be determined by finding the silhouette score. Often we aim to get
some inferences from data using clustering techniques so that we can have a broader picture of a
number of classes being represented by the data. In this case, the silhouette score helps us determine
the number of cluster centres to cluster our data along.
Another technique that can be used is the elbow method.
47. What are the performance metrics that can be used to estimate the efficiency of a linear
regression model?
The performance metric that is used in this case is:

• Mean Squared Error


11
www.imageconindia.com/academy
• R2 score
• Adjusted R2 score
• Mean Absolute score
48.What is the default method of splitting in decision trees?
The default method of splitting in decision trees is the Gini Index. Gini Index is the measure of
impurity of a particular node.
This can be changed by making changes to classifier parameters.
49. How is p-value useful?
The p-value gives the probability of the null hypothesis is true. It gives us the statistical significance of
our results. In other words, p-value determines the confidence of a model in a particular output.
50. Can logistic regression be used for classes more than 2?
No, logistic regression cannot be used for classes more than 2 as it is a binary classifier. For multi-
class classification algorithms like Decision Trees, Naïve Bayes’ Classifiers are better suited.
51.What are the hyperparameters of a logistic regression model?
Classifier penalty, classifier solver and classifier C are the trainable hyperparameters of a Logistic
Regression Classifier. These can be specified exclusively with values in Grid Search to hyper tune a
Logistic Classifier.
52. Name a few hyper-parameters of decision trees?
The most important features which one can tune in decision trees are:

• Splitting criteria
• Min_leaves
• Min_samples
• Max_depth
53.What is the role of cross-validation?
Ans. Cross-validation is a technique which is used to increase the performance of a machine learning
algorithm, where the machine is fed sampled data out of the same data for a few times. The sampling
is done so that the dataset is broken into small parts of the equal number of rows, and a random part
is chosen as the test set, while all other parts are chosen as train sets.
54. What is a voting model?
Ans. A voting model is an ensemble model which combines several classifiers but to produce the final
result, in case of a classification-based model, takes into account, the classification of a certain data
point of all the models and picks the most vouched/voted/generated option from all the given classes
in the target column.

12
www.imageconindia.com/academy
55.How to deal with very few data samples? Is it possible to make a model out of it?
If very few data samples are there, we can make use of oversampling to produce new data points. In
this way, we can have new data points.
56. What are the hyperparameters of an SVM?
The gamma value, c value and the type of kernel are the hyperparameters of an SVM model.
57. What is Pandas Profiling?
Pandas profiling is a step to find the effective number of usable data. It gives us the statistics of NULL
values and the usable values and thus makes variable selection and data selection for building models
in the preprocessing phase very effective.
58. What impact does correlation have on PCA?
If data is correlated PCA does not work well. Because of the correlation of variables the effective
variance of variables decreases. Hence correlated data when used for PCA does not work well.
59. How is PCA different from LDA?
PCA is unsupervised. LDA is unsupervised.
PCA takes into consideration the variance. LDA takes into account the distribution of classes.
60. What distance metrics can be used in KNN?
Following distance metrics can be used in KNN.

• Manhattan
• Minkowski
• Tanimoto
• Jaccard
• Mahalanobis
61. Which metrics can be used to measure correlation of categorical data?
Chi square test can be used for doing so. It gives the measure of correlation between categorical
predictors.
62. Which algorithm can be used in value imputation in both categorical and continuous categories
of data?
KNN is the only algorithm that can be used for imputation of both categorical and continuous
variables.
63.What ensemble technique is used by Random forests?
Bagging is the technique used by Random Forests. Random forests are a collection of trees which
work on sampled data from the original dataset with the final prediction being a voted average of all
trees.
13
www.imageconindia.com/academy
64.What are the benefits of pruning?
Pruning helps in the following:

• Reduces overfitting
• Shortens the size of the tree
• Reduces complexity of the model
• Increases bias

65.What is the degree of freedom?


It is the number of independent values or quantities which can be assigned to a statistical
distribution. It is used in Hypothesis testing and chi-square test.
66.Which performance metric is better R2 or adjusted R2?
Adjusted R2 because the performance of predictors impacts it. R2 is independent of predictors and
shows performance improvement through increase if the number of predictors is increased.
67.What’s the difference between Type I and Type II error?
Type I and Type II error in machine learning refers to false values. Type I is equivalent to a False
positive while Type II is equivalent to a False negative. In Type I error, a hypothesis which ought to be
accepted doesn’t get accepted. Similarly, for Type II error, the hypothesis gets rejected which should
have been accepted in the first place.
68.What do you understand by L1 and L2 regularization?
L2 regularization: It tries to spread error among all the terms. L2 corresponds to a Gaussian prior.
L1 regularization: It is more binary/sparse, with many variables either being assigned a 1 or 0 in
weighting. L1 corresponds to setting a Laplacean prior on the terms.
69. Which one is better, Naive Bayes Algorithm or Decision Trees?
Although it depends on the problem you are solving, but some general advantages are following:
Naive Bayes:

• Work well with small dataset compared to DT which need more data
• Lesser overfitting
• Smaller in size and faster in processing
Decision Trees:

• Decision Trees are very flexible, easy to understand, and easy to debug
• No preprocessing or transformation of features required
• Prone to overfitting but you can use pruning or Random forests to avoid that.

14
www.imageconindia.com/academy
70.What do you mean by the ROC curve?
Receiver operating characteristics (ROC curve): ROC curve illustrates the diagnostic ability of a binary
classifier. It is calculated/created by plotting True Positive against False Positive at various threshold
settings. The performance metric of ROC curve is AUC (area under curve). Higher the area under the
curve, better the prediction power of the model.
71. What do you mean by AUC curve?
AUC (area under curve). Higher the area under the curve, better the prediction power of the model.
72. What is log likelihood in logistic regression?
It is the sum of the likelihood residuals. At record level, the natural log of the error (residual) is
calculated for each record, multiplied by minus one, and those values are totaled. That total is then
used as the basis for deviance (2 x ll) and likelihood (exp(ll)).
The same calculation can be applied to a naive model that assumes absolutely no predictive power,
and a saturated model assuming perfect predictions.
The likelihood values are used to compare different models, while the deviances (test, naive, and
saturated) can be used to determine the predictive power and accuracy. Logistic regression accuracy
of the model will always be 100 percent for the development data set, but that is not the case once a
model is applied to another data set.
73. How would you evaluate a logistic regression model?
Model Evaluation is a very important part in any analysis to answer the following questions,
How well does the model fit the data?, Which predictors are most important?, Are the predictions
accurate?
So the following are the criterion to access the model performance,
1. Akaike Information Criteria (AIC): In simple terms, AIC estimates the relative amount of
information lost by a given model. So the less information lost the higher the quality of the model.
Therefore, we always prefer models with minimum AIC.
2. Receiver operating characteristics (ROC curve): ROC curve illustrates the diagnostic ability of a
binary classifier. It is calculated/ created by plotting True Positive against False Positive at various
threshold settings. The performance metric of ROC curve is AUC (area under curve). Higher the area
under the curve, better the prediction power of the model.
3. Confusion Matrix: In order to find out how well the model does in predicting the target variable,
we use a confusion matrix/ classification rate. It is nothing but a tabular representation of actual Vs
predicted values which helps us to find the accuracy of the model.

15
www.imageconindia.com/academy
74. What are the advantages of SVM algorithms?
SVM algorithms have basically advantages in terms of complexity. First I would like to clear that both
Logistic regression as well as SVM can form non linear decision surfaces and can be coupled with the
kernel trick. If Logistic regression can be coupled with kernel then why use SVM?
● SVM is found to have better performance practically in most cases.
● SVM is computationally cheaper O(N^2*K) where K is no of support vectors (support vectors are
those points that lie on the class margin) where as logistic regression is O(N^3)
● Classifier in SVM depends only on a subset of points . Since we need to maximize distance between
closest points of two classes (aka margin) we need to care about only a subset of points unlike logistic
regression.
75.What Are the Stages of Building a Model in Machine Learning?
To build a model in machine learning, you need to follow few steps:

• Understand the business model


• Data acquisitions
• Data cleaning
• Exploratory data analysis
• Use machine learning algorithms to make a model
• Use unknown dataset to check the accuracy of the model
76. What is the difference between Entropy and Information Gain?
The information gain is based on the decrease in entropy after a dataset is split on an attribute.
Constructing a decision tree is all about finding the attribute that returns the highest information gain
(i.e., the most homogeneous branches). Step 1: Calculate entropy of the target.
77. What is the process of carrying out a linear regression?
Linear Regression Analysis consists of more than just fitting a linear line through a cloud of data
points. It consists of 3 stages–
(1) analyzing the correlation and directionality of the data,
(2) estimating the model, i.e., fitting the line,
(3) evaluating the validity and usefulness of the model.
78.How can you avoid overfitting your model?
Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger
picture. There are three main methods to avoid overfitting:

• Keep the model simple—take fewer variables into account, thereby removing some of the
noise in the training data

16
www.imageconindia.com/academy
• Use cross-validation techniques, such as k folds cross-validation
• Use regularization techniques, such as LASSO, that penalize certain model parameters if
they're likely to cause overfitting
79.Differentiate between univariate, bivariate, and multivariate analysis.
Univariate
Univariate data contains only one variable. The purpose of the univariate analysis is to describe the
data and find patterns that exist within it.
Example: height of students
Bivariate
Bivariate data involves two different variables. The analysis of this type of data deals with causes and
relationships and the analysis is done to determine the relationship between the two variables.
Example: temperature and ice cream sales in the summer season
Multivariate
Multivariate data involves three or more variables, it is categorized under multivariate. It is similar to
a bivariate but contains more than one dependent variable.
Example: data for house price prediction
80. How should you maintain a deployed model?
The steps to maintain a deployed model are:
Monitor
Constant monitoring of all models is needed to determine their performance accuracy. When you
change something, you want to figure out how your changes are going to affect things. This needs to
be monitored to ensure it's doing what it's supposed to do.
Evaluate
Evaluation metrics of the current model are calculated to determine if a new algorithm is needed.
Compare
The new models are compared to each other to determine which model performs the best.
Rebuild
The best performing model is re-built on the current state of data.

17
www.imageconindia.com/academy

You might also like