Complete ML Notes
Complete ML Notes
Supervised Learning:
In supervised learning, the algorithm is trained on a labeled dataset, where each input example is
paired with its corresponding output label. The goal is for the model to learn the mapping between
input features and output labels so that it can make accurate predictions on new, unseen data.
Common Tasks:
Classification: Predicting categories or classes for new data points. For example, classifying emails
Regression: Predicting continuous numerical values. For instance, predicting house prices based on
Unsupervised Learning:
Unsupervised learning involves training algorithms on datasets without labeled responses. Instead of
learning from explicit feedback, the algorithm tries to identify patterns, relationships, or structures
Common Tasks:
Clustering: Grouping similar data points together based on some measure of similarity. For example,
Dimensionality Reduction: Reducing the number of input variables or features while preserving the
essential information. This is useful for visualizing high-dimensional data or speeding up subsequent
computations.
Reinforcement Learning:
Reinforcement learning is about training agents to interact with an environment and learn from the
consequences of their actions. The agent receives feedback in the form of rewards or penalties
Common Tasks:
Markov Decision Processes (MDPs): Reinforcement learning problems are often formulated as
MDPs, where the agent takes actions in states and receives rewards accordingly.
Q2. What is Overfitting, and How Can You Avoid It?
The Overfitting is a situation that occurs when a model learns the training set too well, taking up
random fluctuations in the training data as concepts. These impact the model’s ability to generalize
When a model is given the training data, it shows 100 percent accuracy—technically a slight loss.
But, when we use the test data, there may be an error and low efficiency. This condition is known as
overfitting.
Penalty Etude
Regularization: It involves a cost term for the features involved with the objective function
Making a simple model :With lesser variables and parameters, the variance can be reduced
Cross-validation methods like k-folds can also be used If some model parameters are likely to cause
overfitting, techniques for regularization like LASSO can be used that penalize these parameters
Q3. What is ‘training Set’ and ‘test Set’ in a Machine Learning Model? How Much Data Will You
Training Set:
The training set is a subset of the dataset used to train the machine learning model. It consists of
input data paired with the corresponding correct output labels (in supervised learning). The model
learns from this data by adjusting its parameters or weights through optimization algorithms (such as
gradient descent) to minimize the error between its predictions and the actual labels.
Validation Set:
The validation set is used to tune hyperparameters and evaluate the performance of the model
during training. It helps in preventing overfitting by providing an unbiased evaluation of the model's
performance on data that it hasn't seen during training. The validation set is also used for early
stopping, where training is halted when the model's performance on the validation set stops
Test Set:
The test set is used to evaluate the final performance of the trained model after it has been trained
and validated. It provides an unbiased estimate of the model's performance on unseen data. The test
set should ideally reflect the same distribution as the training and validation sets to ensure that the
the size of the dataset, the complexity of the problem, and the available computational resources.
Training Set: Typically, the majority of the data is allocated to the training set, often around 60% to
80% of the total dataset. A larger training set allows the model to learn more effectively and
Validation Set: The validation set is usually smaller than the training set, typically around 10% to 20%
of the total dataset. This subset is used for tuning hyperparameters and monitoring the model's
Test Set: The test set is also smaller compared to the training set and is generally around 10% to
20% of the total dataset. It is kept completely separate from the training and validation sets until the
If the missing values are very few and removing them won't significantly affect the dataset's integrity,
Mean/Median/Mode Imputation:
Replace missing values with the mean (average), median (middle value), or mode (most frequent
value) of the respective feature. This method is straightforward and preserves the overall distribution
of the data.
Forward Fill/Backward Fill:
Fill missing values with the value from the previous or next non-missing observation along the same
feature. This is useful for time-series data where missing values are often consecutive.
Interpolation:
11 3
Use interpolation techniques such as linear interpolation or spline interpolation to estimate missing
values based on neighboring data points. Interpolation is particularly useful for ordered or time-series
data.
Imputation Models:
Train machine learning models (e.g., k-Nearest Neighbors, Random Forests, etc.) to predict missing
values based on other features in the dataset. The model learns patterns from the complete data to
Create an additional binary feature indicating whether a value is missing or not (flagging). Then, use
encoding techniques such as mean imputation or model-based imputation for the missing values in
that feature.
Q5. What Are the Three Stages of Building a Model in Machine Learning?
Model Building Choose a suitable algorithm for the model and train it according to the requirement
Model Testing . Check the accuracy of the model through the test data
Applying the Model Make the required changes after testing and use the final model for real-time
projects
Q6. What Are the Differences Between Machine Learning and Deep Learning?
Machine Learning:
Works well on the low-end system, so you don't need large machines
Most features need to be identified in advance and manually coded
The problem is divided into two parts and solved individually and then combined
Deep Learning :
Enables machines to take decisions with the help of artificial neural networks
Q7. What Are the Applications of Supervised Machine Learning in Modern Businesses?
Here we train the model using historical data that consists of emails categorized as spam or not
Healthcare Diagnosis
By providing images regarding a disease, a model can be trained to detect if a person is suffering
Sentiment Analysis
This refers to the process of using algorithms to mine documents and determine whether they’re
Fraud Detection
By training the model to identify suspicious patterns, we can detect instances of possible fraud.
Supervised learning uses data that is completely labeled, whereas unsupervised learning uses no
training data.
In the case of semi-supervised learning, the training data contains a small amount of labeled data
Q9. What is the Difference Between Inductive Machine Learning and Deductive Machine Learning?
print
concusion
draw m
to y
system
Expert
knowledgth
ahhhhh
Apply
probe
µ
Q10 .Compare K-means and KNN Algorithms.
The classifier is called ‘naive’ because it makes assumptions that may or may not turn out to be
correct.
The algorithm assumes that the presence of one feature of a class is not related to the presence of
any other feature (absolute independence of features), given the class variable.
For instance, a fruit may be considered to be a cherry if it is red in color and round in shape,
regardless of other features. This assumption may or may not be right (as an apple also matches the
description).
Q12. How Will You Know Which Machine Learning Algorithm to Choose for Your Classification
Problem?
While there is no fixed rule to choose an algorithm for a classification problem, you can follow these
guidelines:
If accuracy is a concern, test different algorithms and cross-validate them
If the training dataset is small, use models that have low variance and high bias
If the training dataset is large, use models that have high variance and little bias
Classification is used when your target is categorical, while regression is used when your target
variable is continuous. Both classification and regression belong to the category of supervised
Predicting yes or no
Estimating gender
Breed of an animal
Type of color
Q14 . Considering a Long List of Machine Learning Algorithms, given a Data Set, How Do You
There is no master algorithm for all situations. Choosing an algorithm depends on the following
questions:
Bias
Bias in a machine learning model occurs when the predicted values are further from the actual
values. Low bias indicates a model where the prediction values are very close to the actual ones.
Underfitting: High bias can cause an algorithm to miss the relevant relations between features and
Variance refers to the amount the target model will change when trained with different training data.
Overfitting: High variance can cause an algorithm to model the random noise in the training data
High Bias-Low Variance: Models with high bias and low variance tend to be simple and less flexible.
They may not capture all the nuances in the data but are less affected by fluctuations in the training
Low Bias-High Variance: Models with low bias and high variance tend to be more complex and
flexible. They can capture intricate patterns in the data but are prone to overfitting, meaning they
may perform well on the training data but generalize poorly to unseen data. Examples include
The bias-variance trade-off in machine learning is about finding the right balance between simplicity
and flexibility in a model. Too simple, and the model may miss important patterns (high bias); too
complex, and it may overfit to noise in the data (high variance). The goal is to strike a balance that
generalizes well to new data while capturing the underlying trends effectively.
Q17 . Achieving the balance between bias and variance involves several techniques:
Model Complexity: Start with a simple model and gradually increase complexity as needed. This
helps avoid overfitting initially.
different subsets of the data. This helps in understanding how well the model generalizes to unseen
data.
Feature Selection: Select only the most relevant features to reduce the complexity of the model and
prevent overfitting.
Ensemble Methods: Combine predictions from multiple models to reduce variance. Techniques like
Hyperparameter Tuning: Tune model hyperparameters using techniques like grid search or random
search to find the optimal settings that balance bias and variance.
Bias-Variance Decomposition: Understand the sources of error in the model by decomposing the
overall error into bias and variance components. This helps in diagnosing model performance issues.
Precision
Precision is the ratio of several events you can correctly recall to the total number of events you
Recall
A recall is the ratio of the number of events you can recall the number of total events.
Q19. What do you understand by Type I vs Type II error? 2028 4nd Mar
Line
Type I Error: Type I error occurs when the null hypothesis is true and we reject it.
Type II Error: Type II error occurs when the null hypothesis is false and we accept it.
My
Rondisin e Ski Fake
Reality
CKD T
Q21 .What is Cross-Validation?
the performance of a machine learning model by partitioning the dataset into subsets, training the
model on a portion of the data, and then evaluating it on the remaining unseen data. This process is
repeated multiple times, with different partitions of the data, and the performance metrics are
Splitting the Data: The dataset is divided into k subsets of approximately equal size. One of these
subsets is held out as the validation set, while the remaining k-1 subsets are used for training.
I
Training and Validation: The model is trained on the k-1 subsets and evaluated on the validation set.
This process is repeated k times, each time using a different subset as the validation set.
Performance Evaluation: The performance metrics, such as accuracy, precision, recall, or mean
squared error, are calculated for each iteration of the cross-validation process. These metrics are
Q22. What are the assumptions you need to take before starting with linear regression?
Multivariate normality
No auto-correlation
Homoscedasticity
Linear relationship
No or little multicollinearity
larigating
wantspenalty
Q23. What is the difference between Lasso and Ridge regression?
Ridge Regression:
Performs L2 regularization, i.e., adds penalty equivalent to the square of the magnitude of
coefficients
square of coefficients)
Lasso Regression:
Performs L1 regularization, i.e., adds penalty equivalent to the absolute value of the magnitude of
coefficients
Feature Selection: Lasso can set coefficients to zero, effectively performing feature selection, while
Bias-Variance Tradeoff: Both methods introduce bias into estimates but reduce variance, potentially
Regularization Technique: Ridge uses L2 regularization (squares of coefficients), and Lasso uses L1
The F1 score is a metric commonly used to evaluate the performance of a classification model. It
considers both the precision and recall of the model to provide a single numerical value that
Precision: Precision measures the proportion of true positive predictions out of all positive
Unsuperised Reinforcement
predictions madeSupervised
by the model.
Recall (or Sensitivity): Recall measures the proportion of true positive predictions out of all actual
TalgetValue Objective
Target Value
positive instances in the dataset
Based
Present Notpresent
Q26 .What are the different types of machine learning?
n'tense
sion classification
1 Linear Regression Lofistic Regression
2 Decision Tree A Decision Tree
3 3 Random Forest
Random Forest
4 4 KNN
Xgboost
52 SVM
Dimension Association
Clustering
Reduction
K means Anion
Hichardial PCA Fp growth
1213SCAN ICA
Q27 . What are the different types of Unsupervised machine learning?
FactorAnalists
optics
Q28. How do you know which machine learning algorithm should be used?
Regression: Predicting a Number (Linear Regression, Decision Trees, Random Forests, XGboost,
Adaboost)
Dimension reduction: Reducing the number of variables in data (PCA, ICA, T-SNA, UMAP)
Association: Finding out which products sell together (Apriori, Eclat, FP-Growth)
Overfitting is a common problem in supervised machine learning where a model learns to capture the
noise and random fluctuations in the training data instead of the underlying pattern or trend. This
leads to poor generalization performance, meaning the model performs well on the training data but
Issues:
Overfitting occurs when the model is too complex for the available data, capturing noise instead of
patterns.
It happens when the model fits the training data too closely, including noise and random fluctuations.
High variance accompanies overfitting, indicating the model's predictions fluctuate widely with data
changes.
Signs of overfitting include low training error but high validation error, poor performance on new
Regularization: Introducing penalties on the model parameters to prevent them from becoming too
large.
Cross-validation: Evaluating the model's performance on multiple subsets of the data to ensure it
generalizes well.
No 9 Corret production
Accuracy
Feature selection: Selecting only the most relevant features to reduce the complexity of the model.
Total
Using simpler models: Choosing a simpler model No that
architecture preditions
of
is less prone to overfitting.
Q31 . What evaluation metrics would you use to assess the performance of a classification model?
Accuracy: Accuracy measures the proportion of correctly classified instances out of all instances. It
is calculated as the ratio of the number of correct predictions to the total number of predictions
FIFP
made .
Preasim
Precision: Precision measures the proportion of true positive predictions out of all positive
If
predictions made by the model. It is calculated as the ratio of true positives to the sum of true
Recall
c
positives and false positives. FEEN
Recall (Sensitivity): Recall measures the proportion of true positive predictions out of all actual
2 1
Fi
1
positive instances in the dataset. It is calculated as the ratio of true positives to the sum of true
F1 Score: The F1 score is the harmonic mean of precision and recall, providing a balanced measure
IIII
ROC Curve and AUC: The Receiver Operating Characteristic (ROC) curve plots the true positive rate
(TPR) against the false positive rate (FPR) at various threshold settings. The Area Under the ROC
Curve (AUC) provides a single scalar value representing the model's ability to discriminate between
Confusion Matrix: A confusion matrix provides a tabular summary of the model's predictions versus
the actual class labels, showing true positives, true negatives, false positives, and false negatives.
Q33.What are some common techniques for feature selection in supervised learning?
Filter Methods: Evaluate feature relevance before model training based on statistical properties or
Information Gain
Wrapper Methods: Use model performance as a criterion for feature selection. Techniques include:
Forward Selection
Backward Elimination
Embedded Methods: Integrate feature selection into the model training process.
Lasso Regression
Dimensionality Reduction Techniques: Transform the feature space into a lower-dimensional space
• By penalizing large parameter values, regularization discourages the model from fitting the noise or
• Regularization helps control the complexity of the model, ensuring that it generalizes well to
• Common regularization techniques include L1 (Lasso) and L2 (Ridge) regularization, which add
penalties to the absolute values and squared values of the model parameters, respectively.
Q35 .Can you describe the process of cross-validation and its importance in supervised learning?
Cross-Validation:
2. The process involves dividing the dataset into multiple subsets or folds, training the model on a
3. This process is repeated multiple times, with different partitions of the data, and the performance
metrics are averaged across all iterations to provide a more robust estimate of the model's
performance.
4. Common types of cross-validation include k-fold cross-validation, leave-one-out cross-
validation, and stratified cross-validation, each with its own variations and applications.
5. Cross-validation is important in supervised learning because it helps assess how well the model
generalizes to new, unseen data and provides insights into its stability and reliability.
6. It helps detect issues like overfitting or underfitting by evaluating the model's performance on
multiple subsets of the data, enabling the selection of the best-performing model and
hyperparameters.
techniques leverage the diversity among these models to make more accurate and robust
Improved Accuracy: Ensemble methods often achieve higher predictive accuracy compared to
individual models by leveraging the wisdom of crowds. By combining multiple models, ensemble
learning reduces the risk of selecting a suboptimal model and can better capture the underlying
Robustness: Ensemble methods are more robust to noisy data and outliers because they aggregate
predictions from multiple models. They tend to be less susceptible to overfitting, as errors made by
Reduction of Bias and Variance: Ensemble learning can help strike a balance between bias and
variance, leading to more stable predictions. For example, combining models with high bias and low
variance (e.g., decision trees) with models with low bias and high variance (e.g., neural networks) can
Model Interpretability: Ensemble methods can sometimes offer better interpretability than complex
individual models. For example, in bagging methods like Random Forests, feature importance can be
tasks. It can be used with classification, regression, and even unsupervised learning tasks.
techniques leverage the diversity among these models to make more accurate and robust
Improved Accuracy: Ensemble methods often achieve higher predictive accuracy compared to
individual models by leveraging the wisdom of crowds. By combining multiple models, ensemble
learning reduces the risk of selecting a suboptimal model and can better capture the underlying
Robustness: Ensemble methods are more robust to noisy data and outliers because they aggregate
predictions from multiple models. They tend to be less susceptible to overfitting, as errors made by
Reduction of Bias and Variance: Ensemble learning can help strike a balance between bias and
variance, leading to more stable predictions. For example, combining models with high bias and low
variance (e.g., decision trees) with models with low bias and high variance (e.g., neural networks) can
Model Interpretability: Ensemble methods can sometimes offer better interpretability than complex
individual models. For example, in bagging methods like Random Forests, feature importance can be
Versatility: Ensemble learning is versatile and can be applied to various types of models and learning
tasks. It can be used with classification, regression, and even unsupervised learning tasks.
90
Q36 .Common ensemble learning techniques include:
50
Bagging (Bootstrap Aggregating): Constructs multiple models independently and combines their
Boosting: Builds models sequentially, where each subsequent model focuses on correcting the
errors of the previous model. Examples include AdaBoost, Gradient Boosting Machines (GBM), and
XGBoost.
ENDED 1
Stacking (Stacked Generalization): Combines predictions from multiple models using a meta-model,
Q37. How do you choose between different supervised learning algorithms for a given problem?How
do you choose between different supervised learning algorithms for a given problem?
For Linear Relationships: Linear regression, Logistic regression, Linear Support Vector Machines
(SVM).
For Non-linear Relationships: Decision Trees, Random Forests, Gradient Boosting Machines (GBM),
Neural Networks.
For High-dimensional Data: Regularized regression (e.g., Lasso, Ridge), Support Vector Machines
For Text Data: Naive Bayes, Support Vector Machines (SVM), Recurrent Neural Networks (RNNs),
should also increase or vice-versa. Linear regression assumes that the input and output are linear
EF
No Multicollinearity – Multicollinearity means that all the input columns you have in data should not
be highly related. For example, if we have X1, X2, and X3 as input columns in data and if by changing
X1 there are changes observed in X2, then it is the scenario of multicollinearity. For example, if by
changing X1, we get output keeping that X2 and X3 are constant, but when X1 is correlated to X2,
then on changing X1, X2 will also change, so we will not get proper output. That’s why
E
multicollinearity is a problem. Acth
Pray
a
Normality of Residual – When we predict new data points and calculate error or residual (actual –
predicted), then on plotting, it should be normally distributed across the mean. To know this, you can
directly plot the KDE or QQ plot. First, you have to calculate the residual for every test point, and
using the seaborn library; you can plot a distribution plot. The second way is to directly Plot a QQ
having the same scatter. According to this assumption, when you plot residual, then the spread
should be equal. If it is not equal, then it is known as Heteroscedasticity. To calculate this, you keep
prediction on X-axis and residual on Y-axis. The scatter plot should be uniform.
No Autocorrelation of Errors – If you plot all residual errors, then there should not be a particular
cement i f
pattern.
Q38 .What are the key differences between logistic regression and linear regression ?
Problem Type:
Linear Regression: Linear regression is used for predicting continuous numeric outcomes. It models
C
the relationship between the independent variables (features) and the continuous dependent variable
Logistic Regression: Logistic regression is used for predicting binary outcomes or probabilities. It
models the probability that a given input belongs to a particular class using a logistic function, which
Output:
Linear Regression: The output of linear regression is a continuous numeric value that represents the
predicted outcome.
Logistic Regression: The output of logistic regression is a probability value between 0 and 1, which
can be interpreted as the likelihood of the input belonging to a certain class. It is often used to
classify inputs into one of two classes based on a threshold (e.g., 0.5).
Model Representation:
Linear Regression: In linear regression, the relationship between the input features and the target
Logistic Regression: In logistic regression, the relationship between the input features and the log-
odds of the target variable is represented by a sigmoid (logistic) function, which transforms the
(MSE) or Mean Absolute Error (MAE), which measures the difference between the predicted and
actual values.
Logistic Regression: The loss function used in logistic regression is the Binary Cross-Entropy (also
known as Log Loss), which measures the difference between the predicted probabilities and the true
binary labels.
LogisticRegress Cutey Regresind
Applications:
Linear Regression: Common applications of linear regression include predicting house prices, sales
Logistic Regression: Common applications of logistic regression include binary classification tasks
Q40. How do support vector machines (SVMs) work in supervised learning? What is the kernel trick?
SVMs work by finding the optimal hyperplane that separates data points of different classes in a
The goal of SVMs is to maximize the margin, which is the distance between the hyperplane and the
SVMs can handle both linearly separable and non-linearly separable data by mapping the input
The hyperplane is determined by solving an optimization problem that involves minimizing the norm
of the weight vector subject to the constraint that all data points are correctly classified.
For non-linearly separable data, SVMs use a technique called the kernel trick to22
map the input
The kernel trick allows SVMs to implicitly compute the dot product between the mapped feature
Kernel Trick:
The kernel trick is a mathematical technique that allows SVMs to operate in a high-dimensional
Instead of explicitly transforming the input features, the kernel function computes the dot product
Bias measures how well a model approximates the true relationship between features and target
Variance measures how much the model's predictions vary across different training datasets. High
Q42 , Regularisation :
Regularization is a technique used in machine learning and statistics to prevent overfitting and
improve the generalization performance of models. It involves adding a penalty term to the model's
loss function, which discourages the model from fitting the training data too closely or from
The basic idea behind regularization is to impose constraints on the model parameters during
training, preventing them from taking extreme or overly complex values. This helps to ensure that the
model captures the underlying patterns in the data without overfitting to noise or random
fluctuations.
Ridge outlix Ridge
L1 Regularization (Lasso): last outties
L1 regularization adds a penalty term proportional to the absolute values of the model parameters to
This penalty encourages sparsity in the model by forcing some of the parameters to be exactly zero,
L1 regularization is useful when the dataset contains many irrelevant or redundant features.
L2 Regularization (Ridge):
L2 regularization adds a penalty term proportional to the squared values of the model parameters to
This penalty encourages smaller parameter values, effectively shrinking the coefficients towards zero
L2 regularization is useful for reducing the impact of multicollinearity and stabilizing the model's
predictions.
a model on unseen data. It is commonly used to assess how well a predictive model generalizes to
The basic idea behind cross-validation is to partition the available dataset into multiple subsets or
folds. The model is trained on a subset of the data, called the training set, and then evaluated on the
remaining subset, called the validation set. This process is repeated multiple times, with different
partitions of the data, and the performance metrics are averaged across all iterations to provide a
The model is trained K times, each time using K-1 folds for training and one fold for validation.
The performance metrics are averaged across all K iterations to obtain the final evaluation.
Each data point is held out once as the validation set, and the model is trained on the remaining
data.
990
This process is repeated for each data point in the dataset.
LOOCV is computationally expensive for large datasets but provides an unbiased estimate of the
model's performance.
Stratified Cross-Validation:
Ensures that each fold has a similar distribution of classes or target variable values as the original
dataset.
Particularly useful for imbalanced datasets where certain classes are underrepresented.
Q44. Hyperparameter ?
Hyperparameters are configuration settings that are external to the model and are not learned from
the data during the training process. These parameters control the behavior of the learning algorithm
Definition:
Hyperparameters are parameters that are set prior to training and remain fixed throughout the
training process.
They are distinct from model parameters, which are learned from the training data during the
optimization process.
Examples:
Tuning:
Hyperparameter tuning, also known as hyperparameter optimization, involves selecting the optimal
hyperparameters and evaluating the model's performance using techniques like cross-validation.
Automated methods such as grid search, random search, and Bayesian optimization are commonly
Impact on Model:
Hyperparameters can significantly impact the performance, complexity, and generalization ability of
the model.
Proper selection of hyperparameters is crucial for achieving the desired balance between bias and
variance and for building models that generalize well to unseen data.
Suboptimal hyperparameters can lead to issues such as underfitting, overfitting, or poor model
performance
Actual
Q44 . Accuracy in regression model :
predict f
Mean Absolute Error (MAE):
Err r
MAE measures the average absolute difference between the predicted and actual values.
MSE measures the average squared difference between the predicted and actual values.
It penalizes larger errors more heavily than smaller errors and is widely used in optimization
algorithms.
RMSE is the square root of the MSE and provides a measure of the average magnitude of the errors.
It is in the same unit as the target variable and is easier to interpret than MSE.
independent variables.
It ranges from 0 to 1, with higher values indicating a better fit of the model to the data.
MAPE measures the average percentage difference between the predicted and actual values relative
It provides a relative measure of accuracy and is useful for comparing models across different
datasets.
Adjusted R-squared:
Adjusted R-squared is a modified version of R-squared that penalizes the inclusion of unnecessary
It accounts for the number of predictors in the model and is more appropriate for comparing models
Q45. Could you explain the K-means clustering algorithm and how it works? explain this in same
manner ?
Initialization:
K-means begins by randomly initializing K cluster centroids. These centroids represent the centers of
the clusters. The number of clusters (K) is specified by the user beforehand.
Assignment Step:
Each data point in the dataset is assigned to the nearest cluster centroid based on some distance
This step effectively groups data points into clusters by minimizing the distance between each point
Update Step:
After all data points have been assigned to clusters, the centroids are recomputed based on the
The centroids are moved to the average position of the points within their respective clusters.
This step iteratively updates the centroids to better represent the center of each cluster.
Convergence:
Steps 2 and 3 are repeated iteratively until either the centroids no longer change significantly or a
At convergence, the algorithm has effectively minimized the within-cluster sum of squares, meaning
that the data points within each cluster are as close to the centroid as possible.
Output:
The final output of the K-means algorithm is a set of K cluster centroids and the cluster assignments
Q46. What are the key challenges in determining the optimal number of clusters in K-means?
Elbow Method Ambiguity: It involves plotting the within-cluster sum of squares (WCSS) against the
number of clusters and selecting the point where the rate of decrease in WCSS slows down.
However, this point may not always be clearly defined, making it challenging to determine the
optimal Silhouette Analysis Interpretation: Silhouette analysis measures how similar an object is to its
own cluster compared to other clusters. While silhouette scores provide insight into cluster quality,
interpreting them can be subjective, and high average silhouette scores may not always lead to a
clear choice of K.
Subjectivity and Context Dependency: Determining the optimal K can be subjective and context-
dependent, as different stakeholders may have different perspectives on what constitutes meaningful
clusters. Moreover, the optimal K may vary based on the specific problem domain and objectives.
Overfitting vs. Underfitting: Selecting too few clusters (underfitting) may oversimplify the data, while
selecting too many clusters (overfitting) may lead to spurious or insignificant clusters. Balancing
between underfitting and overfitting requires careful consideration and validation techniques.
High-Dimensional Data: In high-dimensional spaces, the distance between data points becomes less
Hierarchical Clustering:
No initialization required.
ano
Computational complexity: O(n^2 log n) for agglomerative, O(n^3) for divisive.
K-means Clustering:
Assumes clusters are spherical and isotropic; may struggle with non-convex shapes.
Q48 . Can you explain the concept of dimensionality reduction and its importance in unsupervised
learning?
Dimensionality reduction is the process of reducing the number of features or variables in a dataset
Efficiency: High-dimensional data can be computationally intensive and may suffer from the curse of
Visualization: Dimensionality reduction techniques like PCA and t-SNE help visualize complex
patterns.
Noise Reduction: By capturing the most relevant features and removing redundant or noisy ones,
dimensionality reduction can improve the signal-to-noise ratio in the data, leading to better clustering
or classification results.
Interpretability: Reduced-dimensional representations are often more interpretable and easier to
Q48.What are some popular dimensionality reduction techniques, and when would you use each
one?
that capture the maximum variance in the data and projects the data onto these components.
When to use: PCA is suitable when the data has linear relationships between variables and when
Use: t-SNE is effective for visualizing high-dimensional data in low-dimensional space (usually 2D or
When to use: t-SNE is suitable for visual exploration of data and clustering analysis, especially when
Autoencoders:
Use: Autoencoders are neural network architectures that learn efficient representations of data by
encoding it into a lower-dimensional latent space and then decoding it back to the original space.
When to use: Autoencoders are useful for nonlinear dimensionality reduction and feature learning,
Use: LDA is a supervised dimensionality reduction technique that maximizes the separation between
When to use: LDA is suitable when class information is available and when the goal is to reduce
Kernel PCA:
Use: Kernel PCA extends PCA to nonlinear dimensionality reduction by using kernel methods to
Silhouette Score: Silhouette analysis measures how similar an object is to its own cluster compared
Davies-Bouldin Index: This index quantifies the average similarity between clusters, where lower
Calinski-Harabasz Index (Variance Ratio Criterion): This index measures the ratio of between-cluster
Internal Validation Metrics: Various internal metrics, such as inertia in K-means clustering or intra-
cluster distances, can be used to assess the compactness and separation of clusters within the
dataset.
External Validation Metrics: If some ground truth is available, external metrics like Adjusted Rand
Index or Fowlkes-Mallows Index can be used to compare clustering results against known labels.
The "curse of dimensionality" refers to the phenomena where the performance of certain algorithms
degrades as the number of features or dimensions in the dataset increases. This concept is
particularly relevant in high-dimensional spaces, where data points become increasingly sparse and
implications :
Emptyspace
Data Sparsity: As the number of dimensions increases, the amount of available data per dimension
decreases exponentially. This sparsity makes it challenging for unsupervised learning algorithms to
dimensionality reduction techniques, rely on computing distances or similarities between data points.
Overfitting: In high-dimensional spaces, the risk of overfitting also increases. Unsupervised learning
algorithms may find spurious patterns or noise in the data, leading to overfitting and poor
Purpose:
KNN: KNN is a supervised learning algorithm used for classification and regression tasks. It predicts
the class or value of a new data point based on the majority class or average value of its nearest
K-means: K-means is an unsupervised learning algorithm used for clustering tasks. It partitions the
data into a predefined number of clusters (K) based on the proximity of data points to cluster
centroids.
KNN: KNN is a supervised learning algorithm because it relies on labeled data to make predictions. It
K-means: K-means is an unsupervised learning algorithm because it does not require labeled data. It
KNN: KNN predicts the class or value of a new data point by finding the K nearest neighbors in the
K-means: K-means clusters data points into K clusters based on their proximity to cluster centroids.
It does not make predictions for new data points but rather assigns them to existing clusters based
A
Number of Parameters:
KNN: KNN has hyperparameters such as the number of neighbors (K) and the choice of distance
metric.
K-means: K-means has hyperparameters such as the number of clusters (K) and the initialization
Algorithm Complexity:
KNN: KNN has a simple algorithmic complexity during inference, but it can be computationally
expensive for large datasets since it requires calculating distances to all training samples.
K-means: K-means has a higher algorithmic complexity during training, as it involves iteratively
updating cluster centroids until convergence. However, once trained, assigning new data points to
initially forms its own cluster, and then pairs of clusters are successively merged based on their
proximity. This process continues until all data points belong to a single cluster. Agglomerative
clustering can be visualized as a bottom-up approach, where smaller clusters are progressively
1
We cut the dendrogram at the point where the horizontal line covers the maximum distance between
2 vertical lines.
Divisive Hierarchical Clustering: In divisive hierarchical clustering, all data points initially belong to a
single cluster, and then the algorithm recursively divides the cluster into smaller clusters based on
some criteria, such as maximizing inter-cluster dissimilarity. This process continues until each data
point forms its own cluster. Divisive clustering can be visualized as a top-down approach, where a
single cluster is repeatedly split into smaller clusters until a stopping criterion is met.
Elbow Method: Plot the within-cluster sum of squares (WCSS) against the number of clusters (K).
The "elbow" point on the plot, where the rate of decrease in WCSS slows down, can indicate the
optimal number of clusters. This method suggests choosing the number of clusters at the point
where adding more clusters does not significantly reduce the WCSS.
Silhouette Score: Compute the silhouette score for different values of K. The silhouette score
measures the cohesion within clusters and the separation between clusters. A higher silhouette
score suggests better-defined clusters. Select the value of K with the highest average silhouette
score.
3. Find the “center” of all group-1 points and all group-2 points. These centers are known as
“centroids”
5. Assign each point to the nearest centroid. Hence, the corresponding group.
6. Repeat steps 3-5 till there are no further changes in the positions of the centroids OR the
Single Linkage: Also known as nearest-neighbor linkage, it measures the distance between the
closest points of two clusters when merging. It tends to create long, elongated clusters.
Complete Linkage: Also known as farthest-neighbor linkage, it measures the distance between the
farthest points of two clusters when merging. It tends to create compact, spherical clusters.
Average Linkage: It calculates the average distance between all pairs of points from two clusters
when merging. It balances between single and complete linkage, often resulting in balanced clusters.
Centroid Linkage: It calculates the distance between the centroids of two clusters when merging. It is
less affected by outliers but may not capture the cluster structure accurately.
DBSCAN stands for Density Based Spatial Clustering of Applications with Noise.
Terminolgies :
1. ε, epsilon (eps): is known as the maximum allowed distance from one point to the other point, for
2. MinPts: is the minimum number of points which should be present close to each other at a
3. Core Point: That point which has at least MinPts number of points near to it, within the distance
of ε(eps)
4. Border Point/Non-Core Point: That point in data which has less than the minimum number of
5. Noise: That point which has no point near to it within a distance of eps
DBSCAN steps :
1. Start with one point randomly, find out if it is a core point by checking the minimum number of
3. If the number of points within eps distance is fewer than MinPts then mark it as non-core point
4. If the number of points within eps distance is Zero, then mark that point as Noise.
5. Combine all those clusters together whose points are within eps distance. Also known as density
connection or connected components. This starts a chain reaction. If cluster-1 and cluster-2 are
connected and cluster-2 and cluster-3 are connected then cluster-1 and cluster-3 are also
connected. Hence all these are combined to make one single cluster.
Factor Analysis tries to find the hidden groups of variables, which is like a common driving factor for
the type of values in all those variables, e.g. customer survey, if you are unhappy with the taste of the
coffee then all those questions about coffee taste in the survey will have low scores! Hence Bad
Taste will be a Factor and questions like Coffee sweetness rating, Coffee bitterness rating, Coffee
Freshness rating will represent the individual variables that will have low ratings.
PCA tries to find fewer new variables called Principal Components which are linear combinations of
the variables in the data, hence these fewer variables “represent” all the variables, while reducing the
dimensions. Here the goal is to create a new dataset that has the same number of rows but, lesser
number of columns(Principal Components) which explains the variance of all the original variables in
the data.
PC1=α1 * ( Coffee sweetness rating) + α2(Coffee bitterness rating) + α3(Coffee Freshness rating)
PC2=β1 * ( Coffee sweetness rating) + β2(Coffee bitterness rating) + β3(Coffee Freshness rating)
Linear Regression: Linear regression is used for predicting continuous numerical outcomes. It
models the relationship between one or more independent variables (predictors) and a continuous
Logistic Regression: Logistic regression is used for predicting categorical outcomes, specifically
binary outcomes (two classes). It models the probability of the presence of a certain outcome or
event based on one or more independent variables, using a logistic (sigmoid) function to map
Output:
Linear Regression: The output of linear regression is a continuous numerical value. It predicts the
value of the dependent variable based on the values of the independent variables.
Logistic Regression: The output of logistic regression is a probability score between 0 and 1. It
Model Representation:
Linear Regression: The relationship between the independent variables and the dependent variable is
Logistic Regression: The relationship between the independent variables and the log-odds of the
Loss Function:
Linear Regression: Linear regression typically uses the mean squared error (MSE) or the sum of
squared residuals as the loss function to minimize the difference between predicted and actual
values.
Logistic Regression: Logistic regression uses the logistic loss (or log loss) function, also known as
the cross-entropy loss, to minimize the difference between predicted probabilities and actual class
labels.
Interpretation:
Linear Regression: The coefficients (slope and intercept) in linear regression represent the change in
outcome variable for a one-unit change in the independent variable. They are interpreted in terms of
odds ratios.
Logistic Regression is used for predicting a category, specially the Binary categories(Yes/No , 0/1).
When there are only two outcomes in Target Variable it is known as Binomial Logistic Regression.
If there are more than two outcomes in Target Variable it is known as Multinomial Logistic
Regression.
A decision tree makes predictions by recursively splitting the dataset into subsets based on the
values of input features. At each node, it selects the feature that best separates the data into purest
subsets (homogeneous with respect to the target variable). This process continues until a stopping
criterion is met, such as reaching a maximum depth or having a minimum number of samples in each
leaf node. When a new data point is presented, it traverses the tree from the root node to a leaf
node, where it predicts the majority class or average value of the training samples in that node.
5. Can capture nonlinear relationships between features and the target variable.
2. Entropy.
3. Information gain (Gain ratio).
Pruning involves removing branches (subtrees) from the tree that do not provide significant
improvements in predictive accuracy. It prevents decision trees from overfitting by reducing the
Yes, decision trees can handle missing values in the dataset by using surrogate splits. Surrogate
splits are alternative splits used when missing values are encountered during prediction. The
decision tree algorithm automatically determines the best surrogate splits based on available data to
A random forest is an ensemble learning method that builds multiple decision trees during training
and outputs the mode (classification) or average prediction (regression) of the individual trees. Each
tree in the random forest is trained on a bootstrap sample of the original dataset, and at each node, a
random subset of features is considered for splitting. This randomness helps to decorrelate the trees
Q64 .How does a random forest reduce overfitting compared to a single decision tree?:
Random forests reduce overfitting compared to a single decision tree by averaging predictions from
multiple trees trained on different subsets of data. By using bootstrap sampling and random feature
selection, each tree in the random forest learns different aspects of the data, leading to less reliance
Bagging (Bootstrap Aggregating) is a machine learning technique that involves training multiple
models (often decision trees) independently on different subsets of the training data and then
combining their predictions through averaging or voting. Random forests are an extension of bagging
specifically designed for decision trees, where each tree is trained on a bootstrap sample of the data
Random forests handle categorical variables by splitting them into multiple binary (dummy) variables,
where each category becomes a separate feature. During training, the algorithm considers these
binary variables along with numerical variables for splitting nodes in the decision trees. This
approach effectively incorporates categorical variables into the random forest model.
Feature importance in random forests measures the contribution of each feature to the predictive
performance of the model. It is calculated based on the decrease in impurity (e.g., Gini impurity or
entropy) achieved by each feature when used for splitting nodes in the decision trees. Features that
result in larger impurity decreases are considered more important, as they provide more predictive
Q68 .What are ensemble learning methods, and why are they used?:
Ensemble learning methods involve combining multiple individual models (learners) to improve
predictive performance compared to any single model. They are used to mitigate the limitations of
individual models, reduce variance, and enhance generalization by leveraging the wisdom of crowds.
Ensemble methods often outperform single models by capturing diverse perspectives or patterns in
the data.
Bagging (Bootstrap Aggregating): Bagging involves training multiple models (often of the same type)
independently on different subsets of the training data using bootstrap sampling. Predictions are
then averaged (for regression) or majority-voted (for classification) to produce the final output.
Bagging reduces variance and overfitting by averaging the predictions of multiple models.
Boosting: Boosting is an iterative ensemble technique where models are trained sequentially, and
each subsequent model focuses on correcting the errors made by the previous models. Boosting
assigns higher weights to misclassified data points, thereby prioritizing difficult-to-predict instances.
Q70 .What is the purpose of combining multiple weak learners in ensemble methods?:
The purpose of combining multiple weak learners (models that perform slightly better than random
chance) in ensemble methods is to create a strong learner with improved predictive performance. By
leveraging the diversity among weak learners and combining their predictions intelligently, ensemble
methods can reduce bias, variance, and overfitting, leading to more robust and accurate models.
Stacking (Stacked Generalization): Stacking involves training multiple heterogeneous (different types)
or homogeneous (same type) base models and combining their predictions using a meta-learner
(often a linear regression or neural network). Unlike bagging and boosting, which combine
predictions in a parallel or sequential manner, stacking learns to weigh the predictions of base
Q72 .Can you provide examples of ensemble methods other than random forests and boosting
algorithms?:
Voting Classifiers: A voting classifier combines the predictions of multiple individual classifiers (e.g.,
decision trees, logistic regression, support vector machines) using a majority vote (for classification)
Stacking: As mentioned earlier, stacking combines predictions from multiple base models using a
meta-learner. It can involve various base models and meta-learner architectures, making it a versatile
Q73.How do you evaluate the performance of decision trees and random forests?:
Performance of decision trees:
For classification: Metrics such as accuracy, precision, recall, F1-score, and ROC curve can be used.
For regression: Metrics like mean absolute error (MAE), mean squared error (MSE), and R-squared
can be utilized.
Similar to decision trees, but typically random forests are evaluated using metrics like out-of-bag
(OOB) error rate, accuracy, and area under the ROC curve (AUC) for classification, and MSE or R-
Similar to individual models, ensemble models can be evaluated using metrics such as accuracy,
precision, recall, F1-score, AUC, and mean squared error (MSE), depending on the type of problem
Q75.Can you explain cross-validation and its role in evaluating ensemble models?:
Cross-validation involves splitting the dataset into multiple subsets (folds), training the model on
some folds, and evaluating its performance on the remaining fold(s). This process is repeated
multiple times, with different combinations of training and testing sets. Cross-validation helps to
assess the model's generalization performance, detect overfitting, and estimate how the model will
perform on unseen data. It is particularly useful for ensemble models as it provides more reliable
Q77.What are some techniques for diagnosing and addressing overfitting in ensemble models?:
3. Tuning hyperparameters (e.g., maximum tree depth, minimum samples per leaf) using cross-
validation.
5. Ensemble-specific techniques such as early stopping, where training is halted when performance
In a random forest, each decision tree is trained on a bootstrap sample of the original dataset,
meaning that some data points are left out of each bootstrap sample. The out-of-bag (OOB) error is
the prediction error calculated on the data points that are not included in the bootstrap sample used
to train each individual tree. These out-of-bag data points serve as a validation set for the
corresponding tree.
Decision Tree:
A decision tree is a supervised learning algorithm used for classification and regression tasks. It
breaks down a dataset into smaller and smaller subsets while at the same time an associated
decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.
Node:
A node in a decision tree represents a feature or attribute. There are two types of nodes:
Internal Node: Represents a decision point that splits the data into smaller subsets.
Leaf Node (or Terminal Node): Represents the final outcome or decision, usually a class label in the
Root Node:
The root node is the topmost node in the decision tree, from which the tree starts to branch out. It
Split:
A split is a decision point in a decision tree where the dataset is divided into two or more smaller
A branch represents the outcome of a split and leads to subsequent nodes in the decision tree.
Pruning:
Pruning is a technique used to reduce the size of a decision tree by removing unnecessary branches.
This helps prevent overfitting and improves the tree's generalization ability.
Entropy:
As mentioned earlier, entropy is a measure of impurity or disorder in a set of data. It's used in
decision tree algorithms to determine the best attribute to split the data on at each node.
Information Gain:
Information gain measures the effectiveness of an attribute in classifying the data. It quantifies the
Gini Impurity:
Gini impurity is another measure of impurity used in decision tree algorithms. It measures the
probability of misclassifying an element in a dataset, where a lower Gini impurity indicates purer
nodes.
Decision tree pruning is a process of trimming down the branches of a decision tree to reduce its
size and complexity. This helps to improve the tree's generalization performance on unseen data by
reducing overfitting.
The clustering technique can be used in multiple domains of data science like image
classification, customer segmentation, and recommendation engine. One of the most common
use is in market research and customer segmentation which is then utilized to target a
Q82 . What is feature engineering? How does it affect the model’s performance?
Feature engineering refers to developing some new features by using existing features.
Sometimes there is a very subtle mathematical relation between some features which if
explored properly then the new features can be developed using those mathematical
operations.
Also, there are times when multiple pieces of information are clubbed and provided as a single
data column. At those times developing new features and using them help us to gain deeper
insights into the data as well as if the features derived are significant enough helps to improve
There are metrics like Inertia or Sum of Squared Errors (SSE), Silhouette Score, l1, and l2
scores. Out of all of these metrics, the Inertia or Sum of Squared Errors (SSE) and Silhouette
Smaller values of learning rate help the training process to converge more slowly and
gradually toward the global optimum instead of fluctuating around it. This is because a smaller
learning rate results in smaller updates to the model weights at each iteration, which can help
The main reason why we cannot use linear regression for a classification task is that the
output of linear regression is continuous and unbounded, while classification requires discrete
If we use linear regression for the classification task the error function graph will not be
convex. A convex graph has only one minimum which is also known as the global minima but
in the case of the non-convex graph, there are chances of our model getting stuck at some
local minima which may not be the global minima. To avoid this situation of getting stuck at
the local minima we do not use the linear regression algorithm for a classification task.
the features to a certain scale or range of values. If we do not perform normalization then
there are chances that the gradient will not converge to the global or local minima and end up
Upsampling: Increases the number of samples in the minority class by randomly selecting points and
adding them to the dataset. It can lead to high training accuracy but might not generalize well to
Downsampling: Decreases the number of samples in the majority class by randomly selecting a
subset of points equal to the number of points in the minority class. It might result in the loss of
Data leakage occurs when a high correlation exists between the target variable and input features.
Training the model with such highly correlated features can lead to the model inadvertently capturing
the target variable's information during training. Consequently, the model may achieve seemingly
high accuracy during training and validation, but it fails to perform well on new, unseen data.
Detecting this situation involves observing a significant drop in model performance when applied to
Trang 80
real-world predictions.
valid 2
The
Q89. Is it always necessary to use an 80:20 ratio for the train test split?
No there is no such necessary condition that the data must be split into 80:20 ratio. The main
purpose of the splitting is to have some data which the model has not seen previously so, that we
we haven’t large datasets. It is applied to find the similarity and dissimilarities between the two
images.
Q91 . What is the difference between one hot encoding and ordinal encoding?
One-Hot Encoding:
Ordinal Encoding:
validation
Q92.How to identify whether the model has overfitted the training data or not?
This is the step where the splitting of the data into training and validation data proves to be a boon. If
the model’s performance on the training data is very high as compared to the performance on the
validation data then we can say that the model has overfitted the training data by learning the
Q93.What is the difference between stochastic gradient descent (SGD) and gradient descent (GD)?
The update step aims to move towards the global minimum of the loss function in a more consistent
manner.
GD typically leads to smoother convergence of the training error towards the minimum.
In SGD, the model is trained using a randomly selected mini-batch of data in each iteration.
It calculates the gradient of the loss function with respect to only the examples in the mini-batch and
The update steps are more noisy and less consistent compared to GD because they are influenced
SGD often results in more oscillations in the training error, but it can converge faster and is
KNN
kma
Q94 . What is the difference between the k-means and k-means++ algorithms?
The only difference between the two is in the way centroids are initialized. In the k-means algorithm,
the centroids are initialized randomly from the given points. There is a drawback in this method that
sometimes this random initialization leads to non-optimized clusters due to maybe initialization of
To overcome this problem k-means++ algorithm was formed. In k-means++, The first centroid is
selected randomly from the data points. The selection of subsequent centroids is based on their
Q95. which one decision tree or random forest is more robust to outlier ? 9D.BA
Random Forest is generally more robust to outliers compared to a single Decision Tree.
Aggregation of Predictions: Random Forest aggregates predictions from multiple decision trees.
Outliers may have less influence on the final prediction since they are likely to be outliers in only a
subset of the trees. This averaging effect helps to reduce the impact of outliers on the overall
prediction.
Subsampling: Random Forest typically uses bootstrapping (sampling with replacement) to create
multiple subsets of the data for each tree. This means that outliers may not always be present in
Feature Randomization: Random Forest also randomly selects a subset of features at each split. This
can help mitigate the effect of outliers in any single feature, as the model may rely on other features
upsoming
Q96 How to handle imbalance data ?
Resampling Techniques: I
Oversampling: Increase the number of instances in the minority class by generating synthetic
Undersampling: Reduce the number of instances in the majority class by randomly selecting a
subset of samples.
SMOTE (Synthetic Minority Over-sampling Technique): Generate synthetic samples for the minority
The Synthetic Minority Oversampling Technique (SMOTE) is employed to address data imbalance in
datasets. This method synthesizes new data points for minority classes by linearly interpolating
existing ones. Its advantage lies in training the model on augmented data, diversifying its exposure.
However, SMOTE may introduce undesired noise, potentially harming model performance.
Q98.Does the accuracy score always a good metric to measure the performance of a classification
model?
No, there are times when we train our model on an imbalanced dataset the accuracy score is not a
good metric to measure the performance of the model. In such cases, we use precision and recall to
m
Q99.What is KNN Imputer?
missing
if
KNN Imputer is a sophisticated method for filling null values in data. It utilizes a distance parameter,
also known as the k parameter, similar to clustering algorithms. Instead of using descriptive
statistical measures like mean, mode, or median, KNN Imputer imputes missing values based on the
neighborhood points of the missing values. This approach offers a more dynamic and context-
Q100 .What is the purpose of splitting a given dataset into training and validation data?
Training Data: The training data is used to train the model's parameters or learn patterns from the
data. The model learns the relationship between input features and the target variable by minimizing
a loss function.
Validation Data: The validation data, also known as the holdout set, is used to evaluate the model's
performance during training. It serves as an independent dataset to assess how well the model
generalizes to unseen data. By evaluating the model on the validation set periodically during training,
we can monitor its performance and make adjustments to hyperparameters or detect issues like
overfitting.
One effective method is using the t-SNE algorithm, short for t-Distributed Stochastic Neighbor
we can utilize PCA or LDA to convert n-dimensional data to 2D for visualization. However, while PCA
aims to preserve dataset variance, t-SNE focuses on retaining local similarities in the dataset.
0010
Q102 . Why do we face the curse of dimensionality?
The curse of dimensionality refers to the challenges and issues that arise when working with high-
dimensional data. As the number of dimensions increases, the volume of the space grows
model high-dimensional data, impacting the performance and reliability of machine learning
algorithms.
Identify and Remove Redundant Features: Use techniques such as correlation analysis or variance
inflation factor (VIF) to identify highly correlated features. Remove redundant features to reduce
multicollinearity.
Feature Selection: Select a subset of features that are most relevant to the target variable. Consider
techniques like recursive feature elimination, feature importance from tree-based models, or L1
regularization.
Analysis (PCA) or t-distributed Stochastic Neighbor Embedding (t-SNE) to reduce the number of
Feature Engineering: Create new features that capture the essence of the correlated features or
transform existing features to reduce correlation. For example, you can create interaction terms or
polynomial features.
column
Input
Q104 . What is multicollinearity and how will you handle it in your regression model?
Multicollinearity refers to the presence of high correlations between predictor variables (also known
as independent variables) in a regression model. It can cause issues such as unstable coefficient
estimates and inflated standard errors, making the model's interpretation difficult.
Identify Multicollinearity: Use techniques such as correlation analysis, variance inflation factor (VIF),
Remove Redundant Variables: Remove one or more of the highly correlated variables from the
model. Choose which variables to remove based on domain knowledge, importance to the model, or
statistical significance.
Combine Variables: Create new composite variables by combining highly correlated variables. For
example, you can calculate averages or sums of variables or create interaction terms.
regression. These methods penalize large coefficients, which can help mitigate multicollinearity
I
issues by shrinking coefficients towards zero.
Q105 .What does the area under the ROC curve indicate?
ROC stands for Receiver Operating Characteristic. It measures the usefulness of a test where the
larger the area, the more useful the test. These areas are used to compare the effectiveness of the
tests. A higher AUC (area under the curve) generally indicates that the model is better at
distinguishing between the positive and negative classes. AUC values range from 0 to 1, with a value
of 0.5 indicating that the model is no better than random guessing, and a value of 1 indicating perfect
classification.
It is calculated as the ratio of true positives (TP) to the sum of true positives and false negatives (FN).
Sensitivity indicates how well the model can identify positive instances from the total number of
model.
It is calculated as the ratio of true negatives (TN) to the sum of true negatives and false positives (FP).
Specificity indicates how well the model can identify negative instances from the total number of
actual negative instances.
Outliers are errors or extreme values that differ from the data in a set. They can impact the accuracy
of your result if not detected through the outlier detection tools. Some popular tools to discover
Kernel SVM stands for Kernel Support Vector Machine. In SVM, a kernel is a function that aids in
problem-solving. They provide shortcuts to help you avoid doing complicated math. The amazing
thing about kernel is that it allows us to go to higher dimensions and execute smooth computations.
Kernel SVMs can work with a variety of kernel functions, including linear, polynomial, and radial basis
Nets Amgen
Q109 . What is content-based filtering and collaborative filtering?
preferences, and interests. For instance, if a user has shown interest in action movies, the
content-based filtering algorithm will recommend more action movies to that user.
Collaborative filtering recommends items to a user based on the preferences of other users
who have similar tastes. For example, if a user has watched several action movies, the
collaborative filtering algorithm will recommend other action movies that were also watched
Overfitting happens when a model captures noise or random fluctuations in the data instead of
Complex Models: Models with excessive complexity relative to the data can learn noise,
causing overfitting.
Small Datasets: Insufficient data can lead the model to memorize examples rather than
High Dimensionality: With many features, models may find false correlations that don't
generalize.
leading to overfitting.
Data Noise: Noise or outliers in the data can mislead the model into fitting irregularities.
Data Leakage: When test set information influences training, the model learns spurious
Parametric models refer to the models having a limited number of parameters. In case of
parametric models, only the parameter of a model is needed to be known to make predictions
Non-parametric models do not have any restrictions on the number of parameters, which
makes new data predictions more flexible. In case of non-parametric models, the knowledge
of model parameters and the state of the data needs to be known to make predictions.
Entropy
Red again Impurity
Q112 .What is Entropy in Machine Learning? Gini
Entropy in Machine Learning measures the randomness in the data that needs to be
andit
processed. The more entropy in the given data, the more difficult it becomes to draw any
useful conclusion from the data. For example, let us take the flipping of a coin. The result of
this act is random as it does not favor heads or tails. Here, the result for any number of tosses
cannot be predicted easily as there is no definite relationship between the action of flipping
Q113 . When to use mean and when to use median to handle a missing numeric value?
We choose the mean to impute missing values when the data distribution is normal and there
are no significant outliers, as the mean is sensitive to both. In contrast, we use the median in
cases of skewed distributions or when outliers are present, because the median is more
robust to these factors and provides a better central tendency measure under these
conditions.
Q114 . In Machine Learning, for how many classes can Logistic Regression be used?
Logistic regression cannot be used for more than two classes. Logistic regression is, by
default, a binary classifier. However, in cases where multi-class classification problems need to
be solved, the default number of classes can be extended, i.e., multinomial logistic regression.
Linear Kernel: Suitable for linearly separable data. It computes the dot product between the
Polynomial Kernel: Maps the data into higher-dimensional space using polynomial functions. It
is effective for non-linear data and has parameters like degree and coefficient.
Radial Basis Function (RBF) Kernel: Measures the similarity between two data points based on
their distance. It is versatile and widely used for various types of data, but it requires tuning of
Sigmoid Kernel: Applies a hyperbolic tangent function to the dot product of the input features.
MinMaxScaler:
Useful for models sensitive to feature magnitudes, but can be influenced by outliers.
Preserves the original shape of the distribution, only changing the range of values.
Q117. How does one-hot encoding increase the dimensionality of a dataset compared to label
encoding?
binary variable (dummy variable) for each category in a categorical variable. For example, if a
variable "Color" has three categories: "Red," "Green," and "Blue," one-hot encoding would
label encoding assigns a unique integer value to each category without creating separate
binary variables. For example, "Red" might be encoded as 0, "Green" as 1, and "Blue" as 2.
Since label encoding does not create additional binary variables, it does not increase the
One-Hot Encoding: Creates binary variables for each category, with one variable set to 1 (hot)
and the others set to 0 (cold). Each category is represented by a separate binary variable.
Label Encoding: Assigns a unique integer label to each category. Each category is mapped to
Ordinal Encoding: Similar to label encoding, but assigns integer labels based on the order or
rank of the categories. Useful for ordinal variables where there is a natural ordering among
categories.
10
Binary Encoding: Converts integer labels into binary digits and represents them as separate
categorical information.
Frequency Encoding: Replaces categories with the frequency (count) of each category in the
dataset. It captures the distribution of categories but may not work well for rare categories.
Target Encoding (Mean Encoding): Replaces categories with the mean of the target variable
for each category. It captures the relationship between the categorical variable and the target
variable but may lead to overfitting if not used carefully.
Learning Rate: Controls the step size during optimization. It determines how much the model
Number of Trees (or Estimators): In ensemble methods like random forests and gradient
boosting, this hyperparameter specifies the number of individual models (trees) to include in
9
the ensemble.
Depth of Trees: Determines the maximum depth allowed for decision trees. Deeper trees can
capture more complex relationships in the data but may also lead to overfitting.
Regularization Strength: Hyperparameters like alpha (L1 regularization) and lambda (L2
regularization) control the strength of regularization in models like linear regression and
values.
Kernel Parameters: In kernel methods like Support Vector Machines (SVM) and kernelized
versions of algorithms, hyperparameters like gamma control the shape and width of the kernel
function.
Batch Size: Determines the number of training examples processed in each iteration of
Activation Function: In neural networks, hyperparameters like the type of activation function
(e.g., ReLU, sigmoid, tanh) and its parameters (e.g., slope in Leaky ReLU) affect the non-linear
Number of Hidden Units/Layers: In neural networks, the number of hidden units (neurons) and
Dropout Rate: Controls the proportion of neurons randomly dropped out during training to
Thresholds: In binary classification models, hyperparameters like the decision threshold affect
the classification of instances into positive or negative classes.
c
Q120. If your dataset is suffering from high variance, how would you handle it?
For datasets with high variance, we could use the bagging algorithm to handle it. Bagging
algorithm splits the data into subgroups with sampling replicated from random data. After the
data is split, random data is used to create rules using a training algorithm. Then we use
variables into normal variables as normality is the most common assumption made while using
many statistical techniques. It has a lambda parameter which when set to 0 implies that this
Advantages:
Interpretability: Decision trees are easy to understand and interpret, making them suitable for
Handle Non-linear Relationships: Decision trees can capture non-linear relationships between
features and the target variable without the need for feature engineering or transformation.
Overfitting: Decision trees are prone to overfitting, especially when the tree depth is not
properly constrained or when dealing with noisy data. This can lead to poor generalization
Instability: Decision trees are sensitive to small variations in the training data, which can result
in different trees being generated for slightly different datasets. This instability makes decision
trees less robust compared to some other machine learning algorithms.
Shapiro-Wilk W Test: This test assesses whether a sample comes from a normally distributed
population. It is based on the correlation between the data and the corresponding normal
distribution.
Anderson-Darling Test: The Anderson-Darling test evaluates whether a sample comes from a
specified distribution, such as the normal distribution. It provides a single statistic that
Martinez-Iglewicz Test: The Martinez-Iglewicz test is used to detect outliers and assess
normality based on the modified Z-score of the data points. It is particularly useful for
only one independent variable. VIF gives the estimate of the volume of multicollinearity in a set
Q125 .Which machine learning algorithm is known as the lazy learner, and why is it called so?
KNN is a Machine Learning algorithm known as a lazy learner. K-NN is a lazy learner because
it doesn’t learn any machine-learned values or variables from the training data but dynamically
calculates distance every time it wants to classify, hence memorizing the training dataset
instead.
the tree, i.e., it was out of the sample. This data is referred to as out of bag data. In order to
get an unbiased measure of the accuracy of the model over test data, out of bag error is used.
The out of bag data is passed for each tree is passed through that tree and the outputs are
Q127 .What is Bayes’ Theorem? State at least 1 use case with respect to the machine learning
context?
Bayes’ Theorem describes the probability of an event, based on prior knowledge of conditions
that might be related to the event. For example, if cancer is related to age, then, using Bayes’
theorem, a person’s age can be used to more accurately assess the probability that they have
cancer than can be done without the knowledge of the person’s age.
Chain rule for Bayesian probability can be used to predict the likelihood of the next word in the
sentence.
Exploratory Data Analysis (EDA) helps analysts to understand the data better and forms the
Visualization
Univariate visualization
Bivariate visualization
Multivariate visualization
Outlier Detection – Use Boxplot to identify the distribution of Outliers, then Apply IQR to set
Scaling the Dataset – Apply MinMax, Standard Scaler or Z Score Scaling mechanism to scale
the data.
Feature Engineering – Need of the domain, and SME knowledge helps Analyst find derivative
fields which can fetch more information about the nature of the data
Dimensionality reduction — Helps in reducing the volume of data without losing much
information
Generative Model: A generative model learns the joint probability distribution of the features
and labels in the data. It models how the data is generated and can generate new samples
similar to the training data. Generative models can be used for tasks such as data generation,
the labels given the features. It focuses on learning the boundaries or decision boundaries
between different classes in the data. Discriminative models directly model the decision
boundary between classes and are typically used for tasks such as classification and
regression.
Naive Bayes:
1. Work well with small dataset compared to DT which need more data
2. Lesser overfitting
1. Decision Trees are very flexible, easy to understand, and easy to debug
3. Prone to overfitting but you can use pruning or Random forests to avoid that.
the log likelihood is a measure used to estimate how well the model fits the observed data. It
quantifies the agreement between the actual outcomes (labels) and the predictions made by
First reason is that XGBoosT is an ensemble method that uses many trees to make a decision
SVM is a linear separator, when data is not linearly separable SVM needs a Kernel to project
the data into a space where it can separate it, there lies its greatest strength and weakness, by
being able to project data into a high dimensional space SVM can find a linear separation for
almost any data but at the same time it needs to use a Kernel and we can argue that there’s
Q133 . What are the advantages of using a naive Bayes for classification?
2. If the NB conditional independence assumption holds, then it will converge quicker than
5. Highly scalable. It scales linearly with the number of predictors and data points.