ML Viva and Oral Question and Answers
ML Viva and Oral Question and Answers
Q.3: Why did you choose Linear Regression and Random Forest models for
this prediction?
o Answer: Linear Regression is a simple and interpretable model,
ideal for understanding relationships between variables, while
Random Forest handles non-linearity better and can capture
complex interactions in the data.
Q.5: How did you handle feature selection and engineering to improve model
accuracy?
Answer: I engineered features from date-time data, like hour of day, day
of week, and month, which could influence pricing due to demand
patterns. Additionally, I explored the distance between pickup and
drop-off as a key feature, using Haversine distance for more precise
measurements, which helped improve model relevance.
Q.6: Why is Random Forest often preferred over Linear Regression in price
prediction tasks?
Answer: Random Forest handles non-linear relationships and
interactions between variables, which are common in real-world pricing
data. It’s also less sensitive to outliers and can better handle missing or
sparse data, making it more robust in varied conditions compared to
Linear Regression.
ASSIGNMENT 2: Email Spam Detection
Q.1: Why did you choose K-Nearest Neighbors (KNN) and Support Vector
Machine (SVM) for this classification?
o Answer: KNN is a straightforward algorithm with effective results
for smaller datasets, while SVM works well with high-dimensional
data and can create a distinct boundary between classes, making
it suitable for spam detection.
Q.2: How did you preprocess the email data for spam classification?
o Answer: I tokenized the email text, removed stop words,
converted text to lowercase, and used techniques like TF-IDF or
Bag of Words to convert the text data into numerical features
suitable for the algorithms.
Q.3: What metrics did you use to evaluate the model performance?
o Answer: I used accuracy, precision, recall, and F1-score to
evaluate performance. Precision and recall are particularly
important for spam detection to balance false positives and false
negatives.
Q.4: What challenges do you face with text data in machine learning, and
how did you address them?
Answer: Text data is high-dimensional and unstructured, so
preprocessing is essential. I used techniques like tokenization, stop
word removal, and TF-IDF transformation to convert text into numerical
vectors while retaining important information. This reduced data
complexity and improved classification accuracy.
Q.2: How did you determine the optimal value for K in KNN?
o Answer: I used cross-validation and/or the elbow method by
plotting accuracy against different K values to select the value
with the best performance.
Q.3: What does the confusion matrix tell you about model performance?
o Answer: The confusion matrix provides insights into the number of
true positives, false positives, true negatives, and false negatives,
helping calculate precision, recall, and error rates for a deeper
understanding of model accuracy.
Q.5: What insights do you gain from the confusion matrix beyond overall
accuracy?
Answer: The confusion matrix allows us to see the distribution of true
positives, false positives, true negatives, and false negatives, giving a
clear picture of errors. From this, I calculated precision, recall, and the
error rate, which are critical for understanding the model’s reliability,
especially in a healthcare context where misclassifications can have
serious implications.
ASSIGNMENT 4: Sales Data Clustering
Q.3: Can you explain the difference between K-Means and hierarchical
clustering?
o Answer: K-Means is a partition-based method that divides the
dataset into clusters by minimizing the variance within clusters,
while hierarchical clustering builds a tree of clusters by either
agglomerating or dividing them in a nested manner.
Q.4: How does the elbow method assist in determining the optimal number of
clusters?
Answer: The elbow method involves plotting the Within-Cluster-Sum of
Squares (WCSS) against different cluster counts. The “elbow” point,
where WCSS reduction starts to plateau, suggests the optimal number
of clusters as it balances between model simplicity and cluster
separation effectiveness.
Q.2: What features did you use to build the survival prediction model?
o Answer: Key features included passenger age, gender, class,
fare, and embarked location. These features were chosen
because they likely influenced survival chances based on
historical accounts of the disaster.
Q.3: Which machine learning algorithms did you use, and why?
o Answer: I used models like Logistic Regression for its
interpretability and Random Forest for its robustness in handling
mixed data types and complex interactions, allowing for a more
nuanced survival prediction.
Q.4: How did you handle categorical features like gender and socio-
economic class in your Titanic dataset?
Answer: I used label encoding for gender, as it’s a binary variable, and
one-hot encoding for socio-economic class to ensure the model treats
each class independently without any ordinal assumption. This
representation enabled the model to understand categorical influences
on survival.
Q.5: Which performance metric is most relevant for this classification task,
and why?
Answer: Accuracy is useful, but due to the importance of minimizing
false negatives (misclassifying survivors as non-survivors), I focused on
recall for the survival class. High recall ensures that most actual
survivors are identified, which is crucial in scenarios where it’s
essential to avoid missing positive cases.