0% found this document useful (0 votes)
66 views5 pages

ML Viva and Oral Question and Answers

Uploaded by

dnyanesh26mali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views5 pages

ML Viva and Oral Question and Answers

Uploaded by

dnyanesh26mali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ML VIVA / ORAL QUESTION

ASSIGNMENT 1: Predict Uber Ride Prices

Q.1: How did you preprocess the Uber fare dataset?


o Answer: I cleaned the data by handling missing values, converting
categorical features into numerical values, scaling the data if
necessary, and performing any date-time processing to extract
relevant features like day, month, or hour.

Q.2: How did you identify outliers in the dataset?


o Answer: I used statistical methods like the IQR (Interquartile
Range) and Z-score methods to detect outliers in the continuous
variables. Visualization tools like box plots also helped identify
anomalies.

Q.3: Why did you choose Linear Regression and Random Forest models for
this prediction?
o Answer: Linear Regression is a simple and interpretable model,
ideal for understanding relationships between variables, while
Random Forest handles non-linearity better and can capture
complex interactions in the data.

Q.4: How did you evaluate and compare the models?


o Answer: I used metrics such as R-squared (R²), Root Mean
Squared Error (RMSE), and Mean Absolute Error (MAE) to assess
and compare the accuracy and error of each model.

Q.5: How did you handle feature selection and engineering to improve model
accuracy?
 Answer: I engineered features from date-time data, like hour of day, day
of week, and month, which could influence pricing due to demand
patterns. Additionally, I explored the distance between pickup and
drop-off as a key feature, using Haversine distance for more precise
measurements, which helped improve model relevance.

Q.6: Why is Random Forest often preferred over Linear Regression in price
prediction tasks?
 Answer: Random Forest handles non-linear relationships and
interactions between variables, which are common in real-world pricing
data. It’s also less sensitive to outliers and can better handle missing or
sparse data, making it more robust in varied conditions compared to
Linear Regression.
ASSIGNMENT 2: Email Spam Detection

Q.1: Why did you choose K-Nearest Neighbors (KNN) and Support Vector
Machine (SVM) for this classification?
o Answer: KNN is a straightforward algorithm with effective results
for smaller datasets, while SVM works well with high-dimensional
data and can create a distinct boundary between classes, making
it suitable for spam detection.

Q.2: How did you preprocess the email data for spam classification?
o Answer: I tokenized the email text, removed stop words,
converted text to lowercase, and used techniques like TF-IDF or
Bag of Words to convert the text data into numerical features
suitable for the algorithms.

Q.3: What metrics did you use to evaluate the model performance?
o Answer: I used accuracy, precision, recall, and F1-score to
evaluate performance. Precision and recall are particularly
important for spam detection to balance false positives and false
negatives.

Q.4: What challenges do you face with text data in machine learning, and
how did you address them?
 Answer: Text data is high-dimensional and unstructured, so
preprocessing is essential. I used techniques like tokenization, stop
word removal, and TF-IDF transformation to convert text into numerical
vectors while retaining important information. This reduced data
complexity and improved classification accuracy.

Q.5: Why is precision particularly important in spam detection models?


 Answer: Precision is crucial because a high false positive rate (non-
spam labeled as spam) can lead to important emails being marked as
spam, which disrupts user experience. High precision ensures that
flagged emails are likely to be actual spam, improving trust in the
model.
ASSIGNMENT 3: Diabetes Prediction Using KNN

Q.1: What preprocessing steps were applied to the diabetes dataset?


o Answer: The preprocessing involved handling missing values,
scaling the data since KNN is sensitive to feature magnitudes, and
converting any categorical features into numerical format if
present.

Q.2: How did you determine the optimal value for K in KNN?
o Answer: I used cross-validation and/or the elbow method by
plotting accuracy against different K values to select the value
with the best performance.

Q.3: What does the confusion matrix tell you about model performance?
o Answer: The confusion matrix provides insights into the number of
true positives, false positives, true negatives, and false negatives,
helping calculate precision, recall, and error rates for a deeper
understanding of model accuracy.

Q.4: Why is feature scaling important in K-Nearest Neighbors (KNN), and


how did you implement it?
 Answer: KNN is distance-based, meaning feature scales can heavily
influence distance calculations. Without scaling, features with larger
ranges dominate the distance metric, skewing the results. I applied
MinMax scaling to normalize feature values between 0 and 1, ensuring
fair contribution from all features.

Q.5: What insights do you gain from the confusion matrix beyond overall
accuracy?
 Answer: The confusion matrix allows us to see the distribution of true
positives, false positives, true negatives, and false negatives, giving a
clear picture of errors. From this, I calculated precision, recall, and the
error rate, which are critical for understanding the model’s reliability,
especially in a healthcare context where misclassifications can have
serious implications.
ASSIGNMENT 4: Sales Data Clustering

Q.1: What’s the purpose of clustering in this context?


o Answer: Clustering helps group sales data into segments based
on similar patterns, which can assist in identifying customer
types, product preferences, or regional trends, aiding in targeted
marketing.

Q.2: How did you decide on the number of clusters?


o Answer: I used the elbow method by plotting the Within-Cluster-
Sum of Squared Errors (WCSS) against the number of clusters and
chose the point where the decrease in WCSS starts to diminish,
indicating the optimal cluster count.

Q.3: Can you explain the difference between K-Means and hierarchical
clustering?
o Answer: K-Means is a partition-based method that divides the
dataset into clusters by minimizing the variance within clusters,
while hierarchical clustering builds a tree of clusters by either
agglomerating or dividing them in a nested manner.

Q.4: How does the elbow method assist in determining the optimal number of
clusters?
 Answer: The elbow method involves plotting the Within-Cluster-Sum of
Squares (WCSS) against different cluster counts. The “elbow” point,
where WCSS reduction starts to plateau, suggests the optimal number
of clusters as it balances between model simplicity and cluster
separation effectiveness.

Q.5: Can you explain a real-world application of K-Means clustering in sales


data?
 Answer: K-Means clustering can segment customers based on
purchasing behavior, helping identify high-value customer groups or
regional preferences. This enables businesses to tailor marketing
strategies, optimize product recommendations, and improve customer
targeting, thus boosting sales efficiency.
MINI PROJECT: Titanic Survival Prediction
Q.1: How did you handle missing data in the Titanic dataset?
o Answer: For features like age, I used mean or median imputation,
and for categorical variables like embarked, I used the mode.
Missing values in fare were filled based on the median or grouped
median of similar passenger classes.

Q.2: What features did you use to build the survival prediction model?
o Answer: Key features included passenger age, gender, class,
fare, and embarked location. These features were chosen
because they likely influenced survival chances based on
historical accounts of the disaster.

Q.3: Which machine learning algorithms did you use, and why?
o Answer: I used models like Logistic Regression for its
interpretability and Random Forest for its robustness in handling
mixed data types and complex interactions, allowing for a more
nuanced survival prediction.

Q.4: How did you handle categorical features like gender and socio-
economic class in your Titanic dataset?
 Answer: I used label encoding for gender, as it’s a binary variable, and
one-hot encoding for socio-economic class to ensure the model treats
each class independently without any ordinal assumption. This
representation enabled the model to understand categorical influences
on survival.

Q.5: Which performance metric is most relevant for this classification task,
and why?
 Answer: Accuracy is useful, but due to the importance of minimizing
false negatives (misclassifying survivors as non-survivors), I focused on
recall for the survival class. High recall ensures that most actual
survivors are identified, which is crucial in scenarios where it’s
essential to avoid missing positive cases.

You might also like