Lab_questionbank
Lab_questionbank
*************************************************************************************
*******************
Group B
12. Visualize outliers in the 'GrLivArea' column of the 'housing.csv' dataset using a box plot.
Analysis Part: After plotting, discuss any visible outliers and potential impact on model performance.
13. Split the 'diabetes.csv' dataset from Kaggle into train and test sets.
Analysis Part: Use train_test_split to create training and testing sets, and display their shapes.
14. Normalize the 'BMI' and 'Glucose' columns of the 'diabetes.csv' dataset.
Analysis Part: Apply MinMaxScaler and visualize the normalized data with a scatter plot. 15. Handle
missing values in the 'Pregnancies' column of the 'diabetes.csv' dataset. Analysis Part: Use
df['Pregnancies'].fillna(0) to fill missing values and visualize the distribution with a bar plot.
16. Detect missing values in the 'abalone.csv' dataset from UCI Machine Learning Repository.
Analysis Part: Use df.isnull().sum() to check for missing values and create a heatmap visualization.
17. Drop columns with missing values in the 'abalone.csv' dataset.
Analysis Part: Use df.dropna(axis='columns') and discuss any columns that were dropped.
18. Use one-hot encoding on the 'Sex' column of the 'abalone.csv' dataset.
Analysis Part: Apply one-hot encoding and visualize the distribution of each sex category. 19.
Perform label encoding on the 'diagnosis' column of the 'cancer.csv' dataset from UCI Machine
Learning Repository.
Analysis Part: Use LabelEncoder to encode the 'diagnosis' column and visualize the class distribution.
20. Identify outliers in the 'area_mean' column of the 'cancer.csv' dataset using a box plot. Analysis
Part: Plot and identify any outliers in the 'area_mean' column.
21. Normalize the 'perimeter_mean' and 'concavity_mean' columns of the 'cancer.csv' dataset.
Analysis Part: Use StandardScaler and visualize the normalized data with a scatter plot. 22.
Visualize the relationship between 'age' and 'cholesterol' in the 'heart.csv' dataset from Kaggle.
Analysis Part: Use a scatter plot and discuss any visible patterns or correlations. 23. Handle
missing values in the 'thalach' column of the 'heart.csv' dataset.
Analysis Part: Use df['thalach'].fillna(df['thalach'].median()) to fill missing values and visualize using a
histogram.
24. Encode the 'gender' column in the 'adult.csv' dataset from UCI Machine Learning Repository.
Analysis Part: Use label encoding and visualize the distribution of genders.
25. Normalize the 'hours-per-week' column of the 'adult.csv' dataset.
Analysis Part: Apply MinMaxScaler and create a box plot for the normalized data. 26. Drop rows with
missing values in the 'LoanAmount' column of the 'loan.csv' dataset from Kaggle. Analysis Part: Use
df.dropna(subset=['LoanAmount']) and compare the dataset size before and after. 27. Perform a scatter
plot analysis between 'ApplicantIncome' and 'LoanAmount' in the 'loan.csv' dataset. Analysis Part: Plot
and discuss any visible correlations or patterns.
28. Identify outliers in the 'Age' column of the 'credit.csv' dataset from UCI Machine Learning
Repository.
Analysis Part: Use a box plot to identify any outliers and discuss their implications.
29. Use one-hot encoding for the 'Education' column in the 'credit.csv' dataset.
Analysis Part: Apply one-hot encoding and visualize the new column distribution. 30. Handle missing
values in the 'CreditAmount' column of the 'credit.csv' dataset. Analysis Part: Use
df['CreditAmount'].fillna(df['CreditAmount'].mean()) and visualize using a histogram. 31. Visualize the
correlation between 'Age' and 'CreditAmount' in the 'credit.csv' dataset using a scatter plot.
Analysis Part: Plot and analyze any potential relationships.
32. Encode the 'Smoker' column in the 'insurance.csv' dataset from Kaggle using label encoding.
Analysis Part: Use label encoding and visualize the distribution of smokers.
33. Normalize the 'BMI' and 'Charges' columns of the 'insurance.csv' dataset.
Analysis Part: Use StandardScaler and create a scatter plot to visualize normalized data.
34. Drop columns with missing values in the 'cars.csv' dataset from UCI Machine Learning Repository.
Analysis Part: Use df.dropna(axis='columns') and discuss which columns were dropped. 35. Identify
and handle outliers in the 'Horsepower' column of the 'cars.csv' dataset. Analysis Part: Use a box plot
to detect outliers and discuss strategies for handling them (e.g., capping, removing).