Assignment - CarsData - Descriptive - EDA - Munjal - Exercise - Ipynb - Colaboratory
Assignment - CarsData - Descriptive - EDA - Munjal - Exercise - Ipynb - Colaboratory
ipynb - Colaboratory
Problem Objective :
2.Peform necessary Univariate and Bivariate analysis of features highlighting the insights.
3.Preprocess the data and find out duplicates, missing values and treatment, outliers and treatment, bad data.
Data Dictionary
Name : Name of the car which includes Brand name and Model name
Location : The location in which the car is being sold or is available for purchase Cities
Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM.
Fuel_Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
Mileage : The standard mileage offered by the car company in kmpl or km/kg
New_Price : The price of a new car of the same model in INR Lakhs.
2 marks
Code Text
2 marks
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remoun
2 marks
https://colab.research.google.com/drive/1Y9QXM0We-C_J7v4Go_gGbFiDaG5eQcRW#printMode=true 1/6
21/11/2023, 00:01 Assignment_CarsData_Descriptive_EDA_Munjal_exercise.ipynb - Colaboratory
2 marks
2 marks
2 marks
4 marks
6 marks
2 marks
4 marks
2 marks
Feature Engineering
https://colab.research.google.com/drive/1Y9QXM0We-C_J7v4Go_gGbFiDaG5eQcRW#printMode=true 2/6
21/11/2023, 00:01 Assignment_CarsData_Descriptive_EDA_Munjal_exercise.ipynb - Colaboratory
date.today().year command gives you current year and
age of the car can be determined by current year - Year column in the dataset
6 marks
4 marks
4 marks
Use the brand column that you just created in above command
4 marks
4 marks
Statistical summary
Give summary(count,mean,std,min,25%,50%,75% and max) of numerical columns only and transpose the results
4 marks
Give summary of all columns including categorical columns also( that means both numerical and categorical columns) and
transpose the results
4 marks
From the statistics summary, what useful insights can you derive?
https://colab.research.google.com/drive/1Y9QXM0We-C_J7v4Go_gGbFiDaG5eQcRW#printMode=true 3/6
21/11/2023, 00:01 Assignment_CarsData_Descriptive_EDA_Munjal_exercise.ipynb - Colaboratory
#remove bhp
cars_dataset['Power'] = cars_dataset['Power'].replace(regex='bhp',value='')
cars_dataset.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-15-25600cc4d0fb> in <cell line: 3>()
1 #remove bhp
2
----> 3 cars_dataset['Power'] = cars_dataset['Power'].replace(regex='bhp',value='')
4
5 cars_dataset.head()
print("Mileage : ",cars_dataset['Mileage'].unique())
print("Engine : ",cars_dataset['Engine'].unique())
print("Power : ",cars_dataset['Power'].unique())
Remove whitespaces from columns - Mileage, Engine and Power - either one by one or via a
function with for loop
cols_to_remove_space =['Mileage','Engine','Power']
cars_dataset.head()
print("Mileage : ",cars_dataset['Mileage'].unique())
print("Engine : ",cars_dataset['Engine'].unique())
print("Power : ",cars_dataset['Power'].unique())
Change 'null' to np.nan in Power column and display unique values of Power column
https://colab.research.google.com/drive/1Y9QXM0We-C_J7v4Go_gGbFiDaG5eQcRW#printMode=true 4/6
21/11/2023, 00:01 Assignment_CarsData_Descriptive_EDA_Munjal_exercise.ipynb - Colaboratory
df['Mileage'] = df['Mileage'].astype('float')
4 marks
using the example below, fill the missing values at brand level for Mileage and Power columns also
cars_dataset.describe()
6 marks
Display list of all the Numerical and Categorical columns in the data in separate
variables
6 marks
Seaborn is a python library built on top of Matplotlib that uses short lines of code to create and style statistical plots
Univariate analysis can be done for both Categorical and Numerical variables.
Categorical variables can be visualized using a Count plot, Bar Chart, Pie Plot, etc.
Skewness > 0: Then more weight in the left tail of the distribution, i.e. right skewed
Skewness < 0: Then more weight in the right tail of the distribution, i.e. left skewed
Univariate analysis
6 marks
https://colab.research.google.com/drive/1Y9QXM0We-C_J7v4Go_gGbFiDaG5eQcRW#printMode=true 5/6
21/11/2023, 00:01 Assignment_CarsData_Descriptive_EDA_Munjal_exercise.ipynb - Colaboratory
6 marks
From the numerical and categorical columns' univariate analysis what useful insights can you
draw?
2 marks
Bivariate analysis
Bivariate Analysis helps to understand how variables are related to each other and the relationship between dependent and independent
variables
For Numerical variables, Pair plots and Scatter plots are widely been used to do Bivariate Analysis.
A Stacked bar chart can be used for categorical variables if the output variable is a classifier.
Draw a pairplot using seaborn library for this cars data and derive insights from that
6 marks
Bar plots can be used to show the relationship between Categorical variables and continuous variables
4 marks
https://colab.research.google.com/drive/1Y9QXM0We-C_J7v4Go_gGbFiDaG5eQcRW#printMode=true 6/6