Lec ExploratoryDataAnalysis1Unit5Part1
Lec ExploratoryDataAnalysis1Unit5Part1
October 3, 2023
df = pd.DataFrame(data)
# Replacing Values
1
# Replace 'Female' with 'F' and 'Male' with 'M' in the 'Gender' column
df['Gender'] = df['Gender'].replace({'Female': 'F', 'Male': 'M'})
print("\nDataFrame after replacing values in 'Gender' column:")
print(df)
# Functions in pandas
# Calculate the mean salary
mean_salary = df['Salary'].mean()
print("\nMean Salary:", mean_salary)
Dataset
Name Age Gender Salary
0 Alice 25.0 Female 50000.0
1 Bob 30.0 Male 60000.0
2 Charlie NaN Male 45000.0
3 David 35.0 Male 70000.0
4 Eva 28.0 Female NaN
5 Alice 25.0 Female 55000.0
Missing values in the DataFrame:
Name Age Gender Salary
0 False False False False
1 False False False False
2 False True False False
3 False False False False
4 False False False True
5 False False False False
2
5 Alice 25.0 Female 55000.0
[6]: df=pd.read_csv("data.csv")
[7]: df.head()
3
Market Category Vehicle Size Vehicle Style \
0 Factory Tuner,Luxury,High-Performance Compact Coupe
1 Luxury,Performance Compact Convertible
2 Luxury,High-Performance Compact Coupe
3 Luxury,Performance Compact Coupe
4 Luxury Compact Convertible
[9]: df.columns.tolist()
[9]: ['Make',
'Model',
'Year',
'Engine Fuel Type',
'Engine HP',
'Engine Cylinders',
'Transmission Type',
'Driven_Wheels',
'Number of Doors',
'Market Category',
'Vehicle Size',
'Vehicle Style',
'highway MPG',
'city mpg',
'Popularity',
'MSRP']
[10]: ['make',
'model',
'year',
'engine_fuel_type',
'engine_hp',
'engine_cylinders',
'transmission_type',
'driven_wheels',
'number_of_doors',
4
'market_category',
'vehicle_size',
'vehicle_style',
'highway_mpg',
'city_mpg',
'popularity',
'msrp']
[11]: ['make',
'model',
'year',
'engine_fuel_type',
'engine_hp',
'engine_cylinders',
'transmission_type',
'driven_wheels',
'number_of_doors',
'market_category',
'vehicle_size',
'vehicle_style',
'highway_mpg',
'city_mpg',
'popularity',
'price']
[12]: ['make',
'model',
'year',
'engine_fuel_type',
'engine_hp',
'engine_cylinders',
'transmission_type',
'driven_wheels',
'number_of_doors',
'market_category',
'vehicle_size',
'vehicle_style',
'highway_mpg',
'city_mpg',
5
'popularity',
'price']
[13]: df.head()
[18]: sns.distplot(df['price']);
C:\Users\agarw\anaconda3\lib\site-packages\seaborn\distributions.py:2619:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-level
function with similar flexibility) or `histplot` (an axes-level function for
histograms).
warnings.warn(msg, FutureWarning)
6
[19]: sns.distplot(df['price'] , fit=norm);
C:\Users\agarw\anaconda3\lib\site-packages\seaborn\distributions.py:2619:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-level
function with similar flexibility) or `histplot` (an axes-level function for
histograms).
warnings.warn(msg, FutureWarning)
7
[25]: sns.distplot(df['price'] , fit=norm);
(mu, sigma) = norm.fit(df['price'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
#check skewness of the data
print("Skewness: %f" % df['price'].skew())
print("Kurtosis: %f" % df['price'].kurt())
loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')
C:\Users\agarw\anaconda3\lib\site-packages\seaborn\distributions.py:2619:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-level
function with similar flexibility) or `histplot` (an axes-level function for
histograms).
8
warnings.warn(msg, FutureWarning)
Skewness: 11.771987
Kurtosis: 268.926276
9
kurtosis
In probability theory and statistics, kurtosis is a measure of the “tailedness” of the probability
distribution of a real-valued random variable.
Like skewness, kurtosis describes a particular aspect of a probability distribution.
There are different ways to quantify kurtosis for a theoretical distribution, and there are corre-
sponding ways of estimating it using a sample from a population.
Different measures of kurtosis may have different interpretations.
[29]: # Set the variable and data for the scatter plot
engine_col = 'engine_hp'
engine_data = pd.concat([df['price'], df[engine_col]], axis=1)
engine_data.head()
10
[30]: # Create the scatter plot
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(x=engine_data[engine_col], y=engine_data['price'])
ax.set_ylim([0, 800000])
ax.set_title("Scatter plot of car popularity and price")
ax.set_xlabel("Engine Horsepower (rpm)")
ax.set_ylabel("Price ($)")
11
ax.set_xlabel('Number of Engine Cylinders')
ax.set_ylabel('Price ($)')
plt.xticks(rotation=45)
12
[43]: top_makes = df['make'].value_counts().nlargest(5).index.tolist()
print(top_makes)
# Create a new DataFrame that only includes the top makes
top_make_data = df[df[make_col].isin(top_makes)]
top_make_data.head()
13
[45]: # correlation matrix
[47]: plt.figure(figsize=(7,6))
correlation = df.corr()
sns.heatmap(correlation,annot=True)
correlation
14
year 0.258240 0.198171 0.073049 0.227590
engine_hp -0.406563 -0.439371 0.037501 0.662008
engine_cylinders -0.621606 -0.600776 0.041145 0.531312
number_of_doors 0.118570 0.120881 -0.048272 -0.126635
highway_mpg 1.000000 0.886829 -0.020991 -0.160043
city_mpg 0.886829 1.000000 -0.003217 -0.157676
popularity -0.020991 -0.003217 1.000000 -0.048476
price -0.160043 -0.157676 -0.048476 1.000000
pairplot
[48]: sns.set()
cols = ['year', 'engine_hp', 'engine_cylinders', 'number_of_doors', 'price',]
sns.pairplot(df[cols], height = 2.5)
plt.show();
15
2 DATA CLEANSING
[53]: #check missing ratio
data_na = (df.isnull().sum() / len(df)) * 100
print(data_na)
# exclude the columns that are not null (consider onlu colums that have null␣
,→values non zeros)
16
make 0.000000
model 0.000000
year 0.000000
engine_fuel_type 0.025180
engine_hp 0.579151
engine_cylinders 0.251805
transmission_type 0.000000
driven_wheels 0.000000
number_of_doors 0.050361
market_category 31.408427
vehicle_size 0.000000
vehicle_style 0.000000
highway_mpg 0.000000
city_mpg 0.000000
popularity 0.000000
price 0.000000
dtype: float64
17
Drop Duplicate
[56]: print(df.shape)
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)
(11914, 16)
number of duplicate rows: (715, 16)
(11199, 16)
18
3 Deal with missing values
[60]: df.head().T
[60]: 0 \
make bmw
model 1_series_m
year 2011
engine_fuel_type premium_unleaded_(required)
engine_hp 335.0
engine_cylinders 6.0
transmission_type manual
driven_wheels rear_wheel_drive
number_of_doors 2.0
market_category factory_tuner,luxury,high-performance
vehicle_size compact
vehicle_style coupe
highway_mpg 26
city_mpg 19
popularity 3916
price 46135
1 2 \
make bmw bmw
model 1_series 1_series
year 2011 2011
engine_fuel_type premium_unleaded_(required) premium_unleaded_(required)
engine_hp 300.0 300.0
engine_cylinders 6.0 6.0
transmission_type manual manual
driven_wheels rear_wheel_drive rear_wheel_drive
number_of_doors 2.0 2.0
market_category luxury,performance luxury,high-performance
vehicle_size compact compact
vehicle_style convertible coupe
highway_mpg 28 28
city_mpg 19 20
popularity 3916 3916
price 40650 36350
3 4
make bmw bmw
model 1_series 1_series
year 2011 2011
engine_fuel_type premium_unleaded_(required) premium_unleaded_(required)
engine_hp 230.0 230.0
engine_cylinders 6.0 6.0
19
transmission_type manual manual
driven_wheels rear_wheel_drive rear_wheel_drive
number_of_doors 2.0 2.0
market_category luxury,performance luxury
vehicle_size compact compact
vehicle_style coupe convertible
highway_mpg 28 28
city_mpg 18 18
popularity 3916 3916
price 29450 34500
[61]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 11199 entries, 0 to 11913
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 make 11199 non-null object
1 model 11199 non-null object
2 year 11199 non-null int64
3 engine_fuel_type 11196 non-null object
4 engine_hp 11130 non-null float64
5 engine_cylinders 11169 non-null float64
6 transmission_type 11199 non-null object
7 driven_wheels 11199 non-null object
8 number_of_doors 11193 non-null float64
9 market_category 7823 non-null object
10 vehicle_size 11199 non-null object
11 vehicle_style 11199 non-null object
12 highway_mpg 11199 non-null int64
13 city_mpg 11199 non-null int64
14 popularity 11199 non-null int64
15 price 11199 non-null int64
dtypes: float64(3), int64(5), object(8)
memory usage: 1.5+ MB
,→None))
[64]: df['engine_fuel_type'].isnull().sum()
20
[64]: 0
df['number_of_doors'] = df.groupby('model')['number_of_doors'].transform(lambda␣
,→x: x.fillna(x.mean()))
[67]: df['number_of_doors'].isnull().sum()
[67]: 0
df['engine_cylinders'] = df.groupby('model')['engine_cylinders'].
,→transform(lambda x: x.fillna(x.mean()))
[71]: df['engine_cylinders'].isnull().sum()
[71]: 29
[73]: #As we utilize the groupby method, there may still be null values present in␣
,→our dataset.
[75]: df['engine_hp'].isnull().sum()
[75]: 47
[76]: df.isnull().sum()
[76]: make 0
model 0
year 0
engine_fuel_type 0
engine_hp 47
engine_cylinders 29
transmission_type 0
driven_wheels 0
number_of_doors 0
market_category 3376
vehicle_size 0
vehicle_style 0
highway_mpg 0
21
city_mpg 0
popularity 0
price 0
dtype: int64
[77]: #As we utilize the groupby method, there may still be null values present in␣
,→our dataset.
To address this issue, we can use a rule-based method for imputing these remaining missing values
[ ]:
22