0% found this document useful (0 votes)
5 views

Untitled 0

The document details a data analysis process using a dataset of cars, including loading the data, inspecting its structure, and performing basic data cleaning such as removing duplicates and handling missing values. The dataset contains 53 entries with various attributes related to car specifications and efficiency ratings. Visualizations are created to analyze the distribution of car makes and the correlation between different numerical variables.

Uploaded by

kishanwali29
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Untitled 0

The document details a data analysis process using a dataset of cars, including loading the data, inspecting its structure, and performing basic data cleaning such as removing duplicates and handling missing values. The dataset contains 53 entries with various attributes related to car specifications and efficiency ratings. Visualizations are created to analyze the distribution of car makes and the correlation between different numerical variables.

Uploaded by

kishanwali29
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

In [2]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:

df = pd.read_csv("cars.csv")
df.head(5)

Out [3]:
CITY HWY COMB CITY HWY COMB
Unnamed:
YEAR Make Model Size (kW) TYPE (kWh/100 (kWh/100 (kWh/100 (Le/100 (Le/100 (Le/100 (g/km) RA
5
km) km) km) km) km) km)

0 2012 MITSUBISHI i-MiEV SUBCOMPACT 49 A1 B 16.9 21.4 18.7 1.9 2.4 2.1 0 Na

1 2012 NISSAN LEAF MID-SIZE 80 A1 B 19.3 23.0 21.1 2.2 2.6 2.4 0 Na

FOCUS
2 2013 FORD COMPACT 107 A1 B 19.0 21.1 20.0 2.1 2.4 2.2 0 Na
ELECTRIC

3 2013 MITSUBISHI i-MiEV SUBCOMPACT 49 A1 B 16.9 21.4 18.7 1.9 2.4 2.1 0 Na

4 2013 NISSAN LEAF MID-SIZE 80 A1 B 19.3 23.0 21.1 2.2 2.6 2.4 0 Na

In [ ]:

In [5]:

df.info()

df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53 entries, 0 to 52
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YEAR 53 non-null int64
1 Make 53 non-null object
2 Model 53 non-null object
3 Size 53 non-null object
4 (kW) 53 non-null int64
5 Unnamed: 5 53 non-null object
6 TYPE 53 non-null object
7 CITY (kWh/100 km) 53 non-null float64
8 HWY (kWh/100 km) 53 non-null float64
9 COMB (kWh/100 km) 53 non-null float64
10 CITY (Le/100 km) 53 non-null float64
11 HWY (Le/100 km) 53 non-null float64
12 COMB (Le/100 km) 53 non-null float64
13 (g/km) 53 non-null int64
14 RATING 19 non-null float64
15 (km) 53 non-null int64
16 TIME (h) 53 non-null int64
dtypes: float64(7), int64(5), object(5)
memory usage: 7.2+ KB

Out [5]:
CITY HWY COMB CITY HWY COMB
YEAR (kW) (kWh/100 (kWh/100 (kWh/100 (Le/100 (Le/100 (Le/100 (g/km) RATING (km) TIME (h)
km) km) km) km) km) km)

count 53.000000 53.000000 53.00000 53.000000 53.000000 53.000000 53.000000 53.000000 53.0 19.0 53.000000 53.000000

mean 2014.735849 190.622642 19.64717 21.633962 20.541509 2.207547 2.422642 2.301887 0.0 10.0 239.169811 8.471698

std 1.227113 155.526429 3.00100 1.245753 1.979455 0.344656 0.143636 0.212576 0.0 0.0 141.426352 2.991036

min 2012.000000 35.000000 15.20000 18.800000 16.800000 1.700000 2.100000 1.900000 0.0 10.0 100.000000 4.000000

25% 2014.000000 80.000000 17.00000 20.800000 18.700000 1.900000 2.300000 2.100000 0.0 10.0 117.000000 7.000000

50% 2015.000000 107.000000 19.00000 21.700000 20.000000 2.100000 2.400000 2.200000 0.0 10.0 135.000000 8.000000

75% 2016.000000 283.000000 22.40000 22.500000 22.100000 2.500000 2.500000 2.500000 0.0 10.0 402.000000 12.000000

max 2016.000000 568.000000 23.90000 23.300000 23.600000 2.700000 2.600000 2.600000 0.0 10.0 473.000000 12.000000

In [7]:
df.shape

df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53 entries, 0 to 52
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YEAR 53 non-null int64
1 Make 53 non-null object
2 Model 53 non-null object
3 Size 53 non-null object
4 (kW) 53 non-null int64
5 Unnamed: 5 53 non-null object
6 TYPE 53 non-null object
7 CITY (kWh/100 km) 53 non-null float64
8 HWY (kWh/100 km) 53 non-null float64
9 COMB (kWh/100 km) 53 non-null float64
10 CITY (Le/100 km) 53 non-null float64
11 HWY (Le/100 km) 53 non-null float64
12 COMB (Le/100 km) 53 non-null float64
13 (g/km) 53 non-null int64
14 RATING 19 non-null float64
15 (km) 53 non-null int64
16 TIME (h) 53 non-null int64
dtypes: float64(7), int64(5), object(5)
memory usage: 7.2+ KB

In [ ]:

In [9]: print(df.isna().sum())
df = df.dropna()

YEAR 0
Make 0
Model 0
Size 0
(kW) 0
Unnamed: 5 0
TYPE 0
CITY (kWh/100 km) 0
HWY (kWh/100 km) 0
COMB (kWh/100 km) 0
CITY (Le/100 km) 0
HWY (Le/100 km) 0
COMB (Le/100 km) 0
(g/km) 0
RATING 34
(km) 0
TIME (h) 0
dtype: int64

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19 entries, 34 to 52
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YEAR 19 non-null int64
1 Make 19 non-null object
2 Model 19 non-null object
3 Size 19 non-null object
4 (kW) 19 non-null int64
5 Unnamed: 5 19 non-null object
6 TYPE 19 non-null object
7 CITY (kWh/100 km) 19 non-null float64
8 HWY (kWh/100 km) 19 non-null float64
9 COMB (kWh/100 km) 19 non-null float64
10 CITY (Le/100 km) 19 non-null float64
11 HWY (Le/100 km) 19 non-null float64
12 COMB (Le/100 km) 19 non-null float64
13 (g/km) 19 non-null int64
14 RATING 19 non-null float64
15 (km) 19 non-null int64
16 TIME (h) 19 non-null int64
dtypes: float64(7), int64(5), object(5)
memory usage: 2.7+ KB

In [15]:
df.Make.value_counts().nlargest(40).plot(kind='bar', figsize=(10,5))
plt.title("Number of cars by make")
plt.ylabel('Number of cars')
plt.xlabel('Make');

In [18]: plt.figure(figsize=(10,5))
c = df.corr()
sns.heatmap(c,cmap="BrBG",annot=True)
c
---------------------------------------------------------------------------ValueError Traceback (most recent call las
1 plt.figure(figsize=(10,5))
----> 2 c = df.corr()
3 sns.heatmap(c,cmap="BrBG",annot=True)
4 c
/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in corr(self, method, min_periods, numeric_only)
11047 cols = data.columns
11048 idx = cols.copy()
> 11049 mat = data.to_numpy(dtype=float, na_value=np.nan, copy=False)
11050
11051 if method == "pearson":
/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in to_numpy(self, dtype, copy, na_value)
1991 if dtype is not None:
1992 dtype = np.dtype(dtype)
-> 1993 result = self._mgr.as_array(dtype=dtype, copy=copy, na_value=na_value)
1994 if result.dtype is not dtype:
1995 result = np.asarray(result, dtype=dtype)
/usr/local/lib/python3.10/dist-packages/pandas/core/internals/managers.py in as_array(self, dtype, copy, na_value)
1692 arr.flags.writeable = False
1693 else:
-> 1694 arr = self._interleave(dtype=dtype, na_value=na_value)
1695 # The underlying data was copied within _interleave, so no need
1696 # to further copy if copy=True or setting na_value
/usr/local/lib/python3.10/dist-packages/pandas/core/internals/managers.py in _interleave(self, dtype, na_value)
1751 else:
1752 arr = blk.get_values(dtype)
-> 1753 result[rl.indexer] = arr
1754 itemmask[rl.indexer] = 1
1755
ValueError: could not convert string to float: 'BMW'

<Figure size 1000x500 with 0 Axes>

In [20]:
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df['Size'], df['COMB (Le/100 km)'])
ax.set_xlabel('Size')
ax.set_ylabel('COMB (Le/100 km)')
plt.show()

You might also like