Nalysis Manipulation and Cleaning
Nalysis Manipulation and Cleaning
↪1730370220778'
df = pd.read_csv(r"C:\Users\chido\Desktop\data1.csv")
print(df.head())
print(df.tail())
1
1 6.0 MANUAL rear wheel drive 2.0
2 6.0 MANUAL rear wheel drive 2.0
3 6.0 MANUAL rear wheel drive 2.0
4 6.0 MANUAL rear wheel drive 2.0
2
[2]: import pandas as pd
df = pd.read_csv(r"C:\Users\chido\Desktop\data1.csv")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11914 entries, 0 to 11913
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Make 11914 non-null object
1 Model 11914 non-null object
2 Year 11914 non-null int64
3 Engine Fuel Type 11911 non-null object
4 Engine HP 11845 non-null float64
5 Engine Cylinders 11884 non-null float64
6 Transmission Type 11914 non-null object
7 Driven_Wheels 11914 non-null object
8 Number of Doors 11908 non-null float64
9 Market Category 8172 non-null object
10 Vehicle Size 11914 non-null object
11 Vehicle Style 11914 non-null object
12 highway MPG 11914 non-null int64
13 city mpg 11914 non-null int64
14 Popularity 11914 non-null int64
15 MSRP 11914 non-null int64
dtypes: float64(3), int64(5), object(8)
memory usage: 1.5+ MB
df = pd.read_csv(r"C:\Users\chido\Desktop\data1.csv")
df.isnull().sum()
[3]: Make 0
Model 0
Year 0
Engine Fuel Type 3
Engine HP 69
Engine Cylinders 30
Transmission Type 0
Driven_Wheels 0
Number of Doors 6
Market Category 3742
Vehicle Size 0
3
Vehicle Style 0
highway MPG 0
city mpg 0
Popularity 0
MSRP 0
dtype: int64
df = pd.read_csv(r"C:\Users\chido\Desktop\data1.csv")
df.describe()
df = pd.read_csv(r"C:\Users\chido\Downloads\laptops.csv")
df.shape
df.count()
4
Storage type 2118
GPU 789
Screen 2156
Touch 2160
Final Price 2160
dtype: int64
df = pd.read_csv(r"C:\Users\chido\Downloads\data1.csv")
df.shape
df.count()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11914 entries, 0 to 11913
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Make 11914 non-null object
1 Model 11914 non-null object
2 Year 11914 non-null int64
3 Engine Fuel Type 11911 non-null object
4 Engine HP 11845 non-null float64
5 Engine Cylinders 11884 non-null float64
6 Transmission Type 11914 non-null object
7 Driven_Wheels 11914 non-null object
8 Number of Doors 11908 non-null float64
9 Market Category 8172 non-null object
10 Vehicle Size 11914 non-null object
11 Vehicle Style 11914 non-null object
12 highway MPG 11914 non-null int64
13 city mpg 11914 non-null int64
14 Popularity 11914 non-null int64
15 MSRP 11914 non-null int64
dtypes: float64(3), int64(5), object(8)
memory usage: 1.5+ MB
Data Cleaning Data cleaning means fixing bad data in your data set.
Bad data could be:
Empty cells Data in wrong format Wrong data Duplicates
[30]:
5
1 BMW 1 Series 2011 premium unleaded (required) 300.0
2 BMW 1 Series 2011 premium unleaded (required) 300.0
3 BMW 1 Series 2011 premium unleaded (required) 230.0
4 BMW 1 Series 2011 premium unleaded (required) 230.0
… … … … … …
11909 Acura ZDX 2012 premium unleaded (required) 300.0
11910 Acura ZDX 2012 premium unleaded (required) 300.0
11911 Acura ZDX 2012 premium unleaded (required) 300.0
11912 Acura ZDX 2013 premium unleaded (recommended) 300.0
11913 Lincoln Zephyr 2006 regular unleaded 221.0
6
11913 26 17 61 28995
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df)
new_df.shape
7
11911 Crossover,Hatchback,Luxury Midsize 4dr Hatchback
11912 Crossover,Hatchback,Luxury Midsize 4dr Hatchback
11913 Luxury Midsize Sedan
[ ]: import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df)
df = pd.read_csv('dataset.csv')
df = pd.read_csv('data.csv')
Example Calculate the MEAN, and replace any empty values with it:
8
[36]: import pandas as pd
df = pd.read_csv('dataset.csv')
print(df.shape)
print(df.isnull().sum())
print(df.info())
(32, 5)
Duration 0
Date 1
Pulse 0
Maxpulse 0
Calories 2
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 32 non-null int64
1 Date 31 non-null object
2 Pulse 32 non-null int64
3 Maxpulse 32 non-null int64
4 Calories 30 non-null float64
dtypes: float64(1), int64(3), object(1)
memory usage: 1.4+ KB
None
[ ]: import pandas as pd
df = pd.read_csv('dataset.csv')
x = df["Calories"].mean()
[ ]: import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].median()
9
df = pd.read_csv('dataset.csv')
x = df["Calories"].mode()
print(x)
0 300.0
Name: Calories, dtype: float64
[ ]:
df = pd.read_csv('dataset.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[19], line 5
1 import pandas as pd
3 df = pd.read_csv('dataset.csv')
----> 5 df['Date'] = pd.to_datetime(df['Date'])
7 print(df.to_string())
10
492 dayfirst=dayfirst,
(…)
496 allow_object=True,
497 )
499 if tz_parsed is not None:
500 # We can take a shortcut since the datetime64 numpy array
501 # is in UTC
- passing `format='mixed'`, and the format will be inferred for each element␣
↪individually. You might want to use `dayfirst` alongside this.
[20]: df['Date'].dropna()
[20]: 0 '2020/12/01'
1 '2020/12/02'
2 '2020/12/03'
3 '2020/12/04'
4 '2020/12/05'
5 '2020/12/06'
11
6 '2020/12/07'
7 '2020/12/08'
8 '2020/12/09'
9 '2020/12/10'
10 '2020/12/11'
11 '2020/12/12'
12 '2020/12/12'
13 '2020/12/13'
14 '2020/12/14'
15 '2020/12/15'
16 '2020/12/16'
17 '2020/12/17'
18 '2020/12/18'
19 '2020/12/19'
20 '2020/12/20'
21 '2020/12/21'
23 '2020/12/23'
24 '2020/12/24'
25 '2020/12/25'
26 20201226
27 '2020/12/27'
28 '2020/12/28'
29 '2020/12/29'
30 '2020/12/30'
31 '2020/12/31'
Name: Date, dtype: object
[21]: df
12
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
[25]: df
13
2156 Blade Intel Core i7 16 1000 SSD RTX 3070
2157 Blade Intel Core i7 32 1000 SSD RTX 3080
2158 Book Intel Evo Core i7 16 1000 SSD NaN
2159 Book Intel Evo Core i7 16 256 SSD NaN
Removing Rows Another way of handling wrong data is to remove the rows that contains wrong
data.
This way you do not have to find out what to replace them with, and there is a good chance you
do not need them to do your analyses.
[ ]: for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
By taking a look at our test data set, we can assume that row 11 and 12 are duplicates.
To discover duplicates, we can use the duplicated() method.
The duplicated() method returns a Boolean values for each row:
E Server Returns True for every row that is a duplicate, otherwise False:
[26]: print(df.duplicated())
0 False
1 False
2 False
14
3 False
4 False
…
2155 False
2156 False
2157 False
2158 False
2159 False
Length: 2160, dtype: bool
Removing Duplicates To remove duplicates, use the drop_duplicates() method.
[ ]: df.drop_duplicates(inplace = True)
Finding Relationships A great aspect of the Pandas module is the corr() method.
The corr() method calculates the relationship between each column in your data set.
The examples in this page uses a CSV file called: ‘data.csv’.
[ ]: df.corr()
[ ]:
15