0% found this document useful (0 votes)
17 views15 pages

Nalysis Manipulation and Cleaning

Learn more
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views15 pages

Nalysis Manipulation and Cleaning

Learn more
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

nalysis-manipulation-and-cleaning

November 17, 2024

[1]: import requests

# URL of the dataset


url = 'https://drive.usercontent.google.com/download?
↪id=1wh4Kv-HejXrUcZlbbHfPwb3AyNZgO1Y8&export=download&authuser=4&confirm=t&uuid=b6f38ee4-8733

↪1730370220778'

# Send a GET request to the URL


response = requests.get(url)

# Check if the request was successful (status code 200)


if response.status_code == 200:
# Open a file in write-binary mode and save the content
with open('dataset.csv', 'wb') as file:
file.write(response.content)
print("Dataset downloaded successfully!")
else:
print(f"Failed to download the dataset. Status code: {response.
↪status_code}")

Dataset downloaded successfully!

[1]: import pandas as pd

df = pd.read_csv(r"C:\Users\chido\Desktop\data1.csv")

print(df.head())
print(df.tail())

Make Model Year Engine Fuel Type Engine HP \


0 BMW 1 Series M 2011 premium unleaded (required) 335.0
1 BMW 1 Series 2011 premium unleaded (required) 300.0
2 BMW 1 Series 2011 premium unleaded (required) 300.0
3 BMW 1 Series 2011 premium unleaded (required) 230.0
4 BMW 1 Series 2011 premium unleaded (required) 230.0

Engine Cylinders Transmission Type Driven_Wheels Number of Doors \


0 6.0 MANUAL rear wheel drive 2.0

1
1 6.0 MANUAL rear wheel drive 2.0
2 6.0 MANUAL rear wheel drive 2.0
3 6.0 MANUAL rear wheel drive 2.0
4 6.0 MANUAL rear wheel drive 2.0

Market Category Vehicle Size Vehicle Style \


0 Factory Tuner,Luxury,High-Performance Compact Coupe
1 Luxury,Performance Compact Convertible
2 Luxury,High-Performance Compact Coupe
3 Luxury,Performance Compact Coupe
4 Luxury Compact Convertible

highway MPG city mpg Popularity MSRP


0 26 19 3916 46135
1 28 19 3916 40650
2 28 20 3916 36350
3 28 18 3916 29450
4 28 18 3916 34500
Make Model Year Engine Fuel Type Engine HP \
11909 Acura ZDX 2012 premium unleaded (required) 300.0
11910 Acura ZDX 2012 premium unleaded (required) 300.0
11911 Acura ZDX 2012 premium unleaded (required) 300.0
11912 Acura ZDX 2013 premium unleaded (recommended) 300.0
11913 Lincoln Zephyr 2006 regular unleaded 221.0

Engine Cylinders Transmission Type Driven_Wheels Number of Doors \


11909 6.0 AUTOMATIC all wheel drive 4.0
11910 6.0 AUTOMATIC all wheel drive 4.0
11911 6.0 AUTOMATIC all wheel drive 4.0
11912 6.0 AUTOMATIC all wheel drive 4.0
11913 6.0 AUTOMATIC front wheel drive 4.0

Market Category Vehicle Size Vehicle Style highway MPG \


11909 Crossover,Hatchback,Luxury Midsize 4dr Hatchback 23
11910 Crossover,Hatchback,Luxury Midsize 4dr Hatchback 23
11911 Crossover,Hatchback,Luxury Midsize 4dr Hatchback 23
11912 Crossover,Hatchback,Luxury Midsize 4dr Hatchback 23
11913 Luxury Midsize Sedan 26

city mpg Popularity MSRP


11909 16 204 46120
11910 16 204 56670
11911 16 204 50620
11912 16 204 50920
11913 17 61 28995

2
[2]: import pandas as pd

df = pd.read_csv(r"C:\Users\chido\Desktop\data1.csv")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11914 entries, 0 to 11913
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Make 11914 non-null object
1 Model 11914 non-null object
2 Year 11914 non-null int64
3 Engine Fuel Type 11911 non-null object
4 Engine HP 11845 non-null float64
5 Engine Cylinders 11884 non-null float64
6 Transmission Type 11914 non-null object
7 Driven_Wheels 11914 non-null object
8 Number of Doors 11908 non-null float64
9 Market Category 8172 non-null object
10 Vehicle Size 11914 non-null object
11 Vehicle Style 11914 non-null object
12 highway MPG 11914 non-null int64
13 city mpg 11914 non-null int64
14 Popularity 11914 non-null int64
15 MSRP 11914 non-null int64
dtypes: float64(3), int64(5), object(8)
memory usage: 1.5+ MB

[3]: import pandas as pd

df = pd.read_csv(r"C:\Users\chido\Desktop\data1.csv")

df.isnull().sum()

[3]: Make 0
Model 0
Year 0
Engine Fuel Type 3
Engine HP 69
Engine Cylinders 30
Transmission Type 0
Driven_Wheels 0
Number of Doors 6
Market Category 3742
Vehicle Size 0

3
Vehicle Style 0
highway MPG 0
city mpg 0
Popularity 0
MSRP 0
dtype: int64

[9]: import pandas as pd

df = pd.read_csv(r"C:\Users\chido\Desktop\data1.csv")

df.describe()

[9]: Year Engine HP Engine Cylinders Number of Doors \


count 11914.000000 11845.00000 11884.000000 11908.000000
mean 2010.384338 249.38607 5.628829 3.436093
std 7.579740 109.19187 1.780559 0.881315
min 1990.000000 55.00000 0.000000 2.000000
25% 2007.000000 170.00000 4.000000 2.000000
50% 2015.000000 227.00000 6.000000 4.000000
75% 2016.000000 300.00000 6.000000 4.000000
max 2017.000000 1001.00000 16.000000 4.000000

highway MPG city mpg Popularity MSRP


count 11914.000000 11914.000000 11914.000000 1.191400e+04
mean 26.637485 19.733255 1554.911197 4.059474e+04
std 8.863001 8.987798 1441.855347 6.010910e+04
min 12.000000 7.000000 2.000000 2.000000e+03
25% 22.000000 16.000000 549.000000 2.100000e+04
50% 26.000000 18.000000 1385.000000 2.999500e+04
75% 30.000000 22.000000 2009.000000 4.223125e+04
max 354.000000 137.000000 5657.000000 2.065902e+06

[27]: import pandas as pd

df = pd.read_csv(r"C:\Users\chido\Downloads\laptops.csv")

df.shape
df.count()

[27]: Laptop 2160


Status 2160
Brand 2160
Model 2160
CPU 2160
RAM 2160
Storage 2160

4
Storage type 2118
GPU 789
Screen 2156
Touch 2160
Final Price 2160
dtype: int64

[29]: import pandas as pd

df = pd.read_csv(r"C:\Users\chido\Downloads\data1.csv")

df.shape
df.count()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11914 entries, 0 to 11913
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Make 11914 non-null object
1 Model 11914 non-null object
2 Year 11914 non-null int64
3 Engine Fuel Type 11911 non-null object
4 Engine HP 11845 non-null float64
5 Engine Cylinders 11884 non-null float64
6 Transmission Type 11914 non-null object
7 Driven_Wheels 11914 non-null object
8 Number of Doors 11908 non-null float64
9 Market Category 8172 non-null object
10 Vehicle Size 11914 non-null object
11 Vehicle Style 11914 non-null object
12 highway MPG 11914 non-null int64
13 city mpg 11914 non-null int64
14 Popularity 11914 non-null int64
15 MSRP 11914 non-null int64
dtypes: float64(3), int64(5), object(8)
memory usage: 1.5+ MB
Data Cleaning Data cleaning means fixing bad data in your data set.
Bad data could be:
Empty cells Data in wrong format Wrong data Duplicates
[30]:

Make Model Year Engine Fuel Type Engine HP \


0 BMW 1 Series M 2011 premium unleaded (required) 335.0

5
1 BMW 1 Series 2011 premium unleaded (required) 300.0
2 BMW 1 Series 2011 premium unleaded (required) 300.0
3 BMW 1 Series 2011 premium unleaded (required) 230.0
4 BMW 1 Series 2011 premium unleaded (required) 230.0
… … … … … …
11909 Acura ZDX 2012 premium unleaded (required) 300.0
11910 Acura ZDX 2012 premium unleaded (required) 300.0
11911 Acura ZDX 2012 premium unleaded (required) 300.0
11912 Acura ZDX 2013 premium unleaded (recommended) 300.0
11913 Lincoln Zephyr 2006 regular unleaded 221.0

Engine Cylinders Transmission Type Driven_Wheels Number of Doors \


0 6.0 MANUAL rear wheel drive 2.0
1 6.0 MANUAL rear wheel drive 2.0
2 6.0 MANUAL rear wheel drive 2.0
3 6.0 MANUAL rear wheel drive 2.0
4 6.0 MANUAL rear wheel drive 2.0
… … … … …
11909 6.0 AUTOMATIC all wheel drive 4.0
11910 6.0 AUTOMATIC all wheel drive 4.0
11911 6.0 AUTOMATIC all wheel drive 4.0
11912 6.0 AUTOMATIC all wheel drive 4.0
11913 6.0 AUTOMATIC front wheel drive 4.0

Market Category Vehicle Size Vehicle Style \


0 Factory Tuner,Luxury,High-Performance Compact Coupe
1 Luxury,Performance Compact Convertible
2 Luxury,High-Performance Compact Coupe
3 Luxury,Performance Compact Coupe
4 Luxury Compact Convertible
… … … …
11909 Crossover,Hatchback,Luxury Midsize 4dr Hatchback
11910 Crossover,Hatchback,Luxury Midsize 4dr Hatchback
11911 Crossover,Hatchback,Luxury Midsize 4dr Hatchback
11912 Crossover,Hatchback,Luxury Midsize 4dr Hatchback
11913 Luxury Midsize Sedan

highway MPG city mpg Popularity MSRP


0 26 19 3916 46135
1 28 19 3916 40650
2 28 20 3916 36350
3 28 18 3916 29450
4 28 18 3916 34500
… … … … …
11909 23 16 204 46120
11910 23 16 204 56670
11911 23 16 204 50620
11912 23 16 204 50920

6
11913 26 17 61 28995

[8084 rows x 16 columns]

[32]: import pandas as pd

df = pd.read_csv('data.csv')

new_df = df.dropna()

print(new_df)
new_df.shape

Make Model Year Engine Fuel Type Engine HP \


0 BMW 1 Series M 2011 premium unleaded (required) 335.0
1 BMW 1 Series 2011 premium unleaded (required) 300.0
2 BMW 1 Series 2011 premium unleaded (required) 300.0
3 BMW 1 Series 2011 premium unleaded (required) 230.0
4 BMW 1 Series 2011 premium unleaded (required) 230.0
… … … … … …
11909 Acura ZDX 2012 premium unleaded (required) 300.0
11910 Acura ZDX 2012 premium unleaded (required) 300.0
11911 Acura ZDX 2012 premium unleaded (required) 300.0
11912 Acura ZDX 2013 premium unleaded (recommended) 300.0
11913 Lincoln Zephyr 2006 regular unleaded 221.0

Engine Cylinders Transmission Type Driven_Wheels Number of Doors \


0 6.0 MANUAL rear wheel drive 2.0
1 6.0 MANUAL rear wheel drive 2.0
2 6.0 MANUAL rear wheel drive 2.0
3 6.0 MANUAL rear wheel drive 2.0
4 6.0 MANUAL rear wheel drive 2.0
… … … … …
11909 6.0 AUTOMATIC all wheel drive 4.0
11910 6.0 AUTOMATIC all wheel drive 4.0
11911 6.0 AUTOMATIC all wheel drive 4.0
11912 6.0 AUTOMATIC all wheel drive 4.0
11913 6.0 AUTOMATIC front wheel drive 4.0

Market Category Vehicle Size Vehicle Style \


0 Factory Tuner,Luxury,High-Performance Compact Coupe
1 Luxury,Performance Compact Convertible
2 Luxury,High-Performance Compact Coupe
3 Luxury,Performance Compact Coupe
4 Luxury Compact Convertible
… … … …
11909 Crossover,Hatchback,Luxury Midsize 4dr Hatchback
11910 Crossover,Hatchback,Luxury Midsize 4dr Hatchback

7
11911 Crossover,Hatchback,Luxury Midsize 4dr Hatchback
11912 Crossover,Hatchback,Luxury Midsize 4dr Hatchback
11913 Luxury Midsize Sedan

highway MPG city mpg Popularity MSRP


0 26 19 3916 46135
1 28 19 3916 40650
2 28 20 3916 36350
3 28 18 3916 29450
4 28 18 3916 34500
… … … … …
11909 23 16 204 46120
11910 23 16 204 56670
11911 23 16 204 50620
11912 23 16 204 50920
11913 26 17 61 28995

[8084 rows x 16 columns]

[32]: (8084, 16)

[ ]: import pandas as pd

df = pd.read_csv('data.csv')

df.dropna(inplace = True)

print(df)

Replace NULL values with the number 130:


[17]: import pandas as pd

df = pd.read_csv('dataset.csv')

df.fillna(130, inplace = True)

Replace Only For Specified Columns


[18]: import pandas as pd

df = pd.read_csv('data.csv')

df["Engine HP"].fillna(130, inplace = True)

Example Calculate the MEAN, and replace any empty values with it:

8
[36]: import pandas as pd

df = pd.read_csv('dataset.csv')

print(df.shape)
print(df.isnull().sum())
print(df.info())

(32, 5)
Duration 0
Date 1
Pulse 0
Maxpulse 0
Calories 2
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 32 non-null int64
1 Date 31 non-null object
2 Pulse 32 non-null int64
3 Maxpulse 32 non-null int64
4 Calories 30 non-null float64
dtypes: float64(1), int64(3), object(1)
memory usage: 1.4+ KB
None

[ ]: import pandas as pd

df = pd.read_csv('dataset.csv')

x = df["Calories"].mean()

df["Calories"].fillna(x, inplace = True)

[ ]: import pandas as pd

df = pd.read_csv('data.csv')

x = df["Calories"].median()

df["Calories"].fillna(x, inplace = True)

[43]: import pandas as pd

9
df = pd.read_csv('dataset.csv')

x = df["Calories"].mode()

print(x)

df["Calories"].fillna(x, inplace = True)

0 300.0
Name: Calories, dtype: float64

[ ]:

[19]: import pandas as pd

df = pd.read_csv('dataset.csv')

df['Date'] = pd.to_datetime(df['Date'])

print(df.to_string())

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[19], line 5
1 import pandas as pd
3 df = pd.read_csv('dataset.csv')
----> 5 df['Date'] = pd.to_datetime(df['Date'])
7 print(df.to_string())

File ~\anaconda3\Lib\site-packages\pandas\core\tools\datetimes.py:1112, in␣


↪to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit,␣
↪infer_datetime_format, origin, cache)

1110 result = arg.map(cache_array)


1111 else:
-> 1112 values = convert_listlike(arg._values, format)
1113 result = arg._constructor(values, index=arg.index, name=arg.name)
1114 elif isinstance(arg, (ABCDataFrame, abc.MutableMapping)):

File ~\anaconda3\Lib\site-packages\pandas\core\tools\datetimes.py:488, in␣


↪_convert_listlike_datetimes(arg, format, name, utc, unit, errors, dayfirst,␣
↪yearfirst, exact)

486 # `format` could be inferred, or user didn't ask for mixed-format␣


↪parsing.

487 if format is not None and format != "mixed":


--> 488 return _array_strptime_with_fallback(arg, name, utc, format, exact,␣
↪errors)

490 result, tz_parsed = objects_to_datetime64ns(


491 arg,

10
492 dayfirst=dayfirst,
(…)
496 allow_object=True,
497 )
499 if tz_parsed is not None:
500 # We can take a shortcut since the datetime64 numpy array
501 # is in UTC

File ~\anaconda3\Lib\site-packages\pandas\core\tools\datetimes.py:519, in␣


↪_array_strptime_with_fallback(arg, name, utc, fmt, exact, errors)

508 def _array_strptime_with_fallback(


509 arg,
510 name,
(…)
514 errors: str,
515 ) -> Index:
516 """
517 Call array_strptime, with fallback behavior depending on 'errors'.
518 """
--> 519 result, timezones = array_strptime(arg, fmt, exact=exact,␣
↪errors=errors, utc=utc)

520 if any(tz is not None for tz in timezones):


521 return _return_parsed_timezone_results(result, timezones, utc,␣
↪name)

File strptime.pyx:534, in pandas._libs.tslibs.strptime.array_strptime()

File strptime.pyx:355, in pandas._libs.tslibs.strptime.array_strptime()

ValueError: time data "20201226" doesn't match format "'%Y/%m/%d'", at position␣


↪26. You might want to try:

- passing `format` if your strings have a consistent format;


- passing `format='ISO8601'` if your strings are all ISO8601 but not␣
↪necessarily in exactly the same format;

- passing `format='mixed'`, and the format will be inferred for each element␣
↪individually. You might want to use `dayfirst` alongside this.

[ ]: df.dropna(subset=['Date'], inplace = True)

[20]: df['Date'].dropna()

[20]: 0 '2020/12/01'
1 '2020/12/02'
2 '2020/12/03'
3 '2020/12/04'
4 '2020/12/05'
5 '2020/12/06'

11
6 '2020/12/07'
7 '2020/12/08'
8 '2020/12/09'
9 '2020/12/10'
10 '2020/12/11'
11 '2020/12/12'
12 '2020/12/12'
13 '2020/12/13'
14 '2020/12/14'
15 '2020/12/15'
16 '2020/12/16'
17 '2020/12/17'
18 '2020/12/18'
19 '2020/12/19'
20 '2020/12/20'
21 '2020/12/21'
23 '2020/12/23'
24 '2020/12/24'
25 '2020/12/25'
26 20201226
27 '2020/12/27'
28 '2020/12/28'
29 '2020/12/29'
30 '2020/12/30'
31 '2020/12/31'
Name: Date, dtype: object

[21]: df

[21]: Duration Date Pulse Maxpulse Calories


0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2

12
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0

Pandas - Fixing Wrong Data


Replacing Values One way to fix wrong values is to replace them with something else.
In our example, it is most likely a typo, and the value should be “45” instead of “450”, and we
could just insert “45” in row 7:
[24]: df.loc[7, 'Duration'] = 45

[25]: df

[25]: Laptop Status Brand \


0 ASUS ExpertBook B1 B1502CBA-EJ0436X Intel Core… New Asus
1 Alurin Go Start Intel Celeron N4020/8GB/256GB … New Alurin
2 ASUS ExpertBook B1 B1502CBA-EJ0424X Intel Core… New Asus
3 MSI Katana GF66 12UC-082XES Intel Core i7-1270… New MSI
4 HP 15S-FQ5085NS Intel Core i5-1235U/16GB/512GB… New HP
… … … …
2155 Razer Blade 17 FHD 360Hz Intel Core i7-11800H/… Refurbished Razer
2156 Razer Blade 17 FHD 360Hz Intel Core i7-11800H/… Refurbished Razer
2157 Razer Blade 17 FHD 360Hz Intel Core i7-11800H/… Refurbished Razer
2158 Razer Book 13 Intel Evo Core i7-1165G7/16GB/1T… Refurbished Razer
2159 Razer Book FHD+ Intel Evo Core i7-1165G7/16GB/… Refurbished Razer

Model CPU RAM Storage Storage type GPU \


0 ExpertBook Intel Core i5 8 512 SSD NaN
1 Go Intel Celeron 8 256 SSD NaN
2 ExpertBook Intel Core i3 8 256 SSD NaN
3 Katana Intel Core i7 16 1000 SSD RTX 3050
4 15S Intel Core i5 16 512 SSD NaN
… … … … … … …
2155 Blade Intel Core i7 16 1000 SSD RTX 3060

13
2156 Blade Intel Core i7 16 1000 SSD RTX 3070
2157 Blade Intel Core i7 32 1000 SSD RTX 3080
2158 Book Intel Evo Core i7 16 1000 SSD NaN
2159 Book Intel Evo Core i7 16 256 SSD NaN

Screen Touch Final Price Duration


0 15.6 No 1009.00 NaN
1 15.6 No 299.00 NaN
2 15.6 No 789.00 NaN
3 15.6 No 1199.00 NaN
4 15.6 No 669.01 NaN
… … … … …
2155 17.3 No 2699.99 NaN
2156 17.3 No 2899.99 NaN
2157 17.3 No 3399.99 NaN
2158 13.4 Yes 1899.99 NaN
2159 13.4 Yes 1699.99 NaN

[2160 rows x 13 columns]

Loop through all values in the “Duration” column.


If the value is higher than 120, set it to 120:
[ ]: for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120

Removing Rows Another way of handling wrong data is to remove the rows that contains wrong
data.
This way you do not have to find out what to replace them with, and there is a good chance you
do not need them to do your analyses.
[ ]: for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)

By taking a look at our test data set, we can assume that row 11 and 12 are duplicates.
To discover duplicates, we can use the duplicated() method.
The duplicated() method returns a Boolean values for each row:
E Server Returns True for every row that is a duplicate, otherwise False:
[26]: print(df.duplicated())

0 False
1 False
2 False

14
3 False
4 False

2155 False
2156 False
2157 False
2158 False
2159 False
Length: 2160, dtype: bool
Removing Duplicates To remove duplicates, use the drop_duplicates() method.

[ ]: df.drop_duplicates(inplace = True)

Finding Relationships A great aspect of the Pandas module is the corr() method.
The corr() method calculates the relationship between each column in your data set.
The examples in this page uses a CSV file called: ‘data.csv’.
[ ]: df.corr()

[ ]:

15

You might also like