0% found this document useful (0 votes)

17 views15 pages

Nalysis Manipulation and Cleaning

Learn more

Uploaded by

adonwelukeprecious

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views15 pages

Nalysis Manipulation and Cleaning

Learn more

Uploaded by

adonwelukeprecious

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

nalysis-manipulation-and-cleaning

November 17, 2024

[1]: import requests

# URL of the dataset

url = 'https://drive.usercontent.google.com/download?
↪id=1wh4Kv-HejXrUcZlbbHfPwb3AyNZgO1Y8&export=download&authuser=4&confirm=t&uuid=b6f38ee4-8733

↪1730370220778'

# Send a GET request to the URL

response = requests.get(url)

# Check if the request was successful (status code 200)

if response.status_code == 200:
# Open a file in write-binary mode and save the content
with open('dataset.csv', 'wb') as file:
file.write(response.content)
print("Dataset downloaded successfully!")
else:
print(f"Failed to download the dataset. Status code: {response.
↪status_code}")

Dataset downloaded successfully!

[1]: import pandas as pd

df = pd.read_csv(r"C:\Users\chido\Desktop\data1.csv")

print(df.head())
print(df.tail())

Make Model Year Engine Fuel Type Engine HP \

Engine Cylinders Transmission Type Driven_Wheels Number of Doors \

0 6.0 MANUAL rear wheel drive 2.0

1
1 6.0 MANUAL rear wheel drive 2.0
2 6.0 MANUAL rear wheel drive 2.0
3 6.0 MANUAL rear wheel drive 2.0
4 6.0 MANUAL rear wheel drive 2.0

Market Category Vehicle Size Vehicle Style \

0 Factory Tuner,Luxury,High-Performance Compact Coupe
1 Luxury,Performance Compact Convertible
2 Luxury,High-Performance Compact Coupe
3 Luxury,Performance Compact Coupe
4 Luxury Compact Convertible

highway MPG city mpg Popularity MSRP

0 26 19 3916 46135
1 28 19 3916 40650
2 28 20 3916 36350
3 28 18 3916 29450
4 28 18 3916 34500
Make Model Year Engine Fuel Type Engine HP \
11909 Acura ZDX 2012 premium unleaded (required) 300.0
11910 Acura ZDX 2012 premium unleaded (required) 300.0
11911 Acura ZDX 2012 premium unleaded (required) 300.0
11912 Acura ZDX 2013 premium unleaded (recommended) 300.0
11913 Lincoln Zephyr 2006 regular unleaded 221.0

Engine Cylinders Transmission Type Driven_Wheels Number of Doors \

11909 6.0 AUTOMATIC all wheel drive 4.0
11910 6.0 AUTOMATIC all wheel drive 4.0
11911 6.0 AUTOMATIC all wheel drive 4.0
11912 6.0 AUTOMATIC all wheel drive 4.0
11913 6.0 AUTOMATIC front wheel drive 4.0

Market Category Vehicle Size Vehicle Style highway MPG \

11909 Crossover,Hatchback,Luxury Midsize 4dr Hatchback 23
11910 Crossover,Hatchback,Luxury Midsize 4dr Hatchback 23
11911 Crossover,Hatchback,Luxury Midsize 4dr Hatchback 23
11912 Crossover,Hatchback,Luxury Midsize 4dr Hatchback 23
11913 Luxury Midsize Sedan 26

city mpg Popularity MSRP

11909 16 204 46120
11910 16 204 56670
11911 16 204 50620
11912 16 204 50920
11913 17 61 28995

2
[2]: import pandas as pd

df = pd.read_csv(r"C:\Users\chido\Desktop\data1.csv")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11914 entries, 0 to 11913
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Make 11914 non-null object
1 Model 11914 non-null object
2 Year 11914 non-null int64
3 Engine Fuel Type 11911 non-null object
4 Engine HP 11845 non-null float64
5 Engine Cylinders 11884 non-null float64
6 Transmission Type 11914 non-null object
7 Driven_Wheels 11914 non-null object
8 Number of Doors 11908 non-null float64
9 Market Category 8172 non-null object
10 Vehicle Size 11914 non-null object
11 Vehicle Style 11914 non-null object
12 highway MPG 11914 non-null int64
13 city mpg 11914 non-null int64
14 Popularity 11914 non-null int64
15 MSRP 11914 non-null int64
dtypes: float64(3), int64(5), object(8)
memory usage: 1.5+ MB

[3]: import pandas as pd

df = pd.read_csv(r"C:\Users\chido\Desktop\data1.csv")

df.isnull().sum()

[3]: Make 0
Model 0
Year 0
Engine Fuel Type 3
Engine HP 69
Engine Cylinders 30
Transmission Type 0
Driven_Wheels 0
Number of Doors 6
Market Category 3742
Vehicle Size 0

3
Vehicle Style 0
highway MPG 0
city mpg 0
Popularity 0
MSRP 0
dtype: int64

[9]: import pandas as pd

df = pd.read_csv(r"C:\Users\chido\Desktop\data1.csv")

df.describe()

[9]: Year Engine HP Engine Cylinders Number of Doors \

count 11914.000000 11845.00000 11884.000000 11908.000000
mean 2010.384338 249.38607 5.628829 3.436093
std 7.579740 109.19187 1.780559 0.881315
min 1990.000000 55.00000 0.000000 2.000000
25% 2007.000000 170.00000 4.000000 2.000000
50% 2015.000000 227.00000 6.000000 4.000000
75% 2016.000000 300.00000 6.000000 4.000000
max 2017.000000 1001.00000 16.000000 4.000000

highway MPG city mpg Popularity MSRP

count 11914.000000 11914.000000 11914.000000 1.191400e+04
mean 26.637485 19.733255 1554.911197 4.059474e+04
std 8.863001 8.987798 1441.855347 6.010910e+04
min 12.000000 7.000000 2.000000 2.000000e+03
25% 22.000000 16.000000 549.000000 2.100000e+04
50% 26.000000 18.000000 1385.000000 2.999500e+04
75% 30.000000 22.000000 2009.000000 4.223125e+04
max 354.000000 137.000000 5657.000000 2.065902e+06

[27]: import pandas as pd

df = pd.read_csv(r"C:\Users\chido\Downloads\laptops.csv")

df.shape
df.count()

[27]: Laptop 2160

Status 2160
Brand 2160
Model 2160
CPU 2160
RAM 2160
Storage 2160

4
Storage type 2118
GPU 789
Screen 2156
Touch 2160
Final Price 2160
dtype: int64

[29]: import pandas as pd

df = pd.read_csv(r"C:\Users\chido\Downloads\data1.csv")

df.shape
df.count()
df.info()

Make Model Year Engine Fuel Type Engine HP \

0 BMW 1 Series M 2011 premium unleaded (required) 335.0

5
1 BMW 1 Series 2011 premium unleaded (required) 300.0
2 BMW 1 Series 2011 premium unleaded (required) 300.0
3 BMW 1 Series 2011 premium unleaded (required) 230.0
4 BMW 1 Series 2011 premium unleaded (required) 230.0
… … … … … …
11909 Acura ZDX 2012 premium unleaded (required) 300.0
11910 Acura ZDX 2012 premium unleaded (required) 300.0
11911 Acura ZDX 2012 premium unleaded (required) 300.0
11912 Acura ZDX 2013 premium unleaded (recommended) 300.0
11913 Lincoln Zephyr 2006 regular unleaded 221.0

Engine Cylinders Transmission Type Driven_Wheels Number of Doors \

0 6.0 MANUAL rear wheel drive 2.0
1 6.0 MANUAL rear wheel drive 2.0
2 6.0 MANUAL rear wheel drive 2.0
3 6.0 MANUAL rear wheel drive 2.0
4 6.0 MANUAL rear wheel drive 2.0
… … … … …
11909 6.0 AUTOMATIC all wheel drive 4.0
11910 6.0 AUTOMATIC all wheel drive 4.0
11911 6.0 AUTOMATIC all wheel drive 4.0
11912 6.0 AUTOMATIC all wheel drive 4.0
11913 6.0 AUTOMATIC front wheel drive 4.0

Market Category Vehicle Size Vehicle Style \

0 Factory Tuner,Luxury,High-Performance Compact Coupe
1 Luxury,Performance Compact Convertible
2 Luxury,High-Performance Compact Coupe
3 Luxury,Performance Compact Coupe
4 Luxury Compact Convertible
… … … …
11909 Crossover,Hatchback,Luxury Midsize 4dr Hatchback
11910 Crossover,Hatchback,Luxury Midsize 4dr Hatchback
11911 Crossover,Hatchback,Luxury Midsize 4dr Hatchback
11912 Crossover,Hatchback,Luxury Midsize 4dr Hatchback
11913 Luxury Midsize Sedan

highway MPG city mpg Popularity MSRP

0 26 19 3916 46135
1 28 19 3916 40650
2 28 20 3916 36350
3 28 18 3916 29450
4 28 18 3916 34500
… … … … …
11909 23 16 204 46120
11910 23 16 204 56670
11911 23 16 204 50620
11912 23 16 204 50920

6
11913 26 17 61 28995

[8084 rows x 16 columns]

[32]: import pandas as pd

df = pd.read_csv('data.csv')

new_df = df.dropna()

print(new_df)
new_df.shape

Make Model Year Engine Fuel Type Engine HP \

0 BMW 1 Series M 2011 premium unleaded (required) 335.0
1 BMW 1 Series 2011 premium unleaded (required) 300.0
2 BMW 1 Series 2011 premium unleaded (required) 300.0
3 BMW 1 Series 2011 premium unleaded (required) 230.0
4 BMW 1 Series 2011 premium unleaded (required) 230.0
… … … … … …
11909 Acura ZDX 2012 premium unleaded (required) 300.0
11910 Acura ZDX 2012 premium unleaded (required) 300.0
11911 Acura ZDX 2012 premium unleaded (required) 300.0
11912 Acura ZDX 2013 premium unleaded (recommended) 300.0
11913 Lincoln Zephyr 2006 regular unleaded 221.0

Engine Cylinders Transmission Type Driven_Wheels Number of Doors \

Market Category Vehicle Size Vehicle Style \

7
11911 Crossover,Hatchback,Luxury Midsize 4dr Hatchback
11912 Crossover,Hatchback,Luxury Midsize 4dr Hatchback
11913 Luxury Midsize Sedan

highway MPG city mpg Popularity MSRP

0 26 19 3916 46135
1 28 19 3916 40650
2 28 20 3916 36350
3 28 18 3916 29450
4 28 18 3916 34500
… … … … …
11909 23 16 204 46120
11910 23 16 204 56670
11911 23 16 204 50620
11912 23 16 204 50920
11913 26 17 61 28995

[8084 rows x 16 columns]

[32]: (8084, 16)

[ ]: import pandas as pd

df = pd.read_csv('data.csv')

df.dropna(inplace = True)

print(df)

Replace NULL values with the number 130:

[17]: import pandas as pd

df = pd.read_csv('dataset.csv')

df.fillna(130, inplace = True)

Replace Only For Specified Columns

[18]: import pandas as pd

df = pd.read_csv('data.csv')

df["Engine HP"].fillna(130, inplace = True)

Example Calculate the MEAN, and replace any empty values with it:

8
[36]: import pandas as pd

df = pd.read_csv('dataset.csv')

print(df.shape)
print(df.isnull().sum())
print(df.info())

(32, 5)
Duration 0
Date 1
Pulse 0
Maxpulse 0
Calories 2
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 32 non-null int64
1 Date 31 non-null object
2 Pulse 32 non-null int64
3 Maxpulse 32 non-null int64
4 Calories 30 non-null float64
dtypes: float64(1), int64(3), object(1)
memory usage: 1.4+ KB
None

[ ]: import pandas as pd

df = pd.read_csv('dataset.csv')

x = df["Calories"].mean()

df["Calories"].fillna(x, inplace = True)

[ ]: import pandas as pd

df = pd.read_csv('data.csv')

x = df["Calories"].median()

df["Calories"].fillna(x, inplace = True)

[43]: import pandas as pd

9
df = pd.read_csv('dataset.csv')

x = df["Calories"].mode()

print(x)

df["Calories"].fillna(x, inplace = True)

0 300.0
Name: Calories, dtype: float64

[ ]:

[19]: import pandas as pd

df = pd.read_csv('dataset.csv')

df['Date'] = pd.to_datetime(df['Date'])

print(df.to_string())

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[19], line 5
1 import pandas as pd
3 df = pd.read_csv('dataset.csv')
----> 5 df['Date'] = pd.to_datetime(df['Date'])
7 print(df.to_string())

File ~\anaconda3\Lib\site-packages\pandas\core\tools\datetimes.py:1112, in␣

↪to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit,␣
↪infer_datetime_format, origin, cache)

1110 result = arg.map(cache_array)

1111 else:
-> 1112 values = convert_listlike(arg._values, format)
1113 result = arg._constructor(values, index=arg.index, name=arg.name)
1114 elif isinstance(arg, (ABCDataFrame, abc.MutableMapping)):

File ~\anaconda3\Lib\site-packages\pandas\core\tools\datetimes.py:488, in␣

↪_convert_listlike_datetimes(arg, format, name, utc, unit, errors, dayfirst,␣
↪yearfirst, exact)

486 # `format` could be inferred, or user didn't ask for mixed-format␣

↪parsing.

487 if format is not None and format != "mixed":

--> 488 return _array_strptime_with_fallback(arg, name, utc, format, exact,␣
↪errors)

490 result, tz_parsed = objects_to_datetime64ns(

491 arg,

10
492 dayfirst=dayfirst,
(…)
496 allow_object=True,
497 )
499 if tz_parsed is not None:
500 # We can take a shortcut since the datetime64 numpy array
501 # is in UTC

File ~\anaconda3\Lib\site-packages\pandas\core\tools\datetimes.py:519, in␣

↪_array_strptime_with_fallback(arg, name, utc, fmt, exact, errors)

508 def _array_strptime_with_fallback(

509 arg,
510 name,
(…)
514 errors: str,
515 ) -> Index:
516 """
517 Call array_strptime, with fallback behavior depending on 'errors'.
518 """
--> 519 result, timezones = array_strptime(arg, fmt, exact=exact,␣
↪errors=errors, utc=utc)

520 if any(tz is not None for tz in timezones):

521 return _return_parsed_timezone_results(result, timezones, utc,␣
↪name)

File strptime.pyx:534, in pandas._libs.tslibs.strptime.array_strptime()

File strptime.pyx:355, in pandas._libs.tslibs.strptime.array_strptime()

ValueError: time data "20201226" doesn't match format "'%Y/%m/%d'", at position␣

↪26. You might want to try:

- passing `format` if your strings have a consistent format;

- passing `format='ISO8601'` if your strings are all ISO8601 but not␣
↪necessarily in exactly the same format;

- passing `format='mixed'`, and the format will be inferred for each element␣
↪individually. You might want to use `dayfirst` alongside this.

[ ]: df.dropna(subset=['Date'], inplace = True)

[20]: df['Date'].dropna()

[20]: 0 '2020/12/01'
1 '2020/12/02'
2 '2020/12/03'
3 '2020/12/04'
4 '2020/12/05'
5 '2020/12/06'

11
6 '2020/12/07'
7 '2020/12/08'
8 '2020/12/09'
9 '2020/12/10'
10 '2020/12/11'
11 '2020/12/12'
12 '2020/12/12'
13 '2020/12/13'
14 '2020/12/14'
15 '2020/12/15'
16 '2020/12/16'
17 '2020/12/17'
18 '2020/12/18'
19 '2020/12/19'
20 '2020/12/20'
21 '2020/12/21'
23 '2020/12/23'
24 '2020/12/24'
25 '2020/12/25'
26 20201226
27 '2020/12/27'
28 '2020/12/28'
29 '2020/12/29'
30 '2020/12/30'
31 '2020/12/31'
Name: Date, dtype: object

[21]: df

[21]: Duration Date Pulse Maxpulse Calories

0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2

12
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0

Pandas - Fixing Wrong Data

Replacing Values One way to fix wrong values is to replace them with something else.
In our example, it is most likely a typo, and the value should be “45” instead of “450”, and we
could just insert “45” in row 7:
[24]: df.loc[7, 'Duration'] = 45

[25]: df

[25]: Laptop Status Brand \

0 ASUS ExpertBook B1 B1502CBA-EJ0436X Intel Core… New Asus
1 Alurin Go Start Intel Celeron N4020/8GB/256GB … New Alurin
2 ASUS ExpertBook B1 B1502CBA-EJ0424X Intel Core… New Asus
3 MSI Katana GF66 12UC-082XES Intel Core i7-1270… New MSI
4 HP 15S-FQ5085NS Intel Core i5-1235U/16GB/512GB… New HP
… … … …
2155 Razer Blade 17 FHD 360Hz Intel Core i7-11800H/… Refurbished Razer
2156 Razer Blade 17 FHD 360Hz Intel Core i7-11800H/… Refurbished Razer
2157 Razer Blade 17 FHD 360Hz Intel Core i7-11800H/… Refurbished Razer
2158 Razer Book 13 Intel Evo Core i7-1165G7/16GB/1T… Refurbished Razer
2159 Razer Book FHD+ Intel Evo Core i7-1165G7/16GB/… Refurbished Razer

Model CPU RAM Storage Storage type GPU \

0 ExpertBook Intel Core i5 8 512 SSD NaN
1 Go Intel Celeron 8 256 SSD NaN
2 ExpertBook Intel Core i3 8 256 SSD NaN
3 Katana Intel Core i7 16 1000 SSD RTX 3050
4 15S Intel Core i5 16 512 SSD NaN
… … … … … … …
2155 Blade Intel Core i7 16 1000 SSD RTX 3060

13
2156 Blade Intel Core i7 16 1000 SSD RTX 3070
2157 Blade Intel Core i7 32 1000 SSD RTX 3080
2158 Book Intel Evo Core i7 16 1000 SSD NaN
2159 Book Intel Evo Core i7 16 256 SSD NaN

Screen Touch Final Price Duration

0 15.6 No 1009.00 NaN
1 15.6 No 299.00 NaN
2 15.6 No 789.00 NaN
3 15.6 No 1199.00 NaN
4 15.6 No 669.01 NaN
… … … … …
2155 17.3 No 2699.99 NaN
2156 17.3 No 2899.99 NaN
2157 17.3 No 3399.99 NaN
2158 13.4 Yes 1899.99 NaN
2159 13.4 Yes 1699.99 NaN

[2160 rows x 13 columns]

Loop through all values in the “Duration” column.

If the value is higher than 120, set it to 120:
[ ]: for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120

Removing Rows Another way of handling wrong data is to remove the rows that contains wrong
data.
This way you do not have to find out what to replace them with, and there is a good chance you
do not need them to do your analyses.
[ ]: for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)

By taking a look at our test data set, we can assume that row 11 and 12 are duplicates.
To discover duplicates, we can use the duplicated() method.
The duplicated() method returns a Boolean values for each row:
E Server Returns True for every row that is a duplicate, otherwise False:
[26]: print(df.duplicated())

0 False
1 False
2 False

14
3 False
4 False
…
2155 False
2156 False
2157 False
2158 False
2159 False
Length: 2160, dtype: bool
Removing Duplicates To remove duplicates, use the drop_duplicates() method.

[ ]: df.drop_duplicates(inplace = True)

Finding Relationships A great aspect of the Pandas module is the corr() method.
The corr() method calculates the relationship between each column in your data set.
The examples in this page uses a CSV file called: ‘data.csv’.
[ ]: df.corr()

[ ]:

Let Reviewer For Mathematics Major
No ratings yet
Let Reviewer For Mathematics Major
62 pages
Lecture 4 - Static+Dynamic Routing
No ratings yet
Lecture 4 - Static+Dynamic Routing
51 pages
NPTEL - CC - Assignment 3
0% (1)
NPTEL - CC - Assignment 3
4 pages
Cars Sales Dashboard
No ratings yet
Cars Sales Dashboard
19 pages
SR-IOV Configuration
No ratings yet
SR-IOV Configuration
3 pages
Sunmi KLD KeyManagement V2.101
No ratings yet
Sunmi KLD KeyManagement V2.101
7 pages
Topic 1 Ict Notes
No ratings yet
Topic 1 Ict Notes
4 pages
2018 Css Notes For Professionals
No ratings yet
2018 Css Notes For Professionals
242 pages
SLE201v15 Lab Exercise 3.3
No ratings yet
SLE201v15 Lab Exercise 3.3
3 pages
Project 8 Predictive Analytics - Ipynb - Colaboratory
No ratings yet
Project 8 Predictive Analytics - Ipynb - Colaboratory
8 pages
MTC 30521
No ratings yet
MTC 30521
28 pages
Data Analysis Report
No ratings yet
Data Analysis Report
74 pages
Automobile Price Data
No ratings yet
Automobile Price Data
53 pages
IP Project Model
No ratings yet
IP Project Model
51 pages
Sam Satapathy Resume
No ratings yet
Sam Satapathy Resume
11 pages
Machine Learning Project 1690186790
No ratings yet
Machine Learning Project 1690186790
18 pages
Report Analysis Super Cars
100% (1)
Report Analysis Super Cars
15 pages
Car Price Prediction
No ratings yet
Car Price Prediction
35 pages
Se Python - Merged
No ratings yet
Se Python - Merged
77 pages
Advance EDA & Predictive Analytics
No ratings yet
Advance EDA & Predictive Analytics
38 pages
EDA Withoutcode
No ratings yet
EDA Withoutcode
36 pages
Data Wrangling
No ratings yet
Data Wrangling
24 pages
ProteusAMT-L RevJ
No ratings yet
ProteusAMT-L RevJ
188 pages
Electric Vehicle Range Prediction-Regression Analysis
No ratings yet
Electric Vehicle Range Prediction-Regression Analysis
37 pages
BDA-4 EDA Project
No ratings yet
BDA-4 EDA Project
19 pages
Assignment
No ratings yet
Assignment
49 pages
Ishigurognnintroduction201023 201027054344
No ratings yet
Ishigurognnintroduction201023 201027054344
81 pages
22eg107a11 DWV
No ratings yet
22eg107a11 DWV
15 pages
Ads Lab Manual
No ratings yet
Ads Lab Manual
63 pages
Configuration Samba Server File Sharing
No ratings yet
Configuration Samba Server File Sharing
20 pages
Data Analysis
No ratings yet
Data Analysis
58 pages
Auto Dataset MK - Part 1: Pandas PD Numpy NP
No ratings yet
Auto Dataset MK - Part 1: Pandas PD Numpy NP
18 pages
Practical Example Full Notes
No ratings yet
Practical Example Full Notes
48 pages
Car Price Prediction 1
No ratings yet
Car Price Prediction 1
24 pages
Read CSV Files Using Pandas Library
No ratings yet
Read CSV Files Using Pandas Library
11 pages
City Cycle Fuel Consumption 2024
No ratings yet
City Cycle Fuel Consumption 2024
23 pages
Internship
No ratings yet
Internship
23 pages
Houghton Mifflin Math Homework Grade 6
100% (1)
Houghton Mifflin Math Homework Grade 6
6 pages
Quikr Car Price Prediction Using Linear Regression 1717999953
No ratings yet
Quikr Car Price Prediction Using Linear Regression 1717999953
12 pages
Exp 5 Exploratory Data Analysis SDK Ok
No ratings yet
Exp 5 Exploratory Data Analysis SDK Ok
13 pages
HSQuote v1.90 User Guide PDF
No ratings yet
HSQuote v1.90 User Guide PDF
94 pages
Xii Project PDF
No ratings yet
Xii Project PDF
19 pages
15 3 Introduction To ISA (Instruction Set Architecture) 25-08-2021 (25 Aug 2021) Material - I - 25 Aug 2021 - Instructi
No ratings yet
15 3 Introduction To ISA (Instruction Set Architecture) 25-08-2021 (25 Aug 2021) Material - I - 25 Aug 2021 - Instructi
10 pages
#1 - Skill Builds - Data Analysis With Python
No ratings yet
#1 - Skill Builds - Data Analysis With Python
3 pages
Introduction To Python - Minor Project
No ratings yet
Introduction To Python - Minor Project
5 pages
GmPrac1 - Jupyter Notebook
No ratings yet
GmPrac1 - Jupyter Notebook
11 pages
Intro To Exploratory Data Analysis Eda in Python
No ratings yet
Intro To Exploratory Data Analysis Eda in Python
7 pages
Datacleaning - Ipynb - Colab
No ratings yet
Datacleaning - Ipynb - Colab
4 pages
Course2 - DataAnalysis With Python - Week3 - Exploratory Data Analysis
No ratings yet
Course2 - DataAnalysis With Python - Week3 - Exploratory Data Analysis
23 pages
Import As Import As: Numpy NP Pandas PD
No ratings yet
Import As Import As: Numpy NP Pandas PD
22 pages
Course Outline 01 March Update
No ratings yet
Course Outline 01 March Update
16 pages
Elite Sports Cars Eda
No ratings yet
Elite Sports Cars Eda
9 pages
Car Price Prediction Using ML
No ratings yet
Car Price Prediction Using ML
11 pages
Neenopal Data Analysis Task 2
No ratings yet
Neenopal Data Analysis Task 2
4 pages
Untitled 0
No ratings yet
Untitled 0
3 pages
SMDM Business+Report
No ratings yet
SMDM Business+Report
11 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
9587 - 9638 - 9563 - ADS - Exp1.ipynb - Colab
No ratings yet
9587 - 9638 - 9563 - ADS - Exp1.ipynb - Colab
8 pages
Untitled 21
No ratings yet
Untitled 21
6 pages
HCI Lecure 10 Software Process
No ratings yet
HCI Lecure 10 Software Process
58 pages
SMDM Business+Report
No ratings yet
SMDM Business+Report
11 pages
Untitled - Ipynb - (5) - JupyterLab
No ratings yet
Untitled - Ipynb - (5) - JupyterLab
4 pages
Trilokesh Assignment
No ratings yet
Trilokesh Assignment
15 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
DV Ca-1
No ratings yet
DV Ca-1
9 pages
'Horsepower' "?" 'Horsepower' 'Horsepower' 'Horsepower' 'Horsepower' 'Horsepower'
No ratings yet
'Horsepower' "?" 'Horsepower' 'Horsepower' 'Horsepower' 'Horsepower' 'Horsepower'
5 pages
DS Lab 21 Scheme Journal
No ratings yet
DS Lab 21 Scheme Journal
30 pages
Lab Assignment 6
No ratings yet
Lab Assignment 6
5 pages
Mtcars - Ipynb - Colab
No ratings yet
Mtcars - Ipynb - Colab
2 pages
Mohy - Jupyter Notebook
No ratings yet
Mohy - Jupyter Notebook
3 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
22 pages
Expt2.ipynb - Colaboratory
No ratings yet
Expt2.ipynb - Colaboratory
2 pages
Drop The Columns - Id - and - Unnamed - 0 - From Axis...
No ratings yet
Drop The Columns - Id - and - Unnamed - 0 - From Axis...
3 pages
Muestre Los Tipos de Datos de Cada Columna Utiliz...
No ratings yet
Muestre Los Tipos de Datos de Cada Columna Utiliz...
2 pages
Numpy,,Pandas (24.4.25)
No ratings yet
Numpy,,Pandas (24.4.25)
1 page
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
Content Log - Previous
No ratings yet
Content Log - Previous
67 pages
Online Food Management System
No ratings yet
Online Food Management System
36 pages
Ashutosh Resume
No ratings yet
Ashutosh Resume
4 pages
AI Student
No ratings yet
AI Student
18 pages
FALLSEM2022-23 CSE2006 ETH VL2022230103866 Reference Material I 22-08-2022 8255
No ratings yet
FALLSEM2022-23 CSE2006 ETH VL2022230103866 Reference Material I 22-08-2022 8255
41 pages
STE Micorproject
No ratings yet
STE Micorproject
19 pages
Project Report
No ratings yet
Project Report
11 pages
Names and Addresses: Ipv4: Cs144, Stanford University 1
No ratings yet
Names and Addresses: Ipv4: Cs144, Stanford University 1
8 pages
CSI News Letter AUG2011
No ratings yet
CSI News Letter AUG2011
6 pages
Branson 2000d Error Code 300: Direct Link #1
No ratings yet
Branson 2000d Error Code 300: Direct Link #1
3 pages
Medit I700 Requirements - Buscar Con Google
No ratings yet
Medit I700 Requirements - Buscar Con Google
1 page
Automotive Intelligentsia 2011-2012 Sports Car Guide
From Everand
Automotive Intelligentsia 2011-2012 Sports Car Guide
Jim Gorzelany
No ratings yet
Building Honda K-Series Engine Performance
From Everand
Building Honda K-Series Engine Performance
Richard Holdener
5/5 (2)
Honda Engine Swaps
From Everand
Honda Engine Swaps
Aaron Bonk
No ratings yet

Nalysis Manipulation and Cleaning

Uploaded by

Nalysis Manipulation and Cleaning

Uploaded by

nalysis-manipulation-and-cleaning

November 17, 2024

[1]: import requests

# URL of the dataset

# Send a GET request to the URL

# Check if the request was successful (status code 200)

Dataset downloaded successfully!

[1]: import pandas as pd

Make Model Year Engine Fuel Type Engine HP \

Engine Cylinders Transmission Type Driven_Wheels Number of Doors \

Market Category Vehicle Size Vehicle Style \

highway MPG city mpg Popularity MSRP

Engine Cylinders Transmission Type Driven_Wheels Number of Doors \

Market Category Vehicle Size Vehicle Style highway MPG \

city mpg Popularity MSRP

[3]: import pandas as pd

[9]: import pandas as pd

[9]: Year Engine HP Engine Cylinders Number of Doors \

highway MPG city mpg Popularity MSRP

[27]: import pandas as pd

[27]: Laptop 2160

[29]: import pandas as pd

Make Model Year Engine Fuel Type Engine HP \

Engine Cylinders Transmission Type Driven_Wheels Number of Doors \

Market Category Vehicle Size Vehicle Style \

highway MPG city mpg Popularity MSRP

[8084 rows x 16 columns]

[32]: import pandas as pd

Make Model Year Engine Fuel Type Engine HP \

Engine Cylinders Transmission Type Driven_Wheels Number of Doors \

Market Category Vehicle Size Vehicle Style \

highway MPG city mpg Popularity MSRP

[8084 rows x 16 columns]

[32]: (8084, 16)

Replace NULL values with the number 130:

df.fillna(130, inplace = True)

Replace Only For Specified Columns

df["Engine HP"].fillna(130, inplace = True)

df["Calories"].fillna(x, inplace = True)

df["Calories"].fillna(x, inplace = True)

[43]: import pandas as pd

df["Calories"].fillna(x, inplace = True)

[19]: import pandas as pd

File ~\anaconda3\Lib\site-packages\pandas\core\tools\datetimes.py:1112, in␣

1110 result = arg.map(cache_array)

File ~\anaconda3\Lib\site-packages\pandas\core\tools\datetimes.py:488, in␣

486 # `format` could be inferred, or user didn't ask for mixed-format␣

487 if format is not None and format != "mixed":

490 result, tz_parsed = objects_to_datetime64ns(

File ~\anaconda3\Lib\site-packages\pandas\core\tools\datetimes.py:519, in␣

508 def _array_strptime_with_fallback(

520 if any(tz is not None for tz in timezones):

File strptime.pyx:534, in pandas._libs.tslibs.strptime.array_strptime()

File strptime.pyx:355, in pandas._libs.tslibs.strptime.array_strptime()

ValueError: time data "20201226" doesn't match format "'%Y/%m/%d'", at position␣

- passing `format` if your strings have a consistent format;

[ ]: df.dropna(subset=['Date'], inplace = True)

[21]: Duration Date Pulse Maxpulse Calories

Pandas - Fixing Wrong Data

[25]: Laptop Status Brand \

Model CPU RAM Storage Storage type GPU \

Screen Touch Final Price Duration

[2160 rows x 13 columns]

Loop through all values in the “Duration” column.

You might also like