0% found this document useful (0 votes)

33 views

Chapter 2

This document discusses cleaning text and categorical data in Python. It describes common problems with text data like names, phone numbers, emails, and passwords, such as data inconsistency, fixed length violations, and typos. An example shows a DataFrame with a 'Full name' and 'Phone number' column that contains phone numbers in various formats.

Uploaded by

vrhdzv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

Chapter 2

Uploaded by

vrhdzv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Membership

constraints
C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @DataCamp
Chapter 2 - Text and categorical data problems

CLEANING DATA IN PYTHON

Categories and membership constraints
Prede ned nite set of categories

Type of data Example values Numeric representation

Marriage Status unmarried , married 0 ,1

Household Income Category 0-20K , 20-40K , ... 0 , 1 , ..

Loan Status default , payed , no_loan 0 ,1 ,2

Marriage status can only be unmarried _or_ married

CLEANING DATA IN PYTHON

Why could we have these problems?

CLEANING DATA IN PYTHON

How do we treat these problems?

CLEANING DATA IN PYTHON

An example
# Read study data and print it # Correct possible blood types
study_data = pd.read_csv('study.csv') categories
study_data

blood_type
name birthday blood_type 1 O-
1 Beth 2019-10-20 B- 2 O+
2 Ignatius 2020-07-08 A- 3 A-
3 Paul 2019-08-12 O+ 4 A+
4 Helen 2019-03-17 O- 5 B+
5 Jennifer 2019-12-17 Z+ 6 B-
6 Kennedy 2020-04-27 A+ 7 AB+
7 Keith 2019-04-19 AB+ 8 AB-

CLEANING DATA IN PYTHON

An example
# Read study data and print it # Correct possible blood types
study_data = pd.read_csv('study.csv') categories
study_data

blood_type
name birthday blood_type 1 O-
1 Beth 2019-10-20 B- 2 O+
2 Ignatius 2020-07-08 A- 3 A-
3 Paul 2019-08-12 O+ 4 A+
4 Helen 2019-03-17 O- 5 B+
5 Jennifer 2019-12-17 Z+ <-- 6 B-
6 Kennedy 2020-04-27 A+ 7 AB+
7 Keith 2019-04-19 AB+ 8 AB-

CLEANING DATA IN PYTHON

A note on joins

CLEANING DATA IN PYTHON

A left anti join on blood types

CLEANING DATA IN PYTHON

An inner join on blood types

CLEANING DATA IN PYTHON

Finding inconsistent categories
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsistent_categories)

{'Z+'}

# Get and print rows with inconsistent categories

inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
study_data[inconsistent_rows]

name birthday blood_type

5 Jennifer 2019-12-17 Z+

CLEANING DATA IN PYTHON

Dropping inconsistent categories
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_typ
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
inconsistent_data = study_data[inconsistent_rows]
# Drop inconsistent categories and get consistent data only
consistent_data = study_data[~inconsistent_rows]

name birthday blood_type

1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
4 Helen 2019-03-17 O-
... ... ... ...

CLEANING DATA IN PYTHON

Let's practice!
C L E A N I N G D ATA I N P Y T H O N
Categorical
variables
C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @DataCamp
What type of errors could we have?
I) Value inconsistency

Inconsistent elds: 'married' , 'Maried' , 'UNMARRIED' , 'not married' ..

_Trailing white spaces: _ 'married ' , ' married ' ..

II) Collapsing too many categories to few

Creating new groups: 0-20K , 20-40K categories ... from continuous household income data

Mapping groups to new ones: Mapping household income categories to 2 'rich' , 'poor'

III) Making sure data is of type category (seen in Chapter 1)

CLEANING DATA IN PYTHON

Value consistency
Capitalization: 'married' , 'Married' , 'UNMARRIED' , 'unmarried' ..

# Get marriage status column

marriage_status = demographics['marriage_status']
marriage_status.value_counts()

unmarried 352
married 268
MARRIED 204
UNMARRIED 176
dtype: int64

CLEANING DATA IN PYTHON

Value consistency
# Get value counts on DataFrame
marriage_status.groupby('marriage_status').count()

household_income gender
marriage_status
MARRIED 204 204
UNMARRIED 176 176
married 268 268
unmarried 352 352

CLEANING DATA IN PYTHON

Value consistency
# Capitalize
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.upper()
marriage_status['marriage_status'].value_counts()

UNMARRIED 528
MARRIED 472

# Lowercase
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.lower()
marriage_status['marriage_status'].value_counts()

unmarried 528
married 472

CLEANING DATA IN PYTHON

Value consistency
Trailing spaces: 'married ' , 'married' , 'unmarried' , ' unmarried' ..

# Get marriage status column

marriage_status = demographics['marriage_status']
marriage_status.value_counts()

unmarried 352
unmarried 268
married 204
married 176
dtype: int64

CLEANING DATA IN PYTHON

Value consistency
# Strip all spaces
demographics = demographics['marriage_status'].str.strip()
demographics['marriage_status'].value_counts()

unmarried 528
married 472

CLEANING DATA IN PYTHON

Collapsing data into categories
Create categories out of data: income_group column from income column.

# Using qcut()
import pandas as pd
group_names = ['0-200K', '200K-500K', '500K+']
demographics['income_group'] = pd.qcut(demographics['household_income'], q = 3,
labels = group_names)
# Print income_group column
demographics[['income_group', 'household_income']]

category household_income
0 200K-500K 189243
1 500K+ 778533
..

CLEANING DATA IN PYTHON

Collapsing data into categories
Create categories out of data: income_group column from income column.

# Using cut() - create category ranges and names

ranges = [0,200000,500000,np.inf]
group_names = ['0-200K', '200K-500K', '500K+']
# Create income group column
demographics['income_group'] = pd.cut(demographics['household_income'], bins=ranges,
labels=group_names)
demographics[['income_group', 'household_income']]

category Income
0 0-200K 189243
1 500K+ 778533

CLEANING DATA IN PYTHON

Collapsing data into categories
Map categories to fewer ones: reducing categories in categorical column.

operating_system column is: 'Microsoft', 'MacOS', 'IOS', 'Android', 'Linux'

operating_system column should become: 'DesktopOS', 'MobileOS'

# Create mapping dictionary and replace

mapping = {'Microsoft':'DesktopOS', 'MacOS':'DesktopOS', 'Linux':'DesktopOS',
'IOS':'MobileOS', 'Android':'MobileOS'}
devices['operating_system'] = devices['operating_system'].replace(mapping)
devices['operating_system'].unique()

array(['DesktopOS', 'MobileOS'], dtype=object)

CLEANING DATA IN PYTHON

Let's practice!
C L E A N I N G D ATA I N P Y T H O N
Cleaning text data
C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @ DataCamp
What is text data?
Type of data Example values Common text data problems

Names Alex , Sara ... 1) Data inconsistency:

Phone numbers +96171679912 ...
+96171679912 or 0096171679912 or ..?
Emails `[email protected]`..
2) Fixed length violations:
Passwords ...

Passwords needs to be at least 8 characters

3) Typos:

+961.71.679912

CLEANING DATA IN PYTHON

Example
phones = pd.read_csv('phones.csv')
print(phones)

Full name Phone number

0 Noelani A. Gray 001-702-397-5143
1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin +1-297-996-4904
4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews 4138
6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 001-143-119-9210

CLEANING DATA IN PYTHON

Example
phones = pd.read_csv('phones.csv')
print(phones)

Full name Phone number

0 Noelani A. Gray 001-702-397-5143
1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin +1-297-996-4904 <-- Inconsistent data format
4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews 4138 <-- Length violation
6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 001-143-119-9210

CLEANING DATA IN PYTHON

Example
phones = pd.read_csv('phones.csv')
print(phones)

Full name Phone number

0 Noelani A. Gray 0017023975143
1 Myles Z. Gomez 0013294850540
2 Gil B. Silva 0011954922338
3 Prescott D. Hardin 0012979964904
4 Benedict G. Valdez 0019698203536
5 Reece M. Andrews NaN
6 Hayfa E. Keith 0015361758444
7 Hedley I. Logan 0016815521823
8 Jack W. Carrillo 0019103235265
9 Lionel M. Davis 0011431199210

CLEANING DATA IN PYTHON

Fixing the phone number column
# Replace "+" with "00"
phones["Phone number"] = phones["Phone number"].str.replace("+", "00")
phones

Full name Phone number

0 Noelani A. Gray 001-702-397-5143
1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin 001-297-996-4904
4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews 4138
6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 001-143-119-9210

CLEANING DATA IN PYTHON

Fixing the phone number column
# Replace "-" with nothing
phones["Phone number"] = phones["Phone number"].str.replace("-", "")
phones

Full name Phone number

0 Noelani A. Gray 0017023975143
1 Myles Z. Gomez 0013294850540
2 Gil B. Silva 0011954922338
3 Prescott D. Hardin 0012979964904
4 Benedict G. Valdez 0019698203536
5 Reece M. Andrews 4138
6 Hayfa E. Keith 0015361758444
7 Hedley I. Logan 0016815521823
8 Jack W. Carrillo 0019103235265
9 Lionel M. Davis 0011431199210

CLEANING DATA IN PYTHON

Fixing the phone number column
# Replace phone numbers with lower than 10 digits to NaN
digits = phones['Phone number'].str.len()
phones.loc[digits < 10, "Phone number"] = np.nan
phones

Full name Phone number

CLEANING DATA IN PYTHON

Fixing the phone number column
# Find length of each row in Phone number column
sanity_check = phone['Phone number'].str.len()

# Assert minmum phone number length is 10

assert sanity_check.min() >= 10

# Assert all numbers do not have "+" or "-"

assert phone['Phone number'].str.contains("+|-").any() == False

Remember, assert returns nothing if the condition passes

CLEANING DATA IN PYTHON

But what about more complicated examples?
phones.head()

Full name Phone number

0 Olga Robinson +(01706)-25891
1 Justina Kim +0500-571437
2 Tamekah Henson +0800-1111
3 Miranda Solis +07058-879063
4 Caldwell Gilliam +(016977)-8424

Supercharged control + F

CLEANING DATA IN PYTHON

Regular expressions in action
# Replace letters with nothing
phones['Phone number'] = phones['Phone number'].str.replace(r'\D+', '')
phones.head()

Full name Phone number

0 Olga Robinson 0170625891
1 Justina Kim 0500571437
2 Tamekah Henson 08001111
3 Miranda Solis 07058879063
4 Caldwell Gilliam 0169778424

CLEANING DATA IN PYTHON

Let's practice!
C L E A N I N G D ATA I N P Y T H O N

Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals
From Everand
Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals
Brent Dykes
4.5/5 (3)
Mario Like Tutorial in As3
No ratings yet
Mario Like Tutorial in As3
7 pages
Membership Constraints: Adel Nehme
No ratings yet
Membership Constraints: Adel Nehme
36 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
47 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
46 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
Cleaning Data in Python: Pu!ing It All Together
No ratings yet
Cleaning Data in Python: Pu!ing It All Together
14 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
No ratings yet
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
12 pages
Chapter 4
No ratings yet
Chapter 4
58 pages
DataCleaninginML
No ratings yet
DataCleaninginML
15 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
24 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
DAwHPC L03 Data Cleaning Practical
No ratings yet
DAwHPC L03 Data Cleaning Practical
43 pages
Comparing Strings: Adel Nehme
No ratings yet
Comparing Strings: Adel Nehme
58 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Data Cleanups
No ratings yet
Data Cleanups
16 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
E-Book Data Cleaning Techniques in Python
100% (2)
E-Book Data Cleaning Techniques in Python
50 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
26 pages
Data Analytics With PowerBI
No ratings yet
Data Analytics With PowerBI
27 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
UNIT V
No ratings yet
UNIT V
47 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
PDS_Exp_7_to_9
No ratings yet
PDS_Exp_7_to_9
10 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
text 3
No ratings yet
text 3
3 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
Import Import As Import As: #Default To CSV
No ratings yet
Import Import As Import As: #Default To CSV
6 pages
python interviews
No ratings yet
python interviews
154 pages
Day 10 Pandasdatacleaning
No ratings yet
Day 10 Pandasdatacleaning
6 pages
S08 Slides
No ratings yet
S08 Slides
14 pages
Week 1 To Week 9
No ratings yet
Week 1 To Week 9
30 pages
Ds Exp1 Manju
No ratings yet
Ds Exp1 Manju
5 pages
CleaningData Chapter 3
No ratings yet
CleaningData Chapter 3
29 pages
1-Introduction to data cleaning
No ratings yet
1-Introduction to data cleaning
22 pages
AL Notes
No ratings yet
AL Notes
61 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
III-Unit
No ratings yet
III-Unit
4 pages
Data Cleaning & Preparation
100% (2)
Data Cleaning & Preparation
2 pages
22am901 Data Science Using Python Unit 2
No ratings yet
22am901 Data Science Using Python Unit 2
116 pages
Advanced Python Lab
No ratings yet
Advanced Python Lab
17 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Practical 3
No ratings yet
Practical 3
2 pages
Data Preprocessing
No ratings yet
Data Preprocessing
84 pages
2.1 Combining Data Frames
No ratings yet
2.1 Combining Data Frames
38 pages
Lab 3 DWM
No ratings yet
Lab 3 DWM
5 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Python For Data Analysis
67% (3)
Python For Data Analysis
39 pages
data analysis
No ratings yet
data analysis
42 pages
Pythonic Data Cleaning With Numpy and Pandas
No ratings yet
Pythonic Data Cleaning With Numpy and Pandas
11 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
SPSS: The Ultimate Data Analysis Tool
From Everand
SPSS: The Ultimate Data Analysis Tool
Steven Bright
5/5 (1)
Pre-Calculus Essentials
From Everand
Pre-Calculus Essentials
Ernest Woodward
No ratings yet
02 The Potential Outcomes Framework
No ratings yet
02 The Potential Outcomes Framework
30 pages
Program Evaluation: Eltecon 2016/17 Autumn Instructor: Dániel Horn (Slides: Gábor Kézdi)
No ratings yet
Program Evaluation: Eltecon 2016/17 Autumn Instructor: Dániel Horn (Slides: Gábor Kézdi)
24 pages
04 Imperfect Compliance: Program Evaluation Instructor: Dániel Horn (Slides: Gábor Kézdi)
No ratings yet
04 Imperfect Compliance: Program Evaluation Instructor: Dániel Horn (Slides: Gábor Kézdi)
25 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
28 pages
Macroeconomics II: Real Business Cycle Model and Open Economies
No ratings yet
Macroeconomics II: Real Business Cycle Model and Open Economies
70 pages
Chapter3 PDF
No ratings yet
Chapter3 PDF
21 pages
List Comprehensions: Hugo Bowne-Anderson
No ratings yet
List Comprehensions: Hugo Bowne-Anderson
30 pages
Macroeconomics II: Introduction To New-Keynesian Models: D Aniel Baksa
No ratings yet
Macroeconomics II: Introduction To New-Keynesian Models: D Aniel Baksa
78 pages
Using Seaborn Styles: Chris Mo
No ratings yet
Using Seaborn Styles: Chris Mo
17 pages
Welcome To The Case Study!: Hugo Bowne-Anderson
No ratings yet
Welcome To The Case Study!: Hugo Bowne-Anderson
16 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
25 pages
Introduction To Seaborn: Chris Mo
No ratings yet
Introduction To Seaborn: Chris Mo
18 pages
Chapter 3
No ratings yet
Chapter 3
31 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
22 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
24 pages
Using Facetgrid, Factorplot and Lmplot: Chris Mo
No ratings yet
Using Facetgrid, Factorplot and Lmplot: Chris Mo
32 pages
Chapter 1
No ratings yet
Chapter 1
24 pages
Introduction To Cloud Computing
No ratings yet
Introduction To Cloud Computing
49 pages
Guardium Architecture and Deployment - Master Skills
No ratings yet
Guardium Architecture and Deployment - Master Skills
97 pages
Data Science With R Workflow: Important Resources
No ratings yet
Data Science With R Workflow: Important Resources
2 pages
Case Study On User Roles
No ratings yet
Case Study On User Roles
9 pages
Chapter 4 - Operating Systems
No ratings yet
Chapter 4 - Operating Systems
89 pages
Avaya Aura Communication Manager DenialEvents 7 0
No ratings yet
Avaya Aura Communication Manager DenialEvents 7 0
200 pages
Basics of P6 by Mazhar Shariff
No ratings yet
Basics of P6 by Mazhar Shariff
42 pages
AccountStatement Report 6080474657 21052024 10 32
No ratings yet
AccountStatement Report 6080474657 21052024 10 32
1 page
Practice 7 30424 ENG
No ratings yet
Practice 7 30424 ENG
1 page
CIT831-2020-2
No ratings yet
CIT831-2020-2
2 pages
Duplicate Cleaner Log
No ratings yet
Duplicate Cleaner Log
394 pages
Desktop Multiple Choice Quiz
No ratings yet
Desktop Multiple Choice Quiz
3 pages
Ms Office Common Applications Accenture
No ratings yet
Ms Office Common Applications Accenture
27 pages
Loading and Linking in Os
No ratings yet
Loading and Linking in Os
2 pages
What's New in Oracle® Crystal Ball?: Browse To
No ratings yet
What's New in Oracle® Crystal Ball?: Browse To
22 pages
D0171001M
No ratings yet
D0171001M
80 pages
IGF-L For PC PDF
No ratings yet
IGF-L For PC PDF
21 pages
Basics of UML Diagrams With Examples
0% (1)
Basics of UML Diagrams With Examples
31 pages
Ood Chapter1
No ratings yet
Ood Chapter1
21 pages
Wellbore Cementing Model: DEA 67 LL
No ratings yet
Wellbore Cementing Model: DEA 67 LL
50 pages
Chapter-1 Introduction To JAVA
No ratings yet
Chapter-1 Introduction To JAVA
11 pages
System Admin by Google - 240218 - 101000
No ratings yet
System Admin by Google - 240218 - 101000
206 pages
Ce03 10
No ratings yet
Ce03 10
48 pages
Instant Download Windows CMD Command Syntax Unknown PDF All Chapter
100% (2)
Instant Download Windows CMD Command Syntax Unknown PDF All Chapter
59 pages
C++ Lab Manual
No ratings yet
C++ Lab Manual
38 pages
Open Gapps Log
No ratings yet
Open Gapps Log
3 pages
Open Source Whitebox Router PDF
No ratings yet
Open Source Whitebox Router PDF
13 pages
DTW Power Bi
No ratings yet
DTW Power Bi
14 pages
06-Software Architecture
No ratings yet
06-Software Architecture
31 pages

Chapter 2

Uploaded by

Chapter 2

Uploaded by

Membership

CLEANING DATA IN PYTHON

Type of data Example values Numeric representation

Marriage Status unmarried , married 0 ,1

Household Income Category 0-20K , 20-40K , ... 0 , 1 , ..

Loan Status default , payed , no_loan 0 ,1 ,2

Marriage status can only be unmarried _or_ married

CLEANING DATA IN PYTHON

CLEANING DATA IN PYTHON

CLEANING DATA IN PYTHON

CLEANING DATA IN PYTHON

CLEANING DATA IN PYTHON

CLEANING DATA IN PYTHON

CLEANING DATA IN PYTHON

CLEANING DATA IN PYTHON

# Get and print rows with inconsistent categories

name birthday blood_type

CLEANING DATA IN PYTHON

name birthday blood_type

CLEANING DATA IN PYTHON

Inconsistent elds: 'married' , 'Maried' , 'UNMARRIED' , 'not married' ..

_Trailing white spaces: _ 'married ' , ' married ' ..

II) Collapsing too many categories to few

III) Making sure data is of type category (seen in Chapter 1)

CLEANING DATA IN PYTHON

# Get marriage status column

CLEANING DATA IN PYTHON

CLEANING DATA IN PYTHON

CLEANING DATA IN PYTHON

# Get marriage status column

CLEANING DATA IN PYTHON

CLEANING DATA IN PYTHON

CLEANING DATA IN PYTHON

# Using cut() - create category ranges and names

CLEANING DATA IN PYTHON

operating_system column is: 'Microsoft', 'MacOS', 'IOS', 'Android', 'Linux'

operating_system column should become: 'DesktopOS', 'MobileOS'

# Create mapping dictionary and replace

array(['DesktopOS', 'MobileOS'], dtype=object)

CLEANING DATA IN PYTHON

Names Alex , Sara ... 1) Data inconsistency:

Passwords needs to be at least 8 characters

CLEANING DATA IN PYTHON

Full name Phone number

CLEANING DATA IN PYTHON

Full name Phone number

CLEANING DATA IN PYTHON

Full name Phone number

CLEANING DATA IN PYTHON

Full name Phone number

CLEANING DATA IN PYTHON

Full name Phone number

CLEANING DATA IN PYTHON

Full name Phone number

CLEANING DATA IN PYTHON

# Assert minmum phone number length is 10

# Assert all numbers do not have "+" or "-"

Remember, assert returns nothing if the condition passes

CLEANING DATA IN PYTHON

Full name Phone number

CLEANING DATA IN PYTHON

Full name Phone number

CLEANING DATA IN PYTHON

You might also like