How to preprocess string data within a Pandas DataFrame?

Last Updated : 21 Mar, 2024

Sometimes, the data which we're working on might be stuffed in a single column, but for us to work on the data, the data should be spread out into different columns and the columns must be of different data types. When all the data is combined in a single string, the string needs to be preprocessed. This article is about preprocessing string data within a Pandas DataFrame.

Method 1: By using PandasSeries.str.extract() function:

Syntax:

Series.str.extract(pat, flags=0, expand=True)

Parameters:

pat: regex expression which helps us divide data into columns.
flags: by default 0 no flags, int parameter.
expand: Returns a DataFrame with one column per capture group if True.

returns:

method returns a dataframe or series

Step 1: Import packages

Pandas package is imported.

Python3

# import packages
import pandas as pd

Step 2: Create dataframe:

pd.DataFrame() method is used to create a dataframe of the dictionary given. We create a dataframe that needs to be preprocessed. All the data resides in a single column in string format at the start.

Python3

# creating data
data = {'CovidData': ['Anhui 1.0 2020-01-22 17:00:00',
                      'Beijing 14.0 2020-01-22 17:00:00',
                      'Washington 1.0 2020-01-24 17:00:00',
                      'Victoria 3.0 2020-01-31 23:59:00',
                      'Macau 10.0 2020-02-06 14:23:04']}

#creating a pandas dataframe 
dataset = pd.DataFrame(data)

str. extract() takes a regex expression string and other parameters to extract data into columns. (....-..-.. ..:..:..) is used to extract dates in the form (yyyy-mm-dd hh:mm:ss), Datetime objects are of that format.

Python3

dataset['LastUpdated'] = dataset['CovidData'].str.extract(
    '(....-..-.. ..:..:..)', expand=True)
dataset['LastUpdated']

Output:

str. extract() takes a regex expression string ''([A-Za-z]+)''. it extracts strings which have alphabets.

Python3

dataset['State'] = dataset['CovidData'].str.extract('([A-Za-z]+)', expand=True)
dataset['State']

Output:

'(\d+.\d)' is used to match decimals. + represents one or more numbers before '.'(decimal) and one number after the decimal. ex: 12.1, 3.5 etc... .

Python3

dataset['confirmed_cases'] = dataset['CovidData'].str.extract(
    '(\d+.\d)', expand=True)
dataset['confirmed_cases']

Output:

Dataframe before preprocessing:

Dataframe after preprocessing:

Method 2: Using apply() function

In this method, we preprocess a dataset that contains movie reviews, it's the rotten tomatoes dataset. The panda's package, re and stop_words packages are imported. We store the stop words in a variable called stop_words. Data set is imported with the help of the pd.read_csv() method. We use the apply() method to preprocess string data. str.lower is used to convert all the string data to lower case. re.sub(r'[^\w\s]', '', x) helps us get rid of punctuation marks and finally, we remove stop_words from the string data. As the CSV file is huge a part of the data is displayed to see the difference.

Python3

# import packages
import pandas as pd
from stop_words import get_stop_words
import re

# stop words
stop_words = get_stop_words('en')

# reading the csv file
data = pd.read_csv('test.csv')

print('Before string processing : ')
print(data[(data['PhraseId'] >= 157139) & (
    data['PhraseId'] <= 157141)]['Phrase'])

# converting all text to lower case in the Phrase column
data['Phrase'] = data['Phrase'].apply(str.lower)

# using regex to remove punctuation
data['Phrase'] = data['Phrase'].apply(lambda x: re.sub(r'[^\w\s]', '', x)
                                      )

# removing stop words
data['Phrase'] = data['Phrase'].apply(lambda x: ' '.join(
    w for w in x.split() if w not in stop_words))

print('After string processing : ')
data[(data['PhraseId'] >= 157139) & (data['PhraseId'] <= 157141)]['Phrase']