How to preprocess string data within a Pandas DataFrame?
Last Updated :
21 Mar, 2024
Sometimes, the data which we're working on might be stuffed in a single column, but for us to work on the data, the data should be spread out into different columns and the columns must be of different data types. When all the data is combined in a single string, the string needs to be preprocessed. This article is about preprocessing string data within a Pandas DataFrame.
Syntax:
Series.str.extract(pat, flags=0, expand=True)
Parameters:
- pat: regex expression which helps us divide data into columns.
- flags: by default 0 no flags, int parameter.
- expand: Returns a DataFrame with one column per capture group if True.
returns:
method returns a dataframe or series
Step 1: Import packages
Pandas package is imported.
Python3
# import packages
import pandas as pd
Step 2: Create dataframe:
pd.DataFrame() method is used to create a dataframe of the dictionary given. We create a dataframe that needs to be preprocessed. All the data resides in a single column in string format at the start.
Python3
# creating data
data = {'CovidData': ['Anhui 1.0 2020-01-22 17:00:00',
'Beijing 14.0 2020-01-22 17:00:00',
'Washington 1.0 2020-01-24 17:00:00',
'Victoria 3.0 2020-01-31 23:59:00',
'Macau 10.0 2020-02-06 14:23:04']}
#creating a pandas dataframe
dataset = pd.DataFrame(data)
 str. extract() takes a regex expression string and other parameters to extract data into columns. (....-..-.. ..:..:..) is used to extract dates in the form (yyyy-mm-dd hh:mm:ss), Datetime objects are of that format.
Python3
dataset['LastUpdated'] = dataset['CovidData'].str.extract(
'(....-..-.. ..:..:..)', expand=True)
dataset['LastUpdated']
Output:
str. extract() takes a regex expression string ''([A-Za-z]+)''. it extracts strings which have alphabets.Â
Python3
dataset['State'] = dataset['CovidData'].str.extract('([A-Za-z]+)', expand=True)
dataset['State']
Output:
 '(\d+.\d)' is used to match decimals. + represents one or more numbers before '.'(decimal) and one number after the decimal. ex: 12.1, 3.5 etc... .Â
Python3
dataset['confirmed_cases'] = dataset['CovidData'].str.extract(
'(\d+.\d)', expand=True)
dataset['confirmed_cases']
Output:
Dataframe before preprocessing:
Dataframe after preprocessing:

Method 2: Using apply() function
In this method, we preprocess a dataset that contains movie reviews, it's the rotten tomatoes dataset. The panda's package, re and stop_words packages are imported. We store the stop words in a variable called stop_words. Data set is imported with the help of the pd.read_csv() method. We use the apply() method to preprocess string data. str.lower is used to convert all the string data to lower case. Â re.sub(r'[^\w\s]', '', x) helps us get rid of punctuation marks and finally, we remove stop_words from the string data. As the CSV file is huge a part of the data is displayed to see the difference.
Python3
# import packages
import pandas as pd
from stop_words import get_stop_words
import re
# stop words
stop_words = get_stop_words('en')
# reading the csv file
data = pd.read_csv('test.csv')
print('Before string processing : ')
print(data[(data['PhraseId'] >= 157139) & (
data['PhraseId'] <= 157141)]['Phrase'])
# converting all text to lower case in the Phrase column
data['Phrase'] = data['Phrase'].apply(str.lower)
# using regex to remove punctuation
data['Phrase'] = data['Phrase'].apply(lambda x: re.sub(r'[^\w\s]', '', x)
)
# removing stop words
data['Phrase'] = data['Phrase'].apply(lambda x: ' '.join(
w for w in x.split() if w not in stop_words))
print('After string processing : ')
data[(data['PhraseId'] >= 157139) & (data['PhraseId'] <= 157141)]['Phrase']
Output:
Similar Reads
Clean the string data in the given Pandas Dataframe In today's world data analytics is being used by all sorts of companies out there. While working with data, we can come across any sort of problem which requires an out-of-the-box approach for evaluation. Most of the Data in real life contains the name of entities or other nouns. It might be possibl
3 min read
How to write Pandas DataFrame as TSV using Python? In this article, we will discuss how to write pandas dataframe as TSV using Python. Let's start by creating a data frame. It can be done by importing an existing file, but for simplicity, we will create our own. Python3 # importing the module import pandas as pd # creating some sample data sample =
1 min read
Pandas DataFrame.to_string-Python Pandas is a powerful Python library for data manipulation, with DataFrame as its key two-dimensional, labeled data structure. It allows easy formatting and readable display of data. DataFrame.to_string() function in Pandas is specifically designed to render a DataFrame into a console-friendly tabula
5 min read
How to add metadata to a DataFrame or Series with Pandas in Python? Metadata, also known as data about the data. Metadata can give us data description, summary, storage in memory, and datatype of that particular data. We are going to display and create metadata. Scenario: We can get metadata simply by using info() commandWe can add metadata to the existing data and
3 min read
Replace Characters in Strings in Pandas DataFrame In this article, we are going to see how to replace characters in strings in pandas dataframe using Python. We can replace characters using str.replace() method is basically replacing an existing string or character in a string with a new one. we can replace characters in strings is for the entire
3 min read