Python | Pandas Working With Text Data
Last Updated :
13 Jun, 2024
Series and Indexes are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the str attribute and generally, have names matching the equivalent (scalar) built-in string methods.
Lowercasing and Uppercasing a DataIn order to lowercase a data, we use str.lower() this function converts all uppercase characters to lowercase. If no uppercase characters exist, it returns the original string. In order to uppercase a data, we use str.upper() this function converts all lowercase characters to uppercase. If no lowercase characters exist, it returns the original string.
Code #1:
Python
# Import pandas package
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# converting and overwriting values in column
df["Name"]= df["Name"].str.lower()
print(df)
Output :
As shown in the output image of the data frame, all values in the name column have been converted into lower case.

In this example, we are using nba.csv
file.
Code #2:
Python
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv")
# converting and overwriting values in column
data["Team"]= data["Team"].str.upper()
# display
data
Output :
As shown in the output image of data frame, all values in the Team column have been converted into upper case.

Splitting and Replacing a Data
In order to split a data, we use str.split() this function returns a list of strings after breaking the given string by the specified separator but it can only be applied to an individual string. Pandas str.split()
method can be applied to a whole series. .str has to be prefixed every time before calling this method to differentiate it from the Python’s default function otherwise, it will throw an error. In order to replace a data, we use str.replace() this function works like Python
.replace()
method only, but it works on Series too. Before calling .replace()
on a Pandas series, .str has to be prefixed in order to differentiate it from the Python’s default replace method.
Code #1:
Python
# importing pandas module
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Knnuaj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# dropping null value columns to avoid errors
df.dropna(inplace = True)
# new data frame with split value columns
df["Address"]= df["Address"].str.split("a", n = 1, expand = True)
# df display
print(df)
Output :
As shown in the output image, the Address column was separated at the first occurrence of “a” and not on the later occurrence since the n parameter was set to 1 (Max 1 separation in a string).

Code #2:
Python
# importing pandas module
import pandas as pd
# reading csv file from url
data = pd.read_csv("nba.csv")
# overwriting column with replaced value of age
data["Age"]= data["Age"].replace(25.0, "Twenty five")
# creating a filter for age column
# where age = "Twenty five"
filter = data["Age"]=="Twenty five"
# printing only filtered columns
data.where(filter).dropna()
Output :
As shown in the output image, all the values in Age column having age=25.0 have been replaced by “Twenty five”.

Concatenation of Data
In order to concatenate a Series or Index, we use str.cat() this function is used to concatenate strings to the passed caller series of string. Distinct values from a different series can be passed but the length of both the series has to be same. .str has to be prefixed to differentiate it from the Python’s default method.
Code #1:
Python
# importing pandas module
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# making copy of address column
new = df["Address"].copy()
# concatenating address with name column
# overwriting name column
df["Name"]= df["Name"].str.cat(new, sep =", ")
# display
print(df)
Output :
As shown in the output image, every string in the Address column having same index as string in Name column have been concatenated with separator “, “
.

Code #2:
Python
# importing pandas module
import pandas as pd
# importing csv from link
data = pd.read_csv("nba.csv")
# making copy of team column
new = data["Team"].copy()
# concatenating team with name column
# overwriting name column
data["Name"]= data["Name"].str.cat(new, sep =", ")
# display
data
Output:
As shown in the output image, every string in the Team column having same index as string in Name column have been concatenated with separator “, “.

Removing Whitespaces of Data
In order to remove a whitespaces, we use str.strip(), str.rstrip(), str.lstrip() these function used to handle white spaces(including New line) in any text data. As it can be seen in the name, str.lstrip() is used to remove spaces from the left side of string, str.rstrip() to remove spaces from right side of the string and str.strip() removes spaces from both sides. Since these are pandas function with same name as Python’s default functions, .str has to be prefixed to tell the compiler that a Pandas function is being called.
Code #1:
Python
# importing pandas module
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Nagpur junction', 'Kanpur junction',
'Nagpur junction', 'Kannuaj junction'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# replacing address name and adding spaces in start and end
new = df["Address"].replace("Nagpur junction", " Nagpur junction ").copy()
# checking with custom string
print(new.str.strip()==" Nagpur junction")
print(new.str.strip()=="Nagpur junction ")
print(new.str.strip()==" Nagpur junction ")
Output :
As shown in the output image, the comparison is returning False for all 3 conditions, which means the spaces were successfully removed from both sides and the string is no longer having spaces.

Code #2:
Python
# importing pandas module
import pandas as pd
# making data frame
data = pd.read_csv("nba.csv")
# replacing team name and adding spaces in start and end
new = data["Team"].replace("Boston Celtics", " Boston Celtics ").copy()
# checking with custom removed space string
new.str.lstrip()=="Boston Celtics "
Output :
As shown in the output image, the comparison is true after removing the left side spaces.

Extracting a Data
In order to extract a data, we use str.extract() this function accepts a regular expression with at least one capture group. Extracting a regular expression with more than one group returns a DataFrame with one column per group. Elements that do not match return a row filled with NaN.
Code #1:
Python
# importing pandas module
import pandas as pd
# creating a series
s = pd.Series(['a1', 'b2', 'c3'])
# Extracting a data
n= s.str.extract(r'([ab])(\d)')
print(n)
Output :
As shown in the output image, that two groups will return a DataFrame with two columns. Non-matches will be NaN.

Code #2:
Python
# importing pandas module
import pandas as pd
# creating a series
s = pd.Series(['a1', 'b2', 'c3'])
# Extracting a data
n = s.str.extract(r'(?P<Geeks>[ab])(?P<For>\d)')
print(n)
Output :
As shown in the output image, that named groups will become column names in the result.

Pandas str methods:
FUNCTION | DESCRIPTION |
---|
str.lower() | Method to convert a string’s characters to lowercase |
str.upper() | Method to convert a string’s characters to uppercase |
str.find() | Method is used to search a substring in each string present in a series |
str.rfind() | Method is used to search a substring in each string present in a series from the Right side |
str.findall() | Method is also used to find substrings or separators in each string in a series |
str.isalpha() | Method is used to check if all characters in each string in series are alphabetic(a-z/A-Z) |
str.isdecimal() | Method is used to check whether all characters in a string are decimal |
str.title() | Method to capitalize the first letter of every word in a string |
str.len() | Method returns a count of the number of characters in a string |
str.replace() | Method replaces a substring within a string with another value that the user provides |
str.contains() | Method tests if pattern or regex is contained within a string of a Series or Index |
str.extract() | Extract groups from the first match of regular expression pattern. |
str.startswith() | Method tests if the start of each string element matches a pattern |
str.endswith() | Method tests if the end of each string element matches a pattern |
str.isdigit() | Method is used to check if all characters in each string in series are digits |
str.lstrip() | Method removes whitespace from the left side (beginning) of a string |
str.rstrip() | Method removes whitespace from the right side (end) of a string |
str.strip() | Method to remove leading and trailing whitespace from string |
str.split() | Method splits a string value, based on an occurrence of a user-specified value |
str.join() | Method is used to join all elements in list present in a series with passed delimiter |
str.cat() | Method is used to concatenate strings to the passed caller series of string. |
str.repeat() | Method is used to repeat string values in the same position of passed series itself |
str.get() | Method is used to get element at the passed position |
str.partition() | Method splits the string only at the first occurrence unlike str.split() |
str.rpartition() | Method splits string only once and that too reversely. It works in a similar way like str.partition() and str.split() |
str.pad() | Method to add padding (whitespaces or other characters) to every string element in a series |
str.swapcase() | Method to swap case of each string in a series |
Similar Reads
Pandas Tutorial Pandas is an open-source software library designed for data manipulation and analysis. It provides data structures like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python libraries, such as NumPy and Matplotlib. It offers functions for data t
6 min read
Introduction
Creating Objects
Viewing Data
Selection & Slicing
Dealing with Rows and Columns in Pandas DataFrameA Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. We can perform basic operations on rows/columns like selecting, deleting, adding, and renaming. In this article, we are using nba.csv file. Dealing with Columns In order to deal with col
5 min read
Pandas Extracting rows using .loc[] - PythonPandas provide a unique method to retrieve rows from a Data frame. DataFrame.loc[] method is a method that takes only index labels and returns row or dataframe if the index label exists in the caller data frame. To download the CSV used in code, click here.Example: Extracting single Row In this exam
3 min read
Extracting rows using Pandas .iloc[] in PythonPython is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages that makes importing and analyzing data much easier. here we are learning how to Extract rows using Pandas .iloc[] in Python.Pandas .iloc[
7 min read
Indexing and Selecting Data with PandasIndexing and selecting data helps us to efficiently retrieve specific rows, columns or subsets of data from a DataFrame. Whether we're filtering rows based on conditions, extracting particular columns or accessing data by labels or positions, mastering these techniques helps to work effectively with
4 min read
Boolean Indexing in PandasIn boolean indexing, we will select subsets of data based on the actual values of the data in the DataFrame and not on their row/column labels or integer locations. In boolean indexing, we use a boolean vector to filter the data. Boolean indexing is a type of indexing that uses actual values of the
6 min read
Python | Pandas DataFrame.ix[ ]Python's Pandas library is a powerful tool for data analysis, it provides DataFrame.ix[] method to select a subset of data using both label-based and integer-based indexing.Important Note: DataFrame.ix[] method has been deprecated since Pandas version 0.20.0 and is no longer recommended for use in n
2 min read
Python | Pandas Series.str.slice()Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas str.slice() method is used to slice substrings from a string present in Pandas
3 min read
How to take column-slices of DataFrame in Pandas?In this article, we will learn how to slice a DataFrame column-wise in Python. DataFrame is a two-dimensional tabular data structure with labeled axes. i.e. columns.Creating Dataframe to slice columnsPython# importing pandas import pandas as pd # Using DataFrame() method from pandas module df1 = pd.
2 min read
Operations
Python | Pandas.apply()Pandas.apply allow the users to pass a function and apply it on every single value of the Pandas series. It comes as a huge improvement for the pandas library as this function helps to segregate data according to the conditions required due to which it is efficiently used in data science and machine
4 min read
Apply function to every row in a Pandas DataFrameApplying a function to every row in a Pandas DataFrame means executing custom logic on each row individually. For example, if a DataFrame contains columns 'A', 'B' and 'C', and you want to compute their sum for each row, you can apply a function across all rows to generate a new column. Letâs explor
3 min read
Python | Pandas Series.apply()Pandas series is a One-dimensional ndarray with axis labels. The labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Pandas Series.apply() function invoke the p
3 min read
Pandas dataframe.aggregate() | PythonDataframe.aggregate() function is used to apply some aggregation across one or more columns. Aggregate using callable, string, dict or list of string/callables. The most frequently used aggregations are:sum: Return the sum of the values for the requested axismin: Return the minimum of the values for
2 min read
Pandas DataFrame mean() MethodPython is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas DataFrame mean()Â Pandas dataframe.mean() function returns the mean of the value
2 min read
Python | Pandas Series.mean()Pandas series is a One-dimensional ndarray with axis labels. The labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Pandas Series.mean() function return the me
2 min read
Python | Pandas dataframe.mad()Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.mad() function return the mean absolute deviation of the values for t
2 min read
Python | Pandas Series.mad() to calculate Mean Absolute Deviation of a SeriesPandas provide a method to make Calculation of MAD (Mean Absolute Deviation) very easy. MAD is defined as average distance between each value and mean. The formula used to calculate MAD is: Syntax: Series.mad(axis=None, skipna=None, level=None) Parameters: axis: 0 or âindexâ for row wise operation a
2 min read
Python | Pandas dataframe.sem()Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.sem() function return unbiased standard error of the mean over reques
3 min read
Python | Pandas Series.value_counts()Pandas is one of the most widely used library for data handling and analysis. It simplifies many data manipulation tasks especially when working with tabular data. In this article, we'll explore the Series.value_counts() function in Pandas which helps you quickly count the frequency of unique values
2 min read
Pandas Index.value_counts()-PythonPython is popular for data analysis thanks to its powerful libraries and Pandas is one of the best. It makes working with data simple and efficient. The Index.value_counts() function in Pandas returns the count of each unique value in an Index, sorted in descending order so the most frequent item co
3 min read
Applying Lambda functions to Pandas DataframeIn Python Pandas, we have the freedom to add different functions whenever needed like lambda function, sort function, etc. We can apply a lambda function to both the columns and rows of the Pandas data frame.Syntax: lambda arguments: expressionAn anonymous function which we can pass in instantly wit
6 min read
Manipulating Data
Adding New Column to Existing DataFrame in PandasAdding a new column to a DataFrame in Pandas is a simple and common operation when working with data in Python. You can quickly create new columns by directly assigning values to them. Let's discuss how to add new columns to the existing DataFrame in Pandas. There can be multiple methods, based on d
6 min read
Python | Delete rows/columns from DataFrame using Pandas.drop()Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages which makes importing and analyzing data much easier. In this article, we will how to delete a row in Excel using Pandas as well as delete
4 min read
Python | Pandas DataFrame.truncatePandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. It can be thought of as a dict-like container for Series objects. This is the primary data structure o
3 min read
Python | Pandas Series.truncate()Pandas series is a One-dimensional ndarray with axis labels. The labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Pandas Series.truncate() function is used t
2 min read
Iterating over rows and columns in Pandas DataFrameIteration is a general term for taking each item of something, one after another. Pandas DataFrame consists of rows and columns so, to iterate over dataframe, we have to iterate a dataframe like a dictionary. In a dictionary, we iterate over the keys of the object in the same way we have to iterate
7 min read
Pandas Dataframe.sort_values()In Pandas, sort_values() function sorts a DataFrame by one or more columns in ascending or descending order. This method is essential for organizing and analyzing large datasets effectively.Syntax: DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
2 min read
Python | Pandas Dataframe.sort_values() | Set-2Prerequisite: Pandas DataFrame.sort_values() | Set-1 Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages, and makes importing and analyzing data much easier. Pandas sort_values() function so
3 min read
How to add one row in existing Pandas DataFrame?Adding rows to a Pandas DataFrame is a common task in data manipulation and can be achieved using methods like loc[], and concat(). Method 1. Using loc[] - By Specifying its Index and ValuesThe loc[] method is ideal for directly modifying an existing DataFrame, making it more memory-efficient compar
4 min read
Grouping Data
Merging, Joining, Concatenating and Comparing
Python | Pandas Merging, Joining and ConcatenatingPandas DataFrame helps for working with data organized in rows and columns. When we're working with multiple datasets we need to combine them in different ways. Pandas provides three simple methods like merging, joining and concatenating. These methods help us to combine data in various ways whether
9 min read
Python | Pandas Series.str.cat() to concatenate stringPython is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier.Pandas str.cat() is used to concatenate strings to the passed caller series of string.
3 min read
Python - Pandas dataframe.append()Pandas append function is used to add rows of other dataframes to end of existing dataframe, returning a new dataframe object. Columns not in the original data frames are added as new columns and the new cells are populated with NaN value.Append Dataframe into another DataframeIn this example, we ar
4 min read
Python | Pandas Series.append()Pandas series is a One-dimensional ndarray with axis labels. The labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Pandas Series.append() function is used to
4 min read
Pandas Index.append() - PythonIndex.append() method in Pandas is used to concatenate or append one Index object with another Index or a list/tuple of Index objects, returning a new Index object. It does not modify the original Index. Example:Pythonimport pandas as pd idx1 = pd.Index([1, 2, 3]) idx2 = pd.Index([4, 5]) res = idx1.
2 min read
Python | Pandas Series.combine()Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas Series.combine() is a series mathematical operation method. This is used to com
3 min read
Add a row at top in pandas DataFramePandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Let's see how can we can add a row at top in pandas DataFrame.Observe this dataset first. Python3 # importing pandas module import pandas as pd # making data fram
1 min read
Python | Pandas str.join() to join string/list elements with passed delimiterPython is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas str.join() method is used to join all elements in list present in a series with
2 min read
Join two text columns into a single column in PandasLet's see the different methods to join two text columns into a single column. Method #1: Using cat() function We can also use different separators during join. e.g. -, _, " " etc. Python3 1== # importing pandas import pandas as pd df = pd.DataFrame({'Last': ['Gaitonde', 'Singh', 'Mathur'], 'First':
2 min read
How To Compare Two Dataframes with Pandas compare?A DataFrame is a 2D structure composed of rows and columns, and where data is stored into a tubular form. It is mutable in terms of size, and heterogeneous tabular data. Arithmetic operations can also be performed on both row and column labels. To know more about the creation of Pandas DataFrame. He
5 min read
How to compare the elements of the two Pandas Series?Sometimes we need to compare pandas series to perform some comparative analysis. It is possible to compare two pandas Series with help of Relational operators, we can easily compare the corresponding elements of two series at a time. The result will be displayed in form of True or False. And we can
3 min read
Working with Date and Time
Python | Working with date and time using PandasWhile working with data, encountering time series data is very usual. Pandas is a very useful tool while working with time series data. Pandas provide a different set of tools using which we can perform all the necessary tasks on date-time data. Let's try to understand with the examples discussed b
8 min read
Python | Pandas Timestamp.timestampPython is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas Timestamp.timestamp() function returns the time expressed as the number of seco
3 min read
Python | Pandas Timestamp.nowPython is a great language for data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages that makes importing and analyzing data much easier. Pandas Timestamp.now() function returns the current time in the local timezone. It is Equiv
3 min read
Python | Pandas Timestamp.isoformatPython is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas Timestamp objects represent date and time values, making them essential for wor
2 min read
Python | Pandas Timestamp.datePython is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas Timestamp.date() function return a datetime object with same year, month and da
2 min read
Python | Pandas Timestamp.replacePython is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages that makes importing and analyzing data much easier. Pandas Timestamp.replace() function is used to replace the member values of the given
3 min read
Pandas.to_datetime()-Pythonpandas.to_datetime() converts argument(s) to datetime. This function is essential for working with date and time data, especially when parsing strings or timestamps into Python's datetime64 format used in Pandas. For Example:Pythonimport pandas as pd d = ['2025-06-21', '2025-06-22'] res = pd.to_date
3 min read
Python | pandas.date_range() methodPython is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages that makes importing and analyzing data much easier. pandas.date_range() is one of the general functions in Pandas which is used to return
4 min read