Handling Large Datasets in Pandas
Last Updated :
29 Mar, 2025
Pandas is an excellent tool for working with smaller datasets, typically ranging from two to three gigabytes. However, when the dataset size exceeds this threshold, using Pandas can become problematic. This is because Pandas loads the entire dataset into memory before processing it, which can cause memory issues if the dataset is too large for the available RAM. Even with smaller datasets, memory problems can arise as preprocessing and modifications often create duplicate copies of the DataFrame.
Despite these challenges, there are several techniques that allow you to handle larger datasets efficiently with Pandas in Python. Let’s explore these methods that enable you to work with millions of records while minimizing memory usage.
How to handle Large Datasets in Python?
- Use Efficient Datatypes: Utilize more memory-efficient data types (e.g.,
int32
instead of int64
, float32
instead of float64
) to reduce memory usage. - Load Less Data: Use the
use-cols
parameter in pd.read_csv()
to load only the necessary columns, reducing memory consumption. - Sampling: For exploratory data analysis or testing, consider working with a sample of the dataset instead of the entire dataset.
- Chunking: Use the
chunksize
parameter in pd.read_csv()
to read the dataset in smaller chunks, processing each chunk iteratively. - Optimizing Pandas dtypes: Use the
astype
method to convert columns to more memory-efficient types after loading the data, if appropriate. - Parallelizing Pandas with Dask: Use Dask, a parallel computing library, to scale Pandas workflows to larger-than-memory datasets by leveraging parallel processing.
Using Efficient Data Types:
- Reducing memory utilization in Pandas requires the use of efficient data types. For instance, if precision allows, you can use float32 or even float16 in instead of the standard float64 dtype. Similar to this, if the data range permits, integer columns can be downcast to smaller integer types like int8, int16, or int32.
- Benefits: Significantly lessens memory footprint, particularly for big datasets.
- Implementation: When reading data, you can use functions like pd.read_csv() or pd.read_sql() to specify the dtype parameter. Furthermore, existing columns can be changed to more memory-efficient types using the astype() method.
Python
import pandas as pd
# Define the size of the dataset
num_rows = 1000000 # 1 million rows
# Example DataFrame with inefficient datatypes
data = {'A': [1, 2, 3, 4],
'B': [5.0, 6.0, 7.0, 8.0]}
df = pd.DataFrame(data)
# Replicate the DataFrame to create a larger dataset
df_large = pd.concat([df] * (num_rows // len(df)), ignore_index=True)
# Check memory usage before conversion
print("Memory usage before conversion:")
print(df_large.memory_usage().sum())
# Convert to more memory-efficient datatypes
df_large['A'] = pd.to_numeric(df_large['A'], downcast='integer')
df_large['B'] = pd.to_numeric(df_large['B'], downcast='float')
# Check memory usage after conversion
print("Memory usage after conversion:")
print(df_large.memory_usage().sum())
Output:
Memory usage before conversion:
16000128
Memory usage after conversion:
5000128
Load Less Data
- Overview: This technique entails loading only the relevant columns from the dataset. This is especially helpful when working with datasets that have a lot of columns or when analysis just requires a portion of the data.
- Benefits: Enhances processing effectiveness and uses less memory.
- Implementation: To select which columns to load, use the usecols parameter in routines such as pd.read_csv().
Python
import pandas as pd
# Create sample DataFrame
data = {'A': range(1000),
'B': range(1000),
'C': range(1000),
'D': range(1000)}
# Load only specific columns
df = pd.DataFrame(data)
df_subset = df[['A', 'D']]
print('Specific Columns of the DataFrame')
print(df_subset.head())
Output:
Specific Columns of the DataFrame
A D
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
Sampling:
- Sampling is the process of choosing a random selection of the dataset's data for examination. This can be used to quickly analyze the dataset, explore it, or create models using a representative sample of the data.
- Benefits: Makes analysis and experimentation faster, especially when working with big datasets.
- Implementation: To randomly select rows or columns from the DataFrame, use Pandas' sample() method.
Python
import pandas as pd
# Create sample DataFrame
data = {'A': range(1000),
'B': range(1000),
'C': range(1000),
'D': range(1000)}
# Sample 10% of the dataset
df = pd.DataFrame(data)
df_sample = df.sample(frac=0.1, random_state=42)
print(df_sample.head())
Output:
A B C D
521 521 521 521 521
737 737 737 737 737
740 740 740 740 740
660 660 660 660 660
411 411 411 411 411
Chunking:
- Rather than loading the complete dataset into memory at once, chunking entails processing the dataset in smaller, more manageable parts. When working with datasets that are too big to fit in memory, this is quite helpful.
- Benefits: Processes huge datasets on devices with limited memory and uses less memory.
- Implementation: To specify the number of rows to read at a time, use the chunksize argument in routines such as pd.read_csv().
Python
import pandas as pd
# Create sample DataFrame
data = {'A': range(10000),
'B': range(10000)}
# Process data in chunks
chunk_size = 1000
for chunk in pd.DataFrame(data).groupby(pd.DataFrame(data).index // chunk_size):
print(chunk)
Output:
(0, A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
.. ... ...
995 995 995
996 996 996
997 997 997
998 998 998
999 999 999
[1000 rows x 2 columns])
(1, A B
1000 1000 1000
1001 1001 1001
1002 1002 1002
1003 1003 1003
1004 1004 1004
... ... ...
1995 1995 1995
1996 1996 1996
1997 1997 1997
1998 1998 1998
1999 1999 1999
[1000 rows x 2 columns])
(2, A B
2000 2000 2000
2001 2001 2001
2002 2002 2002
2003 2003 2003
2004 2004 2004
... ... ...
2995 2995 2995
2996 2996 2996
2997 2997 2997
2998 2998 2998
2999 2999 2999
[1000 rows x 2 columns])
(3, A B
3000 3000 3000
3001 3001 3001
3002 3002 3002
3003 3003 3003
3004 3004 3004
... ... ...
3995 3995 3995
3996 3996 3996
3997 3997 3997
3998 3998 3998
3999 3999 3999
Optimising Pandas dtypes:
- Described as: Finding columns with data types that are not as efficient as possible and changing them to ones that are would save more memory. Performance can be greatly enhanced and memory utilization can be much decreased.
- Benefits: Increases processing speed and minimizes memory footprint.
- Implementation: To convert columns to more efficient data types, use the astype() method. To convert columns to datetime or numeric types, respectively, use functions such as pd.to_datetime() or pd.to_numeric().
Python
import pandas as pd
# Create sample DataFrame
data = {'date_column': ['2022-01-01', '2022-01-02', '2022-01-03'],
'numeric_column': [1.234, 2.345, 3.456]}
df = pd.DataFrame(data)
# Convert inefficient dtypes
df['date_column'] = pd.to_datetime(df['date_column'])
df['numeric_column'] = pd.to_numeric(df['numeric_column'], downcast='float')
print(df.dtypes)
date_column datetime64[ns]
numeric_column float32
dtype: object
Parallelising Pandas with Dask:
- Dask is a package for parallel computing that works well with Pandas and offers parallelized operations for big datasets. Your Pandas workflows can be scaled across many cores or even distributed clusters with its help.
- Advantages: Allows Pandas operations to be executed in parallel, greatly reducing processing times for huge datasets.
- Implementation: To execute parallelized operations on sizable datasets, use Dask data structures like dask.DataFrame and dask.array. Dask facilitates the smooth transfer of current codebases to parallel execution by supporting the majority of the well-known Pandas APIs.
Python
import dask.dataframe as dd
import pandas as pd
# Create sample DataFrame
data = {'A': range(10000),
'B': range(10000)}
df = pd.DataFrame(data)
# Load data using Dask
ddf = dd.from_pandas(df, npartitions=4)
# Perform parallelized operations
result = ddf.groupby('A').mean().compute()
print(result)
B
A
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
... ...
9995 9995.0
9996 9996.0
9997 9997.0
9998 9998.0
9999 9999.0
[10000 rows x 1 columns]
Similar Reads
Non-linear Components In electrical circuits, Non-linear Components are electronic devices that need an external power source to operate actively. Non-Linear Components are those that are changed with respect to the voltage and current. Elements that do not follow ohm's law are called Non-linear Components. Non-linear Co
11 min read
Spring Boot Tutorial Spring Boot is a Java framework that makes it easier to create and run Java applications. It simplifies the configuration and setup process, allowing developers to focus more on writing code for their applications. This Spring Boot Tutorial is a comprehensive guide that covers both basic and advance
10 min read
Class Diagram | Unified Modeling Language (UML) A UML class diagram is a visual tool that represents the structure of a system by showing its classes, attributes, methods, and the relationships between them. It helps everyone involved in a projectâlike developers and designersâunderstand how the system is organized and how its components interact
12 min read
Backpropagation in Neural Network Back Propagation is also known as "Backward Propagation of Errors" is a method used to train neural network . Its goal is to reduce the difference between the modelâs predicted output and the actual output by adjusting the weights and biases in the network.It works iteratively to adjust weights and
9 min read
3-Phase Inverter An inverter is a fundamental electrical device designed primarily for the conversion of direct current into alternating current . This versatile device , also known as a variable frequency drive , plays a vital role in a wide range of applications , including variable frequency drives and high power
13 min read
Polymorphism in Java Polymorphism in Java is one of the core concepts in object-oriented programming (OOP) that allows objects to behave differently based on their specific class type. The word polymorphism means having many forms, and it comes from the Greek words poly (many) and morph (forms), this means one entity ca
7 min read
What is Vacuum Circuit Breaker? A vacuum circuit breaker is a type of breaker that utilizes a vacuum as the medium to extinguish electrical arcs. Within this circuit breaker, there is a vacuum interrupter that houses the stationary and mobile contacts in a permanently sealed enclosure. When the contacts are separated in a high vac
13 min read
CTE in SQL In SQL, a Common Table Expression (CTE) is an essential tool for simplifying complex queries and making them more readable. By defining temporary result sets that can be referenced multiple times, a CTE in SQL allows developers to break down complicated logic into manageable parts. CTEs help with hi
6 min read
Python Variables In Python, variables are used to store data that can be referenced and manipulated during program execution. A variable is essentially a name that is assigned to a value. Unlike many other programming languages, Python variables do not require explicit declaration of type. The type of the variable i
6 min read
Spring Boot Interview Questions and Answers Spring Boot is a Java-based framework used to develop stand-alone, production-ready applications with minimal configuration. Introduced by Pivotal in 2014, it simplifies the development of Spring applications by offering embedded servers, auto-configuration, and fast startup. Many top companies, inc
15+ min read