Parquet is a columnar storage format that is optimized for distributed processing of large datasets. It is widely used in Big Data processing systems like Hadoop and Apache Spark.Â
A partitioned parquet file is a parquet file that is partitioned into multiple smaller files based on the values of one or more columns. Partitioning can significantly improve query performance by allowing the processing system to read only the necessary files.
Concepts
- Parquet format: A columnar storage format that is optimized for distributed processing of large datasets.
- Partitioning: Dividing a dataset into smaller parts based on the values of one or more columns.
- Pandas DataFrame: A two-dimensional labeled data structure with columns of potentially different types.
- pyarrow: A Python package that provides a Python interface to the Arrow C++ library for working with columnar data.
Dataset link: [https://www.kaggle.com/datasets/pawankumargunjan/weather]
Example 1
Install the pyarrow package:
The pyarrow package provides a Python interface to the Arrow C++ library for working with columnar data. Install it using the following command
pip install pyarrow
Import the pandas or pyarrow packages:
Python3
import pyarrow.parquet as pa
Reading parquet data using pyarrow.parquet
Python
table = pa.read_table('weather.2016.parquet')
table
Output:
pyarrow.Table
ForecastSiteCode: int64
ObservationTime: int64
ObservationDate: timestamp[ms]
WindDirection: int64
WindSpeed: int64
WindGust: double
Visibility: double
ScreenTemperature: double
Pressure: double
SignificantWeatherCode: int64
SiteName: string
Latitude: double
Longitude: double
Region: string
Country: string
----
ForecastSiteCode: [[3002,3005,3008,3017,3023,...,3882,3002,3005,3008,3017],[3023,3026,3031,3034,3037,...,3797,3866,3872,3876,3882]]
ObservationTime: [[0,0,0,0,0,...,12,13,13,13,13],[13,13,13,13,13,...,23,23,23,23,23]]
ObservationDate: [[2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,...,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000],[2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,...,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000]]
WindDirection: [[12,10,8,6,10,...,4,8,8,8,8],[9,9,10,8,8,...,1,0,1,1,1]]
WindSpeed: [[8,2,6,8,30,...,5,19,18,19,19],[25,26,24,23,23,...,5,10,2,3,2]]
WindGust: [[null,null,null,null,37,...,null,null,null,null,29],[36,41,37,34,37,...,null,null,null,null,null]]
Visibility: [[30000,35000,50000,40000,2600,...,4000,8000,3500,11000,28000],[4600,9000,30000,10000,2900,...,22000,null,50000,null,35000]]
ScreenTemperature: [[2.1,0.1,2.8,1.6,9.8,...,10,-99,7.4,8.1,9.2],[9.1,9.5,10.2,9.7,9.9,...,4.9,8.4,3.5,6.1,3.7]]
Pressure: [[997,997,997,996,991,...,1030,null,1019,1020,1019],[1019,1018,1020,1020,1021,...,1019,1018,1019,1019,1019]]
SignificantWeatherCode: [[8,7,-99,8,11,...,1,5,15,12,7],[15,12,12,12,15,...,0,-99,0,-99,0]]
...
Print the shape of the dataset
Python3
Output:
(194697, 15)
Convert the pyarrow table dataset into a pandas dataframe.
Python3
df = table.to_pandas()
# Taking tanspose so the printing dataset will easy.
df.head().T
Output:
Â
| 0
| 1
| 2
| 3
| 4
|
ForecastSiteCode
| 3002
| 3005
| 3008
| 3017
| 3023
|
ObservationTime
| 0
| 0
| 0
| 0
| 0
|
ObservationDate
| 2016-02-01 00:00:00
| 2016-02-01 00:00:00
| 2016-02-01 00:00:00
| 2016-02-01 00:00:00
| 2016-02-01 00:00:00
|
WindDirection
| 12
| 10
| 8
| 6
| 10
|
WindSpeed
| 8
| 2
| 6
| 8
| 30
|
WindGust
| NaN
| NaN
| NaN
| NaN
| 37.0
|
Visibility
| 30000.0
| 35000.0
| 50000.0
| 40000.0
| 2600.0
|
ScreenTemperature
| 2.1
| 0.1
| 2.8
| 1.6
| 9.8
|
Pressure
| 997.0
| 997.0
| 997.0
| 996.0
| 991.0
|
SignificantWeatherCode
| 8
| 7
| -99
| 8
| 11
|
SiteName
| BALTASOUND (3002)
| LERWICK (S. SCREEN) (3005)
| FAIR ISLE (3008)
| KIRKWALL (3017)
| SOUTH UIST RANGE (3023)
|
Latitude
| 60.749
| 60.139
| 59.53
| 58.954
| 57.358
|
Longitude
| -0.854
| -1.183
| -1.63
| -2.9
| -7.397
|
Region
| Orkney & Shetland
| Orkney & Shetland
| Orkney & Shetland
| Orkney & Shetland
| Highland & Eilean Siar
|
Country
| SCOTLAND
| SCOTLAND
| None
| SCOTLAND
| SCOTLAND
|
Example 2
Reading parquet data using pandas.read_parquet
Python
import pandas as pd
df = pd.read_parquet('weather.2016.parquet')
df.head()
Output:
 | ForecastSiteCode | ObservationTime | ObservationDate | WindDirection | WindSpeed | WindGust | Visibility | ScreenTemperature | Pressure | SignificantWeatherCode | SiteName | Latitude | Longitude | Region | Country |
0 | 3002 | 0 | 2016-02-01 | 12 | 8 | NaN | 30000.0 | 2.1 | 997.0 | 8 | BALTASOUND (3002) | 60.749 | -0.854 | Orkney & Shetland | SCOTLAND |
1 | 3005 | 0 | 2016-02-01 | 10 | 2 | NaN | 35000.0 | 0.1 | 997.0 | 7 | LERWICK (S. SCREEN) (3005) | 60.139 | -1.183 | Orkney & Shetland | SCOTLAND |
2 | 3008 | 0 | 2016-02-01 | 8 | 6 | NaN | 50000.0 | 2.8 | 997.0 | -99 | FAIR ISLE (3008) | 59.530 | -1.630 | Orkney & Shetland | None |
3 | 3017 | 0 | 2016-02-01 | 6 | 8 | NaN | 40000.0 | 1.6 | 996.0 | 8 | KIRKWALL (3017) | 58.954 | -2.900 | Orkney & Shetland | SCOTLAND |
4 | 3023 | 0 | 2016-02-01 | 10 | 30 | 37.0 | 2600.0 | 9.8 | 991.0 | 11 | SOUTH UIST RANGE (3023) | 57.358 | -7.397 | Highland & Eilean Siar | SCOTLAND |
Example 3
Filtering the parquet data
Python
import pandas as pd
df = pd.read_parquet('weather.2016.parquet', filters=[('Country', '=', 'ENGLAND')])
df.head()
Output
 | ForecastSiteCode | ObservationTime | ObservationDate | WindDirection | WindSpeed | WindGust | Visibility | ScreenTemperature | Pressure | SignificantWeatherCode | SiteName | Latitude | Longitude | Region | Country |
0 | 3134 | 0 | 2016-02-01 | 1 | 1 | NaN | 2100.0 | 4.3 | 999.0 | 15 | GLASGOW/BISHOPTON (3134) | 55.907 | -4.533 | Strathclyde | ENGLAND |
1 | 3210 | 0 | 2016-02-01 | 8 | 11 | 33.0 | 3100.0 | 8.8 | 1005.0 | 7 | ST. BEES HEAD (3210) | 54.518 | -3.615 | North West England | ENGLAND |
2 | 3212 | 0 | 2016-02-01 | 11 | 16 | NaN | 4800.0 | 11.6 | 1004.0 | 12 | KESWICK (3212) | 54.614 | -3.157 | North West England | ENGLAND |
3 | 3214 | 0 | 2016-02-01 | 11 | 24 | 34.0 | 10000.0 | 10.0 | 1005.0 | 8 | WALNEY ISLAND (3214) | 54.125 | -3.257 | North West England | ENGLAND |
4 | 3220 | 0 | 2016-02-01 | 16 | -99 | NaN | 25000.0 | 11.1 | 1002.0 | 7 | CARLISLE (3220) | 54.933 | -2.963 | North West England | ENGLAND |
This will filter the parquet data by Country='ENGLAND' and print the first 5 rows of the DataFrame.
Aggregating the parquet data
Python
grouped = df.groupby(['Country']).mean(numeric_only=True)
# Taking tanspose so the printing dataset will easy.
print(grouped.T)
Output
Country ENGLAND
ForecastSiteCode 3560.622936
ObservationTime 11.517152
WindDirection 8.534412
WindSpeed 7.770786
WindGust 36.035424
Visibility 22431.530727
ScreenTemperature 5.336209
Pressure 1011.335307
SignificantWeatherCode -3.614757
Latitude 52.354470
Longitude -1.586393
Group the data by Country, calculate the mean for each group, and print the resulting data frame.
Example 4
Multiple filters
Python
import pandas as pd
df = pd.read_parquet('weather.2016.parquet',
filters=[('Country', '=', 'ENGLAND'),
('WindSpeed','<', 7)])
# Taking tanspose so the printing dataset will easy.
df.head().T
Output:
Â
| 0
| 1
| 2
| 3
| 4
|
ForecastSiteCode
| 3134
| 3220
| 3839
| 3220
| 3839
|
ObservationTime
| 0
| 0
| 0
| 1
| 1
|
ObservationDate
| 2016-02-01 00:00:00
| 2016-02-01 00:00:00
| 2016-02-01 00:00:00
| 2016-02-01 00:00:00
| 2016-02-01 00:00:00
|
WindDirection
| 1
| 16
| 16
| 16
| 16
|
WindSpeed
| 1
| -99
| -99
| -99
| -99
|
WindGust
| NaN
| NaN
| NaN
| NaN
| NaN
|
Visibility
| 2100.0
| 25000.0
| NaN
| 15000.0
| NaN
|
ScreenTemperature
| 4.3
| 11.1
| 12.1
| 11.2
| 12.3
|
Pressure
| 999.0
| 1002.0
| NaN
| 1003.0
| NaN
|
SignificantWeatherCode
| 15
| 7
| -99
| 8
| -99
|
SiteName
| GLASGOW/BISHOPTON (3134)
| CARLISLE (3220)
| EXETER AIRPORT (3839)
| CARLISLE (3220)
| EXETER AIRPORT (3839)
|
Latitude
| 55.907
| 54.933
| 50.737
| 54.933
| 50.737
|
Longitude
| -4.533
| -2.963
| -3.405
| -2.963
| -3.405
|
Region
| Strathclyde
| North West England
| South West England
| North West England
| South West England
|
Country
| ENGLAND
| ENGLAND
| ENGLAND
| ENGLAND
| ENGLAND
|
This will filter the parquet data by Country='ENGLAND' and WindSpeed< 7 and print the first 5 rows of the DataFrame.
Conclusion
In conclusion, partitioning parquet files is a powerful way to optimize data storage and querying performance. By partitioning data based on one or more columns, you can easily filter, sort, and aggregate data within a subset of partitions, rather than having to scan the entire dataset.
In this article, we covered two methods for reading partitioned parquet files in Python: using pandas' read_parquet() function and using pyarrow's ParquetDataset class. We also provided several examples of how to read and filter partitioned parquet files using these methods with real-world weather data.
Overall, partitioning parquet files is an effective technique for optimizing data storage and retrieval. Whether you're dealing with big data or just trying to improve query performance, partitioning can help you get the most out of your data.
Similar Reads
Read Html File In Python Using Pandas
In Python, Pandas is a powerful library commonly used for data manipulation and analysis. While it's primarily used for working with structured data such as CSV files, Excel spreadsheets, and databases, it's also capable of reading HTML files and extracting tabular data from them. In this article, w
5 min read
Reading rpt files with Pandas
In most cases, we usually have a CSV file to load the data from, but there are other formats such as JSON, rpt, TSV, etc. that can be used to store data. Pandas provide us with the utility to load data from them. In this article, we'll see how we can load data from an rpt file with the use of Pandas
2 min read
How to Plot a Dataframe using Pandas
Pandas plotting is an interface to Matplotlib, that allows to generate high-quality plots directly from a DataFrame or Series. The .plot() method is the core function for plotting data in Pandas. Depending on the kind of plot we want to create, we can specify various parameters such as plot type (ki
8 min read
Reading CSV files in Python
A CSV (Comma Separated Values) file is a form of plain text document that uses a particular format to organize tabular information. CSV file format is a bounded text document that uses a comma to distinguish the values. Every row in the document is a data log. Each log is composed of one or more fie
5 min read
Pandas Read CSV in Python
CSV files are the Comma Separated Files. It allows users to load tabular data into a DataFrame, which is a powerful structure for data manipulation and analysis. To access data from the CSV file, we require a function read_csv() from Pandas that retrieves data in the form of the data frame. Hereâs a
6 min read
Reading An Arff File To Pandas Dataframe
Attribute-Relation File Format (ARFF) is a file format developed by the Machine Learning Project of the University of Waikato, New Zealand. It has been developed by the Computer Science department of the aforementioned University. The ARFF files mostly belong to WEKA (Waikato Environment for Knowled
4 min read
Data profiling in Pandas using Python
Pandas is one of the most popular Python library mainly used for data manipulation and analysis. When we are working with large data, many times we need to perform Exploratory Data Analysis. We need to get the detailed description about different columns available and there relation, null check, dat
1 min read
Polars - Reading JSON File | polars.read_json()
Polars is an exciting alternative to traditional data manipulation libraries like Pandas. It's built for high-performance data processing, particularly for handling large datasets. One of its powerful features is the ability to efficiently read and process various file formats, including JSON. In th
4 min read
Read And Write Tabular Data using Pandas
Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with ârelationalâ or âlabeledâ data both easy and intuitive. It aims to be the fundamental, high-level building block for doing practical, real-world data analysis in Python.The two primary d
3 min read
Data Manipulation in Python using Pandas
In Machine Learning, the model requires a dataset to operate, i.e. to train and test. But data doesnât come fully prepared and ready to use. There are discrepancies like Nan/ Null / NA values in many rows and columns. Sometimes the data set also contains some of the rows and columns which are not ev
6 min read