Pandas Read CSV in Python

Read a Parquet File Using Pandas

Last Updated : 28 Apr, 2025

Parquet is a columnar storage format that is optimized for distributed processing of large datasets. It is widely used in Big Data processing systems like Hadoop and Apache Spark.

A partitioned parquet file is a parquet file that is partitioned into multiple smaller files based on the values of one or more columns. Partitioning can significantly improve query performance by allowing the processing system to read only the necessary files.

Concepts

Parquet format: A columnar storage format that is optimized for distributed processing of large datasets.
Partitioning: Dividing a dataset into smaller parts based on the values of one or more columns.
Pandas DataFrame: A two-dimensional labeled data structure with columns of potentially different types.
pyarrow: A Python package that provides a Python interface to the Arrow C++ library for working with columnar data.

Dataset link: [https://www.kaggle.com/datasets/pawankumargunjan/weather]

Example 1

Install the pyarrow package:

The pyarrow package provides a Python interface to the Arrow C++ library for working with columnar data. Install it using the following command

pip install pyarrow

Import the pandas or pyarrow packages:

Python3

import pyarrow.parquet as pa

Reading parquet data using pyarrow.parquet

Python

table = pa.read_table('weather.2016.parquet')
table

Output:

pyarrow.Table
ForecastSiteCode: int64
ObservationTime: int64
ObservationDate: timestamp[ms]
WindDirection: int64
WindSpeed: int64
WindGust: double
Visibility: double
ScreenTemperature: double
Pressure: double
SignificantWeatherCode: int64
SiteName: string
Latitude: double
Longitude: double
Region: string
Country: string
----
ForecastSiteCode: [[3002,3005,3008,3017,3023,...,3882,3002,3005,3008,3017],[3023,3026,3031,3034,3037,...,3797,3866,3872,3876,3882]]
ObservationTime: [[0,0,0,0,0,...,12,13,13,13,13],[13,13,13,13,13,...,23,23,23,23,23]]
ObservationDate: [[2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,2016-02-01 00:00:00.000,...,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000],[2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,2016-03-12 00:00:00.000,...,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000,2016-03-31 00:00:00.000]]
WindDirection: [[12,10,8,6,10,...,4,8,8,8,8],[9,9,10,8,8,...,1,0,1,1,1]]
WindSpeed: [[8,2,6,8,30,...,5,19,18,19,19],[25,26,24,23,23,...,5,10,2,3,2]]
WindGust: [[null,null,null,null,37,...,null,null,null,null,29],[36,41,37,34,37,...,null,null,null,null,null]]
Visibility: [[30000,35000,50000,40000,2600,...,4000,8000,3500,11000,28000],[4600,9000,30000,10000,2900,...,22000,null,50000,null,35000]]
ScreenTemperature: [[2.1,0.1,2.8,1.6,9.8,...,10,-99,7.4,8.1,9.2],[9.1,9.5,10.2,9.7,9.9,...,4.9,8.4,3.5,6.1,3.7]]
Pressure: [[997,997,997,996,991,...,1030,null,1019,1020,1019],[1019,1018,1020,1020,1021,...,1019,1018,1019,1019,1019]]
SignificantWeatherCode: [[8,7,-99,8,11,...,1,5,15,12,7],[15,12,12,12,15,...,0,-99,0,-99,0]]
...

Print the shape of the dataset

Python3

table.shape

Output:

(194697, 15)

Convert the pyarrow table dataset into a pandas dataframe.

Python3

df = table.to_pandas()
# Taking tanspose so the printing dataset will easy.
df.head().T

Output:

	0	1	2	3	4
ForecastSiteCode	3002	3005	3008	3017	3023
ObservationTime	0	0	0	0	0
ObservationDate	2016-02-01 00:00:00	2016-02-01 00:00:00	2016-02-01 00:00:00	2016-02-01 00:00:00	2016-02-01 00:00:00
WindDirection	12	10	8	6	10
WindSpeed	8	2	6	8	30
WindGust	NaN	NaN	NaN	NaN	37.0
Visibility	30000.0	35000.0	50000.0	40000.0	2600.0
ScreenTemperature	2.1	0.1	2.8	1.6	9.8
Pressure	997.0	997.0	997.0	996.0	991.0
SignificantWeatherCode	8	7	-99	8	11
SiteName	BALTASOUND (3002)	LERWICK (S. SCREEN) (3005)	FAIR ISLE (3008)	KIRKWALL (3017)	SOUTH UIST RANGE (3023)
Latitude	60.749	60.139	59.53	58.954	57.358
Longitude	-0.854	-1.183	-1.63	-2.9	-7.397
Region	Orkney & Shetland	Orkney & Shetland	Orkney & Shetland	Orkney & Shetland	Highland & Eilean Siar
Country	SCOTLAND	SCOTLAND	None	SCOTLAND	SCOTLAND

Example 2

Reading parquet data using pandas.read_parquet

Python

import pandas as pd

df = pd.read_parquet('weather.2016.parquet')
df.head()

Output:

	ForecastSiteCode	ObservationTime	ObservationDate	WindDirection	WindSpeed	WindGust	Visibility	ScreenTemperature	Pressure	SignificantWeatherCode	SiteName	Latitude	Longitude	Region	Country
0	3002	0	2016-02-01	12	8	NaN	30000.0	2.1	997.0	8	BALTASOUND (3002)	60.749	-0.854	Orkney & Shetland	SCOTLAND
1	3005	0	2016-02-01	10	2	NaN	35000.0	0.1	997.0	7	LERWICK (S. SCREEN) (3005)	60.139	-1.183	Orkney & Shetland	SCOTLAND
2	3008	0	2016-02-01	8	6	NaN	50000.0	2.8	997.0	-99	FAIR ISLE (3008)	59.530	-1.630	Orkney & Shetland	None
3	3017	0	2016-02-01	6	8	NaN	40000.0	1.6	996.0	8	KIRKWALL (3017)	58.954	-2.900	Orkney & Shetland	SCOTLAND
4	3023	0	2016-02-01	10	30	37.0	2600.0	9.8	991.0	11	SOUTH UIST RANGE (3023)	57.358	-7.397	Highland & Eilean Siar	SCOTLAND

Example 3

Filtering the parquet data

Python

import pandas as pd

df = pd.read_parquet('weather.2016.parquet', filters=[('Country', '=', 'ENGLAND')])
df.head()

Output

	ForecastSiteCode	ObservationTime	ObservationDate	WindDirection	WindSpeed	WindGust	Visibility	ScreenTemperature	Pressure	SignificantWeatherCode	SiteName	Latitude	Longitude	Region	Country
0	3134	0	2016-02-01	1	1	NaN	2100.0	4.3	999.0	15	GLASGOW/BISHOPTON (3134)	55.907	-4.533	Strathclyde	ENGLAND
1	3210	0	2016-02-01	8	11	33.0	3100.0	8.8	1005.0	7	ST. BEES HEAD (3210)	54.518	-3.615	North West England	ENGLAND
2	3212	0	2016-02-01	11	16	NaN	4800.0	11.6	1004.0	12	KESWICK (3212)	54.614	-3.157	North West England	ENGLAND
3	3214	0	2016-02-01	11	24	34.0	10000.0	10.0	1005.0	8	WALNEY ISLAND (3214)	54.125	-3.257	North West England	ENGLAND
4	3220	0	2016-02-01	16	-99	NaN	25000.0	11.1	1002.0	7	CARLISLE (3220)	54.933	-2.963	North West England	ENGLAND

This will filter the parquet data by Country='ENGLAND' and print the first 5 rows of the DataFrame.

Aggregating the parquet data

Python

grouped = df.groupby(['Country']).mean(numeric_only=True)
# Taking tanspose so the printing dataset will easy.
print(grouped.T)

Output

Country                      ENGLAND
ForecastSiteCode         3560.622936
ObservationTime            11.517152
WindDirection               8.534412
WindSpeed                   7.770786
WindGust                   36.035424
Visibility              22431.530727
ScreenTemperature           5.336209
Pressure                 1011.335307
SignificantWeatherCode     -3.614757
Latitude                   52.354470
Longitude                  -1.586393

Group the data by Country, calculate the mean for each group, and print the resulting data frame.

Example 4

Multiple filters

Python

import pandas as pd
df = pd.read_parquet('weather.2016.parquet', 
                     filters=[('Country', '=', 'ENGLAND'),
                              ('WindSpeed','<', 7)])
# Taking tanspose so the printing dataset will easy.
df.head().T

Output:

	0	1	2	3	4
ForecastSiteCode	3134	3220	3839	3220	3839
ObservationTime	0	0	0	1	1
ObservationDate	2016-02-01 00:00:00	2016-02-01 00:00:00	2016-02-01 00:00:00	2016-02-01 00:00:00	2016-02-01 00:00:00
WindDirection	1	16	16	16	16
WindSpeed	1	-99	-99	-99	-99
WindGust	NaN	NaN	NaN	NaN	NaN
Visibility	2100.0	25000.0	NaN	15000.0	NaN
ScreenTemperature	4.3	11.1	12.1	11.2	12.3
Pressure	999.0	1002.0	NaN	1003.0	NaN
SignificantWeatherCode	15	7	-99	8	-99
SiteName	GLASGOW/BISHOPTON (3134)	CARLISLE (3220)	EXETER AIRPORT (3839)	CARLISLE (3220)	EXETER AIRPORT (3839)
Latitude	55.907	54.933	50.737	54.933	50.737
Longitude	-4.533	-2.963	-3.405	-2.963	-3.405
Region	Strathclyde	North West England	South West England	North West England	South West England
Country	ENGLAND	ENGLAND	ENGLAND	ENGLAND	ENGLAND

This will filter the parquet data by Country='ENGLAND' and WindSpeed< 7 and print the first 5 rows of the DataFrame.

Conclusion

In conclusion, partitioning parquet files is a powerful way to optimize data storage and querying performance. By partitioning data based on one or more columns, you can easily filter, sort, and aggregate data within a subset of partitions, rather than having to scan the entire dataset.

In this article, we covered two methods for reading partitioned parquet files in Python: using pandas' read_parquet() function and using pyarrow's ParquetDataset class. We also provided several examples of how to read and filter partitioned parquet files using these methods with real-world weather data.

Overall, partitioning parquet files is an effective technique for optimizing data storage and retrieval. Whether you're dealing with big data or just trying to improve query performance, partitioning can help you get the most out of your data.

Read a Parquet File Using Pandas

Pandas Read CSV in Python

A

amishra1577

Improve

Article Tags :

Practice Tags :

python

Similar Reads

Read Html File In Python Using Pandas

We are given an HTML file that contains one or more tables, and our task is to extract these tables as DataFrames using Python. For example, if we have an HTML file with a table like this:<table> <tr><th>Code</th><th>Language</th><th>Difficulty</th>

Reading rpt files with Pandas

In most cases, we usually have a CSV file to load the data from, but there are other formats such as JSON, rpt, TSV, etc. that can be used to store data. Pandas provide us with the utility to load data from them. In this article, we'll see how we can load data from an rpt file with the use of Pandas

Pandas Read CSV in Python

CSV files are the Comma Separated Files. It allows users to load tabular data into a DataFrame, which is a powerful structure for data manipulation and analysis. To access data from the CSV file, we require a function read_csv() from Pandas that retrieves data in the form of the data frame. Hereâ€™s a

Reading An Arff File To Pandas Dataframe

Attribute-Relation File Format (ARFF) is a file format developed by the Machine Learning Project of the University of Waikato, New Zealand. It has been developed by the Computer Science department of the aforementioned University. The ARFF files mostly belong to WEKA (Waikato Environment for Knowled

How to Read Text Files with Pandas?

In this article, we will discuss how to read text files with pandas in Python. In Python, the Pandas module allows us to load DataFrames from external files and work on them. The dataset can be in different types of files.Text File UsedRead Text Files with PandasBelow are the methods by which we can

How to Read JSON Files with Pandas?

JSON (JavaScript Object Notation) store data using key-value pairs. Reading JSON files using Pandas is simple and helpful when you're working with data in .json format. There are mainly three methods to read Json file using Pandas Some of them are:Using pd.read_json() MethodUsing JSON Module and pd.