How to read parquet file in Scala?

Last Updated : 01 Jul, 2024

Scala has good support through Apache Spark for reading Parquet files, a columnar storage format. Below is a comprehensive guide to reading Parquet files in Scala:

Setting Up Your Environment

First, to create a development environment with all necessary libs and frameworks, you must do the following. If you're using SBT, include the following dependencies in your 'build.sbt' file:

Scala

libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2"

Initializing a SparkSession

In Spark ‘SparkSession’ is the entry point for reading data present in the system. You can create a 'SparkSession' as shown below:

Scala

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder
  .appName("ParquetExample")
  .master("local[*]")
  .getOrCreate()

Reading a Parquet File

Like any other Spark data source, you can read a Parquet file using the provided ‘read’ method of the ‘SparkSession.’ Here’s how you can do it:

Scala

val df = spark.read.parquet("path/to/your/parquet/file")

Displaying the Data

After the Parquet file being imported into a DataFrame, there are many actions possible on it. For instance, you can display the first few rows of the DataFrame:

Scala

df.show()

Example: Reading and Displaying a Parquet File

Here’s a complete example that puts everything together:

Scala

import org.apache.spark.sql.SparkSession

object ParquetExample {
  def main(args: Array[String]): Unit = {
    // Initialize SparkSession
    val spark = SparkSession.builder
      .appName("ParquetExample")
      .master("local[*]")
      .getOrCreate()

    // Read Parquet file
    val df = spark.read.parquet("path/to/your/parquet/file")

    // Show the DataFrame content
    df.show()

    // Stop the SparkSession
    spark.stop()
  }
}

Additional Operations on DataFrame

After getting your data into a DataFrame, you are ready for operations in pandas which can range from very simple to complex. Here are some examples:

Selecting Specific Columns

Scala

df.select("column1", "column2").show()

Filtering Data

Scala

df.filter(df("column") > 10).show()

Grouping and Aggregation

Scala

df.groupBy("column").count().show()

Writing Data Back to Parquet

You can also write the DataFrame back to a Parquet file:

Scala

df.write.parquet("path/to/output/parquet/file")

Handling Nested Data

The structure of data in Parquet can be deeply nested and Spark can handle them so well. Suppose your Parquet file contains nested data; you can access nested fields using dot notation or by using the 'select' method with expressions:

Scala

df.select("nestedField.innerField").show()

Performance Considerations

When working with Parquet files, consider the following best practices for performance:

Column Pruning: That is, read only the needed columns or elements.
Predicate Pushdown: To be able to read only the required rows, you have to use the filters.
Partitioning: If you are working with partitioned data, do include the partitioned columns in your queries.

Conclusion

It is rather easy and efficient to read Parquet files in Scala employing Apache Spark which opens rich opportunities for data processing and analysis. Thus, you can perform the loads, manipulations and store Parquet data in your Scala application quickly by applying the steps above.