How to read parquet file in Scala?
Last Updated :
01 Jul, 2024
Scala has good support through Apache Spark for reading Parquet files, a columnar storage format. Below is a comprehensive guide to reading Parquet files in Scala:
Setting Up Your Environment
First, to create a development environment with all necessary libs and frameworks, you must do the following. If you're using SBT, include the following dependencies in your 'build.sbt' file:
Scala
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2"
Initializing a SparkSession
In Spark ‘SparkSession’ is the entry point for reading data present in the system. You can create a 'SparkSession' as shown below:
Scala
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
.appName("ParquetExample")
.master("local[*]")
.getOrCreate()
Reading a Parquet File
Like any other Spark data source, you can read a Parquet file using the provided ‘read’ method of the ‘SparkSession.’ Here’s how you can do it:
Scala
val df = spark.read.parquet("path/to/your/parquet/file")
Displaying the Data
After the Parquet file being imported into a DataFrame, there are many actions possible on it. For instance, you can display the first few rows of the DataFrame:
Scala
Example: Reading and Displaying a Parquet File
Here’s a complete example that puts everything together:
Scala
import org.apache.spark.sql.SparkSession
object ParquetExample {
def main(args: Array[String]): Unit = {
// Initialize SparkSession
val spark = SparkSession.builder
.appName("ParquetExample")
.master("local[*]")
.getOrCreate()
// Read Parquet file
val df = spark.read.parquet("path/to/your/parquet/file")
// Show the DataFrame content
df.show()
// Stop the SparkSession
spark.stop()
}
}
Additional Operations on DataFrame
After getting your data into a DataFrame, you are ready for operations in pandas which can range from very simple to complex. Here are some examples:
Selecting Specific Columns
Scala
df.select("column1", "column2").show()
Filtering Data
Scala
df.filter(df("column") > 10).show()
Grouping and Aggregation
Scala
df.groupBy("column").count().show()
Writing Data Back to Parquet
You can also write the DataFrame back to a Parquet file:
Scala
df.write.parquet("path/to/output/parquet/file")
Handling Nested Data
The structure of data in Parquet can be deeply nested and Spark can handle them so well. Suppose your Parquet file contains nested data; you can access nested fields using dot notation or by using the 'select' method with expressions:
Scala
df.select("nestedField.innerField").show()
Performance Considerations
When working with Parquet files, consider the following best practices for performance:
- Column Pruning: That is, read only the needed columns or elements.
- Predicate Pushdown: To be able to read only the required rows, you have to use the filters.
- Partitioning: If you are working with partitioned data, do include the partitioned columns in your queries.
Conclusion
It is rather easy and efficient to read Parquet files in Scala employing Apache Spark which opens rich opportunities for data processing and analysis. Thus, you can perform the loads, manipulations and store Parquet data in your Scala application quickly by applying the steps above.
Similar Reads
How to Print RDD in scala? Scala stands for scalable language. It was developed in 2003 by Martin Odersky. It is an object-oriented language that provides support for functional programming approach as well. Everything in scala is an object e.g. - values like 1,2 can invoke functions like toString(). Scala is a statically typ
4 min read
How to read and write JSON files in Scala? Scala is frequently used for reading and writing JSON files in a variety of applications, particularly it includes data transmission.Table of ContentSteps for reading JSON files in Scala:Steps for writing JSON files in Scala:Steps for reading JSON files in Scala:When reading JSON files in Scala we f
3 min read
How to Read and Write CSV File in Scala? Data processing and analysis in Scala mostly require dealing with CSV (Comma Separated Values) files. CSV is a simple file format used to store tabular data, such as a spreadsheet or database. Each line of a CSV file is plain text, representing a data row, with values separated by commas (,). Readin
5 min read
How to print dataframe in Scala? Scala stands for scalable language. It was developed in 2003 by Martin Odersky. It is an object-oriented language that provides support for functional programming approach as well. Everything in scala is an object e.g. - values like 1,2 can invoke functions like toString(). Scala is a statically typ
4 min read
How to create partition in scala? In the world of big data, processing efficiency is key, and data partitioning emerges as a vital tool for optimizing performance. By strategically dividing large datasets into smaller subsets, partitioning enables parallel processing, significantly accelerating data manipulation tasks. In Scala, ach
2 min read