Big Data File Formats For Data Engineers
Big Data File Formats For Data Engineers
com/in/ggnanasekaran
A file format in the context of big data refers to the structure and organization in which
data is stored within files. It determines how data is encoded, stored, and represented,
influencing factors like storage efficiency, data processing speed, and compatibility with
various tools and systems.
How you store the data in your data lake is critical and you need to consider the format,
compression and especially how you partition your data.Therefore, processing big data
comes with its own high cost to store data. Moreover, adding the storage cost together
with the cost of CPU to process data, IO, and network costs, you’re left with a
money-swallowing whirlpool. In other words, larger datasets mean more expense.
AVRO:
● Avro is a row-based, language-neutral, schema-based serialization technique
and object container file format. It is an open-source format. The Avro format
stores data definitions (the schema) in JSON and is easily read and interpreted.
The data within the file is stored in binary format, making it compact and
space-efficient.
● One of Avro's key features is its support for schema evolution. As data
requirements change over time, Avro permits the evolution of schemas without
breaking compatibility with existing data. New fields can be added, fields can be
renamed, and default values can be specified, all while maintaining the ability to
read older data
● Avro's binary encoding, coupled with the ability to use various compression
codecs, contributes to efficient storage and reduced data transfer times
Parquet:
● Parquet is a columnar storage file format widely utilized in the domain of big data
processing and analytics. Developed within the Apache Hadoop ecosystem,
Parquet optimizes data storage and query performance.
● Parquet organizes data by columns rather than rows, enabling efficient
compression and faster analytical queries.
● Parquet supports schema evolution, allowing changes to the data schema
without compromising backward compatibility. This flexibility is crucial as data
structures evolve over time, ensuring seamless data processing.
● While Parquet is excellent for analytical use cases, it might not be as well-suited
for transactional workloads or scenarios where frequent updates and deletes are
required due to its append-only nature
.
ORC file format:
● ORC (Optimized Row Columnar) is a columnar storage file format designed for
high-performance data processing in big data environments. Developed within
the Apache Hive project, ORC improves storage efficiency and query speed by
storing data in columns rather than rows, enabling efficient compression and
faster data access.
● The ORC file format stores collections of rows in a single file, in a columnar
format within the file. This enables parallel processing of row collections across a
Big Data File Formats for Data Engineers www.linkedin.com/in/ggnanasekaran
cluster. Due to the columnar layout, each file is optimal for compression,
enabling skipping of data and columns to reduce read and decompression loads
● It significantly reduces I/O operations and boosts overall data processing
efficiency..
Here’s a quick comparison of the three main big data file formats: