Bigdata Fileformats
Bigdata Fileformats
In this blog, I will talk about what file formats actually are, go
through some common Hadoop file format features, and give a
little advice on which format you should be using.
When we are processing Big data, cost required to store such data
is more (Hadoop stores data redundantly to achieve fault
tolerance). Along with the storage cost, processing the data comes
with CPU, Network, IO costs. As the data increases cost for
processing and storage increases.
The various Hadoop file formats have evolved as a way to ease
these issues across a number of use cases.
Some file formats are designed for general use, others are
designed for more specific use cases, and some are designed with
specific data characteristics in mind. So there really is quite a lot of
choice.
Avro stores the schema in JSON format making it easy to read and
interpret by any program.
The data itself is stored in binary format making it compact and
efficient.
A key feature of Avro is the robust support for data schemas that
change over time, i.e. schema evolution. Avro handles schema
changes like missing fields,
added fields and changed fields.
Avro provides rich data structures. For example, you can create a
record that contains an array, an enumerated type, and a sub
record.
This format is the ideal candidate for storing data in a data lake
landing zone, because:
1. Data from the landing zone is usually read as a whole for further
processing by downstream systems (the row-based format is more
efficient in
this case).
2. Downstream systems can easily retrieve table schemas from
files (there is no need to store the schemas separately in an
external meta store).
3. Any source schema change is easily handled (schema evolution).
For this table in a row wise storage format the data will be stored
as follows-
For example, let’s say you want only the NAME column. In a row
storage format, each record in the dataset has to be loaded, parsed
into fields and then data for Name is extracted. With column-
oriented format it can directly go to the Name column as all the
values for that columns are stored together and get those values.
No need to go through the whole record.
One of the unique features of Parquet is that it can store data with
nested structures also in columnar fashion which means in
Parquet file format even the nested fields can be read individually
without the need to read all the fields in the nested structure.
Parquet format uses the record shredding and assembly algorithm
for storing nested structures in columnar fashion.
To understand the Parquet file format in Hadoop you should be
aware of the following three terms-
ORC stores collections of rows in one file and within the collection
the row data is stored in a columnar format.
The file footer contains a list of stripes in the file, the number of
rows per stripe, and each column’s data type. It also contains
column-level aggregates count, min, max, and sum.
Index data includes min and max values for each column and the
row positions within each column. ORC indexes are used only for
the selection of stripes and row groups and not for answering
queries.
AVRO vs PARQUET
ORC vs PARQUET