0% found this document useful (0 votes)
14 views

Bigdata Fileformats

The document discusses different file formats used for big data in Hadoop including Avro, Parquet, and ORC. Avro stores data and schemas in binary format and supports schema evolution. Parquet stores nested data in a flat columnar format, improving storage and query performance. ORC also stores data columnarly and supports complex data types, concurrent reads, and efficient skipping of irrelevant data. The document compares features of these formats such as their suitability for different use cases and strengths in reading, writing, querying and compression.

Uploaded by

Madhavan Eyunni
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Bigdata Fileformats

The document discusses different file formats used for big data in Hadoop including Avro, Parquet, and ORC. Avro stores data and schemas in binary format and supports schema evolution. Parquet stores nested data in a flat columnar format, improving storage and query performance. ORC also stores data columnarly and supports complex data types, concurrent reads, and efficient skipping of irrelevant data. The document compares features of these formats such as their suitability for different use cases and strengths in reading, writing, querying and compression.

Uploaded by

Madhavan Eyunni
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Big Data File Formats

In this blog, I will talk about what file formats actually are, go
through some common Hadoop file format features, and give a
little advice on which format you should be using.

Why we need different file formats ?

A huge bottleneck for HDFS-enabled applications like MapReduce


and Spark is the time it takes to find relevant data in a particular
location and the time it takes to write the data back to another
location. These issues get complicated with the difficulties
managing large datasets, such as evolving schemas, or storage
constraints.

When we are processing Big data, cost required to store such data
is more (Hadoop stores data redundantly to achieve fault
tolerance). Along with the storage cost, processing the data comes
with CPU, Network, IO costs. As the data increases cost for
processing and storage increases.
The various Hadoop file formats have evolved as a way to ease
these issues across a number of use cases.

Choosing an appropriate file format can have some significant


benefits:

1. Faster read times


2. Faster write times
3. Splittable files
4. Schema evolution support
5. Advanced compression support

Some file formats are designed for general use, others are
designed for more specific use cases, and some are designed with
specific data characteristics in mind. So there really is quite a lot of
choice.

AVRO File Format

Avro is a row-based storage format for Hadoop which is widely


used as a serialization platform.

Avro stores the schema in JSON format making it easy to read and
interpret by any program.
The data itself is stored in binary format making it compact and
efficient.

Avro is a language-neutral data serialization system. It can be


processed by many languages (currently C, C++, C#, Java, Python,
and Ruby).

A key feature of Avro is the robust support for data schemas that
change over time, i.e. schema evolution. Avro handles schema
changes like missing fields,
added fields and changed fields.

Avro provides rich data structures. For example, you can create a
record that contains an array, an enumerated type, and a sub
record.

This format is the ideal candidate for storing data in a data lake
landing zone, because:
1. Data from the landing zone is usually read as a whole for further
processing by downstream systems (the row-based format is more
efficient in
this case).
2. Downstream systems can easily retrieve table schemas from
files (there is no need to store the schemas separately in an
external meta store).
3. Any source schema change is easily handled (schema evolution).

PARQUET File Format

Parquet, an open source file format for Hadoop stores nested


data structures in a flat columnar format.

Compared to a traditional approach where data is stored in row-


oriented approach, parquet is more efficient in terms of storage
and performance.
It is especially good for queries which read particular columns
from a “wide” (with many columns) table since only needed
columns are read and IO is minimized.

What is a columnar storage format?

In order to understand Parquet file format in Hadoop better, first


let’s see what is columnar format. In a column-oriented format
values of each column of the same type in the records are stored
together.

For example, if there is a record which comprises of ID, emp Name


and Department then all the values for ID column will be stored
together, values for Name column together and so on. If we take
the same record schema as mentioned above having three fields ID
(int), NAME (varchar) and Department (varchar)

For this table in a row wise storage format the data will be stored
as follows-

Whereas the same data will be stored as follows in a Column


oriented storage format-
If you need to query few columns from a table then columnar
storage format is more efficient as it will read only required
columns since they are adjacent thus minimizing IO.

For example, let’s say you want only the NAME column. In a row
storage format, each record in the dataset has to be loaded, parsed
into fields and then data for Name is extracted. With column-
oriented format it can directly go to the Name column as all the
values for that columns are stored together and get those values.
No need to go through the whole record.

So, column-oriented format increases the query performance as


less seek time is required to go to the required columns and less IO
is required as it needs to read only the columns whose data is
required.

One of the unique features of Parquet is that it can store data with
nested structures also in columnar fashion which means in
Parquet file format even the nested fields can be read individually
without the need to read all the fields in the nested structure.
Parquet format uses the record shredding and assembly algorithm
for storing nested structures in columnar fashion.
To understand the Parquet file format in Hadoop you should be
aware of the following three terms-

 Row group: A logical horizontal partitioning of the data


into rows. A row group consists of a column chunk for each
column in the dataset.

 Column chunk: A chunk of the data for a particular


column. These column chunks live in a particular row group
and is guaranteed to be contiguous in the file.

 Page: Column chunks are divided up into pages written back


to back. The pages share a common header and readers can
skip over page they are not interested in.
Here Header just contains a magic number “PAR1” (4-byte) that
identifies the file as Parquet format file.

Footer contains the following-

 File metadata- The file metadata contains the locations of all


the column metadata start locations. Readers are expected to
first read the file metadata to find all the column chunks they
are interested in. The columns chunks should then be read
sequentially. It also includes the format version, the schema,
any extra key-value pairs.

 length of file metadata (4-byte)

 magic number “PAR1” (4-byte)

ORC File Format


The Optimized Row Columnar (ORC) file format provides a highly
efficient way to store data. It was designed to overcome limitations
of the other file formats. It ideally stores data compact and enables
skipping over irrelevant parts without the need for large, complex,
or manually maintained indices. The ORC file format addresses all
of these issues.

ORC file format has many advantages such as:

 A single file as the output of each task, which reduces the


NameNode’s load

 Hive type support including datetime, decimal, and the


complex types (struct, list, map, and union)

 Concurrent reads of the same file using separate


RecordReaders

 Ability to split files without scanning for markers

 Bound the amount of memory needed for reading or writing


 Metadata stored using Protocol Buffers, which allows the
addition and removal of fields

ORC stores collections of rows in one file and within the collection
the row data is stored in a columnar format.

An ORC file contains groups of row data called stripes, along with


auxiliary information in a file footer. At the end of the file
a postscript holds compression parameters and the size of the
compressed footer.
The default stripe size is 250 MB. Large stripe sizes enable large,
efficient reads from HDFS.

The file footer contains a list of stripes in the file, the number of
rows per stripe, and each column’s data type. It also contains
column-level aggregates count, min, max, and sum.

Stripe footer contains a directory of stream locations.

Row data is used in table scans.

Index data includes min and max values for each column and the
row positions within each column. ORC indexes are used only for
the selection of stripes and row groups and not for answering
queries.

COMPARISONS BETWEEN DIFFERENT FILE FORMATS

AVRO vs PARQUET

1. AVRO is a row based storage format whereas PARQUET is a


columnar based storage format.
2. PARQUET is much better for analytical querying i.e. reads
and querying are much more efficient than writing.
3. Write operations in AVRO are better as compared to in
PARQUET.
4. AVRO is much matured than PARQUET when it comes to
schema evolution. PARQUET only supports schema append
whereas AVRO supports much featured schema evolution i.e.
adding or modifying columns.
5. PARQUET is ideal for querying subset of columns in a multi
column table. AVRO is ideal in case of ETL operations where
we need to query all the columns.

ORC vs PARQUET

1. PARQUET is more capable of storing nested data.


2. ORC is more capable of Predicate Pushdown.
3. ORC supports ACID properties.
4. ORC is more compression efficient.

You might also like