0% found this document useful (0 votes)
5 views

Lecture 2 File Types Suitable for Storing Big Data

The document discusses various file types suitable for storing big data, highlighting the evolution from traditional formats like CSV and JSON to more efficient binary formats such as Avro, ORC, and Parquet. It explains the differences between row and columnar formats, detailing their respective advantages for read and write operations. The document emphasizes the importance of selecting the appropriate file format based on data characteristics and analysis objectives to optimize performance and resource usage.

Uploaded by

genesiskalya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture 2 File Types Suitable for Storing Big Data

The document discusses various file types suitable for storing big data, highlighting the evolution from traditional formats like CSV and JSON to more efficient binary formats such as Avro, ORC, and Parquet. It explains the differences between row and columnar formats, detailing their respective advantages for read and write operations. The document emphasizes the importance of selecting the appropriate file format based on data characteristics and analysis objectives to optimize performance and resource usage.

Uploaded by

genesiskalya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Lecture 3: File Types Suitable to Storing Big Data

With the ever-increasing amount of data being generated daily there is need for methods to
evolve to store this data in an efficient manner. For example, it is estimated that in 2022, 2.5
quintillion bytes of data was created daily and this figure is expected to keep growing

The Evolution of File Formats

While Java Script Object Notation (JSON) and Comma Separated Values (CSV) files are still
common for storing data, they were never designed for the massive scale of big data and tend
to eat up resources unnecessarily (JSON files with nested data can be very CPU-intensive, for
example). They are in text format and therefore human readable. But they lack the efficiencies
offered by binary options.

So as data has grown, file formats have evolved. File format impacts speed and performance,
and can be a key factor in determining whether you must wait an hour for an answer – or
milliseconds. Matching your file format to your needs is crucial for minimizing the time it
takes to find the relevant data and also to glean meaningful insights from it.

Files types suitable for storing big data include:


 Comma separated Values (CSV): Good option for compatibility, spreadsheet processing
and human readable data. ...
 JavaScript Object Notation (JSON): Heavily used in APIs.
 Extensible Markup Language (XML )
 Avro: Great for storing row data, very efficient. ...
 Protocol Buffers: Great for APIs, especially for gRPC. ...
 Parquet: Columnar storage. ...
 ORC: Similar to Parquet, it offers better compression.
CSV: Simple and Universal

CSV stands for comma-separated values, and it is a plain text format that stores data in rows
and columns, separated by commas or other delimiters. CSV is simple, universal, and easy to
read and write by humans and machines. It is widely supported by many software and
programming languages, such as Excel, Python, and R. CSV is ideal for data analysis when
you have tabular data that is not too complex or nested, and when you want to import or export
data quickly and efficiently.

JSON: Flexible and Structured


JSON stands for JavaScript Object Notation, and it is a text format that stores data as objects,
consisting of key-value pairs. JSON is flexible, structured, and easy to parse and manipulate
by machines. It is widely used by web applications and APIs, as it can handle complex and
nested data structures, such as arrays, lists, and dictionaries. JSON is ideal for data analysis
when you have data that is not easily represented in a table, and when you want to work with
data from web sources or JSON-based tools, such as MongoDB or D3.js.

BSON is a binary JSON (a superset of JSON with some more data types, most importantly
binary byte array). It is a serialization format used in MongoDB. BSON may be an
organization specializing in effectively putting away of JSON-like archives, which are close
to supporting the conventional JSON information sorts it moreover underpins dates and
parallel information natively. It is twofold organized, so it is not human-readable like JSON.

7
Equivalent BSON record isn’t continuously littler than JSON, but it allows you to effectively
skip the records that you’re not fascinated with when perusing it, whereas with JSON you’d
have to be parsed each byte. Usually, the fundamental reason why it is utilized inside
MongoDB. If you’re not working with MongoDB, and don’t require any of the highlights
then you’re most likely way better off utilizing JSON. In a single MongoDB document, you
can store up to 16MB of binary data. However, MongoDB has its own file system GridFS,
which stores binary files larger than 16MB in chunks.

Difference Between JSON vs BSON

JSON BSON

Standard file format Type. Binary file format Type.

The language-free organization is Binary JSON comprises a list of requested


utilized for offbeat server browser components containing a field title, sort, and
communication. esteem. Field title sorts are ordinarily a string.

The parallel encoding method comprises extra


Broadly JSON comprises of question
data such as lengths of strings and the protest
and cluster where the question could be
subtypes. In addition, BinData and Date
a collection of key-value sets and the
information sorts are the information sorts that
cluster is a requested list of values.
are not upheld in JSON.

JSON stands for JavaScript Object BSON stands for Binary JavaScript Object
Notation. Notation.

JSON data contains its data basic in BSON gives extra datatypes over the JSON
JSON format. data.

Database like AnyDB, redis, etc stores


MongoDB stores data in BSON format.
information in JSON format.

JSON uses less space in comparison to


BSON uses more space as compared to JSON.
BSON.

It is slow as compared to BSON. It is faster than JSON.

It is used for the transmission of data. It is used for the storage of the data.

It has no encoding and decoding


It has encoding and decoding technique.
technique.

XML: Rich and Extensible


XML stands for Extensible Markup Language, and it is a text format that stores data as
elements, attributes, and text, enclosed by tags. XML is rich, extensible, and easy to validate

8
and transform by machines. It is widely used by document formats and standards, such as
HTML, RSS, and SOAP, as it can handle metadata, schemas, and namespaces. XML is ideal
for data analysis when you have data that is highly structured and hierarchical, and when you
want to work with data from XML-based sources or tools, such as XML databases or XSLT.

How to Choose the Best Format between CSV, JSON and XML
When deciding which data format is best for data analysis, there is no definitive answer as it
depends on your data characteristics, analysis objectives, and available tools. Generally
speaking, CSV should be chosen if the data is simple, flat, and tabular, and you are working
with common software and languages. JSON should be chosen if the data is complex, nested,
and object-oriented, and you are working with web applications and APIs. Finally, XML
should be chosen if the data is rich, hierarchical, and document-based, and you are working
with XML standards and formats. Converting between different data formats may result in
some loss or distortion of information so it is better to choose the most suitable format from
the start. Tools such as pandas, json or xml libraries in Python can be used to convert between
formats if needed.

Big data file formats can also be grouped into row and column formats

Row vs. Columnar Big Data File Formats

There are two main ways in which you can organize your data: rows and columns. Which one
you choose largely controls how efficiently you store and query your data.

1. Row – the data is organized by record. Think of this as a more “traditional” way of
organizing and managing data. All data associated with a specific record is stored
adjacently. In other words, the data of each row is arranged such that the last column of
a row is stored next to the first column entry of the succeeding data row.

2. Columnar – the values of each table column (field) are stored next to each other. This
means like items are grouped and stored next to one another. Within fields the read-in
order is maintained; this preserves the ability to link data to records.

Row format: Traditionally you can think of row storage this way:

1, Michael, Jones, Dallas, 32

2, Preston, James, Boston, 25

But you can also represent row data visually in the order in which it would be stored in memory,
like this:

1, Michael, Jones, Dallas, 32, 2, Preston, James, Boston, 25

Columnar format: Traditionally you can think of columnar storage this way:

9
ID: 1, 2

First Name: Michael, Preston

Last Name: Jones, James

City: Dallas, Boston

Age: 32, 25

But you can also represent columnar data visually in the order in which it would be stored in
memory

1, 2, Michael, Preston, Jones, James, Dallas, Boston, 32, 25

Choosing a format is all about ensuring your format matches your downstream intended use
for the data. Below we highlight the key reasons why you might use row vs. columnar
formatting. Optimize your formatting to match your storage method and data usage, and
you’ve optimized valuable engineering time and resources.

The Particulars of Row Formatting


WIth data in row storage memory, items align in the following way when stored on disk:

Adding more to this dataset is trivial – you just append any newly acquired data to the end of
the current dataset:

As writing to the dataset is relatively cheap and easy in this format, row formatting is
recommended for any case in which you have write-heavy operations. However, read
operations can be very inefficient.

Here’s an example: Obtain the sum of ages for individuals in the data. This simple task can be
surprisingly compute-intensive with a row-oriented database. Since ages are not stored in a
common location within memory, you must load all 15 data points into memory, then extract
the relevant data to perform the required operation:

Load:

Extract:

Sum:

10
32 + 25 + 37 = 94

Now imagine millions or billions of data points, stored across numerous disks because of their
scale.

Using the same example:

Suppose the data lives in several discs, and each disk can hold only the 5 data points. The
sample dataset is split across 3 storage discs:

You must load all the data in all the discs to obtain the information necessary for your query.
If each disk is filled to capacity with data, this can easily require extra memory utilization and
quickly become burdensome.

That said, row formatting does offer advantages when schema changes – we’ll cover this later.
In general, if the data is wide – that is, it has many columns – and is write-heavy, a row-based
format may be best.

The Particulars of Columnar Formatting


Again, referring to the example dataset in columnar format, we can visually represent the data
in the order in which it would be stored in memory as follows:

The data is grouped in terms of like columns:

 ID

 First Name

 Last Name

 City

 Age

11
Writing data is somewhat more time-intensive in columnar formatted data than it is in row
formatted data; instead of just appending to the end as in a row-based format, you must read in
the entire dataset, navigate to the appropriate positions, and make the proper insertions:

Navigating to the appropriate positions for insertions is wasteful. But the way the data is
partitioned across multiple disks can alleviate this. Using the same framework as with row
formatted data, suppose you had a separate storage location (disk) for each attribute (column)
(so in this sample data case there are 5 disks):

It still takes a large amount of memory and computational time writing to 5 separate locations
to add data for a 3rd individual. However, you are simply appending to the end of the files
stored in each of the locations.

So the columnar format doesn’t compare favorably to row formatting with regard to write
operations. But it is superior for read operations such as querying data. Here’s the same read
example from before, applied to column-based storage partitioned by column:

Task: Obtain the sum of ages for individuals in the data.

To accomplish this, go only to the storage location that contains information on ages (disk 5)
and read the necessary data. This saves a large amount of memory and time by skipping over
non-relevant data very quickly.

Read:

Sum:

32 + 25 + 37 = 94

In this case, all reads came from sequential data stored on a single disk.

But efficient querying isn’t the only reason columnar-formatted data is popular. Columnar-
formatted data also allows for efficient compression. By storing each attribute together (ID,
ages, and so on) you can benefit from commonalities between attributes, such as a shared or

12
common data type or a common length (number of bits) per entry. For example, if you know
that age is an integer data type that won’t exceed a value of 200, you can compress the storage
location and reduce memory usage/allocation, as you don’t need standard amounts of allocated
memory for such values. (BIGINT is typically stored as 4 bytes, for instance, whereas a short
int can be stored as 2 bytes).

Further, columnar-formatted files usually support a number of flexible compression options


(Snappy, gzip, and LZO, for example) and provide efficient encoding schemes. For example,
you can use different encoding for compressing integer and string data; as all the data is very
similar in a column, it can be compressed more quickly for storage and decompressed for
analysis or other processing. In a row-based storage structure, on the other hand, you must
compress many types of data together – and those rows can get very long in schema on read –
and decompress pretty much the entire table when it’s time for analysis or other processing.

Three popular file formats for big data include: Avro, ORC, and Parquet.

The Avro Row-Based File Format Explained

Apache Avro was a project initially released late in 2009 as a row-based, language-neutral,
schema-based serialization technique and object container file format. It is an open-source
format that is often recommended for Apache Kafka. It’s preferred when serializing data in
Hadoop. The Avro format stores data definitions (the schema) in JSON and is easily read and
interpreted. The data within the file is stored in binary format, making it compact and space-
efficient.

What makes Avro stand out as a file format is that it is self-describing. Avro bundles serialized
data with the data’s schema in the same file – the message header contains the schema used to
serialize the message. This enables software to efficiently deserialize messages.

The Avro file format supports schema evolution. It supports dynamic data schemas that can
change over time; it can easily handle schema changes such as missing fields, added fields, or
edited/changed fields. In addition to schema flexibility, the Avro format supports complex data
structures such as arrays, enums, maps, and unions.

Avro-formatted files are splittable and compressible (though they do not compress as well as
some columnar data). Avro files work great for storage in a Hadoop ecosystem and for running
processes in parallel (because they are faster to load.

One caveat: If every message includes the schema in its header, it doesn’t scale well. Avro
requires the reading and re-reading of the repeated header schema across multiple files, so at
scale Avro can cause inefficiencies in bandwidth and storage space, slowing compute
processes.

13
.

Avro Format Features

Can be shared by programs using different languages

Self-describing; bundles serialized data with data’s schema

Supports schema evolution and flexibility

Splitable

Compression options including uncompressed, snappy, deflate, bzip2, and xz

Ideal Avro format use cases

 Write-heavy operations (such as ingestion into a data lake) due to serialized row-based
storage.

 When writing speed with schema evolution (adaptability to change in metadata)


is critical.

The Optimized Row Columnar (ORC) Columnar File Format Explained

Optimized Row Columnar (ORC) is an open-source columnar storage file format originally
released in early 2013 for Hadoop workloads. ORC provides a highly-efficient way to store
Apache Hive data, though it can store other data as well. It’s the successor to the traditional
Record Columnar File (RCFile) format. ORC was designed and optimized specifically with
Hive data in mind, improving the overall performance when Hive reads, writes, and processes
data. As a result, ORC supports ACID transactions when working with Hive.

The ORC file format stores collections of rows in a single file, in a columnar format within the
file. This enables parallel processing of row collections across a cluster. Due to the columnar
layout, each file is optimal for compression, enabling skipping of data and columns to reduce
read and decompression loads.

ORC files are organized into independent stripes of data. Each stripe consists of an index, row
data, and a footer. The footer holds key statistics for each column within a stripe (count, min,
max, sum, and so on), enabling easy skipping as needed. The footer also contains metadata
about the ORC file, making it easy to combine information across stripes.

14
ORC file structure

ORC compression chunk

15
ORC Format Features

Name Node load reduction – single file output per task

Complex type support including DateTime, decimal, struct, list, map, and union

Separate RecordReaders for concurrent reads of a single file

Split files without scanning for markers

Predicate pushdown

Compression options including Snappy, Zlib

Supports ACID when used with Hive

By default, a stripe size is 250 MB; the large stripe size is what enables efficient reads. ORC
file formats offer superior compression characteristics (ORC is often chosen over Parquet when
compression is the sole criterion), including compression done with Snappy or Zlib. An
additional feature unique to ORC is predicate pushdown. In predicate pushdown, the system
checks a query or condition against file metadata to see whether rows must be read. This
increases the potential for skipping.

Ideal ORC format use cases

 When reads constitute a significantly higher volume than writes.

 When you rely on Hive.

 When compression flexibility/options are key.

The Parquet Columnar File Format Explained

The Apache Parquet file format was first introduced in 2013 as an open-source storage format
that boasted substantial advances in efficiencies for analytical querying. According
to https://parquet.apache.org:

“Apache Parquet is a … file format designed for efficient data storage and retrieval. It provides
efficient data compression and encoding schemes with enhanced performance to handle
complex data in bulk…”

16
Parquet files support complex nested data structures in a flat format and offer multiple
compression options.

Parquet is broadly accessible. It supports multiple coding languages, including Java, C++, and
Python, to reach a broad audience. This makes it usable in nearly any big data setting. As it’s
open source, it avoids vendor lock-in.

Parquet is also self-describing. It contains metadata that includes file schema and structure.
You can use this to separate different services for writing, storing, and reading Parquet files.

Parquet files are composed of row groups, header and footer. Each row group contains data
from the same columns. The same columns are stored together in each row group:

Most importantly, at its core Parquet formatting is designed to support fast data processing for
complex nested data structures such as log files and event streams at scale. It saves on cloud
storage space by using highly efficient columnar compression, and provides flexible encoding
schemes to handle columns with different data types; you can specify compression schemes on
a per-column basis. It is extensible to future encoding mechanisms as well, making Parquet
“future-proof” in this regard. Parquet supports many query engines including Amazon Athena,
Amazon Redshift Spectrum, Qubole, Google BigQuery, Microsoft Azure Data Explorer
and Apache Drill). It reports a 2x faster unload speed and consumes as little as ⅙ the storage
in Amazon S3 compared to text formats.
Parquet files are splittable as they store file footer metadata containing information on block
boundaries for the file. Systems access this block boundary information to determine whether
to skip or read only specific parts (blocks) of the file – allowing for more efficient reads – or
to more easily submit different blocks for parallel processing. Parquet supports automatic
schema merging for schema evolution, so you can start with a simple schema and gradually
add more columns as needed.

Parquet Format Features

Language agnostic

Supports complex data types

Multiple flexible compression options

Supports schema evolution

Enables data skipping, reduced I/O

Parquet files are often most appropriate for analytics (OLAP) use cases, typically when
traditional OLTP databases are the source. They offer highly-efficient data compression and

17
decompression. They also feature increased data throughput and performance using techniques
such as data skipping (in which queries return specific column values and do not read an entire
row of data, greatly minimizing I/O).

Ideal Parquet format use cases


 Storing big data of any kind (structured data tables, images, videos, documents).
 Ideal for services such as AWS Athena and Amazon Redshift Spectrum, which are
serverless, interactive technologies.
 A good fit for Snowflake as it supports extremely efficient compression and encoding
schemes.
 When your full dataset has many columns, but you only need to access a subset.
 When you want multiple services to consume the same data from object storage.
 When you’re largely or wholly dependent on Spark.

ORC vs Parquet: Key Differences in a Nutshell


ORC (Optimized Row Columnar) and Parquet are two popular big data file formats.
Parquet is generally better for write-once, read-many analytics, while ORC is more
suitable for read-heavy operations. ORC is optimized for Hive data, while Parquet is
considerably more efficient for querying. Both support complex data structures, multiple
compression options, schema evolution, and data skipping.

There is overlap between ORC and Parquet. But Parquet is ideal for write-once, read-many
analytics, and in fact has become the de facto standard for OLAP on big data. It also works
best with Spark, which is widely used throughout the big data ecosystem.

Parquet is really the best option when speed and efficiency of queries are most important. It’s
optimized to work with complex data in bulk, including nested data. It’s also highly-effective
at minimizing table scans and, like ORC, can compress data to small sizes. It provides the
widest range of options for squeezing greater efficiency from your queries regardless of vendor.
And its extensibility with regard to future encoding mechanisms also makes it attractive to
organizations concerned with keeping their data infrastructure current and optimized.

It’s worth noting that new table formats are also emerging to support the substantial increases
in the volume and velocity (that is, streaming) of data. These formats include Apache Iceberg,
Apache Hudi, and Databricks Delta Lake. We will explore these in a future blog.

Advantages Of Using Appropriate File Formats:

1. Faster read
2. Faster write
3. Splitable files support
4. Schema evolution can be supported
5. Advanced compression can be achieved

18

You might also like