0% found this document useful (0 votes)

5 views

Lecture 2 File Types Suitable for Storing Big Data

The document discusses various file types suitable for storing big data, highlighting the evolution from traditional formats like CSV and JSON to more efficient binary formats such as Avro, ORC, and Parquet. It explains the differences between row and columnar formats, detailing their respective advantages for read and write operations. The document emphasizes the importance of selecting the appropriate file format based on data characteristics and analysis objectives to optimize performance and resource usage.

Uploaded by

genesiskalya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Lecture 2 File Types Suitable for Storing Big Data

Uploaded by

genesiskalya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Lecture 3: File Types Suitable to Storing Big Data

With the ever-increasing amount of data being generated daily there is need for methods to
evolve to store this data in an efficient manner. For example, it is estimated that in 2022, 2.5
quintillion bytes of data was created daily and this figure is expected to keep growing

The Evolution of File Formats

While Java Script Object Notation (JSON) and Comma Separated Values (CSV) files are still
common for storing data, they were never designed for the massive scale of big data and tend
to eat up resources unnecessarily (JSON files with nested data can be very CPU-intensive, for
example). They are in text format and therefore human readable. But they lack the efficiencies
offered by binary options.

So as data has grown, file formats have evolved. File format impacts speed and performance,
and can be a key factor in determining whether you must wait an hour for an answer – or
milliseconds. Matching your file format to your needs is crucial for minimizing the time it
takes to find the relevant data and also to glean meaningful insights from it.

Files types suitable for storing big data include:

 Comma separated Values (CSV): Good option for compatibility, spreadsheet processing
and human readable data. ...
 JavaScript Object Notation (JSON): Heavily used in APIs.
 Extensible Markup Language (XML )
 Avro: Great for storing row data, very efficient. ...
 Protocol Buffers: Great for APIs, especially for gRPC. ...
 Parquet: Columnar storage. ...
 ORC: Similar to Parquet, it offers better compression.
CSV: Simple and Universal

CSV stands for comma-separated values, and it is a plain text format that stores data in rows
and columns, separated by commas or other delimiters. CSV is simple, universal, and easy to
read and write by humans and machines. It is widely supported by many software and
programming languages, such as Excel, Python, and R. CSV is ideal for data analysis when
you have tabular data that is not too complex or nested, and when you want to import or export
data quickly and efficiently.

JSON: Flexible and Structured

JSON stands for JavaScript Object Notation, and it is a text format that stores data as objects,
consisting of key-value pairs. JSON is flexible, structured, and easy to parse and manipulate
by machines. It is widely used by web applications and APIs, as it can handle complex and
nested data structures, such as arrays, lists, and dictionaries. JSON is ideal for data analysis
when you have data that is not easily represented in a table, and when you want to work with
data from web sources or JSON-based tools, such as MongoDB or D3.js.

BSON is a binary JSON (a superset of JSON with some more data types, most importantly
binary byte array). It is a serialization format used in MongoDB. BSON may be an
organization specializing in effectively putting away of JSON-like archives, which are close
to supporting the conventional JSON information sorts it moreover underpins dates and
parallel information natively. It is twofold organized, so it is not human-readable like JSON.

7
Equivalent BSON record isn’t continuously littler than JSON, but it allows you to effectively
skip the records that you’re not fascinated with when perusing it, whereas with JSON you’d
have to be parsed each byte. Usually, the fundamental reason why it is utilized inside
MongoDB. If you’re not working with MongoDB, and don’t require any of the highlights
then you’re most likely way better off utilizing JSON. In a single MongoDB document, you
can store up to 16MB of binary data. However, MongoDB has its own file system GridFS,
which stores binary files larger than 16MB in chunks.

Difference Between JSON vs BSON

JSON BSON

Standard file format Type. Binary file format Type.

The language-free organization is Binary JSON comprises a list of requested

utilized for offbeat server browser components containing a field title, sort, and
communication. esteem. Field title sorts are ordinarily a string.

The parallel encoding method comprises extra

Broadly JSON comprises of question
data such as lengths of strings and the protest
and cluster where the question could be
subtypes. In addition, BinData and Date
a collection of key-value sets and the
information sorts are the information sorts that
cluster is a requested list of values.
are not upheld in JSON.

JSON stands for JavaScript Object BSON stands for Binary JavaScript Object
Notation. Notation.

JSON data contains its data basic in BSON gives extra datatypes over the JSON
JSON format. data.

Database like AnyDB, redis, etc stores

MongoDB stores data in BSON format.
information in JSON format.

JSON uses less space in comparison to

BSON uses more space as compared to JSON.
BSON.

It is slow as compared to BSON. It is faster than JSON.

It is used for the transmission of data. It is used for the storage of the data.

It has no encoding and decoding

It has encoding and decoding technique.
technique.

XML: Rich and Extensible

XML stands for Extensible Markup Language, and it is a text format that stores data as
elements, attributes, and text, enclosed by tags. XML is rich, extensible, and easy to validate

8
and transform by machines. It is widely used by document formats and standards, such as
HTML, RSS, and SOAP, as it can handle metadata, schemas, and namespaces. XML is ideal
for data analysis when you have data that is highly structured and hierarchical, and when you
want to work with data from XML-based sources or tools, such as XML databases or XSLT.

How to Choose the Best Format between CSV, JSON and XML
When deciding which data format is best for data analysis, there is no definitive answer as it
depends on your data characteristics, analysis objectives, and available tools. Generally
speaking, CSV should be chosen if the data is simple, flat, and tabular, and you are working
with common software and languages. JSON should be chosen if the data is complex, nested,
and object-oriented, and you are working with web applications and APIs. Finally, XML
should be chosen if the data is rich, hierarchical, and document-based, and you are working
with XML standards and formats. Converting between different data formats may result in
some loss or distortion of information so it is better to choose the most suitable format from
the start. Tools such as pandas, json or xml libraries in Python can be used to convert between
formats if needed.

Big data file formats can also be grouped into row and column formats

Row vs. Columnar Big Data File Formats

There are two main ways in which you can organize your data: rows and columns. Which one
you choose largely controls how efficiently you store and query your data.

1. Row – the data is organized by record. Think of this as a more “traditional” way of
organizing and managing data. All data associated with a specific record is stored
adjacently. In other words, the data of each row is arranged such that the last column of
a row is stored next to the first column entry of the succeeding data row.

2. Columnar – the values of each table column (field) are stored next to each other. This
means like items are grouped and stored next to one another. Within fields the read-in
order is maintained; this preserves the ability to link data to records.

Row format: Traditionally you can think of row storage this way:

1, Michael, Jones, Dallas, 32

2, Preston, James, Boston, 25

But you can also represent row data visually in the order in which it would be stored in memory,
like this:

1, Michael, Jones, Dallas, 32, 2, Preston, James, Boston, 25

Columnar format: Traditionally you can think of columnar storage this way:

9
ID: 1, 2

First Name: Michael, Preston

Last Name: Jones, James

City: Dallas, Boston

Age: 32, 25

But you can also represent columnar data visually in the order in which it would be stored in
memory

1, 2, Michael, Preston, Jones, James, Dallas, Boston, 32, 25

Choosing a format is all about ensuring your format matches your downstream intended use
for the data. Below we highlight the key reasons why you might use row vs. columnar
formatting. Optimize your formatting to match your storage method and data usage, and
you’ve optimized valuable engineering time and resources.

The Particulars of Row Formatting

WIth data in row storage memory, items align in the following way when stored on disk:

Adding more to this dataset is trivial – you just append any newly acquired data to the end of
the current dataset:

As writing to the dataset is relatively cheap and easy in this format, row formatting is
recommended for any case in which you have write-heavy operations. However, read
operations can be very inefficient.

Here’s an example: Obtain the sum of ages for individuals in the data. This simple task can be
surprisingly compute-intensive with a row-oriented database. Since ages are not stored in a
common location within memory, you must load all 15 data points into memory, then extract
the relevant data to perform the required operation:

Load:

Extract:

Sum:

10
32 + 25 + 37 = 94

Now imagine millions or billions of data points, stored across numerous disks because of their
scale.

Using the same example:

Suppose the data lives in several discs, and each disk can hold only the 5 data points. The
sample dataset is split across 3 storage discs:

You must load all the data in all the discs to obtain the information necessary for your query.
If each disk is filled to capacity with data, this can easily require extra memory utilization and
quickly become burdensome.

That said, row formatting does offer advantages when schema changes – we’ll cover this later.
In general, if the data is wide – that is, it has many columns – and is write-heavy, a row-based
format may be best.

The Particulars of Columnar Formatting

Again, referring to the example dataset in columnar format, we can visually represent the data
in the order in which it would be stored in memory as follows:

The data is grouped in terms of like columns:

 ID

 First Name

 Last Name

 City

 Age

11
Writing data is somewhat more time-intensive in columnar formatted data than it is in row
formatted data; instead of just appending to the end as in a row-based format, you must read in
the entire dataset, navigate to the appropriate positions, and make the proper insertions:

Navigating to the appropriate positions for insertions is wasteful. But the way the data is
partitioned across multiple disks can alleviate this. Using the same framework as with row
formatted data, suppose you had a separate storage location (disk) for each attribute (column)
(so in this sample data case there are 5 disks):

It still takes a large amount of memory and computational time writing to 5 separate locations
to add data for a 3rd individual. However, you are simply appending to the end of the files
stored in each of the locations.

So the columnar format doesn’t compare favorably to row formatting with regard to write
operations. But it is superior for read operations such as querying data. Here’s the same read
example from before, applied to column-based storage partitioned by column:

Task: Obtain the sum of ages for individuals in the data.

To accomplish this, go only to the storage location that contains information on ages (disk 5)
and read the necessary data. This saves a large amount of memory and time by skipping over
non-relevant data very quickly.

Read:

Sum:

32 + 25 + 37 = 94

In this case, all reads came from sequential data stored on a single disk.

But efficient querying isn’t the only reason columnar-formatted data is popular. Columnar-
formatted data also allows for efficient compression. By storing each attribute together (ID,
ages, and so on) you can benefit from commonalities between attributes, such as a shared or

12
common data type or a common length (number of bits) per entry. For example, if you know
that age is an integer data type that won’t exceed a value of 200, you can compress the storage
location and reduce memory usage/allocation, as you don’t need standard amounts of allocated
memory for such values. (BIGINT is typically stored as 4 bytes, for instance, whereas a short
int can be stored as 2 bytes).

Further, columnar-formatted files usually support a number of flexible compression options

(Snappy, gzip, and LZO, for example) and provide efficient encoding schemes. For example,
you can use different encoding for compressing integer and string data; as all the data is very
similar in a column, it can be compressed more quickly for storage and decompressed for
analysis or other processing. In a row-based storage structure, on the other hand, you must
compress many types of data together – and those rows can get very long in schema on read –
and decompress pretty much the entire table when it’s time for analysis or other processing.

Three popular file formats for big data include: Avro, ORC, and Parquet.

The Avro Row-Based File Format Explained

Apache Avro was a project initially released late in 2009 as a row-based, language-neutral,
schema-based serialization technique and object container file format. It is an open-source
format that is often recommended for Apache Kafka. It’s preferred when serializing data in
Hadoop. The Avro format stores data definitions (the schema) in JSON and is easily read and
interpreted. The data within the file is stored in binary format, making it compact and space-
efficient.

What makes Avro stand out as a file format is that it is self-describing. Avro bundles serialized
data with the data’s schema in the same file – the message header contains the schema used to
serialize the message. This enables software to efficiently deserialize messages.

The Avro file format supports schema evolution. It supports dynamic data schemas that can
change over time; it can easily handle schema changes such as missing fields, added fields, or
edited/changed fields. In addition to schema flexibility, the Avro format supports complex data
structures such as arrays, enums, maps, and unions.

Avro-formatted files are splittable and compressible (though they do not compress as well as
some columnar data). Avro files work great for storage in a Hadoop ecosystem and for running
processes in parallel (because they are faster to load.

One caveat: If every message includes the schema in its header, it doesn’t scale well. Avro
requires the reading and re-reading of the repeated header schema across multiple files, so at
scale Avro can cause inefficiencies in bandwidth and storage space, slowing compute
processes.

13
.

Avro Format Features

Can be shared by programs using different languages

Self-describing; bundles serialized data with data’s schema

Supports schema evolution and flexibility

Splitable

Compression options including uncompressed, snappy, deflate, bzip2, and xz

Ideal Avro format use cases

 Write-heavy operations (such as ingestion into a data lake) due to serialized row-based
storage.

 When writing speed with schema evolution (adaptability to change in metadata)

is critical.

The Optimized Row Columnar (ORC) Columnar File Format Explained

Optimized Row Columnar (ORC) is an open-source columnar storage file format originally
released in early 2013 for Hadoop workloads. ORC provides a highly-efficient way to store
Apache Hive data, though it can store other data as well. It’s the successor to the traditional
Record Columnar File (RCFile) format. ORC was designed and optimized specifically with
Hive data in mind, improving the overall performance when Hive reads, writes, and processes
data. As a result, ORC supports ACID transactions when working with Hive.

The ORC file format stores collections of rows in a single file, in a columnar format within the
file. This enables parallel processing of row collections across a cluster. Due to the columnar
layout, each file is optimal for compression, enabling skipping of data and columns to reduce
read and decompression loads.

ORC files are organized into independent stripes of data. Each stripe consists of an index, row
data, and a footer. The footer holds key statistics for each column within a stripe (count, min,
max, sum, and so on), enabling easy skipping as needed. The footer also contains metadata
about the ORC file, making it easy to combine information across stripes.

14
ORC file structure

ORC compression chunk

15
ORC Format Features

Name Node load reduction – single file output per task

Complex type support including DateTime, decimal, struct, list, map, and union

Separate RecordReaders for concurrent reads of a single file

Split files without scanning for markers

Predicate pushdown

Compression options including Snappy, Zlib

Supports ACID when used with Hive

By default, a stripe size is 250 MB; the large stripe size is what enables efficient reads. ORC
file formats offer superior compression characteristics (ORC is often chosen over Parquet when
compression is the sole criterion), including compression done with Snappy or Zlib. An
additional feature unique to ORC is predicate pushdown. In predicate pushdown, the system
checks a query or condition against file metadata to see whether rows must be read. This
increases the potential for skipping.

Ideal ORC format use cases

 When reads constitute a significantly higher volume than writes.

 When you rely on Hive.

 When compression flexibility/options are key.

The Parquet Columnar File Format Explained

The Apache Parquet file format was first introduced in 2013 as an open-source storage format
that boasted substantial advances in efficiencies for analytical querying. According
to https://parquet.apache.org:

“Apache Parquet is a … file format designed for efficient data storage and retrieval. It provides
efficient data compression and encoding schemes with enhanced performance to handle
complex data in bulk…”

16
Parquet files support complex nested data structures in a flat format and offer multiple
compression options.

Parquet is broadly accessible. It supports multiple coding languages, including Java, C++, and
Python, to reach a broad audience. This makes it usable in nearly any big data setting. As it’s
open source, it avoids vendor lock-in.

Parquet is also self-describing. It contains metadata that includes file schema and structure.
You can use this to separate different services for writing, storing, and reading Parquet files.

Parquet files are composed of row groups, header and footer. Each row group contains data
from the same columns. The same columns are stored together in each row group:

Most importantly, at its core Parquet formatting is designed to support fast data processing for
complex nested data structures such as log files and event streams at scale. It saves on cloud
storage space by using highly efficient columnar compression, and provides flexible encoding
schemes to handle columns with different data types; you can specify compression schemes on
a per-column basis. It is extensible to future encoding mechanisms as well, making Parquet
“future-proof” in this regard. Parquet supports many query engines including Amazon Athena,
Amazon Redshift Spectrum, Qubole, Google BigQuery, Microsoft Azure Data Explorer
and Apache Drill). It reports a 2x faster unload speed and consumes as little as ⅙ the storage
in Amazon S3 compared to text formats.
Parquet files are splittable as they store file footer metadata containing information on block
boundaries for the file. Systems access this block boundary information to determine whether
to skip or read only specific parts (blocks) of the file – allowing for more efficient reads – or
to more easily submit different blocks for parallel processing. Parquet supports automatic
schema merging for schema evolution, so you can start with a simple schema and gradually
add more columns as needed.

Parquet Format Features

Language agnostic

Supports complex data types

Multiple flexible compression options

Supports schema evolution

Enables data skipping, reduced I/O

Parquet files are often most appropriate for analytics (OLAP) use cases, typically when
traditional OLTP databases are the source. They offer highly-efficient data compression and

17
decompression. They also feature increased data throughput and performance using techniques
such as data skipping (in which queries return specific column values and do not read an entire
row of data, greatly minimizing I/O).

Ideal Parquet format use cases

 Storing big data of any kind (structured data tables, images, videos, documents).
 Ideal for services such as AWS Athena and Amazon Redshift Spectrum, which are
serverless, interactive technologies.
 A good fit for Snowflake as it supports extremely efficient compression and encoding
schemes.
 When your full dataset has many columns, but you only need to access a subset.
 When you want multiple services to consume the same data from object storage.
 When you’re largely or wholly dependent on Spark.

ORC vs Parquet: Key Differences in a Nutshell

ORC (Optimized Row Columnar) and Parquet are two popular big data file formats.
Parquet is generally better for write-once, read-many analytics, while ORC is more
suitable for read-heavy operations. ORC is optimized for Hive data, while Parquet is
considerably more efficient for querying. Both support complex data structures, multiple
compression options, schema evolution, and data skipping.

There is overlap between ORC and Parquet. But Parquet is ideal for write-once, read-many
analytics, and in fact has become the de facto standard for OLAP on big data. It also works
best with Spark, which is widely used throughout the big data ecosystem.

Parquet is really the best option when speed and efficiency of queries are most important. It’s
optimized to work with complex data in bulk, including nested data. It’s also highly-effective
at minimizing table scans and, like ORC, can compress data to small sizes. It provides the
widest range of options for squeezing greater efficiency from your queries regardless of vendor.
And its extensibility with regard to future encoding mechanisms also makes it attractive to
organizations concerned with keeping their data infrastructure current and optimized.

It’s worth noting that new table formats are also emerging to support the substantial increases
in the volume and velocity (that is, streaming) of data. These formats include Apache Iceberg,
Apache Hudi, and Databricks Delta Lake. We will explore these in a future blog.

Advantages Of Using Appropriate File Formats:

1. Faster read
2. Faster write
3. Splitable files support
4. Schema evolution can be supported
5. Advanced compression can be achieved

Snorks Activity Revised
100% (1)
Snorks Activity Revised
8 pages
Data Fundamentals
No ratings yet
Data Fundamentals
37 pages
Microsoft Azure Data Fundamentals Explore Core Data Concepts
No ratings yet
Microsoft Azure Data Fundamentals Explore Core Data Concepts
8 pages
DP_900_Data_Fundamentals_1710103456
No ratings yet
DP_900_Data_Fundamentals_1710103456
35 pages
Module 1
No ratings yet
Module 1
11 pages
Module 1 Notes
No ratings yet
Module 1 Notes
7 pages
JSON Data Basics
From Everand
JSON Data Basics
Frank Wellington
No ratings yet
DP900 Chapter1 Notes
No ratings yet
DP900 Chapter1 Notes
10 pages
File formats worked example
No ratings yet
File formats worked example
3 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
Data Science
No ratings yet
Data Science
32 pages
Understanding Different Types of File Formats-En
No ratings yet
Understanding Different Types of File Formats-En
2 pages
Common Data Representation Formats Used For Big Data Include
No ratings yet
Common Data Representation Formats Used For Big Data Include
7 pages
Domain 1
No ratings yet
Domain 1
8 pages
File Formats in Big Data
No ratings yet
File Formats in Big Data
13 pages
Unit 5 DBMS
No ratings yet
Unit 5 DBMS
38 pages
Microsoft Azure Data Fundamentals
No ratings yet
Microsoft Azure Data Fundamentals
97 pages
42_P16CSE5A-P16ITE3A_2020052204503639
No ratings yet
42_P16CSE5A-P16ITE3A_2020052204503639
23 pages
Datatypes
No ratings yet
Datatypes
5 pages
JSON Explanation
No ratings yet
JSON Explanation
5 pages
CH5-Written Report
No ratings yet
CH5-Written Report
6 pages
How To Choose The Right Data Storage Format For Your Measurement System
No ratings yet
How To Choose The Right Data Storage Format For Your Measurement System
6 pages
Big Data File Formats For Data Engineers
No ratings yet
Big Data File Formats For Data Engineers
3 pages
Undestanding Data Module-3
No ratings yet
Undestanding Data Module-3
8 pages
Microsoft Azure Data Fundamentals
No ratings yet
Microsoft Azure Data Fundamentals
6 pages
Cse Big Data 702 Notes
No ratings yet
Cse Big Data 702 Notes
91 pages
SQL Material
No ratings yet
SQL Material
47 pages
Lab 3
No ratings yet
Lab 3
10 pages
DBMS -Unit 3 - Page 1-6
No ratings yet
DBMS -Unit 3 - Page 1-6
19 pages
Data Format Compare
From Everand
Data Format Compare
Frank Wellington
No ratings yet
2.1.1 Data Formats
No ratings yet
2.1.1 Data Formats
14 pages
Introduction To Json
No ratings yet
Introduction To Json
7 pages
Differences
No ratings yet
Differences
14 pages
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
No ratings yet
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
8 pages
Chapter - 2: Database Model Key-Value Data Store Document Databases Column Databases Graph Databases
No ratings yet
Chapter - 2: Database Model Key-Value Data Store Document Databases Column Databases Graph Databases
61 pages
Unit-05 DBMS Notes_merged
No ratings yet
Unit-05 DBMS Notes_merged
47 pages
Applications of Object Relational Database Management Systems at BCS
No ratings yet
Applications of Object Relational Database Management Systems at BCS
13 pages
Fs Report
No ratings yet
Fs Report
28 pages
week2_data_formats 3
No ratings yet
week2_data_formats 3
60 pages
Lecture 2 Serialization Basics 1.5 Hours
No ratings yet
Lecture 2 Serialization Basics 1.5 Hours
10 pages
Structured vs. Unstructured Data Understanding Differences
No ratings yet
Structured vs. Unstructured Data Understanding Differences
9 pages
Lec 10 - Column DB
No ratings yet
Lec 10 - Column DB
34 pages
UNIT 3
No ratings yet
UNIT 3
12 pages
Transcript - Defining Data
No ratings yet
Transcript - Defining Data
10 pages
File Organization
No ratings yet
File Organization
11 pages
Unit-Iii Advanced Database Systems
No ratings yet
Unit-Iii Advanced Database Systems
29 pages
11zon DBMS1.PDF
No ratings yet
11zon DBMS1.PDF
20 pages
DBMS unit 6 Macro
No ratings yet
DBMS unit 6 Macro
3 pages
BD Unit 1
No ratings yet
BD Unit 1
72 pages
Bigdata Fileformats
No ratings yet
Bigdata Fileformats
12 pages
Week_09_json
No ratings yet
Week_09_json
41 pages
14-Record Nei Blocchi
No ratings yet
14-Record Nei Blocchi
14 pages
2 Data Engineering (Storing Data)
No ratings yet
2 Data Engineering (Storing Data)
11 pages
TDA357-L10-NoSQL,JSON1
No ratings yet
TDA357-L10-NoSQL,JSON1
41 pages
Phase 2 Document
No ratings yet
Phase 2 Document
47 pages
Calculator
No ratings yet
Calculator
6 pages
01.Murachs MySQL 2019 Chapter 08
No ratings yet
01.Murachs MySQL 2019 Chapter 08
26 pages
Algorithm and Data Structure Lecture 1a
No ratings yet
Algorithm and Data Structure Lecture 1a
4 pages
Unit 5_230601_174540-1
No ratings yet
Unit 5_230601_174540-1
14 pages
Top 70+ SQL Interview Questions and Answers (Mostly Asked)
No ratings yet
Top 70+ SQL Interview Questions and Answers (Mostly Asked)
1 page
Database Applications 1.1. Introduction To Database Applications 1.1.1. What Is A Database?
No ratings yet
Database Applications 1.1. Introduction To Database Applications 1.1.1. What Is A Database?
8 pages
Auditing Theory Question Bank
No ratings yet
Auditing Theory Question Bank
38 pages
Chapter Ii
No ratings yet
Chapter Ii
14 pages
Organic Chemistry
No ratings yet
Organic Chemistry
11 pages
Ilakkana Vilakkavurai KBSM PDF
No ratings yet
Ilakkana Vilakkavurai KBSM PDF
50 pages
The Problem Of Evil And Its Symbols In Jewish And Christian Tradition Jsot Supplement Series Henning Graf Reventlow pdf download
No ratings yet
The Problem Of Evil And Its Symbols In Jewish And Christian Tradition Jsot Supplement Series Henning Graf Reventlow pdf download
81 pages
de-thi-chuyen-anh-tp-hcm-nam-hoc-2024-2025-co-dap-an
No ratings yet
de-thi-chuyen-anh-tp-hcm-nam-hoc-2024-2025-co-dap-an
6 pages
Abstract Algebra Theory and Applications - Thomas W Judson
100% (1)
Abstract Algebra Theory and Applications - Thomas W Judson
1,005 pages
Celite® Filter Cel: Technical Data
No ratings yet
Celite® Filter Cel: Technical Data
1 page
Phases of The Moon
No ratings yet
Phases of The Moon
4 pages
Shift-Share Analysis (Mix and Share Analysis) : Chapter 7, Pages 67-73 in Textbook
No ratings yet
Shift-Share Analysis (Mix and Share Analysis) : Chapter 7, Pages 67-73 in Textbook
25 pages
Resume Cezar Emanuel C.
No ratings yet
Resume Cezar Emanuel C.
4 pages
Simplified Melc-Based Budget of Lessons: (First Quarter)
No ratings yet
Simplified Melc-Based Budget of Lessons: (First Quarter)
3 pages
Download (Ebook) More How to Draw Manga Vol. 2 - Penning Characters by Go Office ISBN 9784766114836, 4766114833 ebook All Chapters PDF
100% (3)
Download (Ebook) More How to Draw Manga Vol. 2 - Penning Characters by Go Office ISBN 9784766114836, 4766114833 ebook All Chapters PDF
81 pages
Full Download (Test Bank) Financial Economics 2nd Edition by Bodie PDF
100% (7)
Full Download (Test Bank) Financial Economics 2nd Edition by Bodie PDF
24 pages
1 s2.0 S1877705813016007 Main
No ratings yet
1 s2.0 S1877705813016007 Main
10 pages
Python-and-Crypto-A-Beginners-Guide
No ratings yet
Python-and-Crypto-A-Beginners-Guide
9 pages
DEV401 Questions
No ratings yet
DEV401 Questions
15 pages
Winplot Release Notes
No ratings yet
Winplot Release Notes
12 pages
Structural Analysis of Wind Turbine Blade of Profile Naca 2418
No ratings yet
Structural Analysis of Wind Turbine Blade of Profile Naca 2418
10 pages
Uh-1n Helicopter Crew Briefing Gui̇de
100% (2)
Uh-1n Helicopter Crew Briefing Gui̇de
30 pages
Download ebooks file Tips and Tidbits for the Horse Lover Howell Equestrian Library 1st Edition Tena Bastian all chapters
100% (4)
Download ebooks file Tips and Tidbits for the Horse Lover Howell Equestrian Library 1st Edition Tena Bastian all chapters
78 pages
RAC Assignments
No ratings yet
RAC Assignments
77 pages
Best_Practices_for_Tosca.docx
No ratings yet
Best_Practices_for_Tosca.docx
15 pages
Syllabus For Courses Affiliated To The Kerala University of Health Sciences
No ratings yet
Syllabus For Courses Affiliated To The Kerala University of Health Sciences
124 pages
Practical For 2022-23
No ratings yet
Practical For 2022-23
11 pages
SCFT Report Gul006 Vil (Voda)
No ratings yet
SCFT Report Gul006 Vil (Voda)
53 pages
bl1 307
No ratings yet
bl1 307
2 pages
Lesson 5 Storyboard and Animatics
No ratings yet
Lesson 5 Storyboard and Animatics
26 pages
Member, National Gender Resource Pool Philippine Commission On Women
No ratings yet
Member, National Gender Resource Pool Philippine Commission On Women
66 pages

Lecture 2 File Types Suitable for Storing Big Data

Uploaded by

Lecture 2 File Types Suitable for Storing Big Data

Uploaded by

Lecture 3: File Types Suitable to Storing Big Data

The Evolution of File Formats

Files types suitable for storing big data include:

JSON: Flexible and Structured

Difference Between JSON vs BSON

Standard file format Type. Binary file format Type.

The language-free organization is Binary JSON comprises a list of requested

The parallel encoding method comprises extra

Database like AnyDB, redis, etc stores

JSON uses less space in comparison to

It is slow as compared to BSON. It is faster than JSON.

It has no encoding and decoding

XML: Rich and Extensible

Row vs. Columnar Big Data File Formats

1, Michael, Jones, Dallas, 32

2, Preston, James, Boston, 25

1, Michael, Jones, Dallas, 32, 2, Preston, James, Boston, 25

First Name: Michael, Preston

Last Name: Jones, James

City: Dallas, Boston

1, 2, Michael, Preston, Jones, James, Dallas, Boston, 32, 25

The Particulars of Row Formatting

Using the same example:

The Particulars of Columnar Formatting

The data is grouped in terms of like columns:

Task: Obtain the sum of ages for individuals in the data.

Further, columnar-formatted files usually support a number of flexible compression options

The Avro Row-Based File Format Explained

Avro Format Features

Can be shared by programs using different languages

Self-describing; bundles serialized data with data’s schema

Supports schema evolution and flexibility

Compression options including uncompressed, snappy, deflate, bzip2, and xz

Ideal Avro format use cases

 When writing speed with schema evolution (adaptability to change in metadata)

The Optimized Row Columnar (ORC) Columnar File Format Explained

ORC compression chunk

Name Node load reduction – single file output per task

Separate RecordReaders for concurrent reads of a single file

Split files without scanning for markers

Compression options including Snappy, Zlib

Supports ACID when used with Hive

Ideal ORC format use cases

 When reads constitute a significantly higher volume than writes.

 When you rely on Hive.

 When compression flexibility/options are key.

The Parquet Columnar File Format Explained

Parquet Format Features

Supports complex data types

Multiple flexible compression options

Supports schema evolution

Enables data skipping, reduced I/O

Ideal Parquet format use cases

ORC vs Parquet: Key Differences in a Nutshell

Advantages Of Using Appropriate File Formats:

You might also like