2.1.1 Data Formats
2.1.1 Data Formats
Pravin Y Pawar
• Understanding the sources of data can help access and manipulate data more efficiently
Data Sources(2)
User Generated data
• One common source is user input data, data explicitly input by users, which is often the input on
which ML models can make predictions
o can be texts, images, videos, uploaded files, etc.
• If there are wrong way for humans to input data, humans are going to do it, and as a result, user
input data can be easily mal-formatted
• If user input is supposed to be
o texts, they might be too long or too short
o numerical values, users might accidentally enter texts
o fie uploading, they might upload files in the wrong formats
• Challenges
o User input data requires more heavy-duty checking and processing
o Users also have little patience. In most cases, when we input data, we expect to get results back
immediately
o Therefore, user input data tends to require fast processing.
Data Sources(3)
System-generated data
• Data generated by different components of systems, which include various types of logs and system outputs such as model
predictions
• Logs
o Can record the state of the system and significant events in the system, such as memory usage, number of instances, services called, packages
used, etc. Can record the results of different jobs, including large batch jobs for data processing and model training
o Provides visibility into how the system is doing, and the main purpose of this visibility is for debugging and possibly improving the application
o Much less likely to be mal-formatted as its system generated
o Don’t need to be processed as soon as they arrive
o acceptable to process logs periodically, such as hourly or even daily
o might still want to process logs fast to be able to detect and be notified whenever something interesting happens
• Internal databases
o generated by various services and enterprise applications in a company
o manage their assets such as inventory, customer relationship, users, and more
o can be used by ML models directly or by various components of an ML system
o For example,
when users enter a search query on Amazon, one or more ML models will process that query to detect the
intention of that query — what products users are actually looking for? — then Amazon will need to check
their internal databases for the availability of these products before ranking them and showing them to
users.
Data Sources(5)
Third-party data - wonderfully weird world
• Types of data
o First-party data is the data that company already collects about users or customers
o Second-party data is the data collected by another company on their own customers that they make
available to you
o Third-party data companies collect data on the public who aren’t their customers
• The rise of the Internet and smartphones has made it much easier for all types of data to be
collected
o Riddled with privacy concerns
o Data from apps, websites, check-in services, etc. are collected and (hopefully) anonymized to generate
activity history for each person
o Third-party data is usually sold as structured data after being cleaned and processed by vendors.
Data Formats
Need
• Once have data, need to store data (or “persist” it, in technical terms)!
• Since data comes from multiple sources with different access patterns
o storing your data isn’t always straightforward and can be costly
o important to think about how the data will be used in the future so that the format will make sense
• Data serialization
o The process of converting a data structure or object state into a format that can be stored or transmitted and reconstructed later
o Many data serialization formats
• When considering a format to work with, need to consider different characteristics such as
o human readability
o access patterns
o and whether it’s based on text or binary, which influences the size of its files
Data Formats(2)
Common data formats
• Parquet is a column-oriented data storage format designed for the Apache Hadoop ecosystem (backed by Cloudera, in
collaboration with Twitter).
• AVRO is a row-based storage format where data is indexed to improve query performance .Hence we primary here. That’s
Binary Primary .
Data Formats(2)
Common data formats
• Parquet is a column-oriented data storage format designed for the Apache Hadoop ecosystem (backed by Cloudera, in
collaboration with Twitter).
• AVRO is a row-based storage format where data is indexed to improve query performance .Hence we primary here. That’s
Binary Primary .
• Parquet and Avro are both binary file formats used for storing and processing structured data in the Apache Hadoop
ecosystem. However, referring to them as "Binary Primary" might not be accurate or commonly used terminology. The term
"Binary Primary" doesn't have a specific meaning in relation to these file formats.
• Parquet is a columnar storage format, meaning it organizes and stores data by column rather than by row. This columnar
organization offers advantages such as efficient compression, better query performance, and the ability to load only the
required columns for processing, which can improve overall system performance.
• On the other hand, Avro is a row-based data serialization system that provides a compact binary format for data storage.
Avro includes a schema specification, allowing for self-describing data files that can be easily read by multiple programming
languages. While Avro does support indexing for better query performance, it is not primarily known for its indexing
capabilities.
• In summary, Parquet is a columnar storage format optimized for efficient querying and processing of structured data, while
Avro is a row-based data serialization system with compact binary storage. Both formats have their own advantages and use
cases within the Apache Hadoop ecosystem, but the term "Binary Primary" is not typically used to describe them.
Data Formats(3)
JSON
• JSON, JavaScript Object Notation, is everywhere
o Though it was derived from JavaScript, it’s language-independent — most modern programming
languages can generate and parse JSON
o Human-readable
o Key-value pair paradigm is simple but powerful, capable of handling data of different levels of
structuredness
• For example, persons data can be stored in a structured format like the following
• The same data can also be stored in an unstructured blob of text like the following
Data Formats(4)
Row-major vs. Column-major Format
• Two common formats that represents distinct paradigms are CSV and Parquet
o CSV is row-major, which means consecutive elements in a row are stored next to each other in memory
o Parquet is column-major, which means consecutive elements in a column are stored next to each other
• Modern computers process sequential data more efficiently than non-sequential data
o If a table is row-major, accessing its rows will be faster than accessing its columns in expectation
o For row-major formats, accessing data by rows is expected to be faster than accessing data by columns
• Column-major formats allow flexible column-based reads, especially if data is large with thousands,
if not millions, of features
o Row-major formats allow faster data writes
• Overall,
o row-major formats are better when you have to do a lot of writes
o column-major ones are better when you have to do a lot of column-based reads.
Data Formats(5)
Row-major vs. Column-major Format
Data Formats(6)
Text vs. Binary Format
• CSV and JSON are text files whereas Parquet files are binary files
o Text files are files that are in plain text, which usually mean they are human-readable
o Binary files are files that contain 0’s and 1’s, and meant to be read or used by programs that know how
to interpret the raw bytes
o A program has to know exactly how the data inside the binary file is laid out to make use of the file