0% found this document useful (0 votes)
17 views

2.1.1 Data Formats

Uploaded by

Mukesh Nalekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

2.1.1 Data Formats

Uploaded by

Mukesh Nalekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Data Formats

Pravin Y Pawar

Adapted from Designing Machine Learning Systems


by Chip Huyen
Data Sources

• An ML system can work with data from many different sources


o have different characteristics with different access patterns
o can be used for different purposes
o and require different processing methods

• Understanding the sources of data can help access and manipulate data more efficiently
Data Sources(2)
User Generated data
• One common source is user input data, data explicitly input by users, which is often the input on
which ML models can make predictions
o can be texts, images, videos, uploaded files, etc.

• If there are wrong way for humans to input data, humans are going to do it, and as a result, user
input data can be easily mal-formatted
• If user input is supposed to be
o texts, they might be too long or too short
o numerical values, users might accidentally enter texts
o fie uploading, they might upload files in the wrong formats

• Challenges
o User input data requires more heavy-duty checking and processing
o Users also have little patience. In most cases, when we input data, we expect to get results back
immediately
o Therefore, user input data tends to require fast processing.
Data Sources(3)
System-generated data
• Data generated by different components of systems, which include various types of logs and system outputs such as model
predictions

• Logs
o Can record the state of the system and significant events in the system, such as memory usage, number of instances, services called, packages
used, etc. Can record the results of different jobs, including large batch jobs for data processing and model training
o Provides visibility into how the system is doing, and the main purpose of this visibility is for debugging and possibly improving the application
o Much less likely to be mal-formatted as its system generated
o Don’t need to be processed as soon as they arrive
o acceptable to process logs periodically, such as hourly or even daily
o might still want to process logs fast to be able to detect and be notified whenever something interesting happens

• Because debugging ML systems is hard,


o Common practice to log everything you can
o Means volume of logs can grow very, very quickly

• Leads to two problems


o The first is that it can be hard to know where to look because signals are lost in the noise
o Many services that process and analyze logs, such as Logstash, DataDog, Logz, etc.
o The second problem is how to store a rapidly growing amount of logs
o In most cases, store logs for as long as they are useful, and can discard them when they are no longer relevant
o Can also be stored in low-access storage that costs much less than higher-frequency-access storage
Data Sources(4)
System-generated users data
• Users’ behaveiors
o such as clicking, choosing a suggestion, scrolling, zooming, ignoring a popup, or spending an unusual
amount of time on certain pages
o system-generated data, it’s still considered part of user data
o might be subject to privacy regulations
o Can be used for ML systems to make predictions and to train their future versions

• Internal databases
o generated by various services and enterprise applications in a company
o manage their assets such as inventory, customer relationship, users, and more
o can be used by ML models directly or by various components of an ML system
o For example,
when users enter a search query on Amazon, one or more ML models will process that query to detect the
intention of that query — what products users are actually looking for? — then Amazon will need to check
their internal databases for the availability of these products before ranking them and showing them to
users.
Data Sources(5)
Third-party data - wonderfully weird world
• Types of data
o First-party data is the data that company already collects about users or customers
o Second-party data is the data collected by another company on their own customers that they make
available to you
o Third-party data companies collect data on the public who aren’t their customers

• The rise of the Internet and smartphones has made it much easier for all types of data to be
collected
o Riddled with privacy concerns
o Data from apps, websites, check-in services, etc. are collected and (hopefully) anonymized to generate
activity history for each person
o Third-party data is usually sold as structured data after being cleaned and processed by vendors.
Data Formats
Need
• Once have data, need to store data (or “persist” it, in technical terms)!
• Since data comes from multiple sources with different access patterns
o storing your data isn’t always straightforward and can be costly
o important to think about how the data will be used in the future so that the format will make sense

• Questions needs to be considered


o How do I store multimodal data?
o When each sample might contain both images and texts?
o Where to store your data so that it’s cheap and still fast to access?
o How to store complex models so that they can be loaded and run correctly on different hardware?

• Data serialization
o The process of converting a data structure or object state into a format that can be stored or transmitted and reconstructed later
o Many data serialization formats

• When considering a format to work with, need to consider different characteristics such as
o human readability
o access patterns
o and whether it’s based on text or binary, which influences the size of its files
Data Formats(2)
Common data formats
• Parquet is a column-oriented data storage format designed for the Apache Hadoop ecosystem (backed by Cloudera, in
collaboration with Twitter).
• AVRO is a row-based storage format where data is indexed to improve query performance .Hence we primary here. That’s
Binary Primary .
Data Formats(2)
Common data formats
• Parquet is a column-oriented data storage format designed for the Apache Hadoop ecosystem (backed by Cloudera, in
collaboration with Twitter).
• AVRO is a row-based storage format where data is indexed to improve query performance .Hence we primary here. That’s
Binary Primary .
• Parquet and Avro are both binary file formats used for storing and processing structured data in the Apache Hadoop
ecosystem. However, referring to them as "Binary Primary" might not be accurate or commonly used terminology. The term
"Binary Primary" doesn't have a specific meaning in relation to these file formats.
• Parquet is a columnar storage format, meaning it organizes and stores data by column rather than by row. This columnar
organization offers advantages such as efficient compression, better query performance, and the ability to load only the
required columns for processing, which can improve overall system performance.
• On the other hand, Avro is a row-based data serialization system that provides a compact binary format for data storage.
Avro includes a schema specification, allowing for self-describing data files that can be easily read by multiple programming
languages. While Avro does support indexing for better query performance, it is not primarily known for its indexing
capabilities.
• In summary, Parquet is a columnar storage format optimized for efficient querying and processing of structured data, while
Avro is a row-based data serialization system with compact binary storage. Both formats have their own advantages and use
cases within the Apache Hadoop ecosystem, but the term "Binary Primary" is not typically used to describe them.
Data Formats(3)
JSON
• JSON, JavaScript Object Notation, is everywhere
o Though it was derived from JavaScript, it’s language-independent — most modern programming
languages can generate and parse JSON
o Human-readable
o Key-value pair paradigm is simple but powerful, capable of handling data of different levels of
structuredness
• For example, persons data can be stored in a structured format like the following

• The same data can also be stored in an unstructured blob of text like the following
Data Formats(4)
Row-major vs. Column-major Format
• Two common formats that represents distinct paradigms are CSV and Parquet
o CSV is row-major, which means consecutive elements in a row are stored next to each other in memory
o Parquet is column-major, which means consecutive elements in a column are stored next to each other

• Modern computers process sequential data more efficiently than non-sequential data
o If a table is row-major, accessing its rows will be faster than accessing its columns in expectation
o For row-major formats, accessing data by rows is expected to be faster than accessing data by columns

• Column-major formats allow flexible column-based reads, especially if data is large with thousands,
if not millions, of features
o Row-major formats allow faster data writes

• Overall,
o row-major formats are better when you have to do a lot of writes
o column-major ones are better when you have to do a lot of column-based reads.
Data Formats(5)
Row-major vs. Column-major Format
Data Formats(6)
Text vs. Binary Format
• CSV and JSON are text files whereas Parquet files are binary files
o Text files are files that are in plain text, which usually mean they are human-readable
o Binary files are files that contain 0’s and 1’s, and meant to be read or used by programs that know how
to interpret the raw bytes
o A program has to know exactly how the data inside the binary file is laid out to make use of the file

• Binary files are more compact.


o AWS recommends using the Parquet format because
“the Parquet format is up to 2x faster to unload and consumes up to
6x less storage in Amazon S3, compared to text formats.”
Thank You!
In our next session:

You might also like