2.1.1 Data Formats

Uploaded by

Mukesh Nalekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

2.1.1 Data Formats

Uploaded by

Mukesh Nalekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Data Formats

Pravin Y Pawar

Adapted from Designing Machine Learning Systems

by Chip Huyen
Data Sources

• An ML system can work with data from many different sources

o have different characteristics with different access patterns
o can be used for different purposes
o and require different processing methods

• Understanding the sources of data can help access and manipulate data more efficiently
Data Sources(2)
User Generated data
• One common source is user input data, data explicitly input by users, which is often the input on
which ML models can make predictions
o can be texts, images, videos, uploaded files, etc.

• If there are wrong way for humans to input data, humans are going to do it, and as a result, user
input data can be easily mal-formatted
• If user input is supposed to be
o texts, they might be too long or too short
o numerical values, users might accidentally enter texts
o fie uploading, they might upload files in the wrong formats

• Challenges
o User input data requires more heavy-duty checking and processing
o Users also have little patience. In most cases, when we input data, we expect to get results back
immediately
o Therefore, user input data tends to require fast processing.
Data Sources(3)
System-generated data
• Data generated by different components of systems, which include various types of logs and system outputs such as model
predictions

• Logs
o Can record the state of the system and significant events in the system, such as memory usage, number of instances, services called, packages
used, etc. Can record the results of different jobs, including large batch jobs for data processing and model training
o Provides visibility into how the system is doing, and the main purpose of this visibility is for debugging and possibly improving the application
o Much less likely to be mal-formatted as its system generated
o Don’t need to be processed as soon as they arrive
o acceptable to process logs periodically, such as hourly or even daily
o might still want to process logs fast to be able to detect and be notified whenever something interesting happens

• Because debugging ML systems is hard,

o Common practice to log everything you can
o Means volume of logs can grow very, very quickly

• Leads to two problems

o The first is that it can be hard to know where to look because signals are lost in the noise
o Many services that process and analyze logs, such as Logstash, DataDog, Logz, etc.
o The second problem is how to store a rapidly growing amount of logs
o In most cases, store logs for as long as they are useful, and can discard them when they are no longer relevant
o Can also be stored in low-access storage that costs much less than higher-frequency-access storage
Data Sources(4)
System-generated users data
• Users’ behaveiors
o such as clicking, choosing a suggestion, scrolling, zooming, ignoring a popup, or spending an unusual
amount of time on certain pages
o system-generated data, it’s still considered part of user data
o might be subject to privacy regulations
o Can be used for ML systems to make predictions and to train their future versions

• Internal databases
o generated by various services and enterprise applications in a company
o manage their assets such as inventory, customer relationship, users, and more
o can be used by ML models directly or by various components of an ML system
o For example,
when users enter a search query on Amazon, one or more ML models will process that query to detect the
intention of that query — what products users are actually looking for? — then Amazon will need to check
their internal databases for the availability of these products before ranking them and showing them to
users.
Data Sources(5)
Third-party data - wonderfully weird world
• Types of data
o First-party data is the data that company already collects about users or customers
o Second-party data is the data collected by another company on their own customers that they make
available to you
o Third-party data companies collect data on the public who aren’t their customers

• The rise of the Internet and smartphones has made it much easier for all types of data to be
collected
o Riddled with privacy concerns
o Data from apps, websites, check-in services, etc. are collected and (hopefully) anonymized to generate
activity history for each person
o Third-party data is usually sold as structured data after being cleaned and processed by vendors.
Data Formats
Need
• Once have data, need to store data (or “persist” it, in technical terms)!
• Since data comes from multiple sources with different access patterns
o storing your data isn’t always straightforward and can be costly
o important to think about how the data will be used in the future so that the format will make sense

• Questions needs to be considered

o How do I store multimodal data?
o When each sample might contain both images and texts?
o Where to store your data so that it’s cheap and still fast to access?
o How to store complex models so that they can be loaded and run correctly on different hardware?

• Data serialization
o The process of converting a data structure or object state into a format that can be stored or transmitted and reconstructed later
o Many data serialization formats

• When considering a format to work with, need to consider different characteristics such as
o human readability
o access patterns
o and whether it’s based on text or binary, which influences the size of its files
Data Formats(2)
Common data formats
• Parquet is a column-oriented data storage format designed for the Apache Hadoop ecosystem (backed by Cloudera, in
collaboration with Twitter).
• AVRO is a row-based storage format where data is indexed to improve query performance .Hence we primary here. That’s
Binary Primary .
Data Formats(2)
Common data formats
• Parquet is a column-oriented data storage format designed for the Apache Hadoop ecosystem (backed by Cloudera, in
collaboration with Twitter).
• AVRO is a row-based storage format where data is indexed to improve query performance .Hence we primary here. That’s
Binary Primary .
• Parquet and Avro are both binary file formats used for storing and processing structured data in the Apache Hadoop
ecosystem. However, referring to them as "Binary Primary" might not be accurate or commonly used terminology. The term
"Binary Primary" doesn't have a specific meaning in relation to these file formats.
• Parquet is a columnar storage format, meaning it organizes and stores data by column rather than by row. This columnar
organization offers advantages such as efficient compression, better query performance, and the ability to load only the
required columns for processing, which can improve overall system performance.
• On the other hand, Avro is a row-based data serialization system that provides a compact binary format for data storage.
Avro includes a schema specification, allowing for self-describing data files that can be easily read by multiple programming
languages. While Avro does support indexing for better query performance, it is not primarily known for its indexing
capabilities.
• In summary, Parquet is a columnar storage format optimized for efficient querying and processing of structured data, while
Avro is a row-based data serialization system with compact binary storage. Both formats have their own advantages and use
cases within the Apache Hadoop ecosystem, but the term "Binary Primary" is not typically used to describe them.
Data Formats(3)
JSON
• JSON, JavaScript Object Notation, is everywhere
o Though it was derived from JavaScript, it’s language-independent — most modern programming
languages can generate and parse JSON
o Human-readable
o Key-value pair paradigm is simple but powerful, capable of handling data of different levels of
structuredness
• For example, persons data can be stored in a structured format like the following

• The same data can also be stored in an unstructured blob of text like the following
Data Formats(4)
Row-major vs. Column-major Format
• Two common formats that represents distinct paradigms are CSV and Parquet
o CSV is row-major, which means consecutive elements in a row are stored next to each other in memory
o Parquet is column-major, which means consecutive elements in a column are stored next to each other

• Modern computers process sequential data more efficiently than non-sequential data
o If a table is row-major, accessing its rows will be faster than accessing its columns in expectation
o For row-major formats, accessing data by rows is expected to be faster than accessing data by columns

• Column-major formats allow flexible column-based reads, especially if data is large with thousands,
if not millions, of features
o Row-major formats allow faster data writes

• Overall,
o row-major formats are better when you have to do a lot of writes
o column-major ones are better when you have to do a lot of column-based reads.
Data Formats(5)
Row-major vs. Column-major Format
Data Formats(6)
Text vs. Binary Format
• CSV and JSON are text files whereas Parquet files are binary files
o Text files are files that are in plain text, which usually mean they are human-readable
o Binary files are files that contain 0’s and 1’s, and meant to be read or used by programs that know how
to interpret the raw bytes
o A program has to know exactly how the data inside the binary file is laid out to make use of the file

• Binary files are more compact.

o AWS recommends using the Parquet format because
“the Parquet format is up to 2x faster to unload and consumes up to
6x less storage in Amazon S3, compared to text formats.”
Thank You!
In our next session:

A Common Sense Guide to Data Structures and Algorithms in Python Volume 1 Level Up Your Core Programming Skills Jay Wengrow instant download
100% (2)
A Common Sense Guide to Data Structures and Algorithms in Python Volume 1 Level Up Your Core Programming Skills Jay Wengrow instant download
39 pages
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Ramadan Marketing Guide, Indonesia 2024
No ratings yet
Ramadan Marketing Guide, Indonesia 2024
35 pages
Data Fundamentals
No ratings yet
Data Fundamentals
37 pages
C Factor Instructions
No ratings yet
C Factor Instructions
1 page
Microsoft Azure Data Fundamentals Explore Core Data Concepts
No ratings yet
Microsoft Azure Data Fundamentals Explore Core Data Concepts
8 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Module 1
No ratings yet
Module 1
11 pages
Module 1 Notes
No ratings yet
Module 1 Notes
7 pages
final report
No ratings yet
final report
22 pages
Cs329s 03 Note Data Engineering
No ratings yet
Cs329s 03 Note Data Engineering
26 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
AWS ML Notes -Domain 1 - Data Processing
No ratings yet
AWS ML Notes -Domain 1 - Data Processing
37 pages
DP_900_Data_Fundamentals_1710103456
No ratings yet
DP_900_Data_Fundamentals_1710103456
35 pages
C3_W1
No ratings yet
C3_W1
160 pages
MS AZURE DP-900
No ratings yet
MS AZURE DP-900
264 pages
Big Data File Formats For Data Engineers
No ratings yet
Big Data File Formats For Data Engineers
3 pages
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
MI U3 Notes
No ratings yet
MI U3 Notes
11 pages
Lecture 2 File Types Suitable for Storing Big Data
No ratings yet
Lecture 2 File Types Suitable for Storing Big Data
12 pages
Data Munging
No ratings yet
Data Munging
65 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Multimedia Database Report
No ratings yet
Multimedia Database Report
28 pages
Terminology and Database System
No ratings yet
Terminology and Database System
3 pages
Data Format Compare
From Everand
Data Format Compare
Frank Wellington
No ratings yet
Undestanding Data Module-3
No ratings yet
Undestanding Data Module-3
8 pages
Survey of Graph Database
No ratings yet
Survey of Graph Database
39 pages
Comparison of File Formats for Big Data
No ratings yet
Comparison of File Formats for Big Data
4 pages
DP900 Chapter1 Notes
No ratings yet
DP900 Chapter1 Notes
10 pages
BUAN6320 - Chapter 2 & 9
No ratings yet
BUAN6320 - Chapter 2 & 9
55 pages
Advanced Databases and Mining
No ratings yet
Advanced Databases and Mining
49 pages
File Formats in Big Data
No ratings yet
File Formats in Big Data
13 pages
Multimedia Databases: Seminar Report
No ratings yet
Multimedia Databases: Seminar Report
32 pages
42_P16CSE5A-P16ITE3A_2020052204503639
No ratings yet
42_P16CSE5A-P16ITE3A_2020052204503639
23 pages
Bab 1
No ratings yet
Bab 1
50 pages
Datascience-unit3
No ratings yet
Datascience-unit3
19 pages
2 Data Models
No ratings yet
2 Data Models
60 pages
C++ File Handling Step by Step: A Practical Guide with Examples
From Everand
C++ File Handling Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Lecture1 Intro
No ratings yet
Lecture1 Intro
30 pages
p148-zeng
No ratings yet
p148-zeng
14 pages
File Formats & Service Binding
No ratings yet
File Formats & Service Binding
6 pages
BD Unit 1
No ratings yet
BD Unit 1
72 pages
Datasets in machine learning Unit 2
No ratings yet
Datasets in machine learning Unit 2
15 pages
02 Introduction To DBMS (Notes) Ingles
No ratings yet
02 Introduction To DBMS (Notes) Ingles
20 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Database Model
No ratings yet
Database Model
8 pages
CH 4
No ratings yet
CH 4
17 pages
BigData OSFY Nov
No ratings yet
BigData OSFY Nov
6 pages
Microsoft Azure DP 203 Cert Notes 1712494873
100% (1)
Microsoft Azure DP 203 Cert Notes 1712494873
151 pages
d 01 Introduction
No ratings yet
d 01 Introduction
37 pages
A New Approach To Adaptive Data Models
No ratings yet
A New Approach To Adaptive Data Models
9 pages
Bigdata Fileformats
No ratings yet
Bigdata Fileformats
12 pages
Algorithm and Data Structure Lecture 1a
No ratings yet
Algorithm and Data Structure Lecture 1a
4 pages
NumPy Beginner s Guide Third Edition Ivan Idris - The ebook in PDF format is ready for immediate access
No ratings yet
NumPy Beginner s Guide Third Edition Ivan Idris - The ebook in PDF format is ready for immediate access
49 pages
Module 2 - Data Preprocessing and Visualization
No ratings yet
Module 2 - Data Preprocessing and Visualization
15 pages
Domain 1
No ratings yet
Domain 1
8 pages
Data Science Essentials in Python PDF
No ratings yet
Data Science Essentials in Python PDF
8 pages
DBMS
No ratings yet
DBMS
18 pages
11-Database Programming Model
No ratings yet
11-Database Programming Model
25 pages
MIT1 204S10 Lec01
No ratings yet
MIT1 204S10 Lec01
12 pages
A Common Sense Guide to Data Structures and Algorithms in Python Volume 1 Level Up Your Core Programming Skills Jay Wengrow all chapter instant download
100% (2)
A Common Sense Guide to Data Structures and Algorithms in Python Volume 1 Level Up Your Core Programming Skills Jay Wengrow all chapter instant download
77 pages
Birara Sisay Anley
0% (1)
Birara Sisay Anley
12 pages
Prime Meridian
No ratings yet
Prime Meridian
3 pages
Types of Maps-GIS System
No ratings yet
Types of Maps-GIS System
5 pages
Name: Nalekar Mukesh Shantaram Prema: Bits Id No.
No ratings yet
Name: Nalekar Mukesh Shantaram Prema: Bits Id No.
5 pages
Error
No ratings yet
Error
6 pages
CCIE Standard Questions
0% (1)
CCIE Standard Questions
460 pages
Lecture 2 and 3
No ratings yet
Lecture 2 and 3
147 pages
Various Fuzzy Numbers and Their Various Ranking Approaches
No ratings yet
Various Fuzzy Numbers and Their Various Ranking Approaches
10 pages
MA8691 Quickstart EU EN
No ratings yet
MA8691 Quickstart EU EN
48 pages
Advanced Excel Success A Practical Guide to Mastering Excel 1st Edition Alan Murray download
100% (1)
Advanced Excel Success A Practical Guide to Mastering Excel 1st Edition Alan Murray download
51 pages
Mittal Marketing ( Website Design Service)
No ratings yet
Mittal Marketing ( Website Design Service)
10 pages
VART3626-week 1 - Introduction
No ratings yet
VART3626-week 1 - Introduction
26 pages
Arabic Pad (Penjelasan)
No ratings yet
Arabic Pad (Penjelasan)
6 pages
TV Hisense
No ratings yet
TV Hisense
1 page
Manual de Serviço
No ratings yet
Manual de Serviço
641 pages
Chapter 9 Solving Linear Equations Algebraically
No ratings yet
Chapter 9 Solving Linear Equations Algebraically
46 pages
Lecture 1 CSNC4583
No ratings yet
Lecture 1 CSNC4583
50 pages
Unit 2 MIS
No ratings yet
Unit 2 MIS
7 pages
Esp32 Wroom Da Datasheet en
No ratings yet
Esp32 Wroom Da Datasheet en
30 pages
Linear Inequalities
100% (2)
Linear Inequalities
4 pages
Audit Assignment
No ratings yet
Audit Assignment
16 pages
Cheat
No ratings yet
Cheat
61 pages
AnaValid-ProtocSécurité V3.1
No ratings yet
AnaValid-ProtocSécurité V3.1
15 pages
Resume - Jo
No ratings yet
Resume - Jo
1 page
Acknowledgment Form For The Acceptable Use of STC Assets
No ratings yet
Acknowledgment Form For The Acceptable Use of STC Assets
2 pages
Artificial Intelligence Theory and Applications 973 Studies in Computational Intelligence 973 Endre Pap (Editor) All Chapter Instant Download
No ratings yet
Artificial Intelligence Theory and Applications 973 Studies in Computational Intelligence 973 Endre Pap (Editor) All Chapter Instant Download
49 pages
(Ebook) A Comprehensive Guide to Coding and Programming in Stata by Gafoor Rafael ISBN 9781040089507, 104008950Xpdf download
100% (3)
(Ebook) A Comprehensive Guide to Coding and Programming in Stata by Gafoor Rafael ISBN 9781040089507, 104008950Xpdf download
61 pages
Euroland - Planilha
No ratings yet
Euroland - Planilha
4 pages
Effective Internet Research
No ratings yet
Effective Internet Research
26 pages
Log Cat 1716445067392
No ratings yet
Log Cat 1716445067392
42 pages
DDP Lab 4
No ratings yet
DDP Lab 4
13 pages
Webinar - Introduction To Matter
No ratings yet
Webinar - Introduction To Matter
48 pages
Blockchain Mining: Complete Step by Step Guide
No ratings yet
Blockchain Mining: Complete Step by Step Guide
4 pages

2.1.1 Data Formats

Uploaded by

2.1.1 Data Formats

Uploaded by

Data Formats

Adapted from Designing Machine Learning Systems

• An ML system can work with data from many different sources

• Because debugging ML systems is hard,

• Leads to two problems

• Questions needs to be considered

• Binary files are more compact.

You might also like