De Mod 4 Build Data Pipelines With Delta Live Tables
De Mod 4 Build Data Pipelines With Delta Live Tables
Pipelines with
Delta Live Tables
Module 04
Streaming
Analytics
Kinesis BRONZE SILVER GOLD
CSV,
JSON,TXT… BI &
Reporting
Data Lake
Data Science
Raw ingestion Filtered, cleaned, Business-level & ML
4
©2023 Databricks Inc. — All rights reserved
Multi-Hop in the Lakehouse
Bronze Layer
Streaming analytics
CSV
JSON
TXT
Data quality
AI and reporting
Difficult to switch between batch Impossible to trace data lineage Error handling and recovery is
and stream processing laborious
Pitfall: my_table must be an append-only source. • Any append-only delta table can be
read as a stream (i.e. from the live
e.g. it may not:
schema, from the catalog, or just from a
• be the target of APPLY CHANGES INTO path).
• define an aggregate function
• be a table on which you’ve executed DML to
delete/update a row (see GDPR section)
• Table definitions are written • A Pipeline picks one or more • DLT will create or update all
(but not run) in notebooks notebooks of table the tables in the pipelines.
definitions, as well as any
• Databricks Repos allow you
configuration required.
to version control your table
definitions.
Development vs Production
Fast iteration or enterprise grade reliability
Time and current status, for all Table schemas, definitions, and Expectation pass / failure / drop
operations declared properties statistics
A pipeline’s configuration is a
map of key value pairs that
can be used to parameterize
your code:
• Improve code CREATE STREAMING LIVE TABLE data AS
readability/maintainability SELECT * FROM cloud_files("${my_etl.input_path}", "json")
Up-to-date Snapshot
cities
A target for the changes to id city
be applied to.
A source of changes,
currently this has to be a
stream.
cities
A unique key that can be
id city
used to identify a given row.
cities
id city
1 Bekerly, CA Berkeley, CA
replicated_table
replicated_table
DLT encodes Delta best practices DLT automatically manages your Schema evolution is handled for you
automatically when creating DLT physical data to minimize cost and
How:
tables. optimize performance.
Modifying a live table transformation
How: How:
to add/remove/rename a column will
DLT sets the following properties: • runs vacuum daily automatically do the right thing.
• runs optimize daily
• optimizeWrite When removing a column in a
• autoCompact You still can tell us how you want it streaming live table, old values are
• tuneFileSizesForRewrites organized (ie ZORDER) preserved.