0% found this document useful (0 votes)
342 views52 pages

De Mod 4 Build Data Pipelines With Delta Live Tables

The document discusses Delta Live Tables (DLT) in Databricks. It provides an overview of the key concepts, including: - DLT allows users to easily build batch and streaming data pipelines on Delta Lake to reliably move data between the bronze, silver and gold layers. - Live tables are materialized views defined by SQL queries that are kept up-to-date by pipelines. This simplifies ETL operations. - Streaming live tables ensure exactly-once processing of input rows and only read data once to reduce costs and latency. - The document demonstrates how to create a basic live table pipeline with three steps: defining the table, creating a pipeline, and starting the pipeline.

Uploaded by

nizammbb2u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
342 views52 pages

De Mod 4 Build Data Pipelines With Delta Live Tables

The document discusses Delta Live Tables (DLT) in Databricks. It provides an overview of the key concepts, including: - DLT allows users to easily build batch and streaming data pipelines on Delta Lake to reliably move data between the bronze, silver and gold layers. - Live tables are materialized views defined by SQL queries that are kept up-to-date by pipelines. This simplifies ETL operations. - Streaming live tables ensure exactly-once processing of input rows and only read data once to reduce costs and latency. - The document demonstrates how to create a basic live table pipeline with three steps: defining the table, creating a pipeline, and starting the pipeline.

Uploaded by

nizammbb2u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Build Data

Pipelines with
Delta Live Tables

Module 04

©2023 Databricks Inc. — All rights reserved 1


Agenda
Build Data Pipelines with Delta Live Tables

The Medallion Architecture


Introduction to Delta Live Tables
DE 4.1 - DLT UI Walkthrough
DE 4.1A - SQL Pipelines
DE 4.1B - Python Pipelines
DE 4.2 - Python vs SQL
DE 4.3 - Pipeline Results
DE 4.4 - Pipeline Event Logs
©2023 Databricks Inc. — All rights reserved 2
The Medallion
Architecture

©2023 Databricks Inc. — All rights reserved 3


Medallion Architecture in the Lakehouse

Streaming
Analytics
Kinesis BRONZE SILVER GOLD
CSV,
JSON,TXT… BI &
Reporting
Data Lake

Data Science
Raw ingestion Filtered, cleaned, Business-level & ML

and history augmented aggregates

Data Quality & Governance Data Sharing

4
©2023 Databricks Inc. — All rights reserved
Multi-Hop in the Lakehouse
Bronze Layer

Typically just a raw copy of ingested data


Replaces traditional data lake
Bronze
Provides efficient storage and querying of full, unprocessed
history of data

©2023 Databricks Inc. — All rights reserved 5


Multi-Hop in the Lakehouse
Silver Layer

Reduces data storage complexity, latency, and redundancy


Optimizes ETL throughput and analytic query performance
Silver
Preserves grain of original data (without aggregations)
Eliminates duplicate records
Production schema enforced
Data quality checks, corrupt data quarantined

©2023 Databricks Inc. — All rights reserved 6


Multi-Hop in the Lakehouse
Gold Layer

Powers ML applications, reporting, dashboards, ad hoc analytics


Refined views of data, typically with aggregations
Gold
Reduces strain on production systems
Optimizes query performance for business-critical data

©2023 Databricks Inc. — All rights reserved 7


Introduction to Delta
Live Tables

©2023 Databricks Inc. — All rights reserved 8


Multi-Hop in the Lakehouse

Streaming analytics

CSV
JSON
TXT

Bronze Silver Gold


Databricks Auto
Loader Raw Ingestion and Filtered, Cleaned, Business-level
History Augmented Aggregates

Data quality
AI and reporting

©2023 Databricks Inc. — All rights reserved


The Reality is Not so Simple

Bronze Silver Gold

©2023 Databricks Inc. — All rights reserved


Large scale ETL is complex and brittle

Complex pipeline Data quality and Difficult pipeline


development governance operations
Hard to build and maintain table Difficult to monitor and enforce Poor observability at granular,
dependencies data quality data level

Difficult to switch between batch Impossible to trace data lineage Error handling and recovery is
and stream processing laborious

©2023 Databricks Inc. — All rights reserved 11


Introducing Delta Live Tables
Make reliable ETL easy on Delta Lake

Operate with agility Trust your data Scale with reliability

Declarative tools to DLT has built-in Easily scale


build batch and declarative quality infrastructure
streaming data controls alongside your data
pipelines
Declare quality
expectations and
actions to take

©2023 Databricks Inc. — All rights reserved 12


What is a LIVE TABLE?

©2023 Databricks Inc. — All rights reserved 13


What is a Live Table?
Live Tables are materialized views for the lakehouse.

A live table is: Live tables provides tools to:


• Defined by a SQL query • Manage dependencies
• Created and kept up-to-date by a • Control quality
pipeline
• Automate operations
• Simplify collaboration
CREATE OR REFRESH LIVE TABLE report
• Save costs
AS SELECT sum(profit)
• Reduce latency
FROM prod.sales

©2023 Databricks Inc. — All rights reserved 14


What is a Streaming Live Table?
Based on SparkTM Structured Streaming

A streaming live table is “stateful”: • Streaming Live tables compute results


over append-only streams such as
• Ensures exactly-once processing of
Kafka, Kinesis, or Auto Loader (files on
input rows
cloud storage)
• Inputs are only read once
• Streaming live tables allow you to reduce
costs and latency by avoiding
reprocessing of old data.
CREATE STREAMING LIVE TABLE report
AS SELECT sum(profit)
FROM cloud_files(prod.sales)

©2023 Databricks Inc. — All rights reserved 15


When should I use
streaming?

©2022 Databricks Inc. — All rights reserved 16


Using Spark Structured Streaming for ingestion
Easily ingest files from cloud storage as they are uploaded

This example creates a table with all the


json data stored in “/data”:
• cloud_files keeps track of which files
CREATE STREAMING LIVE TABLE raw_data have been read to avoid duplication and
AS SELECT * wasted work
FROM cloud_files("/data", "json”) • Supports both listing and notifications
for arbitrary scale
• Configurable schema inference and
schema evolution

©2022 Databricks Inc. — All rights reserved 17


Using the SQL STREAM() function
Stream data from any Delta table

CREATE STREAMING LIVE TABLE mystream • STREAM(my_table) reads a stream of


AS SELECT * new records, instead of a snapshot

FROM STREAM(my_table) • Streaming tables must be an


append-only table

Pitfall: my_table must be an append-only source. • Any append-only delta table can be
read as a stream (i.e. from the live
e.g. it may not:
schema, from the catalog, or just from a
• be the target of APPLY CHANGES INTO path).
• define an aggregate function
• be a table on which you’ve executed DML to
delete/update a row (see GDPR section)

©2023 Databricks Inc. — All rights reserved 18


How do I use DLT?

©2023 Databricks Inc. — All rights reserved 19


Creating Your First Live Table Pipeline
SQL to DLT in three easy steps…

Write create live table Create a pipeline Click start

• Table definitions are written • A Pipeline picks one or more • DLT will create or update all
(but not run) in notebooks notebooks of table the tables in the pipelines.
definitions, as well as any
• Databricks Repos allow you
configuration required.
to version control your table
definitions.

©2023 Databricks Inc. — All rights reserved 20


BEST PRACTICE

Development vs Production
Fast iteration or enterprise grade reliability

Development Mode Production Mode

• Reuses a long-running cluster • Cuts costs by turning off clusters


running for fast iteration. as soon as they are done (within 5
minutes)
• No retries on errors enabling
faster debugging. • Escalating retries, including
cluster restarts, ensure reliability
in the face of transient issues.

In the Pipelines UI:

©2023 Databricks Inc. — All rights reserved 21


What if I have
dependent tables?

©2023 Databricks Inc. — All rights reserved 22


Declare LIVE Dependencies
Using the LIVE virtual schema.

CREATE LIVE TABLE events • Dependencies owned by other producers


are just read from the catalog or spark
AS SELECT … FROM prod.raw_data
data source as normal.
• LIVE dependencies, from the same
CREATE LIVE TABLE report pipeline, are read from the LIVE schema.
AS SELECT … FROM LIVE.events • DLT detects LIVE dependencies and
executes all operations in correct order.
events report
• DLT handles parallelism and captures the
lineage of the data.

©2023 Databricks Inc. — All rights reserved 23


How do I ensure
Data Quality?

©2023 Databricks Inc. — All rights reserved 24


BEST PRACTICE

Ensure correctness with Expectations


Expectations are tests that ensure data quality in production

CONSTRAINT valid_timestamp Expectations are true/false expressions


that are used to validate each row during
EXPECT (timestamp > '2012-01-01’)
processing.
ON VIOLATION DROP

DLT offers flexible policies on how to handle


@dlt.expect_or_drop( records that violate expectations:
"valid_timestamp", • Track number of bad records
col("timestamp") > '2012-01-01') • Drop bad records
• Abort processing for a single bad record

©2023 Databricks Inc. — All rights reserved 25


What about
operations?

©2023 Databricks Inc. — All rights reserved 26


Pipelines UI (1 of 5)
A one stop shop for ETL debugging and operations

• Visualize data flows


between tables

©2023 Databricks Inc. — All rights reserved 27


Pipelines UI (2 of 5)
A one stop shop for ETL debugging and operations

• Visualize data flows


between tables
• Discover metadata and
quality of each table

©2023 Databricks Inc. — All rights reserved 28


Pipelines UI (3 of 5)
A one stop shop for ETL debugging and operations

• Visualize data flows


between tables
• Discover metadata and
quality of each table
• Access to historical
updates

©2023 Databricks Inc. — All rights reserved 29


Pipelines UI (4 of 5)
A one stop shop for ETL debugging and operations

• Visualize data flows


between tables
• Discover metadata and
quality of each table
• Access to historical
updates
• Control operations

©2023 Databricks Inc. — All rights reserved 30


Pipelines UI (5 of 5)
A one stop shop for ETL debugging and operations

• Visualize data flows


between tables
• Discover metadata and
quality of each table
• Access to historical
updates
• Control operations
• Dive deep into events

©2023 Databricks Inc. — All rights reserved 31


The Event Log
The event log automatically records all pipelines operations.

Operational Statistics Provenance Data Quality

Time and current status, for all Table schemas, definitions, and Expectation pass / failure / drop
operations declared properties statistics

Pipeline and cluster Table-level lineage Input/Output rows that caused


configurations expectation failures
Query plans used to update
Row counts tables

©2023 Databricks Inc. — All rights reserved 32


How can I use
parameters?

©2022 Databricks Inc. — All rights reserved 33


Modularize your code with configuration
Avoid hard coding paths, topic names, and other constants in your code.

A pipeline’s configuration is a
map of key value pairs that
can be used to parameterize
your code:
• Improve code CREATE STREAMING LIVE TABLE data AS
readability/maintainability SELECT * FROM cloud_files("${my_etl.input_path}", "json")

• Reuse code in multiple @dlt.table


pipelines for different data def data():
input_path = spark.conf.get("my_etl.input_path”)
spark.readStream.format("cloud_files”).load(input_path)

©2023 Databricks Inc. — All rights reserved 34


How can I do
change data capture
(CDC)?

©2023 Databricks Inc. — All rights reserved 35


APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere

APPLY CHANGES INTO LIVE.cities


{UPDATE}
FROM STREAM(LIVE.city_updates)
{DELETE}
KEYS (id) {INSERT}
SEQUENCE BY ts
APPLY
CHANGES
INTO

Up-to-date Snapshot

©2023 Databricks Inc. — All rights reserved 36


APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere

APPLY CHANGES INTO LIVE.cities


city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 1, "city": "Bekerly, CA"}
KEYS (id)
SEQUENCE BY ts

cities
A target for the changes to id city
be applied to.

©2023 Databricks Inc. — All rights reserved 37


APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere

APPLY CHANGES INTO LIVE.cities


city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 1, "city": "Bekerly, CA"}
KEYS (id)
SEQUENCE BY ts

A source of changes,
currently this has to be a
stream.

©2023 Databricks Inc. — All rights reserved 38


APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere

APPLY CHANGES INTO LIVE.cities


city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 1, "city": "Bekerly, CA"}
KEYS (id)
SEQUENCE BY ts

cities
A unique key that can be
id city
used to identify a given row.

©2023 Databricks Inc. — All rights reserved 39


APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere

APPLY CHANGES INTO LIVE.cities


city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 100, "city": "Bekerly, CA"}
KEYS (id)
SEQUENCE BY ts

A sequence that can be used cities


to order changes: id city
• Log sequence number (lsn)
• Timestamp
• Ingestion time

©2023 Databricks Inc. — All rights reserved 40


APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere

APPLY CHANGES INTO LIVE.cities


city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 100, "city": "Bekerly, CA"}
KEYS (id)
{"id": 1, "ts": 200, "city": "Berkeley, CA"}
SEQUENCE BY ts

cities
id city
1 Bekerly, CA Berkeley, CA

©2023 Databricks Inc. — All rights reserved 41


REFERENCE ARCHITECTURE

Change Data Capture (CDC) from RDBMS


A variety of 3rd party tools can provide a streaming change feed
replicated_table

Amazon DMS to S3 cloud_files APPLY CHANGES INTO


RDS

replicated_table

MySQL or Debezium APPLY CHANGES INTO


Postgres

replicated_table

Golden Gate APPLY CHANGES INTO


Oracle

©2023 Databricks Inc. — All rights reserved


What do I no longer
need to manage with
DLT?

©2023 Databricks Inc. — All rights reserved 43


Automated Data Management
DLT automatically optimizes data for performance & ease-of-use

Best Practices Physical Data Schema Evolution

What: What: What:

DLT encodes Delta best practices DLT automatically manages your Schema evolution is handled for you
automatically when creating DLT physical data to minimize cost and
How:
tables. optimize performance.
Modifying a live table transformation
How: How:
to add/remove/rename a column will
DLT sets the following properties: • runs vacuum daily automatically do the right thing.
• runs optimize daily
• optimizeWrite When removing a column in a
• autoCompact You still can tell us how you want it streaming live table, old values are
• tuneFileSizesForRewrites organized (ie ZORDER) preserved.

©2023 Databricks Inc. — All rights reserved 44


DE 4.1 - Using the Delta
Live Tables UI
Deploy a DLT pipeline
Explore the resultant DAG
Execute an update of the pipeline

©2023 Databricks Inc. — All rights reserved 45


DE 4.1.1 - Fundamentals of
DLT Syntax
Declaring Delta Live Tables
Ingesting data with Auto Loader
Using parameters in DLT Pipelines
Enforcing data quality with constraints
Adding comments to tables
Describing differences in syntax and execution of live tables and streaming live
tables
©2023 Databricks Inc. — All rights reserved 46
DE 4.1.2 - More DLT SQL
Syntax
Processing CDC data with APPLY CHANGES INTO
Declaring live views
Joining live tables
Describing how DLT library notebooks work together in a pipeline
Scheduling multiple notebooks in a DLT pipeline

©2023 Databricks Inc. — All rights reserved 47


DE 4.2 - Delta Live Tables:
Python vs SQL
Identify key differences between the Python and SQL implementations of Delta
Live Tables

©2023 Databricks Inc. — All rights reserved 48


DE 4.3 - Exploring the
Results of a DLT Pipeline

©2023 Databricks Inc. — All rights reserved 49


DE 4.4 - Exploring the
Pipeline Events Logs

©2023 Databricks Inc. — All rights reserved 50


DE 4.1.3 - Troubleshooting
DLT Syntax Lab
Identifying and troubleshooting DLT syntax
Iteratively developing DLT pipelines with notebooks

©2023 Databricks Inc. — All rights reserved 51


©2023 Databricks Inc. — All rights reserved 52

You might also like