100% found this document useful (2 votes)
2K views

Data Engineering With Databricks

Uploaded by

Midhun Raja
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
2K views

Data Engineering With Databricks

Uploaded by

Midhun Raja
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Data

Engineering
with Databricks

©2022 Databricks Inc. — All rights reserved 1


Course Objectives

• Leverage the Databricks Lakehouse Platform to perform core


responsibilities for data pipeline development
• Use SQL and Python to write production data pipelines to extract,
transform, and load data into tables and views in the lakehouse
• Simplify data ingestion and incremental change propagation using
Databricks-native features and syntax
• Orchestrate production pipelines to deliver fresh results for ad-hoc
analytics and dashboarding

©2022 Databricks Inc. — All rights reserved 2


Course Agenda
• Module 1: Databricks Workspace and Services
• Module 2: Delta Lake
• Module 3: Relational Entities on Databricks
• Module 4: ETL With Spark SQL
• Module 5: OPTIONAL Python for Spark SQL
• Module 6: Incremental Data Processing
• Module 7: Multi-Hop Architecture
• Module 8: Delta Live Tables
• Module 9: Task Orchestration with Jobs
• Module 10: Running a DBSQL Query
• Module 11: Managing Permissions
• Module 12: Productionalizing Dashboards and Queries in DBSQL
©2022 Databricks Inc. — All rights reserved 3
The Databricks
Lakehouse
Platform

©2021 Databricks Inc. — All rights reserved 4


Using the Databricks Lakehouse Platform
Module 1 Learning Objectives

• Describe the components of the Databricks Lakehouse


• Complete basic code development tasks using services of the
Databricks Data Science and Engineering Workspace
• Perform common table operations using Delta Lake in the Lakehouse

©2022 Databricks Inc. — All rights reserved 5


Using the Databricks Lakehouse Platform
Module 1 Agenda

• Introduction to the Databricks Lakehouse Platform


• Introduction to the Databricks Workspace and Services
• Using clusters, files, notebooks, and repos

• Introduction to Delta Lake


• Manipulating and optimizing data in Delta tables

©2022 Databricks Inc. — All rights reserved 6


Customers

7000+
Lakehouse
across the globe

One simple platform to unify all of


your data, analytics, and AI workloads
Original creators of:

©2021 Databricks Inc. — All rights reserved


Supporting enterprises in every industry
Healthcare & Life Manufacturing & Media & Financial
Sciences Automotive Entertainment Services

Public Sector Retail & CPG Energy & Utilities Digital Native

©2021 Databricks Inc. — All rights reserved 8


Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML

Siloed stacks increase data architecture complexity

Analytics and BI Transform Real-time Database Data Machine


Science Learning
Extract Load
Data marts Data prep
Streaming Data Engine
Data Lake Data Lake
Data warehouse

Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data
©2021 Databricks Inc. — All rights reserved 9
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML

Disconnected systems and proprietary data formats make integration difficult


Amazon Redshift Teradata Hadoop Apache Airflow Apache Kafka Apache Spark Jupyter Amazon SageMaker
Azure Synapse Google BigQuery
Amazon EMR Apache Spark Apache Flink Amazon Kinesis Azure ML Studio MatLAB
Snowflake IBM Db2
Google Dataproc Cloudera Azure Stream Analytics Google Dataflow Domino Data Labs SAS
SAP Oracle Autonomous
Data Warehouse Tibco Spotfire Confluent TensorFlow PyTorch

Siloed stacks increase data architecture complexity

Analytics and BI Transform Real-time Database Data Machine


Science Learning
Extract Load
Data marts Data prep
Streaming Data Engine
Data Lake Data Lake
Data warehouse

Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data
©2021 Databricks Inc. — All rights reserved 10
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML
Siloed data teams decrease productivity

Data Analysts Data Engineers Data Engineers Data Scientists

Disconnected systems and proprietary data formats make integration difficult


Amazon Redshift Teradata Hadoop Apache Airflow Apache Kafka Apache Spark Jupyter Amazon SageMaker
Azure Synapse Google BigQuery
Amazon EMR Apache Spark Apache Flink Amazon Kinesis Azure ML Studio MatLAB
Snowflake IBM Db2
Google Dataproc Cloudera Azure Stream Analytics Google Dataflow Domino Data Labs SAS
SAP Oracle Autonomous
Data Warehouse Tibco Spotfire Confluent TensorFlow PyTorch

Siloed stacks increase data architecture complexity

Analytics and BI Transform Real-time Database Data Machine


Science Learning
Extract Load
Data marts Data prep
Streaming Data Engine
Data Lake Data Lake
Data warehouse

Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data
©2021 Databricks Inc. — All rights reserved 11
Lakehouse
One platform to unify all of
Data your data, analytics, and AI Data
Lake workloads Warehouse

©2021 Databricks Inc. — All rights reserved 12


An open approach to bringing
data management and
governance to data lakes

Data Better reliability with transactions Data


Lake 48x faster data processing with indexing Warehouse
Data governance at scale with
fine-grained access control lists

©2021 Databricks Inc. — All rights reserved 13


The Databricks Lakehouse Platform
Databricks Lakehouse Platform

Data BI and SQL Data Science Real-Time Data


Engineering Analytics and ML Applications

Data Management and Governance


✓ Simple
Open Data Lake
✓ Open
Platform Security & Administration

✓ Collaborative

Unstructured, semi-structured, structured, and streaming data

©2021 Databricks Inc. — All rights reserved 14


The Databricks Lakehouse Platform
Databricks Lakehouse Platform

✓ Simple Data
Engineering
BI and SQL
Analytics
Data Science
and ML
Real-Time Data
Applications

Unify your data, analytics, Data Management and Governance


and AI on one common
platform for all data use Open Data Lake
cases
Platform Security & Administration

Unstructured, semi-structured, structured, and streaming data

©2021 Databricks Inc. — All rights reserved 15


The Databricks Lakehouse Platform

✓ Open
30 Million+
Unify your data ecosystem
with open source standards
Monthly downloads
and formats.

Built on the innovation of


some of the most successful
open source data projects in
the world

©2021 Databricks Inc. — All rights reserved 16


The Databricks Lakehouse Platform
Visual ETL & Data Ingestion Business Intelligence
Azure

✓ Open
Azure Data
Factory Synapse
Google
BigQuery
Amazon
Redshift

Unify your data ecosystem Machine Learning


with open source standards Amazon Azure Machine

and formats. SageMaker Learning


Google
AI Platform
Lakehouse Platform

Data Providers

450+
Centralized Governance
AWS
Glue

Partners across the Top Consulting & SI Partners


data landscape

©2021 Databricks Inc. — All rights reserved 17


The Databricks Lakehouse Platform
Data Analysts

✓ Collaborative

Unify your data teams to


collaborate across the entire
data and AI workflow
Models
Dashboards
Notebooks
Datasets

Data Engineers Data Scientists

©2021 Databricks Inc. — All rights reserved 18


Databricks
Architecture
and Services

©2022 Databricks Inc. — All rights reserved 19


Databricks Architecture
Databricks Customer
Control Plane Cloud Account Cloud Account Data Plane
Data processing with Apache
Spark Clusters
Web Application

Repos /
Notebooks

Job Scheduling

Databricks
Cluster Data
Management File System
Sources
(DBFS)

©2022 Databricks Inc. — All rights reserved 20


Databricks Services

Control Plane in Databricks


Manage customer accounts, datasets, and clusters

Databricks Web Repos / Cluster


Jobs
Application Notebooks Management

©2022 Databricks Inc. — All rights reserved 21


Clusters
Databricks Customer
Control Plane Cloud Account Cloud Account Data Plane
Data processing with Apache
Spark Clusters
Web Application

Repos /
Notebooks

Job Scheduling

Databricks
Cluster Data
Management File System
Sources
(DBFS)

©2022 Databricks Inc. — All rights reserved 22


Clusters
Overview
CLUSTER
Clusters are made up of one or Executor
more virtual machine (VM) Core Memory

instances
Driver coordinates activities of Core Local Storage

executors Driver
Executor
Executors run tasks composing
a Spark job Core Memory

Core Local Storage

©2022 Databricks Inc. — All rights reserved 23


Clusters
Types

All-purpose Clusters Job Clusters

Analyze data collaboratively using Run automated jobs


interactive notebooks
The Databricks job scheduler creates
Create clusters from the Workspace job clusters when running jobs.
or API
Retains up to 30 clusters.
Retains up to 70 clusters for up to 30
days.

©2022 Databricks Inc. — All rights reserved 24


Git Versioning
with Databricks
Repos

©2022 Databricks Inc. — All rights reserved 25


Databricks Repos
Overview

Git Versioning CI/CD Integration Enterprise ready

Native integration with API surface to integrate Allow lists to avoid


Github, Gitlab, Bitbucket with automation exfiltration
and Azure Devops
Simplifies the Secret detection to avoid
UI-based workflows dev/staging/prod leaking keys
multi-workspace story

CI CD

©2022 Databricks Inc. — All rights reserved 26


Databricks Repos
CI/CD Integration

Control Plane in Databricks Git and CI/CD Systems


Manage customer accounts, datasets, and clusters

Databricks Web Repos / Cluster Version Review Test


Application Jobs Management
Notebooks

Repos Service

©2022 Databricks Inc. — All rights reserved 27


Databricks Repos
Best practices for CI/CD workflows
User workflow in Merge workflow in Git Production job workflow in
Admin workflow
Databricks provider Databricks

Set up top-level Clone remote


repository to user Pull request and API call brings Repo in
Repos folders
folder review process Production folder to
(example:
latest version
Production)

Create new branch


based on main Merge into main
branch branch

Set up Git Run Databricks job


automation to update based on Repo in
Repos on merge Create and edit code Production folder
Git automation calls
Databricks Repos API

Steps in Databricks
Commit and push to
feature branch Steps in your Git provider

©2022 Databricks Inc. — All rights reserved 28


What is
Delta Lake?

©2022 Databricks Inc. — All rights reserved 29


Delta Lake is an open-source
project that enables building a
data lakehouse on top of
existing storage systems

©2022 Databricks Inc. — All rights reserved 30


Delta Lake Is Not…

• Proprietary technology
• Storage format
• Storage medium
• Database service or data warehouse

©2022 Databricks Inc. — All rights reserved 31


Delta Lake Is…

• Open source
• Builds upon standard data formats
• Optimized for cloud object storage
• Built for scalable metadata handling

©2022 Databricks Inc. — All rights reserved 32


Delta Lake brings ACID to object storage

▪ Atomicity
▪ Consistency
▪ Isolation
▪ Durability

©2022 Databricks Inc. — All rights reserved 33


Problems solved by ACID
1. Hard to append data
2. Modification of existing data difficult
3. Jobs failing mid way
4. Real-time operations hard
5. Costly to keep historical data versions

©2022 Databricks Inc. — All rights reserved


Delta Lake is the default for all
tables created in Databricks

©2022 Databricks Inc. — All rights reserved 35


ETL with Spark
SQL and
Python

©2022 Databricks Inc. — All rights reserved 36


ETL With Spark SQL and Python
Module 2 Learning Objectives

• Leverage Spark SQL DDL to create and manipulate relational entities on


Databricks
• Use Spark SQL to extract, transform, and load data to support
production workloads and analytics in the Lakehouse
• Leverage Python for advanced code functionality needed in production
applications

©2022 Databricks Inc. — All rights reserved 37


ETL With Spark SQL and Python
Module 2 Agenda

• Working with Relational Entities on Databricks


• Managing databases, tables, and views

• ETL with Spark SQL


• Extracting data from external sources, loading and updating data in the lakehouse,
and common transformations

• Just Enough Python for Spark SQL


• Building extensible functions with Python-wrapped SQL

©2022 Databricks Inc. — All rights reserved 38


Incremental
Data and Delta
Live Tables

©2022 Databricks Inc. — All rights reserved 39


Incremental Data and Delta Live Tables
Module 3 Learning Objectives

• Incrementally process data to power analytic insights with Spark


Structured Streaming and Auto Loader
• Propagate new data through multiple tables in the data lakehouse
• Leverage Delta Live Tables to simplify productionalizing SQL data
pipelines with Databricks

©2022 Databricks Inc. — All rights reserved 40


Incremental Data and Delta Live Tables
Module 3 Agenda

• Incremental Data Processing with Structured Streaming and Auto


Loader
• Processing and aggregating data incrementally in near real time

• Multi-hop in the Lakehouse


• Propagating changes through a series of tables to drive production systems

• Using Delta Live Tables


• Simplifying deployment of production pipelines and infrastructure using SQL

©2022 Databricks Inc. — All rights reserved 41


Multi-hop
Architecture

©2022 Databricks Inc. — All rights reserved 42


Multi-Hop in the Lakehouse

Streaming analytics

CSV
JSON
TXT

Bronze Silver Gold


Databricks Auto
Loader

Data quality

AI and reporting

©2022 Databricks Inc. — All rights reserved


Multi-Hop in the Lakehouse
Bronze Layer

Typically just a raw copy of ingested data


Replaces traditional data lake
Bronze
Provides efficient storage and querying of full, unprocessed
history of data

©2022 Databricks Inc. — All rights reserved 44


Multi-Hop in the Lakehouse
Silver Layer

Reduces data storage complexity, latency, and redundancy


Optimizes ETL throughput and analytic query performance
Silver
Preserves grain of original data (without aggregations)
Eliminates duplicate records
Production schema enforced
Data quality checks, corrupt data quarantined

©2022 Databricks Inc. — All rights reserved 45


Multi-Hop in the Lakehouse
Gold Layer

Powers ML applications, reporting, dashboards, ad hoc analytics


Refined views of data, typically with aggregations
Gold
Reduces strain on production systems
Optimizes query performance for business-critical data

©2022 Databricks Inc. — All rights reserved 46


Introducing
Delta Live
Tables

©2022 Databricks Inc. — All rights reserved 47


Multi-Hop in the Lakehouse

Streaming analytics

CSV
JSON
TXT

Bronze Silver Gold


Databricks Auto
Loader Raw Ingestion and Filtered, Cleaned, Business-level
History Augmented Aggregates

Data quality
AI and reporting

©2022 Databricks Inc. — All rights reserved


The Reality is Not so Simple

Bronze Silver Gold

©2022 Databricks Inc. — All rights reserved


Large scale ETL is complex and brittle

Complex pipeline Data quality and Difficult pipeline


development governance operations
Hard to build and maintain table Difficult to monitor and enforce Poor observability at granular,
dependencies data quality data level

Difficult to switch between batch Impossible to trace data lineage Error handling and recovery is
and stream processing laborious

©2022 Databricks Inc. — All rights reserved 50


Introducing Delta Live Tables
Make reliable ETL easy on Delta Lake

Operate with agility Trust your data Scale with reliability

Declarative tools to DLT has built-in Easily scale


build batch and declarative quality infrastructure alongside
streaming data controls your data
pipelines
Declare quality
expectations and
actions to take

©2022 Databricks Inc. — All rights reserved 51


Managing Data
Access and
Production
Pipelines

©2022 Databricks Inc. — All rights reserved 52


Managing Data Access and Production
Pipelines
Module 4 Learning Objectives

• Orchestrate tasks with Databricks Jobs


• Use Databricks SQL for on-demand queries
• Configure Databricks Access Control Lists to provide groups with secure
access to production and development databases
• Configure and schedule dashboards and alerts to reflect updates to
production data pipelines

©2022 Databricks Inc. — All rights reserved 53


Managing Data Access and Production
Pipelines
Module 4 Agenda

• Task Orchestration with Databricks Jobs


• Scheduling notebooks and DLT pipelines with dependencies

• Running Your First Databricks SQL Query


• Navigating, configuring, and executing queries in Databricks SQL

• Managing Permissions in the Lakehouse


• Configuring permissions for databases, tables, and views in the data lakehouse

• Productionalizing Dashboards and Queries in DBSQL


• Scheduling queries, dashboards, and alerts for end-to-end analytic pipelines

©2022 Databricks Inc. — All rights reserved 54


Introducing
Unity Catalog

©2022 Databricks Inc. — All rights reserved 55


Data Governance Overview
Four key functional areas

Data Access Control Data Access Audit

Control who has access to which data Capture and record all access to data

Data Lineage Data Discovery

Capture upstream sources and downstream Ability to search for and discover authorized assets
consumers

©2022 Databricks Inc. — All rights reserved 56


Data Governance Overview
Challenges

Structured Cloud 2 Data Analysts

Semi-structured Data Scientists

Cloud 1

Unstructured Data Engineers

Cloud 3
Streaming Machine Learning

©2022 Databricks Inc. — All rights reserved 57


Databricks Unity Catalog
Overview

Unify governance across clouds Unify data and AI assets Unify existing catalogs

Fine-grained governance for data Centrally share, audit, secure and Works in concert with existing
lakes across clouds - based on manage all data types with one data, storage, and catalogs - no
open standard ANSI SQL. simple interface. hard migration required.

©2022 Databricks Inc. — All rights reserved 58


Databricks Unity Catalog
Three-layer namespace

Traditional two-layer namespace Three-layer namespace with Unity


Catalog

SELECT * FROM schema.table SELECT * FROM catalog.schema.table

©2022 Databricks Inc. — All rights reserved 59


Databricks Unity Catalog
Security Model
Workspace

Traditional Query Lifecycle


2. Check grants
Table ACL

1. Submit query 3. Lookup location


SELECT * FROM
table 6. Filter 4. Return path to
Cluster or
unauthorized data SQL Endpoint table Hive Metastore

5. Cloud-specific
credentials

Cloud Storage

©2022 Databricks Inc. — All rights reserved 60


Databricks Unity Catalog
Security Model
Workspace

Query Life Cycle with


Unity Catalog Audit Log
3. Write to log

1. Submit query 2. Check namespace 4. Check grants


SELECT * FROM and Delta
table metadata
8. Filter Cluster or 6. URLs and
unauthorized data SQL Endpoint short-lived tokens Unity Catalog
7. Ingest from
array of URLs
5. Cloud-specific
credentials

Cloud Storage

©2022 Databricks Inc. — All rights reserved 61


Course Recap

©2022 Databricks Inc. — All rights reserved 62


Course Objectives

• Leverage the Databricks Lakehouse Platform to perform core


responsibilities for data pipeline development
• Use SQL and Python to write production data pipelines to extract,
transform, and load data into tables and views in the lakehouse
• Simplify data ingestion and incremental change propagation using
Databricks-native features and syntax
• Orchestrate production pipelines to deliver fresh results for ad-hoc
analytics and dashboarding

©2022 Databricks Inc. — All rights reserved 63

You might also like