Data Engineering With Databricks
Data Engineering With Databricks
Engineering
with Databricks
7000+
Lakehouse
across the globe
Public Sector Retail & CPG Energy & Utilities Digital Native
Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data
©2021 Databricks Inc. — All rights reserved 9
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML
Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data
©2021 Databricks Inc. — All rights reserved 10
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML
Siloed data teams decrease productivity
Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data
©2021 Databricks Inc. — All rights reserved 11
Lakehouse
One platform to unify all of
Data your data, analytics, and AI Data
Lake workloads Warehouse
✓ Collaborative
✓ Simple Data
Engineering
BI and SQL
Analytics
Data Science
and ML
Real-Time Data
Applications
✓ Open
30 Million+
Unify your data ecosystem
with open source standards
Monthly downloads
and formats.
✓ Open
Azure Data
Factory Synapse
Google
BigQuery
Amazon
Redshift
Data Providers
450+
Centralized Governance
AWS
Glue
✓ Collaborative
Repos /
Notebooks
Job Scheduling
Databricks
Cluster Data
Management File System
Sources
(DBFS)
Repos /
Notebooks
Job Scheduling
Databricks
Cluster Data
Management File System
Sources
(DBFS)
instances
Driver coordinates activities of Core Local Storage
executors Driver
Executor
Executors run tasks composing
a Spark job Core Memory
CI CD
Repos Service
Steps in Databricks
Commit and push to
feature branch Steps in your Git provider
• Proprietary technology
• Storage format
• Storage medium
• Database service or data warehouse
• Open source
• Builds upon standard data formats
• Optimized for cloud object storage
• Built for scalable metadata handling
▪ Atomicity
▪ Consistency
▪ Isolation
▪ Durability
Streaming analytics
CSV
JSON
TXT
Data quality
AI and reporting
Streaming analytics
CSV
JSON
TXT
Data quality
AI and reporting
Difficult to switch between batch Impossible to trace data lineage Error handling and recovery is
and stream processing laborious
Control who has access to which data Capture and record all access to data
Capture upstream sources and downstream Ability to search for and discover authorized assets
consumers
Cloud 1
Cloud 3
Streaming Machine Learning
Unify governance across clouds Unify data and AI assets Unify existing catalogs
Fine-grained governance for data Centrally share, audit, secure and Works in concert with existing
lakes across clouds - based on manage all data types with one data, storage, and catalogs - no
open standard ANSI SQL. simple interface. hard migration required.
5. Cloud-specific
credentials
Cloud Storage
Cloud Storage