DataBricks Overview
DataBricks Overview
of Databricks
A brief introduction and sharing about Databricks...
Contents
1. Overview of Databricks
5. Conclusion(todo)
What is Databricks?
Databricks, Inc.
A global data, analytics and artificial
company founded by the original
creator of Apache Spark.
Databricks
Datbricks is a unified big data intelligence
platform integrates with cloud storage and
security with lakehouse architecture.
Unless specified, when we mention Databricks, we generally refer to Databricks big data platform.
What is ADB?
ADB is short for Azure Databricks Platform
Architecture of Azure Databricks
Databricks is not a single software or an independent platform
Data Ingestion
Data Storage
ETL(Data Cleaning)
Data Analysis
Data visualization
Azure Databricks Components
Azure Databricks is a unified big data intelligence platform which integrates and works with
the power of other tools.
Detail of Each Component
Data Storage
Data can be converted into a delta table, taking advantage of
columnar storage to reduce space, improve query efficiency, and
support features such as ACID, time travel, etc.
Seamless Integration
Enhanced Collaboration
Supports multi-languages:
Python, R, SQL, Scala, markdown
DAG
Directed Acyclic Graph
Tasks:
• interactive code(py);
• non-interactive code (.py)
• delta live tables
NOTE: DAG enables the concurrent execution for the same level tasks.
Detail of Each Component
Streaming Tables
A delta table with stream(s) writing to it.
Used for:
• Ingestion
• Low latency transformations
• Huge Scale
Materialized View
Delta Live Tables The result of a query, stored in a delta table.
Used for:
• Transforming data
• Building aggregate tables
• Speeding up BI queries and reports
NOTE: Delta Live Tables can also be treated as a task orchestrated in the workflow.
Columnar Storage
Processing
Data Reading: read data and load it into
memory;
Column Parsing
parse and identify column data type;
Columnar Organization
re-arrange and store data in a columnar
storage manner;
Compression
reduce storage space and improve data
reading efficiency;
Metadata
metadata is generated for query
optimization and data management.
delta table
Catalyst
Parser
Parser the textual query and converts it into
AST(Abstract Syntax Tree);
Analyzer
Checks the syntax and semantics of the
query to ensure its correctness;
Logical Optimizer
Improve the structure of the logical plan like
predicate pushdown, column pruning,
etc;
Physical Planner
Transform logical plan into a physical plan;
Code Generation
Compile some query to byte code for
performance consideration
To be continued...