0% found this document useful (0 votes)
6 views

DataBricks Overview

databricks overview by zack

Uploaded by

zackary0226
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

DataBricks Overview

databricks overview by zack

Uploaded by

zackary0226
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Enter the World

of Databricks
A brief introduction and sharing about Databricks...
Contents
1. Overview of Databricks

2. Architecture and Components

4. Use Cases with Databricks(todo)

5. Conclusion(todo)
What is Databricks?

 Databricks, Inc.
A global data, analytics and artificial
company founded by the original
creator of Apache Spark.

 Databricks
Datbricks is a unified big data intelligence
platform integrates with cloud storage and
security with lakehouse architecture.

Unless specified, when we mention Databricks, we generally refer to Databricks big data platform.
What is ADB?
ADB is short for Azure Databricks Platform
Architecture of Azure Databricks
Databricks is not a single software or an independent platform

 Data Ingestion

 Data Storage

 ETL(Data Cleaning)

 Data Analysis

 Data visualization
Azure Databricks Components
Azure Databricks is a unified big data intelligence platform which integrates and works with
the power of other tools.
Detail of Each Component

Data Storage
Data can be converted into a delta table, taking advantage of
columnar storage to reduce space, improve query efficiency, and
support features such as ACID, time travel, etc.

Centralized environment for organizing


and accessing all the components and
workspace data, including various files like python
scripts.
Detail of Each Component

 Seamless Integration
 Enhanced Collaboration
 Supports multi-languages:
Python, R, SQL, Scala, markdown

 Works closely with Delta Table


 SparkSQL and Spark DataFrame
for data cleaning/ETL
 wrote in jupyter notebook
Detail of Each Component

DAG
Directed Acyclic Graph

Execute each task one by


one without circling.

Tasks:
• interactive code(py);
• non-interactive code (.py)
• delta live tables

NOTE: DAG enables the concurrent execution for the same level tasks.
Detail of Each Component

Streaming Tables
A delta table with stream(s) writing to it.
 Used for:
• Ingestion
• Low latency transformations
• Huge Scale

Materialized View
Delta Live Tables  The result of a query, stored in a delta table.
Used for:
• Transforming data
• Building aggregate tables
• Speeding up BI queries and reports

NOTE: Delta Live Tables can also be treated as a task orchestrated in the workflow.
Columnar Storage

Processing
 Data Reading: read data and load it into
memory;
Column Parsing
 parse and identify column data type;
Columnar Organization
 re-arrange and store data in a columnar
storage manner;
Compression
 reduce storage space and improve data
reading efficiency;
Metadata
 metadata is generated for query
optimization and data management.

delta table
Catalyst
Parser
 Parser the textual query and converts it into
AST(Abstract Syntax Tree);
Analyzer
 Checks the syntax and semantics of the
query to ensure its correctness;
Logical Optimizer
 Improve the structure of the logical plan like
predicate pushdown, column pruning,
etc;
Physical Planner
 Transform logical plan into a physical plan;
Code Generation
 Compile some query to byte code for
performance consideration
To be continued...

You might also like