Azure Databricks Brief Introduction
Azure Databricks Brief Introduction
A Brief Introduction
Bryan Cafferky
Data Solutions Enabler
https://github.com/bcafferky/shared
About Bryan Cafferky…
• Decades of IT Experience
• Past Microsoft Data Platform MVP and Cloud and Data Center MVP
• Founded and lead PASS Chapter The RI Microsoft BI User Group and The Boston Data
Science Group
Wrapping Up
Why Azure?
Enablement and Ease of Use
Integration
Fluid Scaling
Enterprise Solutions
Secure
Support
Why So Many Options?
Data Analysis Platform
Client Tools
Streaming
On-Prem
Cloudera
• PostgreSQL and MySQL Offered as Azure Services (in preview). via
Marketplac
e
Scale
Azure Supported Machine Learning Approaches
Custom Build Your Own Pre Built
Non Scalable Services
Scalable Scalable
• Poor Scalability • Very High • Pre Trained
• High Scalability
• Language Scalability • Easy to Use
• Support
Dependent • General API • Fitted to a
• Model Purpose
Performance
Varies
Examples:
Examples: Examples: Examples:
Cognitive Services
Examples
https://docs.microsoft.com/en-us/python/api/overview/azure/ml/intro?view=azure-
ml-py
Azure ML Services
https://docs.microsoft.com/en-us/azure/machine-learning/
What was the Python
Programming Language
Named After?
Monty Python’s Flying
Circus
Azure ML Services from a Jupyter Notebook
https://github.com/Microsoft/AMLDataPrepDocs/blob/master/Scenarios/GettingStarted/getting-sta
SQL Server/Machine Learning Server Integration
Clarifying Bad Terms
Same Word
Opposite
Meanings
Clarifying Bad Terms
Analytics Visualizations and/or
Machine Learning
n g in g
ha ll e
C
Big Data Non traditional:
Movies, Images, massive, streaming
Scale Out
Moving Up the Scale
Scale Up vs. Scale Out Max = Petabytes+
Max = Terabytes
Max = Megabytes
Medium Large
Real
Time
H
a
d B
o A
o T
p C
H
Move Data
Queuing
to/From SQL
Service
Databases
Streaming
APACHE SPARK
An unified, open source, parallel, data processing framework for Big Data Analytics
Azure Databricks
Notebooks
Integrated
Blob File Secure
System
Collaborati
on
Click
Cluster
Creation
Language
Job
Extensions
Scheduler
Spark
Optimization
s
HDInsight vs. Azure Databricks
Easy Scheduling
Cluster
Creation
Big Data
Notebooks
Security Collaboration
DATA B R I C K S - C O M PA N Y OV E RV I E W
• Open-source data processing engine built around speed, ease of use, and
sophisticated analytics
• Highly extensible with support for Scala, Java and Python alongside Spark SQL,
GraphX, Streaming and Machine Learning Library (Mllib)
A Z U R E DATA B R I C K S
Azure Databricks
Collaborative Workspace
Hadoop storage
DATABRICKS APACHE SERVERLESS Rest APIs
I/O SPARK Data warehouses
Enhance Productivity Build on secure & trusted cloud Scale without limits
GENERAL SPARK CLUSTER ARCHITECTURE
Driver Program
SparkContext
‘Driver’ runs the user’s ‘main’ function and
executes the various parallel operations on
the worker nodes.
The results of the operations are collected by Cluster Manager
the driver
The worker nodes read and write data from/to Worker Node Worker Node Worker Node
Data Sources including HDFS.
Worker node also cache transformed data in Cache Cache Cache
memory as RDDs (Resilient Data Sets).
Task Task Task
Worker nodes and the Driver Node execute as
VMs in public clouds (AWS, Google and
Azure).
Access Control can be defined for Workspaces, Clusters, Jobs and REST APIs
Workspace Access Defines who can who can view, edit, and run
Control notebooks in their workspace
Clusters
Libraries Workspac
es
Azure
Databrick
s
Jobs Notebook
s
Advanced Analytics on Big Data
Streaming
On-Prem
Cloudera
• PostgreSQL and MySQL Offered as Azure Services (in preview). via
Marketplac
e
Scale
Wrapping Up