0% found this document useful (0 votes)
40 views

Azure Databricks Brief Introduction

The document provides an overview of Azure Databricks, highlighting its integration with Azure services, ease of use, and scalability for big data and machine learning. It emphasizes the collaborative features, secure access control, and the advantages of using Apache Spark within the Azure ecosystem. Additionally, it outlines the capabilities of Azure Databricks in processing and analyzing large datasets efficiently.

Uploaded by

etest2272
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Azure Databricks Brief Introduction

The document provides an overview of Azure Databricks, highlighting its integration with Azure services, ease of use, and scalability for big data and machine learning. It emphasizes the collaborative features, secure access control, and the advantages of using Apache Spark within the Azure ecosystem. Additionally, it outlines the capabilities of Azure Databricks in processing and analyzing large datasets efficiently.

Uploaded by

etest2272
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Azure Databricks

A Brief Introduction

Bryan Cafferky
Data Solutions Enabler

https://github.com/bcafferky/shared
About Bryan Cafferky…

• Microsoft Data and AI Solutions Enabler for Healthcare

• Decades of IT Experience

• Past Microsoft Data Platform MVP and Cloud and Data Center MVP

• Author of Pro PowerShell for Database Developers

• Experienced in health care, insurance, banking, and ecommerce

• Founded and lead PASS Chapter The RI Microsoft BI User Group and The Boston Data
Science Group

https://www.linkedin.com/in/bryanca Subscribe to my YouTube Channel


fferky
Where Are We Heading?
Azure Data and Machine Learning Services Overview

Big Data and Machine Learning at Scale

Azure Databricks Overview

Wrapping Up
Why Azure?
Enablement and Ease of Use

Integration

Fluid Scaling

Enterprise Solutions

Secure

Support
Why So Many Options?
Data Analysis Platform
Client Tools

Streaming

Azure Data Factory Azure Cognitive


Data Ingestion Services
DEMO

On-Prem

Analysis Services SQL Data Warehouse Cosmos DB HDInsight Databricks


Managed Instance/
SQL DB Massively Parallel Processing

Cloudera
• PostgreSQL and MySQL Offered as Azure Services (in preview). via
Marketplac
e
Scale
Azure Supported Machine Learning Approaches
Custom Build Your Own Pre Built
Non Scalable Services
Scalable Scalable
• Poor Scalability • Very High • Pre Trained
• High Scalability
• Language Scalability • Easy to Use
• Support
Dependent • General API • Fitted to a
• Model Purpose
Performance
Varies
Examples:
Examples: Examples: Examples:

Cognitive Services

Open Source Commercial


© Copyright Microsoft Corporation. All rights reserved.
Cognitive Services

© Copyright Microsoft Corporation. All rights reserved.


Machine Learning Solutions
Partner Solutions

• Pre Built Solutions


• Customize to Your Needs

Examples

© Copyright Microsoft Corporation. All rights reserved.


Moving Up the Scale
Scale Up vs. Scale Out Max = Petabytes+
Max = Terabytes
Max = Megabytes
Medium Large

Small SQL Server/ML HDInsight, Databricks,


Server DW, CosmosDB

Open Source Scale Up Scale Out


Machine Learning Services

© Copyright Microsoft Corporation. All rights reserved.


Using Python and Azure ML Services

https://docs.microsoft.com/en-us/python/api/overview/azure/ml/intro?view=azure-
ml-py
Azure ML Services

https://docs.microsoft.com/en-us/azure/machine-learning/
What was the Python
Programming Language
Named After?
Monty Python’s Flying
Circus
Azure ML Services from a Jupyter Notebook

https://github.com/Microsoft/AMLDataPrepDocs/blob/master/Scenarios/GettingStarted/getting-sta
SQL Server/Machine Learning Server Integration
Clarifying Bad Terms

Same Word
Opposite
Meanings
Clarifying Bad Terms
Analytics Visualizations and/or
Machine Learning
n g in g
ha ll e
C
Big Data Non traditional:
Movies, Images, massive, streaming

Data Lake No such thing. Just Blob storage.

Machine Includes AI, Deep Learning,


Learning Predictive Modeling
Scale Up vs. Scale Out

Scale Up Scale Out


Scale Up vs. Scale Out

Scale Out
Moving Up the Scale
Scale Up vs. Scale Out Max = Petabytes+
Max = Terabytes
Max = Megabytes
Medium Large

Small SQL Server/ML HDInsight, Databricks,


Server DW, CosmosDB

Open Source Scale Up Scale Out


My Kitchen Drawer
Databrick
The Apache Hadoop Ecosystem s
HDInsight
In
Streamin NoSQL
Distribute Full Memory
g Storage
d Text MPP
Machine ETL
System Search SQL
Learning
Coordinati
on

Real
Time
H
a
d B
o A
o T
p C
H
Move Data
Queuing
to/From SQL
Service
Databases
Streaming
APACHE SPARK
An unified, open source, parallel, data processing framework for Big Data Analytics

Spark SQL Spark MLlib Spark GraphX


Interactive Machine Streaming Graph
Spark Unifies: Computation
Queries Learning Stream processing
 Batch Processing
 Interactive SQL
 Real-time processing
 Machine Learning
Spark Core Engine
 Deep Learning
 Graph Processing Standalone
Yarn Mesos
Scheduler
Spark MLlib
Spark Structured Machine
Streaming Learning
Stream processing
It all runs on
Spark

Azure Databricks

Notebooks

Integrated
Blob File Secure
System
Collaborati
on
Click
Cluster
Creation
Language
Job
Extensions
Scheduler

Spark
Optimization
s
HDInsight vs. Azure Databricks

HDInsight Azure Databricks


Azur
e

Easy Scheduling
Cluster
Creation
Big Data
Notebooks

Security Collaboration
DATA B R I C K S - C O M PA N Y OV E RV I E W

 Founded in late 2013


 By the creators of Apache Spark, original team
from UC Berkeley AMPLab
 Largest code contributor code to Apache Spark
 Level 2/3 support partnership with
• Hortonworks
• MapR
• DataStax
 Provides certifications such as Databricks
Certified Application, Databricks Certified
Distribution and Databricks Certified Developer
 Main Product: The Unified Analytics Platform
 In Oct 2017, introduced Databricks Delta
(currently in private preview).
A Z U R E DATA B R I C K S

 Azure Databricks is a first party service on Azure.


• Unlike with other clouds, it is not an Azure Marketplace or a
3rd party hosted service.
 Azure Databricks is integrated seamlessly with Azure
services:
• Azure Portal: Service an be launched directly from Azure
Portal
• Azure Storage Services: Directly access data in Azure Blob
Storage and Azure Data Lake Store
• Azure Active Directory: For user authentication, eliminating
the need to maintain two separate sets of uses in
Databricks and Azure. Microsoft Azure
• Azure SQL DW and Azure Cosmos DB: Enables you to
combine structured and unstructured data for analytics
• Apache Kafka for HDInsight: Enables you to use Kafka as a
streaming data source or sink
• Azure Billing: You get a single bill from Azure

• Azure Power BI: For rich data visualization

 Eliminates need to create a separate account with


Databricks.
Why Spark?

• Open-source data processing engine built around speed, ease of use, and
sophisticated analytics

• In memory engine that is up to 100 times faster than Hadoop

• Largest open-source data project with 1000+ contributors

• Highly extensible with support for Scala, Java and Python alongside Spark SQL,
GraphX, Streaming and Machine Learning Library (Mllib)
A Z U R E DATA B R I C K S

Azure Databricks
Collaborative Workspace

Machine learning models


IoT / streaming data
DATA DATA BUSINESS
ENGINEER SCIENTIST ANALYST

Deploy Production Jobs & Workflows


BI tools
Cloud storage

MULTI-STAGE JOB SCHEDULER NOTIFICATION &


PIPELINES LOGS
Data warehouses
Optimized Databricks Runtime Engine Data exports

Hadoop storage
DATABRICKS APACHE SERVERLESS Rest APIs
I/O SPARK Data warehouses

Enhance Productivity Build on secure & trusted cloud Scale without limits
GENERAL SPARK CLUSTER ARCHITECTURE

Driver Program
SparkContext
 ‘Driver’ runs the user’s ‘main’ function and
executes the various parallel operations on
the worker nodes.
 The results of the operations are collected by Cluster Manager
the driver
 The worker nodes read and write data from/to Worker Node Worker Node Worker Node
Data Sources including HDFS.
 Worker node also cache transformed data in Cache Cache Cache
memory as RDDs (Resilient Data Sets).
Task Task Task
 Worker nodes and the Driver Node execute as
VMs in public clouds (AWS, Google and
Azure).

Data Sources (HDFS, SQL, NoSQL, …)


Demonstration
binu_diabetes_demo
S E C U R E C O L L A BO RAT I O N
Azure Databricks enables secure collaboration between colleagues

• With Azure Databricks


colleagues can securely share
key artifacts such as Clusters,
Notebooks, Jobs and
Workspaces Fine Grained Permissions
• Secure collaboration is enabled
through a combination of:

Fine grained permissions:


Defines who can do what on which
artifacts (access control)
AAD-based User
Authentication
AAD-based authentication: Ensures
that users are actually who they
claim to be
A Z U R E DATA B R I C K S I N T E G RAT I O N W I T H A A D
Azure Databricks is integrated with AAD—so Azure Databricks users are just regular AAD
users

 There is no need to define users—and


their access control—separately in
Databricks.
 AAD users can be used directly in
Azure Databricks for all user-based
access control (Clusters, Jobs, Access Authentication
Notebooks etc.). Control

 Databricks has delegated user


Azure Databricks
authentication to AAD enabling single-
sign on (SSO) and unified
authentication.
 Notebooks, and their outputs, are
stored in the Databricks account.
However, AAD-based access-control
ensures that only authorized users
can access them.
DATA B R I C K S AC C E S S C O N T R O L
Access control can be defined at the user level via the Admin Console

Access Control can be defined for Workspaces, Clusters, Jobs and REST APIs

Workspace Access Defines who can who can view, edit, and run
Control notebooks in their workspace

Allows users to who can attach to, restart, and


manage (resize/delete) clusters.
Cluster Access
Databric Control
ks Allows Admins to specify which users have
Access permissions to create clusters
Control Allows owners of a job to control who can view job
Jobs Access Control
results or manage runs of a job (run now/cancel)

Allows users to use personal access tokens instead of


REST API Tokens
passwords to access the Databricks REST API
A Z U R E DATA B R I C K S C O R E A RT I FAC T S

Clusters

Libraries Workspac
es

Azure
Databrick
s
Jobs Notebook
s
Advanced Analytics on Big Data

Ingest Store Prep & Train Model & Intelligence


Serve

Logs, files and


media
(unstructured)
Data factory Azure Azure Databricks Azure Cosmos DB Web & mobile apps
storage (Spark Mllib,
SparkR, SparklyR)

Business / custom Polybas


apps Data factory e
(Structured) Azure SQL Data
Analytical
Warehouse
dashboards
Data Analysis Platform
Client Tools

Streaming

Azure Data Factory Azure Cognitive


Data Ingestion Services
DEMO

On-Prem

Analysis Services SQL Data Warehouse Cosmos DB HDInsight Databricks


Managed Instance/
SQL DB Massively Parallel Processing

Cloudera
• PostgreSQL and MySQL Offered as Azure Services (in preview). via
Marketplac
e
Scale
Wrapping Up

You might also like