Azure Data Lake Store
Azure Data Lake Analytics
A technical overview
and introduction to U-SQL
Gary Hope
Cloud Data Solution Architect
Microsoft South Africa
[email protected]
Proudly
brought to you by
Platinum
Gold
Silver
Bronze
Azure Data Lake
HDInsight Azure Data Lake Analytics
Analytics (“managed clusters”)
Azure Data Lake Storage
Storage
Azure Data Lake
as part of Cortana Intelligence Suite
Information Big Data Stores Machine Learning Intelligence
Data Management and Analytics
People
Sources
SQL Data Machine Cognitive
Data Factory
Warehouse Learning Services
Data Lake Bot Web
Data Catalog Analytics Framework
Data Lake Store
Apps HDInsight
Event Hubs (Hadoop and Cortana Mobile
Spark) Apps
Stream Analytics Bots
Dashboards &
Visualizations
Sensors Automated
and Power BI Systems
devices
Data Intelligence Action
Azure Data Lake
Analytics HDInsight
U-SQL
(extensible by C#, R and Python)
YARN
WebHDFS
Store
Demo – Lets Create The Services
6
?
?
Why data lakes? ?
?
Traditional business analytics process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema
(‘schema-on-write’)
5. Create reports. Analyze data
Dedicated ETL tools (e.g. SSIS)
Relational Queries
ETL pipeline
LOB Results
Applications Defined schema
All data not immediately required is discarded or archived
8
New big data thinking: All data has value
All data has potential value
Data hoarding
No defined schema—stored in native format
Schema is imposed and transformations are done at query time (schema-on-read).
Apps and users interpret the data as they see fit
Iterate
Gather data
Store indefinitely Analyze See results
from all sources
Data Lake Store: Technical Requirements
Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place).
Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up
Reliable Must be highly available and reliable (no permanent loss of data).
Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark
Details Must be able to store data with all details; aggregation may lead to loss of details.
Native format Must permit data to be stored in its ‘native format’ to track lineage & for data provenance.
All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.
Multiple analytic Must support multiple analytic frameworks—Batch, Real-time, Streaming, ML etc.
frameworks No one analytic framework can work for all data and all types of analysis.
Big Data analytics workloads
A highly scalable, distributed, parallel file system in the cloud
specifically designed to work with a variety of big data analytics workloads
Devices Azure Data Lake Analytics
Batch
LOB Applications U-SQL
Web
HDInsight
Relational
Batch NoSQL
MapReduce HBase
Social
Azure Data Lake Store
Video Script In-Memory
Pig Spark
Sensors
SQL Predictive
Clickstream
Hive R Server
Azure Data Lake Store
Scale, Performance, Reliability
Azure Data Lake Store: No Scale Limits
Azure Data Lake Store integrates with
Azure Active Directory (AAD) for:
Amount of data stored Seamlessly scales
How long data can be stored from a few KBs
Number of files to several PBs
Size of the individual files
Ingestion throughput
Azure Data Lake Store: How it works
Each file in ADL Store is sliced into blocks
Blocks are distributed across multiple data Azure Data Lake Store file
nodes in the backend storage system
Block 1 Block 2 … Block 2
With sufficient number of backend storage
data nodes, files of any size can be stored
Backend storage runs in the Azure cloud
which has virtually unlimited resources Block Block Block Block Block Block
Metadata is stored about each file Data node Data node Data node Data node Data node Data node
No limit to metadata either. Backend Storage
14
Azure Data Lake Store: Massive throughput
Read operation
Through read parallelism ADL Store provides Azure Data Lake Store file
massive throughput
Block 1 Block 2 … Block 2
Each read operation on a ADL Store file results
in multiple read operations executed in
parallel against the backend storage data
nodes
Block Block Block Block Block Block
Data node Data node Data node Data node Data node Data node
Backend storage
15
ADL Store: High Availability and Reliability
Azure maintains 3 replicas of each data object Write Commit
per region across three fault and upgrade Replica 1
domains Fault/upgrade
domains
Each create or append operation on a replica
is replicated to other two
Writes are committed to application only after
all replicas are successfully updated
Replica 2 Replica 3
Read operations can go against
any replica
Data is never lost or unavailable
even under failures
16
Azure Data Lake Store
Security
Azure Data Lake Store Security: AAD integration
Multi-factor authentication based on OAuth2.0
Integration with on-premises AD for federated authentication
Role-based access control
Privileged account management
Application usage monitoring and rich auditing
Security monitoring and alerting
Fine-grained ACLs for AD identities
18
Azure Data Lake Store Security: Role-based access
Each file and directory is
associated with an owner and a
group
Files or directories have separate
permissions (read(r), write(w),
execute(x)) for owners, members
of the group, and for all other
users
Fine-grained access control lists
(ACLs) rules can be specified for
specific named users or named
groups
19
Granular control of file and folder access
POSIX-Style ACLs with full compatibility with HDFS/WebHDFS
Generate default ACLs for files Child File Folder
and folders
Customize for fine-tuned control
Access ACLS Access ACLS
Access ACLs control how a user
can access to the file or folder Default ACLS
Default ACLs used to construct the
Access ACL of new children
Default ACLs copied to the Default
ACL of new child folders New Child Folder New Child File
Default ACLS
Access ACLS Access ACLS
IP address ACLs
64.34.55.130
–
64.34.55.135
IP range whitelist
Encryption of data at rest
Audit logs for data access
GitHub
[T1] Alice, Write
[T2] Bob, Read
Demo – Lets Upload Some Data
Azure Data Lake Store
Hadoop integration and
Data Movement
Azure Data Lake Store is HDFS-compatible
With a WebHDFS endpoint Azure Data Lake Store is a Hadoop-compatible file system
that integrates seamlessly with Azure HDInsight
HBase
Map reduce Hive query Spark queries Any HDFS application
transactions
Azure HDInsight
Hadoop WebHDFS client
WebHDFS-compatible REST API
Azure Data Lake Store
26
Azure Data Lake Store: Ingress
Data can be ingested into Azure Data Lake Store from a variety of sources
SQL
Azure SQL DB
ADL built-in copy service
Azure Data Factory
Hadoop DistCp
Azure Data Factory Azure Storage Blobs
Azure SQL DW
ADL Store
.NET SDK
CLI
Azure Portal
Table Storage Azure PowerShell
Azure tables
Azure Stream Analytics
On-premises databases Azure Event Hubs Custom programs
27
Azure Data Lake Store: Egress
Data can be exported from Azure Data Lake Store into numerous targets/sinks
Azure Data Factory
SQL
Hadoop DistCp
Azure SQL DB Azure Storage Blobs
Azure Data Factory
Apache Sqoop
Azure SQL DW
ADL Store
.NET SDK
Table Storage
CLI
Azure Tables Azure Portal
Azure PowerShell
On-premises databases Custom programs
28
Lambda architecture
On Premises Cloud Consumption
Cleansing ADL Analytics Analysis
Alerts
Relational DB
Power BI
On-Prem HDFS SQL
Active Incoming DB/DW
Data Lake Store
Data
Azure Data
Factory DMG Data Factory
Event data
EventHubs Web Portals
Streaming
Event data Kafka
Azure
Data Lake Store
Costs
Costs breakdown by stage
Get all the advantages
of ADL Store with
cost concepts
you are familiar with
Azure Data Lake
Analytics
Azure Built on Apache YARN
Data Lake Scales dynamically with the turn of a dial
Analytics Service Pay by the query
Supports Azure AD for access control,
A new distributed roles, and integration with on-prem
analytics service identity systems
Built with U-SQL to unify the benefits of
SQL with the power of C#
Processes data across Azure
Azure Data Lake Analytics
All data Productivity Easy and Limitless scale Enterprise-
from day one powerful data grade
preparation
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100
Developing big data apps
Author, debug, & optimize
big data apps
in Visual Studio
Multiple Languages
U-SQL, Hive, & Pig
Seamlessly integrate .NET
Work across all cloud data
Azure Data Lake
Analytics
Azure Azure SQL DB in an
Azure SQL DW Azure SQL DB Data Lake Store Storage Blobs Azure VM
Azure Data Lake
U-SQL
38
What is A hyper-scalable, highly extensible
language for preparing, transforming
U-SQL? and analyzing all data
Allows users to focus on the what—
not the how—of business problems
Built on familiar languages (SQL and
C#) and supported by a fully integrated
development environment
Built for data developers & scientists
The Origins of U-SQL
SCOPE
Next generation large-scale data processing U-SQL
language combining
The declarative, optimizable and parallelizability of SQL
The extensibility, expressiveness and familiarity of C# Hive
T-SQL
High performance Scalable Affordable Easy to program Secure
Usage scenarios
Achieve the same programming experience in batch or interactive
Schematizing unstructured data
(Load-Extract-Transform-Store) for analysis
Cook data for other users (LETS & Share)
As unstructured data
As structured data
Large-scale custom processing with custom code
Augment big data with high-value data from where it lives
Expression-flow programming style
Automatic "in-lining" of U-SQL
expressions – whole script leads to a
single execution model
Execution plan that is optimized out-of-
the-box and w/o user intervention
Per-job and user-driven parallelization 010010
100100
Detail visibility into execution steps, for
debugging
Heat map functionality to identify
performance bottlenecks 010101
U-SQL Queries: General pattern
Read Process Store
Azure Azure
Storage EXTRACT OUTPUT Storage
Blobs Blobs
Azure RowSet SELECT… RowSet
SQL SELECT FROM…
DB WHERE…
Azure EXTRACT OUTPUT Azure
Data Data
Lake INSERT Lake
SELECT
Anatomy of a U-SQL query
REFERENCE ASSEMBLY WebLogExtASM;
• U-SQL types are the same as
@rs = C# types
EXTRACT
UserID string, • The structure (schema) is first
Start DateTime, imposed when the data is first
Rowset: Conceptually is like End DatetTime, extracted/read from the file
an intermediate table… Region string, (schema-on-read)
is how U-SQL passes data SitesVisited string,
between statements PagesVisited string
FROM "swebhdfs://Logs/WebLogRecords.txt" Input is read from this file in ADL
USING WebLogExtractor(); Custom function to read from
@result = SELECT UserID, input file
(End.Subtract(Start)).TotalSeconds AS Duration
FROM @rs ORDER BY Duration DESC FETCH 10; C# Expression
OUTPUT @result TO "swebhdfs://Logs/Results/top10.txt" Output is stored in this file in ADL
USING Outputter.Tsv();
Built-in function that writes the output
in TSV format
U-SQL data types
Category Types
byte, byte? short, short?
sbyte, sbyte? ushort, ushort?
int, int? ulong, unlong?
Numeric uint, unint? float, float?
long, long? double, double?
decimal,
decimal?
char, char?
Text
string
MAP<K>
Complex
ARRAY<K,T>
Temporal DateTime, DateTime?
bool, bool?
Other Guid, Guid?
Byte[]
Federated queries: Query data where it lives
Easily query data in multiple Azure data stores without moving it to a single store
Benefits
Avoid moving large amounts of data across the
network between stores
Azure
Single view of data irrespective of physical location Storage Blobs
Minimize data proliferation issues caused by
maintaining multiple copies Query
U-SQL
Single query language for all data Query
Result
Each data store maintains its own sovereignty Azure Data Azure SQL
Lake Analytics in VMs
Design choices based on the need
Azure
SQL DB
46
Demo – Lets Run Some Queries
Azure Data Lake Analytics
Billing
Azure Data Lake Analytics Billing
Accounts are FREE!
Pay for the compute resources you want
for your queries
Pay for storage separately
(query_minutes * parallelism * parallelism_cost_per_minute)+ per_job_charge
Get started
today!
For more information visit:
http://azure.com/datalake
49
Where to learn more…
50
Email: [email protected]
Twitter: @GaryHope
Mobile: +27 82 7778886