Azure Data Lake and U-SQL
Azure Data Lake and U-SQL
Gary Hope
Cloud Data Solution Architect
Microsoft South Africa
[email protected]
Proudly
brought to you by
Platinum
Gold
Silver
Bronze
Azure Data Lake
Dashboards &
Visualizations
Sensors Automated
and Power BI Systems
devices
U-SQL
(extensible by C#, R and Python)
YARN
WebHDFS
Store
Demo – Lets Create The Services
6
?
?
?
Traditional business analytics process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema
(‘schema-on-write’)
5. Create reports. Analyze data
Iterate
Gather data
Store indefinitely Analyze See results
from all sources
Data Lake Store: Technical Requirements
Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place).
Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up
Reliable Must be highly available and reliable (no permanent loss of data).
Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark
Details Must be able to store data with all details; aggregation may lead to loss of details.
Native format Must permit data to be stored in its ‘native format’ to track lineage & for data provenance.
All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.
Multiple analytic Must support multiple analytic frameworks—Batch, Real-time, Streaming, ML etc.
frameworks No one analytic framework can work for all data and all types of analysis.
Big Data analytics workloads
A highly scalable, distributed, parallel file system in the cloud
specifically designed to work with a variety of big data analytics workloads
Batch
LOB Applications U-SQL
Web
HDInsight
Relational
Batch NoSQL
MapReduce HBase
Social
Sensors
SQL Predictive
Clickstream
Hive R Server
Azure Data Lake Store
Scale, Performance, Reliability
Azure Data Lake Store: No Scale Limits
Azure Data Lake Store integrates with
Azure Active Directory (AAD) for:
Amount of data stored Seamlessly scales
How long data can be stored from a few KBs
Number of files to several PBs
Size of the individual files
Ingestion throughput
Azure Data Lake Store: How it works
Each file in ADL Store is sliced into blocks
Blocks are distributed across multiple data Azure Data Lake Store file
nodes in the backend storage system
Block 1 Block 2 … Block 2
Metadata is stored about each file Data node Data node Data node Data node Data node Data node
14
Azure Data Lake Store: Massive throughput
Read operation
Through read parallelism ADL Store provides Azure Data Lake Store file
massive throughput
Block 1 Block 2 … Block 2
Data node Data node Data node Data node Data node Data node
Backend storage
15
ADL Store: High Availability and Reliability
Azure maintains 3 replicas of each data object Write Commit
per region across three fault and upgrade Replica 1
domains Fault/upgrade
domains
Each create or append operation on a replica
is replicated to other two
Writes are committed to application only after
all replicas are successfully updated
Replica 2 Replica 3
Read operations can go against
any replica
Data is never lost or unavailable
even under failures
16
Azure Data Lake Store
Security
Azure Data Lake Store Security: AAD integration
Multi-factor authentication based on OAuth2.0
Integration with on-premises AD for federated authentication
Role-based access control
Privileged account management
Application usage monitoring and rich auditing
Security monitoring and alerting
Fine-grained ACLs for AD identities
18
Azure Data Lake Store Security: Role-based access
Each file and directory is
associated with an owner and a
group
Files or directories have separate
permissions (read(r), write(w),
execute(x)) for owners, members
of the group, and for all other
users
Fine-grained access control lists
(ACLs) rules can be specified for
specific named users or named
groups
19
Granular control of file and folder access
POSIX-Style ACLs with full compatibility with HDFS/WebHDFS
Default ACLS
64.34.55.130
–
64.34.55.135
IP range whitelist
Encryption of data at rest
Audit logs for data access
GitHub
HBase
Map reduce Hive query Spark queries Any HDFS application
transactions
Azure HDInsight
SQL
Azure SQL DB
Apache Sqoop
Azure SQL DW
ADL Store
.NET SDK
Table Storage
CLI
Azure Tables Azure Portal
Azure PowerShell
Alerts
Relational DB
Power BI
Event data
EventHubs Web Portals
Streaming
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100
Developing big data apps
Author, debug, & optimize
big data apps
in Visual Studio
Multiple Languages
U-SQL, Hive, & Pig
Azure Azure
Storage EXTRACT OUTPUT Storage
Blobs Blobs
Benefits
Avoid moving large amounts of data across the
network between stores
Azure
Single view of data irrespective of physical location Storage Blobs
Azure
SQL DB
46
49
Where to learn more…
50
Email: [email protected]
Twitter: @GaryHope
Mobile: +27 82 7778886