518385781-Data-Lake
518385781-Data-Lake
This course introduces you to the fundamental concepts of data lake store and analytics,
creating data lake store instances, ingesting data, applying analytics and securing data.
Happy Learning!
Azure data lake store is a highly scalable, distributed, parallel file system that is
designed to work with various analytic frameworks, and has the capability of storing
varied data from several sources.
Azure Data Lake Store
Microsoft describes Azure Data Lake Store (ADLS) as a hyperscale repository for big data analytics
workloads that stores data in its native format. ADLS is a Hadoop File System compatible with
Hadoop Distributed File System (HDFS) that works with the Hadoop ecosystem.
ADLS:
Allows storing of relational and non-relational data, and the data schema need not be defined
before data is loaded.
Keeps three copies of a particular data to enable high availability of data provision.
Analytics Service
HDInsight
Diversified Storage
Analytics Service- to build various data analytic job services, and execute them parallelly.
HDInsight- for managing clusters after ingesting large volumes of data clusters, by extending various
open sources such as Hadoop, Spark, Pig, Hive, and so on.
Diversified Storage- to store diversified data such as structured, unstructured, and semi-structured
data from diverse data sources.
The image above illustrates the ingestion of raw data, the preparation and modeling of
data, and the processing of data analytics jobs.
Azure Data Lake Storage Gen 1
Data Lake Storage Gen 1 is an Apache Hadoop file system compatible with HDFS
that works with the Hadoop ecosystem. Existing HDInsight applications or services
that use the WebHDFS API can easily integrate with Data Lake Storage Gen 1.
The working of a Gen 1 data lake store is illustrated in the above image.
Hadoop-compatible access: The new Azure Blob File System (ABFS) driver is enabled within all
Apache Hadoop environments, including Azure Databricks, Azure HDInsight, and SQL Data
Warehouse, to access data stored in ADLS Gen2.
A superset of POSIX permissions: The security model for ADLS Gen2 supports ACL and POSIX
permissions, along with extra granularity specific to ADLS Gen2.
Cost effective: ADLS Gen2 offers low-cost storage capacity and transactions as data transitions
through its entire life cycle.
Comparing Data Lake Store and Blob Storage
Data Lake Store Blob Storage
The above image illustrates the data sources for ADLS, and how data is streamed into
usage.
The image in the previous card illustrates how streamed data is managed by using ADLS in three
different layers:
Data generation
Storage
Data processing
Data is generated from various sources such as cloud, local machines, or logs.
Data is stored in a data lake store which uses different analytic jobs such as Hadoop, Spark, or Pig to
analyze the data.
Schema
Note: The Azure portal interface changes continuously with updates. The videos may differ from the
actual interface, but the core functionality remains the same.
ADLS Gen 2 is the most productive storage, and its pricing depends on the file
structure, and redundancy you choose. Refer to link for exact pricing details based on
your requirement.
Provisioning Data Lake Store Gen 1
The following video shows how to provision Gen 1 data lake storage in the Azure portal:
If you have trouble playing this video, please click here for help.
The above CLI command creates a data lake store Gen 1 account, and a folder
named mynewfolder at the root of the data lake storage Gen 1 account.
Note: The --folder parameter ensures that the command creates a folder, if not, it
creates an empty file named mynewfolder, at the root as default.
1. Login to the Azure portal, and navigate to the storage account resource.
2. Add storage account and provide the resource group name, storage account
name, location, and performance (standard or premium).
3. Select the account type as Storage V2, which is a Gen 2 type of account.
4. Select the replication based on your requirement, such as LRS, GRS, or RA-
GRS, and proceed to next for advanced options.
5. Enable the hierarchical namespace to organize the objects or files for efficient
data access.
6. Proceed to next, validate, and create the storage account.
To create the data lake store Gen 2 account, select the account kind as shown in the
above image.
You can get your data into your data lake store account in two ways,
Through direct upload method or Adlcopy
Set up a pipeline by using data factory, and process data from various sources.
Note: Setting up a pipeline by using data factory, and copying data into a data lake store from
various sources is explained in the Azure Data Factory course.
The other upload or copying methods are explained in the following cards.
Provide the storage account name, source path, and destination path in the above CLI
command.
Example:
AdlCopy Tool
ADLS Gen 1 provides a command line tool AdlCopy to copy data from the following
sources:
From azure storage blobs to data lake storage Gen 1 account.
Note: You cannot use AdlCopy to copy data from ADLS Gen 1 to blob.
Between two data lake storage Gen 1 accounts.
You should have AdlCopy tool installed in your machine. To install, use link.
AdlCopy syntax:
AdlCopy /Source <Blob or Data Lake Storage Gen1 source> /Dest <Data Lake Storage
Gen1 destination> /SourceKey <Key for Blob account> /Account <Data Lake Analytics
account> /Units <Number of Analytics units> /Pattern
Specify the source, destination instances URL, and the data file that needs to be
copied.
Note: To get the instances URL, navigate to the instance dashboard.
Azure Data Lake Analytics
Azure data lake analytics is an analytics job service that writes queries, extracts valuable insights
from any scale of data, and simplifies big data.
It can handle jobs of any scale in a cost-effective manner, where you pay for a job only when it is
running.
Data Lake Analytics works with ADLS for high performance, throughput, and parallelization. It works
with Azure Storage blobs, Azure SQL Database, and Azure Warehouse.
If you have trouble playing this video, please click here for help.
Data explorer is used to browse the above data sources, and to perform basic file
management operations.
1. Login to the Azure portal, and navigate to the Data Lake Analytics page.
2. Click data sources, and then click add data sources.
To add a Data Lake Store account, you need the account name and access to
the account, to query it.
To add Azure Blob storage, you need the storage account and the account key.
Setting Up Firewall Rule
You can enable access to trusted clients only by specifying an IP address or defining a range of IP
addresses, by setting up the firewall rules to cut off access to your data lake analytics at network
level.
Login to the Azure portal and navigate to your data lake analytics account.
Click OK.
U-SQL Overview
Data lake analytics service runs jobs that query the data to generate an output for analysis, where
these jobs consist of scripts written in a language called U-SQL.
U-SQL is a query language that extends the familiar, simple, declarative nature of SQL; combined
with the expressive power of C#, and uses the same distributed runtime that powers Microsoft's
internal exabyte-scale data lake.
U-SQL Sample Job
The following video explains how to run a job by extracting data from a log file writing a
sample U-SQL script for analysis:
Play
12:51
-16:38
Mute
Settings
Enter fullscreen
Play
If you have trouble playing this video, please click here for help.
Extractor Parameters
The built-in extractors support several parameters you can use to control how data is
read. The following are some of the commonly used parameters:
delimiter- It is a char type parameter that specifies the column separator
character whose default column separator value is comma (','). It is only used in
Extractors.Text().
rowDelimiter- It is a string type parameter whose max length is 1, which
specifies the row separator in a file whose default values are "\r\n" (carriage
return, line feed).
skipFirstNRows- It is an Int type parameter whose default value is 0, which
specifies the number of rows to skip in a file.
silent- It is a boolean type parameter whose default value is false, which
specifies that the extractor ignore and skip rows that have a different number
of columns than the requested number.
If you have trouble playing this video, please click here for help.
Table: represents a data set of data that you want to create, such as creating a
table with certain data.
Views: encapsulates queries that abstract tables in your database, such as
writing a view that consists of select statements which retrieve data from the
mentioned tables.
Table valued function: writes custom logic to retrieve the desired data set for
queries.
Procedures: encapsulates the code that performs certain tasks regularly, such
as writing a code to insert data into tables or other regular operations which
are executed repeatedly.
If you have trouble playing this video, please click here for help.
External Tables
Along with managed tables, U-SQL catalogs can also include external tables
which reference tables in azure instances such as SQL data warehouse, SQL
database or SQL Server in Azure virtual machines.
This is useful when you have to use U-SQL to process data that is stored in an
existing database in Azure.
To create an external table, use the CREATE DATA SOURCE statement to
create a reference to an external database, and then use the CREATE
EXTERNAL TABLE statement to create a reference to a table in that data
source.
If you have trouble playing this video, please click here for help.
Procedures: Demo
The following video explains how to create a procedure-type of object for a database that
retrieves the data set required:
If you have trouble playing this video, please click here for help.
Authentication
Authorization
You can implement database roles, permissions, and granular row level security to
ensure that databases are secure, by adding roles and permissions to the users.
Network Isolation
ADLS Gen1 enables accessing your data store at the network level where you can
allow access by establishing firewalls and defining an IP address range for trusted
clients. With an IP address range, only clients that have an IP address within the
defined range can connect to Data Lake Storage Gen1.
Data Protection
Data Lake Storage Gen1 protects data throughout its life cycle. For data in transit, the
industry-standard Transport Layer Security (TLS 1.2) protocol is used to secure data
over the network.
File Folder
Can read the contents of a Requires Read and Execute permissions to list the
Read
file. contents of the folder.
Write Can write or append to a file. Requires Write and Execute permissions to create
File Folder
Note: You can set permissions or roles to a file system based on the requirement.
Tuning Data Lake Store
Some of the prominent tuning tasks that enhance the performance of data lake store are:
Data Ingestion
While ingesting data from the source to the data lake store Gen 1, it is important to consider factors
(bottlenecks) such as source hardware, and network connectivity.
Performance Criteria
Database size, concurrency, and response time are important metrics, depending on which the data
lake store should be fine tuned.
Once the source hardware and network connectivity bottlenecks are addressed, you can configure
ingestion tools such as:
Structuring your dataset- When data is stored in ADLS Gen 1, the file size, folder structure, and
number of files have an impact on performance. For better performance, it is recommended to
organize data into larger files, than have many small files.
Organizing time series data in folders- For Azure data lake analytics workloads, partition-pruning of
time-series data enables some queries to read only a subset of the data. This improves performance.
Log Analytics
Log analytics in Azure play a prominent role in creating service alerts, and controlling
the cost of Azure data lake implementations.
Log analytics collect telemetry and other data which enables its automated
alerting capability.
Implementing log analytics in Azure does not require any configuration, since it
is already integrated with other Azure services.
To enable log analytics, create a workspace and collect all the metrics and data
that are being emitted from various activities.
While implementing log analytics, ensure that the agents are installed on
virtual machines.
Log analytics query language is a simple and interactive query provided by Microsoft to facilitate log
searches.
It is used to identify valuable insights from the data by querying, combining, aggregating, joining, and
performing other tasks on your data in log analytics.
Log analytics query language enable you to specify conditions, implement joins, and facilitate smart
analytics.
Example: Building a single collaborative log analytic visualization that provides various analytical
outcomes in a single dashboard, to help administrators monitor and define strategy.
Data lake store (gen 1 and gen 2) provisioning, and ingesting data into the store.
Data analysis by using U-SQL, and retrieving the desired data sets.
Hands-on scenario
You are a Cloud Engineer who has recently joined a Big Data project. Your team is
looking for a Cloud environment that can simplify the big data analysis process and
extract valuable insights. You need to provide a demo to your Team Lead about the
efficiency and features of Azure Data Lake Analytics. i) Create Azure Data Lake
Storage Gen 1 account: Location: East US 2, Upload the data files to be analyzed. ii)
Create Data Lake Analytics: Location: East US 2, Azure Data Lake Storage Gen 1
account: Azure Data Lake Storage Gen 1 account created in the previous step. iii)
Create a new job, write a U-SQL query to extract the data for analyzing and submit
the query to save the output file. iv) Create a new job to view the database created in
the previous step.
Notes: Use the credentials given in the hands-on to log in to the Azure Portal, create a
new resource group and use the same resource group for all resources. The
Username/Password/Services Name can be as per your choice, after completing the
hands-on, delete all the resources created.