0% found this document useful (0 votes)
10 views21 pages

518385781-Data-Lake

Uploaded by

Quang Linh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views21 pages

518385781-Data-Lake

Uploaded by

Quang Linh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Azure Cloud Services

Welcome to the Azure Data Lake Storage course.


Hope you have a basic understanding of Azure Cloud Services. If not, you must go
through the Azure Data Factory and Azure Storage courses before proceeding.

This course introduces you to the fundamental concepts of data lake store and analytics,
creating data lake store instances, ingesting data, applying analytics and securing data.

Happy Learning!

Azure Data Lake Store

Azure data lake store is a highly scalable, distributed, parallel file system that is
designed to work with various analytic frameworks, and has the capability of storing
varied data from several sources.
Azure Data Lake Store

Microsoft describes Azure Data Lake Store (ADLS) as a hyperscale repository for big data analytics
workloads that stores data in its native format. ADLS is a Hadoop File System compatible with
Hadoop Distributed File System (HDFS) that works with the Hadoop ecosystem.

ADLS:

Provides unlimited data storage in different forms.

Is built for running large scale analytic workloads optimally.

Allows storing of relational and non-relational data, and the data schema need not be defined
before data is loaded.

Keeps three copies of a particular data to enable high availability of data provision.

Azure Data Lake - Architecture Components


The ADLS architecture constitutes three components:

 Analytics Service
 HDInsight
 Diversified Storage

Azure Data Lake - Architecture Components

Analytics Service- to build various data analytic job services, and execute them parallelly.

HDInsight- for managing clusters after ingesting large volumes of data clusters, by extending various
open sources such as Hadoop, Spark, Pig, Hive, and so on.
Diversified Storage- to store diversified data such as structured, unstructured, and semi-structured
data from diverse data sources.

Azure Data Lake Storage Working

The image above illustrates the ingestion of raw data, the preparation and modeling of
data, and the processing of data analytics jobs.
Azure Data Lake Storage Gen 1

Data Lake Storage Gen 1 is an Apache Hadoop file system compatible with HDFS
that works with the Hadoop ecosystem. Existing HDInsight applications or services
that use the WebHDFS API can easily integrate with Data Lake Storage Gen 1.

The working of a Gen 1 data lake store is illustrated in the above image.

Key Features of Data Lake Store Gen 1


 Built for Hadoop: Data stored in ADLS Gen 1 can be easily analyzed by using
Hadoop analytic frameworks such as Pig, Hive, and MapReduce. Azure
HDInsight clusters can be provisioned and configured to directly access the
data stored in ADLS Gen 1.
 Unlimited storage: ADLS Gen 1 does not impose any limit on file sizes, or the
amount of data to be stored in a data lake. Files can range from kilobyte to
petabytes in size, making it a preferred choice to store any amount of data.
 Performance-tuned for big data analytics: A data lake spreads parts of a file
over multiple individual storage servers. Therefore, when performing data
analytics, it improves read throughput when reading files in parallel.

Key Features of Data Lake Store Gen 1


 Enterprise-ready, Highly-available, and Secure: Data assets are stored durably
by making extra copies to guard against any unexpected failures. Enterprises
can use ADLS Gen 1 in their solutions as an important part of their existing
data platform.
 All Data: ADLS Gen 1 can store any type of data in its native format without
requiring prior transformations, and it does not perform any special handling of
data based on the type of data.

Azure Data Lake Store Gen 2


ADLS Gen 2 is built on top of Azure blob storage, dedicated to big data analytics,
which is the result of converging both the storage service (blob storage and ADLS
Gen 1) capabilities.
ADLS Gen 2 is specifically designed for enterprise big data analytics that enables
managing massive amounts of data. A fundamental component of ADLS Gen 2 is that
it uses hierarchical namespace addition to blob storage, and organizes objects/files
into a hierarchy of directories for efficient data access.
ADLS Gen 2 addresses all the drawbacks in areas such as performance, management,
security, and cost effectiveness which were compromised in the past in cloud-based
analytics.
Key Features of ADLS Gen 2

Hadoop-compatible access: The new Azure Blob File System (ABFS) driver is enabled within all
Apache Hadoop environments, including Azure Databricks, Azure HDInsight, and SQL Data
Warehouse, to access data stored in ADLS Gen2.

A superset of POSIX permissions: The security model for ADLS Gen2 supports ACL and POSIX
permissions, along with extra granularity specific to ADLS Gen2.

Cost effective: ADLS Gen2 offers low-cost storage capacity and transactions as data transitions
through its entire life cycle.
Comparing Data Lake Store and Blob Storage
Data Lake Store Blob Storage

Optimized and dedicated storage General purpose object store for a


Purpose
for big data analytics workloads. variety of storage scenarios.

Supports streaming analytics and Supports any type of text or binary


machine learning data, such as data, such as application backend,
Use cases
log files, IoT data, massive backup data, media storage for
datasets, and click streams. streaming and general purpose data.

ADLS accounts contain folders, Storage accounts have containers,


File system which in turn contain data stored which in turn has data in the form of
as files. blobs.

Based on shared secrets - Account


Data Operations - Based on Azure Active Directory
Access Keys and Shared Access
Authentication Identities.
Signature Keys.

Has certain limits regarding no of


No limits on account sizes, file
Size limits accounts, and storage capacity. Refer
sizes or number of files.
to link.
Streamed Data Management

The above image illustrates the data sources for ADLS, and how data is streamed into
usage.

Streamed Data Management

The image in the previous card illustrates how streamed data is managed by using ADLS in three
different layers:

Data generation
Storage

Data processing

Data is generated from various sources such as cloud, local machines, or logs.

Data is stored in a data lake store which uses different analytic jobs such as Hadoop, Spark, or Pig to
analyze the data.

After data analysis, the data is processed for use.

Schema

This section describes the following:

Pricing data lake store

Provisioning data lake store

Deploying data lake store by using various management tools

Ingesting data into the store

Moving data from adl to different sources

Moving data by using Adlcopy

Note: The Azure portal interface changes continuously with updates. The videos may differ from the
actual interface, but the core functionality remains the same.

Data Lake Store Pricing


The cost of ADLS Gen 1 depends on how much you store, and, the size and volume
of transactions and outbound data transfers.
Azure allows storing data prices in two ways; pay as you go, and monthly
commitment packages. Refer to azure documented prices for Gen 1 storage based on
your requirement.

ADLS Gen 2 is the most productive storage, and its pricing depends on the file
structure, and redundancy you choose. Refer to link for exact pricing details based on
your requirement.
Provisioning Data Lake Store Gen 1
The following video shows how to provision Gen 1 data lake storage in the Azure portal:

If you have trouble playing this video, please click here for help.

No transcript is available for this video.

Azure Data Lake Store Gen 1 by using


PowerShell
A data lake storage instance can be created by using Azure PowerShell.
The PowerShell command to create data lake store account is:

New-AzureRmDataLakeStoreAccount -ResourceGroupName $resourcegroupname -Name


$dlsname -Location "East US" -DisableEncyption

To access the PowerShell,


Select >_ from the top menu in the Azure portal home page.

Creating ADLS Gen 1 by using CLI


Azure CLI is one of the options with which you can manage data lake store.
The following command is a sample CLI command to create a data lake storage Gen 1
account:

az dls fs create --account $account_name --path /mynewfolder --folder

The above CLI command creates a data lake store Gen 1 account, and a folder
named mynewfolder at the root of the data lake storage Gen 1 account.
Note: The --folder parameter ensures that the command creates a folder, if not, it
creates an empty file named mynewfolder, at the root as default.

Data Lake Store Gen 2 Creation


To create a data lake Storage V2 account:

1. Login to the Azure portal, and navigate to the storage account resource.
2. Add storage account and provide the resource group name, storage account
name, location, and performance (standard or premium).
3. Select the account type as Storage V2, which is a Gen 2 type of account.
4. Select the replication based on your requirement, such as LRS, GRS, or RA-
GRS, and proceed to next for advanced options.
5. Enable the hierarchical namespace to organize the objects or files for efficient
data access.
6. Proceed to next, validate, and create the storage account.

Azure Data Lake Storage Gen 2

To create the data lake store Gen 2 account, select the account kind as shown in the
above image.

Getting Data into Data Lake Store

You can get your data into your data lake store account in two ways,
Through direct upload method or Adlcopy

Set up a pipeline by using data factory, and process data from various sources.

Note: Setting up a pipeline by using data factory, and copying data into a data lake store from
various sources is explained in the Azure Data Factory course.

The other upload or copying methods are explained in the following cards.

Copying Data into ADLS


After successful creation of the data lake storage account (Gen 1 or Gen 2), navigate
to the resource page.
To ingest data into the storage account through offline copy (directly uploading data
from portal),

1. Navigate to the storage account instance page.


2. Select Data explorer. The storage account files explorer page appears.
3. Choose the upload option from the menu, and from the source, select the files
you want to be stored in your account.
4. Upload the files by selecting Add the select files.

Data Upload by using Azure CLI


The CLI command to upload data into the data lake storage account is:

az dls fs upload --account $account_name --source-path "/path" --destination-path


"/path"

Provide the storage account name, source path, and destination path in the above CLI
command.
Example:

az dls fs upload --account mydatalakestoragegen1 --source-path


"C:\SampleData\AmbulanceData\vehicle1_09142014.csv" --destination-path
"/mynewfolder/vehicle1_09142014.csv"

AdlCopy Tool
ADLS Gen 1 provides a command line tool AdlCopy to copy data from the following
sources:
 From azure storage blobs to data lake storage Gen 1 account.
Note: You cannot use AdlCopy to copy data from ADLS Gen 1 to blob.
 Between two data lake storage Gen 1 accounts.
You should have AdlCopy tool installed in your machine. To install, use link.
AdlCopy syntax:

AdlCopy /Source <Blob or Data Lake Storage Gen1 source> /Dest <Data Lake Storage
Gen1 destination> /SourceKey <Key for Blob account> /Account <Data Lake Analytics
account> /Units <Number of Analytics units> /Pattern

Moving Data by using AdlCopy


Let's assume that data is being copied between two data lake stores named Adls1 and
Adls2, where the source is Adls1 and destination is Adls2.
The following example command will perform the copy activity:

AdlCopy /Source adl://adls1.azuredatalakestore.net/testfolder/sampledata.csv /dest


adl://adls2.azuredatalakestore.net/testfolder

Specify the source, destination instances URL, and the data file that needs to be
copied.
Note: To get the instances URL, navigate to the instance dashboard.
Azure Data Lake Analytics

Azure data lake analytics is an analytics job service that writes queries, extracts valuable insights
from any scale of data, and simplifies big data.

It can handle jobs of any scale in a cost-effective manner, where you pay for a job only when it is
running.

Data Lake Analytics works with ADLS for high performance, throughput, and parallelization. It works
with Azure Storage blobs, Azure SQL Database, and Azure Warehouse.

Provisioning Azure Data Lake Analytics


The following video explains how to create a data lake analytics instance, along with a
data lake store in the Azure portal:

If you have trouble playing this video, please click here for help.

No transcript is available for this video.

Manage Data Sources in Data Lake Analytics


Data Lake Analytics supports two data sources:

 Data lake store


 Azure storage

Data explorer is used to browse the above data sources, and to perform basic file
management operations.

To add any of the above data sources,

1. Login to the Azure portal, and navigate to the Data Lake Analytics page.
2. Click data sources, and then click add data sources.

 To add a Data Lake Store account, you need the account name and access to
the account, to query it.
 To add Azure Blob storage, you need the storage account and the account key.
Setting Up Firewall Rule

You can enable access to trusted clients only by specifying an IP address or defining a range of IP
addresses, by setting up the firewall rules to cut off access to your data lake analytics at network
level.

To setup a firewall rule,

Login to the Azure portal and navigate to your data lake analytics account.

On the left menu choose firewall.

Provide the values for the fields by specifying the IP addresses.

Click OK.

U-SQL Overview

Data lake analytics service runs jobs that query the data to generate an output for analysis, where
these jobs consist of scripts written in a language called U-SQL.

U-SQL is a query language that extends the familiar, simple, declarative nature of SQL; combined
with the expressive power of C#, and uses the same distributed runtime that powers Microsoft's
internal exabyte-scale data lake.
U-SQL Sample Job
The following video explains how to run a job by extracting data from a log file writing a
sample U-SQL script for analysis:

Play

12:51

-16:38

Mute

Settings

Enter fullscreen

Play

If you have trouble playing this video, please click here for help.

No transcript is available for this video.

Built-in Extractors in U-SQL


Extractors are used to extract data from common types of data sources. U-SQL
includes the following built-in extractors:

 Extractors.Text - an extractor for generic text file data sources.


 Extractors.Csv - a special version of the Extractors.Text extractor specifically
for comma-delimited data.
 Extractors.Tsv - a special version of the Extractors.Text extractor specifically
for tab-delimited data.

Extractor Parameters
The built-in extractors support several parameters you can use to control how data is
read. The following are some of the commonly used parameters:
 delimiter- It is a char type parameter that specifies the column separator
character whose default column separator value is comma (','). It is only used in
Extractors.Text().
 rowDelimiter- It is a string type parameter whose max length is 1, which
specifies the row separator in a file whose default values are "\r\n" (carriage
return, line feed).
 skipFirstNRows- It is an Int type parameter whose default value is 0, which
specifies the number of rows to skip in a file.
 silent- It is a boolean type parameter whose default value is false, which
specifies that the extractor ignore and skip rows that have a different number
of columns than the requested number.

Built-in Outputters in U-SQL


U-SQL provides a built-in outputter class called Outputters. It provides the following
built-in outputters to transform a rowset into a file or set of files*:

 Outputters.Text()- Provides outputting a rowset into a variety of delimited text


formats.
 Outputters.Csv()- Provides outputting a rowset into a comma-separated value
(CSV) file of different encodings.
 Outputters.Tsv()- Provides outputting a rowset into a tab-separated value
(TSV) file of different encodings.

Aggregating Data With U-SQL: Demo


The following video explains how to execute a job that aggregates data (columns) from
multiple log files:

If you have trouble playing this video, please click here for help.

No transcript is available for this video.

U-SQL Catalog Overview


Azure data lake analytics allows you to create a catalog of U-SQL objects that are
stored in databases within the data lake store.
The following are some of the objects you can create in any database:

 Table: represents a data set of data that you want to create, such as creating a
table with certain data.
 Views: encapsulates queries that abstract tables in your database, such as
writing a view that consists of select statements which retrieve data from the
mentioned tables.
 Table valued function: writes custom logic to retrieve the desired data set for
queries.
 Procedures: encapsulates the code that performs certain tasks regularly, such
as writing a code to insert data into tables or other regular operations which
are executed repeatedly.

Creating a View: Demo


The following video explains how to create a sample view, and retrieve a data set by
writing a query:

If you have trouble playing this video, please click here for help.

No transcript is available for this video.

External Tables
 Along with managed tables, U-SQL catalogs can also include external tables
which reference tables in azure instances such as SQL data warehouse, SQL
database or SQL Server in Azure virtual machines.
 This is useful when you have to use U-SQL to process data that is stored in an
existing database in Azure.
 To create an external table, use the CREATE DATA SOURCE statement to
create a reference to an external database, and then use the CREATE
EXTERNAL TABLE statement to create a reference to a table in that data
source.

Table Value Function: Demo


The following video explains how to create a table-value function type of object for the
database:

If you have trouble playing this video, please click here for help.

No transcript is available for this video.

Procedures: Demo
The following video explains how to create a procedure-type of object for a database that
retrieves the data set required:

If you have trouble playing this video, please click here for help.

No transcript is available for this video.

Security Mechanism for Data Lake Store


Security mechanisms that you can implement to secure a data lake store include:

 Authentication

Azure provides multi-factor authentication capability that ensures an additional layer


of security for sign ins and transactions, and allows to set up authentication with the
use of sign in codes that are received through an SMS.
It also facilitates authentication from any clients by using standard open protocols
such as OAuth or OpenID

 Authorization

Authorization is implemented by Role-based Authorization Control (RBAC ) which is a


built-in feature for Microsoft Azure which facilitates account management.
It implements POSIX ACL to access data from the data lake storage.
You can apply granular levels of authorization for data lake resources by assigning
user roles and security groups by using Access Control or IAM.

Security Mechanism for Data Lake Store


 Database roles and permissions

You can implement database roles, permissions, and granular row level security to
ensure that databases are secure, by adding roles and permissions to the users.

 Network Isolation

ADLS Gen1 enables accessing your data store at the network level where you can
allow access by establishing firewalls and defining an IP address range for trusted
clients. With an IP address range, only clients that have an IP address within the
defined range can connect to Data Lake Storage Gen1.

 Data Protection

Data Lake Storage Gen1 protects data throughout its life cycle. For data in transit, the
industry-standard Transport Layer Security (TLS 1.2) protocol is used to secure data
over the network.

Securing Data Lake Store


ADLS Gen 1 implements an access control model as default, that derives permissions
from the RBAC file system stored in the data lake store.
The permissions that can be used on files and folders are Read, Write and Execute:

File Folder

Can read the contents of a Requires Read and Execute permissions to list the
Read
file. contents of the folder.

Write Can write or append to a file. Requires Write and Execute permissions to create
File Folder

child items in a folder.

Does not mean anything in


Execut
the context of Data Lake Required to traverse the child items of a folder.
e
Storage Gen 1.

Note: You can set permissions or roles to a file system based on the requirement.
Tuning Data Lake Store

Some of the prominent tuning tasks that enhance the performance of data lake store are:

Data Ingestion

While ingesting data from the source to the data lake store Gen 1, it is important to consider factors
(bottlenecks) such as source hardware, and network connectivity.

Performance Criteria

Database size, concurrency, and response time are important metrics, depending on which the data
lake store should be fine tuned.

Tuning Data Lake Store

Configure data ingestion tools for maximum parallelization:

Once the source hardware and network connectivity bottlenecks are addressed, you can configure
ingestion tools such as:

Structuring your dataset- When data is stored in ADLS Gen 1, the file size, folder structure, and
number of files have an impact on performance. For better performance, it is recommended to
organize data into larger files, than have many small files.

Organizing time series data in folders- For Azure data lake analytics workloads, partition-pruning of
time-series data enables some queries to read only a subset of the data. This improves performance.

Log Analytics
Log analytics in Azure play a prominent role in creating service alerts, and controlling
the cost of Azure data lake implementations.

 Log analytics collect telemetry and other data which enables its automated
alerting capability.
 Implementing log analytics in Azure does not require any configuration, since it
is already integrated with other Azure services.
 To enable log analytics, create a workspace and collect all the metrics and data
that are being emitted from various activities.
 While implementing log analytics, ensure that the agents are installed on
virtual machines.

Log Analytics Query Language

Log analytics query language is a simple and interactive query provided by Microsoft to facilitate log
searches.

It is used to identify valuable insights from the data by querying, combining, aggregating, joining, and
performing other tasks on your data in log analytics.

Log analytics query language enable you to specify conditions, implement joins, and facilitate smart
analytics.

Example: Building a single collaborative log analytic visualization that provides various analytical
outcomes in a single dashboard, to help administrators monitor and define strategy.

Implementing Log Analytics


After signing up in the Azure portal, create log analytics workspaces, and use it to
analyze the logs.
To create the log analytics workspace,
1. In the Azure portal, search for log analytics by clicking All Services.
2. Select Log Analytics Workspaces, and click Add.
3. Enter values in the name, subscription, resource group, location and pricing
tier fields.
4. Click OK.
You can run log searches to analyze data, or to configure collection of monitoring
telemetry data.
Summary

The concepts covered in this course were:


Data lake store and its capabilities.

Data lake store (gen 1 and gen 2) provisioning, and ingesting data into the store.

Data lake analytics and managing it.

Data analysis by using U-SQL, and retrieving the desired data sets.

Securing and monitoring data in a store.

Hands-on scenario
You are a Cloud Engineer who has recently joined a Big Data project. Your team is
looking for a Cloud environment that can simplify the big data analysis process and
extract valuable insights. You need to provide a demo to your Team Lead about the
efficiency and features of Azure Data Lake Analytics. i) Create Azure Data Lake
Storage Gen 1 account: Location: East US 2, Upload the data files to be analyzed. ii)
Create Data Lake Analytics: Location: East US 2, Azure Data Lake Storage Gen 1
account: Azure Data Lake Storage Gen 1 account created in the previous step. iii)
Create a new job, write a U-SQL query to extract the data for analyzing and submit
the query to save the output file. iv) Create a new job to view the database created in
the previous step.
Notes: Use the credentials given in the hands-on to log in to the Azure Portal, create a
new resource group and use the same resource group for all resources. The
Username/Password/Services Name can be as per your choice, after completing the
hands-on, delete all the resources created.

You might also like