0% found this document useful (0 votes)
261 views290 pages

Data Lake Azure

This document provides information and instructions for getting started with Azure Data Lake Analytics using the Azure portal. It describes how to create an Azure Data Lake Analytics account, define a sample U-SQL script, and submit a job to process the script. The sample script defines a small dataset and writes it to the default storage as a CSV file. The steps outlined allow a user to get up and running with Azure Data Lake Analytics through the Azure portal in just a few minutes.

Uploaded by

legionario15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
261 views290 pages

Data Lake Azure

This document provides information and instructions for getting started with Azure Data Lake Analytics using the Azure portal. It describes how to create an Azure Data Lake Analytics account, define a sample U-SQL script, and submit a job to process the script. The sample script defines a small dataset and writes it to the default storage as a CSV file. The steps outlined allow a user to get up and running with Azure Data Lake Analytics through the Azure portal in just a few minutes.

Uploaded by

legionario15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 290

Contents

Data Lake Analytics documentation


Overview
What is Data Lake Analytics?
What's new in Data Lake Analytics?
Get started
Azure portal
Visual Studio
Visual Studio Code
Azure PowerShell
Azure CLI
Concepts
Security
Security controls by Azure Policy
Security baseline for Azure Security Benchmark
How-to guides
Manage Data Lake Analytics
Azure portal
Command line
Azure CLI
Azure PowerShell
SDKs
.NET
Python
Java
Node.js
Add users
Account policies
Secure job folders
Access diagnostic logs
Adjust quota limits
Plan disaster recovery
Develop U-SQL programs
U-SQL language
Basics
Language reference
Catalog
User-defined operators
Python extensions
R extensions
Cognitive extensions
Visual Studio
Install
Local run
Local debug
Develop U-SQL databases
Browse and view jobs
Debug custom C# code
Troubleshoot recurring jobs
Vertex execution details
Export U-SQL database
Analyze website logs
Resolve data-skew
Monitor and troubleshoot jobs
Visual Studio Code
Authoring
Custom code
Accessing resources
Local run & debug
Programmability guide
Programmability guide overview
Programmability guide - UDT and UDAGG
Programmability guide - UDO
User-defined objects overview
Use user-defined extractor
Use user-defined outputter
Use user-defined processor
Use user-defined applier
Use user-defined combiner
Use user-defined reducer
Schedule U-SQL jobs
Schedule jobs using SSIS
Continuous integration and continuous deployment
Overview
Assembly registration
Set up tests
U-SQL SDK
Understand Apache Spark for U-SQL developers
Understand Apache Spark for U-SQL developers
Understand Apache Spark data formats
Understand Apache Spark code concepts
Troubleshoot U-SQL jobs
Troubleshoot U-SQL runtime failures and known issues
Troubleshoot a .NET upgrade
Migrate to Spark on Azure Synapse Analytics
Reference
Azure PowerShell
.NET
Node.js
Python
REST
Resource Manager template
Azure CLI
Azure Policy built-ins
Resources
Azure Data Lake Blog
Azure Roadmap
Request changes
Microsoft Q&A question page
Pricing
Pricing calculator
Service updates
Stack Overflow
Videos
Code samples
What is Azure Data Lake Analytics?
12/10/2021 • 2 minutes to read • Edit Online

Azure Data Lake Analytics is an on-demand analytics job service that simplifies big data. Instead of deploying,
configuring, and tuning hardware, you write queries to transform your data and extract valuable insights. The
analytics service can handle jobs of any scale instantly by setting the dial for how much power you need. You
only pay for your job when it is running, making it cost-effective.

Azure Data Lake analytics recent update information


Azure Data Lake analytics service is updated on an aperiodic basis for certain purpose. We continue to provide
the support for this service with component update, component beta preview and so on.
For recent update general information, refer to What's new in Data Lake Analytics?.
For each update details, refer to Azure Data Lake analytics release note.

Dynamic scaling
Data Lake Analytics dynamically provisions resources and lets you do analytics on terabytes to petabytes of
data. You pay only for the processing power used. As you increase or decrease the size of data stored or the
amount of compute resources used, you don’t have to rewrite code.

Develop faster, debug, and optimize smarter using familiar tools


Data Lake Analytics deep integrates with Visual Studio. You can use familiar tools to run, debug, and tune your
code. Visualizations of your U-SQL jobs let you see how your code runs at scale, so you can easily identify
performance bottlenecks and optimize costs.

U-SQL: simple and familiar, powerful, and extensible


Data Lake Analytics includes U-SQL, a query language that extends the familiar, simple, declarative nature of
SQL with the expressive power of C#. The U-SQL language uses the same distributed runtime that powers
Microsoft's internal exabyte-scale data lake. SQL and .NET developers can now process and analyze their data
with the skills they already have.

Integrates seamlessly with your IT investments


Data Lake Analytics uses your existing IT investments for identity, management, and security. This approach
simplifies data governance and makes it easy to extend your current data applications. Data Lake Analytics is
integrated with Active Directory for user management and permissions and comes with built-in monitoring and
auditing.

Affordable and cost effective


Data Lake Analytics is a cost-effective solution for running big data workloads. You pay on a per-job basis when
data is processed. No hardware, licenses, or service-specific support agreements are required. The system
automatically scales up or down as the job starts and completes, so you never pay for more than what you need.
Learn more about controlling costs and saving money.
Works with all your Azure data
Data Lake Analytics works with Azure Data Lake Storage Gen1 for the highest performance, throughput, and
parallelization and works with Azure Storage blobs, Azure SQL Database, Azure Synapse Analytics.

NOTE
Data Lake Analytics doesn't work with Azure Data Lake Storage Gen2 yet until further notice.

In-region data residency


Data Lake Analytics does not move or store customer data out of the region in which it is deployed.

Next steps
See the Azure Data Lake Analytics recent update using What's new in Azure Data Lake Analytics?
Get Started with Data Lake Analytics using Azure portal | Azure PowerShell | CLI
Manage Azure Data Lake Analytics using Azure portal | Azure PowerShell | CLI | Azure .NET SDK | Node.js
How to control costs and save money with Data Lake Analytics
What's new in Data Lake Analytics?
12/10/2021 • 2 minutes to read • Edit Online

Azure Data Lake Analytics is updated on an aperiodic basis for certain components. To stay updated with the
most recent update, this article provides you with information about:
The notification of key component beta preview
The important component version information, for example: the list of the component available versions, the
current default version and so on.

Notification of key component beta preview


No key component beta version available for preview.

U-SQL runtime
The Azure Data Lake U-SQL runtime, including the compiler, optimizer, and job manager, is what processes your
U-SQL code.
When you submit the Azure Data Lake analytics job from any tools, your job will use the currently available
default runtime in production environment.
The runtime version will be updated aperiodically. And the previous runtime will be kept available for some
time. When a new Beta version is ready for preview, it will be also available there.
Cau t i on

Choosing a runtime that is different from the default has the potential to break your U-SQL jobs. It is highly
recommended not to use these non-default versions for production, but for testing only.
The non-default runtime version has a fixed lifecycle. It will be automatically expired.
The following version is the current default runtime version.
release_20200707_scope_2b8d563_usql
To get understanding how to troubleshoot U-SQL runtime failures, refer to Troubleshoot U-SQL runtime failures.

.NET Framework
Azure Data Lake Analytics now is using the .NET Framework v4.7.2 .
If your Azure Data Lake Analytics U-SQL script code uses custom assemblies, and those custom assemblies use
.NET libraries, validate your code to check if there is any breakings.
To get understanding how to troubleshoot a .NET upgrade using Troubleshoot a .NET upgrade.

Release note
For recent update details, refer to the Azure Data Lake Analytics release note.

Next steps
Get Started with Data Lake Analytics using Azure portal | Azure PowerShell | CLI
Get started with Azure Data Lake Analytics using
the Azure portal
12/10/2021 • 2 minutes to read • Edit Online

This article describes how to use the Azure portal to create Azure Data Lake Analytics accounts, define jobs in U-
SQL, and submit jobs to the Data Lake Analytics service.

Prerequisites
Before you begin this tutorial, you must have an Azure subscription . See Get Azure free trial.

Create a Data Lake Analytics account


Now, you will create a Data Lake Analytics and an Azure Data Lake Storage Gen1 account at the same time. This
step is simple and only takes about 60 seconds to finish.
1. Sign on to the Azure portal.
2. Click Create a resource > Data + Analytics > Data Lake Analytics .
3. Select values for the following items:
Name : Name your Data Lake Analytics account (Only lower case letters and numbers allowed).
Subscription : Choose the Azure subscription used for the Analytics account.
Resource Group . Select an existing Azure Resource Group or create a new one.
Location . Select an Azure data center for the Data Lake Analytics account.
Data Lake Storage Gen1 : Follow the instruction to create a new Data Lake Storage Gen1 account, or
select an existing one.
4. Optionally, select a pricing tier for your Data Lake Analytics account.
5. Click Create .

Your first U-SQL script


The following text is a very simple U-SQL script. All it does is define a small dataset within the script and then
write that dataset out to the default Data Lake Storage Gen1 account as a file called /data.csv .

@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS
D( customer, amount );
OUTPUT @a
TO "/data.csv"
USING Outputters.Csv();

Submit a U-SQL job


1. From the Data Lake Analytics account, select New Job .
2. Paste in the text of the preceding U-SQL script. Name the job.
3. Select Submit button to start the job.
4. Monitor the Status of the job, and wait until the job status changes to Succeeded .
5. Select the Data tab, then select the Outputs tab. Select the output file named data.csv and view the output
data.

See also
To get started developing U-SQL applications, see Develop U-SQL scripts using Data Lake Tools for Visual
Studio.
To learn U-SQL, see Get started with Azure Data Lake Analytics U-SQL language.
For management tasks, see Manage Azure Data Lake Analytics using Azure portal.
Develop U-SQL scripts by using Data Lake Tools for
Visual Studio
12/10/2021 • 3 minutes to read • Edit Online

Azure Data Lake and Stream Analytics Tools include functionality related to two Azure services, Azure Data Lake
Analytics and Azure Stream Analytics. For more information about the Azure Stream Analytics scenarios, see
Azure Stream Analytics tools for Visual Studio.
This article describes how to use Visual Studio to create Azure Data Lake Analytics accounts. You can define jobs
in U-SQL, and submit jobs to the Data Lake Analytics service. For more information about Data Lake Analytics,
see Azure Data Lake Analytics overview.

IMPORTANT
We recommend you upgrade to Azure Data Lake Tools for Visual Studio version 2.3.3000.4 or later. The previous versions
are no longer available for download and are now deprecated.
1. Check if you are using an earlier version than 2.3.3000.4 of Azure Data Lake Tools for Visual Studio.

2. If your version is an earlier version of 2.3.3000.4, update your Azure Data Lake Tools for Visual Studio by visiting
the download center:
For Visual Studio 2017 and 2019
For Visual Studio 2013 and 2015

Prerequisites
Visual Studio : All editions except Express are supported.
Visual Studio 2019
Visual Studio 2017
Visual Studio 2015
Visual Studio 2013
Microsoft Azure SDK for .NET version 2.7.1 or later. Install it by using the Web platform installer.
A Data Lake Analytics account. To create an account, see Get Started with Azure Data Lake Analytics
using Azure portal.

Install Azure Data Lake Tools for Visual Studio


This tutorial requires that Data Lake Tools for Visual Studio is installed. For more information, see Install Data
Lake Tools for Visual Studio.

Connect to an Azure Data Lake Analytics account


1. Open Visual Studio.
2. Open Data Lake Analytics Explorer by selecting View > Data Lake Analytics Explorer .
3. Right-click Azure , then select Connect to Microsoft Azure Subscription . In Sign in to your
account , follow the instructions.
4. In Ser ver Explorer , select Azure > Data Lake Analytics . You see a list of your Data Lake Analytics
accounts.

Write your first U-SQL script


The following text is a simple U-SQL script. It defines a small dataset and writes that dataset to the default Data
Lake Store as a file called /data.csv .

USE DATABASE master;


USE SCHEMA dbo;
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS
D( customer, amount );
OUTPUT @a
TO "/data.csv"
USING Outputters.Csv();

Submit a Data Lake Analytics job


1. In Visual Studio, select File > New > Project .
2. Select the U-SQL Project type, and then select Next . In Configure your new project , select Create .
Visual Studio creates a solution that contains a Script.usql file.
3. Paste the script from Write your first U-SQL script into the Script.usql window.
4. In Solution Explorer , right-click Script.usql , and select Submit Script .
5. In Submit Job , choose your Data Lake Analytics account and select Submit .
After the job submission, the Job view tab opens to show the job progress.
Job Summar y shows the summary of the job.
Job Graph visualizes the progress of the job.
MetaData Operations shows all the actions that were taken on the U-SQL catalog.
Data shows all the inputs and outputs.
State Histor y shows the timeline and state details.
AU Analysis shows how many AUs were used in the job and explore simulations of different AU allocation
strategies.
Diagnostics provides an advanced analysis for job execution and performance optimization.

To see the latest job status and refresh the screen, select Refresh .
Check job status
1. In Ser ver Explorer , select Azure > Data Lake Analytics .
2. Expand the Data Lake Analytics account name.
3. Double-click Jobs .
4. Select the job that you previously submitted.

See the job output


1. In Ser ver Explorer , browse to the job you submitted.
2. Click the Data tab.
3. In the Job Outputs tab, select the "/data.csv" file.

Next steps
Run U-SQL scripts on your own workstation for testing and debugging
Debug C# code in U-SQL jobs using Azure Data Lake Tools for Visual Studio Code
Use the Azure Data Lake Tools for Visual Studio Code
Use Azure Data Lake Tools for Visual Studio Code
12/10/2021 • 8 minutes to read • Edit Online

In this article, learn how you can use Azure Data Lake Tools for Visual Studio Code (VS Code) to create, test, and
run U-SQL scripts. The information is also covered in the following video:

Prerequisites
Azure Data Lake Tools for VS Code supports Windows, Linux, and macOS. U-SQL local run and local debug
works only in Windows.
Visual Studio Code
For macOS and Linux:
.NET 5.0 SDK
Mono 5.2.x

Install Azure Data Lake Tools


After you install the prerequisites, you can install Azure Data Lake Tools for VS Code.
To install Azure Data Lake Tools
1. Open Visual Studio Code.
2. Select Extensions in the left pane. Enter Azure Data Lake Tools in the search box.
3. Select Install next to Azure Data Lake Tools .
After a few seconds, the Install button changes to Reload .
4. Select Reload to activate the Azure Data Lake Tools extension.
5. Select Reload Window to confirm. You can see Azure Data Lake Tools in the Extensions pane.

Activate Azure Data Lake Tools


Create a .usql file or open an existing .usql file to activate the extension.

Work with U-SQL


To work with U-SQL, you need open either a U-SQL file or a folder.
To open the sample script
Open the command palette (Ctrl+Shift+P) and enter ADL: Open Sample Script . It opens another instance of
this sample. You can also edit, configure, and submit a script on this instance.
To open a folder for your U -SQL project
1. From Visual Studio Code, select the File menu, and then select Open Folder .
2. Specify a folder, and then select Select Folder .
3. Select the File menu, and then select New . An Untitled-1 file is added to the project.
4. Enter the following code in the Untitled-1 file:

@departments =
SELECT * FROM
(VALUES
(31, "Sales"),
(33, "Engineering"),
(34, "Clerical"),
(35, "Marketing")
) AS
D( DepID, DepName );

OUTPUT @departments TO "/Output/departments.csv" USING Outputters.Csv();


The script creates a departments.csv file with some data included in the /output folder.
5. Save the file as myUSQL.usql in the opened folder.
To compile a U -SQL script
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Compile Script . The compile results appear in the Output window. You can also right-click a
script file, and then select ADL: Compile Script to compile a U-SQL job. The compilation result appears in
the Output pane.
To submit a U -SQL script
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Submit Job . You can also right-click a script file, and then select ADL: Submit Job .
After you submit a U-SQL job, the submission logs appear in the Output window in VS Code. The job view
appears in the right pane. If the submission is successful, the job URL appears too. You can open the job URL in a
web browser to track the real-time job status.
On the job view's SUMMARY tab, you can see the job details. Main functions include resubmit a script,
duplicate a script, and open in the portal. On the job view's DATA tab, you can refer to the input files, output
files, and resource files. Files can be downloaded to the local computer.

To set the default context


You can set the default context to apply this setting to all script files if you have not set parameters for files
individually.
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Set Default Context . Or right-click the script editor and select ADL: Set Default Context .
3. Choose the account, database, and schema that you want. The setting is saved to the xxx_settings.json
configuration file.

To set script parameters


1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Set Script Parameters .
3. The xxx_settings.json file is opened with the following properties:
account : An Azure Data Lake Analytics account under your Azure subscription that's needed to
compile and run U-SQL jobs. You need configure the computer account before you compile and run
U-SQL jobs.
database : A database under your account. The default is master .
schema : A schema under your database. The default is dbo .
optionalSettings :
priority : The priority range is from 1 to 1000, with 1 as the highest priority. The default value is
1000 .
degreeOfParallelism : The parallelism range is from 1 to 150. The default value is the
maximum parallelism allowed in your Azure Data Lake Analytics account.
NOTE
After you save the configuration, the account, database, and schema information appear on the status bar at the lower-
left corner of the corresponding .usql file if you don’t have a default context set up.

To set Git ignore


1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Set Git Ignore .
If you don’t have a .gitIgnore file in your VS Code working folder, a file named .gitIgnore is created
in your folder. Four items (usqlCodeBehindReference , usqlCodeBehindGenerated , .cache , obj )
are added in the file by default. You can make more updates if needed.
If you already have a .gitIgnore file in your VS Code working folder, the tool adds four items
(usqlCodeBehindReference , usqlCodeBehindGenerated , .cache , obj ) in your .gitIgnore file if
the four items were not included in the file.

Work with code-behind files: C Sharp, Python, and R


Azure Data Lake Tools supports multiple custom codes. For instructions, see Develop U-SQL with Python, R, and
C Sharp for Azure Data Lake Analytics in VS Code.

Work with assemblies


For information on developing assemblies, see Develop U-SQL assemblies for Azure Data Lake Analytics jobs.
You can use Data Lake Tools to register custom code assemblies in the Data Lake Analytics catalog.
To register an assembly
You can register the assembly through the ADL: Register Assembly or ADL: Register Assembly
(Advanced) command.
To register through the ADL: Register Assembly command
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Register Assembly .
3. Specify the local assembly path.
4. Select a Data Lake Analytics account.
5. Select a database.
The portal is opened in a browser and displays the assembly registration process.
A more convenient way to trigger the ADL: Register Assembly command is to right-click the .dll file in File
Explorer.
To register through the ADL: Register Assembly (Advanced) command
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Register Assembly (Advanced) .
3. Specify the local assembly path.
4. The JSON file is displayed. Review and edit the assembly dependencies and resource parameters, if
needed. Instructions are displayed in the Output window. To proceed to the assembly registration, save
(Ctrl+S) the JSON file.

NOTE
Azure Data Lake Tools autodetects whether the DLL has any assembly dependencies. The dependencies are displayed
in the JSON file after they're detected.
You can upload your DLL resources (for example, .txt, .png, and .csv) as part of the assembly registration.

Another way to trigger the ADL: Register Assembly (Advanced) command is to right-click the .dll file in File
Explorer.
The following U-SQL code demonstrates how to call an assembly. In the sample, the assembly name is test.
REFERENCE ASSEMBLY [test];
@a =
EXTRACT
Iid int,
Starts DateTime,
Region string,
Query string,
DwellTime int,
Results string,
ClickedUrls string
FROM @"Sample/SearchLog.txt"
USING Extractors.Tsv();
@d =
SELECT DISTINCT Region
FROM @a;
@d1 =
PROCESS @d
PRODUCE
Region string,
Mkt string
USING new USQLApplication_codebehind.MyProcessor();
OUTPUT @d1
TO @"Sample/SearchLogtest.txt"
USING Outputters.Tsv();

Use U-SQL local run and local debug for Windows users
U-SQL local run tests your local data and validates your script locally before your code is published to Data Lake
Analytics. You can use the local debug feature to complete the following tasks before your code is submitted to
Data Lake Analytics:
Debug your C# code-behind.
Step through the code.
Validate your script locally.
The local run and local debug feature only works in Windows environments, and is not supported on macOS
and Linux-based operating systems.
For instructions on local run and local debug, see U-SQL local run and local debug with Visual Studio Code.

Connect to Azure
Before you can compile and run U-SQL scripts in Data Lake Analytics, you must connect to your Azure account.

To connect to Azure by using a command


1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Login . The login information appears on the lower right.
3. Select Copy & Open to open the login webpage. Paste the code into the box, and then select Continue .

4. Follow the instructions to sign in from the webpage. When you're connected, your Azure account name
appears on the status bar in the lower-left corner of the VS Code window.

NOTE
Data Lake Tools automatically signs you in the next time if you don't sign out.
If your account has two factors enabled, we recommend that you use phone authentication rather than using a PIN.

To sign out, enter the command ADL: Logout .


To connect to Azure from the explorer
Expand AZURE DATAL AKE , select Sign in to Azure , and then follow step 3 and step 4 of To connect to Azure
by using a command.

You can't sign out from the explorer. To sign out, see To connect to Azure by using a command.

Create an extraction script


You can create an extraction script for .csv, .tsv, and .txt files by using the command ADL: Create EXTRACT
Script or from the Azure Data Lake explorer.
To create an extraction script by using a command
1. Select Ctrl+Shift+P to open the command palette, and enter ADL: Create EXTRACT Script .
2. Specify the full path for an Azure Storage file, and select the Enter key.
3. Select one account.
4. For a .txt file, select a delimiter to extract the file.
The extraction script is generated based on your entries. For a script that cannot detect the columns, choose one
from the two options. If not, only one script will be generated.

To create an extraction script from the explorer


Another way to create the extraction script is through the right-click (shortcut) menu on the .csv, .tsv, or .txt file in
Azure Data Lake Store or Azure Blob storage.
Next steps
Develop U-SQL with Python, R, and C Sharp for Azure Data Lake Analytics in VS Code
U-SQL local run and local debug with Visual Studio Code
Tutorial: Get started with Azure Data Lake Analytics
Tutorial: Develop U-SQL scripts by using Data Lake Tools for Visual Studio
Get started with Azure Data Lake Analytics using
Azure PowerShell
12/10/2021 • 2 minutes to read • Edit Online

Learn how to use Azure PowerShell to create Azure Data Lake Analytics accounts and then submit and run U-
SQL jobs. For more information about Data Lake Analytics, see Azure Data Lake Analytics overview.

Prerequisites
NOTE
This article uses the Azure Az PowerShell module, which is the recommended PowerShell module for interacting with
Azure. To get started with the Az PowerShell module, see Install Azure PowerShell. To learn how to migrate to the Az
PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Before you begin this tutorial, you must have the following information:
An Azure Data Lake Analytics account . See Get started with Data Lake Analytics.
A workstation with Azure PowerShell . See How to install and configure Azure PowerShell.

Log in to Azure
This tutorial assumes you are already familiar with using Azure PowerShell. In particular, you need to know how
to log in to Azure. See the Get started with Azure PowerShell if you need help.
To log in with a subscription name:

Connect-AzAccount -SubscriptionName "ContosoSubscription"

Instead of the subscription name, you can also use a subscription id to log in:

Connect-AzAccount -SubscriptionId "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

If successful, the output of this command looks like the following text:

Environment : AzureCloud
Account : [email protected]
TenantId : "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
SubscriptionId : "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
SubscriptionName : ContosoSubscription
CurrentStorageAccount :

Preparing for the tutorial


The PowerShell snippets in this tutorial use these variables to store this information:
$rg = "<ResourceGroupName>"
$adls = "<DataLakeStoreAccountName>"
$adla = "<DataLakeAnalyticsAccountName>"
$location = "East US 2"

Get information about a Data Lake Analytics account


Get-AdlAnalyticsAccount -ResourceGroupName $rg -Name $adla

Submit a U-SQL job


Create a PowerShell variable to hold the U-SQL script.

$script = @"
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS
D( customer, amount );
OUTPUT @a
TO "/data.csv"
USING Outputters.Csv();

"@

Submit the script text with the Submit-AdlJob cmdlet and the -Script parameter.

$job = Submit-AdlJob -Account $adla -Name "My Job" -Script $script

As an alternative, you can submit a script file using the -ScriptPath parameter:

$filename = "d:\test.usql"
$script | out-File $filename
$job = Submit-AdlJob -Account $adla -Name "My Job" -ScriptPath $filename

Get the status of a job with Get-AdlJob .

$job = Get-AdlJob -Account $adla -JobId $job.JobId

Instead of calling Get-AdlJob over and over until a job finishes, use the Wait-AdlJob cmdlet.

Wait-AdlJob -Account $adla -JobId $job.JobId

Download the output file using Export-AdlStoreItem .

Export-AdlStoreItem -Account $adls -Path "/data.csv" -Destination "C:\data.csv"

See also
To see the same tutorial using other tools, click the tab selectors on the top of the page.
To learn U-SQL, see Get started with Azure Data Lake Analytics U-SQL language.
For management tasks, see Manage Azure Data Lake Analytics using Azure portal.
Get started with Azure Data Lake Analytics using
Azure CLI
12/10/2021 • 4 minutes to read • Edit Online

This article describes how to use the Azure CLI command-line interface to create Azure Data Lake Analytics
accounts, submit USQL jobs, and catalogs. The job reads a tab separated values (TSV) file and converts it into a
comma-separated values (CSV) file.

Prerequisites
Before you begin, you need the following items:
An Azure subscription . See Get Azure free trial.
This article requires that you are running the Azure CLI version 2.0 or later. If you need to install or upgrade,
see Install Azure CLI.

Sign in to Azure
To sign in to your Azure subscription:

az login

You are requested to browse to a URL, and enter an authentication code. And then follow the instructions to
enter your credentials.
Once you have logged in, the login command lists your subscriptions.
To use a specific subscription:

az account set --subscription <subscription id>

Create Data Lake Analytics account


You need a Data Lake Analytics account before you can run any jobs. To create a Data Lake Analytics account,
you must specify the following items:
Azure Resource Group . A Data Lake Analytics account must be created within an Azure Resource group.
Azure Resource Manager enables you to work with the resources in your application as a group. You can
deploy, update, or delete all of the resources for your application in a single, coordinated operation.
To list the existing resource groups under your subscription:

az group list

To create a new resource group:

az group create --name "<Resource Group Name>" --location "<Azure Location>"


Data Lake Analytics account name . Each Data Lake Analytics account has a name.
Location . Use one of the Azure data centers that supports Data Lake Analytics.
Default Data Lake Store account : Each Data Lake Analytics account has a default Data Lake Store account.
To list the existing Data Lake Store account:

az dls account list

To create a new Data Lake Store account:

az dls account create --account "<Data Lake Store Account Name>" --resource-group "<Resource Group Name>"

Use the following syntax to create a Data Lake Analytics account:

az dla account create --account "<Data Lake Analytics Account Name>" --resource-group "<Resource Group
Name>" --location "<Azure location>" --default-data-lake-store "<Default Data Lake Store Account Name>"

After creating an account, you can use the following commands to list the accounts and show account details:

az dla account list


az dla account show --account "<Data Lake Analytics Account Name>"

Upload data to Data Lake Store


In this tutorial, you process some search logs. The search log can be stored in either Data Lake store or Azure
Blob storage.
The Azure portal provides a user interface for copying some sample data files to the default Data Lake Store
account, which include a search log file. See Prepare source data to upload the data to the default Data Lake
Store account.
To upload files using Azure CLI, use the following commands:

az dls fs upload --account "<Data Lake Store Account Name>" --source-path "<Source File Path>" --
destination-path "<Destination File Path>"
az dls fs list --account "<Data Lake Store Account Name>" --path "<Path>"

Data Lake Analytics can also access Azure Blob storage. For uploading data to Azure Blob storage, see Using the
Azure CLI with Azure Storage.

Submit Data Lake Analytics jobs


The Data Lake Analytics jobs are written in the U-SQL language. To learn more about U-SQL, see Get started
with U-SQL language and U-SQL language reference.
To create a Data Lake Analytics job script
Create a text file with following U-SQL script, and save the text file to your workstation:
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS
D( customer, amount );
OUTPUT @a
TO "/data.csv"
USING Outputters.Csv();

This U-SQL script reads the source data file using Extractors.Tsv() , and then creates a csv file using
Outputters.Csv() .
Don't modify the two paths unless you copy the source file into a different location. Data Lake Analytics creates
the output folder if it doesn't exist.
It is simpler to use relative paths for files stored in default Data Lake Store accounts. You can also use absolute
paths. For example:

adl://<Data LakeStorageAccountName>.azuredatalakestore.net:443/Samples/Data/SearchLog.tsv

You must use absolute paths to access files in linked Storage accounts. The syntax for files stored in linked Azure
Storage account is:

wasb://<BlobContainerName>@<StorageAccountName>.blob.core.windows.net/Samples/Data/SearchLog.tsv

NOTE
Azure Blob container with public blobs are not supported. Azure Blob container with public containers are not supported.

To submit jobs
Use the following syntax to submit a job.

az dla job submit --account "<Data Lake Analytics Account Name>" --job-name "<Job Name>" --script "<Script
Path and Name>"

For example:

az dla job submit --account "myadlaaccount" --job-name "myadlajob" --script @"C:\DLA\myscript.txt"

To list jobs and show job details

az dla job list --account "<Data Lake Analytics Account Name>"


az dla job show --account "<Data Lake Analytics Account Name>" --job-identity "<Job Id>"

To cancel jobs

az dla job cancel --account "<Data Lake Analytics Account Name>" --job-identity "<Job Id>"

Retrieve job results


After a job is completed, you can use the following commands to list the output files, and download the files:

az dls fs list --account "<Data Lake Store Account Name>" --source-path "/Output" --destination-path "
<Destination>"
az dls fs preview --account "<Data Lake Store Account Name>" --path "/Output/SearchLog-from-Data-Lake.csv"
az dls fs preview --account "<Data Lake Store Account Name>" --path "/Output/SearchLog-from-Data-Lake.csv" -
-length 128 --offset 0
az dls fs download --account "<Data Lake Store Account Name>" --source-path "/Output/SearchLog-from-Data-
Lake.csv" --destination-path "<Destination Path and File Name>"

For example:

az dls fs download --account "myadlsaccount" --source-path "/Output/SearchLog-from-Data-Lake.csv" --


destination-path "C:\DLA\myfile.csv"

Next steps
To see the Data Lake Analytics Azure CLI reference document, see Data Lake Analytics.
To see the Data Lake Store Azure CLI reference document, see Data Lake Store.
To see a more complex query, see Analyze Website logs using Azure Data Lake Analytics.
Azure Policy Regulatory Compliance controls for
Azure Data Lake Analytics
12/10/2021 • 5 minutes to read • Edit Online

Regulatory Compliance in Azure Policy provides Microsoft created and managed initiative definitions, known as
built-ins, for the compliance domains and security controls related to different compliance standards. This
page lists the compliance domains and security controls for Azure Data Lake Analytics. You can assign the
built-ins for a security control individually to help make your Azure resources compliant with the specific
standard.
The title of each built-in policy definition links to the policy definition in the Azure portal. Use the link in the
Policy Version column to view the source on the Azure Policy GitHub repo.

IMPORTANT
Each control below is associated with one or more Azure Policy definitions. These policies may help you assess compliance
with the control; however, there often is not a one-to-one or complete match between a control and one or more policies.
As such, Compliant in Azure Policy refers only to the policies themselves; this doesn't ensure you're fully compliant with
all requirements of a control. In addition, the compliance standard includes controls that aren't addressed by any Azure
Policy definitions at this time. Therefore, compliance in Azure Policy is only a partial view of your overall compliance status.
The associations between controls and Azure Policy Regulatory Compliance definitions for these compliance standards
may change over time.

Azure Security Benchmark


The Azure Security Benchmark provides recommendations on how you can secure your cloud solutions on
Azure. To see how this service completely maps to the Azure Security Benchmark, see the Azure Security
Benchmark mapping files.
To review how the available Azure Policy built-ins for all Azure services map to this compliance standard, see
Azure Policy Regulatory Compliance - Azure Security Benchmark.

P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)

Logging and Threat LT-4 Enable logging for Resource logs in Data 5.0.0
Detection Azure resources Lake Analytics should
be enabled

Azure Security Benchmark v1


The Azure Security Benchmark provides recommendations on how you can secure your cloud solutions on
Azure. To see how this service completely maps to the Azure Security Benchmark, see the Azure Security
Benchmark mapping files.
To review how the available Azure Policy built-ins for all Azure services map to this compliance standard, see
Azure Policy Regulatory Compliance - Azure Security Benchmark.
P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)

Logging and 2.3 Enable audit logging Resource logs in Data 5.0.0
Monitoring for Azure resources Lake Analytics should
be enabled

CIS Microsoft Azure Foundations Benchmark 1.3.0


To review how the available Azure Policy built-ins for all Azure services map to this compliance standard, see
Azure Policy Regulatory Compliance - CIS Microsoft Azure Foundations Benchmark 1.3.0. For more information
about this compliance standard, see CIS Microsoft Azure Foundations Benchmark.

P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)

Logging and 5.3 Ensure that Resource logs in Data 5.0.0


Monitoring Diagnostic Logs are Lake Analytics should
enabled for all be enabled
services which
support it.

FedRAMP High
To review how the available Azure Policy built-ins for all Azure services map to this compliance standard, see
Azure Policy Regulatory Compliance - FedRAMP High. For more information about this compliance standard,
see FedRAMP High.

P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)

Audit and AU-6 (4) Central Review and Resource logs in Data 5.0.0
Accountability Analysis Lake Analytics should
be enabled

Audit and AU-6 (5) Integration / Resource logs in Data 5.0.0


Accountability Scanning and Lake Analytics should
Monitoring be enabled
Capabilities

Audit and AU-12 Audit Generation Resource logs in Data 5.0.0


Accountability Lake Analytics should
be enabled

Audit and AU-12 (1) System-wide / Time- Resource logs in Data 5.0.0
Accountability correlated Audit Trail Lake Analytics should
be enabled

FedRAMP Moderate
To review how the available Azure Policy built-ins for all Azure services map to this compliance standard, see
Azure Policy Regulatory Compliance - FedRAMP Moderate. For more information about this compliance
standard, see FedRAMP Moderate.
P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)

Audit and AU-12 Audit Generation Resource logs in Data 5.0.0


Accountability Lake Analytics should
be enabled

HIPAA HITRUST 9.2


To review how the available Azure Policy built-ins for all Azure services map to this compliance standard, see
Azure Policy Regulatory Compliance - HIPAA HITRUST 9.2. For more information about this compliance
standard, see HIPAA HITRUST 9.2.

P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)

Audit Logging 1210.09aa3System.3 All disclosures of Resource logs in Data 5.0.0


- 09.aa covered information Lake Analytics should
within or outside of be enabled
the organization are
logged including type
of disclosure,
date/time of the
event, recipient, and
sender.

New Zealand ISM Restricted


To review how the available Azure Policy built-ins for all Azure services map to this compliance standard, see
Azure Policy Regulatory Compliance - New Zealand ISM Restricted. For more information about this compliance
standard, see New Zealand ISM Restricted.

P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)

Access Control and AC-17 16.6.9 Events to be Resource logs in Data 5.0.0
Passwords logged Lake Analytics should
be enabled

NIST SP 800-53 Rev. 4


To review how the available Azure Policy built-ins for all Azure services map to this compliance standard, see
Azure Policy Regulatory Compliance - NIST SP 800-53 Rev. 4. For more information about this compliance
standard, see NIST SP 800-53 Rev. 4.

P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)

Audit and AU-6 (4) Central Review and Resource logs in Data 5.0.0
Accountability Analysis Lake Analytics should
be enabled

Audit and AU-6 (5) Integration / Resource logs in Data 5.0.0


Accountability Scanning and Lake Analytics should
Monitoring be enabled
Capabilities
P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E

Audit and AU-12 Audit Generation Resource logs in Data 5.0.0


Accountability Lake Analytics should
be enabled

Audit and AU-12 (1) System-wide / Time- Resource logs in Data 5.0.0
Accountability correlated Audit Trail Lake Analytics should
be enabled

NIST SP 800-53 Rev. 5


To review how the available Azure Policy built-ins for all Azure services map to this compliance standard, see
Azure Policy Regulatory Compliance - NIST SP 800-53 Rev. 5. For more information about this compliance
standard, see NIST SP 800-53 Rev. 5.

P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)

Audit and AU-6 (4) Central Review and Resource logs in Data 5.0.0
Accountability Analysis Lake Analytics should
be enabled

Audit and AU-6 (5) Integrated Analysis Resource logs in Data 5.0.0
Accountability of Audit Records Lake Analytics should
be enabled

Audit and AU-12 Audit Record Resource logs in Data 5.0.0


Accountability Generation Lake Analytics should
be enabled

Audit and AU-12 (1) System-wide and Resource logs in Data 5.0.0
Accountability Time-correlated Lake Analytics should
Audit Trail be enabled

Next steps
Learn more about Azure Policy Regulatory Compliance.
See the built-ins on the Azure Policy GitHub repo.
Manage Azure Data Lake Analytics using the Azure
portal
12/10/2021 • 5 minutes to read • Edit Online

This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs by using
the Azure portal.

Manage Data Lake Analytics accounts


Create an account
1. Sign in to the Azure portal.
2. Click Create a resource > Intelligence + analytics > Data Lake Analytics .
3. Select values for the following items:
a. Name : The name of the Data Lake Analytics account.
b. Subscription : The Azure subscription used for the account.
c. Resource Group : The Azure resource group in which to create the account.
d. Location : The Azure datacenter for the Data Lake Analytics account.
e. Data Lake Store : The default store to be used for the Data Lake Analytics account. The Azure Data
Lake Store account and the Data Lake Analytics account must be in the same location.
4. Click Create .
Delete a Data Lake Analytics account
Before you delete a Data Lake Analytics account, delete its default Data Lake Store account.
1. In the Azure portal, go to your Data Lake Analytics account.
2. Click Delete .
3. Type the account name.
4. Click Delete .

Manage data sources


Data Lake Analytics supports the following data sources:
Data Lake Store
Azure Storage
You can use Data Explorer to browse data sources and perform basic file management operations.
Add a data source
1. In the Azure portal, go to your Data Lake Analytics account.
2. Click Data Sources .
3. Click Add Data Source .
To add a Data Lake Store account, you need the account name and access to the account to be able to
query it.
To add Azure Blob storage, you need the storage account and the account key. To find them, go to the
storage account in the portal.
Set up firewall rules
You can use Data Lake Analytics to further lock down access to your Data Lake Analytics account at the network
level. You can enable a firewall, specify an IP address, or define an IP address range for your trusted clients. After
you enable these measures, only clients that have the IP addresses within the defined range can connect to the
store.
If other Azure services, like Azure Data Factory or VMs, connect to the Data Lake Analytics account, make sure
that Allow Azure Ser vices is turned On .
Set up a firewall rule
1. In the Azure portal, go to your Data Lake Analytics account.
2. On the menu on the left, click Firewall .

Add a new user


You can use the Add User Wizard to easily provision new Data Lake users.
1. In the Azure portal, go to your Data Lake Analytics account.
2. On the left, under Getting Star ted , click Add User Wizard .
3. Select a user, and then click Select .
4. Select a role, and then click Select . To set up a new developer to use Azure Data Lake, select the Data Lake
Analytics Developer role.
5. Select the access control lists (ACLs) for the U-SQL databases. When you're satisfied with your choices, click
Select .
6. Select the ACLs for files. For the default store, don't change the ACLs for the root folder "/" and for the
/system folder. Click Select .
7. Review all your selected changes, and then click Run .
8. When the wizard is finished, click Done .

Manage Azure role-based access control


Like other Azure services, you can use Azure role-based access control (Azure RBAC) to control how users
interact with the service.
The standard Azure roles have the following capabilities:
Owner : Can submit jobs, monitor jobs, cancel jobs from any user, and configure the account.
Contributor : Can submit jobs, monitor jobs, cancel jobs from any user, and configure the account.
Reader : Can monitor jobs.
Use the Data Lake Analytics Developer role to enable U-SQL developers to use the Data Lake Analytics service.
You can use the Data Lake Analytics Developer role to:
Submit jobs.
Monitor job status and the progress of jobs submitted by any user.
See the U-SQL scripts from jobs submitted by any user.
Cancel only your own jobs.
Add users or security groups to a Data Lake Analytics account
1. In the Azure portal, go to your Data Lake Analytics account.
2. Click Access control (IAM) > Add role assignment .
3. Select a role.
4. Add a user.
5. Click OK .

NOTE
If a user or a security group needs to submit jobs, they also need permission on the store account. For more information,
see Secure data stored in Data Lake Store.

Manage jobs
Submit a job
1. In the Azure portal, go to your Data Lake Analytics account.
2. Click New Job . For each job, configure:
a. Job Name : The name of the job.
b. Priority : Lower numbers have higher priority. If two jobs are queued, the one with lower priority
value runs first.
c. Parallelism : The maximum number of compute processes to reserve for this job.
3. Click Submit Job .
Monitor jobs
1. In the Azure portal, go to your Data Lake Analytics account.
2. Click View All Jobs . A list of all the active and recently finished jobs in the account is shown.
3. Optionally, click Filter to help you find the jobs by Time Range , Job Name , and Author values.
Monitoring pipeline jobs
Jobs that are part of a pipeline work together, usually sequentially, to accomplish a specific scenario. For
example, you can have a pipeline that cleans, extracts, transforms, aggregates usage for customer insights.
Pipeline jobs are identified using the "Pipeline" property when the job was submitted. Jobs scheduled using ADF
V2 will automatically have this property populated.
To view a list of U-SQL jobs that are part of pipelines:
1. In the Azure portal, go to your Data Lake Analytics accounts.
2. Click Job Insights . The "All Jobs" tab will be defaulted, showing a list of running, queued, and ended jobs.
3. Click the Pipeline Jobs tab. A list of pipeline jobs will be shown along with aggregated statistics for each
pipeline.
Monitoring recurring jobs
A recurring job is one that has the same business logic but uses different input data every time it runs. Ideally,
recurring jobs should always succeed, and have relatively stable execution time; monitoring these behaviors will
help ensure the job is healthy. Recurring jobs are identified using the "Recurrence" property. Jobs scheduled
using ADF V2 will automatically have this property populated.
To view a list of U-SQL jobs that are recurring:
1. In the Azure portal, go to your Data Lake Analytics accounts.
2. Click Job Insights . The "All Jobs" tab will be defaulted, showing a list of running, queued, and ended jobs.
3. Click the Recurring Jobs tab. A list of recurring jobs will be shown along with aggregated statistics for each
recurring job.

Next steps
Overview of Azure Data Lake Analytics
Manage Azure Data Lake Analytics by using Azure PowerShell
Manage Azure Data Lake Analytics using policies
Manage Azure Data Lake Analytics using the Azure
CLI
12/10/2021 • 4 minutes to read • Edit Online

Learn how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs using the Azure CLI. To
see management topics using other tools, click the tab select above.

Prerequisites
Before you begin this tutorial, you must have the following resources:
An Azure subscription. See Get Azure free trial.
Azure CLI. See Install and configure Azure CLI.
Download and install the pre-release Azure CLI tools in order to complete this demo.
Authenticate by using the az login command and select the subscription that you want to use. For more
information on authenticating using a work or school account, see Connect to an Azure subscription from
the Azure CLI.

az login
az account set --subscription <subscription id>

You can now access the Data Lake Analytics and Data Lake Store commands. Run the following command
to list the Data Lake Store and Data Lake Analytics commands:

az dls -h
az dla -h

Manage accounts
Before running any Data Lake Analytics jobs, you must have a Data Lake Analytics account. Unlike Azure
HDInsight, you don't pay for an Analytics account when it is not running a job. You only pay for the time when it
is running a job. For more information, see Azure Data Lake Analytics Overview.
Create accounts
Run the following command to create a Data Lake account,

az dla account create --account "<Data Lake Analytics account name>" --location "<Location Name>" --
resource-group "<Resource Group Name>" --default-data-lake-store "<Data Lake Store account name>"

Update accounts
The following command updates the properties of an existing Data Lake Analytics Account

az dla account update --account "<Data Lake Analytics Account Name>" --firewall-state "Enabled" --query-
store-retention 7

List accounts
List Data Lake Analytics accounts within a specific resource group

az dla account list "<Resource group name>"

Get details of an account


az dla account show --account "<Data Lake Analytics account name>" --resource-group "<Resource group name>"

Delete an account

az dla account delete --account "<Data Lake Analytics account name>" --resource-group "<Resource group
name>"

Manage data sources


Data Lake Analytics currently supports the following two data sources:
Azure Data Lake Store
Azure Storage
When you create an Analytics account, you must designate an Azure Data Lake Storage account to be the default
storage account. The default Data Lake storage account is used to store job metadata and job audit logs. After
you have created an Analytics account, you can add additional Data Lake Storage accounts and/or Azure Storage
account.
Find the default Data Lake Store account
You can view the default Data Lake Store account used by running the az dla account show command. Default
account name is listed under the defaultDataLakeStoreAccount property.

az dla account show --account "<Data Lake Analytics account name>"

Add additional Blob storage accounts

az dla account blob-storage add --access-key "<Azure Storage Account Key>" --account "<Data Lake Analytics
account name>" --storage-account-name "<Storage account name>"

NOTE
Only Blob storage short names are supported. Don't use FQDN, for example "myblob.blob.core.windows.net".

Add additional Data Lake Store accounts


The following command updates the specified Data Lake Analytics account with an additional Data Lake Store
account:

az dla account data-lake-store add --account "<Data Lake Analytics account name>" --data-lake-store-account-
name "<Data Lake Store account name>"

Update existing data source


To update an existing Blob storage account key:
az dla account blob-storage update --access-key "<New Blob Storage Account Key>" --account "<Data Lake
Analytics account name>" --storage-account-name "<Data Lake Store account name>"

List data sources


To list the Data Lake Store accounts:

az dla account data-lake-store list --account "<Data Lake Analytics account name>"

To list the Blob storage account:

az dla account blob-storage list --account "<Data Lake Analytics account name>"

Delete data sources


To delete a Data Lake Store account:

az dla account data-lake-store delete --account "<Data Lake Analytics account name>" --data-lake-store-
account-name "<Azure Data Lake Store account name>"

To delete a Blob storage account:

az dla account blob-storage delete --account "<Data Lake Analytics account name>" --storage-account-name "
<Data Lake Store account name>"

Manage jobs
You must have a Data Lake Analytics account before you can create a job. For more information, see Manage
Data Lake Analytics accounts.
List jobs

az dla job list --account "<Data Lake Analytics account name>"


Get job details

az dla job show --account "<Data Lake Analytics account name>" --job-identity "<Job Id>"

Submit jobs

NOTE
The default priority of a job is 1000, and the default degree of parallelism for a job is 1.

az dla job submit --account "<Data Lake Analytics account name>" --job-name "<Name of your job>" --
script "<Script to submit>"

Cancel jobs
Use the list command to find the job ID, and then use cancel to cancel the job.

az dla job cancel --account "<Data Lake Analytics account name>" --job-identity "<Job Id>"

Pipelines and recurrences


Get information about pipelines and recurrences
Use the az dla job pipeline commands to see the pipeline information previously submitted jobs.

az dla job pipeline list --account "<Data Lake Analytics Account Name>"

az dla job pipeline show --account "<Data Lake Analytics Account Name>" --pipeline-identity "<Pipeline ID>"

Use the az dla job recurrence commands to see the recurrence information for previously submitted jobs.

az dla job recurrence list --account "<Data Lake Analytics Account Name>"

az dla job recurrence show --account "<Data Lake Analytics Account Name>" --recurrence-identity "<Recurrence
ID>"

Next steps
Overview of Microsoft Azure Data Lake Analytics
Get started with Data Lake Analytics using Azure portal
Manage Azure Data Lake Analytics using Azure portal
Monitor and troubleshoot Azure Data Lake Analytics jobs using Azure portal
Manage Azure Data Lake Analytics using Azure
PowerShell
12/10/2021 • 8 minutes to read • Edit Online

This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs by using
Azure PowerShell.

Prerequisites
NOTE
This article uses the Azure Az PowerShell module, which is the recommended PowerShell module for interacting with
Azure. To get started with the Az PowerShell module, see Install Azure PowerShell. To learn how to migrate to the Az
PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

To use PowerShell with Data Lake Analytics, collect the following pieces of information:
Subscription ID : The ID of the Azure subscription that contains your Data Lake Analytics account.
Resource group : The name of the Azure resource group that contains your Data Lake Analytics account.
Data Lake Analytics account name : The name of your Data Lake Analytics account.
Default Data Lake Store account name : Each Data Lake Analytics account has a default Data Lake Store
account.
Location : The location of your Data Lake Analytics account, such as "East US 2" or other supported locations.
The PowerShell snippets in this tutorial use these variables to store this information

$subId = "<SubscriptionId>"
$rg = "<ResourceGroupName>"
$adla = "<DataLakeAnalyticsAccountName>"
$adls = "<DataLakeStoreAccountName>"
$location = "<Location>"

Log in to Azure
Log in using interactive user authentication
Log in using a subscription ID or by subscription name

# Using subscription id
Connect-AzAccount -SubscriptionId $subId

# Using subscription name


Connect-AzAccount -SubscriptionName $subname

Saving authentication context


The Connect-AzAccount cmdlet always prompts for credentials. You can avoid being prompted by using the
following cmdlets:
# Save login session information
Save-AzAccounts -Path D:\profile.json

# Load login session information


Select-AzAccounts -Path D:\profile.json

Log in using a Service Principal Identity (SPI )

$tenantid = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
$spi_appname = "appname"
$spi_appid = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
$spi_secret = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

$pscredential = New-Object System.Management.Automation.PSCredential ($spi_appid, (ConvertTo-SecureString


$spi_secret -AsPlainText -Force))
Login-AzAccount -ServicePrincipal -TenantId $tenantid -Credential $pscredential -Subscription $subid

Manage accounts
List accounts

# List Data Lake Analytics accounts within the current subscription.


Get-AdlAnalyticsAccount

# List Data Lake Analytics accounts within a specific resource group.


Get-AdlAnalyticsAccount -ResourceGroupName $rg

Create an account
Every Data Lake Analytics account requires a default Data Lake Store account that it uses for storing logs. You
can reuse an existing account or create an account.

# Create a data lake store if needed, or you can re-use an existing one
New-AdlStore -ResourceGroupName $rg -Name $adls -Location $location
New-AdlAnalyticsAccount -ResourceGroupName $rg -Name $adla -Location $location -DefaultDataLake $adls

Get account information


Get details about an account.

Get-AdlAnalyticsAccount -Name $adla

Check if an account exists

Test-AdlAnalyticsAccount -Name $adla

Manage data sources


Azure Data Lake Analytics currently supports the following data sources:
Azure Data Lake Store
Azure Storage
Every Data Lake Analytics account has a default Data Lake Store account. The default Data Lake Store account is
used to store job metadata and job audit logs.
Find the default Data Lake Store account

$adla_acct = Get-AdlAnalyticsAccount -Name $adla


$dataLakeStoreName = $adla_acct.DefaultDataLakeAccount

You can find the default Data Lake Store account by filtering the list of datasources by the IsDefault property:

Get-AdlAnalyticsDataSource -Account $adla | ? { $_.IsDefault }

Add a data source

# Add an additional Storage (Blob) account.


$AzureStorageAccountName = "<AzureStorageAccountName>"
$AzureStorageAccountKey = "<AzureStorageAccountKey>"
Add-AdlAnalyticsDataSource -Account $adla -Blob $AzureStorageAccountName -AccessKey $AzureStorageAccountKey

# Add an additional Data Lake Store account.


$AzureDataLakeStoreName = "<AzureDataLakeStoreAccountName"
Add-AdlAnalyticsDataSource -Account $adla -DataLakeStore $AzureDataLakeStoreName

List data sources

# List all the data sources


Get-AdlAnalyticsDataSource -Account $adla

# List attached Data Lake Store accounts


Get-AdlAnalyticsDataSource -Account $adla | where -Property Type -EQ "DataLakeStore"

# List attached Storage accounts


Get-AdlAnalyticsDataSource -Account $adla | where -Property Type -EQ "Blob"

Submit U-SQL jobs


Submit a string as a U -SQL job

$script = @"
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS D( customer, amount );
OUTPUT @a
TO "/data.csv"
USING Outputters.Csv();
"@

$scriptpath = "d:\test.usql"
$script | Out-File $scriptpath

Submit-AdlJob -AccountName $adla -Script $script -Name "Demo"

Submit a file as a U -SQL job


$scriptpath = "d:\test.usql"
$script | Out-File $scriptpath
Submit-AdlJob -AccountName $adla –ScriptPath $scriptpath -Name "Demo"

List jobs
The output includes the currently running jobs and those jobs that have recently completed.

Get-AdlJob -Account $adla

List the top N jobs


By default the list of jobs is sorted on submit time. So the most recently submitted jobs appear first. By default,
The ADLA account remembers jobs for 180 days, but the Get-AdlJob cmdlet by default returns only the first 500.
Use -Top parameter to list a specific number of jobs.

$jobs = Get-AdlJob -Account $adla -Top 10

List jobs by job state


Using the -State parameter. You can combine any of these values:
Accepted
Compiling
Ended
New
Paused
Queued
Running
Scheduling
Start

# List the running jobs


Get-AdlJob -Account $adla -State Running

# List the jobs that have completed


Get-AdlJob -Account $adla -State Ended

# List the jobs that have not started yet


Get-AdlJob -Account $adla -State Accepted,Compiling,New,Paused,Scheduling,Start

List jobs by job result


Use the -Result parameter to detect whether ended jobs completed successfully. It has these values:
Cancelled
Failed
None
Succeeded
# List Successful jobs.
Get-AdlJob -Account $adla -State Ended -Result Succeeded

# List Failed jobs.


Get-AdlJob -Account $adla -State Ended -Result Failed

List jobs by job submitter


The -Submitter parameter helps you identify who submitted a job.

Get-AdlJob -Account $adla -Submitter "[email protected]"

List jobs by submission time


The -SubmittedAfter is useful in filtering to a time range.

# List jobs submitted in the last day.


$d = [DateTime]::Now.AddDays(-1)
Get-AdlJob -Account $adla -SubmittedAfter $d

# List jobs submitted in the last seven day.


$d = [DateTime]::Now.AddDays(-7)
Get-AdlJob -Account $adla -SubmittedAfter $d

Get job status


Get the status of a specific job.

Get-AdlJob -AccountName $adla -JobId $job.JobId

Cancel a job

Stop-AdlJob -Account $adla -JobID $jobID

Wait for a job to finish


Instead of repeating Get-AdlAnalyticsJob until a job finishes, you can use the Wait-AdlJob cmdlet to wait for
the job to end.

Wait-AdlJob -Account $adla -JobId $job.JobId

List job pipelines and recurrences


Use the Get-AdlJobPipeline cmdlet to see the pipeline information previously submitted jobs.

$pipelines = Get-AdlJobPipeline -Account $adla


$pipeline = Get-AdlJobPipeline -Account $adla -PipelineId "<pipeline ID>"

Use the Get-AdlJobRecurrence cmdlet to see the recurrence information for previously submitted jobs.

$recurrences = Get-AdlJobRecurrence -Account $adla

$recurrence = Get-AdlJobRecurrence -Account $adla -RecurrenceId "<recurrence ID>"


Manage compute policies
List existing compute policies
The Get-AdlAnalyticsComputePolicy cmdlet retrieves info about compute policies for a Data Lake Analytics
account.

$policies = Get-AdlAnalyticsComputePolicy -Account $adla

Create a compute policy


The New-AdlAnalyticsComputePolicy cmdlet creates a new compute policy for a Data Lake Analytics account. This
example sets the maximum AUs available to the specified user to 50, and the minimum job priority to 250.

$userObjectId = (Get-AzAdUser -SearchString "[email protected]").Id

New-AdlAnalyticsComputePolicy -Account $adla -Name "GaryMcDaniel" -ObjectId $objectId -ObjectType User -


MaxDegreeOfParallelismPerJob 50 -MinPriorityPerJob 250

Manage files
Check for the existence of a file

Test-AdlStoreItem -Account $adls -Path "/data.csv"

Uploading and downloading


Upload a file.

Import-AdlStoreItem -AccountName $adls -Path "c:\data.tsv" -Destination "/data_copy.csv"

Upload an entire folder recursively.

Import-AdlStoreItem -AccountName $adls -Path "c:\myData\" -Destination "/myData/" -Recurse

Download a file.

Export-AdlStoreItem -AccountName $adls -Path "/data.csv" -Destination "c:\data.csv"

Download an entire folder recursively.

Export-AdlStoreItem -AccountName $adls -Path "/" -Destination "c:\myData\" -Recurse

NOTE
If the upload or download process is interrupted, you can attempt to resume the process by running the cmdlet again
with the -Resume flag.

Manage the U-SQL catalog


The U-SQL catalog is used to structure data and code so they can be shared by U-SQL scripts. The catalog
enables the highest performance possible with data in Azure Data Lake. For more information, see Use U-SQL
catalog.
List items in the U -SQL catalog

# List U-SQL databases


Get-AdlCatalogItem -Account $adla -ItemType Database

# List tables within a database


Get-AdlCatalogItem -Account $adla -ItemType Table -Path "database"

# List tables within a schema.


Get-AdlCatalogItem -Account $adla -ItemType Table -Path "database.schema"

List all the assemblies the U -SQL catalog

$dbs = Get-AdlCatalogItem -Account $adla -ItemType Database

foreach ($db in $dbs)


{
$asms = Get-AdlCatalogItem -Account $adla -ItemType Assembly -Path $db.Name

foreach ($asm in $asms)


{
$asmname = "[" + $db.Name + "].[" + $asm.Name + "]"
Write-Host $asmname
}
}

Get details about a catalog item

# Get details of a table


Get-AdlCatalogItem -Account $adla -ItemType Table -Path "master.dbo.mytable"

# Test existence of a U-SQL database.


Test-AdlCatalogItem -Account $adla -ItemType Database -Path "master"

Store credentials in the catalog


Within a U-SQL database, create a credential object for a database hosted in Azure. Currently, U-SQL credentials
are the only type of catalog item that you can create through PowerShell.

$dbName = "master"
$credentialName = "ContosoDbCreds"
$dbUri = "https://contoso.database.windows.net:8080"

New-AdlCatalogCredential -AccountName $adla `


-DatabaseName $db `
-CredentialName $credentialName `
-Credential (Get-Credential) `
-Uri $dbUri

Manage firewall rules


List firewall rules

Get-AdlAnalyticsFirewallRule -Account $adla


Add a firewall rule

$ruleName = "Allow access from on-prem server"


$startIpAddress = "<start IP address>"
$endIpAddress = "<end IP address>"

Add-AdlAnalyticsFirewallRule -Account $adla -Name $ruleName -StartIpAddress $startIpAddress -EndIpAddress


$endIpAddress

Modify a firewall rule

Set-AdlAnalyticsFirewallRule -Account $adla -Name $ruleName -StartIpAddress $startIpAddress -EndIpAddress


$endIpAddress

Remove a firewall rule

Remove-AdlAnalyticsFirewallRule -Account $adla -Name $ruleName

Allow Azure IP addresses

Set-AdlAnalyticsAccount -Name $adla -AllowAzureIpState Enabled

Set-AdlAnalyticsAccount -Name $adla -FirewallState Enabled


Set-AdlAnalyticsAccount -Name $adla -FirewallState Disabled

Working with Azure


Get error details

Resolve-AzError -Last

Verify if you are running as an Administrator on your Windows machine

function Test-Administrator
{
$user = [Security.Principal.WindowsIdentity]::GetCurrent();
$p = New-Object Security.Principal.WindowsPrincipal $user
$p.IsInRole([Security.Principal.WindowsBuiltinRole]::Administrator)
}

Find a TenantID
From a subscription name:

function Get-TenantIdFromSubscriptionName( [string] $subname )


{
$sub = (Get-AzSubscription -SubscriptionName $subname)
$sub.TenantId
}

Get-TenantIdFromSubscriptionName "ADLTrainingMS"

From a subscription ID:


function Get-TenantIdFromSubscriptionId( [string] $subid )
{
$sub = (Get-AzSubscription -SubscriptionId $subid)
$sub.TenantId
}

$subid = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
Get-TenantIdFromSubscriptionId $subid

From a domain address such as "contoso.com"

function Get-TenantIdFromDomain( $domain )


{
$url = "https://login.windows.net/" + $domain + "/.well-known/openid-configuration"
return (Invoke-WebRequest $url|ConvertFrom-Json).token_endpoint.Split('/')[3]
}

$domain = "contoso.com"
Get-TenantIdFromDomain $domain

List all your subscriptions and tenant IDs

$subs = Get-AzSubscription
foreach ($sub in $subs)
{
Write-Host $sub.Name "(" $sub.Id ")"
Write-Host "`tTenant Id" $sub.TenantId
}

Create a Data Lake Analytics account using a template


You can also use an Azure Resource Group template using the following sample: Create a Data Lake Analytics
account using a template

Next steps
Overview of Microsoft Azure Data Lake Analytics
Get started with Data Lake Analytics using the Azure portal | Azure PowerShell | Azure CLI
Manage Azure Data Lake Analytics using Azure portal | Azure PowerShell | CLI
Manage Azure Data Lake Analytics a .NET app
12/10/2021 • 7 minutes to read • Edit Online

This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs using an
app written using the Azure .NET SDK.

Prerequisites
Visual Studio 2015, Visual Studio 2013 update 4, or Visual Studio 2012 with Visual C++
Installed .
Microsoft Azure SDK for .NET version 2.5 or above . Install it using the Web platform installer.
Required NuGet Packages
Install NuGet packages
PA C K A GE VERSIO N

Microsoft.Rest.ClientRuntime.Azure.Authentication 2.3.1

Microsoft.Azure.Management.DataLake.Analytics 3.0.0

Microsoft.Azure.Management.DataLake.Store 2.2.0

Microsoft.Azure.Management.ResourceManager 1.6.0-preview

Microsoft.Azure.Graph.RBAC 3.4.0-preview

You can install these packages via the NuGet command line with the following commands:

Install-Package -Id Microsoft.Rest.ClientRuntime.Azure.Authentication -Version 2.3.1


Install-Package -Id Microsoft.Azure.Management.DataLake.Analytics -Version 3.0.0
Install-Package -Id Microsoft.Azure.Management.DataLake.Store -Version 2.2.0
Install-Package -Id Microsoft.Azure.Management.ResourceManager -Version 1.6.0-preview
Install-Package -Id Microsoft.Azure.Graph.RBAC -Version 3.4.0-preview

Common variables
string subid = "<Subscription ID>"; // Subscription ID (a GUID)
string tenantid = "<Tenant ID>"; // AAD tenant ID or domain. For example, "contoso.onmicrosoft.com"
string rg == "<value>"; // Resource group name
string clientid = "1950a258-227b-4e31-a9cf-717495945fc2"; // Sample client ID (this will work, but you
should pick your own)

Authentication
You have multiple options for logging on to Azure Data Lake Analytics. The following snippet shows an example
of authentication with interactive user authentication with a pop-up.
using System;
using System.IO;
using System.Threading;
using System.Security.Cryptography.X509Certificates;

using Microsoft.Rest;
using Microsoft.Rest.Azure.Authentication;
using Microsoft.Azure.Management.DataLake.Analytics;
using Microsoft.Azure.Management.DataLake.Analytics.Models;
using Microsoft.Azure.Management.DataLake.Store;
using Microsoft.Azure.Management.DataLake.Store.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
using Microsoft.Azure.Graph.RBAC;

public static Program


{
public static string TENANT = "microsoft.onmicrosoft.com";
public static string CLIENTID = "1950a258-227b-4e31-a9cf-717495945fc2";
public static System.Uri ARM_TOKEN_AUDIENCE = new System.Uri( @"https://management.core.windows.net/");
public static System.Uri ADL_TOKEN_AUDIENCE = new System.Uri( @"https://datalake.azure.net/" );
public static System.Uri GRAPH_TOKEN_AUDIENCE = new System.Uri( @"https://graph.windows.net/" );

static void Main(string[] args)


{
string MY_DOCUMENTS= System.Environment.GetFolderPath( System.Environment.SpecialFolder.MyDocuments);
string TOKEN_CACHE_PATH = System.IO.Path.Combine(MY_DOCUMENTS, "my.tokencache");

var tokenCache = GetTokenCache(TOKEN_CACHE_PATH);


var armCreds = GetCreds_User_Popup(TENANT, ARM_TOKEN_AUDIENCE, CLIENTID, tokenCache);
var adlCreds = GetCreds_User_Popup(TENANT, ADL_TOKEN_AUDIENCE, CLIENTID, tokenCache);
var graphCreds = GetCreds_User_Popup(TENANT, GRAPH_TOKEN_AUDIENCE, CLIENTID, tokenCache);
}
}

The source code for GetCreds_User_Popup and the code for other options for authentication are covered in
Data Lake Analytics .NET authentication options

Create the client management objects


var resourceManagementClient = new ResourceManagementClient(armCreds) { SubscriptionId = subid };

var adlaAccountClient = new DataLakeAnalyticsAccountManagementClient(armCreds);


adlaAccountClient.SubscriptionId = subid;

var adlsAccountClient = new DataLakeStoreAccountManagementClient(armCreds);


adlsAccountClient.SubscriptionId = subid;

var adlaCatalogClient = new DataLakeAnalyticsCatalogManagementClient(adlCreds);


var adlaJobClient = new DataLakeAnalyticsJobManagementClient(adlCreds);

var adlsFileSystemClient = new DataLakeStoreFileSystemManagementClient(adlCreds);

var graphClient = new GraphRbacManagementClient(graphCreds);


graphClient.TenantID = domain;

Manage accounts
Create an Azure Resource Group
If you haven't already created one, you must have an Azure Resource Group to create your Data Lake Analytics
components. You need your authentication credentials, subscription ID, and a location. The following code shows
how to create a resource group:

var resourceGroup = new ResourceGroup { Location = location };


resourceManagementClient.ResourceGroups.CreateOrUpdate(groupName, rg);

For more information, see Azure Resource Groups and Data Lake Analytics.
Create a Data Lake Store account
Ever ADLA account requires an ADLS account. If you don't already have one to use, you can create one with the
following code:

var new_adls_params = new DataLakeStoreAccount(location: _location);


adlsAccountClient.Account.Create(rg, adls, new_adls_params);

Create a Data Lake Analytics account


The following code creates an ADLS account

var new_adla_params = new DataLakeAnalyticsAccount()


{
DefaultDataLakeStoreAccount = adls,
Location = location
};

adlaClient.Account.Create(rg, adla, new_adla_params);

List Data Lake Store accounts

var adlsAccounts = adlsAccountClient.Account.List().ToList();


foreach (var adls in adlsAccounts)
{
Console.WriteLine($"ADLS: {0}", adls.Name);
}

List Data Lake Analytics accounts

var adlaAccounts = adlaClient.Account.List().ToList();

for (var adla in AdlaAccounts)


{
Console.WriteLine($"ADLA: {0}, adla.Name");
}

Checking if an account exists

bool exists = adlaClient.Account.Exists(rg, adla));

Get information about an account

bool exists = adlaClient.Account.Exists(rg, adla));


if (exists)
{
var adla_accnt = adlaClient.Account.Get(rg, adla);
}

Delete an account
if (adlaClient.Account.Exists(rg, adla))
{
adlaClient.Account.Delete(rg, adla);
}

Get the default Data Lake Store account


Every Data Lake Analytics account requires a default Data Lake Store account. Use this code to determine the
default Store account for an Analytics account.

if (adlaClient.Account.Exists(rg, adla))
{
var adla_accnt = adlaClient.Account.Get(rg, adla);
string def_adls_account = adla_accnt.DefaultDataLakeStoreAccount;
}

Manage data sources


Data Lake Analytics currently supports the following data sources:
Azure Data Lake Store
Azure Storage Account
Link to an Azure Storage account
You can create links to Azure Storage accounts.

string storage_key = "xxxxxxxxxxxxxxxxxxxx";


string storage_account = "mystorageaccount";
var addParams = new AddStorageAccountParameters(storage_key);
adlaClient.StorageAccounts.Add(rg, adla, storage_account, addParams);

List Azure Storage data sources

var stg_accounts = adlaAccountClient.StorageAccounts.ListByAccount(rg, adla);

if (stg_accounts != null)
{
foreach (var stg_account in stg_accounts)
{
Console.WriteLine($"Storage account: {0}", stg_account.Name);
}
}

List Data Lake Store data sources

var adls_accounts = adlsClient.Account.List();

if (adls_accounts != null)
{
foreach (var adls_accnt in adls_accounts)
{
Console.WriteLine($"ADLS account: {0}", adls_accnt.Name);
}
}

Upload and download folders and files


You can use the Data Lake Store file system client management object to upload and download individual files
or folders from Azure to your local computer, using the following methods:
UploadFolder
UploadFile
DownloadFolder
DownloadFile
The first parameter for these methods is the name of the Data Lake Store Account, followed by parameters for
the source path and the destination path.
The following example shows how to download a folder in the Data Lake Store.

adlsFileSystemClient.FileSystem.DownloadFolder(adls, sourcePath, destinationPath);

Create a file in a Data Lake Store account

using (var memstream = new MemoryStream())


{
using (var sw = new StreamWriter(memstream, UTF8Encoding.UTF8))
{
sw.WriteLine("Hello World");
sw.Flush();

memstream.Position = 0;

adlsFileSystemClient.FileSystem.Create(adls, "/Samples/Output/randombytes.csv", memstream);


}
}

Verify Azure Storage account paths


The following code checks if an Azure Storage account (storageAccntName) exists in a Data Lake Analytics
account (analyticsAccountName), and if a container (containerName) exists in the Azure Storage account.

string storage_account = "mystorageaccount";


string storage_container = "mycontainer";
bool accountExists = adlaClient.Account.StorageAccountExists(rg, adla, storage_account));
bool containerExists = adlaClient.Account.StorageContainerExists(rg, adla, storage_account,
storage_container));

Manage catalog and jobs


The DataLakeAnalyticsCatalogManagementClient object provides methods for managing the SQL database
provided for each Azure Data Lake Analytics account. The DataLakeAnalyticsJobManagementClient provides
methods to submit and manage jobs run on the database with U-SQL scripts.
List databases and schemas
Among the several things you can list, the most common are databases and their schema. The following code
obtains a collection of databases, and then enumerates the schema for each database.
var databases = adlaCatalogClient.Catalog.ListDatabases(adla);
foreach (var db in databases)
{
Console.WriteLine($"Database: {db.Name}");
Console.WriteLine(" - Schemas:");
var schemas = adlaCatalogClient.Catalog.ListSchemas(adla, db.Name);
foreach (var schm in schemas)
{
Console.WriteLine($"\t{schm.Name}");
}
}

List table columns


The following code shows how to access the database with a Data Lake Analytics Catalog management client to
list the columns in a specified table.

var tbl = adlaCatalogClient.Catalog.GetTable(adla, "master", "dbo", "MyTableName");


IEnumerable<USqlTableColumn> columns = tbl.ColumnList;

foreach (USqlTableColumn utc in columns)


{
Console.WriteLine($"\t{utc.Name}");
}

Submit a U -SQL job


The following code shows how to use a Data Lake Analytics Job management client to submit a job.

string scriptPath = "/Samples/Scripts/SearchResults_Wikipedia_Script.txt";


Stream scriptStrm = adlsFileSystemClient.FileSystem.Open(_adlsAccountName, scriptPath);
string scriptTxt = string.Empty;
using (StreamReader sr = new StreamReader(scriptStrm))
{
scriptTxt = sr.ReadToEnd();
}

var jobName = "SR_Wikipedia";


var jobId = Guid.NewGuid();
var properties = new USqlJobProperties(scriptTxt);
var parameters = new JobInformation(jobName, JobType.USql, properties, priority: 1, degreeOfParallelism: 1,
jobId: jobId);
var jobInfo = adlaJobClient.Job.Create(adla, jobId, parameters);
Console.WriteLine($"Job {jobName} submitted.");

List failed jobs


The following code lists information about jobs that failed.

var odq = new ODataQuery<JobInformation> { Filter = "result eq 'Failed'" };


var jobs = adlaJobClient.Job.List(adla, odq);
foreach (var j in jobs)
{
Console.WriteLine($"{j.Name}\t{j.JobId}\t{j.Type}\t{j.StartTime}\t{j.EndTime}");
}

List pipelines
The following code lists information about each pipeline of jobs submitted to the account.
var pipelines = adlaJobClient.Pipeline.List(adla);
foreach (var p in pipelines)
{
Console.WriteLine($"Pipeline: {p.Name}\t{p.PipelineId}\t{p.LastSubmitTime}");
}

List recurrences
The following code lists information about each recurrence of jobs submitted to the account.

var recurrences = adlaJobClient.Recurrence.List(adla);


foreach (var r in recurrences)
{
Console.WriteLine($"Recurrence: {r.Name}\t{r.RecurrenceId}\t{r.LastSubmitTime}");
}

Common graph scenarios


Look up user in the AAD directory

var userinfo = graphClient.Users.Get( "[email protected]" );

Get the ObjectId of a user in the AAD directory

var userinfo = graphClient.Users.Get( "[email protected]" );


Console.WriteLine( userinfo.ObjectId )

Manage compute policies


The DataLakeAnalyticsAccountManagementClient object provides methods for managing the compute policies
for a Data Lake Analytics account.
List compute policies
The following code retrieves a list of compute policies for a Data Lake Analytics account.

var policies = adlaAccountClient.ComputePolicies.ListByAccount(rg, adla);


foreach (var p in policies)
{
Console.WriteLine($"Name: {p.Name}\tType: {p.ObjectType}\tMax AUs / job:
{p.MaxDegreeOfParallelismPerJob}\tMin priority / job: {p.MinPriorityPerJob}");
}

Create a new compute policy


The following code creates a new compute policy for a Data Lake Analytics account, setting the maximum AUs
available to the specified user to 50, and the minimum job priority to 250.

var userAadObjectId = "3b097601-4912-4d41-b9d2-78672fc2acde";


var newPolicyParams = new ComputePolicyCreateOrUpdateParameters(userAadObjectId, "User", 50, 250);
adlaAccountClient.ComputePolicies.CreateOrUpdate(rg, adla, "GaryMcDaniel", newPolicyParams);

Next steps
Overview of Microsoft Azure Data Lake Analytics
Manage Azure Data Lake Analytics using Azure portal
Monitor and troubleshoot Azure Data Lake Analytics jobs using Azure portal
Manage Azure Data Lake Analytics using Python
12/10/2021 • 4 minutes to read • Edit Online

This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs by using
Python.

Supported Python versions


Use a 64-bit version of Python.
You can use the standard Python distribution found at Python.org downloads .
Many developers find it convenient to use the Anaconda Python distribution .
This article was written using Python version 3.6 from the standard Python distribution

Install Azure Python SDK


Install the following modules:
The azure-mgmt-resource module includes other Azure modules for Active Directory, etc.
The azure-datalake-store module includes the Azure Data Lake Store filesystem operations.
The azure-mgmt-datalake-store module includes the Azure Data Lake Store account management
operations.
The azure-mgmt-datalake-analytics module includes the Azure Data Lake Analytics operations.
First, ensure you have the latest pip by running the following command:

python -m pip install --upgrade pip

This document was written using pip version 9.0.1 .


Use the following pip commands to install the modules from the commandline:

pip install azure-mgmt-resource


pip install azure-datalake-store
pip install azure-mgmt-datalake-store
pip install azure-mgmt-datalake-analytics

Create a new Python script


Paste the following code into the script:
# Use this only for Azure AD service-to-service authentication
#from azure.common.credentials import ServicePrincipalCredentials

# Use this only for Azure AD end-user authentication


#from azure.common.credentials import UserPassCredentials

# Required for Azure Resource Manager


from azure.mgmt.resource.resources import ResourceManagementClient
from azure.mgmt.resource.resources.models import ResourceGroup

# Required for Azure Data Lake Store account management


from azure.mgmt.datalake.store import DataLakeStoreAccountManagementClient
from azure.mgmt.datalake.store.models import DataLakeStoreAccount

# Required for Azure Data Lake Store filesystem management


from azure.datalake.store import core, lib, multithread

# Required for Azure Data Lake Analytics account management


from azure.mgmt.datalake.analytics.account import DataLakeAnalyticsAccountManagementClient
from azure.mgmt.datalake.analytics.account.models import DataLakeAnalyticsAccount,
DataLakeStoreAccountInformation

# Required for Azure Data Lake Analytics job management


from azure.mgmt.datalake.analytics.job import DataLakeAnalyticsJobManagementClient
from azure.mgmt.datalake.analytics.job.models import JobInformation, JobState, USqlJobProperties

# Required for Azure Data Lake Analytics catalog management


from azure.mgmt.datalake.analytics.catalog import DataLakeAnalyticsCatalogManagementClient

# Use these as needed for your application


import logging
import getpass
import pprint
import uuid
import time

Run this script to verify that the modules can be imported.

Authentication
Interactive user authentication with a pop-up
This method is not supported.
Interactive user authentication with a device code

user = input(
'Enter the user to authenticate with that has permission to subscription: ')
password = getpass.getpass()
credentials = UserPassCredentials(user, password)

Noninteractive authentication with SPI and a secret

credentials = ServicePrincipalCredentials(
client_id='FILL-IN-HERE', secret='FILL-IN-HERE', tenant='FILL-IN-HERE')

Noninteractive authentication with API and a certificate


This method is not supported.

Common script variables


These variables are used in the samples.

subid = '<Azure Subscription ID>'


rg = '<Azure Resource Group Name>'
location = '<Location>' # i.e. 'eastus2'
adls = '<Azure Data Lake Store Account Name>'
adla = '<Azure Data Lake Analytics Account Name>'

Create the clients


resourceClient = ResourceManagementClient(credentials, subid)
adlaAcctClient = DataLakeAnalyticsAccountManagementClient(credentials, subid)
adlaJobClient = DataLakeAnalyticsJobManagementClient(
credentials, 'azuredatalakeanalytics.net')

Create an Azure Resource Group


armGroupResult = resourceClient.resource_groups.create_or_update(
rg, ResourceGroup(location=location))

Create Data Lake Analytics account


First create a store account.

adlsAcctResult = adlsAcctClient.account.create(
rg,
adls,
DataLakeStoreAccount(
location=location)
)
).wait()

Then create an ADLA account that uses that store.

adlaAcctResult = adlaAcctClient.account.create(
rg,
adla,
DataLakeAnalyticsAccount(
location=location,
default_data_lake_store_account=adls,
data_lake_store_accounts=[DataLakeStoreAccountInformation(name=adls)]
)
).wait()

Submit a job
script = """
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS
D( customer, amount );
OUTPUT @a
TO "/data.csv"
USING Outputters.Csv();
"""

jobId = str(uuid.uuid4())
jobResult = adlaJobClient.job.create(
adla,
jobId,
JobInformation(
name='Sample Job',
type='USql',
properties=USqlJobProperties(script=script)
)
)

Wait for a job to end


jobResult = adlaJobClient.job.get(adla, jobId)
while(jobResult.state != JobState.ended):
print('Job is not yet done, waiting for 3 seconds. Current state: ' +
jobResult.state.value)
time.sleep(3)
jobResult = adlaJobClient.job.get(adla, jobId)

print('Job finished with result: ' + jobResult.result.value)

List pipelines and recurrences


Depending whether your jobs have pipeline or recurrence metadata attached, you can list pipelines and
recurrences.

pipelines = adlaJobClient.pipeline.list(adla)
for p in pipelines:
print('Pipeline: ' + p.name + ' ' + p.pipelineId)

recurrences = adlaJobClient.recurrence.list(adla)
for r in recurrences:
print('Recurrence: ' + r.name + ' ' + r.recurrenceId)

Manage compute policies


The DataLakeAnalyticsAccountManagementClient object provides methods for managing the compute policies
for a Data Lake Analytics account.
List compute policies
The following code retrieves a list of compute policies for a Data Lake Analytics account.
policies = adlaAccountClient.computePolicies.listByAccount(rg, adla)
for p in policies:
print('Name: ' + p.name + 'Type: ' + p.objectType + 'Max AUs / job: ' +
p.maxDegreeOfParallelismPerJob + 'Min priority / job: ' + p.minPriorityPerJob)

Create a new compute policy


The following code creates a new compute policy for a Data Lake Analytics account, setting the maximum AUs
available to the specified user to 50, and the minimum job priority to 250.

userAadObjectId = "3b097601-4912-4d41-b9d2-78672fc2acde"
newPolicyParams = ComputePolicyCreateOrUpdateParameters(
userAadObjectId, "User", 50, 250)
adlaAccountClient.computePolicies.createOrUpdate(
rg, adla, "GaryMcDaniel", newPolicyParams)

Next steps
To see the same tutorial using other tools, click the tab selectors on the top of the page.
To learn U-SQL, see Get started with Azure Data Lake Analytics U-SQL language.
For management tasks, see Manage Azure Data Lake Analytics using Azure portal.
Manage Azure Data Lake Analytics using a Java app
12/10/2021 • 5 minutes to read • Edit Online

This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs using an
app written using the Azure Java SDK.

Prerequisites
Java Development Kit (JDK) 8 (using Java version 1.8).
IntelliJ or another suitable Java development environment. The instructions in this document use IntelliJ.
Create an Azure Active Directory (AAD) application and retrieve its Client ID , Tenant ID , and Key . For more
information about AAD applications and instructions on how to get a client ID, see Create Active Directory
application and service principal using portal. The Reply URI and Key is available from the portal once you
have the application created and key generated.

Authenticating using Azure Active Directory


The code following snippet provides code for non-interactive authentication, where the application provides
its own credentials.

Create a Java application


1. Open IntelliJ and create a Java project using the Command-Line App template.
2. Right-click on the project on the left-hand side of your screen and click Add Framework Suppor t . Choose
Maven and click OK .
3. Open the newly created "pom.xml" file and add the following snippet of text between the </version> tag
and the </project> tag:
<dependencies>
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-client-authentication</artifactId>
<version>1.6.12</version>
</dependency>
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-client-runtime</artifactId>
<version>1.6.12</version>
</dependency>
<dependency>
<groupId>com.microsoft.rest</groupId>
<artifactId>client-runtime</artifactId>
<version>1.6.12</version>
</dependency>
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-mgmt-datalake-store</artifactId>
<version>1.22.0</version>
</dependency>
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-mgmt-datalake-analytics</artifactId>
<version>1.22.0</version>
</dependency>
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-data-lake-store-sdk</artifactId>
<version>2.3.6</version>
<exclusions>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
</exclusion>
</exclusions>
</dependency>
</dependencies>

Go to File > Settings > Build > Execution > Deployment . Select Build Tools > Maven > Impor ting .
Then check Impor t Maven projects automatically .
Open Main.java and replace the existing code block with the following code:

import com.microsoft.azure.CloudException;
import com.microsoft.azure.credentials.ApplicationTokenCredentials;
import com.microsoft.azure.datalake.store.*;
import com.microsoft.azure.datalake.store.oauth2.*;
import com.microsoft.azure.management.datalake.analytics.implementation.*;
import com.microsoft.azure.management.datalake.store.*;
import com.microsoft.azure.management.datalake.store.implementation.*;
import com.microsoft.azure.management.datalake.store.models.*;
import com.microsoft.azure.management.datalake.analytics.*;
import com.microsoft.azure.management.datalake.analytics.models.*;
import com.microsoft.rest.credentials.ServiceClientCredentials;

import java.io.*;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.UUID;
import java.util.List;

public class Main {


private static String _adlsAccountName;
private static String _adlaAccountName;
private static String _resourceGroupName;
private static String _resourceGroupName;
private static String _location;

private static String _tenantId;


private static String _subscriptionId;
private static String _clientId;
private static String _clientSecret;

private static DataLakeStoreAccountManagementClient _adlsClient;


private static ADLStoreClient _adlsStoreClient;
private static DataLakeAnalyticsAccountManagementClient _adlaClient;
private static DataLakeAnalyticsJobManagementClient _adlaJobClient;

public static void main(String[] args) throws Exception {


_adlsAccountName = "<DATA-LAKE-STORE-NAME>";
_adlaAccountName = "<DATA-LAKE-ANALYTICS-NAME>";
_resourceGroupName = "<RESOURCE-GROUP-NAME>";
_location = "East US 2";

_tenantId = "<TENANT-ID>";
_subscriptionId = "<SUBSCRIPTION-ID>";
_clientId = "<CLIENT-ID>";

_clientSecret = "<CLIENT-SECRET>";

String localFolderPath = "C:\\local_path\\";

// ----------------------------------------
// Authenticate
// ----------------------------------------
ApplicationTokenCredentials creds = new ApplicationTokenCredentials(_clientId, _tenantId,
_clientSecret, null);
SetupClients(creds);

// ----------------------------------------
// List Data Lake Store and Analytics accounts that this app can access
// ----------------------------------------
System.out.println(String.format("All ADL Store accounts that this app can access in subscription
%s:", _subscriptionId));
List<DataLakeStoreAccountBasic> adlsListResult = _adlsClient.accounts().list();
for (DataLakeStoreAccountBasic acct : adlsListResult) {
System.out.println(acct.name());
}

System.out.println(String.format("All ADL Analytics accounts that this app can access in


subscription %s:", _subscriptionId));
List<DataLakeAnalyticsAccountBasic> adlaListResult = _adlaClient.accounts().list();
for (DataLakeAnalyticsAccountBasic acct : adlaListResult) {
System.out.println(acct.name());
}
WaitForNewline("Accounts displayed.", "Creating files.");

// ----------------------------------------
// Create a file in Data Lake Store: input1.csv
// ----------------------------------------
CreateFile("/input1.csv", "123,abc", true);
WaitForNewline("File created.", "Submitting a job.");

// ----------------------------------------
// Submit a job to Data Lake Analytics
// ----------------------------------------
String script = "@input = EXTRACT Row1 string, Row2 string FROM \"/input1.csv\" USING
Extractors.Csv(); OUTPUT @input TO @\"/output1.csv\" USING Outputters.Csv();";
UUID jobId = SubmitJobByScript(script, "testJob");
WaitForNewline("Job submitted.", "Getting job status.");

// ----------------------------------------
// Wait for job completion and output job status
// ----------------------------------------
System.out.println(String.format("Job status: %s", GetJobStatus(jobId)));
System.out.println(String.format("Job status: %s", GetJobStatus(jobId)));
System.out.println("Waiting for job completion.");
WaitForJob(jobId);
System.out.println(String.format("Job status: %s", GetJobStatus(jobId)));
WaitForNewline("Job completed.", "Downloading job output.");

// ----------------------------------------
// Download job output from Data Lake Store
// ----------------------------------------
DownloadFile("/output1.csv", localFolderPath + "output1.csv");
WaitForNewline("Job output downloaded.", "Deleting file.");

DeleteFile("/output1.csv");
WaitForNewline("File deleted.", "Done.");
}

public static void SetupClients(ServiceClientCredentials creds) {


_adlsClient = new DataLakeStoreAccountManagementClientImpl(creds);
_adlaClient = new DataLakeAnalyticsAccountManagementClientImpl(creds);
_adlaJobClient = new DataLakeAnalyticsJobManagementClientImpl(creds);
_adlsClient.withSubscriptionId(_subscriptionId);
_adlaClient.withSubscriptionId(_subscriptionId);

String authEndpoint = "https://login.microsoftonline.com/" + _tenantId + "/oauth2/token";


AccessTokenProvider provider = new ClientCredsTokenProvider(authEndpoint, _clientId, _clientSecret);
_adlsStoreClient = ADLStoreClient.createClient(_adlsAccountName + ".azuredatalakestore.net",
provider);
}

public static void WaitForNewline(String reason, String nextAction) {


if (nextAction == null)
nextAction = "";

System.out.println(reason + "\r\nPress ENTER to continue...");


try {
System.in.read();
} catch (Exception e) {
}

if (!nextAction.isEmpty()) {
System.out.println(nextAction);
}
}

// Create accounts
public static void CreateAccounts() throws InterruptedException, CloudException, IOException {
// Create ADLS account
CreateDataLakeStoreAccountParameters adlsParameters = new CreateDataLakeStoreAccountParameters();
adlsParameters.withLocation(_location);

_adlsClient.accounts().create(_resourceGroupName, _adlsAccountName, adlsParameters);

// Create ADLA account


AddDataLakeStoreWithAccountParameters adlsInfo = new AddDataLakeStoreWithAccountParameters();
adlsInfo.withName(_adlsAccountName);

List<AddDataLakeStoreWithAccountParameters> adlsInfoList = new ArrayList<>();


adlsInfoList.add(adlsInfo);

CreateDataLakeAnalyticsAccountParameters adlaParameters = new


CreateDataLakeAnalyticsAccountParameters();
adlaParameters.withLocation(_location);
adlaParameters.withDefaultDataLakeStoreAccount(_adlsAccountName);
adlaParameters.withDataLakeStoreAccounts(adlsInfoList);

_adlaClient.accounts().create(_resourceGroupName, _adlaAccountName, adlaParameters);


}

// Create a file
public static void CreateFile(String path, String contents, boolean force) throws IOException,
public static void CreateFile(String path, String contents, boolean force) throws IOException,
CloudException {
byte[] bytesContents = contents.getBytes();

ADLFileOutputStream stream = _adlsStoreClient.createFile(path, IfExists.OVERWRITE, "777", force);


stream.write(bytesContents);
stream.close();
}

// Delete a file
public static void DeleteFile(String filePath) throws IOException, CloudException {
_adlsStoreClient.delete(filePath);
}

// Download a file
private static void DownloadFile(String srcPath, String destPath) throws IOException, CloudException {
ADLFileInputStream stream = _adlsStoreClient.getReadStream(srcPath);

PrintWriter pWriter = new PrintWriter(destPath, Charset.defaultCharset().name());

String fileContents = "";


if (stream != null) {
Writer writer = new StringWriter();

char[] buffer = new char[1024];


try {
Reader reader = new BufferedReader(
new InputStreamReader(stream, "UTF-8"));
int n;
while ((n = reader.read(buffer)) != -1) {
writer.write(buffer, 0, n);
}
} finally {
stream.close();
}
fileContents = writer.toString();
}

pWriter.println(fileContents);
pWriter.close();
}

// Submit a U-SQL job


private static UUID SubmitJobByScript(String script, String jobName) throws IOException, CloudException
{
UUID jobId = java.util.UUID.randomUUID();
CreateJobProperties properties = new CreateUSqlJobProperties();
properties.withScript(script);
CreateJobParameters parameters = new CreateJobParameters();
parameters.withName(jobName);
parameters.withType(JobType.USQL);
parameters.withProperties(properties);

JobInformation jobInfo = _adlaJobClient.jobs().create(_adlaAccountName, jobId, parameters);

return jobInfo.jobId();
}

// Wait for job completion


private static JobResult WaitForJob(UUID jobId) throws IOException, CloudException {
JobInformation jobInfo = _adlaJobClient.jobs().get(_adlaAccountName, jobId);
while (jobInfo.state() != JobState.ENDED) {
jobInfo = _adlaJobClient.jobs().get(_adlaAccountName, jobId);
}
return jobInfo.result();
}

// Retrieve job status


private static String GetJobStatus(UUID jobId) throws IOException, CloudException {
JobInformation jobInfo = _adlaJobClient.jobs().get(_adlaAccountName, jobId);
return jobInfo.state().toString();
}
}

Provide the values for parameters called out in the code snippet:
localFolderPath
_adlaAccountName
_adlsAccountName
_resourceGroupName
_tenantId
_subId
_clientId
_clientSecret

Next steps
To learn U-SQL, see Get started with Azure Data Lake Analytics U-SQL language, and U-SQL language
reference.
For management tasks, see Manage Azure Data Lake Analytics using Azure portal.
To get an overview of Data Lake Analytics, see Azure Data Lake Analytics overview.
Manage Azure Data Lake Analytics using Azure
SDK for Node.js
12/10/2021 • 2 minutes to read • Edit Online

This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs using an
app written using the Azure SDK for Node.js.
The following versions are supported:
Node.js version: 0.10.0 or higher
REST API version for Account: 2015-10-01-preview
REST API version for Catalog: 2015-10-01-preview
REST API version for Job: 2016-03-20-preview

Features
Account management: create, get, list, update, and delete.
Job management: submit, get, list, and cancel.
Catalog management: get and list.

How to Install
npm install azure-arm-datalake-analytics

Authenticate using Azure Active Directory


var msrestAzure = require('ms-rest-azure');
//user authentication
var credentials = new msRestAzure.UserTokenCredentials('your-client-id', 'your-domain', 'your-username',
'your-password', 'your-redirect-uri');
//service principal authentication
var credentials = new msRestAzure.ApplicationTokenCredentials('your-client-id', 'your-domain', 'your-
secret');

Create the Data Lake Analytics client


var adlaManagement = require("azure-arm-datalake-analytics");
var accountClient = new adlaManagement.DataLakeAnalyticsAccountClient(credentials, 'your-subscription-id');
var jobClient = new adlaManagement.DataLakeAnalyticsJobClient(credentials, 'azuredatalakeanalytics.net');
var catalogClient = new adlaManagement.DataLakeAnalyticsCatalogClient(credentials,
'azuredatalakeanalytics.net');

Create a Data Lake Analytics account


var util = require('util');
var resourceGroupName = 'testrg';
var accountName = 'testadlaacct';
var location = 'eastus2';

// A Data Lake Store account must already have been created to create
// a Data Lake Analytics account. See the Data Lake Store readme for
// information on doing so. For now, we assume one exists already.
var datalakeStoreAccountName = 'existingadlsaccount';

// account object to create


var accountToCreate = {
tags: {
testtag1: 'testvalue1',
testtag2: 'testvalue2'
},
name: accountName,
location: location,
properties: {
defaultDataLakeStoreAccount: datalakeStoreAccountName,
dataLakeStoreAccounts: [
{
name: datalakeStoreAccountName
}
]
}
};

client.account.create(resourceGroupName, accountName, accountToCreate, function (err, result, request,


response) {
if (err) {
console.log(err);
/*err has reference to the actual request and response, so you can see what was sent and received on the
wire.
The structure of err looks like this:
err: {
code: 'Error Code',
message: 'Error Message',
body: 'The response body if any',
request: reference to a stripped version of http request
response: reference to a stripped version of the response
}
*/
} else {
console.log('result is: ' + util.inspect(result, {depth: null}));
}
});

Get a list of jobs


var util = require('util');
var accountName = 'testadlaacct';
jobClient.job.list(accountName, function (err, result, request, response) {
if (err) {
console.log(err);
} else {
console.log('result is: ' + util.inspect(result, {depth: null}));
}
});

Get a list of databases in the Data Lake Analytics Catalog


var util = require('util');
var accountName = 'testadlaacct';
catalogClient.catalog.listDatabases(accountName, function (err, result, request, response) {
if (err) {
console.log(err);
} else {
console.log('result is: ' + util.inspect(result, {depth: null}));
}
});

See also
Microsoft Azure SDK for Node.js
Adding a user in the Azure portal
12/10/2021 • 2 minutes to read • Edit Online

Start the Add User Wizard


1. Open your Azure Data Lake Analytics via https://portal.azure.com.
2. Click Add User Wizard .
3. In the Select user step, find the user you want to add. Click Select .
4. the Select role step, pick Data Lake Analytics Developer . This role has the minimum set of permissions
required to submit/monitor/manage U-SQL jobs. Assign to this role if the group is not intended for
managing Azure services.
5. In the Select catalog permissions step, select any additional databases that user will need access to. Read
and Write Access to the default static database called "master" is required to submit jobs. When you are
done, click OK .
6. In the final step called Assign selected permissions review the changes the wizard will make. Click OK .

Configure ACLs for data folders


Grant "R-X" or "RWX", as needed, on folders containing input data and output data.

Optionally, add the user to the Azure Data Lake Storage Gen1 role
Reader role.
1. Find your Azure Data Lake Storage Gen1 account.
2. Click on Users .
3. Click Add .
4. Select an Azure role to assign this group.
5. Assign to Reader role. This role has the minimum set of permissions required to browse/manage data stored
in ADLSGen1. Assign to this role if the Group is not intended for managing Azure services.
6. Type in the name of the Group.
7. Click OK .

Adding a user using PowerShell


1. Follow the instructions in this guide: How to install and configure Azure PowerShell.
2. Download the Add-AdlaJobUser.ps1 PowerShell script.
3. Run the PowerShell script.
The sample command to give user access to submit jobs, view new job metadata, and view old metadata is:
Add-AdlaJobUser.ps1 -Account myadlsaccount -EntityToAdd 546e153e-0ecf-417b-ab7f-aa01ce4a7bff -EntityType User
-FullReplication

Next steps
Overview of Azure Data Lake Analytics
Get started with Data Lake Analytics by using the Azure portal
Manage Azure Data Lake Analytics by using Azure PowerShell
Manage Azure Data Lake Analytics using Account
Policies
12/10/2021 • 4 minutes to read • Edit Online

Account policies help you control how resources an Azure Data Lake Analytics account are used. These policies
allow you to control the cost of using Azure Data Lake Analytics. For example, with these policies you can
prevent unexpected cost spikes by limiting how many AUs the account can simultaneously use.## Account-level
policies
These policies apply to all jobs in a Data Lake Analytics account.

Maximum number of AUs in a Data Lake Analytics account


A policy controls the total number of Analytics Units (AUs) your Data Lake Analytics account can use. By default,
the value is set to 250. For example, if this value is set to 250 AUs, you can have one job running with 250 AUs
assigned to it, or 10 jobs running with 25 AUs each. Additional jobs that are submitted are queued until the
running jobs are finished. When running jobs are finished, AUs are freed up for the queued jobs to run.
To change the number of AUs for your Data Lake Analytics account:
1. In the Azure portal, go to your Data Lake Analytics account.
2. Click Limits and policies .
3. Under Maximum AUs , move the slider to select a value, or enter the value in the text box.
4. Click Save .

NOTE
If you need more than the default (250) AUs, in the portal, click Help+Suppor t to submit a support request. The
number of AUs available in your Data Lake Analytics account can be increased.

Maximum number of jobs that can run simultaneously


This policy limits how many jobs can run simultaneously. By default, this value is set to 20. If your Data Lake
Analytics has AUs available, new jobs are scheduled to run immediately until the total number of running jobs
reaches the value of this policy. When you reach the maximum number of jobs that can run simultaneously,
subsequent jobs are queued in priority order until one or more running jobs complete (depending on available
AUs).
To change the number of jobs that can run simultaneously:
1. In the Azure portal, go to your Data Lake Analytics account.
2. Click Limits and policies .
3. Under Maximum Number of Running Jobs , move the slider to select a value, or enter the value in the
text box.
4. Click Save .
NOTE
If you need to run more than the default (20) number of jobs, in the portal, click Help+Suppor t to submit a
support request. The number of jobs that can run simultaneously in your Data Lake Analytics account can be
increased.

How long to keep job metadata and resources


When your users run U-SQL jobs, the Data Lake Analytics service keeps all related files. These files include the
U-SQL script, the DLL files referenced in the U-SQL script, compiled resources, and statistics. The files are in the
/system/ folder of the default Azure Data Lake Storage account. This policy controls how long these resources
are stored before they are automatically deleted (the default is 30 days). You can use these files for debugging,
and for performance-tuning of jobs that you'll rerun in the future.
To change how long to keep job metadata and resources:
1. In the Azure portal, go to your Data Lake Analytics account.
2. Click Limits and policies .
3. Under Days to Retain Job Queries , move the slider to select a value, or enter the value in the text box.
4. Click Save .

Job-level policies
Job-level policies allow you to control the maximum AUs and the maximum priority that individual users (or
members of specific security groups) can set on jobs that they submit. This policy lets you control the costs
incurred by users. It also lets you control the effect that scheduled jobs might have on high-priority production
jobs that are running in the same Data Lake Analytics account.
Data Lake Analytics has two policies that you can set at the job level:
AU limit per job : Users can only submit jobs that have up to this number of AUs. By default, this limit is
the same as the maximum AU limit for the account.
Priority : Users can only submit jobs that have a priority lower than or equal to this value. A higher
number indicates a lower priority. By default, this limit is set to 1, which is the highest possible priority.
There is a default policy set on every account. The default policy applies to all users of the account. You can
create additional policies for specific users and groups.

NOTE
Account-level policies and job-level policies apply simultaneously.

Add a policy for a specific user or group


1. In the Azure portal, go to your Data Lake Analytics account.
2. Click Limits and policies .
3. Under Job Submission Limits , click the Add Policy button. Then, select or enter the following settings:
a. Compute Policy Name : Enter a policy name, to remind you of the purpose of the policy.
b. Select User or Group : Select the user or group this policy applies to.
c. Set the Job AU Limit : Set the AU limit that applies to the selected user or group.
d. Set the Priority Limit : Set the priority limit that applies to the selected user or group.
4. Click Ok .
5. The new policy is listed in the Default policy table, under Job Submission Limits .

Delete or edit an existing policy


1. In the Azure portal, go to your Data Lake Analytics account.
2. Click Limits and policies .
3. Under Job Submission Limits , find the policy you want to edit.
4. To see the Delete and Edit options, in the rightmost column of the table, click ... .## Additional
resources for job policies
Policy overview blog post
Account-level policies blog post
Job-level policies blog post

Next steps
Overview of Azure Data Lake Analytics
Get started with Data Lake Analytics by using the Azure portal
Manage Azure Data Lake Analytics by using Azure PowerShell
Configure user access to job information to job
information in Azure Data Lake Analytics
12/10/2021 • 2 minutes to read • Edit Online

In Azure Data Lake Analytics, you can use multiple user accounts or service principals to run jobs.
In order for those same users to see the detailed job information, the users need to be able to read the contents
of the job folders. The job folders are located in /system/ directory.
If the necessary permissions are not configured, the user may see an error:
Graph data not available - You don't have permissions to access the graph data.

Configure user access to job information


You can use the Add User Wizard to configure the ACLs on the folders. For more information, see Add a new
user.
If you need more granular control, or need to script the permissions, then secure the folders as follows:
1. Grant execute permissions (via an access ACL) on the root folder:
/
2. Grant execute and read permissions (via an access ACL and a default ACL) on the folders that contain
the job folders. For example, for a specific job that ran on May 25, 2018, these folders need to be
accessed:
/system
/system/jobservice
/system/jobservice/jobs
/system/jobservice/jobs/Usql
/system/jobservice/jobs/Usql/2018
/system/jobservice/jobs/Usql/2018/05
/system/jobservice/jobs/Usql/2018/05/25
/system/jobservice/jobs/Usql/2018/05/25/11
/system/jobservice/jobs/Usql/2018/05/25/11/01
/system/jobservice/jobs/Usql/2018/05/25/11/01/b074bd7a-1448-d879-9d75-f562b101bd3d

Next steps
Add a new user
Accessing diagnostic logs for Azure Data Lake
Analytics
12/10/2021 • 5 minutes to read • Edit Online

Diagnostic logging allows you to collect data access audit trails. These logs provide information such as:
A list of users that accessed the data.
How frequently the data is accessed.
How much data is stored in the account.

Enable logging
1. Sign on to the Azure portal.
2. Open your Data Lake Analytics account and select Diagnostic logs from the Monitor section. Next,
select Turn on diagnostics .

3. From Diagnostics settings , enter a Name for this logging configuration and then select logging
options.

You can choose to store/process the data in three different ways.


Select Archive to a storage account to store logs in an Azure storage account. Use this
option if you want to archive the data. If you select this option, you must provide an Azure
storage account to save the logs to.
Select Stream to an Event Hub to stream log data to an Azure Event Hub. Use this option
if you have a downstream processing pipeline that is analyzing incoming logs in real time. If
you select this option, you must provide the details for the Azure Event Hub you want to
use.
Select Send to Log Analytics to send the data to the Azure Monitor service. Use this
option if you want to use Azure Monitor logs to gather and analyze logs.
Specify whether you want to get audit logs or request logs or both. A request log captures every
API request. An audit log records all operations that are triggered by that API request.
For Archive to a storage account , specify the number of days to retain the data.
Click Save .

NOTE
You must select either Archive to a storage account , Stream to an Event Hub or Send to Log
Analytics before clicking the Save button.

Use the Azure Storage account that contains log data


1. To display the blob containers that hold logging data, open the Azure Storage account used for Data Lake
Analytics for logging, and then click Blobs .
The container insights-logs-audit contains the audit logs.
The container insights-logs-requests contains the request logs.
2. Within the containers, the logs are stored under the following file structure:

resourceId=/
SUBSCRIPTIONS/
<<SUBSCRIPTION_ID>>/
RESOURCEGROUPS/
<<RESOURCE_GRP_NAME>>/
PROVIDERS/
MICROSOFT.DATALAKEANALYTICS/
ACCOUNTS/
<DATA_LAKE_ANALYTICS_NAME>>/
y=####/
m=##/
d=##/
h=##/
m=00/
PT1H.json

NOTE
The ## entries in the path contain the year, month, day, and hour in which the log was created. Data Lake
Analytics creates one file every hour, so m= always contains a value of 00 .

As an example, the complete path to an audit log could be:


https://adllogs.blob.core.windows.net/insights-logs-audit/resourceId=/SUBSCRIPTIONS/<sub-
id>/RESOURCEGROUPS/myresourcegroup/PROVIDERS/MICROSOFT.DATALAKEANALYTICS/ACCOUNTS/mydatalakeanalytics/y=2016/m=07/d=18/h=04/m=00/PT1H.jso

Similarly, the complete path to a request log could be:


https://adllogs.blob.core.windows.net/insights-logs-requests/resourceId=/SUBSCRIPTIONS/<sub-
id>/RESOURCEGROUPS/myresourcegroup/PROVIDERS/MICROSOFT.DATALAKEANALYTICS/ACCOUNTS/mydatalakeanalytics/y=2016/m=07/d=18/h=14/m=00/PT1H.jso

Log structure
The audit and request logs are in a structured JSON format.
Request logs
Here's a sample entry in the JSON-formatted request log. Each blob has one root object called records that
contains an array of log objects.
{
"records":
[
. . . .
,
{
"time": "2016-07-07T21:02:53.456Z",
"resourceId":
"/SUBSCRIPTIONS/<subscription_id>/RESOURCEGROUPS/<resource_group_name>/PROVIDERS/MICROSOFT.DATALAKEANALYTICS
/ACCOUNTS/<data_lake_analytics_account_name>",
"category": "Requests",
"operationName": "GetAggregatedJobHistory",
"resultType": "200",
"callerIpAddress": "::ffff:1.1.1.1",
"correlationId": "4a11c709-05f5-417c-a98d-6e81b3e29c58",
"identity": "1808bd5f-62af-45f4-89d8-03c5e81bac30",
"properties": {
"HttpMethod":"POST",
"Path":"/JobAggregatedHistory",
"RequestContentLength":122,
"ClientRequestId":"3b7adbd9-3519-4f28-a61c-bd89506163b8",
"StartTime":"2016-07-07T21:02:52.472Z",
"EndTime":"2016-07-07T21:02:53.456Z"
}
}
,
. . . .
]
}

Request log schema

NAME TYPE DESC RIP T IO N

time String The timestamp (in UTC) of the log

resourceId String The identifier of the resource that


operation took place on

category String The log category. For example,


Requests .

operationName String Name of the operation that is logged.


For example,
GetAggregatedJobHistory.

resultType String The status of the operation, For


example, 200.

callerIpAddress String The IP address of the client making the


request

correlationId String The identifier of the log. This value can


be used to group a set of related log
entries.

identity Object The identity that generated the log

properties JSON See the next section (Request log


properties schema) for details

Request log properties schema

NAME TYPE DESC RIP T IO N

HttpMethod String The HTTP Method used for the


operation. For example, GET.

Path String The path the operation was performed


on
NAME TYPE DESC RIP T IO N

RequestContentLength int The content length of the HTTP


request

ClientRequestId String The identifier that uniquely identifies


this request

StartTime String The time at which the server received


the request

EndTime String The time at which the server sent a


response

Audit logs
Here's a sample entry in the JSON-formatted audit log. Each blob has one root object called records that
contains an array of log objects.

{
"records":
[
{
"time": "2016-07-28T19:15:16.245Z",
"resourceId":
"/SUBSCRIPTIONS/<subscription_id>/RESOURCEGROUPS/<resource_group_name>/PROVIDERS/MICROSOFT.DATALAKEANALYTICS
/ACCOUNTS/<data_lake_ANALYTICS_account_name>",
"category": "Audit",
"operationName": "JobSubmitted",
"identity": "[email protected]",
"properties": {
"JobId":"D74B928F-5194-4E6C-971F-C27026C290E6",
"JobName": "New Job",
"JobRuntimeName": "default",
"SubmitTime": "7/28/2016 7:14:57 PM"
}
}
]
}

Audit log schema

NAME TYPE DESC RIP T IO N

time String The timestamp (in UTC) of the log

resourceId String The identifier of the resource that


operation took place on

category String The log category. For example, Audit .

operationName String Name of the operation that is logged.


For example, JobSubmitted.

resultType String A substatus for the job status


(operationName).

resultSignature String Additional details on the job status


(operationName).

identity String The user that requested the operation.


For example, [email protected].

properties JSON See the next section (Audit log


properties schema) for details
NOTE
resultType and resultSignature provide information on the result of an operation, and only contain a value if an
operation has completed. For example, they only contain a value when operationName contains a value of JobStar ted
or JobEnded .

Audit log properties schema

NAME TYPE DESC RIP T IO N

JobId String The ID assigned to the job

JobName String The name that was provided for the


job

JobRunTime String The runtime used to process the job

SubmitTime String The time (in UTC) that the job was
submitted

StartTime String The time the job started running after


submission (in UTC)

EndTime String The time the job ended

Parallelism String The number of Data Lake Analytics


units requested for this job during
submission

NOTE
SubmitTime , Star tTime , EndTime , and Parallelism provide information on an operation. These entries only contain a
value if that operation has started or completed. For example, SubmitTime only contains a value after operationName
has the value JobSubmitted .

Process the log data


Azure Data Lake Analytics provides a sample on how to process and analyze the log data. You can find the
sample at https://github.com/Azure/AzureDataLake/tree/master/Samples/AzureDiagnosticsSample.

Next steps
Overview of Azure Data Lake Analytics
Adjust quotas and limits in Azure Data Lake
Analytics
12/10/2021 • 2 minutes to read • Edit Online

Learn how to adjust and increase the quota and limits in Azure Data Lake Analytics (ADLA) accounts. Knowing
these limits will help you understand your U-SQL job behavior. All quota limits are soft, so you can increase the
maximum limits by contacting Azure support.

Azure subscriptions limits


Maximum number of ADL A accounts per subscription per region: 5
If you try to create a sixth ADLA account, you will get an error "You have reached the maximum number of Data
Lake Analytics accounts allowed (5) in region under subscription name".
If you want to go beyond this limit, you can try these options:
choose another region if suitable
contact Azure support by opening a support ticket to request a quota increase.

Default ADLA account limits


Maximum number of Analytics Units (AUs) per account: 250, default 32
This is the maximum number of AUs that can run concurrently in your account. If your total number of running
AUs across all jobs exceeds this limit, newer jobs are queued automatically. For example:
If you have only one job running with 32 AUs, when you submit a second job it will wait in the job queue
until the first job completes.
If you already have four jobs running and each is using 8 AUs, when you submit a fifth job that needs 8
AUs it waits in the job queue until there are 8 AUs available.
Maximum number of Analytics Units (AUs) per job: 250, default 32
This is the maximum number of AUs that each individual job can be assigned in your account. Jobs that are
assigned more than this limit will be rejected, unless the submitter is affected by a compute policy ( job
submission limit) that gives them more AUs per job. The upper bound of this value is the AU limit for the
account.
Maximum number of concurrent U-SQL jobs per account: 20
This is the maximum number of jobs that can run concurrently in your account. Above this value, newer jobs are
queued automatically.

Adjust ADLA account limits


1. Sign on to the Azure portal.
2. Choose an existing ADLA account.
3. Click Proper ties .
4. Adjust the values for Maximum AUs , Maximum number of running jobs , and Job submission limits
to suit your needs.

Increase maximum quota limits


You can find more information about Azure limits in the Azure service-specific limits documentation.
1. Open a support request in Azure portal.
2. Select the issue type Quota .
3. Select your Subscription (make sure it is not a "trial" subscription).
4. Select quota type Data Lake Analytics .

5. In the problem page, explain your requested increase limit with Details of why you need this extra
capacity.
6. Verify your contact information and create the support request.
Microsoft reviews your request and tries to accommodate your business needs as soon as possible.

Next steps
Overview of Microsoft Azure Data Lake Analytics
Manage Azure Data Lake Analytics using Azure PowerShell
Monitor and troubleshoot Azure Data Lake Analytics jobs using Azure portal
Disaster recovery guidance for Azure Data Lake
Analytics
12/10/2021 • 2 minutes to read • Edit Online

Azure Data Lake Analytics is an on-demand analytics job service that simplifies big data. Instead of deploying,
configuring, and tuning hardware, you write queries to transform your data and extract valuable insights. The
analytics service can handle jobs of any scale instantly by setting the dial for how much power you need. You
only pay for your job when it is running, making it cost-effective. This article provides guidance on how to
protect your jobs from rare region-wide outages or accidental deletions.

Disaster recovery guidance


When using Azure Data Lake Analytics, it's critical for you to prepare your own disaster recovery plan. This
article helps guide you to build a disaster recovery plan. There are additional resources that can help you create
your own plan:
Failure and disaster recovery for Azure applications
Azure resiliency technical guidance

Best practices and scenario guidance


You can run a recurring U-SQL job in an ADLA account in a region that reads and writes U-SQL tables as well as
unstructured data. Prepare for a disaster by taking these steps:
1. Create ADLA and ADLS accounts in the secondary region that will be used during an outage.

NOTE
Since account names are globally unique, use a consistent naming scheme that indicates which account is
secondary.

2. For unstructured data, reference Disaster recovery guidance for data in Azure Data Lake Storage Gen1
3. For structured data stored in ADLA tables and databases, create copies of the metadata artifacts such as
databases, tables, table-valued functions, and assemblies. You need to periodically resync these artifacts
when changes happen in production. For example, newly inserted data has to be replicated to the
secondary region by copying the data and inserting into the secondary table.

NOTE
These object names are scoped to the secondary account and are not globally unique, so they can have the same
names as in the primary production account.

During an outage, you need to update your scripts so the input paths point to the secondary endpoint. Then the
users submit their jobs to the ADLA account in the secondary region. The output of the job will then be written
to the ADLA and ADLS account in the secondary region.

Next steps
Disaster recovery guidance for data in Azure Data Lake Storage Gen1
Get started with U-SQL in Azure Data Lake
Analytics
12/10/2021 • 4 minutes to read • Edit Online

U-SQL is a language that combines declarative SQL with imperative C# to let you process data at any scale.
Through the scalable, distributed-query capability of U-SQL, you can efficiently analyze data across relational
stores such as Azure SQL Database. With U-SQL, you can process unstructured data by applying schema on
read and inserting custom logic and UDFs. Additionally, U-SQL includes extensibility that gives you fine-grained
control over how to execute at scale.

Learning resources
The U-SQL Tutorial provides a guided walkthrough of most of the U-SQL language. This document is
recommended reading for all developers wanting to learn U-SQL.
For detailed information about the U-SQL language syntax , see the U-SQL Language Reference.
To understand the U-SQL design philosophy , see the Visual Studio blog post Introducing U-SQL – A
Language that makes Big Data Processing Easy.

Prerequisites
Before you go through the U-SQL samples in this document, read and complete Tutorial: Develop U-SQL scripts
using Data Lake Tools for Visual Studio. That tutorial explains the mechanics of using U-SQL with Azure Data
Lake Tools for Visual Studio.

Your first U-SQL script


The following U-SQL script is simple and lets us explore many aspects the U-SQL language.

@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();

OUTPUT @searchlog
TO "/output/SearchLog-first-u-sql.csv"
USING Outputters.Csv();

This script doesn't have any transformation steps. It reads from the source file called SearchLog.tsv ,
schematizes it, and writes the rowset back into a file called SearchLog-first-u-sql.csv.
Notice the question mark next to the data type in the Duration field. It means that the Duration field could be
null.
Key concepts
Rowset variables : Each query expression that produces a rowset can be assigned to a variable. U-SQL
follows the T-SQL variable naming pattern ( @searchlog , for example) in the script.
The EXTRACT keyword reads data from a file and defines the schema on read. Extractors.Tsv is a built-in
U-SQL extractor for tab-separated-value files. You can develop custom extractors.
The OUTPUT writes data from a rowset to a file. Outputters.Csv() is a built-in U-SQL outputter to create a
comma-separated-value file. You can develop custom outputters.
File paths
The EXTRACT and OUTPUT statements use file paths. File paths can be absolute or relative:
This following absolute file path refers to a file in a Data Lake Store named mystore :

adl://mystore.azuredatalakestore.net/Samples/Data/SearchLog.tsv

This following file path starts with "/" . It refers to a file in the default Data Lake Store account:

/output/SearchLog-first-u-sql.csv

Use scalar variables


You can use scalar variables as well to make your script maintenance easier. The previous U-SQL script can also
be written as:

DECLARE @in string = "/Samples/Data/SearchLog.tsv";


DECLARE @out string = "/output/SearchLog-scalar-variables.csv";
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM @in
USING Extractors.Tsv();
OUTPUT @searchlog
TO @out
USING Outputters.Csv();

Transform rowsets
Use SELECT to transform rowsets:
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@rs1 =
SELECT Start, Region, Duration
FROM @searchlog
WHERE Region == "en-gb";
OUTPUT @rs1
TO "/output/SearchLog-transform-rowsets.csv"
USING Outputters.Csv();

The WHERE clause uses a C# Boolean expression. You can use the C# expression language to do your own
expressions and functions. You can even perform more complex filtering by combining them with logical
conjunctions (ANDs) and disjunctions (ORs).
The following script uses the DateTime.Parse() method and a conjunction.

@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@rs1 =
SELECT Start, Region, Duration
FROM @searchlog
WHERE Region == "en-gb";
@rs1 =
SELECT Start, Region, Duration
FROM @rs1
WHERE Start >= DateTime.Parse("2012/02/16") AND Start <= DateTime.Parse("2012/02/17");
OUTPUT @rs1
TO "/output/SearchLog-transform-datetime.csv"
USING Outputters.Csv();

NOTE
The second query is operating on the result of the first rowset, which creates a composite of the two filters. You can also
reuse a variable name, and the names are scoped lexically.

Aggregate rowsets
U-SQL gives you the familiar ORDER BY, GROUP BY, and aggregations.
The following query finds the total duration per region, and then displays the top five durations in order.
U-SQL rowsets do not preserve their order for the next query. Thus, to order an output, you need to add ORDER
BY to the OUTPUT statement:
DECLARE @outpref string = "/output/Searchlog-aggregation";
DECLARE @out1 string = @outpref+"_agg.csv";
DECLARE @out2 string = @outpref+"_top5agg.csv";
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@rs1 =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM @searchlog
GROUP BY Region;
@res =
SELECT *
FROM @rs1
ORDER BY TotalDuration DESC
FETCH 5 ROWS;
OUTPUT @rs1
TO @out1
ORDER BY TotalDuration DESC
USING Outputters.Csv();
OUTPUT @res
TO @out2
ORDER BY TotalDuration DESC
USING Outputters.Csv();

The U-SQL ORDER BY clause requires using the FETCH clause in a SELECT expression.
The U-SQL HAVING clause can be used to restrict the output to groups that satisfy the HAVING condition:

@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@res =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM @searchlog
GROUP BY Region
HAVING SUM(Duration) > 200;
OUTPUT @res
TO "/output/Searchlog-having.csv"
ORDER BY TotalDuration DESC
USING Outputters.Csv();

For advanced aggregation scenarios, see the U-SQL reference documentation for aggregate, analytic, and
reference functions

Next steps
Overview of Microsoft Azure Data Lake Analytics
Develop U-SQL scripts by using Data Lake Tools for Visual Studio
Get started with the U-SQL Catalog in Azure Data
Lake Analytics
12/10/2021 • 2 minutes to read • Edit Online

Create a TVF
In the previous U-SQL script, you repeated the use of EXTRACT to read from the same source file. With the U-
SQL table-valued function (TVF), you can encapsulate the data for future reuse.
The following script creates a TVF called Searchlog() in the default database and schema:

DROP FUNCTION IF EXISTS Searchlog;

CREATE FUNCTION Searchlog()


RETURNS @searchlog TABLE
(
UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
)
AS BEGIN
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
RETURN;
END;

The following script shows you how to use the TVF that was defined in the previous script:

@res =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM Searchlog() AS S
GROUP BY Region
HAVING SUM(Duration) > 200;

OUTPUT @res
TO "/output/SearchLog-use-tvf.csv"
ORDER BY TotalDuration DESC
USING Outputters.Csv();

Create views
If you have a single query expression, instead of a TVF you can use a U-SQL VIEW to encapsulate that
expression.
The following script creates a view called SearchlogView in the default database and schema:

DROP VIEW IF EXISTS SearchlogView;

CREATE VIEW SearchlogView AS


EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();

The following script demonstrates the use of the defined view:

@res =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM SearchlogView
GROUP BY Region
HAVING SUM(Duration) > 200;

OUTPUT @res
TO "/output/Searchlog-use-view.csv"
ORDER BY TotalDuration DESC
USING Outputters.Csv();

Create tables
As with relational database tables, with U-SQL you can create a table with a predefined schema or create a table
that infers the schema from the query that populates the table (also known as CREATE TABLE AS SELECT or
CTAS).
Create a database and two tables by using the following script:
DROP DATABASE IF EXISTS SearchLogDb;
CREATE DATABASE SearchLogDb;
USE DATABASE SearchLogDb;

DROP TABLE IF EXISTS SearchLog1;


DROP TABLE IF EXISTS SearchLog2;

CREATE TABLE SearchLog1 (


UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string,

INDEX sl_idx CLUSTERED (UserId ASC)


DISTRIBUTED BY HASH (UserId)
);

INSERT INTO SearchLog1 SELECT * FROM master.dbo.Searchlog() AS s;

CREATE TABLE SearchLog2(


INDEX sl_idx CLUSTERED (UserId ASC)
DISTRIBUTED BY HASH (UserId)
) AS SELECT * FROM master.dbo.Searchlog() AS S; // You can use EXTRACT or SELECT here

Query tables
You can query tables, such as those created in the previous script, in the same way that you query the data files.
Instead of creating a rowset by using EXTRACT, you now can refer to the table name.
To read from the tables, modify the transform script that you used previously:

@rs1 =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM SearchLogDb.dbo.SearchLog2
GROUP BY Region;

@res =
SELECT *
FROM @rs1
ORDER BY TotalDuration DESC
FETCH 5 ROWS;

OUTPUT @res
TO "/output/Searchlog-query-table.csv"
ORDER BY TotalDuration DESC
USING Outputters.Csv();

NOTE
Currently, you cannot run a SELECT on a table in the same script as the one where you created the table.

Next Steps
Overview of Microsoft Azure Data Lake Analytics
Develop U-SQL scripts using Data Lake Tools for Visual Studio
Monitor and troubleshoot Azure Data Lake Analytics jobs using Azure portal
Develop U-SQL user-defined operators (UDOs)
12/10/2021 • 2 minutes to read • Edit Online

This article describes how to develop user-defined operators to process data in a U-SQL job.

Define and use a user-defined operator in U-SQL


To create and submit a U -SQL job
1. From the Visual Studio select File > New > Project > U-SQL Project .
2. Click OK . Visual Studio creates a solution with a Script.usql file.
3. From Solution Explorer , expand Script.usql, and then double-click Script.usql.cs .
4. Paste the following code into the file:
using Microsoft.Analytics.Interfaces;
using System.Collections.Generic;
namespace USQL_UDO
{
public class CountryName : IProcessor
{
private static IDictionary<string, string> CountryTranslation = new Dictionary<string,
string>
{
{
"Deutschland", "Germany"
},
{
"Suisse", "Switzerland"
},
{
"UK", "United Kingdom"
},
{
"USA", "United States of America"
},
{
"中国", "PR China"
}
};
public override IRow Process(IRow input, IUpdatableRow output)
{
string UserID = input.Get<string>("UserID");
string Name = input.Get<string>("Name");
string Address = input.Get<string>("Address");
string City = input.Get<string>("City");
string State = input.Get<string>("State");
string PostalCode = input.Get<string>("PostalCode");
string Country = input.Get<string>("Country");
string Phone = input.Get<string>("Phone");
if (CountryTranslation.Keys.Contains(Country))
{
Country = CountryTranslation[Country];
}
output.Set<string>(0, UserID);
output.Set<string>(1, Name);
output.Set<string>(2, Address);
output.Set<string>(3, City);
output.Set<string>(4, State);
output.Set<string>(5, PostalCode);
output.Set<string>(6, Country);
output.Set<string>(7, Phone);
return output.AsReadOnly();
}
}
}

5. Open Script.usql , and paste the following U-SQL script:


@drivers =
EXTRACT UserID string,
Name string,
Address string,
City string,
State string,
PostalCode string,
Country string,
Phone string
FROM "/Samples/Data/AmbulanceData/Drivers.txt"
USING Extractors.Tsv(Encoding.Unicode);

@drivers_CountryName =
PROCESS @drivers
PRODUCE UserID string,
Name string,
Address string,
City string,
State string,
PostalCode string,
Country string,
Phone string
USING new USQL_UDO.CountryName();

OUTPUT @drivers_CountryName
TO "/Samples/Outputs/Drivers.csv"
USING Outputters.Csv(Encoding.Unicode);

6. Specify the Data Lake Analytics account, Database, and Schema.


7. From Solution Explorer , right-click Script.usql , and then click Build Script .
8. From Solution Explorer , right-click Script.usql , and then click Submit Script .
9. If you haven't connected to your Azure subscription, you will be prompted to enter your Azure account
credentials.
10. Click Submit . Submission results and job link are available in the Results window when the submission is
completed.
11. Click the Refresh button to see the latest job status and refresh the screen.
To see the output
1. From Ser ver Explorer , expand Azure , expand Data Lake Analytics , expand your Data Lake Analytics
account, expand Storage Accounts , right-click the Default Storage, and then click Explorer .
2. Expand Samples, expand Outputs, and then double-click Drivers.csv .

Next steps
Extending U-SQL Expressions with User-Code
Use Data Lake Tools for Visual Studio for developing U-SQL applications
Extend U-SQL scripts with Python code in Azure
Data Lake Analytics
12/10/2021 • 2 minutes to read • Edit Online

Prerequisites
Before you begin, ensure the Python extensions are installed in your Azure Data Lake Analytics account.
Navigate to you Data Lake Analytics Account in the Azure portal
In the left menu, under GETTING STARTED click on Sample Scripts
Click Install U-SQL Extensions then OK

Overview
Python Extensions for U-SQL enable developers to perform massively parallel execution of Python code. The
following example illustrates the basic steps:
Use the REFERENCE ASSEMBLY statement to enable Python extensions for the U-SQL Script
Using the REDUCE operation to partition the input data on a key
The Python extensions for U-SQL include a built-in reducer ( Extension.Python.Reducer ) that runs Python
code on each vertex assigned to the reducer
The U-SQL script contains the embedded Python code that has a function called usqlml_main that accepts a
pandas DataFrame as input and returns a pandas DataFrame as output.

REFERENCE ASSEMBLY [ExtPython];


DECLARE @myScript = @"
def get_mentions(tweet):
return ';'.join( ( w[1:] for w in tweet.split() if w[0]=='@' ) )
def usqlml_main(df):
del df['time']
del df['author']
df['mentions'] = df.tweet.apply(get_mentions)
del df['tweet']
return df
";
@t =
SELECT * FROM
(VALUES
("D1","T1","A1","@foo Hello World @bar"),
("D2","T2","A2","@baz Hello World @beer")
) AS date, time, author, tweet );
@m =
REDUCE @t ON date
PRODUCE date string, mentions string
USING new Extension.Python.Reducer(pyScript:@myScript);
OUTPUT @m
TO "/tweetmentions.csv"
USING Outputters.Csv();

How Python Integrates with U-SQL


Datatypes
String and numeric columns from U-SQL are converted as-is between Pandas and U-SQL
U-SQL Nulls are converted to and from Pandas NA values
Schemas
Index vectors in Pandas are not supported in U-SQL. All input data frames in the Python function always have
a 64-bit numerical index from 0 through the number of rows minus 1.
U-SQL datasets cannot have duplicate column names
U-SQL datasets column names that are not strings.
Python Versions
Only Python 3.5.1 (compiled for Windows) is supported.
Standard Python modules
All the standard Python modules are included.
Additional Python modules
Besides the standard Python libraries, several commonly used python libraries are included:
pandas
numpy
numexpr
Exception Messages
Currently, an exception in Python code shows up as generic vertex failure. In the future, the U-SQL Job error
messages will display the Python exception message.
Input and Output size limitations
Every vertex has a limited amount of memory assigned to it. Currently, that limit is 6 GB for an AU. Because the
input and output DataFrames must exist in memory in the Python code, the total size for the input and output
cannot exceed 6 GB.

Next steps
Overview of Microsoft Azure Data Lake Analytics
Develop U-SQL scripts using Data Lake Tools for Visual Studio
Using U-SQL window functions for Azure Data Lake Analytics jobs
Use Azure Data Lake Tools for Visual Studio Code
Extend U-SQL scripts with R code in Azure Data
Lake Analytics
12/10/2021 • 4 minutes to read • Edit Online

The following example illustrates the basic steps for deploying R code:
Use the REFERENCE ASSEMBLY statement to enable R extensions for the U-SQL Script.
Use the REDUCE operation to partition the input data on a key.
The R extensions for U-SQL include a built-in reducer ( Extension.R.Reducer ) that runs R code on each vertex
assigned to the reducer.
Usage of dedicated named data frames called inputFromUSQL and outputToUSQL respectively to pass data
between U-SQL and R. Input and output DataFrame identifier names are fixed (that is, users cannot change
these predefined names of input and output DataFrame identifiers).

Embedding R code in the U-SQL script


You can inline the R code your U-SQL script by using the command parameter of the Extension.R.Reducer . For
example, you can declare the R script as a string variable and pass it as a parameter to the Reducer.

REFERENCE ASSEMBLY [ExtR];

DECLARE @myRScript = @"


inputFromUSQL$Species = as.factor(inputFromUSQL$Species)
lm.fit=lm(unclass(Species)~.-Par, data=inputFromUSQL)
#do not return readonly columns and make sure that the column names are the same in usql and r cripts,
outputToUSQL=data.frame(summary(lm.fit)$coefficients)
colnames(outputToUSQL) <- c(""Estimate"", ""StdError"", ""tValue"", ""Pr"")
outputToUSQL
";

@RScriptOutput = REDUCE … USING new Extension.R.Reducer(command:@myRScript, ReturnType:"dataframe");

Keep the R code in a separate file and reference it the U-SQL script
The following example illustrates a more complex usage. In this case, the R code is deployed as a RESOURCE
that is the U-SQL script.
Save this R code as a separate file.

load("my_model_LM_Iris.rda")
outputToUSQL=data.frame(predict(lm.fit, inputFromUSQL, interval="confidence"))

Use a U-SQL script to deploy that R script with the DEPLOY RESOURCE statement.
REFERENCE ASSEMBLY [ExtR];
DEPLOY RESOURCE @"/usqlext/samples/R/RinUSQL_PredictUsingLinearModelasDF.R";
DEPLOY RESOURCE @"/usqlext/samples/R/my_model_LM_Iris.rda";
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/my/R/Output/LMPredictionsIris.txt";
DECLARE @PartitionCount int = 10;
@InputData =
EXTRACT
SepalLength double,
SepalWidth double,
PetalLength double,
PetalWidth double,
Species string
FROM @IrisData
USING Extractors.Csv();
@ExtendedData =
SELECT
Extension.R.RandomNumberGenerator.GetRandomNumber(@PartitionCount) AS Par,
SepalLength,
SepalWidth,
PetalLength,
PetalWidth
FROM @InputData;
// Predict Species
@RScriptOutput = REDUCE @ExtendedData ON Par
PRODUCE Par, fit double, lwr double, upr double
READONLY Par
USING new Extension.R.Reducer(scriptFile:"RinUSQL_PredictUsingLinearModelasDF.R",
rReturnType:"dataframe", stringsAsFactors:false);
OUTPUT @RScriptOutput TO @OutputFilePredictions USING Outputters.Tsv();

How R Integrates with U-SQL


Datatypes
String and numeric columns from U-SQL are converted as-is between R DataFrame and U-SQL [supported
types: double , string , bool , integer , byte ].
The Factor datatype is not supported in U-SQL.
byte[] must be serialized as a base64-encoded string .
U-SQL strings can be converted to factors in R code, once U-SQL create R input dataframe or by setting the
reducer parameter stringsAsFactors: true .
Schemas
U-SQL datasets cannot have duplicate column names.
U-SQL datasets column names must be strings.
Column names must be the same in U-SQL and R scripts.
Readonly column cannot be part of the output dataframe. Because readonly columns are automatically
injected back in the U-SQL table if it is a part of output schema of UDO.
Functional limitations
The R Engine can't be instantiated twice in the same process.
Currently, U-SQL does not support Combiner UDOs for prediction using partitioned models generated using
Reducer UDOs. Users can declare the partitioned models as resource and use them in their R Script (see
sample code ExtR_PredictUsingLMRawStringReducer.usql )
R Versions
Only R 3.2.2 is supported.
Standard R modules
base
boot
Class
Cluster
codetools
compiler
datasets
doParallel
doRSR
foreach
foreign
Graphics
grDevices
grid
iterators
KernSmooth
lattice
MASS
Matrix
Methods
mgcv
nlme
Nnet
Parallel
pkgXMLBuilder
RevoIOQ
revoIpe
RevoMods
RevoPemaR
RevoRpeConnector
RevoRsrConnector
RevoScaleR
RevoTreeView
RevoUtils
RevoUtilsMath
Rpart
RUnit
spatial
splines
Stats
stats4
survival
Tcltk
Tools
translations
utils
XML

Input and Output size limitations


Every vertex has a limited amount of memory assigned to it. Because the input and output DataFrames must
exist in memory in the R code, the total size for the input and output cannot exceed 500 MB.
Sample code
More sample code is available in your Data Lake Store account after you install the U-SQL Advanced Analytics
extensions. The path for more sample code is: <your_account_address>/usqlext/samples/R .

Deploying Custom R modules with U-SQL


First, create an R custom module and zip it and then upload the zipped R custom module file to your ADL store.
In the example, we will upload magittr_1.5.zip to the root of the default ADLS account for the ADLA account we
are using. Once you upload the module to ADL store, declare it as use DEPLOY RESOURCE to make it available
in your U-SQL script and call install.packages to install it.
REFERENCE ASSEMBLY [ExtR];
DEPLOY RESOURCE @"/magrittr_1.5.zip";
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFileModelSummary string = @"/R/Output/CustomPackages.txt";
// R script to run
DECLARE @myRScript = @"
# install the magrittr package,
install.packages('magrittr_1.5.zip', repos = NULL),
# load the magrittr package,
require(magrittr),
# demonstrate use of the magrittr package,
2 %>% sqrt
";
@InputData =
EXTRACT SepalLength double,
SepalWidth double,
PetalLength double,
PetalWidth double,
Species string
FROM @IrisData
USING Extractors.Csv();
@ExtendedData =
SELECT 0 AS Par,
*
FROM @InputData;
@RScriptOutput = REDUCE @ExtendedData ON Par
PRODUCE Par, RowId int, ROutput string
READONLY Par
USING new Extension.R.Reducer(command:@myRScript, rReturnType:"charactermatrix");
OUTPUT @RScriptOutput TO @OutputFileModelSummary USING Outputters.Tsv();

Next steps
Overview of Microsoft Azure Data Lake Analytics
Develop U-SQL scripts using Data Lake Tools for Visual Studio
Using U-SQL window functions for Azure Data Lake Analytics jobs
Get started with the Cognitive capabilities of U-SQL
12/10/2021 • 2 minutes to read • Edit Online

Overview
Cognitive capabilities for U-SQL enable developers to use put intelligence in their big data programs.
The following samples using cognitive capabilities are available:
Imaging: Detect faces
Imaging: Detect emotion
Imaging: Detect objects (tagging)
Imaging: OCR (optical character recognition)
Text: Key Phrase Extraction & Sentiment Analysis

Registering Cognitive Extensions in U-SQL


Before you begin, follow the steps in this article to register Cognitive Extensions in U-SQL: Registering Cognitive
Extensions in U-SQL.

Next steps
U-SQL/Cognitive Samples
Develop U-SQL scripts using Data Lake Tools for Visual Studio
Using U-SQL window functions for Azure Data Lake Analytics jobs
Install Data Lake Tools for Visual Studio
12/10/2021 • 2 minutes to read • Edit Online

Learn how to use Visual Studio to create Azure Data Lake Analytics accounts. You can define jobs in U-SQL and
submit jobs to the Data Lake Analytics service. For more information about Data Lake Analytics, see Azure Data
Lake Analytics overview.

Prerequisites
Visual Studio : All editions except Express are supported.
Visual Studio 2019
Visual Studio 2017
Visual Studio 2015
Visual Studio 2013
Microsoft Azure SDK for .NET version 2.7.1 or later. Install it by using the Web platform installer.
A Data Lake Analytics account. To create an account, see Get Started with Azure Data Lake Analytics
using Azure portal.

Install Azure Data Lake Tools for Visual Studio 2017 or Visual Studio
2019
Azure Data Lake Tools for Visual Studio is supported in Visual Studio 2017 15.3 or later. The tool is part of the
Data storage and processing and Azure development workloads. Enable either one of these two
workloads as part of your Visual Studio installation.
Enable the Data storage and processing workload as shown:

Enable the Azure development workload as shown:


Install Azure Data Lake Tools for Visual Studio 2013 and 2015
Download and install Microsoft Azure Data Lake and Stream Analytics Tools for Visual Studio . After installation,
Visual Studio has the following changes:
The Ser ver Explorer > Azure node contains a Data Lake Analytics node.
The Tools menu has a Data Lake item.

Next steps
To log diagnostics information, see Accessing diagnostics logs for Azure Data Lake Analytics.
To see a more complex query, see Analyze Website logs using Azure Data Lake Analytics.
To use the vertex execution view, see Use the Vertex Execution View in Data Lake Tools for Visual Studio.
Run U-SQL scripts on your local machine
12/10/2021 • 6 minutes to read • Edit Online

When you develop U-SQL scripts, you can save time and expense by running the scripts locally. Azure Data Lake
Tools for Visual Studio supports running U-SQL scripts on your local machine.

Basic concepts for local runs


The following chart shows the components for local run and how these components map to cloud run.

C O M P O N EN T LO C A L RUN C LO UD RUN

Storage Local data root folder Default Azure Data Lake Store account

Compute U-SQL local run engine Azure Data Lake Analytics service

Run environment Working directory on local machine Azure Data Lake Analytics cluster

The sections that follow provide more information about local run components.
Local data root folders
A local data root folder is a local store for the local compute account. Any folder in the local file system on your
local machine can be a local data root folder. It's the same as the default Azure Data Lake Store account of a Data
Lake Analytics account. Switching to a different data root folder is just like switching to a different default store
account.
The data root folder is used as follows:
Store metadata. Examples are databases, tables, table-valued functions, and assemblies.
Look up the input and output paths that are defined as relative paths in U-SQL scripts. By using relative
paths, it's easier to deploy your U-SQL scripts to Azure.
U -SQL local run engines
A U-SQL local run engine is a local compute account for U-SQL jobs. Users can run U-SQL jobs locally
through Azure Data Lake Tools for Visual Studio. Local runs are also supported through the Azure Data Lake U-
SQL SDK command-line and programming interfaces. Learn more about the Azure Data Lake U-SQL SDK.
Working directories
When you run a U-SQL script, a working directory folder is needed to cache compilation results, run logs, and
perform other functions. In Azure Data Lake Tools for Visual Studio, the working directory is the U-SQL project’s
working directory. It's located under <U-SQL project root path>/bin/debug> . The working directory is cleaned
every time a new run is triggered.

Local runs in Microsoft Visual Studio


Azure Data Lake Tools for Visual Studio have a built-in local run engine. The tools surface the engine as a local
compute account. To run a U-SQL script locally, select the Local-machine or Local-project account in the
script’s editor margin drop-down menu. Then select Submit .
Local runs with a Local-machine account
A Local-machine account is a shared local compute account with a single local data root folder as the local
store account. By default, the data root folder is located at
C:\Users<username>\AppData\Local\USQLDataRoot . It's also configurable through Tools > Data Lake
> Options and Settings .

A U-SQL project is required for a local run. The U-SQL project’s working directory is used for the U-SQL local
run working directory. Compilation results, run logs, and other job run-related files are generated and stored
under the working directory folder during the local run. Every time you rerun the script, all the files in the
working directory are cleaned and regenerated.

Local runs with a Local-project account


A Local-project account is a project-isolated local compute account for each project with an isolated local data
root folder. Every active U-SQL project that opens in Solution Explorer in Visual Studio has a corresponding
(Local-project: <project name>) account. The accounts are listed in both Server Explorer in Visual Studio and
the U-SQL script editor margin.
The Local-project account provides a clean and isolated development environment. A Local-machine account
has a shared local data root folder that stores metadata and input and output data for all local jobs. But a Local-
project account creates a temporary local data root folder under a U-SQL project working directory every time
a U-SQL script is run. This temporary data root folder is cleaned when a rebuild or rerun happens.
A U-SQL project manages the isolated local run environment through a project reference and property. You can
configure the input data sources for U-SQL scripts in both the project and the referenced database
environments.
Manage the input data source for a Local-project account
A U-SQL project creates a local data root folder and sets up data for a Local-project account. A temporary data
root folder is cleaned and recreated under the U-SQL project working directory every time a rebuild and local
run happens. All data sources that are configured by the U-SQL project are copied to this temporary local data
root folder before the local job runs.
You can configure the root folder of your data sources. Right-click U-SQL project > Proper ty > Test Data
Source . When you run a U-SQL script on a Local-project account, all files and subfolders in the Test Data
Source folder are copied to the temporary local data root folder. Files under subfolders are included. After a
local job runs, output results can also be found under the temporary local data root folder in the project working
directory. All this output is deleted and cleaned when the project gets rebuilt and cleaned.

Manage a referenced database environment for a Local-project account


If a U-SQL query uses or queries with U-SQL database objects, you must make the database environments
ready locally before you run the U-SQL script locally. For a Local-project account, U-SQL database
dependencies can be managed by U-SQL project references. You can add U-SQL database project references to
your U-SQL project. Before running U-SQL scripts on a Local-project account, all referenced databases are
deployed to the temporary local data root folder. And for each run, the temporary data root folder is cleaned as
a fresh isolated environment.
See this related article:
Learn how to manage U-SQL database definitions and references in U-SQL database projects.

The difference between Local-machine and Local-project accounts


A Local-machine account simulates an Azure Data Lake Analytics account on users’ local machines. It shares
the same experience with an Azure Data Lake Analytics account. A Local-project account provides a user-
friendly local development environment. This environment helps users deploy database references and input
data before they run scripts locally. A Local-machine account provides a shared permanent environment that
can be accessed through all projects. A Local-project account provides an isolated development environment
for each project. It's refreshed for each run. A Local-project account offers a faster development experience by
quickly applying new changes.
More differences between Local-machine and Local-project accounts are shown in the following table:
DIF F EREN C E A N GL E LO C A L - M A C H IN E LO C A L - P RO JEC T

Local access Can be accessed by all projects. Only the corresponding project can
access this account.

Local data root folder A permanent local folder. Configured A temporary folder created for each
through Tools > Data Lake > local run under the U-SQL project
Options and Settings . working directory. The folder gets
cleaned when a rebuild or rerun
happens.

Input data for a U-SQL script The relative path under the permanent Set through U-SQL project
local data root folder. proper ty > Test Data Source . All
files and subfolders are copied to the
temporary data root folder before a
local run.

Output data for a U-SQL script Relative path under the permanent Output to the temporary data root
local data root folder. folder. The results are cleaned when a
rebuild or rerun happens.

Referenced database deployment Referenced databases aren't deployed Referenced databases are deployed to
automatically when running against a the Local-project account
Local-machine account. It's the same automatically before a local run. All
for submitting to an Azure Data Lake database environments are cleaned
Analytics account. and redeployed when a rebuild or
rerun happens.

A local run with the U-SQL SDK


You can run U-SQL scripts locally in Visual Studio and also use the Azure Data Lake U-SQL SDK to run U-SQL
scripts locally with command-line and programming interfaces. Through these interfaces, you can automate U-
SQL local runs and tests.
Learn more about the Azure Data Lake U-SQL SDK.

Next steps
How to set up a CI/CD pipeline for Azure Data Lake Analytics.
How to test your Azure Data Lake Analytics code.
Debug Azure Data Lake Analytics code locally
12/10/2021 • 2 minutes to read • Edit Online

You can use Azure Data Lake Tools for Visual Studio to run and debug Azure Data Lake Analytics code on your
local workstation, just as you can in the Azure Data Lake Analytics service.
Learn how to run U-SQL script on your local machine.

Debug scripts and C# assemblies locally


You can debug C# assemblies without submitting and registering them to the Azure Data Lake Analytics service.
You can set breakpoints in both the code-behind file and in a referenced C# project.
Debug local code in a code -behind file
1. Set breakpoints in the code-behind file.
2. Select F5 to debug the script locally.

NOTE
The following procedure works only in Visual Studio 2015. In older Visual Studio versions, you might need to manually
add the PDB files.

Debug local code in a referenced C# project


1. Create a C# assembly project, and build it to generate the output DLL file.
2. Register the DLL file by using a U-SQL statement:

CREATE ASSEMBLY assemblyname FROM @"..\..\path\to\output\.dll";

3. Set breakpoints in the C# code.


4. Select F5 to debug the script by referencing the C# DLL file locally.

Next steps
For an example of a more complex query, see Analyze website logs using Azure Data Lake Analytics.
To view job details, see Use Job Browser and Job View for Azure Data Lake Analytics jobs.
To use the vertex execution view, see Use the Vertex Execution View in Data Lake Tools for Visual Studio.
Use a U-SQL database project to develop a U-SQL
database for Azure Data Lake
12/10/2021 • 4 minutes to read • Edit Online

U-SQL database provides structured views over unstructured data and managed structured data in tables. It also
provides a general metadata catalog system for organizing your structured data and custom code. The database
is the concept that groups these related objects together.
Learn more about U-SQL database and Data Definition Language (DDL).
The U-SQL database project is a project type in Visual Studio that helps developers develop, manage, and deploy
their U-SQL databases quickly and easily.

Create a U-SQL database project


Azure Data Lake Tools for Visual Studio added a new project template called U-SQL database project after
version 2.3.3000.0. To create a U-SQL project, select File > New > Project . The U-SQL Database Project can be
found under Azure Data Lake > U-SQL node .

Develop U-SQL database objects by using a database project


Right-click the U-SQL database project. The select Add > New item . You can find all new supported object
types in the Add New Item Wizard.
For a non-assembly object (for example, a table-valued function), a new U-SQL script is created after you add a
new item. You can start to develop the DDL statement for that object in the editor.
For an assembly object, the tool provides a user-friendly UI editor that helps you register the assembly and
deploy DLL files and other additional files. The following steps show you how to add an assembly object
definition to the U-SQL database project:
1. Add references to the C# project that include the UDO/UDAG/UDF for the U-SQL database project.

2. In the assembly design view, choose the referenced assembly from Create assembly from reference
drop-down menu.
3. Add Managed Dependencies and Additional Files if there are any. When you add additional files, the
tool uses the relative path to make sure it can find the assemblies both on your local machine and on the
build machine later.
@_DeployTempDirectory is a predefined variable that points the tool to the build output folder. Under the build
output folder, every assembly has a subfolder named with the assembly name. All DLLs and additional files are
in that subfolder.

Build a U-SQL database project


The build output for a U-SQL database project is a U-SQL database deployment package, named with the suffix
.usqldbpack . The .usqldbpack package is a .zip file that includes all DDL statements in a single U-SQL script in
the DDL folder, and all DLLs and additional files for assemblies in the Temp folder.
Learn more about how to build a U-SQL database project with the MSBuild command line and an Azure DevOps
Services build task.

Deploy a U-SQL database


The .usqldbpack package can be deployed to either a local account or an Azure Data Lake Analytics account by
using Visual Studio or the deployment SDK.
Deploy a U -SQL database in Visual Studio
You can deploy a U-SQL database through a U-SQL database project or a .usqldbpack package in Visual Studio.
Deploy through a U-SQL database project
1. Right-click the U-SQL database project, and then select Deploy .
2. In the Deploy U-SQL Database Wizard , select the ADL A account to which you want to deploy the
database. Both local accounts and ADLA accounts are supported.
3. Database Source is filled in automatically, and points to the .usqldbpack package in the project's build
output folder.
4. Enter a name in Database Name to create a database. If a database with that same name already exists
in the target Azure Data Lake Analytics account, all objects that are defined in the database project are
created without recreating the database.
5. To deploy the U-SQL database, select Submit . All resources (assemblies and additional files) are
uploaded, and a U-SQL job that includes all DDL statements is submitted.

Deploy through a U-SQL database deployment package


1. Open Ser ver Explorer . Then expand the Azure Data Lake Analytics account to which you want to
deploy the database.
2. Right click U-SQL Databases , and then choose Deploy Database .
3. Set Database Source to the U-SQL database deployment package (.usqldbpack file) path.
4. Enter the Database Name to create a database. If there is a database with the same name that already
exists in the target Azure Data Lake Analytics account, all objects that are defined in the database project
are created without recreating the database.

Deploy U -SQL database by using the SDK


PackageDeploymentTool.exe provides the programming and command-line interfaces that help to deploy U-SQL
databases. The SDK is included in the U-SQL SDK Nuget package, located at
build/runtime/PackageDeploymentTool.exe .
Learn more about the SDK and how to set up CI/CD pipeline for U-SQL database deployment.

Reference a U-SQL database project


A U-SQL project can reference a U-SQL database project. The reference affects two workloads:
Project build: Set up the referenced database environments before building the U-SQL scripts.
Local run against (a local-project) account: The referenced database environments are deployed to (a local-
project) account before U-SQL script execution. Learn more about local runs and the difference between (the
local-machine) and (a local-project) account here.
How to add a U -SQL database reference
1. Right-click the U-SQL project in Solution Explorer , and then choose Add U-SQL Database
Reference....

2. Configure a database reference from a U-SQL database project in the current solution or in a U-SQL
database package file.
3. Provide the name for the database.

Next steps
How to set up a CI/CD pipeline for Azure Data Lake Analytics
How to test your Azure Data Lake Analytics code
Run U-SQL script on your local machine
Use Job Browser and Job View for Azure Data Lake
Analytics
12/10/2021 • 10 minutes to read • Edit Online

The Azure Data Lake Analytics service archives submitted jobs in a query store. In this article, you learn how to
use Job Browser and Job View in Azure Data Lake Tools for Visual Studio to find the historical job information.
By default, the Data Lake Analytics service archives the jobs for 30 days. The expiration period can be configured
from the Azure portal by configuring the customized expiration policy. You will not be able to access the job
information after expiration.

Prerequisites
See Data Lake Tools for Visual Studio prerequisites.

Open the Job Browser


Access the Job Browser via Ser ver Explorer>Azure>Data Lake Analytics>Jobs in Visual Studio. Using the
Job Browser, you can access the query store of a Data Lake Analytics account. Job Browser displays Query Store
on the left, showing basic job information, and Job View on the right showing detailed job information.

Job View
Job View shows the detailed information of a job. To open a job, you can double-click a job in the Job Browser, or
open it from the Data Lake menu by clicking Job View. You should see a dialog populated with the job URL.

Job View contains:


Job Summary
Refresh the Job View to see the more recent information of running jobs.
Job Status (graph):
Job Status outlines the job phases:
Preparing: Upload your script to the cloud, compiling and optimizing the script using the
compile service.
Queued: Jobs are queued when they are waiting for enough resources, or the jobs exceed
the max concurrent jobs per account limitation. The priority setting determines the
sequence of queued jobs - the lower the number, the higher the priority.
Running: The job is actually running in your Data Lake Analytics account.
Finalizing: The job is completing (for example, finalizing the file).
The job can fail in every phase. For example, compilation errors in the Preparing phase,
timeout errors in the Queued phase, and execution errors in the Running phase, etc.
Basic Information
The basic job information shows in the lower part of the Job Summary panel.

Job Result: Succeeded or failed. The job may fail in every phase.
Total Duration: Wall clock time (duration) between submitting time and ending time.
Total Compute Time: The sum of every vertex execution time, you can consider it as the time
that the job is executed in only one vertex. Refer to Total Vertices to find more information
about vertex.
Submit/Start/End Time: The time when the Data Lake Analytics service receives job
submission/starts to run the job/ends the job successfully or not.
Compilation/Queued/Running: Wall clock time spent during the Preparing/Queued/Running
phase.
Account: The Data Lake Analytics account used for running the job.
Author: The user who submitted the job, it can be a real person’s account or a system account.
Priority: The priority of the job. The lower the number, the higher the priority. It only affects the
sequence of the jobs in the queue. Setting a higher priority does not preempt running jobs.
Parallelism: The requested maximum number of concurrent Azure Data Lake Analytics Units
(ADLAUs), also known as vertices. Currently, one vertex is equal to one VM with two virtual
core and six GB RAM, though this could be upgraded in future Data Lake Analytics updates.
Bytes Left: Bytes that need to be processed until the job completes.
Bytes read/written: Bytes that have been read/written since the job started running.
Total vertices: The job is broken up into many pieces of work, each piece of work is called a
vertex. This value describes how many pieces of work the job consists of. You can consider a
vertex as a basic process unit, also known as Azure Data Lake Analytics Unit (ADLAU), and
vertices can be run in parallelism.
Completed/Running/Failed: The count of completed/running/failed vertices. Vertices can fail
due to both user code and system failures, but the system retries failed vertices automatically a
few times. If the vertex is still failed after retrying, the whole job will fail.
Job Graph
A U-SQL script represents the logic of transforming input data to output data. The script is compiled and
optimized to a physical execution plan at the Preparing phase. Job Graph is to show the physical
execution plan. The following diagram illustrates the process:

A job is broken up into many pieces of work. Each piece of work is called a Vertex. The vertices are
grouped as Super Vertex (also known as stage), and visualized as Job Graph. The green stage placards in
the job graph show the stages.
Every vertex in a stage is doing the same kind of work with different pieces of the same data. For
example, if you have a file with one TB data, and there are hundreds of vertices reading from it, each of
them is reading a chunk. Those vertices are grouped in the same stage and doing same work on different
pieces of same input file.
Stage information
In a particular stage, some numbers are shown in the placard.

SV1 Extract: The name of a stage, named by a number and the operation method.
84 vertices: The total count of vertices in this stage. The figure indicates how many pieces of
work is divided in this stage.
12.90 s/vertex: The average vertex execution time for this stage. This figure is calculated by
SUM (every vertex execution time) / (total Vertex count). Which means if you could assign
all the vertices executed in parallelism, the whole stage is completed in 12.90 s. It also
means if all the work in this stage is done serially, the cost would be #vertices * AVG time.
850,895 rows written: Total row count written in this stage.
R/W: Amount of data read/Written in this stage in bytes.
Colors: Colors are used in the stage to indicate different vertex status.
Green indicates the vertex is succeeded.
Orange indicates the vertex is retried. The retried vertex was failed but is retried
automatically and successfully by the system, and the overall stage is completed
successfully. If the vertex retried but still failed, the color turns red and the whole job
failed.
Red indicates failed, which means a certain vertex had been retried a few times by the
system but still failed. This scenario causes the whole job to fail.
Blue means a certain vertex is running.
White indicates the vertex is Waiting. The vertex may be waiting to be scheduled once an
ADLAU becomes available, or it may be waiting for input since its input data might not
be ready.
You can find more details for the stage by hovering your mouse cursor by one state:

Vertices: Describes the vertices details, for example, how many vertices in total, how many vertices
have been completed, are they failed or still running/waiting, etc.
Data read cross/intra pod: Files and data are stored in multiple pods in distributed file system. The
value here describes how much data has been read in the same pod or cross pod.
Total compute time: The sum of every vertex execution time in the stage, you can consider it as the
time it would take if all work in the stage is executed in only one vertex.
Data and rows written/read: Indicates how much data or rows have been read/written, or need to
be read.
Vertex read failures: Describes how many vertices are failed while read data.
Vertex duplicate discards: If a vertex runs too slow, the system may schedule multiple vertices to
run the same piece of work. Reductant vertices will be discarded once one of the vertices complete
successfully. Vertex duplicate discards records the number of vertices that are discarded as
duplications in the stage.
Vertex revocations: The vertex was succeeded, but get rerun later due to some reasons. For
example, if downstream vertex loses intermediate input data, it will ask the upstream vertex to
rerun.
Vertex schedule executions: The total time that the vertices have been scheduled.
Min/Average/Max Vertex data read: The minimum/average/maximum of every vertex read data.
Duration: The wall clock time a stage takes, you need to load profile to see this value.
Job Playback
Data Lake Analytics runs jobs and archives the vertices running information of the jobs, such as
when the vertices are started, stopped, failed and how they are retried, etc. All of the information is
automatically logged in the query store and stored in its Job Profile. You can download the Job
Profile through “Load Profile” in Job View, and you can view the Job Playback after downloading
the Job Profile.
Job Playback is an epitome visualization of what happened in the cluster. It helps you watch job
execution progress and visually detect out performance anomalies and bottlenecks in a very short
time (less than 30s usually).
Job Heat Map Display
Job Heat Map can be selected through the Display dropdown in Job Graph.

It shows the I/O, time and throughput heat map of a job, through which you can find where the job
spends most of the time, or whether your job is an I/O boundary job, and so on.
Progress: The job execution progress, see Information in stage information.
Data read/written: The heat map of total data read/written in each stage.
Compute time: The heat map of SUM (every vertex execution time), you can consider this as
how long it would take if all work in the stage is executed with only 1 vertex.
Average execution time per node: The heat map of SUM (every vertex execution time) / (Vertex
Number). Which means if you could assign all the vertices executed in parallelism, the whole
stage will be done in this time frame.
Input/Output throughput: The heat map of input/output throughput of each stage, you can
confirm if your job is an I/O bound job through this.
Metadata Operations
You can perform some metadata operations in your U-SQL script, such as create a database, drop a table,
etc. These operations are shown in Metadata Operation after compilation. You may find assertions, create
entities, drop entities here.

State History
The State History is also visualized in Job Summary, but you can get more details here. You can find the
detailed information such as when the job is prepared, queued, started running, ended. Also you can find
how many times the job has been compiled (the CcsAttempts: 1), when is the job dispatched to the
cluster actually (the Detail: Dispatching job to cluster), etc.

Diagnostics
The tool diagnoses job execution automatically. You will receive alerts when there are some errors or
performance issues in your jobs. Please note that you need to download Profile to get full information
here.
Warnings: An alert shows up here with compiler warning. You can click “x issue(s)” link to have more
details once the alert appears.
Vertex run too long: If any vertex runs out of time (say 5 hours), issues will be found here.
Resource usage: If you allocated more or not enough Parallelism than need, issues will be found here.
Also you can click Resource usage to see more details and perform what-if scenarios to find a better
resource allocation (for more details, see this guide).
Memory check: If any vertex uses more than 5 GB of memory, issues will be found here. Job execution
may get killed by system if it uses more memory than system limitation.

Job Detail
Job Detail shows the detailed information of the job, including Script, Resources and Vertex Execution View.

Script
The U-SQL script of the job is stored in the query store. You can view the original U-SQL script and re-
submit it if needed.
Resources
You can find the job compilation outputs stored in the query store through Resources. For instance, you
can find “algebra.xml” which is used to show the Job Graph, the assemblies you registered, etc. here.
Vertex execution view
It shows vertices execution details. The Job Profile archives every vertex execution log, such as total data
read/written, runtime, state, etc. Through this view, you can get more details on how a job ran. For more
information, see Use the Vertex Execution View in Data Lake Tools for Visual Studio.

Next Steps
To log diagnostics information, see Accessing diagnostics logs for Azure Data Lake Analytics
To see a more complex query, see Analyze Website logs using Azure Data Lake Analytics.
To use vertex execution view, see Use the Vertex Execution View in Data Lake Tools for Visual Studio
Debug user-defined C# code for failed U-SQL jobs
12/10/2021 • 3 minutes to read • Edit Online

U-SQL provides an extensibility model using C#. In U-SQL scripts, it is easy to call C# functions and perform
analytic functions that SQL-like declarative language does not support. To learn more for U-SQL extensibility,
see U-SQL programmability guide.
In practice, any code may need debugging, but it is hard to debug a distributed job with custom code on the
cloud with limited log files. Azure Data Lake Tools for Visual Studio provides a feature called Failed Ver tex
Debug , which helps you more easily debug the failures that occur in your custom code. When U-SQL job fails,
the service keeps the failure state and the tool helps you to download the cloud failure environment to the local
machine for debugging. The local download captures the entire cloud environment, including any input data and
user code.
The following video demonstrates Failed Vertex Debug in Azure Data Lake Tools for Visual Studio.

IMPORTANT
Visual Studio requires the following two updates for using this feature: Microsoft Visual C++ 2015 Redistributable Update
3 and the Universal C Runtime for Windows.

Download failed vertex to local machine


When you open a failed job in Azure Data Lake Tools for Visual Studio, you see a yellow alert bar with detailed
error messages in the error tab.
1. Click Download to download all the required resources and input streams. If the download doesn't
complete, click Retr y .
2. Click Open after the download completes to generate a local debugging environment. A new debugging
solution will be opened, and if you have existing solution opened in Visual Studio, please make sure to
save and close it before debugging.
Configure the debugging environment
NOTE
Before debugging, be sure to check Common Language Runtime Exceptions in the Exception Settings window (Ctrl
+ Alt + E ).

In the new launched Visual Studio instance, you may or may not find the user-defined C# source code:
1. I can find my source code in the solution
2. I cannot find my source code in the solution
Source code is included in debugging solution
There are two cases that the C# source code is captured:
1. The user code is defined in code-behind file (typically named Script.usql.cs in a U-SQL project).
2. The user code is defined in C# class library project for U-SQL application, and registered as assembly
with debug info .
If the source code is imported to the solution, you can use the Visual Studio debugging tools (watch, variables,
etc.) to troubleshoot the problem:
1. Press F5 to start debugging. The code runs until it is stopped by an exception.
2. Open the source code file and set breakpoints, then press F5 to debug the code step by step.

Source code is not included in debugging solution


If the user code is not included in code-behind file, or you did not register the assembly with debug info , then
the source code is not included automatically in the debugging solution. In this case, you need extra steps to add
your source code:
1. Right-click Solution 'Ver texDebug' > Add > Existing Project... to find the assembly source code and
add the project to the debugging solution.

2. Get the project folder path for FailedVer texDebugHost project.


3. Right-Click the added assembly source code project > Proper ties , select the Build tab at left, and
paste the copied path ending with \bin\debug as Output > Output path . The final output path is like
<DataLakeTemp path>\fd91dd21-776e-4729-a78b-
81ad85a4fba6\loiu0t1y.mfo\FailedVertexDebug\FailedVertexDebugHost\bin\Debug\
.

After these settings, start debugging with F5 and breakpoints. You can also use the Visual Studio debugging
tools (watch, variables, etc.) to troubleshoot the problem.

NOTE
Rebuild the assembly source code project each time after you modify the code to generate updated .pdb files.

Resubmit the job


After debugging, if the project completes successfully the output window shows the following message:
The Program 'LocalVertexHost.exe' has exited with code 0 (0x0).

To resubmit the failed job:


1. For jobs with code-behind solutions, copy the C# code into the code-behind source file (typically
Script.usql.cs ).

2. For jobs with assemblies, right-click the assembly source code project in debugging solution and register
the updated .dll assemblies into your Azure Data Lake catalog.
3. Resubmit the U-SQL job.
Next steps
U-SQL programmability guide
Develop U-SQL User-defined operators for Azure Data Lake Analytics jobs
Test and debug U-SQL jobs by using local run and the Azure Data Lake U-SQL SDK
How to troubleshoot an abnormal recurring job
Troubleshoot an abnormal recurring job
12/10/2021 • 2 minutes to read • Edit Online

This article shows how to use Azure Data Lake Tools for Visual Studio to troubleshoot problems with recurring
jobs. Learn more about pipeline and recurring jobs from the Azure Data Lake and Azure HDInsight blog.
Recurring jobs usually share the same query logic and similar input data. For example, imagine that you have a
recurring job running every Monday morning at 8 A.M. to count last week’s weekly active user. The scripts for
these jobs share one script template that contains the query logic. The inputs for these jobs are the usage data
for last week. Sharing the same query logic and similar input usually means that performance of these jobs is
similar and stable. If one of your recurring jobs suddenly performs abnormally, fails, or slows down a lot, you
might want to:
See the statistics reports for the previous runs of the recurring job to see what happened.
Compare the abnormal job with a normal one to figure out what has been changed.
Related Job View in Azure Data Lake Tools for Visual Studio helps you accelerate the troubleshooting progress
with both cases.

Step 1: Find recurring jobs and open Related Job View


To use Related Job View to troubleshoot a recurring job problem, you need to first find the recurring job in
Visual Studio and then open Related Job View.
Case 1: You have the URL for the recurring job
Through Tools > Data Lake > Job View , you can paste the job URL to open Job View in Visual Studio. Select
View Related Jobs to open Related Job View.

Case 2: You have the pipeline for the recurring job, but not the URL
In Visual Studio, you can open Pipeline Browser through Server Explorer > your Azure Data Lake Analytics
account > Pipelines . (If you can't find this node in Server Explorer, download the latest plug-in.)

In Pipeline Browser, all pipelines for the Data Lake Analytics account are listed at left. You can expand the
pipelines to find all recurring jobs, and then select the one that has problems. Related Job View opens at right.
Step 2: Analyze a statistics report
A summary and a statistics report are shown at top of Related Job View. There, you can find the potential root
cause of the problem.
1. In the report, the X-axis shows the job submission time. Use it to find the abnormal job.
2. Use the process in the following diagram to check statistics and get insights about the problem and the
possible solutions.

Step 3: Compare the abnormal job to a normal job


You can find all submitted recurring jobs through the job list at the bottom of Related Job View. To find more
insights and potential solutions, right-click the abnormal job. Use the Job Diff view to compare the abnormal job
with a previous normal one.
Pay attention to the big differences between these two jobs. Those differences are probably causing the
performance problems. To check further, use the steps in the following diagram:

Next steps
Resolve data-skew problems
Debug user-defined C# code for failed U-SQL jobs
Use the Vertex Execution View in Data Lake Tools
for Visual Studio
12/10/2021 • 2 minutes to read • Edit Online

Learn how to use the Vertex Execution View to exam Data Lake Analytics jobs.

Open the Vertex Execution View


Open a U-SQL job in Data Lake Tools for Visual Studio. Click Ver tex Execution View in the bottom left corner.
You may be prompted to load profiles first and it can take some time depending on your network connectivity.

Understand Vertex Execution View


The Vertex Execution View has three parts:
The Ver tex selector on the left lets you select vertices by features (such as top 10 data read, or choose by
stage). One of the most commonly-used filters is to see the ver tices on critical path . The Critical path is the
longest chain of vertices of a U-SQL job. Understanding the critical path is useful for optimizing your jobs by
checking which vertex takes the longest time.

The top center pane shows the running status of all the ver tices .

The bottom center pane shows information about each vertex:


Process Name: The name of the vertex instance. It is composed of different parts in
StageName|VertexName|VertexRunInstance. For example, the SV7_Split[62].v1 vertex stands for the second
running instance (.v1, index starting from 0) of Vertex number 62 in Stage SV7_Split.
Total Data Read/Written: The data was read/written by this vertex.
State/Exit Status: The final status when the vertex is ended.
Exit Code/Failure Type: The error when the vertex failed.
Creation Reason: Why the vertex was created.
Resource Latency/Process Latency/PN Queue Latency: the time taken for the vertex to wait for resources, to
process data, and to stay in the queue.
Process/Creator GUID: GUID for the current running vertex or its creator.
Version: the N-th instance of the running vertex (the system might schedule new instances of a vertex for
many reasons, for example failover, compute redundancy, etc.)
Version Created Time.
Process Create Start Time/Process Queued Time/Process Start Time/Process Complete Time: when the vertex
process starts creation; when the vertex process starts to queue; when the certain vertex process starts;
when the certain vertex is completed.

Next steps
To log diagnostics information, see Accessing diagnostics logs for Azure Data Lake Analytics
To see a more complex query, see Analyze Website logs using Azure Data Lake Analytics.
To view job details, see Use Job Browser and Job View for Azure Data lake Analytics jobs
Export a U-SQL database
12/10/2021 • 3 minutes to read • Edit Online

In this article, learn how to use Azure Data Lake Tools for Visual Studio to export a U-SQL database as a single
U-SQL script and downloaded resources. You can import the exported database to a local account in the same
process.
Customers usually maintain multiple environments for development, test, and production. These environments
are hosted on both a local account, on a developer's local computer, and in an Azure Data Lake Analytics account
in Azure.
When you develop and tune U-SQL queries in development and test environments, developers often need to re-
create their work in a production database. The Database Export Wizard helps accelerate this process. By using
the wizard, developers can clone the existing database environment and sample data to other Data Lake
Analytics accounts.

Export steps
Step 1: Export the database in Server Explorer
All Data Lake Analytics accounts that you have permissions for are listed in Server Explorer. To export the
database:
1. In Server Explorer, expand the account that contains the database that you want to export.
2. Right-click the database, and then select Expor t .

If the Expor t menu option isn't available, you need to update the tool to the lasted release.
Step 2: Configure the objects that you want to export
If you need only a small part of a large database, you can configure a subset of objects that you want to export
in the export wizard.
The export action is completed by running a U-SQL job. Therefore, exporting from an Azure account incurs
some cost.
Step 3: Check the objects list and other configurations
In this step, you can verify the selected objects in the Expor t object list box. If there are any errors, select
Previous to go back and correctly configure the objects that you want to export.
You can also configure other settings for the export target. Configuration descriptions are listed in the following
table:

C O N F IGURAT IO N DESC RIP T IO N

Destination Name This name indicates where you want to save the exported
database resources. Examples are assemblies, additional files,
and sample data. A folder with this name is created under
your local data root folder.

Project Directory This path defines where you want to save the exported U-
SQL script. All database object definitions are saved at this
location.

Schema Only If you select this option, only database definitions and
resources (like assemblies and additional files) are exported.

Schema and Data If you select this option, database definitions, resources, and
data are exported. The top N rows of tables are exported.

Import to Local Database Automatically If you select this option, the exported database is
automatically imported to your local database when
exporting is finished.
Step 4: Check the export results
When exporting is finished, you can view the exported results in the log window in the wizard. The following
example shows how to find exported U-SQL script and database resources, including assemblies, additional
files, and sample data:
Import the exported database to a local account
The most convenient way to import the exported database is to select the Impor t to Local Database
Automatically check box during the exporting process in Step 3. If you didn't check this box, first, find the
exported U-SQL script in the export log. Then, run the U-SQL script locally to import the database to your local
account.

Import the exported database to a Data Lake Analytics account


To import the database to different Data Lake Analytics account:
1. Upload the exported resources, including assemblies, additional files, and sample data, to the default Azure
Data Lake Store account of the Data Lake Analytics account that you want to import to. You can find the
exported resource folder under the local data root folder. Upload the entire folder to the root of the default
Data Lake Store account.
2. When uploading is finished, submit the exported U-SQL script to the Data Lake Analytics account that you
want to import the database to.

Known limitations
Currently, if you select the Schema and Data option in Step 3, the tool runs a U-SQL job to export the data
stored in tables. Because of this, the data exporting process might be slow and you might incur costs.

Next steps
Learn about U-SQL databases
Test and debug U-SQL jobs by using local run and the Azure Data Lake U-SQL SDK
Analyze Website logs using Azure Data Lake
Analytics
12/10/2021 • 4 minutes to read • Edit Online

Learn how to analyze website logs using Data Lake Analytics, especially on finding out which referrers ran into
errors when they tried to visit the website.

Prerequisites
Visual Studio 2015 or Visual Studio 2013 .
Data Lake Tools for Visual Studio .
Once Data Lake Tools for Visual Studio is installed, you will see a Data Lake item in the Tools menu in
Visual Studio:

Basic knowledge of Data Lake Analytics and the Data Lake Tools for Visual Studio . To get
started, see:
Develop U-SQL script using Data Lake tools for Visual Studio.
A Data Lake Analytics account. See Create an Azure Data Lake Analytics account.
Install the sample data. In the Azure Portal, open you Data Lake Analytics account and click Sample
Scripts on the left menu, then click Copy Sample Data .

Connect to Azure
Before you can build and test any U-SQL scripts, you must first connect to Azure.
To connect to Data Lake Analytics
1. Open Visual Studio.
2. Click Data Lake > Options and Settings .
3. Click Sign In , or Change User if someone has signed in, and follow the instructions.
4. Click OK to close the Options and Settings dialog.
To browse your Data Lake Analytics accounts
1. From Visual Studio, open Ser ver Explorer by press CTRL+ALT+S .
2. From Ser ver Explorer , expand Azure , and then expand Data Lake Analytics . You shall see a list of your
Data Lake Analytics accounts if there are any. You cannot create Data Lake Analytics accounts from the studio.
To create an account, see Get Started with Azure Data Lake Analytics using Azure Portal or Get Started with
Azure Data Lake Analytics using Azure PowerShell.

Develop U-SQL application


A U-SQL application is mostly a U-SQL script. To learn more about U-SQL, see Get started with U-SQL.
You can add addition user-defined operators to the application. For more information, see Develop U-SQL user
defined operators for Data Lake Analytics jobs.
To create and submit a Data Lake Analytics job
1. Click the File > New > Project .
2. Select the U-SQL Project type.

3. Click OK . Visual studio creates a solution with a Script.usql file.


4. Enter the following script into the Script.usql file:

// Create a database for easy reuse, so you don't need to read from a file very time.
CREATE DATABASE IF NOT EXISTS SampleDBTutorials;

// Create a Table valued function. TVF ensures that your jobs fetch data from he weblog file with the
correct schema.
DROP FUNCTION IF EXISTS SampleDBTutorials.dbo.WeblogsView;
CREATE FUNCTION SampleDBTutorials.dbo.WeblogsView()
RETURNS @result TABLE
(
s_date DateTime,
s_time string,
s_sitename string,
cs_method string,
cs_uristem string,
cs_uriquery string,
s_port int,
cs_username string,
c_ip string,
cs_useragent string,
cs_cookie string,
cs_referer string,
cs_host string,
sc_status int,
sc_substatus int,
sc_win32status int,
sc_bytes int,
cs_bytes int,
cs_bytes int,
s_timetaken int
)
AS
BEGIN

@result = EXTRACT
s_date DateTime,
s_time string,
s_sitename string,
cs_method string,
cs_uristem string,
cs_uriquery string,
s_port int,
cs_username string,
c_ip string,
cs_useragent string,
cs_cookie string,
cs_referer string,
cs_host string,
sc_status int,
sc_substatus int,
sc_win32status int,
sc_bytes int,
cs_bytes int,
s_timetaken int
FROM @"/Samples/Data/WebLog.log"
USING Extractors.Text(delimiter:' ');
RETURN;
END;

// Create a table for storing referrers and status


DROP TABLE IF EXISTS SampleDBTutorials.dbo.ReferrersPerDay;
@weblog = SampleDBTutorials.dbo.WeblogsView();
CREATE TABLE SampleDBTutorials.dbo.ReferrersPerDay
(
INDEX idx1
CLUSTERED(Year ASC)
DISTRIBUTED BY HASH(Year)
) AS

SELECT s_date.Year AS Year,


s_date.Month AS Month,
s_date.Day AS Day,
cs_referer,
sc_status,
COUNT(DISTINCT c_ip) AS cnt
FROM @weblog
GROUP BY s_date,
cs_referer,
sc_status;

To understand the U-SQL, see Get started with Data Lake Analytics U-SQL language.
5. Add a new U-SQL script to your project and enter the following:

// Query the referrers that ran into errors


@content =
SELECT *
FROM SampleDBTutorials.dbo.ReferrersPerDay
WHERE sc_status >=400 AND sc_status < 500;

OUTPUT @content
TO @"/Samples/Outputs/UnsuccessfulResponses.log"
USING Outputters.Tsv();
6. Switch back to the first U-SQL script and next to the Submit button, specify your Analytics account.
7. From Solution Explorer , right click Script.usql , and then click Build Script . Verify the results in the
Output pane.
8. From Solution Explorer , right click Script.usql , and then click Submit Script .
9. Verify the Analytics Account is the one where you want to run the job, and then click Submit .
Submission results and job link are available in the Data Lake Tools for Visual Studio Results window
when the submission is completed.
10. Wait until the job is completed successfully. If the job failed, it is most likely missing the source file. Please
see the Prerequisite section of this tutorial. For additional troubleshooting information, see Monitor and
troubleshoot Azure Data Lake Analytics jobs.
When the job is completed, you shall see the following screen:

11. Now repeat steps 7- 10 for Script1.usql .


To see the job output
1. From Ser ver Explorer , expand Azure , expand Data Lake Analytics , expand your Data Lake Analytics
account, expand Storage Accounts , right-click the default Data Lake Storage account, and then click
Explorer .
2. Double-click Samples to open the folder, and then double-click Outputs .
3. Double-click UnsuccessfulResponses.log .
4. You can also double-click the output file inside the graph view of the job in order to navigate directly to the
output.

Next steps
To get started with Data Lake Analytics using different tools, see:
Get started with Data Lake Analytics using Azure Portal
Get started with Data Lake Analytics using Azure PowerShell
Get started with Data Lake Analytics using .NET SDK
Resolve data-skew problems by using Azure Data
Lake Tools for Visual Studio
12/10/2021 • 7 minutes to read • Edit Online

What is data skew?


Briefly stated, data skew is an over-represented value. Imagine that you have assigned 50 tax examiners to audit
tax returns, one examiner for each US state. The Wyoming examiner, because the population there is small, has
little to do. In California, however, the examiner is kept very busy because of the state's large population.

In our scenario, the data is unevenly distributed across all tax examiners, which means that some examiners
must work more than others. In your own job, you frequently experience situations like the tax-examiner
example here. In more technical terms, one vertex gets much more data than its peers, a situation that makes
the vertex work more than the others and that eventually slows down an entire job. What's worse, the job might
fail, because vertices might have, for example, a 5-hour runtime limitation and a 6-GB memory limitation.

Resolving data-skew problems


Azure Data Lake Tools for Visual Studio can help detect whether your job has a data-skew problem. If a problem
exists, you can resolve it by trying the solutions in this section.

Solution 1: Improve table partitioning


Option 1: Filter the skewed key value in advance
If it does not affect your business logic, you can filter the higher-frequency values in advance. For example, if
there are a lot of 000-000-000 in column GUID, you might not want to aggregate that value. Before you
aggregate, you can write “WHERE GUID != “000-000-000”” to filter the high-frequency value.
Option 2: Pick a different partition or distribution key
In the preceding example, if you want only to check the tax-audit workload all over the country/region, you can
improve the data distribution by selecting the ID number as your key. Picking a different partition or distribution
key can sometimes distribute the data more evenly, but you need to make sure that this choice doesn’t affect
your business logic. For instance, to calculate the tax sum for each state, you might want to designate State as
the partition key. If you continue to experience this problem, try using Option 3.
Option 3: Add more partition or distribution keys
Instead of using only State as a partition key, you can use more than one key for partitioning. For example,
consider adding ZIP Code as an additional partition key to reduce data-partition sizes and distribute the data
more evenly.
Option 4: Use round-robin distribution
If you cannot find an appropriate key for partition and distribution, you can try to use round-robin distribution.
Round-robin distribution treats all rows equally and randomly puts them into corresponding buckets. The data
gets evenly distributed, but it loses locality information, a drawback that can also reduce job performance for
some operations. Additionally, if you are doing aggregation for the skewed key anyway, the data-skew problem
will persist. To learn more about round-robin distribution, see the U-SQL Table Distributions section in CREATE
TABLE (U-SQL): Creating a Table with Schema.

Solution 2: Improve the query plan


Option 1: Use the CREATE STATISTICS statement
U-SQL provides the CREATE STATISTICS statement on tables. This statement gives more information to the
query optimizer about the data characteristics, such as value distribution, that are stored in a table. For most
queries, the query optimizer already generates the necessary statistics for a high-quality query plan.
Occasionally, you might need to improve query performance by creating additional statistics with CREATE
STATISTICS or by modifying the query design. For more information, see the CREATE STATISTICS (U-SQL) page.
Code example:

CREATE STATISTICS IF NOT EXISTS stats_SampleTable_date ON SampleDB.dbo.SampleTable(date) WITH FULLSCAN;

NOTE
Statistics information is not updated automatically. If you update the data in a table without re-creating the statistics, the
query performance might decline.

Option 2: Use SKEWFACTOR


If you want to sum the tax for each state, you must use GROUP BY state, an approach that doesn't avoid the
data-skew problem. However, you can provide a data hint in your query to identify data skew in keys so that the
optimizer can prepare an execution plan for you.
Usually, you can set the parameter as 0.5 and 1, with 0.5 meaning not much skew and 1 meaning heavy skew.
Because the hint affects execution-plan optimization for the current statement and all downstream statements,
be sure to add the hint before the potential skewed key-wise aggregation.

SKEWFACTOR (columns) = x

Provides a hint that the given columns have a skew factor x from 0 (no skew) through 1 (very heavy skew).
Code example:
//Add a SKEWFACTOR hint.
@Impressions =
SELECT * FROM
searchDM.SML.PageView(@start, @end) AS PageView
OPTION(SKEWFACTOR(Query)=0.5)
;
//Query 1 for key: Query, ClientId
@Sessions =
SELECT
ClientId,
Query,
SUM(PageClicks) AS Clicks
FROM
@Impressions
GROUP BY
Query, ClientId
;
//Query 2 for Key: Query
@Display =
SELECT * FROM @Sessions
INNER JOIN @Campaigns
ON @Sessions.Query == @Campaigns.Query
;

Option 3: Use ROWCOUNT


In addition to SKEWFACTOR, for specific skewed-key join cases, if you know that the other joined row set is
small, you can tell the optimizer by adding a ROWCOUNT hint in the U-SQL statement before JOIN. This way,
optimizer can choose a broadcast join strategy to help improve performance. Be aware that ROWCOUNT does
not resolve the data-skew problem, but it can offer some additional help.

OPTION(ROWCOUNT = n)

Identify a small row set before JOIN by providing an estimated integer row count.
Code example:

//Unstructured (24-hour daily log impressions)


@Huge = EXTRACT ClientId int, ...
FROM @"wasb://ads@wcentralus/2015/10/30/{*}.nif"
;
//Small subset (that is, ForgetMe opt out)
@Small = SELECT * FROM @Huge
WHERE Bing.ForgetMe(x,y,z)
OPTION(ROWCOUNT=500)
;
//Result (not enough information to determine simple broadcast JOIN)
@Remove = SELECT * FROM Bing.Sessions
INNER JOIN @Small ON Sessions.Client == @Small.Client
;

Solution 3: Improve the user-defined reducer and combiner


You can sometimes write a user-defined operator to deal with complicated process logic, and a well-written
reducer and combiner might mitigate a data-skew problem in some cases.
Option 1: Use a recursive reducer, if possible
By default, a user-defined reducer runs in non-recursive mode, which means that reduce work for a key is
distributed into a single vertex. But if your data is skewed, the huge data sets might be processed in a single
vertex and run for a long time.
To improve performance, you can add an attribute in your code to define reducer to run in recursive mode. Then,
the huge data sets can be distributed to multiple vertices and run in parallel, which speeds up your job.
To change a non-recursive reducer to recursive, you need to make sure that your algorithm is associative. For
example, the sum is associative, and the median is not. You also need to make sure that the input and output for
reducer keep the same schema.
Attribute of recursive reducer:

[SqlUserDefinedReducer(IsRecursive = true)]

Code example:

[SqlUserDefinedReducer(IsRecursive = true)]
public class TopNReducer : IReducer
{
public override IEnumerable<IRow>
Reduce(IRowset input, IUpdatableRow output)
{
//Your reducer code goes here.
}
}

Option 2: Use row-level combiner mode, if possible


Similar to the ROWCOUNT hint for specific skewed-key join cases, combiner mode tries to distribute huge
skewed-key value sets to multiple vertices so that the work can be executed concurrently. Combiner mode can’t
resolve data-skew issues, but it can offer some additional help for huge skewed-key value sets.
By default, the combiner mode is Full, which means that the left row set and right row set cannot be separated.
Setting the mode as Left/Right/Inner enables row-level join. The system separates the corresponding row sets
and distributes them into multiple vertices that run in parallel. However, before you configure the combiner
mode, be careful to ensure that the corresponding row sets can be separated.
The example that follows shows a separated left row set. Each output row depends on a single input row from
the left, and it potentially depends on all rows from the right with the same key value. If you set the combiner
mode as left, the system separates the huge left-row set into small ones and assigns them to multiple vertices.
NOTE
If you set the wrong combiner mode, the combination is less efficient, and the results might be wrong.

Attributes of combiner mode:


SqlUserDefinedCombiner(Mode=CombinerMode.Full): Every output row potentially depends on all the
input rows from left and right with the same key value.
SqlUserDefinedCombiner(Mode=CombinerMode.Left): Every output row depends on a single input row
from the left (and potentially all rows from the right with the same key value).
qlUserDefinedCombiner(Mode=CombinerMode.Right): Every output row depends on a single input row
from the right (and potentially all rows from the left with the same key value).
SqlUserDefinedCombiner(Mode=CombinerMode.Inner): Every output row depends on a single input row
from the left and the right with the same value.
Code example:

[SqlUserDefinedCombiner(Mode = CombinerMode.Right)]
public class WatsonDedupCombiner : ICombiner
{
public override IEnumerable<IRow>
Combine(IRowset left, IRowset right, IUpdatableRow output)
{
//Your combiner code goes here.
}
}
Monitor jobs in Azure Data Lake Analytics using the
Azure Portal
12/10/2021 • 2 minutes to read • Edit Online

To see all the jobs


1. From the Azure portal, click Microsoft Azure in the upper left corner.
2. Click the tile with your Data Lake Analytics account name. The job summary is shown on the Job
Management tile.

The job Management gives you a glance of the job status. Notice there is a failed job.
3. Click the Job Management tile to see the jobs. The jobs are categorized in Running , Queued , and
Ended . You shall see your failed job in the Ended section. It shall be first one in the list. When you have a
lot of jobs, you can click Filter to help you to locate jobs.

4. Click the failed job from the list to open the job details:
Notice the Resubmit button. After you fix the problem, you can resubmit the job.
5. Click highlighted part from the previous screenshot to open the error details. You shall see something
like:

It tells you the source folder is not found.


6. Click Duplicate Script .
7. Update the FROM path to: /Samples/Data/SearchLog.tsv

8. Click Submit Job .

Next steps
Azure Data Lake Analytics overview
Get started with Azure Data Lake Analytics using Azure PowerShell
Manage Azure Data Lake Analytics using Azure portal
Use Azure Data Lake Tools for Visual Studio Code
12/10/2021 • 8 minutes to read • Edit Online

In this article, learn how you can use Azure Data Lake Tools for Visual Studio Code (VS Code) to create, test, and
run U-SQL scripts. The information is also covered in the following video:

Prerequisites
Azure Data Lake Tools for VS Code supports Windows, Linux, and macOS. U-SQL local run and local debug
works only in Windows.
Visual Studio Code
For macOS and Linux:
.NET 5.0 SDK
Mono 5.2.x

Install Azure Data Lake Tools


After you install the prerequisites, you can install Azure Data Lake Tools for VS Code.
To install Azure Data Lake Tools
1. Open Visual Studio Code.
2. Select Extensions in the left pane. Enter Azure Data Lake Tools in the search box.
3. Select Install next to Azure Data Lake Tools .
After a few seconds, the Install button changes to Reload .
4. Select Reload to activate the Azure Data Lake Tools extension.
5. Select Reload Window to confirm. You can see Azure Data Lake Tools in the Extensions pane.

Activate Azure Data Lake Tools


Create a .usql file or open an existing .usql file to activate the extension.

Work with U-SQL


To work with U-SQL, you need open either a U-SQL file or a folder.
To open the sample script
Open the command palette (Ctrl+Shift+P) and enter ADL: Open Sample Script . It opens another instance of
this sample. You can also edit, configure, and submit a script on this instance.
To open a folder for your U -SQL project
1. From Visual Studio Code, select the File menu, and then select Open Folder .
2. Specify a folder, and then select Select Folder .
3. Select the File menu, and then select New . An Untitled-1 file is added to the project.
4. Enter the following code in the Untitled-1 file:

@departments =
SELECT * FROM
(VALUES
(31, "Sales"),
(33, "Engineering"),
(34, "Clerical"),
(35, "Marketing")
) AS
D( DepID, DepName );

OUTPUT @departments TO "/Output/departments.csv" USING Outputters.Csv();


The script creates a departments.csv file with some data included in the /output folder.
5. Save the file as myUSQL.usql in the opened folder.
To compile a U -SQL script
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Compile Script . The compile results appear in the Output window. You can also right-click a
script file, and then select ADL: Compile Script to compile a U-SQL job. The compilation result appears in
the Output pane.
To submit a U -SQL script
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Submit Job . You can also right-click a script file, and then select ADL: Submit Job .
After you submit a U-SQL job, the submission logs appear in the Output window in VS Code. The job view
appears in the right pane. If the submission is successful, the job URL appears too. You can open the job URL in a
web browser to track the real-time job status.
On the job view's SUMMARY tab, you can see the job details. Main functions include resubmit a script,
duplicate a script, and open in the portal. On the job view's DATA tab, you can refer to the input files, output
files, and resource files. Files can be downloaded to the local computer.

To set the default context


You can set the default context to apply this setting to all script files if you have not set parameters for files
individually.
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Set Default Context . Or right-click the script editor and select ADL: Set Default Context .
3. Choose the account, database, and schema that you want. The setting is saved to the xxx_settings.json
configuration file.

To set script parameters


1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Set Script Parameters .
3. The xxx_settings.json file is opened with the following properties:
account : An Azure Data Lake Analytics account under your Azure subscription that's needed to
compile and run U-SQL jobs. You need configure the computer account before you compile and run
U-SQL jobs.
database : A database under your account. The default is master .
schema : A schema under your database. The default is dbo .
optionalSettings :
priority : The priority range is from 1 to 1000, with 1 as the highest priority. The default value is
1000 .
degreeOfParallelism : The parallelism range is from 1 to 150. The default value is the
maximum parallelism allowed in your Azure Data Lake Analytics account.
NOTE
After you save the configuration, the account, database, and schema information appear on the status bar at the lower-
left corner of the corresponding .usql file if you don’t have a default context set up.

To set Git ignore


1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Set Git Ignore .
If you don’t have a .gitIgnore file in your VS Code working folder, a file named .gitIgnore is created
in your folder. Four items (usqlCodeBehindReference , usqlCodeBehindGenerated , .cache , obj )
are added in the file by default. You can make more updates if needed.
If you already have a .gitIgnore file in your VS Code working folder, the tool adds four items
(usqlCodeBehindReference , usqlCodeBehindGenerated , .cache , obj ) in your .gitIgnore file if
the four items were not included in the file.

Work with code-behind files: C Sharp, Python, and R


Azure Data Lake Tools supports multiple custom codes. For instructions, see Develop U-SQL with Python, R, and
C Sharp for Azure Data Lake Analytics in VS Code.

Work with assemblies


For information on developing assemblies, see Develop U-SQL assemblies for Azure Data Lake Analytics jobs.
You can use Data Lake Tools to register custom code assemblies in the Data Lake Analytics catalog.
To register an assembly
You can register the assembly through the ADL: Register Assembly or ADL: Register Assembly
(Advanced) command.
To register through the ADL: Register Assembly command
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Register Assembly .
3. Specify the local assembly path.
4. Select a Data Lake Analytics account.
5. Select a database.
The portal is opened in a browser and displays the assembly registration process.
A more convenient way to trigger the ADL: Register Assembly command is to right-click the .dll file in File
Explorer.
To register through the ADL: Register Assembly (Advanced) command
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Register Assembly (Advanced) .
3. Specify the local assembly path.
4. The JSON file is displayed. Review and edit the assembly dependencies and resource parameters, if
needed. Instructions are displayed in the Output window. To proceed to the assembly registration, save
(Ctrl+S) the JSON file.

NOTE
Azure Data Lake Tools autodetects whether the DLL has any assembly dependencies. The dependencies are displayed
in the JSON file after they're detected.
You can upload your DLL resources (for example, .txt, .png, and .csv) as part of the assembly registration.

Another way to trigger the ADL: Register Assembly (Advanced) command is to right-click the .dll file in File
Explorer.
The following U-SQL code demonstrates how to call an assembly. In the sample, the assembly name is test.
REFERENCE ASSEMBLY [test];
@a =
EXTRACT
Iid int,
Starts DateTime,
Region string,
Query string,
DwellTime int,
Results string,
ClickedUrls string
FROM @"Sample/SearchLog.txt"
USING Extractors.Tsv();
@d =
SELECT DISTINCT Region
FROM @a;
@d1 =
PROCESS @d
PRODUCE
Region string,
Mkt string
USING new USQLApplication_codebehind.MyProcessor();
OUTPUT @d1
TO @"Sample/SearchLogtest.txt"
USING Outputters.Tsv();

Use U-SQL local run and local debug for Windows users
U-SQL local run tests your local data and validates your script locally before your code is published to Data Lake
Analytics. You can use the local debug feature to complete the following tasks before your code is submitted to
Data Lake Analytics:
Debug your C# code-behind.
Step through the code.
Validate your script locally.
The local run and local debug feature only works in Windows environments, and is not supported on macOS
and Linux-based operating systems.
For instructions on local run and local debug, see U-SQL local run and local debug with Visual Studio Code.

Connect to Azure
Before you can compile and run U-SQL scripts in Data Lake Analytics, you must connect to your Azure account.

To connect to Azure by using a command


1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Login . The login information appears on the lower right.
3. Select Copy & Open to open the login webpage. Paste the code into the box, and then select Continue .

4. Follow the instructions to sign in from the webpage. When you're connected, your Azure account name
appears on the status bar in the lower-left corner of the VS Code window.

NOTE
Data Lake Tools automatically signs you in the next time if you don't sign out.
If your account has two factors enabled, we recommend that you use phone authentication rather than using a PIN.

To sign out, enter the command ADL: Logout .


To connect to Azure from the explorer
Expand AZURE DATAL AKE , select Sign in to Azure , and then follow step 3 and step 4 of To connect to Azure
by using a command.

You can't sign out from the explorer. To sign out, see To connect to Azure by using a command.

Create an extraction script


You can create an extraction script for .csv, .tsv, and .txt files by using the command ADL: Create EXTRACT
Script or from the Azure Data Lake explorer.
To create an extraction script by using a command
1. Select Ctrl+Shift+P to open the command palette, and enter ADL: Create EXTRACT Script .
2. Specify the full path for an Azure Storage file, and select the Enter key.
3. Select one account.
4. For a .txt file, select a delimiter to extract the file.
The extraction script is generated based on your entries. For a script that cannot detect the columns, choose one
from the two options. If not, only one script will be generated.

To create an extraction script from the explorer


Another way to create the extraction script is through the right-click (shortcut) menu on the .csv, .tsv, or .txt file in
Azure Data Lake Store or Azure Blob storage.
Next steps
Develop U-SQL with Python, R, and C Sharp for Azure Data Lake Analytics in VS Code
U-SQL local run and local debug with Visual Studio Code
Tutorial: Get started with Azure Data Lake Analytics
Tutorial: Develop U-SQL scripts by using Data Lake Tools for Visual Studio
Develop U-SQL with Python, R, and C# for Azure
Data Lake Analytics in Visual Studio Code
12/10/2021 • 3 minutes to read • Edit Online

Learn how to use Visual Studio Code (VSCode) to write Python, R and C# code behind with U-SQL and submit
jobs to Azure Data Lake service. For more information about Azure Data Lake Tools for VSCode, see Use the
Azure Data Lake Tools for Visual Studio Code.
Before writing code-behind custom code, you need to open a folder or a workspace in VSCode.

Prerequisites for Python and R


Register Python and, R extensions assemblies for your ADL account.
1. Open your account in portal.
Select Over view .
Click Sample Script .
2. Click More .
3. Select Install U-SQL Extensions .
4. Confirmation message is displayed after the U-SQL extensions are installed.

NOTE
For best experiences on Python and R language service, please install VSCode Python and R extension.

Develop Python file


1. Click the New File in your workspace.
2. Write your code in U-SQL. The following is a code sample.
REFERENCE ASSEMBLY [ExtPython];
@t =
SELECT * FROM
(VALUES
("D1","T1","A1","@foo Hello World @bar"),
("D2","T2","A2","@baz Hello World @beer")
) AS
D( date, time, author, tweet );

@m =
REDUCE @t ON date
PRODUCE date string, mentions string
USING new Extension.Python.Reducer("pythonSample.usql.py", pyVersion : "3.5.1");

OUTPUT @m
TO "/tweetmentions.csv"
USING Outputters.Csv();

3. Right-click a script file, and then select ADL: Generate Python Code Behind File .
4. The xxx.usql.py file is generated in your working folder. Write your code in Python file. The following is a
code sample.

def get_mentions(tweet):
return ';'.join( ( w[1:] for w in tweet.split() if w[0]=='@' ) )

def usqlml_main(df):
del df['time']
del df['author']
df['mentions'] = df.tweet.apply(get_mentions)
del df['tweet']
return df

5. Right-click in USQL file, you can click Compile Script or Submit Job to running job.

Develop R file
1. Click the New File in your workspace.
2. Write your code in U-SQL file. The following is a code sample.
DEPLOY RESOURCE @"/usqlext/samples/R/my_model_LM_Iris.rda";
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/my/R/Output/LMPredictionsIris.txt";
DECLARE @PartitionCount int = 10;

@InputData =
EXTRACT SepalLength double,
SepalWidth double,
PetalLength double,
PetalWidth double,
Species string
FROM @IrisData
USING Extractors.Csv();

@ExtendedData =
SELECT Extension.R.RandomNumberGenerator.GetRandomNumber(@PartitionCount) AS Par,
SepalLength,
SepalWidth,
PetalLength,
PetalWidth
FROM @InputData;

// Predict Species

@RScriptOutput =
REDUCE @ExtendedData
ON Par
PRODUCE Par,
fit double,
lwr double,
upr double
READONLY Par
USING new Extension.R.Reducer(scriptFile : "RClusterRun.usql.R", rReturnType : "dataframe",
stringsAsFactors : false);
OUTPUT @RScriptOutput
TO @OutputFilePredictions
USING Outputters.Tsv();

3. Right-click in USQL file, and then select ADL: Generate R Code Behind File .
4. The xxx.usql.r file is generated in your working folder. Write your code in R file. The following is a code
sample.

load("my_model_LM_Iris.rda")
outputToUSQL=data.frame(predict(lm.fit, inputFromUSQL, interval="confidence"))

5. Right-click in USQL file, you can click Compile Script or Submit Job to running job.

Develop C# file
A code-behind file is a C# file associated with a single U-SQL script. You can define a script dedicated to UDO,
UDA, UDT, and UDF in the code-behind file. The UDO, UDA, UDT, and UDF can be used directly in the script
without registering the assembly first. The code-behind file is put in the same folder as its peering U-SQL script
file. If the script is named xxx.usql, the code-behind is named as xxx.usql.cs. If you manually delete the code-
behind file, the code-behind feature is disabled for its associated U-SQL script. For more information about
writing customer code for U-SQL script, see Writing and Using Custom Code in U-SQL: User-Defined Functions.
1. Click the New File in your workspace.
2. Write your code in U-SQL file. The following is a code sample.
@a =
EXTRACT
Iid int,
Starts DateTime,
Region string,
Query string,
DwellTime int,
Results string,
ClickedUrls string
FROM @"/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();

@d =
SELECT DISTINCT Region
FROM @a;

@d1 =
PROCESS @d
PRODUCE
Region string,
Mkt string
USING new USQLApplication_codebehind.MyProcessor();

OUTPUT @d1
TO @"/output/SearchLogtest.txt"
USING Outputters.Tsv();

3. Right-click in USQL file, and then select ADL: Generate CS Code Behind File .
4. The xxx.usql.cs file is generated in your working folder. Write your code in CS file. The following is a
code sample.

namespace USQLApplication_codebehind
{
[SqlUserDefinedProcessor]

public class MyProcessor : IProcessor


{
public override IRow Process(IRow input, IUpdatableRow output)
{
output.Set(0, input.Get<string>(0));
output.Set(1, input.Get<string>(0));
return output.AsReadOnly();
}
}
}

5. Right-click in USQL file, you can click Compile Script or Submit Job to running job.

Next steps
Use the Azure Data Lake Tools for Visual Studio Code
U-SQL local run and local debug with Visual Studio Code
Get started with Data Lake Analytics using PowerShell
Get started with Data Lake Analytics using the Azure portal
Use Data Lake Tools for Visual Studio for developing U-SQL applications
Use Data Lake Analytics(U-SQL) catalog
Accessing resources with Azure Data Lake Tools
12/10/2021 • 6 minutes to read • Edit Online

You can access Azure Data Lake Analytics resources with Azure Data Tools commands or actions in VS Code
easily.

Integrate with Azure Data Lake Analytics through a command


You can access Azure Data Lake Analytics resources to list accounts, access metadata, and view analytics jobs.
To list the Azure Data Lake Analytics accounts under your Azure subscription
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: List Accounts . The accounts appear in the Output pane.
To access Azure Data Lake Analytics metadata
1. Select Ctrl+Shift+P, and then enter ADL: List Tables .
2. Select one of the Data Lake Analytics accounts.
3. Select one of the Data Lake Analytics databases.
4. Select one of the schemas. You can see the list of tables.
To view Azure Data Lake Analytics jobs
1. Open the command palette (Ctrl+Shift+P) and select ADL: Show Jobs .
2. Select a Data Lake Analytics or local account.
3. Wait for the job list to appear for the account.
4. Select a job from the job list. Data Lake Tools opens the job view in the right pane and displays some
information in the VS Code output.

Integrate with Azure Data Lake Store through a command


You can use Azure Data Lake Store-related commands to:
Browse through the Azure Data Lake Store resources
Preview the Azure Data Lake Store file
Upload the file directly to Azure Data Lake Store in VS Code
Download the file directly from Azure Data Lake Store in VS Code
List the storage path
To list the storage path through the command palette
1. Right-click the script editor and select ADL: List Path .
2. Choose the folder in the list, or select Enter a path or Browse from root path . (We're using Enter a path
as an example.)
3. Select your Data Lake Analytics account.
4. Browse to or enter the storage folder path (for example, /output/).
The command palette lists the path information based on your entries.

A more convenient way to list the relative path is through the shortcut menu.
To list the storage path through the shortcut menu
Right-click the path string and select List Path .

Preview the storage file


1. Right-click the script editor and select ADL: Preview File .
2. Select your Data Lake Analytics account.
3. Enter an Azure Storage file path (for example, /output/SearchLog.txt).
The file opens in VS Code.
Another way to preview the file is through the shortcut menu on the file's full path or the file's relative path in
the script editor.
Upload a file or folder
1. Right-click the script editor and select Upload File or Upload Folder .
2. Choose one file or multiple files if you selected Upload File , or choose the whole folder if you selected
Upload Folder . Then select Upload .
3. Choose the storage folder in the list, or select Enter a path or Browse from root path . (We're using Enter
a path as an example.)
4. Select your Data Lake Analytics account.
5. Browse to or enter the storage folder path (for example, /output/).
6. Select Choose Current Folder to specify your upload destination.

Another way to upload files to storage is through the shortcut menu on the file's full path or the file's relative
path in the script editor.
You can monitor the upload status.
Download a file
You can download a file by using the command ADL: Download File or ADL: Download File (Advanced) .
To download a file through the ADL: Download File (Advanced) command
1. Right-click the script editor, and then select Download File (Advanced) .
2. VS Code displays a JSON file. You can enter file paths and download multiple files at the same time.
Instructions are displayed in the Output window. To proceed to download the file or files, save (Ctrl+S)
the JSON file.

The Output window displays the file download status.

You can monitor the download status.


To download a file through the ADL: Download File command
1. Right-click the script editor, select Download File , and then select the destination folder from the Select
Folder dialog box.
2. Choose the folder in the list, or select Enter a path or Browse from root path . (We're using Enter a
path as an example.)
3. Select your Data Lake Analytics account.
4. Browse to or enter the storage folder path (for example, /output/), and then choose a file to download.

Another way to download storage files is through the shortcut menu on the file's full path or the file's relative
path in the script editor.
You can monitor the download status.
Check storage tasks' status
The upload and download status appears on the status bar. Select the status bar, and then the status appears on
the OUTPUT tab.

Integrate with Azure Data Lake Analytics from the explorer


After you log in, all the subscriptions for your Azure account are listed in the left pane, under AZURE
DATAL AKE .

Data Lake Analytics metadata navigation


Expand your Azure subscription. Under the U-SQL Databases node, you can browse through your U-SQL
database and view folders like Schemas , Credentials , Assemblies , Tables , and Index .
Data Lake Analytics metadata entity management
Expand U-SQL Databases . You can create a database, schema, table, table type, index, or statistic by right-
clicking the corresponding node, and then selecting Script to Create on the shortcut menu. On the opened
script page, edit the script according to your needs. Then submit the job by right-clicking it and selecting ADL:
Submit Job .
After you finish creating the item, right-click the node and then select Refresh to show the item. You can also
delete the item by right-clicking it and then selecting Delete .
Data Lake Analytics assembly registration
You can register an assembly in the corresponding database by right-clicking the Assemblies node, and then
selecting Register assembly .

Integrate with Azure Data Lake Store from the explorer


Browse to Data Lake Store :
You can right-click the folder node and then use the Refresh , Delete , Upload , Upload Folder , Copy
Relative Path , and Copy Full Path commands on the shortcut menu.

You can right-click the file node and then use the Preview , Download , Delete , Create EXTRACT Script
(available only for CSV, TSV, and TXT files), Copy Relative Path , and Copy Full Path commands on the
shortcut menu.
Integrate with Azure Blob storage from the explorer
Browse to Blob storage:
You can right-click the blob container node and then use the Refresh , Delete Blob Container , and
Upload Blob commands on the shortcut menu.

You can right-click the folder node and then use the Refresh and Upload Blob commands on the
shortcut menu.

You can right-click the file node and then use the Preview/Edit , Download , Delete , Create EXTRACT
Script (available only for CSV, TSV, and TXT files), Copy Relative Path , and Copy Full Path commands
on the shortcut menu.
Open the Data Lake explorer in the portal
1. Select Ctrl+Shift+P to open the command palette.
2. Enter Open Web Azure Storage Explorer or right-click a relative path or the full path in the script editor,
and then select Open Web Azure Storage Explorer .
3. Select a Data Lake Analytics account.
Data Lake Tools opens the Azure Storage path in the Azure portal. You can find the path and preview the file
from the web.

Additional features
Data Lake Tools for VS Code supports the following features:
IntelliSense autocomplete : Suggestions appear in pop-up windows around items like keywords,
methods, and variables. Different icons represent different types of objects:
Scala data type
Complex data type
Built-in UDTs
.NET collection and classes
C# expressions
Built-in C# UDFs, UDOs, and UDAAGs
U-SQL functions
U-SQL windowing functions

IntelliSense autocomplete on Data Lake Analytics metadata : Data Lake Tools downloads the Data
Lake Analytics metadata information locally. The IntelliSense feature automatically populates objects from
the Data Lake Analytics metadata. These objects include the database, schema, table, view, table-valued
function, procedures, and C# assemblies.

IntelliSense error marker : Data Lake Tools underlines editing errors for U-SQL and C#.
Syntax highlights : Data Lake Tools uses colors to differentiate items like variables, keywords, data types,
and functions.

NOTE
We recommend that you upgrade to Azure Data Lake Tools for Visual Studio version 2.3.3000.4 or later. The previous
versions are no longer available for download and are now deprecated.

Next steps
Develop U-SQL with Python, R, and C Sharp for Azure Data Lake Analytics in VS Code
U-SQL local run and local debug with Visual Studio Code
Tutorial: Get started with Azure Data Lake Analytics
Tutorial: Develop U-SQL scripts by using Data Lake Tools for Visual Studio
Run U-SQL and debug locally in Visual Studio Code
12/10/2021 • 2 minutes to read • Edit Online

This article describes how to run U-SQL jobs on a local development machine to speed up early coding phases
or to debug code locally in Visual Studio Code. For instructions on Azure Data Lake Tool for Visual Studio Code,
see Use Azure Data Lake Tools for Visual Studio Code.
Only Windows installations of the Azure Data Lake Tools for Visual Studio support the action to run U-SQL
locally and debug U-SQL locally. Installations on macOS and Linux-based operating systems do not support this
feature.

Set up the U-SQL local run environment


1. Select Ctrl+Shift+P to open the command palette, and then enter ADL: Download Local Run Package
to download the packages.

2. Locate the dependency packages from the path shown in the Output pane, and then install BuildTools
and Win10SDK 10240. Here is an example path:
C:\Users\xxx\AppData\Roaming\LocalRunDependency

2.1 To install BuildTools , click visualcppbuildtools_full.exe in the LocalRunDependency folder, then follow
the wizard instructions.
2.2 To install Win10SDK 10240 , click sdksetup.exe in the
LocalRunDependency/Win10SDK_10.0.10240_2 folder, then follow the wizard instructions.

3. Set up the environment variable. Set the SCOPE_CPP_SDK environment variable to:
C:\Users\XXX\AppData\Roaming\LocalRunDependency\CppSDK_3rdparty

Start the local run service and submit the U-SQL job to a local
account
For the first-time user, use ADL: Download Local Run Package to download local run packages, if you have
not set up U-SQL local run environment.
1. Select Ctrl+Shift+P to open the command palette, and then enter ADL: Star t Local Run Ser vice .
2. Select Accept to accept the Microsoft Software License Terms for the first time.

3. The cmd console opens. For first-time users, you need to enter 3 , and then locate the local folder path for
your data input and output. If you are unsuccessful defining the path with backslashes, try forward
slashes. For other options, you can use the default values.

4. Select Ctrl+Shift+P to open the command palette, enter ADL: Submit Job , and then select Local to
submit the job to your local account.

5. After you submit the job, you can view the submission details. To view the submission details, select
jobUrl in the Output window. You can also view the job submission status from the cmd console. Enter
7 in the cmd console if you want to know more job details.
Start a local debug for the U-SQL job
For the first-time user:
1. Use ADL: Download Local Run Package to download local run packages, if you have not set up U-SQL
local run environment.
2. Install .NET Core SDK 2.0 as suggested in the message box, if not installed.

3. Install C# for Visual Studio Code as suggested in the message box if not installed. Click Install to continue,
and then restart VSCode.

Follow steps below to perform local debug:


1. Select Ctrl+Shift+P to open the command palette, and then enter ADL: Star t Local Run Ser vice . The
cmd console opens. Make sure that the DataRoot is set.
2. Set a breakpoint in your C# code-behind.
3. Back to script editor, right-click and select ADL: Local Debug .

Next steps
Use the Azure Data Lake Tools for Visual Studio Code
Develop U-SQL with Python, R, and CSharp for Azure Data Lake Analytics in VSCode
Get started with Data Lake Analytics using PowerShell
Get started with Data Lake Analytics using the Azure portal
Use Data Lake Tools for Visual Studio for developing U-SQL applications
Use Data Lake Analytics(U-SQL) catalog
U-SQL programmability guide overview
12/10/2021 • 8 minutes to read • Edit Online

U-SQL is a query language that's designed for big data-type of workloads. One of the unique features of U-SQL
is the combination of the SQL-like declarative language with the extensibility and programmability that's
provided by C#. In this guide, we concentrate on the extensibility and programmability of the U-SQL language
that's enabled by C#.

Requirements
Download and install Azure Data Lake Tools for Visual Studio.

Get started with U-SQL


Look at the following U-SQL script:

@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0, "2017-03-39"),
("Woodgrove", 2700.0, "2017-04-10")
) AS D( customer, amount, date );

@results =
SELECT
customer,
amount,
date
FROM @a;

This script defines two RowSets: @a and @results . RowSet @results is defined from @a .

C# types and expressions in U-SQL script


A U-SQL Expression is a C# expression combined with U-SQL logical operations such AND , OR , and NOT . U-
SQL Expressions can be used with SELECT, EXTRACT, WHERE, HAVING, GROUP BY and DECLARE. For example,
the following script parses a string as a DateTime value.

@results =
SELECT
customer,
amount,
DateTime.Parse(date) AS date
FROM @a;

The following snippet parses a string as DateTime value in a DECLARE statement.

DECLARE @d = DateTime.Parse("2016/01/01");

Use C# expressions for data type conversions


The following example demonstrates how you can do a datetime data conversion by using C# expressions. In
this particular scenario, string datetime data is converted to standard datetime with midnight 00:00:00 time
notation.

DECLARE @dt = "2016-07-06 10:23:15";

@rs1 =
SELECT
Convert.ToDateTime(Convert.ToDateTime(@dt).ToString("yyyy-MM-dd")) AS dt,
dt AS olddt
FROM @rs0;

OUTPUT @rs1
TO @output_file
USING Outputters.Text();

Use C# expressions for today’s date


To pull today's date, we can use the following C# expression: DateTime.Now.ToString("M/d/yyyy")

Here's an example of how to use this expression in a script:

@rs1 =
SELECT
MAX(guid) AS start_id,
MIN(dt) AS start_time,
MIN(Convert.ToDateTime(Convert.ToDateTime(dt<@default_dt?@default_dt:dt).ToString("yyyy-MM-dd"))) AS
start_zero_time,
MIN(USQL_Programmability.CustomFunctions.GetFiscalPeriod(dt)) AS start_fiscalperiod,
DateTime.Now.ToString("M/d/yyyy") AS Nowdate,
user,
des
FROM @rs0
GROUP BY user, des;

Using .NET assemblies


U-SQL’s extensibility model relies heavily on the ability to add custom code from .NET assemblies.
Register a .NET assembly
Use the CREATE ASSEMBLY statement to place a .NET assembly into a U-SQL Database. Afterwards, U-SQL scripts
can use those assemblies by using the REFERENCE ASSEMBLY statement.
The following code shows how to register an assembly:

CREATE ASSEMBLY MyDB.[MyAssembly]


FROM "/myassembly.dll";

The following code shows how to reference an assembly:

REFERENCE ASSEMBLY MyDB.[MyAssembly];

Consult the assembly registration instructions that covers this topic in greater detail.
Use assembly versioning
Currently, U-SQL uses the .NET Framework version 4.7.2. So ensure that your own assemblies are compatible
with that version of the runtime.
As mentioned earlier, U-SQL runs code in a 64-bit (x64) format. So make sure that your code is compiled to run
on x64. Otherwise you get the incorrect format error shown earlier.
Each uploaded assembly DLL and resource file, such as a different runtime, a native assembly, or a config file,
can be at most 400 MB. The total size of deployed resources, either via DEPLOY RESOURCE or via references to
assemblies and their additional files, cannot exceed 3 GB.
Finally, note that each U-SQL database can only contain one version of any given assembly. For example, if you
need both version 7 and version 8 of the NewtonSoft Json.NET library, you need to register them in two
different databases. Furthermore, each script can only refer to one version of a given assembly DLL. In this
respect, U-SQL follows the C# assembly management and versioning semantics.

Use user-defined functions: UDF


U-SQL user-defined functions, or UDF, are programming routines that accept parameters, perform an action
(such as a complex calculation), and return the result of that action as a value. The return value of UDF can only
be a single scalar. U-SQL UDF can be called in U-SQL base script like any other C# scalar function.
We recommend that you initialize U-SQL user-defined functions as public and static .

public static string MyFunction(string param1)


{
return "my result";
}

First let’s look at the simple example of creating a UDF.


In this use-case scenario, we need to determine the fiscal period, including the fiscal quarter and fiscal month of
the first sign-in for the specific user. The first fiscal month of the year in our scenario is June.
To calculate fiscal period, we introduce the following C# function:

public static string GetFiscalPeriod(DateTime dt)


{
int FiscalMonth=0;
if (dt.Month < 7)
{
FiscalMonth = dt.Month + 6;
}
else
{
FiscalMonth = dt.Month - 6;
}

int FiscalQuarter=0;
if (FiscalMonth >=1 && FiscalMonth<=3)
{
FiscalQuarter = 1;
}
if (FiscalMonth >= 4 && FiscalMonth <= 6)
{
FiscalQuarter = 2;
}
if (FiscalMonth >= 7 && FiscalMonth <= 9)
{
FiscalQuarter = 3;
}
if (FiscalMonth >= 10 && FiscalMonth <= 12)
{
FiscalQuarter = 4;
}

return "Q" + FiscalQuarter.ToString() + ":P" + FiscalMonth.ToString();


}
It simply calculates fiscal month and quarter and returns a string value. For June, the first month of the first
fiscal quarter, we use "Q1:P1". For July, we use "Q1:P2", and so on.
This is a regular C# function that we are going to use in our U-SQL project.
Here is how the code-behind section looks in this scenario:

using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace USQL_Programmability
{
public class CustomFunctions
{
public static string GetFiscalPeriod(DateTime dt)
{
int FiscalMonth=0;
if (dt.Month < 7)
{
FiscalMonth = dt.Month + 6;
}
else
{
FiscalMonth = dt.Month - 6;
}

int FiscalQuarter=0;
if (FiscalMonth >=1 && FiscalMonth<=3)
{
FiscalQuarter = 1;
}
if (FiscalMonth >= 4 && FiscalMonth <= 6)
{
FiscalQuarter = 2;
}
if (FiscalMonth >= 7 && FiscalMonth <= 9)
{
FiscalQuarter = 3;
}
if (FiscalMonth >= 10 && FiscalMonth <= 12)
{
FiscalQuarter = 4;
}

return "Q" + FiscalQuarter.ToString() + ":" + FiscalMonth.ToString();


}
}
}

Now we are going to call this function from the base U-SQL script. To do this, we have to provide a fully
qualified name for the function, including the namespace, which in this case is
NameSpace.Class.Function(parameter).

USQL_Programmability.CustomFunctions.GetFiscalPeriod(dt)

Following is the actual U-SQL base script:


DECLARE @input_file string = @"\usql-programmability\input_file.tsv";
DECLARE @output_file string = @"\usql-programmability\output_file.tsv";

@rs0 =
EXTRACT
guid Guid,
dt DateTime,
user String,
des String
FROM @input_file USING Extractors.Tsv();

DECLARE @default_dt DateTime = Convert.ToDateTime("06/01/2016");

@rs1 =
SELECT
MAX(guid) AS start_id,
MIN(dt) AS start_time,
MIN(Convert.ToDateTime(Convert.ToDateTime(dt<@default_dt?@default_dt:dt).ToString("yyyy-MM-dd"))) AS
start_zero_time,
MIN(USQL_Programmability.CustomFunctions.GetFiscalPeriod(dt)) AS start_fiscalperiod,
user,
des
FROM @rs0
GROUP BY user, des;

OUTPUT @rs1
TO @output_file
USING Outputters.Text();

Following is the output file of the script execution:

0d8b9630-d5ca-11e5-8329-251efa3a2941,2016-02-11T07:04:17.2630000-08:00,2016-06-
01T00:00:00.0000000,"Q3:8","User1",""

20843640-d771-11e5-b87b-8b7265c75a44,2016-02-11T07:04:17.2630000-08:00,2016-06-
01T00:00:00.0000000,"Q3:8","User2",""

301f23d2-d690-11e5-9a98-4b4f60a1836f,2016-02-11T09:01:33.9720000-08:00,2016-06-
01T00:00:00.0000000,"Q3:8","User3",""

This example demonstrates a simple usage of inline UDF in U-SQL.


Keep state between UDF invocations
U-SQL C# programmability objects can be more sophisticated, utilizing interactivity through the code-behind
global variables. Let’s look at the following business use-case scenario.
In large organizations, users can switch between varieties of internal applications. These can include Microsoft
Dynamics CRM, Power BI, and so on. Customers might want to apply a telemetry analysis of how users switch
between different applications, what the usage trends are, and so on. The goal for the business is to optimize
application usage. They also might want to combine different applications or specific sign-on routines.
To achieve this goal, we have to determine session IDs and lag time between the last session that occurred.
We need to find a previous sign-in and then assign this sign-in to all sessions that are being generated to the
same application. The first challenge is that U-SQL base script doesn't allow us to apply calculations over
already-calculated columns with LAG function. The second challenge is that we have to keep the specific session
for all sessions within the same time period.
To solve this problem, we use a global variable inside a code-behind section:
static public string globalSession; .
This global variable is applied to the entire rowset during our script execution.
Here is the code-behind section of our U-SQL program:

using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace USQLApplication21
{
public class UserSession
{
static public string globalSession;
static public string StampUserSession(string eventTime, string PreviousRow, string Session)
{

if (!string.IsNullOrEmpty(PreviousRow))
{
double timeGap =
Convert.ToDateTime(eventTime).Subtract(Convert.ToDateTime(PreviousRow)).TotalMinutes;
if (timeGap <= 60) {return Session;}
else {return Guid.NewGuid().ToString();}
}
else {return Guid.NewGuid().ToString();}

static public string getStampUserSession(string Session)


{
if (Session != globalSession && !string.IsNullOrEmpty(Session)) { globalSession = Session; }
return globalSession;
}

}
}

This example shows the global variable static public string globalSession; used inside the
getStampUserSession function and getting reinitialized each time the Session parameter is changed.

The U-SQL base script is as follows:


DECLARE @in string = @"\UserSession\test1.tsv";
DECLARE @out1 string = @"\UserSession\Out1.csv";
DECLARE @out2 string = @"\UserSession\Out2.csv";
DECLARE @out3 string = @"\UserSession\Out3.csv";

@records =
EXTRACT DataId string,
EventDateTime string,
UserName string,
UserSessionTimestamp string

FROM @in
USING Extractors.Tsv();

@rs1 =
SELECT
EventDateTime,
UserName,
LAG(EventDateTime, 1)
OVER(PARTITION BY UserName ORDER BY EventDateTime ASC) AS prevDateTime,
string.IsNullOrEmpty(LAG(EventDateTime, 1)
OVER(PARTITION BY UserName ORDER BY EventDateTime ASC)) AS Flag,
USQLApplication21.UserSession.StampUserSession
(
EventDateTime,
LAG(EventDateTime, 1) OVER(PARTITION BY UserName ORDER BY EventDateTime ASC),
LAG(UserSessionTimestamp, 1) OVER(PARTITION BY UserName ORDER BY EventDateTime ASC)
) AS UserSessionTimestamp
FROM @records;

@rs2 =
SELECT
EventDateTime,
UserName,
LAG(EventDateTime, 1)
OVER(PARTITION BY UserName ORDER BY EventDateTime ASC) AS prevDateTime,
string.IsNullOrEmpty( LAG(EventDateTime, 1) OVER(PARTITION BY UserName ORDER BY EventDateTime ASC))
AS Flag,
USQLApplication21.UserSession.getStampUserSession(UserSessionTimestamp) AS UserSessionTimestamp
FROM @rs1
WHERE UserName != "UserName";

OUTPUT @rs2
TO @out2
ORDER BY UserName, EventDateTime ASC
USING Outputters.Csv();

Function USQLApplication21.UserSession.getStampUserSession(UserSessionTimestamp) is called here during the


second memory rowset calculation. It passes the UserSessionTimestamp column and returns the value until
UserSessionTimestamp has changed.

The output file is as follows:


"2016-02-19T07:32:36.8420000-08:00","User1",,True,"72a0660e-22df-428e-b672-e0977007177f"
"2016-02-17T11:52:43.6350000-08:00","User2",,True,"4a0cd19a-6e67-4d95-a119-4eda590226ba"
"2016-02-17T11:59:08.8320000-08:00","User2","2016-02-17T11:52:43.6350000-08:00",False,"4a0cd19a-6e67-4d95-
a119-4eda590226ba"
"2016-02-11T07:04:17.2630000-08:00","User3",,True,"51860a7a-1610-4f74-a9ea-69d5eef7cd9c"
"2016-02-11T07:10:33.9720000-08:00","User3","2016-02-11T07:04:17.2630000-08:00",False,"51860a7a-1610-4f74-
a9ea-69d5eef7cd9c"
"2016-02-15T21:27:41.8210000-08:00","User3","2016-02-11T07:10:33.9720000-08:00",False,"4d2bc48d-bdf3-4591-
a9c1-7b15ceb8e074"
"2016-02-16T05:48:49.6360000-08:00","User3","2016-02-15T21:27:41.8210000-08:00",False,"dd3006d0-2dcd-42d0-
b3a2-bc03dd77c8b9"
"2016-02-16T06:22:43.6390000-08:00","User3","2016-02-16T05:48:49.6360000-08:00",False,"dd3006d0-2dcd-42d0-
b3a2-bc03dd77c8b9"
"2016-02-17T16:29:53.2280000-08:00","User3","2016-02-16T06:22:43.6390000-08:00",False,"2fa899c7-eecf-4b1b-
a8cd-30c5357b4f3a"
"2016-02-17T16:39:07.2430000-08:00","User3","2016-02-17T16:29:53.2280000-08:00",False,"2fa899c7-eecf-4b1b-
a8cd-30c5357b4f3a"
"2016-02-17T17:20:39.3220000-08:00","User3","2016-02-17T16:39:07.2430000-08:00",False,"2fa899c7-eecf-4b1b-
a8cd-30c5357b4f3a"
"2016-02-19T05:23:54.5710000-08:00","User3","2016-02-17T17:20:39.3220000-08:00",False,"6ca7ed80-c149-4c22-
b24b-94ff5b0d824d"
"2016-02-19T05:48:37.7510000-08:00","User3","2016-02-19T05:23:54.5710000-08:00",False,"6ca7ed80-c149-4c22-
b24b-94ff5b0d824d"
"2016-02-19T06:40:27.4830000-08:00","User3","2016-02-19T05:48:37.7510000-08:00",False,"6ca7ed80-c149-4c22-
b24b-94ff5b0d824d"
"2016-02-19T07:27:37.7550000-08:00","User3","2016-02-19T06:40:27.4830000-08:00",False,"6ca7ed80-c149-4c22-
b24b-94ff5b0d824d"
"2016-02-19T19:35:40.9450000-08:00","User3","2016-02-19T07:27:37.7550000-08:00",False,"3f385f0b-3e68-4456-
ac74-ff6cef093674"
"2016-02-20T00:07:37.8250000-08:00","User3","2016-02-19T19:35:40.9450000-08:00",False,"685f76d5-ca48-4c58-
b77d-bd3a9ddb33da"
"2016-02-11T09:01:33.9720000-08:00","User4",,True,"9f0cf696-c8ba-449a-8d5f-1ca6ed8f2ee8"
"2016-02-17T06:30:38.6210000-08:00","User4","2016-02-11T09:01:33.9720000-08:00",False,"8b11fd2a-01bf-4a5e-
a9af-3c92c4e4382a"
"2016-02-17T22:15:26.4020000-08:00","User4","2016-02-17T06:30:38.6210000-08:00",False,"4e1cb707-3b5f-49c1-
90c7-9b33b86ca1f4"
"2016-02-18T14:37:27.6560000-08:00","User4","2016-02-17T22:15:26.4020000-08:00",False,"f4e44400-e837-40ed-
8dfd-2ea264d4e338"
"2016-02-19T01:20:31.4800000-08:00","User4","2016-02-18T14:37:27.6560000-08:00",False,"2136f4cf-7c7d-43c1-
8ae2-08f4ad6a6e08"

This example demonstrates a more complicated use-case scenario in which we use a global variable inside a
code-behind section that's applied to the entire memory rowset.

Next steps
U-SQL programmability guide - UDT and UDAGG
U-SQL programmability guide - UDO
U-SQL programmability guide - UDT and UDAGG
12/10/2021 • 10 minutes to read • Edit Online

Use user-defined types: UDT


User-defined types, or UDT, is another programmability feature of U-SQL. U-SQL UDT acts like a regular C# user-
defined type. C# is a strongly typed language that allows the use of built-in and custom user-defined types.
U-SQL cannot implicitly serialize or de-serialize arbitrary UDTs when the UDT is passed between vertices in
rowsets. This means that the user has to provide an explicit formatter by using the IFormatter interface. This
provides U-SQL with the serialize and de-serialize methods for the UDT.

NOTE
U-SQL’s built-in extractors and outputters currently cannot serialize or de-serialize UDT data to or from files even with the
IFormatter set. So when you're writing UDT data to a file with the OUTPUT statement, or reading it with an extractor, you
have to pass it as a string or byte array. Then you call the serialization and deserialization code (that is, the UDT’s ToString()
method) explicitly. User-defined extractors and outputters, on the other hand, can read and write UDTs.

If we try to use UDT in EXTRACTOR or OUTPUTTER (out of previous SELECT), as shown here:

@rs1 =
SELECT
MyNameSpace.Myfunction_Returning_UDT(filed1) AS myfield
FROM @rs0;

OUTPUT @rs1
TO @output_file
USING Outputters.Text();

We receive the following error:

Error 1 E_CSC_USER_INVALIDTYPEINOUTPUTTER: Outputters.Text was used to output column myfield of type


MyNameSpace.Myfunction_Returning_UDT.

Description:

Outputters.Text only supports built-in types.

Resolution:

Implement a custom outputter that knows how to serialize this type, or call a serialization method on the
type in
the preceding SELECT. C:\Users\sergeypu\Documents\Visual Studio 2013\Projects\USQL-Programmability\
USQL-Programmability\Types.usql 52 1 USQL-Programmability

To work with UDT in outputter, we either have to serialize it to string with the ToString() method or create a
custom outputter.
UDTs currently cannot be used in GROUP BY. If UDT is used in GROUP BY, the following error is thrown:
Error 1 E_CSC_USER_INVALIDTYPEINCLAUSE: GROUP BY doesn't support type MyNameSpace.Myfunction_Returning_UDT
for column myfield

Description:

GROUP BY doesn't support UDT or Complex types.

Resolution:

Add a SELECT statement where you can project a scalar column that you want to use with GROUP BY.
C:\Users\sergeypu\Documents\Visual Studio 2013\Projects\USQL-Programmability\USQL-Programmability\Types.usql
62 5 USQL-Programmability

To define a UDT, we have to:


1. Add the following namespaces:

using Microsoft.Analytics.Interfaces
using System.IO;

2. Add Microsoft.Analytics.Interfaces , which is required for the UDT interfaces. In addition, System.IO
might be needed to define the IFormatter interface.
3. Define a used-defined type with SqlUserDefinedType attribute.
SqlUserDefinedType is used to mark a type definition in an assembly as a user-defined type (UDT) in U-SQL.
The properties on the attribute reflect the physical characteristics of the UDT. This class cannot be inherited.
SqlUserDefinedType is a required attribute for UDT definition.
The constructor of the class:
SqlUserDefinedTypeAttribute (type formatter)
Type formatter: Required parameter to define an UDT formatter--specifically, the type of the IFormatter
interface must be passed here.

[SqlUserDefinedType(typeof(MyTypeFormatter))]
public class MyType
{ … }

Typical UDT also requires definition of the IFormatter interface, as shown in the following example:

public class MyTypeFormatter : IFormatter<MyType>


{
public void Serialize(MyType instance, IColumnWriter writer, ISerializationContext context)
{ … }

public MyType Deserialize(IColumnReader reader, ISerializationContext context)


{ … }
}

The IFormatter interface serializes and de-serializes an object graph with the root type of <typeparamref
name="T">.
<typeparam name="T">The root type for the object graph to serialize and de-serialize.
Deserialize : De-serializes the data on the provided stream and reconstitutes the graph of objects.
Serialize : Serializes an object, or graph of objects, with the given root to the provided stream.
MyType instance: Instance of the type.
IColumnWriter writer / IColumnReader reader: The underlying column stream.
ISerializationContext context: Enum that defines a set of flags that specifies the source or destination context
for the stream during serialization.
Intermediate : Specifies that the source or destination context is not a persisted store.
Persistence : Specifies that the source or destination context is a persisted store.
As a regular C# type, a U-SQL UDT definition can include overrides for operators such as +/==/!=. It can also
include static methods. For example, if we are going to use this UDT as a parameter to a U-SQL MIN aggregate
function, we have to define < operator override.
Earlier in this guide, we demonstrated an example for fiscal period identification from the specific date in the
format Qn:Pn (Q1:P10) . The following example shows how to define a custom type for fiscal period values.
Following is an example of a code-behind section with custom UDT and IFormatter interface:

[SqlUserDefinedType(typeof(FiscalPeriodFormatter))]
public struct FiscalPeriod
{
public int Quarter { get; private set; }

public int Month { get; private set; }

public FiscalPeriod(int quarter, int month):this()


{
this.Quarter = quarter;
this.Month = month;
}

public override bool Equals(object obj)


{
if (ReferenceEquals(null, obj))
{
return false;
}

return obj is FiscalPeriod && Equals((FiscalPeriod)obj);


}

public bool Equals(FiscalPeriod other)


{
return this.Quarter.Equals(other.Quarter) && this.Month.Equals(other.Month);
}

public bool GreaterThan(FiscalPeriod other)


{
return this.Quarter.CompareTo(other.Quarter) > 0 || this.Month.CompareTo(other.Month) > 0;
}

public bool LessThan(FiscalPeriod other)


{
return this.Quarter.CompareTo(other.Quarter) < 0 || this.Month.CompareTo(other.Month) < 0;
}

public override int GetHashCode()


{
unchecked
{
return (this.Quarter.GetHashCode() * 397) ^ this.Month.GetHashCode();
}
}
public static FiscalPeriod operator +(FiscalPeriod c1, FiscalPeriod c2)
{
return new FiscalPeriod((c1.Quarter + c2.Quarter) > 4 ? (c1.Quarter + c2.Quarter)-4 : (c1.Quarter +
c2.Quarter), (c1.Month + c2.Month) > 12 ? (c1.Month + c2.Month) - 12 : (c1.Month + c2.Month));
}

public static bool operator ==(FiscalPeriod c1, FiscalPeriod c2)


{
return c1.Equals(c2);
}

public static bool operator !=(FiscalPeriod c1, FiscalPeriod c2)


{
return !c1.Equals(c2);
}
public static bool operator >(FiscalPeriod c1, FiscalPeriod c2)
{
return c1.GreaterThan(c2);
}
public static bool operator <(FiscalPeriod c1, FiscalPeriod c2)
{
return c1.LessThan(c2);
}
public override string ToString()
{
return (String.Format("Q{0}:P{1}", this.Quarter, this.Month));
}

public class FiscalPeriodFormatter : IFormatter<FiscalPeriod>


{
public void Serialize(FiscalPeriod instance, IColumnWriter writer, ISerializationContext context)
{
using (var binaryWriter = new BinaryWriter(writer.BaseStream))
{
binaryWriter.Write(instance.Quarter);
binaryWriter.Write(instance.Month);
binaryWriter.Flush();
}
}

public FiscalPeriod Deserialize(IColumnReader reader, ISerializationContext context)


{
using (var binaryReader = new BinaryReader(reader.BaseStream))
{
var result = new FiscalPeriod(binaryReader.ReadInt16(), binaryReader.ReadInt16());
return result;
}
}
}

The defined type includes two numbers: quarter and month. Operators ==/!=/>/< and static method
ToString() are defined here.

As mentioned earlier, UDT can be used in SELECT expressions, but cannot be used in OUTPUTTER/EXTRACTOR
without custom serialization. It either has to be serialized as a string with ToString() or used with a custom
OUTPUTTER/EXTRACTOR.
Now let’s discuss usage of UDT. In a code-behind section, we changed our GetFiscalPeriod function to the
following:
public static FiscalPeriod GetFiscalPeriodWithCustomType(DateTime dt)
{
int FiscalMonth = 0;
if (dt.Month < 7)
{
FiscalMonth = dt.Month + 6;
}
else
{
FiscalMonth = dt.Month - 6;
}

int FiscalQuarter = 0;
if (FiscalMonth >= 1 && FiscalMonth <= 3)
{
FiscalQuarter = 1;
}
if (FiscalMonth >= 4 && FiscalMonth <= 6)
{
FiscalQuarter = 2;
}
if (FiscalMonth >= 7 && FiscalMonth <= 9)
{
FiscalQuarter = 3;
}
if (FiscalMonth >= 10 && FiscalMonth <= 12)
{
FiscalQuarter = 4;
}

return new FiscalPeriod(FiscalQuarter, FiscalMonth);


}

As you can see, it returns the value of our FiscalPeriod type.


Here we provide an example of how to use it further in U-SQL base script. This example demonstrates different
forms of UDT invocation from U-SQL script.
DECLARE @input_file string = @"c:\work\cosmos\usql-programmability\input_file.tsv";
DECLARE @output_file string = @"c:\work\cosmos\usql-programmability\output_file.tsv";

@rs0 =
EXTRACT
guid string,
dt DateTime,
user String,
des String
FROM @input_file USING Extractors.Tsv();

@rs1 =
SELECT
guid AS start_id,
dt,
DateTime.Now.ToString("M/d/yyyy") AS Nowdate,
USQL_Programmability.CustomFunctions.GetFiscalPeriodWithCustomType(dt).Quarter AS fiscalquarter,
USQL_Programmability.CustomFunctions.GetFiscalPeriodWithCustomType(dt).Month AS fiscalmonth,
USQL_Programmability.CustomFunctions.GetFiscalPeriodWithCustomType(dt) + new
USQL_Programmability.CustomFunctions.FiscalPeriod(1,7) AS fiscalperiod_adjusted,
user,
des
FROM @rs0;

@rs2 =
SELECT
start_id,
dt,
DateTime.Now.ToString("M/d/yyyy") AS Nowdate,
fiscalquarter,
fiscalmonth,
USQL_Programmability.CustomFunctions.GetFiscalPeriodWithCustomType(dt).ToString() AS fiscalperiod,

// This user-defined type was created in the prior SELECT. Passing the UDT to this subsequent SELECT
would have failed if the UDT was not annotated with an IFormatter.
fiscalperiod_adjusted.ToString() AS fiscalperiod_adjusted,
user,
des
FROM @rs1;

OUTPUT @rs2
TO @output_file
USING Outputters.Text();

Here's an example of a full code-behind section:

using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace USQL_Programmability
{
public class CustomFunctions
{
static public DateTime? ToDateTime(string dt)
{
DateTime dtValue;

if (!DateTime.TryParse(dt, out dtValue))


return Convert.ToDateTime(dt);
else
return null;
}

public static FiscalPeriod GetFiscalPeriodWithCustomType(DateTime dt)


{
int FiscalMonth = 0;
if (dt.Month < 7)
{
FiscalMonth = dt.Month + 6;
}
else
{
FiscalMonth = dt.Month - 6;
}

int FiscalQuarter = 0;
if (FiscalMonth >= 1 && FiscalMonth <= 3)
{
FiscalQuarter = 1;
}
if (FiscalMonth >= 4 && FiscalMonth <= 6)
{
FiscalQuarter = 2;
}
if (FiscalMonth >= 7 && FiscalMonth <= 9)
{
FiscalQuarter = 3;
}
if (FiscalMonth >= 10 && FiscalMonth <= 12)
{
FiscalQuarter = 4;
}

return new FiscalPeriod(FiscalQuarter, FiscalMonth);


} [SqlUserDefinedType(typeof(FiscalPeriodFormatter))]
public struct FiscalPeriod
{
public int Quarter { get; private set; }

public int Month { get; private set; }

public FiscalPeriod(int quarter, int month):this()


{
this.Quarter = quarter;
this.Month = month;
}

public override bool Equals(object obj)


{
if (ReferenceEquals(null, obj))
{
return false;
}

return obj is FiscalPeriod && Equals((FiscalPeriod)obj);


}

public bool Equals(FiscalPeriod other)


{
return this.Quarter.Equals(other.Quarter) && this.Month.Equals(other.Month);
}

public bool GreaterThan(FiscalPeriod other)


{
return this.Quarter.CompareTo(other.Quarter) > 0 || this.Month.CompareTo(other.Month) > 0;
}

public bool LessThan(FiscalPeriod other)


{
return this.Quarter.CompareTo(other.Quarter) < 0 || this.Month.CompareTo(other.Month) < 0;
return this.Quarter.CompareTo(other.Quarter) < 0 || this.Month.CompareTo(other.Month) < 0;
}

public override int GetHashCode()


{
unchecked
{
return (this.Quarter.GetHashCode() * 397) ^ this.Month.GetHashCode();
}
}

public static FiscalPeriod operator +(FiscalPeriod c1, FiscalPeriod c2)


{
return new FiscalPeriod((c1.Quarter + c2.Quarter) > 4 ? (c1.Quarter + c2.Quarter)-4 : (c1.Quarter +
c2.Quarter), (c1.Month + c2.Month) > 12 ? (c1.Month + c2.Month) - 12 : (c1.Month + c2.Month));
}

public static bool operator ==(FiscalPeriod c1, FiscalPeriod c2)


{
return c1.Equals(c2);
}

public static bool operator !=(FiscalPeriod c1, FiscalPeriod c2)


{
return !c1.Equals(c2);
}
public static bool operator >(FiscalPeriod c1, FiscalPeriod c2)
{
return c1.GreaterThan(c2);
}
public static bool operator <(FiscalPeriod c1, FiscalPeriod c2)
{
return c1.LessThan(c2);
}
public override string ToString()
{
return (String.Format("Q{0}:P{1}", this.Quarter, this.Month));
}

public class FiscalPeriodFormatter : IFormatter<FiscalPeriod>


{
public void Serialize(FiscalPeriod instance, IColumnWriter writer, ISerializationContext context)
{
using (var binaryWriter = new BinaryWriter(writer.BaseStream))
{
binaryWriter.Write(instance.Quarter);
binaryWriter.Write(instance.Month);
binaryWriter.Flush();
}
}

public FiscalPeriod Deserialize(IColumnReader reader, ISerializationContext context)


{
using (var binaryReader = new BinaryReader(reader.BaseStream))
{
var result = new FiscalPeriod(binaryReader.ReadInt16(), binaryReader.ReadInt16());
return result;
}
}
}
}
}

Use user-defined aggregates: UDAGG


User-defined aggregates are any aggregation-related functions that are not shipped out-of-the-box with U-SQL.
The example can be an aggregate to perform custom math calculations, string concatenations, manipulations
with strings, and so on.
The user-defined aggregate base class definition is as follows:

[SqlUserDefinedAggregate]
public abstract class IAggregate<T1, T2, TResult> : IAggregate
{
protected IAggregate();

public abstract void Accumulate(T1 t1, T2 t2);


public abstract void Init();
public abstract TResult Terminate();
}

SqlUserDefinedAggregate indicates that the type should be registered as a user-defined aggregate. This class
cannot be inherited.
SqlUserDefinedType attribute is optional for UDAGG definition.
The base class allows you to pass three abstract parameters: two as input parameters and one as the result. The
data types are variable and should be defined during class inheritance.

public class GuidAggregate : IAggregate<string, string, string>


{
string guid_agg;

public override void Init()


{ … }

public override void Accumulate(string guid, string user)


{ … }

public override string Terminate()


{ … }
}

Init invokes once for each group during computation. It provides an initialization routine for each
aggregation group.
Accumulate is executed once for each value. It provides the main functionality for the aggregation
algorithm. It can be used to aggregate values with various data types that are defined during class
inheritance. It can accept two parameters of variable data types.
Terminate is executed once per aggregation group at the end of processing to output the result for each
group.
To declare correct input and output data types, use the class definition as follows:

public abstract class IAggregate<T1, T2, TResult> : IAggregate

T1: First parameter to accumulate


T2: Second parameter to accumulate
TResult: Return type of terminate
For example:
public class GuidAggregate : IAggregate<string, int, int>

or

public class GuidAggregate : IAggregate<string, string, string>

Use UDAGG in U -SQL


To use UDAGG, first define it in code-behind or reference it from the existent programmability DLL as discussed
earlier.
Then use the following syntax:

AGG<UDAGG_functionname>(param1,param2)

Here is an example of UDAGG:

public class GuidAggregate : IAggregate<string, string, string>


{
string guid_agg;

public override void Init()


{
guid_agg = "";
}

public override void Accumulate(string guid, string user)


{
if (user.ToUpper()== "USER1")
{
guid_agg += "{" + guid + "}";
}
}

public override string Terminate()


{
return guid_agg;
}

And base U-SQL script:


DECLARE @input_file string = @"\usql-programmability\input_file.tsv";
DECLARE @output_file string = @" \usql-programmability\output_file.tsv";

@rs0 =
EXTRACT
guid string,
dt DateTime,
user String,
des String
FROM @input_file
USING Extractors.Tsv();

@rs1 =
SELECT
user,
AGG<USQL_Programmability.GuidAggregate>(guid,user) AS guid_list
FROM @rs0
GROUP BY user;

OUTPUT @rs1 TO @output_file USING Outputters.Text();

In this use-case scenario, we concatenate class GUIDs for the specific users.

Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDO
U-SQL user-defined objects overview
12/10/2021 • 2 minutes to read • Edit Online

U-SQL: user-defined objects: UDO


U-SQL enables you to define custom programmability objects, which are called user-defined objects or UDO.
The following is a list of UDO in U-SQL:
User-defined extractors
Extract row by row
Used to implement data extraction from custom structured files
User-defined outputters
Output row by row
Used to output custom data types or custom file formats
User-defined processors
Take one row and produce one row
Used to reduce the number of columns or produce new columns with values that are derived from an
existing column set
User-defined appliers
Take one row and produce 0 to n rows
Used with OUTER/CROSS APPLY
User-defined combiners
Combines rowsets--user-defined JOINs
User-defined reducers
Take n rows and produce one row
Used to reduce the number of rows
UDO is typically called explicitly in U-SQL script as part of the following U-SQL statements:
EXTRACT
OUTPUT
PROCESS
COMBINE
REDUCE

NOTE
UDO’s are limited to consume 0.5Gb memory. This memory limitation does not apply to local executions.

Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDT and UDAGG
Use user-defined extractor
12/10/2021 • 3 minutes to read • Edit Online

U-SQL UDO: user-defined extractor


U-SQL allows you to import external data by using an EXTRACT statement. An EXTRACT statement can use built-
in UDO extractors:
Extractors.Text(): Provides extraction from delimited text files of different encodings.
Extractors.Csv(): Provides extraction from comma-separated value (CSV) files of different encodings.
Extractors.Tsv(): Provides extraction from tab-separated value (TSV) files of different encodings.
It can be useful to develop a custom extractor. This can be helpful during data import if we want to do any of the
following tasks:
Modify input data by splitting columns and modifying individual values. The PROCESSOR functionality is
better for combining columns.
Parse unstructured data such as Web pages and emails, or semi-unstructured data such as XML/JSON.
Parse data in unsupported encoding.

How to define and use user-defined extractor


To define a user-defined extractor, or UDE, we need to create an IExtractor interface. All input parameters to
the extractor, such as column/row delimiters, and encoding, need to be defined in the constructor of the class.
The IExtractor interface should also contain a definition for the IEnumerable<IRow> override as follows:

[SqlUserDefinedExtractor]
public class SampleExtractor : IExtractor
{
public SampleExtractor(string row_delimiter, char col_delimiter)
{ … }

public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)


{ … }
}

The SqlUserDefinedExtractor attribute indicates that the type should be registered as a user-defined extractor.
This class cannot be inherited.
SqlUserDefinedExtractor is an optional attribute for UDE definition. It used to define AtomicFileProcessing
property for the UDE object.
bool AtomicFileProcessing
true = Indicates that this extractor requires atomic input files (JSON, XML, ...)
false = Indicates that this extractor can deal with split / distributed files (CSV, SEQ, ...)
The main UDE programmability objects are input and output . The input object is used to enumerate input data
as IUnstructuredReader . The output object is used to set output data as a result of the extractor activity.
The input data is accessed through System.IO.Stream and System.IO.StreamReader .
For input columns enumeration, we first split the input stream by using a row delimiter.

foreach (Stream current in input.Split(my_row_delimiter))


{

}

Then, further split input row into column parts.

foreach (Stream current in input.Split(my_row_delimiter))


{

string[] parts = line.Split(my_column_delimiter);
foreach (string part in parts)
{ … }
}

To set output data, we use the output.Set method.


It's important to understand that the custom extractor only outputs columns and values that are defined with
the output. Set method call.

output.Set<string>(count, part);

The actual extractor output is triggered by calling yield return output.AsReadOnly(); .


Following is the extractor example:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class FullDescriptionExtractor : IExtractor
{
private Encoding _encoding;
private byte[] _row_delim;
private char _col_delim;

public FullDescriptionExtractor(Encoding encoding, string row_delim = "\r\n", char col_delim = '\t')


{
this._encoding = ((encoding == null) ? Encoding.UTF8 : encoding);
this._row_delim = this._encoding.GetBytes(row_delim);
this._col_delim = col_delim;

public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)


{
string line;
//Read the input line by line
foreach (Stream current in input.Split(_encoding.GetBytes("\r\n")))
{
using (System.IO.StreamReader streamReader = new StreamReader(current, this._encoding))
{
line = streamReader.ReadToEnd().Trim();
//Split the input by the column delimiter
string[] parts = line.Split(this._col_delim);
int count = 0; // start with first column
foreach (string part in parts)
{
if (count == 0)
{ // for column “guid”, re-generated guid
Guid new_guid = Guid.NewGuid();
output.Set<Guid>(count, new_guid);
}
else if (count == 2)
{
// for column “user”, convert to UPPER case
output.Set<string>(count, part.ToUpper());

}
else
{
// keep the rest of the columns as-is
output.Set<string>(count, part);
}
count += 1;
}

}
yield return output.AsReadOnly();
}
yield break;
}
}

In this use-case scenario, the extractor regenerates the GUID for “guid” column and converts the values of “user”
column to upper case. Custom extractors can produce more complicated results by parsing input data and
manipulating it.
Following is base U-SQL script that uses a custom extractor:
DECLARE @input_file string = @"\usql-programmability\input_file.tsv";
DECLARE @output_file string = @"\usql-programmability\output_file.tsv";

@rs0 =
EXTRACT
guid Guid,
dt String,
user String,
des String
FROM @input_file
USING new USQL_Programmability.FullDescriptionExtractor(Encoding.UTF8);

OUTPUT @rs0 TO @output_file USING Outputters.Text();

Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDT and UDAGG
Use user-defined outputter
12/10/2021 • 5 minutes to read • Edit Online

U-SQL UDO: user-defined outputter


User-defined outputter is another U-SQL UDO that allows you to extend built-in U-SQL functionality. Similar to
the extractor, there are several built-in outputters.
Outputters.Text(): Writes data to delimited text files of different encodings.
Outputters.Csv(): Writes data to comma-separated value (CSV) files of different encodings.
Outputters.Tsv(): Writes data to tab-separated value (TSV) files of different encodings.
Custom outputter allows you to write data in a custom defined format. This can be useful for the following tasks:
Writing data to semi-structured or unstructured files.
Writing data not supported encodings.
Modifying output data or adding custom attributes.

How to define and use user-defined outputter


To define user-defined outputter, we need to create the IOutputter interface.
Following is the base IOutputter class implementation:

public abstract class IOutputter : IUserDefinedOperator


{
protected IOutputter();

public virtual void Close();


public abstract void Output(IRow input, IUnstructuredWriter output);
}

All input parameters to the outputter, such as column/row delimiters, encoding, and so on, need to be defined in
the constructor of the class. The IOutputter interface should also contain a definition for void Output override.
The attribute [SqlUserDefinedOutputter(AtomicFileProcessing = true) can optionally be set for atomic file
processing. For more information, see the following details.
[SqlUserDefinedOutputter(AtomicFileProcessing = true)]
public class MyOutputter : IOutputter
{

public MyOutputter(myparam1, myparam2)


{

}

public override void Close()


{

}

public override void Output(IRow row, IUnstructuredWriter output)


{

}
}

Output is called for each input row. It returns the IUnstructuredWriter output rowset.
The Constructor class is used to pass parameters to the user-defined outputter.
Close is used to optionally override to release expensive state or determine when the last row was written.

SqlUserDefinedOutputter attribute indicates that the type should be registered as a user-defined outputter.
This class cannot be inherited.
SqlUserDefinedOutputter is an optional attribute for a user-defined outputter definition. It's used to define the
AtomicFileProcessing property.
bool AtomicFileProcessing
true = Indicates that this outputter requires atomic output files (JSON, XML, ...)
false = Indicates that this outputter can deal with split / distributed files (CSV, SEQ, ...)
The main programmability objects are row and output . The row object is used to enumerate output data as
IRow interface. Output is used to set output data to the target file.

The output data is accessed through the IRow interface. Output data is passed a row at a time.
The individual values are enumerated by calling the Get method of the IRow interface:

row.Get<string>("column_name")

Individual column names can be determined by calling row.Schema :

ISchema schema = row.Schema;


var col = schema[i];
string val = row.Get<string>(col.Name)

This approach enables you to build a flexible outputter for any metadata schema.
The output data is written to file by using System.IO.StreamWriter . The stream parameter is set to
output.BaseStream as part of IUnstructuredWriter output .

Note that it's important to flush the data buffer to the file after each row iteration. In addition, the StreamWriter
object must be used with the Disposable attribute enabled (default) and with the using keyword:
using (StreamWriter streamWriter = new StreamWriter(output.BaseStream, this._encoding))
{

}

Otherwise, call Flush() method explicitly after each iteration. We show this in the following example.
Set headers and footers for user-defined outputter
To set a header, use single iteration execution flow.

public override void Output(IRow row, IUnstructuredWriter output)


{

if (isHeaderRow)
{

}


if (isHeaderRow)
{
isHeaderRow = false;
}

}
}

The code in the first if (isHeaderRow) block is executed only once.


For the footer, use the reference to the instance of System.IO.Stream object ( output.BaseStream ). Write the
footer in the Close() method of the IOutputter interface. (For more information, see the following example.)
Following is an example of a user-defined outputter:

[SqlUserDefinedOutputter(AtomicFileProcessing = true)]
public class HTMLOutputter : IOutputter
{
// Local variables initialization
private string row_delimiter;
private char col_delimiter;
private bool isHeaderRow;
private Encoding encoding;
private bool IsTableHeader = true;
private Stream g_writer;

// Parameters definition
public HTMLOutputter(bool isHeader = false, Encoding encoding = null)
{
this.isHeaderRow = isHeader;
this.encoding = ((encoding == null) ? Encoding.UTF8 : encoding);
}

// The Close method is used to write the footer to the file. It's executed only once, after all rows
public override void Close()
{
//Reference to IO.Stream object - g_writer
StreamWriter streamWriter = new StreamWriter(g_writer, this.encoding);
streamWriter.Write("</table>");
streamWriter.Flush();
streamWriter.Close();
}

public override void Output(IRow row, IUnstructuredWriter output)


{
System.IO.StreamWriter streamWriter = new StreamWriter(output.BaseStream, this.encoding);

// Metadata schema initialization to enumerate column names


ISchema schema = row.Schema;

// This is a data-independent header--HTML table definition


if (IsTableHeader)
{
streamWriter.Write("<table border=1>");
IsTableHeader = false;
}

// HTML table attributes


string header_wrapper_on = "<th>";
string header_wrapper_off = "</th>";
string data_wrapper_on = "<td>";
string data_wrapper_off = "</td>";

// Header row output--runs only once


if (isHeaderRow)
{
streamWriter.Write("<tr>");
for (int i = 0; i < schema.Count(); i++)
{
var col = schema[i];
streamWriter.Write(header_wrapper_on + col.Name + header_wrapper_off);
}
streamWriter.Write("</tr>");
}

// Data row output


streamWriter.Write("<tr>");
for (int i = 0; i < schema.Count(); i++)
{
var col = schema[i];
string val = "";
try
{
// Data type enumeration--required to match the distinct list of types from OUTPUT statement
switch (col.Type.Name.ToString().ToLower())
{
case "string": val = row.Get<string>(col.Name).ToString(); break;
case "guid": val = row.Get<Guid>(col.Name).ToString(); break;
default: break;
}
}
// Handling NULL values--keeping them empty
catch (System.NullReferenceException)
{
}
streamWriter.Write(data_wrapper_on + val + data_wrapper_off);
}
streamWriter.Write("</tr>");

if (isHeaderRow)
{
isHeaderRow = false;
}
// Reference to the instance of the IO.Stream object for footer generation
g_writer = output.BaseStream;
streamWriter.Flush();
}
}

// Define the factory classes


public static class Factory
{
public static HTMLOutputter HTMLOutputter(bool isHeader = false, Encoding encoding = null)
public static HTMLOutputter HTMLOutputter(bool isHeader = false, Encoding encoding = null)
{
return new HTMLOutputter(isHeader, encoding);
}
}

And U-SQL base script:

DECLARE @input_file string = @"\usql-programmability\input_file.tsv";


DECLARE @output_file string = @"\usql-programmability\output_file.html";

@rs0 =
EXTRACT
guid Guid,
dt String,
user String,
des String
FROM @input_file
USING new USQL_Programmability.FullDescriptionExtractor(Encoding.UTF8);

OUTPUT @rs0
TO @output_file
USING new USQL_Programmability.HTMLOutputter(isHeader: true);

This is an HTML outputter, which creates an HTML file with table data.
Call outputter from U -SQL base script
To call a custom outputter from the base U-SQL script, the new instance of the outputter object has to be
created.

OUTPUT @rs0 TO @output_file USING new USQL_Programmability.HTMLOutputter(isHeader: true);

To avoid creating an instance of the object in base script, we can create a function wrapper, as shown in our
earlier example:

// Define the factory classes


public static class Factory
{
public static HTMLOutputter HTMLOutputter(bool isHeader = false, Encoding encoding = null)
{
return new HTMLOutputter(isHeader, encoding);
}
}

In this case, the original call looks like the following:

OUTPUT @rs0
TO @output_file
USING USQL_Programmability.Factory.HTMLOutputter(isHeader: true);

Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDT and UDAGG
Use user-defined processor
12/10/2021 • 2 minutes to read • Edit Online

U-SQL UDO: user-defined processor


User-defined processor, or UDP, is a type of U-SQL UDO that enables you to process the incoming rows by
applying programmability features. UDP enables you to combine columns, modify values, and add new columns
if necessary. Basically, it helps to process a rowset to produce required data elements.

How to define and use user-defined processor


To define a UDP, we need to create an IProcessor interface with the SqlUserDefinedProcessor attribute, which is
optional for UDP.
This interface should contain the definition for the IRow interface rowset override, as shown in the following
example:

[SqlUserDefinedProcessor]
public class MyProcessor: IProcessor
{
public override IRow Process(IRow input, IUpdatableRow output)
{

}
}

SqlUserDefinedProcessor indicates that the type should be registered as a user-defined processor. This class
cannot be inherited.
The SqlUserDefinedProcessor attribute is optional for UDP definition.
The main programmability objects are input and output . The input object is used to enumerate input columns
and output, and to set output data as a result of the processor activity.
For input columns enumeration, we use the input.Get method.

string column_name = input.Get<string>("column_name");

The parameter for input.Get method is a column that's passed as part of the PRODUCE clause of the PROCESS
statement of the U-SQL base script. We need to use the correct data type here.
For output, use the output.Set method.
It's important to note that custom producer only outputs columns and values that are defined with the
output.Set method call.

output.Set<string>("mycolumn", mycolumn);

The actual processor output is triggered by calling return output.AsReadOnly(); .


Following is a processor example:
[SqlUserDefinedProcessor]
public class FullDescriptionProcessor : IProcessor
{
public override IRow Process(IRow input, IUpdatableRow output)
{
string user = input.Get<string>("user");
string des = input.Get<string>("des");
string full_description = user.ToUpper() + "=>" + des;
output.Set<string>("dt", input.Get<string>("dt"));
output.Set<string>("full_description", full_description);
output.Set<Guid>("new_guid", Guid.NewGuid());
output.Set<Guid>("guid", input.Get<Guid>("guid"));
return output.AsReadOnly();
}
}

In this use-case scenario, the processor is generating a new column called “full_description” by combining the
existing columns--in this case, “user” in upper case, and “des”. It also regenerates a GUID and returns the original
and new GUID values.
As you can see from the previous example, you can call C# methods during output.Set method call.
Following is an example of base U-SQL script that uses a custom processor:

DECLARE @input_file string = @"\usql-programmability\input_file.tsv";


DECLARE @output_file string = @"\usql-programmability\output_file.tsv";

@rs0 =
EXTRACT
guid Guid,
dt String,
user String,
des String
FROM @input_file USING Extractors.Tsv();

@rs1 =
PROCESS @rs0
PRODUCE dt String,
full_description String,
guid Guid,
new_guid Guid
USING new USQL_Programmability.FullDescriptionProcessor();

OUTPUT @rs1 TO @output_file USING Outputters.Text();

Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDT and UDAGG
Use user-defined applier
12/10/2021 • 4 minutes to read • Edit Online

U-SQL UDO: user-defined applier


A U-SQL user-defined applier enables you to invoke a custom C# function for each row that's returned by the
outer table expression of a query. The right input is evaluated for each row from the left input, and the rows that
are produced are combined for the final output. The list of columns that are produced by the APPLY operator are
the combination of the set of columns in the left and the right input.

How to define and use user-defined applier


User-defined applier is being invoked as part of the USQL SELECT expression.
The typical call to the user-defined applier looks like the following:

SELECT …
FROM …
CROSS APPLYis used to pass parameters
new MyScript.MyApplier(param1, param2) AS alias(output_param1 string, …);

For more information about using appliers in a SELECT expression, see U-SQL SELECT Selecting from CROSS
APPLY and OUTER APPLY.
The user-defined applier base class definition is as follows:

public abstract class IApplier : IUserDefinedOperator


{
protected IApplier();

public abstract IEnumerable<IRow> Apply(IRow input, IUpdatableRow output);


}

To define a user-defined applier, we need to create the IApplier interface with the [ SqlUserDefinedApplier ]
attribute, which is optional for a user-defined applier definition.

[SqlUserDefinedApplier]
public class ParserApplier : IApplier
{
public ParserApplier()
{

}

public override IEnumerable<IRow> Apply(IRow input, IUpdatableRow output)


{

}
}

Apply is called for each row of the outer table. It returns the IUpdatableRow output rowset.
The Constructor class is used to pass parameters to the user-defined applier.
SqlUserDefinedApplier indicates that the type should be registered as a user-defined applier. This class
cannot be inherited.
SqlUserDefinedApplier is optional for a user-defined applier definition.
The main programmability objects are as follows:

public override IEnumerable<IRow> Apply(IRow input, IUpdatableRow output)

Input rowsets are passed as IRow input. The output rows are generated as IUpdatableRow output interface.
Individual column names can be determined by calling the IRow Schema method.

ISchema schema = row.Schema;


var col = schema[i];
string val = row.Get<string>(col.Name)

To get the actual data values from the incoming IRow , we use the Get() method of IRow interface.

mycolumn = row.Get<int>("mycolumn")

Or we use the schema column name:

row.Get<int>(row.Schema[0].Name)

The output values must be set with IUpdatableRow output:

output.Set<int>("mycolumn", mycolumn)

It is important to understand that custom appliers only output columns and values that are defined with
output.Set method call.

The actual output is triggered by calling yield return output.AsReadOnly(); .


The user-defined applier parameters can be passed to the constructor. Applier can return a variable number of
columns that need to be defined during the applier call in base U-SQL Script.

new USQL_Programmability.ParserApplier ("all") AS properties(make string, model string, year string, type
string, millage int);

Here is the user-defined applier example:


[SqlUserDefinedApplier]
public class ParserApplier : IApplier
{
private string parsingPart;

public ParserApplier(string parsingPart)


{
if (parsingPart.ToUpper().Contains("ALL")
|| parsingPart.ToUpper().Contains("MAKE")
|| parsingPart.ToUpper().Contains("MODEL")
|| parsingPart.ToUpper().Contains("YEAR")
|| parsingPart.ToUpper().Contains("TYPE")
|| parsingPart.ToUpper().Contains("MILLAGE")
)
{
this.parsingPart = parsingPart;
}
else
{
throw new ArgumentException("Incorrect parameter. Please use: 'ALL[MAKE|MODEL|TYPE|MILLAGE]'");
}
}

public override IEnumerable<IRow> Apply(IRow input, IUpdatableRow output)


{

string[] properties = input.Get<string>("properties").Split(',');

// only process with correct number of properties


if (properties.Count() == 5)
{

string make = properties[0];


string model = properties[1];
string year = properties[2];
string type = properties[3];
int millage = -1;

// Only return millage if it is number, otherwise, -1


if (!int.TryParse(properties[4], out millage))
{
millage = -1;
}

if (parsingPart.ToUpper().Contains("MAKE") || parsingPart.ToUpper().Contains("ALL")) output.Set<string>


("make", make);
if (parsingPart.ToUpper().Contains("MODEL") || parsingPart.ToUpper().Contains("ALL")) output.Set<string>
("model", model);
if (parsingPart.ToUpper().Contains("YEAR") || parsingPart.ToUpper().Contains("ALL")) output.Set<string>
("year", year);
if (parsingPart.ToUpper().Contains("TYPE") || parsingPart.ToUpper().Contains("ALL")) output.Set<string>
("type", type);
if (parsingPart.ToUpper().Contains("MILLAGE") || parsingPart.ToUpper().Contains("ALL")) output.Set<int>
("millage", millage);
}
yield return output.AsReadOnly();
}
}

Following is the base U-SQL script for this user-defined applier:


DECLARE @input_file string = @"c:\usql-programmability\car_fleet.tsv";
DECLARE @output_file string = @"c:\usql-programmability\output_file.tsv";

@rs0 =
EXTRACT
stocknumber int,
vin String,
properties String
FROM @input_file USING Extractors.Tsv();

@rs1 =
SELECT
r.stocknumber,
r.vin,
properties.make,
properties.model,
properties.year,
properties.type,
properties.millage
FROM @rs0 AS r
CROSS APPLY
new USQL_Programmability.ParserApplier ("all") AS properties(make string, model string, year string,
type string, millage int);

OUTPUT @rs1 TO @output_file USING Outputters.Text();

In this use case scenario, user-defined applier acts as a comma-delimited value parser for the car fleet
properties. The input file rows look like the following:

103 Z1AB2CD123XY45889 Ford,Explorer,2005,SUV,152345


303 Y0AB2CD34XY458890 Chevrolet,Cruise,2010,4Dr,32455
210 X5AB2CD45XY458893 Nissan,Altima,2011,4Dr,74000

It is a typical tab-delimited TSV file with a properties column that contains car properties such as make and
model. Those properties must be parsed to the table columns. The applier that's provided also enables you to
generate a dynamic number of properties in the result rowset, based on the parameter that's passed. You can
generate either all properties or a specific set of properties only.

...USQL_Programmability.ParserApplier ("all")
...USQL_Programmability.ParserApplier ("make")
...USQL_Programmability.ParserApplier ("make&model")

The user-defined applier can be called as a new instance of applier object:

CROSS APPLY new MyNameSpace.MyApplier (parameter: "value") AS alias([columns types]…);

Or with the invocation of a wrapper factory method:

CROSS APPLY MyNameSpace.MyApplier (parameter: "value") AS alias([columns types]…);

Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDT and UDAGG
Use user-defined combiner
12/10/2021 • 4 minutes to read • Edit Online

U-SQL UDO: user-defined combiner


User-defined combiner, or UDC, enables you to combine rows from left and right rowsets, based on custom
logic. User-defined combiner is used with COMBINE expression.

How to define and use user-defined combiner


A combiner is being invoked with the COMBINE expression that provides the necessary information about both
the input rowsets, the grouping columns, the expected result schema, and additional information.
To call a combiner in a base U-SQL script, we use the following syntax:

Combine_Expression :=
'COMBINE' Combine_Input
'WITH' Combine_Input
Join_On_Clause
Produce_Clause
[Readonly_Clause]
[Required_Clause]
USING_Clause.

For more information, see COMBINE Expression (U-SQL).


To define a user-defined combiner, we need to create the ICombiner interface with the [ SqlUserDefinedCombiner ]
attribute, which is optional for a user-defined Combiner definition.
Base ICombiner class definition:

public abstract class ICombiner : IUserDefinedOperator


{
protected ICombiner();
public virtual IEnumerable<IRow> Combine(List<IRowset> inputs,
IUpdatableRow output);
public abstract IEnumerable<IRow> Combine(IRowset left, IRowset right,
IUpdatableRow output);
}

The custom implementation of an ICombiner interface should contain the definition for an IEnumerable<IRow>
Combine override.

[SqlUserDefinedCombiner]
public class MyCombiner : ICombiner
{

public override IEnumerable<IRow> Combine(IRowset left, IRowset right,


IUpdatableRow output)
{

}
}
The SqlUserDefinedCombiner attribute indicates that the type should be registered as a user-defined
combiner. This class cannot be inherited.
SqlUserDefinedCombiner is used to define the Combiner mode property. It is an optional attribute for a user-
defined combiner definition.
CombinerMode Mode
CombinerMode enum can take the following values:
Full (0) Every output row potentially depends on all the input rows from left and right with the same key
value.
Left (1) Every output row depends on a single input row from the left (and potentially all rows from the
right with the same key value).
Right (2) Every output row depends on a single input row from the right (and potentially all rows from
the left with the same key value).
Inner (3) Every output row depends on a single input row from left and right with the same value.
Example: [ SqlUserDefinedCombiner(Mode=CombinerMode.Left) ]
The main programmability objects are:

public override IEnumerable<IRow> Combine(IRowset left, IRowset right,


IUpdatableRow output

Input rowsets are passed as left and right IRowset type of interface. Both rowsets must be enumerated for
processing. You can only enumerate each interface once, so we have to enumerate and cache it if necessary.
For caching purposes, we can create a List<T> type of memory structure as a result of a LINQ query execution,
specifically List< IRow >. The anonymous data type can be used during enumeration as well.
See Introduction to LINQ Queries (C#) for more information about LINQ queries, and IEnumerable<T> Interface
for more information about IEnumerable<T> interface.
To get the actual data values from the incoming IRowset , we use the Get() method of IRow interface.

mycolumn = row.Get<int>("mycolumn")

Individual column names can be determined by calling the IRow Schema method.

ISchema schema = row.Schema;


var col = schema[i];
string val = row.Get<string>(col.Name)

Or by using the schema column name:

c# row.Get<int>(row.Schema[0].Name)

The general enumeration with LINQ looks like the following:


var myRowset =
(from row in left.Rows
select new
{
Mycolumn = row.Get<int>("mycolumn"),
}).ToList();

After enumerating both rowsets, we are going to loop through all rows. For each row in the left rowset, we are
going to find all rows that satisfy the condition of our combiner.
The output values must be set with IUpdatableRow output.

output.Set<int>("mycolumn", mycolumn)

The actual output is triggered by calling to yield return output.AsReadOnly(); .


Following is a combiner example:
[SqlUserDefinedCombiner]
public class CombineSales : ICombiner
{

public override IEnumerable<IRow> Combine(IRowset left, IRowset right,


IUpdatableRow output)
{
var internetSales =
(from row in left.Rows
select new
{
ProductKey = row.Get<int>("ProductKey"),
OrderDateKey = row.Get<int>("OrderDateKey"),
SalesAmount = row.Get<decimal>("SalesAmount"),
TaxAmt = row.Get<decimal>("TaxAmt")
}).ToList();

var resellerSales =
(from row in right.Rows
select new
{
ProductKey = row.Get<int>("ProductKey"),
OrderDateKey = row.Get<int>("OrderDateKey"),
SalesAmount = row.Get<decimal>("SalesAmount"),
TaxAmt = row.Get<decimal>("TaxAmt")
}).ToList();

foreach (var row_i in internetSales)


{
foreach (var row_r in resellerSales)
{

if (
row_i.OrderDateKey > 0
&& row_i.OrderDateKey < row_r.OrderDateKey
&& row_i.OrderDateKey == 20010701
&& (row_r.SalesAmount + row_r.TaxAmt) > 20000)
{
output.Set<int>("OrderDateKey", row_i.OrderDateKey);
output.Set<int>("ProductKey", row_i.ProductKey);
output.Set<decimal>("Internet_Sales_Amount", row_i.SalesAmount + row_i.TaxAmt);
output.Set<decimal>("Reseller_Sales_Amount", row_r.SalesAmount + row_r.TaxAmt);
}

}
}
yield return output.AsReadOnly();
}
}

In this use-case scenario, we are building an analytics report for the retailer. The goal is to find all products that
cost more than $20,000 and that sell through the website faster than through the regular retailer within a
certain time frame.
Here is the base U-SQL script. You can compare the logic between a regular JOIN and a combiner:

DECLARE @LocalURI string = @"\usql-programmability\";

DECLARE @input_file_internet_sales string = @LocalURI+"FactInternetSales.txt";


DECLARE @input_file_reseller_sales string = @LocalURI+"FactResellerSales.txt";
DECLARE @output_file1 string = @LocalURI+"output_file1.tsv";
DECLARE @output_file2 string = @LocalURI+"output_file2.tsv";

@fact_internet_sales =
EXTRACT
ProductKey int ,
ProductKey int ,
OrderDateKey int ,
DueDateKey int ,
ShipDateKey int ,
CustomerKey int ,
PromotionKey int ,
CurrencyKey int ,
SalesTerritoryKey int ,
SalesOrderNumber String ,
SalesOrderLineNumber int ,
RevisionNumber int ,
OrderQuantity int ,
UnitPrice decimal ,
ExtendedAmount decimal,
UnitPriceDiscountPct float ,
DiscountAmount float ,
ProductStandardCost decimal ,
TotalProductCost decimal ,
SalesAmount decimal ,
TaxAmt decimal ,
Freight decimal ,
CarrierTrackingNumber String,
CustomerPONumber String
FROM @input_file_internet_sales
USING Extractors.Text(delimiter:'|', encoding: Encoding.Unicode);

@fact_reseller_sales =
EXTRACT
ProductKey int ,
OrderDateKey int ,
DueDateKey int ,
ShipDateKey int ,
ResellerKey int ,
EmployeeKey int ,
PromotionKey int ,
CurrencyKey int ,
SalesTerritoryKey int ,
SalesOrderNumber String ,
SalesOrderLineNumber int ,
RevisionNumber int ,
OrderQuantity int ,
UnitPrice decimal ,
ExtendedAmount decimal,
UnitPriceDiscountPct float ,
DiscountAmount float ,
ProductStandardCost decimal ,
TotalProductCost decimal ,
SalesAmount decimal ,
TaxAmt decimal ,
Freight decimal ,
CarrierTrackingNumber String,
CustomerPONumber String
FROM @input_file_reseller_sales
USING Extractors.Text(delimiter:'|', encoding: Encoding.Unicode);

@rs1 =
SELECT
fis.OrderDateKey,
fis.ProductKey,
fis.SalesAmount+fis.TaxAmt AS Internet_Sales_Amount,
frs.SalesAmount+frs.TaxAmt AS Reseller_Sales_Amount
FROM @fact_internet_sales AS fis
INNER JOIN @fact_reseller_sales AS frs
ON fis.ProductKey == frs.ProductKey
WHERE
fis.OrderDateKey < frs.OrderDateKey
AND fis.OrderDateKey == 20010701
AND frs.SalesAmount+frs.TaxAmt > 20000;

@rs2 =
@rs2 =
COMBINE @fact_internet_sales AS fis
WITH @fact_reseller_sales AS frs
ON fis.ProductKey == frs.ProductKey
PRODUCE OrderDateKey int,
ProductKey int,
Internet_Sales_Amount decimal,
Reseller_Sales_Amount decimal
USING new USQL_Programmability.CombineSales();

OUTPUT @rs1 TO @output_file1 USING Outputters.Tsv();


OUTPUT @rs2 TO @output_file2 USING Outputters.Tsv();

A user-defined combiner can be called as a new instance of the applier object:

USING new MyNameSpace.MyCombiner();

Or with the invocation of a wrapper factory method:

USING MyNameSpace.MyCombiner();

Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDT and UDAGG
Use user-defined reducer
12/10/2021 • 2 minutes to read • Edit Online

U-SQL UDO: user-defined reducer


U-SQL enables you to write custom rowset reducers in C# by using the user-defined operator extensibility
framework and implementing an IReducer interface.
User-defined reducer, or UDR, can be used to eliminate unnecessary rows during data extraction (import). It also
can be used to manipulate and evaluate rows and columns. Based on programmability logic, it can also define
which rows need to be extracted.

How to define and use user-defined reducer


To define a UDR class, we need to create an IReducer interface with an optional SqlUserDefinedReducer
attribute.
This class interface should contain a definition for the IEnumerable interface rowset override.

[SqlUserDefinedReducer]
public class EmptyUserReducer : IReducer
{

public override IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output)


{

}

The SqlUserDefinedReducer attribute indicates that the type should be registered as a user-defined reducer.
This class cannot be inherited. SqlUserDefinedReducer is an optional attribute for a user-defined reducer
definition. It's used to define IsRecursive property.
bool IsRecursive
true = Indicates whether this Reducer is associative and commutative
The main programmability objects are input and output . The input object is used to enumerate input rows.
Output is used to set output rows as a result of reducing activity.
For input rows enumeration, we use the Row.Get method.

foreach (IRow row in input.Rows)


{
row.Get<string>("mycolumn");
}

The parameter for the Row.Get method is a column that's passed as part of the PRODUCE class of the REDUCE
statement of the U-SQL base script. We need to use the correct data type here as well.
For output, use the output.Set method.
It is important to understand that custom reducer only outputs values that are defined with the output.Set
method call.

output.Set<string>("mycolumn", guid);

The actual reducer output is triggered by calling yield return output.AsReadOnly(); .


Following is a reducer example:

[SqlUserDefinedReducer]
public class EmptyUserReducer : IReducer
{

public override IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output)


{
string guid;
DateTime dt;
string user;
string des;

foreach (IRow row in input.Rows)


{
guid = row.Get<string>("guid");
dt = row.Get<DateTime>("dt");
user = row.Get<string>("user");
des = row.Get<string>("des");

if (user.Length > 0)
{
output.Set<string>("guid", guid);
output.Set<DateTime>("dt", dt);
output.Set<string>("user", user);
output.Set<string>("des", des);

yield return output.AsReadOnly();


}
}
}

In this use-case scenario, the reducer is skipping rows with an empty user name. For each row in rowset, it reads
each required column, then evaluates the length of the user name. It outputs the actual row only if user name
value length is more than 0.
Following is base U-SQL script that uses a custom reducer:
DECLARE @input_file string = @"\usql-programmability\input_file_reducer.tsv";
DECLARE @output_file string = @"\usql-programmability\output_file.tsv";

@rs0 =
EXTRACT
guid string,
dt DateTime,
user String,
des String
FROM @input_file
USING Extractors.Tsv();

@rs1 =
REDUCE @rs0 PRESORT guid
ON guid
PRODUCE guid string, dt DateTime, user String, des String
USING new USQL_Programmability.EmptyUserReducer();

@rs2 =
SELECT guid AS start_id,
dt AS start_time,
DateTime.Now.ToString("M/d/yyyy") AS Nowdate,
USQL_Programmability.CustomFunctions.GetFiscalPeriodWithCustomType(dt).ToString() AS
start_fiscalperiod,
user,
des
FROM @rs1;

OUTPUT @rs2
TO @output_file
USING Outputters.Text();

Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDT and UDAGG
Schedule U-SQL jobs using SQL Server Integration
Services (SSIS)
12/10/2021 • 5 minutes to read • Edit Online

In this document, you learn how to orchestrate and create U-SQL jobs using SQL Server Integration Service
(SSIS).

Prerequisites
Azure Feature Pack for Integration Services provides the Azure Data Lake Analytics task and the Azure Data Lake
Analytics Connection Manager that helps connect to Azure Data Lake Analytics service. To use this task, make
sure you install:
Download and install SQL Server Data Tools (SSDT) for Visual Studio
Install Azure Feature Pack for Integration Services (SSIS)

Azure Data Lake Analytics task


The Azure Data Lake Analytics task let users submit U-SQL jobs to the Azure Data Lake Analytics account.
Learn how to configure Azure Data Lake Analytics task.

You can get the U-SQL script from different places by using SSIS built-in functions and tasks, below scenarios
show how can you configure the U-SQL scripts for different user cases.

Scenario 1-Use inline script call tvfs and stored procs


In Azure Data Lake Analytics Task Editor, configure SourceType as DirectInput , and put the U-SQL statements
into USQLStatement .
For easy maintenance and code management, only put short U-SQL script as inline scripts, for example, you can
call existing table valued functions and stored procedures in your U-SQL databases.

Related article: How to pass parameter to stored procedures

Scenario 2-Use U-SQL files in Azure Data Lake Store


You can also use U-SQL files in the Azure Data Lake Store by using Azure Data Lake Store File System Task
in Azure Feature Pack. This approach enables you to use the scripts stored on cloud.
Follow below steps to set up the connection between Azure Data Lake Store File System Task and Azure Data
Lake Analytics Task.
Set task control flow
In SSIS package design view, add an Azure Data Lake Store File System Task , a Foreach Loop Container
and an Azure Data Lake Analytics Task in the Foreach Loop Container. The Azure Data Lake Store File System
Task helps to download U-SQL files in your ADLS account to a temporary folder. The Foreach Loop Container
and the Azure Data Lake Analytics Task help to submit every U-SQL file under the temporary folder to the Azure
Data Lake Analytics account as a U-SQL job.
Configure Azure Data Lake Store File System Task
1. Set Operation to CopyFromADLS .
2. Set up AzureDataLakeConnection , learn more about Azure Data Lake Store Connection Manager.
3. Set AzureDataLakeDirector y . Point to the folder storing your U-SQL scripts. Use relative path that is
relative to the Azure Data Lake Store account root folder.
4. Set Destination to a folder that caches the downloaded U-SQL scripts. This folder path will be used in
Foreach Loop Container for U-SQL job submission.

Learn more about Azure Data Lake Store File System Task.
Configure Foreach Loop Container
1. In Collection page, set Enumerator to Foreach File Enumerator .
2. Set Folder under Enumerator configuration group to the temporary folder that includes the
downloaded U-SQL scripts.
3. Set Files under Enumerator configuration to *.usql so that the loop container only catches the files
ending with .usql .

4. In Variable Mappings page, add a user defined variable to get the file name for each U-SQL file. Set the
Index to 0 to get the file name. In this example, define a variable called User::FileName . This variable will
be used to dynamically get U-SQL script file connection and set U-SQL job name in Azure Data Lake
Analytics Task.
Configure Azure Data Lake Analytics Task
1. Set SourceType to FileConnection .
2. Set FileConnection to the file connection that points to the file objects returned from Foreach Loop
Container.
To create this file connection:
a. Choose <New Connection...> in FileConnection setting.
b. Set Usage type to Existing file , and set the File to any existing file's file path.

c. In Connection Managers view, right-click the file connection created just now, and choose
Proper ties .
d. In the Proper ties window, expand Expressions , and set ConnectionString to the variable
defined in Foreach Loop Container, for example, @[User::FileName] .
3. Set AzureDataLakeAnalyticsConnection to the Azure Data Lake Analytics account that you want to
submit jobs to. Learn more about Azure Data Lake Analytics Connection Manager.
4. Set other job configurations. Learn More.
5. Use Expressions to dynamically set U-SQL job name:
a. In Expressions page, add a new expression key-value pair for JobName .
b. Set the value for JobName to the variable defined in Foreach Loop Container, for example,
@[User::FileName] .
Scenario 3-Use U-SQL files in Azure Blob Storage
You can use U-SQL files in Azure Blob Storage by using Azure Blob Download Task in Azure Feature Pack.
This approach enables you using the scripts on cloud.
The steps are similar with Scenario 2: Use U-SQL files in Azure Data Lake Store. Change the Azure Data Lake
Store File System Task to Azure Blob Download Task. Learn more about Azure Blob Download Task.
The control flow is like below.

Scenario 4-Use U-SQL files on the local machine


Besides of using U-SQL files stored on cloud, you can also use files on your local machine or files deployed with
your SSIS packages.
1. Right-click Connection Managers in SSIS project and choose New Connection Manager .
2. Select File type and click Add....
3. Set Usage type to Existing file , and set the File to the file on the local machine.

4. Add Azure Data Lake Analytics Task and:


a. Set SourceType to FileConnection .
b. Set FileConnection to the File Connection created just now.
5. Finish other configurations for Azure Data Lake Analytics Task.

Scenario 5-Use U-SQL statement in SSIS variable


In some cases, you may need to dynamically generate the U-SQL statements. You can use SSIS Variable with
SSIS Expression and other SSIS tasks, like Script Task, to help you generate the U-SQL statement dynamically.
1. Open Variables tool window through SSIS > Variables top-level menu.
2. Add an SSIS Variable and set the value directly or use Expression to generate the value.
3. Add Azure Data Lake Analytics Task and:
a. Set SourceType to Variable .
b. Set SourceVariable to the SSIS Variable created just now.
4. Finish other configurations for Azure Data Lake Analytics Task.

Scenario 6-Pass parameters to U-SQL script


In some cases, you may want to dynamically set the U-SQL variable value in the U-SQL script. Parameter
Mapping feature in Azure Data Lake Analytics Task help with this scenario. There are usually two typical user
cases:
Set the input and output file path variables dynamically based on current date and time.
Set the parameter for stored procedures.
Learn more about how to set parameters for the U-SQL script.

Next steps
Run SSIS packages in Azure
Azure Feature Pack for Integration Services (SSIS)
Schedule U-SQL jobs using Azure Data Factory
How to set up a CI/CD pipeline for Azure Data
Lake Analytics
12/10/2021 • 15 minutes to read • Edit Online

In this article, you learn how to set up a continuous integration and deployment (CI/CD) pipeline for U-SQL jobs
and U-SQL databases.

NOTE
This article uses the Azure Az PowerShell module, which is the recommended PowerShell module for interacting with
Azure. To get started with the Az PowerShell module, see Install Azure PowerShell. To learn how to migrate to the Az
PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Use CI/CD for U-SQL jobs


Azure Data Lake Tools for Visual Studio provides the U-SQL project type that helps you organize U-SQL scripts.
Using the U-SQL project to manage your U-SQL code makes further CI/CD scenarios easy.

Build a U-SQL project


A U-SQL project can be built with the Microsoft Build Engine (MSBuild) by passing corresponding parameters.
Follow the steps in this article to set up a build process for a U-SQL project.
Project migration
Before you set up a build task for a U-SQL project, make sure you have the latest version of the U-SQL project.
Open the U-SQL project file in your editor and verify that you have these import items:

<!-- check for SDK Build target in current path then in USQLSDKPath-->
<Import Project="UsqlSDKBuild.targets" Condition="Exists('UsqlSDKBuild.targets')" />
<Import Project="$(USQLSDKPath)\UsqlSDKBuild.targets" Condition="!Exists('UsqlSDKBuild.targets') And
'$(USQLSDKPath)' != '' And Exists('$(USQLSDKPath)\UsqlSDKBuild.targets')" />

If not, you have two options to migrate the project:


Option 1: Change the old import item to the preceding one.
Option 2: Open the old project in the Azure Data Lake Tools for Visual Studio. Use a version newer than
2.3.3000.0. The old project template will be upgraded automatically to the newest version. New projects
created with versions newer than 2.3.3000.0 use the new template.
Get NuGet
MSBuild doesn't provide built-in support for U-SQL projects. To get this support, you need to add a reference for
your solution to the Microsoft.Azure.DataLake.USQL.SDK NuGet package that adds the required language
service.
To add the NuGet package reference, right-click the solution in Visual Studio Solution Explorer and choose
Manage NuGet Packages . Or you can add a file called packages.config in the solution folder and put the
following contents into it:
<?xml version="1.0" encoding="utf-8"?>
<packages>
<package id="Microsoft.Azure.DataLake.USQL.SDK" version="1.3.180620" targetFramework="net452" />
</packages>

Manage U -SQL database references


U-SQL scripts in a U-SQL project might have query statements for U-SQL database objects. In that case, you
need to reference the corresponding U-SQL database project that includes the objects' definition before you
build the U-SQL project. An example is when you query a U-SQL table or reference an assembly.
Learn more about U-SQL database project.

NOTE
The DROP statement may cause an accidental deletion. To enable DROP statement, you need to explicitly specify the
MSBuild arguments. AllowDropStatement will enable non-data related DROP operation, like drop assembly and drop
table valued function. AllowDataDropStatement will enable data related DROP operation, like drop table and drop
schema. You have to enable AllowDropStatement before using AllowDataDropStatement.

Build a U -SQL project with the MSBuild command line


First migrate the project and get the NuGet package. Then call the standard MSBuild command line with the
following additional arguments to build your U-SQL project:

msbuild USQLBuild.usqlproj
/p:USQLSDKPath=packages\Microsoft.Azure.DataLake.USQL.SDK.1.3.180615\build\runtime;USQLTargetType=SyntaxChec
k;DataRoot=datarootfolder;/p:EnableDeployment=true

The arguments definition and values are as follows:


USQLSDKPath=<U-SQL Nuget package>\build\runtime . This parameter refers to the installation
path of the NuGet package for the U-SQL language service.
USQLTargetType=Merge or SyntaxCheck :
Merge . Merge mode compiles code-behind files. Examples are .cs , .py , and .r files. It inlines the
resulting user-defined code library into the U-SQL script. Examples are a dll binary, Python, or R
code.
SyntaxCheck . SyntaxCheck mode first merges code-behind files into the U-SQL script. Then it
compiles the U-SQL script to validate your code.
DataRoot=<DataRoot path> . DataRoot is needed only for SyntaxCheck mode. When it builds the
script with SyntaxCheck mode, MSBuild checks the references to database objects in the script. Before
building, set up a matching local environment that contains the referenced objects from the U-SQL
database in the build machine's DataRoot folder. You can also manage these database dependencies by
referencing a U-SQL database project. MSBuild only checks database object references, not files.
EnableDeployment=true or false . EnableDeployment indicates if it's allowed to deploy referenced U-
SQL databases during the build process. If you reference a U-SQL database project and consume the
database objects in your U-SQL script, set this parameter to true .
Continuous integration through Azure Pipelines
In addition to the command line, you can also use the Visual Studio Build or an MSBuild task to build U-SQL
projects in Azure Pipelines. To set up a build pipeline, make sure to add two tasks in the build pipeline: a NuGet
restore task and an MSBuild task.
1. Add a NuGet restore task to get the solution-referenced NuGet package that includes
Azure.DataLake.USQL.SDK , so that MSBuild can find the U-SQL language targets. Set Advanced >
Destination director y to $(Build.SourcesDirectory)/packages if you want to use the MSBuild
arguments sample directly in step 2.
2. Set MSBuild arguments in Visual Studio build tools or in an MSBuild task as shown in the following
example. Or you can define variables for these arguments in the Azure Pipelines build pipeline.

/p:USQLSDKPath=$(Build.SourcesDirectory)/packages/Microsoft.Azure.DataLake.USQL.SDK.1.3.180615/build/
runtime /p:USQLTargetType=SyntaxCheck /p:DataRoot=$(Build.SourcesDirectory) /p:EnableDeployment=true

U -SQL project build output


After you run a build, all scripts in the U-SQL project are built and output to a zip file called
USQLProjectName.usqlpack . The folder structure in your project is kept in the zipped build output.

NOTE
The code-behind files for each U-SQL script will be merged as an inline statement to the script build output.

Test U-SQL scripts


Azure Data Lake provides test projects for U-SQL scripts and C# UDO/UDAG/UDF:
Learn how to add test cases for U-SQL scripts and extended C# code
Learn how to run test cases in Azure Pipelines.

Deploy a U-SQL job


After you verify code through the build and test process, you can submit U-SQL jobs directly from Azure
Pipelines through an Azure PowerShell task. You can also deploy the script to Azure Data Lake Store or Azure
Blob storage and run the scheduled jobs through Azure Data Factory.
Submit U -SQL jobs through Azure Pipelines
The build output of the U-SQL project is a zip file called USQLProjectName.usqlpack . The zip file includes all
U-SQL scripts in the project. You can use the Azure PowerShell task in Pipelines with the following sample
PowerShell script to submit U-SQL jobs directly from Azure Pipelines.

<#
This script can be used to submit U-SQL Jobs with given U-SQL project build output(.usqlpack file).
This will unzip the U-SQL project build output, and submit all scripts one-by-one.

Note: the code behind file for each U-SQL script will be merged into the built U-SQL script in build
output.

Example :
USQLJobSubmission.ps1 -ADLAAccountName "myadlaaccount" -ArtifactsRoot "C:\USQLProject\bin\debug\" -
DegreeOfParallelism 2
#>

param(
[Parameter(Mandatory=$true)][string]$ADLAAccountName, # ADLA account name to submit U-SQL jobs
[Parameter(Mandatory=$true)][string]$ArtifactsRoot, # Root folder of U-SQL project build output
[Parameter(Mandatory=$false)][string]$DegreeOfParallelism = 1
)

function Unzip($USQLPackfile, $UnzipOutput)


{
$USQLPackfileZip = Rename-Item -Path $USQLPackfile -NewName
$([System.IO.Path]::ChangeExtension($USQLPackfile, ".zip")) -Force -PassThru
Expand-Archive -Path $USQLPackfileZip -DestinationPath $UnzipOutput -Force
Rename-Item -Path $USQLPackfileZip -NewName $([System.IO.Path]::ChangeExtension($USQLPackfileZip,
".usqlpack")) -Force
}

## Get U-SQL scripts in U-SQL project build output(.usqlpack file)


Function GetUsqlFiles()
{

$USQLPackfiles = Get-ChildItem -Path $ArtifactsRoot -Include *.usqlpack -File -Recurse -ErrorAction


SilentlyContinue

$UnzipOutput = Join-Path $ArtifactsRoot -ChildPath "UnzipUSQLScripts"


foreach ($USQLPackfile in $USQLPackfiles)
{
Unzip $USQLPackfile $UnzipOutput
}

$USQLFiles = Get-ChildItem -Path $UnzipOutput -Include *.usql -File -Recurse -ErrorAction


SilentlyContinue

return $USQLFiles
}

## Submit U-SQL scripts to ADLA account one-by-one


Function SubmitAnalyticsJob()
{
$usqlFiles = GetUsqlFiles

Write-Output "$($usqlFiles.Count) jobs to be submitted..."

# Submit each usql script and wait for completion before moving ahead.
foreach ($usqlFile in $usqlFiles)
{
$scriptName = "[Release].[$([System.IO.Path]::GetFileNameWithoutExtension($usqlFile.fullname))]"

Write-Output "Submitting job for '{$usqlFile}'"

$jobToSubmit = Submit-AzDataLakeAnalyticsJob -Account $ADLAAccountName -Name $scriptName -ScriptPath


$usqlFile -DegreeOfParallelism $DegreeOfParallelism

LogJobInformation $jobToSubmit

Write-Output "Waiting for job to complete. Job ID:'{$($jobToSubmit.JobId)}', Name:


'$($jobToSubmit.Name)' "
$jobResult = Wait-AzDataLakeAnalyticsJob -Account $ADLAAccountName -JobId $jobToSubmit.JobId
LogJobInformation $jobResult
}
}

Function LogJobInformation($jobInfo)
{
Write-Output "************************************************************************"
Write-Output ([string]::Format("Job Id: {0}", $(DefaultIfNull $jobInfo.JobId)))
Write-Output ([string]::Format("Job Name: {0}", $(DefaultIfNull $jobInfo.Name)))
Write-Output ([string]::Format("Job State: {0}", $(DefaultIfNull $jobInfo.State)))
Write-Output ([string]::Format("Job Started at: {0}", $(DefaultIfNull $jobInfo.StartTime)))
Write-Output ([string]::Format("Job Ended at: {0}", $(DefaultIfNull $jobInfo.EndTime)))
Write-Output ([string]::Format("Job Result: {0}", $(DefaultIfNull $jobInfo.Result)))
Write-Output "************************************************************************"
}

Function DefaultIfNull($item)
{
if ($item -ne $null)
{
return $item
}
return ""
}

Function Main()
{
Write-Output ([string]::Format("ADLA account: {0}", $ADLAAccountName))
Write-Output ([string]::Format("Root folde for usqlpack: {0}", $ArtifactsRoot))
Write-Output ([string]::Format("AU count: {0}", $DegreeOfParallelism))

Write-Output "Starting USQL script deployment..."

SubmitAnalyticsJob

Write-Output "Finished deployment..."


Write-Output "Finished deployment..."
}

Main

NOTE
The commands: Submit-AzDataLakeAnalyticsJob and Wait-AzDataLakeAnalyticsJob are both Azure PowerShell
cmdlets for Azure Data Lake Analytics in the Azure Resource Manager framework. You'll neeed a workstation with Azure
PowerShell installed. You can refer to the command list for more commands and examples.

Deploy U -SQL jobs through Azure Data Factory


You can submit U-SQL jobs directly from Azure Pipelines. Or you can upload the built scripts to Azure Data Lake
Store or Azure Blob storage and run the scheduled jobs through Azure Data Factory.
Use the Azure PowerShell task in Azure Pipelines with the following sample PowerShell script to upload the U-
SQL scripts to an Azure Data Lake Store account:
<#
This script can be used to upload U-SQL files to ADLS with given U-SQL project build output(.usqlpack
file).
This will unzip the U-SQL project build output, and upload all scripts to ADLS one-by-one.

Example :
FileUpload.ps1 -ADLSName "myadlsaccount" -ArtifactsRoot "C:\USQLProject\bin\debug\"
#>

param(
[Parameter(Mandatory=$true)][string]$ADLSName, # ADLS account name to upload U-SQL scripts
[Parameter(Mandatory=$true)][string]$ArtifactsRoot, # Root folder of U-SQL project build output
[Parameter(Mandatory=$false)][string]$DestinationFolder = "USQLScriptSource" # Destination folder in ADLS
)

Function UploadResources()
{
Write-Host "************************************************************************"
Write-Host "Uploading files to $ADLSName"
Write-Host "***********************************************************************"

$usqlScripts = GetUsqlFiles

$files = @(get-childitem $usqlScripts -recurse)


foreach($file in $files)
{
Write-Host "Uploading file: $($file.Name)"
Import-AzDataLakeStoreItem -AccountName $ADLSName -Path $file.FullName -Destination "/$(Join-Path
$DestinationFolder $file)" -Force
}
}

function Unzip($USQLPackfile, $UnzipOutput)


{
$USQLPackfileZip = Rename-Item -Path $USQLPackfile -NewName
$([System.IO.Path]::ChangeExtension($USQLPackfile, ".zip")) -Force -PassThru
Expand-Archive -Path $USQLPackfileZip -DestinationPath $UnzipOutput -Force
Rename-Item -Path $USQLPackfileZip -NewName $([System.IO.Path]::ChangeExtension($USQLPackfileZip,
".usqlpack")) -Force
}

Function GetUsqlFiles()
{

$USQLPackfiles = Get-ChildItem -Path $ArtifactsRoot -Include *.usqlpack -File -Recurse -ErrorAction


SilentlyContinue

$UnzipOutput = Join-Path $ArtifactsRoot -ChildPath "UnzipUSQLScripts"

foreach ($USQLPackfile in $USQLPackfiles)


{
Unzip $USQLPackfile $UnzipOutput
}

return Get-ChildItem -Path $UnzipOutput -Include *.usql -File -Recurse -ErrorAction SilentlyContinue
}

UploadResources

CI/CD for a U-SQL database


Azure Data Lake Tools for Visual Studio provides U-SQL database project templates that help you develop,
manage, and deploy U-SQL databases. Learn more about a U-SQL database project.
Build U-SQL database project
Get the NuGet package
MSBuild doesn't provide built-in support for U-SQL database projects. To get this ability, you need to add a
reference for your solution to the Microsoft.Azure.DataLake.USQL.SDK NuGet package that adds the required
language service.
To add the NuGet package reference, right-click the solution in Visual Studio Solution Explorer. Choose Manage
NuGet Packages . Then search for and install the NuGet package. Or you can add a file called packages.config
in the solution folder and put the following contents into it:

<?xml version="1.0" encoding="utf-8"?>


<packages>
<package id="Microsoft.Azure.DataLake.USQL.SDK" version="1.3.180615" targetFramework="net452" />
</packages>

Build U -SQL a database project with the MSBuild command line


To build your U-SQL database project, call the standard MSBuild command line and pass the U-SQL SDK NuGet
package reference as an additional argument. See the following example:

msbuild DatabaseProject.usqldbproj
/p:USQLSDKPath=packages\Microsoft.Azure.DataLake.USQL.SDK.1.3.180615\build\runtime

The argument USQLSDKPath=<U-SQL Nuget package>\build\runtime refers to the install path of the NuGet package
for the U-SQL language service.
Continuous integration with Azure Pipelines
In addition to the command line, you can use Visual Studio Build or an MSBuild task to build U-SQL database
projects in Azure Pipelines. To set up a build task, make sure to add two tasks in the build pipeline: a NuGet
restore task and an MSBuild task.
1. Add a NuGet restore task to get the solution-referenced NuGet package, which includes
Azure.DataLake.USQL.SDK , so that MSBuild can find the U-SQL language targets. Set Advanced >
Destination director y to $(Build.SourcesDirectory)/packages if you want to use the MSBuild
arguments sample directly in step 2.
2. Set MSBuild arguments in Visual Studio build tools or in an MSBuild task as shown in the following
example. Or you can define variables for these arguments in the Azure Pipelines build pipeline.

/p:USQLSDKPath=$(Build.SourcesDirectory)/packages/Microsoft.Azure.DataLake.USQL.SDK.1.3.180615/build/
runtime

U -SQL database project build output


The build output for a U-SQL database project is a U-SQL database deployment package, named with the suffix
.usqldbpack . The .usqldbpack package is a zip file that includes all DDL statements in a single U-SQL script in a
DDL folder. It includes all .dlls and additional files for assembly in a temp folder.

Test table-valued functions and stored procedures


Adding test cases for table-valued functions and stored procedures directly isn't currently supported. As a
workaround, you can create a U-SQL project that has U-SQL scripts that call those functions and write test cases
for them. Take the following steps to set up test cases for table-valued functions and stored procedures defined
in the U-SQL database project:
1. Create a U-SQL project for test purposes and write U-SQL scripts calling the table-valued functions and
stored procedures.
2. Add a database reference to the U-SQL project. To get the table-valued function and stored procedure
definition, you need to reference the database project that contains the DDL statement. Learn more about
database references.
3. Add test cases for U-SQL scripts that call table-valued functions and stored procedures. Learn how to add
test cases for U-SQL scripts.

Deploy U-SQL database through Azure Pipelines


PackageDeploymentTool.exe provides the programming and command-line interfaces that help deploy U-SQL
database deployment packages, .usqldbpack . The SDK is included in the U-SQL SDK NuGet package, located at
build/runtime/PackageDeploymentTool.exe . By using PackageDeploymentTool.exe , you can deploy U-SQL
databases to both Azure Data Lake Analytics and local accounts.

NOTE
PowerShell command-line support and Azure Pipelines release task support for U-SQL database deployment is currently
pending.

Take the following steps to set up a database deployment task in Azure Pipelines:
1. Add a PowerShell Script task in a build or release pipeline and execute the following PowerShell script.
This task helps to get Azure SDK dependencies for PackageDeploymentTool.exe and
PackageDeploymentTool.exe . You can set the -AzureSDK and -DBDeploymentTool parameters to load
the dependencies and deployment tool to specific folders. Pass the -AzureSDK path to
PackageDeploymentTool.exe as the -AzureSDKPath parameter in step 2.

<#
This script is used for getting dependencies and SDKs for U-SQL database deployment.
PowerShell command line support for deploying U-SQL database package(.usqldbpack file) will come
soon.

Example :
GetUSQLDBDeploymentSDK.ps1 -AzureSDK "AzureSDKFolderPath" -DBDeploymentTool
"DBDeploymentToolFolderPath"
#>

param (
[string]$AzureSDK = "AzureSDK", # Folder to cache Azure SDK dependencies
[string]$DBDeploymentTool = "DBDeploymentTool", # Folder to cache U-SQL database deployment tool
[string]$workingfolder = "" # Folder to execute these command lines
)

if ([string]::IsNullOrEmpty($workingfolder))
if ([string]::IsNullOrEmpty($workingfolder))
{
$scriptpath = $MyInvocation.MyCommand.Path
$workingfolder = Split-Path $scriptpath
}
cd $workingfolder

echo "workingfolder=$workingfolder, outputfolder=$outputfolder"


echo "Downloading required packages..."

iwr https://www.nuget.org/api/v2/package/Microsoft.Azure.Management.DataLake.Analytics/3.5.1-preview
-outf Microsoft.Azure.Management.DataLake.Analytics.3.5.1-preview.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.Azure.Management.DataLake.Store/2.4.1-preview -
outf Microsoft.Azure.Management.DataLake.Store.2.4.1-preview.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.IdentityModel.Clients.ActiveDirectory/2.28.3 -outf
Microsoft.IdentityModel.Clients.ActiveDirectory.2.28.3.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.Rest.ClientRuntime/2.3.11 -outf
Microsoft.Rest.ClientRuntime.2.3.11.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.Rest.ClientRuntime.Azure/3.3.7 -outf
Microsoft.Rest.ClientRuntime.Azure.3.3.7.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.Rest.ClientRuntime.Azure.Authentication/2.3.3 -
outf Microsoft.Rest.ClientRuntime.Azure.Authentication.2.3.3.zip
iwr https://www.nuget.org/api/v2/package/Newtonsoft.Json/6.0.8 -outf Newtonsoft.Json.6.0.8.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.Azure.DataLake.USQL.SDK/ -outf USQLSDK.zip

echo "Extracting packages..."

Expand-Archive Microsoft.Azure.Management.DataLake.Analytics.3.5.1-preview.zip -DestinationPath


Microsoft.Azure.Management.DataLake.Analytics.3.5.1-preview -Force
Expand-Archive Microsoft.Azure.Management.DataLake.Store.2.4.1-preview.zip -DestinationPath
Microsoft.Azure.Management.DataLake.Store.2.4.1-preview -Force
Expand-Archive Microsoft.IdentityModel.Clients.ActiveDirectory.2.28.3.zip -DestinationPath
Microsoft.IdentityModel.Clients.ActiveDirectory.2.28.3 -Force
Expand-Archive Microsoft.Rest.ClientRuntime.2.3.11.zip -DestinationPath
Microsoft.Rest.ClientRuntime.2.3.11 -Force
Expand-Archive Microsoft.Rest.ClientRuntime.Azure.3.3.7.zip -DestinationPath
Microsoft.Rest.ClientRuntime.Azure.3.3.7 -Force
Expand-Archive Microsoft.Rest.ClientRuntime.Azure.Authentication.2.3.3.zip -DestinationPath
Microsoft.Rest.ClientRuntime.Azure.Authentication.2.3.3 -Force
Expand-Archive Newtonsoft.Json.6.0.8.zip -DestinationPath Newtonsoft.Json.6.0.8 -Force
Expand-Archive USQLSDK.zip -DestinationPath USQLSDK -Force

echo "Copy required DLLs to output folder..."

mkdir $AzureSDK -Force


mkdir $DBDeploymentTool -Force
copy Microsoft.Azure.Management.DataLake.Analytics.3.5.1-preview\lib\net452\*.dll $AzureSDK
copy Microsoft.Azure.Management.DataLake.Store.2.4.1-preview\lib\net452\*.dll $AzureSDK
copy Microsoft.IdentityModel.Clients.ActiveDirectory.2.28.3\lib\net45\*.dll $AzureSDK
copy Microsoft.Rest.ClientRuntime.2.3.11\lib\net452\*.dll $AzureSDK
copy Microsoft.Rest.ClientRuntime.Azure.3.3.7\lib\net452\*.dll $AzureSDK
copy Microsoft.Rest.ClientRuntime.Azure.Authentication.2.3.3\lib\net452\*.dll $AzureSDK
copy Newtonsoft.Json.6.0.8\lib\net45\*.dll $AzureSDK
copy USQLSDK\build\runtime\*.* $DBDeploymentTool

2. Add a Command-Line task in a build or release pipeline and fill in the script by calling
PackageDeploymentTool.exe . PackageDeploymentTool.exe is located under the defined
$DBDeploymentTool folder. The sample script is as follows:
Deploy a U-SQL database locally:

PackageDeploymentTool.exe deploylocal -Package <package path> -Database <database name> -


DataRoot <data root path>

Use interactive authentication mode to deploy a U-SQL database to an Azure Data Lake Analytics
account:

PackageDeploymentTool.exe deploycluster -Package <package path> -Database <database name> -


Account <account name> -ResourceGroup <resource group name> -SubscriptionId <subscript id> -
Tenant <tenant name> -AzureSDKPath <azure sdk path> -Interactive

Use secrete authentication to deploy a U-SQL database to an Azure Data Lake Analytics account:

PackageDeploymentTool.exe deploycluster -Package <package path> -Database <database name> -


Account <account name> -ResourceGroup <resource group name> -SubscriptionId <subscript id> -
Tenant <tenant name> -ClientId <client id> -Secrete <secrete>

Use cer tFile authentication to deploy a U-SQL database to an Azure Data Lake Analytics account:

PackageDeploymentTool.exe deploycluster -Package <package path> -Database <database name> -


Account <account name> -ResourceGroup <resource group name> -SubscriptionId <subscript id> -
Tenant <tenant name> -ClientId <client id> -Secrete <secrete> -CertFile <certFile>

PackageDeploymentTool.exe parameter descriptions


Common parameters

PA RA M ET ER DESC RIP T IO N DEFA ULT VA L UE REQ UIRED

Package The path of the U-SQL null true


database deployment
package to be deployed.

Database The database name to be master false


deployed to or created.

LogFile The path of the file for null false


logging. Default to standard
out (console).

LogLevel Log level: Verbose, Normal, LogLevel.Normal false


Warning, or Error.

Parameter for local deployment

PA RA M ET ER DESC RIP T IO N DEFA ULT VA L UE REQ UIRED

DataRoot The path of the local data null true


root folder.

Parameters for Azure Data Lake Analytics deployment

PA RA M ET ER DESC RIP T IO N DEFA ULT VA L UE REQ UIRED

Account Specifies which Azure Data null true


Lake Analytics account to
deploy to by account name.

ResourceGroup The Azure resource group null true


name for the Azure Data
Lake Analytics account.
PA RA M ET ER DESC RIP T IO N DEFA ULT VA L UE REQ UIRED

SubscriptionId The Azure subscription ID null true


for the Azure Data Lake
Analytics account.

Tenant The tenant name is the null true


Azure Active Directory
(Azure AD) domain name.
Find it in the subscription
management page in the
Azure portal.

AzureSDKPath The path to search null true


dependent assemblies in
the Azure SDK.

Interactive Whether or not to use false false


interactive mode for
authentication.

ClientId The Azure AD application ID null Required for non-interactive


required for non-interactive authentication.
authentication.

Secrete The secrete or password for null Required for non-interactive


non-interactive authentication, or else use
authentication. It should be SecreteFile.
used only in a trusted and
secure environment.

SecreteFile The file saves the secrete or null Required for non-interactive
password for non- authentication, or else use
interactive authentication. Secrete.
Make sure to keep it
readable only by the
current user.

CertFile The file saves X.509 null false


certification for non-
interactive authentication.
The default is to use client
secrete authentication.

JobPrefix The prefix for database Deploy_ + DateTime.Now false


deployment of a U-SQL
DDL job.

Next steps
How to test your Azure Data Lake Analytics code.
Run U-SQL script on your local machine.
Use U-SQL database project to develop U-SQL database.
Best practices for managing U-SQL assemblies in a
CI/CD pipeline
12/10/2021 • 3 minutes to read • Edit Online

In this article, you learn how to manage U-SQL assembly source code with the newly introduced U-SQL
database project. You also learn how to set up a continuous integration and deployment (CI/CD) pipeline for
assembly registration by using Azure DevOps.

Use the U-SQL database project to manage assembly source code


The U-SQL database project is a project type in Visual Studio that helps developers develop, manage, and deploy
their U-SQL databases quickly and easily. You can manage all U-SQL database objects (except for credentials)
with the U-SQL database project.
To manage the C# assembly source code and assembly registration DDL U-SQL scripts, use the:
U-SQL database project to manage assembly registration U-SQL scripts.
Class Library (For U-SQL Application) to manage the C# source code and dependencies for user-defined
operators, functions, and aggregators (UDOs, UDFs, and UDAGs).
U-SQL database project to reference the Class Library project.
A U-SQL database project can reference a Class Library (For U-SQL Application) project. You can create
assemblies registered in the U-SQL database by using referenced C# source code from this Class Library (For U-
SQL Application) project.
Follow these steps to create projects and add references.
1. Create a Class Library (For U-SQL Application) project by selecting File > New > Project . The project is
under the Azure Data Lake > U-SQL node.
2. Add your user-defined C# code in the Class Library (For U-SQL Application) project.
3. Create a U-SQL project by selecting File > New > Project . The project is under the Azure Data Lake >
U-SQL node.

4. Add a reference to the C# class library project for the U-SQL database project.
5. Create an assembly script in the U-SQL database project by right-clicking the project and selecting Add
New Item .

6. Open the assembly script in the assembly design view. Select the referenced assembly from the Create
assembly from reference drop-down menu.
7. Add Managed Dependencies and Additional Files , if there are any. When you add additional files, the
tool uses the relative path to make sure it can find the assemblies on your local machine and on the build
machine later.
@_DeployTempDirector y in the editor window at the bottom is a predefined variable that points the tool to
the build output folder. Under the build output folder, every assembly has a subfolder named with the assembly
name. All DLLs and additional files are in that subfolder.

Build a U-SQL database project


The build output for a U-SQL database project is a U-SQL database deployment package. It's named with the
suffix .usqldbpack . The .usqldbpack package is a .zip file that includes all DDL statements in a single U-SQL
script in the DDL folder. All built .dll files and additional files for assemblies are in the Temp folder.

Deploy a U-SQL database


The .usqldbpack package can be deployed to either a local account or an Azure Data Lake Analytics account.
Use Visual Studio or the deployment SDK.
Deploy a U -SQL database in Visual Studio
You can deploy a U-SQL database by using a U-SQL database project or a .usqldbpack package in Visual Studio.
Deploy by using a U-SQL database project
1. Right-click the U-SQL database project, and then select Deploy .
2. In the Deploy U-SQL Database wizard, select the ADL A Account to which you want to deploy the
database. Both local accounts and ADLA accounts are supported.
3. Database Source is filled in automatically. It points to the .usqldbpack package in the project's build
output folder.
4. Enter a name in Database Name to create a database. If a database with that same name already exists
in the target Azure Data Lake Analytics account, all objects that are defined in the database project are
created without re-creating the database.
5. To deploy the U-SQL database, select Submit . All resources, such as assemblies and additional files, are
uploaded. A U-SQL job that includes all DDL statements is submitted.

Deploy a U -SQL database in Azure DevOps


PackageDeploymentTool.exe provides the programming and command-line interfaces that help to deploy U-SQL
databases. The SDK is included in the U-SQL SDK Nuget package, located at
build/runtime/PackageDeploymentTool.exe .

In Azure DevOps, you can use a command-line task and this SDK to set up an automation pipeline for the U-SQL
database refresh. Learn more about the SDK and how to set up a CI/CD pipeline for U-SQL database
deployment.

Next steps
Set up a CI/CD pipeline for Azure Data Lake Analytics
Test your Azure Data Lake Analytics code
Run U-SQL script on your local machine
Test your Azure Data Lake Analytics code
12/10/2021 • 5 minutes to read • Edit Online

Azure Data Lake provides the U-SQL language. U-SQL combines declarative SQL with imperative C# to process
data at any scale. In this document, you learn how to create test cases for U-SQL and extended C# user-defined
operator (UDO) code.

Test U-SQL scripts


The U-SQL script is compiled and optimized for executable code to run in Azure or on your local computer. The
compilation and optimization process treats the entire U-SQL script as a whole. You can't do a traditional unit
test for every statement. However, by using the U-SQL test SDK and the local run SDK, you can do script-level
tests.
Create test cases for U -SQL script
Azure Data Lake Tools for Visual Studio enables you to create U-SQL script test cases.
1. Right-click a U-SQL script in Solution Explorer, and then select Create Unit Test .
2. Create a new test project or insert the test case into an existing test project.

Manage the test data source


When you test U-SQL scripts, you need test input files. To manage the test data, in Solution Explorer , right-
click the U-SQL project, and select Proper ties . You can enter a source in Test Data Source .

When you call the Initialize() interface in the U-SQL test SDK, a temporary local data root folder is created
under the working directory of the test project. All files and folders in the test data source folder are copied to
the temporary local data root folder before you run the U-SQL script test cases. You can add more test data
source folders by splitting the test data folder path with a semicolon.
Manage the database environment for testing
If your U-SQL scripts use or query with U-SQL database objects, you need to initialize the database environment
before you run U-SQL test cases. This approach can be necessary when calling stored procedures. The
Initialize() interface in the U-SQL test SDK helps you deploy all databases that are referenced by the U-SQL
project to the temporary local data root folder in the working directory of the test project.
For more information about how to manage U-SQL database project references for a U-SQL project, see
Reference a U-SQL database project.
Verify test results
The Run() interface returns a job execution result. 0 means success, and 1 means failure. You can also use C#
assert functions to verify the outputs.
Run test cases in Visual Studio
A U-SQL script test project is built on top of a C# unit test framework. After you build the project, select Test >
Windows > Test Explorer . You can run test cases from Test Explorer . Alternatively, right-click the .cs file in
your unit test and select Run Tests .

Test C# UDOs
Create test cases for C# UDOs
You can use a C# unit test framework to test your C# user-defined operators (UDOs). When testing UDOs, you
need to prepare corresponding IRowset objects as inputs.
There are two ways to create an IRowset object:
Load data from a file to create IRowset :

//Schema: "a:int, b:int"


USqlColumn<int> col1 = new USqlColumn<int>("a");
USqlColumn<int> col2 = new USqlColumn<int>("b");
List<IColumn> columns = new List<IColumn> { col1, col2 };
USqlSchema schema = new USqlSchema(columns);

//Generate one row with default values


IUpdatableRow output = new USqlRow(schema, null).AsUpdatable();

//Get data from file


IRowset rowset = UnitTestHelper.GetRowsetFromFile(@"processor.txt", schema, output.AsReadOnly(),
discardAdditionalColumns: true, rowDelimiter: null, columnSeparator: '\t');

Use data from a data collection to create IRowset :

//Schema: "a:int, b:int"


USqlSchema schema = new USqlSchema(
new USqlColumn<int>("a"),
new USqlColumn<int>("b")
);

IUpdatableRow output = new USqlRow(schema, null).AsUpdatable();

//Generate Rowset with specified values


List<object[]> values = new List<object[]>{
new object[2] { 2, 3 },
new object[2] { 10, 20 }
};

IEnumerable<IRow> rows = UnitTestHelper.CreateRowsFromValues(schema, values);


IRowset rowset = UnitTestHelper.GetRowsetFromCollection(rows, output.AsReadOnly());

Verify test results


After you call UDO functions, you can verify the results through the schema and Rowset value verification by
using C# assert functions. You can add a U-SQL C# UDO Unit Test Project to your solution. To do so, select
File > New > Project in Visual Studio.
Run test cases in Visual Studio
After you build the project, select Test > Windows > Test Explorer . You can run test cases from Test Explorer .
Alternatively, right-click the .cs file in your unit test and select Run Tests .

Run test cases in Azure Pipelines


Both U-SQL script test projects and C# UDO test projects inherit C# unit test projects. The Visual Studio
test task in Azure Pipelines can run these test cases.
Run U -SQL test cases in Azure Pipelines
For a U-SQL test, make sure you load CPPSDK on your build computer, and then pass the CPPSDK path to
USqlScriptTestRunner(cppSdkFolderFullPath: @"") .

What is CPPSDK?
CPPSDK is a package that includes Microsoft Visual C++ 14 and Windows SDK 10.0.10240.0. This package
includes the environment that's needed by the U-SQL runtime. You can get this package under the Azure Data
Lake Tools for Visual Studio installation folder:
For Visual Studio 2015, it is under
C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\IDE\Extensions\Microsoft\Microsoft Azure Data
Lake Tools for Visual Studio 2015\X.X.XXXX.X\CppSDK
For Visual Studio 2017, it is under
C:\Program Files (x86)\Microsoft Visual Studio\2017\<Visual Studio Edition>\SDK\ScopeCppSDK
For Visual Studio 2019, it is under
C:\Program Files (x86)\Microsoft Visual Studio\2019\<Visual Studio Edition>\SDK\ScopeCppSDK

Prepare CPPSDK in the Azure Pipelines build agent


The most common way to prepare the CPPSDK dependency in Azure Pipelines is as follows:
1. Zip the folder that includes the CPPSDK libraries.
2. Check in the .zip file to your source control system. The .zip file ensures that you check in all libraries
under the CPPSDK folder so that files aren't ignored because of a .gitignore file.
3. Unzip the .zip file in the build pipeline.
4. Point USqlScriptTestRunner to this unzipped folder on the build computer.
Run C# UDO test cases in Azure Pipelines
For a C# UDO test, make sure to reference the following assemblies, which are needed for UDOs.
Microsoft.Analytics.Interfaces
Microsoft.Analytics.Types
Microsoft.Analytics.UnitTest
If you reference them through the Nuget package Microsoft.Azure.DataLake.USQL.Interfaces, make sure you add
a NuGet Restore task in your build pipeline.

Next steps
How to set up CI/CD pipeline for Azure Data Lake Analytics
Run U-SQL script on your local machine
Use U-SQL database project to develop U-SQL database
Run and test U-SQL with Azure Data Lake U-SQL
SDK
12/10/2021 • 11 minutes to read • Edit Online

When developing U-SQL script, it is common to run and test U-SQL script locally before submit it to cloud.
Azure Data Lake provides a Nuget package called Azure Data Lake U-SQL SDK for this scenario, through which
you can easily scale U-SQL run and test. It is also possible to integrate this U-SQL test with CI (Continuous
Integration) system to automate the compile and test.
If you care about how to manually local run and debug U-SQL script with GUI tooling, then you can use Azure
Data Lake Tools for Visual Studio for that. You can learn more from here.

Install Azure Data Lake U-SQL SDK


You can get the Azure Data Lake U-SQL SDK here on Nuget.org. And before using it, you need to make sure you
have dependencies as follows.
Dependencies
The Data Lake U-SQL SDK requires the following dependencies:
Microsoft .NET Framework 4.6 or newer.
Microsoft Visual C++ 14 and Windows SDK 10.0.10240.0 or newer (which is called CppSDK in this
article). There are two ways to get CppSDK:
Install Visual Studio Community Edition. You'll have a \Windows Kits\10 folder under the Program
Files folder--for example, C:\Program Files (x86)\Windows Kits\10. You'll also find the Windows 10
SDK version under \Windows Kits\10\Lib. If you don’t see these folders, reinstall Visual Studio and
be sure to select the Windows 10 SDK during the installation. If you have this installed with Visual
Studio, the U-SQL local compiler will find it automatically.
Install Data Lake Tools for Visual Studio. You can find the prepackaged Visual C++ and Windows
SDK files at
C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\IDE\Extensions\Microsoft\ADL
Tools\X.X.XXXX.X\CppSDK.

In this case, the U-SQL local compiler cannot find the dependencies automatically. You need to
specify the CppSDK path for it. You can either copy the files to another location or use it as is.

Understand basic concepts


Data root
The data-root folder is a "local store" for the local compute account. It's equivalent to the Azure Data Lake Store
account of a Data Lake Analytics account. Switching to a different data-root folder is just like switching to a
different store account. If you want to access commonly shared data with different data-root folders, you must
use absolute paths in your scripts. Or, create file system symbolic links (for example, mklink on NTFS) under the
data-root folder to point to the shared data.
The data-root folder is used to:
Store local metadata, including databases, tables, table-valued functions (TVFs), and assemblies.
Look up the input and output paths that are defined as relative paths in U-SQL. Using relative paths makes it
easier to deploy your U-SQL projects to Azure.
File path in U -SQL
You can use both a relative path and a local absolute path in U-SQL scripts. The relative path is relative to the
specified data-root folder path. We recommend that you use "/" as the path separator to make your scripts
compatible with the server side. Here are some examples of relative paths and their equivalent absolute paths.
In these examples, C:\LocalRunDataRoot is the data-root folder.

REL AT IVE PAT H A B SO L UT E PAT H

/abc/def/input.csv C:\LocalRunDataRoot\abc\def\input.csv

abc/def/input.csv C:\LocalRunDataRoot\abc\def\input.csv

D:/abc/def/input.csv D:\abc\def\input.csv

Working directory
When running the U-SQL script locally, a working directory is created during compilation under current running
directory. In addition to the compilation outputs, the needed runtime files for local execution will be shadow
copied to this working directory. The working directory root folder is called "ScopeWorkDir" and the files under
the working directory are as follows:

DIREC TO RY / F IL E DIREC TO RY / F IL E DIREC TO RY / F IL E DEF IN IT IO N DESC RIP T IO N

C6A101DDCB47050 Hash string of Shadow copy of


6 runtime version runtime files needed
for local execution

Script_66AE4909AA0 Script name + hash Compilation outputs


ED06C string of script path and execution step
logging

_script_.abr Compiler output Algebra file

_ScopeCodeGen_.* Compiler output Generated managed


code

_ScopeCodeGenEngi Compiler output Generated native


ne_.* code

referenced Assembly reference Referenced assembly


assemblies files

deployed_resources Resource deployment Resource deployment


files

xxxxxxxx.xxx[1..n]_*.* Execution log Log of execution


steps

Use the SDK from the command line


Command-line interface of the helper application
Under SDK directory\build\runtime, LocalRunHelper.exe is the command-line helper application that provides
interfaces to most of the commonly used local-run functions. Note that both the command and the argument
switches are case-sensitive. To invoke it:

LocalRunHelper.exe <command> <Required-Command-Arguments> [Optional-Command-Arguments]

Run LocalRunHelper.exe without arguments or with the help switch to show the help information:
> LocalRunHelper.exe help
Command 'help' : Show usage information
Command 'compile' : Compile the script
Required Arguments :
-Script param
Script File Path
Optional Arguments :
-Shallow [default value 'False']
Shallow compile

In the help information:


Command gives the command’s name.
Required Argument lists arguments that must be supplied.
Optional Argument lists arguments that are optional, with default values. Optional Boolean arguments
don’t have parameters, and their appearances mean negative to their default value.
Return value and logging
The helper application returns 0 for success and -1 for failure. By default, the helper sends all messages to the
current console. However, most of the commands support the -MessageOut path_to_log_file optional
argument that redirects the outputs to a log file.
Environment variable configuring
U-SQL local run needs a specified data root as local storage account, as well as a specified CppSDK path for
dependencies. You can both set the argument in command-line or set environment variable for them.
Set the SCOPE_CPP_SDK environment variable.
If you get Microsoft Visual C++ and the Windows SDK by installing Data Lake Tools for Visual Studio,
verify that you have the following folder:
C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\IDE\Extensions\Microsoft\Microsoft Azure
Data Lake Tools for Visual Studio 2015\X.X.XXXX.X\CppSDK

Define a new environment variable called SCOPE_CPP_SDK to point to this directory. Or copy the folder
to the other location and specify SCOPE_CPP_SDK as that.
In addition to setting the environment variable, you can specify the -CppSDK argument when you're
using the command line. This argument overwrites your default CppSDK environment variable.
Set the LOCALRUN_DATAROOT environment variable.
Define a new environment variable called LOCALRUN_DATAROOT that points to the data root.
In addition to setting the environment variable, you can specify the -DataRoot argument with the data-
root path when you're using a command line. This argument overwrites your default data-root
environment variable. You need to add this argument to every command line you're running so that you
can overwrite the default data-root environment variable for all operations.
SDK command line usage samples
Compile and run
The run command is used to compile the script and then execute compiled results. Its command-line arguments
are a combination of those from compile and execute .

LocalRunHelper run -Script path_to_usql_script.usql [optional_arguments]

The following are optional arguments for run :


A RGUM EN T DEFA ULT VA L UE DESC RIP T IO N

-CodeBehind False The script has .cs code behind

-CppSDK CppSDK Directory

-DataRoot DataRoot environment variable DataRoot for local run, default to


'LOCALRUN_DATAROOT' environment
variable

-MessageOut Dump messages on console to a file

-Parallel 1 Run the plan with the specified


parallelism

-References List of paths to extra reference


assemblies or data files of code behind,
separated by ';'

-UdoRedirect False Generate Udo assembly redirect config

-UseDatabase master Database to use for code behind


temporary assembly registration

-Verbose False Show detailed outputs from runtime

-WorkDir Current Directory Directory for compiler usage and


outputs

-RunScopeCEP 0 ScopeCEP mode to use

-ScopeCEPTempPath temp Temp path to use for streaming data

-OptFlags Comma-separated list of optimizer


flags

Here's an example:
LocalRunHelper run -Script d:\test\test1.usql -WorkDir d:\test\bin -CodeBehind -References
"d:\asm\ref1.dll;d:\asm\ref2.dll" -UseDatabase testDB –Parallel 5 -Verbose

Besides combining compile and execute , you can compile and execute the compiled executables separately.
Compile a U-SQL script
The compile command is used to compile a U-SQL script to executables.

LocalRunHelper compile -Script path_to_usql_script.usql [optional_arguments]

The following are optional arguments for compile :

A RGUM EN T DESC RIP T IO N

-CodeBehind [default value 'False'] The script has .cs code behind

-CppSDK [default value ''] CppSDK Directory


A RGUM EN T DESC RIP T IO N

-DataRoot [default value 'DataRoot environment variable'] DataRoot for local run, default to 'LOCALRUN_DATAROOT'
environment variable

-MessageOut [default value ''] Dump messages on console to a file

-References [default value ''] List of paths to extra reference assemblies or data files of
code behind, separated by ';'

-Shallow [default value 'False'] Shallow compile

-UdoRedirect [default value 'False'] Generate Udo assembly redirect config

-UseDatabase [default value 'master'] Database to use for code behind temporary assembly
registration

-WorkDir [default value 'Current Directory'] Directory for compiler usage and outputs

-RunScopeCEP [default value '0'] ScopeCEP mode to use

-ScopeCEPTempPath [default value 'temp'] Temp path to use for streaming data

-OptFlags [default value ''] Comma-separated list of optimizer flags

Here are some usage examples.


Compile a U-SQL script:

LocalRunHelper compile -Script d:\test\test1.usql

Compile a U-SQL script and set the data-root folder. Note that this will overwrite the set environment variable.

LocalRunHelper compile -Script d:\test\test1.usql –DataRoot c:\DataRoot

Compile a U-SQL script and set a working directory, reference assembly, and database:

LocalRunHelper compile -Script d:\test\test1.usql -WorkDir d:\test\bin -References


"d:\asm\ref1.dll;d:\asm\ref2.dll" -UseDatabase testDB

Execute compiled results


The execute command is used to execute compiled results.

LocalRunHelper execute -Algebra path_to_compiled_algebra_file [optional_arguments]

The following are optional arguments for execute :

A RGUM EN T DEFA ULT VA L UE DESC RIP T IO N


A RGUM EN T DEFA ULT VA L UE DESC RIP T IO N

-DataRoot '' Data root for metadata execution. It


defaults to the
LOCALRUN_DATAROOT
environment variable.

-MessageOut '' Dump messages on the console to a


file.

-Parallel '1' Indicator to run the generated local-


run steps with the specified parallelism
level.

-Verbose 'False' Indicator to show detailed outputs


from runtime.

Here's a usage example:

LocalRunHelper execute -Algebra d:\test\workdir\C6A101DDCB470506\Script_66AE4909AA0ED06C\__script__.abr –


DataRoot c:\DataRoot –Parallel 5

Use the SDK with programming interfaces


The programming interfaces are all located in the LocalRunHelper.exe. You can use them to integrate the
functionality of the U-SQL SDK and the C# test framework to scale your U-SQL script local test. In this article, I
will use the standard C# unit test project to show how to use these interfaces to test your U-SQL script.
Step 1: Create C# unit test project and configuration
Create a C# unit test project through File > New > Project > Visual C# > Test > Unit Test Project.
Add LocalRunHelper.exe as a reference for the project. The LocalRunHelper.exe is located at
\build\runtime\LocalRunHelper.exe in Nuget package.

U-SQL SDK only support x64 environment, make sure to set build platform target as x64. You can set
that through Project Property > Build > Platform target.
Make sure to set your test environment as x64. In Visual Studio, you can set it through Test > Test Settings
> Default Processor Architecture > x64.

Make sure to copy all dependency files under NugetPackage\build\runtime\ to project working directory
which is usually under ProjectFolder\bin\x64\Debug.
Step 2: Create U -SQL script test case
Below is the sample code for U-SQL script test. For testing, you need to prepare scripts, input files and expected
output files.
using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.IO;
using System.Text;
using System.Security.Cryptography;
using Microsoft.Analytics.LocalRun;
namespace UnitTestProject1
{
[TestClass]
public class USQLUnitTest
{
[TestMethod]
public void TestUSQLScript()
{
//Specify the local run message output path
StreamWriter MessageOutput = new StreamWriter("../../../log.txt");
LocalRunHelper localrun = new LocalRunHelper(MessageOutput);
//Configure the DateRoot path, Script Path and CPPSDK path
localrun.DataRoot = "../../../";
localrun.ScriptPath = "../../../Script/Script.usql";
localrun.CppSdkDir = "../../../CppSDK";
//Run U-SQL script
localrun.DoRun();
//Script output
string Result = Path.Combine(localrun.DataRoot, "Output/result.csv");
//Expected script output
string ExpectedResult = "../../../ExpectedOutput/result.csv";
Test.Helpers.FileAssert.AreEqual(Result, ExpectedResult);
//Don't forget to close MessageOutput to get logs into file
MessageOutput.Close();
}
}
}
namespace Test.Helpers
{
public static class FileAssert
{
static string GetFileHash(string filename)
{
Assert.IsTrue(File.Exists(filename));
using (var hash = new SHA1Managed())
{
var clearBytes = File.ReadAllBytes(filename);
var hashedBytes = hash.ComputeHash(clearBytes);
return ConvertBytesToHex(hashedBytes);
}
}
static string ConvertBytesToHex(byte[] bytes)
{
var sb = new StringBuilder();
for (var i = 0; i < bytes.Length; i++)
{
sb.Append(bytes[i].ToString("x"));
}
return sb.ToString();
}
public static void AreEqual(string filename1, string filename2)
{
string hash1 = GetFileHash(filename1);
string hash2 = GetFileHash(filename2);
Assert.AreEqual(hash1, hash2);
}
}
}

Programming interfaces in LocalRunHelper.exe


LocalRunHelper.exe provides the programming interfaces for U-SQL local compile, run, etc. The interfaces are
listed as follows.
Constructor
public LocalRunHelper([System.IO.TextWriter messageOutput = null])

PA RA M ET ER TYPE DESC RIP T IO N

messageOutput System.IO.TextWriter for output messages, set to null to use


Console

Properties
P RO P ERT Y TYPE DESC RIP T IO N

AlgebraPath string The path to algebra file (algebra file is


one of the compilation results)

CodeBehindReferences string If the script has additional code behind


references, specify the paths separated
with ';'

CppSdkDir string CppSDK directory

CurrentDir string Current directory

DataRoot string Data root path

DebuggerMailPath string The path to debugger mailslot

GenerateUdoRedirect bool If we want to generate assembly


loading redirection override config

HasCodeBehind bool If the script has code behind

InputDir string Directory for input data

MessagePath string Message dump file path

OutputDir string Directory for output data

Parallelism int Parallelism to run the algebra

ParentPid int PID of the parent on which the service


monitors to exit, set to 0 or negative
to ignore

ResultPath string Result dump file path

RuntimeDir string Runtime directory

ScriptPath string Where to find the script

Shallow bool Shallow compile or not


P RO P ERT Y TYPE DESC RIP T IO N

TempDir string Temp directory

UseDataBase string Specify the database to use for code


behind temporary assembly
registration, master by default

WorkDir string Preferred working directory

Method
M ET H O D DESC RIP T IO N RET URN PA RA M ET ER

public bool DoCompile() Compile the U-SQL script True on success

public bool DoExec() Execute the compiled result True on success

public bool DoRun() Run the U-SQL script True on success


(Compile + Execute)

public bool Check if the given path is True for valid The path of runtime
IsValidRuntimeDir(string valid runtime path directory
path)

FAQ about common issue


Error 1
E_CSC_SYSTEM_INTERNAL: Internal error! Could not load file or assembly 'ScopeEngineManaged.dll' or one of
its dependencies. The specified module could not be found.
Please check the following:
Make sure you have x64 environment. The build target platform and the test environment should be x64,
refer to Step 1: Create C# unit test project and configuration above.
Make sure you have copied all dependency files under NugetPackage\build\runtime\ to project working
directory.

Next steps
To learn U-SQL, see Get started with Azure Data Lake Analytics U-SQL language.
To log diagnostics information, see Accessing diagnostics logs for Azure Data Lake Analytics.
To see a more complex query, see Analyze website logs using Azure Data Lake Analytics.
To view job details, see Use Job Browser and Job View for Azure Data Lake Analytics jobs.
To use the vertex execution view, see Use the Vertex Execution View in Data Lake Tools for Visual Studio.
Understand Apache Spark for U-SQL developers
12/10/2021 • 2 minutes to read • Edit Online

Microsoft supports several Analytics services such as Azure Databricks and Azure HDInsight as well as Azure
Data Lake Analytics. We hear from developers that they have a clear preference for open-source-solutions as
they build analytics pipelines. To help U-SQL developers understand Apache Spark, and how you might
transform your U-SQL scripts to Apache Spark, we've created this guidance.
It includes a number of steps you can take, and several alternatives.

Steps to transform U-SQL to Apache Spark


1. Transform your job orchestration pipelines.
If you use Azure Data Factory to orchestrate your Azure Data Lake Analytics scripts, you'll have to adjust
them to orchestrate the new Spark programs.
2. Understand the differences between how U-SQL and Spark manage data
If you want to move your data from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2, you
will have to copy both the file data and the catalog maintained data. Note that Azure Data Lake Analytics
only supports Azure Data Lake Storage Gen1. See Understand Spark data formats
3. Transform your U-SQL scripts to Spark
Before transforming your U-SQL scripts, you will have to choose an analytics service. Some of the
available compute services available are:
Azure Data Factory DataFlow Mapping data flows are visually designed data transformations that
allow data engineers to develop a graphical data transformation logic without writing code. While not
suited to execute complex user code, they can easily represent traditional SQL-like dataflow
transformations
Azure HDInsight Hive Apache Hive on HDInsight is suited to Extract, Transform, and Load (ETL)
operations. This means you are going to translate your U-SQL scripts to Apache Hive.
Apache Spark Engines such as Azure HDInsight Spark or Azure Databricks This means you are going
to translate your U-SQL scripts to Spark. For more information, see Understand Spark data formats
Cau t i on

Both Azure Databricks and Azure HDInsight Spark are cluster services and not serverless jobs like Azure Data
Lake Analytics. You will have to consider how to provision the clusters to get the appropriate cost/performance
ratio and how to manage their lifetime to minimize your costs. These services are have different performance
characteristics with user code written in .NET, so you will have to either write wrappers or rewrite your code in a
supported language. For more information, see Understand Spark data formats, Understand Apache Spark code
concepts for U-SQL developers, .Net for Apache Spark

Next steps
Understand Spark data formats for U-SQL developers
Understand Spark code concepts for U-SQL developers
Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage
Gen2
.NET for Apache Spark
Transform data using Hadoop Hive activity in Azure Data Factory
Transform data using Spark activity in Azure Data Factory
What is Apache Spark in Azure HDInsight
Understand differences between U-SQL and Spark
data formats
12/10/2021 • 2 minutes to read • Edit Online

If you want to use either Azure Databricks or Azure HDInsight Spark, we recommend that you migrate your data
from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2.
In addition to moving your files, you'll also want to make your data, stored in U-SQL tables, accessible to Spark.

Move data stored in Azure Data Lake Storage Gen1 files


Data stored in files can be moved in various ways:
Write an Azure Data Factory pipeline to copy the data from Azure Data Lake Storage Gen1 account to the
Azure Data Lake Storage Gen2 account.
Write a Spark job that reads the data from the Azure Data Lake Storage Gen1 account and writes it to the
Azure Data Lake Storage Gen2 account. Based on your use case, you may want to write it in a different
format such as Parquet if you do not need to preserve the original file format.
We recommend that you review the article Upgrade your big data analytics solutions from Azure Data Lake
Storage Gen1 to Azure Data Lake Storage Gen2

Move data stored in U-SQL tables


U-SQL tables are not understood by Spark. If you have data stored in U-SQL tables, you'll run a U-SQL job that
extracts the table data and saves it in a format that Spark understands. The most appropriate format is to create
a set of Parquet files following the Hive metastore's folder layout.
The output can be achieved in U-SQL with the built-in Parquet outputter and using the dynamic output
partitioning with file sets to create the partition folders. Process more files than ever and use Parquet provides
an example of how to create such Spark consumable data.
After this transformation, you copy the data as outlined in the chapter Move data stored in Azure Data Lake
Storage Gen1 files.

Caveats
Data semantics When copying files, the copy will occur at the byte level. So the same data should be
appearing in the Azure Data Lake Storage Gen2 account. Note however, Spark may interpret some
characters differently. For example, it may use a different default for a row-delimiter in a CSV file.
Furthermore, if you're copying typed data (from tables), then Parquet and Spark may have different
precision and scale for some of the typed values (for example, a float) and may treat null values
differently. For example, U-SQL has the C# semantics for null values, while Spark has a three-valued logic
for null values.
Data organization (partitioning) U-SQL tables provide two level partitioning. The outer level (
PARTITIONED BY ) is by value and maps mostly into the Hive/Spark partitioning scheme using folder
hierarchies. You will need to ensure that the null values are mapped to the right folder. The inner level (
DISTRIBUTED BY ) in U-SQL offers 4 distribution schemes: round robin, range, hash, and direct hash.
Hive/Spark tables only support value partitioning or hash partitioning, using a different hash function
than U-SQL. When you output your U-SQL table data, you will probably only be able to map into the
value partitioning for Spark and may need to do further tuning of your data layout depending on your
final Spark queries.

Next steps
Understand Spark code concepts for U-SQL developers
Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage
Gen2
.NET for Apache Spark
Transform data using Spark activity in Azure Data Factory
Transform data using Hadoop Hive activity in Azure Data Factory
What is Apache Spark in Azure HDInsight
Understand Apache Spark code for U-SQL
developers
12/10/2021 • 11 minutes to read • Edit Online

This section provides high-level guidance on transforming U-SQL Scripts to Apache Spark.
It starts with a comparison of the two language's processing paradigms
Provides tips on how to:
Transform scripts including U-SQL's rowset expressions
.NET code
Data types
Catalog objects.

Understand the U-SQL and Spark language and processing paradigms


Before you start migrating Azure Data Lake Analytics' U-SQL scripts to Spark, it is useful to understand the
general language and processing philosophies of the two systems.
U-SQL is a SQL-like declarative query language that uses a data-flow paradigm and allows you to easily embed
and scale out user-code written in .NET (for example C#), Python, and R. The user-extensions can implement
simple expressions or user-defined functions, but can also provide the user the ability to implement so called
user-defined operators that implement custom operators to perform rowset level transformations, extractions
and writing output.
Spark is a scale-out framework offering several language bindings in Scala, Java, Python, .NET etc. where you
primarily write your code in one of these languages, create data abstractions called resilient distributed datasets
(RDD), dataframes, and datasets and then use a LINQ-like domain-specific language (DSL) to transform them. It
also provides SparkSQL as a declarative sublanguage on the dataframe and dataset abstractions. The DSL
provides two categories of operations, transformations and actions. Applying transformations to the data
abstractions will not execute the transformation but instead build-up the execution plan that will be submitted
for evaluation with an action (for example, writing the result into a temporary table or file, or printing the result).
Thus when translating a U-SQL script to a Spark program, you will have to decide which language you want to
use to at least generate the data frame abstraction (which is currently the most frequently used data abstraction)
and whether you want to write the declarative dataflow transformations using the DSL or SparkSQL. In some
more complex cases, you may need to split your U-SQL script into a sequence of Spark and other steps
implemented with Azure Batch or Azure Functions.
Furthermore, Azure Data Lake Analytics offers U-SQL in a serverless job service environment, while both Azure
Databricks and Azure HDInsight offer Spark in form of a cluster service. When transforming your application,
you will have to take into account the implications of now creating, sizing, scaling, and decommissioning the
clusters.

Transform U-SQL scripts


U-SQL scripts follow the following processing pattern:
1. Data gets read from either unstructured files, using the EXTRACT statement, a location or file set specification,
and the built-in or user-defined extractor and desired schema, or from U-SQL tables (managed or external
tables). It is represented as a rowset.
2. The rowsets get transformed in multiple U-SQL statements that apply U-SQL expressions to the rowsets and
produce new rowsets.
3. Finally, the resulting rowsets are output into either files using the OUTPUT statement that specifies the
location(s) and a built-in or user-defined outputter, or into a U-SQL table.
The script is evaluated lazily, meaning that each extraction and transformation step is composed into an
expression tree and globally evaluated (the dataflow).
Spark programs are similar in that you would use Spark connectors to read the data and create the dataframes,
then apply the transformations on the dataframes using either the LINQ-like DSL or SparkSQL, and then write
the result into files, temporary Spark tables, some programming language types, or the console.

Transform .NET code


U-SQL's expression language is C# and it offers a variety of ways to scale out custom .NET code.
Since Spark currently does not natively support executing .NET code, you will have to either rewrite your
expressions into an equivalent Spark, Scala, Java, or Python expression or find a way to call into your .NET code.
If your script uses .NET libraries, you have the following options:
Translate your .NET code into Scala or Python.
Split your U-SQL script into several steps, where you use Azure Batch processes to apply the .NET
transformations (if you can get acceptable scale)
Use a .NET language binding available in Open Source called Moebius. This project is not in a supported
state.
In any case, if you have a large amount of .NET logic in your U-SQL scripts, please contact us through your
Microsoft Account representative for further guidance.
The following details are for the different cases of .NET and C# usages in U-SQL scripts.
Transform scalar inline U -SQL C# expressions
U-SQL's expression language is C#. Many of the scalar inline U-SQL expressions are implemented natively for
improved performance, while more complex expressions may be executed through calling into the .NET
framework.
Spark has its own scalar expression language (either as part of the DSL or in SparkSQL) and allows calling into
user-defined functions written in its hosting language.
If you have scalar expressions in U-SQL, you should first find the most appropriate natively understood Spark
scalar expression to get the most performance, and then map the other expressions into a user-defined function
of the Spark hosting language of your choice.
Be aware that .NET and C# have different type semantics than the Spark hosting languages and Spark's DSL. See
below for more details on the type system differences.
Transform user-defined scalar .NET functions and user-defined aggregators
U-SQL provides ways to call arbitrary scalar .NET functions and to call user-defined aggregators written in .NET.
Spark also offers support for user-defined functions and user-defined aggregators written in most of its hosting
languages that can be called from Spark's DSL and SparkSQL.
Transform user-defined operators (UDOs)
U-SQL provides several categories of user-defined operators (UDOs) such as extractors, outputters, reducers,
processors, appliers, and combiners that can be written in .NET (and - to some extent - in Python and R).
Spark does not offer the same extensibility model for operators, but has equivalent capabilities for some.
The Spark equivalent to extractors and outputters is Spark connectors. For many U-SQL extractors, you may find
an equivalent connector in the Spark community. For others, you will have to write a custom connector. If the U-
SQL extractor is complex and makes use of several .NET libraries, it may be preferable to build a connector in
Scala that uses interop to call into the .NET library that does the actual processing of the data. In that case, you
will have to deploy the .NET Core runtime to the Spark cluster and make sure that the referenced .NET libraries
are .NET Standard 2.0 compliant.
The other types of U-SQL UDOs will need to be rewritten using user-defined functions and aggregators and the
semantically appropriate Spark DLS or SparkSQL expression. For example, a processor can be mapped to a
SELECT of a variety of UDF invocations, packaged as a function that takes a dataframe as an argument and
returns a dataframe.
Transform U -SQL's optional libraries
U-SQL provides a set of optional and demo libraries that offer Python, R, JSON, XML, AVRO support, and some
cognitive services capabilities.
Spark offers its own Python and R integration, pySpark and SparkR respectively, and provides connectors to
read and write JSON, XML, and AVRO.
If you need to transform a script referencing the cognitive services libraries, we recommend contacting us via
your Microsoft Account representative.

Transform typed values


Because U-SQL's type system is based on the .NET type system and Spark has its own type system, that is
impacted by the host language binding, you will have to make sure that the types you are operating on are close
and for certain types, the type ranges, precision and/or scale may be slightly different. Furthermore, U-SQL and
Spark treat null values differently.
Data types
The following table gives the equivalent types in Spark, Scala, and PySpark for the given U-SQL types.

U- SQ L SPA RK SC A L A P Y SPA RK

byte

sbyte ByteType Byte ByteType

int IntegerType Int IntegerType

uint

long LongType Long LongType

ulong

float FloatType Float FloatType

double DoubleType Double DoubleType

decimal DecimalType java.math.BigDecimal DecimalType

short ShortType Short ShortType


U- SQ L SPA RK SC A L A P Y SPA RK

ushort

char Char

string StringType String StringType

DateTime DateType , java.sql.Date , DateType ,


TimestampType java.sql.Timestamp TimestampType

bool BooleanType Boolean BooleanType

Guid

byte[] BinaryType Array[Byte] BinaryType

SQL.MAP<K,V> MapType(keyType, scala.collection.Map MapType(keyType,


valueType, valueType,
valueContainsNull) valueContainsNull=True)

SQL.ARRAY<T> ArrayType(elementType, scala.collection.Seq ArrayType(elementType,


containsNull) containsNull=True)

For more information, see:


org.apache.spark.sql.types
Spark SQL and DataFrames Types
Scala value types
pyspark.sql.types
Treatment of NULL
In Spark, types per default allow NULL values while in U-SQL, you explicitly mark scalar, non-object as nullable.
While Spark allows you to define a column as not nullable, it will not enforce the constraint and may lead to
wrong result.
In Spark, NULL indicates that the value is unknown. A Spark NULL value is different from any value, including
itself. Comparisons between two Spark NULL values, or between a NULL value and any other value, return
unknown because the value of each NULL is unknown.
This behavior is different from U-SQL, which follows C# semantics where null is different from any value but
equal to itself.
Thus a SparkSQL SELECT statement that uses WHERE column_name = NULL returns zero rows even if there are
NULL values in column_name , while in U-SQL, it would return the rows where column_name is set to null .
Similarly, A Spark SELECT statement that uses WHERE column_name != NULL returns zero rows even if there are
non-null values in column_name , while in U-SQL, it would return the rows that have non-null. Thus, if you want
the U-SQL null-check semantics, you should use isnull and isnotnull respectively (or their DSL equivalent).

Transform U-SQL catalog objects


One major difference is that U-SQL Scripts can make use of its catalog objects, many of which have no direct
Spark equivalent.
Spark does provide support for the Hive Meta store concepts, mainly databases, and tables, so you can map U-
SQL databases and schemas to Hive databases, and U-SQL tables to Spark tables (see Moving data stored in U-
SQL tables), but it has no support for views, table-valued functions (TVFs), stored procedures, U-SQL assemblies,
external data sources etc.
The U-SQL code objects such as views, TVFs, stored procedures, and assemblies can be modeled through code
functions and libraries in Spark and referenced using the host language's function and procedural abstraction
mechanisms (for example, through importing Python modules or referencing Scala functions).
If the U-SQL catalog has been used to share data and code objects across projects and teams, then equivalent
mechanisms for sharing have to be used (for example, Maven for sharing code objects).

Transform U-SQL rowset expressions and SQL-based scalar


expressions
U-SQL's core language is transforming rowsets and is based on SQL. The following is a non-exhaustive list of
the most common rowset expressions offered in U-SQL:
SELECT / FROM / WHERE / GROUP BY +Aggregates+ HAVING / ORDER BY + FETCH
INNER / OUTER / CROSS / SEMI JOIN expressions
CROSS / OUTER APPLY expressions
PIVOT / UNPIVOT expressions
VALUES rowset constructor
Set expressions UNION / OUTER UNION / INTERSECT / EXCEPT

In addition, U-SQL provides a variety of SQL-based scalar expressions such as


OVER windowing expressions
a variety of built-in aggregators and ranking functions ( SUM , FIRST etc.)
Some of the most familiar SQL scalar expressions: CASE , LIKE , ( NOT ) IN , AND , OR etc.

Spark offers equivalent expressions in both its DSL and SparkSQL form for most of these expressions. Some of
the expressions not supported natively in Spark will have to be rewritten using a combination of the native
Spark expressions and semantically equivalent patterns. For example, OUTER UNION will have to be translated
into the equivalent combination of projections and unions.
Due to the different handling of NULL values, a U-SQL join will always match a row if both of the columns being
compared contain a NULL value, while a join in Spark will not match such columns unless explicit null checks are
added.

Transform other U-SQL concepts


U-SQL also offers a variety of other features and concepts, such as federated queries against SQL Server
databases, parameters, scalar, and lambda expression variables, system variables, OPTION hints.
Federated Queries against SQL Server databases/external tables
U-SQL provides data source and external tables as well as direct queries against Azure SQL Database. While
Spark does not offer the same object abstractions, it provides Spark connector for Azure SQL Database that can
be used to query SQL databases.
U -SQL parameters and variables
Parameters and user variables have equivalent concepts in Spark and their hosting languages.
For example in Scala, you can define a variable with the var keyword:

var x = 2 * 3;
println(x)

U-SQL's system variables (variables starting with @@ ) can be split into two categories:
Settable system variables that can be set to specific values to impact the scripts behavior
Informational system variables that inquire system and job level information
Most of the settable system variables have no direct equivalent in Spark. Some of the informational system
variables can be modeled by passing the information as arguments during job execution, others may have an
equivalent function in Spark's hosting language.
U -SQL hints
U-SQL offers several syntactic ways to provide hints to the query optimizer and execution engine:
Setting a U-SQL system variable
an OPTION clause associated with the rowset expression to provide a data or plan hint
a join hint in the syntax of the join expression (for example, BROADCASTLEFT )

Spark's cost-based query optimizer has its own capabilities to provide hints and tune the query performance.
Please refer to the corresponding documentation.

Next steps
Understand Spark data formats for U-SQL developers
.NET for Apache Spark
Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage
Gen2
Transform data using Spark activity in Azure Data Factory
Transform data using Hadoop Hive activity in Azure Data Factory
What is Apache Spark in Azure HDInsight
Learn how to troubleshoot U-SQL runtime failures
due to runtime changes
12/10/2021 • 4 minutes to read • Edit Online

The Azure Data Lake U-SQL runtime, including the compiler, optimizer, and job manager, is what processes your
U-SQL code.

Choosing your U-SQL runtime version


When you submit U-SQL jobs from either Visual Studio, the ADL SDK or the Azure Data Lake Analytics portal,
your job will use the currently available default runtime. New versions of the U-SQL runtime are released on a
regular basis and include both minor updates and security fixes.
You can also choose a custom runtime version; either because you want to try out a new update, need to stay on
an older version of a runtime, or were provided with a hotfix for a reported problem where you cannot wait for
the regular new update.
Cau t i on

Choosing a runtime that is different from the default has the potential to break your U-SQL jobs. Use these
other versions for testing only.
In rare cases, Microsoft Support may pin a different version of a runtime as the default for your account. Please
ensure that you revert this pin as soon as possible. If you remain pinned to that version, it will expire at some
later date.
Monitoring your jobs U -SQL runtime version
You can see the history of which runtime version your past jobs have used in your account's job history via the
Visual Studio's job browser or the Azure portal's job history.
1. In the Azure portal, go to your Data Lake Analytics account.
2. Select View All Jobs . A list of all the active and recently finished jobs in the account appears.
3. Optionally, click Filter to help you find the jobs by Time Range , Job Name , and Author values.
4. You can see the runtime used in the completed jobs.
The available runtime versions change over time. The default runtime is always called "default" and we keep at
least the previous runtime available for some time as well as make special runtimes available for a variety of
reasons. Explicitly named runtimes generally follow the following format (italics are used for variable parts and
[] indicates optional parts):
release_YYYYMMDD_adl_buildno[_modifier]
For example, release_20190318_adl_3394512_2 means the second version of the build 3394512 of the runtime
release of March 18 2019 and release_20190318_adl_3394512_private means a private build of the same
release. Note: The date is related to when the last check-in has been taken for that release and not necessarily
the official release date.

Troubleshooting U-SQL runtime version issues


There are two possible runtime version issues that you may encounter:
1. A script or some user-code is changing behavior from one release to the next. Such breaking changes are
normally communicated ahead of time with the publication of release notes. If you encounter such a
breaking change, contact Microsoft Support to report this breaking behavior (in case it has not been
documented yet) and submit your jobs against the older runtime version.
2. You have been using a non-default runtime either explicitly or implicitly when it has been pinned to your
account, and that runtime has been removed after some time. If you encounter missing runtimes,
upgrade your scripts to run with the current default runtime. If you need additional time, contact
Microsoft Support

Known issues
1. Referencing Newtonsoft.Json file version 12.0.3 or onwards in a USQL script will cause the following
compilation failure:
"We are sorry; jobs running in your Data Lake Analytics account will likely run more slowly or fail to
complete. An unexpected problem is preventing us from automatically restoring this functionality to your
Azure Data Lake Analytics account. Azure Data Lake engineers have been contacted to investigate."
Where the call stack will contain:
System.IndexOutOfRangeException: Index was outside the bounds of the array.
at Roslyn.Compilers.MetadataReader.PEFile.CustomAttributeTableReader.get_Item(UInt32 rowId)
...

Solution : Please use Newtonsoft.Json file v12.0.2 or lower.


2. Customers might see temporary files and folders on their store. Those are produced as part of the
normal job execution, but are usually deleted before the customers see them. Under certain
circumstances, which are rare and random, they might remain visible for a period of time. They are
eventually deleted, and are never counted as part of user storage, or generate any form of charges
whatsoever. Depending on the customers' job logic they might cause issues. For instance, if the job
enumerates all files in the folder and then compares file lists, it might fail because of the unexpected
temporary files being present. Similarly, if a downstream job enumerates all files from a given folder for
further processing, it might also enumerate the temp files.
Solution : A fix is identified in the runtime where the temp files will be stored in account level temp folder
than the current output folder. The temp files will be written in this new temp folder and will be deleted at
the end the job execution.
Since this fix is handling the customer data, it is extremely important to have this fix well validated within
MSFT before it is released. It is expected to have this fix available as beta runtime in the middle of year
2021 and as default runtime in the second half of year 2021.

See also
Azure Data Lake Analytics overview
Manage Azure Data Lake Analytics using Azure portal
Monitor jobs in Azure Data Lake Analytics using the Azure portal
Azure Data Lake Analytics is upgrading to the .NET
Framework v4.7.2
12/10/2021 • 6 minutes to read • Edit Online

The Azure Data Lake Analytics default runtime is upgrading from .NET Framework v4.5.2 to .NET Framework
v4.7.2. This change introduces a small risk of breaking changes if your U-SQL code uses custom assemblies, and
those custom assemblies use .NET libraries.
This upgrade from .NET Framework 4.5.2 to version 4.7.2 means that the .NET Framework deployed in a U-SQL
runtime (the default runtime) will now always be 4.7.2. There isn't a side-by-side option for .NET Framework
versions.
After this upgrade to .NET Framework 4.7.2 is complete, the system’s managed code will run as version 4.7.2,
user provided libraries such as the U-SQL custom assemblies will run in the backwards-compatible mode
appropriate for the version that the assembly has been generated for.
If your assembly DLLs are generated for version 4.5.2, the deployed framework will treat them as 4.5.2
libraries, providing (with a few exceptions) 4.5.2 semantics.
You can now use U-SQL custom assemblies that make use of version 4.7.2 features, if you target the .NET
Framework 4.7.2.
Because of this upgrade to .NET Framework 4.7.2, there's a potential to introduce breaking changes to your U-
SQL jobs that use .NET custom assemblies. We suggest you check for backwards-compatibility issues using the
procedure below.

How to check for backwards-compatibility issues


Check for the potential of backwards-compatibility breaking issues by running the .NET compatibility checks on
your .NET code in your U-SQL custom assemblies.

NOTE
The tool doesn't detect actual breaking changes. it only identifies called .NET APIs that may (for certain inputs) cause
issues. If you get notified of an issue, your code may still be fine, however you should check in more details.

1. Run the backwards-compatibility checker on your .NET DLLs either by


a. Using the Visual Studio Extension at .NET Portability Analyzer Visual Studio Extension
b. Downloading and using the standalone tool from GitHub dotnetapiport. Instructions for running
standalone tool are at GitHub dotnetapiport breaking changes
c. For 4.7.2. compatibility, read isRetargeting == True identifies possible issues.
2. If the tool indicates if your code may be impacted by any of the possible backwards-incompatibilities (some
common examples of incompatibilities are listed below), you can further check by
a. Analyzing your code and identifying if your code is passing values to the impacted APIs
b. Perform a runtime check. The runtime deployment isn't done side-by-side in ADLA. You can perform a
runtime check before the upgrade, using VisualStudio’s local run with a local .NET Framework 4.7.2
against a representative data set.
3. If you indeed are impacted by a backwards-incompatibility, take the necessary steps to fix it (such as fixing
your data or code logic).
In most cases, you should not be impacted by backwards-incompatibility.

Timeline
You can check for the deployment of the new runtime here Runtime troubleshoot, and by looking at any prior
successful job.
What if I can't get my code reviewed in time
You can submit your job against the old runtime version (which is built targeting 4.5.2), however due to the lack
of .NET Framework side-by-side capabilities, it will still only run in 4.5.2 compatibility mode. You may still
encounter some of the backwards-compatibility issues because of this behavior.
What are the most common backwards-compatibility issues you may encounter
The most common backwards-incompatibilities that the checker is likely to identify are (we generated this list by
running the checker on our own internal ADLA jobs), which libraries are impacted (note: that you may call the
libraries only indirectly, thus it is important to take required action #1 to check if your jobs are impacted), and
possible actions to remedy. Note: In almost all cases for our own jobs, the warnings turned out to be false
positives due to the narrow natures of most breaking changes.
IAsyncResult.CompletedSynchronously property must be correct for the resulting task to complete
When calling TaskFactory.FromAsync, the implementation of the
IAsyncResult.CompletedSynchronously property must be correct for the resulting task to complete.
That is, the property must return true if, and only if, the implementation completed synchronously.
Previously, the property was not checked.
Impacted Libraries: mscorlib, System.Threading.Tasks
Suggested Action: Ensure TaskFactory.FromAsync returns true correctly
DataObject.GetData now retrieves data as UTF-8
For apps that target the .NET Framework 4 or that run on the .NET Framework 4.5.1 or earlier versions,
DataObject.GetData retrieves HTML-formatted data as an ASCII string. As a result, non-ASCII
characters (characters whose ASCII codes are greater than 0x7F) are represented by two random
characters.#N##N#For apps that target the .NET Framework 4.5 or later and run on the .NET
Framework 4.5.2, DataObject.GetData retrieves HTML-formatted data as UTF-8, which represents
characters greater than 0x7F correctly.
Impacted Libraries: Glo
Suggested Action: Ensure data retrieved is the format you want
XmlWriter throws on invalid surrogate pairs
For apps that target the .NET Framework 4.5.2 or previous versions, writing an invalid surrogate pair
using exception fallback handling does not always throw an exception. For apps that target the .NET
Framework 4.6, attempting to write an invalid surrogate pair throws an ArgumentException .
Impacted Libraries: System.Xml, System.Xml.ReaderWriter
Suggested Action: Ensure you are not writing an invalid surrogate pair that will cause argument
exception
HtmlTextWriter does not render <br/> element correctly
Beginning in the .NET Framework 4.6, calling HtmlTextWriter.RenderBeginTag() and
HtmlTextWriter.RenderEndTag() with a <BR /> element will correctly insert only one <BR /> (instead
of two)
Impacted Libraries: System.Web
Suggested Action: Ensure you are inserting the amount of <BR /> you expect to see so no random
behavior is seen in production job
Calling CreateDefaultAuthorizationContext with a null argument has changed
The implementation of the AuthorizationContext returned by a call to the
CreateDefaultAuthorizationContext(IList<IAuthorizationPolicy>) with a null authorizationPolicies
argument has changed its implementation in the .NET Framework 4.6.
Impacted Libraries: System.IdentityModel
Suggested Action: Ensure you are handling the new expected behavior when there is null
authorization policy
RSACng now correctly loads RSA keys of non-standard key size
In .NET Framework versions prior to 4.6.2, customers with non-standard key sizes for RSA certificates
are unable to access those keys via the GetRSAPublicKey() and GetRSAPrivateKey() extension
methods. A CryptographicException with the message "The requested key size is not supported" is
thrown. With the .NET Framework 4.6.2 this issue has been fixed. Similarly, RSA.ImportParameters()
and RSACng.ImportParameters() now work with non-standard key sizes without throwing
CryptographicException 's.
Impacted Libraries: mscorlib, System.Core
Suggested Action: Ensure RSA keys are working as expected
Path colon checks are stricter
In .NET Framework 4.6.2, a number of changes were made to support previously unsupported paths
(both in length and format). Checks for proper drive separator (colon) syntax were made more correct,
which had the side effect of blocking some URI paths in a few select Path APIs where they used to be
tolerated.
Impacted Libraries: mscorlib, System.Runtime.Extensions
Suggested Action:
Calls to ClaimsIdentity constructors
Starting with the .NET Framework 4.6.2, there is a change in how
T:System.Security.Claims.ClaimsIdentity constructors with an T:System.Security.Principal.IIdentity
parameter set the P:System.Security.Claims.ClaimsIdentify.Actor property. If the
T:System.Security.Principal.IIdentity argument is a T:System.Security.Claims.ClaimsIdentity
object, and the P:System.Security.Claims.ClaimsIdentify.Actor property of that
T:System.Security.Claims.ClaimsIdentity object is not null , the
P:System.Security.Claims.ClaimsIdentify.Actor property is attached by using the
M:System.Security.Claims.ClaimsIdentity.Clone method. In the Framework 4.6.1 and earlier versions,
the P:System.Security.Claims.ClaimsIdentify.Actor property is attached as an existing reference.
Because of this change, starting with the .NET Framework 4.6.2, the
P:System.Security.Claims.ClaimsIdentify.Actor property of the new
T:System.Security.Claims.ClaimsIdentity object is not equal to the
P:System.Security.Claims.ClaimsIdentify.Actor property of the constructor's
T:System.Security.Principal.IIdentity argument. In the .NET Framework 4.6.1 and earlier versions, it
is equal.
Impacted Libraries: mscorlib
Suggested Action: Ensure ClaimsIdentity is working as expected on new runtime
Serialization of control characters with DataContractJsonSerializer is now compatible with ECMAScript V6
and V8
In the .NET framework 4.6.2 and earlier versions, the DataContractJsonSerializer did not serialize some
special control characters, such as \b, \f, and \t, in a way that was compatible with the ECMAScript V6
and V8 standards. Starting with the .NET Framework 4.7, serialization of these control characters is
compatible with ECMAScript V6 and V8.
Impacted Libraries: System.Runtime.Serialization.Json
Suggested Action: Ensure same behavior with DataContractJsonSerializer
Migrate Azure Data Lake Analytics to Azure
Synapse Analytics
12/10/2021 • 3 minutes to read • Edit Online

Microsoft launched the Azure Synapse Analytics which aims at bringing both data lakes and data warehouse
together for a unique big data analytics experience. It will help customers gather and analyze all the varying
data, to solve data inefficiency, and work together. Moreover, Synapse’s integration with Azure Machine Learning
and Power BI will allow the improved ability for organizations to get insights from its data as well as execute
machine learning to all its smart apps.
The document shows you how to do the migration from Azure Data Lake Analytics to Azure Synapse Analytics.

Recommended approach
Step 1: Assess readiness
Step 2: Prepare to migrate
Step 3: Migrate data and application workloads
Step 4: Cutover from Azure Data Lake Analytics to Azure Synapse Analytics
Step 1: Assess readiness
1. Look at Apache Spark on Azure Synapse Analytics, and understand key differences of Azure Data Lake
Analytics and Spark on Azure Synapse Analytics.

IT EM A Z URE DATA L A K E A N A LY T IC S SPA RK O N SY N A P SE

Pricing Per Analytic Unit-hour Per vCore-hour

Engine Azure Data Lake Analytics Apache Spark

Default Programing Language U-SQL T-SQL, Python, Scala, Spark SQL and
.NET

Data Sources Azure Data Lake Storage Azure Blob Storage, Azure Data Lake
Storage

2. Review the Questionnaire for Migration Assessment and list those possible risks for considering.
Step 2: Prepare to migrate
1. Identify jobs and data that you'll migrate.
Take this opportunity to clean up those jobs that you no longer use. Unless you plan to migrate all
your jobs at one time, take this time to identify logical groups of jobs that you can migrate in phases.
Evaluate the size of the data and understand Apache Spark data format. Review your U-SQL scripts
and evaluate the scripts re-writing efforts and understand the Apache Spark code concept.
2. Determine the impact that a migration will have on your business. For example, whether you can afford
any downtime while migration takes place.
3. Create a migration plan.
Step 3: Migrate data and application workload
1. Migrate your data from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2.
Azure Data Lake Storage Gen1 retirement will be in February 2024, see the official announcement. We’d
suggest migrating the data to Gen2 in the first place. See Understand Apache Spark data formats for
Azure Data Lake Analytics U-SQL developers and move both the file and the data stored in U-SQL tables
to make them accessible to Azure Synapse Analytics. More details of the migration guide can be found
here.
2. Transform your U-SQL scripts to Spark. Refer to Understand Apache Spark code concepts for Azure Data
Lake Analytics U-SQL developers to transform your U-SQL scripts to Spark.
3. Transform or re-create your job orchestration pipelines to new Spark program.
Step 4: Cut over from Azure Data Lake Analytics to Azure Synapse Analytics
After you're confident that your applications and workloads are stable, you can begin using Azure Synapse
Analytics to satisfy your business scenarios. Turn off any remaining pipelines that are running on Azure Data
Lake Analytics and decommission your Azure Data Lake Analytics accounts.

Questionnaire for Migration Assessment


C AT EGO RY Q UEST IO N S REF EREN C E

Evaluate the size of the Migration How many Azure Data Lake Analytics The more data and scripts to be
accounts do you have? How many migrated, the more UDO/UDF are
pipelines are in use? How many U-SQL used in scripts, the more difficult it is
scripts are in use? to migrate. The time and resources
required for migration need to be well
planned according to the scale of the
project.

Data source What’s the size of the data source? Understand Apache Spark data
What kinds of data format for formats for Azure Data Lake Analytics
processing? U-SQL developers

Data output Will you keep the output data for later If the output data will be used often
use? If the output data is saved in U- and saved in U-SQL tables, you need
SQL tables, how to handle it? change the scripts and change the
output data to Spark supported data
format.

Data migration Have you made the storage migration Migrate Azure Data Lake Storage from
plan? Gen1 to Gen2

U-SQL scripts transform Do you use UDO/UDF (.NET, python, Understand Apache Spark code
etc.)?If above answer is yes, which concepts for Azure Data Lake Analytics
language do you use in your U-SQL developers
UDO/UDF and any problems for the
transform during the transform?Is the
federated query being used in U-SQL?

Next steps
Azure Synapse Analytics
Azure Policy built-in definitions for Azure Data Lake
Analytics
12/10/2021 • 2 minutes to read • Edit Online

This page is an index of Azure Policy built-in policy definitions for Azure Data Lake Analytics. For additional
Azure Policy built-ins for other services, see Azure Policy built-in definitions.
The name of each built-in policy definition links to the policy definition in the Azure portal. Use the link in the
Version column to view the source on the Azure Policy GitHub repo.

Azure Data Lake Analytics


NAME VERSIO N
( A ZURE PO RTA L) DESC RIP T IO N EF F EC T ( S) ( GIT HUB)

Deploy Diagnostic Settings Deploys the diagnostic DeployIfNotExists, Disabled 2.0.0


for Data Lake Analytics to settings for Data Lake
Event Hub Analytics to stream to a
regional Event Hub when
any Data Lake Analytics
which is missing this
diagnostic settings is
created or updated.

Deploy Diagnostic Settings Deploys the diagnostic DeployIfNotExists, Disabled 1.0.0


for Data Lake Analytics to settings for Data Lake
Log Analytics workspace Analytics to stream to a
regional Log Analytics
workspace when any Data
Lake Analytics which is
missing this diagnostic
settings is created or
updated.

Resource logs in Data Lake Audit enabling of resource AuditIfNotExists, Disabled 5.0.0
Analytics should be enabled logs. This enables you to
recreate activity trails to use
for investigation purposes;
when a security incident
occurs or when your
network is compromised

Next steps
See the built-ins on the Azure Policy GitHub repo.
Review the Azure Policy definition structure.
Review Understanding policy effects.

You might also like