Data Lake Azure
Data Lake Azure
Azure Data Lake Analytics is an on-demand analytics job service that simplifies big data. Instead of deploying,
configuring, and tuning hardware, you write queries to transform your data and extract valuable insights. The
analytics service can handle jobs of any scale instantly by setting the dial for how much power you need. You
only pay for your job when it is running, making it cost-effective.
Dynamic scaling
Data Lake Analytics dynamically provisions resources and lets you do analytics on terabytes to petabytes of
data. You pay only for the processing power used. As you increase or decrease the size of data stored or the
amount of compute resources used, you don’t have to rewrite code.
NOTE
Data Lake Analytics doesn't work with Azure Data Lake Storage Gen2 yet until further notice.
Next steps
See the Azure Data Lake Analytics recent update using What's new in Azure Data Lake Analytics?
Get Started with Data Lake Analytics using Azure portal | Azure PowerShell | CLI
Manage Azure Data Lake Analytics using Azure portal | Azure PowerShell | CLI | Azure .NET SDK | Node.js
How to control costs and save money with Data Lake Analytics
What's new in Data Lake Analytics?
12/10/2021 • 2 minutes to read • Edit Online
Azure Data Lake Analytics is updated on an aperiodic basis for certain components. To stay updated with the
most recent update, this article provides you with information about:
The notification of key component beta preview
The important component version information, for example: the list of the component available versions, the
current default version and so on.
U-SQL runtime
The Azure Data Lake U-SQL runtime, including the compiler, optimizer, and job manager, is what processes your
U-SQL code.
When you submit the Azure Data Lake analytics job from any tools, your job will use the currently available
default runtime in production environment.
The runtime version will be updated aperiodically. And the previous runtime will be kept available for some
time. When a new Beta version is ready for preview, it will be also available there.
Cau t i on
Choosing a runtime that is different from the default has the potential to break your U-SQL jobs. It is highly
recommended not to use these non-default versions for production, but for testing only.
The non-default runtime version has a fixed lifecycle. It will be automatically expired.
The following version is the current default runtime version.
release_20200707_scope_2b8d563_usql
To get understanding how to troubleshoot U-SQL runtime failures, refer to Troubleshoot U-SQL runtime failures.
.NET Framework
Azure Data Lake Analytics now is using the .NET Framework v4.7.2 .
If your Azure Data Lake Analytics U-SQL script code uses custom assemblies, and those custom assemblies use
.NET libraries, validate your code to check if there is any breakings.
To get understanding how to troubleshoot a .NET upgrade using Troubleshoot a .NET upgrade.
Release note
For recent update details, refer to the Azure Data Lake Analytics release note.
Next steps
Get Started with Data Lake Analytics using Azure portal | Azure PowerShell | CLI
Get started with Azure Data Lake Analytics using
the Azure portal
12/10/2021 • 2 minutes to read • Edit Online
This article describes how to use the Azure portal to create Azure Data Lake Analytics accounts, define jobs in U-
SQL, and submit jobs to the Data Lake Analytics service.
Prerequisites
Before you begin this tutorial, you must have an Azure subscription . See Get Azure free trial.
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS
D( customer, amount );
OUTPUT @a
TO "/data.csv"
USING Outputters.Csv();
See also
To get started developing U-SQL applications, see Develop U-SQL scripts using Data Lake Tools for Visual
Studio.
To learn U-SQL, see Get started with Azure Data Lake Analytics U-SQL language.
For management tasks, see Manage Azure Data Lake Analytics using Azure portal.
Develop U-SQL scripts by using Data Lake Tools for
Visual Studio
12/10/2021 • 3 minutes to read • Edit Online
Azure Data Lake and Stream Analytics Tools include functionality related to two Azure services, Azure Data Lake
Analytics and Azure Stream Analytics. For more information about the Azure Stream Analytics scenarios, see
Azure Stream Analytics tools for Visual Studio.
This article describes how to use Visual Studio to create Azure Data Lake Analytics accounts. You can define jobs
in U-SQL, and submit jobs to the Data Lake Analytics service. For more information about Data Lake Analytics,
see Azure Data Lake Analytics overview.
IMPORTANT
We recommend you upgrade to Azure Data Lake Tools for Visual Studio version 2.3.3000.4 or later. The previous versions
are no longer available for download and are now deprecated.
1. Check if you are using an earlier version than 2.3.3000.4 of Azure Data Lake Tools for Visual Studio.
2. If your version is an earlier version of 2.3.3000.4, update your Azure Data Lake Tools for Visual Studio by visiting
the download center:
For Visual Studio 2017 and 2019
For Visual Studio 2013 and 2015
Prerequisites
Visual Studio : All editions except Express are supported.
Visual Studio 2019
Visual Studio 2017
Visual Studio 2015
Visual Studio 2013
Microsoft Azure SDK for .NET version 2.7.1 or later. Install it by using the Web platform installer.
A Data Lake Analytics account. To create an account, see Get Started with Azure Data Lake Analytics
using Azure portal.
To see the latest job status and refresh the screen, select Refresh .
Check job status
1. In Ser ver Explorer , select Azure > Data Lake Analytics .
2. Expand the Data Lake Analytics account name.
3. Double-click Jobs .
4. Select the job that you previously submitted.
Next steps
Run U-SQL scripts on your own workstation for testing and debugging
Debug C# code in U-SQL jobs using Azure Data Lake Tools for Visual Studio Code
Use the Azure Data Lake Tools for Visual Studio Code
Use Azure Data Lake Tools for Visual Studio Code
12/10/2021 • 8 minutes to read • Edit Online
In this article, learn how you can use Azure Data Lake Tools for Visual Studio Code (VS Code) to create, test, and
run U-SQL scripts. The information is also covered in the following video:
Prerequisites
Azure Data Lake Tools for VS Code supports Windows, Linux, and macOS. U-SQL local run and local debug
works only in Windows.
Visual Studio Code
For macOS and Linux:
.NET 5.0 SDK
Mono 5.2.x
@departments =
SELECT * FROM
(VALUES
(31, "Sales"),
(33, "Engineering"),
(34, "Clerical"),
(35, "Marketing")
) AS
D( DepID, DepName );
NOTE
Azure Data Lake Tools autodetects whether the DLL has any assembly dependencies. The dependencies are displayed
in the JSON file after they're detected.
You can upload your DLL resources (for example, .txt, .png, and .csv) as part of the assembly registration.
Another way to trigger the ADL: Register Assembly (Advanced) command is to right-click the .dll file in File
Explorer.
The following U-SQL code demonstrates how to call an assembly. In the sample, the assembly name is test.
REFERENCE ASSEMBLY [test];
@a =
EXTRACT
Iid int,
Starts DateTime,
Region string,
Query string,
DwellTime int,
Results string,
ClickedUrls string
FROM @"Sample/SearchLog.txt"
USING Extractors.Tsv();
@d =
SELECT DISTINCT Region
FROM @a;
@d1 =
PROCESS @d
PRODUCE
Region string,
Mkt string
USING new USQLApplication_codebehind.MyProcessor();
OUTPUT @d1
TO @"Sample/SearchLogtest.txt"
USING Outputters.Tsv();
Use U-SQL local run and local debug for Windows users
U-SQL local run tests your local data and validates your script locally before your code is published to Data Lake
Analytics. You can use the local debug feature to complete the following tasks before your code is submitted to
Data Lake Analytics:
Debug your C# code-behind.
Step through the code.
Validate your script locally.
The local run and local debug feature only works in Windows environments, and is not supported on macOS
and Linux-based operating systems.
For instructions on local run and local debug, see U-SQL local run and local debug with Visual Studio Code.
Connect to Azure
Before you can compile and run U-SQL scripts in Data Lake Analytics, you must connect to your Azure account.
4. Follow the instructions to sign in from the webpage. When you're connected, your Azure account name
appears on the status bar in the lower-left corner of the VS Code window.
NOTE
Data Lake Tools automatically signs you in the next time if you don't sign out.
If your account has two factors enabled, we recommend that you use phone authentication rather than using a PIN.
You can't sign out from the explorer. To sign out, see To connect to Azure by using a command.
Learn how to use Azure PowerShell to create Azure Data Lake Analytics accounts and then submit and run U-
SQL jobs. For more information about Data Lake Analytics, see Azure Data Lake Analytics overview.
Prerequisites
NOTE
This article uses the Azure Az PowerShell module, which is the recommended PowerShell module for interacting with
Azure. To get started with the Az PowerShell module, see Install Azure PowerShell. To learn how to migrate to the Az
PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Before you begin this tutorial, you must have the following information:
An Azure Data Lake Analytics account . See Get started with Data Lake Analytics.
A workstation with Azure PowerShell . See How to install and configure Azure PowerShell.
Log in to Azure
This tutorial assumes you are already familiar with using Azure PowerShell. In particular, you need to know how
to log in to Azure. See the Get started with Azure PowerShell if you need help.
To log in with a subscription name:
Instead of the subscription name, you can also use a subscription id to log in:
If successful, the output of this command looks like the following text:
Environment : AzureCloud
Account : [email protected]
TenantId : "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
SubscriptionId : "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
SubscriptionName : ContosoSubscription
CurrentStorageAccount :
$script = @"
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS
D( customer, amount );
OUTPUT @a
TO "/data.csv"
USING Outputters.Csv();
"@
Submit the script text with the Submit-AdlJob cmdlet and the -Script parameter.
As an alternative, you can submit a script file using the -ScriptPath parameter:
$filename = "d:\test.usql"
$script | out-File $filename
$job = Submit-AdlJob -Account $adla -Name "My Job" -ScriptPath $filename
Instead of calling Get-AdlJob over and over until a job finishes, use the Wait-AdlJob cmdlet.
See also
To see the same tutorial using other tools, click the tab selectors on the top of the page.
To learn U-SQL, see Get started with Azure Data Lake Analytics U-SQL language.
For management tasks, see Manage Azure Data Lake Analytics using Azure portal.
Get started with Azure Data Lake Analytics using
Azure CLI
12/10/2021 • 4 minutes to read • Edit Online
This article describes how to use the Azure CLI command-line interface to create Azure Data Lake Analytics
accounts, submit USQL jobs, and catalogs. The job reads a tab separated values (TSV) file and converts it into a
comma-separated values (CSV) file.
Prerequisites
Before you begin, you need the following items:
An Azure subscription . See Get Azure free trial.
This article requires that you are running the Azure CLI version 2.0 or later. If you need to install or upgrade,
see Install Azure CLI.
Sign in to Azure
To sign in to your Azure subscription:
az login
You are requested to browse to a URL, and enter an authentication code. And then follow the instructions to
enter your credentials.
Once you have logged in, the login command lists your subscriptions.
To use a specific subscription:
az group list
az dls account create --account "<Data Lake Store Account Name>" --resource-group "<Resource Group Name>"
az dla account create --account "<Data Lake Analytics Account Name>" --resource-group "<Resource Group
Name>" --location "<Azure location>" --default-data-lake-store "<Default Data Lake Store Account Name>"
After creating an account, you can use the following commands to list the accounts and show account details:
az dls fs upload --account "<Data Lake Store Account Name>" --source-path "<Source File Path>" --
destination-path "<Destination File Path>"
az dls fs list --account "<Data Lake Store Account Name>" --path "<Path>"
Data Lake Analytics can also access Azure Blob storage. For uploading data to Azure Blob storage, see Using the
Azure CLI with Azure Storage.
This U-SQL script reads the source data file using Extractors.Tsv() , and then creates a csv file using
Outputters.Csv() .
Don't modify the two paths unless you copy the source file into a different location. Data Lake Analytics creates
the output folder if it doesn't exist.
It is simpler to use relative paths for files stored in default Data Lake Store accounts. You can also use absolute
paths. For example:
adl://<Data LakeStorageAccountName>.azuredatalakestore.net:443/Samples/Data/SearchLog.tsv
You must use absolute paths to access files in linked Storage accounts. The syntax for files stored in linked Azure
Storage account is:
wasb://<BlobContainerName>@<StorageAccountName>.blob.core.windows.net/Samples/Data/SearchLog.tsv
NOTE
Azure Blob container with public blobs are not supported. Azure Blob container with public containers are not supported.
To submit jobs
Use the following syntax to submit a job.
az dla job submit --account "<Data Lake Analytics Account Name>" --job-name "<Job Name>" --script "<Script
Path and Name>"
For example:
To cancel jobs
az dla job cancel --account "<Data Lake Analytics Account Name>" --job-identity "<Job Id>"
az dls fs list --account "<Data Lake Store Account Name>" --source-path "/Output" --destination-path "
<Destination>"
az dls fs preview --account "<Data Lake Store Account Name>" --path "/Output/SearchLog-from-Data-Lake.csv"
az dls fs preview --account "<Data Lake Store Account Name>" --path "/Output/SearchLog-from-Data-Lake.csv" -
-length 128 --offset 0
az dls fs download --account "<Data Lake Store Account Name>" --source-path "/Output/SearchLog-from-Data-
Lake.csv" --destination-path "<Destination Path and File Name>"
For example:
Next steps
To see the Data Lake Analytics Azure CLI reference document, see Data Lake Analytics.
To see the Data Lake Store Azure CLI reference document, see Data Lake Store.
To see a more complex query, see Analyze Website logs using Azure Data Lake Analytics.
Azure Policy Regulatory Compliance controls for
Azure Data Lake Analytics
12/10/2021 • 5 minutes to read • Edit Online
Regulatory Compliance in Azure Policy provides Microsoft created and managed initiative definitions, known as
built-ins, for the compliance domains and security controls related to different compliance standards. This
page lists the compliance domains and security controls for Azure Data Lake Analytics. You can assign the
built-ins for a security control individually to help make your Azure resources compliant with the specific
standard.
The title of each built-in policy definition links to the policy definition in the Azure portal. Use the link in the
Policy Version column to view the source on the Azure Policy GitHub repo.
IMPORTANT
Each control below is associated with one or more Azure Policy definitions. These policies may help you assess compliance
with the control; however, there often is not a one-to-one or complete match between a control and one or more policies.
As such, Compliant in Azure Policy refers only to the policies themselves; this doesn't ensure you're fully compliant with
all requirements of a control. In addition, the compliance standard includes controls that aren't addressed by any Azure
Policy definitions at this time. Therefore, compliance in Azure Policy is only a partial view of your overall compliance status.
The associations between controls and Azure Policy Regulatory Compliance definitions for these compliance standards
may change over time.
P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)
Logging and Threat LT-4 Enable logging for Resource logs in Data 5.0.0
Detection Azure resources Lake Analytics should
be enabled
Logging and 2.3 Enable audit logging Resource logs in Data 5.0.0
Monitoring for Azure resources Lake Analytics should
be enabled
P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)
FedRAMP High
To review how the available Azure Policy built-ins for all Azure services map to this compliance standard, see
Azure Policy Regulatory Compliance - FedRAMP High. For more information about this compliance standard,
see FedRAMP High.
P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)
Audit and AU-6 (4) Central Review and Resource logs in Data 5.0.0
Accountability Analysis Lake Analytics should
be enabled
Audit and AU-12 (1) System-wide / Time- Resource logs in Data 5.0.0
Accountability correlated Audit Trail Lake Analytics should
be enabled
FedRAMP Moderate
To review how the available Azure Policy built-ins for all Azure services map to this compliance standard, see
Azure Policy Regulatory Compliance - FedRAMP Moderate. For more information about this compliance
standard, see FedRAMP Moderate.
P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)
P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)
P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)
Access Control and AC-17 16.6.9 Events to be Resource logs in Data 5.0.0
Passwords logged Lake Analytics should
be enabled
P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)
Audit and AU-6 (4) Central Review and Resource logs in Data 5.0.0
Accountability Analysis Lake Analytics should
be enabled
Audit and AU-12 (1) System-wide / Time- Resource logs in Data 5.0.0
Accountability correlated Audit Trail Lake Analytics should
be enabled
P O L IC Y P O L IC Y VERSIO N
DO M A IN C O N T RO L ID C O N T RO L T IT L E ( A ZURE PO RTA L) ( GIT HUB)
Audit and AU-6 (4) Central Review and Resource logs in Data 5.0.0
Accountability Analysis Lake Analytics should
be enabled
Audit and AU-6 (5) Integrated Analysis Resource logs in Data 5.0.0
Accountability of Audit Records Lake Analytics should
be enabled
Audit and AU-12 (1) System-wide and Resource logs in Data 5.0.0
Accountability Time-correlated Lake Analytics should
Audit Trail be enabled
Next steps
Learn more about Azure Policy Regulatory Compliance.
See the built-ins on the Azure Policy GitHub repo.
Manage Azure Data Lake Analytics using the Azure
portal
12/10/2021 • 5 minutes to read • Edit Online
This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs by using
the Azure portal.
NOTE
If a user or a security group needs to submit jobs, they also need permission on the store account. For more information,
see Secure data stored in Data Lake Store.
Manage jobs
Submit a job
1. In the Azure portal, go to your Data Lake Analytics account.
2. Click New Job . For each job, configure:
a. Job Name : The name of the job.
b. Priority : Lower numbers have higher priority. If two jobs are queued, the one with lower priority
value runs first.
c. Parallelism : The maximum number of compute processes to reserve for this job.
3. Click Submit Job .
Monitor jobs
1. In the Azure portal, go to your Data Lake Analytics account.
2. Click View All Jobs . A list of all the active and recently finished jobs in the account is shown.
3. Optionally, click Filter to help you find the jobs by Time Range , Job Name , and Author values.
Monitoring pipeline jobs
Jobs that are part of a pipeline work together, usually sequentially, to accomplish a specific scenario. For
example, you can have a pipeline that cleans, extracts, transforms, aggregates usage for customer insights.
Pipeline jobs are identified using the "Pipeline" property when the job was submitted. Jobs scheduled using ADF
V2 will automatically have this property populated.
To view a list of U-SQL jobs that are part of pipelines:
1. In the Azure portal, go to your Data Lake Analytics accounts.
2. Click Job Insights . The "All Jobs" tab will be defaulted, showing a list of running, queued, and ended jobs.
3. Click the Pipeline Jobs tab. A list of pipeline jobs will be shown along with aggregated statistics for each
pipeline.
Monitoring recurring jobs
A recurring job is one that has the same business logic but uses different input data every time it runs. Ideally,
recurring jobs should always succeed, and have relatively stable execution time; monitoring these behaviors will
help ensure the job is healthy. Recurring jobs are identified using the "Recurrence" property. Jobs scheduled
using ADF V2 will automatically have this property populated.
To view a list of U-SQL jobs that are recurring:
1. In the Azure portal, go to your Data Lake Analytics accounts.
2. Click Job Insights . The "All Jobs" tab will be defaulted, showing a list of running, queued, and ended jobs.
3. Click the Recurring Jobs tab. A list of recurring jobs will be shown along with aggregated statistics for each
recurring job.
Next steps
Overview of Azure Data Lake Analytics
Manage Azure Data Lake Analytics by using Azure PowerShell
Manage Azure Data Lake Analytics using policies
Manage Azure Data Lake Analytics using the Azure
CLI
12/10/2021 • 4 minutes to read • Edit Online
Learn how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs using the Azure CLI. To
see management topics using other tools, click the tab select above.
Prerequisites
Before you begin this tutorial, you must have the following resources:
An Azure subscription. See Get Azure free trial.
Azure CLI. See Install and configure Azure CLI.
Download and install the pre-release Azure CLI tools in order to complete this demo.
Authenticate by using the az login command and select the subscription that you want to use. For more
information on authenticating using a work or school account, see Connect to an Azure subscription from
the Azure CLI.
az login
az account set --subscription <subscription id>
You can now access the Data Lake Analytics and Data Lake Store commands. Run the following command
to list the Data Lake Store and Data Lake Analytics commands:
az dls -h
az dla -h
Manage accounts
Before running any Data Lake Analytics jobs, you must have a Data Lake Analytics account. Unlike Azure
HDInsight, you don't pay for an Analytics account when it is not running a job. You only pay for the time when it
is running a job. For more information, see Azure Data Lake Analytics Overview.
Create accounts
Run the following command to create a Data Lake account,
az dla account create --account "<Data Lake Analytics account name>" --location "<Location Name>" --
resource-group "<Resource Group Name>" --default-data-lake-store "<Data Lake Store account name>"
Update accounts
The following command updates the properties of an existing Data Lake Analytics Account
az dla account update --account "<Data Lake Analytics Account Name>" --firewall-state "Enabled" --query-
store-retention 7
List accounts
List Data Lake Analytics accounts within a specific resource group
Delete an account
az dla account delete --account "<Data Lake Analytics account name>" --resource-group "<Resource group
name>"
az dla account blob-storage add --access-key "<Azure Storage Account Key>" --account "<Data Lake Analytics
account name>" --storage-account-name "<Storage account name>"
NOTE
Only Blob storage short names are supported. Don't use FQDN, for example "myblob.blob.core.windows.net".
az dla account data-lake-store add --account "<Data Lake Analytics account name>" --data-lake-store-account-
name "<Data Lake Store account name>"
az dla account data-lake-store list --account "<Data Lake Analytics account name>"
az dla account blob-storage list --account "<Data Lake Analytics account name>"
az dla account data-lake-store delete --account "<Data Lake Analytics account name>" --data-lake-store-
account-name "<Azure Data Lake Store account name>"
az dla account blob-storage delete --account "<Data Lake Analytics account name>" --storage-account-name "
<Data Lake Store account name>"
Manage jobs
You must have a Data Lake Analytics account before you can create a job. For more information, see Manage
Data Lake Analytics accounts.
List jobs
az dla job show --account "<Data Lake Analytics account name>" --job-identity "<Job Id>"
Submit jobs
NOTE
The default priority of a job is 1000, and the default degree of parallelism for a job is 1.
az dla job submit --account "<Data Lake Analytics account name>" --job-name "<Name of your job>" --
script "<Script to submit>"
Cancel jobs
Use the list command to find the job ID, and then use cancel to cancel the job.
az dla job cancel --account "<Data Lake Analytics account name>" --job-identity "<Job Id>"
az dla job pipeline list --account "<Data Lake Analytics Account Name>"
az dla job pipeline show --account "<Data Lake Analytics Account Name>" --pipeline-identity "<Pipeline ID>"
Use the az dla job recurrence commands to see the recurrence information for previously submitted jobs.
az dla job recurrence list --account "<Data Lake Analytics Account Name>"
az dla job recurrence show --account "<Data Lake Analytics Account Name>" --recurrence-identity "<Recurrence
ID>"
Next steps
Overview of Microsoft Azure Data Lake Analytics
Get started with Data Lake Analytics using Azure portal
Manage Azure Data Lake Analytics using Azure portal
Monitor and troubleshoot Azure Data Lake Analytics jobs using Azure portal
Manage Azure Data Lake Analytics using Azure
PowerShell
12/10/2021 • 8 minutes to read • Edit Online
This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs by using
Azure PowerShell.
Prerequisites
NOTE
This article uses the Azure Az PowerShell module, which is the recommended PowerShell module for interacting with
Azure. To get started with the Az PowerShell module, see Install Azure PowerShell. To learn how to migrate to the Az
PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
To use PowerShell with Data Lake Analytics, collect the following pieces of information:
Subscription ID : The ID of the Azure subscription that contains your Data Lake Analytics account.
Resource group : The name of the Azure resource group that contains your Data Lake Analytics account.
Data Lake Analytics account name : The name of your Data Lake Analytics account.
Default Data Lake Store account name : Each Data Lake Analytics account has a default Data Lake Store
account.
Location : The location of your Data Lake Analytics account, such as "East US 2" or other supported locations.
The PowerShell snippets in this tutorial use these variables to store this information
$subId = "<SubscriptionId>"
$rg = "<ResourceGroupName>"
$adla = "<DataLakeAnalyticsAccountName>"
$adls = "<DataLakeStoreAccountName>"
$location = "<Location>"
Log in to Azure
Log in using interactive user authentication
Log in using a subscription ID or by subscription name
# Using subscription id
Connect-AzAccount -SubscriptionId $subId
$tenantid = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
$spi_appname = "appname"
$spi_appid = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
$spi_secret = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
Manage accounts
List accounts
Create an account
Every Data Lake Analytics account requires a default Data Lake Store account that it uses for storing logs. You
can reuse an existing account or create an account.
# Create a data lake store if needed, or you can re-use an existing one
New-AdlStore -ResourceGroupName $rg -Name $adls -Location $location
New-AdlAnalyticsAccount -ResourceGroupName $rg -Name $adla -Location $location -DefaultDataLake $adls
You can find the default Data Lake Store account by filtering the list of datasources by the IsDefault property:
$script = @"
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS D( customer, amount );
OUTPUT @a
TO "/data.csv"
USING Outputters.Csv();
"@
$scriptpath = "d:\test.usql"
$script | Out-File $scriptpath
List jobs
The output includes the currently running jobs and those jobs that have recently completed.
Cancel a job
Use the Get-AdlJobRecurrence cmdlet to see the recurrence information for previously submitted jobs.
Manage files
Check for the existence of a file
Download a file.
NOTE
If the upload or download process is interrupted, you can attempt to resume the process by running the cmdlet again
with the -Resume flag.
$dbName = "master"
$credentialName = "ContosoDbCreds"
$dbUri = "https://contoso.database.windows.net:8080"
Resolve-AzError -Last
function Test-Administrator
{
$user = [Security.Principal.WindowsIdentity]::GetCurrent();
$p = New-Object Security.Principal.WindowsPrincipal $user
$p.IsInRole([Security.Principal.WindowsBuiltinRole]::Administrator)
}
Find a TenantID
From a subscription name:
Get-TenantIdFromSubscriptionName "ADLTrainingMS"
$subid = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
Get-TenantIdFromSubscriptionId $subid
$domain = "contoso.com"
Get-TenantIdFromDomain $domain
$subs = Get-AzSubscription
foreach ($sub in $subs)
{
Write-Host $sub.Name "(" $sub.Id ")"
Write-Host "`tTenant Id" $sub.TenantId
}
Next steps
Overview of Microsoft Azure Data Lake Analytics
Get started with Data Lake Analytics using the Azure portal | Azure PowerShell | Azure CLI
Manage Azure Data Lake Analytics using Azure portal | Azure PowerShell | CLI
Manage Azure Data Lake Analytics a .NET app
12/10/2021 • 7 minutes to read • Edit Online
This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs using an
app written using the Azure .NET SDK.
Prerequisites
Visual Studio 2015, Visual Studio 2013 update 4, or Visual Studio 2012 with Visual C++
Installed .
Microsoft Azure SDK for .NET version 2.5 or above . Install it using the Web platform installer.
Required NuGet Packages
Install NuGet packages
PA C K A GE VERSIO N
Microsoft.Rest.ClientRuntime.Azure.Authentication 2.3.1
Microsoft.Azure.Management.DataLake.Analytics 3.0.0
Microsoft.Azure.Management.DataLake.Store 2.2.0
Microsoft.Azure.Management.ResourceManager 1.6.0-preview
Microsoft.Azure.Graph.RBAC 3.4.0-preview
You can install these packages via the NuGet command line with the following commands:
Common variables
string subid = "<Subscription ID>"; // Subscription ID (a GUID)
string tenantid = "<Tenant ID>"; // AAD tenant ID or domain. For example, "contoso.onmicrosoft.com"
string rg == "<value>"; // Resource group name
string clientid = "1950a258-227b-4e31-a9cf-717495945fc2"; // Sample client ID (this will work, but you
should pick your own)
Authentication
You have multiple options for logging on to Azure Data Lake Analytics. The following snippet shows an example
of authentication with interactive user authentication with a pop-up.
using System;
using System.IO;
using System.Threading;
using System.Security.Cryptography.X509Certificates;
using Microsoft.Rest;
using Microsoft.Rest.Azure.Authentication;
using Microsoft.Azure.Management.DataLake.Analytics;
using Microsoft.Azure.Management.DataLake.Analytics.Models;
using Microsoft.Azure.Management.DataLake.Store;
using Microsoft.Azure.Management.DataLake.Store.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
using Microsoft.Azure.Graph.RBAC;
The source code for GetCreds_User_Popup and the code for other options for authentication are covered in
Data Lake Analytics .NET authentication options
Manage accounts
Create an Azure Resource Group
If you haven't already created one, you must have an Azure Resource Group to create your Data Lake Analytics
components. You need your authentication credentials, subscription ID, and a location. The following code shows
how to create a resource group:
For more information, see Azure Resource Groups and Data Lake Analytics.
Create a Data Lake Store account
Ever ADLA account requires an ADLS account. If you don't already have one to use, you can create one with the
following code:
Delete an account
if (adlaClient.Account.Exists(rg, adla))
{
adlaClient.Account.Delete(rg, adla);
}
if (adlaClient.Account.Exists(rg, adla))
{
var adla_accnt = adlaClient.Account.Get(rg, adla);
string def_adls_account = adla_accnt.DefaultDataLakeStoreAccount;
}
if (stg_accounts != null)
{
foreach (var stg_account in stg_accounts)
{
Console.WriteLine($"Storage account: {0}", stg_account.Name);
}
}
if (adls_accounts != null)
{
foreach (var adls_accnt in adls_accounts)
{
Console.WriteLine($"ADLS account: {0}", adls_accnt.Name);
}
}
memstream.Position = 0;
List pipelines
The following code lists information about each pipeline of jobs submitted to the account.
var pipelines = adlaJobClient.Pipeline.List(adla);
foreach (var p in pipelines)
{
Console.WriteLine($"Pipeline: {p.Name}\t{p.PipelineId}\t{p.LastSubmitTime}");
}
List recurrences
The following code lists information about each recurrence of jobs submitted to the account.
Next steps
Overview of Microsoft Azure Data Lake Analytics
Manage Azure Data Lake Analytics using Azure portal
Monitor and troubleshoot Azure Data Lake Analytics jobs using Azure portal
Manage Azure Data Lake Analytics using Python
12/10/2021 • 4 minutes to read • Edit Online
This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs by using
Python.
Authentication
Interactive user authentication with a pop-up
This method is not supported.
Interactive user authentication with a device code
user = input(
'Enter the user to authenticate with that has permission to subscription: ')
password = getpass.getpass()
credentials = UserPassCredentials(user, password)
credentials = ServicePrincipalCredentials(
client_id='FILL-IN-HERE', secret='FILL-IN-HERE', tenant='FILL-IN-HERE')
adlsAcctResult = adlsAcctClient.account.create(
rg,
adls,
DataLakeStoreAccount(
location=location)
)
).wait()
adlaAcctResult = adlaAcctClient.account.create(
rg,
adla,
DataLakeAnalyticsAccount(
location=location,
default_data_lake_store_account=adls,
data_lake_store_accounts=[DataLakeStoreAccountInformation(name=adls)]
)
).wait()
Submit a job
script = """
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS
D( customer, amount );
OUTPUT @a
TO "/data.csv"
USING Outputters.Csv();
"""
jobId = str(uuid.uuid4())
jobResult = adlaJobClient.job.create(
adla,
jobId,
JobInformation(
name='Sample Job',
type='USql',
properties=USqlJobProperties(script=script)
)
)
pipelines = adlaJobClient.pipeline.list(adla)
for p in pipelines:
print('Pipeline: ' + p.name + ' ' + p.pipelineId)
recurrences = adlaJobClient.recurrence.list(adla)
for r in recurrences:
print('Recurrence: ' + r.name + ' ' + r.recurrenceId)
userAadObjectId = "3b097601-4912-4d41-b9d2-78672fc2acde"
newPolicyParams = ComputePolicyCreateOrUpdateParameters(
userAadObjectId, "User", 50, 250)
adlaAccountClient.computePolicies.createOrUpdate(
rg, adla, "GaryMcDaniel", newPolicyParams)
Next steps
To see the same tutorial using other tools, click the tab selectors on the top of the page.
To learn U-SQL, see Get started with Azure Data Lake Analytics U-SQL language.
For management tasks, see Manage Azure Data Lake Analytics using Azure portal.
Manage Azure Data Lake Analytics using a Java app
12/10/2021 • 5 minutes to read • Edit Online
This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs using an
app written using the Azure Java SDK.
Prerequisites
Java Development Kit (JDK) 8 (using Java version 1.8).
IntelliJ or another suitable Java development environment. The instructions in this document use IntelliJ.
Create an Azure Active Directory (AAD) application and retrieve its Client ID , Tenant ID , and Key . For more
information about AAD applications and instructions on how to get a client ID, see Create Active Directory
application and service principal using portal. The Reply URI and Key is available from the portal once you
have the application created and key generated.
Go to File > Settings > Build > Execution > Deployment . Select Build Tools > Maven > Impor ting .
Then check Impor t Maven projects automatically .
Open Main.java and replace the existing code block with the following code:
import com.microsoft.azure.CloudException;
import com.microsoft.azure.credentials.ApplicationTokenCredentials;
import com.microsoft.azure.datalake.store.*;
import com.microsoft.azure.datalake.store.oauth2.*;
import com.microsoft.azure.management.datalake.analytics.implementation.*;
import com.microsoft.azure.management.datalake.store.*;
import com.microsoft.azure.management.datalake.store.implementation.*;
import com.microsoft.azure.management.datalake.store.models.*;
import com.microsoft.azure.management.datalake.analytics.*;
import com.microsoft.azure.management.datalake.analytics.models.*;
import com.microsoft.rest.credentials.ServiceClientCredentials;
import java.io.*;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.UUID;
import java.util.List;
_tenantId = "<TENANT-ID>";
_subscriptionId = "<SUBSCRIPTION-ID>";
_clientId = "<CLIENT-ID>";
_clientSecret = "<CLIENT-SECRET>";
// ----------------------------------------
// Authenticate
// ----------------------------------------
ApplicationTokenCredentials creds = new ApplicationTokenCredentials(_clientId, _tenantId,
_clientSecret, null);
SetupClients(creds);
// ----------------------------------------
// List Data Lake Store and Analytics accounts that this app can access
// ----------------------------------------
System.out.println(String.format("All ADL Store accounts that this app can access in subscription
%s:", _subscriptionId));
List<DataLakeStoreAccountBasic> adlsListResult = _adlsClient.accounts().list();
for (DataLakeStoreAccountBasic acct : adlsListResult) {
System.out.println(acct.name());
}
// ----------------------------------------
// Create a file in Data Lake Store: input1.csv
// ----------------------------------------
CreateFile("/input1.csv", "123,abc", true);
WaitForNewline("File created.", "Submitting a job.");
// ----------------------------------------
// Submit a job to Data Lake Analytics
// ----------------------------------------
String script = "@input = EXTRACT Row1 string, Row2 string FROM \"/input1.csv\" USING
Extractors.Csv(); OUTPUT @input TO @\"/output1.csv\" USING Outputters.Csv();";
UUID jobId = SubmitJobByScript(script, "testJob");
WaitForNewline("Job submitted.", "Getting job status.");
// ----------------------------------------
// Wait for job completion and output job status
// ----------------------------------------
System.out.println(String.format("Job status: %s", GetJobStatus(jobId)));
System.out.println(String.format("Job status: %s", GetJobStatus(jobId)));
System.out.println("Waiting for job completion.");
WaitForJob(jobId);
System.out.println(String.format("Job status: %s", GetJobStatus(jobId)));
WaitForNewline("Job completed.", "Downloading job output.");
// ----------------------------------------
// Download job output from Data Lake Store
// ----------------------------------------
DownloadFile("/output1.csv", localFolderPath + "output1.csv");
WaitForNewline("Job output downloaded.", "Deleting file.");
DeleteFile("/output1.csv");
WaitForNewline("File deleted.", "Done.");
}
if (!nextAction.isEmpty()) {
System.out.println(nextAction);
}
}
// Create accounts
public static void CreateAccounts() throws InterruptedException, CloudException, IOException {
// Create ADLS account
CreateDataLakeStoreAccountParameters adlsParameters = new CreateDataLakeStoreAccountParameters();
adlsParameters.withLocation(_location);
// Create a file
public static void CreateFile(String path, String contents, boolean force) throws IOException,
public static void CreateFile(String path, String contents, boolean force) throws IOException,
CloudException {
byte[] bytesContents = contents.getBytes();
// Delete a file
public static void DeleteFile(String filePath) throws IOException, CloudException {
_adlsStoreClient.delete(filePath);
}
// Download a file
private static void DownloadFile(String srcPath, String destPath) throws IOException, CloudException {
ADLFileInputStream stream = _adlsStoreClient.getReadStream(srcPath);
pWriter.println(fileContents);
pWriter.close();
}
return jobInfo.jobId();
}
Provide the values for parameters called out in the code snippet:
localFolderPath
_adlaAccountName
_adlsAccountName
_resourceGroupName
_tenantId
_subId
_clientId
_clientSecret
Next steps
To learn U-SQL, see Get started with Azure Data Lake Analytics U-SQL language, and U-SQL language
reference.
For management tasks, see Manage Azure Data Lake Analytics using Azure portal.
To get an overview of Data Lake Analytics, see Azure Data Lake Analytics overview.
Manage Azure Data Lake Analytics using Azure
SDK for Node.js
12/10/2021 • 2 minutes to read • Edit Online
This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs using an
app written using the Azure SDK for Node.js.
The following versions are supported:
Node.js version: 0.10.0 or higher
REST API version for Account: 2015-10-01-preview
REST API version for Catalog: 2015-10-01-preview
REST API version for Job: 2016-03-20-preview
Features
Account management: create, get, list, update, and delete.
Job management: submit, get, list, and cancel.
Catalog management: get and list.
How to Install
npm install azure-arm-datalake-analytics
// A Data Lake Store account must already have been created to create
// a Data Lake Analytics account. See the Data Lake Store readme for
// information on doing so. For now, we assume one exists already.
var datalakeStoreAccountName = 'existingadlsaccount';
See also
Microsoft Azure SDK for Node.js
Adding a user in the Azure portal
12/10/2021 • 2 minutes to read • Edit Online
Optionally, add the user to the Azure Data Lake Storage Gen1 role
Reader role.
1. Find your Azure Data Lake Storage Gen1 account.
2. Click on Users .
3. Click Add .
4. Select an Azure role to assign this group.
5. Assign to Reader role. This role has the minimum set of permissions required to browse/manage data stored
in ADLSGen1. Assign to this role if the Group is not intended for managing Azure services.
6. Type in the name of the Group.
7. Click OK .
Next steps
Overview of Azure Data Lake Analytics
Get started with Data Lake Analytics by using the Azure portal
Manage Azure Data Lake Analytics by using Azure PowerShell
Manage Azure Data Lake Analytics using Account
Policies
12/10/2021 • 4 minutes to read • Edit Online
Account policies help you control how resources an Azure Data Lake Analytics account are used. These policies
allow you to control the cost of using Azure Data Lake Analytics. For example, with these policies you can
prevent unexpected cost spikes by limiting how many AUs the account can simultaneously use.## Account-level
policies
These policies apply to all jobs in a Data Lake Analytics account.
NOTE
If you need more than the default (250) AUs, in the portal, click Help+Suppor t to submit a support request. The
number of AUs available in your Data Lake Analytics account can be increased.
Job-level policies
Job-level policies allow you to control the maximum AUs and the maximum priority that individual users (or
members of specific security groups) can set on jobs that they submit. This policy lets you control the costs
incurred by users. It also lets you control the effect that scheduled jobs might have on high-priority production
jobs that are running in the same Data Lake Analytics account.
Data Lake Analytics has two policies that you can set at the job level:
AU limit per job : Users can only submit jobs that have up to this number of AUs. By default, this limit is
the same as the maximum AU limit for the account.
Priority : Users can only submit jobs that have a priority lower than or equal to this value. A higher
number indicates a lower priority. By default, this limit is set to 1, which is the highest possible priority.
There is a default policy set on every account. The default policy applies to all users of the account. You can
create additional policies for specific users and groups.
NOTE
Account-level policies and job-level policies apply simultaneously.
Next steps
Overview of Azure Data Lake Analytics
Get started with Data Lake Analytics by using the Azure portal
Manage Azure Data Lake Analytics by using Azure PowerShell
Configure user access to job information to job
information in Azure Data Lake Analytics
12/10/2021 • 2 minutes to read • Edit Online
In Azure Data Lake Analytics, you can use multiple user accounts or service principals to run jobs.
In order for those same users to see the detailed job information, the users need to be able to read the contents
of the job folders. The job folders are located in /system/ directory.
If the necessary permissions are not configured, the user may see an error:
Graph data not available - You don't have permissions to access the graph data.
Next steps
Add a new user
Accessing diagnostic logs for Azure Data Lake
Analytics
12/10/2021 • 5 minutes to read • Edit Online
Diagnostic logging allows you to collect data access audit trails. These logs provide information such as:
A list of users that accessed the data.
How frequently the data is accessed.
How much data is stored in the account.
Enable logging
1. Sign on to the Azure portal.
2. Open your Data Lake Analytics account and select Diagnostic logs from the Monitor section. Next,
select Turn on diagnostics .
3. From Diagnostics settings , enter a Name for this logging configuration and then select logging
options.
NOTE
You must select either Archive to a storage account , Stream to an Event Hub or Send to Log
Analytics before clicking the Save button.
resourceId=/
SUBSCRIPTIONS/
<<SUBSCRIPTION_ID>>/
RESOURCEGROUPS/
<<RESOURCE_GRP_NAME>>/
PROVIDERS/
MICROSOFT.DATALAKEANALYTICS/
ACCOUNTS/
<DATA_LAKE_ANALYTICS_NAME>>/
y=####/
m=##/
d=##/
h=##/
m=00/
PT1H.json
NOTE
The ## entries in the path contain the year, month, day, and hour in which the log was created. Data Lake
Analytics creates one file every hour, so m= always contains a value of 00 .
Log structure
The audit and request logs are in a structured JSON format.
Request logs
Here's a sample entry in the JSON-formatted request log. Each blob has one root object called records that
contains an array of log objects.
{
"records":
[
. . . .
,
{
"time": "2016-07-07T21:02:53.456Z",
"resourceId":
"/SUBSCRIPTIONS/<subscription_id>/RESOURCEGROUPS/<resource_group_name>/PROVIDERS/MICROSOFT.DATALAKEANALYTICS
/ACCOUNTS/<data_lake_analytics_account_name>",
"category": "Requests",
"operationName": "GetAggregatedJobHistory",
"resultType": "200",
"callerIpAddress": "::ffff:1.1.1.1",
"correlationId": "4a11c709-05f5-417c-a98d-6e81b3e29c58",
"identity": "1808bd5f-62af-45f4-89d8-03c5e81bac30",
"properties": {
"HttpMethod":"POST",
"Path":"/JobAggregatedHistory",
"RequestContentLength":122,
"ClientRequestId":"3b7adbd9-3519-4f28-a61c-bd89506163b8",
"StartTime":"2016-07-07T21:02:52.472Z",
"EndTime":"2016-07-07T21:02:53.456Z"
}
}
,
. . . .
]
}
Audit logs
Here's a sample entry in the JSON-formatted audit log. Each blob has one root object called records that
contains an array of log objects.
{
"records":
[
{
"time": "2016-07-28T19:15:16.245Z",
"resourceId":
"/SUBSCRIPTIONS/<subscription_id>/RESOURCEGROUPS/<resource_group_name>/PROVIDERS/MICROSOFT.DATALAKEANALYTICS
/ACCOUNTS/<data_lake_ANALYTICS_account_name>",
"category": "Audit",
"operationName": "JobSubmitted",
"identity": "[email protected]",
"properties": {
"JobId":"D74B928F-5194-4E6C-971F-C27026C290E6",
"JobName": "New Job",
"JobRuntimeName": "default",
"SubmitTime": "7/28/2016 7:14:57 PM"
}
}
]
}
SubmitTime String The time (in UTC) that the job was
submitted
NOTE
SubmitTime , Star tTime , EndTime , and Parallelism provide information on an operation. These entries only contain a
value if that operation has started or completed. For example, SubmitTime only contains a value after operationName
has the value JobSubmitted .
Next steps
Overview of Azure Data Lake Analytics
Adjust quotas and limits in Azure Data Lake
Analytics
12/10/2021 • 2 minutes to read • Edit Online
Learn how to adjust and increase the quota and limits in Azure Data Lake Analytics (ADLA) accounts. Knowing
these limits will help you understand your U-SQL job behavior. All quota limits are soft, so you can increase the
maximum limits by contacting Azure support.
5. In the problem page, explain your requested increase limit with Details of why you need this extra
capacity.
6. Verify your contact information and create the support request.
Microsoft reviews your request and tries to accommodate your business needs as soon as possible.
Next steps
Overview of Microsoft Azure Data Lake Analytics
Manage Azure Data Lake Analytics using Azure PowerShell
Monitor and troubleshoot Azure Data Lake Analytics jobs using Azure portal
Disaster recovery guidance for Azure Data Lake
Analytics
12/10/2021 • 2 minutes to read • Edit Online
Azure Data Lake Analytics is an on-demand analytics job service that simplifies big data. Instead of deploying,
configuring, and tuning hardware, you write queries to transform your data and extract valuable insights. The
analytics service can handle jobs of any scale instantly by setting the dial for how much power you need. You
only pay for your job when it is running, making it cost-effective. This article provides guidance on how to
protect your jobs from rare region-wide outages or accidental deletions.
NOTE
Since account names are globally unique, use a consistent naming scheme that indicates which account is
secondary.
2. For unstructured data, reference Disaster recovery guidance for data in Azure Data Lake Storage Gen1
3. For structured data stored in ADLA tables and databases, create copies of the metadata artifacts such as
databases, tables, table-valued functions, and assemblies. You need to periodically resync these artifacts
when changes happen in production. For example, newly inserted data has to be replicated to the
secondary region by copying the data and inserting into the secondary table.
NOTE
These object names are scoped to the secondary account and are not globally unique, so they can have the same
names as in the primary production account.
During an outage, you need to update your scripts so the input paths point to the secondary endpoint. Then the
users submit their jobs to the ADLA account in the secondary region. The output of the job will then be written
to the ADLA and ADLS account in the secondary region.
Next steps
Disaster recovery guidance for data in Azure Data Lake Storage Gen1
Get started with U-SQL in Azure Data Lake
Analytics
12/10/2021 • 4 minutes to read • Edit Online
U-SQL is a language that combines declarative SQL with imperative C# to let you process data at any scale.
Through the scalable, distributed-query capability of U-SQL, you can efficiently analyze data across relational
stores such as Azure SQL Database. With U-SQL, you can process unstructured data by applying schema on
read and inserting custom logic and UDFs. Additionally, U-SQL includes extensibility that gives you fine-grained
control over how to execute at scale.
Learning resources
The U-SQL Tutorial provides a guided walkthrough of most of the U-SQL language. This document is
recommended reading for all developers wanting to learn U-SQL.
For detailed information about the U-SQL language syntax , see the U-SQL Language Reference.
To understand the U-SQL design philosophy , see the Visual Studio blog post Introducing U-SQL – A
Language that makes Big Data Processing Easy.
Prerequisites
Before you go through the U-SQL samples in this document, read and complete Tutorial: Develop U-SQL scripts
using Data Lake Tools for Visual Studio. That tutorial explains the mechanics of using U-SQL with Azure Data
Lake Tools for Visual Studio.
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
OUTPUT @searchlog
TO "/output/SearchLog-first-u-sql.csv"
USING Outputters.Csv();
This script doesn't have any transformation steps. It reads from the source file called SearchLog.tsv ,
schematizes it, and writes the rowset back into a file called SearchLog-first-u-sql.csv.
Notice the question mark next to the data type in the Duration field. It means that the Duration field could be
null.
Key concepts
Rowset variables : Each query expression that produces a rowset can be assigned to a variable. U-SQL
follows the T-SQL variable naming pattern ( @searchlog , for example) in the script.
The EXTRACT keyword reads data from a file and defines the schema on read. Extractors.Tsv is a built-in
U-SQL extractor for tab-separated-value files. You can develop custom extractors.
The OUTPUT writes data from a rowset to a file. Outputters.Csv() is a built-in U-SQL outputter to create a
comma-separated-value file. You can develop custom outputters.
File paths
The EXTRACT and OUTPUT statements use file paths. File paths can be absolute or relative:
This following absolute file path refers to a file in a Data Lake Store named mystore :
adl://mystore.azuredatalakestore.net/Samples/Data/SearchLog.tsv
This following file path starts with "/" . It refers to a file in the default Data Lake Store account:
/output/SearchLog-first-u-sql.csv
Transform rowsets
Use SELECT to transform rowsets:
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@rs1 =
SELECT Start, Region, Duration
FROM @searchlog
WHERE Region == "en-gb";
OUTPUT @rs1
TO "/output/SearchLog-transform-rowsets.csv"
USING Outputters.Csv();
The WHERE clause uses a C# Boolean expression. You can use the C# expression language to do your own
expressions and functions. You can even perform more complex filtering by combining them with logical
conjunctions (ANDs) and disjunctions (ORs).
The following script uses the DateTime.Parse() method and a conjunction.
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@rs1 =
SELECT Start, Region, Duration
FROM @searchlog
WHERE Region == "en-gb";
@rs1 =
SELECT Start, Region, Duration
FROM @rs1
WHERE Start >= DateTime.Parse("2012/02/16") AND Start <= DateTime.Parse("2012/02/17");
OUTPUT @rs1
TO "/output/SearchLog-transform-datetime.csv"
USING Outputters.Csv();
NOTE
The second query is operating on the result of the first rowset, which creates a composite of the two filters. You can also
reuse a variable name, and the names are scoped lexically.
Aggregate rowsets
U-SQL gives you the familiar ORDER BY, GROUP BY, and aggregations.
The following query finds the total duration per region, and then displays the top five durations in order.
U-SQL rowsets do not preserve their order for the next query. Thus, to order an output, you need to add ORDER
BY to the OUTPUT statement:
DECLARE @outpref string = "/output/Searchlog-aggregation";
DECLARE @out1 string = @outpref+"_agg.csv";
DECLARE @out2 string = @outpref+"_top5agg.csv";
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@rs1 =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM @searchlog
GROUP BY Region;
@res =
SELECT *
FROM @rs1
ORDER BY TotalDuration DESC
FETCH 5 ROWS;
OUTPUT @rs1
TO @out1
ORDER BY TotalDuration DESC
USING Outputters.Csv();
OUTPUT @res
TO @out2
ORDER BY TotalDuration DESC
USING Outputters.Csv();
The U-SQL ORDER BY clause requires using the FETCH clause in a SELECT expression.
The U-SQL HAVING clause can be used to restrict the output to groups that satisfy the HAVING condition:
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@res =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM @searchlog
GROUP BY Region
HAVING SUM(Duration) > 200;
OUTPUT @res
TO "/output/Searchlog-having.csv"
ORDER BY TotalDuration DESC
USING Outputters.Csv();
For advanced aggregation scenarios, see the U-SQL reference documentation for aggregate, analytic, and
reference functions
Next steps
Overview of Microsoft Azure Data Lake Analytics
Develop U-SQL scripts by using Data Lake Tools for Visual Studio
Get started with the U-SQL Catalog in Azure Data
Lake Analytics
12/10/2021 • 2 minutes to read • Edit Online
Create a TVF
In the previous U-SQL script, you repeated the use of EXTRACT to read from the same source file. With the U-
SQL table-valued function (TVF), you can encapsulate the data for future reuse.
The following script creates a TVF called Searchlog() in the default database and schema:
The following script shows you how to use the TVF that was defined in the previous script:
@res =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM Searchlog() AS S
GROUP BY Region
HAVING SUM(Duration) > 200;
OUTPUT @res
TO "/output/SearchLog-use-tvf.csv"
ORDER BY TotalDuration DESC
USING Outputters.Csv();
Create views
If you have a single query expression, instead of a TVF you can use a U-SQL VIEW to encapsulate that
expression.
The following script creates a view called SearchlogView in the default database and schema:
@res =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM SearchlogView
GROUP BY Region
HAVING SUM(Duration) > 200;
OUTPUT @res
TO "/output/Searchlog-use-view.csv"
ORDER BY TotalDuration DESC
USING Outputters.Csv();
Create tables
As with relational database tables, with U-SQL you can create a table with a predefined schema or create a table
that infers the schema from the query that populates the table (also known as CREATE TABLE AS SELECT or
CTAS).
Create a database and two tables by using the following script:
DROP DATABASE IF EXISTS SearchLogDb;
CREATE DATABASE SearchLogDb;
USE DATABASE SearchLogDb;
Query tables
You can query tables, such as those created in the previous script, in the same way that you query the data files.
Instead of creating a rowset by using EXTRACT, you now can refer to the table name.
To read from the tables, modify the transform script that you used previously:
@rs1 =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM SearchLogDb.dbo.SearchLog2
GROUP BY Region;
@res =
SELECT *
FROM @rs1
ORDER BY TotalDuration DESC
FETCH 5 ROWS;
OUTPUT @res
TO "/output/Searchlog-query-table.csv"
ORDER BY TotalDuration DESC
USING Outputters.Csv();
NOTE
Currently, you cannot run a SELECT on a table in the same script as the one where you created the table.
Next Steps
Overview of Microsoft Azure Data Lake Analytics
Develop U-SQL scripts using Data Lake Tools for Visual Studio
Monitor and troubleshoot Azure Data Lake Analytics jobs using Azure portal
Develop U-SQL user-defined operators (UDOs)
12/10/2021 • 2 minutes to read • Edit Online
This article describes how to develop user-defined operators to process data in a U-SQL job.
@drivers_CountryName =
PROCESS @drivers
PRODUCE UserID string,
Name string,
Address string,
City string,
State string,
PostalCode string,
Country string,
Phone string
USING new USQL_UDO.CountryName();
OUTPUT @drivers_CountryName
TO "/Samples/Outputs/Drivers.csv"
USING Outputters.Csv(Encoding.Unicode);
Next steps
Extending U-SQL Expressions with User-Code
Use Data Lake Tools for Visual Studio for developing U-SQL applications
Extend U-SQL scripts with Python code in Azure
Data Lake Analytics
12/10/2021 • 2 minutes to read • Edit Online
Prerequisites
Before you begin, ensure the Python extensions are installed in your Azure Data Lake Analytics account.
Navigate to you Data Lake Analytics Account in the Azure portal
In the left menu, under GETTING STARTED click on Sample Scripts
Click Install U-SQL Extensions then OK
Overview
Python Extensions for U-SQL enable developers to perform massively parallel execution of Python code. The
following example illustrates the basic steps:
Use the REFERENCE ASSEMBLY statement to enable Python extensions for the U-SQL Script
Using the REDUCE operation to partition the input data on a key
The Python extensions for U-SQL include a built-in reducer ( Extension.Python.Reducer ) that runs Python
code on each vertex assigned to the reducer
The U-SQL script contains the embedded Python code that has a function called usqlml_main that accepts a
pandas DataFrame as input and returns a pandas DataFrame as output.
Next steps
Overview of Microsoft Azure Data Lake Analytics
Develop U-SQL scripts using Data Lake Tools for Visual Studio
Using U-SQL window functions for Azure Data Lake Analytics jobs
Use Azure Data Lake Tools for Visual Studio Code
Extend U-SQL scripts with R code in Azure Data
Lake Analytics
12/10/2021 • 4 minutes to read • Edit Online
The following example illustrates the basic steps for deploying R code:
Use the REFERENCE ASSEMBLY statement to enable R extensions for the U-SQL Script.
Use the REDUCE operation to partition the input data on a key.
The R extensions for U-SQL include a built-in reducer ( Extension.R.Reducer ) that runs R code on each vertex
assigned to the reducer.
Usage of dedicated named data frames called inputFromUSQL and outputToUSQL respectively to pass data
between U-SQL and R. Input and output DataFrame identifier names are fixed (that is, users cannot change
these predefined names of input and output DataFrame identifiers).
Keep the R code in a separate file and reference it the U-SQL script
The following example illustrates a more complex usage. In this case, the R code is deployed as a RESOURCE
that is the U-SQL script.
Save this R code as a separate file.
load("my_model_LM_Iris.rda")
outputToUSQL=data.frame(predict(lm.fit, inputFromUSQL, interval="confidence"))
Use a U-SQL script to deploy that R script with the DEPLOY RESOURCE statement.
REFERENCE ASSEMBLY [ExtR];
DEPLOY RESOURCE @"/usqlext/samples/R/RinUSQL_PredictUsingLinearModelasDF.R";
DEPLOY RESOURCE @"/usqlext/samples/R/my_model_LM_Iris.rda";
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/my/R/Output/LMPredictionsIris.txt";
DECLARE @PartitionCount int = 10;
@InputData =
EXTRACT
SepalLength double,
SepalWidth double,
PetalLength double,
PetalWidth double,
Species string
FROM @IrisData
USING Extractors.Csv();
@ExtendedData =
SELECT
Extension.R.RandomNumberGenerator.GetRandomNumber(@PartitionCount) AS Par,
SepalLength,
SepalWidth,
PetalLength,
PetalWidth
FROM @InputData;
// Predict Species
@RScriptOutput = REDUCE @ExtendedData ON Par
PRODUCE Par, fit double, lwr double, upr double
READONLY Par
USING new Extension.R.Reducer(scriptFile:"RinUSQL_PredictUsingLinearModelasDF.R",
rReturnType:"dataframe", stringsAsFactors:false);
OUTPUT @RScriptOutput TO @OutputFilePredictions USING Outputters.Tsv();
Next steps
Overview of Microsoft Azure Data Lake Analytics
Develop U-SQL scripts using Data Lake Tools for Visual Studio
Using U-SQL window functions for Azure Data Lake Analytics jobs
Get started with the Cognitive capabilities of U-SQL
12/10/2021 • 2 minutes to read • Edit Online
Overview
Cognitive capabilities for U-SQL enable developers to use put intelligence in their big data programs.
The following samples using cognitive capabilities are available:
Imaging: Detect faces
Imaging: Detect emotion
Imaging: Detect objects (tagging)
Imaging: OCR (optical character recognition)
Text: Key Phrase Extraction & Sentiment Analysis
Next steps
U-SQL/Cognitive Samples
Develop U-SQL scripts using Data Lake Tools for Visual Studio
Using U-SQL window functions for Azure Data Lake Analytics jobs
Install Data Lake Tools for Visual Studio
12/10/2021 • 2 minutes to read • Edit Online
Learn how to use Visual Studio to create Azure Data Lake Analytics accounts. You can define jobs in U-SQL and
submit jobs to the Data Lake Analytics service. For more information about Data Lake Analytics, see Azure Data
Lake Analytics overview.
Prerequisites
Visual Studio : All editions except Express are supported.
Visual Studio 2019
Visual Studio 2017
Visual Studio 2015
Visual Studio 2013
Microsoft Azure SDK for .NET version 2.7.1 or later. Install it by using the Web platform installer.
A Data Lake Analytics account. To create an account, see Get Started with Azure Data Lake Analytics
using Azure portal.
Install Azure Data Lake Tools for Visual Studio 2017 or Visual Studio
2019
Azure Data Lake Tools for Visual Studio is supported in Visual Studio 2017 15.3 or later. The tool is part of the
Data storage and processing and Azure development workloads. Enable either one of these two
workloads as part of your Visual Studio installation.
Enable the Data storage and processing workload as shown:
Next steps
To log diagnostics information, see Accessing diagnostics logs for Azure Data Lake Analytics.
To see a more complex query, see Analyze Website logs using Azure Data Lake Analytics.
To use the vertex execution view, see Use the Vertex Execution View in Data Lake Tools for Visual Studio.
Run U-SQL scripts on your local machine
12/10/2021 • 6 minutes to read • Edit Online
When you develop U-SQL scripts, you can save time and expense by running the scripts locally. Azure Data Lake
Tools for Visual Studio supports running U-SQL scripts on your local machine.
C O M P O N EN T LO C A L RUN C LO UD RUN
Storage Local data root folder Default Azure Data Lake Store account
Compute U-SQL local run engine Azure Data Lake Analytics service
Run environment Working directory on local machine Azure Data Lake Analytics cluster
The sections that follow provide more information about local run components.
Local data root folders
A local data root folder is a local store for the local compute account. Any folder in the local file system on your
local machine can be a local data root folder. It's the same as the default Azure Data Lake Store account of a Data
Lake Analytics account. Switching to a different data root folder is just like switching to a different default store
account.
The data root folder is used as follows:
Store metadata. Examples are databases, tables, table-valued functions, and assemblies.
Look up the input and output paths that are defined as relative paths in U-SQL scripts. By using relative
paths, it's easier to deploy your U-SQL scripts to Azure.
U -SQL local run engines
A U-SQL local run engine is a local compute account for U-SQL jobs. Users can run U-SQL jobs locally
through Azure Data Lake Tools for Visual Studio. Local runs are also supported through the Azure Data Lake U-
SQL SDK command-line and programming interfaces. Learn more about the Azure Data Lake U-SQL SDK.
Working directories
When you run a U-SQL script, a working directory folder is needed to cache compilation results, run logs, and
perform other functions. In Azure Data Lake Tools for Visual Studio, the working directory is the U-SQL project’s
working directory. It's located under <U-SQL project root path>/bin/debug> . The working directory is cleaned
every time a new run is triggered.
A U-SQL project is required for a local run. The U-SQL project’s working directory is used for the U-SQL local
run working directory. Compilation results, run logs, and other job run-related files are generated and stored
under the working directory folder during the local run. Every time you rerun the script, all the files in the
working directory are cleaned and regenerated.
Local access Can be accessed by all projects. Only the corresponding project can
access this account.
Local data root folder A permanent local folder. Configured A temporary folder created for each
through Tools > Data Lake > local run under the U-SQL project
Options and Settings . working directory. The folder gets
cleaned when a rebuild or rerun
happens.
Input data for a U-SQL script The relative path under the permanent Set through U-SQL project
local data root folder. proper ty > Test Data Source . All
files and subfolders are copied to the
temporary data root folder before a
local run.
Output data for a U-SQL script Relative path under the permanent Output to the temporary data root
local data root folder. folder. The results are cleaned when a
rebuild or rerun happens.
Referenced database deployment Referenced databases aren't deployed Referenced databases are deployed to
automatically when running against a the Local-project account
Local-machine account. It's the same automatically before a local run. All
for submitting to an Azure Data Lake database environments are cleaned
Analytics account. and redeployed when a rebuild or
rerun happens.
Next steps
How to set up a CI/CD pipeline for Azure Data Lake Analytics.
How to test your Azure Data Lake Analytics code.
Debug Azure Data Lake Analytics code locally
12/10/2021 • 2 minutes to read • Edit Online
You can use Azure Data Lake Tools for Visual Studio to run and debug Azure Data Lake Analytics code on your
local workstation, just as you can in the Azure Data Lake Analytics service.
Learn how to run U-SQL script on your local machine.
NOTE
The following procedure works only in Visual Studio 2015. In older Visual Studio versions, you might need to manually
add the PDB files.
Next steps
For an example of a more complex query, see Analyze website logs using Azure Data Lake Analytics.
To view job details, see Use Job Browser and Job View for Azure Data Lake Analytics jobs.
To use the vertex execution view, see Use the Vertex Execution View in Data Lake Tools for Visual Studio.
Use a U-SQL database project to develop a U-SQL
database for Azure Data Lake
12/10/2021 • 4 minutes to read • Edit Online
U-SQL database provides structured views over unstructured data and managed structured data in tables. It also
provides a general metadata catalog system for organizing your structured data and custom code. The database
is the concept that groups these related objects together.
Learn more about U-SQL database and Data Definition Language (DDL).
The U-SQL database project is a project type in Visual Studio that helps developers develop, manage, and deploy
their U-SQL databases quickly and easily.
2. In the assembly design view, choose the referenced assembly from Create assembly from reference
drop-down menu.
3. Add Managed Dependencies and Additional Files if there are any. When you add additional files, the
tool uses the relative path to make sure it can find the assemblies both on your local machine and on the
build machine later.
@_DeployTempDirectory is a predefined variable that points the tool to the build output folder. Under the build
output folder, every assembly has a subfolder named with the assembly name. All DLLs and additional files are
in that subfolder.
2. Configure a database reference from a U-SQL database project in the current solution or in a U-SQL
database package file.
3. Provide the name for the database.
Next steps
How to set up a CI/CD pipeline for Azure Data Lake Analytics
How to test your Azure Data Lake Analytics code
Run U-SQL script on your local machine
Use Job Browser and Job View for Azure Data Lake
Analytics
12/10/2021 • 10 minutes to read • Edit Online
The Azure Data Lake Analytics service archives submitted jobs in a query store. In this article, you learn how to
use Job Browser and Job View in Azure Data Lake Tools for Visual Studio to find the historical job information.
By default, the Data Lake Analytics service archives the jobs for 30 days. The expiration period can be configured
from the Azure portal by configuring the customized expiration policy. You will not be able to access the job
information after expiration.
Prerequisites
See Data Lake Tools for Visual Studio prerequisites.
Job View
Job View shows the detailed information of a job. To open a job, you can double-click a job in the Job Browser, or
open it from the Data Lake menu by clicking Job View. You should see a dialog populated with the job URL.
Job Result: Succeeded or failed. The job may fail in every phase.
Total Duration: Wall clock time (duration) between submitting time and ending time.
Total Compute Time: The sum of every vertex execution time, you can consider it as the time
that the job is executed in only one vertex. Refer to Total Vertices to find more information
about vertex.
Submit/Start/End Time: The time when the Data Lake Analytics service receives job
submission/starts to run the job/ends the job successfully or not.
Compilation/Queued/Running: Wall clock time spent during the Preparing/Queued/Running
phase.
Account: The Data Lake Analytics account used for running the job.
Author: The user who submitted the job, it can be a real person’s account or a system account.
Priority: The priority of the job. The lower the number, the higher the priority. It only affects the
sequence of the jobs in the queue. Setting a higher priority does not preempt running jobs.
Parallelism: The requested maximum number of concurrent Azure Data Lake Analytics Units
(ADLAUs), also known as vertices. Currently, one vertex is equal to one VM with two virtual
core and six GB RAM, though this could be upgraded in future Data Lake Analytics updates.
Bytes Left: Bytes that need to be processed until the job completes.
Bytes read/written: Bytes that have been read/written since the job started running.
Total vertices: The job is broken up into many pieces of work, each piece of work is called a
vertex. This value describes how many pieces of work the job consists of. You can consider a
vertex as a basic process unit, also known as Azure Data Lake Analytics Unit (ADLAU), and
vertices can be run in parallelism.
Completed/Running/Failed: The count of completed/running/failed vertices. Vertices can fail
due to both user code and system failures, but the system retries failed vertices automatically a
few times. If the vertex is still failed after retrying, the whole job will fail.
Job Graph
A U-SQL script represents the logic of transforming input data to output data. The script is compiled and
optimized to a physical execution plan at the Preparing phase. Job Graph is to show the physical
execution plan. The following diagram illustrates the process:
A job is broken up into many pieces of work. Each piece of work is called a Vertex. The vertices are
grouped as Super Vertex (also known as stage), and visualized as Job Graph. The green stage placards in
the job graph show the stages.
Every vertex in a stage is doing the same kind of work with different pieces of the same data. For
example, if you have a file with one TB data, and there are hundreds of vertices reading from it, each of
them is reading a chunk. Those vertices are grouped in the same stage and doing same work on different
pieces of same input file.
Stage information
In a particular stage, some numbers are shown in the placard.
SV1 Extract: The name of a stage, named by a number and the operation method.
84 vertices: The total count of vertices in this stage. The figure indicates how many pieces of
work is divided in this stage.
12.90 s/vertex: The average vertex execution time for this stage. This figure is calculated by
SUM (every vertex execution time) / (total Vertex count). Which means if you could assign
all the vertices executed in parallelism, the whole stage is completed in 12.90 s. It also
means if all the work in this stage is done serially, the cost would be #vertices * AVG time.
850,895 rows written: Total row count written in this stage.
R/W: Amount of data read/Written in this stage in bytes.
Colors: Colors are used in the stage to indicate different vertex status.
Green indicates the vertex is succeeded.
Orange indicates the vertex is retried. The retried vertex was failed but is retried
automatically and successfully by the system, and the overall stage is completed
successfully. If the vertex retried but still failed, the color turns red and the whole job
failed.
Red indicates failed, which means a certain vertex had been retried a few times by the
system but still failed. This scenario causes the whole job to fail.
Blue means a certain vertex is running.
White indicates the vertex is Waiting. The vertex may be waiting to be scheduled once an
ADLAU becomes available, or it may be waiting for input since its input data might not
be ready.
You can find more details for the stage by hovering your mouse cursor by one state:
Vertices: Describes the vertices details, for example, how many vertices in total, how many vertices
have been completed, are they failed or still running/waiting, etc.
Data read cross/intra pod: Files and data are stored in multiple pods in distributed file system. The
value here describes how much data has been read in the same pod or cross pod.
Total compute time: The sum of every vertex execution time in the stage, you can consider it as the
time it would take if all work in the stage is executed in only one vertex.
Data and rows written/read: Indicates how much data or rows have been read/written, or need to
be read.
Vertex read failures: Describes how many vertices are failed while read data.
Vertex duplicate discards: If a vertex runs too slow, the system may schedule multiple vertices to
run the same piece of work. Reductant vertices will be discarded once one of the vertices complete
successfully. Vertex duplicate discards records the number of vertices that are discarded as
duplications in the stage.
Vertex revocations: The vertex was succeeded, but get rerun later due to some reasons. For
example, if downstream vertex loses intermediate input data, it will ask the upstream vertex to
rerun.
Vertex schedule executions: The total time that the vertices have been scheduled.
Min/Average/Max Vertex data read: The minimum/average/maximum of every vertex read data.
Duration: The wall clock time a stage takes, you need to load profile to see this value.
Job Playback
Data Lake Analytics runs jobs and archives the vertices running information of the jobs, such as
when the vertices are started, stopped, failed and how they are retried, etc. All of the information is
automatically logged in the query store and stored in its Job Profile. You can download the Job
Profile through “Load Profile” in Job View, and you can view the Job Playback after downloading
the Job Profile.
Job Playback is an epitome visualization of what happened in the cluster. It helps you watch job
execution progress and visually detect out performance anomalies and bottlenecks in a very short
time (less than 30s usually).
Job Heat Map Display
Job Heat Map can be selected through the Display dropdown in Job Graph.
It shows the I/O, time and throughput heat map of a job, through which you can find where the job
spends most of the time, or whether your job is an I/O boundary job, and so on.
Progress: The job execution progress, see Information in stage information.
Data read/written: The heat map of total data read/written in each stage.
Compute time: The heat map of SUM (every vertex execution time), you can consider this as
how long it would take if all work in the stage is executed with only 1 vertex.
Average execution time per node: The heat map of SUM (every vertex execution time) / (Vertex
Number). Which means if you could assign all the vertices executed in parallelism, the whole
stage will be done in this time frame.
Input/Output throughput: The heat map of input/output throughput of each stage, you can
confirm if your job is an I/O bound job through this.
Metadata Operations
You can perform some metadata operations in your U-SQL script, such as create a database, drop a table,
etc. These operations are shown in Metadata Operation after compilation. You may find assertions, create
entities, drop entities here.
State History
The State History is also visualized in Job Summary, but you can get more details here. You can find the
detailed information such as when the job is prepared, queued, started running, ended. Also you can find
how many times the job has been compiled (the CcsAttempts: 1), when is the job dispatched to the
cluster actually (the Detail: Dispatching job to cluster), etc.
Diagnostics
The tool diagnoses job execution automatically. You will receive alerts when there are some errors or
performance issues in your jobs. Please note that you need to download Profile to get full information
here.
Warnings: An alert shows up here with compiler warning. You can click “x issue(s)” link to have more
details once the alert appears.
Vertex run too long: If any vertex runs out of time (say 5 hours), issues will be found here.
Resource usage: If you allocated more or not enough Parallelism than need, issues will be found here.
Also you can click Resource usage to see more details and perform what-if scenarios to find a better
resource allocation (for more details, see this guide).
Memory check: If any vertex uses more than 5 GB of memory, issues will be found here. Job execution
may get killed by system if it uses more memory than system limitation.
Job Detail
Job Detail shows the detailed information of the job, including Script, Resources and Vertex Execution View.
Script
The U-SQL script of the job is stored in the query store. You can view the original U-SQL script and re-
submit it if needed.
Resources
You can find the job compilation outputs stored in the query store through Resources. For instance, you
can find “algebra.xml” which is used to show the Job Graph, the assemblies you registered, etc. here.
Vertex execution view
It shows vertices execution details. The Job Profile archives every vertex execution log, such as total data
read/written, runtime, state, etc. Through this view, you can get more details on how a job ran. For more
information, see Use the Vertex Execution View in Data Lake Tools for Visual Studio.
Next Steps
To log diagnostics information, see Accessing diagnostics logs for Azure Data Lake Analytics
To see a more complex query, see Analyze Website logs using Azure Data Lake Analytics.
To use vertex execution view, see Use the Vertex Execution View in Data Lake Tools for Visual Studio
Debug user-defined C# code for failed U-SQL jobs
12/10/2021 • 3 minutes to read • Edit Online
U-SQL provides an extensibility model using C#. In U-SQL scripts, it is easy to call C# functions and perform
analytic functions that SQL-like declarative language does not support. To learn more for U-SQL extensibility,
see U-SQL programmability guide.
In practice, any code may need debugging, but it is hard to debug a distributed job with custom code on the
cloud with limited log files. Azure Data Lake Tools for Visual Studio provides a feature called Failed Ver tex
Debug , which helps you more easily debug the failures that occur in your custom code. When U-SQL job fails,
the service keeps the failure state and the tool helps you to download the cloud failure environment to the local
machine for debugging. The local download captures the entire cloud environment, including any input data and
user code.
The following video demonstrates Failed Vertex Debug in Azure Data Lake Tools for Visual Studio.
IMPORTANT
Visual Studio requires the following two updates for using this feature: Microsoft Visual C++ 2015 Redistributable Update
3 and the Universal C Runtime for Windows.
In the new launched Visual Studio instance, you may or may not find the user-defined C# source code:
1. I can find my source code in the solution
2. I cannot find my source code in the solution
Source code is included in debugging solution
There are two cases that the C# source code is captured:
1. The user code is defined in code-behind file (typically named Script.usql.cs in a U-SQL project).
2. The user code is defined in C# class library project for U-SQL application, and registered as assembly
with debug info .
If the source code is imported to the solution, you can use the Visual Studio debugging tools (watch, variables,
etc.) to troubleshoot the problem:
1. Press F5 to start debugging. The code runs until it is stopped by an exception.
2. Open the source code file and set breakpoints, then press F5 to debug the code step by step.
After these settings, start debugging with F5 and breakpoints. You can also use the Visual Studio debugging
tools (watch, variables, etc.) to troubleshoot the problem.
NOTE
Rebuild the assembly source code project each time after you modify the code to generate updated .pdb files.
2. For jobs with assemblies, right-click the assembly source code project in debugging solution and register
the updated .dll assemblies into your Azure Data Lake catalog.
3. Resubmit the U-SQL job.
Next steps
U-SQL programmability guide
Develop U-SQL User-defined operators for Azure Data Lake Analytics jobs
Test and debug U-SQL jobs by using local run and the Azure Data Lake U-SQL SDK
How to troubleshoot an abnormal recurring job
Troubleshoot an abnormal recurring job
12/10/2021 • 2 minutes to read • Edit Online
This article shows how to use Azure Data Lake Tools for Visual Studio to troubleshoot problems with recurring
jobs. Learn more about pipeline and recurring jobs from the Azure Data Lake and Azure HDInsight blog.
Recurring jobs usually share the same query logic and similar input data. For example, imagine that you have a
recurring job running every Monday morning at 8 A.M. to count last week’s weekly active user. The scripts for
these jobs share one script template that contains the query logic. The inputs for these jobs are the usage data
for last week. Sharing the same query logic and similar input usually means that performance of these jobs is
similar and stable. If one of your recurring jobs suddenly performs abnormally, fails, or slows down a lot, you
might want to:
See the statistics reports for the previous runs of the recurring job to see what happened.
Compare the abnormal job with a normal one to figure out what has been changed.
Related Job View in Azure Data Lake Tools for Visual Studio helps you accelerate the troubleshooting progress
with both cases.
Case 2: You have the pipeline for the recurring job, but not the URL
In Visual Studio, you can open Pipeline Browser through Server Explorer > your Azure Data Lake Analytics
account > Pipelines . (If you can't find this node in Server Explorer, download the latest plug-in.)
In Pipeline Browser, all pipelines for the Data Lake Analytics account are listed at left. You can expand the
pipelines to find all recurring jobs, and then select the one that has problems. Related Job View opens at right.
Step 2: Analyze a statistics report
A summary and a statistics report are shown at top of Related Job View. There, you can find the potential root
cause of the problem.
1. In the report, the X-axis shows the job submission time. Use it to find the abnormal job.
2. Use the process in the following diagram to check statistics and get insights about the problem and the
possible solutions.
Next steps
Resolve data-skew problems
Debug user-defined C# code for failed U-SQL jobs
Use the Vertex Execution View in Data Lake Tools
for Visual Studio
12/10/2021 • 2 minutes to read • Edit Online
Learn how to use the Vertex Execution View to exam Data Lake Analytics jobs.
The top center pane shows the running status of all the ver tices .
Next steps
To log diagnostics information, see Accessing diagnostics logs for Azure Data Lake Analytics
To see a more complex query, see Analyze Website logs using Azure Data Lake Analytics.
To view job details, see Use Job Browser and Job View for Azure Data lake Analytics jobs
Export a U-SQL database
12/10/2021 • 3 minutes to read • Edit Online
In this article, learn how to use Azure Data Lake Tools for Visual Studio to export a U-SQL database as a single
U-SQL script and downloaded resources. You can import the exported database to a local account in the same
process.
Customers usually maintain multiple environments for development, test, and production. These environments
are hosted on both a local account, on a developer's local computer, and in an Azure Data Lake Analytics account
in Azure.
When you develop and tune U-SQL queries in development and test environments, developers often need to re-
create their work in a production database. The Database Export Wizard helps accelerate this process. By using
the wizard, developers can clone the existing database environment and sample data to other Data Lake
Analytics accounts.
Export steps
Step 1: Export the database in Server Explorer
All Data Lake Analytics accounts that you have permissions for are listed in Server Explorer. To export the
database:
1. In Server Explorer, expand the account that contains the database that you want to export.
2. Right-click the database, and then select Expor t .
If the Expor t menu option isn't available, you need to update the tool to the lasted release.
Step 2: Configure the objects that you want to export
If you need only a small part of a large database, you can configure a subset of objects that you want to export
in the export wizard.
The export action is completed by running a U-SQL job. Therefore, exporting from an Azure account incurs
some cost.
Step 3: Check the objects list and other configurations
In this step, you can verify the selected objects in the Expor t object list box. If there are any errors, select
Previous to go back and correctly configure the objects that you want to export.
You can also configure other settings for the export target. Configuration descriptions are listed in the following
table:
Destination Name This name indicates where you want to save the exported
database resources. Examples are assemblies, additional files,
and sample data. A folder with this name is created under
your local data root folder.
Project Directory This path defines where you want to save the exported U-
SQL script. All database object definitions are saved at this
location.
Schema Only If you select this option, only database definitions and
resources (like assemblies and additional files) are exported.
Schema and Data If you select this option, database definitions, resources, and
data are exported. The top N rows of tables are exported.
Import to Local Database Automatically If you select this option, the exported database is
automatically imported to your local database when
exporting is finished.
Step 4: Check the export results
When exporting is finished, you can view the exported results in the log window in the wizard. The following
example shows how to find exported U-SQL script and database resources, including assemblies, additional
files, and sample data:
Import the exported database to a local account
The most convenient way to import the exported database is to select the Impor t to Local Database
Automatically check box during the exporting process in Step 3. If you didn't check this box, first, find the
exported U-SQL script in the export log. Then, run the U-SQL script locally to import the database to your local
account.
Known limitations
Currently, if you select the Schema and Data option in Step 3, the tool runs a U-SQL job to export the data
stored in tables. Because of this, the data exporting process might be slow and you might incur costs.
Next steps
Learn about U-SQL databases
Test and debug U-SQL jobs by using local run and the Azure Data Lake U-SQL SDK
Analyze Website logs using Azure Data Lake
Analytics
12/10/2021 • 4 minutes to read • Edit Online
Learn how to analyze website logs using Data Lake Analytics, especially on finding out which referrers ran into
errors when they tried to visit the website.
Prerequisites
Visual Studio 2015 or Visual Studio 2013 .
Data Lake Tools for Visual Studio .
Once Data Lake Tools for Visual Studio is installed, you will see a Data Lake item in the Tools menu in
Visual Studio:
Basic knowledge of Data Lake Analytics and the Data Lake Tools for Visual Studio . To get
started, see:
Develop U-SQL script using Data Lake tools for Visual Studio.
A Data Lake Analytics account. See Create an Azure Data Lake Analytics account.
Install the sample data. In the Azure Portal, open you Data Lake Analytics account and click Sample
Scripts on the left menu, then click Copy Sample Data .
Connect to Azure
Before you can build and test any U-SQL scripts, you must first connect to Azure.
To connect to Data Lake Analytics
1. Open Visual Studio.
2. Click Data Lake > Options and Settings .
3. Click Sign In , or Change User if someone has signed in, and follow the instructions.
4. Click OK to close the Options and Settings dialog.
To browse your Data Lake Analytics accounts
1. From Visual Studio, open Ser ver Explorer by press CTRL+ALT+S .
2. From Ser ver Explorer , expand Azure , and then expand Data Lake Analytics . You shall see a list of your
Data Lake Analytics accounts if there are any. You cannot create Data Lake Analytics accounts from the studio.
To create an account, see Get Started with Azure Data Lake Analytics using Azure Portal or Get Started with
Azure Data Lake Analytics using Azure PowerShell.
// Create a database for easy reuse, so you don't need to read from a file very time.
CREATE DATABASE IF NOT EXISTS SampleDBTutorials;
// Create a Table valued function. TVF ensures that your jobs fetch data from he weblog file with the
correct schema.
DROP FUNCTION IF EXISTS SampleDBTutorials.dbo.WeblogsView;
CREATE FUNCTION SampleDBTutorials.dbo.WeblogsView()
RETURNS @result TABLE
(
s_date DateTime,
s_time string,
s_sitename string,
cs_method string,
cs_uristem string,
cs_uriquery string,
s_port int,
cs_username string,
c_ip string,
cs_useragent string,
cs_cookie string,
cs_referer string,
cs_host string,
sc_status int,
sc_substatus int,
sc_win32status int,
sc_bytes int,
cs_bytes int,
cs_bytes int,
s_timetaken int
)
AS
BEGIN
@result = EXTRACT
s_date DateTime,
s_time string,
s_sitename string,
cs_method string,
cs_uristem string,
cs_uriquery string,
s_port int,
cs_username string,
c_ip string,
cs_useragent string,
cs_cookie string,
cs_referer string,
cs_host string,
sc_status int,
sc_substatus int,
sc_win32status int,
sc_bytes int,
cs_bytes int,
s_timetaken int
FROM @"/Samples/Data/WebLog.log"
USING Extractors.Text(delimiter:' ');
RETURN;
END;
To understand the U-SQL, see Get started with Data Lake Analytics U-SQL language.
5. Add a new U-SQL script to your project and enter the following:
OUTPUT @content
TO @"/Samples/Outputs/UnsuccessfulResponses.log"
USING Outputters.Tsv();
6. Switch back to the first U-SQL script and next to the Submit button, specify your Analytics account.
7. From Solution Explorer , right click Script.usql , and then click Build Script . Verify the results in the
Output pane.
8. From Solution Explorer , right click Script.usql , and then click Submit Script .
9. Verify the Analytics Account is the one where you want to run the job, and then click Submit .
Submission results and job link are available in the Data Lake Tools for Visual Studio Results window
when the submission is completed.
10. Wait until the job is completed successfully. If the job failed, it is most likely missing the source file. Please
see the Prerequisite section of this tutorial. For additional troubleshooting information, see Monitor and
troubleshoot Azure Data Lake Analytics jobs.
When the job is completed, you shall see the following screen:
Next steps
To get started with Data Lake Analytics using different tools, see:
Get started with Data Lake Analytics using Azure Portal
Get started with Data Lake Analytics using Azure PowerShell
Get started with Data Lake Analytics using .NET SDK
Resolve data-skew problems by using Azure Data
Lake Tools for Visual Studio
12/10/2021 • 7 minutes to read • Edit Online
In our scenario, the data is unevenly distributed across all tax examiners, which means that some examiners
must work more than others. In your own job, you frequently experience situations like the tax-examiner
example here. In more technical terms, one vertex gets much more data than its peers, a situation that makes
the vertex work more than the others and that eventually slows down an entire job. What's worse, the job might
fail, because vertices might have, for example, a 5-hour runtime limitation and a 6-GB memory limitation.
NOTE
Statistics information is not updated automatically. If you update the data in a table without re-creating the statistics, the
query performance might decline.
SKEWFACTOR (columns) = x
Provides a hint that the given columns have a skew factor x from 0 (no skew) through 1 (very heavy skew).
Code example:
//Add a SKEWFACTOR hint.
@Impressions =
SELECT * FROM
searchDM.SML.PageView(@start, @end) AS PageView
OPTION(SKEWFACTOR(Query)=0.5)
;
//Query 1 for key: Query, ClientId
@Sessions =
SELECT
ClientId,
Query,
SUM(PageClicks) AS Clicks
FROM
@Impressions
GROUP BY
Query, ClientId
;
//Query 2 for Key: Query
@Display =
SELECT * FROM @Sessions
INNER JOIN @Campaigns
ON @Sessions.Query == @Campaigns.Query
;
OPTION(ROWCOUNT = n)
Identify a small row set before JOIN by providing an estimated integer row count.
Code example:
[SqlUserDefinedReducer(IsRecursive = true)]
Code example:
[SqlUserDefinedReducer(IsRecursive = true)]
public class TopNReducer : IReducer
{
public override IEnumerable<IRow>
Reduce(IRowset input, IUpdatableRow output)
{
//Your reducer code goes here.
}
}
[SqlUserDefinedCombiner(Mode = CombinerMode.Right)]
public class WatsonDedupCombiner : ICombiner
{
public override IEnumerable<IRow>
Combine(IRowset left, IRowset right, IUpdatableRow output)
{
//Your combiner code goes here.
}
}
Monitor jobs in Azure Data Lake Analytics using the
Azure Portal
12/10/2021 • 2 minutes to read • Edit Online
The job Management gives you a glance of the job status. Notice there is a failed job.
3. Click the Job Management tile to see the jobs. The jobs are categorized in Running , Queued , and
Ended . You shall see your failed job in the Ended section. It shall be first one in the list. When you have a
lot of jobs, you can click Filter to help you to locate jobs.
4. Click the failed job from the list to open the job details:
Notice the Resubmit button. After you fix the problem, you can resubmit the job.
5. Click highlighted part from the previous screenshot to open the error details. You shall see something
like:
Next steps
Azure Data Lake Analytics overview
Get started with Azure Data Lake Analytics using Azure PowerShell
Manage Azure Data Lake Analytics using Azure portal
Use Azure Data Lake Tools for Visual Studio Code
12/10/2021 • 8 minutes to read • Edit Online
In this article, learn how you can use Azure Data Lake Tools for Visual Studio Code (VS Code) to create, test, and
run U-SQL scripts. The information is also covered in the following video:
Prerequisites
Azure Data Lake Tools for VS Code supports Windows, Linux, and macOS. U-SQL local run and local debug
works only in Windows.
Visual Studio Code
For macOS and Linux:
.NET 5.0 SDK
Mono 5.2.x
@departments =
SELECT * FROM
(VALUES
(31, "Sales"),
(33, "Engineering"),
(34, "Clerical"),
(35, "Marketing")
) AS
D( DepID, DepName );
NOTE
Azure Data Lake Tools autodetects whether the DLL has any assembly dependencies. The dependencies are displayed
in the JSON file after they're detected.
You can upload your DLL resources (for example, .txt, .png, and .csv) as part of the assembly registration.
Another way to trigger the ADL: Register Assembly (Advanced) command is to right-click the .dll file in File
Explorer.
The following U-SQL code demonstrates how to call an assembly. In the sample, the assembly name is test.
REFERENCE ASSEMBLY [test];
@a =
EXTRACT
Iid int,
Starts DateTime,
Region string,
Query string,
DwellTime int,
Results string,
ClickedUrls string
FROM @"Sample/SearchLog.txt"
USING Extractors.Tsv();
@d =
SELECT DISTINCT Region
FROM @a;
@d1 =
PROCESS @d
PRODUCE
Region string,
Mkt string
USING new USQLApplication_codebehind.MyProcessor();
OUTPUT @d1
TO @"Sample/SearchLogtest.txt"
USING Outputters.Tsv();
Use U-SQL local run and local debug for Windows users
U-SQL local run tests your local data and validates your script locally before your code is published to Data Lake
Analytics. You can use the local debug feature to complete the following tasks before your code is submitted to
Data Lake Analytics:
Debug your C# code-behind.
Step through the code.
Validate your script locally.
The local run and local debug feature only works in Windows environments, and is not supported on macOS
and Linux-based operating systems.
For instructions on local run and local debug, see U-SQL local run and local debug with Visual Studio Code.
Connect to Azure
Before you can compile and run U-SQL scripts in Data Lake Analytics, you must connect to your Azure account.
4. Follow the instructions to sign in from the webpage. When you're connected, your Azure account name
appears on the status bar in the lower-left corner of the VS Code window.
NOTE
Data Lake Tools automatically signs you in the next time if you don't sign out.
If your account has two factors enabled, we recommend that you use phone authentication rather than using a PIN.
You can't sign out from the explorer. To sign out, see To connect to Azure by using a command.
Learn how to use Visual Studio Code (VSCode) to write Python, R and C# code behind with U-SQL and submit
jobs to Azure Data Lake service. For more information about Azure Data Lake Tools for VSCode, see Use the
Azure Data Lake Tools for Visual Studio Code.
Before writing code-behind custom code, you need to open a folder or a workspace in VSCode.
NOTE
For best experiences on Python and R language service, please install VSCode Python and R extension.
@m =
REDUCE @t ON date
PRODUCE date string, mentions string
USING new Extension.Python.Reducer("pythonSample.usql.py", pyVersion : "3.5.1");
OUTPUT @m
TO "/tweetmentions.csv"
USING Outputters.Csv();
3. Right-click a script file, and then select ADL: Generate Python Code Behind File .
4. The xxx.usql.py file is generated in your working folder. Write your code in Python file. The following is a
code sample.
def get_mentions(tweet):
return ';'.join( ( w[1:] for w in tweet.split() if w[0]=='@' ) )
def usqlml_main(df):
del df['time']
del df['author']
df['mentions'] = df.tweet.apply(get_mentions)
del df['tweet']
return df
5. Right-click in USQL file, you can click Compile Script or Submit Job to running job.
Develop R file
1. Click the New File in your workspace.
2. Write your code in U-SQL file. The following is a code sample.
DEPLOY RESOURCE @"/usqlext/samples/R/my_model_LM_Iris.rda";
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/my/R/Output/LMPredictionsIris.txt";
DECLARE @PartitionCount int = 10;
@InputData =
EXTRACT SepalLength double,
SepalWidth double,
PetalLength double,
PetalWidth double,
Species string
FROM @IrisData
USING Extractors.Csv();
@ExtendedData =
SELECT Extension.R.RandomNumberGenerator.GetRandomNumber(@PartitionCount) AS Par,
SepalLength,
SepalWidth,
PetalLength,
PetalWidth
FROM @InputData;
// Predict Species
@RScriptOutput =
REDUCE @ExtendedData
ON Par
PRODUCE Par,
fit double,
lwr double,
upr double
READONLY Par
USING new Extension.R.Reducer(scriptFile : "RClusterRun.usql.R", rReturnType : "dataframe",
stringsAsFactors : false);
OUTPUT @RScriptOutput
TO @OutputFilePredictions
USING Outputters.Tsv();
3. Right-click in USQL file, and then select ADL: Generate R Code Behind File .
4. The xxx.usql.r file is generated in your working folder. Write your code in R file. The following is a code
sample.
load("my_model_LM_Iris.rda")
outputToUSQL=data.frame(predict(lm.fit, inputFromUSQL, interval="confidence"))
5. Right-click in USQL file, you can click Compile Script or Submit Job to running job.
Develop C# file
A code-behind file is a C# file associated with a single U-SQL script. You can define a script dedicated to UDO,
UDA, UDT, and UDF in the code-behind file. The UDO, UDA, UDT, and UDF can be used directly in the script
without registering the assembly first. The code-behind file is put in the same folder as its peering U-SQL script
file. If the script is named xxx.usql, the code-behind is named as xxx.usql.cs. If you manually delete the code-
behind file, the code-behind feature is disabled for its associated U-SQL script. For more information about
writing customer code for U-SQL script, see Writing and Using Custom Code in U-SQL: User-Defined Functions.
1. Click the New File in your workspace.
2. Write your code in U-SQL file. The following is a code sample.
@a =
EXTRACT
Iid int,
Starts DateTime,
Region string,
Query string,
DwellTime int,
Results string,
ClickedUrls string
FROM @"/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@d =
SELECT DISTINCT Region
FROM @a;
@d1 =
PROCESS @d
PRODUCE
Region string,
Mkt string
USING new USQLApplication_codebehind.MyProcessor();
OUTPUT @d1
TO @"/output/SearchLogtest.txt"
USING Outputters.Tsv();
3. Right-click in USQL file, and then select ADL: Generate CS Code Behind File .
4. The xxx.usql.cs file is generated in your working folder. Write your code in CS file. The following is a
code sample.
namespace USQLApplication_codebehind
{
[SqlUserDefinedProcessor]
5. Right-click in USQL file, you can click Compile Script or Submit Job to running job.
Next steps
Use the Azure Data Lake Tools for Visual Studio Code
U-SQL local run and local debug with Visual Studio Code
Get started with Data Lake Analytics using PowerShell
Get started with Data Lake Analytics using the Azure portal
Use Data Lake Tools for Visual Studio for developing U-SQL applications
Use Data Lake Analytics(U-SQL) catalog
Accessing resources with Azure Data Lake Tools
12/10/2021 • 6 minutes to read • Edit Online
You can access Azure Data Lake Analytics resources with Azure Data Tools commands or actions in VS Code
easily.
A more convenient way to list the relative path is through the shortcut menu.
To list the storage path through the shortcut menu
Right-click the path string and select List Path .
Another way to upload files to storage is through the shortcut menu on the file's full path or the file's relative
path in the script editor.
You can monitor the upload status.
Download a file
You can download a file by using the command ADL: Download File or ADL: Download File (Advanced) .
To download a file through the ADL: Download File (Advanced) command
1. Right-click the script editor, and then select Download File (Advanced) .
2. VS Code displays a JSON file. You can enter file paths and download multiple files at the same time.
Instructions are displayed in the Output window. To proceed to download the file or files, save (Ctrl+S)
the JSON file.
Another way to download storage files is through the shortcut menu on the file's full path or the file's relative
path in the script editor.
You can monitor the download status.
Check storage tasks' status
The upload and download status appears on the status bar. Select the status bar, and then the status appears on
the OUTPUT tab.
You can right-click the file node and then use the Preview , Download , Delete , Create EXTRACT Script
(available only for CSV, TSV, and TXT files), Copy Relative Path , and Copy Full Path commands on the
shortcut menu.
Integrate with Azure Blob storage from the explorer
Browse to Blob storage:
You can right-click the blob container node and then use the Refresh , Delete Blob Container , and
Upload Blob commands on the shortcut menu.
You can right-click the folder node and then use the Refresh and Upload Blob commands on the
shortcut menu.
You can right-click the file node and then use the Preview/Edit , Download , Delete , Create EXTRACT
Script (available only for CSV, TSV, and TXT files), Copy Relative Path , and Copy Full Path commands
on the shortcut menu.
Open the Data Lake explorer in the portal
1. Select Ctrl+Shift+P to open the command palette.
2. Enter Open Web Azure Storage Explorer or right-click a relative path or the full path in the script editor,
and then select Open Web Azure Storage Explorer .
3. Select a Data Lake Analytics account.
Data Lake Tools opens the Azure Storage path in the Azure portal. You can find the path and preview the file
from the web.
Additional features
Data Lake Tools for VS Code supports the following features:
IntelliSense autocomplete : Suggestions appear in pop-up windows around items like keywords,
methods, and variables. Different icons represent different types of objects:
Scala data type
Complex data type
Built-in UDTs
.NET collection and classes
C# expressions
Built-in C# UDFs, UDOs, and UDAAGs
U-SQL functions
U-SQL windowing functions
IntelliSense autocomplete on Data Lake Analytics metadata : Data Lake Tools downloads the Data
Lake Analytics metadata information locally. The IntelliSense feature automatically populates objects from
the Data Lake Analytics metadata. These objects include the database, schema, table, view, table-valued
function, procedures, and C# assemblies.
IntelliSense error marker : Data Lake Tools underlines editing errors for U-SQL and C#.
Syntax highlights : Data Lake Tools uses colors to differentiate items like variables, keywords, data types,
and functions.
NOTE
We recommend that you upgrade to Azure Data Lake Tools for Visual Studio version 2.3.3000.4 or later. The previous
versions are no longer available for download and are now deprecated.
Next steps
Develop U-SQL with Python, R, and C Sharp for Azure Data Lake Analytics in VS Code
U-SQL local run and local debug with Visual Studio Code
Tutorial: Get started with Azure Data Lake Analytics
Tutorial: Develop U-SQL scripts by using Data Lake Tools for Visual Studio
Run U-SQL and debug locally in Visual Studio Code
12/10/2021 • 2 minutes to read • Edit Online
This article describes how to run U-SQL jobs on a local development machine to speed up early coding phases
or to debug code locally in Visual Studio Code. For instructions on Azure Data Lake Tool for Visual Studio Code,
see Use Azure Data Lake Tools for Visual Studio Code.
Only Windows installations of the Azure Data Lake Tools for Visual Studio support the action to run U-SQL
locally and debug U-SQL locally. Installations on macOS and Linux-based operating systems do not support this
feature.
2. Locate the dependency packages from the path shown in the Output pane, and then install BuildTools
and Win10SDK 10240. Here is an example path:
C:\Users\xxx\AppData\Roaming\LocalRunDependency
2.1 To install BuildTools , click visualcppbuildtools_full.exe in the LocalRunDependency folder, then follow
the wizard instructions.
2.2 To install Win10SDK 10240 , click sdksetup.exe in the
LocalRunDependency/Win10SDK_10.0.10240_2 folder, then follow the wizard instructions.
3. Set up the environment variable. Set the SCOPE_CPP_SDK environment variable to:
C:\Users\XXX\AppData\Roaming\LocalRunDependency\CppSDK_3rdparty
Start the local run service and submit the U-SQL job to a local
account
For the first-time user, use ADL: Download Local Run Package to download local run packages, if you have
not set up U-SQL local run environment.
1. Select Ctrl+Shift+P to open the command palette, and then enter ADL: Star t Local Run Ser vice .
2. Select Accept to accept the Microsoft Software License Terms for the first time.
3. The cmd console opens. For first-time users, you need to enter 3 , and then locate the local folder path for
your data input and output. If you are unsuccessful defining the path with backslashes, try forward
slashes. For other options, you can use the default values.
4. Select Ctrl+Shift+P to open the command palette, enter ADL: Submit Job , and then select Local to
submit the job to your local account.
5. After you submit the job, you can view the submission details. To view the submission details, select
jobUrl in the Output window. You can also view the job submission status from the cmd console. Enter
7 in the cmd console if you want to know more job details.
Start a local debug for the U-SQL job
For the first-time user:
1. Use ADL: Download Local Run Package to download local run packages, if you have not set up U-SQL
local run environment.
2. Install .NET Core SDK 2.0 as suggested in the message box, if not installed.
3. Install C# for Visual Studio Code as suggested in the message box if not installed. Click Install to continue,
and then restart VSCode.
Next steps
Use the Azure Data Lake Tools for Visual Studio Code
Develop U-SQL with Python, R, and CSharp for Azure Data Lake Analytics in VSCode
Get started with Data Lake Analytics using PowerShell
Get started with Data Lake Analytics using the Azure portal
Use Data Lake Tools for Visual Studio for developing U-SQL applications
Use Data Lake Analytics(U-SQL) catalog
U-SQL programmability guide overview
12/10/2021 • 8 minutes to read • Edit Online
U-SQL is a query language that's designed for big data-type of workloads. One of the unique features of U-SQL
is the combination of the SQL-like declarative language with the extensibility and programmability that's
provided by C#. In this guide, we concentrate on the extensibility and programmability of the U-SQL language
that's enabled by C#.
Requirements
Download and install Azure Data Lake Tools for Visual Studio.
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0, "2017-03-39"),
("Woodgrove", 2700.0, "2017-04-10")
) AS D( customer, amount, date );
@results =
SELECT
customer,
amount,
date
FROM @a;
This script defines two RowSets: @a and @results . RowSet @results is defined from @a .
@results =
SELECT
customer,
amount,
DateTime.Parse(date) AS date
FROM @a;
DECLARE @d = DateTime.Parse("2016/01/01");
@rs1 =
SELECT
Convert.ToDateTime(Convert.ToDateTime(@dt).ToString("yyyy-MM-dd")) AS dt,
dt AS olddt
FROM @rs0;
OUTPUT @rs1
TO @output_file
USING Outputters.Text();
@rs1 =
SELECT
MAX(guid) AS start_id,
MIN(dt) AS start_time,
MIN(Convert.ToDateTime(Convert.ToDateTime(dt<@default_dt?@default_dt:dt).ToString("yyyy-MM-dd"))) AS
start_zero_time,
MIN(USQL_Programmability.CustomFunctions.GetFiscalPeriod(dt)) AS start_fiscalperiod,
DateTime.Now.ToString("M/d/yyyy") AS Nowdate,
user,
des
FROM @rs0
GROUP BY user, des;
Consult the assembly registration instructions that covers this topic in greater detail.
Use assembly versioning
Currently, U-SQL uses the .NET Framework version 4.7.2. So ensure that your own assemblies are compatible
with that version of the runtime.
As mentioned earlier, U-SQL runs code in a 64-bit (x64) format. So make sure that your code is compiled to run
on x64. Otherwise you get the incorrect format error shown earlier.
Each uploaded assembly DLL and resource file, such as a different runtime, a native assembly, or a config file,
can be at most 400 MB. The total size of deployed resources, either via DEPLOY RESOURCE or via references to
assemblies and their additional files, cannot exceed 3 GB.
Finally, note that each U-SQL database can only contain one version of any given assembly. For example, if you
need both version 7 and version 8 of the NewtonSoft Json.NET library, you need to register them in two
different databases. Furthermore, each script can only refer to one version of a given assembly DLL. In this
respect, U-SQL follows the C# assembly management and versioning semantics.
int FiscalQuarter=0;
if (FiscalMonth >=1 && FiscalMonth<=3)
{
FiscalQuarter = 1;
}
if (FiscalMonth >= 4 && FiscalMonth <= 6)
{
FiscalQuarter = 2;
}
if (FiscalMonth >= 7 && FiscalMonth <= 9)
{
FiscalQuarter = 3;
}
if (FiscalMonth >= 10 && FiscalMonth <= 12)
{
FiscalQuarter = 4;
}
using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace USQL_Programmability
{
public class CustomFunctions
{
public static string GetFiscalPeriod(DateTime dt)
{
int FiscalMonth=0;
if (dt.Month < 7)
{
FiscalMonth = dt.Month + 6;
}
else
{
FiscalMonth = dt.Month - 6;
}
int FiscalQuarter=0;
if (FiscalMonth >=1 && FiscalMonth<=3)
{
FiscalQuarter = 1;
}
if (FiscalMonth >= 4 && FiscalMonth <= 6)
{
FiscalQuarter = 2;
}
if (FiscalMonth >= 7 && FiscalMonth <= 9)
{
FiscalQuarter = 3;
}
if (FiscalMonth >= 10 && FiscalMonth <= 12)
{
FiscalQuarter = 4;
}
Now we are going to call this function from the base U-SQL script. To do this, we have to provide a fully
qualified name for the function, including the namespace, which in this case is
NameSpace.Class.Function(parameter).
USQL_Programmability.CustomFunctions.GetFiscalPeriod(dt)
@rs0 =
EXTRACT
guid Guid,
dt DateTime,
user String,
des String
FROM @input_file USING Extractors.Tsv();
@rs1 =
SELECT
MAX(guid) AS start_id,
MIN(dt) AS start_time,
MIN(Convert.ToDateTime(Convert.ToDateTime(dt<@default_dt?@default_dt:dt).ToString("yyyy-MM-dd"))) AS
start_zero_time,
MIN(USQL_Programmability.CustomFunctions.GetFiscalPeriod(dt)) AS start_fiscalperiod,
user,
des
FROM @rs0
GROUP BY user, des;
OUTPUT @rs1
TO @output_file
USING Outputters.Text();
0d8b9630-d5ca-11e5-8329-251efa3a2941,2016-02-11T07:04:17.2630000-08:00,2016-06-
01T00:00:00.0000000,"Q3:8","User1",""
20843640-d771-11e5-b87b-8b7265c75a44,2016-02-11T07:04:17.2630000-08:00,2016-06-
01T00:00:00.0000000,"Q3:8","User2",""
301f23d2-d690-11e5-9a98-4b4f60a1836f,2016-02-11T09:01:33.9720000-08:00,2016-06-
01T00:00:00.0000000,"Q3:8","User3",""
using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace USQLApplication21
{
public class UserSession
{
static public string globalSession;
static public string StampUserSession(string eventTime, string PreviousRow, string Session)
{
if (!string.IsNullOrEmpty(PreviousRow))
{
double timeGap =
Convert.ToDateTime(eventTime).Subtract(Convert.ToDateTime(PreviousRow)).TotalMinutes;
if (timeGap <= 60) {return Session;}
else {return Guid.NewGuid().ToString();}
}
else {return Guid.NewGuid().ToString();}
}
}
This example shows the global variable static public string globalSession; used inside the
getStampUserSession function and getting reinitialized each time the Session parameter is changed.
@records =
EXTRACT DataId string,
EventDateTime string,
UserName string,
UserSessionTimestamp string
FROM @in
USING Extractors.Tsv();
@rs1 =
SELECT
EventDateTime,
UserName,
LAG(EventDateTime, 1)
OVER(PARTITION BY UserName ORDER BY EventDateTime ASC) AS prevDateTime,
string.IsNullOrEmpty(LAG(EventDateTime, 1)
OVER(PARTITION BY UserName ORDER BY EventDateTime ASC)) AS Flag,
USQLApplication21.UserSession.StampUserSession
(
EventDateTime,
LAG(EventDateTime, 1) OVER(PARTITION BY UserName ORDER BY EventDateTime ASC),
LAG(UserSessionTimestamp, 1) OVER(PARTITION BY UserName ORDER BY EventDateTime ASC)
) AS UserSessionTimestamp
FROM @records;
@rs2 =
SELECT
EventDateTime,
UserName,
LAG(EventDateTime, 1)
OVER(PARTITION BY UserName ORDER BY EventDateTime ASC) AS prevDateTime,
string.IsNullOrEmpty( LAG(EventDateTime, 1) OVER(PARTITION BY UserName ORDER BY EventDateTime ASC))
AS Flag,
USQLApplication21.UserSession.getStampUserSession(UserSessionTimestamp) AS UserSessionTimestamp
FROM @rs1
WHERE UserName != "UserName";
OUTPUT @rs2
TO @out2
ORDER BY UserName, EventDateTime ASC
USING Outputters.Csv();
This example demonstrates a more complicated use-case scenario in which we use a global variable inside a
code-behind section that's applied to the entire memory rowset.
Next steps
U-SQL programmability guide - UDT and UDAGG
U-SQL programmability guide - UDO
U-SQL programmability guide - UDT and UDAGG
12/10/2021 • 10 minutes to read • Edit Online
NOTE
U-SQL’s built-in extractors and outputters currently cannot serialize or de-serialize UDT data to or from files even with the
IFormatter set. So when you're writing UDT data to a file with the OUTPUT statement, or reading it with an extractor, you
have to pass it as a string or byte array. Then you call the serialization and deserialization code (that is, the UDT’s ToString()
method) explicitly. User-defined extractors and outputters, on the other hand, can read and write UDTs.
If we try to use UDT in EXTRACTOR or OUTPUTTER (out of previous SELECT), as shown here:
@rs1 =
SELECT
MyNameSpace.Myfunction_Returning_UDT(filed1) AS myfield
FROM @rs0;
OUTPUT @rs1
TO @output_file
USING Outputters.Text();
Description:
Resolution:
Implement a custom outputter that knows how to serialize this type, or call a serialization method on the
type in
the preceding SELECT. C:\Users\sergeypu\Documents\Visual Studio 2013\Projects\USQL-Programmability\
USQL-Programmability\Types.usql 52 1 USQL-Programmability
To work with UDT in outputter, we either have to serialize it to string with the ToString() method or create a
custom outputter.
UDTs currently cannot be used in GROUP BY. If UDT is used in GROUP BY, the following error is thrown:
Error 1 E_CSC_USER_INVALIDTYPEINCLAUSE: GROUP BY doesn't support type MyNameSpace.Myfunction_Returning_UDT
for column myfield
Description:
Resolution:
Add a SELECT statement where you can project a scalar column that you want to use with GROUP BY.
C:\Users\sergeypu\Documents\Visual Studio 2013\Projects\USQL-Programmability\USQL-Programmability\Types.usql
62 5 USQL-Programmability
using Microsoft.Analytics.Interfaces
using System.IO;
2. Add Microsoft.Analytics.Interfaces , which is required for the UDT interfaces. In addition, System.IO
might be needed to define the IFormatter interface.
3. Define a used-defined type with SqlUserDefinedType attribute.
SqlUserDefinedType is used to mark a type definition in an assembly as a user-defined type (UDT) in U-SQL.
The properties on the attribute reflect the physical characteristics of the UDT. This class cannot be inherited.
SqlUserDefinedType is a required attribute for UDT definition.
The constructor of the class:
SqlUserDefinedTypeAttribute (type formatter)
Type formatter: Required parameter to define an UDT formatter--specifically, the type of the IFormatter
interface must be passed here.
[SqlUserDefinedType(typeof(MyTypeFormatter))]
public class MyType
{ … }
Typical UDT also requires definition of the IFormatter interface, as shown in the following example:
The IFormatter interface serializes and de-serializes an object graph with the root type of <typeparamref
name="T">.
<typeparam name="T">The root type for the object graph to serialize and de-serialize.
Deserialize : De-serializes the data on the provided stream and reconstitutes the graph of objects.
Serialize : Serializes an object, or graph of objects, with the given root to the provided stream.
MyType instance: Instance of the type.
IColumnWriter writer / IColumnReader reader: The underlying column stream.
ISerializationContext context: Enum that defines a set of flags that specifies the source or destination context
for the stream during serialization.
Intermediate : Specifies that the source or destination context is not a persisted store.
Persistence : Specifies that the source or destination context is a persisted store.
As a regular C# type, a U-SQL UDT definition can include overrides for operators such as +/==/!=. It can also
include static methods. For example, if we are going to use this UDT as a parameter to a U-SQL MIN aggregate
function, we have to define < operator override.
Earlier in this guide, we demonstrated an example for fiscal period identification from the specific date in the
format Qn:Pn (Q1:P10) . The following example shows how to define a custom type for fiscal period values.
Following is an example of a code-behind section with custom UDT and IFormatter interface:
[SqlUserDefinedType(typeof(FiscalPeriodFormatter))]
public struct FiscalPeriod
{
public int Quarter { get; private set; }
The defined type includes two numbers: quarter and month. Operators ==/!=/>/< and static method
ToString() are defined here.
As mentioned earlier, UDT can be used in SELECT expressions, but cannot be used in OUTPUTTER/EXTRACTOR
without custom serialization. It either has to be serialized as a string with ToString() or used with a custom
OUTPUTTER/EXTRACTOR.
Now let’s discuss usage of UDT. In a code-behind section, we changed our GetFiscalPeriod function to the
following:
public static FiscalPeriod GetFiscalPeriodWithCustomType(DateTime dt)
{
int FiscalMonth = 0;
if (dt.Month < 7)
{
FiscalMonth = dt.Month + 6;
}
else
{
FiscalMonth = dt.Month - 6;
}
int FiscalQuarter = 0;
if (FiscalMonth >= 1 && FiscalMonth <= 3)
{
FiscalQuarter = 1;
}
if (FiscalMonth >= 4 && FiscalMonth <= 6)
{
FiscalQuarter = 2;
}
if (FiscalMonth >= 7 && FiscalMonth <= 9)
{
FiscalQuarter = 3;
}
if (FiscalMonth >= 10 && FiscalMonth <= 12)
{
FiscalQuarter = 4;
}
@rs0 =
EXTRACT
guid string,
dt DateTime,
user String,
des String
FROM @input_file USING Extractors.Tsv();
@rs1 =
SELECT
guid AS start_id,
dt,
DateTime.Now.ToString("M/d/yyyy") AS Nowdate,
USQL_Programmability.CustomFunctions.GetFiscalPeriodWithCustomType(dt).Quarter AS fiscalquarter,
USQL_Programmability.CustomFunctions.GetFiscalPeriodWithCustomType(dt).Month AS fiscalmonth,
USQL_Programmability.CustomFunctions.GetFiscalPeriodWithCustomType(dt) + new
USQL_Programmability.CustomFunctions.FiscalPeriod(1,7) AS fiscalperiod_adjusted,
user,
des
FROM @rs0;
@rs2 =
SELECT
start_id,
dt,
DateTime.Now.ToString("M/d/yyyy") AS Nowdate,
fiscalquarter,
fiscalmonth,
USQL_Programmability.CustomFunctions.GetFiscalPeriodWithCustomType(dt).ToString() AS fiscalperiod,
// This user-defined type was created in the prior SELECT. Passing the UDT to this subsequent SELECT
would have failed if the UDT was not annotated with an IFormatter.
fiscalperiod_adjusted.ToString() AS fiscalperiod_adjusted,
user,
des
FROM @rs1;
OUTPUT @rs2
TO @output_file
USING Outputters.Text();
using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace USQL_Programmability
{
public class CustomFunctions
{
static public DateTime? ToDateTime(string dt)
{
DateTime dtValue;
int FiscalQuarter = 0;
if (FiscalMonth >= 1 && FiscalMonth <= 3)
{
FiscalQuarter = 1;
}
if (FiscalMonth >= 4 && FiscalMonth <= 6)
{
FiscalQuarter = 2;
}
if (FiscalMonth >= 7 && FiscalMonth <= 9)
{
FiscalQuarter = 3;
}
if (FiscalMonth >= 10 && FiscalMonth <= 12)
{
FiscalQuarter = 4;
}
[SqlUserDefinedAggregate]
public abstract class IAggregate<T1, T2, TResult> : IAggregate
{
protected IAggregate();
SqlUserDefinedAggregate indicates that the type should be registered as a user-defined aggregate. This class
cannot be inherited.
SqlUserDefinedType attribute is optional for UDAGG definition.
The base class allows you to pass three abstract parameters: two as input parameters and one as the result. The
data types are variable and should be defined during class inheritance.
Init invokes once for each group during computation. It provides an initialization routine for each
aggregation group.
Accumulate is executed once for each value. It provides the main functionality for the aggregation
algorithm. It can be used to aggregate values with various data types that are defined during class
inheritance. It can accept two parameters of variable data types.
Terminate is executed once per aggregation group at the end of processing to output the result for each
group.
To declare correct input and output data types, use the class definition as follows:
or
AGG<UDAGG_functionname>(param1,param2)
@rs0 =
EXTRACT
guid string,
dt DateTime,
user String,
des String
FROM @input_file
USING Extractors.Tsv();
@rs1 =
SELECT
user,
AGG<USQL_Programmability.GuidAggregate>(guid,user) AS guid_list
FROM @rs0
GROUP BY user;
In this use-case scenario, we concatenate class GUIDs for the specific users.
Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDO
U-SQL user-defined objects overview
12/10/2021 • 2 minutes to read • Edit Online
NOTE
UDO’s are limited to consume 0.5Gb memory. This memory limitation does not apply to local executions.
Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDT and UDAGG
Use user-defined extractor
12/10/2021 • 3 minutes to read • Edit Online
[SqlUserDefinedExtractor]
public class SampleExtractor : IExtractor
{
public SampleExtractor(string row_delimiter, char col_delimiter)
{ … }
The SqlUserDefinedExtractor attribute indicates that the type should be registered as a user-defined extractor.
This class cannot be inherited.
SqlUserDefinedExtractor is an optional attribute for UDE definition. It used to define AtomicFileProcessing
property for the UDE object.
bool AtomicFileProcessing
true = Indicates that this extractor requires atomic input files (JSON, XML, ...)
false = Indicates that this extractor can deal with split / distributed files (CSV, SEQ, ...)
The main UDE programmability objects are input and output . The input object is used to enumerate input data
as IUnstructuredReader . The output object is used to set output data as a result of the extractor activity.
The input data is accessed through System.IO.Stream and System.IO.StreamReader .
For input columns enumeration, we first split the input stream by using a row delimiter.
output.Set<string>(count, part);
}
else
{
// keep the rest of the columns as-is
output.Set<string>(count, part);
}
count += 1;
}
}
yield return output.AsReadOnly();
}
yield break;
}
}
In this use-case scenario, the extractor regenerates the GUID for “guid” column and converts the values of “user”
column to upper case. Custom extractors can produce more complicated results by parsing input data and
manipulating it.
Following is base U-SQL script that uses a custom extractor:
DECLARE @input_file string = @"\usql-programmability\input_file.tsv";
DECLARE @output_file string = @"\usql-programmability\output_file.tsv";
@rs0 =
EXTRACT
guid Guid,
dt String,
user String,
des String
FROM @input_file
USING new USQL_Programmability.FullDescriptionExtractor(Encoding.UTF8);
Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDT and UDAGG
Use user-defined outputter
12/10/2021 • 5 minutes to read • Edit Online
All input parameters to the outputter, such as column/row delimiters, encoding, and so on, need to be defined in
the constructor of the class. The IOutputter interface should also contain a definition for void Output override.
The attribute [SqlUserDefinedOutputter(AtomicFileProcessing = true) can optionally be set for atomic file
processing. For more information, see the following details.
[SqlUserDefinedOutputter(AtomicFileProcessing = true)]
public class MyOutputter : IOutputter
{
Output is called for each input row. It returns the IUnstructuredWriter output rowset.
The Constructor class is used to pass parameters to the user-defined outputter.
Close is used to optionally override to release expensive state or determine when the last row was written.
SqlUserDefinedOutputter attribute indicates that the type should be registered as a user-defined outputter.
This class cannot be inherited.
SqlUserDefinedOutputter is an optional attribute for a user-defined outputter definition. It's used to define the
AtomicFileProcessing property.
bool AtomicFileProcessing
true = Indicates that this outputter requires atomic output files (JSON, XML, ...)
false = Indicates that this outputter can deal with split / distributed files (CSV, SEQ, ...)
The main programmability objects are row and output . The row object is used to enumerate output data as
IRow interface. Output is used to set output data to the target file.
The output data is accessed through the IRow interface. Output data is passed a row at a time.
The individual values are enumerated by calling the Get method of the IRow interface:
row.Get<string>("column_name")
This approach enables you to build a flexible outputter for any metadata schema.
The output data is written to file by using System.IO.StreamWriter . The stream parameter is set to
output.BaseStream as part of IUnstructuredWriter output .
Note that it's important to flush the data buffer to the file after each row iteration. In addition, the StreamWriter
object must be used with the Disposable attribute enabled (default) and with the using keyword:
using (StreamWriter streamWriter = new StreamWriter(output.BaseStream, this._encoding))
{
…
}
Otherwise, call Flush() method explicitly after each iteration. We show this in the following example.
Set headers and footers for user-defined outputter
To set a header, use single iteration execution flow.
…
if (isHeaderRow)
{
isHeaderRow = false;
}
…
}
}
[SqlUserDefinedOutputter(AtomicFileProcessing = true)]
public class HTMLOutputter : IOutputter
{
// Local variables initialization
private string row_delimiter;
private char col_delimiter;
private bool isHeaderRow;
private Encoding encoding;
private bool IsTableHeader = true;
private Stream g_writer;
// Parameters definition
public HTMLOutputter(bool isHeader = false, Encoding encoding = null)
{
this.isHeaderRow = isHeader;
this.encoding = ((encoding == null) ? Encoding.UTF8 : encoding);
}
// The Close method is used to write the footer to the file. It's executed only once, after all rows
public override void Close()
{
//Reference to IO.Stream object - g_writer
StreamWriter streamWriter = new StreamWriter(g_writer, this.encoding);
streamWriter.Write("</table>");
streamWriter.Flush();
streamWriter.Close();
}
if (isHeaderRow)
{
isHeaderRow = false;
}
// Reference to the instance of the IO.Stream object for footer generation
g_writer = output.BaseStream;
streamWriter.Flush();
}
}
@rs0 =
EXTRACT
guid Guid,
dt String,
user String,
des String
FROM @input_file
USING new USQL_Programmability.FullDescriptionExtractor(Encoding.UTF8);
OUTPUT @rs0
TO @output_file
USING new USQL_Programmability.HTMLOutputter(isHeader: true);
This is an HTML outputter, which creates an HTML file with table data.
Call outputter from U -SQL base script
To call a custom outputter from the base U-SQL script, the new instance of the outputter object has to be
created.
To avoid creating an instance of the object in base script, we can create a function wrapper, as shown in our
earlier example:
OUTPUT @rs0
TO @output_file
USING USQL_Programmability.Factory.HTMLOutputter(isHeader: true);
Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDT and UDAGG
Use user-defined processor
12/10/2021 • 2 minutes to read • Edit Online
[SqlUserDefinedProcessor]
public class MyProcessor: IProcessor
{
public override IRow Process(IRow input, IUpdatableRow output)
{
…
}
}
SqlUserDefinedProcessor indicates that the type should be registered as a user-defined processor. This class
cannot be inherited.
The SqlUserDefinedProcessor attribute is optional for UDP definition.
The main programmability objects are input and output . The input object is used to enumerate input columns
and output, and to set output data as a result of the processor activity.
For input columns enumeration, we use the input.Get method.
The parameter for input.Get method is a column that's passed as part of the PRODUCE clause of the PROCESS
statement of the U-SQL base script. We need to use the correct data type here.
For output, use the output.Set method.
It's important to note that custom producer only outputs columns and values that are defined with the
output.Set method call.
output.Set<string>("mycolumn", mycolumn);
In this use-case scenario, the processor is generating a new column called “full_description” by combining the
existing columns--in this case, “user” in upper case, and “des”. It also regenerates a GUID and returns the original
and new GUID values.
As you can see from the previous example, you can call C# methods during output.Set method call.
Following is an example of base U-SQL script that uses a custom processor:
@rs0 =
EXTRACT
guid Guid,
dt String,
user String,
des String
FROM @input_file USING Extractors.Tsv();
@rs1 =
PROCESS @rs0
PRODUCE dt String,
full_description String,
guid Guid,
new_guid Guid
USING new USQL_Programmability.FullDescriptionProcessor();
Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDT and UDAGG
Use user-defined applier
12/10/2021 • 4 minutes to read • Edit Online
SELECT …
FROM …
CROSS APPLYis used to pass parameters
new MyScript.MyApplier(param1, param2) AS alias(output_param1 string, …);
For more information about using appliers in a SELECT expression, see U-SQL SELECT Selecting from CROSS
APPLY and OUTER APPLY.
The user-defined applier base class definition is as follows:
To define a user-defined applier, we need to create the IApplier interface with the [ SqlUserDefinedApplier ]
attribute, which is optional for a user-defined applier definition.
[SqlUserDefinedApplier]
public class ParserApplier : IApplier
{
public ParserApplier()
{
…
}
Apply is called for each row of the outer table. It returns the IUpdatableRow output rowset.
The Constructor class is used to pass parameters to the user-defined applier.
SqlUserDefinedApplier indicates that the type should be registered as a user-defined applier. This class
cannot be inherited.
SqlUserDefinedApplier is optional for a user-defined applier definition.
The main programmability objects are as follows:
Input rowsets are passed as IRow input. The output rows are generated as IUpdatableRow output interface.
Individual column names can be determined by calling the IRow Schema method.
To get the actual data values from the incoming IRow , we use the Get() method of IRow interface.
mycolumn = row.Get<int>("mycolumn")
row.Get<int>(row.Schema[0].Name)
output.Set<int>("mycolumn", mycolumn)
It is important to understand that custom appliers only output columns and values that are defined with
output.Set method call.
new USQL_Programmability.ParserApplier ("all") AS properties(make string, model string, year string, type
string, millage int);
@rs0 =
EXTRACT
stocknumber int,
vin String,
properties String
FROM @input_file USING Extractors.Tsv();
@rs1 =
SELECT
r.stocknumber,
r.vin,
properties.make,
properties.model,
properties.year,
properties.type,
properties.millage
FROM @rs0 AS r
CROSS APPLY
new USQL_Programmability.ParserApplier ("all") AS properties(make string, model string, year string,
type string, millage int);
In this use case scenario, user-defined applier acts as a comma-delimited value parser for the car fleet
properties. The input file rows look like the following:
It is a typical tab-delimited TSV file with a properties column that contains car properties such as make and
model. Those properties must be parsed to the table columns. The applier that's provided also enables you to
generate a dynamic number of properties in the result rowset, based on the parameter that's passed. You can
generate either all properties or a specific set of properties only.
...USQL_Programmability.ParserApplier ("all")
...USQL_Programmability.ParserApplier ("make")
...USQL_Programmability.ParserApplier ("make&model")
Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDT and UDAGG
Use user-defined combiner
12/10/2021 • 4 minutes to read • Edit Online
Combine_Expression :=
'COMBINE' Combine_Input
'WITH' Combine_Input
Join_On_Clause
Produce_Clause
[Readonly_Clause]
[Required_Clause]
USING_Clause.
The custom implementation of an ICombiner interface should contain the definition for an IEnumerable<IRow>
Combine override.
[SqlUserDefinedCombiner]
public class MyCombiner : ICombiner
{
Input rowsets are passed as left and right IRowset type of interface. Both rowsets must be enumerated for
processing. You can only enumerate each interface once, so we have to enumerate and cache it if necessary.
For caching purposes, we can create a List<T> type of memory structure as a result of a LINQ query execution,
specifically List< IRow >. The anonymous data type can be used during enumeration as well.
See Introduction to LINQ Queries (C#) for more information about LINQ queries, and IEnumerable<T> Interface
for more information about IEnumerable<T> interface.
To get the actual data values from the incoming IRowset , we use the Get() method of IRow interface.
mycolumn = row.Get<int>("mycolumn")
Individual column names can be determined by calling the IRow Schema method.
c# row.Get<int>(row.Schema[0].Name)
After enumerating both rowsets, we are going to loop through all rows. For each row in the left rowset, we are
going to find all rows that satisfy the condition of our combiner.
The output values must be set with IUpdatableRow output.
output.Set<int>("mycolumn", mycolumn)
var resellerSales =
(from row in right.Rows
select new
{
ProductKey = row.Get<int>("ProductKey"),
OrderDateKey = row.Get<int>("OrderDateKey"),
SalesAmount = row.Get<decimal>("SalesAmount"),
TaxAmt = row.Get<decimal>("TaxAmt")
}).ToList();
if (
row_i.OrderDateKey > 0
&& row_i.OrderDateKey < row_r.OrderDateKey
&& row_i.OrderDateKey == 20010701
&& (row_r.SalesAmount + row_r.TaxAmt) > 20000)
{
output.Set<int>("OrderDateKey", row_i.OrderDateKey);
output.Set<int>("ProductKey", row_i.ProductKey);
output.Set<decimal>("Internet_Sales_Amount", row_i.SalesAmount + row_i.TaxAmt);
output.Set<decimal>("Reseller_Sales_Amount", row_r.SalesAmount + row_r.TaxAmt);
}
}
}
yield return output.AsReadOnly();
}
}
In this use-case scenario, we are building an analytics report for the retailer. The goal is to find all products that
cost more than $20,000 and that sell through the website faster than through the regular retailer within a
certain time frame.
Here is the base U-SQL script. You can compare the logic between a regular JOIN and a combiner:
@fact_internet_sales =
EXTRACT
ProductKey int ,
ProductKey int ,
OrderDateKey int ,
DueDateKey int ,
ShipDateKey int ,
CustomerKey int ,
PromotionKey int ,
CurrencyKey int ,
SalesTerritoryKey int ,
SalesOrderNumber String ,
SalesOrderLineNumber int ,
RevisionNumber int ,
OrderQuantity int ,
UnitPrice decimal ,
ExtendedAmount decimal,
UnitPriceDiscountPct float ,
DiscountAmount float ,
ProductStandardCost decimal ,
TotalProductCost decimal ,
SalesAmount decimal ,
TaxAmt decimal ,
Freight decimal ,
CarrierTrackingNumber String,
CustomerPONumber String
FROM @input_file_internet_sales
USING Extractors.Text(delimiter:'|', encoding: Encoding.Unicode);
@fact_reseller_sales =
EXTRACT
ProductKey int ,
OrderDateKey int ,
DueDateKey int ,
ShipDateKey int ,
ResellerKey int ,
EmployeeKey int ,
PromotionKey int ,
CurrencyKey int ,
SalesTerritoryKey int ,
SalesOrderNumber String ,
SalesOrderLineNumber int ,
RevisionNumber int ,
OrderQuantity int ,
UnitPrice decimal ,
ExtendedAmount decimal,
UnitPriceDiscountPct float ,
DiscountAmount float ,
ProductStandardCost decimal ,
TotalProductCost decimal ,
SalesAmount decimal ,
TaxAmt decimal ,
Freight decimal ,
CarrierTrackingNumber String,
CustomerPONumber String
FROM @input_file_reseller_sales
USING Extractors.Text(delimiter:'|', encoding: Encoding.Unicode);
@rs1 =
SELECT
fis.OrderDateKey,
fis.ProductKey,
fis.SalesAmount+fis.TaxAmt AS Internet_Sales_Amount,
frs.SalesAmount+frs.TaxAmt AS Reseller_Sales_Amount
FROM @fact_internet_sales AS fis
INNER JOIN @fact_reseller_sales AS frs
ON fis.ProductKey == frs.ProductKey
WHERE
fis.OrderDateKey < frs.OrderDateKey
AND fis.OrderDateKey == 20010701
AND frs.SalesAmount+frs.TaxAmt > 20000;
@rs2 =
@rs2 =
COMBINE @fact_internet_sales AS fis
WITH @fact_reseller_sales AS frs
ON fis.ProductKey == frs.ProductKey
PRODUCE OrderDateKey int,
ProductKey int,
Internet_Sales_Amount decimal,
Reseller_Sales_Amount decimal
USING new USQL_Programmability.CombineSales();
USING MyNameSpace.MyCombiner();
Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDT and UDAGG
Use user-defined reducer
12/10/2021 • 2 minutes to read • Edit Online
[SqlUserDefinedReducer]
public class EmptyUserReducer : IReducer
{
The SqlUserDefinedReducer attribute indicates that the type should be registered as a user-defined reducer.
This class cannot be inherited. SqlUserDefinedReducer is an optional attribute for a user-defined reducer
definition. It's used to define IsRecursive property.
bool IsRecursive
true = Indicates whether this Reducer is associative and commutative
The main programmability objects are input and output . The input object is used to enumerate input rows.
Output is used to set output rows as a result of reducing activity.
For input rows enumeration, we use the Row.Get method.
The parameter for the Row.Get method is a column that's passed as part of the PRODUCE class of the REDUCE
statement of the U-SQL base script. We need to use the correct data type here as well.
For output, use the output.Set method.
It is important to understand that custom reducer only outputs values that are defined with the output.Set
method call.
output.Set<string>("mycolumn", guid);
[SqlUserDefinedReducer]
public class EmptyUserReducer : IReducer
{
if (user.Length > 0)
{
output.Set<string>("guid", guid);
output.Set<DateTime>("dt", dt);
output.Set<string>("user", user);
output.Set<string>("des", des);
In this use-case scenario, the reducer is skipping rows with an empty user name. For each row in rowset, it reads
each required column, then evaluates the length of the user name. It outputs the actual row only if user name
value length is more than 0.
Following is base U-SQL script that uses a custom reducer:
DECLARE @input_file string = @"\usql-programmability\input_file_reducer.tsv";
DECLARE @output_file string = @"\usql-programmability\output_file.tsv";
@rs0 =
EXTRACT
guid string,
dt DateTime,
user String,
des String
FROM @input_file
USING Extractors.Tsv();
@rs1 =
REDUCE @rs0 PRESORT guid
ON guid
PRODUCE guid string, dt DateTime, user String, des String
USING new USQL_Programmability.EmptyUserReducer();
@rs2 =
SELECT guid AS start_id,
dt AS start_time,
DateTime.Now.ToString("M/d/yyyy") AS Nowdate,
USQL_Programmability.CustomFunctions.GetFiscalPeriodWithCustomType(dt).ToString() AS
start_fiscalperiod,
user,
des
FROM @rs1;
OUTPUT @rs2
TO @output_file
USING Outputters.Text();
Next steps
U-SQL programmability guide - overview
U-SQL programmability guide - UDT and UDAGG
Schedule U-SQL jobs using SQL Server Integration
Services (SSIS)
12/10/2021 • 5 minutes to read • Edit Online
In this document, you learn how to orchestrate and create U-SQL jobs using SQL Server Integration Service
(SSIS).
Prerequisites
Azure Feature Pack for Integration Services provides the Azure Data Lake Analytics task and the Azure Data Lake
Analytics Connection Manager that helps connect to Azure Data Lake Analytics service. To use this task, make
sure you install:
Download and install SQL Server Data Tools (SSDT) for Visual Studio
Install Azure Feature Pack for Integration Services (SSIS)
You can get the U-SQL script from different places by using SSIS built-in functions and tasks, below scenarios
show how can you configure the U-SQL scripts for different user cases.
Learn more about Azure Data Lake Store File System Task.
Configure Foreach Loop Container
1. In Collection page, set Enumerator to Foreach File Enumerator .
2. Set Folder under Enumerator configuration group to the temporary folder that includes the
downloaded U-SQL scripts.
3. Set Files under Enumerator configuration to *.usql so that the loop container only catches the files
ending with .usql .
4. In Variable Mappings page, add a user defined variable to get the file name for each U-SQL file. Set the
Index to 0 to get the file name. In this example, define a variable called User::FileName . This variable will
be used to dynamically get U-SQL script file connection and set U-SQL job name in Azure Data Lake
Analytics Task.
Configure Azure Data Lake Analytics Task
1. Set SourceType to FileConnection .
2. Set FileConnection to the file connection that points to the file objects returned from Foreach Loop
Container.
To create this file connection:
a. Choose <New Connection...> in FileConnection setting.
b. Set Usage type to Existing file , and set the File to any existing file's file path.
c. In Connection Managers view, right-click the file connection created just now, and choose
Proper ties .
d. In the Proper ties window, expand Expressions , and set ConnectionString to the variable
defined in Foreach Loop Container, for example, @[User::FileName] .
3. Set AzureDataLakeAnalyticsConnection to the Azure Data Lake Analytics account that you want to
submit jobs to. Learn more about Azure Data Lake Analytics Connection Manager.
4. Set other job configurations. Learn More.
5. Use Expressions to dynamically set U-SQL job name:
a. In Expressions page, add a new expression key-value pair for JobName .
b. Set the value for JobName to the variable defined in Foreach Loop Container, for example,
@[User::FileName] .
Scenario 3-Use U-SQL files in Azure Blob Storage
You can use U-SQL files in Azure Blob Storage by using Azure Blob Download Task in Azure Feature Pack.
This approach enables you using the scripts on cloud.
The steps are similar with Scenario 2: Use U-SQL files in Azure Data Lake Store. Change the Azure Data Lake
Store File System Task to Azure Blob Download Task. Learn more about Azure Blob Download Task.
The control flow is like below.
Next steps
Run SSIS packages in Azure
Azure Feature Pack for Integration Services (SSIS)
Schedule U-SQL jobs using Azure Data Factory
How to set up a CI/CD pipeline for Azure Data
Lake Analytics
12/10/2021 • 15 minutes to read • Edit Online
In this article, you learn how to set up a continuous integration and deployment (CI/CD) pipeline for U-SQL jobs
and U-SQL databases.
NOTE
This article uses the Azure Az PowerShell module, which is the recommended PowerShell module for interacting with
Azure. To get started with the Az PowerShell module, see Install Azure PowerShell. To learn how to migrate to the Az
PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
<!-- check for SDK Build target in current path then in USQLSDKPath-->
<Import Project="UsqlSDKBuild.targets" Condition="Exists('UsqlSDKBuild.targets')" />
<Import Project="$(USQLSDKPath)\UsqlSDKBuild.targets" Condition="!Exists('UsqlSDKBuild.targets') And
'$(USQLSDKPath)' != '' And Exists('$(USQLSDKPath)\UsqlSDKBuild.targets')" />
NOTE
The DROP statement may cause an accidental deletion. To enable DROP statement, you need to explicitly specify the
MSBuild arguments. AllowDropStatement will enable non-data related DROP operation, like drop assembly and drop
table valued function. AllowDataDropStatement will enable data related DROP operation, like drop table and drop
schema. You have to enable AllowDropStatement before using AllowDataDropStatement.
msbuild USQLBuild.usqlproj
/p:USQLSDKPath=packages\Microsoft.Azure.DataLake.USQL.SDK.1.3.180615\build\runtime;USQLTargetType=SyntaxChec
k;DataRoot=datarootfolder;/p:EnableDeployment=true
/p:USQLSDKPath=$(Build.SourcesDirectory)/packages/Microsoft.Azure.DataLake.USQL.SDK.1.3.180615/build/
runtime /p:USQLTargetType=SyntaxCheck /p:DataRoot=$(Build.SourcesDirectory) /p:EnableDeployment=true
NOTE
The code-behind files for each U-SQL script will be merged as an inline statement to the script build output.
<#
This script can be used to submit U-SQL Jobs with given U-SQL project build output(.usqlpack file).
This will unzip the U-SQL project build output, and submit all scripts one-by-one.
Note: the code behind file for each U-SQL script will be merged into the built U-SQL script in build
output.
Example :
USQLJobSubmission.ps1 -ADLAAccountName "myadlaaccount" -ArtifactsRoot "C:\USQLProject\bin\debug\" -
DegreeOfParallelism 2
#>
param(
[Parameter(Mandatory=$true)][string]$ADLAAccountName, # ADLA account name to submit U-SQL jobs
[Parameter(Mandatory=$true)][string]$ArtifactsRoot, # Root folder of U-SQL project build output
[Parameter(Mandatory=$false)][string]$DegreeOfParallelism = 1
)
return $USQLFiles
}
# Submit each usql script and wait for completion before moving ahead.
foreach ($usqlFile in $usqlFiles)
{
$scriptName = "[Release].[$([System.IO.Path]::GetFileNameWithoutExtension($usqlFile.fullname))]"
LogJobInformation $jobToSubmit
Function LogJobInformation($jobInfo)
{
Write-Output "************************************************************************"
Write-Output ([string]::Format("Job Id: {0}", $(DefaultIfNull $jobInfo.JobId)))
Write-Output ([string]::Format("Job Name: {0}", $(DefaultIfNull $jobInfo.Name)))
Write-Output ([string]::Format("Job State: {0}", $(DefaultIfNull $jobInfo.State)))
Write-Output ([string]::Format("Job Started at: {0}", $(DefaultIfNull $jobInfo.StartTime)))
Write-Output ([string]::Format("Job Ended at: {0}", $(DefaultIfNull $jobInfo.EndTime)))
Write-Output ([string]::Format("Job Result: {0}", $(DefaultIfNull $jobInfo.Result)))
Write-Output "************************************************************************"
}
Function DefaultIfNull($item)
{
if ($item -ne $null)
{
return $item
}
return ""
}
Function Main()
{
Write-Output ([string]::Format("ADLA account: {0}", $ADLAAccountName))
Write-Output ([string]::Format("Root folde for usqlpack: {0}", $ArtifactsRoot))
Write-Output ([string]::Format("AU count: {0}", $DegreeOfParallelism))
SubmitAnalyticsJob
Main
NOTE
The commands: Submit-AzDataLakeAnalyticsJob and Wait-AzDataLakeAnalyticsJob are both Azure PowerShell
cmdlets for Azure Data Lake Analytics in the Azure Resource Manager framework. You'll neeed a workstation with Azure
PowerShell installed. You can refer to the command list for more commands and examples.
Example :
FileUpload.ps1 -ADLSName "myadlsaccount" -ArtifactsRoot "C:\USQLProject\bin\debug\"
#>
param(
[Parameter(Mandatory=$true)][string]$ADLSName, # ADLS account name to upload U-SQL scripts
[Parameter(Mandatory=$true)][string]$ArtifactsRoot, # Root folder of U-SQL project build output
[Parameter(Mandatory=$false)][string]$DestinationFolder = "USQLScriptSource" # Destination folder in ADLS
)
Function UploadResources()
{
Write-Host "************************************************************************"
Write-Host "Uploading files to $ADLSName"
Write-Host "***********************************************************************"
$usqlScripts = GetUsqlFiles
Function GetUsqlFiles()
{
return Get-ChildItem -Path $UnzipOutput -Include *.usql -File -Recurse -ErrorAction SilentlyContinue
}
UploadResources
msbuild DatabaseProject.usqldbproj
/p:USQLSDKPath=packages\Microsoft.Azure.DataLake.USQL.SDK.1.3.180615\build\runtime
The argument USQLSDKPath=<U-SQL Nuget package>\build\runtime refers to the install path of the NuGet package
for the U-SQL language service.
Continuous integration with Azure Pipelines
In addition to the command line, you can use Visual Studio Build or an MSBuild task to build U-SQL database
projects in Azure Pipelines. To set up a build task, make sure to add two tasks in the build pipeline: a NuGet
restore task and an MSBuild task.
1. Add a NuGet restore task to get the solution-referenced NuGet package, which includes
Azure.DataLake.USQL.SDK , so that MSBuild can find the U-SQL language targets. Set Advanced >
Destination director y to $(Build.SourcesDirectory)/packages if you want to use the MSBuild
arguments sample directly in step 2.
2. Set MSBuild arguments in Visual Studio build tools or in an MSBuild task as shown in the following
example. Or you can define variables for these arguments in the Azure Pipelines build pipeline.
/p:USQLSDKPath=$(Build.SourcesDirectory)/packages/Microsoft.Azure.DataLake.USQL.SDK.1.3.180615/build/
runtime
NOTE
PowerShell command-line support and Azure Pipelines release task support for U-SQL database deployment is currently
pending.
Take the following steps to set up a database deployment task in Azure Pipelines:
1. Add a PowerShell Script task in a build or release pipeline and execute the following PowerShell script.
This task helps to get Azure SDK dependencies for PackageDeploymentTool.exe and
PackageDeploymentTool.exe . You can set the -AzureSDK and -DBDeploymentTool parameters to load
the dependencies and deployment tool to specific folders. Pass the -AzureSDK path to
PackageDeploymentTool.exe as the -AzureSDKPath parameter in step 2.
<#
This script is used for getting dependencies and SDKs for U-SQL database deployment.
PowerShell command line support for deploying U-SQL database package(.usqldbpack file) will come
soon.
Example :
GetUSQLDBDeploymentSDK.ps1 -AzureSDK "AzureSDKFolderPath" -DBDeploymentTool
"DBDeploymentToolFolderPath"
#>
param (
[string]$AzureSDK = "AzureSDK", # Folder to cache Azure SDK dependencies
[string]$DBDeploymentTool = "DBDeploymentTool", # Folder to cache U-SQL database deployment tool
[string]$workingfolder = "" # Folder to execute these command lines
)
if ([string]::IsNullOrEmpty($workingfolder))
if ([string]::IsNullOrEmpty($workingfolder))
{
$scriptpath = $MyInvocation.MyCommand.Path
$workingfolder = Split-Path $scriptpath
}
cd $workingfolder
iwr https://www.nuget.org/api/v2/package/Microsoft.Azure.Management.DataLake.Analytics/3.5.1-preview
-outf Microsoft.Azure.Management.DataLake.Analytics.3.5.1-preview.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.Azure.Management.DataLake.Store/2.4.1-preview -
outf Microsoft.Azure.Management.DataLake.Store.2.4.1-preview.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.IdentityModel.Clients.ActiveDirectory/2.28.3 -outf
Microsoft.IdentityModel.Clients.ActiveDirectory.2.28.3.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.Rest.ClientRuntime/2.3.11 -outf
Microsoft.Rest.ClientRuntime.2.3.11.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.Rest.ClientRuntime.Azure/3.3.7 -outf
Microsoft.Rest.ClientRuntime.Azure.3.3.7.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.Rest.ClientRuntime.Azure.Authentication/2.3.3 -
outf Microsoft.Rest.ClientRuntime.Azure.Authentication.2.3.3.zip
iwr https://www.nuget.org/api/v2/package/Newtonsoft.Json/6.0.8 -outf Newtonsoft.Json.6.0.8.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.Azure.DataLake.USQL.SDK/ -outf USQLSDK.zip
2. Add a Command-Line task in a build or release pipeline and fill in the script by calling
PackageDeploymentTool.exe . PackageDeploymentTool.exe is located under the defined
$DBDeploymentTool folder. The sample script is as follows:
Deploy a U-SQL database locally:
Use interactive authentication mode to deploy a U-SQL database to an Azure Data Lake Analytics
account:
Use secrete authentication to deploy a U-SQL database to an Azure Data Lake Analytics account:
Use cer tFile authentication to deploy a U-SQL database to an Azure Data Lake Analytics account:
SecreteFile The file saves the secrete or null Required for non-interactive
password for non- authentication, or else use
interactive authentication. Secrete.
Make sure to keep it
readable only by the
current user.
Next steps
How to test your Azure Data Lake Analytics code.
Run U-SQL script on your local machine.
Use U-SQL database project to develop U-SQL database.
Best practices for managing U-SQL assemblies in a
CI/CD pipeline
12/10/2021 • 3 minutes to read • Edit Online
In this article, you learn how to manage U-SQL assembly source code with the newly introduced U-SQL
database project. You also learn how to set up a continuous integration and deployment (CI/CD) pipeline for
assembly registration by using Azure DevOps.
4. Add a reference to the C# class library project for the U-SQL database project.
5. Create an assembly script in the U-SQL database project by right-clicking the project and selecting Add
New Item .
6. Open the assembly script in the assembly design view. Select the referenced assembly from the Create
assembly from reference drop-down menu.
7. Add Managed Dependencies and Additional Files , if there are any. When you add additional files, the
tool uses the relative path to make sure it can find the assemblies on your local machine and on the build
machine later.
@_DeployTempDirector y in the editor window at the bottom is a predefined variable that points the tool to
the build output folder. Under the build output folder, every assembly has a subfolder named with the assembly
name. All DLLs and additional files are in that subfolder.
In Azure DevOps, you can use a command-line task and this SDK to set up an automation pipeline for the U-SQL
database refresh. Learn more about the SDK and how to set up a CI/CD pipeline for U-SQL database
deployment.
Next steps
Set up a CI/CD pipeline for Azure Data Lake Analytics
Test your Azure Data Lake Analytics code
Run U-SQL script on your local machine
Test your Azure Data Lake Analytics code
12/10/2021 • 5 minutes to read • Edit Online
Azure Data Lake provides the U-SQL language. U-SQL combines declarative SQL with imperative C# to process
data at any scale. In this document, you learn how to create test cases for U-SQL and extended C# user-defined
operator (UDO) code.
When you call the Initialize() interface in the U-SQL test SDK, a temporary local data root folder is created
under the working directory of the test project. All files and folders in the test data source folder are copied to
the temporary local data root folder before you run the U-SQL script test cases. You can add more test data
source folders by splitting the test data folder path with a semicolon.
Manage the database environment for testing
If your U-SQL scripts use or query with U-SQL database objects, you need to initialize the database environment
before you run U-SQL test cases. This approach can be necessary when calling stored procedures. The
Initialize() interface in the U-SQL test SDK helps you deploy all databases that are referenced by the U-SQL
project to the temporary local data root folder in the working directory of the test project.
For more information about how to manage U-SQL database project references for a U-SQL project, see
Reference a U-SQL database project.
Verify test results
The Run() interface returns a job execution result. 0 means success, and 1 means failure. You can also use C#
assert functions to verify the outputs.
Run test cases in Visual Studio
A U-SQL script test project is built on top of a C# unit test framework. After you build the project, select Test >
Windows > Test Explorer . You can run test cases from Test Explorer . Alternatively, right-click the .cs file in
your unit test and select Run Tests .
Test C# UDOs
Create test cases for C# UDOs
You can use a C# unit test framework to test your C# user-defined operators (UDOs). When testing UDOs, you
need to prepare corresponding IRowset objects as inputs.
There are two ways to create an IRowset object:
Load data from a file to create IRowset :
What is CPPSDK?
CPPSDK is a package that includes Microsoft Visual C++ 14 and Windows SDK 10.0.10240.0. This package
includes the environment that's needed by the U-SQL runtime. You can get this package under the Azure Data
Lake Tools for Visual Studio installation folder:
For Visual Studio 2015, it is under
C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\IDE\Extensions\Microsoft\Microsoft Azure Data
Lake Tools for Visual Studio 2015\X.X.XXXX.X\CppSDK
For Visual Studio 2017, it is under
C:\Program Files (x86)\Microsoft Visual Studio\2017\<Visual Studio Edition>\SDK\ScopeCppSDK
For Visual Studio 2019, it is under
C:\Program Files (x86)\Microsoft Visual Studio\2019\<Visual Studio Edition>\SDK\ScopeCppSDK
Next steps
How to set up CI/CD pipeline for Azure Data Lake Analytics
Run U-SQL script on your local machine
Use U-SQL database project to develop U-SQL database
Run and test U-SQL with Azure Data Lake U-SQL
SDK
12/10/2021 • 11 minutes to read • Edit Online
When developing U-SQL script, it is common to run and test U-SQL script locally before submit it to cloud.
Azure Data Lake provides a Nuget package called Azure Data Lake U-SQL SDK for this scenario, through which
you can easily scale U-SQL run and test. It is also possible to integrate this U-SQL test with CI (Continuous
Integration) system to automate the compile and test.
If you care about how to manually local run and debug U-SQL script with GUI tooling, then you can use Azure
Data Lake Tools for Visual Studio for that. You can learn more from here.
In this case, the U-SQL local compiler cannot find the dependencies automatically. You need to
specify the CppSDK path for it. You can either copy the files to another location or use it as is.
/abc/def/input.csv C:\LocalRunDataRoot\abc\def\input.csv
abc/def/input.csv C:\LocalRunDataRoot\abc\def\input.csv
D:/abc/def/input.csv D:\abc\def\input.csv
Working directory
When running the U-SQL script locally, a working directory is created during compilation under current running
directory. In addition to the compilation outputs, the needed runtime files for local execution will be shadow
copied to this working directory. The working directory root folder is called "ScopeWorkDir" and the files under
the working directory are as follows:
Run LocalRunHelper.exe without arguments or with the help switch to show the help information:
> LocalRunHelper.exe help
Command 'help' : Show usage information
Command 'compile' : Compile the script
Required Arguments :
-Script param
Script File Path
Optional Arguments :
-Shallow [default value 'False']
Shallow compile
Define a new environment variable called SCOPE_CPP_SDK to point to this directory. Or copy the folder
to the other location and specify SCOPE_CPP_SDK as that.
In addition to setting the environment variable, you can specify the -CppSDK argument when you're
using the command line. This argument overwrites your default CppSDK environment variable.
Set the LOCALRUN_DATAROOT environment variable.
Define a new environment variable called LOCALRUN_DATAROOT that points to the data root.
In addition to setting the environment variable, you can specify the -DataRoot argument with the data-
root path when you're using a command line. This argument overwrites your default data-root
environment variable. You need to add this argument to every command line you're running so that you
can overwrite the default data-root environment variable for all operations.
SDK command line usage samples
Compile and run
The run command is used to compile the script and then execute compiled results. Its command-line arguments
are a combination of those from compile and execute .
Here's an example:
LocalRunHelper run -Script d:\test\test1.usql -WorkDir d:\test\bin -CodeBehind -References
"d:\asm\ref1.dll;d:\asm\ref2.dll" -UseDatabase testDB –Parallel 5 -Verbose
Besides combining compile and execute , you can compile and execute the compiled executables separately.
Compile a U-SQL script
The compile command is used to compile a U-SQL script to executables.
-CodeBehind [default value 'False'] The script has .cs code behind
-DataRoot [default value 'DataRoot environment variable'] DataRoot for local run, default to 'LOCALRUN_DATAROOT'
environment variable
-References [default value ''] List of paths to extra reference assemblies or data files of
code behind, separated by ';'
-UseDatabase [default value 'master'] Database to use for code behind temporary assembly
registration
-WorkDir [default value 'Current Directory'] Directory for compiler usage and outputs
-ScopeCEPTempPath [default value 'temp'] Temp path to use for streaming data
Compile a U-SQL script and set the data-root folder. Note that this will overwrite the set environment variable.
Compile a U-SQL script and set a working directory, reference assembly, and database:
U-SQL SDK only support x64 environment, make sure to set build platform target as x64. You can set
that through Project Property > Build > Platform target.
Make sure to set your test environment as x64. In Visual Studio, you can set it through Test > Test Settings
> Default Processor Architecture > x64.
Make sure to copy all dependency files under NugetPackage\build\runtime\ to project working directory
which is usually under ProjectFolder\bin\x64\Debug.
Step 2: Create U -SQL script test case
Below is the sample code for U-SQL script test. For testing, you need to prepare scripts, input files and expected
output files.
using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.IO;
using System.Text;
using System.Security.Cryptography;
using Microsoft.Analytics.LocalRun;
namespace UnitTestProject1
{
[TestClass]
public class USQLUnitTest
{
[TestMethod]
public void TestUSQLScript()
{
//Specify the local run message output path
StreamWriter MessageOutput = new StreamWriter("../../../log.txt");
LocalRunHelper localrun = new LocalRunHelper(MessageOutput);
//Configure the DateRoot path, Script Path and CPPSDK path
localrun.DataRoot = "../../../";
localrun.ScriptPath = "../../../Script/Script.usql";
localrun.CppSdkDir = "../../../CppSDK";
//Run U-SQL script
localrun.DoRun();
//Script output
string Result = Path.Combine(localrun.DataRoot, "Output/result.csv");
//Expected script output
string ExpectedResult = "../../../ExpectedOutput/result.csv";
Test.Helpers.FileAssert.AreEqual(Result, ExpectedResult);
//Don't forget to close MessageOutput to get logs into file
MessageOutput.Close();
}
}
}
namespace Test.Helpers
{
public static class FileAssert
{
static string GetFileHash(string filename)
{
Assert.IsTrue(File.Exists(filename));
using (var hash = new SHA1Managed())
{
var clearBytes = File.ReadAllBytes(filename);
var hashedBytes = hash.ComputeHash(clearBytes);
return ConvertBytesToHex(hashedBytes);
}
}
static string ConvertBytesToHex(byte[] bytes)
{
var sb = new StringBuilder();
for (var i = 0; i < bytes.Length; i++)
{
sb.Append(bytes[i].ToString("x"));
}
return sb.ToString();
}
public static void AreEqual(string filename1, string filename2)
{
string hash1 = GetFileHash(filename1);
string hash2 = GetFileHash(filename2);
Assert.AreEqual(hash1, hash2);
}
}
}
Properties
P RO P ERT Y TYPE DESC RIP T IO N
Method
M ET H O D DESC RIP T IO N RET URN PA RA M ET ER
public bool Check if the given path is True for valid The path of runtime
IsValidRuntimeDir(string valid runtime path directory
path)
Next steps
To learn U-SQL, see Get started with Azure Data Lake Analytics U-SQL language.
To log diagnostics information, see Accessing diagnostics logs for Azure Data Lake Analytics.
To see a more complex query, see Analyze website logs using Azure Data Lake Analytics.
To view job details, see Use Job Browser and Job View for Azure Data Lake Analytics jobs.
To use the vertex execution view, see Use the Vertex Execution View in Data Lake Tools for Visual Studio.
Understand Apache Spark for U-SQL developers
12/10/2021 • 2 minutes to read • Edit Online
Microsoft supports several Analytics services such as Azure Databricks and Azure HDInsight as well as Azure
Data Lake Analytics. We hear from developers that they have a clear preference for open-source-solutions as
they build analytics pipelines. To help U-SQL developers understand Apache Spark, and how you might
transform your U-SQL scripts to Apache Spark, we've created this guidance.
It includes a number of steps you can take, and several alternatives.
Both Azure Databricks and Azure HDInsight Spark are cluster services and not serverless jobs like Azure Data
Lake Analytics. You will have to consider how to provision the clusters to get the appropriate cost/performance
ratio and how to manage their lifetime to minimize your costs. These services are have different performance
characteristics with user code written in .NET, so you will have to either write wrappers or rewrite your code in a
supported language. For more information, see Understand Spark data formats, Understand Apache Spark code
concepts for U-SQL developers, .Net for Apache Spark
Next steps
Understand Spark data formats for U-SQL developers
Understand Spark code concepts for U-SQL developers
Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage
Gen2
.NET for Apache Spark
Transform data using Hadoop Hive activity in Azure Data Factory
Transform data using Spark activity in Azure Data Factory
What is Apache Spark in Azure HDInsight
Understand differences between U-SQL and Spark
data formats
12/10/2021 • 2 minutes to read • Edit Online
If you want to use either Azure Databricks or Azure HDInsight Spark, we recommend that you migrate your data
from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2.
In addition to moving your files, you'll also want to make your data, stored in U-SQL tables, accessible to Spark.
Caveats
Data semantics When copying files, the copy will occur at the byte level. So the same data should be
appearing in the Azure Data Lake Storage Gen2 account. Note however, Spark may interpret some
characters differently. For example, it may use a different default for a row-delimiter in a CSV file.
Furthermore, if you're copying typed data (from tables), then Parquet and Spark may have different
precision and scale for some of the typed values (for example, a float) and may treat null values
differently. For example, U-SQL has the C# semantics for null values, while Spark has a three-valued logic
for null values.
Data organization (partitioning) U-SQL tables provide two level partitioning. The outer level (
PARTITIONED BY ) is by value and maps mostly into the Hive/Spark partitioning scheme using folder
hierarchies. You will need to ensure that the null values are mapped to the right folder. The inner level (
DISTRIBUTED BY ) in U-SQL offers 4 distribution schemes: round robin, range, hash, and direct hash.
Hive/Spark tables only support value partitioning or hash partitioning, using a different hash function
than U-SQL. When you output your U-SQL table data, you will probably only be able to map into the
value partitioning for Spark and may need to do further tuning of your data layout depending on your
final Spark queries.
Next steps
Understand Spark code concepts for U-SQL developers
Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage
Gen2
.NET for Apache Spark
Transform data using Spark activity in Azure Data Factory
Transform data using Hadoop Hive activity in Azure Data Factory
What is Apache Spark in Azure HDInsight
Understand Apache Spark code for U-SQL
developers
12/10/2021 • 11 minutes to read • Edit Online
This section provides high-level guidance on transforming U-SQL Scripts to Apache Spark.
It starts with a comparison of the two language's processing paradigms
Provides tips on how to:
Transform scripts including U-SQL's rowset expressions
.NET code
Data types
Catalog objects.
U- SQ L SPA RK SC A L A P Y SPA RK
byte
uint
ulong
ushort
char Char
Guid
Spark offers equivalent expressions in both its DSL and SparkSQL form for most of these expressions. Some of
the expressions not supported natively in Spark will have to be rewritten using a combination of the native
Spark expressions and semantically equivalent patterns. For example, OUTER UNION will have to be translated
into the equivalent combination of projections and unions.
Due to the different handling of NULL values, a U-SQL join will always match a row if both of the columns being
compared contain a NULL value, while a join in Spark will not match such columns unless explicit null checks are
added.
var x = 2 * 3;
println(x)
U-SQL's system variables (variables starting with @@ ) can be split into two categories:
Settable system variables that can be set to specific values to impact the scripts behavior
Informational system variables that inquire system and job level information
Most of the settable system variables have no direct equivalent in Spark. Some of the informational system
variables can be modeled by passing the information as arguments during job execution, others may have an
equivalent function in Spark's hosting language.
U -SQL hints
U-SQL offers several syntactic ways to provide hints to the query optimizer and execution engine:
Setting a U-SQL system variable
an OPTION clause associated with the rowset expression to provide a data or plan hint
a join hint in the syntax of the join expression (for example, BROADCASTLEFT )
Spark's cost-based query optimizer has its own capabilities to provide hints and tune the query performance.
Please refer to the corresponding documentation.
Next steps
Understand Spark data formats for U-SQL developers
.NET for Apache Spark
Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage
Gen2
Transform data using Spark activity in Azure Data Factory
Transform data using Hadoop Hive activity in Azure Data Factory
What is Apache Spark in Azure HDInsight
Learn how to troubleshoot U-SQL runtime failures
due to runtime changes
12/10/2021 • 4 minutes to read • Edit Online
The Azure Data Lake U-SQL runtime, including the compiler, optimizer, and job manager, is what processes your
U-SQL code.
Choosing a runtime that is different from the default has the potential to break your U-SQL jobs. Use these
other versions for testing only.
In rare cases, Microsoft Support may pin a different version of a runtime as the default for your account. Please
ensure that you revert this pin as soon as possible. If you remain pinned to that version, it will expire at some
later date.
Monitoring your jobs U -SQL runtime version
You can see the history of which runtime version your past jobs have used in your account's job history via the
Visual Studio's job browser or the Azure portal's job history.
1. In the Azure portal, go to your Data Lake Analytics account.
2. Select View All Jobs . A list of all the active and recently finished jobs in the account appears.
3. Optionally, click Filter to help you find the jobs by Time Range , Job Name , and Author values.
4. You can see the runtime used in the completed jobs.
The available runtime versions change over time. The default runtime is always called "default" and we keep at
least the previous runtime available for some time as well as make special runtimes available for a variety of
reasons. Explicitly named runtimes generally follow the following format (italics are used for variable parts and
[] indicates optional parts):
release_YYYYMMDD_adl_buildno[_modifier]
For example, release_20190318_adl_3394512_2 means the second version of the build 3394512 of the runtime
release of March 18 2019 and release_20190318_adl_3394512_private means a private build of the same
release. Note: The date is related to when the last check-in has been taken for that release and not necessarily
the official release date.
Known issues
1. Referencing Newtonsoft.Json file version 12.0.3 or onwards in a USQL script will cause the following
compilation failure:
"We are sorry; jobs running in your Data Lake Analytics account will likely run more slowly or fail to
complete. An unexpected problem is preventing us from automatically restoring this functionality to your
Azure Data Lake Analytics account. Azure Data Lake engineers have been contacted to investigate."
Where the call stack will contain:
System.IndexOutOfRangeException: Index was outside the bounds of the array.
at Roslyn.Compilers.MetadataReader.PEFile.CustomAttributeTableReader.get_Item(UInt32 rowId)
...
See also
Azure Data Lake Analytics overview
Manage Azure Data Lake Analytics using Azure portal
Monitor jobs in Azure Data Lake Analytics using the Azure portal
Azure Data Lake Analytics is upgrading to the .NET
Framework v4.7.2
12/10/2021 • 6 minutes to read • Edit Online
The Azure Data Lake Analytics default runtime is upgrading from .NET Framework v4.5.2 to .NET Framework
v4.7.2. This change introduces a small risk of breaking changes if your U-SQL code uses custom assemblies, and
those custom assemblies use .NET libraries.
This upgrade from .NET Framework 4.5.2 to version 4.7.2 means that the .NET Framework deployed in a U-SQL
runtime (the default runtime) will now always be 4.7.2. There isn't a side-by-side option for .NET Framework
versions.
After this upgrade to .NET Framework 4.7.2 is complete, the system’s managed code will run as version 4.7.2,
user provided libraries such as the U-SQL custom assemblies will run in the backwards-compatible mode
appropriate for the version that the assembly has been generated for.
If your assembly DLLs are generated for version 4.5.2, the deployed framework will treat them as 4.5.2
libraries, providing (with a few exceptions) 4.5.2 semantics.
You can now use U-SQL custom assemblies that make use of version 4.7.2 features, if you target the .NET
Framework 4.7.2.
Because of this upgrade to .NET Framework 4.7.2, there's a potential to introduce breaking changes to your U-
SQL jobs that use .NET custom assemblies. We suggest you check for backwards-compatibility issues using the
procedure below.
NOTE
The tool doesn't detect actual breaking changes. it only identifies called .NET APIs that may (for certain inputs) cause
issues. If you get notified of an issue, your code may still be fine, however you should check in more details.
Timeline
You can check for the deployment of the new runtime here Runtime troubleshoot, and by looking at any prior
successful job.
What if I can't get my code reviewed in time
You can submit your job against the old runtime version (which is built targeting 4.5.2), however due to the lack
of .NET Framework side-by-side capabilities, it will still only run in 4.5.2 compatibility mode. You may still
encounter some of the backwards-compatibility issues because of this behavior.
What are the most common backwards-compatibility issues you may encounter
The most common backwards-incompatibilities that the checker is likely to identify are (we generated this list by
running the checker on our own internal ADLA jobs), which libraries are impacted (note: that you may call the
libraries only indirectly, thus it is important to take required action #1 to check if your jobs are impacted), and
possible actions to remedy. Note: In almost all cases for our own jobs, the warnings turned out to be false
positives due to the narrow natures of most breaking changes.
IAsyncResult.CompletedSynchronously property must be correct for the resulting task to complete
When calling TaskFactory.FromAsync, the implementation of the
IAsyncResult.CompletedSynchronously property must be correct for the resulting task to complete.
That is, the property must return true if, and only if, the implementation completed synchronously.
Previously, the property was not checked.
Impacted Libraries: mscorlib, System.Threading.Tasks
Suggested Action: Ensure TaskFactory.FromAsync returns true correctly
DataObject.GetData now retrieves data as UTF-8
For apps that target the .NET Framework 4 or that run on the .NET Framework 4.5.1 or earlier versions,
DataObject.GetData retrieves HTML-formatted data as an ASCII string. As a result, non-ASCII
characters (characters whose ASCII codes are greater than 0x7F) are represented by two random
characters.#N##N#For apps that target the .NET Framework 4.5 or later and run on the .NET
Framework 4.5.2, DataObject.GetData retrieves HTML-formatted data as UTF-8, which represents
characters greater than 0x7F correctly.
Impacted Libraries: Glo
Suggested Action: Ensure data retrieved is the format you want
XmlWriter throws on invalid surrogate pairs
For apps that target the .NET Framework 4.5.2 or previous versions, writing an invalid surrogate pair
using exception fallback handling does not always throw an exception. For apps that target the .NET
Framework 4.6, attempting to write an invalid surrogate pair throws an ArgumentException .
Impacted Libraries: System.Xml, System.Xml.ReaderWriter
Suggested Action: Ensure you are not writing an invalid surrogate pair that will cause argument
exception
HtmlTextWriter does not render <br/> element correctly
Beginning in the .NET Framework 4.6, calling HtmlTextWriter.RenderBeginTag() and
HtmlTextWriter.RenderEndTag() with a <BR /> element will correctly insert only one <BR /> (instead
of two)
Impacted Libraries: System.Web
Suggested Action: Ensure you are inserting the amount of <BR /> you expect to see so no random
behavior is seen in production job
Calling CreateDefaultAuthorizationContext with a null argument has changed
The implementation of the AuthorizationContext returned by a call to the
CreateDefaultAuthorizationContext(IList<IAuthorizationPolicy>) with a null authorizationPolicies
argument has changed its implementation in the .NET Framework 4.6.
Impacted Libraries: System.IdentityModel
Suggested Action: Ensure you are handling the new expected behavior when there is null
authorization policy
RSACng now correctly loads RSA keys of non-standard key size
In .NET Framework versions prior to 4.6.2, customers with non-standard key sizes for RSA certificates
are unable to access those keys via the GetRSAPublicKey() and GetRSAPrivateKey() extension
methods. A CryptographicException with the message "The requested key size is not supported" is
thrown. With the .NET Framework 4.6.2 this issue has been fixed. Similarly, RSA.ImportParameters()
and RSACng.ImportParameters() now work with non-standard key sizes without throwing
CryptographicException 's.
Impacted Libraries: mscorlib, System.Core
Suggested Action: Ensure RSA keys are working as expected
Path colon checks are stricter
In .NET Framework 4.6.2, a number of changes were made to support previously unsupported paths
(both in length and format). Checks for proper drive separator (colon) syntax were made more correct,
which had the side effect of blocking some URI paths in a few select Path APIs where they used to be
tolerated.
Impacted Libraries: mscorlib, System.Runtime.Extensions
Suggested Action:
Calls to ClaimsIdentity constructors
Starting with the .NET Framework 4.6.2, there is a change in how
T:System.Security.Claims.ClaimsIdentity constructors with an T:System.Security.Principal.IIdentity
parameter set the P:System.Security.Claims.ClaimsIdentify.Actor property. If the
T:System.Security.Principal.IIdentity argument is a T:System.Security.Claims.ClaimsIdentity
object, and the P:System.Security.Claims.ClaimsIdentify.Actor property of that
T:System.Security.Claims.ClaimsIdentity object is not null , the
P:System.Security.Claims.ClaimsIdentify.Actor property is attached by using the
M:System.Security.Claims.ClaimsIdentity.Clone method. In the Framework 4.6.1 and earlier versions,
the P:System.Security.Claims.ClaimsIdentify.Actor property is attached as an existing reference.
Because of this change, starting with the .NET Framework 4.6.2, the
P:System.Security.Claims.ClaimsIdentify.Actor property of the new
T:System.Security.Claims.ClaimsIdentity object is not equal to the
P:System.Security.Claims.ClaimsIdentify.Actor property of the constructor's
T:System.Security.Principal.IIdentity argument. In the .NET Framework 4.6.1 and earlier versions, it
is equal.
Impacted Libraries: mscorlib
Suggested Action: Ensure ClaimsIdentity is working as expected on new runtime
Serialization of control characters with DataContractJsonSerializer is now compatible with ECMAScript V6
and V8
In the .NET framework 4.6.2 and earlier versions, the DataContractJsonSerializer did not serialize some
special control characters, such as \b, \f, and \t, in a way that was compatible with the ECMAScript V6
and V8 standards. Starting with the .NET Framework 4.7, serialization of these control characters is
compatible with ECMAScript V6 and V8.
Impacted Libraries: System.Runtime.Serialization.Json
Suggested Action: Ensure same behavior with DataContractJsonSerializer
Migrate Azure Data Lake Analytics to Azure
Synapse Analytics
12/10/2021 • 3 minutes to read • Edit Online
Microsoft launched the Azure Synapse Analytics which aims at bringing both data lakes and data warehouse
together for a unique big data analytics experience. It will help customers gather and analyze all the varying
data, to solve data inefficiency, and work together. Moreover, Synapse’s integration with Azure Machine Learning
and Power BI will allow the improved ability for organizations to get insights from its data as well as execute
machine learning to all its smart apps.
The document shows you how to do the migration from Azure Data Lake Analytics to Azure Synapse Analytics.
Recommended approach
Step 1: Assess readiness
Step 2: Prepare to migrate
Step 3: Migrate data and application workloads
Step 4: Cutover from Azure Data Lake Analytics to Azure Synapse Analytics
Step 1: Assess readiness
1. Look at Apache Spark on Azure Synapse Analytics, and understand key differences of Azure Data Lake
Analytics and Spark on Azure Synapse Analytics.
Default Programing Language U-SQL T-SQL, Python, Scala, Spark SQL and
.NET
Data Sources Azure Data Lake Storage Azure Blob Storage, Azure Data Lake
Storage
2. Review the Questionnaire for Migration Assessment and list those possible risks for considering.
Step 2: Prepare to migrate
1. Identify jobs and data that you'll migrate.
Take this opportunity to clean up those jobs that you no longer use. Unless you plan to migrate all
your jobs at one time, take this time to identify logical groups of jobs that you can migrate in phases.
Evaluate the size of the data and understand Apache Spark data format. Review your U-SQL scripts
and evaluate the scripts re-writing efforts and understand the Apache Spark code concept.
2. Determine the impact that a migration will have on your business. For example, whether you can afford
any downtime while migration takes place.
3. Create a migration plan.
Step 3: Migrate data and application workload
1. Migrate your data from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2.
Azure Data Lake Storage Gen1 retirement will be in February 2024, see the official announcement. We’d
suggest migrating the data to Gen2 in the first place. See Understand Apache Spark data formats for
Azure Data Lake Analytics U-SQL developers and move both the file and the data stored in U-SQL tables
to make them accessible to Azure Synapse Analytics. More details of the migration guide can be found
here.
2. Transform your U-SQL scripts to Spark. Refer to Understand Apache Spark code concepts for Azure Data
Lake Analytics U-SQL developers to transform your U-SQL scripts to Spark.
3. Transform or re-create your job orchestration pipelines to new Spark program.
Step 4: Cut over from Azure Data Lake Analytics to Azure Synapse Analytics
After you're confident that your applications and workloads are stable, you can begin using Azure Synapse
Analytics to satisfy your business scenarios. Turn off any remaining pipelines that are running on Azure Data
Lake Analytics and decommission your Azure Data Lake Analytics accounts.
Evaluate the size of the Migration How many Azure Data Lake Analytics The more data and scripts to be
accounts do you have? How many migrated, the more UDO/UDF are
pipelines are in use? How many U-SQL used in scripts, the more difficult it is
scripts are in use? to migrate. The time and resources
required for migration need to be well
planned according to the scale of the
project.
Data source What’s the size of the data source? Understand Apache Spark data
What kinds of data format for formats for Azure Data Lake Analytics
processing? U-SQL developers
Data output Will you keep the output data for later If the output data will be used often
use? If the output data is saved in U- and saved in U-SQL tables, you need
SQL tables, how to handle it? change the scripts and change the
output data to Spark supported data
format.
Data migration Have you made the storage migration Migrate Azure Data Lake Storage from
plan? Gen1 to Gen2
U-SQL scripts transform Do you use UDO/UDF (.NET, python, Understand Apache Spark code
etc.)?If above answer is yes, which concepts for Azure Data Lake Analytics
language do you use in your U-SQL developers
UDO/UDF and any problems for the
transform during the transform?Is the
federated query being used in U-SQL?
Next steps
Azure Synapse Analytics
Azure Policy built-in definitions for Azure Data Lake
Analytics
12/10/2021 • 2 minutes to read • Edit Online
This page is an index of Azure Policy built-in policy definitions for Azure Data Lake Analytics. For additional
Azure Policy built-ins for other services, see Azure Policy built-in definitions.
The name of each built-in policy definition links to the policy definition in the Azure portal. Use the link in the
Version column to view the source on the Azure Policy GitHub repo.
Resource logs in Data Lake Audit enabling of resource AuditIfNotExists, Disabled 5.0.0
Analytics should be enabled logs. This enables you to
recreate activity trails to use
for investigation purposes;
when a security incident
occurs or when your
network is compromised
Next steps
See the built-ins on the Azure Policy GitHub repo.
Review the Azure Policy definition structure.
Review Understanding policy effects.