Data Factory
Data Factory
Overview
Introduction to Azure Data Factory
Concepts
Pipelines and activities
Datasets
Scheduling and execution
Get Started
Tutorial: Create a pipeline to copy data
Copy Wizard
Azure portal
Visual Studio
PowerShell
Azure Resource Manager template
REST API
.NET API
Tutorial: Create a pipeline to transform data
Azure portal
Visual Studio
PowerShell
Azure Resource Manager template
REST API
Tutorial: Move data between on-premises and cloud
FAQ
How To
Move Data
Copy Activity Overview
Data Factory Copy Wizard
Performance and tuning guide
Fault tolerance
Security considerations
Connectors
Data Management Gateway
Transform Data
HDInsight Hive Activity
HDInsight Pig Activity
HDInsight MapReduce Activity
HDInsight Streaming Activity
HDInsight Spark Activity
Machine Learning Batch Execution Activity
Machine Learning Update Resource Activity
Stored Procedure Activity
Data Lake Analytics U-SQL Activity
.NET custom activity
Invoke R scripts
Reprocess models in Azure Analysis Services
Compute Linked Services
Develop
Azure Resource Manager template
Samples
Functions and system variables
Naming rules
.NET API change log
Monitor and Manage
Monitoring and Management app
Azure Data Factory pipelines
Using .NET SDK
Troubleshoot Data Factory issues
Troubleshoot issues with using Data Management Gateway
Reference
Code samples
PowerShell
.NET
REST
JSON
Resources
Azure Roadmap
Case Studies
Learning path
MSDN Forum
Pricing
Pricing calculator
Release notes for Data Management Gateway
Request a feature
Service updates
Stack Overflow
Videos
Customer Profiling
Process large-scale datasets using Data Factory and Batch
Product Recommendations
Introduction to Azure Data Factory
8/15/2017 10 min to read Edit Online
Azure Data Factory is the platform for this kind of scenarios. It is a cloud-based data integration service that
allows you to create data-driven workflows in the cloud for orchestrating and automating data
movement and data transformation. Using Azure Data Factory, you can create and schedule data-driven
workflows (called pipelines) that can ingest data from disparate data stores, process/transform the data by using
compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine
Learning, and publish output data to data stores such as Azure SQL Data Warehouse for business intelligence
(BI) applications to consume.
It's more of an Extract-and-Load (EL) and then Transform-and-Load (TL) platform rather than a traditional
Extract-Transform-and-Load (ETL) platform. The transformations that are performed are to transform/process
data by using compute services rather than to perform transformations like the ones for adding derived
columns, counting number of rows, sorting data, etc.
Currently, in Azure Data Factory, the data that is consumed and produced by workflows is time-sliced data
(hourly, daily, weekly, etc.). For example, a pipeline may read input data, process data, and produce output data
once a day. You can also run a workflow just one time.
Key components
An Azure subscription may have one or more Azure Data Factory instances (or data factories). Azure Data
Factory is composed of four key components that work together to provide the platform on which you can
compose data-driven workflows with steps to move and transform data.
Pipeline
A data factory may have one or more pipelines. A pipeline is a group of activities. Together, the activities in a
pipeline perform a task. For example, a pipeline could contain a group of activities that ingests data from an
Azure blob, and then run a Hive query on an HDInsight cluster to partition the data. The benefit of this is that the
pipeline allows you to manage the activities as a set instead of each one individually. For example, you can
deploy and schedule the pipeline, instead of the activities independently.
Activity
A pipeline may have one or more activities. Activities define the actions to perform on your data. For example,
you may use a Copy activity to copy data from one data store to another data store. Similarly, you may use a
Hive activity, which runs a Hive query on an Azure HDInsight cluster to transform or analyze your data. Data
Factory supports two types of activities: data movement activities and data transformation activities.
Data movement activities
Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory supports the
following data stores. Data from any source can be written to any sink. Click a data store to learn how to copy
data to and from that store.
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK
Azure Cosmos DB
(DocumentDB API)
DB2*
MySQL*
Oracle*
PostgreSQL*
SAP HANA*
SQL Server*
Sybase*
Teradata*
NoSQL Cassandra*
MongoDB*
File Amazon S3
File System*
FTP
HDFS*
SFTP
Generic OData
Generic ODBC*
Salesforce
GE Historian*
Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server
NOTE
You can use MapReduce activity to run Spark programs on your HDInsight Spark cluster. See Invoke Spark programs from
Azure Data Factory for details. You can create a custom activity to run R scripts on your HDInsight cluster with R installed.
See Run R Script using Azure Data Factory.
Supported regions
Currently, you can create data factories in the West US, East US, and North Europe regions. However, a data
factory can access data stores and compute services in other Azure regions to move data between data stores or
process data using compute services.
Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate
movement of data between supported data stores and processing of data using compute services in other
regions or in an on-premises environment. It also allows you to monitor and manage workflows using both
programmatic and UI mechanisms.
Even though Data Factory is available in only West US, East US, and North Europe regions, the service
powering the data movement in Data Factory is available globally in several regions. If a data store is behind a
firewall, then a Data Management Gateway installed in your on-premises environment moves the data instead.
For an example, let us assume that your compute environments such as Azure HDInsight cluster and Azure
Machine Learning are running out of West Europe region. You can create and use an Azure Data Factory instance
in North Europe and use it to schedule jobs on your compute environments in West Europe. It takes a few
milliseconds for Data Factory to trigger the job on your compute environment but the time for running the job
on your computing environment does not change.
TUTORIAL DESCRIPTION
Move data between two cloud data stores In this tutorial, you create a data factory with a pipeline that
moves data from Blob storage to SQL database.
Transform data using Hadoop cluster In this tutorial, you build your first Azure data factory with a
data pipeline that processes data by running Hive script on
an Azure HDInsight (Hadoop) cluster.
Move data between an on-premises data store and a cloud In this tutorial, you build a data factory with a pipeline that
data store using Data Management Gateway moves data from an on-premises SQL Server database to
an Azure blob. As part of the walkthrough, you install and
configure the Data Management Gateway on your machine.
Pipelines and Activities in Azure Data Factory
8/15/2017 16 min to read Edit Online
This article helps you understand pipelines and activities in Azure Data Factory and use them to
construct end-to-end data-driven workflows for your data movement and data processing scenarios.
NOTE
This article assumes that you have gone through Introduction to Azure Data Factory. If you do not have
hands-on-experience with creating data factories, going through data transformation tutorial and/or data
movement tutorial would help you understand this article better.
Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that
together perform a task. The activities in a pipeline define actions to perform on your data. For
example, you may use a copy activity to copy data from an on-premises SQL Server to an Azure Blob
Storage. Then, use a Hive activity that runs a Hive script on an Azure HDInsight cluster to
process/transform data from the blob storage to produce output data. Finally, use a second copy
activity to copy the output data to an Azure SQL Data Warehouse on top of which business intelligence
(BI) reporting solutions are built.
An activity can take zero or more input datasets and produce one or more output datasets. The
following diagram shows the relationship between pipeline, activity, and dataset in Data Factory:
A pipeline allows you to manage activities as a set instead of each one individually. For example, you
can deploy, schedule, suspend, and resume a pipeline, instead of dealing with activities in the pipeline
independently.
Data Factory supports two types of activities: data movement activities and data transformation
activities. Each activity can have zero or more input datasets and produce one or more output datasets.
An input dataset represents the input for an activity in the pipeline and an output dataset represents
the output for the activity. Datasets identify data within different data stores, such as tables, files,
folders, and documents. After you create a dataset, you can use it with activities in a pipeline. For
example, a dataset can be an input/output dataset of a Copy Activity or an HDInsightHive Activity. For
more information about datasets, see Datasets in Azure Data Factory article.
Data movement activities
Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory
supports the following data stores. Data from any source can be written to any sink. Click a data store
to learn how to copy data to and from that store.
Azure Cosmos DB
(DocumentDB API)
DB2*
MySQL*
Oracle*
PostgreSQL*
SAP Business
Warehouse*
SAP HANA*
SQL Server*
Sybase*
Teradata*
NoSQL Cassandra*
MongoDB*
File Amazon S3
File System*
FTP
HDFS*
SFTP
Generic OData
Generic ODBC*
Salesforce
GE Historian*
NOTE
Data stores with * can be on-premises or on Azure IaaS, and require you to install Data Management Gateway
on an on-premises/Azure IaaS machine.
Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server
NOTE
You can use MapReduce activity to run Spark programs on your HDInsight Spark cluster. See Invoke Spark
programs from Azure Data Factory for details. You can create a custom activity to run R scripts on your
HDInsight cluster with R installed. See Run R Script using Azure Data Factory.
Schedule pipelines
A pipeline is active only between its start time and end time. It is not executed before the start time or
after the end time. If the pipeline is paused, it does not get executed irrespective of its start and end
time. For a pipeline to run, it should not be paused. See Scheduling and Execution to understand how
scheduling and execution works in Azure Data Factory.
Pipeline JSON
Let us take a closer look on how a pipeline is defined in JSON format. The generic structure for a
pipeline looks as follows:
{
"name": "PipelineName",
"properties":
{
"description" : "pipeline description",
"activities":
[
],
"start": "<start date-time>",
"end": "<end date-time>",
"isPaused": true/false,
"pipelineMode": "scheduled/onetime",
"expirationTime": "15.00:00:00",
"datasets":
[
]
}
}
Activity JSON
The activities section can have one or more activities defined within it. Each activity has the following
top-level structure:
{
"name": "ActivityName",
"description": "description",
"type": "<ActivityType>",
"inputs": "[]",
"outputs": "[]",
"linkedServiceName": "MyLinkedService",
"typeProperties":
{
},
"policy":
{
},
"scheduler":
{
}
}
linkedServiceName Name of the linked service used by Yes for HDInsight Activity and
the activity. Azure Machine Learning Batch
Scoring Activity
An activity may require that you
specify the linked service that links No for all others
to the required compute
environment.
Policies
Policies affect the run-time behavior of an activity, specifically when the slice of a table is processed.
The following table provides the details.
Example: 00:10:00
(implies delay of 10 mins)
The typeProperties section is different for each transformation activity. To learn about type properties
supported for a transformation activity, click the transformation activity in the Data transformation
activities table.
For a complete walkthrough of creating this pipeline, see Tutorial: Build your first pipeline to process
data using Hadoop cluster.
In this sample, the pipeline has two activities: Activity1 and Activity2. The Activity1 takes Dataset1 as an
input and produces an output Dataset2. The Activity takes Dataset2 as an input and produces an output
Dataset3. Since the output of Activity1 (Dataset2) is the input of Activity2, the Activity2 runs only after
the Activity completes successfully and produces the Dataset2 slice. If the Activity1 fails for some
reason and does not produce the Dataset2 slice, the Activity 2 does not run for that slice (for example:
9 AM to 10 AM).
You can also chain activities that are in different pipelines.
In this sample, Pipeline1 has only one activity that takes Dataset1 as an input and produces Dataset2 as
an output. The Pipeline2 also has only one activity that takes Dataset2 as an input and Dataset3 as an
output.
For more information, see scheduling and execution.
{
"name": "CopyPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
]
"name": "CopyActivity-0"
}
]
"pipelineMode": "OneTime"
}
}
Next Steps
For more information about datasets, see Create datasets article.
For more information about how pipelines are scheduled and executed, see Scheduling and
execution in Azure Data Factory article.
Datasets in Azure Data Factory
8/8/2017 15 min to read Edit Online
This article describes what datasets are, how they are defined in JSON format, and how they are
used in Azure Data Factory pipelines. It provides details about each section (for example, structure,
availability, and policy) in the dataset JSON definition. The article also provides examples for using
the offset, anchorDateTime, and style properties in a dataset JSON definition.
NOTE
If you are new to Data Factory, see Introduction to Azure Data Factory for an overview. If you do not have
hands-on experience with creating data factories, you can gain a better understanding by reading the data
transformation tutorial and the data movement tutorial.
Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that
together perform a task. The activities in a pipeline define actions to perform on your data. For
example, you might use a copy activity to copy data from an on-premises SQL Server to Azure Blob
storage. Then, you might use a Hive activity that runs a Hive script on an Azure HDInsight cluster to
process data from Blob storage to produce output data. Finally, you might use a second copy
activity to copy the output data to Azure SQL Data Warehouse, on top of which business
intelligence (BI) reporting solutions are built. For more information about pipelines and activities,
see Pipelines and activities in Azure Data Factory.
An activity can take zero or more input datasets, and produce one or more output datasets. An
input dataset represents the input for an activity in the pipeline, and an output dataset represents
the output for the activity. Datasets identify data within different data stores, such as tables, files,
folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder
in Blob storage from which the pipeline should read the data.
Before you create a dataset, create a linked service to link your data store to the data factory.
Linked services are much like connection strings, which define the connection information needed
for Data Factory to connect to external resources. Datasets identify data within the linked data
stores, such as SQL tables, files, folders, and documents. For example, an Azure Storage linked
service links a storage account to the data factory. An Azure Blob dataset represents the blob
container and the folder that contains the input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL database, you create two linked
services: Azure Storage and Azure SQL Database. Then, create two datasets: Azure Blob dataset
(which refers to the Azure Storage linked service) and Azure SQL Table dataset (which refers to the
Azure SQL Database linked service). The Azure Storage and Azure SQL Database linked services
contain connection strings that Data Factory uses at runtime to connect to your Azure Storage and
Azure SQL Database, respectively. The Azure Blob dataset specifies the blob container and blob
folder that contains the input blobs in your Blob storage. The Azure SQL Table dataset specifies the
SQL table in your SQL database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and linked service
in Data Factory:
Dataset JSON
A dataset in Data Factory is defined in JSON format as follows:
{
"name": "<name of dataset>",
"properties": {
"type": "<type of dataset: AzureBlob, AzureSql etc...>",
"external": <boolean flag to indicate external data. only for input datasets>,
"linkedServiceName": "<Name of the linked service that refers to a data store.>",
"structure": [
{
"name": "<Name of the column>",
"type": "<Name of the type>"
}
],
"typeProperties": {
"<type specific property>": "<value>",
"<type specific property 2>": "<value 2>",
},
"availability": {
"frequency": "<Specifies the time unit for data slice production. Supported
frequency: Minute, Hour, Day, Week, Month>",
"interval": "<Specifies the interval within the defined frequency. For example,
frequency set to 'Hour' and interval set to 1 indicates that new data slices should be produced
hourly>"
},
"policy":
{
}
}
}
Dataset example
In the following example, the dataset represents a table named MyTable in a SQL database.
{
"name": "DatasetSample",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties":
{
"tableName": "MyTable"
},
"availability":
{
"frequency": "Day",
"interval": 1
}
}
}
{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"description": "",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial
Catalog=<databasename>;User ID=<username>@<servername>;Password=<password>;Integrated
Security=False;Encrypt=True;Connect Timeout=30"
}
}
}
IMPORTANT
Unless a dataset is being produced by the pipeline, it should be marked as external. This setting generally
applies to inputs of first activity in a pipeline.
Dataset type
The type of the dataset depends on the data store you use. See the following table for a list of data
stores supported by Data Factory. Click a data store to learn how to create a linked service and a
dataset for that data store.
Azure Cosmos DB
(DocumentDB API)
DB2*
MySQL*
Oracle*
PostgreSQL*
SAP Business
Warehouse*
SAP HANA*
SQL Server*
Sybase*
Teradata*
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK
NoSQL Cassandra*
MongoDB*
File Amazon S3
File System*
FTP
HDFS*
SFTP
Generic OData
Generic ODBC*
Salesforce
GE Historian*
NOTE
Data stores with * can be on-premises or on Azure infrastructure as a service (IaaS). These data stores
require you to install Data Management Gateway.
In the example in the previous section, the type of the dataset is set to AzureSqlTable. Similarly,
for an Azure Blob dataset, the type of the dataset is set to AzureBlob, as shown in the following
JSON:
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "input.log",
"folderPath": "adfgetstarted/inputdata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
},
"external": true,
"policy": {}
}
}
Dataset structure
The structure section is optional. It defines the schema of the dataset by containing a collection of
names and data types of columns. You use the structure section to provide type information that is
used to convert types and map columns from the source to the destination. In the following
example, the dataset has three columns: slicetimestamp , projectname , and pageviews . They are of
type String, String, and Decimal, respectively.
structure:
[
{ "name": "slicetimestamp", "type": "String"},
{ "name": "projectname", "type": "String"},
{ "name": "pageviews", "type": "Decimal"}
]
The following guidelines help you determine when to include structure information, and what to
include in the structure section.
For structured data sources, specify the structure section only if you want map source
columns to sink columns, and their names are not the same. This kind of structured data
source stores data schema and type information along with the data itself. Examples of
structured data sources include SQL Server, Oracle, and Azure table.
As type information is already available for structured data sources, you should not include
type information when you do include the structure section.
For schema on read data sources (specifically Blob storage), you can choose to store
data without storing any schema or type information with the data. For these types of data
sources, include structure when you want to map source columns to sink columns. Also
include structure when the dataset is an input for a copy activity, and data types of source
dataset should be converted to native types for the sink.
Data Factory supports the following values for providing type information in structure:
Int16, Int32, Int64, Single, Double, Decimal, Byte[], Boolean, String, Guid, Datetime,
Datetimeoffset, and Timespan. These values are Common Language Specification (CLS)-
compliant, .NET-based type values.
Data Factory automatically performs type conversions when moving data from a source data store
to a sink data store.
Dataset availability
The availability section in a dataset defines the processing window (for example, hourly, daily, or
weekly) for the dataset. For more information about activity windows, see Scheduling and
execution.
The following availability section specifies that the output dataset is either produced hourly, or the
input dataset is available hourly:
"availability":
{
"frequency": "Hour",
"interval": 1
}
"start": "2016-08-25T00:00:00Z",
"end": "2016-08-25T05:00:00Z",
The output dataset is produced hourly within the pipeline start and end times. Therefore, there are
five dataset slices produced by this pipeline, one for each activity window (12 AM - 1 AM, 1 AM - 2
AM, 2 AM - 3 AM, 3 AM - 4 AM, 4 AM - 5 AM).
The following table describes properties you can use in the availability section:
Supported frequency:
Minute, Hour, Day,
Week, Month
PROPERTY DESCRIPTION REQUIRED DEFAULT
"Frequency x interval"
determines how often
the slice is produced.
For example, if you need
the dataset to be sliced
on an hourly basis, you
set frequency to Hour,
and interval to 1.
If frequency is set to
Day, and style is set to
EndOfInterval, the slice
is produced in the last
hour of the day.
If frequency is set to
Hour, and style is set
to EndOfInterval, the
slice is produced at the
end of the hour. For
example, for a slice for
the 1 PM - 2 PM
period, the slice is
produced at 2 PM.
PROPERTY DESCRIPTION REQUIRED DEFAULT
offset example
By default, daily ( "frequency": "Day", "interval": 1 ) slices start at 12 AM (midnight) Coordinated
Universal Time (UTC). If you want the start time to be 6 AM UTC time instead, set the offset as
shown in the following snippet:
"availability":
{
"frequency": "Day",
"interval": 1,
"offset": "06:00:00"
}
anchorDateTime example
In the following example, the dataset is produced once every 23 hours. The first slice starts at the
time specified by anchorDateTime, which is set to 2017-04-19T08:00:00 (UTC).
"availability":
{
"frequency": "Hour",
"interval": 23,
"anchorDateTime":"2017-04-19T08:00:00"
}
offset/style example
The following dataset is monthly, and is produced on the 3rd of every month at 8:00 AM (
3.08:00:00 ):
"availability": {
"frequency": "Month",
"interval": 1,
"offset": "3.08:00:00",
"style": "StartOfInterval"
}
Dataset policy
The policy section in the dataset definition defines the criteria or the condition that the dataset
slices must fulfill.
Validation policies
POLICY NAME DESCRIPTION APPLIED TO REQUIRED DEFAULT
Examples
minimumSizeMB:
"policy":
{
"validation":
{
"minimumSizeMB": 10.0
}
}
minimumRows:
"policy":
{
"validation":
{
"minimumRows": 100
}
}
External datasets
External datasets are the ones that are not produced by a running pipeline in the data factory. If the
dataset is marked as external, the ExternalData policy may be defined to influence the behavior
of the dataset slice availability.
Unless a dataset is being produced by Data Factory, it should be marked as external. This setting
generally applies to the inputs of first activity in a pipeline, unless activity or pipeline chaining is
being used.
If it is 1:00 PM right
now, we begin the first
try. If the duration to
complete the first
validation check is 1
minute and the
operation failed, the
next retry is at 1:00 +
1min (duration) + 1min
(retry interval) = 1:02
PM.
Create datasets
You can create datasets by using one of these tools or SDKs:
Copy Wizard
Azure portal
Visual Studio
PowerShell
Azure Resource Manager template
REST API
.NET API
See the following tutorials for step-by-step instructions for creating pipelines and datasets by
using one of these tools or SDKs:
Build a pipeline with a data transformation activity
Build a pipeline with a data movement activity
After a pipeline is created and deployed, you can manage and monitor your pipelines by using the
Azure portal blades, or the Monitoring and Management app. See the following topics for step-by-
step instructions:
Monitor and manage pipelines by using Azure portal blades
Monitor and manage pipelines by using the Monitoring and Management app
Scoped datasets
You can create datasets that are scoped to a pipeline by using the datasets property. These
datasets can only be used by activities within this pipeline, not by activities in other pipelines. The
following example defines a pipeline with two datasets (InputDataset-rdc and OutputDataset-rdc)
to be used within the pipeline.
IMPORTANT
Scoped datasets are supported only with one-time pipelines (where pipelineMode is set to OneTime).
See Onetime pipeline for details.
{
"name": "CopyPipeline-rdc",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataset-rdc"
}
],
"outputs": [
{
"name": "OutputDataset-rdc"
}
],
"scheduler": {
"frequency": "Day",
"interval": 1,
"style": "StartOfInterval"
},
"name": "CopyActivity-0"
"name": "CopyActivity-0"
}
],
"start": "2016-02-28T00:00:00Z",
"end": "2016-02-28T00:00:00Z",
"isPaused": false,
"pipelineMode": "OneTime",
"expirationTime": "15.00:00:00",
"datasets": [
{
"name": "InputDataset-rdc",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "InputLinkedService-rdc",
"typeProperties": {
"fileName": "emp.txt",
"folderPath": "adftutorial/input",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
},
{
"name": "OutputDataset-rdc",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "OutputLinkedService-rdc",
"typeProperties": {
"fileName": "emp.txt",
"folderPath": "adftutorial/output",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": false,
"policy": {}
}
}
]
}
}
Next steps
For more information about pipelines, see Create pipelines.
For more information about how pipelines are scheduled and executed, see Scheduling and
execution in Azure Data Factory.
Data Factory scheduling and execution
7/10/2017 22 min to read Edit Online
This article explains the scheduling and execution aspects of the Azure Data Factory application model. This
article assumes that you understand basics of Data Factory application model concepts, including activity,
pipelines, linked services, and datasets. For basic concepts of Azure Data Factory, see the following articles:
Introduction to Data Factory
Pipelines
Datasets
"start": "2017-04-01T08:00:00Z",
"end": "2017-04-01T11:00:00Z"
"isPaused": false
"scheduler": {
"frequency": "Hour",
"interval": 1
},
As shown in the following diagram, specifying a schedule for an activity creates a series of tumbling windows
with in the pipeline start and end times. Tumbling windows are a series of fixed-size non-overlapping,
contiguous time intervals. These logical tumbling windows for an activity are called activity windows.
The scheduler property for an activity is optional. If you do specify this property, it must match the cadence
you specify in the definition of output dataset for the activity. Currently, output dataset is what drives the
schedule. Therefore, you must create an output dataset even if the activity does not produce any output.
Input dataset:
{
"name": "AzureSqlInput",
"properties": {
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {}
}
}
Output dataset
{
"name": "AzureBlobOutput",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mypath/{Year}/{Month}/{Day}/{Hour}",
"format": {
"type": "TextFormat"
},
"partitionedBy": [
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" }
},
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" }}
]
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Currently, output dataset drives the schedule. In other words, the schedule specified for the output dataset
is used to run an activity at runtime. Therefore, you must create an output dataset even if the activity does not
produce any output. If the activity doesn't take any input, you can skip creating the input dataset.
In the following pipeline definition, the scheduler property is used to specify schedule for the activity. This
property is optional. Currently, the schedule for the activity must match the schedule specified for the output
dataset.
{
"name": "SamplePipeline",
"properties": {
"description": "copy activity",
"activities": [
{
"type": "Copy",
"name": "AzureSQLtoBlob",
"description": "copy activity",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >=
\\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 100000,
"writeBatchTimeout": "00:05:00"
}
},
"inputs": [
{
"name": "AzureSQLInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"scheduler": {
"frequency": "Hour",
"interval": 1
}
}
],
"start": "2017-04-01T08:00:00Z",
"end": "2017-04-01T11:00:00Z"
}
}
In this example, the activity runs hourly between the start and end times of the pipeline. The output data is
produced hourly for three-hour windows (8 AM - 9 AM, 9 AM - 10 AM, and 10 AM - 11 AM).
Each unit of data consumed or produced by an activity run is called a data slice. The following diagram shows
an example of an activity with one input dataset and one output dataset:
The diagram shows the hourly data slices for the input and output dataset. The diagram shows three input
slices that are ready for processing. The 10-11 AM activity is in progress, producing the 10-11 AM output slice.
You can access the time interval associated with the current slice in the dataset JSON by using variables:
SliceStart and SliceEnd. Similarly, you can access the time interval associated with an activity window by using
the WindowStart and WindowEnd. The schedule of an activity must match the schedule of the output dataset
for the activity. Therefore, the SliceStart and SliceEnd values are the same as WindowStart and WindowEnd
values respectively. For more information on these variables, see Data Factory functions and system variables
articles.
You can use these variables for different purposes in your activity JSON. For example, you can use them to
select data from input and output datasets representing time series data (for example: 8 AM to 9 AM). This
example also uses WindowStart and WindowEnd to select relevant data for an activity run and copy it to a
blob with the appropriate folderPath. The folderPath is parameterized to have a separate folder for every
hour.
In the preceding example, the schedule specified for input and output datasets is the same (hourly). If the input
dataset for the activity is available at a different frequency, say every 15 minutes, the activity that produces this
output dataset still runs once an hour as the output dataset is what drives the activity schedule. For more
information, see Model datasets with different frequencies.
Supported frequency:
Minute, Hour, Day, Week,
Month
PROPERTY DESCRIPTION REQUIRED DEFAULT
Frequency x interval
determines how often the
slice is produced.
If Frequency is set to
Month and style is set to
EndOfInterval, the slice is
produced on the last day of
month. If the style is set to
StartOfInterval, the slice is
produced on the first day
of month.
Note: If the
AnchorDateTime has date
parts that are more
granular than the
frequency then the more
granular parts are ignored.
Note: If both
anchorDateTime and offset
are specified, the result is
the combined shift.
offset example
By default, daily ( "frequency": "Day", "interval": 1 ) slices start at 12 AM UTC time (midnight). If you want the
start time to be 6 AM UTC time instead, set the offset as shown in the following snippet:
"availability":
{
"frequency": "Day",
"interval": 1,
"offset": "06:00:00"
}
anchorDateTime example
In the following example, the dataset is produced once every 23 hours. The first slice starts at the time specified
by the anchorDateTime, which is set to 2017-04-19T08:00:00 (UTC time).
"availability":
{
"frequency": "Hour",
"interval": 23,
"anchorDateTime":"2017-04-19T08:00:00"
}
offset/style Example
The following dataset is a monthly dataset and is produced on 3rd of every month at 8:00 AM ( 3.08:00:00 ):
"availability": {
"frequency": "Month",
"interval": 1,
"offset": "3.08:00:00",
"style": "StartOfInterval"
}
Dataset policy
A dataset can have a validation policy defined that specifies how the data generated by a slice execution can be
validated before it is ready for consumption. In such cases, after the slice has finished execution, the output slice
status is changed to Waiting with a substatus of Validation. After the slices are validated, the slice status
changes to Ready. If a data slice has been produced but did not pass the validation, activity runs for
downstream slices that depend on this slice are not processed. Monitor and manage pipelines covers the
various states of data slices in Data Factory.
The policy section in dataset definition defines the criteria or the condition that the dataset slices must fulfill.
The following table describes properties you can use in the policy section:
Examples
minimumSizeMB:
"policy":
{
"validation":
{
"minimumSizeMB": 10.0
}
}
minimumRows
"policy":
{
"validation":
{
"minimumRows": 100
}
}
For more information about these properties and examples, see Create datasets article.
Activity policies
Policies affect the run-time behavior of an activity, specifically when the slice of a table is processed. The
following table provides the details.
The diagram shows that out of three recent slices, there was a failure producing the 9-10 AM slice for Dataset2.
Data Factory automatically tracks dependency for the time series dataset. As a result, it does not start the
activity run for the 9-10 AM downstream slice.
Data Factory monitoring and management tools allow you to drill into the diagnostic logs for the failed slice to
easily find the root cause for the issue and fix it. After you have fixed the issue, you can easily start the activity
run to produce the failed slice. For more information on how to rerun and understand state transitions for data
slices, see Monitoring and managing pipelines using Azure portal blades or Monitoring and Management app.
After you rerun the 9-10 AM slice for Dataset2, Data Factory starts the run for the 9-10 AM dependent slice on
the final dataset.
Multiple activities in a pipeline
You can have more than one activity in a pipeline. If you have multiple activities in a pipeline and the output of
an activity is not an input of another activity, the activities may run in parallel if input data slices for the
activities are ready.
You can chain two activities (run one activity after another) by setting the output dataset of one activity as the
input dataset of the other activity. The activities can be in the same pipeline or in different pipelines. The second
activity executes only when the first one finishes successfully.
For example, consider the following case where a pipeline has two activities:
1. Activity A1 that requires external input dataset D1, and produces output dataset D2.
2. Activity A2 that requires input from dataset D2, and produces output dataset D3.
In this scenario, activities A1 and A2 are in the same pipeline. The activity A1 runs when the external data is
available and the scheduled availability frequency is reached. The activity A2 runs when the scheduled slices
from D2 become available and the scheduled availability frequency is reached. If there is an error in one of the
slices in dataset D2, A2 does not run for that slice until it becomes available.
The Diagram view with both activities in the same pipeline would look like the following diagram:
As mentioned earlier, the activities could be in different pipelines. In such a scenario, the diagram view would
look like the following diagram:
Output dataset
One output file is created every day in the day's folder. Availability of output is set at Day (frequency: Day and
interval: 1).
{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/",
"partitionedBy": [
{ "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}},
{ "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}},
{ "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}}
],
"format": {
"type": "TextFormat"
}
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
The following diagram shows the scenario from a data-dependency point of view.
The output slice for every day depends on 24 hourly slices from an input dataset. Data Factory computes these
dependencies automatically by figuring out the input data slices that fall in the same time period as the output
slice to be produced. If any of the 24 input slices is not available, Data Factory waits for the input slice to be
ready before starting the daily activity run.
Sample 2: Specify dependency with expressions and Data Factory functions
Lets consider another scenario. Suppose you have a hive activity that processes two input datasets. One of
them has new data daily, but one of them gets new data every week. Suppose you wanted to do a join across
the two inputs and produce an output every day.
The simple approach in which Data Factory automatically figures out the right input slices to process by
aligning to the output data slices time period does not work.
You must specify that for every activity run, the Data Factory should use last weeks data slice for the weekly
input dataset. You use Azure Data Factory functions as shown in the following snippet to implement this
behavior.
Input1: Azure blob
The first input is the Azure blob being updated daily.
{
"name": "AzureBlobInputDaily",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/",
"partitionedBy": [
{ "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}},
{ "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}},
{ "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}}
],
"format": {
"type": "TextFormat"
}
},
"external": true,
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
{
"name": "AzureBlobInputWeekly",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/",
"partitionedBy": [
{ "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}},
{ "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}},
{ "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}}
],
"format": {
"type": "TextFormat"
}
},
"external": true,
"availability": {
"frequency": "Day",
"interval": 7
}
}
}
See Data Factory functions and system variables for a list of functions and system variables that Data Factory
supports.
Appendix
Example: copy sequentially
It is possible to run multiple copy operations one after another in a sequential/ordered manner. For example,
you might have two copy activities in a pipeline (CopyActivity1 and CopyActivity2) with the following input
data output datasets:
CopyActivity1
Input: Dataset. Output: Dataset2.
CopyActivity2
Input: Dataset2. Output: Dataset3.
CopyActivity2 would run only if the CopyActivity1 has run successfully and Dataset2 is available.
Here is the sample pipeline JSON:
{
"name": "ChainActivities",
"properties": {
"description": "Run activities in sequence",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink",
"copyBehavior": "PreserveHierarchy",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "Dataset1"
}
],
"outputs": [
{
"name": "Dataset2"
}
],
"policy": {
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "CopyFromBlob1ToBlob2",
"description": "Copy data from a blob to another"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "Dataset2"
}
],
"outputs": [
{
"name": "Dataset3"
}
],
],
"policy": {
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "CopyFromBlob2ToBlob3",
"description": "Copy data from a blob to another"
}
],
"start": "2016-08-25T01:00:00Z",
"end": "2016-08-25T01:00:00Z",
"isPaused": false
}
}
Notice that in the example, the output dataset of the first copy activity (Dataset2) is specified as input for the
second activity. Therefore, the second activity runs only when the output dataset from the first activity is ready.
In the example, CopyActivity2 can have a different input, such as Dataset3, but you specify Dataset2 as an input
to CopyActivity2, so the activity does not run until CopyActivity1 finishes. For example:
CopyActivity1
Input: Dataset1. Output: Dataset2.
CopyActivity2
Inputs: Dataset3, Dataset2. Output: Dataset4.
{
"name": "ChainActivities",
"properties": {
"description": "Run activities in sequence",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink",
"copyBehavior": "PreserveHierarchy",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "Dataset1"
}
],
"outputs": [
{
"name": "Dataset2"
}
],
"policy": {
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
},
"name": "CopyFromBlobToBlob",
"description": "Copy data from a blob to another"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "Dataset3"
},
{
"name": "Dataset2"
}
],
"outputs": [
{
"name": "Dataset4"
}
],
"policy": {
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "CopyFromBlob3ToBlob4",
"description": "Copy data from a blob to another"
}
],
"start": "2017-04-25T01:00:00Z",
"end": "2017-04-25T01:00:00Z",
"isPaused": false
}
}
Notice that in the example, two input datasets are specified for the second copy activity. When multiple inputs
are specified, only the first input dataset is used for copying data, but other datasets are used as dependencies.
CopyActivity2 would start only after the following conditions are met:
CopyActivity1 has successfully completed and Dataset2 is available. This dataset is not used when copying
data to Dataset4. It only acts as a scheduling dependency for CopyActivity2.
Dataset3 is available. This dataset represents the data that is copied to the destination.
Tutorial: Copy data from Blob Storage to SQL
Database using Data Factory
8/21/2017 4 min to read Edit Online
In this tutorial, you create a data factory with a pipeline to copy data from Blob storage to SQL
database.
The Copy Activity performs the data movement in Azure Data Factory. It is powered by a globally
available service that can copy data between various data stores in a secure, reliable, and scalable way.
See Data Movement Activities article for details about the Copy Activity.
NOTE
For a detailed overview of the Data Factory service, see the Introduction to Azure Data Factory article.
John, Doe
Jane, Doe
2. Use tools such as Azure Storage Explorer to create the adftutorial container and to upload the
emp.txt file to the container.
3. Use the following SQL script to create the emp table in your Azure SQL Database.
If you have SQL Server 2012/2014 installed on your computer: follow instructions from
Managing Azure SQL Database using SQL Server Management Studio to connect to your Azure
SQL server and run the SQL script. This article uses the classic Azure portal, not the new Azure
portal, to configure firewall for an Azure SQL server.
If your client is not allowed to access the Azure SQL server, you need to configure firewall for
your Azure SQL server to allow access from your machine (IP Address). See this article for steps
to configure the firewall for your Azure SQL server.
Create a data factory
You have completed the prerequisites. You can create a data factory using one of the following ways.
Click one of the options in the drop-down list at the top or the following links to perform the tutorial.
Copy Wizard
Azure portal
Visual Studio
PowerShell
Azure Resource Manager template
REST API
.NET API
NOTE
The data pipeline in this tutorial copies data from a source data store to a destination data store. It does not
transform input data to produce output data. For a tutorial on how to transform data using Azure Data
Factory, see Tutorial: Build your first pipeline to transform data using Hadoop cluster.
You can chain two activities (run one activity after another) by setting the output dataset of one activity as the
input dataset of the other activity. See Scheduling and execution in Data Factory for detailed information.
Tutorial: Create a pipeline with Copy Activity using
Data Factory Copy Wizard
7/10/2017 6 min to read Edit Online
This tutorial shows you how to use the Copy Wizard to copy data from an Azure blob storage to an Azure
SQL database.
The Azure Data Factory Copy Wizard allows you to quickly create a data pipeline that copies data from a
supported source data store to a supported destination data store. Therefore, we recommend that you use the
wizard as a first step to create a sample pipeline for your data movement scenario. For a list of data stores
supported as sources and as destinations, see supported data stores.
This tutorial shows you how to create an Azure data factory, launch the Copy Wizard, go through a series of
steps to provide details about your data ingestion/movement scenario. When you finish steps in the wizard,
the wizard automatically creates a pipeline with a Copy Activity to copy data from an Azure blob storage to an
Azure SQL database. For more information about Copy Activity, see data movement activities.
Prerequisites
Complete prerequisites listed in the Tutorial Overview article before performing this tutorial.
4. After the creation is complete, you see the Data Factory blade as shown in the following image:
3. On the Source data store page, click Azure Blob Storage tile. You use this page to specify the source
data store for the copy task.
4. On the Specify the Azure Blob storage account page:
a. Enter AzureStorageLinkedService for Linked service name.
b. Confirm that From Azure subscriptions option is selected for Account selection method.
c. Select your Azure subscription.
d. Select an Azure storage account from the list of Azure storage accounts available in the
selected subscription. You can also choose to enter storage account settings manually by
selecting Enter manually option for the Account selection method, and then click Next.
6. On the Choose the input file or folder page, click Next. Do not select Binary copy.
7. On the File format settings page, you see the delimiters and the schema that is auto-detected by the
wizard by parsing the file. You can also enter the delimiters manually for the copy wizard to stop auto-
detecting or to override. Click Next after you review the delimiters and preview data.
8. On the Destination data store page, select Azure SQL Database, and click Next.
10. On the Table mapping page, select emp for the Destination field from the drop-down list, click
down arrow (optional) to see the schema and to preview the data.
11. On the Schema mapping page, click Next.
3. To see the latest status of hourly slices, click Refresh button in the ACTIVITY WINDOWS list at the
bottom. You see five activity windows for five days between start and end times for the pipeline. The list is
not automatically refreshed, so you may need to click Refresh a couple of times before you see all the
activity windows in the Ready state.
4. Select an activity window in the list. See the details about it in the Activity Window Explorer on the
right.
Notice that the dates 11, 12, 13, 14, and 15 are in green color, which means that the daily output slices
for these dates have already been produced. You also see this color coding on the pipeline and the
output dataset in the diagram view. In the previous step, notice that two slices have already been
produced, one slice is currently being processed, and the other two are waiting to be processed (based
on the color coding).
For more information on using this application, see Monitor and manage pipeline using Monitoring
App article.
Next steps
In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a
destination data store in a copy operation. The following table provides a list of data stores supported as
sources and destinations by the copy activity:
Azure Cosmos DB
(DocumentDB API)
DB2*
MySQL*
Oracle*
PostgreSQL*
SAP HANA*
SQL Server*
Sybase*
Teradata*
NoSQL Cassandra*
MongoDB*
File Amazon S3
File System*
FTP
HDFS*
SFTP
Generic OData
Generic ODBC*
Salesforce
GE Historian*
For details about fields/properties that you see in the copy wizard for a data store, click the link for the data
store in the table.
Tutorial: Use Azure portal to create a Data Factory
pipeline to copy data
7/10/2017 18 min to read Edit Online
In this article, you learn how to use Azure portal to create a data factory with a pipeline that copies data from
an Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read through the
Introduction to Azure Data Factory article before doing this tutorial.
In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a
supported data store to a supported sink data store. For a list of data stores supported as sources and sinks,
see supported data stores. The activity is powered by a globally available service that can copy data between
various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see
Data Movement Activities.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by
setting the output dataset of one activity as the input dataset of the other activity. For more information, see
multiple activities in a pipeline.
NOTE
The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how
to transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster.
Prerequisites
Complete prerequisites listed in the tutorial prerequisites article before performing this tutorial.
Steps
Here are the steps you perform as part of this tutorial:
1. Create an Azure data factory. In this step, you create a data factory named ADFTutorialDataFactory.
2. Create linked services in the data factory. In this step, you create two linked services of types: Azure
Storage and Azure SQL Database.
The AzureStorageLinkedService links your Azure storage account to the data factory. You created a
container and uploaded data to this storage account as part of prerequisites.
AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from
the blob storage is stored in this database. You created a SQL table in this database as part of
prerequisites.
3. Create input and output datasets in the data factory.
The Azure storage linked service specifies the connection string that Data Factory service uses at run
time to connect to your Azure storage account. And, the input blob dataset specifies the container and
the folder that contains the input data.
Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory
service uses at run time to connect to your Azure SQL database. And, the output SQL table dataset
specifies the table in the database to which the data from the blob storage is copied.
4. Create a pipeline in the data factory. In this step, you create a pipeline with a copy activity.
The copy activity copies data from a blob in the Azure blob storage to a table in the Azure SQL
database. You can use a copy activity in a pipeline to copy data from any supported source to any
supported destination. For a list of supported data stores, see data movement activities article.
5. Monitor the pipeline. In this step, you monitor the slices of input and output datasets by using Azure
portal.
A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a
Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive
script to transform input data to product output data. Let's start with creating the data factory in this step.
1. After logging in to the Azure portal, click New on the left menu, click Data + Analytics, and click Data
Factory.
b. Select your Azure subscription in which you want to create the data factory.
c. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
Some of the steps in this tutorial assume that you use the name:
ADFTutorialResourceGroup for the resource group. To learn about resource groups, see
Using resource groups to manage your Azure resources.
d. Select the location for the data factory. Only regions supported by the Data Factory service are
shown in the drop-down list.
e. Select Pin to dashboard.
f. Click Create.
IMPORTANT
To create Data Factory instances, you must be a member of the Data Factory Contributor role at the
subscription/resource group level.
The name of the data factory may be registered as a DNS name in the future and hence become
publically visible.
3. On the dashboard, you see the following tile with status: Deploying data factory.
4. After the creation is complete, you see the Data Factory blade as shown in the image.
2. You see the Data Factory Editor as shown in the following image:
3. In the editor, click New data store button on the toolbar and select Azure storage from the drop-
down menu. You should see the JSON template for creating an Azure storage linked service in the right
pane.
4. Replace <accountname> and <accountkey> with the account name and account key values for your
Azure storage account.
5. Click Deploy on the toolbar. You should see the deployed AzureStorageLinkedService in the tree
view now.
For more information about JSON properties in the linked service definition, see Azure Blob Storage
connector article.
Create a linked service for the Azure SQL Database
In this step, you link your Azure SQL database to your data factory. You specify the Azure SQL server name,
database name, user name, and user password in this section.
1. In the Data Factory Editor, click New data store button on the toolbar and select Azure SQL Database
from the drop-down menu. You should see the JSON template for creating the Azure SQL linked service in
the right pane.
2. Replace <servername> , <databasename> , <username>@<servername> , and <password> with names of your
Azure SQL server, database, user account, and password.
3. Click Deploy on the toolbar to create and deploy the AzureSqlLinkedService.
4. Confirm that you see AzureSqlLinkedService in the tree view under Linked services.
For more information about these JSON properties, see Azure SQL Database connector.
Create datasets
In the previous step, you created linked services to link your Azure Storage account and Azure SQL database to
your data factory. In this step, you define two datasets named InputDataset and OutputDataset that represent
input and output data that is stored in the data stores referred by AzureStorageLinkedService and
AzureSqlLinkedService respectively.
The Azure storage linked service specifies the connection string that Data Factory service uses at run time to
connect to your Azure storage account. And, the input blob dataset (InputDataset) specifies the container and
the folder that contains the input data.
Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses
at run time to connect to your Azure SQL database. And, the output SQL table dataset (OututDataset) specifies
the table in the database to which the data from the blob storage is copied.
Create input dataset
In this step, you create a dataset named InputDataset that points to a blob file (emp.txt) in the root folder of a
blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService linked service.
If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the
destination. In this tutorial, you specify a value for the fileName.
1. In the Editor for the Data Factory, click ... More, click New dataset, and click Azure Blob storage from
the drop-down menu.
2. Replace JSON in the right pane with the following JSON snippet:
{
"name": "InputDataset",
"properties": {
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "adftutorial/",
"fileName": "emp.txt",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
format -> type The input file is in the text format, so we use
TextFormat.
For more information about these JSON properties, see Azure Blob connector article.
3. Click Deploy on the toolbar to create and deploy the InputDataset dataset. Confirm that you see the
InputDataset in the tree view.
Create output dataset
The Azure SQL Database linked service specifies the connection string that Data Factory service uses at run
time to connect to your Azure SQL database. The output SQL table dataset (OututDataset) you create in this
step specifies the table in the database to which the data from the blob storage is copied.
1. In the Editor for the Data Factory, click ... More, click New dataset, and click Azure SQL from the drop-
down menu.
2. Replace JSON in the right pane with the following JSON snippet:
{
"name": "OutputDataset",
"properties": {
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "emp"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
PROPERTY DESCRIPTION
There are three columns ID, FirstName, and LastName in the emp table in the database. ID is an
identity column, so you need to specify only FirstName and LastName here.
For more information about these JSON properties, see Azure SQL connector article.
3. Click Deploy on the toolbar to create and deploy the OutputDataset dataset. Confirm that you see the
OutputDataset in the tree view under Datasets.
Create pipeline
In this step, you create a pipeline with a copy activity that uses InputDataset as an input and
OutputDataset as an output.
Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a
slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours.
Therefore, 24 slices of output dataset are produced by the pipeline.
1. In the Editor for the Data Factory, click ... More, and click New pipeline. Alternatively, you can right-click
Pipelines in the tree view and click New pipeline.
2. Replace JSON in the right pane with the following JSON snippet:
{
"name": "ADFTutorialPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60:00:00"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2017-05-11T00:00:00Z",
"end": "2017-05-12T00:00:00Z"
}
}
Monitor pipeline
In this step, you use the Azure portal to monitor whats going on in an Azure data factory.
Monitor pipeline using Monitor & Manage App
The following steps show you how to monitor pipelines in your data factory by using the Monitor & Manage
application:
1. Click Monitor & Manage tile on the home page for your data factory.
NOTE
If you see that the web browser is stuck at "Authorizing...", do one of the following: clear the Block third-party
cookies and site data check box (or) create an exception for login.microsoftonline.com, and then try to
open the app again.
3. Change the Start time and End time to include start (2017-05-11) and end times (2017-05-12) of your
pipeline, and click Apply.
4. You see the activity windows associated with each hour between pipeline start and end times in the list in
the middle pane.
5. To see details about an activity window, select the activity window in the Activity Windows list.
In Activity Window Explorer on the right, you see that the slices up to the current UTC time (8:12 PM)
are all processed (in green color). The 8-9 PM, 9 - 10 PM, 10 - 11 PM, 11 PM - 12 AM slices are not
processed yet.
The Attempts section in the right pane provides information about the activity run for the data slice. If
there was an error, it provides details about the error. For example, if the input folder or container does
not exist and the slice processing fails, you see an error message stating that the container or folder
does not exist.
6. Launch SQL Server Management Studio, connect to the Azure SQL Database, and verify that the rows
are inserted in to the emp table in the database.
For detailed information about using this application, see Monitor and manage Azure Data Factory pipelines
using Monitoring and Management App.
Monitor pipeline using Diagram View
You can also monitor data pipelines by using the diagram view.
1. In the Data Factory blade, click Diagram.
2. You should see the diagram similar to the following image:
3. In the diagram view, double-click InputDataset to see slices for the dataset.
4. Click See more link to see all the data slices. You see 24 hourly slices between pipeline start and end
times.
Notice that all the data slices up to the current UTC time are Ready because the emp.txt file exists all
the time in the blob container: adftutorial\input. The slices for the future times are not in ready state
yet. Confirm that no slices show up in the Recently failed slices section at the bottom.
5. Close the blades until you see the diagram view (or) scroll left to see the diagram view. Then, double-click
OutputDataset.
6. Click See more link on the Table blade for OutputDataset to see all the slices.
7. Notice that all the slices up to the current UTC time move from pending execution state => In progress
==> Ready state. The slices from the past (before current time) are processed from latest to oldest by
default. For example, if the current time is 8:12 PM UTC, the slice for 7 PM - 8 PM is processed ahead of the
6 PM - 7 PM slice. The 8 PM - 9 PM slice is processed at the end of the time interval by default, that is after
9 PM.
8. Click any data slice from the list and you should see the Data slice blade. A piece of data associated
with an activity window is called a slice. A slice can be one file or multiple files.
If the slice is not in the Ready state, you can see the upstream slices that are not Ready and are
blocking the current slice from executing in the Upstream slices that are not ready list.
9. In the DATA SLICE blade, you should see all activity runs in the list at the bottom. Click an activity run
to see the Activity run details blade.
In this blade, you see how long the copy operation took, what throughput is, how many bytes of data
were read and written, run start time, run end time etc.
10. Click X to close all the blades until you get back to the home blade for the ADFTutorialDataFactory.
11. (optional) click the Datasets tile or Pipelines tile to get the blades you have seen the preceding steps.
12. Launch SQL Server Management Studio, connect to the Azure SQL Database, and verify that the rows
are inserted in to the emp table in the database.
Summary
In this tutorial, you created an Azure data factory to copy data from an Azure blob to an Azure SQL database.
You used the Azure portal to create the data factory, linked services, datasets, and a pipeline. Here are the
high-level steps you performed in this tutorial:
1. Created an Azure data factory.
2. Created linked services:
a. An Azure Storage linked service to link your Azure Storage account that holds input data.
b. An Azure SQL linked service to link your Azure SQL database that holds the output data.
3. Created datasets that describe input data and output data for pipelines.
4. Created a pipeline with a Copy Activity with BlobSource as source and SqlSink as sink.
Next steps
In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination
data store in a copy operation. The following table provides a list of data stores supported as sources and
destinations by the copy activity:
Azure Cosmos DB
(DocumentDB API)
DB2*
MySQL*
Oracle*
PostgreSQL*
SAP HANA*
SQL Server*
Sybase*
Teradata*
NoSQL Cassandra*
MongoDB*
File Amazon S3
File System*
FTP
HDFS*
SFTP
Generic OData
Generic ODBC*
Salesforce
GE Historian*
To learn about how to copy data to/from a data store, click the link for the data store in the table.
Tutorial: Create a pipeline with Copy Activity using
Visual Studio
7/10/2017 19 min to read Edit Online
In this article, you learn how to use the Microsoft Visual Studio to create a data factory with a pipeline that
copies data from an Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read
through the Introduction to Azure Data Factory article before doing this tutorial.
In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a
supported data store to a supported sink data store. For a list of data stores supported as sources and sinks,
see supported data stores. The activity is powered by a globally available service that can copy data between
various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see
Data Movement Activities.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by
setting the output dataset of one activity as the input dataset of the other activity. For more information, see
multiple activities in a pipeline.
NOTE
The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how
to transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster.
Prerequisites
1. Read through Tutorial Overview article and complete the prerequisite steps.
2. To create Data Factory instances, you must be a member of the Data Factory Contributor role at the
subscription/resource group level.
3. You must have the following installed on your computer:
Visual Studio 2013 or Visual Studio 2015
Download Azure SDK for Visual Studio 2013 or Visual Studio 2015. Navigate to Azure Download
Page and click VS 2013 or VS 2015 in the .NET section.
Download the latest Azure Data Factory plugin for Visual Studio: VS 2013 or VS 2015. You can also
update the plugin by doing the following steps: On the menu, click Tools -> Extensions and
Updates -> Online -> Visual Studio Gallery -> Microsoft Azure Data Factory Tools for Visual
Studio -> Update.
Steps
Here are the steps you perform as part of this tutorial:
1. Create linked services in the data factory. In this step, you create two linked services of types: Azure
Storage and Azure SQL Database.
The AzureStorageLinkedService links your Azure storage account to the data factory. You created a
container and uploaded data to this storage account as part of prerequisites.
AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from
the blob storage is stored in this database. You created a SQL table in this database as part of
prerequisites.
2. Create input and output datasets in the data factory.
The Azure storage linked service specifies the connection string that Data Factory service uses at run
time to connect to your Azure storage account. And, the input blob dataset specifies the container and
the folder that contains the input data.
Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory
service uses at run time to connect to your Azure SQL database. And, the output SQL table dataset
specifies the table in the database to which the data from the blob storage is copied.
3. Create a pipeline in the data factory. In this step, you create a pipeline with a copy activity.
The copy activity copies data from a blob in the Azure blob storage to a table in the Azure SQL
database. You can use a copy activity in a pipeline to copy data from any supported source to any
supported destination. For a list of supported data stores, see data movement activities article.
4. Create an Azure data factory when deploying Data Factory entities (linked services, datasets/tables, and
pipelines).
3. Specify the name of the project, location for the solution, and name of the solution, and then click OK.
Create linked services
You create linked services in a data factory to link your data stores and compute services to the data factory. In
this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake Analytics. You use
two data stores of type Azure Storage (source) and Azure SQL Database (destination).
Therefore, you create two linked services of types: AzureStorage and AzureSqlDatabase.
The Azure Storage linked service links your Azure storage account to the data factory. This storage account is
the one in which you created a container and uploaded the data as part of prerequisites.
Azure SQL linked service links your Azure SQL database to the data factory. The data that is copied from the
blob storage is stored in this database. You created the emp table in this database as part of prerequisites.
Linked services link data stores or compute services to an Azure data factory. See supported data stores for all
the sources and sinks supported by the Copy Activity. See compute linked services for the list of compute
services supported by Data Factory. In this tutorial, you do not use any compute service.
Create the Azure Storage linked service
1. In Solution Explorer, right-click Linked Services, point to Add, and click New Item.
2. In the Add New Item dialog box, select Azure Storage Linked Service from the list, and click Add.
3. Replace <accountname> and <accountkey> * with the name of your Azure storage account and its key.
Create datasets
In the previous step, you created linked services to link your Azure Storage account and Azure SQL database
to your data factory. In this step, you define two datasets named InputDataset and OutputDataset that
represent input and output data that is stored in the data stores referred by AzureStorageLinkedService1 and
AzureSqlLinkedService1 respectively.
The Azure storage linked service specifies the connection string that Data Factory service uses at run time to
connect to your Azure storage account. And, the input blob dataset (InputDataset) specifies the container and
the folder that contains the input data.
Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses
at run time to connect to your Azure SQL database. And, the output SQL table dataset (OututDataset) specifies
the table in the database to which the data from the blob storage is copied.
Create input dataset
In this step, you create a dataset named InputDataset that points to a blob file (emp.txt) in the root folder of a
blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService1 linked
service. If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are
copied to the destination. In this tutorial, you specify a value for the fileName.
Here, you use the term "tables" rather than "datasets". A table is a rectangular dataset and is the only type of
dataset supported right now.
1. Right-click Tables in the Solution Explorer, point to Add, and click New Item.
2. In the Add New Item dialog box, select Azure Blob, and click Add.
3. Replace the JSON text with the following text and save the AzureBlobLocation1.json file.
{
"name": "InputDataset",
"properties": {
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService1",
"typeProperties": {
"folderPath": "adftutorial/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
format -> type The input file is in the text format, so we use
TextFormat.
For more information about these JSON properties, see Azure Blob connector article.
Create output dataset
In this step, you create an output dataset named OutputDataset. This dataset points to a SQL table in the
Azure SQL database represented by AzureSqlLinkedService1.
1. Right-click Tables in the Solution Explorer again, point to Add, and click New Item.
2. In the Add New Item dialog box, select Azure SQL, and click Add.
3. Replace the JSON text with the following JSON and save the AzureSqlTableLocation1.json file.
{
"name": "OutputDataset",
"properties": {
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService1",
"typeProperties": {
"tableName": "emp"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
There are three columns ID, FirstName, and LastName in the emp table in the database. ID is an
identity column, so you need to specify only FirstName and LastName here.
For more information about these JSON properties, see Azure SQL connector article.
Create pipeline
In this step, you create a pipeline with a copy activity that uses InputDataset as an input and
OutputDataset as an output.
Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a
slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours.
Therefore, 24 slices of output dataset are produced by the pipeline.
1. Right-click Pipelines in the Solution Explorer, point to Add, and click New Item.
2. Select Copy Data Pipeline in the Add New Item dialog box and click Add.
3. Replace the JSON with the following JSON and save the CopyActivity1.json file.
{
"name": "ADFTutorialPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60:00:00"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2017-05-11T00:00:00Z",
"end": "2017-05-12T00:00:00Z",
"isPaused": false
}
}
In the activities section, there is only one activity whose type is set to Copy. For more information
about the copy activity, see data movement activities. In Data Factory solutions, you can also use
data transformation activities.
Input for the activity is set to InputDataset and output for the activity is set to OutputDataset.
In the typeProperties section, BlobSource is specified as the source type and SqlSink is
specified as the sink type. For a complete list of data stores supported by the copy activity as
sources and sinks, see supported data stores. To learn how to use a specific supported data store
as a source/sink, click the link in the table.
Replace the value of the start property with the current day and end value with the next day.
You can specify only the date part and skip the time part of the date time. For example, "2016-
02-03", which is equivalent to "2016-02-03T00:00:00Z"
Both start and end datetimes must be in ISO format. For example: 2016-10-14T16:32:41Z. The
end time is optional, but we use it in this tutorial.
If you do not specify value for the end property, it is calculated as "start + 48 hours". To run the
pipeline indefinitely, specify 9999-09-09 as the value for the end property.
In the preceding example, there are 24 data slices as each data slice is produced hourly.
For descriptions of JSON properties in a pipeline definition, see create pipelines article. For
descriptions of JSON properties in a copy activity definition, see data movement activities. For
descriptions of JSON properties supported by BlobSource, see Azure Blob connector article. For
descriptions of JSON properties supported by SqlSink, see Azure SQL Database connector
article.
IMPORTANT
The name of the Azure data factory must be globally unique. If you receive an error about the name of
data factory when publishing, change the name of the data factory (for example,
yournameVSTutorialFactory) and try publishing again. See Data Factory - Naming Rules topic for naming
rules for Data Factory artifacts.
5. In the Publish Items page, ensure that all the Data Factories entities are selected, and click Next to
switch to the Summary page.
6. Review the summary and click Next to start the deployment process and view the Deployment
Status.
7. In the Deployment Status page, you should see the status of the deployment process. Click Finish
after the deployment is done.
You can run the following command to confirm that the Data Factory provider is registered.
Get-AzureRmResourceProvider
Login using the Azure subscription into the Azure portal and navigate to a Data Factory blade (or)
create a data factory in the Azure portal. This action automatically registers the provider for you.
The name of the data factory may be registered as a DNS name in the future and hence become publically
visible.
IMPORTANT
To create Data Factory instances, you need to be a admin/co-admin of the Azure subscription
Monitor pipeline
Navigate to the home page for your data factory:
1. Log in to Azure portal.
2. Click More services on the left menu, and click Data factories.
3. Start typing the name of your data factory.
4. Click your data factory in the results list to see the home page for your data factory.
5. Follow instructions from Monitor datasets and pipeline to monitor the pipeline and datasets you have
created in this tutorial. Currently, Visual Studio does not support monitoring Data Factory pipelines.
Summary
In this tutorial, you created an Azure data factory to copy data from an Azure blob to an Azure SQL database.
You used Visual Studio to create the data factory, linked services, datasets, and a pipeline. Here are the high-
level steps you performed in this tutorial:
1. Created an Azure data factory.
2. Created linked services:
a. An Azure Storage linked service to link your Azure Storage account that holds input data.
b. An Azure SQL linked service to link your Azure SQL database that holds the output data.
3. Created datasets, which describe input data and output data for pipelines.
4. Created a pipeline with a Copy Activity with BlobSource as source and SqlSink as sink.
To see how to use a HDInsight Hive Activity to transform data by using Azure HDInsight cluster, see Tutorial:
Build your first pipeline to transform data using Hadoop cluster.
You can chain two activities (run one activity after another) by setting the output dataset of one activity as the
input dataset of the other activity. See Scheduling and execution in Data Factory for detailed information.
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"description": "",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
{
"$schema":
"http://datafactories.schema.management.azure.com/vsschemas/V1/Microsoft.DataFactory.Config.json",
"AzureStorageLinkedService1": [
{
"name": "$.properties.typeProperties.connectionString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
],
"AzureSqlLinkedService1": [
{
"name": "$.properties.typeProperties.connectionString",
"value": "Server=tcp:spsqlserver.database.windows.net,1433;Database=spsqldb;User
ID=spelluru;Password=Sowmya123;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
]
}
This example configures connectionString property of an Azure Storage linked service and an Azure
SQL linked service. Notice that the syntax for specifying name is JsonPath.
If JSON has a property that has an array of values as shown in the following code:
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
Configure properties as shown in the following configuration file (use zero-based indexing):
{
"name": "$.properties.structure[0].name",
"value": "FirstName"
}
{
"name": "$.properties.structure[0].type",
"value": "String"
}
{
"name": "$.properties.structure[1].name",
"value": "LastName"
}
{
"name": "$.properties.structure[1].type",
"value": "String"
}
{
"name": "$.properties.activities[1].typeProperties.webServiceParameters.['Database server name']",
"value": "MyAsqlServer.database.windows.net"
}
4. Select the configuration file that you would like to use and click Next.
5. Confirm that you see the name of JSON file in the Summary page and click Next.
6. Click Finish after the deployment operation is finished.
When you deploy, the values from the configuration file are used to set values for properties in the JSON files
before the entities are deployed to Azure Data Factory service.
Next steps
In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination
data store in a copy operation. The following table provides a list of data stores supported as sources and
destinations by the copy activity:
Azure Cosmos DB
(DocumentDB API)
DB2*
MySQL*
Oracle*
PostgreSQL*
SAP HANA*
SQL Server*
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK
Sybase*
Teradata*
NoSQL Cassandra*
MongoDB*
File Amazon S3
File System*
FTP
HDFS*
SFTP
Generic OData
Generic ODBC*
Salesforce
GE Historian*
To learn about how to copy data to/from a data store, click the link for the data store in the table.
Tutorial: Create a Data Factory pipeline that moves
data by using Azure PowerShell
7/10/2017 17 min to read Edit Online
In this article, you learn how to use PowerShell to create a data factory with a pipeline that copies data from an
Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read through the
Introduction to Azure Data Factory article before doing this tutorial.
In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a
supported data store to a supported sink data store. For a list of data stores supported as sources and sinks,
see supported data stores. The activity is powered by a globally available service that can copy data between
various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see
Data Movement Activities.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by
setting the output dataset of one activity as the input dataset of the other activity. For more information, see
multiple activities in a pipeline.
NOTE
This article does not cover all the Data Factory cmdlets. See Data Factory Cmdlet Reference for comprehensive
documentation on these cmdlets.
The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how
to transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster.
Prerequisites
Complete prerequisites listed in the tutorial prerequisites article.
Install Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.
Steps
Here are the steps you perform as part of this tutorial:
1. Create an Azure data factory. In this step, you create a data factory named ADFTutorialDataFactoryPSH.
2. Create linked services in the data factory. In this step, you create two linked services of types: Azure
Storage and Azure SQL Database.
The AzureStorageLinkedService links your Azure storage account to the data factory. You created a
container and uploaded data to this storage account as part of prerequisites.
AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from
the blob storage is stored in this database. You created a SQL table in this database as part of
prerequisites.
3. Create input and output datasets in the data factory.
The Azure storage linked service specifies the connection string that Data Factory service uses at run
time to connect to your Azure storage account. And, the input blob dataset specifies the container and
the folder that contains the input data.
Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory
service uses at run time to connect to your Azure SQL database. And, the output SQL table dataset
specifies the table in the database to which the data from the blob storage is copied.
4. Create a pipeline in the data factory. In this step, you create a pipeline with a copy activity.
The copy activity copies data from a blob in the Azure blob storage to a table in the Azure SQL
database. You can use a copy activity in a pipeline to copy data from any supported source to any
supported destination. For a list of supported data stores, see data movement activities article.
5. Monitor the pipeline. In this step, you monitor the slices of input and output datasets by using PowerShell.
A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a
Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive
script to transform input data to product output data. Let's start with creating the data factory in this step.
1. Launch PowerShell. Keep Azure PowerShell open until the end of this tutorial. If you close and reopen,
you need to run the commands again.
Run the following command, and enter the user name and password that you use to sign in to the
Azure portal:
Login-AzureRmAccount
Run the following command to view all the subscriptions for this account:
Get-AzureRmSubscription
Run the following command to select the subscription that you want to work with. Replace
<NameOfAzureSubscription> with the name of your Azure subscription:
Some of the steps in this tutorial assume that you use the resource group named
ADFTutorialResourceGroup. If you use a different resource group, you need to use it in place of
ADFTutorialResourceGroup in this tutorial.
3. Run the New-AzureRmDataFactory cmdlet to create a data factory named
ADFTutorialDataFactoryPSH:
$df=New-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name
ADFTutorialDataFactoryPSH Location "West US"
This name may already have been taken. Therefore, make the name of the data factory unique by
adding a prefix or suffix (for example: ADFTutorialDataFactoryPSH05152017) and run the command
again.
Note the following points:
The name of the Azure data factory must be globally unique. If you receive the following error, change
the name (for example, yournameADFTutorialDataFactoryPSH). Use this name in place of
ADFTutorialFactoryPSH while performing steps in this tutorial. See Data Factory - Naming Rules for
Data Factory artifacts.
To create Data Factory instances, you must be a contributor or administrator of the Azure subscription.
The name of the data factory may be registered as a DNS name in the future, and hence become publicly
visible.
You may receive the following error: "This subscription is not registered to use namespace
Microsoft.DataFactory." Do one of the following, and try publishing again:
In Azure PowerShell, run the following command to register the Data Factory provider:
Run the following command to confirm that the Data Factory provider is registered:
Get-AzureRmResourceProvider
Sign in by using the Azure subscription to the Azure portal. Go to a Data Factory blade, or create a
data factory in the Azure portal. This action automatically registers the provider for you.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=
<accountname>;AccountKey=<accountkey>"
}
}
}
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
Properties : Microsoft.Azure.Management.DataFactories.Models.LinkedServiceProperties
ProvisioningState : Succeeded
Other way of creating this linked service is to specify resource group name and data factory name
instead of specifying the DataFactory object.
IMPORTANT
Replace <servername>, <databasename>, <username@servername>, and <password> with names of
your Azure SQL server, database, user account, and password.
{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:<server>.database.windows.net,1433;Database=
<databasename>;User ID=<user>@<server>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
LinkedServiceName : AzureSqlLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
Properties : Microsoft.Azure.Management.DataFactories.Models.LinkedServiceProperties
ProvisioningState : Succeeded
Confirm that Allow access to Azure services setting is turned on for your SQL database server. To
verify and turn it on, do the following steps:
a. Log in to the Azure portal
b. Click More services > on the left, and click SQL servers in the DATABASES category.
c. Select your server in the list of SQL servers.
d. On the SQL server blade, click Show firewall settings link.
e. In the Firewall settings blade, click ON for Allow access to Azure services.
f. Click Save on the toolbar.
Create datasets
In the previous step, you created linked services to link your Azure Storage account and Azure SQL database
to your data factory. In this step, you define two datasets named InputDataset and OutputDataset that
represent input and output data that is stored in the data stores referred by AzureStorageLinkedService and
AzureSqlLinkedService respectively.
The Azure storage linked service specifies the connection string that Data Factory service uses at run time to
connect to your Azure storage account. And, the input blob dataset (InputDataset) specifies the container and
the folder that contains the input data.
Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses
at run time to connect to your Azure SQL database. And, the output SQL table dataset (OututDataset) specifies
the table in the database to which the data from the blob storage is copied.
Create an input dataset
In this step, you create a dataset named InputDataset that points to a blob file (emp.txt) in the root folder of a
blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService linked service.
If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the
destination. In this tutorial, you specify a value for the fileName.
1. Create a JSON file named InputDataset.json in the C:\ADFGetStartedPSH folder, with the following
content:
{
"name": "InputDataset",
"properties": {
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "emp.txt",
"folderPath": "adftutorial/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
format -> type The input file is in the text format, so we use
TextFormat.
For more information about these JSON properties, see Azure Blob connector article.
2. Run the following command to create the Data Factory dataset.
DatasetName : InputDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
Availability : Microsoft.Azure.Management.DataFactories.Common.Models.Availability
Location : Microsoft.Azure.Management.DataFactories.Models.AzureBlobDataset
Policy : Microsoft.Azure.Management.DataFactories.Common.Models.Policy
Structure : {FirstName, LastName}
Properties : Microsoft.Azure.Management.DataFactories.Models.DatasetProperties
ProvisioningState : Succeeded
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
There are three columns ID, FirstName, and LastName in the emp table in the database. ID is an
identity column, so you need to specify only FirstName and LastName here.
For more information about these JSON properties, see Azure SQL connector article.
2. Run the following command to create the data factory dataset.
Create a pipeline
In this step, you create a pipeline with a copy activity that uses InputDataset as an input and
OutputDataset as an output.
Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a
slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours.
Therefore, 24 slices of output dataset are produced by the pipeline.
1. Create a JSON file named ADFTutorialPipeline.json in the C:\ADFGetStartedPSH folder, with the
following content:
{
"name": "ADFTutorialPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60:00:00"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2017-05-11T00:00:00Z",
"end": "2017-05-12T00:00:00Z"
}
}
Note the following points:
In the activities section, there is only one activity whose type is set to Copy. For more information
about the copy activity, see data movement activities. In Data Factory solutions, you can also use
data transformation activities.
Input for the activity is set to InputDataset and output for the activity is set to OutputDataset.
In the typeProperties section, BlobSource is specified as the source type and SqlSink is
specified as the sink type. For a complete list of data stores supported by the copy activity as
sources and sinks, see supported data stores. To learn how to use a specific supported data store
as a source/sink, click the link in the table.
Replace the value of the start property with the current day and end value with the next day.
You can specify only the date part and skip the time part of the date time. For example, "2016-
02-03", which is equivalent to "2016-02-03T00:00:00Z"
Both start and end datetimes must be in ISO format. For example: 2016-10-14T16:32:41Z. The
end time is optional, but we use it in this tutorial.
If you do not specify value for the end property, it is calculated as "start + 48 hours". To run the
pipeline indefinitely, specify 9999-09-09 as the value for the end property.
In the preceding example, there are 24 data slices as each data slice is produced hourly.
For descriptions of JSON properties in a pipeline definition, see create pipelines article. For
descriptions of JSON properties in a copy activity definition, see data movement activities. For
descriptions of JSON properties supported by BlobSource, see Azure Blob connector article. For
descriptions of JSON properties supported by SqlSink, see Azure SQL Database connector
article.
2. Run the following command to create the data factory table.
PipelineName : ADFTutorialPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
Properties : Microsoft.Azure.Management.DataFactories.Models.PipelinePropertie
ProvisioningState : Succeeded
Congratulations! You have successfully created an Azure data factory with a pipeline to copy data from an
Azure blob storage to an Azure SQL database.
For example:
$df=Get-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name
ADFTutorialDataFactoryPSH0516
Then, run print the contents of $df to see the following output:
PS C:\ADFGetStartedPSH> $df
DataFactoryName : ADFTutorialDataFactoryPSH0516
DataFactoryId : 6f194b34-03b3-49ab-8f03-9f8a7b9d3e30
ResourceGroupName : ADFTutorialResourceGroup
Location : West US
Tags : {}
Properties : Microsoft.Azure.Management.DataFactories.Models.DataFactoryProperties
ProvisioningState : Succeeded
2. Run Get-AzureRmDataFactorySlice to get details about all slices of the OutputDataset, which is the
output dataset of the pipeline.
This setting should match the Start value in the pipeline JSON. You should see 24 slices, one for each
hour from 12 AM of the current day to 12 AM of the next day.
Here are three sample slices from the output:
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
DatasetName : OutputDataset
Start : 5/11/2017 11:00:00 PM
End : 5/12/2017 12:00:00 AM
RetryCount : 0
State : Ready
SubState :
LatencyStatus :
LongRetryCount : 0
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
DatasetName : OutputDataset
Start : 5/11/2017 9:00:00 PM
End : 5/11/2017 10:00:00 PM
RetryCount : 0
State : InProgress
SubState :
LatencyStatus :
LongRetryCount : 0
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
DatasetName : OutputDataset
Start : 5/11/2017 8:00:00 PM
End : 5/11/2017 9:00:00 PM
RetryCount : 0
State : Waiting
SubState : ConcurrencyLimit
LatencyStatus :
LongRetryCount : 0
3. Run Get-AzureRmDataFactoryRun to get the details of activity runs for a specific slice. Copy the
date-time value from the output of the previous command to specify the value for the StartDateTime
parameter.
Id : c0ddbd75-d0c7-4816-a775-
704bbd7c7eab_636301332000000000_636301368000000000_OutputDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
DatasetName : OutputDataset
ProcessingStartTime : 5/16/2017 8:00:33 PM
ProcessingEndTime : 5/16/2017 8:01:36 PM
PercentComplete : 100
DataSliceStart : 5/11/2017 9:00:00 PM
DataSliceEnd : 5/11/2017 10:00:00 PM
Status : Succeeded
Timestamp : 5/16/2017 8:00:33 PM
RetryAttempt : 0
Properties : {}
ErrorMessage :
ActivityName : CopyFromBlobToSQL
PipelineName : ADFTutorialPipeline
Type : Copy
For comprehensive documentation on Data Factory cmdlets, see Data Factory Cmdlet Reference.
Summary
In this tutorial, you created an Azure data factory to copy data from an Azure blob to an Azure SQL database.
You used PowerShell to create the data factory, linked services, datasets, and a pipeline. Here are the high-
level steps you performed in this tutorial:
1. Created an Azure data factory.
2. Created linked services:
a. An Azure Storage linked service to link your Azure storage account that holds input data.
b. An Azure SQL linked service to link your SQL database that holds the output data.
3. Created datasets that describe input data and output data for pipelines.
4. Created a pipeline with Copy Activity, with BlobSource as the source and SqlSink as the sink.
Next steps
In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination
data store in a copy operation. The following table provides a list of data stores supported as sources and
destinations by the copy activity:
Azure Cosmos DB
(DocumentDB API)
DB2*
MySQL*
Oracle*
PostgreSQL*
SAP HANA*
SQL Server*
Sybase*
Teradata*
NoSQL Cassandra*
MongoDB*
File Amazon S3
File System*
FTP
HDFS*
SFTP
Generic OData
Generic ODBC*
Salesforce
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK
GE Historian*
To learn about how to copy data to/from a data store, click the link for the data store in the table.
Tutorial: Use Azure Resource Manager template to
create a Data Factory pipeline to copy data
7/10/2017 13 min to read Edit Online
This tutorial shows you how to use an Azure Resource Manager template to create an Azure data factory. The
data pipeline in this tutorial copies data from a source data store to a destination data store. It does not transform
input data to produce output data. For a tutorial on how to transform data using Azure Data Factory, see Tutorial:
Build a pipeline to transform data using Hadoop cluster.
In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a
supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see
supported data stores. The activity is powered by a globally available service that can copy data between various
data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see Data
Movement Activities.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by
setting the output dataset of one activity as the input dataset of the other activity. For more information, see
multiple activities in a pipeline.
NOTE
The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to
transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster.
Prerequisites
Go through Tutorial Overview and Prerequisites and complete the prerequisite steps.
Follow instructions in How to install and configure Azure PowerShell article to install latest version of Azure
PowerShell on your computer. In this tutorial, you use PowerShell to deploy Data Factory entities.
(optional) See Authoring Azure Resource Manager Templates to learn about Azure Resource Manager
templates.
In this tutorial
In this tutorial, you create a data factory with the following Data Factory entities:
ENTITY DESCRIPTION
Azure Storage linked service Links your Azure Storage account to the data factory. Azure
Storage is the source data store and Azure SQL database is
the sink data store for the copy activity in the tutorial. It
specifies the storage account that contains the input data for
the copy activity.
Azure SQL Database linked service Links your Azure SQL database to the data factory. It
specifies the Azure SQL database that holds the output data
for the copy activity.
ENTITY DESCRIPTION
Azure Blob input dataset Refers to the Azure Storage linked service. The linked service
refers to an Azure Storage account and the Azure Blob
dataset specifies the container, folder, and file name in the
storage that holds the input data.
Azure SQL output dataset Refers to the Azure SQL linked service. The Azure SQL linked
service refers to an Azure SQL server and the Azure SQL
dataset specifies the name of the table that holds the output
data.
Data pipeline The pipeline has one activity of type Copy that takes the
Azure blob dataset as an input and the Azure SQL dataset as
an output. The copy activity copies data from an Azure blob
to a table in the Azure SQL database.
A data factory can have one or more pipelines. A pipeline can have one or more activities in it. There are two
types of activities: data movement activities and data transformation activities. In this tutorial, you create a
pipeline with one activity (copy activity).
The following section provides the complete Resource Manager template for defining Data Factory entities so
that you can quickly run through the tutorial and test the template. To understand how each Data Factory entity
is defined, see Data Factory entities in the template section.
Create a JSON file named ADFCopyTutorialARM.json in C:\ADFGetStarted folder with the following content:
{
"contentVersion": "1.0.0.0",
"$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"parameters": {
"storageAccountName": { "type": "string", "metadata": { "description": "Name of the Azure storage
account that contains the data to be copied." } },
"storageAccountKey": { "type": "securestring", "metadata": { "description": "Key for the Azure storage
account." } },
"sourceBlobContainer": { "type": "string", "metadata": { "description": "Name of the blob container in
the Azure Storage account." } },
"sourceBlobName": { "type": "string", "metadata": { "description": "Name of the blob in the container
that has the data to be copied to Azure SQL Database table" } },
"sqlServerName": { "type": "string", "metadata": { "description": "Name of the Azure SQL Server that
will hold the output/copied data." } },
"databaseName": { "type": "string", "metadata": { "description": "Name of the Azure SQL Database in
the Azure SQL server." } },
"sqlServerUserName": { "type": "string", "metadata": { "description": "Name of the user that has
access to the Azure SQL server." } },
"sqlServerPassword": { "type": "securestring", "metadata": { "description": "Password for the user." }
},
"targetSQLTable": { "type": "string", "metadata": { "description": "Table in the Azure SQL Database
that will hold the copied data." }
}
},
"variables": {
"dataFactoryName": "[concat('AzureBlobToAzureSQLDatabaseDF', uniqueString(resourceGroup().id))]",
"azureSqlLinkedServiceName": "AzureSqlLinkedService",
"azureStorageLinkedServiceName": "AzureStorageLinkedService",
"blobInputDatasetName": "BlobInputDataset",
"sqlOutputDatasetName": "SQLOutputDataset",
"pipelineName": "Blob2SQLPipeline"
},
"resources": [
{
"name": "[variables('dataFactoryName')]",
"apiVersion": "2015-10-01",
"type": "Microsoft.DataFactory/datafactories",
"location": "West US",
"resources": [
{
"type": "linkedservices",
"name": "[variables('azureStorageLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureStorage",
"description": "Azure Storage linked service",
"typeProperties": {
"connectionString": "
[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',parame
ters('storageAccountKey'))]"
}
}
},
{
"type": "linkedservices",
"name": "[variables('azureSqlLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureSqlDatabase",
"description": "Azure SQL linked service",
"typeProperties": {
"connectionString": "
[concat('Server=tcp:',parameters('sqlServerName'),'.database.windows.net,1433;Database=',
parameters('databaseName'), ';User
ID=',parameters('sqlServerUserName'),';Password=',parameters('sqlServerPassword'),';Trusted_Connection=False
;Encrypt=True;Connection Timeout=30')]"
}
}
},
{
"type": "datasets",
"name": "[variables('blobInputDatasetName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "[variables('azureStorageLinkedServiceName')]",
"structure": [
{
"name": "Column0",
"type": "String"
},
{
"name": "Column1",
"type": "String"
}
],
"typeProperties": {
"folderPath": "[concat(parameters('sourceBlobContainer'), '/')]",
"fileName": "[parameters('sourceBlobName')]",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}
},
{
"type": "datasets",
"name": "[variables('sqlOutputDatasetName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureSqlLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "[variables('azureSqlLinkedServiceName')]",
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"typeProperties": {
"tableName": "[parameters('targetSQLTable')]"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
},
{
"type": "datapipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('azureSqlLinkedServiceName')]",
"[variables('blobInputDatasetName')]",
"[variables('sqlOutputDatasetName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"activities": [
{
"name": "CopyFromAzureBlobToAzureSQL",
"description": "Copy data frm Azure blob to Azure SQL",
"type": "Copy",
"inputs": [
{
"name": "[variables('blobInputDatasetName')]"
}
],
"outputs": [
{
"name": "[variables('sqlOutputDatasetName')]"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"sqlWriterCleanupScript": "$$Text.Format('DELETE FROM {0}', 'emp')"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "Column0:FirstName,Column1:LastName"
"columnMappings": "Column0:FirstName,Column1:LastName"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 3,
"timeout": "01:00:00"
}
}
],
"start": "2017-05-11T00:00:00Z",
"end": "2017-05-12T00:00:00Z"
}
}
]
}
]
}
Parameters JSON
Create a JSON file named ADFCopyTutorialARM-Parameters.json that contains parameters for the Azure
Resource Manager template.
IMPORTANT
Specify name and key of your Azure Storage account for storageAccountName and storageAccountKey parameters.
Specify Azure SQL server, database, user, and password for sqlServerName, databaseName, sqlServerUserName, and
sqlServerPassword parameters.
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"storageAccountName": { "value": "<Name of the Azure storage account>" },
"storageAccountKey": {
"value": "<Key for the Azure storage account>"
},
"sourceBlobContainer": { "value": "adftutorial" },
"sourceBlobName": { "value": "emp.txt" },
"sqlServerName": { "value": "<Name of the Azure SQL server>" },
"databaseName": { "value": "<Name of the Azure SQL database>" },
"sqlServerUserName": { "value": "<Name of the user who has access to the Azure SQL database>" },
"sqlServerPassword": { "value": "<password for the user>" },
"targetSQLTable": { "value": "emp" }
}
}
IMPORTANT
You may have separate parameter JSON files for development, testing, and production environments that you can use
with the same Data Factory JSON template. By using a Power Shell script, you can automate deploying Data Factory
entities in these environments.
Login-AzureRmAccount
Run the following command to view all the subscriptions for this account.
Get-AzureRmSubscription
Run the following command to select the subscription that you want to work with.
2. Run the following command to deploy Data Factory entities using the Resource Manager template you
created in Step 1.
Monitor pipeline
1. Log in to the Azure portal using your Azure account.
2. Click Data factories on the left menu (or) click More services and click Data factories under
INTELLIGENCE + ANALYTICS category.
3. In the Data factories page, search for and find your data factory (AzureBlobToAzureSQLDatabaseDF).
4. Click your Azure data factory. You see the home page for the data factory.
5. Follow instructions from Monitor datasets and pipeline to monitor the pipeline and datasets you have created
in this tutorial. Currently, Visual Studio does not support monitoring Data Factory pipelines.
6. When a slice is in the Ready state, verify that the data is copied to the emp table in the Azure SQL database.
For more information on how to use Azure portal blades to monitor pipeline and datasets you have created in
this tutorial, see Monitor datasets and pipeline .
For more information on how to use the Monitor & Manage application to monitor your data pipelines, see
Monitor and manage Azure Data Factory pipelines using Monitoring App.
"resources": [
{
"name": "[variables('dataFactoryName')]",
"apiVersion": "2015-10-01",
"type": "Microsoft.DataFactory/datafactories",
"location": "West US"
}
{
"type": "linkedservices",
"name": "[variables('azureStorageLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureStorage",
"description": "Azure Storage linked service",
"typeProperties": {
"connectionString": "
[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',parame
ters('storageAccountKey'))]"
}
}
}
The connectionString uses the storageAccountName and storageAccountKey parameters. The values for these
parameters passed by using a configuration file. The definition also uses variables: azureStroageLinkedService
and dataFactoryName defined in the template.
Azure SQL Database linked service
AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the blob
storage is stored in this database. You created the emp table in this database as part of prerequisites. You specify
the Azure SQL server name, database name, user name, and user password in this section. See Azure SQL linked
service for details about JSON properties used to define an Azure SQL linked service.
{
"type": "linkedservices",
"name": "[variables('azureSqlLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureSqlDatabase",
"description": "Azure SQL linked service",
"typeProperties": {
"connectionString": "
[concat('Server=tcp:',parameters('sqlServerName'),'.database.windows.net,1433;Database=',
parameters('databaseName'), ';User
ID=',parameters('sqlServerUserName'),';Password=',parameters('sqlServerPassword'),';Trusted_Connection=False
;Encrypt=True;Connection Timeout=30')]"
}
}
}
Data pipeline
You define a pipeline that copies data from the Azure blob dataset to the Azure SQL dataset. See Pipeline JSON
for descriptions of JSON elements used to define a pipeline in this example.
{
"type": "datapipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('azureSqlLinkedServiceName')]",
"[variables('blobInputDatasetName')]",
"[variables('sqlOutputDatasetName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"activities": [
{
"name": "CopyFromAzureBlobToAzureSQL",
"description": "Copy data frm Azure blob to Azure SQL",
"type": "Copy",
"inputs": [
{
"name": "[variables('blobInputDatasetName')]"
}
],
"outputs": [
{
"name": "[variables('sqlOutputDatasetName')]"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"sqlWriterCleanupScript": "$$Text.Format('DELETE FROM {0}', 'emp')"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "Column0:FirstName,Column1:LastName"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 3,
"timeout": "01:00:00"
}
}
],
"start": "2017-05-11T00:00:00Z",
"end": "2017-05-12T00:00:00Z"
}
}
Notice that the first command uses parameter file for the development environment, second one for the test
environment, and the third one for the production environment.
You can also reuse the template to perform repeated tasks. For example, you need to create many data factories
with one or more pipelines that implement the same logic but each data factory uses different Storage and SQL
Database accounts. In this scenario, you use the same template in the same environment (dev, test, or
production) with different parameter files to create data factories.
Next steps
In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination
data store in a copy operation. The following table provides a list of data stores supported as sources and
destinations by the copy activity:
Azure Cosmos DB
(DocumentDB API)
DB2*
MySQL*
Oracle*
PostgreSQL*
SAP HANA*
SQL Server*
Sybase*
Teradata*
NoSQL Cassandra*
MongoDB*
File Amazon S3
File System*
FTP
HDFS*
SFTP
Generic OData
Generic ODBC*
Salesforce
GE Historian*
To learn about how to copy data to/from a data store, click the link for the data store in the table.
Tutorial: Use REST API to create an Azure Data
Factory pipeline to copy data
8/21/2017 17 min to read Edit Online
In this article, you learn how to use REST API to create a data factory with a pipeline that copies data from an
Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read through the
Introduction to Azure Data Factory article before doing this tutorial.
In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a
supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see
supported data stores. The activity is powered by a globally available service that can copy data between various
data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see Data
Movement Activities.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by
setting the output dataset of one activity as the input dataset of the other activity. For more information, see
multiple activities in a pipeline.
NOTE
This article does not cover all the Data Factory REST API. See Data Factory REST API Reference for comprehensive
documentation on Data Factory cmdlets.
The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to
transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster.
Prerequisites
Go through Tutorial Overview and complete the prerequisite steps.
Install Curl on your machine. You use the Curl tool with REST commands to create a data factory.
Follow instructions from this article to:
1. Create a Web application named ADFCopyTutorialApp in Azure Active Directory.
2. Get client ID and secret key.
3. Get tenant ID.
4. Assign the ADFCopyTutorialApp application to the Data Factory Contributor role.
Install Azure PowerShell.
Launch PowerShell and do the following steps. Keep Azure PowerShell open until the end of this tutorial.
If you close and reopen, you need to run the commands again.
1. Run the following command and enter the user name and password that you use to sign in to the
Azure portal:
Login-AzureRmAccount
2. Run the following command to view all the subscriptions for this account:
Get-AzureRmSubscription
3. Run the following command to select the subscription that you want to work with. Replace
<NameOfAzureSubscription> with the name of your Azure subscription.
If the resource group already exists, you specify whether to update it (Y) or keep it as (N).
Some of the steps in this tutorial assume that you use the resource group named
ADFTutorialResourceGroup. If you use a different resource group, you need to use the name of
your resource group in place of ADFTutorialResourceGroup in this tutorial.
IMPORTANT
Name must be globally unique, so you may want to prefix/suffix ADFCopyTutorialDF to make it a unique name.
{
"name": "ADFCopyTutorialDF",
"location": "WestUS"
}
azurestoragelinkedservice.json
IMPORTANT
Replace accountname and accountkey with name and key of your Azure storage account. To learn how to get your
storage access key, see View, copy and regenerate storage access keys.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
For details about JSON properties, see Azure Storage linked service.
azuersqllinkedservice.json
IMPORTANT
Replace servername, databasename, username, and password with name of your Azure SQL server, name of SQL
database, user account, and password for the account.
{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"description": "",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=
<databasename>;User ID=<username>;Password=<password>;Integrated Security=False;Encrypt=True;Connect
Timeout=30"
}
}
}
For details about JSON properties, see Azure SQL linked service.
inputdataset.json
{
"name": "AzureBlobInput",
"properties": {
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "adftutorial/",
"fileName": "emp.txt",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
folderPath Specifies the blob container and the folder that contains
input blobs. In this tutorial, adftutorial is the blob container
and folder is the root folder.
fileName This property is optional. If you omit this property, all files
from the folderPath are picked. In this tutorial, emp.txt is
specified for the fileName, so only that file is picked up for
processing.
format -> type The input file is in the text format, so we use TextFormat.
For more information about these JSON properties, see Azure Blob connector article.
outputdataset.json
{
"name": "AzureSqlOutput",
"properties": {
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "emp"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
There are three columns ID, FirstName, and LastName in the emp table in the database. ID is an identity
column, so you need to specify only FirstName and LastName here.
For more information about these JSON properties, see Azure SQL connector article.
pipeline.json
{
"name": "ADFTutorialPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"description": "Push Regional Effectiveness Campaign data to Azure SQL database",
"type": "Copy",
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureSqlOutput"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60:00:00"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2017-05-11T00:00:00Z",
"end": "2017-05-12T00:00:00Z"
}
}
Note the following points:
In the activities section, there is only one activity whose type is set to Copy. For more information about the
copy activity, see data movement activities. In Data Factory solutions, you can also use data transformation
activities.
Input for the activity is set to AzureBlobInput and output for the activity is set to AzureSqlOutput.
In the typeProperties section, BlobSource is specified as the source type and SqlSink is specified as the sink
type. For a complete list of data stores supported by the copy activity as sources and sinks, see supported data
stores. To learn how to use a specific supported data store as a source/sink, click the link in the table.
Replace the value of the start property with the current day and end value with the next day. You can specify
only the date part and skip the time part of the date time. For example, "2017-02-03", which is equivalent to
"2017-02-03T00:00:00Z"
Both start and end datetimes must be in ISO format. For example: 2016-10-14T16:32:41Z. The end time is
optional, but we use it in this tutorial.
If you do not specify value for the end property, it is calculated as "start + 48 hours". To run the pipeline
indefinitely, specify 9999-09-09 as the value for the end property.
In the preceding example, there are 24 data slices as each data slice is produced hourly.
For descriptions of JSON properties in a pipeline definition, see create pipelines article. For descriptions of JSON
properties in a copy activity definition, see data movement activities. For descriptions of JSON properties
supported by BlobSource, see Azure Blob connector article. For descriptions of JSON properties supported by
SqlSink, see Azure SQL Database connector article.
IMPORTANT
See Prerequisites section for instructions on getting client ID, client secret, tenant ID, and subscription ID.
$rg = "ADFTutorialResourceGroup"
Run the following command after updating the name of the data factory you are using:
$adf = "ADFCopyTutorialDF"
(ConvertFrom-Json $responseToken)
IMPORTANT
Confirm that the name of the data factory you specify here (ADFCopyTutorialDF) matches the name specified in the
datafactory.json.
3. View the results. If the data factory has been successfully created, you see the JSON for the data factory in
the results; otherwise, you see an error message.
Write-Host $results
You can run the following command to confirm that the Data Factory provider is registered.
Get-AzureRmResourceProvider
Login using the Azure subscription into the Azure portal and navigate to a Data Factory blade (or)
create a data factory in the Azure portal. This action automatically registers the provider for you.
Before creating a pipeline, you need to create a few Data Factory entities first. You first create linked services to
link source and destination data stores to your data store. Then, define input and output datasets to represent
data in linked data stores. Finally, create the pipeline with an activity that uses these datasets.
3. View the results. If the linked service has been successfully created, you see the JSON for the linked service
in the results; otherwise, you see an error message.
Write-Host $results
3. View the results. If the linked service has been successfully created, you see the JSON for the linked service
in the results; otherwise, you see an error message.
Write-Host $results
Create datasets
In the previous step, you created linked services to link your Azure Storage account and Azure SQL database to
your data factory. In this step, you define two datasets named AzureBlobInput and AzureSqlOutput that represent
input and output data that is stored in the data stores referred by AzureStorageLinkedService and
AzureSqlLinkedService respectively.
The Azure storage linked service specifies the connection string that Data Factory service uses at run time to
connect to your Azure storage account. And, the input blob dataset (AzureBlobInput) specifies the container and
the folder that contains the input data.
Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at
run time to connect to your Azure SQL database. And, the output SQL table dataset (OututDataset) specifies the
table in the database to which the data from the blob storage is copied.
Create input dataset
In this step, you create a dataset named AzureBlobInput that points to a blob file (emp.txt) in the root folder of a
blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService linked service. If
you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the
destination. In this tutorial, you specify a value for the fileName.
1. Assign the command to variable named cmd.
3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the
results; otherwise, you see an error message.
Write-Host $results
3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the
results; otherwise, you see an error message.
Write-Host $results
Create pipeline
In this step, you create a pipeline with a copy activity that uses AzureBlobInput as an input and
AzureSqlOutput as an output.
Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a
slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours.
Therefore, 24 slices of output dataset are produced by the pipeline.
1. Assign the command to variable named cmd.
3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the
results; otherwise, you see an error message.
Write-Host $results
Congratulations! You have successfully created an Azure data factory, with a pipeline that copies data from
Azure Blob Storage to Azure SQL database.
Monitor pipeline
In this step, you use Data Factory REST API to monitor slices being produced by the pipeline.
$ds ="AzureSqlOutput"
IMPORTANT
Make sure that the start and end times specified in the following command match the start and end times of the pipeline.
Run the Invoke-Command and the next one until you see a slice in Ready state or Failed state. When the slice is
in Ready state, check the emp table in your Azure SQL database for the output data.
For each slice, two rows of data from the source file are copied to the emp table in the Azure SQL database.
Therefore, you see 24 new records in the emp table when all the slices are successfully processed (in Ready
state).
Summary
In this tutorial, you used REST API to create an Azure data factory to copy data from an Azure blob to an Azure
SQL database. Here are the high-level steps you performed in this tutorial:
1. Created an Azure data factory.
2. Created linked services:
a. An Azure Storage linked service to link your Azure Storage account that holds input data.
b. An Azure SQL linked service to link your Azure SQL database that holds the output data.
3. Created datasets, which describe input data and output data for pipelines.
4. Created a pipeline with a Copy Activity with BlobSource as source and SqlSink as sink.
Next steps
In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination
data store in a copy operation. The following table provides a list of data stores supported as sources and
destinations by the copy activity:
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK
Azure Cosmos DB
(DocumentDB API)
DB2*
MySQL*
Oracle*
PostgreSQL*
SAP HANA*
SQL Server*
Sybase*
Teradata*
NoSQL Cassandra*
MongoDB*
File Amazon S3
File System*
FTP
HDFS*
SFTP
Generic OData
Generic ODBC*
Salesforce
GE Historian*
To learn about how to copy data to/from a data store, click the link for the data store in the table.
Tutorial: Create a pipeline with Copy Activity using
.NET API
7/11/2017 14 min to read Edit Online
In this article, you learn how to use .NET API to create a data factory with a pipeline that copies data from an
Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read through the
Introduction to Azure Data Factory article before doing this tutorial.
In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a
supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see
supported data stores. The activity is powered by a globally available service that can copy data between various
data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see Data
Movement Activities.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by
setting the output dataset of one activity as the input dataset of the other activity. For more information, see
multiple activities in a pipeline.
NOTE
For complete documentation on .NET API for Data Factory, see Data Factory .NET API Reference.
The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to
transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster.
Prerequisites
Go through Tutorial Overview and Pre-requisites to get an overview of the tutorial and complete the
prerequisite steps.
Visual Studio 2012 or 2013 or 2015
Download and install Azure .NET SDK
Azure PowerShell. Follow instructions in How to install and configure Azure PowerShell article to install Azure
PowerShell on your computer. You use Azure PowerShell to create an Azure Active Directory application.
Create an application in Azure Active Directory
Create an Azure Active Directory application, create a service principal for the application, and assign it to the
Data Factory Contributor role.
1. Launch PowerShell.
2. Run the following command and enter the user name and password that you use to sign in to the Azure
portal.
Login-AzureRmAccount
3. Run the following command to view all the subscriptions for this account.
Get-AzureRmSubscription
4. Run the following command to select the subscription that you want to work with. Replace
<NameOfAzureSubscription> with the name of your Azure subscription.
IMPORTANT
Note down SubscriptionId and TenantId from the output of this command.
5. Create an Azure resource group named ADFTutorialResourceGroup by running the following command
in the PowerShell.
If the resource group already exists, you specify whether to update it (Y) or keep it as (N).
If you use a different resource group, you need to use the name of your resource group in place of
ADFTutorialResourceGroup in this tutorial.
6. Create an Azure Active Directory application.
If you get the following error, specify a different URL and run the command again.
Another object with the same value for property identifierUris already exists.
$azureAdApplication
5. Add the following using statements to the source file (Program.cs) in the project.
using System.Configuration;
using System.Collections.ObjectModel;
using System.Threading;
using System.Threading.Tasks;
using Microsoft.Azure;
using Microsoft.Azure.Management.DataFactories;
using Microsoft.Azure.Management.DataFactories.Models;
using Microsoft.Azure.Management.DataFactories.Common.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
6. Add the following code that creates an instance of DataPipelineManagementClient class to the Main
method. You use this object to create a data factory, a linked service, input and output datasets, and a
pipeline. You also use this object to monitor slices of a dataset at runtime.
// create data factory management client
string resourceGroupName = "ADFTutorialResourceGroup";
string dataFactoryName = "APITutorialFactory";
IMPORTANT
Replace the value of resourceGroupName with the name of your Azure resource group.
Update name of the data factory (dataFactoryName) to be unique. Name of the data factory must be globally
unique. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.
7. Add the following code that creates a data factory to the Main method.
A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For
example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive
activity to run a Hive script to transform input data to product output data. Let's start with creating the
data factory in this step.
8. Add the following code that creates an Azure Storage linked service to the Main method.
IMPORTANT
Replace storageaccountname and accountkey with name and key of your Azure Storage account.
// create a linked service for input data store: Azure Storage
Console.WriteLine("Creating Azure Storage linked service");
client.LinkedServices.CreateOrUpdate(resourceGroupName, dataFactoryName,
new LinkedServiceCreateOrUpdateParameters()
{
LinkedService = new LinkedService()
{
Name = "AzureStorageLinkedService",
Properties = new LinkedServiceProperties
(
new AzureStorageLinkedService("DefaultEndpointsProtocol=https;AccountName=
<storageaccountname>;AccountKey=<accountkey>")
)
}
}
);
You create linked services in a data factory to link your data stores and compute services to the data
factory. In this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake
Analytics. You use two data stores of type Azure Storage (source) and Azure SQL Database (destination).
Therefore, you create two linked services named AzureStorageLinkedService and AzureSqlLinkedService
of types: AzureStorage and AzureSqlDatabase.
The AzureStorageLinkedService links your Azure storage account to the data factory. This storage account
is the one in which you created a container and uploaded the data as part of prerequisites.
9. Add the following code that creates an Azure SQL linked service to the Main method.
IMPORTANT
Replace servername, databasename, username, and password with names of your Azure SQL server, database,
user, and password.
// create a linked service for output data store: Azure SQL Database
Console.WriteLine("Creating Azure SQL Database linked service");
client.LinkedServices.CreateOrUpdate(resourceGroupName, dataFactoryName,
new LinkedServiceCreateOrUpdateParameters()
{
LinkedService = new LinkedService()
{
Name = "AzureSqlLinkedService",
Properties = new LinkedServiceProperties
(
new AzureSqlDatabaseLinkedService("Data Source=tcp:
<servername>.database.windows.net,1433;Initial Catalog=<databasename>;User ID=<username>;Password=
<password>;Integrated Security=False;Encrypt=True;Connect Timeout=30")
)
}
}
);
AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the
blob storage is stored in this database. You created the emp table in this database as part of prerequisites.
10. Add the following code that creates input and output datasets to the Main method.
new DatasetCreateOrUpdateParameters()
{
Dataset = new Dataset()
{
Name = Dataset_Source,
Properties = new DatasetProperties()
{
Structure = new List<DataElement>()
{
new DataElement() { Name = "FirstName", Type = "String" },
new DataElement() { Name = "LastName", Type = "String" }
},
LinkedServiceName = "AzureStorageLinkedService",
TypeProperties = new AzureBlobDataset()
{
FolderPath = "adftutorial/",
FileName = "emp.txt"
},
External = true,
Availability = new Availability()
{
Frequency = SchedulePeriod.Hour,
Interval = 1,
},
client.Pipelines.CreateOrUpdate(resourceGroupName, dataFactoryName,
new PipelineCreateOrUpdateParameters()
{
Pipeline = new Pipeline()
{
Name = PipelineName,
Properties = new PipelineProperties()
{
Description = "Demo Pipeline for data transfer between blobs",
// Initial value for pipeline's active period. With this, you won't need to set slice
status
Start = PipelineActivePeriodStartTime,
End = PipelineActivePeriodEndTime,
13. Add the following code to get run details for a data slice to the Main method.
Console.WriteLine("Getting run details of a data slice");
14. Add the following helper method used by the Main method to the Program class.
NOTE
When you copy and paste the following code, make sure that the copied code is at the same level as the Main
method.
if (result != null)
return result.AccessToken;
15. In the Solution Explorer, expand the project (DataFactoryAPITestApp), right-click References, and click
Add Reference. Select check box for System.Configuration assembly. and click OK.
16. Build the console application. Click Build on the menu and click Build Solution.
17. Confirm that there is at least one file in the adftutorial container in your Azure blob storage. If not, create
Emp.txt file in Notepad with the following content and upload it to the adftutorial container.
John, Doe
Jane, Doe
18. Run the sample by clicking Debug -> Start Debugging on the menu. When you see the Getting run
details of a data slice, wait for a few minutes, and press ENTER.
19. Use the Azure portal to verify that the data factory APITutorialFactory is created with the following artifacts:
Linked service: LinkedService_AzureStorage
Dataset: InputDataset and OutputDataset.
Pipeline: PipelineBlobSample
20. Verify that the two employee records are created in the emp table in the specified Azure SQL database.
Next steps
For complete documentation on .NET API for Data Factory, see Data Factory .NET API Reference.
In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination
data store in a copy operation. The following table provides a list of data stores supported as sources and
destinations by the copy activity:
Azure Cosmos DB
(DocumentDB API)
DB2*
MySQL*
Oracle*
PostgreSQL*
SAP HANA*
SQL Server*
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK
Sybase*
Teradata*
NoSQL Cassandra*
MongoDB*
File Amazon S3
File System*
FTP
HDFS*
SFTP
Generic OData
Generic ODBC*
Salesforce
GE Historian*
To learn about how to copy data to/from a data store, click the link for the data store in the table.
Tutorial: Build your first pipeline to transform data
using Hadoop cluster
8/24/2017 4 min to read Edit Online
In this tutorial, you build your first Azure data factory with a data pipeline. The pipeline transforms input
data by running Hive script on an Azure HDInsight (Hadoop) cluster to produce output data.
This article provides overview and prerequisites for the tutorial. After you complete the prerequisites, you
can do the tutorial using one of the following tools/SDKs: Azure portal, Visual Studio, PowerShell, Resource
Manager template, REST API. Select one of the options in the drop-down list at the beginning (or) links at
the end of this article to do the tutorial using one of these options.
Tutorial overview
In this tutorial, you perform the following steps:
1. Create a data factory. A data factory can contain one or more data pipelines that move and
transform data.
In this tutorial, you create one pipeline in the data factory.
2. Create a pipeline. A pipeline can have one or more activities (Examples: Copy Activity, HDInsight
Hive Activity). This sample uses the HDInsight Hive activity that runs a Hive script on a HDInsight
Hadoop cluster. The script first creates a table that references the raw web log data stored in Azure
blob storage and then partitions the raw data by year and month.
In this tutorial, the pipeline uses the Hive Activity to transform data by running a Hive query on an
Azure HDInsight Hadoop cluster.
3. Create linked services. You create a linked service to link a data store or a compute service to the
data factory. A data store such as Azure Storage holds input/output data of activities in the pipeline. A
compute service such as HDInsight Hadoop cluster processes/transforms data.
In this tutorial, you create two linked services: Azure Storage and Azure HDInsight. The Azure
Storage linked service links an Azure Storage Account that holds the input/output data to the data
factory. Azure HDInsight linked service links an Azure HDInsight cluster that is used to transform data
to the data factory.
4. Create input and output datasets. An input dataset represents the input for an activity in the pipeline
and an output dataset represents the output for the activity.
In this tutorial, the input and output datasets specify locations of input and output data in the Azure
Blob Storage. The Azure Storage linked service specifies what Azure Storage Account is used. An
input dataset specifies where the input files are located and an output dataset specifies where the
output files are placed.
See Introduction to Azure Data Factory article for a detailed overview of Azure Data Factory.
Here is the diagram view of the sample data factory you build in this tutorial. MyFirstPipeline has one
activity of type Hive that consumes AzureBlobInput dataset as an input and produces AzureBlobOutput
dataset as an output.
In this tutorial, inputdata folder of the adfgetstarted Azure blob container contains one file named
input.log. This log file has entries from three months: January, February, and March of 2016. Here are the
sample rows for each month in the input file.
2016-01-01,02:01:09,SAMPLEWEBSITE,GET,/blogposts/mvc4/step2.png,X-ARR-LOG-ID=2ec4b8ad-3cf0-4442-93ab-
837317ece6a1,80,-,1.54.23.196,Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+
(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36,-
,http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-
post-scenarios.aspx,\N,200,0,0,53175,871
2016-02-01,02:01:10,SAMPLEWEBSITE,GET,/blogposts/mvc4/step7.png,X-ARR-LOG-ID=d7472a26-431a-4a4d-99eb-
c7b4fda2cf4c,80,-,1.54.23.196,Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+
(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36,-
,http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-
post-scenarios.aspx,\N,200,0,0,30184,871
2016-03-01,02:01:10,SAMPLEWEBSITE,GET,/blogposts/mvc4/step7.png,X-ARR-LOG-ID=d7472a26-431a-4a4d-99eb-
c7b4fda2cf4c,80,-,1.54.23.196,Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+
(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36,-
,http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-
post-scenarios.aspx,\N,200,0,0,30184,871
When the file is processed by the pipeline with HDInsight Hive Activity, the activity runs a Hive script on the
HDInsight cluster that partitions input data by year and month. The script creates three output folders that
contain a file with entries from each month.
adfgetstarted/partitioneddata/year=2016/month=1/000000_0
adfgetstarted/partitioneddata/year=2016/month=2/000000_0
adfgetstarted/partitioneddata/year=2016/month=3/000000_0
From the sample lines shown above, the first one (with 2016-01-01) is written to the 000000_0 file in the
month=1 folder. Similarly, the second one is written to the file in the month=2 folder and the third one is
written to the file in the month=3 folder.
Prerequisites
Before you begin this tutorial, you must have the following prerequisites:
1. Azure subscription - If you don't have an Azure subscription, you can create a free trial account in just a
couple of minutes. See the Free Trial article on how you can obtain a free trial account.
2. Azure Storage You use a general-purpose standard Azure storage account for storing the data in this
tutorial. If you don't have a general-purpose standard Azure storage account, see the Create a storage
account article. After you have created the storage account, note down the account name and access
key. See View, copy and regenerate storage access keys.
3. Download and review the Hive query file (HQL) located at:
https://adftutorialfiles.blob.core.windows.net/hivetutorial/partitionweblogs.hql. This query transforms
input data to produce output data.
4. Download and review the sample input file (input.log) located at:
https://adftutorialfiles.blob.core.windows.net/hivetutorial/input.log
5. Create a blob container named adfgetstarted in your Azure Blob Storage.
6. Upload partitionweblogs.hql file to the script folder in the adfgetstarted container. Use tools such as
Microsoft Azure Storage Explorer.
7. Upload input.log file to the inputdata folder in the adfgetstarted container.
After you complete the prerequisites, select one of the following tools/SDKs to do the tutorial:
Azure portal
Visual Studio
PowerShell
Resource Manager template
REST API
Azure portal and Visual Studio provide GUI way of building your data factories. Whereas, PowerShell,
Resource Manager Template, and REST API options provides scripting/programming way of building your
data factories.
NOTE
The data pipeline in this tutorial transforms input data to produce output data. It does not copy data from a source
data store to a destination data store. For a tutorial on how to copy data using Azure Data Factory, see Tutorial:
Copy data from Blob Storage to SQL Database.
You can chain two activities (run one activity after another) by setting the output dataset of one activity as the input
dataset of the other activity. See Scheduling and execution in Data Factory for detailed information.
Tutorial: Build your first Azure data factory using
Azure portal
8/21/2017 14 min to read Edit Online
In this article, you learn how to use Azure portal to create your first Azure data factory. To do the tutorial using
other tools/SDKs, select one of the options from the drop-down list.
The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure
HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a
month between the specified start and end times.
NOTE
The data pipeline in this tutorial transforms input data to produce output data. For a tutorial on how to copy data using
Azure Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the
output dataset of one activity as the input dataset of the other activity. For more information, see scheduling and execution
in Data Factory.
Prerequisites
1. Read through Tutorial Overview article and complete the prerequisite steps.
2. This article does not provide a conceptual overview of the Azure Data Factory service. We recommend that you
go through Introduction to Azure Data Factory article for a detailed overview of the service.
IMPORTANT
The name of the Azure data factory must be globally unique. If you receive the error: Data factory name
GetStartedDF is not available. Change the name of the data factory (for example, yournameGetStartedDF) and
try creating again. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.
The name of the data factory may be registered as a DNS name in the future and hence become publically visible.
4. Select the Azure subscription where you want the data factory to be created.
5. Select existing resource group or create a resource group. For the tutorial, create a resource group named:
ADFGetStartedRG.
6. Select the location for the data factory. Only regions supported by the Data Factory service are shown in the
drop-down list.
7. Select Pin to dashboard.
8. Click Create on the New data factory blade.
IMPORTANT
To create Data Factory instances, you must be a member of the Data Factory Contributor role at the
subscription/resource group level.
9. On the dashboard, you see the following tile with status: Deploying data factory.
10. Congratulations! You have successfully created your first data factory. After the data factory has been
created successfully, you see the data factory page, which shows you the contents of the data factory.
Before creating a pipeline in the data factory, you need to create a few Data Factory entities first. You first create
linked services to link data stores/computes to your data store, define input and output datasets to represent
input/output data in linked data stores, and then create the pipeline with an activity that uses these datasets.
3. You should see the JSON script for creating an Azure Storage linked service in the editor.
4. Replace account name with the name of your Azure storage account and account key with the access key of
the Azure storage account. To learn how to get your storage access key, see the information about how to
view, copy, and regenerate storage access keys in Manage your storage account.
5. Click Deploy on the command bar to deploy the linked service.
After the linked service is deployed successfully, the Draft-1 window should disappear and you see
AzureStorageLinkedService in the tree view on the left.
Create Azure HDInsight linked service
In this step, you link an on-demand HDInsight cluster to your data factory. The HDInsight cluster is automatically
created at runtime and deleted after it is done processing and idle for the specified amount of time.
1. In the Data Factory Editor, click ... More, click New compute, and select On-demand HDInsight
cluster.
2. Copy and paste the following snippet to the Draft-1 window. The JSON snippet describes the properties
that are used to create the HDInsight cluster on-demand.
{
"name": "HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "AzureStorageLinkedService"
}
}
}
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
TimeToLive Specifies that the idle time for the HDInsight cluster,
before it is deleted.
Create datasets
In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to
the AzureStorageLinkedService you have created earlier in this tutorial. The linked service points to an Azure
Storage account and datasets specify container, folder, file name in the storage that holds input and output data.
Create input dataset
1. In the Data Factory Editor, click ... More on the command bar, click New dataset, and select Azure Blob
storage.
2. Copy and paste the following snippet to the Draft-1 window. In the JSON snippet, you are creating a
dataset called AzureBlobInput that represents input data for an activity in the pipeline. In addition, you
specify that the input data is located in the blob container called adfgetstarted and the folder called
inputdata.
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "input.log",
"folderPath": "adfgetstarted/inputdata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
},
"external": true,
"policy": {}
}
}
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
folderPath Specifies the blob container and the folder that contains
input blobs.
PROPERTY DESCRIPTION
fileName This property is optional. If you omit this property, all the
files from the folderPath are picked. In this tutorial, only
the input.log is processed.
For more information about these JSON properties, see Azure Blob connector article.
3. Click Deploy on the command bar to deploy the newly created dataset. You should see the dataset in the tree
view on the left.
Create output dataset
Now, you create the output dataset to represent the output data stored in the Azure Blob storage.
1. In the Data Factory Editor, click ... More on the command bar, click New dataset, and select Azure Blob
storage.
2. Copy and paste the following snippet to the Draft-1 window. In the JSON snippet, you are creating a
dataset called AzureBlobOutput, and specifying the structure of the data that is produced by the Hive
script. In addition, you specify that the results are stored in the blob container called adfgetstarted and the
folder called partitioneddata. The availability section specifies that the output dataset is produced on a
monthly basis.
{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "adfgetstarted/partitioneddata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
}
}
}
See Create the input dataset section for descriptions of these properties. You do not set the external
property on an output dataset as the dataset is produced by the Data Factory service.
3. Click Deploy on the command bar to deploy the newly created dataset.
4. Verify that the dataset is created successfully.
Create pipeline
In this step, you create your first pipeline with a HDInsightHive activity. Input slice is available monthly
(frequency: Month, interval: 1), output slice is produced monthly, and the scheduler property for the activity is also
set to monthly. The settings for the output dataset and the activity scheduler must match. Currently, output
dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce any
output. If the activity doesn't take any input, you can skip creating the input dataset. The properties used in the
following JSON are explained at the end of this section.
1. In the Data Factory Editor, click Ellipsis () More commands and then click New pipeline.
IMPORTANT
Replace storageaccountname with the name of your storage account in the JSON.
{
"name": "MyFirstPipeline",
"properties": {
"description": "My first Azure Data Factory pipeline",
"activities": [
{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adfgetstarted/script/partitionweblogs.hql",
"scriptLinkedService": "AzureStorageLinkedService",
"defines": {
"inputtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata",
"partitionedtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata"
}
},
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "HDInsightOnDemandLinkedService"
}
],
"start": "2017-07-01T00:00:00Z",
"end": "2017-07-02T00:00:00Z",
"isPaused": false
}
}
In the JSON snippet, you are creating a pipeline that consists of a single activity that uses Hive to process
Data on an HDInsight cluster.
The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the
scriptLinkedService, called AzureStorageLinkedService), and in script folder in the container
adfgetstarted.
The defines section is used to specify the runtime settings that are passed to the hive script as Hive
configuration values (e.g ${hiveconf:inputtable}, ${hiveconf:partitionedtable}).
The start and end properties of the pipeline specifies the active period of the pipeline.
In the activity JSON, you specify that the Hive script runs on the compute specified by the
linkedServiceName HDInsightOnDemandLinkedService.
NOTE
See "Pipeline JSON" in Pipelines and activities in Azure Data Factory for details about JSON properties used in the
example.
Monitor pipeline
Monitor pipeline using Diagram View
1. Click X to close Data Factory Editor blades and to navigate back to the Data Factory blade, and click
Diagram.
2. In the Diagram View, you see an overview of the pipelines, and datasets used in this tutorial.
3. To view all activities in the pipeline, right-click pipeline in the diagram and click Open Pipeline.
To navigate back to the previous view, click Data factory in the breadcrumb menu at the top.
5. In the Diagram View, double-click the dataset AzureBlobInput. Confirm that the slice is in Ready state. It
may take a couple of minutes for the slice to show up in Ready state. If it does not happen after you wait
for sometime, see if you have the input file (input.log) placed in the right container (adfgetstarted) and
folder (inputdata).
6. Click X to close AzureBlobInput blade.
7. In the Diagram View, double-click the dataset AzureBlobOutput. You see that the slice that is currently
being processed.
9. When the slice is in Ready state, check the partitioneddata folder in the adfgetstarted container in your
blob storage for the output data.
10. Click the slice to see details about it in a Data slice blade.
11. Click an activity run in the Activity runs list to see details about an activity run (Hive activity in our
scenario) in an Activity run details window.
From the log files, you can see the Hive query that was executed and status information. These logs are
useful for troubleshooting any issues. See Monitor and manage pipelines using Azure portal blades article
for more details.
IMPORTANT
The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial
again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container.
3. Select an activity window in the Activity Windows list to see details about it.
Summary
In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop
cluster. You used the Data Factory Editor in the Azure portal to do the following steps:
1. Created an Azure data factory.
2. Created two linked services:
a. Azure Storage linked service to link your Azure blob storage that holds input/output files to the data
factory.
b. Azure HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the
data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data
and produce output data.
3. Created two datasets, which describe input and output data for HDInsight Hive activity in the pipeline.
4. Created a pipeline with a HDInsight Hive activity.
Next Steps
In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script
on an on-demand HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to Azure
SQL, see Tutorial: Copy data from an Azure blob to Azure SQL.
See Also
TOPIC DESCRIPTION
Scheduling and execution This article explains the scheduling and execution aspects of
Azure Data Factory application model.
Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug
pipelines using the Monitoring & Management App.
Tutorial: Create a data factory by using Visual Studio
8/21/2017 22 min to read Edit Online
This tutorial shows you how to create an Azure data factory by using Visual Studio. You create a Visual Studio
project using the Data Factory project template, define Data Factory entities (linked services, datasets, and
pipeline) in JSON format, and then publish/deploy these entities to the cloud.
The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure
HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a
month between the specified start and end times.
NOTE
This tutorial does not show how copy data by using Azure Data Factory. For a tutorial on how to copy data using Azure
Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the
output dataset of one activity as the input dataset of the other activity. For more information, see scheduling and execution
in Data Factory.
3. Enter a name for the project, location, and a name for the solution, and click OK.
Create linked services
In this step, you create two linked services: Azure Storage and HDInsight on-demand.
The Azure Storage linked service links your Azure Storage account to the data factory by providing the connection
information. Data Factory service uses the connection string from the linked service setting to connect to the
Azure storage at runtime. This storage holds input and output data for the pipeline, and the hive script file used by
the hive activity.
With on-demand HDInsight linked service, The HDInsight cluster is automatically created at runtime when the
input data is ready to processed. The cluster is deleted after it is done processing and idle for the specified
amount of time.
NOTE
You create a data factory by specifying its name and settings at the time of publishing your Data Factory solution.
3. Replace <accountname> and <accountkey> with the name of your Azure storage account and its key. To learn
how to get your storage access key, see the information about how to view, copy, and regenerate storage
access keys in Manage your storage account.
{
"name": "HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "AzureStorageLinkedService1"
}
}
}
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
TimeToLive Specifies that the idle time for the HDInsight cluster,
before it is deleted.
IMPORTANT
The HDInsight cluster creates a default container in the blob storage you specified in the JSON
(linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by
design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is processed
unless there is an existing live cluster (timeToLive). The cluster is automatically deleted when the processing is done.
As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for
troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these
containers follow a pattern: adf<yourdatafactoryname>-<linkedservicename>-datetimestamp . Use tools such as
Microsoft Storage Explorer to delete containers in your Azure blob storage.
For more information about JSON properties, see Compute linked services article.
4. Save the HDInsightOnDemandLinkedService1.json file.
Create datasets
In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to
the AzureStorageLinkedService1 you have created earlier in this tutorial. The linked service points to an Azure
Storage account and datasets specify container, folder, file name in the storage that holds input and output data.
Create input dataset
1. In the Solution Explorer, right-click Tables, point to Add, and click New Item.
2. Select Azure Blob from the list, change the name of the file to InputDataSet.json, and click Add.
3. Replace the JSON in the editor with the following JSON snippet:
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService1",
"typeProperties": {
"fileName": "input.log",
"folderPath": "adfgetstarted/inputdata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
},
"external": true,
"policy": {}
}
}
This JSON snippet defines a dataset called AzureBlobInput that represents input data for the hive activity
in the pipeline. You specify that the input data is located in the blob container called adfgetstarted and the
folder called inputdata .
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
fileName This property is optional. If you omit this property, all the
files from the folderPath are picked. In this case, only the
input.log is processed.
external This property is set to true if the input data for the
activity is not generated by the pipeline. This property is
only specified on input datasets. For the input dataset of
the first activity, always set it to true.
{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService1",
"typeProperties": {
"folderPath": "adfgetstarted/partitioneddata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
}
}
}
The JSON snippet defines a dataset called AzureBlobOutput that represents output data produced by the
hive activity in the pipeline. You specify that the output data is produced by the hive activity is placed in the
blob container called adfgetstarted and the folder called partitioneddata .
The availability section specifies that the output dataset is produced on a monthly basis. The output
dataset drives the schedule of the pipeline. The pipeline runs monthly between its start and end times.
See Create the input dataset section for descriptions of these properties. You do not set the external
property on an output dataset as the dataset is produced by the pipeline.
4. Save the OutputDataset.json file.
Create pipeline
You have created the Azure Storage linked service, and input and output datasets so far. Now, you create a
pipeline with a HDInsightHive activity. The input for the hive activity is set to AzureBlobInput and output is
set to AzureBlobOutput. A slice of an input dataset is available monthly (frequency: Month, interval: 1), and the
output slice is produced monthly too.
1. In the Solution Explorer, right-click Pipelines, point to Add, and click New Item.
2. Select Hive Transformation Pipeline from the list, and click Add.
3. Replace the JSON with the following snippet:
IMPORTANT
Replace <storageaccountname> with the name of your storage account.
{
"name": "MyFirstPipeline",
"properties": {
"description": "My first Azure Data Factory pipeline",
"activities": [
{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adfgetstarted/script/partitionweblogs.hql",
"scriptLinkedService": "AzureStorageLinkedService1",
"defines": {
"inputtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata",
"partitionedtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata"
}
},
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "HDInsightOnDemandLinkedService"
}
],
"start": "2016-04-01T00:00:00Z",
"end": "2016-04-02T00:00:00Z",
"isPaused": false
}
}
IMPORTANT
Replace <storageaccountname> with the name of your storage account.
The JSON snippet defines a pipeline that consists of a single activity (Hive Activity). This activity runs a Hive
script to process input data on an on-demand HDInsight cluster to produce output data. In the activities
section of the pipeline JSON, you see only one activity in the array with type set to HDInsightHive.
In the type properties that are specific to HDInsight Hive activity, you specify what Azure Storage linked
service has the hive script file, the path to the script file, and parameters to the script file.
The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the
scriptLinkedService), and in the script folder in the container adfgetstarted .
The defines section is used to specify the runtime settings that are passed to the hive script as Hive
configuration values (e.g ${hiveconf:inputtable} , ${hiveconf:partitionedtable}) .
The start and end properties of the pipeline specifies the active period of the pipeline. You configured the
dataset to be produced monthly, therefore, only once slice is produced by the pipeline (because the month
is same in start and end dates).
In the activity JSON, you specify that the Hive script runs on the compute specified by the
linkedServiceName HDInsightOnDemandLinkedService.
4. Save the HiveActivity1.json file.
Add partitionweblogs.hql and input.log as a dependency
1. Right-click Dependencies in the Solution Explorer window, point to Add, and click Existing Item.
2. Navigate to the C:\ADFGettingStarted and select partitionweblogs.hql, input.log files, and click Add. You
created these two files as part of prerequisites from the Tutorial Overview.
When you publish the solution in the next step, the partitionweblogs.hql file is uploaded to the script folder in
the adfgetstarted blob container.
Publish/deploy Data Factory entities
In this step, you publish the Data Factory entities (linked services, datasets, and pipeline) in your project to the
Azure Data Factory service. In the process of publishing, you specify the name for your data factory.
1. Right-click project in the Solution Explorer, and click Publish.
2. If you see Sign in to your Microsoft account dialog box, enter your credentials for the account that has
Azure subscription, and click sign in.
3. You should see the following dialog box:
IMPORTANT
If you receive the error Data factory name DataFactoryUsingVS is not available when publishing,
change the name (for example, yournameDataFactoryUsingVS). See Data Factory - Naming Rules topic for
naming rules for Data Factory artifacts.
5. In the Publish Items page, ensure that all the Data Factories entities are selected, and click Next to switch
to the Summary page.
6. Review the summary and click Next to start the deployment process and view the Deployment Status.
7. In the Deployment Status page, you should see the status of the deployment process. Click Finish after the
deployment is done.
Important points to note:
If you receive the error: This subscription is not registered to use namespace Microsoft.DataFactory,
do one of the following and try publishing again:
In Azure PowerShell, run the following command to register the Data Factory provider.
Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory
You can run the following command to confirm that the Data Factory provider is registered.
Get-AzureRmResourceProvider
Login using the Azure subscription in to the Azure portal and navigate to a Data Factory blade (or)
create a data factory in the Azure portal. This action automatically registers the provider for you.
The name of the data factory may be registered as a DNS name in the future and hence become publically
visible.
To create Data Factory instances, you need to be an admin or co-admin of the Azure subscription
Monitor pipeline
In this step, you monitor the pipeline using Diagram View of the data factory.
Monitor pipeline using Diagram View
1. Log in to the Azure portal, do the following steps:
a. Click More services and click Data factories.
b. Select the name of your data factory (for example: DataFactoryUsingVS09152016) from the list of
data factories.
4. To view all activities in the pipeline, right-click pipeline in the diagram and click Open Pipeline.
To navigate back to the previous view, click Data factory in the breadcrumb menu at the top.
6. In the Diagram View, double-click the dataset AzureBlobInput. Confirm that the slice is in Ready state. It
may take a couple of minutes for the slice to show up in Ready state. If it does not happen after you wait
for sometime, see if you have the input file (input.log) placed in the right container ( adfgetstarted ) and
folder ( inputdata ). And, make sure that the external property on the input dataset is set to true.
IMPORTANT
Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect
the pipeline to take approximately 30 minutes to process the slice.
10. When the slice is in Ready state, check the partitioneddata folder in the adfgetstarted container in your
blob storage for the output data.
11. Click the slice to see details about it in a Data slice blade.
12. Click an activity run in the Activity runs list to see details about an activity run (Hive activity in our
scenario) in an Activity run details window.
From the log files, you can see the Hive query that was executed and status information. These logs are
useful for troubleshooting any issues.
See Monitor datasets and pipeline for instructions on how to use the Azure portal to monitor the pipeline and
datasets you have created in this tutorial.
Monitor pipeline using Monitor & Manage App
You can also use Monitor & Manage application to monitor your pipelines. For detailed information about using
this application, see Monitor and manage Azure Data Factory pipelines using Monitoring and Management App.
1. Click Monitor & Manage tile.
2. You should see Monitor & Manage application. Change the Start time and End time to match start (04-
01-2016 12:00 AM) and end times (04-02-2016 12:00 AM) of your pipeline, and click Apply.
3. To see details about an activity window, select it in the Activity Windows list to see details about it.
IMPORTANT
The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the
tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container.
Additional notes
A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a
Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive
script to transform input data. See supported data stores for all the sources and sinks supported by the Copy
Activity. See compute linked services for the list of compute services supported by Data Factory.
Linked services link data stores or compute services to an Azure data factory. See supported data stores for all
the sources and sinks supported by the Copy Activity. See compute linked services for the list of compute
services supported by Data Factory and transformation activities that can run on them.
See Move data from/to Azure Blob for details about JSON properties used in the Azure Storage linked service
definition.
You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See Compute
Linked Services for details.
The Data Factory creates a Linux-based HDInsight cluster for you with the preceding JSON. See On-demand
HDInsight Linked Service for details.
The HDInsight cluster creates a default container in the blob storage you specified in the JSON
(linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is
by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is
processed unless there is an existing live cluster (timeToLive). The cluster is automatically deleted when the
processing is done.
As more slices are processed, you see many containers in your Azure blob storage. If you do not need
them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names
of these containers follow a pattern: adf**yourdatafactoryname**-**linkedservicename**-datetimestamp . Use
tools such as Microsoft Storage Explorer to delete containers in your Azure blob storage.
Currently, output dataset is what drives the schedule, so you must create an output dataset even if the activity
does not produce any output. If the activity doesn't take any input, you can skip creating the input dataset.
This tutorial does not show how copy data by using Azure Data Factory. For a tutorial on how to copy data
using Azure Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database.
3. You can right-click a data factory, and select Export Data Factory to New Project to create a Visual
Studio project based on an existing data factory.
Update Data Factory tools for Visual Studio
To update Azure Data Factory tools for Visual Studio, do the following steps:
1. Click Tools on the menu and select Extensions and Updates.
2. Select Updates in the left pane and then select Visual Studio Gallery.
3. Select Azure Data Factory tools for Visual Studio and click Update. If you do not see this entry, you
already have the latest version of the tools.
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"description": "",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
{
"$schema":
"http://datafactories.schema.management.azure.com/vsschemas/V1/Microsoft.DataFactory.Config.json",
"AzureStorageLinkedService1": [
{
"name": "$.properties.typeProperties.connectionString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
],
"AzureSqlLinkedService1": [
{
"name": "$.properties.typeProperties.connectionString",
"value": "Server=tcp:spsqlserver.database.windows.net,1433;Database=spsqldb;User
ID=spelluru;Password=Sowmya123;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
]
}
This example configures connectionString property of an Azure Storage linked service and an Azure SQL
linked service. Notice that the syntax for specifying name is JsonPath.
If JSON has a property that has an array of values as shown in the following code:
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
Configure properties as shown in the following configuration file (use zero-based indexing):
{
"name": "$.properties.structure[0].name",
"value": "FirstName"
}
{
"name": "$.properties.structure[0].type",
"value": "String"
}
{
"name": "$.properties.structure[1].name",
"value": "LastName"
}
{
"name": "$.properties.structure[1].type",
"value": "String"
}
{
"name": "$.properties.activities[1].typeProperties.webServiceParameters.['Database server name']",
"value": "MyAsqlServer.database.windows.net"
}
4. Select the configuration file that you would like to use and click Next.
5. Confirm that you see the name of JSON file in the Summary page and click Next.
6. Click Finish after the deployment operation is finished.
When you deploy, the values from the configuration file are used to set values for properties in the JSON files
before the entities are deployed to Azure Data Factory service.
Summary
In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop
cluster. You used the Data Factory Editor in the Azure portal to do the following steps:
1. Created an Azure data factory.
2. Created two linked services:
a. Azure Storage linked service to link your Azure blob storage that holds input/output files to the data
factory.
b. Azure HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the
data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data
and produce output data.
3. Created two datasets, which describe input and output data for HDInsight Hive activity in the pipeline.
4. Created a pipeline with a HDInsight Hive activity.
Next Steps
In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script
on an on-demand HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to Azure
SQL, see Tutorial: Copy data from an Azure blob to Azure SQL.
You can chain two activities (run one activity after another) by setting the output dataset of one activity as the
input dataset of the other activity. See Scheduling and execution in Data Factory for detailed information.
See Also
TOPIC DESCRIPTION
Data Transformation Activities This article provides a list of data transformation activities
(such as HDInsight Hive transformation you used in this
tutorial) supported by Azure Data Factory.
TOPIC DESCRIPTION
Scheduling and execution This article explains the scheduling and execution aspects of
Azure Data Factory application model.
Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug
pipelines using the Monitoring & Management App.
Tutorial: Build your first Azure data factory using
Azure PowerShell
8/21/2017 14 min to read Edit Online
In this article, you use Azure PowerShell to create your first Azure data factory. To do the tutorial using other
tools/SDKs, select one of the options from the drop-down list.
The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure
HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a
month between the specified start and end times.
NOTE
The data pipeline in this tutorial transforms input data to produce output data. It does not copy data from a source data
store to a destination data store. For a tutorial on how to copy data using Azure Data Factory, see Tutorial: Copy data from
Blob Storage to SQL Database.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the
output dataset of one activity as the input dataset of the other activity. For more information, see scheduling and execution
in Data Factory.
Prerequisites
Read through Tutorial Overview article and complete the prerequisite steps.
Follow instructions in How to install and configure Azure PowerShell article to install latest version of Azure
PowerShell on your computer.
(optional) This article does not cover all the Data Factory cmdlets. See Data Factory Cmdlet Reference for
comprehensive documentation on Data Factory cmdlets.
2. Create an Azure resource group named ADFTutorialResourceGroup by running the following command:
New-AzureRmResourceGroup -Name ADFTutorialResourceGroup -Location "West US"
Some of the steps in this tutorial assume that you use the resource group named
ADFTutorialResourceGroup. If you use a different resource group, you need to use it in place of
ADFTutorialResourceGroup in this tutorial.
3. Run the New-AzureRmDataFactory cmdlet that creates a data factory named FirstDataFactoryPSH.
You can run the following command to confirm that the Data Factory provider is registered:
Get-AzureRmResourceProvider
Login using the Azure subscription into the Azure portal and navigate to a Data Factory blade (or) create
a data factory in the Azure portal. This action automatically registers the provider for you.
Before creating a pipeline, you need to create a few Data Factory entities first. You first create linked services to
link data stores/computes to your data store, define input and output datasets to represent input/output data in
linked data stores, and then create the pipeline with an activity that uses these datasets.
Replace account name with the name of your Azure storage account and account key with the access
key of the Azure storage account. To learn how to get your storage access key, see the information about
how to view, copy, and regenerate storage access keys in Manage your storage account.
2. In Azure PowerShell, switch to the ADFGetStarted folder.
3. You can use the New-AzureRmDataFactoryLinkedService cmdlet that creates a linked service. This
cmdlet and other Data Factory cmdlets you use in this tutorial requires you to pass values for the
ResourceGroupName and DataFactoryName parameters. Alternatively, you can use Get-
AzureRmDataFactory to get a DataFactory object and pass the object without typing
ResourceGroupName and DataFactoryName each time you run a cmdlet. Run the following command to
assign the output of the Get-AzureRmDataFactory cmdlet to a $df variable.
If you hadn't run the Get-AzureRmDataFactory cmdlet and assigned the output to the $df variable, you
would have to specify values for the ResourceGroupName and DataFactoryName parameters as follows.
If you close Azure PowerShell in the middle of the tutorial, you have to run the Get-AzureRmDataFactory
cmdlet next time you start Azure PowerShell to complete the tutorial.
Create Azure HDInsight linked service
In this step, you link an on-demand HDInsight cluster to your data factory. The HDInsight cluster is automatically
created at runtime and deleted after it is done processing and idle for the specified amount of time. You could use
your own HDInsight cluster instead of using an on-demand HDInsight cluster. See Compute Linked Services for
details.
1. Create a JSON file named HDInsightOnDemandLinkedService.json in the C:\ADFGetStarted folder
with the following content.
{
"name": "HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "StorageLinkedService"
}
}
}
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
TimeToLive Specifies that the idle time for the HDInsight cluster,
before it is deleted.
Create datasets
In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to
the StorageLinkedService you have created earlier in this tutorial. The linked service points to an Azure Storage
account and datasets specify container, folder, file name in the storage that holds input and output data.
Create input dataset
1. Create a JSON file named InputTable.json in the C:\ADFGetStarted folder with the following content:
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"fileName": "input.log",
"folderPath": "adfgetstarted/inputdata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
},
"external": true,
"policy": {}
}
}
The JSON defines a dataset named AzureBlobInput, which represents input data for an activity in the
pipeline. In addition, it specifies that the input data is located in the blob container called adfgetstarted
and the folder called inputdata.
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
fileName This property is optional. If you omit this property, all the
files from the folderPath are picked. In this case, only the
input.log is processed.
2. Run the following command in Azure PowerShell to create the Data Factory dataset:
{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "adfgetstarted/partitioneddata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
}
}
}
The JSON defines a dataset named AzureBlobOutput, which represents output data for an activity in the
pipeline. In addition, it specifies that the results are stored in the blob container called adfgetstarted and
the folder called partitioneddata. The availability section specifies that the output dataset is produced
on a monthly basis.
2. Run the following command in Azure PowerShell to create the Data Factory dataset:
Create pipeline
In this step, you create your first pipeline with a HDInsightHive activity. Input slice is available monthly
(frequency: Month, interval: 1), output slice is produced monthly, and the scheduler property for the activity is also
set to monthly. The settings for the output dataset and the activity scheduler must match. Currently, output
dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce any
output. If the activity doesn't take any input, you can skip creating the input dataset. The properties used in the
following JSON are explained at the end of this section.
1. Create a JSON file named MyFirstPipelinePSH.json in the C:\ADFGetStarted folder with the following
content:
IMPORTANT
Replace storageaccountname with the name of your storage account in the JSON.
{
"name": "MyFirstPipeline",
"properties": {
"description": "My first Azure Data Factory pipeline",
"activities": [
{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adfgetstarted/script/partitionweblogs.hql",
"scriptLinkedService": "StorageLinkedService",
"defines": {
"inputtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata",
"partitionedtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata"
}
},
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "HDInsightOnDemandLinkedService"
}
],
"start": "2017-07-01T00:00:00Z",
"end": "2017-07-02T00:00:00Z",
"isPaused": false
}
}
In the JSON snippet, you are creating a pipeline that consists of a single activity that uses Hive to process
Data on an HDInsight cluster.
The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the
scriptLinkedService, called StorageLinkedService), and in script folder in the container adfgetstarted.
The defines section is used to specify the runtime settings that be passed to the hive script as Hive
configuration values (e.g ${hiveconf:inputtable}, ${hiveconf:partitionedtable}).
The start and end properties of the pipeline specifies the active period of the pipeline.
In the activity JSON, you specify that the Hive script runs on the compute specified by the
linkedServiceName HDInsightOnDemandLinkedService.
NOTE
See "Pipeline JSON" in Pipelines and activities in Azure Data Factory for details about JSON properties that are used
in the example.
2. Confirm that you see the input.log file in the adfgetstarted/inputdata folder in the Azure blob storage,
and run the following command to deploy the pipeline. Since the start and end times are set in the past
and isPaused is set to false, the pipeline (activity in the pipeline) runs immediately after you deploy.
3. Congratulations, you have successfully created your first pipeline using Azure PowerShell!
Monitor pipeline
In this step, you use Azure PowerShell to monitor whats going on in an Azure data factory.
1. Run Get-AzureRmDataFactory and assign the output to a $df variable.
2. Run Get-AzureRmDataFactorySlice to get details about all slices of the EmpSQLTable, which is the
output table of the pipeline.
Notice that the StartDateTime you specify here is the same start time specified in the pipeline JSON. Here is
the sample output:
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : FirstDataFactoryPSH
DatasetName : AzureBlobOutput
Start : 7/1/2017 12:00:00 AM
End : 7/2/2017 12:00:00 AM
RetryCount : 0
State : InProgress
SubState :
LatencyStatus :
LongRetryCount : 0
3. Run Get-AzureRmDataFactoryRun to get the details of activity runs for a specific slice.
You can keep running this cmdlet until you see the slice in Ready state or Failed state. When the slice is in
Ready state, check the partitioneddata folder in the adfgetstarted container in your blob storage for the
output data. Creation of an on-demand HDInsight cluster usually takes some time.
IMPORTANT
Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the
pipeline to take approximately 30 minutes to process the slice.
The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial
again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container.
Summary
In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop
cluster. You used the Data Factory Editor in the Azure portal to do the following steps:
1. Created an Azure data factory.
2. Created two linked services:
a. Azure Storage linked service to link your Azure blob storage that holds input/output files to the data
factory.
b. Azure HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the
data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data
and produce output data.
3. Created two datasets, which describe input and output data for HDInsight Hive activity in the pipeline.
4. Created a pipeline with a HDInsight Hive activity.
Next steps
In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script
on an on-demand Azure HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to
Azure SQL, see Tutorial: Copy data from an Azure Blob to Azure SQL.
See Also
TOPIC DESCRIPTION
Data Factory Cmdlet Reference See comprehensive documentation on Data Factory cmdlets
Scheduling and Execution This article explains the scheduling and execution aspects of
Azure Data Factory application model.
Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug
pipelines using the Monitoring & Management App.
Tutorial: Build your first Azure data factory using
Azure Resource Manager template
7/21/2017 12 min to read Edit Online
In this article, you use an Azure Resource Manager template to create your first Azure data factory. To do the
tutorial using other tools/SDKs, select one of the options from the drop-down list.
The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure
HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a
month between the specified start and end times.
NOTE
The data pipeline in this tutorial transforms input data to produce output data. For a tutorial on how to copy data using
Azure Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database.
The pipeline in this tutorial has only one activity of type: HDInsightHive. A pipeline can have more than one activity. And,
you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input
dataset of the other activity. For more information, see scheduling and execution in Data Factory.
Prerequisites
Read through Tutorial Overview article and complete the prerequisite steps.
Follow instructions in How to install and configure Azure PowerShell article to install latest version of Azure
PowerShell on your computer.
See Authoring Azure Resource Manager Templates to learn about Azure Resource Manager templates.
In this tutorial
ENTITY DESCRIPTION
Azure Storage linked service Links your Azure Storage account to the data factory. The
Azure Storage account holds the input and output data for
the pipeline in this sample.
HDInsight on-demand linked service Links an on-demand HDInsight cluster to the data factory.
The cluster is automatically created for you to process data
and is deleted after the processing is done.
Azure Blob input dataset Refers to the Azure Storage linked service. The linked service
refers to an Azure Storage account and the Azure Blob
dataset specifies the container, folder, and file name in the
storage that holds the input data.
Azure Blob output dataset Refers to the Azure Storage linked service. The linked service
refers to an Azure Storage account and the Azure Blob
dataset specifies the container, folder, and file name in the
storage that holds the output data.
ENTITY DESCRIPTION
Data pipeline The pipeline has one activity of type HDInsightHive, which
consumes the input dataset and produces the output
dataset.
A data factory can have one or more pipelines. A pipeline can have one or more activities in it. There are two types
of activities: data movement activities and data transformation activities. In this tutorial, you create a pipeline with
one activity (Hive activity).
The following section provides the complete Resource Manager template for defining Data Factory entities so that
you can quickly run through the tutorial and test the template. To understand how each Data Factory entity is
defined, see Data Factory entities in the template section.
{
"$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": { ...
},
"variables": { ...
},
"resources": [
{
"name": "[parameters('dataFactoryName')]",
"apiVersion": "[variables('apiVersion')]",
"type": "Microsoft.DataFactory/datafactories",
"location": "westus",
"resources": [
{ ... },
{ ... },
{ ... },
{ ... }
]
}
]
}
Create a JSON file named ADFTutorialARM.json in C:\ADFGetStarted folder with the following content:
{
"contentVersion": "1.0.0.0",
"$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"parameters": {
"storageAccountName": { "type": "string", "metadata": { "description": "Name of the Azure storage
account that contains the input/output data." } },
"storageAccountKey": { "type": "securestring", "metadata": { "description": "Key for the Azure
storage account." } },
"blobContainer": { "type": "string", "metadata": { "description": "Name of the blob container in
the Azure Storage account." } },
"inputBlobFolder": { "type": "string", "metadata": { "description": "The folder in the blob
container that has the input file." } },
"inputBlobName": { "type": "string", "metadata": { "description": "Name of the input file/blob." }
},
"outputBlobFolder": { "type": "string", "metadata": { "description": "The folder in the blob
container that will hold the transformed data." } },
"hiveScriptFolder": { "type": "string", "metadata": { "description": "The folder in the blob
container that contains the Hive query file." } },
"hiveScriptFile": { "type": "string", "metadata": { "description": "Name of the hive query (HQL)
file." } }
},
"variables": {
"dataFactoryName": "[concat('HiveTransformDF', uniqueString(resourceGroup().id))]",
"azureStorageLinkedServiceName": "AzureStorageLinkedService",
"hdInsightOnDemandLinkedServiceName": "HDInsightOnDemandLinkedService",
"blobInputDatasetName": "AzureBlobInput",
"blobOutputDatasetName": "AzureBlobOutput",
"pipelineName": "HiveTransformPipeline"
},
"resources": [
{
"name": "[variables('dataFactoryName')]",
"apiVersion": "2015-10-01",
"type": "Microsoft.DataFactory/datafactories",
"location": "West US",
"resources": [
{
"type": "linkedservices",
"name": "[variables('azureStorageLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureStorage",
"description": "Azure Storage linked service",
"typeProperties": {
"connectionString": "
[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',paramet
ers('storageAccountKey'))]"
}
}
},
{
"type": "linkedservices",
"name": "[variables('hdInsightOnDemandLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "[variables('azureStorageLinkedServiceName')]"
}
}
},
{
"type": "datasets",
"name": "[variables('blobInputDatasetName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "[variables('azureStorageLinkedServiceName')]",
"typeProperties": {
"fileName": "[parameters('inputBlobName')]",
"folderPath": "[concat(parameters('blobContainer'), '/',
parameters('inputBlobFolder'))]",
"format": {
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
},
"external": true
}
},
{
"type": "datasets",
"name": "[variables('blobOutputDatasetName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "[variables('azureStorageLinkedServiceName')]",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/',
parameters('outputBlobFolder'))]",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
}
}
},
{
"type": "datapipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('hdInsightOnDemandLinkedServiceName')]",
"[variables('blobInputDatasetName')]",
"[variables('blobOutputDatasetName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"description": "Pipeline that transforms data using Hive script.",
"activities": [
{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "[concat(parameters('blobContainer'), '/',
parameters('hiveScriptFolder'), '/', parameters('hiveScriptFile'))]",
"scriptLinkedService": "[variables('azureStorageLinkedServiceName')]",
"defines": {
"inputtable": "[concat('wasb://', parameters('blobContainer'), '@',
parameters('storageAccountName'), '.blob.core.windows.net/', parameters('inputBlobFolder'))]",
"partitionedtable": "[concat('wasb://', parameters('blobContainer'), '@',
parameters('storageAccountName'), '.blob.core.windows.net/', parameters('outputBlobFolder'))]"
}
},
"inputs": [
{
"name": "[variables('blobInputDatasetName')]"
}
],
"outputs": [
"outputs": [
{
"name": "[variables('blobOutputDatasetName')]"
}
],
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "[variables('hdInsightOnDemandLinkedServiceName')]"
}
],
"start": "2017-07-01T00:00:00Z",
"end": "2017-07-02T00:00:00Z",
"isPaused": false
}
}
]
}
]
}
NOTE
You can find another example of Resource Manager template for creating an Azure data factory on Tutorial: Create a
pipeline with Copy Activity using an Azure Resource Manager template.
Parameters JSON
Create a JSON file named ADFTutorialARM-Parameters.json that contains parameters for the Azure Resource
Manager template.
IMPORTANT
Specify the name and key of your Azure Storage account for the storageAccountName and storageAccountKey
parameters in this parameter file.
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"storageAccountName": {
"value": "<Name of your Azure Storage account>"
},
"storageAccountKey": {
"value": "<Key of your Azure Storage account>"
},
"blobContainer": {
"value": "adfgetstarted"
},
"inputBlobFolder": {
"value": "inputdata"
},
"inputBlobName": {
"value": "input.log"
},
"outputBlobFolder": {
"value": "partitioneddata"
},
"hiveScriptFolder": {
"value": "script"
},
"hiveScriptFile": {
"value": "partitionweblogs.hql"
}
}
}
IMPORTANT
You may have separate parameter JSON files for development, testing, and production environments that you can use with
the same Data Factory JSON template. By using a Power Shell script, you can automate deploying Data Factory entities in
these environments.
Monitor pipeline
1. After logging in to the Azure portal, Click Browse and select Data factories.
2. In the Data Factories blade, click the data factory (TutorialFactoryARM) you created.
3. In the Data Factory blade for your data factory, click Diagram.
4. In the Diagram View, you see an overview of the pipelines, and datasets used in this tutorial.
5. In the Diagram View, double-click the dataset AzureBlobOutput. You see that the slice that is currently
being processed.
6. When processing is done, you see the slice in Ready state. Creation of an on-demand HDInsight cluster
usually takes sometime (approximately 20 minutes). Therefore, expect the pipeline to take approximately
30 minutes to process the slice.
7. When the slice is in Ready state, check the partitioneddata folder in the adfgetstarted container in your
blob storage for the output data.
See Monitor datasets and pipeline for instructions on how to use the Azure portal blades to monitor the pipeline
and datasets you have created in this tutorial.
You can also use Monitor and Manage App to monitor your data pipelines. See Monitor and manage Azure Data
Factory pipelines using Monitoring App for details about using the application.
IMPORTANT
The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the
tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container.
"resources": [
{
"name": "[variables('dataFactoryName')]",
"apiVersion": "2015-10-01",
"type": "Microsoft.DataFactory/datafactories",
"location": "West US"
}
{
"type": "linkedservices",
"name": "[variables('azureStorageLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureStorage",
"description": "Azure Storage linked service",
"typeProperties": {
"connectionString": "
[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',paramet
ers('storageAccountKey'))]"
}
}
}
The connectionString uses the storageAccountName and storageAccountKey parameters. The values for these
parameters passed by using a configuration file. The definition also uses variables: azureStroageLinkedService
and dataFactoryName defined in the template.
HDInsight on-demand linked service
See Compute linked services article for details about JSON properties used to define an HDInsight on-demand
linked service.
{
"type": "linkedservices",
"name": "[variables('hdInsightOnDemandLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "[variables('azureStorageLinkedServiceName')]"
}
}
}
This definition uses the following parameters defined in parameter template: blobContainer, inputBlobFolder, and
inputBlobName.
Azure Blob output dataset
You specify the names of blob container and folder that holds the output data. See Azure Blob dataset properties
for details about JSON properties used to define an Azure Blob dataset.
{
"type": "datasets",
"name": "[variables('blobOutputDatasetName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "[variables('azureStorageLinkedServiceName')]",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('outputBlobFolder'))]",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
}
}
}
This definition uses the following parameters defined in the parameter template: blobContainer and
outputBlobFolder.
Data pipeline
You define a pipeline that transform data by running Hive script on an on-demand Azure HDInsight cluster. See
Pipeline JSON for descriptions of JSON elements used to define a pipeline in this example.
{
"type": "datapipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('hdInsightOnDemandLinkedServiceName')]",
"[variables('blobInputDatasetName')]",
"[variables('blobOutputDatasetName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"description": "Pipeline that transforms data using Hive script.",
"activities": [
{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "[concat(parameters('blobContainer'), '/', parameters('hiveScriptFolder'), '/',
parameters('hiveScriptFile'))]",
"scriptLinkedService": "[variables('azureStorageLinkedServiceName')]",
"defines": {
"inputtable": "[concat('wasb://', parameters('blobContainer'), '@',
parameters('storageAccountName'), '.blob.core.windows.net/', parameters('inputBlobFolder'))]",
"partitionedtable": "[concat('wasb://', parameters('blobContainer'), '@',
parameters('storageAccountName'), '.blob.core.windows.net/', parameters('outputBlobFolder'))]"
}
},
"inputs": [
{
"name": "[variables('blobInputDatasetName')]"
}
],
"outputs": [
{
"name": "[variables('blobOutputDatasetName')]"
}
],
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "[variables('hdInsightOnDemandLinkedServiceName')]"
}
],
"start": "2017-07-01T00:00:00Z",
"end": "2017-07-02T00:00:00Z",
"isPaused": false
}
}
Notice that the first command uses parameter file for the development environment, second one for the test
environment, and the third one for the production environment.
You can also reuse the template to perform repeated tasks. For example, you need to create many data factories
with one or more pipelines that implement the same logic but each data factory uses different Azure storage and
Azure SQL Database accounts. In this scenario, you use the same template in the same environment (dev, test, or
production) with different parameter files to create data factories.
{
"contentVersion": "1.0.0.0",
"$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"parameters": {
},
"variables": {
"dataFactoryName": "GatewayUsingArmDF",
"apiVersion": "2015-10-01",
"singleQuote": "'"
},
"resources": [
{
"name": "[variables('dataFactoryName')]",
"apiVersion": "[variables('apiVersion')]",
"type": "Microsoft.DataFactory/datafactories",
"location": "eastus",
"resources": [
{
"dependsOn": [ "[concat('Microsoft.DataFactory/dataFactories/',
variables('dataFactoryName'))]" ],
"type": "gateways",
"apiVersion": "[variables('apiVersion')]",
"name": "GatewayUsingARM",
"properties": {
"description": "my gateway"
}
}
]
}
]
}
This template creates a data factory named GatewayUsingArmDF with a gateway named: GatewayUsingARM.
See Also
TOPIC DESCRIPTION
Scheduling and execution This article explains the scheduling and execution aspects of
Azure Data Factory application model.
Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug
pipelines using the Monitoring & Management App.
Tutorial: Build your first Azure data factory using
Data Factory REST API
8/21/2017 14 min to read Edit Online
In this article, you use Data Factory REST API to create your first Azure data factory. To do the tutorial using other
tools/SDKs, select one of the options from the drop-down list.
The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure
HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a
month between the specified start and end times.
NOTE
This article does not cover all the REST API. For comprehensive documentation on REST API, see Data Factory REST API
Reference.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the
output dataset of one activity as the input dataset of the other activity. For more information, see scheduling and execution
in Data Factory.
Prerequisites
Read through Tutorial Overview article and complete the prerequisite steps.
Install Curl on your machine. You use the CURL tool with REST commands to create a data factory.
Follow instructions from this article to:
1. Create a Web application named ADFGetStartedApp in Azure Active Directory.
2. Get client ID and secret key.
3. Get tenant ID.
4. Assign the ADFGetStartedApp application to the Data Factory Contributor role.
Install Azure PowerShell.
Launch PowerShell and run the following command. Keep Azure PowerShell open until the end of this
tutorial. If you close and reopen, you need to run the commands again.
1. Run Login-AzureRmAccount and enter the user name and password that you use to sign in to the
Azure portal.
2. Run Get-AzureRmSubscription to view all the subscriptions for this account.
3. Run Get-AzureRmSubscription -SubscriptionName NameOfAzureSubscription | Set-
AzureRmContext to select the subscription that you want to work with. Replace
NameOfAzureSubscription with the name of your Azure subscription.
Create an Azure resource group named ADFTutorialResourceGroup by running the following command
in the PowerShell:
Some of the steps in this tutorial assume that you use the resource group named
ADFTutorialResourceGroup. If you use a different resource group, you need to use the name of your
resource group in place of ADFTutorialResourceGroup in this tutorial.
Create JSON definitions
Create following JSON files in the folder where curl.exe is located.
datafactory.json
IMPORTANT
Name must be globally unique, so you may want to prefix/suffix ADFCopyTutorialDF to make it a unique name.
{
"name": "FirstDataFactoryREST",
"location": "WestUS"
}
azurestoragelinkedservice.json
IMPORTANT
Replace accountname and accountkey with name and key of your Azure storage account. To learn how to get your
storage access key, see the information about how to view, copy, and regenerate storage access keys in Manage your
storage account.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
hdinsightondemandlinkedservice.json
{
"name": "HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "AzureStorageLinkedService"
}
}
}
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
TimeToLive Specifies that the idle time for the HDInsight cluster, before it
is deleted.
linkedServiceName Specifies the storage account that is used to store the logs
that are generated by HDInsight
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "input.log",
"folderPath": "adfgetstarted/inputdata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
},
"external": true,
"policy": {}
}
}
The JSON defines a dataset named AzureBlobInput, which represents input data for an activity in the pipeline. In
addition, it specifies that the input data is located in the blob container called adfgetstarted and the folder called
inputdata.
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION
fileName This property is optional. If you omit this property, all the files
from the folderPath are picked. In this case, only the input.log
is processed.
columnDelimiter columns in the log files are delimited by a comma character (,)
external this property is set to true if the input data is not generated
by the Data Factory service.
outputdataset.json
{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "adfgetstarted/partitioneddata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
}
}
}
The JSON defines a dataset named AzureBlobOutput, which represents output data for an activity in the
pipeline. In addition, it specifies that the results are stored in the blob container called adfgetstarted and the
folder called partitioneddata. The availability section specifies that the output dataset is produced on a
monthly basis.
pipeline.json
IMPORTANT
Replace storageaccountname with name of your Azure storage account.
{
"name": "MyFirstPipeline",
"properties": {
"description": "My first Azure Data Factory pipeline",
"activities": [{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adfgetstarted/script/partitionweblogs.hql",
"scriptLinkedService": "AzureStorageLinkedService",
"defines": {
"inputtable":
"wasb://adfgetstarted@<stroageaccountname>.blob.core.windows.net/inputdata",
"partitionedtable":
"wasb://adfgetstarted@<stroageaccountname>t.blob.core.windows.net/partitioneddata"
}
},
"inputs": [{
"name": "AzureBlobInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "HDInsightOnDemandLinkedService"
}],
"start": "2017-07-10T00:00:00Z",
"end": "2017-07-11T00:00:00Z",
"isPaused": false
}
}
In the JSON snippet, you are creating a pipeline that consists of a single activity that uses Hive to process data on
a HDInsight cluster.
The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the
scriptLinkedService, called StorageLinkedService), and in script folder in the container adfgetstarted.
The defines section specifies runtime settings that are passed to the hive script as Hive configuration values (e.g
${hiveconf:inputtable}, ${hiveconf:partitionedtable}).
The start and end properties of the pipeline specifies the active period of the pipeline.
In the activity JSON, you specify that the Hive script runs on the compute specified by the linkedServiceName
HDInsightOnDemandLinkedService.
NOTE
See "Pipeline JSON" in Pipelines and activities in Azure Data Factory for details about JSON properties used in the preceding
example.
$rg = "ADFTutorialResourceGroup"
$adf = "FirstDataFactoryREST"
(ConvertFrom-Json $responseToken)
3. View the results. If the data factory has been successfully created, you see the JSON for the data factory in
the results; otherwise, you see an error message.
Write-Host $results
You can run the following command to confirm that the Data Factory provider is registered:
Get-AzureRmResourceProvider
Login using the Azure subscription into the Azure portal and navigate to a Data Factory blade (or)
create a data factory in the Azure portal. This action automatically registers the provider for you.
Before creating a pipeline, you need to create a few Data Factory entities first. You first create linked services to
link data stores/computes to your data store, define input and output datasets to represent data in linked data
stores.
3. View the results. If the linked service has been successfully created, you see the JSON for the linked service
in the results; otherwise, you see an error message.
Write-Host $results
Create Azure HDInsight linked service
In this step, you link an on-demand HDInsight cluster to your data factory. The HDInsight cluster is automatically
created at runtime and deleted after it is done processing and idle for the specified amount of time. You could use
your own HDInsight cluster instead of using an on-demand HDInsight cluster. See Compute Linked Services for
details.
1. Assign the command to variable named cmd.
3. View the results. If the linked service has been successfully created, you see the JSON for the linked service
in the results; otherwise, you see an error message.
Write-Host $results
Create datasets
In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to
the StorageLinkedService you have created earlier in this tutorial. The linked service points to an Azure Storage
account and datasets specify container, folder, file name in the storage that holds input and output data.
Create input dataset
In this step, you create the input dataset to represent input data stored in the Azure Blob storage.
1. Assign the command to variable named cmd.
3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the
results; otherwise, you see an error message.
Write-Host $results
3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the
results; otherwise, you see an error message.
Write-Host $results
Create pipeline
In this step, you create your first pipeline with a HDInsightHive activity. Input slice is available monthly
(frequency: Month, interval: 1), output slice is produced monthly, and the scheduler property for the activity is also
set to monthly. The settings for the output dataset and the activity scheduler must match. Currently, output
dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce
any output. If the activity doesn't take any input, you can skip creating the input dataset.
Confirm that you see the input.log file in the adfgetstarted/inputdata folder in the Azure blob storage, and
run the following command to deploy the pipeline. Since the start and end times are set in the past and
isPaused is set to false, the pipeline (activity in the pipeline) runs immediately after you deploy.
1. Assign the command to variable named cmd.
3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the
results; otherwise, you see an error message.
Write-Host $results
4. Congratulations, you have successfully created your first pipeline using Azure PowerShell!
Monitor pipeline
In this step, you use Data Factory REST API to monitor slices being produced by the pipeline.
$ds ="AzureBlobOutput"
IMPORTANT
Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the
pipeline to take approximately 30 minutes to process the slice.
Run the Invoke-Command and the next one until you see the slice in Ready state or Failed state. When the slice
is in Ready state, check the partitioneddata folder in the adfgetstarted container in your blob storage for the
output data. The creation of an on-demand HDInsight cluster usually takes some time.
IMPORTANT
The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the
tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container.
You can also use Azure portal to monitor slices and troubleshoot any issues. See Monitor pipelines using Azure
portal details.
Summary
In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop
cluster. You used the Data Factory Editor in the Azure portal to do the following steps:
1. Created an Azure data factory.
2. Created two linked services:
a. Azure Storage linked service to link your Azure blob storage that holds input/output files to the data
factory.
b. Azure HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the
data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data
and produce output data.
3. Created two datasets, which describe input and output data for HDInsight Hive activity in the pipeline.
4. Created a pipeline with a HDInsight Hive activity.
Next steps
In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script
on an on-demand Azure HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to
Azure SQL, see Tutorial: Copy data from an Azure Blob to Azure SQL.
See Also
TOPIC DESCRIPTION
Data Factory REST API Reference See comprehensive documentation on Data Factory cmdlets
Scheduling and Execution This article explains the scheduling and execution aspects of
Azure Data Factory application model.
Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug
pipelines using the Monitoring & Management App.
Move data between on-premises sources and the
cloud with Data Management Gateway
8/21/2017 15 min to read Edit Online
This article provides an overview of data integration between on-premises data stores and cloud data stores using
Data Factory. It builds on the Data Movement Activities article and other data factory core concepts articles: datasets
and pipelines.
IMPORTANT
See Data Management Gateway article for details about Data Management Gateway.
The following walkthrough shows you how to create a data factory with a pipeline that moves data from an on-
premises SQL Server database to an Azure blob storage. As part of the walkthrough, you install and configure the
Data Management Gateway on your machine.
4. Select the Azure subscription where you want the data factory to be created.
5. Select existing resource group or create a resource group. For the tutorial, create a resource group named:
ADFTutorialResourceGroup.
6. Click Create on the New data factory page.
IMPORTANT
To create Data Factory instances, you must be a member of the Data Factory Contributor role at the
subscription/resource group level.
7. After creation is complete, you see the Data Factory page as shown in the following image:
Create gateway
1. In the Data Factory page, click Author and deploy tile to launch the Editor for the data factory.
2. In the Data Factory Editor, click ... More on the toolbar and then click New data gateway. Alternatively, you
can right-click Data Gateways in the tree view, and click New data gateway.
3. In the Create page, enter adftutorialgateway for the name, and click OK.
NOTE
In this walkthrough, you create the logical gateway with only one node (on-premises Windows machine). You can
scale out a data management gateway by associating multiple on-premises machines with the gateway. You can scale
up by increasing number of data movement jobs that can run concurrently on a node. This feature is also available for
a logical gateway with a single node. See Scaling data management gateway in Azure Data Factory article for details.
4. In the Configure page, click Install directly on this computer. This action downloads the installation
package for the gateway, installs, configures, and registers the gateway on the computer.
NOTE
Use Internet Explorer or a Microsoft ClickOnce compatible web browser.
If you are using Chrome, go to the Chrome web store, search with "ClickOnce" keyword, choose one of the ClickOnce
extensions, and install it.
Do the same for Firefox (install add-in). Click Open Menu button on the toolbar (three horizontal lines in the top-
right corner), click Add-ons, search with "ClickOnce" keyword, choose one of the ClickOnce extensions, and install it.
This way is the easiest way (one-click) to download, install, configure, and register the gateway in one single
step. You can see the Microsoft Data Management Gateway Configuration Manager application is
installed on your computer. You can also find the executable ConfigManager.exe in the folder: C:\Program
Files\Microsoft Data Management Gateway\2.0\Shared.
You can also download and install gateway manually by using the links in this page and register it using the
key shown in the NEW KEY text box.
See Data Management Gateway article for all the details about the gateway.
NOTE
You must be an administrator on the local computer to install and configure the Data Management Gateway
successfully. You can add additional users to the Data Management Gateway Users local Windows group. The
members of this group can use the Data Management Gateway Configuration Manager tool to configure the
gateway.
5. Wait for a couple of minutes or wait until you see the following notification message:
6. Launch Data Management Gateway Configuration Manager application on your computer. In the
Search window, type Data Management Gateway to access this utility. You can also find the executable
ConfigManager.exe in the folder: C:\Program Files\Microsoft Data Management
Gateway\2.0\Shared
7. Confirm that you see adftutorialgateway is connected to the cloud service message. The status bar the
bottom displays Connected to the cloud service along with a green check mark.
On the Home tab, you can also do the following operations:
Register a gateway with a key from the Azure portal by using the Register button.
Stop the Data Management Gateway Host Service running on your gateway machine.
Schedule updates to be installed at a specific time of the day.
View when the gateway was last updated.
Specify time at which an update to the gateway can be installed.
8. Switch to the Settings tab. The certificate specified in the Certificate section is used to encrypt/decrypt
credentials for the on-premises data store that you specify on the portal. (optional) Click Change to use your
own certificate instead. By default, the gateway uses the certificate that is auto-generated by the Data Factory
service.
d. In the Setting Credentials dialog box, specify authentication type, user name, and password, and
click OK. If the connection is successful, the encrypted credentials are stored in the JSON and the
dialog box closes.
e. Close the empty browser tab that launched the dialog box if it is not automatically closed and
get back to the tab with the Azure portal.
On the gateway machine, these credentials are encrypted by using a certificate that the Data
Factory service owns. If you want to use the certificate that is associated with the Data
Management Gateway instead, see Set credentials securely.
c. Click Deploy on the command bar to deploy the SQL Server linked service. You should see the linked
service in the tree view.
Create datasets
In this step, you create input and output datasets that represent input and output data for the copy operation (On-
premises SQL Server database => Azure blob storage). Before creating datasets, do the following steps (detailed
steps follows the list):
Create a table named emp in the SQL Server Database you added as a linked service to the data factory and
insert a couple of sample entries into the table.
Create a blob container named adftutorial in the Azure blob storage account you added as a linked service to
the data factory.
Prepare On-premises SQL Server for the tutorial
1. In the database you specified for the on-premises SQL Server linked service (SqlServerLinkedService), use
the following SQL script to create the emp table in the database.
{
"name": "OutputBlobTable",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "adftutorial/outfromonpremdf",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}",
"fileName": "{Hour}.csv",
"partitionedBy":
[
See Move data to/from Azure Blob Storage for details about JSON properties.
3. Click Deploy on the command bar to deploy the dataset. Confirm that you see both the datasets in the tree
view.
Create pipeline
In this step, you create a pipeline with one Copy Activity that uses EmpOnPremSQLTable as input and
OutputBlobTable as output.
1. In Data Factory Editor, click ... More, and click New pipeline.
2. Replace the JSON in the right pane with the following text:
{
"name": "ADFTutorialPipelineOnPrem",
"properties": {
"description": "This pipeline has one Copy activity that copies data from an on-prem SQL to Azure
blob",
"activities": [
{
"name": "CopyFromSQLtoBlob",
"description": "Copy data from on-prem SQL server to blob",
"type": "Copy",
"inputs": [
{
"name": "EmpOnPremSQLTable"
}
],
"outputs": [
{
"name": "OutputBlobTable"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from emp"
},
"sink": {
"type": "BlobSink"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2016-07-05T00:00:00Z",
"end": "2016-07-06T00:00:00Z",
"isPaused": false
}
}
IMPORTANT
Replace the value of the start property with the current day and end value with the next day.
You can zoom in, zoom out, zoom to 100%, zoom to fit, automatically position pipelines and datasets, and
show lineage information (highlights upstream and downstream items of selected items). You can double-
click an object (input/output dataset or pipeline) to see properties for it.
Monitor pipeline
In this step, you use the Azure portal to monitor whats going on in an Azure data factory. You can also use
PowerShell cmdlets to monitor datasets and pipelines. For details about monitoring, see Monitor and Manage
Pipelines.
1. In the diagram, double-click EmpOnPremSQLTable.
2. Notice that all the data slices up are in Ready state because the pipeline duration (start time to end time) is in
the past. It is also because you have inserted the data in the SQL Server database and it is there all the time.
Confirm that no slices show up in the Problem slices section at the bottom. To view all the slices, click See
More at the bottom of the list of slices.
3. Now, In the Datasets page, click OutputBlobTable.
4. Click any data slice from the list and you should see the Data Slice page. You see activity runs for the slice.
You see only one activity run usually.
If the slice is not in the Ready state, you can see the upstream slices that are not Ready and are blocking the
current slice from executing in the Upstream slices that are not ready list.
5. Click the activity run from the list at the bottom to see activity run details.
You would see information such as throughput, duration, and the gateway used to transfer the data.
6. Click X to close all the pages until you
7. get back to the home page for the ADFTutorialOnPremDF.
8. (optional) Click Pipelines, click ADFTutorialOnPremDF, and drill through input tables (Consumed) or output
datasets (Produced).
9. Use tools such as Microsoft Storage Explorer to verify that a blob/file is created for each hour.
Next steps
See Data Management Gateway article for all the details about the Data Management Gateway.
See Copy data from Azure Blob to Azure SQL to learn about how to use Copy Activity to move data from a
source data store to a sink data store.
Azure Data Factory - Frequently Asked Questions
8/15/2017 22 min to read Edit Online
General questions
What is Azure Data Factory?
Data Factory is a cloud-based data integration service that automates the movement and transformation of
data. Just like a factory that runs equipment to take raw materials and transform them into finished goods, Data
Factory orchestrates existing services that collect raw data and transform it into ready-to-use information.
Data Factory allows you to create data-driven workflows to move data between both on-premises and cloud data
stores as well as process/transform data using compute services such as Azure HDInsight and Azure Data Lake
Analytics. After you create a pipeline that performs the action that you need, you can schedule it to run periodically
(hourly, daily, weekly etc.).
For more information, see Overview & Key Concepts.
Where can I find pricing details for Azure Data Factory?
See Data Factory Pricing Details page for the pricing details for the Azure Data Factory.
How do I get started with Azure Data Factory?
For an overview of Azure Data Factory, see Introduction to Azure Data Factory.
For a tutorial on how to copy/move data using Copy Activity, see Copy data from Azure Blob Storage to Azure
SQL Database.
For a tutorial on how to transform data using HDInsight Hive Activity. See Process data by running Hive script
on Hadoop cluster
What is the Data Factorys region availability?
Data Factory is available in US West and North Europe. The compute and storage services used by data factories
can be in other regions. See Supported regions.
What are the limits on number of data factories/pipelines/activities/datasets?
See Azure Data Factory Limits section of the Azure Subscription and Service Limits, Quotas, and Constraints
article.
What is the authoring/developer experience with Azure Data Factory service?
You can author/create data factories using one of the following tools/SDKs:
Azure portal The Data Factory blades in the Azure portal provide rich user interface for you to create data
factories ad linked services. The Data Factory Editor, which is also part of the portal, allows you to easily create
linked services, tables, data sets, and pipelines by specifying JSON definitions for these artifacts. See Build your
first data pipeline using Azure portal for an example of using the portal/editor to create and deploy a data
factory.
Visual Studio You can use Visual Studio to create an Azure data factory. See Build your first data pipeline using
Visual Studio for details.
Azure PowerShell See Create and monitor Azure Data Factory using Azure PowerShell for a
tutorial/walkthrough for creating a data factory using PowerShell. See Data Factory Cmdlet Reference content
on MSDN Library for a comprehensive documentation of Data Factory cmdlets.
.NET Class Library You can programmatically create data factories by using Data Factory .NET SDK. See Create,
monitor, and manage data factories using .NET SDK for a walkthrough of creating a data factory using .NET SDK.
See Data Factory Class Library Reference for a comprehensive documentation of Data Factory .NET SDK.
REST API You can also use the REST API exposed by the Azure Data Factory service to create and deploy data
factories. See Data Factory REST API Reference for a comprehensive documentation of Data Factory REST API.
Azure Resource Manager Template See Tutorial: Build your first Azure data factory using Azure Resource
Manager template fo details.
Can I rename a data factory?
No. Like other Azure resources, the name of an Azure data factory cannot be changed.
Can I move a data factory from one Azure subscription to another?
Yes. Use the Move button on your data factory blade as shown in the following diagram:
On-demand HDInsight cluster or your own HDInsight cluster DotNet, Hive, Pig, MapReduce, Hadoop Streaming
Azure Machine Learning Machine Learning activities: Batch Execution and Update
Resource
Azure SQL, Azure SQL Data Warehouse, SQL Server Stored Procedure
How does Azure Data Factory compare with SQL Server Integration Services (SSIS )?
See the Azure Data Factory vs. SSIS presentation from one of our MVPs (Most Valued Professionals): Reza Rad.
Some of the recent changes in Data Factory may not be listed in the slide deck. We are continuously adding more
capabilities to Azure Data Factory. We are continuously adding more capabilities to Azure Data Factory. We will
incorporate these updates into the comparison of data integration technologies from Microsoft sometime later this
year.
Activities - FAQ
What are the different types of activities you can use in a Data Factory pipeline?
Data Movement Activities to move data.
Data Transformation Activities to process/transform data.
When does an activity run?
The availability configuration setting in the output data table determines when the activity is run. If input datasets
are specified, the activity checks whether all the input data dependencies are satisfied (that is, Ready state) before it
starts running.
Azure Cosmos DB
(DocumentDB API)
DB2*
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK
MySQL*
Oracle*
PostgreSQL*
SAP HANA*
SQL Server*
Sybase*
Teradata*
NoSQL Cassandra*
MongoDB*
File Amazon S3
File System*
FTP
HDFS*
SFTP
Generic OData
Generic ODBC*
Salesforce
GE Historian*
NOTE
Data stores with * can be on-premises or on Azure IaaS, and require you to install Data Management Gateway on an on-
premises/Azure IaaS machine.
TextFormat example
The following sample shows some of the format properties for TextFormat.
"typeProperties":
{
"folderPath": "mycontainer/myfolder",
"fileName": "myblobname",
"format":
{
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": ";",
"quoteChar": "\"",
"NullValue": "NaN",
"firstRowAsHeader": true,
"skipLineCount": 0,
"treatEmptyAsNull": true
}
},
To use an escapeChar instead of quoteChar , replace the line with quoteChar with the following escapeChar:
"escapeChar": "$",
{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
}
{"time":"2015-04-
29T07:12:20.9100000Z","callingimsi":"466920403025604","callingnum1":"678948008","callingnum2":"56
7834760","switch1":"China","switch2":"Germany"}
{"time":"2015-04-
29T07:13:21.0220000Z","callingimsi":"466922202613463","callingnum1":"123436380","callingnum2":"78
9037573","switch1":"US","switch2":"UK"}
{"time":"2015-04-
29T07:13:21.4370000Z","callingimsi":"466923101048691","callingnum1":"678901578","callingnum2":"34
5626404","switch1":"Germany","switch2":"UK"}
[
{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
},
{
"time": "2015-04-29T07:13:21.0220000Z",
"callingimsi": "466922202613463",
"callingnum1": "123436380",
"callingnum2": "789037573",
"switch1": "US",
"switch2": "UK"
},
{
"time": "2015-04-29T07:13:21.4370000Z",
"callingimsi": "466923101048691",
"callingnum1": "678901578",
"callingnum2": "345626404",
"switch1": "Germany",
"switch2": "UK"
}
]
JsonFormat example
Case 1: Copying data from JSON files
See below two types of samples when copying data from JSON files, and the generic points to note:
Sample 1: extract data from object and array
In this sample, you expect one root JSON object maps to single record in tabular result. If you have a JSON file with
the following content:
{
"id": "ed0e4960-d9c5-11e6-85dc-d7996816aad3",
"context": {
"device": {
"type": "PC"
},
"custom": {
"dimensions": [
{
"TargetResourceType": "Microsoft.Compute/virtualMachines"
},
{
"ResourceManagmentProcessRunId": "827f8aaa-ab72-437c-ba48-d8917a7336a3"
},
{
"OccurrenceTime": "1/13/2017 11:24:37 AM"
}
]
}
}
}
and you want to copy it into an Azure SQL table in the following format, by extracting data from both objects and
array:
RESOURCEMANAGMEN
ID DEVICETYPE TARGETRESOURCETYPE TPROCESSRUNID OCCURRENCETIME
The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts). More
specifically:
section defines the customized column names and the corresponding data type while converting to
structure
tabular data. This section is optional unless you need to do column mapping. See Specifying structure
definition for rectangular datasets section for more details.
jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. To copy
data from array, you can use array[x].property to extract value of the given property from the xth object, or
you can use array[*].property to find the value from any object containing such property.
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "deviceType",
"type": "String"
},
{
"name": "targetResourceType",
"type": "String"
},
{
"name": "resourceManagmentProcessRunId",
"type": "String"
},
{
"name": "occurrenceTime",
"type": "DateTime"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects",
"jsonPathDefinition": {"id": "$.id", "deviceType": "$.context.device.type", "targetResourceType":
"$.context.custom.dimensions[0].TargetResourceType", "resourceManagmentProcessRunId":
"$.context.custom.dimensions[1].ResourceManagmentProcessRunId", "occurrenceTime": "
$.context.custom.dimensions[2].OccurrenceTime"}
}
}
}
Sample 2: cross apply multiple objects with the same pattern from array
In this sample, you expect to transform one root JSON object into multiple records in tabular result. If you have a
JSON file with the following content:
{
"ordernumber": "01",
"orderdate": "20170122",
"orderlines": [
{
"prod": "p1",
"price": 23
},
{
"prod": "p2",
"price": 13
},
{
"prod": "p3",
"price": 231
}
],
"city": [ { "sanmateo": "No 1" } ]
}
and you want to copy it into an Azure SQL table in the following format, by flattening the data inside the array and
cross join with the common root info:
ORDERNUMBER ORDERDATE ORDER_PD ORDER_PRICE CITY
The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts). More
specifically:
structure section defines the customized column names and the corresponding data type while converting to
tabular data. This section is optional unless you need to do column mapping. See Specifying structure
definition for rectangular datasets section for more details.
jsonNodeReference indicates to iterate and extract data from the objects with the same pattern under array
orderlines.
jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. In this
example, "ordernumber", "orderdate" and "city" are under root object with JSON path starting with "$.", while
"order_pd" and "order_price" are defined with path derived from the array element without "$.".
"properties": {
"structure": [
{
"name": "ordernumber",
"type": "String"
},
{
"name": "orderdate",
"type": "String"
},
{
"name": "order_pd",
"type": "String"
},
{
"name": "order_price",
"type": "Int64"
},
{
"name": "city",
"type": "String"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects",
"jsonNodeReference": "$.orderlines",
"jsonPathDefinition": {"ordernumber": "$.ordernumber", "orderdate": "$.orderdate", "order_pd":
"prod", "order_price": "price", "city": " $.city"}
}
}
}
and for each record, you expect to write to a JSON object in below format:
{
"id": "1",
"order": {
"date": "20170119",
"price": 2000,
"customer": "David"
}
}
The output dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts).
More specifically, structure section defines the customized property names in destination file, nestingSeparator
(default is ".") will be used to identify the nest layer from the name. This section is optional unless you want to
change the property name comparing with source column name, or nest some of the properties.
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "order.date",
"type": "String"
},
{
"name": "order.price",
"type": "Int64"
},
{
"name": "order.customer",
"type": "String"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat"
}
}
}
Specifying AvroFormat
If you want to parse the Avro files or write the data in Avro format, set the format type property to AvroFormat.
You do not need to specify any properties in the Format section within the typeProperties section. Example:
"format":
{
"type": "AvroFormat",
}
To use Avro format in a Hive table, you can refer to Apache Hives tutorial.
Note the following points:
Complex data types are not supported (records, enums, arrays, maps, unions and fixed).
Specifying OrcFormat
If you want to parse the ORC files or write the data in ORC format, set the format type property to OrcFormat.
You do not need to specify any properties in the Format section within the typeProperties section. Example:
"format":
{
"type": "OrcFormat"
}
IMPORTANT
If you are not copying ORC files as-is between on-premises and cloud data stores, you need to install the JRE 8 (Java
Runtime Environment) on your gateway machine. A 64-bit gateway requires 64-bit JRE and 32-bit gateway requires 32-bit
JRE. You can find both versions from here. Choose the appropriate one.
"format":
{
"type": "ParquetFormat"
}
IMPORTANT
If you are not copying Parquet files as-is between on-premises and cloud data stores, you need to install the JRE 8 (Java
Runtime Environment) on your gateway machine. A 64-bit gateway requires 64-bit JRE and 32-bit gateway requires 32-bit
JRE. You can find both versions from here. Choose the appropriate one.
{
"name": "MyHDInsightOnDemandLinkedService",
"properties":
{
"type": "HDInsightOnDemandLinkedService",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "LinkedService-SampleData",
"additionalLinkedServiceNames": [ "otherLinkedServiceName1", "otherLinkedServiceName2" ]
}
}
}
In the example above, otherLinkedServiceName1 and otherLinkedServiceName2 represent linked services whose
definitions contain credentials that the HDInsight cluster needs to access alternate storage accounts.
Slices - FAQ
Why are my input slices not in Ready state?
A common mistake is not setting external property to true on the input dataset when the input data is external to
the data factory (not produced by the data factory).
In the following example, you only need to set external to true on dataset1.
DataFactory1 Pipeline 1: dataset1 -> activity1 -> dataset2 -> activity2 -> dataset3 Pipeline 2: dataset3-> activity3
-> dataset4
If you have another data factory with a pipeline that takes dataset4 (produced by pipeline 2 in data factory 1), mark
dataset4 as an external dataset because the dataset is produced by a different data factory (DataFactory1, not
DataFactory2).
DataFactory2
Pipeline 1: dataset4->activity4->dataset5
If the external property is properly set, verify whether the input data exists in the location specified in the input
dataset definition.
How to run a slice at another time than midnight when the slice is being produced daily?
Use the offset property to specify the time at which you want the slice to be produced. See Dataset availability
section for details about this property. Here is a quick example:
"availability":
{
"frequency": "Day",
"interval": 1,
"offset": "06:00:00"
}
Overview
In Azure Data Factory, you can use Copy Activity to copy data between on-premises and cloud data
stores. After the data is copied, it can be further transformed and analyzed. You can also use Copy
Activity to publish transformation and analysis results for business intelligence (BI) and application
consumption.
Copy Activity is powered by a secure, reliable, scalable, and globally available service. This article
provides details on data movement in Data Factory and Copy Activity.
First, let's see how data migration occurs between two cloud data stores, and between an on-
premises data store and a cloud data store.
NOTE
To learn about activities in general, see Understanding pipelines and activities.
Copy data between an on-premises data store and a cloud data store
To securely move data between an on-premises data store and a cloud data store, install Data
Management Gateway on your on-premises machine. Data Management Gateway is an agent that
enables hybrid data movement and processing. You can install it on the same machine as the data
store itself, or on a separate machine that has access to the data store.
In this scenario, Data Management Gateway performs the serialization/deserialization,
compression/decompression, column mapping, and type conversion. Data does not flow through the
Azure Data Factory service. Instead, Data Management Gateway directly writes the data to the
destination store.
See Move data between on-premises and cloud data stores for an introduction and walkthrough. See
Data Management Gateway for detailed information about this agent.
You can also move data from/to supported data stores that are hosted on Azure IaaS virtual
machines (VMs) by using Data Management Gateway. In this case, you can install Data Management
Gateway on the same VM as the data store itself, or on a separate VM that has access to the data
store.
NOTE
If you need to move data to/from a data store that Copy Activity doesn't support, use a custom activity in
Data Factory with your own logic for copying/moving data. For details on creating and using a custom
activity, see Use custom activities in an Azure Data Factory pipeline.
Azure Cosmos DB
(DocumentDB API)
DB2*
MySQL*
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK
Oracle*
PostgreSQL*
SAP Business
Warehouse*
SAP HANA*
SQL Server*
Sybase*
Teradata*
NoSQL Cassandra*
MongoDB*
File Amazon S3
File System*
FTP
HDFS*
SFTP
Generic OData
Generic ODBC*
Salesforce
GE Historian*
NOTE
Data stores with * can be on-premises or on Azure IaaS, and require you to install Data Management
Gateway on an on-premises/Azure IaaS machine.
East US 2 East US 2
Central US Central US
West US West US
West US 2 West US
UK South UK South
Alternatively, you can explicitly indicate the region of Data Factory service to be used to perform the
copy by specifying executionLocation property under Copy Activity typeProperties . Supported
values for this property are listed in above Region used for data movement column. Note your
data goes through that region over the wire during copy. For example, to copy between Azure stores
in Korea, you can specify "executionLocation": "Japan East" to route through Japan region (see
sample JSON as reference).
NOTE
If the region of the destination data store is not in preceding list or undetectable, by default Copy Activity
fails instead of going through an alternative region, unless executionLocation is specified. The supported
region list will be expanded over time.
Copy data between an on-premises data store and a cloud data store
When data is being copied between on-premises (or Azure virtual machines/IaaS) and cloud stores,
Data Management Gateway performs data movement on an on-premises machine or virtual
machine. The data does not flow through the service in the cloud, unless you use the staged copy
capability. In this case, data flows through the staging Azure Blob storage before it is written into the
sink data store.
{
"name": "ADFTutorialPipeline",
"properties": {
"description": "Copy data from Azure blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputBlobTable"
}
],
"outputs": [
{
"name": "OutputSQLTable"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink"
},
"executionLocation": "Japan East"
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2016-07-12T00:00:00Z",
"end": "2016-07-13T00:00:00Z"
}
}
The schedule that is defined in the output dataset determines when the activity runs (for example:
daily, frequency as day, and interval as 1). The activity copies data from an input dataset (source) to
an output dataset (sink).
You can specify more than one input dataset to Copy Activity. They are used to verify the
dependencies before the activity is run. However, only the data from the first dataset is copied to the
destination dataset. For more information, see Scheduling and execution.
Fault tolerance
By default, copy activity will stop copying data and return failure when encounter incompatible data
between source and sink; while you can explicitly configure to skip and log the incompatible rows
and only copy those compatible data to make the copy succeeded. See the Copy Activity fault
tolerance on more details.
Security considerations
See the Security considerations, which describes security infrastructure that data movement services
in Azure Data Factory use to secure your data.
Type conversions
Different data stores have different native type systems. Copy Activity performs automatic type
conversions from source types to sink types with the following two-step approach:
1. Convert from native source types to a .NET type.
2. Convert from a .NET type to a native sink type.
The mapping from a native type system to a .NET type for a data store is in the respective data store
article. (Click the specific link in the Supported data stores table). You can use these mappings to
determine appropriate types while creating your tables, so that Copy Activity performs the right
conversions.
Next steps
To learn about the Copy Activity more, see Copy data from Azure Blob storage to Azure SQL
Database.
To learn about moving data from an on-premises data store to a cloud data store, see Move data
from on-premises to cloud data stores.
Azure Data Factory Copy Wizard
8/15/2017 4 min to read Edit Online
The Azure Data Factory Copy Wizard eases the process of ingesting data, which is usually a first step in an end-to-
end data integration scenario. When going through the Azure Data Factory Copy Wizard, you do not need to
understand any JSON definitions for linked services, data sets, and pipelines. The wizard automatically creates a
pipeline to copy data from the selected data source to the selected destination. In addition, the Copy Wizard helps
you to validate the data being ingested at the time of authoring. This saves time, especially when you are ingesting
data for the first time from the data source. To start the Copy Wizard, click the Copy data tile on the home page of
your data factory.
The wizard is designed with big data in mind from the start, with support for diverse data and object types. You can
author Data Factory pipelines that move hundreds of folders, files, or tables. The wizard supports automatic data
preview, schema capture and mapping, and data filtering.
TIP
When copying data from SQL Server or Azure SQL Database into Azure SQL Data Warehouse, if the table does not exist in
the destination store, Data Factory support auto table creation using source's schema. Learn more from Move data to and
from Azure SQL Data Warehouse using Azure Data Factory.
Use a drop-down list to select a column from the source schema to map to a column in the destination schema. The
Copy Wizard tries to understand your pattern for column mapping. It applies the same pattern to the rest of the
columns, so that you do not need to select each of the columns individually to complete the schema mapping. If
you prefer, you can override these mappings by using the drop-down lists to map the columns one by one. The
pattern becomes more accurate as you map more columns. The Copy Wizard constantly updates the pattern, and
ultimately reaches the right pattern for the column mapping you want to achieve.
Filtering data
You can filter source data to select only the data that needs to be copied to the sink data store. Filtering reduces the
volume of the data to be copied to the sink data store and therefore enhances the throughput of the copy
operation. It provides a flexible way to filter data in a relational database by using the SQL query language, or files
in an Azure blob folder by using Data Factory functions and variables.
Filtering of data in a database
The following screenshot shows a SQL query using the Text.Format function and WindowStart variable.
2016/03/01/01
2016/03/01/02
2016/03/01/03
...
Click the Browse button for File or folder, browse to one of these folders (for example, 2016->03->01->02), and
click Choose. You should see 2016/03/01/02 in the text box. Now, replace 2016 with {year}, 03 with {month}, 01
with {day}, and 02 with {hour}, and press the Tab key. You should see drop-down lists to select the format for
these four variables:
As shown in the following screenshot, you can also use a custom variable and any supported format strings. To
select a folder with that structure, use the Browse button first. Then replace a value with {custom}, and press the
Tab key to see the text box where you can type the format string.
Scheduling options
You can run the copy operation once or on a schedule (hourly, daily, and so on). Both of these options can be used
for the breadth of the connectors across environments, including on-premises, cloud, and local desktop copy.
A one-time copy operation enables data movement from a source to a destination only once. It applies to data of
any size and any supported format. The scheduled copy allows you to copy data on a prescribed recurrence. You
can use rich settings (like retry, timeout, and alerts) to configure the scheduled copy.
Next steps
For a quick walkthrough of using the Data Factory Copy Wizard to create a pipeline with Copy Activity, see Tutorial:
Create a pipeline using the Copy Wizard.
Load 1 TB into Azure SQL Data Warehouse under 15
minutes with Data Factory
8/22/2017 7 min to read Edit Online
Azure SQL Data Warehouse is a cloud-based, scale-out database capable of processing massive volumes of data,
both relational and non-relational. Built on massively parallel processing (MPP) architecture, SQL Data Warehouse
is optimized for enterprise data warehouse workloads. It offers cloud elasticity with the flexibility to scale storage
and compute independently.
Getting started with Azure SQL Data Warehouse is now easier than ever using Azure Data Factory. Azure Data
Factory is a fully managed cloud-based data integration service, which can be used to populate a SQL Data
Warehouse with the data from your existing system, and saving you valuable time while evaluating SQL Data
Warehouse and building your analytics solutions. Here are the key benefits of loading data into Azure SQL Data
Warehouse using Azure Data Factory:
Easy to set up: 5-step intuitive wizard with no scripting required.
Rich data store support: built-in support for a rich set of on-premises and cloud-based data stores.
Secure and compliant: data is transferred over HTTPS or ExpressRoute, and global service presence ensures
your data never leaves the geographical boundary
Unparalleled performance by using PolyBase Using Polybase is the most efficient way to move data into
Azure SQL Data Warehouse. Using the staging blob feature, you can achieve high load speeds from all types of
data stores besides Azure Blob storage, which the Polybase supports by default.
This article shows you how to use Data Factory Copy Wizard to load 1-TB data from Azure Blob Storage into Azure
SQL Data Warehouse in under 15 minutes, at over 1.2 GBps throughput.
This article provides step-by-step instructions for moving data into Azure SQL Data Warehouse by using the Copy
Wizard.
NOTE
For general information about capabilities of Data Factory in moving data to/from Azure SQL Data Warehouse, see Move
data to and from Azure SQL Data Warehouse using Azure Data Factory article.
You can also build pipelines using Azure portal, Visual Studio, PowerShell, etc. See Tutorial: Copy data from Azure Blob to
Azure SQL Database for a quick walkthrough with step-by-step instructions for using the Copy Activity in Azure Data
Factory.
Prerequisites
Azure Blob Storage: this experiment uses Azure Blob Storage (GRS) for storing TPC-H testing dataset. If you do
not have an Azure storage account, learn how to create a storage account.
TPC-H data: we are going to use TPC-H as the testing dataset. To do that, you need to use dbgen from TPC-
H toolkit, which helps you generate the dataset. You can either download source code for dbgen from TPC
Tools and compile it yourself, or download the compiled binary from GitHub. Run dbgen.exe with the
following commands to generate 1 TB flat file for lineitem table spread across 10 files:
Dbgen -s 1000 -S **1** -C 10 -T L -v
Dbgen -s 1000 -S **2** -C 10 -T L -v
Dbgen -s 1000 -S **10** -C 10 -T L -v
Now copy the generated files to Azure Blob. Refer to Move data to and from an on-premises file
system by using Azure Data Factory for how to do that using ADF Copy.
Azure SQL Data Warehouse: this experiment loads data into Azure SQL Data Warehouse created with 6,000
DWUs
Refer to Create an Azure SQL Data Warehouse for detailed instructions on how to create a SQL Data
Warehouse database. To get the best possible load performance into SQL Data Warehouse using Polybase,
we choose maximum number of Data Warehouse Units (DWUs) allowed in the Performance setting, which
is 6,000 DWUs.
NOTE
When loading from Azure Blob, the data loading performance is directly proportional to the number of DWUs you
configure on the SQL Data Warehouse:
Loading 1 TB into 1,000 DWU SQL Data Warehouse takes 87 minutes (~200 MBps throughput) Loading 1 TB into
2,000 DWU SQL Data Warehouse takes 46 minutes (~380 MBps throughput) Loading 1 TB into 6,000 DWU SQL
Data Warehouse takes 14 minutes (~1.2 GBps throughput)
To create a SQL Data Warehouse with 6,000 DWUs, move the Performance slider all the way to the right:
For an existing database that is not configured with 6,000 DWUs, you can scale it up using Azure portal.
Navigate to the database in Azure portal, and there is a Scale button in the Overview panel shown in the
following image:
Click the Scale button to open the following panel, move the slider to the maximum value, and click Save
button.
This experiment loads data into Azure SQL Data Warehouse using xlargerc resource class.
To achieve best possible throughput, copy needs to be performed using a SQL Data Warehouse user
belonging to xlargerc resource class. Learn how to do that by following Change a user resource class
example.
Create destination table schema in Azure SQL Data Warehouse database, by running the following DDL
statement:
With the prerequisite steps completed, we are now ready to configure the copy activity using the Copy
Wizard.
5. On the Data Factory home page, click the Copy data tile to launch Copy Wizard.
NOTE
If you see that the web browser is stuck at "Authorizing...", disable/uncheck Block third party cookies and site
data setting (or) keep it enabled and create an exception for login.microsoftonline.com and then try launching the
wizard again.
2. Fill in the connection information for the Azure Blob storage account, and click Next.
3. Choose the folder containing the TPC-H line item files and click Next.
4. Upon clicking Next, the file format settings are detected automatically. Check to make sure that column
delimiter is | instead of the default comma ,. Click Next after you have previewed the data.
Step 3: Configure destination
This section shows you how to configure the destination: lineitem table in the Azure SQL Data Warehouse
database.
1. Choose Azure SQL Data Warehouse as the destination store and click Next.
2. Fill in the connection information for Azure SQL Data Warehouse. Make sure you specify the user that is a
member of the role xlargerc (see the prerequisites section for detailed instructions), and click Next.
3. Choose the destination table and click Next.
4. In Schema mapping page, leave "Apply column mapping" option unchecked and click Next.
You can view the copy run details in the Activity Window Explorer in the right panel, including the data
volume read from source and written into destination, duration, and the average throughput for the run.
As you can see from the following screen shot, copying 1 TB from Azure Blob Storage into SQL Data
Warehouse took 14 minutes, effectively achieving 1.22 GBps throughput!
Best practices
Here are a few best practices for running your Azure SQL Data Warehouse database:
Use a larger resource class when loading into a CLUSTERED COLUMNSTORE INDEX.
For more efficient joins, consider using hash distribution by a select column instead of default round robin
distribution.
For faster load speeds, consider using heap for transient data.
Create statistics after you finish loading Azure SQL Data Warehouse.
See Best practices for Azure SQL Data Warehouse for details.
Next steps
Data Factory Copy Wizard - This article provides details about the Copy Wizard.
Copy Activity performance and tuning guide - This article contains the reference performance measurements
and tuning guide.
Copy Activity performance and tuning guide
8/21/2017 28 min to read Edit Online
Azure Data Factory Copy Activity delivers a first-class secure, reliable, and high-performance data loading
solution. It enables you to copy tens of terabytes of data every day across a rich variety of cloud and on-
premises data stores. Blazing-fast data loading performance is key to ensure you can focus on the core big
data problem: building advanced analytics solutions and getting deep insights from all that data.
Azure provides a set of enterprise-grade data storage and data warehouse solutions, and Copy Activity offers a
highly optimized data loading experience that is easy to configure and set up. With just a single copy activity,
you can achieve:
Loading data into Azure SQL Data Warehouse at 1.2 GBps. For a walkthrough with a use case, see Load 1
TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory.
Loading data into Azure Blob storage at 1.0 GBps
Loading data into Azure Data Lake Store at 1.0 GBps
This article describes:
Performance reference numbers for supported source and sink data stores to help you plan your project;
Features that can boost the copy throughput in different scenarios, including cloud data movement units,
parallel copy, and staged Copy;
Performance tuning guidance on how to tune the performance and the key factors that can impact copy
performance.
NOTE
If you are not familiar with Copy Activity in general, see Move data by using Copy Activity before reading this article.
Performance reference
As a reference, below table shows the copy throughput number in MBps for the given source and sink pairs
based on in-house testing. For comparison, it also demonstrates how different settings of cloud data
movement units or Data Management Gateway scalability (multiple gateway nodes) can help on copy
performance.
Points to note:
Throughput is calculated by using the following formula: [size of data read from source]/[Copy Activity run
duration].
The performance reference numbers in the table were measured using TPC-H data set in a single copy
activity run.
In Azure data stores, the source and sink are in the same Azure region.
For hybrid copy between on-premises and cloud data stores, each gateway node was running on a machine
that was separate from the on-premises data store with below specification. When a single activity was
running on gateway, the copy operation consumed only a small portion of the test machine's CPU, memory,
or network bandwidth. Learn more from consideration for Data Management Gateway.
Memory 128 GB
TIP
You can achieve higher throughput by leveraging more data movement units (DMUs) than the default maximum DMUs,
which is 32 for a cloud-to-cloud copy activity run. For example, with 100 DMUs, you can achieve copying data from
Azure Blob into Azure Data Lake Store at 1.0GBps. See the Cloud data movement units section for details about this
feature and the supported scenario. Contact Azure support to request more DMUs.
Parallel copy
You can read data from the source or write data to the destination in parallel within a Copy Activity run.
This feature enhances the throughput of a copy operation and reduces the time it takes to move data.
This setting is different from the concurrency property in the activity definition. The concurrency property
determines the number of concurrent Copy Activity runs to process data from different activity windows (1
AM to 2 AM, 2 AM to 3 AM, 3 AM to 4 AM, and so on). This capability is helpful when you perform a historical
load. The parallel copy capability applies to a single activity run.
Let's look at a sample scenario. In the following example, multiple slices from the past need to be processed.
Data Factory runs an instance of Copy Activity (an activity run) for each slice:
The data slice from the first activity window (1 AM to 2 AM) ==> Activity run 1
The data slice from the second activity window (2 AM to 3 AM) ==> Activity run 2
The data slice from the second activity window (3 AM to 4 AM) ==> Activity run 3
And so on.
In this example, when the concurrency value is set to 2, Activity run 1 and Activity run 2 copy data from two
activity windows concurrently to improve data movement performance. However, if multiple files are
associated with Activity run 1, the data movement service copies files from the source to the destination one
file at a time.
Cloud data movement units
A cloud data movement unit (DMU) is a measure that represents the power (a combination of CPU,
memory, and network resource allocation) of a single unit in Data Factory. A DMU might be used in a cloud-to-
cloud copy operation, but not in a hybrid copy.
By default, Data Factory uses a single cloud DMU to perform a single Copy Activity run. To override this default,
specify a value for the cloudDataMovementUnits property as follows. For information about the level of
performance gain you might get when you configure more units for a specific copy source and sink, see the
performance reference.
"activities":[
{
"name": "Sample copy activity",
"description": "",
"type": "Copy",
"inputs": [{ "name": "InputDataset" }],
"outputs": [{ "name": "OutputDataset" }],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"cloudDataMovementUnits": 32
}
}
]
The allowed values for the cloudDataMovementUnits property are 1 (default), 2, 4, 8, 16, 32. The actual
number of cloud DMUs that the copy operation uses at run time is equal to or less than the configured value,
depending on your data pattern.
NOTE
If you need more cloud DMUs for a higher throughput, contact Azure support. Setting of 8 and above currently works
only when you copy multiple files from Blob storage/Data Lake Store/Amazon S3/cloud FTP/cloud SFTP to
Blob storage/Data Lake Store/Azure SQL Database.
parallelCopies
You can use the parallelCopies property to indicate the parallelism that you want Copy Activity to use. You
can think of this property as the maximum number of threads within Copy Activity that can read from your
source or write to your sink data stores in parallel.
For each Copy Activity run, Data Factory determines the number of parallel copies to use to copy data from the
source data store and to the destination data store. The default number of parallel copies that it uses depends
on the type of source and sink that you are using.
Copy data between file-based stores (Blob storage; Data Between 1 and 32. Depends on the size of the files and the
Lake Store; Amazon S3; an on-premises file system; an on- number of cloud data movement units (DMUs) used to
premises HDFS) copy data between two cloud data stores, or the physical
configuration of the Gateway machine used for a hybrid
copy (to copy data to or from an on-premises data store).
Usually, the default behavior should give you the best throughput. However, to control the load on machines
that host your data stores, or to tune copy performance, you may choose to override the default value and
specify a value for the parallelCopies property. The value must be between 1 and 32 (both inclusive). At run
time, for the best performance, Copy Activity uses a value that is less than or equal to the value that you set.
"activities":[
{
"name": "Sample copy activity",
"description": "",
"type": "Copy",
"inputs": [{ "name": "InputDataset" }],
"outputs": [{ "name": "OutputDataset" }],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"parallelCopies": 8
}
}
]
Points to note:
When you copy data between file-based stores, the parallelCopies determine the parallelism at the file
level. The chunking within a single file would happen underneath automatically and transparently, and it's
designed to use the best suitable chunk size for a given source data store type to load data in parallel and
orthogonal to parallelCopies. The actual number of parallel copies the data movement service uses for the
copy operation at run time is no more than the number of files you have. If the copy behavior is mergeFile,
Copy Activity cannot take advantage of file-level parallelism.
When you specify a value for the parallelCopies property, consider the load increase on your source and
sink data stores, and to gateway if it is a hybrid copy. This happens especially when you have multiple
activities or concurrent runs of the same activities that run against the same data store. If you notice that
either the data store or Gateway is overwhelmed with the load, decrease the parallelCopies value to relieve
the load.
When you copy data from stores that are not file-based to stores that are file-based, the data movement
service ignores the parallelCopies property. Even if parallelism is specified, it's not applied in this case.
NOTE
You must use Data Management Gateway version 1.11 or later to use the parallelCopies feature when you do a hybrid
copy.
To better use these two properties, and to enhance your data movement throughput, see the sample use cases.
You don't need to configure parallelCopies to take advantage of the default behavior. If you do configure and
parallelCopies is too small, multiple cloud DMUs might not be fully utilized.
Billing impact
It's important to remember that you are charged based on the total time of the copy operation. If a copy job
used to take one hour with one cloud unit and now it takes 15 minutes with four cloud units, the overall bill
remains almost the same. For example, you use four cloud units. The first cloud unit spends 10 minutes, the
second one, 10 minutes, the third one, 5 minutes, and the fourth one, 5 minutes, all in one Copy Activity run.
You are charged for the total copy (data movement) time, which is 10 + 10 + 5 + 5 = 30 minutes. Using
parallelCopies does not affect billing.
Staged copy
When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an
interim staging store. Staging is especially useful in the following cases:
1. You want to ingest data from various data stores into SQL Data Warehouse via PolyBase. SQL Data
Warehouse uses PolyBase as a high-throughput mechanism to load a large amount of data into SQL Data
Warehouse. However, the source data must be in Blob storage, and it must meet additional criteria. When
you load data from a data store other than Blob storage, you can activate data copying via interim staging
Blob storage. In that case, Data Factory performs the required data transformations to ensure that it meets
the requirements of PolyBase. Then it uses PolyBase to load data into SQL Data Warehouse. For more
details, see Use PolyBase to load data into Azure SQL Data Warehouse. For a walkthrough with a use case,
see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory.
2. Sometimes it takes a while to perform a hybrid data movement (that is, to copy between an on-
premises data store and a cloud data store) over a slow network connection. To improve
performance, you can compress the data on-premises so that it takes less time to move data to the staging
data store in the cloud. Then you can decompress the data in the staging store before you load it into the
destination data store.
3. You don't want to open ports other than port 80 and port 443 in your firewall, because of
corporate IT policies. For example, when you copy data from an on-premises data store to an Azure SQL
Database sink or an Azure SQL Data Warehouse sink, you need to activate outbound TCP communication
on port 1433 for both the Windows firewall and your corporate firewall. In this scenario, take advantage of
the gateway to first copy data to a Blob storage staging instance over HTTP or HTTPS on port 443. Then,
load the data into SQL Database or SQL Data Warehouse from Blob storage staging. In this flow, you don't
need to enable port 1433.
How staged copy works
When you activate the staging feature, first the data is copied from the source data store to the staging data
store (bring your own). Next, the data is copied from the staging data store to the sink data store. Data Factory
automatically manages the two-stage flow for you. Data Factory also cleans up temporary data from the
staging storage after the data movement is complete.
In the cloud copy scenario (both source and sink data stores are in the cloud), gateway is not used. The Data
Factory service performs the copy operations.
In the hybrid copy scenario (source is on-premises and sink is in the cloud), the gateway moves data from the
source data store to a staging data store. Data Factory service moves data from the staging data store to the
sink data store. Copying data from a cloud data store to an on-premises data store via staging also is
supported with the reversed flow.
When you activate data movement by using a staging store, you can specify whether you want the data to be
compressed before moving data from the source data store to an interim or staging data store, and then
decompressed before moving data from an interim or staging data store to the sink data store.
Currently, you can't copy data between two on-premises data stores by using a staging store. We expect this
option to be available soon.
Configuration
Configure the enableStaging setting in Copy Activity to specify whether you want the data to be staged in
Blob storage before you load it into a destination data store. When you set enableStaging to TRUE, specify the
additional properties listed in the next table. If you dont have one, you also need to create an Azure Storage or
Storage shared access signature-linked service for staging.
Here's a sample definition of Copy Activity with the properties that are described in the preceding table:
"activities":[
{
"name": "Sample copy activity",
"type": "Copy",
"inputs": [{ "name": "OnpremisesSQLServerInput" }],
"outputs": [{ "name": "AzureSQLDBOutput" }],
"typeProperties": {
"source": {
"type": "SqlSource",
},
"sink": {
"type": "SqlSink"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "MyStagingBlob",
"path": "stagingcontainer/path",
"enableCompression": true
}
}
}
]
Billing impact
You are charged based on two steps: copy duration and copy type.
When you use staging during a cloud copy (copying data from a cloud data store to another cloud data
store), you are charged the [sum of copy duration for step 1 and step 2] x [cloud copy unit price].
When you use staging during a hybrid copy (copying data from an on-premises data store to a cloud data
store), you are charged for [hybrid copy duration] x [hybrid copy unit price] + [cloud copy duration] x [cloud
copy unit price].
Later in the article, you can compare the performance and configuration of your scenario to Copy
Activitys performance reference from our tests.
2. Diagnose and optimize performance. If the performance you observe doesn't meet your
expectations, you need to identify performance bottlenecks. Then, optimize performance to remove or
reduce the effect of bottlenecks. A full description of performance diagnosis is beyond the scope of this
article, but here are some common considerations:
Performance features:
Parallel copy
Cloud data movement units
Staged copy
Data Management Gateway scalability
Data Management Gateway
Source
Sink
Serialization and deserialization
Compression
Column mapping
Other considerations
3. Expand the configuration to your entire data set. When you're satisfied with the execution results and
performance, you can expand the definition and pipeline active period to cover your entire data set.
Other considerations
If the size of data you want to copy is large, you can adjust your business logic to further partition the data
using the slicing mechanism in Data Factory. Then, schedule Copy Activity to run more frequently to reduce the
data size for each Copy Activity run.
Be cautious about the number of data sets and copy activities requiring Data Factory to connector to the same
data store at the same time. Many concurrent copy jobs might throttle a data store and lead to degraded
performance, copy job internal retries, and in some cases, execution failures.
Scenario II: Copy 20 blobs of 500 MB each from Blob storage to Data Lake Store Analytics, and then tune
performance.
Analysis and performance tuning: In this scenario, Data Factory copies the data from Blob storage to Data
Lake Store by using single-copy (parallelCopies set to 1) and single-cloud data movement units. The
throughput you observe will be close to that described in the performance reference section.
Scenario III: Individual file size is greater than dozens of MBs and total volume is large.
Analysis and performance turning: Increasing parallelCopies doesn't result in better copy performance
because of the resource limitations of a single-cloud DMU. Instead, you should specify more cloud DMUs to get
more resources to perform the data movement. Do not specify a value for the parallelCopies property. Data
Factory handles the parallelism for you. In this case, if you set cloudDataMovementUnits to 4, a throughput
of about four times occurs.
Reference
Here are performance monitoring and tuning references for some of the supported data stores:
Azure Storage (including Blob storage and Table storage): Azure Storage scalability targets and Azure
Storage performance and scalability checklist
Azure SQL Database: You can monitor the performance and check the database transaction unit (DTU)
percentage
Azure SQL Data Warehouse: Its capability is measured in data warehouse units (DWUs); see Manage
compute power in Azure SQL Data Warehouse (Overview)
Azure Cosmos DB: Performance levels in Azure Cosmos DB
On-premises SQL Server: Monitor and tune for performance
On-premises file server: Performance tuning for file servers
Add fault tolerance in Copy Activity by skipping
incompatible rows
8/21/2017 3 min to read Edit Online
Azure Data Factory Copy Activity offers you two ways to handle incompatible rows when copying data between
source and sink data stores:
You can abort and fail the copy activity when incompatible data is encountered (default behavior).
You can continue to copy all of the data by adding fault tolerance and skipping incompatible data rows. In
addition, you can log the incompatible rows in Azure Blob storage. You can then examine the log to learn the
cause for the failure, fix the data on the data source, and retry the copy activity.
Supported scenarios
Copy Activity supports three scenarios for detecting, skipping, and logging incompatible data:
Incompatibility between the source data type and the sink native type
For example: Copy data from a CSV file in Blob storage to a SQL database with a schema definition that
contains three INT type columns. The CSV file rows that contain numeric data, such as 123,456,789 are
copied successfully to the sink store. However, the rows that contain non-numeric values, such as
123,456,abc are detected as incompatible and are skipped.
Mismatch in the number of columns between the source and the sink
For example: Copy data from a CSV file in Blob storage to a SQL database with a schema definition that
contains six columns. The CSV file rows that contain six columns are copied successfully to the sink store.
The CSV file rows that contain more or fewer than six columns are detected as incompatible and are
skipped.
Primary key violation when writing to a relational database
For example: Copy data from a SQL server to a SQL database. A primary key is defined in the sink SQL
database, but no such primary key is defined in the source SQL server. The duplicated rows that exist in the
source cannot be copied to the sink. Copy Activity copies only the first row of the source data into the sink.
The subsequent source rows that contain the duplicated primary key value are detected as incompatible and
are skipped.
Configuration
The following example provides a JSON definition to configure skipping the incompatible rows in Copy Activity:
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
},
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": "BlobStorage",
"path": "redirectcontainer/erroroutput"
}
}
path The path of the log file that Specify the Blob storage No
contains the skipped rows. path that you want to use to
log the incompatible data. If
you do not provide a path,
the service creates a
container for you.
Monitoring
After the copy activity run completes, you can see the number of skipped rows in the monitoring section:
If you configure to log the incompatible rows, you can find the log file at this path:
https://[your-blob-account].blob.core.windows.net/[path-if-configured]/[copy-activity-run-id]/[auto-generated-
GUID].csv
In the log file, you can see the rows that were skipped and the root cause of the incompatibility.
Both the original data and the corresponding error are logged in the file. An example of the log file content is as
follows:
data1, data2, data3, UserErrorInvalidDataValue,Column 'Prop_2' contains an invalid value 'data3'. Cannot
convert 'data3' to type 'DateTime'.,
data4, data5, data6, Violation of PRIMARY KEY constraint 'PK_tblintstrdatetimewithpk'. Cannot insert duplicate
key in object 'dbo.tblintstrdatetimewithpk'. The duplicate key value is (data4).
Next steps
To learn more about Azure Data Factory Copy Activity, see Move data by using Copy Activity.
Azure Data Factory - Security considerations for data
movement
8/21/2017 10 min to read Edit Online
Introduction
This article describes basic security infrastructure that data movement services in Azure Data Factory use to secure
your data. Azure Data Factory management resources are built on Azure security infrastructure and use all possible
security measures offered by Azure.
In a Data Factory solution, you create one or more data pipelines. A pipeline is a logical grouping of activities that
together perform a task. These pipelines reside in the region where the data factory was created.
Even though Data Factory is available in only West US, East US, and North Europe regions, the data movement
service is available globally in several regions. Data Factory service ensures that data does not leave a geographical
area/ region unless you explicitly instruct the service to use an alternate region if the data movement service is not
yet deployed to that region.
Azure Data Factory itself does not store any data except for linked service credentials for cloud data stores, which
are encrypted using certificates. It lets you create data-driven workflows to orchestrate movement of data between
supported data stores and processing of data using compute services in other regions or in an on-premises
environment. It also allows you to monitor and manage workflows using both programmatic and UI mechanisms.
Data movement using Azure Data Factory has been certified for:
HIPAA/HITECH
ISO/IEC 27001
ISO/IEC 27018
CSA STAR
If you are interested in Azure compliance and how Azure secures its own infrastructure, visit the Microsoft Trust
Center.
In this article, we review security considerations in the following two data movement scenarios:
Cloud scenario- In this scenario, both your source and destination are publicly accessible through internet.
These include managed cloud storage services like Azure Storage, Azure SQL Data Warehouse, Azure SQL
Database, Azure Data Lake Store, Amazon S3, Amazon Redshift, SaaS services such as Salesforce, and web
protocols such as FTP and OData. You can find a complete list of supported data sources here.
Hybrid scenario- In this scenario, either your source or destination is behind a firewall or inside an on-
premises corporate network or the data store is in a private network/ virtual network (most often the source)
and is not publicly accessible. Database servers hosted on virtual machines also fall under this scenario.
Cloud scenarios
Securing data store credentials
Azure Data Factory protects your data store credentials by encrypting them by using certificates managed by
Microsoft. These certificates are rotated every two years (which includes renewal of certificate and migration of
credentials). These encrypted credentials are securely stored in an Azure Storage managed by Azure Data
Factory management services. For more information about Azure Storage security, refer Azure Storage Security
Overview.
Data encryption in transit
If the cloud data store supports HTTPS or TLS, all data transfers between data movement services in Data Factory
and a cloud data store are via secure channel HTTPS or TLS.
NOTE
All connections to Azure SQL Database and Azure SQL Data Warehouse always require encryption (SSL/TLS) while data is
in transit to and from the database. While authoring a pipeline using a JSON editor, add the encryption property and set it
to true in the connection string. When you use the Copy Wizard, the wizard sets this property by default. For Azure
Storage, you can use HTTPS in the connection string.
NOTE
Older gateways that were installed before November 2016 or of version 2.3.xxxx.x continue to use credentials encrypted and
stored on cloud. Even if you upgrade the gateway to the latest version, the credentials are not migrated to an on-premises
machine
Encryption in transit
All data transfers are via secure channel HTTPS and TLS over TCP to prevent man-in-the-middle attacks during
communication with Azure services.
You can also use IPSec VPN or Express Route to further secure the communication channel between your on-
premises network and Azure.
Virtual network is a logical representation of your network in the cloud. You can connect an on-premises network
to your Azure virtual network (VNet) by setting up IPSec VPN (site-to-site) or Express Route (Private Peering)
The following table summarizes the network and gateway configuration recommendations based on different
combinations of source and destination locations for hybrid data movement.
On-premises Virtual machines and cloud IPSec VPN (point-to-site or Gateway can be installed
services deployed in virtual site-to-site) either on-premises or on an
networks Azure virtual machine (VM)
in VNet
On-premises Virtual machines and cloud ExpressRoute (Private Gateway can be installed
services deployed in virtual Peering) either on-premises or on an
networks Azure VM in VNet
The following images show the usage of Data Management Gateway for moving data between an on-premises
database and Azure services using Express route and IPSec VPN (with Virtual Network):
Express Route:
IPSec VPN:
Firewall configurations and whitelisting IP address of gateway
Firewall requirements for on-premises/private network
In an enterprise, a corporate firewall runs on the central router of the organization. And, Windows firewall runs
as a daemon on the local machine on which the gateway is installed.
The following table provides outbound port and domain requirements for the corporate firewall.
NOTE
You may have to manage ports/ whitelisting domains at the corporate firewall level as required by respective data sources.
This table only uses Azure SQL Database, Azure SQL Data Warehouse, Azure Data Lake Store as examples.
The following table provides inbound port requirements for the windows firewall.
INBOUND PORTS DESCRIPTION
Next steps
For information about performance of copy activity, see Copy activity performance and tuning guide.
Move data From Amazon Redshift using Azure
Data Factory
7/27/2017 7 min to read Edit Online
This article explains how to use the Copy Activity in Azure Data Factory to move data from Amazon Redshift.
The article builds on the Data Movement Activities article, which presents a general overview of data movement
with the copy activity.
You can copy data from Amazon Redshift to any supported sink data store. For a list of data stores supported as
sinks by the copy activity, see supported data stores. Data factory currently supports moving data from Amazon
Redshift to other data stores, but not for moving data from other data stores to Amazon Redshift.
Prerequisites
If you are moving data to an on-premises data store, install Data Management Gateway on an on-premises
machine. Then, Grant Data Management Gateway (use IP address of the machine) the access to Amazon
Redshift cluster. See Authorize access to the cluster for instructions.
If you are moving data to an Azure data store, see Azure Data Center IP Ranges for the Compute IP address
and SQL ranges used by the Azure data centers.
Getting started
You can create a pipeline with a copy activity that moves data from an Amazon Redshift source by using
different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from an Amazon Redshift data store, see JSON example: Copy data from Amazon Redshift to
Azure Blob section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to Amazon Redshift:
port The number of the TCP port that the No, default value: 5439
Amazon Redshift server uses to listen
for client connections.
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy are similar for all dataset types (Azure SQL, Azure blob, Azure table,
etc.).
The typeProperties section is different for each type of dataset. It provides information about the location of
the data in the data store. The typeProperties section for dataset of type RelationalTable (which includes
Amazon Redshift dataset) has the following properties
query Use the custom query to SQL query string. For No (if tableName of
read data. example: select * from dataset is specified)
MyTable.
{
"name": "AmazonRedshiftLinkedService",
"properties":
{
"type": "AmazonRedshift",
"typeProperties":
{
"server": "< The IP address or host name of the Amazon Redshift server >",
"port": <The number of the TCP port that the Amazon Redshift server uses to listen for client
connections.>,
"database": "<The database name of the Amazon Redshift database>",
"username": "<username>",
"password": "<password>"
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
Copy activity in a pipeline with Azure Redshift source (RelationalSource) and Blob sink:
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to
run every hour. In the pipeline JSON definition, the source type is set to RelationalSource and sink type is set
to BlobSink. The SQL query specified for the query property selects the data in the past hour to copy.
{
"name": "CopyAmazonRedshiftToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-
ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "AmazonRedshiftInputDataset"
}
],
"outputs": [
{
"name": "AzureBlobOutputDataSet"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "AmazonRedshiftToBlob"
}
],
"start": "2014-06-01T18:00:00Z",
"end": "2014-06-01T19:00:00Z"
}
}
SMALLINT Int16
INTEGER Int32
BIGINT Int64
AMAZON REDSHIFT TYPE .NET BASED TYPE
DECIMAL Decimal
REAL Single
BOOLEAN String
CHAR String
VARCHAR String
DATE DateTime
TIMESTAMP DateTime
TEXT String
Next Steps
See the following articles:
Copy Activity tutorial for step-by-step instructions for creating a pipeline with a Copy Activity.
Move data from Amazon Simple Storage Service
by using Azure Data Factory
6/27/2017 8 min to read Edit Online
This article explains how to use the copy activity in Azure Data Factory to move data from Amazon Simple
Storage Service (S3). It builds on the Data movement activities article, which presents a general overview of
data movement with the copy activity.
You can copy data from Amazon S3 to any supported sink data store. For a list of data stores supported as sinks
by the copy activity, see the Supported data stores table. Data Factory currently supports only moving data
from Amazon S3 to other data stores, but not moving data from other data stores to Amazon S3.
Required permissions
To copy data from Amazon S3, make sure you have been granted the following permissions:
s3:GetObject and s3:GetObjectVersion for Amazon S3 Object Operations.
s3:ListBucket for Amazon S3 Bucket Operations. If you are using the Data Factory Copy Wizard,
s3:ListAllMyBuckets is also required.
For details about the full list of Amazon S3 permissions, see Specifying Permissions in a Policy.
Getting started
You can create a pipeline with a copy activity that moves data from an Amazon S3 source by using different
tools or APIs.
The easiest way to create a pipeline is to use the Copy Wizard. For a quick walkthrough, see Tutorial: Create a
pipeline using Copy Wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. For step-by-step instructions to create a
pipeline with a copy activity, see the Copy activity tutorial.
Whether you use tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools or APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from an Amazon S3 data store, see the JSON example: Copy data from Amazon S3 to Azure
Blob section of this article.
NOTE
For details about supported file and compression formats for a copy activity, see File and compression formats in Azure
Data Factory.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to Amazon S3.
secretAccessKey The secret access key itself. Encrypted secret string Yes
Here is an example:
{
"name": "AmazonS3LinkedService",
"properties": {
"type": "AwsAccessKey",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": "<secret access key>"
}
}
}
Dataset properties
To specify a dataset to represent input data in Azure Blob storage, set the type property of the dataset to
AmazonS3. Set the linkedServiceName property of the dataset to the name of the Amazon S3 linked service.
For a full list of sections and properties available for defining datasets, see Creating datasets.
Sections such as structure, availability, and policy are similar for all dataset types (such as SQL database, Azure
blob, and Azure table). The typeProperties section is different for each type of dataset, and provides
information about the location of the data in the data store. The typeProperties section for a dataset of type
AmazonS3 (which includes the Amazon S3 dataset) has the following properties:
NOTE
bucketName + key specifies the location of the S3 object, where bucket is the root container for S3 objects, and key is
the full path to the S3 object.
{
"name": "dataset-s3",
"properties": {
"type": "AmazonS3",
"linkedServiceName": "link- testS3",
"typeProperties": {
"key": "testFolder/test.orc",
"bucketName": "testbucket",
"version": "XXXXXXXXXczm0CJajYkHf0_k6LhBmkcL",
"format": {
"type": "OrcFormat"
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}
"key": "testFolder/test.orc",
"bucketName": "testbucket",
You can have Data Factory calculate these properties dynamically at runtime, by using system variables such as
SliceStart.
You can do the same for the prefix property of an Amazon S3 dataset. For a list of supported functions and
variables, see Data Factory functions and system variables.
{
"name": "AmazonS3LinkedService",
"properties": {
"type": "AwsAccessKey",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": "<secret access key>"
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
{
"name": "AmazonS3InputDataset",
"properties": {
"type": "AmazonS3",
"linkedServiceName": "AmazonS3LinkedService",
"typeProperties": {
"key": "testFolder/test.orc",
"bucketName": "testbucket",
"format": {
"type": "OrcFormat"
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}
NOTE
To map columns from a source dataset to columns from a sink dataset, see Mapping dataset columns in Azure Data
Factory.
Next steps
See the following articles:
To learn about key factors that impact performance of data movement (copy activity) in Data Factory,
and various ways to optimize it, see the Copy activity performance and tuning guide.
For step-by-step instructions for creating a pipeline with a copy activity, see the Copy activity tutorial.
Copy data to or from Azure Blob Storage using
Azure Data Factory
8/21/2017 31 min to read Edit Online
This article explains how to use the Copy Activity in Azure Data Factory to copy data to and from Azure Blob
Storage. It builds on the Data Movement Activities article, which presents a general overview of data
movement with the copy activity.
Overview
You can copy data from any supported source data store to Azure Blob Storage or from Azure Blob Storage
to any supported sink data store. The following table provides a list of data stores supported as sources or
sinks by the copy activity. For example, you can move data from a SQL Server database or an Azure SQL
database to an Azure blob storage. And, you can copy data from Azure blob storage to an Azure SQL Data
Warehouse or an Azure Cosmos DB collection.
Supported scenarios
You can copy data from Azure Blob Storage to the following data stores:
You can copy data from the following data stores to Azure Blob Storage:
NoSQL Cassandra
MongoDB
File Amazon S3
File System
FTP
HDFS
SFTP
IMPORTANT
Copy Activity supports copying data from/to both general-purpose Azure Storage accounts and Hot/Cool Blob
storage. The activity supports reading from block, append, or page blobs, but supports writing to only block
blobs. Azure Premium Storage is not supported as a sink because it is backed by page blobs.
Copy Activity does not delete data from the source after the data is successfully copied to the destination. If you need
to delete source data after a successful copy, create a custom activity to delete the data and use the activity in the
pipeline. For an example, see the Delete blob or folder sample on GitHub.
Get started
You can create a pipeline with a copy activity that moves data to/from an Azure Blob Storage by using
different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. This article has a walkthrough for creating a
pipeline to copy data from an Azure Blob Storage location to another Azure Blob Storage location. For a
tutorial on creating a pipeline to copy data from an Azure Blob Storage to Azure SQL Database, see Tutorial:
Create a pipeline using Copy Wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from
a source data store to a sink data store:
1. Create a data factory. A data factory may contain one or more pipelines.
2. Create linked services to link input and output data stores to your data factory. For example, if you are
copying data from an Azure blob storage to an Azure SQL database, you create two linked services to link
your Azure storage account and Azure SQL database to your data factory. For linked service properties
that are specific to Azure Blob Storage, see linked service properties section.
3. Create datasets to represent input and output data for the copy operation. In the example mentioned in
the last step, you create a dataset to specify the blob container and folder that contains the input data.
And, you create another dataset to specify the SQL table in the Azure SQL database that holds the data
copied from the blob storage. For dataset properties that are specific to Azure Blob Storage, see dataset
properties section.
4. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the
example mentioned earlier, you use BlobSource as a source and SqlSink as a sink for the copy activity.
Similarly, if you are copying from Azure SQL Database to Azure Blob Storage, you use SqlSource and
BlobSink in the copy activity. For copy activity properties that are specific to Azure Blob Storage, see copy
activity properties section. For details on how to use a data store as a source or a sink, click the link in the
previous section for your data store.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that
are used to copy data to/from an Azure Blob Storage, see JSON examples section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to Azure Blob Storage.
See the following article for steps to view/copy the account key for an Azure Storage: View, copy, and
regenerate storage access keys.
Example:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
IMPORTANT
Azure Data Factory now only supports Service SAS but not Account SAS. See Types of Shared Access Signatures for
details about these two types and how to construct. Note the SAS URL generable from Azure portal or Storage
Explorer is an Account SAS, which is not supported.
The Azure Storage SAS linked service allows you to link an Azure Storage Account to an Azure data factory
by using a Shared Access Signature (SAS). It provides the data factory with restricted/time-bound access to
all/specific resources (blob/container) in the storage. The following table provides description for JSON
elements specific to Azure Storage SAS linked service.
Example:
{
"name": "StorageSasLinkedService",
"properties": {
"type": "AzureStorageSas",
"typeProperties": {
"sasUri": "<Specify SAS URI of the Azure Storage resource>"
}
}
}
Dataset properties
To specify a dataset to represent input or output data in an Azure Blob Storage, you set the type property of
the dataset to: AzureBlob. Set the linkedServiceName property of the dataset to the name of the Azure
Storage or Azure Storage SAS linked service. The type properties of the dataset specify the blob container
and the folder in the blob storage.
For a full list of JSON sections & properties available for defining datasets, see the Creating datasets article.
Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure
SQL, Azure blob, Azure table, etc.).
Data factory supports the following CLS-compliant .NET based type values for providing type information in
structure for schema-on-read data sources like Azure blob: Int16, Int32, Int64, Single, Double, Decimal,
Byte[], Bool, String, Guid, Datetime, Datetimeoffset, Timespan. Data Factory automatically performs type
conversions when moving data from a source data store to a sink data store.
The typeProperties section is different for each type of dataset and provides information about the location,
format etc., of the data in the data store. The typeProperties section for dataset of type AzureBlob dataset
has the following properties:
"folderPath": "wikidatagateway/wikisampledataout/{Slice}",
"partitionedBy":
[
{ "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } },
],
In this example, {Slice} is replaced with the value of Data Factory system variable SliceStart in the format
(YYYYMMDDHH) specified. The SliceStart refers to start time of the slice. The folderPath is different for each
slice. For example: wikidatagateway/wikisampledataout/2014100103 or
wikidatagateway/wikisampledataout/2014100104
Sample 2
"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}",
"fileName": "{Hour}.csv",
"partitionedBy":
[
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],
In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by
folderPath and fileName properties.
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
File1
File2
Subfolder1
File3
File4
File5.
RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
auto-generated name for File1
auto-generated name for File2
auto-generated name for File3
auto-generated name for File4
auto-generated name for File5
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
File1 + File2 + File3 + File4 + File
5 contents are merged into one file
with auto-generated file name
RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
File1
File2
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
auto-generated name for File1
auto-generated name for File2
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
File1 + File2 contents are merged
into one file with auto-generated file
name. auto-generated name for File1
NOTE
If you see that the web browser is stuck at "Authorizing...", disable/uncheck Block third-party cookies and
site data setting (or) keep it enabled and create an exception for login.microsoftonline.com and then try
launching the wizard again.
7. On the File format settings page, you see the delimiters and the schema that is auto-detected by the
wizard by parsing the file.
a. Confirm the following options: a. The file format is set to Text format. You can see all the
supported formats in the drop-down list. For example: JSON, Avro, ORC, Parquet. b. The column
delimiter is set to Comma (,) . You can see the other column delimiters supported by Data Factory
in the drop-down list. You can also specify a custom delimiter. c. The row delimiter is set to
Carriage Return + Line feed (\r\n) . You can see the other row delimiters supported by Data
Factory in the drop-down list. You can also specify a custom delimiter. d. The skip line count is set
to 0. If you want a few lines to be skipped at the top of the file, enter the number here. e. The first
data row contains column names is not set. If the source files contain column names in the first
row, select this option. f. The treat empty column value as null option is set.
b. Expand Advanced settings to see advanced option available.
c. At the bottom of the page, see the preview of data from the emp.txt file.
d. Click SCHEMA tab at the bottom to see the schema that the copy wizard inferred by looking at the
data in the source file.
e. Click Next after you review the delimiters and preview data.
8. On the Destination data store page, select Azure Blob Storage, and click Next. You are using the
Azure Blob Storage as both the source and destination data stores in this walkthrough.
11. On the File format settings page, review the settings, and click Next. One of the additional options here
is to add a header to the output file. If you select that option, a header row is added with names of the
columns from the schema of the source. You can rename the default column names when viewing the
schema for the source. For example, you could change the first column to First Name and second column
to Last Name. Then, the output file is generated with a header with these names as column names.
12. On the Performance settings page, confirm that cloud units and parallel copies are set to Auto, and
click Next. For details about these settings, see Copy activity performance and tuning guide.
13. On the Summary page, review all settings (task properties, settings for source and destination, and copy
settings), and click Next.
14. Review information in the Summary page, and click Finish. The wizard creates two linked services, two
datasets (input and output), and one pipeline in the data factory (from where you launched the Copy
Wizard).
{
"name": "Source-BlobStorage-z4y",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString":
"DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=**********"
}
}
}
D e st i n a t i o n b l o b st o r a g e l i n k e d se r v i c e
{
"name": "Destination-BlobStorage-z4y",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString":
"DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=**********"
}
}
}
For more information about Azure Storage linked service, see Linked service properties section.
Datasets
There are two datasets: an input dataset and an output dataset. The type of the dataset is set to AzureBlob
for both.
The input dataset points to the input folder of the adfblobconnector blob container. The external property
is set to true for this dataset as the data is not produced by the pipeline with the copy activity that takes this
dataset as an input.
The output dataset points to the output folder of the same blob container. The output dataset also uses the
year, month, and day of the SliceStart system variable to dynamically evaluate the path for the output file.
For a list of functions and system variables supported by Data Factory, see Data Factory functions and
system variables. The external property is set to false (default value) because this dataset is produced by
the pipeline.
For more information about properties supported by Azure Blob dataset, see Dataset properties section.
I n p u t d a t a se t
{
"name": "InputDataset-z4y",
"properties": {
"structure": [
{ "name": "Prop_0", "type": "String" },
{ "name": "Prop_1", "type": "String" }
],
"type": "AzureBlob",
"linkedServiceName": "Source-BlobStorage-z4y",
"typeProperties": {
"folderPath": "adfblobconnector/input/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
O u t p u t d a t a se t
{
"name": "OutputDataset-z4y",
"properties": {
"structure": [
{ "name": "Prop_0", "type": "String" },
{ "name": "Prop_1", "type": "String" }
],
"type": "AzureBlob",
"linkedServiceName": "Destination-BlobStorage-z4y",
"typeProperties": {
"folderPath": "adfblobconnector/output/{year}/{month}/{day}",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
},
"partitionedBy": [
{ "name": "year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy"
} },
{ "name": "month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" }
},
{ "name": "day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }
]
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": false,
"policy": {}
}
}
Pipeline
The pipeline has just one activity. The type of the activity is set to Copy. In the type properties for the activity,
there are two sections, one for source and the other one for sink. The source type is set to BlobSource as the
activity is copying data from a blob storage. The sink type is set to BlobSink as the activity copying data to a
blob storage. The copy activity takes InputDataset-z4y as the input and OutputDataset-z4y as the output.
For more information about properties supported by BlobSource and BlobSink, see Copy activity properties
section.
{
"name": "CopyPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"sink": {
"type": "BlobSink",
"copyBehavior": "MergeFiles",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataset-z4y"
}
],
"outputs": [
{
"name": "OutputDataset-z4y"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-Blob path_ adfblobconnector_input_->OutputDataset-z4y"
}
],
"start": "2017-04-21T22:34:00Z",
"end": "2017-04-25T05:00:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;User ID=<username>@<servername>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and
AzureStorageSas. For the first one, you specify the connection string that includes the account key and for
the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details.
Azure Blob input dataset:
Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for
the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder
path uses year, month, and day part of the start time and file name uses the hour part of the start time.
external: true setting informs Data Factory that the table is external to the data factory and is not
produced by an activity in the data factory.
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/",
"fileName": "{Hour}.csv",
"partitionedBy": [
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } }
],
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
{
"name": "AzureSqlOutput",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "MyOutputTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and
AzureStorageSas. For the first one, you specify the connection string that includes the account key and for
the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details.
Azure SQL input dataset:
The sample assumes you have created a table MyTable in Azure SQL and it contains a column called
timestampcolumn for time series data.
Setting external: true informs Data Factory service that the table is external to the data factory and is not
produced by an activity in the data factory.
{
"name": "AzureSqlInput",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}/",
"partitionedBy": [
{
"name": "Year",
"value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } }
],
"format": {
"type": "TextFormat",
"columnDelimiter": "\t",
"rowDelimiter": "\n"
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
NOTE
To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data
Factory.
This article explains how to use the Copy Activity in Azure Data Factory to move data to/from Azure Cosmos
DB (DocumentDB API). It builds on the Data Movement Activities article, which presents a general overview of
data movement with the copy activity.
You can copy data from any supported source data store to Azure Cosmos DB or from Azure Cosmos DB to
any supported sink data store. For a list of data stores supported as sources or sinks by the copy activity, see
the Supported data stores table.
IMPORTANT
Azure Cosmos DB connector only support DocumentDB API.
To copy data as-is to/from JSON files or another Cosmos DB collection, see Import/Export JSON documents.
Getting started
You can create a pipeline with a copy activity that moves data to/from Azure Cosmos DB by using different
tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are
used to copy data to/from Cosmos DB, see JSON examples section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to Cosmos DB:
Example:
{
"name": "CosmosDbLinkedService",
"properties": {
"type": "DocumentDb",
"typeProperties": {
"connectionString": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>"
}
}
}
Dataset properties
For a full list of sections & properties available for defining datasets please refer to the Creating datasets
article. Sections like structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure
SQL, Azure blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for the dataset of type DocumentDbCollection has the
following properties.
Example:
{
"name": "PersonCosmosDbTable",
"properties": {
"type": "DocumentDbCollection",
"linkedServiceName": "CosmosDbLinkedService",
"typeProperties": {
"collectionName": "Person"
},
"external": true,
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
NOTE
The Copy Activity takes only one input and produces only one output.
Properties available in the typeProperties section of the activity on the other hand vary with each activity type
and in case of Copy activity they vary depending on the types of sources and sinks.
In case of Copy activity when source is of type DocumentDbCollectionSource the following properties are
available in typeProperties section:
nestingSeparator A special character in the Character that is used to Character that is used to
source column name to separate nesting levels. separate nesting levels.
indicate that nested
document is needed. Default value is . (dot). Default value is . (dot).
"Name": {
"First": "John"
},
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED
Throttling is decided by a
number of factors,
including size of
documents, number of
terms in documents,
indexing policy of target
collection, etc. For copy
operations, you can use a
better collection (e.g. S3) to
have the most throughput
available (2,500 request
units/second).
JSON examples
The following examples provide sample JSON definitions that you can use to create a pipeline by using Azure
portal or Visual Studio or Azure PowerShell. They show how to copy data to and from Azure Cosmos DB and
Azure Blob Storage. However, data can be copied directly from any of the sources to any of the sinks stated
here using the Copy Activity in Azure Data Factory.
{
"name": "CosmosDbLinkedService",
"properties": {
"type": "DocumentDb",
"typeProperties": {
"connectionString": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>"
}
}
}
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
{
"name": "PersonBlobTableOut",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "docdb",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"nullValue": "NULL"
}
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
{
"PersonId": 2,
"Name": {
"First": "Jane",
"Middle": "",
"Last": "Doe"
}
}
Cosmos DB supports querying documents using a SQL like syntax over hierarchical JSON documents.
Example:
The following pipeline copies data from the Person collection in the Azure Cosmos DB database to an Azure
blob. As part of the copy activity the input and output datasets have been specified.
{
"name": "DocDbToBlobPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": "SELECT Person.Id, Person.Name.First AS FirstName, Person.Name.Middle as MiddleName,
Person.Name.Last AS LastName FROM Person",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink",
"blobWriterAddHeader": true,
"writeBatchSize": 1000,
"writeBatchTimeout": "00:00:59"
}
},
"inputs": [
{
"name": "PersonCosmosDbTable"
}
],
"outputs": [
{
"name": "PersonBlobTableOut"
}
],
"policy": {
"concurrency": 1
},
"name": "CopyFromDocDbToBlob"
}
],
"start": "2015-04-01T00:00:00Z",
"end": "2015-04-02T00:00:00Z"
}
}
{
"name": "CosmosDbLinkedService",
"properties": {
"type": "DocumentDb",
"typeProperties": {
"connectionString": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>"
}
}
}
The following pipeline copies data from Azure Blob to the Person collection in the Cosmos DB. As part of the
copy activity the input and output datasets have been specified.
{
"name": "BlobToDocDbPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "DocumentDbCollectionSink",
"nestingSeparator": ".",
"writeBatchSize": 2,
"writeBatchTimeout": "00:00:00"
}
"translator": {
"type": "TabularTranslator",
"ColumnMappings": "FirstName: Name.First, MiddleName: Name.Middle, LastName: Name.Last,
BusinessEntityID: BusinessEntityID, PersonType: PersonType, NameStyle: NameStyle, Title: Title, Suffix:
Suffix, EmailPromotion: EmailPromotion, rowguid: rowguid, ModifiedDate: ModifiedDate"
}
},
"inputs": [
{
"name": "PersonBlobTableIn"
}
],
"outputs": [
{
"name": "PersonCosmosDbTableOut"
}
],
"policy": {
"concurrency": 1
},
"name": "CopyFromBlobToDocDb"
}
],
"start": "2015-04-14T00:00:00Z",
"end": "2015-04-15T00:00:00Z"
}
}
1,John,,Doe
{
"Id": 1,
"Name": {
"First": "John",
"Middle": null,
"Last": "Doe"
},
"id": "a5e8595c-62ec-4554-a118-3940f4ff70b6"
}
Azure Cosmos DB is a NoSQL store for JSON documents, where nested structures are allowed. Azure Data
Factory enables user to denote hierarchy via nestingSeparator, which is . in this example. With the
separator, the copy activity will generate the Name object with three children elements First, Middle and Last,
according to Name.First, Name.Middle and Name.Last in the table definition.
Appendix
1. Question: Does the Copy Activity support update of existing records?
Answer: No.
2. Question: How does a retry of a copy to Azure Cosmos DB deal with already copied records?
Answer: If records have an "ID" field and the copy operation tries to insert a record with the same ID,
the copy operation throws an error.
3. Question: Does Data Factory support range or hash-based data partitioning?
Answer: No.
4. Question: Can I specify more than one Azure Cosmos DB collection for a table?
Answer: No. Only one collection can be specified at this time.
This article explains how to use Copy Activity in Azure Data Factory to move data to and from Azure Data Lake
Store. It builds on the Data movement activities article, an overview of data movement with Copy Activity.
Supported scenarios
You can copy data from Azure Data Lake Store to the following data stores:
You can copy data from the following data stores to Azure Data Lake Store:
NoSQL Cassandra
MongoDB
CATEGORY DATA STORE
File Amazon S3
File System
FTP
HDFS
SFTP
NOTE
Create a Data Lake Store account before creating a pipeline with Copy Activity. For more information, see Get started
with Azure Data Lake Store.
Get started
You can create a pipeline with a copy activity that moves data to/from an Azure Data Lake Store by using
different tools/APIs.
The easiest way to create a pipeline to copy data is to use the Copy Wizard. For a tutorial on creating a
pipeline by using the Copy Wizard, see Tutorial: Create a pipeline using Copy Wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from
a source data store to a sink data store:
1. Create a data factory. A data factory may contain one or more pipelines.
2. Create linked services to link input and output data stores to your data factory. For example, if you are
copying data from an Azure blob storage to an Azure Data Lake Store, you create two linked services to
link your Azure storage account and Azure Data Lake store to your data factory. For linked service
properties that are specific to Azure Data Lake Store, see linked service properties section.
3. Create datasets to represent input and output data for the copy operation. In the example mentioned in
the last step, you create a dataset to specify the blob container and folder that contains the input data. And,
you create another dataset to specify the folder and file path in the Data Lake store that holds the data
copied from the blob storage. For dataset properties that are specific to Azure Data Lake Store, see dataset
properties section.
4. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the
example mentioned earlier, you use BlobSource as a source and AzureDataLakeStoreSink as a sink for the
copy activity. Similarly, if you are copying from Azure Data Lake Store to Azure Blob Storage, you use
AzureDataLakeStoreSource and BlobSink in the copy activity. For copy activity properties that are specific
to Azure Data Lake Store, see copy activity properties section. For details on how to use a data store as a
source or a sink, click the link in the previous section for your data store.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are
used to copy data to/from an Azure Data Lake Store, see JSON examples section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to Data Lake Store.
After you create or update a service principal in Azure AD, it can take a few minutes for the changes to take effect.
Check the service principal and Data Lake Store access control list (ACL) configurations. If you still see the message "The
credentials provided are invalid," wait a while and try again.
{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": "<service principal key>",
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"subscriptionId": "<subscription of ADLS>",
"resourceGroupName": "<resource group of ADLS>"
}
}
}
{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"sessionId": "<session ID>",
"authorization": "<authorization URL>",
"subscriptionId": "<subscription of ADLS>",
"resourceGroupName": "<resource group of ADLS>"
}
}
}
Token expiration
The authorization code that you generate by using the Authorize button expires after a certain amount of
time. The following message means that the authentication token has expired:
Credential operation error: invalid_grant - AADSTS70002: Error validating credentials. AADSTS70008: The
provided access grant is expired or revoked. Trace ID: d18629e8-af88-43c5-88e3-d8419eb1fca1 Correlation
ID: fac30a0c-6be6-4e02-8d69-a776d2ffefd7 Timestamp: 2015-12-15 21-09-31Z.
The following table shows the expiration times of different types of user accounts:
Users accounts managed by Azure Active Directory 14 days after the last slice run
If you change your password before the token expiration time, the token expires immediately. You will see the
message mentioned earlier in this section.
You can reauthorize the account by using the Authorize button when the token expires to redeploy the linked
service. You can also generate values for the sessionId and authorization properties programmatically by
using the following code:
if (linkedService.Properties.TypeProperties is AzureDataLakeStoreLinkedService ||
linkedService.Properties.TypeProperties is AzureDataLakeAnalyticsLinkedService)
{
AuthorizationSessionGetResponse authorizationSession = this.Client.OAuth.Get(this.ResourceGroupName,
this.DataFactoryName, linkedService.Properties.Type);
AzureDataLakeAnalyticsLinkedService azureDataLakeAnalyticsProperties =
linkedService.Properties.TypeProperties as AzureDataLakeAnalyticsLinkedService;
if (azureDataLakeAnalyticsProperties != null)
{
azureDataLakeAnalyticsProperties.SessionId = authorizationSession.AuthorizationSession.SessionId;
azureDataLakeAnalyticsProperties.Authorization = authorization;
}
}
For details about the Data Factory classes used in the code, see the AzureDataLakeStoreLinkedService Class,
AzureDataLakeAnalyticsLinkedService Class, and AuthorizationSessionGetResponse Class topics. Add a
reference to version 2.9.10826.1824 of Microsoft.IdentityModel.Clients.ActiveDirectory.WindowsForms.dll for
the WindowsFormsWebAuthenticationDialog class used in the code.
Dataset properties
To specify a dataset to represent input data in a Data Lake Store, you set the type property of the dataset to
AzureDataLakeStore. Set the linkedServiceName property of the dataset to the name of the Data Lake
Store linked service. For a full list of JSON sections and properties available for defining datasets, see the
Creating datasets article. Sections of a dataset in JSON, such as structure, availability, and policy, are
similar for all dataset types (Azure SQL database, Azure blob, and Azure table, for example). The
typeProperties section is different for each type of dataset and provides information such as location and
format of the data in the data store.
The typeProperties section for a dataset of type AzureDataLakeStore contains the following properties:
"folderPath": "wikidatagateway/wikisampledataout/{Slice}",
"partitionedBy":
[
{ "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } },
],
In the following example, the year, month, day, and time of SliceStart are extracted into separate variables
that are used by the folderPath and fileName properties:
"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}",
"fileName": "{Hour}.csv",
"partitionedBy":
[
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],
For more details on time-series datasets, scheduling, and slices, see the Datasets in Azure Data Factory and
Data Factory scheduling and execution articles.
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
File1
File2
Subfolder1
File3
File4
File5.
RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
auto-generated name for File1
auto-generated name for File2
auto-generated name for File3
auto-generated name for File4
auto-generated name for File5
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
File1 + File2 + File3 + File4 + File 5
contents are merged into one file with
auto-generated file name
RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
File1
File2
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
auto-generated name for File1
auto-generated name for File2
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
File1 + File2 contents are merged
into one file with auto-generated file
name. auto-generated name for File1
JSON examples for copying data to and from Data Lake Store
The following examples provide sample JSON definitions. You can use these sample definitions to create a
pipeline by using the Azure portal, Visual Studio, or Azure PowerShell. The examples show how to copy data
to and from Data Lake Store and Azure Blob storage. However, data can be copied directly from any of the
sources to any of the supported sinks. For more information, see the section "Supported data stores and
formats" in the Move data by using Copy Activity article.
Example: Copy data from Azure Blob Storage to Azure Data Lake Store
The example code in this section shows:
A linked service of type AzureStorage.
A linked service of type AzureDataLakeStore.
An input dataset of type AzureBlob.
An output dataset of type AzureDataLakeStore.
A pipeline with a copy activity that uses BlobSource and AzureDataLakeStoreSink.
The examples show how time-series data from Azure Blob Storage is copied to Data Lake Store every hour.
Azure Storage linked service
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": "<service principal key>",
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"subscriptionId": "<subscription of ADLS>",
"resourceGroupName": "<resource group of ADLS>"
}
}
}
NOTE
For configuration details, see the Linked service properties section.
Copy activity in a pipeline with a blob source and a Data Lake Store sink
In the following example, the pipeline contains a copy activity that is configured to use the input and output
datasets. The copy activity is scheduled to run every hour. In the pipeline JSON definition, the source type is
set to BlobSource , and the sink type is set to AzureDataLakeStoreSink .
{
"name":"SamplePipeline",
"properties":
{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline with copy activity",
"activities":
[
{
"name": "AzureBlobtoDataLake",
"description": "Copy Activity",
"type": "Copy",
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureDataLakeStoreOutput"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "AzureDataLakeStoreSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
Example: Copy data from Azure Data Lake Store to an Azure blob
The example code in this section shows:
A linked service of type AzureDataLakeStore.
A linked service of type AzureStorage.
An input dataset of type AzureDataLakeStore.
An output dataset of type AzureBlob.
A pipeline with a copy activity that uses AzureDataLakeStoreSource and BlobSink.
The code copies time-series data from Data Lake Store to an Azure blob every hour.
Azure Data Lake Store linked service
{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": "<service principal key>",
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
}
}
}
NOTE
For configuration details, see the Linked service properties section.
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
A copy activity in a pipeline with an Azure Data Lake Store source and a blob sink
In the following example, the pipeline contains a copy activity that is configured to use the input and output
datasets. The copy activity is scheduled to run every hour. In the pipeline JSON definition, the source type is
set to AzureDataLakeStoreSource , and the sink type is set to BlobSink .
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline for copy activity",
"activities":[
{
"name": "AzureDakeLaketoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [
{
"name": "AzureDataLakeStoreInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
In the copy activity definition, you can also map columns from the source dataset to columns in the sink
dataset. For details, see Mapping dataset columns in Azure Data Factory.
This article describes how to use the Copy Activity to push data from a supported source data store to Azure
Search index. Supported source data stores are listed in the Source column of the supported sources and sinks
table. This article builds on the data movement activities article, which presents a general overview of data
movement with Copy Activity and supported data store combinations.
Enabling connectivity
To allow Data Factory service connect to an on-premises data store, you install Data Management Gateway in
your on-premises environment. You can install gateway on the same machine that hosts the source data store
or on a separate machine to avoid competing for resources with the data store.
Data Management Gateway connects on-premises data sources to cloud services in a secure and managed way.
See Move data between on-premises and cloud article for details about Data Management Gateway.
Getting started
You can create a pipeline with a copy activity that pushes data from a source data store to Azure Search index
by using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data to Azure Search index, see JSON example: Copy data from on-premises SQL Server to Azure
Search index section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to Azure Search Index:
Dataset properties
For a full list of sections and properties that are available for defining datasets, see the Creating datasets article.
Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types. The
typeProperties section is different for each type of dataset. The typeProperties section for a dataset of the type
AzureSearchIndex has the following properties:
WriteBehavior property
AzureSearchSink upserts when writing data. In other words, when writing a document, if the document key
already exists in the Azure Search index, Azure Search updates the existing document rather than throwing a
conflict exception.
The AzureSearchSink provides the following two upsert behaviors (by using AzureSearch SDK):
Merge: combine all the columns in the new document with the existing one. For columns with null value in
the new document, the value in the existing one is preserved.
Upload: The new document replaces the existing one. For columns not specified in the new document, the
value is set to null whether there is a non-null value in the existing document or not.
The default behavior is Merge.
WriteBatchSize Property
Azure Search service supports writing documents as a batch. A batch can contain 1 to 1,000 Actions. An action
handles one document to perform the upload/merge operation.
Data type support
The following table specifies whether an Azure Search data type is supported or not.
String Y
Int32 Y
Int64 Y
Double Y
Boolean Y
DataTimeOffset Y
String Array N
GeographyPoint N
{
"Name": "SqlServerLinkedService",
"properties": {
"type": "OnPremisesSqlServer",
"typeProperties": {
"connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated
Security=False;User ID=<username>;Password=<password>;",
"gatewayName": "<gatewayname>"
}
}
}
{
"name": "SqlServerDataset",
"properties": {
"type": "SqlServerTable",
"linkedServiceName": "SqlServerLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
{
"name": "AzureSearchIndexDataset",
"properties": {
"type": "AzureSearchIndex",
"linkedServiceName": "AzureSearchLinkedService",
"typeProperties" : {
"indexName": "products",
},
"availability": {
"frequency": "Minute",
"interval": 15
}
}
}
Copy activity in a pipeline with SQL source and Azure Search Index sink:
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to
run every hour. In the pipeline JSON definition, the source type is set to SqlSource and sink type is set to
AzureSearchIndexSink. The SQL query specified for the SqlReaderQuery property selects the data in the
past hour to copy.
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline for copy activity",
"activities":[
{
"name": "SqlServertoAzureSearchIndex",
"description": "copy activity",
"type": "Copy",
"inputs": [
{
"name": " SqlServerInput"
}
],
"outputs": [
{
"name": "AzureSearchIndexDataset"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-
dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "AzureSearchIndexSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
If you are copying data from a cloud data store into Azure Search, executionLocation property is required. The
following JSON snippet shows the change needed under Copy Activity typeProperties as an example. Check
Copy data between cloud data stores section for supported values and more details.
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "AzureSearchIndexSink"
},
"executionLocation": "West US"
}
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "AzureSearchIndexSink"
},
"executionLocation": "West US"
}
You can also map columns from source dataset to columns from sink dataset in the copy activity definition. For
details, see Mapping dataset columns in Azure Data Factory.
Next steps
See the following articles:
Copy Activity tutorial for step-by-step instructions for creating a pipeline with a Copy Activity.
Copy data to and from Azure SQL Database using
Azure Data Factory
6/27/2017 17 min to read Edit Online
This article explains how to use the Copy Activity in Azure Data Factory to move data to and from Azure SQL
Database. It builds on the Data Movement Activities article, which presents a general overview of data
movement with the copy activity.
Supported scenarios
You can copy data from Azure SQL Database to the following data stores:
You can copy data from the following data stores to Azure SQL Database:
NoSQL Cassandra
MongoDB
CATEGORY DATA STORE
File Amazon S3
File System
FTP
HDFS
SFTP
Getting started
You can create a pipeline with a copy activity that moves data to/from an Azure SQL Database by using
different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from
a source data store to a sink data store:
1. Create a data factory. A data factory may contain one or more pipelines.
2. Create linked services to link input and output data stores to your data factory. For example, if you are
copying data from an Azure blob storage to an Azure SQL database, you create two linked services to link
your Azure storage account and Azure SQL database to your data factory. For linked service properties
that are specific to Azure SQL Database, see linked service properties section.
3. Create datasets to represent input and output data for the copy operation. In the example mentioned in
the last step, you create a dataset to specify the blob container and folder that contains the input data. And,
you create another dataset to specify the SQL table in the Azure SQL database that holds the data copied
from the blob storage. For dataset properties that are specific to Azure Data Lake Store, see dataset
properties section.
4. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the
example mentioned earlier, you use BlobSource as a source and SqlSink as a sink for the copy activity.
Similarly, if you are copying from Azure SQL Database to Azure Blob Storage, you use SqlSource and
BlobSink in the copy activity. For copy activity properties that are specific to Azure SQL Database, see copy
activity properties section. For details on how to use a data store as a source or a sink, click the link in the
previous section for your data store.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that
are used to copy data to/from an Azure SQL Database, see JSON examples section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to Azure SQL Database:
IMPORTANT
Configure Azure SQL Database Firewall the database server to allow Azure Services to access the server. Additionally, if
you are copying data to Azure SQL Database from outside Azure including from on-premises data sources with data
factory gateway, configure appropriate IP address range for the machine that is sending data to Azure SQL Database.
Dataset properties
To specify a dataset to represent input or output data in an Azure SQL database, you set the type property of
the dataset to: AzureSqlTable. Set the linkedServiceName property of the dataset to the name of the Azure
SQL linked service.
For a full list of sections & properties available for defining datasets, see the Creating datasets article.
Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure
SQL, Azure blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for the dataset of type AzureSqlTable has the
following properties:
NOTE
The Copy Activity takes only one input and produces only one output.
Whereas, properties available in the typeProperties section of the activity vary with each activity type. For
Copy activity, they vary depending on the types of sources and sinks.
If you are moving data from an Azure SQL database, you set the source type in the copy activity to
SqlSource. Similarly, if you are moving data to an Azure SQL database, you set the sink type in the copy
activity to SqlSink. This section provides a list of properties supported by SqlSource and SqlSink.
SqlSource
In copy activity, when the source is of type SqlSource, the following properties are available in
typeProperties section:
If the sqlReaderQuery is specified for the SqlSource, the Copy Activity runs this query against the Azure SQL
Database source to get the data. Alternatively, you can specify a stored procedure by specifying the
sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes
parameters).
If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the
structure section of the dataset JSON are used to build a query ( select column1, column2 from mytable ) to run
against the Azure SQL Database. If the dataset definition does not have the structure, all columns are selected
from the table.
NOTE
When you use sqlReaderStoredProcedureName, you still need to specify a value for the tableName property in
the dataset JSON. There are no validations performed against this table though.
SqlSource example
"source": {
"type": "SqlSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', SliceStart)", "type": "Int"}
}
}
SqlSink
SqlSink supports the following properties:
writeBatchSize Inserts data into the SQL Integer (number of rows) No (default: 10000)
table when the buffer size
reaches writeBatchSize.
SqlSink example
"sink": {
"type": "SqlSink",
"writeBatchSize": 1000000,
"writeBatchTimeout": "00:05:00",
"sqlWriterStoredProcedureName": "CopyTestStoredProcedureWithParameters",
"sqlWriterTableType": "CopyTestTableType",
"storedProcedureParameters": {
"identifier": { "value": "1", "type": "Int" },
"stringData": { "value": "str1" },
"decimalData": { "value": "1", "type": "Decimal" }
}
}
See the Azure SQL Linked Service section for the list of properties supported by this linked service.
Azure Blob storage linked service:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
See the Azure Blob article for the list of properties supported by this linked service.
Azure SQL input dataset:
The sample assumes you have created a table MyTable in Azure SQL and it contains a column called
timestampcolumn for time series data.
Setting external: true informs the Azure Data Factory service that the dataset is external to the data factory
and is not produced by an activity in the data factory.
{
"name": "AzureSqlInput",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
See the Azure SQL dataset type properties section for the list of properties supported by this dataset type.
Azure Blob output dataset:
Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is
dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year,
month, day, and hours parts of the start time.
{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}/",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
],
"format": {
"type": "TextFormat",
"columnDelimiter": "\t",
"rowDelimiter": "\n"
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
See the Azure Blob dataset type properties section for the list of properties supported by this dataset type.
A copy activity in a pipeline with SQL source and Blob sink:
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled
to run every hour. In the pipeline JSON definition, the source type is set to SqlSource and sink type is set to
BlobSink. The SQL query specified for the SqlReaderQuery property selects the data in the past hour to
copy.
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline for copy activity",
"activities":[
{
"name": "AzureSQLtoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [
{
"name": "AzureSQLInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-
MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
In the example, sqlReaderQuery is specified for the SqlSource. The Copy Activity runs this query against the
Azure SQL Database source to get the data. Alternatively, you can specify a stored procedure by specifying
the sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes
parameters).
If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the
structure section of the dataset JSON are used to build a query to run against the Azure SQL Database. For
example: select column1, column2 from mytable . If the dataset definition does not have the structure, all
columns are selected from the table.
See the Sql Source section and BlobSink for the list of properties supported by SqlSource and BlobSink.
Example: Copy data from Azure Blob to Azure SQL Database
The sample defines the following Data Factory entities:
1. A linked service of type AzureSqlDatabase.
2. A linked service of type AzureStorage.
3. An input dataset of type AzureBlob.
4. An output dataset of type AzureSqlTable.
5. A pipeline with Copy activity that uses BlobSource and SqlSink.
The sample copies time-series data (hourly, daily, etc.) from Azure blob to a table in Azure SQL database
every hour. The JSON properties used in these samples are described in sections following the samples.
Azure SQL linked service:
{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection
Timeout=30"
}
}
}
See the Azure SQL Linked Service section for the list of properties supported by this linked service.
Azure Blob storage linked service:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
See the Azure Blob article for the list of properties supported by this linked service.
Azure Blob input dataset:
Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for
the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder
path uses year, month, and day part of the start time and file name uses the hour part of the start time.
external: true setting informs the Data Factory service that this table is external to the data factory and is
not produced by an activity in the data factory.
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/",
"fileName": "{Hour}.csv",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
],
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
See the Azure Blob dataset type properties section for the list of properties supported by this dataset type.
Azure SQL Database output dataset:
The sample copies data to a table named MyTable in Azure SQL. Create the table in Azure SQL with the
same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every
hour.
{
"name": "AzureSqlOutput",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "MyOutputTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
See the Azure SQL dataset type properties section for the list of properties supported by this dataset type.
A copy activity in a pipeline with Blob source and SQL sink:
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled
to run every hour. In the pipeline JSON definition, the source type is set to BlobSource and sink type is set
to SqlSink.
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline with copy activity",
"activities":[
{
"name": "AzureBlobtoSQL",
"description": "Copy Activity",
"type": "Copy",
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureSqlOutput"
}
],
"typeProperties": {
"source": {
"type": "BlobSource",
"blobColumnSeparators": ","
},
"sink": {
"type": "SqlSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
See the Sql Sink section and BlobSource for the list of properties supported by SqlSink and BlobSource.
Destination table:
create table dbo.TargetTbl
(
identifier int identity(1,1),
name varchar(100),
age int
)
{
"name": "SampleSource",
"properties": {
"type": " SqlServerTable",
"linkedServiceName": "TestIdentitySQL",
"typeProperties": {
"tableName": "SourceTbl"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {}
}
}
{
"name": "SampleTarget",
"properties": {
"structure": [
{ "name": "name" },
{ "name": "age" }
],
"type": "AzureSqlTable",
"linkedServiceName": "TestIdentitySQLSource",
"typeProperties": {
"tableName": "TargetTbl"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}
Notice that as your source and target table have different schema (target has an additional column with
identity). In this scenario, you need to specify structure property in the target dataset definition, which
doesnt include the identity column.
bigint Int64
binary Byte[]
bit Boolean
date DateTime
Datetime DateTime
datetime2 DateTime
Datetimeoffset DateTimeOffset
Decimal Decimal
Float Double
image Byte[]
int Int32
money Decimal
numeric Decimal
real Single
rowversion Byte[]
SQL SERVER DATABASE ENGINE TYPE .NET FRAMEWORK TYPE
smalldatetime DateTime
smallint Int16
smallmoney Decimal
sql_variant Object *
time TimeSpan
timestamp Byte[]
tinyint Byte
uniqueidentifier Guid
varbinary Byte[]
xml Xml
Repeatable copy
When copying data to SQL Server Database, the copy activity appends data to the sink table by default. To
perform an UPSERT instead, See Repeatable write to SqlSink article.
When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In
Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a
slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same
data is read no matter how many times a slice is run. See Repeatable read from relational sources.
This article explains how to use the Copy Activity in Azure Data Factory to move data to/from Azure SQL Data
Warehouse. It builds on the Data Movement Activities article, which presents a general overview of data
movement with the copy activity.
TIP
To achieve best performance, use PolyBase to load data into Azure SQL Data Warehouse. The Use PolyBase to load
data into Azure SQL Data Warehouse section has details. For a walkthrough with a use case, see Load 1 TB into Azure
SQL Data Warehouse under 15 minutes with Azure Data Factory.
Supported scenarios
You can copy data from Azure SQL Data Warehouse to the following data stores:
You can copy data from the following data stores to Azure SQL Data Warehouse:
NoSQL Cassandra
MongoDB
File Amazon S3
File System
FTP
HDFS
SFTP
TIP
When copying data from SQL Server or Azure SQL Database to Azure SQL Data Warehouse, if the table does not exist
in the destination store, Data Factory can automatically create the table in SQL Data Warehouse by using the schema
of the table in the source data store. See Auto table creation for details.
Getting started
You can create a pipeline with a copy activity that moves data to/from an Azure SQL Data Warehouse by
using different tools/APIs.
The easiest way to create a pipeline that copies data to/from Azure SQL Data Warehouse is to use the Copy
data wizard. See Tutorial: Load data into SQL Data Warehouse with Data Factory for a quick walkthrough on
creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from
a source data store to a sink data store:
1. Create a data factory. A data factory may contain one or more pipelines.
2. Create linked services to link input and output data stores to your data factory. For example, if you are
copying data from an Azure blob storage to an Azure SQL data warehouse, you create two linked services
to link your Azure storage account and Azure SQL data warehouse to your data factory. For linked service
properties that are specific to Azure SQL Data Warehouse, see linked service properties section.
3. Create datasets to represent input and output data for the copy operation. In the example mentioned in
the last step, you create a dataset to specify the blob container and folder that contains the input data. And,
you create another dataset to specify the table in the Azure SQL data warehouse that holds the data copied
from the blob storage. For dataset properties that are specific to Azure SQL Data Warehouse, see dataset
properties section.
4. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the
example mentioned earlier, you use BlobSource as a source and SqlDWSink as a sink for the copy activity.
Similarly, if you are copying from Azure SQL Data Warehouse to Azure Blob Storage, you use
SqlDWSource and BlobSink in the copy activity. For copy activity properties that are specific to Azure SQL
Data Warehouse, see copy activity properties section. For details on how to use a data store as a source or
a sink, click the link in the previous section for your data store.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that
are used to copy data to/from an Azure SQL Data Warehouse, see JSON examples section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to Azure SQL Data Warehouse:
IMPORTANT
Configure Azure SQL Database Firewall and the database server to allow Azure Services to access the server.
Additionally, if you are copying data to Azure SQL Data Warehouse from outside Azure including from on-premises
data sources with data factory gateway, configure appropriate IP address range for the machine that is sending data to
Azure SQL Data Warehouse.
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure
blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for the dataset of type AzureSqlDWTable has the
following properties:
NOTE
The Copy Activity takes only one input and produces only one output.
Whereas, properties available in the typeProperties section of the activity vary with each activity type. For
Copy activity, they vary depending on the types of sources and sinks.
SqlDWSource
When source is of type SqlDWSource, the following properties are available in typeProperties section:
If the sqlReaderQuery is specified for the SqlDWSource, the Copy Activity runs this query against the Azure
SQL Data Warehouse source to get the data.
Alternatively, you can specify a stored procedure by specifying the sqlReaderStoredProcedureName and
storedProcedureParameters (if the stored procedure takes parameters).
If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the
structure section of the dataset JSON are used to build a query to run against the Azure SQL Data Warehouse.
Example: select column1, column2 from mytable . If the dataset definition does not have the structure, all
columns are selected from the table.
SqlDWSource example
"source": {
"type": "SqlDWSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', SliceStart)", "type": "Int"}
}
}
SqlDWSink
SqlDWSink supports the following properties:
writeBatchSize Inserts data into the SQL Integer (number of rows) No (default: 10000)
table when the buffer size
reaches writeBatchSize
SqlDWSink example
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
}
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"polyBaseSettings":
{
"rejectType": "percentage",
"rejectValue": 10.0,
"rejectSampleValue": 100,
"useTypeDefault": true
}
}
TIP
To copy data from Data Lake Store to SQL Data Warehouse efficiently, learn more from Azure Data Factory makes it
even easier and convenient to uncover insights from data when using Data Lake Store with SQL Data Warehouse.
If the requirements are not met, Azure Data Factory checks the settings and automatically falls back to the
BULKINSERT mechanism for the data movement.
1. Source linked service is of type: AzureStorage or AzureDataLakeStore with service principal
authentication.
2. The input dataset is of type: AzureBlob or AzureDataLakeStore, and the format type under type
properties is OrcFormat, or TextFormat with the following configurations:
a. rowDelimiter must be \n.
b. nullValue is set to empty string (""), or treatEmptyAsNull is set to true.
c. encodingName is set to utf-8, which is default value.
d. escapeChar , quoteChar , firstRowAsHeader , and skipLineCount are not specified.
e. compression can be no compression, GZip, or Deflate.
"typeProperties": {
"folderPath": "<blobpath>",
"format": {
"type": "TextFormat",
"columnDelimiter": "<any delimiter>",
"rowDelimiter": "\n",
"nullValue": "",
"encodingName": "utf-8"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
},
NOTE
When copying data from an on-prem data store into Azure SQL Data Warehouse using PolyBase and staging, if your
Data Management Gateway version is below 2.4, JRE (Java Runtime Environment) is required on your gateway
machine that is used to transform your source data into proper format. Suggest you upgrade your gateway to the
latest to avoid such dependency.
To use this feature, create an Azure Storage linked service that refers to the Azure Storage Account that has
the interim blob storage, then specify the enableStaging and stagingSettings properties for the Copy
Activity as shown in the following code:
"activities":[
{
"name": "Sample copy activity from SQL Server to SQL Data Warehouse via PolyBase",
"type": "Copy",
"inputs": [{ "name": "OnpremisesSQLServerInput" }],
"outputs": [{ "name": "AzureSQLDWOutput" }],
"typeProperties": {
"source": {
"type": "SqlSource",
},
"sink": {
"type": "SqlDwSink",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "MyStagingBlob"
}
}
}
]
If you see the following error, it could be an issue with the value you specified for the tableName property.
See the table for the correct way to specify values for the tableName JSON property.
All columns of the table must be specified in the INSERT BULK statement.
NULL value is a special form of default value. If the column is nullable, the input data (in blob) for that column
could be empty (cannot be missing from the input dataset). PolyBase inserts NULL for them in the Azure SQL
Data Warehouse.
SOURCE SQL DATABASE COLUMN TYPE DESTINATION SQL DW COLUMN TYPE (SIZE LIMITATION)
Int Int
BigInt BigInt
SmallInt SmallInt
TinyInt TinyInt
Bit Bit
Decimal Decimal
Numeric Decimal
Float Float
SOURCE SQL DATABASE COLUMN TYPE DESTINATION SQL DW COLUMN TYPE (SIZE LIMITATION)
Money Money
Real Real
SmallMoney SmallMoney
Binary Binary
Date Date
DateTime DateTime
DateTime2 DateTime2
Time Time
DateTimeOffset DateTimeOffset
SmallDateTime SmallDateTime
UniqueIdentifier UniqueIdentifier
Char Char
NChar NChar
Suppose you found errors in source file and updated the quantity of Down Tube from 2 to 4 in the source file.
If you re-run the data slice for that period, youll find two new records appended to Azure SQL/SQL Server
Database. The below assumes none of the columns in the table have the primary key constraint.
To avoid this, you will need to specify UPSERT semantics by leveraging one of the below 2 mechanisms stated
below.
NOTE
A slice can be re-run automatically in Azure Data Factory as per the retry policy specified.
Mechanism 1
You can leverage sqlWriterCleanupScript property to first perform cleanup action when a slice is run.
"sink":
{
"type": "SqlSink",
"sqlWriterCleanupScript": "$$Text.Format('DELETE FROM table WHERE ModifiedDate >= \\'{0:yyyy-MM-dd
HH:mm}\\' AND ModifiedDate < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
}
The cleanup script would be executed first during copy for a given slice which would delete the data from the
SQL Table corresponding to that slice. The activity will subsequently insert the data into the SQL Table.
If the slice is now re-run, then you will find the quantity is updated as desired.
Suppose the Flat Washer record is removed from the original csv. Then re-running the slice would produce
the following result:
Nothing new had to be done. The copy activity ran the cleanup script to delete the corresponding data for that
slice. Then it read the input from the csv (which then contained only 1 record) and inserted it into the Table.
Mechanism 2
IMPORTANT
sliceIdentifierColumnName is not supported for Azure SQL Data Warehouse at this time.
"sink":
{
"type": "SqlSink",
"sliceIdentifierColumnName": "ColumnForADFuseOnly"
}
Azure Data Factory will populate this column as per its need to ensure the source and destination stay
synchronized. The values of this column should not be used outside of this context by the user.
Similar to mechanism 1, Copy Activity will automatically first clean up the data for the given slice from the
destination SQL Table and then run the copy activity normally to insert the data from source to destination
for that slice.
bigint Int64
binary Byte[]
bit Boolean
date DateTime
SQL SERVER DATABASE ENGINE TYPE .NET FRAMEWORK TYPE
Datetime DateTime
datetime2 DateTime
Datetimeoffset DateTimeOffset
Decimal Decimal
Float Double
image Byte[]
int Int32
money Decimal
numeric Decimal
real Single
rowversion Byte[]
smalldatetime DateTime
smallint Int16
smallmoney Decimal
sql_variant Object *
time TimeSpan
timestamp Byte[]
tinyint Byte
uniqueidentifier Guid
varbinary Byte[]
SQL SERVER DATABASE ENGINE TYPE .NET FRAMEWORK TYPE
xml Xml
You can also map columns from source dataset to columns from sink dataset in the copy activity definition.
For details, see Mapping dataset columns in Azure Data Factory.
JSON examples for copying data to and from SQL Data Warehouse
The following examples provide sample JSON definitions that you can use to create a pipeline by using Azure
portal or Visual Studio or Azure PowerShell. They show how to copy data to and from Azure SQL Data
Warehouse and Azure Blob Storage. However, data can be copied directly from any of sources to any of the
sinks stated here using the Copy Activity in Azure Data Factory.
Example: Copy data from Azure SQL Data Warehouse to Azure Blob
The sample defines the following Data Factory entities:
1. A linked service of type AzureSqlDW.
2. A linked service of type AzureStorage.
3. An input dataset of type AzureSqlDWTable.
4. An output dataset of type AzureBlob.
5. A pipeline with Copy Activity that uses SqlDWSource and BlobSink.
The sample copies time-series (hourly, daily, etc.) data from a table in Azure SQL Data Warehouse database to
a blob every hour. The JSON properties used in these samples are described in sections following the
samples.
Azure SQL Data Warehouse linked service:
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection
Timeout=30"
}
}
}
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
{
"name": "AzureSqlDWInput",
"properties": {
"type": "AzureSqlDWTable",
"linkedServiceName": "AzureSqlDWLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
NOTE
In the example, sqlReaderQuery is specified for the SqlDWSource. The Copy Activity runs this query against the Azure
SQL Data Warehouse source to get the data.
Alternatively, you can specify a stored procedure by specifying the sqlReaderStoredProcedureName and
storedProcedureParameters (if the stored procedure takes parameters).
If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure
section of the dataset JSON are used to build a query (select column1, column2 from mytable) to run against the
Azure SQL Data Warehouse. If the dataset definition does not have the structure, all columns are selected from the
table.
Example: Copy data from Azure Blob to Azure SQL Data Warehouse
The sample defines the following Data Factory entities:
1. A linked service of type AzureSqlDW.
2. A linked service of type AzureStorage.
3. An input dataset of type AzureBlob.
4. An output dataset of type AzureSqlDWTable.
5. A pipeline with Copy activity that uses BlobSource and SqlDWSink.
The sample copies time-series data (hourly, daily, etc.) from Azure blob to a table in Azure SQL Data
Warehouse database every hour. The JSON properties used in these samples are described in sections
following the samples.
Azure SQL Data Warehouse linked service:
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection
Timeout=30"
}
}
}
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
{
"name": "AzureSqlDWOutput",
"properties": {
"type": "AzureSqlDWTable",
"linkedServiceName": "AzureSqlDWLinkedService",
"typeProperties": {
"tableName": "MyOutputTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
For a walkthrough, see the see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data
Factory and Load data with Azure Data Factory article in the Azure SQL Data Warehouse documentation.
This article explains how to use the Copy Activity in Azure Data Factory to move data to/from Azure Table
Storage. It builds on the Data Movement Activities article, which presents a general overview of data
movement with the copy activity.
You can copy data from any supported source data store to Azure Table Storage or from Azure Table Storage
to any supported sink data store. For a list of data stores supported as sources or sinks by the copy activity,
see the Supported data stores table.
Getting started
You can create a pipeline with a copy activity that moves data to/from an Azure Table Storage by using
different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from
a source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are
used to copy data to/from an Azure Table Storage, see JSON examples section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to Azure Table Storage:
See the following article for steps to view/copy the account key for an Azure Storage: View, copy, and
regenerate storage access keys.
Example:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
IMPORTANT
Azure Data Factory now only supports Service SAS but not Account SAS. See Types of Shared Access Signatures for
details about these two types and how to construct. Note the SAS URL generable from Azure portal or Storage Explorer
is an Account SAS, which is not supported.
The Azure Storage SAS linked service allows you to link an Azure Storage Account to an Azure data factory by
using a Shared Access Signature (SAS). It provides the data factory with restricted/time-bound access to
all/specific resources (blob/container) in the storage. The following table provides description for JSON
elements specific to Azure Storage SAS linked service.
{
"name": "StorageSasLinkedService",
"properties": {
"type": "AzureStorageSas",
"typeProperties": {
"sasUri": "<Specify SAS URI of the Azure Storage resource>"
}
}
}
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure
blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for the dataset of type AzureTable has the following
properties.
tableName Name of the table in the Azure Table Yes. When a tableName is specified
Database instance that linked service without an azureTableSourceQuery, all
refers to. records from the table are copied to
the destination. If an
azureTableSourceQuery is also
specified, records from the table that
satisfies the query are copied to the
destination.
azureTableSourceQuery Use the custom query to Azure table query string. No. When a tableName is
read data. See examples in the next specified without an
section. azureTableSourceQuery, all
records from the table are
copied to the destination. If
an azureTableSourceQuery
is also specified, records
from the table that satisfies
the query are copied to the
destination.
azureTableSourceQuery examples
If Azure Table column is of string type:
writeBatchSize Inserts data into the Azure Integer (number of rows) No (default: 10000)
table when the
writeBatchSize or
writeBatchTimeout is hit.
azureTablePartitionKeyName
Map a source column to a destination column using the translator JSON property before you can use the
destination column as the azureTablePartitionKeyName.
In the following example, source column DivisionID is mapped to the destination column: DivisionID.
"translator": {
"type": "TabularTranslator",
"columnMappings": "DivisionID: DivisionID, FirstName: FirstName, LastName: LastName"
}
JSON examples
The following examples provide sample JSON definitions that you can use to create a pipeline by using Azure
portal or Visual Studio or Azure PowerShell. They show how to copy data to and from Azure Table Storage
and Azure Blob Database. However, data can be copied directly from any of the sources to any of the
supported sinks. For more information, see the section "Supported data stores and formats" in Move data by
using Copy Activity.
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and
AzureStorageSas. For the first one, you specify the connection string that includes the account key and for
the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details.
Azure Table input dataset:
The sample assumes you have created a table MyTable in Azure Table.
Setting external: true informs the Data Factory service that the dataset is external to the data factory and is
not produced by an activity in the data factory.
{
"name": "AzureTableInput",
"properties": {
"type": "AzureTable",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and
AzureStorageSas. For the first one, you specify the connection string that includes the account key and for
the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details.
Azure Blob input dataset:
Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for
the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path
uses year, month, and day part of the start time and file name uses the hour part of the start time. external:
true setting informs the Data Factory service that the dataset is external to the data factory and is not
produced by an activity in the data factory.
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}",
"fileName": "{Hour}.csv",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
],
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
{
"name": "AzureTableOutput",
"properties": {
"type": "AzureTable",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"tableName": "MyOutputTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Given the type mapping from Azure Table OData type to .NET type, you would define the table in Azure Table
with the following schema.
Azure Table schema:
userid Edm.Int64
name Edm.String
lastlogindate Edm.DateTime
Next, define the Azure Table dataset as follows. You do not need to specify structure section with the type
information since the type information is already specified in the underlying data store.
{
"name": "AzureTableOutput",
"properties": {
"type": "AzureTable",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"tableName": "MyOutputTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
In this case, Data Factory automatically does type conversions including the Datetime field with the custom
datetime format using the "fr-fr" culture when moving data from Blob to Azure Table.
NOTE
To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data
Factory.
This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises
Cassandra database. It builds on the Data Movement Activities article, which presents a general overview of
data movement with the copy activity.
You can copy data from an on-premises Cassandra data store to any supported sink data store. For a list of
data stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently
supports only moving data from a Cassandra data store to other data stores, but not for moving data from
other data stores to a Cassandra data store.
Supported versions
The Cassandra connector supports the following versions of Cassandra: 2.X.
Prerequisites
For the Azure Data Factory service to be able to connect to your on-premises Cassandra database, you must
install a Data Management Gateway on the same machine that hosts the database or on a separate machine to
avoid competing for resources with the database. Data Management Gateway is a component that connects
on-premises data sources to cloud services in a secure and managed way. See Data Management Gateway
article for details about Data Management Gateway. See Move data from on-premises to cloud article for step-
by-step instructions on setting up the gateway a data pipeline to move data.
You must use the gateway to connect to a Cassandra database even if the database is hosted in the cloud, for
example, on an Azure IaaS VM. Y You can have the gateway on the same VM that hosts the database or on a
separate VM as long as the gateway can connect to the database.
When you install the gateway, it automatically installs a Microsoft Cassandra ODBC driver used to connect to
Cassandra database. Therefore, you don't need to manually install any driver on the gateway machine when
copying data from the Cassandra database.
NOTE
See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues.
Getting started
You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by
using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from an on-premises Cassandra data store, see JSON example: Copy data from Cassandra to
Azure Blob section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to a Cassandra data store:
port The TCP port that the Cassandra No, default value: 9042
server uses to listen for client
connections.
username Specify user name for the user Yes, if authenticationType is set to
account. Basic.
password Specify password for the user account. Yes, if authenticationType is set to
Basic.
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure
blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for dataset of type CassandraTable has the following
properties
keyspace Name of the keyspace or schema in Yes (If query for CassandraSource is
Cassandra database. not defined).
tableName Name of the table in Cassandra Yes (If query for CassandraSource is
database. not defined).
query Use the custom query to SQL-92 query or CQL No (if tableName and
read data. query. See CQL reference. keyspace on dataset are
defined).
When using SQL query,
specify keyspace
name.table name to
represent the table you
want to query.
consistencyLevel The consistency level ONE, TWO, THREE, No. Default value is ONE.
specifies how many replicas QUORUM, ALL,
must respond to a read LOCAL_QUORUM,
request before returning EACH_QUORUM,
data to the client LOCAL_ONE. See
application. Cassandra Configuring data
checks the specified consistency for details.
number of replicas for data
to satisfy the read request.
IMPORTANT
This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See
moving data between on-premises locations and cloud article for step-by-step instructions.
{
"name": "CassandraLinkedService",
"properties":
{
"type": "OnPremisesCassandra",
"typeProperties":
{
"authenticationType": "Basic",
"host": "mycassandraserver",
"port": 9042,
"username": "user",
"password": "password",
"gatewayName": "mygateway"
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
Setting external to true informs the Data Factory service that the dataset is external to the data factory and is
not produced by an activity in the data factory.
Azure Blob output dataset:
Data is written to a new blob every hour (frequency: hour, interval: 1).
{
"name": "AzureBlobOutput",
"properties":
{
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties":
{
"folderPath": "adfgetstarted/fromcassandra"
},
"availability":
{
"frequency": "Hour",
"interval": 1
}
}
}
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
ASCII String
BIGINT Int64
BLOB Byte[]
BOOLEAN Boolean
DECIMAL Decimal
DOUBLE Double
CASSANDRA TYPE .NET BASED TYPE
FLOAT Single
INET String
INT Int32
TEXT String
TIMESTAMP DateTime
TIMEUUID Guid
UUID Guid
VARCHAR String
VARINT Decimal
NOTE
For collection types (map, set, list, etc.), refer to Work with Cassandra collection types using virtual table section.
User-defined types are not supported.
The length of Binary Column and String Column lengths cannot be greater than 4000.
1 "sample value 1" ["1", "2", "3"] {"S1": "a", "S2": "b"} {"A", "B", "C"}
3 "sample value 3" ["100", "101", "102", {"S1": "t"} {"A", "E"}
"105"]
The driver would generate multiple virtual tables to represent this single table. The foreign key columns in the
virtual tables reference the primary key columns in the real table, and indicate which real table row the virtual
table row corresponds to.
The first virtual table is the base table named ExampleTable is shown in the following table. The base table
contains the same data as the original database table except for the collections, which are omitted from this
table and expanded in other virtual tables.
PK_INT VALUE
The following tables show the virtual tables that renormalize the data from the List, Map, and StringSet
columns. The columns with names that end with _index or _key indicate the position of the data within the
original list or map. The columns with names that end with _value contain the expanded data from the
collection.
Table ExampleTable_vt_List:
1 0 1
1 1 2
1 2 3
3 0 100
3 1 101
3 2 102
3 3 103
Table ExampleTable_vt_Map:
1 S1 A
1 S2 b
3 S1 t
Table ExampleTable_vt_StringSet:
PK_INT STRINGSET_VALUE
1 A
1 B
1 C
3 A
3 E
This article describes how you can use Copy Activity in Azure Data Factory to copy data from an on-premises
DB2 database to a data store. You can copy data to any store that is listed as a supported sink in the Data
Factory data movement activities article. This topic builds on the Data Factory article, which presents an
overview of data movement by using Copy Activity and lists the supported data store combinations.
Data Factory currently supports only moving data from a DB2 database to a supported sink data store. Moving
data from other data stores to a DB2 database is not supported.
Prerequisites
Data Factory supports connecting to an on-premises DB2 database by using the data management gateway.
For step-by-step instructions to set up the gateway data pipeline to move your data, see the Move data from
on-premises to cloud article.
A gateway is required even if the DB2 is hosted on Azure IaaS VM. You can install the gateway on the same
IaaS VM as the data store. If the gateway can connect to the database, you can install the gateway on a different
VM.
The data management gateway provides a built-in DB2 driver, so you don't need to manually install a driver to
copy data from DB2.
NOTE
For tips on troubleshooting connection and gateway issues, see the Troubleshoot gateway issues article.
Supported versions
The Data Factory DB2 connector supports the following IBM DB2 platforms and versions with Distributed
Relational Database Architecture (DRDA) SQL Access Manager versions 9, 10, and 11:
IBM DB2 for z/OS version 11.1
IBM DB2 for z/OS version 10.1
IBM DB2 for i (AS400) version 7.2
IBM DB2 for i (AS400) version 7.1
IBM DB2 for Linux, UNIX, and Windows (LUW) version 11
IBM DB2 for LUW version 10.5
IBM DB2 for LUW version 10.1
TIP
If you receive the error message "The package corresponding to an SQL statement execution request was not found.
SQLSTATE=51002 SQLCODE=-805," the reason is a necessary package is not created for the normal user on the OS. To
resolve this issue, follow these instructions for your DB2 server type:
DB2 for i (AS400): Let a power user create the collection for the normal user before running Copy Activity. To create
the collection, use the command: create collection <username>
DB2 for z/OS or LUW: Use a high privilege account--a power user or admin that has package authorities and BIND,
BINDADD, GRANT EXECUTE TO PUBLIC permissions--to run the copy once. The necessary package is automatically
created during the copy. Afterward, you can switch back to the normal user for your subsequent copy runs.
Getting started
You can create a pipeline with a copy activity to move data from an on-premises DB2 data store by using
different tools and APIs:
The easiest way to create a pipeline is to use the Azure Data Factory Copy Wizard. For a quick walkthrough
on creating a pipeline by using the Copy Wizard, see the Tutorial: Create a pipeline by using the Copy
Wizard.
You can also use tools to create a pipeline, including the Azure portal, Visual Studio, Azure PowerShell, an
Azure Resource Manager template, the .NET API, and the REST API. For step-by-step instructions to create a
pipeline with a copy activity, see the Copy Activity tutorial.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the Copy Wizard, JSON definitions for the Data Factory linked services, datasets, and pipeline
entities are automatically created for you. When you use tools or APIs (except the .NET API), you define the Data
Factory entities by using the JSON format. The JSON example: Copy data from DB2 to Azure Blob storage
shows the JSON definitions for the Data Factory entities that are used to copy data from an on-premises DB2
data store.
The following sections provide details about the JSON properties that are used to define the Data Factory
entities that are specific to a DB2 data store.
Dataset properties
For a list of the sections and properties that are available for defining datasets, see the Creating datasets article.
Sections, such as structure, availability, and the policy for a dataset JSON, are similar for all dataset types
(Azure SQL, Azure Blob storage, Azure Table storage, and so on).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for a dataset of type RelationalTable, which includes
the DB2 dataset, has the following property:
tableName The name of the table in the DB2 No (if the query property of a copy
database instance that the linked activity of type RelationalSource is
service refers to. This property is case- specified)
sensitive.
query Use the custom query to SQL query string. For No (if the tableName
read the data. example: property of a dataset is
"query": "select * specified)
from
"MySchema"."MyTable""
NOTE
Schema and table names are case-sensitive. In the query statement, enclose property names by using "" (double quotes).
For example:
{
"name": "OnPremDb2LinkedService",
"properties": {
"type": "OnPremisesDb2",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"schema": "<schema>",
"authenticationType": "<authentication type>",
"username": "<username>",
"password": "<password>",
"gatewayName": "<gatewayName>"
}
}
}
{
"name": "Db2DataSet",
"properties": {
"type": "RelationalTable",
"linkedServiceName": "OnPremDb2LinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
SmallInt Int16
Integer Int32
BigInt Int64
Real Single
DB2 DATABASE TYPE .NET FRAMEWORK TYPE
Double Double
Float Double
Decimal Decimal
DecimalFloat Decimal
Numeric Decimal
Date DateTime
Time TimeSpan
Timestamp DateTime
Xml Byte[]
Char String
VarChar String
LongVarChar String
DB2DynArray String
Binary Byte[]
VarBinary Byte[]
LongVarBinary Byte[]
Graphic String
VarGraphic String
LongVarGraphic String
Clob String
Blob Byte[]
DbClob String
SmallInt Int16
Integer Int32
BigInt Int64
DB2 DATABASE TYPE .NET FRAMEWORK TYPE
Real Single
Double Double
Float Double
Decimal Decimal
DecimalFloat Decimal
Numeric Decimal
Date DateTime
Time TimeSpan
Timestamp DateTime
Xml Byte[]
Char String
This article explains how to use the Copy Activity in Azure Data Factory to copy data to/from an on-premises
file system. It builds on the Data Movement Activities article, which presents a general overview of data
movement with the copy activity.
Supported scenarios
You can copy data from an on-premises file system to the following data stores:
You can copy data from the following data stores to an on-premises file system:
NoSQL Cassandra
MongoDB
CATEGORY DATA STORE
File Amazon S3
File System
FTP
HDFS
SFTP
NOTE
Copy Activity does not delete the source file after it is successfully copied to the destination. If you need to delete the
source file after a successful copy, create a custom activity to delete the file and use the activity in the pipeline.
Enabling connectivity
Data Factory supports connecting to and from an on-premises file system via Data Management Gateway.
You must install the Data Management Gateway in your on-premises environment for the Data Factory
service to connect to any supported on-premises data store including file system. To learn about Data
Management Gateway and for step-by-step instructions on setting up the gateway, see Move data between
on-premises sources and the cloud with Data Management Gateway. Apart from Data Management Gateway,
no other binary files need to be installed to communicate to and from an on-premises file system. You must
install and use the Data Management Gateway even if the file system is in Azure IaaS VM. For detailed
information about the gateway, see Data Management Gateway.
To use a Linux file share, install Samba on your Linux server, and install Data Management Gateway on a
Windows server. Installing Data Management Gateway on a Linux server is not supported.
Getting started
You can create a pipeline with a copy activity that moves data to/from a file system by using different
tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from
a source data store to a sink data store:
1. Create a data factory. A data factory may contain one or more pipelines.
2. Create linked services to link input and output data stores to your data factory. For example, if you are
copying data from an Azure blob storage to an on-premises file system, you create two linked services to
link your on-premises file system and Azure storage account to your data factory. For linked service
properties that are specific to an on-premises file system, see linked service properties section.
3. Create datasets to represent input and output data for the copy operation. In the example mentioned in
the last step, you create a dataset to specify the blob container and folder that contains the input data. And,
you create another dataset to specify the folder and file name (optional) in your file system. For dataset
properties that are specific to on-premises file system, see dataset properties section.
4. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the
example mentioned earlier, you use BlobSource as a source and FileSystemSink as a sink for the copy
activity. Similarly, if you are copying from on-premises file system to Azure Blob Storage, you use
FileSystemSource and BlobSink in the copy activity. For copy activity properties that are specific to on-
premises file system, see copy activity properties section. For details on how to use a data store as a
source or a sink, click the link in the previous section for your data store.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that
are used to copy data to/from a file system, see JSON examples section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to file system:
userid Specify the ID of the user who has No (if you choose
access to the server. encryptedCredential)
password Specify the password for the user No (if you choose
(userid). encryptedCredential
encryptedCredential Specify the encrypted credentials that No (if you choose to specify userid
you can get by running the New- and password in plain text)
AzureRmDataFactoryEncryptValue
cmdlet.
Local folder on Data Management D:\\ (for Data Management Gateway .\\ or folder\\subfolder (for Data
Gateway machine: 2.0 and later versions) Management Gateway 2.0 and later
versions)
Examples: D:\* or localhost (for earlier versions than
D:\folder\subfolder\* Data Management Gateway 2.0) D:\\ or D:\\folder\\subfolder (for
gateway version below 2.0)
Examples: \\myserver\share\* or
\\myserver\share\folder\subfolder\*
{
"Name": "OnPremisesFileServerLinkedService",
"properties": {
"type": "OnPremisesFileServer",
"typeProperties": {
"host": "\\\\Contosogame-Asia",
"userid": "Admin",
"password": "123456",
"gatewayName": "mygateway"
}
}
}
{
"Name": " OnPremisesFileServerLinkedService ",
"properties": {
"type": "OnPremisesFileServer",
"typeProperties": {
"host": "D:\\",
"encryptedCredential": "WFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5xxxxxxxxxxxxxxxxx",
"gatewayName": "mygateway"
}
}
}
Dataset properties
For a full list of sections and properties that are available for defining datasets, see Creating datasets. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types.
The typeProperties section is different for each type of dataset. It provides information such as the location
and format of the data in the data store. The typeProperties section for the dataset of type FileShare has the
following properties:
Data.<Guid>.txt (Example:
Data.0a405f8a-93ff-4c6f-b3be-
f69616f1df7a.txt)
NOTE
You cannot use fileName and fileFilter simultaneously.
"folderPath": "wikidatagateway/wikisampledataout/{Slice}",
"partitionedBy":
[
{ "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } },
],
In this example, {Slice} is replaced with the value of the Data Factory system variable SliceStart in the format
(YYYYMMDDHH). SliceStart refers to start time of the slice. The folderPath is different for each slice. For
example: wikidatagateway/wikisampledataout/2014100103 or
wikidatagateway/wikisampledataout/2014100104.
Sample 2:
"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}",
"fileName": "{Hour}.csv",
"partitionedBy":
[
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],
In this example, year, month, day, and time of SliceStart are extracted into separate variables that the
folderPath and fileName properties use.
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
File1
File2
Subfolder1
File3
File4
File5
RECURSIVE VALUE COPYBEHAVIOR VALUE RESULTING BEHAVIOR
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
auto-generated name for File1
auto-generated name for File2
auto-generated name for File3
auto-generated name for File4
auto-generated name for File5
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
File1 + File2 + File3 + File4 + File 5
contents are merged into one file with
an auto-generated file name.
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
File1
File2
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
auto-generated name for File1
auto-generated name for File2
Folder1
File1
File2
Subfolder1
File3
File4
File5
Folder1
File1 + File2 contents are merged
into one file with an auto-generated
file name.
Auto-generated name for File1
{
"Name": "OnPremisesFileServerLinkedService",
"properties": {
"type": "OnPremisesFileServer",
"typeProperties": {
"host": "\\\\Contosogame-Asia.<region>.corp.<company>.com",
"userid": "Admin",
"password": "123456",
"gatewayName": "mygateway"
}
}
}
We recommend using the encryptedCredential property instead the userid and password properties. See
File Server linked service for details about this linked service.
Azure Storage linked service:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
A copy activity in a pipeline with File System source and Blob sink:
The pipeline contains a copy activity that is configured to use the input and output datasets, and is scheduled
to run every hour. In the pipeline JSON definition, the source type is set to FileSystemSource, and sink type
is set to BlobSink.
{
"name":"SamplePipeline",
"properties":{
"start":"2015-06-01T18:00:00",
"end":"2015-06-01T19:00:00",
"description":"Pipeline for copy activity",
"activities":[
{
"name": "OnpremisesFileSystemtoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [
{
"name": "OnpremisesFileSystemInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
Example: Copy data from Azure SQL Database to an on-premises file system
The following sample shows:
A linked service of type AzureSqlDatabase.
A linked service of type OnPremisesFileServer.
An input dataset of type AzureSqlTable.
An output dataset of type FileShare.
A pipeline with a copy activity that uses SqlSource and FileSystemSink.
The sample copies time-series data from an Azure SQL table to an on-premises file system every hour. The
JSON properties that are used in these samples are described in sections after the samples.
Azure SQL Database linked service:
{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection
Timeout=30"
}
}
}
{
"Name": "OnPremisesFileServerLinkedService",
"properties": {
"type": "OnPremisesFileServer",
"typeProperties": {
"host": "\\\\Contosogame-Asia.<region>.corp.<company>.com",
"userid": "Admin",
"password": "123456",
"gatewayName": "mygateway"
}
}
}
We recommend using the encryptedCredential property instead of using the userid and password
properties. See File System linked service for details about this linked service.
Azure SQL input dataset:
The sample assumes that you've created a table MyTable in Azure SQL, and it contains a column called
timestampcolumn for time-series data.
Setting external: true informs Data Factory that the dataset is external to the data factory and is not
produced by an activity in the data factory.
{
"name": "AzureSqlInput",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
{
"name": "OnpremisesFileSystemOutput",
"properties": {
"type": "FileShare",
"linkedServiceName": " OnPremisesFileServerLinkedService ",
"typeProperties": {
"folderPath": "mysharedfolder/yearno={Year}/monthno={Month}/dayno={Day}",
"fileName": "{Hour}.csv",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
]
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
A copy activity in a pipeline with SQL source and File System sink:
The pipeline contains a copy activity that is configured to use the input and output datasets, and is scheduled
to run every hour. In the pipeline JSON definition, the source type is set to SqlSource, and the sink type is
set to FileSystemSink. The SQL query that is specified for the SqlReaderQuery property selects the data in
the past hour to copy.
{
"name":"SamplePipeline",
"properties":{
"start":"2015-06-01T18:00:00",
"end":"2015-06-01T20:00:00",
"description":"pipeline for copy activity",
"activities":[
{
"name": "AzureSQLtoOnPremisesFile",
"description": "copy activity",
"type": "Copy",
"inputs": [
{
"name": "AzureSQLInput"
}
],
"outputs": [
{
"name": "OnpremisesFileSystemOutput"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-
MM-dd}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "FileSystemSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 3,
"timeout": "01:00:00"
}
}
]
}
}
You can also map columns from source dataset to columns from sink dataset in the copy activity definition.
For details, see Mapping dataset columns in Azure Data Factory.
This article explains how to use the copy activity in Azure Data Factory to move data from an FTP server. It
builds on the Data movement activities article, which presents a general overview of data movement with the
copy activity.
You can copy data from an FTP server to any supported sink data store. For a list of data stores supported as
sinks by the copy activity, see the supported data stores table. Data Factory currently supports only moving
data from an FTP server to other data stores, but not moving data from other data stores to an FTP server. It
supports both on-premises and cloud FTP servers.
NOTE
The copy activity does not delete the source file after it is successfully copied to the destination. If you need to delete the
source file after a successful copy, create a custom activity to delete the file, and use the activity in the pipeline.
Enable connectivity
If you are moving data from an on-premises FTP server to a cloud data store (for example, to Azure Blob
storage), install and use Data Management Gateway. The Data Management Gateway is a client agent that is
installed on your on-premises machine, and it allows cloud services to connect to an on-premises resource. For
details, see Data Management Gateway. For step-by-step instructions on setting up the gateway and using it,
see Moving data between on-premises locations and cloud. You use the gateway to connect to an FTP server,
even if the server is on an Azure infrastructure as a service (IaaS) virtual machine (VM).
It is possible to install the gateway on the same on-premises machine or IaaS VM as the FTP server. However,
we recommend that you install the gateway on a separate machine or IaaS VM to avoid resource contention,
and for better performance. When you install the gateway on a separate machine, the machine should be able
to access the FTP server.
Get started
You can create a pipeline with a copy activity that moves data from an FTP source by using different tools or
APIs.
The easiest way to create a pipeline is to use the Data Factory Copy Wizard. See Tutorial: Create a pipeline
using Copy Wizard for a quick walkthrough.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, PowerShell, Azure
Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools or APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from an FTP data store, see the JSON example: Copy data from FTP server to Azure blob
section of this article.
NOTE
For details about supported file and compression formats to use, see File and compression formats in Azure Data
Factory.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to FTP.
{
"name": "FTPLinkedService",
"properties": {
"type": "FtpServer",
"typeProperties": {
"authenticationType": "Anonymous",
"host": "myftpserver.com"
}
}
}
{
"name": "FTPLinkedService",
"properties": {
"type": "FtpServer",
"typeProperties": {
"host": "myftpserver.com",
"authenticationType": "Basic",
"username": "Admin",
"password": "123456"
}
}
}
{
"name": "FTPLinkedService",
"properties": {
"type": "FtpServer",
"typeProperties": {
"host": "myftpserver.com",
"authenticationType": "Basic",
"username": "Admin",
"password": "123456",
"port": "21",
"enableSsl": true,
"enableServerCertificateValidation": true
}
}
}
{
"name": "FTPLinkedService",
"properties": {
"type": "FtpServer",
"typeProperties": {
"host": "myftpserver.com",
"authenticationType": "Basic",
"encryptedCredential": "xxxxxxxxxxxxxxxxx",
"gatewayName": "mygateway"
}
}
}
Dataset properties
Dataset properties
For a full list of sections and properties available for defining datasets, see Creating datasets. Sections such as
structure, availability, and policy of a dataset JSON are similar for all dataset types.
The typeProperties section is different for each type of dataset. It provides information that is specific to the
dataset type. The typeProperties section for a dataset of type FileShare has the following properties:
NOTE
fileName and fileFilter cannot be used simultaneously.
"folderPath": "wikidatagateway/wikisampledataout/{Slice}",
"partitionedBy":
[
{ "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } },
],
In this example, {Slice} is replaced with the value of Data Factory system variable SliceStart, in the format
specified (YYYYMMDDHH). The SliceStart refers to start time of the slice. The folder path is different for each
slice. (For example, wikidatagateway/wikisampledataout/2014100103 or
wikidatagateway/wikisampledataout/2014100104.)
Sample 2
"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}",
"fileName": "{Hour}.csv",
"partitionedBy":
[
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],
In this example, the year, month, day, and time of SliceStart are extracted into separate variables that are used
by the folderPath and fileName properties.
{
"name": "FTPLinkedService",
"properties": {
"type": "FtpServer",
"typeProperties": {
"host": "myftpserver.com",
"authenticationType": "Basic",
"username": "Admin",
"password": "123456"
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
{
"name": "FTPFileInput",
"properties": {
"type": "FileShare",
"linkedServiceName": "FTPLinkedService",
"typeProperties": {
"folderPath": "mysharedfolder",
"fileName": "test.csv",
"useBinaryTransfer": true
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
A copy activity in a pipeline with file system source and blob sink
The pipeline contains a copy activity that is configured to use the input and output datasets, and is scheduled to
run every hour. In the pipeline JSON definition, the source type is set to FileSystemSource, and the sink type
is set to BlobSink.
{
"name": "pipeline",
"properties": {
"activities": [{
"name": "FTPToBlobCopy",
"inputs": [{
"name": "FtpFileInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "00:05:00"
}
}],
"start": "2016-08-24T18:00:00Z",
"end": "2016-08-24T19:00:00Z"
}
}
NOTE
To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data Factory.
Next steps
See the following articles:
To learn about key factors that impact performance of data movement (copy activity) in Data Factory,
and various ways to optimize it, see the Copy activity performance and tuning guide.
For step-by-step instructions for creating a pipeline with a copy activity, see the Copy activity tutorial.
Move data from on-premises HDFS using Azure
Data Factory
7/31/2017 13 min to read Edit Online
This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises
HDFS. It builds on the Data Movement Activities article, which presents a general overview of data movement
with the copy activity.
You can copy data from HDFS to any supported sink data store. For a list of data stores supported as sinks by
the copy activity, see the Supported data stores table. Data factory currently supports only moving data from
an on-premises HDFS to other data stores, but not for moving data from other data stores to an on-premises
HDFS.
NOTE
Copy Activity does not delete the source file after it is successfully copied to the destination. If you need to delete the
source file after a successful copy, create a custom activity to delete the file and use the activity in the pipeline.
Enabling connectivity
Data Factory service supports connecting to on-premises HDFS using the Data Management Gateway. See
moving data between on-premises locations and cloud article to learn about Data Management Gateway and
step-by-step instructions on setting up the gateway. Use the gateway to connect to HDFS even if it is hosted in
an Azure IaaS VM.
NOTE
Make sure the Data Management Gateway can access to ALL the [name node server]:[name node port] and [data node
servers]:[data node port] of the Hadoop cluster. Default [name node port] is 50070, and default [data node port] is
50075.
While you can install gateway on the same on-premises machine or the Azure VM as the HDFS, we
recommend that you install the gateway on a separate machine/Azure IaaS VM. Having gateway on a separate
machine reduces resource contention and improves performance. When you install the gateway on a separate
machine, the machine should be able to access the machine with the HDFS.
Getting started
You can create a pipeline with a copy activity that moves data from a HDFS source by using different
tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from a HDFS data store, see JSON example: Copy data from on-premises HDFS to Azure
Blob section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to HDFS:
encryptedCredential New- No
AzureRMDataFactoryEncryptValue
output of the access credential.
{
"name": "hdfs",
"properties":
{
"type": "Hdfs",
"typeProperties":
{
"authenticationType": "Windows",
"userName": "Administrator",
"password": "password",
"url" : "http://<machine>:50070/webhdfs/v1/",
"gatewayName": "mygateway"
}
}
}
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure
blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for dataset of type FileShare (which includes HDFS
dataset) has the following properties
NOTE
filename and fileFilter cannot be used simultaneously.
In this example {Slice} is replaced with the value of Data Factory system variable SliceStart in the format
(YYYYMMDDHH) specified. The SliceStart refers to start time of the slice. The folderPath is different for each
slice. For example: wikidatagateway/wikisampledataout/2014100103 or
wikidatagateway/wikisampledataout/2014100104.
Sample 2:
"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}",
"fileName": "{Hour}.csv",
"partitionedBy":
[
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],
In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by
folderPath and fileName properties.
{
"name": "HDFSLinkedService",
"properties":
{
"type": "Hdfs",
"typeProperties":
{
"authenticationType": "Windows",
"userName": "Administrator",
"password": "password",
"url" : "http://<machine>:50070/webhdfs/v1/",
"gatewayName": "mygateway"
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
HDFS input dataset: This dataset refers to the HDFS folder DataTransfer/UnitTest/. The pipeline copies all the
files in this folder to the destination.
Setting external: true informs the Data Factory service that the dataset is external to the data factory and is
not produced by an activity in the data factory.
{
"name": "InputDataset",
"properties": {
"type": "FileShare",
"linkedServiceName": "HDFSLinkedService",
"typeProperties": {
"folderPath": "DataTransfer/UnitTest/"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
A copy activity in a pipeline with File System source and Blob sink:
The pipeline contains a Copy Activity that is configured to use these input and output datasets and is scheduled
to run every hour. In the pipeline JSON definition, the source type is set to FileSystemSource and sink type is
set to BlobSink. The SQL query specified for the query property selects the data in the past hour to copy.
{
"name": "pipeline",
"properties":
{
"activities":
[
{
"name": "HdfsToBlobCopy",
"inputs": [ {"name": "InputDataset"} ],
"outputs": [ {"name": "OutputDataset"} ],
"type": "Copy",
"typeProperties":
{
"source":
{
"type": "FileSystemSource"
},
"sink":
{
"type": "BlobSink"
}
},
"policy":
{
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "00:05:00"
}
}
],
"start": "2014-06-01T18:00:00Z",
"end": "2014-06-01T19:00:00Z"
}
}
C:> Ksetup
default realm = REALM.COM (external)
REALM.com:
kdc = <your_kdc_server_address>
NOTE
Replace REALM.COM and AD.COM in the following tutorial with your own respective realm and domain controller as
needed.
On KDC server:
1. Edit the KDC configuration in krb5.conf file to let KDC trust Windows Domain referring to the following
configuration template. By default, the configuration is located at /etc/krb5.conf.
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
default_realm = REALM.COM
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
[realms]
REALM.COM = {
kdc = node.REALM.COM
admin_server = node.REALM.COM
}
AD.COM = {
kdc = windc.ad.com
admin_server = windc.ad.com
}
[domain_realm]
.REALM.COM = REALM.COM
REALM.COM = REALM.COM
.ad.com = AD.COM
ad.com = AD.COM
[capaths]
AD.COM = {
REALM.COM = .
}
On domain controller:
1. Run the following Ksetup commands to add a realm entry:
2. Establish trust from Windows Domain to Kerberos Realm. [password] is the password for the principal
krbtgt/[email protected].
d. Use Ksetup command to specify the encryption algorithm to be used on the specific REALM.
4. Create the mapping between the domain account and Kerberos principal, in order to use Kerberos
principal in Windows Domain.
a. Start the Administrative tools > Active Directory Users and Computers.
b. Configure advanced features by clicking View > Advanced Features.
c. Locate the account to which you want to create mappings, and right-click to view Name
Mappings > click Kerberos Names tab.
d. Add a principal from the realm.
On gateway machine:
Run the following Ksetup commands to add a realm entry.
C:> Ksetup /addkdc REALM.COM <your_kdc_server_address>
C:> ksetup /addhosttorealmmap HDFS-service-FQDN REALM.COM
NOTE
To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data
Factory.
This article outlines how to use the Copy Activity in Azure Data Factory to move data from an on-
premises/cloud HTTP endpoint to a supported sink data store. This article builds on the data movement
activities article that presents a general overview of data movement with copy activity and the list of data
stores supported as sources/sinks.
Data factory currently supports only moving data from an HTTP source to other data stores, but not moving
data from other data stores to an HTTP destination.
Getting started
You can create a pipeline with a copy activity that moves data from an HTTP source by using different
tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using
Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure
PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial
for step-by-step instructions to create a pipeline with a copy activity. For JSON samples to copy data
from HTTP source to Azure Blob Storage, see JSON examples section of this articles.
gatewayName Name of the Data Management Yes if copying data from an on-
Gateway to connect to an on- premises HTTP source.
premises HTTP source.
encryptedCredential Encrypted credential to access the No. Apply only when copying data
HTTP endpoint. Auto-generated when from an on-premises HTTP server.
you configure the authentication
information in copy wizard or the
ClickOnce popup dialog.
See Move data between on-premises sources and the cloud with Data Management Gateway for details about
setting credentials for on-premises HTTP connector data source.
Using Basic, Digest, or Windows authentication
Set authenticationType as Basic , Digest , or Windows , and specify the following properties besides the HTTP
connector generic ones introduced above:
{
"name": "HttpLinkedService",
"properties":
{
"type": "Http",
"typeProperties":
{
"authenticationType": "basic",
"url" : "https://en.wikipedia.org/wiki/",
"userName": "user name",
"password": "password"
}
}
}
certThumbprint The thumbprint of the certificate that Specify either the embeddedCertData
was installed on your gateway or certThumbprint .
machines cert store. Apply only when
copying data from an on-premises
HTTP source.
If you use certThumbprint for authentication and the certificate is installed in the personal store of the local
computer, you need to grant the read permission to the gateway service:
1. Launch Microsoft Management Console (MMC). Add the Certificates snap-in that targets the Local
Computer.
2. Expand Certificates, Personal, and click Certificates.
3. Right-click the certificate from the personal store, and select All Tasks->Manage Private Keys...
4. On the Security tab, add the user account under which Data Management Gateway Host Service is running
with the read access to the certificate.
Example: using client certificate
This linked service links your data factory to an on-premises HTTP web server. It uses a client certificate that is
installed on the machine with Data Management Gateway installed.
{
"name": "HttpLinkedService",
"properties":
{
"type": "Http",
"typeProperties":
{
"authenticationType": "ClientCertificate",
"url": "https://en.wikipedia.org/wiki/",
"certThumbprint": "thumbprint of certificate",
"gatewayName": "gateway name"
}
}
}
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure
blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for dataset of type Http has the following properties
requestMethod Http method. Allowed values are GET No. Default is GET .
or POST.
{
"name": "HttpSourceDataInput",
"properties": {
"type": "Http",
"linkedServiceName": "HttpLinkedService",
"typeProperties": {
"relativeUrl": "XXX/test.xml",
"additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
httpRequestTimeout The timeout (TimeSpan) for the HTTP No. Default value: 00:01:40
request to get a response. It is the
timeout to get a response, not the
timeout to read response data.
JSON examples
The following example provide sample JSON definitions that you can use to create a pipeline by using Azure
portal or Visual Studio or Azure PowerShell. They show how to copy data from HTTP source to Azure Blob
Storage. However, data can be copied directly from any of sources to any of the sinks stated here using the
Copy Activity in Azure Data Factory.
Example: Copy data from HTTP source to Azure Blob Storage
The Data Factory solution for this sample contains the following Data Factory entities:
1. A linked service of type HTTP.
2. A linked service of type AzureStorage.
3. An input dataset of type Http.
4. An output dataset of type AzureBlob.
5. A pipeline with Copy Activity that uses HttpSource and BlobSink.
The sample copies data from an HTTP source to an Azure blob every hour. The JSON properties used in these
samples are described in sections following the samples.
HTTP linked service
This example uses the HTTP linked service with anonymous authentication. See HTTP linked service section for
different types of authentication you can use.
{
"name": "HttpLinkedService",
"properties":
{
"type": "Http",
"typeProperties":
{
"authenticationType": "Anonymous",
"url" : "https://en.wikipedia.org/wiki/"
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
{
"name": "HttpSourceDataInput",
"properties": {
"type": "Http",
"linkedServiceName": "HttpLinkedService",
"typeProperties": {
"relativeUrl": "$$Text.Format('/my/report?month={0:yyyy}-{0:MM}&fmt=csv', SliceStart)",
"additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
NOTE
To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data
Factory.
This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises
MongoDB database. It builds on the Data Movement Activities article, which presents a general overview of
data movement with the copy activity.
You can copy data from an on-premises MongoDB data store to any supported sink data store. For a list of data
stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently
supports only moving data from a MongoDB data store to other data stores, but not for moving data from
other data stores to an MongoDB datastore.
Prerequisites
For the Azure Data Factory service to be able to connect to your on-premises MongoDB database, you must
install the following components:
Supported MongoDB versions are: 2.4, 2.6, 3.0, and 3.2.
Data Management Gateway on the same machine that hosts the database or on a separate machine to
avoid competing for resources with the database. Data Management Gateway is a software that
connects on-premises data sources to cloud services in a secure and managed way. See Data
Management Gateway article for details about Data Management Gateway. See Move data from on-
premises to cloud article for step-by-step instructions on setting up the gateway a data pipeline to move
data.
When you install the gateway, it automatically installs a Microsoft MongoDB ODBC driver used to
connect to MongoDB.
NOTE
You need to use the gateway to connect to MongoDB even if it is hosted in Azure IaaS VMs. If you are trying to
connect to an instance of MongoDB hosted in cloud, you can also install the gateway instance in the IaaS VM.
Getting started
You can create a pipeline with a copy activity that moves data from an on-premises MongoDB data store by
using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from an on-premises MongoDB data store, see JSON example: Copy data from MongoDB to
Azure Blob section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to MongoDB source:
port TCP port that the MongoDB server Optional, default value: 27017
uses to listen for client connections.
username User account to access MongoDB. Yes (if basic authentication is used).
password Password for the user. Yes (if basic authentication is used).
authSource Name of the MongoDB database that Optional (if basic authentication is
you want to use to check your used). default: uses the admin account
credentials for authentication. and the database specified using
databaseName property.
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure
blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for dataset of type MongoDbCollection has the
following properties:
PROPERTY DESCRIPTION REQUIRED
query Use the custom query to SQL-92 query string. For No (if collectionName of
read data. example: select * from dataset is specified)
MyTable.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
MongoDB input dataset: Setting external: true informs the Data Factory service that the table is external
to the data factory and is not produced by an activity in the data factory.
{
"name": "MongoDbInputDataset",
"properties": {
"type": "MongoDbCollection",
"linkedServiceName": "OnPremisesMongoDbLinkedService",
"typeProperties": {
"collectionName": "<Collection name>"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}
Binary Byte[]
Boolean Boolean
Date DateTime
NumberDouble Double
NumberInt Int32
NumberLong Int64
ObjectID String
String String
UUID Guid
NOTE
To learn about support for arrays using virtual tables, refer to Support for complex types using virtual tables section
below.
Currently, the following MongoDB data types are not supported: DBPointer, JavaScript, Max/Min key, Regular
Expression, Symbol, Timestamp, Undefined
The driver would generate multiple virtual tables to represent this single table. The first virtual table is the base
table named ExampleTable, shown below. The base table contains all the data of the original table, but the
data from the arrays has been omitted and is expanded in the virtual tables.
The following tables show the virtual tables that represent the original arrays in the example. These tables
contain the following:
A reference back to the original primary key column corresponding to the row of the original array (via the
_id column)
An indication of the position of the data within the original array
The expanded data for each element within the array
Table ExampleTable_Invoices:
EXAMPLETABLE_I
NVOICES_DIM1_ID
_ID X INVOICE_ID ITEM PRICE DISCOUNT
Table ExampleTable_Ratings:
1111 0 5
1111 1 6
_ID EXAMPLETABLE_RATINGS_DIM1_IDX EXAMPLETABLE_RATINGS
2222 0 1
2222 1 2
Next Steps
See Move data between on-premises and cloud article for step-by-step instructions for creating a data pipeline
that moves data from an on-premises data store to an Azure data store.
Move data From MySQL using Azure Data Factory
6/27/2017 8 min to read Edit Online
This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises
MySQL database. It builds on the Data Movement Activities article, which presents a general overview of data
movement with the copy activity.
You can copy data from an on-premises MySQL data store to any supported sink data store. For a list of data
stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently
supports only moving data from a MySQL data store to other data stores, but not for moving data from other
data stores to an MySQL data store.
Prerequisites
Data Factory service supports connecting to on-premises MySQL sources using the Data Management
Gateway. See moving data between on-premises locations and cloud article to learn about Data Management
Gateway and step-by-step instructions on setting up the gateway.
Gateway is required even if the MySQL database is hosted in an Azure IaaS virtual machine (VM). You can
install the gateway on the same VM as the data store or on a different VM as long as the gateway can connect
to the database.
NOTE
See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues.
TIP
If you hit error on "Authentication failed because the remote party has closed the transport stream.", consider to
upgrade the MySQL Connector/Net to higher version.
Getting started
You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by
using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from an on-premises MySQL data store, see JSON example: Copy data from MySQL to Azure
Blob section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to a MySQL data store:
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure
blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for dataset of type RelationalTable (which includes
MySQL dataset) has the following properties
PROPERTY DESCRIPTION REQUIRED
query Use the custom query to SQL query string. For No (if tableName of
read data. example: select * from dataset is specified)
MyTable.
IMPORTANT
This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See
moving data between on-premises locations and cloud article for step-by-step instructions.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
{
"name": "MySqlDataSet",
"properties": {
"published": false,
"type": "RelationalTable",
"linkedServiceName": "OnPremMySqlLinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
{
"name": "AzureBlobMySqlDataSet",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/mysql/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\t"
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
]
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
bigint Int64
bit Decimal
MYSQL DATABASE TYPE .NET FRAMEWORK TYPE
blob Byte[]
bool Boolean
char String
date Datetime
datetime Datetime
decimal Decimal
double Double
enum String
float Single
int Int32
integer Int32
longblob Byte[]
longtext String
mediumblob Byte[]
mediumint Int32
mediumtext String
numeric Decimal
real Double
set String
MYSQL DATABASE TYPE .NET FRAMEWORK TYPE
smallint Int16
text String
time TimeSpan
timestamp Datetime
tinyblob Byte[]
tinyint Int16
tinytext String
varchar String
year Int
This article explains how to use the Copy Activity in Azure Data Factory to move data from an OData source. It
builds on the Data Movement Activities article, which presents a general overview of data movement with the
copy activity.
You can copy data from an OData source to any supported sink data store. For a list of data stores supported as
sinks by the copy activity, see the Supported data stores table. Data factory currently supports only moving
data from an OData source to other data stores, but not for moving data from other data stores to an OData
source.
Getting started
You can create a pipeline with a copy activity that moves data from an OData source by using different
tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from an OData source, see JSON example: Copy data from OData source to Azure Blob
section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to OData source:
username Specify user name if you are using Yes (only if you are using Basic
Basic authentication. authentication)
password Specify password for the user account Yes (only if you are using Basic
you specified for the username. authentication)
authorizedCredential If you are using OAuth, click Yes (only if you are using OAuth
Authorize button in the Data Factory authentication)
Copy Wizard or Editor and enter your
credential, then the value of this
property will be auto-generated.
{
"name": "ODataLinkedService",
"properties":
{
"type": "OData",
"typeProperties":
{
"url": "http://services.odata.org/OData/OData.svc",
"authenticationType": "Anonymous"
}
}
}
{
"name": "inputLinkedService",
"properties":
{
"type": "OData",
"typeProperties":
{
"url": "<endpoint of on-premises OData source e.g. Dynamics CRM>",
"authenticationType": "Windows",
"username": "domain\\user",
"password": "password",
"gatewayName": "mygateway"
}
}
}
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure
blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for dataset of type ODataResource (which includes
OData dataset) has the following properties
Edm.Binary Byte[]
Edm.Boolean Bool
Edm.Byte Byte[]
Edm.DateTime DateTime
Edm.Decimal Decimal
Edm.Double Double
Edm.Single Single
Edm.Guid Guid
Edm.Int16 Int16
Edm.Int32 Int32
Edm.Int64 Int64
Edm.SByte Int16
Edm.String String
Edm.Time TimeSpan
Edm.DateTimeOffset DateTimeOffset
NOTE
OData complex data types e.g. Object are not supported.
{
"name": "ODataLinkedService",
"properties":
{
"type": "OData",
"typeProperties":
{
"url": "http://services.odata.org/OData/OData.svc",
"authenticationType": "Anonymous"
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
{
"name": "ODataDataset",
"properties":
{
"type": "ODataResource",
"typeProperties":
{
"path": "Products"
},
"linkedServiceName": "ODataLinkedService",
"structure": [],
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
{
"name": "AzureBlobODataDataSet",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/odata/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\t"
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
]
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Specifying query in the pipeline definition is optional. The URL that the Data Factory service uses to retrieve
data is: URL specified in the linked service (required) + path specified in the dataset (optional) + query in the
pipeline (optional).
Type mapping for OData
As mentioned in the data movement activities article, Copy activity performs automatic type conversions from
source types to sink types with the following 2-step approach:
1. Convert from native source types to .NET type
2. Convert from .NET type to native sink type
When moving data from OData data stores, OData data types are mapped to .NET types.
This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises
ODBC data store. It builds on the Data Movement Activities article, which presents a general overview of data
movement with the copy activity.
You can copy data from an ODBC data store to any supported sink data store. For a list of data stores
supported as sinks by the copy activity, see the Supported data stores table. Data factory currently supports
only moving data from an ODBC data store to other data stores, but not for moving data from other data
stores to an ODBC data store.
Enabling connectivity
Data Factory service supports connecting to on-premises ODBC sources using the Data Management Gateway.
See moving data between on-premises locations and cloud article to learn about Data Management Gateway
and step-by-step instructions on setting up the gateway. Use the gateway to connect to an ODBC data store
even if it is hosted in an Azure IaaS VM.
You can install the gateway on the same on-premises machine or the Azure VM as the ODBC data store.
However, we recommend that you install the gateway on a separate machine/Azure IaaS VM to avoid resource
contention and for better performance. When you install the gateway on a separate machine, the machine
should be able to access the machine with the ODBC data store.
Apart from the Data Management Gateway, you also need to install the ODBC driver for the data store on the
gateway machine.
NOTE
See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues.
Getting started
You can create a pipeline with a copy activity that moves data from an ODBC data store by using different
tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from an ODBC data store, see JSON example: Copy data from ODBC data store to Azure Blob
section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to ODBC data store:
{
"name": "odbc",
"properties":
{
"type": "OnPremisesOdbc",
"typeProperties":
{
"authenticationType": "Basic",
"connectionString": "Driver={SQL Server};Server=myserver.database.windows.net;
Database=TestDatabase;;EncryptedCredential=eyJDb25uZWN0...........................",
"gatewayName": "mygateway"
}
}
}
{
"name": "odbc",
"properties":
{
"type": "OnPremisesOdbc",
"typeProperties":
{
"authenticationType": "Anonymous",
"connectionString": "Driver={SQL Server};Server={servername}.database.windows.net;
Database=TestDatabase;",
"credential": "UID={uid};PWD={pwd}",
"gatewayName": "mygateway"
}
}
}
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure
blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for dataset of type RelationalTable (which includes
ODBC dataset) has the following properties
query Use the custom query to SQL query string. For Yes
read data. example: select * from
MyTable.
JSON example: Copy data from ODBC data store to Azure Blob
This example provides JSON definitions that you can use to create a pipeline by using Azure portal or Visual
Studio or Azure PowerShell. It shows how to copy data from an ODBC source to an Azure Blob Storage.
However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory.
The sample has the following data factory entities:
1. A linked service of type OnPremisesOdbc.
2. A linked service of type AzureStorage.
3. An input dataset of type RelationalTable.
4. An output dataset of type AzureBlob.
5. A pipeline with Copy Activity that uses RelationalSource and BlobSink.
The sample copies data from a query result in an ODBC data store to a blob every hour. The JSON properties
used in these samples are described in sections following the samples.
As a first step, set up the data management gateway. The instructions are in the moving data between on-
premises locations and cloud article.
ODBC linked service This example uses the Basic authentication. See ODBC linked service section for different
types of authentication you can use.
{
"name": "OnPremOdbcLinkedService",
"properties":
{
"type": "OnPremisesOdbc",
"typeProperties":
{
"authenticationType": "Basic",
"connectionString": "Driver={SQL Server};Server=Server.database.windows.net;
Database=TestDatabase;",
"userName": "username",
"password": "password",
"gatewayName": "mygateway"
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
{
"name": "ODBCDataSet",
"properties": {
"published": false,
"type": "RelationalTable",
"linkedServiceName": "OnPremOdbcLinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
{
"name": "AzureBlobOdbcDataSet",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/odbc/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\t"
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
]
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Copy activity in a pipeline with ODBC source (RelationalSource) and Blob sink (BlobSink)
The pipeline contains a Copy Activity that is configured to use these input and output datasets and is scheduled
to run every hour. In the pipeline JSON definition, the source type is set to RelationalSource and sink type is
set to BlobSink. The SQL query specified for the query property selects the data in the past hour to copy.
{
"name": "CopyODBCToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-
ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "OdbcDataSet"
}
],
"outputs": [
{
"name": "AzureBlobOdbcDataSet"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "OdbcToBlob"
}
],
"start": "2016-06-01T18:00:00Z",
"end": "2016-06-01T19:00:00Z"
}
}
GE Historian store
You create an ODBC linked service to link a GE Proficy Historian (now GE Historian) data store to an Azure data
factory as shown in the following example:
{
"name": "HistorianLinkedService",
"properties":
{
"type": "OnPremisesOdbc",
"typeProperties":
{
"connectionString": "DSN=<name of the GE Historian store>",
"gatewayName": "<gateway name>",
"authenticationType": "Basic",
"userName": "<user name>",
"password": "<password>"
}
}
}
Install Data Management Gateway on an on-premises machine and register the gateway with the portal. The
gateway installed on your on-premises computer uses the ODBC driver for GE Historian to connect to the GE
Historian data store. Therefore, install the driver if it is not already installed on the gateway machine. See
Enabling connectivity section for details.
Before you use the GE Historian store in a Data Factory solution, verify whether the gateway can connect to the
data store using instructions in the next section.
Read the article from the beginning for a detailed overview of using ODBC data stores as source data stores in
a copy operation.
This article explains how to use the Copy Activity in Azure Data Factory to move data to/from an on-premises
Oracle database. It builds on the Data Movement Activities article, which presents a general overview of data
movement with the copy activity.
Supported scenarios
You can copy data from an Oracle database to the following data stores:
You can copy data from the following data stores to an Oracle database:
NoSQL Cassandra
MongoDB
CATEGORY DATA STORE
File Amazon S3
File System
FTP
HDFS
SFTP
Prerequisites
Data Factory supports connecting to on-premises Oracle sources using the Data Management Gateway. See
Data Management Gateway article to learn about Data Management Gateway and Move data from on-
premises to cloud article for step-by-step instructions on setting up the gateway a data pipeline to move data.
Gateway is required even if the Oracle is hosted in an Azure IaaS VM. You can install the gateway on the same
IaaS VM as the data store or on a different VM as long as the gateway can connect to the database.
NOTE
See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues.
IMPORTANT
Currently Microsoft driver for Oracle only supports copying data from Oracle but not writing to Oracle. And note the
test connection capability in Data Management Gateway Diagnostics tab does not support this driver. Alternatively,
you can use the copy wizard to validate the connectivity.
Oracle Data Provider for .NET: you can also choose to use Oracle Data Provider to copy data from/to
Oracle. This component is included in Oracle Data Access Components for Windows. Install the
appropriate version (32/64 bit) on the machine where the gateway is installed. Oracle Data Provider
.NET 12.1 can access to Oracle Database 10g Release 2 or later.
If you choose XCopy Installation, follow steps in the readme.htm. We recommend you choose the
installer with UI (non-XCopy one).
After installing the provider, restart the Data Management Gateway host service on your machine
using Services applet (or) Data Management Gateway Configuration Manager.
If you use copy wizard to author the copy pipeline, the driver type will be auto-determined. Microsoft driver
will be used by default, unless your gateway version is lower than 2.7 or you choose Oracle as sink.
Getting started
You can create a pipeline with a copy activity that moves data to/from an on-premises Oracle database by
using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from
a source data store to a sink data store:
1. Create a data factory. A data factory may contain one or more pipelines.
2. Create linked services to link input and output data stores to your data factory. For example, if you are
copying data from an Oralce database to an Azure blob storage, you create two linked services to link your
Oracle database and Azure storage account to your data factory. For linked service properties that are
specific to Oracle, see linked service properties section.
3. Create datasets to represent input and output data for the copy operation. In the example mentioned in
the last step, you create a dataset to specify the table in your Oracle database that contains the input data.
And, you create another dataset to specify the blob container and the folder that holds the data copied
from the Oracle database. For dataset properties that are specific to Oracle, see dataset properties section.
4. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the
example mentioned earlier, you use OracleSource as a source and BlobSink as a sink for the copy activity.
Similarly, if you are copying from Azure Blob Storage to Oracle Database, you use BlobSource and
OracleSink in the copy activity. For copy activity properties that are specific to Oracle database, see copy
activity properties section. For details on how to use a data store as a source or a sink, click the link in the
previous section for your data store.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that
are used to copy data to/from an on-premises Oracle database, see JSON examples section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities:
{
"name": "OnPremisesOracleLinkedService",
"properties": {
"type": "OnPremisesOracle",
"typeProperties": {
"driverType": "Microsoft",
"connectionString":"Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;Password=
<password>;",
"gatewayName": "<gateway name>"
}
}
}
{
"name": "OnPremisesOracleLinkedService",
"properties": {
"type": "OnPremisesOracle",
"typeProperties": {
"connectionString": "Data Source=(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<hostname>)(PORT=
<port number>))(CONNECT_DATA=(SERVICE_NAME=<SID>)));
User Id=<username>;Password=<password>;",
"gatewayName": "<gateway name>"
}
}
}
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Oracle, Azure blob,
Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for the dataset of type OracleTable has the following
properties:
PROPERTY DESCRIPTION REQUIRED
NOTE
The Copy Activity takes only one input and produces only one output.
Whereas, properties available in the typeProperties section of the activity vary with each activity type. For
Copy activity, they vary depending on the types of sources and sinks.
OracleSource
In Copy activity, when the source is of type OracleSource the following properties are available in
typeProperties section:
oracleReaderQuery Use the custom query to SQL query string. For No (if tableName of
read data. example: select * from dataset is specified)
MyTable
OracleSink
OracleSink supports the following properties:
writeBatchSize Inserts data into the SQL Integer (number of rows) No (default: 100)
table when the buffer size
reaches writeBatchSize.
{
"name": "OnPremisesOracleLinkedService",
"properties": {
"type": "OnPremisesOracle",
"typeProperties": {
"driverType": "Microsoft",
"connectionString":"Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;Password=
<password>;",
"gatewayName": "<gateway name>"
}
}
}
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=
<Account key>"
}
}
}
Oracle input dataset:
The sample assumes you have created a table MyTable in Oracle and it contains a column called
timestampcolumn for time series data.
Setting external: true informs the Data Factory service that the dataset is external to the data factory and is
not produced by an activity in the data factory.
{
"name": "OracleInput",
"properties": {
"type": "OracleTable",
"linkedServiceName": "OnPremisesOracleLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"offset": "01:00:00",
"interval": "1",
"anchorDateTime": "2014-02-27T12:00:00",
"frequency": "Hour"
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
{
"name": "OnPremisesOracleLinkedService",
"properties": {
"type": "OnPremisesOracle",
"typeProperties": {
"connectionString": "Data Source=(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<hostname>)(PORT=
<port number>))(CONNECT_DATA=(SERVICE_NAME=<SID>)));
User Id=<username>;Password=<password>;",
"gatewayName": "<gateway name>"
}
}
}
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=
<Account key>"
}
}
}
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-05T19:00:00",
"description":"pipeline with copy activity",
"activities":[
{
"name": "AzureBlobtoOracle",
"description": "Copy Activity",
"type": "Copy",
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "OracleOutput"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "OracleSink"
}
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
Troubleshooting tips
Problem 1: .NET Framework Data Provider
You see the following error message:
Copy activity met invalid parameters: 'UnknownParameterName', Detailed message: Unable to find the
requested .Net Framework Data Provider. It may not be installed.
Possible causes:
1. The .NET Framework Data Provider for Oracle was not installed.
2. The .NET Framework Data Provider for Oracle was installed to .NET Framework 2.0 and is not found in the
.NET Framework 4.0 folders.
Resolution/Workaround:
1. If you haven't installed the .NET Provider for Oracle, install it and retry the scenario.
2. If you get the error message even after installing the provider, do the following steps:
a. Open machine config of .NET 2.0 from the folder:
:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config.
b. Search for Oracle Data Provider for .NET, and you should be able to find an entry as shown in the
following sample under system.data -> DbProviderFactories:
3. Copy this entry to the machine.config file in the following v4.0 folder:
:\Windows\Microsoft.NET\Framework64\v4.0.30319\Config\machine.config, and change the version to
4.xxx.x.x.
4. Install \11.2.0\client_1\odp.net\bin\4\Oracle.DataAccess.dll into the global assembly cache (GAC) by
running gacutil /i [provider path] .## Troubleshooting tips
Problem 2: datetime formatting
You see the following error message:
Message=Operation failed in Oracle Database with the following error: 'ORA-01861: literal does not match
format string'.,Source=,''Type=Oracle.DataAccess.Client.OracleException,Message=ORA-01861: literal does
not match format string,Source=Oracle Data Provider for .NET,'.
Resolution/Workaround:
You may need to adjust the query string in your copy activity based on how dates are configured in your
Oracle database, as shown in the following sample (using the to_date function):
BFILE Byte[]
BLOB Byte[]
CHAR String
CLOB String
DATE DateTime
LONG String
NCHAR String
NCLOB String
NVARCHAR2 String
RAW Byte[]
ROWID String
TIMESTAMP DateTime
VARCHAR2 String
XML String
NOTE
Data type INTERVAL YEAR TO MONTH and INTERVAL DAY TO SECOND are not supported when using Microsoft
driver.
This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises
PostgreSQL database. It builds on the Data Movement Activities article, which presents a general overview of
data movement with the copy activity.
You can copy data from an on-premises PostgreSQL data store to any supported sink data store. For a list of
data stores supported as sinks by the copy activity, see supported data stores. Data factory currently supports
moving data from a PostgreSQL database to other data stores, but not for moving data from other data stores
to an PostgreSQL database.
prerequisites
Data Factory service supports connecting to on-premises PostgreSQL sources using the Data Management
Gateway. See moving data between on-premises locations and cloud article to learn about Data Management
Gateway and step-by-step instructions on setting up the gateway.
Gateway is required even if the PostgreSQL database is hosted in an Azure IaaS VM. You can install gateway on
the same IaaS VM as the data store or on a different VM as long as the gateway can connect to the database.
NOTE
See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues.
Getting started
You can create a pipeline with a copy activity that moves data from an on-premises PostgreSQL data store by
using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline:
Azure portal
Visual Studio
Azure PowerShell
Azure Resource Manager template
.NET API
REST API
See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from an on-premises PostgreSQL data store, see JSON example: Copy data from PostgreSQL
to Azure Blob section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to a PostgreSQL data store:
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types.
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for dataset of type RelationalTable (which includes
PostgreSQL dataset) has the following properties:
PROPERTY DESCRIPTION REQUIRED
query Use the custom query to SQL query string. For No (if tableName of
read data. example: "query": "select * dataset is specified)
from
\"MySchema\".\"MyTable\"".
NOTE
Schema and table names are case-sensitive. Enclose them in "" (double quotes) in the query.
Example:
"query": "select * from \"MySchema\".\"MyTable\""
IMPORTANT
This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See
moving data between on-premises locations and cloud article for step-by-step instructions.
{
"name": "OnPremPostgreSqlLinkedService",
"properties": {
"type": "OnPremisesPostgreSql",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"schema": "<schema>",
"authenticationType": "<authentication type>",
"username": "<username>",
"password": "<password>",
"gatewayName": "<gatewayName>"
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<AccountName>;AccountKey=
<AccountKey>"
}
}
}
abstime Datetime
cid String
cidr String
date Datetime
daterange String
intarry String
int4range String
int8range String
json String
jsonb Byte[]
money Decimal
numrange String
oid Int32
pg_lsn Int64
text String
This article outlines how you can use Copy Activity in an Azure data factory to copy data from Salesforce to any
data store that is listed under the Sink column in the supported sources and sinks table. This article builds on
the data movement activities article, which presents a general overview of data movement with Copy Activity
and supported data store combinations.
Azure Data Factory currently supports only moving data from Salesforce to supported sink data stores, but
does not support moving data from other data stores to Salesforce.
Supported versions
This connector supports the following editions of Salesforce: Developer Edition, Professional Edition, Enterprise
Edition, or Unlimited Edition. And it supports copying from Salesforce production, sandbox and custom
domain.
Prerequisites
API permission must be enabled. See How do I enable API access in Salesforce by permission set?
To copy data from Salesforce to on-premises data stores, you must have at least Data Management
Gateway 2.0 installed in your on-premises environment.
Getting started
You can create a pipeline with a copy activity that moves data from Salesforce by using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from Salesforce, see JSON example: Copy data from Salesforce to Azure Blob section of this
article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to Salesforce:
- Default is
"https://login.salesforce.com".
- To copy data from sandbox, specify
"https://test.salesforce.com".
- To copy data from custom domain,
specify, for example,
"https://[domain].my.salesforce.com".
Dataset properties
For a full list of sections and properties that are available for defining datasets, see the Creating datasets article.
Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL,
Azure blob, Azure table, and so on).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for a dataset of the type RelationalTable has the
following properties:
query Use the custom query to A SQL-92 query or No (if the tableName of
read data. Salesforce Object Query the dataset is specified)
Language (SOQL) query.
For example:
select * from
MyTable__c
.
IMPORTANT
The "__c" part of the API Name is needed for any custom object.
Query tips
Retrieving data using where clause on DateTime column
When specify the SOQL or SQL query, pay attention to the DateTime format difference. For example:
SOQL sample:
$$Text.Format('SELECT Id, Name, BillingCity FROM Account WHERE LastModifiedDate >= {0:yyyy-MM-
ddTHH:mm:ssZ} AND LastModifiedDate < {1:yyyy-MM-ddTHH:mm:ssZ}', WindowStart, WindowEnd)
SQL sample:
Using copy wizard to specify the query:
$$Text.Format('SELECT * FROM Account WHERE LastModifiedDate >= {{ts\'{0:yyyy-MM-dd HH:mm:ss}\'}}
AND LastModifiedDate < {{ts\'{1:yyyy-MM-dd HH:mm:ss}\'}}', WindowStart, WindowEnd)
Using JSON editing to specify the query (escape char properly):
$$Text.Format('SELECT * FROM Account WHERE LastModifiedDate >= {{ts\\'{0:yyyy-MM-dd HH:mm:ss}\\'}}
AND LastModifiedDate < {{ts\\'{1:yyyy-MM-dd HH:mm:ss}\\'}}', WindowStart, WindowEnd)
{
"name": "SalesforceLinkedService",
"properties":
{
"type": "Salesforce",
"typeProperties":
{
"username": "<user name>",
"password": "<password>",
"securityToken": "<security token>"
}
}
}
{
"name": "SalesforceInput",
"properties": {
"linkedServiceName": "SalesforceLinkedService",
"type": "RelationalTable",
"typeProperties": {
"tableName": "AllDataType__c"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Setting external to true informs the Data Factory service that the dataset is external to the data factory and is
not produced by an activity in the data factory.
IMPORTANT
The "__c" part of the API Name is needed for any custom object.
IMPORTANT
The "__c" part of the API Name is needed for any custom object.
Checkbox Boolean
Currency Double
Date DateTime
Date/Time DateTime
Email String
Id String
Number Double
Percent Double
Phone String
Picklist String
Text String
URL String
NOTE
To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data
Factory.
The following sample shows the structure section JSON for a table that has three columns userid, name, and
lastlogindate.
"structure":
[
{ "name": "userid"},
{ "name": "name"},
{ "name": "lastlogindate"}
],
Please use the following guidelines for when to include structure information and what to include in the
structure section.
For structured data sources that store data schema and type information along with the data itself
(sources like SQL Server, Oracle, Azure table etc.), you should specify the structure section only if you
want do column mapping of specific source columns to specific columns in sink and their names are not
the same (see details in column mapping section below).
As mentioned above, the type information is optional in structure section. For structured sources, type
information is already available as part of dataset definition in the data store, so you should not include
type information when you do include the structure section.
For schema on read data sources (specifically Azure blob) you can choose to store data without
storing any schema or type information with the data. For these types of data sources you should include
structure in the following 2 cases:
You want to do column mapping.
When the dataset is a source in a Copy activity, you can provide type information in structure and
data factory will use this type information for conversion to native types for the sink. See Move data
to and from Azure Blob article for more information.
Supported .NET -based types
Data factory supports the following CLS compliant .NET based type values for providing type information in
structure for schema on read data sources like Azure blob.
Int16
Int32
Int64
Single
Double
Decimal
Byte[]
Bool
String
Guid
Datetime
Datetimeoffset
Timespan
For Datetime & Datetimeoffset you can also optionally specify culture & format string to facilitate parsing
of your custom Datetime string. See sample for type conversion below.
This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises SAP
Business Warehouse (BW). It builds on the Data Movement Activities article, which presents a general overview
of data movement with the copy activity.
You can copy data from an on-premises SAP Business Warehouse data store to any supported sink data store.
For a list of data stores supported as sinks by the copy activity, see the Supported data stores table. Data
factory currently supports only moving data from an SAP Business Warehouse to other data stores, but not for
moving data from other data stores to an SAP Business Warehouse.
TIP
Put the dlls extracted from the NetWeaver RFC SDK into system32 folder.
Getting started
You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by
using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from an on-premises SAP Business Warehouse, see JSON example: Copy data from SAP
Business Warehouse to Azure Blob section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to an SAP BW data store:
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure
blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. There are no type-specific properties supported for the SAP BW dataset of type
RelationalTable.
Copy activity properties
For a full list of sections & properties available for defining activities, see the Creating Pipelines article.
Properties such as name, description, input and output tables, are policies are available for all types of activities.
Whereas, properties available in the typeProperties section of the activity vary with each activity type. For
Copy activity, they vary depending on the types of sources and sinks.
When source in copy activity is of type RelationalSource (which includes SAP BW), the following properties
are available in typeProperties section:
IMPORTANT
This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See
moving data between on-premises locations and cloud article for step-by-step instructions.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
{
"name": "SapBwDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": "SapBwLinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}
{
"name": "AzureBlobDataSet",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/sapbw/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\t"
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
]
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
ACCP Int
CHAR String
CLNT String
DATA TYPE IN THE ABAP DICTIONARY .NET DATA TYPE
CURR Decimal
CUKY String
DEC Decimal
FLTP Double
INT1 Byte
INT2 Int16
INT4 Int
LANG String
LCHR String
LRAW Byte[]
PREC Int16
QUAN Decimal
RAW Byte[]
RAWSTRING Byte[]
STRING String
UNIT String
DATS String
NUMC String
TIMS String
NOTE
To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data
Factory.
This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises SAP
HANA. It builds on the Data Movement Activities article, which presents a general overview of data movement
with the copy activity.
You can copy data from an on-premises SAP HANA data store to any supported sink data store. For a list of
data stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently
supports only moving data from an SAP HANA to other data stores, but not for moving data from other data
stores to an SAP HANA.
Getting started
You can create a pipeline with a copy activity that moves data from an on-premises SAP HANA data store by
using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from an on-premises SAP HANA, see JSON example: Copy data from SAP HANA to Azure
Blob section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to an SAP HANA data store:
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure
blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. There are no type-specific properties supported for the SAP HANA dataset of type
RelationalTable.
IMPORTANT
This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See
moving data between on-premises locations and cloud article for step-by-step instructions.
{
"name": "SapHanaLinkedService",
"properties":
{
"type": "SapHana",
"typeProperties":
{
"server": "<server name>",
"authenticationType": "<Basic, or Windows>",
"username": "<SAP user>",
"password": "<Password for SAP user>",
"gatewayName": "<gateway name>"
}
}
}
{
"name": "SapHanaDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": "SapHanaLinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}
TINYINT Byte
SMALLINT Int16
INT Int32
SAP HANA TYPE .NET BASED TYPE
BIGINT Int64
REAL Single
DOUBLE Single
DECIMAL Decimal
BOOLEAN Byte
VARCHAR String
NVARCHAR String
CLOB Byte[]
ALPHANUM String
BLOB Byte[]
DATE DateTime
TIME TimeSpan
TIMESTAMP DateTime
SECONDDATE DateTime
Known limitations
There are a few known limitations when copying data from SAP HANA:
NVARCHAR strings are truncated to maximum length of 4000 Unicode characters
SMALLDECIMAL is not supported
VARBINARY is not supported
Valid Dates are between 1899/12/30 and 9999/12/31
This article outlines how to use the Copy Activity in Azure Data Factory to move data from an on-
premises/cloud SFTP server to a supported sink data store. This article builds on the data movement activities
article that presents a general overview of data movement with copy activity and the list of data stores
supported as sources/sinks.
Data factory currently supports only moving data from an SFTP server to other data stores, but not for moving
data from other data stores to an SFTP server. It supports both on-premises and cloud SFTP servers.
NOTE
Copy Activity does not delete the source file after it is successfully copied to the destination. If you need to delete the
source file after a successful copy, create a custom activity to delete the file and use the activity in the pipeline.
Getting started
You can create a pipeline with a copy activity that moves data from an SFTP source by using different
tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using
Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure
PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial
for step-by-step instructions to create a pipeline with a copy activity. For JSON samples to copy data
from SFTP server to Azure Blob Storage, see JSON Example: Copy data from SFTP server to Azure blob
section of this article.
skipHostKeyValidation Specify whether to skip host key No. The default value: false
validation.
hostKeyFingerprint Specify the finger print of the host Yes if the skipHostKeyValidation is
key. set to false.
gatewayName Name of the Data Management Yes if copying data from an on-
Gateway to connect to an on- premises SFTP server.
premises SFTP server.
encryptedCredential Encrypted credential to access the No. Apply only when copying data
SFTP server. Auto-generated when from an on-premises SFTP server.
you specify basic authentication
(username + password) or
SshPublicKey authentication
(username + private key path or
content) in copy wizard or the
ClickOnce popup dialog.
{
"name": "SftpLinkedService",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "mysftpserver",
"port": 22,
"authenticationType": "Basic",
"username": "xxx",
"encryptedCredential": "xxxxxxxxxxxxxxxxx",
"skipHostKeyValidation": false,
"hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00",
"gatewayName": "mygateway"
}
}
}
privateKeyPath Specify absolute path to the private Specify either the privateKeyPath or
key file that gateway can access. privateKeyContent .
privateKeyContent A serialized string of the private key Specify either the privateKeyPath or
content. The Copy Wizard can read privateKeyContent .
the private key file and extract the
private key content automatically. If
you are using any other tool/SDK, use
the privateKeyPath property instead.
PROPERTY DESCRIPTION REQUIRED
passPhrase Specify the pass phrase/password to Yes if the private key file is protected
decrypt the private key if the key file is by a pass phrase.
protected by a pass phrase.
NOTE
SFTP connector only support OpenSSH key. Make sure your key file is in the proper format. You can use Putty tool to
convert from .ppk to OpenSSH format.
{
"name": "SftpLinkedServiceWithPrivateKeyPath",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "mysftpserver",
"port": 22,
"authenticationType": "SshPublicKey",
"username": "xxx",
"privateKeyPath": "D:\\privatekey_openssh",
"passPhrase": "xxx",
"skipHostKeyValidation": true,
"gatewayName": "mygateway"
}
}
}
{
"name": "SftpLinkedServiceWithPrivateKeyContent",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "mysftpserver.westus.cloudapp.azure.com",
"port": 22,
"authenticationType": "SshPublicKey",
"username": "xxx",
"privateKeyContent": "<base64 string of the private key content>",
"passPhrase": "xxx",
"skipHostKeyValidation": true
}
}
}
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types.
The typeProperties section is different for each type of dataset. It provides information that is specific to the
dataset type. The typeProperties section for a dataset of type FileShare dataset has the following properties:
PROPERTY DESCRIPTION REQUIRED
NOTE
filename and fileFilter cannot be used simultaneously.
"folderPath": "wikidatagateway/wikisampledataout/{Slice}",
"partitionedBy":
[
{ "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } },
],
In this example {Slice} is replaced with the value of Data Factory system variable SliceStart in the format
(YYYYMMDDHH) specified. The SliceStart refers to start time of the slice. The folderPath is different for each
slice. Example: wikidatagateway/wikisampledataout/2014100103 or
wikidatagateway/wikisampledataout/2014100104.
Sample 2:
"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}",
"fileName": "{Hour}.csv",
"partitionedBy":
[
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],
In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by
folderPath and fileName properties.
Copy activity properties
For a full list of sections & properties available for defining activities, see the Creating Pipelines article.
Properties such as name, description, input and output tables, and policies are available for all types of
activities.
Whereas, the properties available in the typeProperties section of the activity vary with each activity type. For
Copy activity, the type properties vary depending on the types of sources and sinks.
In Copy Activity, when source is of type FileSystemSource, the following properties are available in
typeProperties section:
IMPORTANT
This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See
moving data between on-premises locations and cloud article for step-by-step instructions.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
{
"name": "SFTPFileInput",
"properties": {
"type": "FileShare",
"linkedServiceName": "SftpLinkedService",
"typeProperties": {
"folderPath": "mysharedfolder",
"fileName": "test.csv"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Next Steps
See the following articles:
Copy Activity tutorial for step-by-step instructions for creating a pipeline with a Copy Activity.
Move data to and from SQL Server on-premises
or on IaaS (Azure VM) using Azure Data Factory
6/27/2017 18 min to read Edit Online
This article explains how to use the Copy Activity in Azure Data Factory to move data to/from an on-premises
SQL Server database. It builds on the Data Movement Activities article, which presents a general overview of
data movement with the copy activity.
Supported scenarios
You can copy data from a SQL Server database to the following data stores:
You can copy data from the following data stores to a SQL Server database:
NoSQL Cassandra
MongoDB
CATEGORY DATA STORE
File Amazon S3
File System
FTP
HDFS
SFTP
Enabling connectivity
The concepts and steps needed for connecting with SQL Server hosted on-premises or in Azure IaaS
(Infrastructure-as-a-Service) VMs are the same. In both cases, you need to use Data Management Gateway
for connectivity.
See moving data between on-premises locations and cloud article to learn about Data Management Gateway
and step-by-step instructions on setting up the gateway. Setting up a gateway instance is a pre-requisite for
connecting with SQL Server.
While you can install gateway on the same on-premises machine or cloud VM instance as the SQL Server for
better performance, we recommended that you install them on separate machines. Having the gateway and
SQL Server on separate machines reduces resource contention.
Getting started
You can create a pipeline with a copy activity that moves data to/from an on-premises SQL Server database
by using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from
a source data store to a sink data store:
1. Create a data factory. A data factory may contain one or more pipelines.
2. Create linked services to link input and output data stores to your data factory. For example, if you are
copying data from a SQL Server database to an Azure blob storage, you create two linked services to link
your SQL Server database and Azure storage account to your data factory. For linked service properties
that are specific to SQL Server database, see linked service properties section.
3. Create datasets to represent input and output data for the copy operation. In the example mentioned in
the last step, you create a dataset to specify the SQL table in your SQL Server database that contains the
input data. And, you create another dataset to specify the blob container and the folder that holds the data
copied from the SQL Server database. For dataset properties that are specific to SQL Server database, see
dataset properties section.
4. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the
example mentioned earlier, you use SqlSource as a source and BlobSink as a sink for the copy activity.
Similarly, if you are copying from Azure Blob Storage to SQL Server Database, you use BlobSource and
SqlSink in the copy activity. For copy activity properties that are specific to SQL Server Database, see copy
activity properties section. For details on how to use a data store as a source or a sink, click the link in the
previous section for your data store.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that
are used to copy data to/from an on-premises SQL Server database, see JSON examples section of this
article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to SQL Server:
You can encrypt credentials using the New-AzureRmDataFactoryEncryptValue cmdlet and use them in
the connection string as shown in the following example (EncryptedCredential property):
"connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated
Security=True;EncryptedCredential=<encrypted credential>",
Samples
JSON for using SQL Authentication
{
"name": "MyOnPremisesSQLDB",
"properties":
{
"type": "OnPremisesSqlServer",
"typeProperties": {
"connectionString": "Data Source=<servername>;Initial Catalog=MarketingCampaigns;Integrated
Security=False;User ID=<username>;Password=<password>;",
"gatewayName": "<gateway name>"
}
}
}
{
"Name": " MyOnPremisesSQLDB",
"Properties":
{
"type": "OnPremisesSqlServer",
"typeProperties": {
"ConnectionString": "Data Source=<servername>;Initial Catalog=MarketingCampaigns;Integrated
Security=True;",
"username": "<domain\\username>",
"password": "<password>",
"gatewayName": "<gateway name>"
}
}
}
Dataset properties
In the samples, you have used a dataset of type SqlServerTable to represent a table in a SQL Server
database.
For a full list of sections & properties available for defining datasets, see the Creating datasets article.
Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (SQL
Server, Azure blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for the dataset of type SqlServerTable has the
following properties:
NOTE
The Copy Activity takes only one input and produces only one output.
Whereas, properties available in the typeProperties section of the activity vary with each activity type. For
Copy activity, they vary depending on the types of sources and sinks.
SqlSource
When source in a copy activity is of type SqlSource, the following properties are available in typeProperties
section:
If the sqlReaderQuery is specified for the SqlSource, the Copy Activity runs this query against the SQL
Server Database source to get the data.
Alternatively, you can specify a stored procedure by specifying the sqlReaderStoredProcedureName and
storedProcedureParameters (if the stored procedure takes parameters).
If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the
structure section are used to build a select query to run against the SQL Server Database. If the dataset
definition does not have the structure, all columns are selected from the table.
NOTE
When you use sqlReaderStoredProcedureName, you still need to specify a value for the tableName property in
the dataset JSON. There are no validations performed against this table though.
SqlSink
SqlSink supports the following properties:
writeBatchSize Inserts data into the SQL Integer (number of rows) No (default: 10000)
table when the buffer size
reaches writeBatchSize.
{
"Name": "SqlServerLinkedService",
"properties": {
"type": "OnPremisesSqlServer",
"typeProperties": {
"connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated
Security=False;User ID=<username>;Password=<password>;",
"gatewayName": "<gatewayname>"
}
}
}
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
In this example, sqlReaderQuery is specified for the SqlSource. The Copy Activity runs this query against the
SQL Server Database source to get the data. Alternatively, you can specify a stored procedure by specifying
the sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes
parameters). The sqlReaderQuery can reference multiple tables within the database referenced by the input
dataset. It is not limited to only the table set as the dataset's tableName typeProperty.
If you do not specify sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the
structure section are used to build a select query to run against the SQL Server Database. If the dataset
definition does not have the structure, all columns are selected from the table.
See the Sql Source section and BlobSink for the list of properties supported by SqlSource and BlobSink.
{
"Name": "SqlServerLinkedService",
"properties": {
"type": "OnPremisesSqlServer",
"typeProperties": {
"connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated
Security=False;User ID=<username>;Password=<password>;",
"gatewayName": "<gatewayname>"
}
}
}
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
{
"name": "SqlServerOutput",
"properties": {
"type": "SqlServerTable",
"linkedServiceName": "SqlServerLinkedService",
"typeProperties": {
"tableName": "MyOutputTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
See Enable or Disable a Server Network Protocol for details and alternate ways of enabling TCP/IP
protocol.
3. In the same window, double-click TCP/IP to launch TCP/IP Properties window.
4. Switch to the IP Addresses tab. Scroll down to see IPAll section. Note down the TCP Port **(default is
**1433).
5. Create a rule for the Windows Firewall on the machine to allow incoming traffic through this port.
6. Verify connection: To connect to the SQL Server using fully qualified name, use SQL Server
Management Studio from a different machine. For example: "..corp..com,1433."
IMPORTANT
See Move data between on-premises sources and the cloud with Data Management Gateway for detailed
information.
See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues.
Destination table:
{
"name": "SampleSource",
"properties": {
"published": false,
"type": " SqlServerTable",
"linkedServiceName": "TestIdentitySQL",
"typeProperties": {
"tableName": "SourceTbl"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {}
}
}
Notice that as your source and target table have different schema (target has an additional column with
identity). In this scenario, you need to specify structure property in the target dataset definition, which
doesnt include the identity column.
bigint Int64
binary Byte[]
bit Boolean
date DateTime
Datetime DateTime
SQL SERVER DATABASE ENGINE TYPE .NET FRAMEWORK TYPE
datetime2 DateTime
Datetimeoffset DateTimeOffset
Decimal Decimal
Float Double
image Byte[]
int Int32
money Decimal
numeric Decimal
real Single
rowversion Byte[]
smalldatetime DateTime
smallint Int16
smallmoney Decimal
sql_variant Object *
time TimeSpan
timestamp Byte[]
tinyint Byte
uniqueidentifier Guid
varbinary Byte[]
xml Xml
Repeatable copy
When copying data to SQL Server Database, the copy activity appends data to the sink table by default. To
perform an UPSERT instead, See Repeatable write to SqlSink article.
When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In
Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a
slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same
data is read no matter how many times a slice is run. See Repeatable read from relational sources.
This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises
Sybase database. It builds on the Data Movement Activities article, which presents a general overview of data
movement with the copy activity.
You can copy data from an on-premises Sybase data store to any supported sink data store. For a list of data
stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently
supports only moving data from a Sybase data store to other data stores, but not for moving data from other
data stores to a Sybase data store.
Prerequisites
Data Factory service supports connecting to on-premises Sybase sources using the Data Management
Gateway. See moving data between on-premises locations and cloud article to learn about Data Management
Gateway and step-by-step instructions on setting up the gateway.
Gateway is required even if the Sybase database is hosted in an Azure IaaS VM. You can install the gateway on
the same IaaS VM as the data store or on a different VM as long as the gateway can connect to the database.
NOTE
See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues.
Getting started
You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by
using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from an on-premises Sybase data store, see JSON example: Copy data from Sybase to Azure
Blob section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to a Sybase data store:
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure
blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for dataset of type RelationalTable (which includes
Sybase dataset) has the following properties:
query Use the custom query to SQL query string. For No (if tableName of
read data. example: select * from dataset is specified)
MyTable.
{
"name": "OnPremSybaseLinkedService",
"properties": {
"type": "OnPremisesSybase",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"schema": "<schema>",
"authenticationType": "<authentication type>",
"username": "<username>",
"password": "<password>",
"gatewayName": "<gatewayName>"
}
}
}
{
"name": "SybaseDataSet",
"properties": {
"type": "RelationalTable",
"linkedServiceName": "OnPremSybaseLinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises
Teradata database. It builds on the Data Movement Activities article, which presents a general overview of data
movement with the copy activity.
You can copy data from an on-premises Teradata data store to any supported sink data store. For a list of data
stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently
supports only moving data from a Teradata data store to other data stores, but not for moving data from other
data stores to a Teradata data store.
Prerequisites
Data factory supports connecting to on-premises Teradata sources via the Data Management Gateway. See
moving data between on-premises locations and cloud article to learn about Data Management Gateway and
step-by-step instructions on setting up the gateway.
Gateway is required even if the Teradata is hosted in an Azure IaaS VM. You can install the gateway on the
same IaaS VM as the data store or on a different VM as long as the gateway can connect to the database.
NOTE
See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues.
Getting started
You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by
using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from an on-premises Teradata data store, see JSON example: Copy data from Teradata to
Azure Blob section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to a Teradata data store:
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure
blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. Currently, there are no type properties supported for the Teradata dataset.
query Use the custom query to SQL query string. For Yes
read data. example: select * from
MyTable.
{
"name": "OnPremTeradataLinkedService",
"properties": {
"type": "OnPremisesTeradata",
"typeProperties": {
"server": "<server>",
"authenticationType": "<authentication type>",
"username": "<username>",
"password": "<password>",
"gatewayName": "<gatewayName>"
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorageLinkedService",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<AccountName>;AccountKey=
<AccountKey>"
}
}
}
{
"name": "TeradataDataSet",
"properties": {
"published": false,
"type": "RelationalTable",
"linkedServiceName": "OnPremTeradataLinkedService",
"typeProperties": {
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Char String
Clob String
TERADATA DATABASE TYPE .NET FRAMEWORK TYPE
Graphic String
VarChar String
VarGraphic String
Blob Byte[]
Byte Byte[]
VarByte Byte[]
BigInt Int64
ByteInt Int16
Decimal Decimal
Double Double
Integer Int32
Number Double
SmallInt Int16
Date DateTime
Time TimeSpan
Timestamp DateTime
Period(Date) String
Period(Time) String
Period(Timestamp) String
Xml String
This article outlines how to use the Copy Activity in Azure Data Factory to move data from a table in a Web
page to a supported sink data store. This article builds on the data movement activities article that presents a
general overview of data movement with copy activity and the list of data stores supported as sources/sinks.
Data factory currently supports only moving data from a Web table to other data stores, but not moving data
from other data stores to a Web table destination.
IMPORTANT
This Web connector currently supports only extracting table content from an HTML page. To retrieve data from a
HTTP/s endpoint, use HTTP connector instead.
Getting started
You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by
using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from a web table, see JSON example: Copy data from Web table to Azure Blob section of this
article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to a Web table:
{
"name": "web",
"properties":
{
"type": "Web",
"typeProperties":
{
"authenticationType": "Anonymous",
"url" : "https://en.wikipedia.org/wiki/"
}
}
}
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure
blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location
of the data in the data store. The typeProperties section for dataset of type WebTable has the following
properties
path A relative URL to the resource that No. When path is not specified, only
contains the table. the URL specified in the linked service
definition is used.
Example:
{
"name": "WebTableInput",
"properties": {
"type": "WebTable",
"linkedServiceName": "WebLinkedService",
"typeProperties": {
"index": 1,
"path": "AFI's_100_Years...100_Movies"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
WebTable input dataset Setting external to true informs the Data Factory service that the dataset is
external to the data factory and is not produced by an activity in the data factory.
NOTE
See Get index of a table in an HTML page section for steps to getting index of a table in an HTML page.
{
"name": "WebTableInput",
"properties": {
"type": "WebTable",
"linkedServiceName": "WebLinkedService",
"typeProperties": {
"index": 1,
"path": "AFI's_100_Years...100_Movies"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
6. In the Query Editor window, click Advanced Editor button on the toolbar.
7. In the Advanced Editor dialog box, the number next to "Source" is the index.
If you are using Excel 2013, use Microsoft Power Query for Excel to get the index. See Connect to a web page
article for details. The steps are similar if you are using Microsoft Power BI for Desktop.
NOTE
To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data
Factory.
Performance and Tuning
See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data
movement (Copy Activity) in Azure Data Factory and various ways to optimize it.
Data Management Gateway
8/21/2017 25 min to read Edit Online
The Data management gateway is a client agent that you must install in your on-premises environment to
copy data between cloud and on-premises data stores. The on-premises data stores supported by Data
Factory are listed in the Supported data sources section.
This article complements the walkthrough in the Move data between on-premises and cloud data stores
article. In the walkthrough, you create a pipeline that uses the gateway to move data from an on-premises
SQL Server database to an Azure blob. This article provides detailed in-depth information about the data
management gateway.
You can scale out a data management gateway by associating multiple on-premises machines with the
gateway. You can scale up by increasing number of data movement jobs that can run concurrently on a node.
This feature is also available for a logical gateway with a single node. See Scaling data management gateway
in Azure Data Factory article for details.
NOTE
Currently, gateway supports only the copy activity and stored procedure activity in Data Factory. It is not possible to
use the gateway from a custom activity to access on-premises data sources.
Overview
Capabilities of data management gateway
Data management gateway provides the following capabilities:
Model on-premises data sources and cloud data sources within the same data factory and move data.
Have a single pane of glass for monitoring and management with visibility into gateway status from the
Data Factory page.
Manage access to on-premises data sources securely.
No changes required to corporate firewall. Gateway only makes outbound HTTP-based connections
to open internet.
Encrypt credentials for your on-premises data stores with your certificate.
Move data efficiently data is transferred in parallel, resilient to intermittent network issues with auto
retry logic.
Command flow and data flow
When you use a copy activity to copy data between on-premises and cloud, the activity uses a gateway to
transfer data from on-premises data source to cloud and vice versa.
Here is the high-level data flow for and summary of steps for copy with data gateway:
1. Data developer creates a gateway for an Azure Data Factory using either the Azure portal or PowerShell
Cmdlet.
2. Data developer creates a linked service for an on-premises data store by specifying the gateway. As part
of setting up the linked service, data developer uses the Setting Credentials application to specify
authentication types and credentials. The Setting Credentials application dialog communicates with the
data store to test connection and the gateway to save credentials.
3. Gateway encrypts the credentials with the certificate associated with the gateway (supplied by data
developer), before saving the credentials in the cloud.
4. Data Factory service communicates with the gateway for scheduling & management of jobs via a control
channel that uses a shared Azure service bus queue. When a copy activity job needs to be kicked off, Data
Factory queues the request along with credential information. Gateway kicks off the job after polling the
queue.
5. The gateway decrypts the credentials with the same certificate and then connects to the on-premises data
store with proper authentication type and credentials.
6. The gateway copies data from an on-premises store to a cloud storage, or vice versa depending on how
the Copy Activity is configured in the data pipeline. For this step, the gateway directly communicates with
cloud-based storage services such as Azure Blob Storage over a secure (HTTPS) channel.
Considerations for using gateway
A single instance of data management gateway can be used for multiple on-premises data sources.
However, a single gateway instance is tied to only one Azure data factory and cannot be shared
with another data factory.
You can have only one instance of data management gateway installed on a single machine.
Suppose, you have two data factories that need to access on-premises data sources, you need to install
gateways on two on-premises computers. In other words, a gateway is tied to a specific data factory
The gateway does not need to be on the same machine as the data source. However, having
gateway closer to the data source reduces the time for the gateway to connect to the data source. We
recommend that you install the gateway on a machine that is different from the one that hosts on-
premises data source. When the gateway and data source are on different machines, the gateway does
not compete for resources with data source.
You can have multiple gateways on different machines connecting to the same on-premises data
source. For example, you may have two gateways serving two data factories but the same on-premises
data source is registered with both the data factories.
If you already have a gateway installed on your computer serving a Power BI scenario, install a separate
gateway for Azure Data Factory on another machine.
Gateway must be used even when you use ExpressRoute.
Treat your data source as an on-premises data source (that is behind a firewall) even when you use
ExpressRoute. Use the gateway to establish connectivity between the service and the data source.
You must use the gateway even if the data store is in the cloud on an Azure IaaS VM.
Installation
Prerequisites
The supported Operating System versions are Windows 7, Windows 8/8.1, Windows 10, Windows
Server 2008 R2, Windows Server 2012, Windows Server 2012 R2. Installation of the data management
gateway on a domain controller is currently not supported.
.NET Framework 4.5.1 or above is required. If you are installing gateway on a Windows 7 machine, install
.NET Framework 4.5 or later. See .NET Framework System Requirements for details.
The recommended configuration for the gateway machine is at least 2 GHz, 4 cores, 8-GB RAM, and 80-
GB disk.
If the host machine hibernates, the gateway does not respond to data requests. Therefore, configure an
appropriate power plan on the computer before installing the gateway. If the machine is configured to
hibernate, the gateway installation prompts a message.
You must be an administrator on the machine to install and configure the data management gateway
successfully. You can add additional users to the data management gateway Users local Windows
group. The members of this group are able to use the Data Management Gateway Configuration
Manager tool to configure the gateway.
As copy activity runs happen on a specific frequency, the resource usage (CPU, memory) on the machine also
follows the same pattern with peak and idle times. Resource utilization also depends heavily on the amount
of data being moved. When multiple copy jobs are in progress, you see resource usage go up during peak
times.
Installation options
Data management gateway can be installed in the following ways:
By downloading an MSI setup package from the Microsoft Download Center. The MSI can also be used to
upgrade existing data management gateway to the latest version, with all settings preserved.
By clicking Download and install data gateway link under MANUAL SETUP or Install directly on this
computer under EXPRESS SETUP. See Move data between on-premises and cloud article for step-by-step
instructions on using express setup. The manual step takes you to the download center. The instructions
for downloading and installing the gateway from download center are in the next section.
Installation best practices:
1. Configure power plan on the host machine for the gateway so that the machine does not hibernate. If the
host machine hibernates, the gateway does not respond to data requests.
2. Back up the certificate associated with the gateway.
Install the gateway from download center
1. Navigate to Microsoft Data Management Gateway download page.
2. Click Download, select the appropriate version (32-bit vs. 64-bit), and click Next.
3. Run the MSI directly or save it to your hard disk and run.
4. On the Welcome page, select a language click Next.
5. Accept the End-User License Agreement and click Next.
6. Select folder to install the gateway and click Next.
7. On the Ready to install page, click Install.
8. Click Finish to complete installation.
9. Get the key from the Azure portal. See the next section for step-by-step instructions.
10. On the Register gateway page of Data Management Gateway Configuration Manager running on
your machine, do the following steps:
a. Paste the key in the text.
b. Optionally, click Show gateway key to see the key text.
c. Click Register.
Register gateway using key
If you haven't already created a logical gateway in the portal
To create a gateway in the portal and get the key from the Configure page, Follow steps from walkthrough
in the Move data between on-premises and cloud article.
If you have already created the logical gateway in the portal
1. In Azure portal, navigate to the Data Factory page, and click Linked Services tile.
2. In the Linked Services page, select the logical gateway you created in the portal.
3. In the Data Gateway page, click Download and install data gateway.
4. In the Configure page, click Recreate key. Click Yes on the warning message after reading it
carefully.
5. Click Copy button next to the key. The key is copied to the clipboard.
If you move cursor over the system tray icon/notification message, you see details about the state of the
gateway/update operation in a popup window.
Ports and firewall
There are two firewalls you need to consider: corporate firewall running on the central router of the
organization, and Windows firewall configured as a daemon on the local machine where the gateway is
installed.
At corporate firewall level, you need configure the following domains and outbound ports:
At Windows firewall level, these outbound ports are normally enabled. If not, you can configure the domains
and ports accordingly on gateway machine.
NOTE
1. Based on your source/ sinks, you may have to whitelist additional domains and outbound ports in your
corporate/Windows firewall.
2. For some Cloud Databases (For example: Azure SQL Database, Azure Data Lake, etc.), you may need to whitelist IP
address of Gateway machine on their firewall configuration.
NOTE
If your firewall does not allow outbound port 1433, Gateway can't access Azure SQL directly. In this case, you may use
Staged Copy to SQL Azure Database/ SQL Azure DW. In this scenario, you would only require HTTPS (port 443) for
the data movement.
Gateway uses the proxy server to connect to the cloud service. Click Change link during initial setup. You see
the proxy setting dialog.
NOTE
If you set up a proxy server with NTLM authentication, Gateway Host Service runs under the domain account. If you
change the password for the domain account later, remember to update configuration settings for the service and
restart it accordingly. Due to this requirement, we suggest you use a dedicated domain account to access the proxy
server that does not require you to update the password frequently.
<system.net>
<defaultProxy useDefaultCredentials="true" />
</system.net>
You can then add proxy server details as shown in the following example:
<system.net>
<defaultProxy enabled="true">
<proxy bypassonlocal="true" proxyaddress="http://proxy.domain.org:8888/" />
</defaultProxy>
</system.net>
Additional properties are allowed inside the proxy tag to specify the required settings like
scriptLocation. Refer to proxy Element (Network Settings) on syntax.
3. Save the configuration file into the original location, then restart the Data Management Gateway Host
service, which picks up the changes. To restart the service: use services applet from the control panel, or
from the Data Management Gateway Configuration Manager > click the Stop Service button, then
click the Start Service. If the service does not start, it is likely that an incorrect XML tag syntax has been
added into the application configuration file that was edited.
IMPORTANT
Do not forget to update both diahost.exe.config and diawp.exe.config.
In addition to these points, you also need to make sure Microsoft Azure is in your companys whitelist. The
list of valid Microsoft Azure IP addresses can be downloaded from the Microsoft Download Center.
Possible symptoms for firewall and proxy server-related issues
If you encounter errors similar to the following ones, it is likely due to improper configuration of the firewall
or proxy server, which blocks gateway from connecting to Data Factory to authenticate itself. Refer to
previous section to ensure your firewall and proxy server are properly configured.
1. When you try to register the gateway, you receive the following error: "Failed to register the gateway key.
Before trying to register the gateway key again, confirm that the data management gateway is in a
connected state and the Data Management Gateway Host Service is Started."
2. When you open Configuration Manager, you see status as Disconnected or Connecting. When viewing
Windows event logs, under Event Viewer > Application and Services Logs > Data Management
Gateway, you see error messages such as the following error: Unable to connect to the remote server
A component of Data Management Gateway has become unresponsive and restarts automatically. Component
name: Gateway.
If you choose not to open the port 8050 on the gateway machine, use mechanisms other than using the
Setting Credentials application to configure data store credentials. For example, you could use New-
AzureRmDataFactoryEncryptValue PowerShell cmdlet. See Setting Credentials and Security section on how
data store credentials can be set.
Update
By default, data management gateway is automatically updated when a newer version of the gateway is
available. The gateway is not updated until all the scheduled tasks are done. No further tasks are processed
by the gateway until the update operation is completed. If the update fails, gateway is rolled back to the old
version.
You see the scheduled update time in the following places:
The gateway properties page in the Azure portal.
Home page of the Data Management Gateway Configuration Manager
System tray notification message.
The Home tab of the Data Management Gateway Configuration Manager displays the update schedule and
the last time the gateway was installed/updated.
You can install the update right away or wait for the gateway to be automatically updated at the scheduled
time. For example, the following image shows you the notification message shown in the Gateway
Configuration Manager along with the Update button that you can click to install it immediately.
The notification message in the system tray would look as shown in the following image:
You see the status of update operation (manual or automatic) in the system tray. When you launch Gateway
Configuration Manager next time, you see a message on the notification bar that the gateway has been
updated along with a link to what's new topic.
To disable/enable auto -update feature
You can disable/enable the auto-update feature by doing the following steps:
[For single node gateway]
1. Launch Windows PowerShell on the gateway machine.
2. Switch to the C:\Program Files\Microsoft Data Management Gateway\2.0\PowerShellScript folder.
3. Run the following command to turn the auto-update feature OFF (disable).
.\GatewayAutoUpdateToggle.ps1 -off
.\GatewayAutoUpdateToggle.ps1 -on
Configuration Manager
Once you install the gateway, you can launch Data Management Gateway Configuration Manager in one of
the following ways:
1. In the Search window, type Data Management Gateway to access this utility.
2. Run the executable ConfigManager.exe in the folder: C:\Program Files\Microsoft Data Management
Gateway\2.0\Shared
Home page
The Home page allows you to do the following actions:
View status of the gateway (connected to the cloud service etc.).
Register using a key from the portal.
Stop and start the Data Management Gateway Host service on the gateway machine.
Schedule updates at a specific time of the days.
View the date when the gateway was last updated.
Settings page
The Settings page allows you to do the following actions:
View, change, and export certificate used by the gateway. This certificate is used to encrypt data source
credentials.
Change HTTPS port for the endpoint. The gateway opens a port for setting the data source credentials.
Status of the endpoint
View SSL certificate is used for SSL communication between portal and the gateway to set credentials
for data sources.
Diagnostics page
The Diagnostics page allows you to do the following actions:
Enable verbose logging, view logs in event viewer, and send logs to Microsoft if there was a failure.
Test connection to a data source.
Help page
The Help page displays the following information:
Brief description of the gateway
Version number
Links to online help, privacy statement, and license agreement.
3. In the Gateway page, you can see the memory and CPU usage of the gateway.
4. Enable Advanced settings to see more details such as network usage.
The following table provides descriptions of columns in the Gateway Nodes list:
Name Name of the logical gateway and nodes associated with the
gateway. Node is an on-premises Windows machine that
has the gateway installed on it. For information on having
more than one node (up to four nodes) in a single logical
gateway, see Data Management Gateway - high availability
and scalability.
Version Shows the version of the logical gateway and each gateway
node. The version of the logical gateway is determined
based on version of majority of nodes in the group. If there
are nodes with different versions in the logical gateway
setup, only the nodes with the same version number as
the logical gateway function properly. Others are in the
limited mode and need to be manually updated (only in
case auto-update fails).
CPU utilization CPU utilization of a gateway node. This value is a near real-
time snapshot.
Concurrent Jobs (Running/ Limit) Number of jobs or tasks running on each node. This value
is a near real-time snapshot. Limit signifies the maximum
concurrent jobs for each node. This value is defined based
on the machine size. You can increase the limit to scale up
concurrent job execution in advanced scenarios, where
CPU/memory/network is under-utilized, but activities are
timing out. This capability is also available with a single-
node gateway (even when the scalability and availability
feature is not enabled).
In this page, you see some settings that make more sense when there are two or more nodes (scale out
scenario) in the gateway. See Data Management Gateway - high availability and scalability for details about
setting up a multi-node gateway.
Gateway status
The following table provides possible statuses of a gateway node:
STATUS COMMENTS/SCENARIOS
The following table provides possible statuses of a logical gateway. The gateway status depends on
statuses of the gateway nodes.
STATUS COMMENTS
Limited Not all nodes in this gateway are in healthy state. This
status is a warning that some node might be down!
Scale up gateway
You can configure the number of concurrent data movement jobs that can run on a node to scale up the
capability of moving data between on-premises and cloud data stores.
When the available memory and CPU are not utilized well, but the idle capacity is 0, you should scale up by
increasing the number of concurrent jobs that can run on a node. You may also want to scale up when
activities are timing out because the gateway is overloaded. In the advanced settings of a gateway node, you
can increase the maximum capacity for a node.
3. In the Data gateway page, click Download and install data gateway.
4. In the Configure page, click Download and install data gateway, and follow instructions to install
the data gateway on the machine.
5. Keep the Microsoft Data Management Gateway Configuration Manager open.
6. In the Configure page in the portal, click Recreate key on the command bar, and click Yes for the
warning message. Click copy button next to key text that copies the key to the clipboard. The gateway
on the old machine stops functioning as soon you recreate the key.
7. Paste the key into text box in the Register Gateway page of the Data Management Gateway
Configuration Manager on your machine. (optional) Click Show gateway key check box to see the
key text.
8. Click Register to register the gateway with the cloud service.
9. On the Settings tab, click Change to select the same certificate that was used with the old gateway,
enter the password, and click Finish.
You can export a certificate from the old gateway by doing the following steps: launch Data
Management Gateway Configuration Manager on the old machine, switch to the Certificate tab, click
Export button and follow the instructions.
10. After successful registration of the gateway, you should see the Registration set to Registered and
Status set to Started on the Home page of the Gateway Configuration Manager.
Encrypting credentials
To encrypt credentials in the Data Factory Editor, do the following steps:
1. Launch web browser on the gateway machine, navigate to Azure portal. Search for your data factory if
needed, open data factory in the DATA FACTORY page and then click Author & Deploy to launch Data
Factory Editor.
2. Click an existing linked service in the tree view to see its JSON definition or create a linked service that
requires a data management gateway (for example: SQL Server or Oracle).
3. In the JSON editor, for the gatewayName property, enter the name of the gateway.
4. Enter server name for the Data Source property in the connectionString.
5. Enter database name for the Initial Catalog property in the connectionString.
6. Click Encrypt button on the command bar that launches the click-once Credential Manager
application. You should see the Setting Credentials dialog box.
{
"name": "SqlServerLinkedService",
"properties": {
"type": "OnPremisesSqlServer",
"description": "",
"typeProperties": {
"connectionString": "data source=myserver;initial catalog=mydatabase;Integrated
Security=False;EncryptedCredential=eyJDb25uZWN0aW9uU3R",
"gatewayName": "adftutorialgateway"
}
}
}
If you access the portal from a machine that is different from the gateway machine, you must make
sure that the Credentials Manager application can connect to the gateway machine. If the application
cannot reach the gateway machine, it does not allow you to set credentials for the data source and to
test connection to the data source.
When you use the Setting Credentials application, the portal encrypts the credentials with the certificate
specified in the Certificate tab of the Gateway Configuration Manager on the gateway machine.
If you are looking for an API-based approach for encrypting the credentials, you can use the New-
AzureRmDataFactoryEncryptValue PowerShell cmdlet to encrypt credentials. The cmdlet uses the certificate
that gateway is configured to use to encrypt the credentials. You add encrypted credentials to the
EncryptedCredential element of the connectionString in the JSON. You use the JSON with the New-
AzureRmDataFactoryLinkedService cmdlet or in the Data Factory Editor.
There is one more approach for setting credentials using Data Factory Editor. If you create a SQL Server
linked service by using the editor and you enter credentials in plain text, the credentials are encrypted using a
certificate that the Data Factory service owns. It does NOT use the certificate that gateway is configured to
use. While this approach might be a little faster in some cases, it is less secure. Therefore, we recommend that
you follow this approach only for development/testing purposes.
PowerShell cmdlets
This section describes how to create and register a gateway using Azure PowerShell cmdlets.
1. Launch Azure PowerShell in administrator mode.
2. Log in to your Azure account by running the following command and entering your Azure credentials.
Login-AzureRmAccount
Name : MyGateway
Description : gateway for walkthrough
Version :
Status : NeedRegistration
VersionStatus : None
CreateTime : 9/28/2014 10:58:22
RegisterTime :
LastConnectTime :
ExpiryTime :
ProvisioningState : Succeeded
Key : ADF#00000000-0000-4fb8-a867-947877aef6cb@fda06d87-f446-43b1-9485-
78af26b8bab0@4707262b-dc25-4fe5-881c-c8a7c3c569fe@wu#nfU4aBlq/heRyYFZ2Xt/CD+7i73PEO521Sj2AFOCmiI
You can register the gateway on a remote machine by using the IsRegisterOnRemoteMachine
parameter. Example:
5. You can use the Get-AzureRmDataFactoryGateway cmdlet to get the list of Gateways in your data
factory. When the Status shows online, it means your gateway is ready to use.
You can remove a gateway using the Remove-AzureRmDataFactoryGateway cmdlet and update
description for a gateway using the Set-AzureRmDataFactoryGateway cmdlets. For syntax and
other details about these cmdlets, see Data Factory Cmdlet Reference.
List gateways using PowerShell
Next steps
See Move data between on-premises and cloud data stores article. In the walkthrough, you create a
pipeline that uses the gateway to move data from an on-premises SQL Server database to an Azure blob.
Data Management Gateway - high availability and
scalability (Preview)
8/31/2017 13 min to read Edit Online
This article helps you configure high availability and scalability solution with Data Management Gateway.
NOTE
This article assumes that you are already familiar with basics of Data Management Gateway. If you are not, see Data
Management Gateway.
This preview feature is officially supported on Data Management Gateway version 2.12.xxxx.x and above. Please
make sure you are using version 2.12.xxxx.x or above. Download the latest version of Data Management Gateway here.
Overview
You can associate data management gateways that are installed on multiple on-premises machines with a single
logical gateway from the portal. These machines are called nodes. You can have up to four nodes associated
with a logical gateway. The benefits of having multiple nodes (on-premises machines with gateway installed) for a
logical gateway are:
Improve performance of data movement between on-premises and cloud data stores.
If one of the nodes goes down for some reason, other nodes are still available for moving the data.
If one of the nodes need to be taken offline for maintenance, other nodes are still available for moving the
data.
You can also configure the number of concurrent data movement jobs that can run on a node to scale up the
capability of moving data between on-premises and cloud data stores.
Using the Azure portal, you can monitor the status of these nodes, which helps you decide whether to add or
remove a node from the logical gateway.
Architecture
The following diagram provides the architecture overview of scalability and availability feature of the Data
Management Gateway:
A logical gateway is the gateway you add to a data factory in the Azure portal. Earlier, you could associate only
one on-premises Windows machine with Data Management Gateway installed with a logical gateway. This on-
premises gateway machine is called a node. Now, you can associate up to four physical nodes with a logical
gateway. A logical gateway with multiple nodes is called a multi-node gateway.
All these nodes are active. They all can process data movement jobs to move data between on-premises and
cloud data stores. One of the nodes act as both dispatcher and worker. Other nodes in the groups are worker
nodes. A dispatcher node pulls data movement tasks/jobs from the cloud service and dispatches them to worker
nodes (including itself). A worker node executes data movement jobs to move data between on-premises and
cloud data stores. All nodes are workers. Only one node can be both dispatch and worker.
You may typically start with one node and scale out to add more nodes as the existing node(s) are overwhelmed
with the data movement load. You can also scale up the data movement capability of a gateway node by
increasing the number of concurrent jobs that are allowed to run on the node. This capability is also available with
a single-node gateway (even when the scalability and availability feature is not enabled).
A gateway with multiple nodes keeps the data store credentials in sync across all nodes. If there is a node-to-node
connectivity issue, the credentials may be out of sync. When you set credentials for an on-premises data store
that uses a gateway, it saves credentials on the dispatcher/worker node. The dispatcher node syncs with other
worker nodes. This process is known as credentials sync. The communication channel between nodes can be
encrypted by a public SSL/TLS certificate.
NOTE
Before you install a data management gateway on an on-premises Windows machine, see prerequisites listed in the main
article.
1. In the walkthrough, while creating a logical gateway, enable the High Availability & Scalability feature.
2. In the Configure page, use either Express Setup or Manual Setup link to install a gateway on the first
node (an on-premises Windows machine).
NOTE
If you use the express setup option, the node-to-node communication is done without encryption. The node name
is same as the machine name. Use manual setup if the node-node communication needs to be encrypted or you
want to specify a node name of your choice. Node names cannot be edited later.
b. Launch Data Management Configuration Manager for the gateway by following these instructions.
You see the gateway name, node name, status, etc.
4. If you choose manual setup:
a. Download the installation package from the Microsoft Download Center, run it to install gateway on
your machine.
b. Use the authentication key from the Configure page to register the gateway.
c. In the New gateway node page, you can provide a custom name to the gateway node. By default,
node name is same as the machine name.
d. In the next page, you can choose whether to enable encryption for node-to-node
communication. Click Skip to disable encryption (default).
NOTE
Changing of encryption mode is only supported when you have a single gateway node in the logical
gateway. To change the encryption mode when a gateway has multiple nodes, do the following steps: delete
all the nodes except one node, change the encryption mode, and then add the nodes again.
See TLS/SSL certificate requirements section for a list of requirements for using an TLS/SSL certificate.
f. you see Data Management Gateway Configuration Manager on the node (on-premises Windows
machine), which shows connectivity status, gateway name, and node name.
NOTE
If you are provisioning the gateway on an Azure VM, you can use this Azure Resource Manager template.
This script creates a logical gateway, sets up VMs with Data Management Gateway software installed, and
registers them with the logical gateway.
6. Click Add Node on the toolbar to add a node to the logical gateway. If you are planning to use express
setup, do this step from the on-premises machine that will be added as a node to the gateway.
7. Steps are similar to setting up the first node. The Configuration Manager UI lets you set the node name if
you choose the manual installation option:
8. After the gateway is installed successfully on the node, the Configuration Manager tool displays the
following screen:
9. If you open the Gateway page in the portal, you see two gateway nodes now:
10. To delete a gateway node, click Delete Node on the toolbar, select the node you want to delete, and then click
Delete from the toolbar. This action deletes the selected node from the group. Note that this action does not
uninstall the data management gateway software from the node (on-premises Windows machine). Use Add
or remove programs in Control Panel on the on-premises to uninstall the gateway. When you uninstall
gateway from the node, it's automatically deleted in the portal.
3. Once the preview feature is enabled in the portal, close all pages. Reopen the gateway page to see the
new preview user interface (UI).
NOTE
During the upgrade, name of the first node is the name of the machine.
You can enable Advanced Settings in the Gateway page to see advanced metrics like Network(in/out), Role &
Credential Status, which is helpful in debugging gateway issues, and Concurrent Jobs (Running/ Limit) which
can be modified/ changed accordingly during performance tuning. The following table provides descriptions of
columns in the Gateway Nodes list:
Name Name of the logical gateway and nodes associated with the
gateway.
Version Shows the version of the logical gateway and each gateway
node. The version of the logical gateway is determined based
on version of majority of nodes in the group. If there are
nodes with different versions in the logical gateway setup,
only the nodes with the same version number as the logical
gateway function properly. Others are in the limited mode
and need to be manually updated (only in case auto-update
fails).
CPU utilization CPU utilization of a gateway node. This value is a near real-
time snapshot.
Concurrent Jobs (Running/ Limit) Number of jobs or tasks running on each node. This value is a
near real-time snapshot. Limit signifies the maximum
concurrent jobs for each node. This value is defined based on
the machine size. You can increase the limit to scale up
concurrent job execution in advanced scenarios, where CPU/
memory/ network is under-utilized, but activities are timing
out. This capability is also available with a single-node
gateway (even when the scalability and availability feature is
not enabled). For more information, see scale considerations
section.
Role There are two types of roles Dispatcher and worker. All
nodes are workers, which means they can all be used to
execute jobs. There is only one dispatcher node, which is used
to pull tasks/jobs from cloud services and dispatch them to
different worker nodes (including itself).
Gateway status
The following table provides possible statuses of a gateway node:
STATUS COMMENTS/SCENARIOS
The following table provides possible statuses of a logical gateway. The gateway status depends on statuses of
the gateway nodes.
STATUS COMMENTS
Limited Not all nodes in this gateway are in healthy state. This status
is a warning that some node might be down!
Scale considerations
Scale out
When the available memory is low and the CPU usage is high, adding a new node helps scale out the load
across machines. If activities are failing due to time-out or gateway node being offline, it helps if you add a node
to the gateway.
Scale up
When the available memory and CPU are not utilized well, but the idle capacity is 0, you should scale up by
increasing the number of concurrent jobs that can run on a node. You may also want to scale up when activities
are timing out because the gateway is overloaded. As shown in the following image, you can increase the
maximum capacity for a node. We suggest doubling it to start with.
Next steps
Review the following articles:
Data Management Gateway - provides a detailed overview of the gateway.
Move data between on-premises and cloud data stores - contains a walkthrough with step-by-step
instructions for using a gateway with a single node.
Move data between on-premises sources and
the cloud with Data Management Gateway
8/21/2017 15 min to read Edit Online
This article provides an overview of data integration between on-premises data stores and cloud data
stores using Data Factory. It builds on the Data Movement Activities article and other data factory
core concepts articles: datasets and pipelines.
IMPORTANT
See Data Management Gateway article for details about Data Management Gateway.
The following walkthrough shows you how to create a data factory with a pipeline that moves data
from an on-premises SQL Server database to an Azure blob storage. As part of the walkthrough, you
install and configure the Data Management Gateway on your machine.
3. In the New data factory page, enter ADFTutorialOnPremDF for the Name.
IMPORTANT
The name of the Azure data factory must be globally unique. If you receive the error: Data factory
name ADFTutorialOnPremDF is not available, change the name of the data factory (for example,
yournameADFTutorialOnPremDF) and try creating again. Use this name in place of
ADFTutorialOnPremDF while performing remaining steps in this tutorial.
The name of the data factory may be registered as a DNS name in the future and hence become
publically visible.
4. Select the Azure subscription where you want the data factory to be created.
5. Select existing resource group or create a resource group. For the tutorial, create a resource
group named: ADFTutorialResourceGroup.
6. Click Create on the New data factory page.
IMPORTANT
To create Data Factory instances, you must be a member of the Data Factory Contributor role at the
subscription/resource group level.
7. After creation is complete, you see the Data Factory page as shown in the following image:
Create gateway
1. In the Data Factory page, click Author and deploy tile to launch the Editor for the data
factory.
2. In the Data Factory Editor, click ... More on the toolbar and then click New data gateway.
Alternatively, you can right-click Data Gateways in the tree view, and click New data
gateway.
3. In the Create page, enter adftutorialgateway for the name, and click OK.
NOTE
In this walkthrough, you create the logical gateway with only one node (on-premises Windows
machine). You can scale out a data management gateway by associating multiple on-premises
machines with the gateway. You can scale up by increasing number of data movement jobs that can
run concurrently on a node. This feature is also available for a logical gateway with a single node. See
Scaling data management gateway in Azure Data Factory article for details.
4. In the Configure page, click Install directly on this computer. This action downloads the
installation package for the gateway, installs, configures, and registers the gateway on the
computer.
NOTE
Use Internet Explorer or a Microsoft ClickOnce compatible web browser.
If you are using Chrome, go to the Chrome web store, search with "ClickOnce" keyword, choose one
of the ClickOnce extensions, and install it.
Do the same for Firefox (install add-in). Click Open Menu button on the toolbar (three horizontal
lines in the top-right corner), click Add-ons, search with "ClickOnce" keyword, choose one of the
ClickOnce extensions, and install it.
This way is the easiest way (one-click) to download, install, configure, and register the gateway
in one single step. You can see the Microsoft Data Management Gateway Configuration
Manager application is installed on your computer. You can also find the executable
ConfigManager.exe in the folder: C:\Program Files\Microsoft Data Management
Gateway\2.0\Shared.
You can also download and install gateway manually by using the links in this page and
register it using the key shown in the NEW KEY text box.
See Data Management Gateway article for all the details about the gateway.
NOTE
You must be an administrator on the local computer to install and configure the Data Management
Gateway successfully. You can add additional users to the Data Management Gateway Users local
Windows group. The members of this group can use the Data Management Gateway Configuration
Manager tool to configure the gateway.
5. Wait for a couple of minutes or wait until you see the following notification message:
6. Launch Data Management Gateway Configuration Manager application on your
computer. In the Search window, type Data Management Gateway to access this utility. You
can also find the executable ConfigManager.exe in the folder: C:\Program Files\Microsoft
Data Management Gateway\2.0\Shared
7. Confirm that you see adftutorialgateway is connected to the cloud service message. The
status bar the bottom displays Connected to the cloud service along with a green check
mark.
On the Home tab, you can also do the following operations:
Register a gateway with a key from the Azure portal by using the Register button.
Stop the Data Management Gateway Host Service running on your gateway machine.
Schedule updates to be installed at a specific time of the day.
View when the gateway was last updated.
Specify time at which an update to the gateway can be installed.
8. Switch to the Settings tab. The certificate specified in the Certificate section is used to
encrypt/decrypt credentials for the on-premises data store that you specify on the portal.
(optional) Click Change to use your own certificate instead. By default, the gateway uses the
certificate that is auto-generated by the Data Factory service.
Create datasets
In this step, you create input and output datasets that represent input and output data for the copy
operation (On-premises SQL Server database => Azure blob storage). Before creating datasets, do
the following steps (detailed steps follows the list):
Create a table named emp in the SQL Server Database you added as a linked service to the data
factory and insert a couple of sample entries into the table.
Create a blob container named adftutorial in the Azure blob storage account you added as a
linked service to the data factory.
Prepare On-premises SQL Server for the tutorial
1. In the database you specified for the on-premises SQL Server linked service
(SqlServerLinkedService), use the following SQL script to create the emp table in the
database.
{
"name": "EmpOnPremSQLTable",
"properties": {
"type": "SqlServerTable",
"linkedServiceName": "SqlServerLinkedService",
"typeProperties": {
"tableName": "emp"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
{
"name": "OutputBlobTable",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "adftutorial/outfromonpremdf",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
See Move data to/from Azure Blob Storage for details about JSON properties.
3. Click Deploy on the command bar to deploy the dataset. Confirm that you see both the datasets
in the tree view.
Create pipeline
In this step, you create a pipeline with one Copy Activity that uses EmpOnPremSQLTable as input
and OutputBlobTable as output.
1. In Data Factory Editor, click ... More, and click New pipeline.
2. Replace the JSON in the right pane with the following text:
{
"name": "ADFTutorialPipelineOnPrem",
"properties": {
"description": "This pipeline has one Copy activity that copies data from an on-prem
SQL to Azure blob",
"activities": [
{
"name": "CopyFromSQLtoBlob",
"description": "Copy data from on-prem SQL server to blob",
"type": "Copy",
"inputs": [
{
"name": "EmpOnPremSQLTable"
}
],
"outputs": [
{
"name": "OutputBlobTable"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from emp"
},
"sink": {
"type": "BlobSink"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2016-07-05T00:00:00Z",
"end": "2016-07-06T00:00:00Z",
"isPaused": false
}
}
IMPORTANT
Replace the value of the start property with the current day and end value with the next day.
You can zoom in, zoom out, zoom to 100%, zoom to fit, automatically position pipelines and
datasets, and show lineage information (highlights upstream and downstream items of
selected items). You can double-click an object (input/output dataset or pipeline) to see
properties for it.
Monitor pipeline
In this step, you use the Azure portal to monitor whats going on in an Azure data factory. You can
also use PowerShell cmdlets to monitor datasets and pipelines. For details about monitoring, see
Monitor and Manage Pipelines.
1. In the diagram, double-click EmpOnPremSQLTable.
2. Notice that all the data slices up are in Ready state because the pipeline duration (start time to
end time) is in the past. It is also because you have inserted the data in the SQL Server database
and it is there all the time. Confirm that no slices show up in the Problem slices section at the
bottom. To view all the slices, click See More at the bottom of the list of slices.
3. Now, In the Datasets page, click OutputBlobTable.
4. Click any data slice from the list and you should see the Data Slice page. You see activity runs
for the slice. You see only one activity run usually.
If the slice is not in the Ready state, you can see the upstream slices that are not Ready and are
blocking the current slice from executing in the Upstream slices that are not ready list.
5. Click the activity run from the list at the bottom to see activity run details.
You would see information such as throughput, duration, and the gateway used to transfer the
data.
6. Click X to close all the pages until you
7. get back to the home page for the ADFTutorialOnPremDF.
8. (optional) Click Pipelines, click ADFTutorialOnPremDF, and drill through input tables
(Consumed) or output datasets (Produced).
9. Use tools such as Microsoft Storage Explorer to verify that a blob/file is created for each hour.
Next steps
See Data Management Gateway article for all the details about the Data Management Gateway.
See Copy data from Azure Blob to Azure SQL to learn about how to use Copy Activity to move
data from a source data store to a sink data store.
Transform data in Azure Data Factory
6/27/2017 3 min to read Edit Online
Overview
This article explains data transformation activities in Azure Data Factory that you can use to transform and
processes your raw data into predictions and insights. A transformation activity executes in a computing
environment such as Azure HDInsight cluster or an Azure Batch. It provides links to articles with detailed
information on each transformation activity.
Data Factory supports the following data transformation activities that can be added to pipelines either
individually or chained with another activity.
NOTE
For a walkthrough with step-by-step instructions, see Create a pipeline with Hive transformation article.
Compute environments
You create a linked service for the compute environment and then use the linked service when defining a
transformation activity. There are two types of compute environments supported by Data Factory.
1. On-Demand: In this case, the computing environment is fully managed by Data Factory. It is automatically
created by the Data Factory service before a job is submitted to process data and removed when the job is
completed. You can configure and control granular settings of the on-demand compute environment for job
execution, cluster management, and bootstrapping actions.
2. Bring Your Own: In this case, you can register your own computing environment (for example HDInsight
cluster) as a linked service in Data Factory. The computing environment is managed by you and the Data
Factory service uses it to execute the activities.
See Compute Linked Services article to learn about compute services supported by Data Factory.
Summary
Azure Data Factory supports the following data transformation activities and the compute environments for the
activities. The transformation activities can be added to pipelines either individually or chained with another
activity.
Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server
The HDInsight Hive activity in a Data Factory pipeline executes Hive queries on your own or on-demand
Windows/Linux-based HDInsight cluster. This article builds on the data transformation activities article, which
presents a general overview of data transformation and the supported transformation activities.
NOTE
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the tutorial: Build your
first data pipeline before reading this article.
Syntax
{
"name": "Hive Activity",
"description": "description",
"type": "HDInsightHive",
"inputs": [
{
"name": "input tables"
}
],
"outputs": [
{
"name": "output tables"
}
],
"linkedServiceName": "MyHDInsightLinkedService",
"typeProperties": {
"script": "Hive script",
"scriptPath": "<pathtotheHivescriptfileinAzureblobstorage>",
"defines": {
"param1": "param1Value"
}
},
"scheduler": {
"frequency": "Day",
"interval": 1
}
}
Syntax details
PROPERTY DESCRIPTION REQUIRED
Example
Lets consider an example of game logs analytics where you want to identify the time spent by users playing
games launched by your company.
The following log is a sample game log, which is comma ( , ) separated and contains the following fields
ProfileID, SessionStart, Duration, SrcIPAddress, and GameType.
1809,2014-05-04 12:04:25.3470000,14,221.117.223.75,CaptureFlag
1703,2014-05-04 06:05:06.0090000,16,12.49.178.247,KingHill
1703,2014-05-04 10:21:57.3290000,10,199.118.18.179,CaptureFlag
1809,2014-05-04 05:24:22.2100000,23,192.84.66.141,KingHill
.....
To execute this Hive script in a Data Factory pipeline, you need to do the following
1. Create a linked service to register your own HDInsight compute cluster or configure on-demand HDInsight
compute cluster. Lets call this linked service HDInsightLinkedService.
2. Create a linked service to configure the connection to Azure Blob storage hosting the data. Lets call this
linked service StorageLinkedService
3. Create datasets pointing to the input and the output data. Lets call the input dataset HiveSampleIn and
the output dataset HiveSampleOut
4. Copy the Hive query as a file to Azure Blob Storage configured in step #2. if the storage for hosting the
data is different from the one hosting this query file, create a separate Azure Storage linked service and
refer to it in the activity. Use scriptPath **to specify the path to hive query file and
**scriptLinkedService to specify the Azure storage that contains the script file.
NOTE
You can also provide the Hive script inline in the activity definition by using the script property. We do not
recommend this approach as all special characters in the script within the JSON document needs to be escaped
and may cause debugging issues. The best practice is to follow step #4.
5. Create a pipeline with the HDInsightHive activity. The activity processes/transforms the data.
{
"name": "HiveActivitySamplePipeline",
"properties": {
"activities": [
{
"name": "HiveActivitySample",
"type": "HDInsightHive",
"inputs": [
{
"name": "HiveSampleIn"
}
],
"outputs": [
{
"name": "HiveSampleOut"
}
],
"linkedServiceName": "HDInsightLinkedService",
"typeproperties": {
"scriptPath": "adfwalkthrough\\scripts\\samplehive.hql",
"scriptLinkedService": "StorageLinkedService"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
}
}
]
}
}
See Also
Pig Activity
MapReduce Activity
Hadoop Streaming Activity
Invoke Spark programs
Invoke R scripts
Transform data using Pig Activity in Azure Data
Factory
6/27/2017 4 min to read Edit Online
The HDInsight Pig activity in a Data Factory pipeline executes Pig queries on your own or on-demand
Windows/Linux-based HDInsight cluster. This article builds on the data transformation activities article, which
presents a general overview of data transformation and the supported transformation activities.
NOTE
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the tutorial: Build your
first data pipeline before reading this article.
Syntax
{
"name": "HiveActivitySamplePipeline",
"properties": {
"activities": [
{
"name": "Pig Activity",
"description": "description",
"type": "HDInsightPig",
"inputs": [
{
"name": "input tables"
}
],
"outputs": [
{
"name": "output tables"
}
],
"linkedServiceName": "MyHDInsightLinkedService",
"typeProperties": {
"script": "Pig script",
"scriptPath": "<pathtothePigscriptfileinAzureblobstorage>",
"defines": {
"param1": "param1Value"
}
},
"scheduler": {
"frequency": "Day",
"interval": 1
}
}
]
}
}
Syntax details
PROPERTY DESCRIPTION REQUIRED
Example
Lets consider an example of game logs analytics where you want to identify the time spent by players playing
games launched by your company.
The following sample game log is a comma (,) separated file. It contains the following fields ProfileID,
SessionStart, Duration, SrcIPAddress, and GameType.
1809,2014-05-04 12:04:25.3470000,14,221.117.223.75,CaptureFlag
1703,2014-05-04 06:05:06.0090000,16,12.49.178.247,KingHill
1703,2014-05-04 10:21:57.3290000,10,199.118.18.179,CaptureFlag
1809,2014-05-04 05:24:22.2100000,23,192.84.66.141,KingHill
.....
NOTE
You can also provide the Pig script inline in the activity definition by using the script property. However, we do
not recommend this approach as all special characters in the script needs to be escaped and may cause
debugging issues. The best practice is to follow step #4.
5. Create the pipeline with the HDInsightPig activity. This activity processes the input data by running Pig
script on HDInsight cluster.
{
"name": "PigActivitySamplePipeline",
"properties": {
"activities": [
{
"name": "PigActivitySample",
"type": "HDInsightPig",
"inputs": [
{
"name": "PigSampleIn"
}
],
"outputs": [
{
"name": "PigSampleOut"
}
],
"linkedServiceName": "HDInsightLinkedService",
"typeproperties": {
"scriptPath": "adfwalkthrough\\scripts\\enrichlogs.pig",
"scriptLinkedService": "StorageLinkedService"
},
"scheduler": {
"frequency": "Day",
"interval": 1
}
}
]
}
}
{
"name": "PigActivitySamplePipeline",
"properties": {
"activities": [
{
"name": "PigActivitySample",
"type": "HDInsightPig",
"inputs": [
{
"name": "PigSampleIn"
}
],
"outputs": [
{
"name": "PigSampleOut"
}
],
"linkedServiceName": "HDInsightLinkedService",
"typeproperties": {
"scriptPath": "adfwalkthrough\\scripts\\samplepig.hql",
"scriptLinkedService": "StorageLinkedService",
"defines": {
"Input": "$$Text.Format('wasb:
//adfwalkthrough@<storageaccountname>.blob.core.windows.net/samplein/yearno={0: yyyy}/monthno=
{0:MM}/dayno={0: dd}/',SliceStart)",
"Output":
"$$Text.Format('wasb://adfwalkthrough@<storageaccountname>.blob.core.windows.net/sampleout/yearno=
{0:yyyy}/monthno={0:MM}/dayno={0:dd}/', SliceStart)"
}
},
"scheduler": {
"frequency": "Day",
"interval": 1
}
}
]
}
}
In the Pig Script, refer to the parameters using '$parameterName' as shown in the following example:
See Also
Hive Activity
MapReduce Activity
Hadoop Streaming Activity
Invoke Spark programs
Invoke R scripts
Invoke MapReduce Programs from Data Factory
8/15/2017 4 min to read Edit Online
The HDInsight MapReduce activity in a Data Factory pipeline executes MapReduce programs on your own or
on-demand Windows/Linux-based HDInsight cluster. This article builds on the data transformation activities
article, which presents a general overview of data transformation and the supported transformation activities.
NOTE
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the tutorial: Build your
first data pipeline before reading this article.
Introduction
A pipeline in an Azure data factory processes data in linked storage services by using linked compute services.
It contains a sequence of activities where each activity performs a specific processing operation. This article
describes using the HDInsight MapReduce Activity.
See Pig and Hive for details about running Pig/Hive scripts on a Windows/Linux-based HDInsight cluster from
a pipeline by using HDInsight Pig and Hive activities.
You can use the HDInsight MapReduce Activity to run any MapReduce jar file on an HDInsight cluster. In
the following sample JSON definition of a pipeline, the HDInsight Activity is configured to run a Mahout
JAR file.
Sample on GitHub
You can download a sample for using the HDInsight MapReduce Activity from: Data Factory Samples on
GitHub.
Running the Word Count program
The pipeline in this example runs the Word Count Map/Reduce program on your Azure HDInsight cluster.
Linked Services
First, you create a linked service to link the Azure Storage that is used by the Azure HDInsight cluster to the
Azure data factory. If you copy/paste the following code, do not forget to replace account name and account
key with the name and key of your Azure Storage.
Azure Storage linked service
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=
<account key>"
}
}
}
{
"name": "HDInsightLinkedService",
"properties": {
"type": "HDInsight",
"typeProperties": {
"clusterUri": "https://<HDInsight cluster name>.azurehdinsight.net",
"userName": "admin",
"password": "**********",
"linkedServiceName": "StorageLinkedService"
}
}
}
Datasets
Output dataset
The pipeline in this example does not take any inputs. You specify an output dataset for the HDInsight
MapReduce Activity. This dataset is just a dummy dataset that is required to drive the pipeline schedule.
{
"name": "MROutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"fileName": "WordCountOutput1.txt",
"folderPath": "example/data/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
Pipeline
The pipeline in this example has only one activity that is of type: HDInsightMapReduce. Some of the important
properties in the JSON are:
PROPERTY NOTES
jarFilePath Path to the jar file containing the class. If you copy/paste
the following code, don't forget to change the name of the
cluster.
jarLinkedService Azure Storage linked service that contains the jar file. This
linked service refers to the storage that is associated with
the HDInsight cluster.
frequency/interval The values for these properties match the output dataset.
See Also
Hive Activity
Pig Activity
Hadoop Streaming Activity
Invoke Spark programs
Invoke R scripts
Transform data using Hadoop Streaming Activity in
Azure Data Factory
8/15/2017 4 min to read Edit Online
You can use the HDInsightStreamingActivity Activity invoke a Hadoop Streaming job from an Azure Data
Factory pipeline. The following JSON snippet shows the syntax for using the HDInsightStreamingActivity in a
pipeline JSON file.
The HDInsight Streaming Activity in a Data Factory pipeline executes Hadoop Streaming programs on your
own or on-demand Windows/Linux-based HDInsight cluster. This article builds on the data transformation
activities article, which presents a general overview of data transformation and the supported transformation
activities.
NOTE
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the tutorial: Build your
first data pipeline before reading this article.
JSON sample
The HDInsight cluster is automatically populated with example programs (wc.exe and cat.exe) and data
(davinci.txt). By default, name of the container that is used by the HDInsight cluster is the name of the cluster
itself. For example, if your cluster name is myhdicluster, name of the blob container associated would be
myhdicluster.
{
"name": "HadoopStreamingPipeline",
"properties": {
"description": "Hadoop Streaming Demo",
"activities": [
{
"type": "HDInsightStreaming",
"typeProperties": {
"mapper": "cat.exe",
"reducer": "wc.exe",
"input":
"wasb://<nameofthecluster>@spestore.blob.core.windows.net/example/data/gutenberg/davinci.txt",
"output":
"wasb://<nameofthecluster>@spestore.blob.core.windows.net/example/data/StreamingOutput/wc.txt",
"filePaths": [
"<nameofthecluster>/example/apps/wc.exe",
"<nameofthecluster>/example/apps/cat.exe"
],
"fileLinkedService": "AzureStorageLinkedService",
"getDebugInfo": "Failure"
},
"outputs": [
{
"name": "StreamingOutputDataset"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "RunHadoopStreamingJob",
"description": "Run a Hadoop streaming job",
"linkedServiceName": "HDInsightLinkedService"
}
],
"start": "2014-01-04T00:00:00Z",
"end": "2014-01-05T00:00:00Z"
}
}
NOTE
As shown in the example, you specify an output dataset for the Hadoop Streaming Activity for the outputs property.
This dataset is just a dummy dataset that is required to drive the pipeline schedule. You do not need to specify any input
dataset for the activity for the inputs property.
Example
The pipeline in this walkthrough runs the Word Count streaming Map/Reduce program on your Azure
HDInsight cluster.
Linked services
Azure Storage linked service
First, you create a linked service to link the Azure Storage that is used by the Azure HDInsight cluster to the
Azure data factory. If you copy/paste the following code, do not forget to replace account name and account
key with the name and key of your Azure Storage.
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=
<account key>"
}
}
}
{
"name": "HDInsightLinkedService",
"properties": {
"type": "HDInsight",
"typeProperties": {
"clusterUri": "https://<HDInsight cluster name>.azurehdinsight.net",
"userName": "admin",
"password": "**********",
"linkedServiceName": "StorageLinkedService"
}
}
}
Datasets
Output dataset
The pipeline in this example does not take any inputs. You specify an output dataset for the HDInsight
Streaming Activity. This dataset is just a dummy dataset that is required to drive the pipeline schedule.
{
"name": "StreamingOutputDataset",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "adftutorial/streamingdata/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
},
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
Pipeline
The pipeline in this example has only one activity that is of type: HDInsightStreaming.
The HDInsight cluster is automatically populated with example programs (wc.exe and cat.exe) and data
(davinci.txt). By default, name of the container that is used by the HDInsight cluster is the name of the cluster
itself. For example, if your cluster name is myhdicluster, name of the blob container associated would be
myhdicluster.
{
"name": "HadoopStreamingPipeline",
"properties": {
"description": "Hadoop Streaming Demo",
"activities": [
{
"type": "HDInsightStreaming",
"typeProperties": {
"mapper": "cat.exe",
"reducer": "wc.exe",
"input":
"wasb://<blobcontainer>@spestore.blob.core.windows.net/example/data/gutenberg/davinci.txt",
"output":
"wasb://<blobcontainer>@spestore.blob.core.windows.net/example/data/StreamingOutput/wc.txt",
"filePaths": [
"<blobcontainer>/example/apps/wc.exe",
"<blobcontainer>/example/apps/cat.exe"
],
"fileLinkedService": "StorageLinkedService"
},
"outputs": [
{
"name": "StreamingOutputDataset"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "RunHadoopStreamingJob",
"description": "Run a Hadoop streaming job",
"linkedServiceName": "HDInsightLinkedService"
}
],
"start": "2017-01-03T00:00:00Z",
"end": "2017-01-04T00:00:00Z"
}
}
See Also
Hive Activity
Pig Activity
MapReduce Activity
Invoke Spark programs
Invoke R scripts
Invoke Spark programs from Azure Data Factory
pipelines
8/21/2017 11 min to read Edit Online
Introduction
Spark Activity is one of the data transformation activities supported by Azure Data Factory. This activity runs
the specified Spark program on your Apache Spark cluster in Azure HDInsight.
IMPORTANT
Spark Activity does not support HDInsight Spark clusters that use an Azure Data Lake Store as primary storage.
Spark Activity supports only existing (your own) HDInsight Spark clusters. It does not support an on-demand
HDInsight linked service.
4. Select the Azure subscription where you want the data factory to be created.
5. Select an existing resource group or create an Azure resource group.
6. Select Pin to dashboard option.
7. Click Create on the New data factory blade.
IMPORTANT
To create Data Factory instances, you must be a member of the Data Factory Contributor role at the
subscription/resource group level.
8. You see the data factory being created in the dashboard of the Azure portal as follows:
9. After the data factory has been created successfully, you see the data factory page, which shows you the
contents of the data factory. If you do not see the data factory page, click the tile for your data factory on
the dashboard.
3. You should see the JSON script for creating an Azure Storage linked service in the editor.
4. Replace account name and account key with the name and access key of your Azure storage account. To
learn how to get your storage access key, see the information about how to view, copy, and regenerate
storage access keys in Manage your storage account.
5. To deploy the linked service, click Deploy on the command bar. After the linked service is deployed
successfully, the Draft-1 window should disappear and you see AzureStorageLinkedService in the tree
view on the left.
Create HDInsight linked service
In this step, you create Azure HDInsight linked service to link your HDInsight Spark cluster to the data factory.
The HDInsight cluster is used to run the Spark program specified in the Spark activity of the pipeline in this
sample.
1. Click ... More on the toolbar, click New compute, and then click HDInsight cluster.
2. Copy and paste the following snippet to the Draft-1 window. In the JSON editor, do the following steps:
a. Specify the URI for the HDInsight Spark cluster. For example:
https://<sparkclustername>.azurehdinsight.net/ .
b. Specify the name of the user who has access to the Spark cluster.
c. Specify the password for user.
d. Specify the Azure Storage linked service that is associated with the HDInsight Spark cluster. In
this example, it is: AzureStorageLinkedService.
{
"name": "HDInsightLinkedService",
"properties": {
"type": "HDInsight",
"typeProperties": {
"clusterUri": "https://<sparkclustername>.azurehdinsight.net/",
"userName": "admin",
"password": "**********",
"linkedServiceName": "AzureStorageLinkedService"
}
}
}
IMPORTANT
Spark Activity does not support HDInsight Spark clusters that use an Azure Data Lake Store as
primary storage.
Spark Activity supports only existing (your own) HDInsight Spark cluster. It does not support an on-
demand HDInsight linked service.
See HDInsight Linked Service for details about the HDInsight linked service.
3. To deploy the linked service, click Deploy on the command bar.
Create output dataset
The output dataset is what drives the schedule (hourly, daily, etc.). Therefore, you must specify an output
dataset for the spark activity in the pipeline even though the activity does not really produce any output.
Specifying an input dataset for the activity is optional.
1. In the Data Factory Editor, click ... More on the command bar, click New dataset, and select Azure Blob
storage.
2. Copy and paste the following snippet to the Draft-1 window. The JSON snippet defines a dataset called
OutputDataset. In addition, you specify that the results are stored in the blob container called
adfspark and the folder called pyFiles/output. As mentioned earlier, this dataset is a dummy dataset.
The Spark program in this example does not produce any output. The availability section specifies that
the output dataset is produced daily.
{
"name": "OutputDataset",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "sparkoutput.txt",
"folderPath": "adfspark/pyFiles/output",
"format": {
"type": "TextFormat",
"columnDelimiter": "\t"
}
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
{
"name": "SparkPipeline",
"properties": {
"activities": [
{
"type": "HDInsightSpark",
"typeProperties": {
"rootPath": "adfspark\\pyFiles",
"entryFilePath": "test.py",
"getDebugInfo": "Always"
},
"outputs": [
{
"name": "OutputDataset"
}
],
"name": "MySparkActivity",
"linkedServiceName": "HDInsightLinkedService"
}
],
"start": "2017-02-05T00:00:00Z",
"end": "2017-02-06T00:00:00Z"
}
}
IMPORTANT
We recommend that you do not set this property to Always in a production environment unless you
are troubleshooting an issue.
The outputs section has one output dataset. You must specify an output dataset even if the spark
program does not produce any output. The output dataset drives the schedule for the pipeline
(hourly, daily, etc.).
For details about the properties supported by Spark activity, see Spark activity properties section.
3. To deploy the pipeline, click Deploy on the command bar.
Monitor pipeline
1. Click X to close Data Factory Editor blades and to navigate back to the Data Factory home page. Click
Monitor and Manage to launch the monitoring application in another tab.
2. Change the Start time filter at the top to 2/1/2017, and click Apply.
3. You should see only one activity window as there is only one day between the start (2017-02-01) and
end times (2017-02-02) of the pipeline. Confirm that the data slice is in ready state.
4. Select the activity window to see details about the activity run. If there is an error, you see details about it
in the right pane.
Verify the results
1. Launch Jupyter notebook for your HDInsight Spark cluster by navigating to:
https://CLUSTERNAME.azurehdinsight.net/jupyter. You can also launch cluster dashboard for your
HDInsight Spark cluster, and then launch Jupyter Notebook.
2. Click New -> PySpark to start a new notebook.
3. Run the following command by copy/pasting the text and pressing SHIFT + ENTER at the end of the
second statement.
%%sql
SELECT buildingID, (targettemp - actualtemp) AS temp_diff, date FROM hvac WHERE date = \"6/1/13\"
4. Confirm that you see the data from the hvac table:
See Run a Spark SQL query section for detailed instructions.
Troubleshooting
Since you set getDebugInfo to Always, you see a log subfolder in the pyFiles folder in your Azure Blob
container. The log file in the log folder provides additional details. This log file is especially useful when there is
an error. In a production environment, you may want to set it to Failure.
For further troubleshooting, do the following steps:
1. Navigate to https://<CLUSTERNAME>.azurehdinsight.net/yarnui/hn/cluster .
The following sections provide information about Data Factory entities to use Apache Spark cluster and Spark
Activity in your data factory.
The following table describes the JSON properties used in the JSON definition:
Folder structure
The Spark activity does not support an in-line script as Pig and Hive activities do. Spark jobs are also more
extensible than Pig/Hive jobs. For Spark jobs, you can provide multiple dependencies such as jar packages
(placed in the java CLASSPATH), python files (placed on the PYTHONPATH), and any other files.
Create the following folder structure in the Azure Blob storage referenced by the HDInsight linked service.
Then, upload dependent files to the appropriate sub folders in the root folder represented by entryFilePath.
For example, upload python files to the pyFiles subfolder and jar files to the jars subfolder of the root folder. At
runtime, Data Factory service expects the following folder structure in the Azure Blob storage:
Here is an example for a storage containing two Spark job files in the Azure Blob Storage referenced by the
HDInsight linked service.
SparkJob1
main.jar
files
input1.txt
input2.txt
jars
package1.jar
package2.jar
logs
SparkJob2
main.py
pyFiles
scrip1.py
script2.py
logs
Create predictive pipelines using Azure Machine
Learning and Azure Data Factory
6/27/2017 17 min to read Edit Online
Introduction
Azure Machine Learning
Azure Machine Learning enables you to build, test, and deploy predictive analytics solutions. From a high-level
point of view, it is done in three steps:
1. Create a training experiment. You do this step by using the Azure ML Studio. The ML studio is a
collaborative visual development environment that you use to train and test a predictive analytics model
using training data.
2. Convert it to a predictive experiment. Once your model has been trained with existing data and you are
ready to use it to score new data, you prepare and streamline your experiment for scoring.
3. Deploy it as a web service. You can publish your scoring experiment as an Azure web service. You can
send data to your model via this web service end point and receive result predictions fro the model.
Azure Data Factory
Data Factory is a cloud-based data integration service that orchestrates and automates the movement and
transformation of data. You can create data integration solutions using Azure Data Factory that can ingest
data from various data stores, transform/process the data, and publish the result data to the data stores.
Data Factory service allows you to create data pipelines that move and transform data, and then run the
pipelines on a specified schedule (hourly, daily, weekly, etc.). It also provides rich visualizations to display the
lineage and dependencies between your data pipelines, and monitor all your data pipelines from a single
unified view to easily pinpoint issues and setup monitoring alerts.
See Introduction to Azure Data Factory and Build your first pipeline articles to quickly get started with the Azure
Data Factory service.
Data Factory and Machine Learning together
Azure Data Factory enables you to easily create pipelines that use a published Azure Machine Learning web
service for predictive analytics. Using the Batch Execution Activity in an Azure Data Factory pipeline, you can
invoke an Azure ML web service to make predictions on the data in batch. See Invoking an Azure ML web
service using the Batch Execution Activity section for details.
Over time, the predictive models in the Azure ML scoring experiments need to be retrained using new input
datasets. You can retrain an Azure ML model from a Data Factory pipeline by doing the following steps:
1. Publish the training experiment (not predictive experiment) as a web service. You do this step in the Azure
ML Studio as you did to expose predictive experiment as a web service in the previous scenario.
2. Use the Azure ML Batch Execution Activity to invoke the web service for the training experiment. Basically,
you can use the Azure ML Batch Execution activity to invoke both training web service and scoring web
service.
After you are done with retraining, update the scoring web service (predictive experiment exposed as a web
service) with the newly trained model by using the Azure ML Update Resource Activity. See Updating
models using Update Resource Activity article for details.
Invoking a web service using Batch Execution Activity
You use Azure Data Factory to orchestrate data movement and processing, and then perform batch execution
using Azure Machine Learning. Here are the top-level steps:
1. Create an Azure Machine Learning linked service. You need the following values:
a. Request URI for the Batch Execution API. You can find the Request URI by clicking the BATCH
EXECUTION link in the web services page.
b. API key for the published Azure Machine Learning web service. You can find the API key by clicking
the web service that you have published.
c. Use the AzureMLBatchExecution activity.
Scenario: Experiments using Web service inputs/outputs that refer to data in Azure Blob Storage
In this scenario, the Azure Machine Learning Web service makes predictions using data from a file in an Azure
blob storage and stores the prediction results in the blob storage. The following JSON defines a Data Factory
pipeline with an AzureMLBatchExecution activity. The activity has the dataset DecisionTreeInputBlob as input
and DecisionTreeResultBlob as the output. The DecisionTreeInputBlob is passed as an input to the web
service by using the webServiceInput JSON property. The DecisionTreeResultBlob is passed as an output to
the Web service by using the webServiceOutputs JSON property.
IMPORTANT
If the web service takes multiple inputs, use the webServiceInputs property instead of using webServiceInput. See the
Web service requires multiple inputs section for an example of using the webServiceInputs property.
Datasets that are referenced by the webServiceInput/webServiceInputs and webServiceOutputs properties (in
typeProperties) must also be included in the Activity inputs and outputs.
In your Azure ML experiment, web service input and output ports and global parameters have default names ("input1",
"input2") that you can customize. The names you use for webServiceInputs, webServiceOutputs, and globalParameters
settings must exactly match the names in the experiments. You can view the sample request payload on the Batch
Execution Help page for your Azure ML endpoint to verify the expected mapping.
{
"name": "PredictivePipeline",
"properties": {
"description": "use AzureML model",
"activities": [
{
"name": "MLActivity",
"type": "AzureMLBatchExecution",
"description": "prediction analysis on batch input",
"inputs": [
{
"name": "DecisionTreeInputBlob"
}
],
"outputs": [
{
"name": "DecisionTreeResultBlob"
}
],
"linkedServiceName": "MyAzureMLLinkedService",
"typeProperties":
{
"webServiceInput": "DecisionTreeInputBlob",
"webServiceOutputs": {
"output1": "DecisionTreeResultBlob"
}
},
"policy": {
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "02:00:00"
}
}
],
"start": "2016-02-13T00:00:00Z",
"end": "2016-02-14T00:00:00Z"
}
}
NOTE
Only inputs and outputs of the AzureMLBatchExecution activity can be passed as parameters to the Web service. For
example, in the above JSON snippet, DecisionTreeInputBlob is an input to the AzureMLBatchExecution activity, which is
passed as an input to the Web service via webServiceInput parameter.
Example
This example uses Azure Storage to hold both the input and output data.
We recommend that you go through the Build your first pipeline with Data Factory tutorial before going
through this example. Use the Data Factory Editor to create Data Factory artifacts (linked services, datasets,
pipeline) in this example.
1. Create a linked service for your Azure Storage. If the input and output files are in different storage
accounts, you need two linked services. Here is a JSON example:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=[acctName];AccountKey=
[acctKey]"
}
}
}
2. Create the input Azure Data Factory dataset. Unlike some other Data Factory datasets, these datasets
must contain both folderPath and fileName values. You can use partitioning to cause each batch
execution (each data slice) to process or produce unique input and output files. You may need to include
some upstream activity to transform the input into the CSV file format and place it in the storage
account for each slice. In that case, you would not include the external and externalData settings
shown in the following example, and your DecisionTreeInputBlob would be the output dataset of a
different Activity.
{
"name": "DecisionTreeInputBlob",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "azuremltesting/input",
"fileName": "in.csv",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"external": true,
"availability": {
"frequency": "Day",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Your input csv file must have the column header row. If you are using the Copy Activity to create/move
the csv into the blob storage, you should set the sink property blobWriterAddHeader to true. For
example:
sink:
{
"type": "BlobSink",
"blobWriterAddHeader": true
}
If the csv file does not have the header row, you may see the following error: Error in Activity: Error
reading string. Unexpected token: StartObject. Path '', line 1, position 1.
3. Create the output Azure Data Factory dataset. This example uses partitioning to create a unique output
path for each slice execution. Without the partitioning, the activity would overwrite the file.
{
"name": "DecisionTreeResultBlob",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "azuremltesting/scored/{folderpart}/",
"fileName": "{filepart}result.csv",
"partitionedBy": [
{
"name": "folderpart",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyyMMdd"
}
},
{
"name": "filepart",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HHmmss"
}
}
],
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Day",
"interval": 15
}
}
}
4. Create a linked service of type: AzureMLLinkedService, providing the API key and model batch
execution URL.
{
"name": "MyAzureMLLinkedService",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://[batch execution endpoint]/jobs",
"apiKey": "[apikey]"
}
}
}
NOTE
AzureMLBatchExecution activity can have zero or more inputs and one or more outputs.
{
"name": "PredictivePipeline",
"properties": {
"description": "use AzureML model",
"activities": [
{
"name": "MLActivity",
"type": "AzureMLBatchExecution",
"description": "prediction analysis on batch input",
"inputs": [
{
"name": "DecisionTreeInputBlob"
}
],
"outputs": [
{
"name": "DecisionTreeResultBlob"
}
],
"linkedServiceName": "MyAzureMLLinkedService",
"typeProperties":
{
"webServiceInput": "DecisionTreeInputBlob",
"webServiceOutputs": {
"output1": "DecisionTreeResultBlob"
}
},
"policy": {
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "02:00:00"
}
}
],
"start": "2016-02-13T00:00:00Z",
"end": "2016-02-14T00:00:00Z"
}
}
Both start and end datetimes must be in ISO format. For example: 2014-10-14T16:32:41Z. The
end time is optional. If you do not specify value for the end property, it is calculated as "start +
48 hours." To run the pipeline indefinitely, specify 9999-09-09 as the value for the end property.
See JSON Scripting Reference for details about JSON properties.
NOTE
Specifying input for the AzureMLBatchExecution activity is optional.
NOTE
Web service input and output are different from Web service parameters. In the first scenario, you have seen how an
input and output can be specified for an Azure ML Web service. In this scenario, you pass parameters for a Web service
that correspond to properties of reader/writer modules.
Let's look at a scenario for using Web service parameters. You have a deployed Azure Machine Learning web
service that uses a reader module to read data from one of the data sources supported by Azure Machine
Learning (for example: Azure SQL Database). After the batch execution is performed, the results are written
using a Writer module (Azure SQL Database). No web service inputs and outputs are defined in the
experiments. In this case, we recommend that you configure relevant web service parameters for the reader
and writer modules. This configuration allows the reader/writer modules to be configured when using the
AzureMLBatchExecution activity. You specify Web service parameters in the globalParameters section in the
activity JSON as follows.
"typeProperties": {
"globalParameters": {
"Param 1": "Value 1",
"Param 2": "Value 2"
}
}
You can also use Data Factory Functions in passing values for the Web service parameters as shown in the
following example:
"typeProperties": {
"globalParameters": {
"Database query": "$$Text.Format('SELECT * FROM myTable WHERE timeColumn = \\'{0:yyyy-MM-dd
HH:mm:ss}\\'', Time.AddHours(WindowStart, 0))"
}
}
NOTE
The Web service parameters are case-sensitive, so ensure that the names you specify in the activity JSON match the
ones exposed by the Web service.
Using a Reader module to read data from multiple files in Azure Blob
Big data pipelines with activities such as Pig and Hive can produce one or more output files with no extensions.
For example, when you specify an external Hive table, the data for the external Hive table can be stored in Azure
blob storage with the following name 000000_0. You can use the reader module in an experiment to read
multiple files, and use them for predictions.
When using the reader module in an Azure Machine Learning experiment, you can specify Azure Blob as an
input. The files in the Azure blob storage can be the output files (Example: 000000_0) that are produced by a
Pig and Hive script running on HDInsight. The reader module allows you to read files (with no extensions) by
configuring the Path to container, directory/blob. The Path to container points to the container and
directory/blob points to folder that contains the files as shown in the following image. The asterisk that is, *)
specifies that all the files in the container/folder (that is, data/aggregateddata/year=2014/month-
6/*) are read as part of the experiment.
Example
Pipeline with AzureMLBatchExecution activity with Web Service Parameters
{
"name": "MLWithSqlReaderSqlWriter",
"properties": {
"description": "Azure ML model with sql azure reader/writer",
"activities": [
{
"name": "MLSqlReaderSqlWriterActivity",
"type": "AzureMLBatchExecution",
"description": "test",
"inputs": [
{
"name": "MLSqlInput"
}
],
"outputs": [
{
"name": "MLSqlOutput"
}
],
"linkedServiceName": "MLSqlReaderSqlWriterDecisionTreeModel",
"typeProperties":
{
"webServiceInput": "MLSqlInput",
"webServiceOutputs": {
"output1": "MLSqlOutput"
}
"globalParameters": {
"Database server name": "<myserver>.database.windows.net",
"Database name": "<database>",
"Server user account name": "<user name>",
"Server user account password": "<password>"
}
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "02:00:00"
},
}
],
"start": "2016-02-13T00:00:00Z",
"end": "2016-02-14T00:00:00Z"
}
}
{
"name": "PredictivePipeline",
"properties": {
"description": "use AzureML model",
"activities": [{
"name": "MLActivity",
"type": "AzureMLBatchExecution",
"description": "prediction analysis on batch input",
"inputs": [{
"name": "inputDataset1"
}, {
"name": "inputDataset2"
}],
"outputs": [{
"name": "outputDataset"
}],
"linkedServiceName": "MyAzureMLLinkedService",
"typeProperties": {
"webServiceInputs": {
"input1": "inputDataset1",
"input2": "inputDataset2"
},
"webServiceOutputs": {
"output1": "outputDataset"
}
},
"policy": {
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "02:00:00"
}
}],
"start": "2016-02-13T00:00:00Z",
"end": "2016-02-14T00:00:00Z"
}
}
{
"name": "retraining",
"type": "AzureMLBatchExecution",
"outputs": [
{
"name": "placeholderOutputDataset"
}
],
"typeProperties": {
},
"linkedServiceName": "mlEndpoint",
"policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "02:00:00"
}
},
Web Service uses readers and writers, and the activity runs only when other activities have succeeded
The Azure ML web service reader and writer modules might be configured to run with or without any
GlobalParameters. However, you may want to embed service calls in a pipeline that uses dataset dependencies
to invoke the service only when some upstream processing has completed. You can also trigger some other
action after the batch execution has completed using this approach. In that case, you can express the
dependencies using activity inputs and outputs, without naming any of them as Web Service inputs or outputs.
{
"name": "retraining",
"type": "AzureMLBatchExecution",
"inputs": [
{
"name": "upstreamData1"
},
{
"name": "upstreamData2"
}
],
"outputs": [
{
"name": "downstreamData"
}
],
"typeProperties": {
},
"linkedServiceName": "mlEndpoint",
"policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "02:00:00"
}
},
{
"name": "PredictivePipeline",
"properties": {
"description": "use AzureML model",
"activities": [
{
"name": "MLActivity",
"type": "AzureMLBatchScoring",
"description": "prediction analysis on batch input",
"inputs": [
{
"name": "ScoringInputBlob"
}
],
"outputs": [
{
"name": "ScoringResultBlob"
}
],
"linkedServiceName": "MyAzureMLLinkedService",
"policy": {
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "02:00:00"
}
}
],
"start": "2016-02-13T00:00:00Z",
"end": "2016-02-14T00:00:00Z"
}
}
You can also use Data Factory Functions in passing values for the Web service parameters as shown in the
following example:
"typeProperties": {
"webServiceParameters": {
"Database query": "$$Text.Format('SELECT * FROM myTable WHERE timeColumn = \\'{0:yyyy-MM-dd
HH:mm:ss}\\'', Time.AddHours(WindowStart, 0))"
}
}
NOTE
The Web service parameters are case-sensitive, so ensure that the names you specify in the activity JSON match the
ones exposed by the Web service.
See Also
Azure blog post: Getting started with Azure Data Factory and Azure Machine Learning
Updating Azure Machine Learning models using
Update Resource Activity
6/27/2017 7 min to read Edit Online
This article complements the main Azure Data Factory - Azure Machine Learning integration article: Create
predictive pipelines using Azure Machine Learning and Azure Data Factory. If you haven't already done so,
review the main article before reading through this article.
Overview
Over time, the predictive models in the Azure ML scoring experiments need to be retrained using new input
datasets. After you are done with retraining, you want to update the scoring web service with the retrained ML
model. The typical steps to enable retraining and updating Azure ML models via web services are:
1. Create an experiment in Azure ML Studio.
2. When you are satisfied with the model, use Azure ML Studio to publish web services for both the training
experiment and scoring/predictive experiment.
The following table describes the web services used in this example. See Retrain Machine Learning models
programmatically for details.
Training web service - Receives training data and produces trained models. The output of the retraining is
an .ilearner file in an Azure Blob storage. The default endpoint is automatically created for you when you
publish the training experiment as a web service. You can create more endpoints but the example uses only
the default endpoint.
Scoring web service - Receives unlabeled data examples and makes predictions. The output of prediction
could have various forms, such as a .csv file or rows in an Azure SQL database, depending on the
configuration of the experiment. The default endpoint is automatically created for you when you publish the
predictive experiment as a web service.
The following picture depicts the relationship between training and scoring endpoints in Azure ML.
You can invoke the training web service by using the Azure ML Batch Execution Activity. Invoking a
training web service is same as invoking an Azure ML web service (scoring web service) for scoring data. The
preceding sections cover how to invoke an Azure ML web service from an Azure Data Factory pipeline in detail.
You can invoke the scoring web service by using the Azure ML Update Resource Activity to update the web
service with the newly trained model. The following examples provide linked service definitions:
{
"name": "updatableScoringEndpoint2",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/xxx/services/--scoring
experiment--/jobs",
"apiKey": "endpoint2Key",
"updateResourceEndpoint": "https://management.azureml.net/workspaces/xxx/webservices/--scoring
experiment--/endpoints/endpoint2"
}
}
}
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resource-group-
name}/providers/Microsoft.MachineLearning/webServices/{web-service-name}?api-version=2016-05-01-preview.
You can get values for place holders in the URL when querying the web service on the Azure Machine Learning
Web Services Portal. The new type of update resource endpoint requires an AAD (Azure Active Directory) token.
Specify servicePrincipalId and servicePrincipalKeyin AzureML linked service. See how to create service
principal and assign permissions to manage Azure resource. Here is a sample AzureML linked service definition:
{
"name": "AzureMLLinkedService",
"properties": {
"type": "AzureML",
"description": "The linked service for AML web service.",
"typeProperties": {
"mlEndpoint":
"https://ussouthcentral.services.azureml.net/workspaces/0000000000000000000000000000000000000/services/00000
00000000000000000000000000000000/jobs?api-version=2.0",
"apiKey": "xxxxxxxxxxxx",
"updateResourceEndpoint": "https://management.azure.com/subscriptions/00000000-0000-0000-0000-
000000000000/resourceGroups/myRG/providers/Microsoft.MachineLearning/webServices/myWebService?api-
version=2016-05-01-preview",
"servicePrincipalId": "000000000-0000-0000-0000-0000000000000",
"servicePrincipalKey": "xxxxx",
"tenant": "mycompany.com"
}
}
}
The following scenario provides more details. It has an example for retraining and updating Azure ML models
from an Azure Data Factory pipeline.
{
"name": "trainingData",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "labeledexamples",
"fileName": "labeledexamples.arff",
"format": {
"type": "TextFormat"
}
},
"availability": {
"frequency": "Week",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
{
"name": "trainingEndpoint",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/xxx/services/--training
experiment--/jobs",
"apiKey": "myKey"
}
}
}
In Azure ML Studio, do the following to get values for mlEndpoint and apiKey:
1. Click WEB SERVICES on the left menu.
2. Click the training web service in the list of web services.
3. Click copy next to API key text box. Paste the key in the clipboard into the Data Factory JSON editor.
4. In the Azure ML studio, click BATCH EXECUTION link.
5. Copy the Request URI from the Request section and paste it into the Data Factory JSON editor.
Linked Service for Azure ML updatable scoring endpoint:
The following JSON snippet defines an Azure Machine Learning linked service that points to the non-default
updatable endpoint of the scoring web service.
{
"name": "updatableScoringEndpoint2",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint":
"https://ussouthcentral.services.azureml.net/workspaces/00000000eb0abe4d6bbb1d7886062747d7/services/00000000
026734a5889e02fbb1f65cefd/jobs?api-version=2.0",
"apiKey":
"sooooooooooh3WvG1hBfKS2BNNcfwSO7hhY6dY98noLfOdqQydYDIXyf2KoIaN3JpALu/AKtflHWMOCuicm/Q==",
"updateResourceEndpoint": "https://management.azure.com/subscriptions/00000000-0000-0000-0000-
000000000000/resourceGroups/Default-MachineLearning-
SouthCentralUS/providers/Microsoft.MachineLearning/webServices/myWebService?api-version=2016-05-01-preview",
"servicePrincipalId": "fe200044-c008-4008-a005-94000000731",
"servicePrincipalKey": "zWa0000000000Tp6FjtZOspK/WMA2tQ08c8U+gZRBlw=",
"tenant": "mycompany.com"
}
}
}
{
"name": "placeholderBlob",
"properties": {
"availability": {
"frequency": "Week",
"interval": 1
},
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "any",
"format": {
"type": "TextFormat"
}
}
}
}
Pipeline
The pipeline has two activities: AzureMLBatchExecution and AzureMLUpdateResource. The Azure ML Batch
Execution activity takes the training data as input and produces an iLearner file as an output. The activity invokes
the training web service (training experiment exposed as a web service) with the input training data and receives
the ilearner file from the webservice. The placeholderBlob is just a dummy output dataset that is required by the
Azure Data Factory service to run the pipeline.
{
"name": "pipeline",
"properties": {
"activities": [
{
"name": "retraining",
"type": "AzureMLBatchExecution",
"inputs": [
{
"name": "trainingData"
}
],
"outputs": [
{
"name": "trainedModelBlob"
}
],
"typeProperties": {
"webServiceInput": "trainingData",
"webServiceOutputs": {
"output1": "trainedModelBlob"
}
},
"linkedServiceName": "trainingEndpoint",
"policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "02:00:00"
}
},
{
"type": "AzureMLUpdateResource",
"typeProperties": {
"trainedModelName": "Training Exp for ADF ML [trained model]",
"trainedModelDatasetName" : "trainedModelBlob"
},
"inputs": [
{
"name": "trainedModelBlob"
}
],
"outputs": [
{
"name": "placeholderBlob"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"name": "AzureML Update Resource",
"linkedServiceName": "updatableScoringEndpoint2"
}
],
"start": "2016-02-13T00:00:00Z",
"end": "2016-02-14T00:00:00Z"
}
}
SQL Server Stored Procedure Activity
7/10/2017 12 min to read Edit Online
Overview
You use data transformation activities in a Data Factory pipeline to transform and process raw data into
predictions and insights. The Stored Procedure Activity is one of the transformation activities that Data Factory
supports. This article builds on the data transformation activities article, which presents a general overview of
data transformation and the supported transformation activities in Data Factory.
You can use the Stored Procedure Activity to invoke a stored procedure in one of the following data stores in
your enterprise or on an Azure virtual machine (VM):
Azure SQL Database
Azure SQL Data Warehouse
SQL Server Database. If you are using SQL Server, install Data Management Gateway on the same machine
that hosts the database or on a separate machine that has access to the database. Data Management
Gateway is a component that connects data sources on-premises/on Azure VM with cloud services in a
secure and managed way. See Data Management Gateway article for details.
IMPORTANT
When copying data into Azure SQL Database or SQL Server, you can configure the SqlSink in copy activity to invoke a
stored procedure by using the sqlWriterStoredProcedureName property. For more information, see Invoke stored
procedure from copy activity. For details about the property, see following connector articles: Azure SQL Database, SQL
Server. Invoking a stored procedure while copying data into an Azure SQL Data Warehouse by using a copy activity is
not supported. But, you can use the stored procedure activity to invoke a stored procedure in a SQL Data Warehouse.
When copying data from Azure SQL Database or SQL Server or Azure SQL Data Warehouse, you can configure
SqlSource in copy activity to invoke a stored procedure to read data from the source database by using the
sqlReaderStoredProcedureName property. For more information, see the following connector articles: Azure SQL
Database, SQL Server, Azure SQL Data Warehouse
The following walkthrough uses the Stored Procedure Activity in a pipeline to invoke a stored procedure in an
Azure SQL database.
Walkthrough
Sample table and stored procedure
1. Create the following table in your Azure SQL Database using SQL Server Management Studio or any
other tool you are comfortable with. The datetimestamp column is the date and time when the
corresponding ID is generated.
CREATE TABLE dbo.sampletable
(
Id uniqueidentifier,
datetimestamp nvarchar(127)
)
GO
Id is the unique identified and the datetimestamp column is the date and time when the corresponding
ID is generated.
In this sample, the stored procedure is in an Azure SQL Database. If the stored procedure is in an Azure
SQL Data Warehouse and SQL Server Database, the approach is similar. For a SQL Server database, you
must install a Data Management Gateway.
2. Create the following stored procedure that inserts data in to the sampletable.
BEGIN
INSERT INTO [sampletable]
VALUES (newid(), @DateTime)
END
IMPORTANT
Name and casing of the parameter (DateTime in this example) must match that of parameter specified in the
pipeline/activity JSON. In the stored procedure definition, ensure that @ is used as a prefix for the parameter.
4. To deploy the linked service, click Deploy on the command bar. Confirm that you see the
AzureSqlLinkedService in the tree view on the left.
{
"name": "sprocsampleout",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "sampletable"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
3. To deploy the dataset, click Deploy on the command bar. Confirm that you see the dataset in the tree
view.
Create a pipeline with SqlServerStoredProcedure activity
Now, let's create a pipeline with a stored procedure activity.
Notice the following properties:
The type property is set to SqlServerStoredProcedure.
The storedProcedureName in type properties is set to sp_sample (name of the stored procedure).
The storedProcedureParameters section contains one parameter named DataTime. Name and casing of
the parameter in JSON must match the name and casing of the parameter in the stored procedure
definition. If you need pass null for a parameter, use the syntax: "param1": null (all lowercase).
1. Click ... More on the command bar and click New pipeline.
2. Copy/paste the following JSON snippet:
{
"name": "SprocActivitySamplePipeline",
"properties": {
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "sp_sample",
"storedProcedureParameters": {
"DateTime": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)"
}
},
"outputs": [
{
"name": "sprocsampleout"
}
],
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "SprocActivitySample"
}
],
"start": "2017-04-02T00:00:00Z",
"end": "2017-04-02T05:00:00Z",
"isPaused": false
}
}
3. In the Diagram View, double-click the dataset sprocsampleout . You see the slices in Ready state. There
should be five slices because a slice is produced for each hour between the start time and end time
from the JSON.
4. When a slice is in Ready state, run a select * from sampletable query against the Azure SQL database
to verify that the data was inserted in to the table by the stored procedure.
See Monitor the pipeline for detailed information about monitoring Azure Data Factory pipelines.
"name": "ADFTutorialPipeline",
"properties": {
"description": "Copy data from a blob to blob",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [ { "name": "InputDataset" } ],
"outputs": [ { "name": "OutputDataset" } ],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst"
},
"name": "CopyFromBlobToSQL"
},
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "SPSproc"
},
"inputs": [ { "name": "OutputDataset" } ],
"outputs": [ { "name": "SQLOutputDataset" } ],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"name": "RunStoredProcedure"
}
],
"start": "2017-04-12T00:00:00Z",
"end": "2017-04-13T00:00:00Z",
"isPaused": false,
}
}
Similarly, to link the store procedure activity with downstream activities (the activities that run after the
stored procedure activity completes), specify the output dataset of the stored procedure activity as an input of
the downstream activity in the pipeline.
IMPORTANT
When copying data into Azure SQL Database or SQL Server, you can configure the SqlSink in copy activity to invoke a
stored procedure by using the sqlWriterStoredProcedureName property. For more information, see Invoke stored
procedure from copy activity. For details about the property, see the following connector articles: Azure SQL Database,
SQL Server.
When copying data from Azure SQL Database or SQL Server or Azure SQL Data Warehouse, you can configure
SqlSource in copy activity to invoke a stored procedure to read data from the source database by using the
sqlReaderStoredProcedureName property. For more information, see the following connector articles: Azure SQL
Database, SQL Server, Azure SQL Data Warehouse
JSON format
Here is the JSON format for defining a Stored Procedure Activity:
{
"name": "SQLSPROCActivity",
"description": "description",
"type": "SqlServerStoredProcedure",
"inputs": [ { "name": "inputtable" } ],
"outputs": [ { "name": "outputtable" } ],
"typeProperties":
{
"storedProcedureName": "<name of the stored procedure>",
"storedProcedureParameters":
{
"param1": "param1Value"
}
}
}
Table:
CREATE TABLE dbo.sampletable2
(
Id uniqueidentifier,
datetimestamp nvarchar(127),
scenario nvarchar(127)
)
GO
Stored procedure:
AS
BEGIN
INSERT INTO [sampletable2]
VALUES (newid(), @DateTime, @Scenario)
END
Now, pass the Scenario parameter and the value from the stored procedure activity. The typeProperties
section in the preceding sample looks like the following snippet:
"typeProperties":
{
"storedProcedureName": "sp_sample",
"storedProcedureParameters":
{
"DateTime": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)",
"Scenario": "Document sample"
}
}
{
"name": "sprocsampleout2",
"properties": {
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "sampletable2"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
A pipeline in an Azure data factory processes data in linked storage services by using linked compute services.
It contains a sequence of activities where each activity performs a specific processing operation. This article
describes the Data Lake Analytics U-SQL Activity that runs a U-SQL script on an Azure Data Lake
Analytics compute linked service.
NOTE
Create an Azure Data Lake Analytics account before creating a pipeline with a Data Lake Analytics U-SQL Activity. To
learn about Azure Data Lake Analytics, see Get started with Azure Data Lake Analytics.
Review the Build your first pipeline tutorial for detailed steps to create a data factory, linked services, datasets, and a
pipeline. Use JSON snippets with Data Factory Editor or Visual Studio or Azure PowerShell to create Data Factory
entities.
resourceGroupName Azure resource group name No (If not specified, resource group of
the data factory is used).
{
"name": "AzureDataLakeAnalyticsLinkedService",
"properties": {
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "adftestaccount",
"dataLakeAnalyticsUri": "azuredatalakeanalytics.net",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": "<service principal key>",
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"subscriptionId": "<optional, subscription id of ADLA>",
"resourceGroupName": "<optional, resource group name of ADLA>"
}
}
}
{
"name": "AzureDataLakeAnalyticsLinkedService",
"properties": {
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "adftestaccount",
"dataLakeAnalyticsUri": "azuredatalakeanalytics.net",
"authorization": "<authcode>",
"sessionId": "<session ID>",
"subscriptionId": "<optional, subscription id of ADLA>",
"resourceGroupName": "<optional, resource group name of ADLA>"
}
}
}
Token expiration
The authorization code you generated by using the Authorize button expires after sometime. See the
following table for the expiration times for different types of user accounts. You may see the following error
message when the authentication token expires: Credential operation error: invalid_grant - AADSTS70002:
Error validating credentials. AADSTS70008: The provided access grant is expired or revoked. Trace ID:
d18629e8-af88-43c5-88e3-d8419eb1fca1 Correlation ID: fac30a0c-6be6-4e02-8d69-a776d2ffefd7
Timestamp: 2015-12-15 21:09:31Z
Users accounts managed by Azure Active Directory (AAD) 14 days after the last slice run.
To avoid/resolve this error, reauthorize using the Authorize button when the token expires and redeploy the
linked service. You can also generate values for sessionId and authorization properties programmatically
using code as follows:
if (linkedService.Properties.TypeProperties is AzureDataLakeStoreLinkedService ||
linkedService.Properties.TypeProperties is AzureDataLakeAnalyticsLinkedService)
{
AuthorizationSessionGetResponse authorizationSession = this.Client.OAuth.Get(this.ResourceGroupName,
this.DataFactoryName, linkedService.Properties.Type);
AzureDataLakeAnalyticsLinkedService azureDataLakeAnalyticsProperties =
linkedService.Properties.TypeProperties as AzureDataLakeAnalyticsLinkedService;
if (azureDataLakeAnalyticsProperties != null)
{
azureDataLakeAnalyticsProperties.SessionId = authorizationSession.AuthorizationSession.SessionId;
azureDataLakeAnalyticsProperties.Authorization = authorization;
}
}
The following table describes names and descriptions of properties that are specific to this activity.
scriptPath Path to folder that contains the U- No (if you use script)
SQL script. Name of the file is case-
sensitive.
PROPERTY DESCRIPTION REQUIRED
scriptLinkedService Linked service that links the storage No (if you use script)
that contains the script to the data
factory
script Specify inline script instead of No (if you use scriptPath and
specifying scriptPath and scriptLinkedService)
scriptLinkedService. For example:
"script": "CREATE DATABASE test"
.
Output dataset
In this example, the output data produced by the U-SQL script is stored in an Azure Data Lake Store
(datalake/output folder).
{
"name": "EventsByRegionTable",
"properties": {
"type": "AzureDataLakeStore",
"linkedServiceName": "AzureDataLakeStoreLinkedService",
"typeProperties": {
"folderPath": "datalake/output/"
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": "<service principal key>",
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
}
}
}
See Move data to and from Azure Data Lake Store article for descriptions of JSON properties.
@rs1 =
SELECT Start, Region, Duration
FROM @searchlog
WHERE Region == "en-gb";
@rs1 =
SELECT Start, Region, Duration
FROM @rs1
WHERE Start <= DateTime.Parse("2012/02/19");
OUTPUT @rs1
TO @out
USING Outputters.Tsv(quoting:false, dateTimeFormat:null);
The values for @in and @out parameters in the U-SQL script are passed dynamically by ADF using the
parameters section. See the parameters section in the pipeline definition.
You can specify other properties such as degreeOfParallelism and priority as well in your pipeline definition for
the jobs that run on the Azure Data Lake Analytics service.
Dynamic parameters
In the sample pipeline definition, in and out parameters are assigned with hard-coded values.
"parameters": {
"in": "/datalake/input/SearchLog.tsv",
"out": "/datalake/output/Result.tsv"
}
"parameters": {
"in": "$$Text.Format('/datalake/input/{0:yyyy-MM-dd HH:mm:ss}.tsv', SliceStart)",
"out": "$$Text.Format('/datalake/output/{0:yyyy-MM-dd HH:mm:ss}.tsv', SliceStart)"
}
In this case, input files are still picked up from the /datalake/input folder and output files are generated in the
/datalake/output folder. The file names are dynamic based on the slice start time.
Use custom activities in an Azure Data Factory
pipeline
8/21/2017 35 min to read Edit Online
There are two types of activities that you can use in an Azure Data Factory pipeline.
Data Movement Activities to move data between supported source and sink data stores.
Data Transformation Activities to transform data using compute services such as Azure HDInsight, Azure
Batch, and Azure Machine Learning.
To move data to/from a data store that Data Factory does not support, create a custom activity with your
own data movement logic and use the activity in a pipeline. Similarly, to transform/process data in a way that
isn't supported by Data Factory, create a custom activity with your own data transformation logic and use the
activity in a pipeline.
You can configure a custom activity to run on an Azure Batch pool of virtual machines or a Windows-based
Azure HDInsight cluster. When using Azure Batch, you can use only an existing Azure Batch pool. Whereas,
when using HDInsight, you can use an existing HDInsight cluster or a cluster that is automatically created for
you on-demand at runtime.
The following walkthrough provides step-by-step instructions for creating a custom .NET activity and using
the custom activity in a pipeline. The walkthrough uses an Azure Batch linked service. To use an Azure
HDInsight linked service instead, you create a linked service of type HDInsight (your own HDInsight cluster)
or HDInsightOnDemand (Data Factory creates an HDInsight cluster on-demand). Then, configure custom
activity to use the HDInsight linked service. See Use Azure HDInsight linked services section for details on
using Azure HDInsight to run the custom activity.
IMPORTANT
The custom .NET activities run only on Windows-based HDInsight clusters. A workaround for this limitation is to
use the Map Reduce Activity to run custom Java code on a Linux-based HDInsight cluster. Another option is to use
an Azure Batch pool of VMs to run custom activities instead of using a HDInsight cluster.
It is not possible to use a Data Management Gateway from a custom activity to access on-premises data sources.
Currently, Data Management Gateway supports only the copy activity and stored procedure activity in Data
Factory.
Install-Package Microsoft.Azure.Management.DataFactories
IMPORTANT
Data Factory service launcher requires the 4.3 version of WindowsAzure.Storage. If you add a reference to a
later version of Azure Storage assembly in your custom activity project, you see an error when the activity
executes. To resolve the error, see Appdomain isolation section.
5. Add the following using statements to the source file in the project.
// Comment these lines if using VS 2017
using System.IO;
using System.Globalization;
using System.Diagnostics;
using System.Linq;
// --------------------
using Microsoft.Azure.Management.DataFactories.Models;
using Microsoft.Azure.Management.DataFactories.Runtime;
using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Blob;
namespace MyDotNetActivityNS
7. Change the name of the class to MyDotNetActivity and derive it from the IDotNetActivity interface
as shown in the following code snippet:
8. Implement (Add) the Execute method of the IDotNetActivity interface to the MyDotNetActivity
class and copy the following sample code to the method.
The following sample counts the number of occurrences of the search term (Microsoft) in each blob
associated with a data slice.
/// <summary>
/// Execute method is the only method of IDotNetActivity interface you must implement.
/// In this sample, the method invokes the Calculate method to perform the core logic.
/// </summary>
// get the first Azure Storate linked service from linkedServices object
// using First method instead of Single since we are using the same
// Azure Storage linked service for input and output.
inputLinkedService = linkedServices.First(
linkedService =>
linkedService.Name ==
inputDataset.Properties.LinkedServiceName).Properties.TypeProperties
as AzureStorageLinkedService;
// get the output dataset using the name of the dataset matched to a name in the Activity
output collection.
Dataset outputDataset = datasets.Single(dataset => dataset.Name ==
activity.Outputs.Single().Name);
// The dictionary can be used to chain custom activities together in the future.
// This feature is not implemented yet, so just return an empty dictionary.
/// <summary>
/// Gets the folderPath value from the input/output dataset.
/// </summary>
/// <summary>
/// Gets the fileName value from the input/output dataset.
/// </summary>
/// <summary>
/// Iterates through each blob (file) in the folder, counts the number of instances of search term
in the file,
/// and prepares the output text that is written to the output blob.
/// </summary>
The GetFolderPath method returns the path to the folder that the dataset points to and the
GetFileName method returns the name of the blob/file that the dataset points to. If you havefolderPath
defines using variables such as {Year}, {Month}, {Day} etc., the method returns the string as it is without
replacing them with runtime values. See Access extended properties section for details on accessing
SliceStart, SliceEnd, etc.
"name": "InputDataset",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "file.txt",
"folderPath": "adftutorial/inputfolder/",
The Calculate method calculates the number of instances of keyword Microsoft in the input files (blobs
in the folder). The search term (Microsoft) is hard-coded in the code.
10. Compile the project. Click Build from the menu and click Build Solution.
IMPORTANT
Set 4.5.2 version of .NET Framework as the target framework for your project: right-click the project, and click
Properties to set the target framework. Data Factory does not support custom activities compiled against
.NET Framework versions later than 4.5.2.
11. Launch Windows Explorer, and navigate to bin\debug or bin\release folder depending on the type
of build.
12. Create a zip file MyDotNetActivity.zip that contains all the binaries in the \bin\Debug folder. Include
the MyDotNetActivity.pdb file so that you get additional details such as line number in the source
code that caused the issue if there was a failure.
IMPORTANT
All the files in the zip file for the custom activity must be at the top level with no sub folders.
13. Create a blob container named customactivitycontainer if it does not already exist.
14. Upload MyDotNetActivity.zip as a blob to the customactivitycontainer in a general-purpose Azure blob
storage (not hot/cool Blob storage) that is referred by AzureStorageLinkedService.
IMPORTANT
If you add this .NET activity project to a solution in Visual Studio that contains a Data Factory project, and add a
reference to .NET activity project from the Data Factory application project, you do not need to perform the last two
steps of manually creating the zip file and uploading it to the general-purpose Azure blob storage. When you publish
Data Factory entities using Visual Studio, these steps are automatically done by the publishing process. For more
information, see Data Factory project in Visual Studio section.
The input folder corresponds to a slice in Azure Data Factory even if the folder has two or more files. When
each slice is processed by the pipeline, the custom activity iterates through all the blobs in the input folder for
that slice.
You see one output file with in the adftutorial\customactivityoutput folder with one or more lines (same as
number of blobs in the input folder):
2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2016-11-16-
00/file.txt.
NOTE
Create the file.txt and upload it to a blob container if you haven't already done so. See instructions in the preceding
section.
3. Click RESOURCE GROUP NAME, and select an existing resource group or create a resource group.
4. Verify that you are using the correct subscription and region where you want the data factory to be
created.
5. Click Create on the New data factory blade.
6. You see the data factory being created in the Dashboard of the Azure portal.
7. After the data factory has been created successfully, you see the Data Factory blade, which shows you
the contents of the data factory.
Step 2: Create linked services
Linked services link data stores or compute services to an Azure data factory. In this step, you link your Azure
Storage account and Azure Batch account to your data factory.
Create Azure Storage linked service
1. Click the Author and deploy tile on the DATA FACTORY blade for CustomActivityFactory. You see the
Data Factory Editor.
2. Click New data store on the command bar and choose Azure storage. You should see the JSON
script for creating an Azure Storage linked service in the editor.
3. Replace <accountname> with name of your Azure storage account and <accountkey> with access key of
the Azure storage account. To learn how to get your storage access key, see View, copy and regenerate
storage access keys.
For the poolName property, you can also specify the ID of the pool instead of the name of the
pool.
IMPORTANT
The Data Factory service does not support an on-demand option for Azure Batch as it does for
HDInsight. You can only use your own Azure Batch pool in an Azure data factory.
{
"name": "InputDataset",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "adftutorial/customactivityinput/",
"format": {
"type": "TextFormat"
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {}
}
}
You create a pipeline later in this walkthrough with start time: 2016-11-16T00:00:00Z and end time:
2016-11-16T05:00:00Z. It is scheduled to produce data hourly, so there are five input/output slices
(between 00:00:00 -> 05:00:00).
The frequency and interval for the input dataset is set to Hour and 1, which means that the input
slice is available hourly. In this sample, it is the same file (file.txt) in the intputfolder.
Here are the start times for each slice, which is represented by SliceStart system variable in the above
JSON snippet.
3. Click Deploy on the toolbar to create and deploy the InputDataset. Confirm that you see the TABLE
CREATED SUCCESSFULLY message on the title bar of the Editor.
Create an output dataset
1. In the Data Factory editor, click ... More on the command bar, click New dataset, and then select Azure
Blob storage.
2. Replace the JSON script in the right pane with the following JSON script:
{
"name": "OutputDataset",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "{slice}.txt",
"folderPath": "adftutorial/customactivityoutput/",
"partitionedBy": [
{
"name": "slice",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy-MM-dd-HH"
}
}
]
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
1 2016-11-16T00:00:00 2016-11-16-00.txt
2 2016-11-16T01:00:00 2016-11-16-01.txt
3 2016-11-16T02:00:00 2016-11-16-02.txt
4 2016-11-16T03:00:00 2016-11-16-03.txt
5 2016-11-16T04:00:00 2016-11-16-04.txt
Remember that all the files in an input folder are part of a slice with the start times mentioned above.
When this slice is processed, the custom activity scans through each file and produces a line in the
output file with the number of occurrences of search term (Microsoft). If there are three files in the
inputfolder, there are three lines in the output file for each hourly slice: 2016-11-16-00.txt, 2016-11-
16:01:00:00.txt, etc.
3. To deploy the OutputDataset, click Deploy on the command bar.
Create and run a pipeline that uses the custom activity
1. In the Data Factory Editor, click ... More, and then select New pipeline on the command bar.
2. Replace the JSON in the right pane with the following JSON script:
{
"name": "ADFTutorialPipelineCustom",
"properties": {
"description": "Use custom activity",
"activities": [
{
"Name": "MyDotNetActivity",
"Type": "DotNetActivity",
"Inputs": [
{
"Name": "InputDataset"
}
],
"Outputs": [
{
"Name": "OutputDataset"
}
],
"LinkedServiceName": "AzureBatchLinkedService",
"typeProperties": {
"AssemblyName": "MyDotNetActivity.dll",
"EntryPoint": "MyDotNetActivityNS.MyDotNetActivity",
"PackageLinkedService": "AzureStorageLinkedService",
"PackageFile": "customactivitycontainer/MyDotNetActivity.zip",
"extendedProperties": {
"SliceStart": "$$Text.Format('{0:yyyyMMddHH-mm}', Time.AddMinutes(SliceStart, 0))"
}
},
"Policy": {
"Concurrency": 2,
"ExecutionPriorityOrder": "OldestFirst",
"Retry": 3,
"Timeout": "00:30:00",
"Delay": "00:00:00"
}
}
],
"start": "2016-11-16T00:00:00Z",
"end": "2016-11-16T05:00:00Z",
"isPaused": false
}
}
3. You should see that the five output slices are in the Ready state. If they are not in the Ready state, they
haven't been produced yet.
4. Verify that the output files are generated in the blob storage in the adftutorial container.
5. If you open the output file, you should see the output similar to the following output:
2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2016-11-16-
00/file.txt.
6. Use the Azure portal or Azure PowerShell cmdlets to monitor your data factory, pipelines, and data
sets. You can see messages from the ActivityLogger in the code for the custom activity in the logs
(specifically user-0.log) that you can download from the portal or using cmdlets.
See Monitor and Manage Pipelines for detailed steps for monitoring datasets and pipelines.
Data Factory project in Visual Studio
You can create and publish Data Factory entities by using Visual Studio instead of using Azure portal. For
detailed information about creating and publishing Data Factory entities by using Visual Studio, See Build
your first pipeline using Visual Studio and Copy data from Azure Blob to Azure SQL articles.
Do the following additional steps if you are creating Data Factory project in Visual Studio:
1. Add the Data Factory project to the Visual Studio solution that contains the custom activity project.
2. Add a reference to the .NET activity project from the Data Factory project. Right-click Data Factory project,
point to Add, and then click Reference.
3. In the Add Reference dialog box, select the MyDotNetActivity project, and click OK.
4. Build and publish the solution.
IMPORTANT
When you publish Data Factory entities, a zip file is automatically created for you and is uploaded to the blob
container: customactivitycontainer. If the blob container does not exist, it is automatically created too.
A task is created for each activity run of a slice. If there are five slices ready to be processed, five tasks are
created in this job. If there are multiple compute nodes in the Batch pool, two or more slices can run in
parallel. If the maximum tasks per compute node is set to > 1, you can also have more than one slice running
on the same compute.
The following diagram illustrates the relationship between Azure Data Factory and Batch tasks.
Troubleshoot failures
Troubleshooting consists of a few basic techniques:
1. If you see the following error, you may be using a Hot/Cool blob storage instead of using a general-
purpose Azure blob storage. Upload the zip file to a general-purpose Azure Storage Account.
2. If you see the following error, confirm that the name of the class in the CS file matches the name you
specified for the EntryPoint property in the pipeline JSON. In the walkthrough, name of the class is:
MyDotNetActivity, and the EntryPoint in the JSON is: MyDotNetActivityNS.MyDotNetActivity.
If the names do match, confirm that all the binaries are in the root folder of the zip file. That is, when
you open the zip file, you should see all the files in the root folder, not in any sub folders.
3. If the input slice is not set to Ready, confirm that the input folder structure is correct and file.txt exists in
the input folders.
4. In the Execute method of your custom activity, use the IActivityLogger object to log information that
helps you troubleshoot issues. The logged messages show up in the user log files (one or more files
named: user-0.log, user-1.log, user-2.log, etc.).
In the OutputDataset blade, click the slice to see the DATA SLICE blade for that slice. You see
activity runs for that slice. You should see one activity run for the slice. If you click Run in the
command bar, you can start another activity run for the same slice.
When you click the activity run, you see the ACTIVITY RUN DETAILS blade with a list of log files. You
see logged messages in the user_0.log file. When an error occurs, you see three activity runs because
the retry count is set to 3 in the pipeline/activity JSON. When you click the activity run, you see the log
files that you can review to troubleshoot the error.
In the list of log files, click the user-0.log. In the right panel are the results of using the
IActivityLogger.Write method. If you don't see all messages, check if you have more log files named:
user_1.log, user_2.log etc. Otherwise, the code may have failed after the last logged message.
In addition, check system-0.log for any system error messages and exceptions.
5. Include the PDB file in the zip file so that the error details have information such as call stack when an
error occurs.
6. All the files in the zip file for the custom activity must be at the top level with no sub folders.
7. Ensure that the assemblyName (MyDotNetActivity.dll),
entryPoint(MyDotNetActivityNS.MyDotNetActivity), packageFile
(customactivitycontainer/MyDotNetActivity.zip), and packageLinkedService (should point to the
general-purposeAzure blob storage that contains the zip file) are set to correct values.
8. If you fixed an error and want to reprocess the slice, right-click the slice in the OutputDataset blade and
click Run.
9. If you see the following error, you are using the Azure Storage package of version > 4.3.0. Data
Factory service launcher requires the 4.3 version of WindowsAzure.Storage. See Appdomain isolation
section for a work-around if you must use the later version of Azure Storage assembly.
Error in Activity: Unknown error in module: System.Reflection.TargetInvocationException: Exception
has been thrown by the target of an invocation. ---> System.TypeLoadException: Could not load type
'Microsoft.WindowsAzure.Storage.Blob.CloudBlob' from assembly 'Microsoft.WindowsAzure.Storage,
Version=4.3.0.0, Culture=neutral,
If you can use the 4.3.0 version of Azure Storage package, remove the existing reference to Azure
Storage package of version > 4.3.0. Then, run the following command from NuGet Package Manager
Console.
Build the project. Delete Azure.Storage assembly of version > 4.3.0 from the bin\Debug folder. Create
a zip file with binaries and the PDB file. Replace the old zip file with this one in the blob container
(customactivitycontainer). Rerun the slices that failed (right-click slice, and click Run).
10. The custom activity does not use the app.config file from your package. Therefore, if your code reads
any connection strings from the configuration file, it does not work at runtime. The best practice when
using Azure Batch is to hold any secrets in an Azure KeyVault, use a certificate-based service
principal to protect the keyvault, and distribute the certificate to Azure Batch pool. The .NET custom
activity then can access secrets from the KeyVault at runtime. This solution is a generic solution and
can scale to any type of secret, not just connection string.
There is an easier workaround (but not a best practice): you can create an Azure SQL linked service
with connection string settings, create a dataset that uses the linked service, and chain the dataset as a
dummy input dataset to the custom .NET activity. You can then access the linked service's connection
string in the custom activity code.
Appdomain isolation
See Cross AppDomain Sample that shows you how to create a custom activity that is not constrained to
assembly versions used by the Data Factory launcher (example: WindowsAzure.Storage v4.3.0,
Newtonsoft.Json v6.0.x, etc.).
"typeProperties": {
"AssemblyName": "MyDotNetActivity.dll",
"EntryPoint": "MyDotNetActivityNS.MyDotNetActivity",
"PackageLinkedService": "AzureStorageLinkedService",
"PackageFile": "customactivitycontainer/MyDotNetActivity.zip",
"extendedProperties": {
"SliceStart": "$$Text.Format('{0:yyyyMMddHH-mm}', Time.AddMinutes(SliceStart, 0))",
"DataFactoryName": "CustomActivityFactory"
}
},
In the example, there are two extended properties: SliceStart and DataFactoryName. The value for
SliceStart is based on the SliceStart system variable. See System Variables for a list of supported system
variables. The value for DataFactoryName is hard-coded to CustomActivityFactory.
To access these extended properties in the Execute method, use code similar to the following code:
startingNumberOfVMs = 1;
maxNumberofVMs = 25;
pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second);
pendingTaskSamples = pendingTaskSamplePercent < 70 ? startingNumberOfVMs :
avg($PendingTasks.GetSample(180 * TimeInterval_Second));
$TargetDedicated=min(maxNumberofVMs,pendingTaskSamples);
See Automatically scale compute nodes in an Azure Batch pool for details.
If the pool is using the default autoScaleEvaluationInterval, the Batch service could take 15-30 minutes to
prepare the VM before running the custom activity. If the pool is using a different
autoScaleEvaluationInterval, the Batch service could take autoScaleEvaluationInterval + 10 minutes.
IMPORTANT
The custom .NET activities run only on Windows-based HDInsight clusters. A workaround for this limitation is to use
the Map Reduce Activity to run custom Java code on a Linux-based HDInsight cluster. Another option is to use an
Azure Batch pool of VMs to run custom activities instead of using a HDInsight cluster.
1. Create an Azure HDInsight linked service.
2. Use HDInsight linked service in place of AzureBatchLinkedService in the pipeline JSON.
If you want to test it with the walkthrough, change start and end times for the pipeline so that you can test
the scenario with the Azure HDInsight service.
Create Azure HDInsight linked service
The Azure Data Factory service supports creation of an on-demand cluster and use it to process input to
produce output data. You can also use your own cluster to perform the same. When you use on-demand
HDInsight cluster, a cluster gets created for each slice. Whereas, if you use your own HDInsight cluster, the
cluster is ready to process the slice immediately. Therefore, when you use on-demand cluster, you may not
see the output data as quickly as when you use your own cluster.
NOTE
At runtime, an instance of a .NET activity runs only on one worker node in the HDInsight cluster; it cannot be scaled to
run on multiple nodes. Multiple instances of .NET activity can run in parallel on different nodes of the HDInsight
cluster.
To u se a n o n - d e m a n d H D I n si g h t c l u st e r
1. In the Azure portal, click Author and Deploy in the Data Factory home page.
2. In the Data Factory Editor, click New compute from the command bar and select On-demand
HDInsight cluster from the menu.
3. Make the following changes to the JSON script:
a. For the clusterSize property, specify the size of the HDInsight cluster.
b. For the timeToLive property, specify how long the customer can be idle before it is deleted.
c. For the version property, specify the HDInsight version you want to use. If you exclude this
property, the latest version is used.
d. For the linkedServiceName, specify AzureStorageLinkedService.
{
"name": "HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"clusterSize": 4,
"timeToLive": "00:05:00",
"osType": "Windows",
"linkedServiceName": "AzureStorageLinkedService",
}
}
}
IMPORTANT
The custom .NET activities run only on Windows-based HDInsight clusters. A workaround for this
limitation is to use the Map Reduce Activity to run custom Java code on a Linux-based HDInsight
cluster. Another option is to use an Azure Batch pool of VMs to run custom activities instead of using a
HDInsight cluster.
1. In the Azure portal, click Author and Deploy in the Data Factory home page.
2. In the Data Factory Editor, click New compute from the command bar and select HDInsight cluster
from the menu.
3. Make the following changes to the JSON script:
a. For the clusterUri property, enter the URL for your HDInsight. For example:
https://.azurehdinsight.net/
b. For the UserName property, enter the user name who has access to the HDInsight cluster.
c. For the Password property, enter the password for the user.
d. For the LinkedServiceName property, enter AzureStorageLinkedService.
4. Click Deploy on the command bar to deploy the linked service.
See Compute linked services for details.
In the pipeline JSON, use HDInsight (on-demand or your own) linked service:
{
"name": "ADFTutorialPipelineCustom",
"properties": {
"description": "Use custom activity",
"activities": [
{
"Name": "MyDotNetActivity",
"Type": "DotNetActivity",
"Inputs": [
{
"Name": "InputDataset"
}
],
"Outputs": [
{
"Name": "OutputDataset"
}
],
"LinkedServiceName": "HDInsightOnDemandLinkedService",
"typeProperties": {
"AssemblyName": "MyDotNetActivity.dll",
"EntryPoint": "MyDotNetActivityNS.MyDotNetActivity",
"PackageLinkedService": "AzureStorageLinkedService",
"PackageFile": "customactivitycontainer/MyDotNetActivity.zip",
"extendedProperties": {
"SliceStart": "$$Text.Format('{0:yyyyMMddHH-mm}', Time.AddMinutes(SliceStart, 0))"
}
},
"Policy": {
"Concurrency": 2,
"ExecutionPriorityOrder": "OldestFirst",
"Retry": 3,
"Timeout": "00:30:00",
"Delay": "00:00:00"
}
}
],
"start": "2016-11-16T00:00:00Z",
"end": "2016-11-16T05:00:00Z",
"isPaused": false
}
}
using System;
using System.Configuration;
using System.Collections.ObjectModel;
using System.Threading;
using System.Threading.Tasks;
using Microsoft.Azure;
using Microsoft.Azure.Management.DataFactories;
using Microsoft.Azure.Management.DataFactories.Models;
using Microsoft.Azure.Management.DataFactories.Common.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
using System.Collections.Generic;
namespace DataFactoryAPITestApp
{
class Program
{
static void Main(string[] args)
{
// create data factory management client
// TODO: replace APITutorialFactory with a name that is globally unique. For example:
APITutorialFactory04212017
string dataFactoryName = "APITutorialFactory";
// create a linked service for output data store: Azure SQL Database
Console.WriteLine("Creating Azure Batch linked service");
client.LinkedServices.CreateOrUpdate(resourceGroupName, dataFactoryName,
new LinkedServiceCreateOrUpdateParameters()
{
LinkedService = new LinkedService()
{
Name = "AzureBatchLinkedService",
Properties = new LinkedServiceProperties
(
// TODO: replace <batchaccountname> and <yourbatchaccountkey> with name and
key of your Azure Batch account
new AzureBatchLinkedService("<batchaccountname>",
"https://westus.batch.azure.com", "<yourbatchaccountkey>", "myazurebatchpool",
"AzureStorageLinkedService")
)
}
}
);
new DatasetCreateOrUpdateParameters()
{
Dataset = new Dataset()
{
Name = Dataset_Source,
Properties = new DatasetProperties()
{
LinkedServiceName = "AzureStorageLinkedService",
TypeProperties = new AzureBlobDataset()
{
FolderPath = "adftutorial/customactivityinput/",
Format = new TextFormat()
},
External = true,
Availability = new Availability()
{
Frequency = SchedulePeriod.Hour,
Interval = 1,
},
client.Pipelines.CreateOrUpdate(resourceGroupName, dataFactoryName,
new PipelineCreateOrUpdateParameters()
{
Pipeline = new Pipeline()
{
Name = PipelineName,
Properties = new PipelineProperties()
{
Description = "Use custom activity",
// Initial value for pipeline's active period. With this, you won't need to
set slice status
Start = PipelineActivePeriodStartTime,
End = PipelineActivePeriodEndTime,
IsPaused = false,
if (result != null)
return result.AccessToken;
HTTP Data Downloader. Downloads data from an HTTP Endpoint to Azure Blob
Storage using custom C# Activity in Data Factory.
Twitter Sentiment Analysis sample Invokes an Azure ML model and do sentiment analysis,
scoring, prediction etc.
SAMPLE WHAT CUSTOM ACTIVITY DOES
Cross AppDomain .NET Activity Uses different assembly versions from ones used by the
Data Factory launcher
Reprocess a model in Azure Analysis Services Reprocesses a model in Azure Analysis Services.
Compute environments supported by Azure Data
Factory
8/24/2017 17 min to read Edit Online
This article explains different compute environments that you can use to process or transform data. It also
provides details about different configurations (on-demand vs. bring your own) supported by Data Factory when
configuring linked services linking these compute environments to an Azure data factory.
The following table provides a list of compute environments supported by Data Factory and the activities that
can run on them.
On-demand HDInsight cluster or your own HDInsight DotNet, Hive, Pig, MapReduce, Hadoop Streaming
cluster
Azure Machine Learning Machine Learning activities: Batch Execution and Update
Resource
Azure SQL, Azure SQL Data Warehouse, SQL Server Stored Procedure
After 07/15/2017, if left blank, the default values for version and osType properties are:
Recommendations:
Before 07/15/2017, perform tests to ensure the compatibility of the Activities that reference this
Linked Services to the latest supported HDInsight version with information documented in Hadoop
components available with different HDInsight versions and Hortonworks release notes associated
with HDInsight versions.
After 07/15/2017, make sure you explicitly specify osType and version values if you would like to
override the default settings.
NOTE
Currently Azure Data Factory does not support HDInsight clusters using Azure Data Lake Store as primary store. Use
Azure Storage as primary store for HDInsight clusters.
NOTE
The on-demand configuration is currently supported only for Azure HDInsight clusters.
IMPORTANT
It typically takes 20 minutes or more to provision an Azure HDInsight cluster on demand.
Example
The following JSON defines a Linux-based on-demand HDInsight linked service. The Data Factory service
automatically creates a Linux-based HDInsight cluster when processing a data slice.
{
"name": "HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "AzureStorageLinkedService"
}
}
}
To use a Windows-based HDInsight cluster, set osType to windows or do not use the property as the default
value is: windows.
IMPORTANT
The HDInsight cluster creates a default container in the blob storage you specified in the JSON (linkedServiceName).
HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand
HDInsight linked service, a HDInsight cluster is created every time a slice needs to be processed unless there is an existing
live cluster (timeToLive) and is deleted when the processing is done.
As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for
troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers
follow a pattern: adf**yourdatafactoryname**-**linkedservicename**-datetimestamp . Use tools such as Microsoft
Storage Explorer to delete containers in your Azure blob storage.
Properties
PROPERTY DESCRIPTION REQUIRED
"additionalLinkedServiceNames": [
"otherLinkedServiceName1",
"otherLinkedServiceName2"
]
Advanced Properties
You can also specify the following properties for the granular configuration of the on-demand HDInsight cluster.
{
"name": " HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"clusterSize": 16,
"timeToLive": "01:30:00",
"linkedServiceName": "adfods1",
"coreConfiguration": {
"templeton.mapper.memory.mb": "5000"
},
"hiveConfiguration": {
"templeton.mapper.memory.mb": "5000"
},
"mapReduceConfiguration": {
"mapreduce.reduce.java.opts": "-Xmx4000m",
"mapreduce.map.java.opts": "-Xmx4000m",
"mapreduce.map.memory.mb": "5000",
"mapreduce.reduce.memory.mb": "5000",
"mapreduce.job.reduce.slowstart.completedmaps": "0.8"
},
"yarnConfiguration": {
"yarn.app.mapreduce.am.resource.mb": "5000",
"mapreduce.map.memory.mb": "5000"
},
"additionalLinkedServiceNames": [
"datafeeds",
"adobedatafeed"
]
}
}
}
Node sizes
You can specify the sizes of head, data, and zookeeper nodes using the following properties:
"headNodeSize": "Standard_D4",
"dataNodeSize": "Standard_D4",
If you specify a wrong value for these properties, you may receive the following error: Failed to create cluster.
Exception: Unable to complete the cluster create operation. Operation failed with code '400'. Cluster left behind
state: 'Error'. Message: 'PreClusterCreationValidationFailure'. When you receive this error, ensure that you are
using the CMDLET & APIS name from the table in the Sizes of Virtual Machines article.
{
"name": "HDInsightLinkedService",
"properties": {
"type": "HDInsight",
"typeProperties": {
"clusterUri": " https://<hdinsightclustername>.azurehdinsight.net/",
"userName": "admin",
"password": "<password>",
"linkedServiceName": "MyHDInsightStoragelinkedService"
}
}
}
Properties
PROPERTY DESCRIPTION REQUIRED
{
"name": "AzureBatchLinkedService",
"properties": {
"type": "AzureBatch",
"typeProperties": {
"accountName": "<Azure Batch account name>",
"accessKey": "<Azure Batch account key>",
"poolName": "<Azure Batch pool name>",
"linkedServiceName": "<Specify associated storage linked service reference here>"
}
}
}
Append ".<region name>" to the name of your batch account for the accountName property. Example:
"accountName": "mybatchaccount.eastus"
Another option is to provide the batchUri endpoint as shown in the following sample:
"accountName": "adfteam",
"batchUri": "https://eastus.batch.azure.com",
Properties
PROPERTY DESCRIPTION REQUIRED
{
"name": "AzureMLLinkedService",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://[batch scoring endpoint]/jobs",
"apiKey": "<apikey>"
}
}
}
Properties
PROPERTY DESCRIPTION REQUIRED
resourceGroupName Azure resource group name No (If not specified, resource group of
the data factory is used).
{
"name": "AzureDataLakeAnalyticsLinkedService",
"properties": {
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "adftestaccount",
"dataLakeAnalyticsUri": "datalakeanalyticscompute.net",
"authorization": "<authcode>",
"sessionId": "<session ID>",
"subscriptionId": "<optional, subscription id of ADLA>",
"resourceGroupName": "<optional, resource group name of ADLA>"
}
}
}
Token expiration
The authorization code you generated by using the Authorize button expires after sometime. See the following
table for the expiration times for different types of user accounts. You may see the following error message
when the authentication token expires: Credential operation error: invalid_grant - AADSTS70002: Error
validating credentials. AADSTS70008: The provided access grant is expired or revoked. Trace ID: d18629e8-af88-
43c5-88e3-d8419eb1fca1 Correlation ID: fac30a0c-6be6-4e02-8d69-a776d2ffefd7 Timestamp: 2015-12-15
21:09:31Z
USER TYPE EXPIRES AFTER
Users accounts managed by Azure Active Directory (AAD) 14 days after the last slice run.
To avoid/resolve this error, reauthorize using the Authorize button when the token expires and redeploy the
linked service. You can also generate values for sessionId and authorization properties programmatically
using code as follows:
if (linkedService.Properties.TypeProperties is AzureDataLakeStoreLinkedService ||
linkedService.Properties.TypeProperties is AzureDataLakeAnalyticsLinkedService)
{
AuthorizationSessionGetResponse authorizationSession = this.Client.OAuth.Get(this.ResourceGroupName,
this.DataFactoryName, linkedService.Properties.Type);
AzureDataLakeAnalyticsLinkedService azureDataLakeAnalyticsProperties =
linkedService.Properties.TypeProperties as AzureDataLakeAnalyticsLinkedService;
if (azureDataLakeAnalyticsProperties != null)
{
azureDataLakeAnalyticsProperties.SessionId = authorizationSession.AuthorizationSession.SessionId;
azureDataLakeAnalyticsProperties.Authorization = authorization;
}
}
Overview
While using Azure Data Factory for your data integration needs, you may find yourself reusing the same pattern
across different environments or implementing the same task repetitively within the same solution. Templates help
you implement and manage these scenarios in an easy manner. Templates in Azure Data Factory are ideal for
scenarios that involve reusability and repetition.
Consider the situation where an organization has 10 manufacturing plants across the world. The logs from each
plant are stored in a separate on-premises SQL Server database. The company wants to build a single data
warehouse in the cloud for ad-hoc analytics. It also wants to have the same logic but different configurations for
development, test, and production environments.
In this case, a task needs to be repeated within the same environment, but with different values across the 10 data
factories for each manufacturing plant. In effect, repetition is present. Templating allows the abstraction of this
generic flow (that is, pipelines having the same activities in each data factory), but uses a separate parameter file for
each manufacturing plant.
Furthermore, as the organization wants to deploy these 10 data factories multiple times across different
environments, templates can use this reusability by utilizing separate parameter files for development, test, and
production environments.
Tutorials
See the following tutorials for step-by-step instructions to create Data Factory entities by using Resource Manager
templates:
Tutorial: Create a pipeline to copy data by using Azure Resource Manager template
Tutorial: Create a pipeline to process data by using Azure Resource Manager template
"$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": { ...
},
"variables": { ...
},
"resources": [
{
"name": "[parameters('dataFactoryName')]",
"apiVersion": "[variables('apiVersion')]",
"type": "Microsoft.DataFactory/datafactories",
"location": "westus",
"resources": [
{ "type": "linkedservices",
...
},
{"type": "datasets",
...
},
{"type": "dataPipelines",
...
}
}
"resources": [
{
"name": "[variables('<mydataFactoryName>')]",
"apiVersion": "2015-10-01",
"type": "Microsoft.DataFactory/datafactories",
"location": "East US"
}
"type": "linkedservices",
"name": "[variables('<LinkedServiceName>')]",
"apiVersion": "2015-10-01",
"dependsOn": [ "[variables('<dataFactoryName>')]" ],
"properties": {
...
}
See Storage Linked Service or Compute Linked Services for details about the JSON properties for the specific linked
service you wish to deploy. The dependsOn parameter specifies name of the corresponding data factory. An
example of defining a linked service for Azure Storage is shown in the following JSON definition:
Define datasets
"type": "datasets",
"name": "[variables('<myDatasetName>')]",
"dependsOn": [
"[variables('<dataFactoryName>')]",
"[variables('<myDatasetLinkedServiceName>')]"
],
"apiVersion": "2015-10-01",
"properties": {
...
}
Refer to Supported data stores for details about the JSON properties for the specific dataset type you wish to
deploy. Note the dependsOn parameter specifies name of the corresponding data factory and storage linked
service. An example of defining dataset type of Azure blob storage is shown in the following JSON definition:
"type": "datasets",
"name": "[variables('storageDataset')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('storageLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "[variables('storageLinkedServiceName')]",
"typeProperties": {
"folderPath": "[concat(parameters('sourceBlobContainer'), '/')]",
"fileName": "[parameters('sourceBlobName')]",
"format": {
"type": "TextFormat"
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
Define pipelines
"type": "dataPipelines",
"name": "[variables('<mypipelineName>')]",
"dependsOn": [
"[variables('<dataFactoryName>')]",
"[variables('<inputDatasetLinkedServiceName>')]",
"[variables('<outputDatasetLinkedServiceName>')]",
"[variables('<inputDataset>')]",
"[variables('<outputDataset>')]"
],
"apiVersion": "2015-10-01",
"properties": {
activities: {
...
}
}
Refer to defining pipelines for details about the JSON properties for defining the specific pipeline and activities you
wish to deploy. Note the dependsOn parameter specifies name of the data factory, and any corresponding linked
services or datasets. An example of a pipeline that copies data from Azure Blob Storage to Azure SQL Database is
shown in the following JSON snippet:
"type": "datapipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('azureSqlLinkedServiceName')]",
"[variables('blobInputDatasetName')]",
"[variables('sqlOutputDatasetName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"activities": [
{
"name": "CopyFromAzureBlobToAzureSQL",
"description": "Copy data frm Azure blob to Azure SQL",
"type": "Copy",
"inputs": [
{
"name": "[variables('blobInputDatasetName')]"
}
],
"outputs": [
{
"name": "[variables('sqlOutputDatasetName')]"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"sqlWriterCleanupScript": "$$Text.Format('DELETE FROM {0}', 'emp')"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "Column0:FirstName,Column1:LastName"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 3,
"timeout": "01:00:00"
}
}
],
"start": "2016-10-03T00:00:00Z",
"end": "2016-10-04T00:00:00Z"
}
"id":"/subscriptions/<subscriptionID>/resourceGroups/<resourceGroupName>/providers/Microsoft.KeyVault/vaults/<k
eyVaultName>",
},
"secretName": "<secretName>"
},
},
...
}
NOTE
While exporting templates for existing data factories is currently not yet supported, it is in the works.
Azure Data Factory - Samples
6/27/2017 6 min to read Edit Online
Samples on GitHub
The GitHub Azure-DataFactory repository contains several samples that help you quickly ramp up with Azure Data
Factory service (or) modify the scripts and use it in own application. The Samples\JSON folder contains JSON
snippets for common scenarios.
SAMPLE DESCRIPTION
JSON samples This sample provides JSON examples for common scenarios.
Http Data Downloader Sample This sample showcases downloading of data from an HTTP
endpoint to Azure Blob Storage using custom .NET activity.
Cross AppDomain Dot Net Activity Sample This sample allows you to author a custom .NET activity that is
not constrained to assembly versions used by the ADF
launcher (For example, WindowsAzure.Storage v4.3.0,
Newtonsoft.Json v6.0.x, etc.).
Run R script This sample includes the Data Factory custom activity that can
be used to invoke RScript.exe. This sample works only with
your own (not on-demand) HDInsight cluster that already has
R Installed on it.
Invoke Spark jobs on HDInsight Hadoop cluster This sample shows how to use MapReduce activity to invoke a
Spark program. The spark program just copies data from one
Azure Blob container to another.
Twitter Analysis using Azure Machine Learning Batch Scoring This sample shows how to use AzureMLBatchScoringActivity
Activity to invoke an Azure Machine Learning model that performs
twitter sentiment analysis, scoring, prediction etc.
Twitter Analysis using custom activity This sample shows how to use a custom .NET activity to
invoke an Azure Machine Learning model that performs
twitter sentiment analysis, scoring, prediction etc.
Parameterized Pipelines for Azure Machine Learning The sample provides an end-to-end C# code to deploy N
pipelines for scoring and retraining each with a different region
parameter where the list of regions is coming from a
parameters.txt file, which is included with this sample.
SAMPLE DESCRIPTION
Reference Data Refresh for Azure Stream Analytics jobs This sample shows how to use Azure Data Factory and Azure
Stream Analytics together to run the queries with reference
data and setup the refresh for reference data on a schedule.
Hybrid Pipeline with On-premises Hortonworks Hadoop The sample uses an on-premises Hadoop cluster as a compute
target for running jobs in Data Factory just like you would add
other compute targets like an HDInsight based Hadoop
cluster in cloud.
JSON Conversion Tool This tool allows you to convert JSONs from version prior to
2015-07-01-preview to latest or 2015-07-01-preview
(default).
U-SQL sample input file This file is a sample file used by an U-SQL activity.
Delete blob file This sample showcases a C# file which can be used as part of
ADF custom .net activity to delete files from the source Azure
Blob location once the files have been copied.
TEMPLATE DESCRIPTION
Copy from Azure Blob Storage to Azure SQL Database Deploying this template creates an Azure data factory with a
pipeline that copies data from the specified Azure blob storage
to the Azure SQL database
Copy from Salesforce to Azure Blob Storage Deploying this template creates an Azure data factory with a
pipeline that copies data from the specified Salesforce account
to the Azure blob storage.
Transform data by running Hive script on an Azure HDInsight Deploying this template creates an Azure data factory with a
cluster pipeline that transforms data by running the sample Hive
script on an Azure HDInsight Hadoop cluster.
4. Specify configuration settings for the sample. For example, your Azure storage account name and account
key, Azure SQL server name, database, User ID, and password, etc.
5. After you are done with specifying the configuration settings, click Create to create/deploy the sample pipelines
and linked services/tables used by the pipelines.
6. You see the status of deployment on the sample tile you clicked earlier on the Sample pipelines blade.
7. When you see the Deployment succeeded message on the tile for the sample, close the Sample pipelines
blade.
8. On DATA FACTORY blade, you see that linked services, data sets, and pipelines are added to your data
factory.
Samples in Visual Studio
Prerequisites
You must have the following installed on your computer:
Visual Studio 2013 or Visual Studio 2015
Download Azure SDK for Visual Studio 2013 or Visual Studio 2015. Navigate to Azure Download Page and click
VS 2013 or VS 2015 in the .NET section.
Download the latest Azure Data Factory plugin for Visual Studio: VS 2013 or VS 2015. If you are using Visual
Studio 2013, you can also update the plugin by doing the following steps: On the menu, click Tools ->
Extensions and Updates -> Online -> Visual Studio Gallery -> Microsoft Azure Data Factory Tools for
Visual Studio -> Update.
Use Data Factory Templates
1. Click File on the menu, point to New, and click Project.
2. In the New Project dialog box, do the following steps:
a. Select DataFactory under Templates.
b. Select Data Factory Templates in the right pane.
c. Enter a name for the project.
d. Select a location for the project.
e. Click OK.
3. In the Data Factory Templates dialog box, select the sample template from the Use-Case Templates
section, and click Next. The following steps walk you through using the Customer Profiling template. Steps
are similar for the other samples.
4. In the Data Factory Configuration dialog, click Next on the Data Factory Basics page.
5. On the Configure data factory page, do the following steps:
a. Select Create New Data Factory. You can also select Use existing data factory.
b. Enter a name for the data factory.
c. Select the Azure subscription in which you want the data factory to be created.
d. Select the resource group for the data factory.
e. Select the West US, East US, or North Europe for the region.
f. Click Next.
6. In the Configure data stores page, specify an existing Azure SQL database and Azure storage account (or)
create database/storage, and click Next.
7. In the Configure compute page, select defaults, and click Next.
8. In the Summary page, review all settings, and click Next.
9. In the Deployment Status page, wait until the deployment is finished, and click Finish.
10. Right-click project in the Solution Explorer, and click Publish.
11. If you see Sign in to your Microsoft account dialog box, enter your credentials for the account that has Azure
subscription, and click sign in.
12. You should see the following dialog box:
This article provides information about functions and variables supported by Azure Data Factory.
NOTE
Currently data factory requires that the schedule specified in the activity exactly matches the schedule specified in
availability of the output dataset. Therefore, WindowStart, WindowEnd, and SliceStart and SliceEnd always map to the
same time period and a single output slice.
{
"Type": "SqlSource",
"sqlReaderQuery": "$$Text.Format('SELECT * FROM MyTable WHERE StartTime = \\'{0:yyyyMMdd-HH}\\'',
WindowStart)"
}
See Custom Date and Time Format Strings topic that describes different formatting options you can use (for
example: ay vs. yyyy).
Functions
The following tables list all the functions in Azure Data Factory:
Y: int Example:
9/15/2013 12: 00:00 PM
+ 15 minutes =
9/15/2013 12: 15:00 PM
CATEGORY FUNCTION PARAMETERS DESCRIPTION
Example:
StartOfHour of
9/15/2013 05: 10:23 PM
is 9/15/2013 05: 00:00
PM
Example:
9/15/2013 12:00:00 PM
- 2 days = 9/13/2013
12:00:00 PM
.
Example:
9/15/2013 12:00:00 PM
- 1 month = 8/15/2013
12:00:00 PM
.
Y: int Example:
9/15/2013 12:00:00 PM
+ 1 quarter =
12/15/2013 12:00:00 PM
CATEGORY FUNCTION PARAMETERS DESCRIPTION
Example:
9/15/2013 12:00:00 PM
- 1 week = 9/7/2013
12:00:00 PM
.
Example:
9/15/2013 12:00:00 PM
- 1 year = 9/15/2012
12:00:00 PM
.
Example:
Day of 9/15/2013
12:00:00 PM is 9
.
Example:
DayOfWeek of 9/15/2013
12:00:00 PM is Sunday
.
Examples:
12/1/2015: day 335 of
2015
12/31/2015: day 365 of
2015
12/31/2016: day 366 of
2016 (Leap Year)
CATEGORY FUNCTION PARAMETERS DESCRIPTION
Example:
DaysInMonth of
9/15/2013 are 30 since
there are 30 days in
the September month
.
Example:
EndOfDay of 9/15/2013
05:10:23 PM is
9/15/2013 11:59:59 PM
.
Example:
EndOfMonth of
9/15/2013 05:10:23 PM
is 9/30/2013 11:59:59
PM
(date time that represents
the end of September
month)
Example:
StartOfDay of
9/15/2013 05:10:23 PM
is 9/15/2013 12:00:00
AM
.
Example
In the following example, input and output parameters for the Hive activity are determined by using the
Text.Format function and SliceStart system variable.
{
"name": "HiveActivitySamplePipeline",
"properties": {
"activities": [
{
"name": "HiveActivitySample",
"type": "HDInsightHive",
"inputs": [
{
"name": "HiveSampleIn"
}
],
"outputs": [
{
"name": "HiveSampleOut"
}
],
"linkedServiceName": "HDInsightLinkedService",
"typeproperties": {
"scriptPath": "adfwalkthrough\\scripts\\samplehive.hql",
"scriptLinkedService": "StorageLinkedService",
"defines": {
"Input":
"$$Text.Format('wasb://adfwalkthrough@<storageaccountname>.blob.core.windows.net/samplein/yearno=
{0:yyyy}/monthno={0:MM}/dayno={0:dd}/', SliceStart)",
"Output":
"$$Text.Format('wasb://adfwalkthrough@<storageaccountname>.blob.core.windows.net/sampleout/yearno=
{0:yyyy}/monthno={0:MM}/dayno={0:dd}/', SliceStart)"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
}
}
}
]
}
}
Example 2
In the following example, the DateTime parameter for the Stored Procedure Activity is determined by using the
Text. Format function and the SliceStart variable.
{
"name": "SprocActivitySamplePipeline",
"properties": {
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "sp_sample",
"storedProcedureParameters": {
"DateTime": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)"
}
},
"outputs": [
{
"name": "sprocsampleout"
}
],
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "SprocActivitySample"
}
],
"start": "2016-08-02T00:00:00Z",
"end": "2016-08-02T05:00:00Z",
"isPaused": false
}
}
Example 3
To read data from previous day instead of day represented by the SliceStart, use the AddDays function as
shown in the following example:
{
"name": "SamplePipeline",
"properties": {
"start": "2016-01-01T08:00:00",
"end": "2017-01-01T11:00:00",
"description": "hive activity",
"activities": [
{
"name": "SampleHiveActivity",
"inputs": [
{
"name": "MyAzureBlobInput",
"startTime": "Date.AddDays(SliceStart, -1)",
"endTime": "Date.AddDays(SliceEnd, -1)"
}
],
"outputs": [
{
"name": "MyAzureBlobOutput"
}
],
"linkedServiceName": "HDInsightLinkedService",
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adftutorial\\hivequery.hql",
"scriptLinkedService": "StorageLinkedService",
"defines": {
"Year": "$$Text.Format('{0:yyyy}',WindowsStart)",
"Month": "$$Text.Format('{0:MM}',WindowStart)",
"Day": "$$Text.Format('{0:dd}',WindowStart)"
}
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 2,
"timeout": "01:00:00"
}
}
]
}
}
See Custom Date and Time Format Strings topic that describes different formatting options you can use (for
example: yy vs. yyyy).
Azure Data Factory - naming rules
8/15/2017 1 min to read Edit Online
The following table provides naming rules for Data Factory artifacts.
Data Factory Unique across Microsoft Azure. Names Each data factory is tied to
are case-insensitive, that is, MyDF and exactly one Azure subscription.
mydf refer to the same data factory. Object names must start with a
letter or a number, and can
contain only letters, numbers,
and the dash (-) character.
Every dash (-) character must
be immediately preceded and
followed by a letter or a
number. Consecutive dashes
are not permitted in container
names.
Name can be 3-63 characters
long.
This article provides information about changes to Azure Data Factory SDK in a specific version. You can find the
latest NuGet package for Azure Data Factory here
Version 4.11.0
Feature Additions:
The following linked service types have been added:
OnPremisesMongoDbLinkedService
AmazonRedshiftLinkedService
AwsAccessKeyLinkedService
The following dataset types have been added:
MongoDbCollectionDataset
AmazonS3Dataset
The following copy source types have been added:
MongoDbSource
Version 4.10.0
The following optional properties have been added to TextFormat:
SkipLineCount
FirstRowAsHeader
TreatEmptyAsNull
The following linked service types have been added:
OnPremisesCassandraLinkedService
SalesforceLinkedService
The following dataset types have been added:
OnPremisesCassandraTableDataset
The following copy source types have been added:
CassandraSource
Add WebServiceInputs property to AzureMLBatchExecutionActivity
Enable passing multiple web service inputs to an Azure Machine Learning experiment
Version 4.9.1
Bug fix
Deprecate WebApi-based authentication for WebLinkedService.
Version 4.9.0
Feature Additions
Add EnableStaging and StagingSettings properties to CopyActivity. See Staged copy for details on the feature.
Bug fix
Introduce an overload of ActivityWindowOperationExtensions.List method, which takes an
ActivityWindowsByActivityListParameters instance.
Mark WriteBatchSize and WriteBatchTimeout as optional in CopySink.
Version 4.8.0
Feature Additions
The following optional properties have been added to Copy activity type to enable tuning of copy performance:
ParallelCopies
CloudDataMovementUnits
Version 4.7.0
Feature Additions
Added new StorageFormat type OrcFormat type to copy files in optimized row columnar (ORC) format.
Add AllowPolyBase and PolyBaseSettings properties to SqlDWSink.
Enables the use of PolyBase to copy data into SQL Data Warehouse.
Version 4.6.1
Bug Fixes
Fixes HTTP request for listing activity windows.
Removes the resource group name and the data factory name from the request payload.
Version 4.6.0
Feature Additions
The following properties have been added to PipelineProperties:
PipelineMode
ExpirationTime
Datasets
The following properties have been added to PipelineRuntimeInfo:
PipelineState
Added new StorageFormat type JsonFormat type to define datasets whose data is in JSON format.
Version 4.5.0
Feature Additions
Added list operations for activity window.
Added methods to retrieve activity windows with filters based on the entity types (that is, data factories,
datasets, pipelines, and activities).
The following linked service types have been added:
ODataLinkedService, WebLinkedService
The following dataset types have been added:
ODataResourceDataset, WebTableDataset
The following copy source types have been added:
WebSource
Version 4.4.0
Feature additions
The following linked service type has been added as data sources and sinks for copy activities:
AzureStorageSasLinkedService. See Azure Storage SAS Linked Service for conceptual information and
examples.
Version 4.3.0
Feature additions
The following linked service types haven been added as data sources for copy activities:
HdfsLinkedService. See Move data from HDFS using Data Factory for conceptual information and
examples.
OnPremisesOdbcLinkedService. See Move data From ODBC data stores using Azure Data Factory for
conceptual information and examples.
Version 4.2.0
Feature additions
The following new activity type has been added: AzureMLUpdateResourceActivity. For details about the activity,
see Updating Azure ML models using the Update Resource Activity.
A new optional property updateResourceEndpoint has been added to the AzureMLLinkedService class.
LongRunningOperationInitialTimeout and LongRunningOperationRetryTimeout properties have been added to
the DataFactoryManagementClient class.
Allow configuration of the timeouts for client calls to the Data Factory service.
Version 4.1.0
Feature additions
The following linked service types have been added:
AzureDataLakeStoreLinkedService
AzureDataLakeAnalyticsLinkedService
The following activity types have been added:
DataLakeAnalyticsUSQLActivity
The following dataset types have been added:
AzureDataLakeStoreDataset
The following source and sink types for Copy Activity have been added:
AzureDataLakeStoreSource
AzureDataLakeStoreSink
Version 4.0.1
Breaking changes
The following classes have been renamed. The new names were the original names of classes before 4.0.0 release.
AzureSqlDataWarehouseDataset AzureSqlDataWarehouseTableDataset
AzureSqlDataset AzureSqlTableDataset
NAME IN 4.0.0 NAME IN 4.0.1
AzureDataset AzureTableDataset
OracleDataset OracleTableDataset
RelationalDataset RelationalTableDataset
SqlServerDataset SqlServerTableDataset
Version 4.0.0
Breaking changes
The Following classes/interfaces have been renamed.
ITableOperations IDatasetOperations
Table Dataset
TableProperties DatasetProperties
TableTypeProprerties DatasetTypeProperties
TableCreateOrUpdateParameters DatasetCreateOrUpdateParameters
TableCreateOrUpdateResponse DatasetCreateOrUpdateResponse
TableGetResponse DatasetGetResponse
TableListResponse DatasetListResponse
CreateOrUpdateWithRawJsonContentParameters DatasetCreateOrUpdateWithRawJsonContentParameters
The List methods return paged results now. If the response contains a non-empty NextLink property, the
client application needs to continue fetching the next page until all pages are returned. Here is an example:
nextLink = nextResponse.NextLink;
}
List pipeline API returns only the summary of a pipeline instead of full details. For instance, activities in a
pipeline summary only contain name and type.
Feature additions
The SqlDWSink class supports two new properties, SliceIdentifierColumnName and
SqlWriterCleanupScript, to support idempotent copy to Azure SQL Data Warehouse. See the Azure SQL Data
Warehouse article for details about these properties.
We now support running stored procedure against Azure SQL Database and Azure SQL Data Warehouse
sources as part of the Copy Activity. The SqlSource and SqlDWSource classes have the following properties:
SqlReaderStoredProcedureName and StoredProcedureParameters. See the Azure SQL Database and
Azure SQL Data Warehouse articles on Azure.com for details about these properties.
Monitor and manage Azure Data Factory pipelines
by using the Monitoring and Management app
6/27/2017 10 min to read Edit Online
This article describes how to use the Monitoring and Management app to monitor, manage, and debug your
Data Factory pipelines. It also provides information on how to create alerts to get notified on failures. You can
get started with using the application by watching the following video:
NOTE
The user interface shown in the video may not exactly match what you see in the portal. It's slightly older, but concepts
remain the same.
NOTE
If you see that the web browser is stuck at "Authorizing...", clear the Block third-party cookies and site data check
box--or keep it selected, create an exception for login.microsoftonline.com, and then try to open the app again.
In the Activity Windows list in the middle pane, you see an activity window for each run of an activity. For
example, if you have the activity scheduled to run hourly for five hours, you see five activity windows
associated with five data slices. If you don't see activity windows in the list at the bottom, do the following:
Update the start time and end time filters at the top to match the start and end times of your pipeline,
and then click the Apply button.
The Activity Windows list is not automatically refreshed. Click the Refresh button on the toolbar in the
Activity Windows list.
If you don't have a Data Factory application to test these steps with, do the tutorial: copy data from Blob
Storage to SQL Database using Data Factory.
See the Scheduling and Execution article for detailed conceptual information about activity windows.
Diagram View
The Diagram View of a data factory provides a single pane of glass to monitor and manage a data factory and
its assets. When you select a Data Factory entity (dataset/pipeline) in the Diagram View:
The data factory entity is selected in the tree view.
The associated activity windows are highlighted in the Activity Windows list.
The properties of the selected object are shown in the Properties window.
When the pipeline is enabled (not in a paused state), it's shown with a green line:
You can pause, resume, or terminate a pipeline by selecting it in the diagram view and using the buttons on
the command bar.
There are three command bar buttons for the pipeline in the Diagram View. You can use the second button to
pause the pipeline. Pausing doesn't terminate the currently running activities and lets them proceed to
completion. The third button pauses the pipeline and terminates its existing executing activities. The first
button resumes the pipeline. When your pipeline is paused, the color of the pipeline changes. For example, a
paused pipeline looks like in the following image:
You can multi-select two or more pipelines by using the Ctrl key. You can use the command bar buttons to
pause/resume multiple pipelines at a time.
You can also right-click a pipeline and select options to suspend, resume, or terminate a pipeline.
Click the Open pipeline option to see all the activities in the pipeline.
In the opened pipeline view, you see all activities in the pipeline. In this example, there is only one activity:
Copy Activity.
To go back to the previous view, click the data factory name in the breadcrumb menu at the top.
In the pipeline view, when you select an output dataset or when you move your mouse over the output
dataset, you see the Activity Windows pop-up window for that dataset.
You can click an activity window to see details for it in the Properties window in the right pane.
In the right pane, switch to the Activity Window Explorer tab to see more details.
You also see resolved variables for each run attempt for an activity in the Attempts section.
Switch to the Script tab to see the JSON script definition for the selected object.
You can see activity windows in three places:
The Activity Windows pop-up in the Diagram View (middle pane).
The Activity Window Explorer in the right pane.
The Activity Windows list in the bottom pane.
In the Activity Windows pop-up and Activity Window Explorer, you can scroll to the previous week and the
next week by using the left and right arrows.
At the bottom of the Diagram View, you see these buttons: Zoom In, Zoom Out, Zoom to Fit, Zoom 100%,
Lock layout. The Lock layout button prevents you from accidentally moving tables and pipelines in the
Diagram View. It's on by default. You can turn it off and move entities around in the diagram. When you turn
it off, you can use the last button to automatically position tables and pipelines. You can also zoom in or out
by using the mouse wheel.
This list doesn't refresh automatically, so use the refresh button on the toolbar to manually refresh it.
Activity windows can be in one of the following statuses:
When you click an activity window in the list, you see details about it in the Activity Windows Explorer or
the Properties window on the right.
It displays properties for the item that you selected in the Resource Explorer (tree view), Diagram View, or
Activity Windows list.
Activity Window Explorer
The Activity Window Explorer window is in the right-most pane of the Monitoring and Management app. It
displays details about the activity window that you selected in the Activity Windows pop-up window or the
Activity Windows list.
You can switch to another activity window by clicking it in the calendar view at the top. You can also use the
left arrow/right arrow buttons at the top to see activity windows from the previous week or the next week.
You can use the toolbar buttons in the bottom pane to rerun the activity window or refresh the details in the
pane.
Script
You can use the Script tab to view the JSON definition of the selected Data Factory entity (linked service,
dataset, or pipeline).
Use system views
The Monitoring and Management app includes pre-built system views (Recent activity windows, Failed
activity windows, In-Progress activity windows) that allow you to view recent/failed/in-progress activity
windows for your data factory.
Switch to the Monitoring Views tab on the left by clicking it.
Currently, there are three system views that are supported. Select an option to see recent activity windows,
failed activity windows, or in-progress activity windows in the Activity Windows list (at the bottom of the
middle pane).
When you select the Recent activity windows option, you see all recent activity windows in descending
order of the last attempt time.
You can use the Failed activity windows view to see all failed activity windows in the list. Select a failed
activity window in the list to see details about it in the Properties window or the Activity Window Explorer.
You can also download any logs for a failed activity window.
NOTE
Currently, all times are in UTC format in the Monitoring and Management app.
In the Activity Windows list, click the name of a column (for example: Status).
You can use the same pop-up window to clear filters. To clear all filters for the Activity Windows list, click the
clear filter button on the command bar.
Perform batch actions
Rerun selected activity windows
Select an activity window, click the down arrow for the first command bar button, and select Rerun / Rerun
with upstream in pipeline. When you select the Rerun with upstream in pipeline option, it reruns all
upstream activity windows as well.
You can also select multiple activity windows in the list and rerun them at the same time. You might want to
filter activity windows based on the status (for example: Failed)--and then rerun the failed activity windows
after correcting the issue that causes the activity windows to fail. See the following section for details about
filtering activity windows in the list.
Pause/resume multiple pipelines
You can multiselect two or more pipelines by using the Ctrl key. You can use the command bar buttons
(which are highlighted in the red rectangle in the following image) to pause/resume them.
Create alerts
The Alerts page lets you create an alert and view/edit/delete existing alerts. You can also disable/enable an
alert. To see the Alerts page, click the Alerts tab.
To create an alert
1. Click Add Alert to add an alert. You see the Details page.
2. Specify the Name and Description for the alert, and click Next. You should see the Filters page.
3. Select the event, status, and substatus (optional) that you want to create a Data Factory service alert
for, and click Next. You should see the Recipients page.
4. Select the Email subscription admins option and/or enter an additional administrator email, and
click Finish. You should see the alert in the list.
In the Alerts list, use the buttons that are associated with the alert to edit/delete/disable/enable an alert.
Event/status/substatus
The following table provides the list of available events and statuses (and substatuses).
Failed Execution
Timed Out
Failed Validation
Abandoned
IMPORTANT
The monitoring & management application provides a better support for monitoring and managing your data
pipelines, and troubleshooting any issues. For details about using the application, see monitor and manage Data
Factory pipelines by using the Monitoring and Management app.
This article describes how to monitor, manage, and debug your pipelines by using Azure portal and
PowerShell. The article also provides information on how to create alerts and get notified about failures.
You should see the home page for the data factory.
Diagram view of your data factory
The Diagram view of a data factory provides a single pane of glass to monitor and manage the data factory
and its assets. To see the Diagram view of your data factory, click Diagram on the home page for the data
factory.
You can zoom in, zoom out, zoom to fit, zoom to 100%, lock the layout of the diagram, and automatically
position pipelines and datasets. You can also see the data lineage information (that is, show upstream and
downstream items of selected items).
Activities inside a pipeline
1. Right-click the pipeline, and then click Open pipeline to see all activities in the pipeline, along with
input and output datasets for the activities. This feature is useful when your pipeline includes more than
one activity and you want to understand the operational lineage of a single pipeline.
2. In the following example, you see a copy activity in the pipeline with an input and an output.
3. You can navigate back to the home page of the data factory by clicking the Data factory link in the
breadcrumb at the top-left corner.
You can view the details about a slice by clicking a slice entry on the Recently Updated Slices blade.
If the slice has been executed multiple times, you see multiple rows in the Activity runs list. You can view
details about an activity run by clicking the run entry in the Activity runs list. The list shows all the log files,
along with an error message if there is one. This feature is useful to view and debug logs without having to
leave your data factory.
If the slice isn't in the Ready state, you can see the upstream slices that aren't ready and are blocking the
current slice from executing in the Upstream slices that are not ready list. This feature is useful when your
slice is in Waiting state and you want to understand the upstream dependencies that the slice is waiting on.
Dataset state diagram
After you deploy a data factory and the pipelines have a valid active period, the dataset slices transition from
one state to another. Currently, the slice status follows the following state diagram:
The dataset state transition flow in data factory is the following: Waiting -> In-Progress/In-Progress
(Validating) -> Ready/Failed.
The slice starts in a Waiting state, waiting for preconditions to be met before it executes. Then, the activity
starts executing, and the slice goes into an In-Progress state. The activity execution might succeed or fail. The
slice is marked as Ready or Failed, based on the result of the execution.
You can reset the slice to go back from the Ready or Failed state to the Waiting state. You can also mark the
slice state to Skip, which prevents the activity from executing and not processing the slice.
NOTE
The diagram view does not support pausing and resuming pipelines. If you want to use an user interface, use the
monitoring and managing application. For details about using the application, see monitor and manage Data Factory
pipelines by using the Monitoring and Management app article.
For example:
For example:
Debug pipelines
Azure Data Factory provides rich capabilities for you to debug and troubleshoot pipelines by using the Azure
portal and Azure PowerShell.
[!NOTE} It is much easier to troubleshot errors using the Monitoring & Management App. For details
about using the application, see monitor and manage Data Factory pipelines by using the Monitoring and
Management app article.
For example:
For example:
Get-AzureRmDataFactoryRun -ResourceGroupName ADF -DataFactoryName LogProcessingFactory -DatasetName
EnrichedGameEventsTable -StartDateTime "5/5/2014 12:00:00 AM"
The value of StartDateTime is the start time for the error/problem slice that you noted from the
previous step. The date-time should be enclosed in double quotes.
4. You should see output with details about the error that is similar to the following:
Id : 841b77c9-d56c-48d1-99a3-8c16c3e77d39
ResourceGroupName : ADF
DataFactoryName : LogProcessingFactory3
DatasetName : EnrichedGameEventsTable
ProcessingStartTime : 10/10/2014 3:04:52 AM
ProcessingEndTime : 10/10/2014 3:06:49 AM
PercentComplete : 0
DataSliceStart : 5/5/2014 12:00:00 AM
DataSliceEnd : 5/6/2014 12:00:00 AM
Status : FailedExecution
Timestamp : 10/10/2014 3:04:52 AM
RetryAttempt : 0
Properties : {}
ErrorMessage : Pig script failed with exit code '5'. See wasb://
[email protected]/PigQuery
Jobs/841b77c9-d56c-48d1-99a3-
8c16c3e77d39/10_10_2014_03_04_53_277/Status/stderr' for
more details.
ActivityName : PigEnrichLogs
PipelineName : EnrichGameLogsPipeline
Type :
5. You can run the Save-AzureRmDataFactoryLog cmdlet with the Id value that you see from the
output, and download the log files by using the -DownloadLogsoption for the cmdlet.
Create alerts
Azure logs user events when an Azure resource (for example, a data factory) is created, updated, or deleted.
You can create alerts on these events. You can use Data Factory to capture various metrics and create alerts on
metrics. We recommend that you use events for real-time monitoring and use metrics for historical purposes.
Alerts on events
Azure events provide useful insights into what is happening in your Azure resources. When you're using
Azure Data Factory, events are generated when:
A data factory is created, updated, or deleted.
Data processing (as "runs") has started or completed.
An on-demand HDInsight cluster is created or removed.
You can create alerts on these user events and configure them to send email notifications to the administrator
and coadministrators of the subscription. In addition, you can specify additional email addresses of users who
need to receive email notifications when the conditions are met. This feature is useful when you want to get
notified on failures and dont want to continuously monitor your data factory.
NOTE
Currently, the portal doesn't show alerts on events. Use the Monitoring and Management app to see all alerts.
You can remove subStatus from the JSON definition if you dont want to be alerted on a specific failure.
This example sets up the alert for all data factories in your subscription. If you want the alert to be set up for a
particular data factory, you can specify data factory resourceUri in the dataSource:
"resourceUri" :
"/SUBSCRIPTIONS/<subscriptionId>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICROSOFT.DATAFACTORY/DATAFA
CTORIES/<dataFactoryName>"
The following table provides the list of available operations and statuses (and substatuses).
Succeeded
FailedExecution
TimedOut
<>
FailedValidation
Abandoned
OnDemandClusterCreateStarted Started
OnDemandClusterCreateSuccessful Succeeded
OnDemandClusterDeleted Succeeded
See Create Alert Rule for details about the JSON elements that are used in the example.
Deploy the alert
To deploy the alert, use the Azure PowerShell cmdlet New-AzureRmResourceGroupDeployment, as shown
in the following example:
After the resource group deployment has finished successfully, you see the following messages:
DeploymentName : ADFAlertFailedSlice
ResourceGroupName : adf
ProvisioningState : Succeeded
Timestamp : 10/11/2014 2:01:00 AM
Mode : Incremental
TemplateLink :
Parameters :
Outputs :
NOTE
You can use the Create Alert Rule REST API to create an alert rule. The JSON payload is similar to the JSON example.
DeploymentName : ADFAlertFailedSlice
ResourceGroupName : adf
ProvisioningState : Succeeded
Timestamp : 10/11/2014 2:01:00 AM
Mode : Incremental
TemplateLink :
Parameters :
Outputs :
See Azure Insight cmdlets for PowerShell cmdlets that you can use to add, get, or remove alerts. Here are a
few examples of using the Get-AlertRule cmdlet:
Properties : Microsoft.Azure.Management.Insights.Models.Rule
Tags : {[$type, Microsoft.WindowsAzure.Management.Common.Storage.CasePreservedDictionary,
Microsoft.WindowsAzure.Management.Common.Storage]}
Id : /subscriptions/<subscription id>/resourceGroups/<resource group
name>/providers/microsoft.insights/alertrules/FailedExecutionRunsWest0
Location : West US
Name : FailedExecutionRunsWest0
Properties : Microsoft.Azure.Management.Insights.Models.Rule
Tags : {[$type, Microsoft.WindowsAzure.Management.Common.Storage.CasePreservedDictionary,
Microsoft.WindowsAzure.Management.Common.Storage]}
Id : /subscriptions/<subscription id>/resourceGroups/<resource group
name>/providers/microsoft.insights/alertrules/FailedExecutionRunsWest3
Location : West US
Name : FailedExecutionRunsWest3
Properties : Microsoft.Azure.Management.Insights.Models.Rule
Tags : {[$type, Microsoft.WindowsAzure.Management.Common.Storage.CasePreservedDictionary,
Microsoft.WindowsAzure.Management.Common.Storage]}
Id : /subscriptions/<subscription id>/resourceGroups/<resource group
name>/providers/microsoft.insights/alertrules/FailedExecutionRunsWest0
Location : West US
Name : FailedExecutionRunsWest0
Run the following get-help commands to see details and examples for the Get-AlertRule cmdlet.
If you see the alert generation events on the portal blade but you don't receive email notifications, check
whether the email address that is specified is set to receive emails from external senders. The alert emails
might have been blocked by your email settings.
Alerts on metrics
In Data Factory, you can capture various metrics and create alerts on metrics. You can monitor and create
alerts on the following metrics for the slices in your data factory:
Failed Runs
Successful Runs
These metrics are useful and help you to get an overview of overall failed and successful runs in the data
factory. Metrics are emitted every time there is a slice run. At the beginning of the hour, these metrics are
aggregated and pushed to your storage account. To enable metrics, set up a storage account.
Enable metrics
To enable metrics, click the following from the Data Factory blade:
Monitoring > Metric > Diagnostic settings > Diagnostics
On the Diagnostics blade, click On, select the storage account, and click Save.
It might take up to one hour for the metrics to be visible on the Monitoring blade because metrics
aggregation happens hourly.
Set up an alert on metrics
Click the Data Factory metrics tile:
On the Alerts rules blade, you see any existing alerts. To add an alert, click Add alert on the toolbar.
Alert notifications
After the alert rule matches the condition, you should get an email that says the alert is activated. After the
issue is resolved and the alert condition doesnt match anymore, you get an email that says the alert is
resolved.
This behavior is different than events where a notification is sent on every failure that an alert rule qualifies
for.
Deploy alerts by using PowerShell
You can deploy alerts for metrics the same way that you do for events.
Alert definition
{
"contentVersion" : "1.0.0.0",
"$schema" : "http://schema.management.azure.com/schemas/2014-04-01-preview/deploymentTemplate.json#",
"parameters" : {},
"resources" : [
{
"name" : "FailedRunsGreaterThan5",
"type" : "microsoft.insights/alertrules",
"apiVersion" : "2014-04-01",
"location" : "East US",
"properties" : {
"name" : "FailedRunsGreaterThan5",
"description" : "Failed Runs greater than 5",
"isEnabled" : true,
"condition" : {
"$type" :
"Microsoft.WindowsAzure.Management.Monitoring.Alerts.Models.ThresholdRuleCondition,
Microsoft.WindowsAzure.Management.Mon.Client",
"odata.type" : "Microsoft.Azure.Management.Insights.Models.ThresholdRuleCondition",
"dataSource" : {
"$type" :
"Microsoft.WindowsAzure.Management.Monitoring.Alerts.Models.RuleMetricDataSource,
Microsoft.WindowsAzure.Management.Mon.Client",
"odata.type" : "Microsoft.Azure.Management.Insights.Models.RuleMetricDataSource",
"resourceUri" : "/SUBSCRIPTIONS/<subscriptionId>/RESOURCEGROUPS/<resourceGroupName
>/PROVIDERS/MICROSOFT.DATAFACTORY/DATAFACTORIES/<dataFactoryName>",
"metricName" : "FailedRuns"
},
"threshold" : 5.0,
"windowSize" : "PT3H",
"timeAggregation" : "Total"
},
"action" : {
"$type" : "Microsoft.WindowsAzure.Management.Monitoring.Alerts.Models.RuleEmailAction,
Microsoft.WindowsAzure.Management.Mon.Client",
"odata.type" : "Microsoft.Azure.Management.Insights.Models.RuleEmailAction",
"customEmails" : ["[email protected]"]
}
}
}
]
}
Replace subscriptionId, resourceGroupName, and dataFactoryName in the sample with appropriate values.
metricName currently supports two values:
FailedRuns
SuccessfulRuns
Deploy the alert
To deploy the alert, use the Azure PowerShell cmdlet New-AzureRmResourceGroupDeployment, as shown
in the following example:
DeploymentName : FailedRunsGreaterThan5
ResourceGroupName : adf
ProvisioningState : Succeeded
Timestamp : 7/27/2015 7:52:56 PM
Mode : Incremental
TemplateLink :
Parameters :
Outputs
You can also use the Add-AlertRule cmdlet to deploy an alert rule. See the Add-AlertRule topic for details and
examples.
You can also move any related resources (such as alerts that are associated with the data factory), along with
the data factory.
Create, monitor, and manage Azure data factories
using Azure Data Factory .NET SDK
8/4/2017 9 min to read Edit Online
Overview
You can create, monitor, and manage Azure data factories programmatically using Data Factory .NET SDK. This
article contains a walkthrough that you can follow to create a sample .NET console application that creates and
monitors a data factory.
NOTE
This article does not cover all the Data Factory .NET API. See Data Factory .NET API Reference for comprehensive
documentation on .NET API for Data Factory.
Prerequisites
Visual Studio 2012 or 2013 or 2015
Download and install Azure .NET SDK.
Azure PowerShell. Follow instructions in How to install and configure Azure PowerShell article to install Azure
PowerShell on your computer. You use Azure PowerShell to create an Azure Active Directory application.
Create an application in Azure Active Directory
Create an Azure Active Directory application, create a service principal for the application, and assign it to the Data
Factory Contributor role.
1. Launch PowerShell.
2. Run the following command and enter the user name and password that you use to sign in to the Azure
portal.
Login-AzureRmAccount
3. Run the following command to view all the subscriptions for this account.
Get-AzureRmSubscription
4. Run the following command to select the subscription that you want to work with. Replace
<NameOfAzureSubscription> with the name of your Azure subscription.
IMPORTANT
Note down SubscriptionId and TenantId from the output of this command.
5. Create an Azure resource group named ADFTutorialResourceGroup by running the following command
in the PowerShell.
If the resource group already exists, you specify whether to update it (Y) or keep it as (N).
If you use a different resource group, you need to use the name of your resource group in place of
ADFTutorialResourceGroup in this tutorial.
6. Create an Azure Active Directory application.
If you get the following error, specify a different URL and run the command again.
Another object with the same value for property identifierUris already exists.
$azureAdApplication
Walkthrough
In the walkthrough, you create a data factory with a pipeline that contains a copy activity. The copy activity copies
data from a folder in your Azure blob storage to another folder in the same blob storage.
The Copy Activity performs the data movement in Azure Data Factory. The activity is powered by a globally
available service that can copy data between various data stores in a secure, reliable, and scalable way. See Data
Movement Activities article for details about the Copy Activity.
1. Using Visual Studio 2012/2013/2015, create a C# .NET console application.
a. Launch Visual Studio 2012/2013/2015.
b. Click File, point to New, and click Project.
c. Expand Templates, and select Visual C#. In this walkthrough, you use C#, but you can use any .NET
language.
d. Select Console Application from the list of project types on the right.
e. Enter DataFactoryAPITestApp for the Name.
f. Select C:\ADFGetStarted for the Location.
g. Click OK to create the project.
2. Click Tools, point to NuGet Package Manager, and click Package Manager Console.
3. In the Package Manager Console, do the following steps:
a. Run the following command to install Data Factory package:
Install-Package Microsoft.Azure.Management.DataFactories
b. Run the following command to install Azure Active Directory package (you use Active Directory API in the
code): Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory -Version 2.19.208020213
4. Replace the contents of App.config file in the project with the following content:
5. In the App.Config file, update values for <Application ID>, <Password>, <Subscription ID>, and <tenant
ID> with your own values.
6. Add the following using statements to the Program.cs file in the project.
using System.Configuration;
using System.Collections.ObjectModel;
using System.Threading;
using System.Threading.Tasks;
using Microsoft.Azure;
using Microsoft.Azure.Management.DataFactories;
using Microsoft.Azure.Management.DataFactories.Models;
using Microsoft.Azure.Management.DataFactories.Common.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
7. Add the following code that creates an instance of DataPipelineManagementClient class to the Main
method. You use this object to create a data factory, a linked service, input and output datasets, and a
pipeline. You also use this object to monitor slices of a dataset at runtime.
// create data factory management client
IMPORTANT
Replace the value of resourceGroupName with the name of your Azure resource group. You can create a resource
group using the New-AzureResourceGroup cmdlet.
Update name of the data factory (dataFactoryName) to be unique. Name of the data factory must be globally unique.
See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.
8. Add the following code that creates a data factory to the Main method.
9. Add the following code that creates an Azure Storage linked service to the Main method.
IMPORTANT
Replace storageaccountname and accountkey with name and key of your Azure Storage account.
// create a linked service for input data store: Azure Storage
Console.WriteLine("Creating Azure Storage linked service");
client.LinkedServices.CreateOrUpdate(resourceGroupName, dataFactoryName,
new LinkedServiceCreateOrUpdateParameters()
{
LinkedService = new LinkedService()
{
Name = "AzureStorageLinkedService",
Properties = new LinkedServiceProperties
(
new AzureStorageLinkedService("DefaultEndpointsProtocol=https;AccountName=
<storageaccountname>;AccountKey=<accountkey>")
)
}
}
);
10. Add the following code that creates input and output datasets to the Main method.
The FolderPath for the input blob is set to adftutorial/ where adftutorial is the name of the container in
your blob storage. If this container does not exist in your Azure blob storage, create a container with this
name: adftutorial and upload a text file to the container.
The FolderPath for the output blob is set to: adftutorial/apifactoryoutput/{Slice} where Slice is
dynamically calculated based on the value of SliceStart (start date-time of each slice.)
client.Datasets.CreateOrUpdate(resourceGroupName, dataFactoryName,
new DatasetCreateOrUpdateParameters()
{
Dataset = new Dataset()
{
Name = Dataset_Source,
Properties = new DatasetProperties()
{
LinkedServiceName = "AzureStorageLinkedService",
TypeProperties = new AzureBlobDataset()
{
FolderPath = "adftutorial/",
FileName = "emp.txt"
},
External = true,
Availability = new Availability()
{
Frequency = SchedulePeriod.Hour,
Interval = 1,
},
client.Datasets.CreateOrUpdate(resourceGroupName, dataFactoryName,
new DatasetCreateOrUpdateParameters()
{
Dataset = new Dataset()
{
Name = Dataset_Destination,
Properties = new DatasetProperties()
{
LinkedServiceName = "AzureStorageLinkedService",
TypeProperties = new AzureBlobDataset()
{
FolderPath = "adftutorial/apifactoryoutput/{Slice}",
PartitionedBy = new Collection<Partition>()
{
new Partition()
{
Name = "Slice",
Value = new DateTimePartitionValue()
{
Date = "SliceStart",
Format = "yyyyMMdd-HH"
}
}
}
},
11. Add the following code that creates and activates a pipeline to the Main method. This pipeline has a
CopyActivity that takes BlobSource as a source and BlobSink as a sink.
The Copy Activity performs the data movement in Azure Data Factory. The activity is powered by a globally
available service that can copy data between various data stores in a secure, reliable, and scalable way. See
Data Movement Activities article for details about the Copy Activity.
// create a pipeline
Console.WriteLine("Creating a pipeline");
DateTime PipelineActivePeriodStartTime = new DateTime(2014, 8, 9, 0, 0, 0, 0, DateTimeKind.Utc);
DateTime PipelineActivePeriodEndTime = PipelineActivePeriodStartTime.AddMinutes(60);
string PipelineName = "PipelineBlobSample";
client.Pipelines.CreateOrUpdate(resourceGroupName, dataFactoryName,
new PipelineCreateOrUpdateParameters()
{
Pipeline = new Pipeline()
{
Name = PipelineName,
Properties = new PipelineProperties()
{
Description = "Demo Pipeline for data transfer between blobs",
// Initial value for pipeline's active period. With this, you won't need to set slice status
Start = PipelineActivePeriodStartTime,
End = PipelineActivePeriodEndTime,
},
}
}
});
12. Add the following code to the Main method to get the status of a data slice of the output dataset. There is
only one slice expected in this sample.
// Pulling status within a timeout threshold
DateTime start = DateTime.Now;
bool done = false;
13. (optional) Add the following code to get run details for a data slice to the Main method.
14. Add the following helper method used by the Main method to the Program class. This method pops a
dialog box that that lets you provide user name and password that you use to log in to Azure portal.
if (result != null)
return result.AccessToken;
15. In the Solution Explorer, expand the project: DataFactoryAPITestApp, right-click References, and click
Add Reference. Select check box for System.Configuration assembly and click OK.
16. Build the console application. Click Build on the menu and click Build Solution.
17. Confirm that there is at least one file in the adftutorial container in your Azure blob storage. If not, create
Emp.txt file in Notepad with the following content and upload it to the adftutorial container.
John, Doe
Jane, Doe
18. Run the sample by clicking Debug -> Start Debugging on the menu. When you see the Getting run details
of a data slice, wait for a few minutes, and press ENTER.
19. Use the Azure portal to verify that the data factory APITutorialFactory is created with the following artifacts:
Linked service: AzureStorageLinkedService
Dataset: DatasetBlobSource and DatasetBlobDestination.
Pipeline: PipelineBlobSample
20. Verify that an output file is created in the apifactoryoutput folder in the adftutorial container.
if (response.NextLink != null)
{
response = dataFactoryManagementClient.ActivityWindows.ListNext(response.NextLink, parameters);
}
else
{
response = null;
}
}
while (response != null);
Next steps
See the following example for creating a pipeline using .NET SDK that copies data from an Azure blob storage to an
Azure SQL database:
Create a pipeline to copy data from Blob Storage to SQL Database
Troubleshoot Data Factory issues
8/15/2017 4 min to read Edit Online
This article provides troubleshooting tips for issues when using Azure Data Factory. This article does not list all the
possible issues when using the service, but it covers some issues and general troubleshooting tips.
Troubleshooting tips
Error: The subscription is not registered to use namespace 'Microsoft.DataFactory'
If you receive this error, the Azure Data Factory resource provider has not been registered on your machine. Do the
following:
1. Launch Azure PowerShell.
2. Log in to your Azure account using the following command.
Login-AzureRmAccount
3. Run the following command to register the Azure Data Factory provider.
{
"name": "CustomerTable",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "MyLinkedService",
"typeProperties": {
"folderPath": "MyContainer/MySubFolder/",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": ";"
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
}
}
}
}
To resolve the error, add the external property and the optional externalData section to the JSON definition of
the input table and recreate the table.
Problem: Hybrid copy operation fails
See Troubleshoot gateway issues for steps to troubleshoot issues with copying to/from an on-premises data store
using the Data Management Gateway.
Problem: On-demand HDInsight provisioning fails
When using a linked service of type HDInsightOnDemand, you need to specify a linkedServiceName that points to
an Azure Blob Storage. Data Factory service uses this storage to store logs and supporting files for your on-demand
HDInsight cluster. Sometimes provisioning of an on-demand HDInsight cluster fails with the following error:
Failed to create cluster. Exception: Unable to complete the cluster create operation. Operation failed with
code '400'. Cluster left behind state: 'Error'. Message: 'StorageAccountNotColocated'.
This error usually indicates that the location of the storage account specified in the linkedServiceName is not in the
same data center location where the HDInsight provisioning is happening. Example: if your data factory is in West
US and the Azure storage is in East US, the on-demand provisioning fails in West US.
Additionally, there is a second JSON property additionalLinkedServiceNames where additional storage accounts
may be specified in on-demand HDInsight. Those additional linked storage accounts should be in the same location
as the HDInsight cluster, or it fails with the same error.
Problem: Custom .NET activity fails
See Debug a pipeline with custom activity for detailed steps.
This article provides information on troubleshooting issues with using Data Management Gateway.
NOTE
See the Data Management Gateway article for detailed information about the gateway. See the Move data between on-premises
and cloud article for a walkthrough of moving data from an on-premises SQL Server database to Microsoft Azure Blob storage by
using the gateway.
Cause
The machine on which you are trying to install the gateway has failed to download the latest gateway installation file
from the download center due to a network issue.
Resolution
Check your firewall proxy server settings to see whether the settings block the network connection from the computer to
the download center, and update the settings accordingly.
Alternatively, you can download the installation file for the latest gateway from the download center on other machines
that can access the download center. You can then copy the installer file to the gateway host computer and run it
manually to install and update the gateway.
2. Problem
You see this error when you're attempting to install a gateway by clicking install directly on this computer in the
Azure portal.
Error: Abort installing a new gateway on this computer because this computer has an existing installed gateway and a
computer without any installed gateway is required for installing a new gateway.
Cause
A gateway is already installed on the machine.
Resolution
Uninstall the existing gateway on the machine and click the install directly on this computer link again.
3. Problem
You might see this error when registering a new gateway.
Error: The gateway has encountered an error during registration.
Cause
You might see this message for one of the following reasons:
The format of the gateway key is invalid.
The gateway key has been invalidated.
The gateway key has been regenerated from the portal.
Resolution
Verify whether you are using the right gateway key from the portal. If needed, regenerate a key and use the key to
register the gateway.
4. Problem
You might see the following error message when you're registering a gateway.
Error: The content or format of the gateway key "{gatewayKey}" is invalid, please go to azure portal to create one
new gateway or regenerate the gateway key.
Cause
The content or format of the input gateway key is incorrect. One of the reasons can be that you copied only a portion of
the key from the portal or you're using an invalid key.
Resolution
Generate a gateway key in the portal, and use the copy button to copy the whole key. Then paste it in this window to
register the gateway.
5. Problem
You might see the following error message when you're registering a gateway.
Error: The gateway key is invalid or empty. Specify a valid gateway key from the portal.
Cause
The gateway key has been regenerated or the gateway has been deleted in the Azure portal. It can also happen if the Data
Management Gateway setup is not latest.
Resolution
Check if the Data Management Gateway setup is the latest version, you can find the latest version on the Microsoft
download center.
If setup is current/ latest and gateway still exists on Portal, regenerate the gateway key in the Azure portal, and use the
copy button to copy the whole key, and then paste it in this window to register the gateway. Otherwise, recreate the
gateway and start over.
6. Problem
You might see the following error message when you're registering a gateway.
Error: Gateway has been online for a while, then shows Gateway is not registered with the status Gateway key is
invalid
Cause
This error might happen because either the gateway has been deleted or the associated gateway key has been
regenerated.
Resolution
If the gateway has been deleted, re-create the gateway from the portal, click Register, copy the key from the portal, paste
it, and try to register the gateway.
If the gateway still exists but its key has been regenerated, use the new key to register the gateway. If you dont have the
key, regenerate the key again from the portal.
7. Problem
When you're registering a gateway, you might need to enter path and password for a certificate.
Cause
The gateway has been registered on other machines before. During the initial registration of a gateway, an encryption
certificate has been associated with the gateway. The certificate can either be self-generated by the gateway or provided
by the user. This certificate is used to encrypt credentials of the data store (linked service).
When restoring the gateway on a different host machine, the registration wizard asks for this certificate to decrypt
credentials previously encrypted with this certificate. Without this certificate, the credentials cannot be decrypted by the
new gateway and subsequent copy activity executions associated with this new gateway will fail.
Resolution
If you have exported the credential certificate from the original gateway machine by using the Export button on the
Settings tab in Data Management Gateway Configuration Manager, use the certificate here.
You cannot skip this stage when recovering a gateway. If the certificate is missing, you need to delete the gateway from
the portal and re-create a new gateway. In addition, update all linked services that are related to the gateway by
reentering their credentials.
8. Problem
You might see the following error message.
Error: The remote server returned an error: (407) Proxy Authentication Required.
Cause
This error happens when your gateway is in an environment that requires an HTTP proxy to access Internet resources, or
your proxy's authentication password is changed but it's not updated accordingly in your gateway.
Resolution
Follow the instructions in the Proxy server considerations section of this article, and configure proxy settings with Data
Management Gateway Configuration Manager.
Cause
Gateway cannot connect to the cloud service through Service Bus.
Resolution
Follow these steps to get the gateway back online:
1. Allow IP address outbound rules on the gateway machine and the corporate firewall. You can find IP addresses from
the Windows Event Log (ID == 401): An attempt was made to access a socket in a way forbidden by its access
permissions XX.XX.XX.XX:9350.
2. Configure proxy settings on the gateway. See the Proxy server considerations section for details.
3. Enable outbound ports 5671 and 9350-9354 on both the Windows Firewall on the gateway machine and the
corporate firewall. See the Ports and firewall section for details. This step is optional, but we recommend it for
performance consideration.
3. Problem
You see the following error.
Error: Cloud service cannot connect to gateway through service bus.
Cause
A transient error in network connectivity.
Resolution
Follow these steps to get the gateway back online:
1. Wait for a couple of minutes, the connectivity will be automatically recovered when the error is gone.
2. If the error persists, restart the gateway service.
Failed to author linked service
Problem
You might see this error when you try to use Credential Manager in the portal to input credentials for a new linked
service, or update credentials for an existing linked service.
Error: The data store '<Server>/<Database>' cannot be reached. Check connection settings for the data source.
When you see this error, the settings page of Data Management Gateway Configuration Manager might look like the
following screenshot.
Cause
The SSL certificate might have been lost on the gateway machine. The gateway computer cannot load the certificate
currently that is used for SSL encryption. You might also see an error message in the event log that is similar to the
following message.
Unable to get the gateway settings from cloud service. Check the gateway key and the network connection. (Certificate
with thumbprint cannot be loaded.)
Resolution
Follow these steps to solve the problem:
1. Start Data Management Gateway Configuration Manager.
2. Switch to the Settings tab.
3. Click the Change button to change the SSL certificate.
4. Select a new certificate as the SSL certificate. You can use any SSL certificate that is generated by you or any
organization.
Cause
This can happen for different reasons, and mitigation varies accordingly.
Resolution
Allow outbound TCP connections over port TCP/1433 on the Data Management Gateway client side before connecting to
an SQL database.
If the target database is an Azure SQL database, check SQL Server firewall settings for Azure as well.
See the following section to test the connection to the on-premises data store.
Gateway logs
Send gateway logs to Microsoft
When you contact Microsoft Support to get help with troubleshooting gateway issues, you might be asked to share your
gateway logs. With the release of the gateway, you can share required gateway logs with two button clicks in Data
Management Gateway Configuration Manager.
1. Switch to the Diagnostics tab in Data Management Gateway Configuration Manager.
2. Click Send Logs to see the following dialog box.
7. Save the Report ID and share it with Microsoft Support. The report ID is used to locate the gateway logs that you
uploaded for troubleshooting. The report ID is also saved in the event viewer. You can find it by looking at the
event ID 25, and check the date and time.
Archive gateway logs on gateway host machine
There are some scenarios where you have gateway issues and you cannot share gateway logs directly:
You manually install the gateway and register the gateway.
You try to register the gateway with a regenerated key in Data Management Gateway Configuration Manager.
You try to send logs and the gateway host service cannot be connected.
For these scenarios, you can save gateway logs as a zip file and share it when you contact Microsoft support. For
example, if you receive an error while you register the gateway as shown in the following screenshot.
Click the Archive gateway logs link to archive and save logs, and then share the zip file with Microsoft support.
Locate gateway logs
You can find detailed gateway log information in the Windows event logs.
1. Start Windows Event Viewer.
2. Locate logs in the Application and Services Logs > Data Management Gateway folder.
When you're troubleshooting gateway-related issues, look for error level events in the event viewer.
Azure Data Factory - JSON Scripting Reference
7/21/2017 131 min to read Edit Online
This article provides JSON schemas and examples for defining Azure Data Factory entities (pipeline, activity,
dataset, and linked service).
Pipeline
The high-level structure for a pipeline definition is as follows:
{
"name": "SamplePipeline",
"properties": {
"description": "Describe what pipeline does",
"activities": [
],
"start": "2016-07-12T00:00:00",
"end": "2016-07-13T00:00:00"
}
}
Following table describes the properties within the pipeline JSON definition:
Activity
The high-level structure for an activity within a pipeline definition (activities element) is as follows:
{
"name": "ActivityName",
"description": "description",
"type": "<ActivityType>",
"inputs": "[]",
"outputs": "[]",
"linkedServiceName": "MyLinkedService",
"typeProperties":
{
},
"policy":
{
}
"scheduler":
{
}
}
Following table describe the properties within the activity JSON definition:
linkedServiceName Name of the linked service used by the Yes for HDInsight activities, Azure
activity. Machine Learning activities, and Stored
Procedure Activity.
An activity may require that you specify
the linked service that links to the No for all others
required compute environment.
Policies
Policies affect the run-time behavior of an activity, specifically when the slice of a table is processed. The following
table provides the details.
typeProperties section
The typeProperties section is different for each activity. Transformation activities have just the type properties. See
DATA TRANSFORMATION ACTIVITIES section in this article for JSON samples that define transformation activities
in a pipeline.
Copy activity has two subsections in the typeProperties section: source and sink. See DATA STORES section in
this article for JSON samples that show how to use a data store as a source and/or sink.
Sample copy pipeline
In the following sample pipeline, there is one activity of type Copy in the activities section. In this sample, the
Copy activity copies data from an Azure Blob storage to an Azure SQL database.
{
"name": "CopyPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60:00:00"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2016-07-12T00:00:00",
"end": "2016-07-13T00:00:00"
}
}
{
"name": "TransformPipeline",
"properties": {
"description": "My first Azure Data Factory pipeline",
"activities": [
{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adfgetstarted/script/partitionweblogs.hql",
"scriptLinkedService": "AzureStorageLinkedService",
"defines": {
"inputtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata",
"partitionedtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata"
}
},
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "HDInsightOnDemandLinkedService"
}
],
"start": "2016-04-01T00:00:00",
"end": "2016-04-02T00:00:00",
"isPaused": false
}
}
See DATA TRANSFORMATION ACTIVITIES section in this article for JSON samples that define transformation
activities in a pipeline.
For a complete walkthrough of creating this pipeline, see Tutorial: Build your first pipeline to process data using
Hadoop cluster.
Linked service
The high-level structure for a linked service definition is as follows:
{
"name": "<name of the linked service>",
"properties": {
"type": "<type of the linked service>",
"typeProperties": {
}
}
}
Following table describe the properties within the activity JSON definition:
Dataset
A dataset in Azure Data Factory is defined as follows:
{
"name": "<name of dataset>",
"properties": {
"type": "<type of dataset: AzureBlob, AzureSql etc...>",
"external": <boolean flag to indicate external data. only for input datasets>,
"linkedServiceName": "<Name of the linked service that refers to a data store.>",
"structure": [
{
"name": "<Name of the column>",
"type": "<Name of the type>"
}
],
"typeProperties": {
"<type specific property>": "<value>",
"<type specific property 2>": "<value 2>",
},
"availability": {
"frequency": "<Specifies the time unit for data slice production. Supported frequency: Minute,
Hour, Day, Week, Month>",
"interval": "<Specifies the interval within the defined frequency. For example, frequency set to
'Hour' and interval set to 1 indicates that new data slices should be produced hourly>"
},
"policy":
{
}
}
}
In the following example, the dataset has three columns slicetimestamp , projectname , and pageviews and they are
of type: String, String, and Decimal respectively.
structure:
[
{ "name": "slicetimestamp", "type": "String"},
{ "name": "projectname", "type": "String"},
{ "name": "pageviews", "type": "Decimal"}
]
The following table describes properties you can use in the availability section:
Supported frequency:
Minute, Hour, Day, Week,
Month
PROPERTY DESCRIPTION REQUIRED DEFAULT
Frequency x interval
determines how often the
slice is produced.
Note: If the
AnchorDateTime has date
parts that are more granular
than the frequency then the
more granular parts are
ignored.
Note: If both
anchorDateTime and offset
are specified, the result is the
combined shift.
The following availability section specifies that the output dataset is either produced hourly (or) input dataset is
available hourly:
"availability":
{
"frequency": "Hour",
"interval": 1
}
The policy section in dataset definition defines the criteria or the condition that the dataset slices must fulfill.
Example:
"policy":
{
"validation":
{
"minimumSizeMB": 10.0
}
}
Unless a dataset is being produced by Azure Data Factory, it should be marked as external. This setting generally
applies to the inputs of first activity in a pipeline unless activity or pipeline chaining is being used.
DATA STORES
The Linked service section provided descriptions for JSON elements that are common to all types of linked services.
This section provides details about JSON elements that are specific to each data store.
The Dataset section provided descriptions for JSON elements that are common to all types of datasets. This section
provides details about JSON elements that are specific to each data store.
The Activity section provided descriptions for JSON elements that are common to all types of activities. This section
provides details about JSON elements that are specific to each data store when it is used as a source/sink in a copy
activity.
Click the link for the store you are interested in to see the JSON schemas for linked service, dataset, and the
source/sink for the copy activity.
Azure Cosmos DB
Azure Search
IBM DB2
MySQL
Oracle
PostgreSQL
SAP HANA
SQL Server
Sybase
Teradata
NoSQL Cassandra
MongoDB
File Amazon S3
File System
FTP
HDFS
SFTP
Others HTTP
OData
CATEGORY DATA STORE
ODBC
Salesforce
Web Table
Ex a m p l e
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
Ex a m p l e
{
"name": "StorageSasLinkedService",
"properties": {
"type": "AzureStorageSas",
"typeProperties": {
"sasUri": "<storageUri>?<sasToken>"
}
}
}
For more information about these linked services, see Azure Blob Storage connector article.
Dataset
To define an Azure Blob dataset, set the type of the dataset to AzureBlob. Then, specify the following Azure Blob
specific properties in the typeProperties section:
Example
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "input.log",
"folderPath": "adfgetstarted/inputdata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
},
"external": true,
"policy": {}
}
}
Example: BlobSource**
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline with copy activity",
"activities": [{
"name": "AzureBlobtoSQL",
"description": "Copy Activity",
"type": "Copy",
"inputs": [{
"name": "AzureBlobInput"
}],
"outputs": [{
"name": "AzureSqlOutput"
}],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink"
}
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
MergeFiles (default):
merges all files from the
source folder to one file. If
the File/Blob Name is
specified, the merged file
name would be the specified
name; otherwise, would be
auto-generated file name.
Example: BlobSink
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline for copy activity",
"activities": [{
"name": "AzureSQLtoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [{
"name": "AzureSQLInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >=
\\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink"
}
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
servicePrincipalId Specify the application's client ID. Yes (for service principal authentication)
servicePrincipalKey Specify the application's key. Yes (for service principal authentication)
tenant Specify the tenant information (domain Yes (for service principal authentication)
name or tenant ID) under which your
application resides. You can retrieve it
by hovering the mouse in the top-right
corner of the Azure portal.
authorization Click Authorize button in the Data Yes (for user credential authentication)
Factory Editor and enter your
credential that assigns the auto-
generated authorization URL to this
property.
sessionId OAuth session id from the OAuth Yes (for user credential authentication)
authorization session. Each session id is
unique and may only be used once. This
setting is automatically generated when
you use Data Factory Editor.
{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": "<service principal key>",
"tenant": "<tenant info. Example: microsoft.onmicrosoft.com>"
}
}
}
{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"sessionId": "<session ID>",
"authorization": "<authorization URL>",
"subscriptionId": "<subscription of ADLS>",
"resourceGroupName": "<resource group of ADLS>"
}
}
}
For more information, see Azure Data Lake Store connector article.
Dataset
To define an Azure Data Lake Store dataset, set the type of the dataset to AzureDataLakeStore, and specify the
following properties in the typeProperties section:
Example
{
"name": "AzureDataLakeStoreInput",
"properties": {
"type": "AzureDataLakeStore",
"linkedServiceName": "AzureDataLakeStoreLinkedService",
"typeProperties": {
"folderPath": "datalake/input/",
"fileName": "SearchLog.tsv",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\t"
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
For more information, see Azure Data Lake Store connector article.
Azure Data Lake Store Source in Copy Activity
If you are copying data from an Azure Data Lake Store, set the source type of the copy activity to
AzureDataLakeStoreSource, and specify following properties in the source section:
AzureDataLakeStoreSource supports the following properties typeProperties section:
Example: AzureDataLakeStoreSource
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline for copy activity",
"activities": [{
"name": "AzureDakeLaketoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [{
"name": "AzureDataLakeStoreInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource"
},
"sink": {
"type": "BlobSink"
}
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
For more information, see Azure Data Lake Store connector article.
Azure Data Lake Store Sink in Copy Activity
If you are copying data to an Azure Data Lake Store, set the sink type of the copy activity to
AzureDataLakeStoreSink, and specify following properties in the sink section:
Example: AzureDataLakeStoreSink
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline with copy activity",
"activities": [{
"name": "AzureBlobtoDataLake",
"description": "Copy Activity",
"type": "Copy",
"inputs": [{
"name": "AzureBlobInput"
}],
"outputs": [{
"name": "AzureDataLakeStoreOutput"
}],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "AzureDataLakeStoreSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
For more information, see Azure Data Lake Store connector article.
Azure Cosmos DB
Linked service
To define an Azure Cosmos DB linked service, set the type of the linked service to DocumentDb, and specify
following properties in the typeProperties section:
Example
{
"name": "CosmosDBLinkedService",
"properties": {
"type": "DocumentDb",
"typeProperties": {
"connectionString": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>"
}
}
}
Example
{
"name": "PersonCosmosDBTable",
"properties": {
"type": "DocumentDbCollection",
"linkedServiceName": "CosmosDBLinkedService",
"typeProperties": {
"collectionName": "Person"
},
"external": true,
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
Example
{
"name": "DocDbToBlobPipeline",
"properties": {
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": "SELECT Person.Id, Person.Name.First AS FirstName, Person.Name.Middle as
MiddleName, Person.Name.Last AS LastName FROM Person",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink",
"blobWriterAddHeader": true,
"writeBatchSize": 1000,
"writeBatchTimeout": "00:00:59"
}
},
"inputs": [{
"name": "PersonCosmosDBTable"
}],
"outputs": [{
"name": "PersonBlobTableOut"
}],
"policy": {
"concurrency": 1
},
"name": "CopyFromCosmosDbToBlob"
}],
"start": "2016-04-01T00:00:00",
"end": "2016-04-02T00:00:00"
}
}
nestingSeparator A special character in the Character that is used to Character that is used to
source column name to separate nesting levels. separate nesting levels.
indicate that nested
document is needed. Default value is . (dot). Default value is . (dot).
"Name": {
"First": "John"
},
Throttling is decided by a
number of factors, including
size of documents, number
of terms in documents,
indexing policy of target
collection, etc. For copy
operations, you can use a
better collection (for
example, S3) to have the
most throughput available
(2,500 request
units/second).
Example
{
"name": "BlobToDocDbPipeline",
"properties": {
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "DocumentDbCollectionSink",
"nestingSeparator": ".",
"writeBatchSize": 2,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"ColumnMappings": "FirstName: Name.First, MiddleName: Name.Middle, LastName: Name.Last,
BusinessEntityID: BusinessEntityID, PersonType: PersonType, NameStyle: NameStyle, Title: Title, Suffix: Suffix"
}
},
"inputs": [{
"name": "PersonBlobTableIn"
}],
"outputs": [{
"name": "PersonCosmosDbTableOut"
}],
"policy": {
"concurrency": 1
},
"name": "CopyFromBlobToCosmosDb"
}],
"start": "2016-04-14T00:00:00",
"end": "2016-04-15T00:00:00"
}
}
Example
{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
Example
{
"name": "AzureSqlInput",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Example
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline for copy activity",
"activities": [{
"name": "AzureSQLtoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [{
"name": "AzureSQLInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >=
\\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
writeBatchSize Inserts data into the SQL Integer (number of rows) No (default: 10000)
table when the buffer size
reaches writeBatchSize.
Example
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline with copy activity",
"activities": [{
"name": "AzureBlobtoSQL",
"description": "Copy Activity",
"type": "Copy",
"inputs": [{
"name": "AzureBlobInput"
}],
"outputs": [{
"name": "AzureSqlOutput"
}],
"typeProperties": {
"source": {
"type": "BlobSource",
"blobColumnSeparators": ","
},
"sink": {
"type": "SqlSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
Example
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
For more information, see Azure SQL Data Warehouse connector article.
Dataset
To define an Azure SQL Data Warehouse dataset, set the type of the dataset to AzureSqlDWTable, and specify the
following properties in the typeProperties section:
Example
{
"name": "AzureSqlDWInput",
"properties": {
"type": "AzureSqlDWTable",
"linkedServiceName": "AzureSqlDWLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
For more information, see Azure SQL Data Warehouse connector article.
SQL DW Source in Copy Activity
If you are copying data from Azure SQL Data Warehouse, set the source type of the copy activity to
SqlDWSource, and specify following properties in the source section:
Example
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline for copy activity",
"activities": [{
"name": "AzureSQLDWtoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [{
"name": "AzureSqlDWInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >=
\\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
For more information, see Azure SQL Data Warehouse connector article.
SQL DW Sink in Copy Activity
If you are copying data to Azure SQL Data Warehouse, set the sink type of the copy activity to SqlDWSink, and
specify following properties in the sink section:
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED
writeBatchSize Inserts data into the SQL Integer (number of rows) No (default: 10000)
table when the buffer size
reaches writeBatchSize
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED
Example
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline with copy activity",
"activities": [{
"name": "AzureBlobtoSQLDW",
"description": "Copy Activity",
"type": "Copy",
"inputs": [{
"name": "AzureBlobInput"
}],
"outputs": [{
"name": "AzureSqlDWOutput"
}],
"typeProperties": {
"source": {
"type": "BlobSource",
"blobColumnSeparators": ","
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
For more information, see Azure SQL Data Warehouse connector article.
Azure Search
Linked service
To define an Azure Search linked service, set the type of the linked service to AzureSearch, and specify following
properties in the typeProperties section:
Example
{
"name": "AzureSearchLinkedService",
"properties": {
"type": "AzureSearch",
"typeProperties": {
"url": "https://<service>.search.windows.net",
"key": "<AdminKey>"
}
}
}
Example
{
"name": "AzureSearchIndexDataset",
"properties": {
"type": "AzureSearchIndex",
"linkedServiceName": "AzureSearchLinkedService",
"typeProperties": {
"indexName": "products"
},
"availability": {
"frequency": "Minute",
"interval": 15
}
}
}
Example
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline for copy activity",
"activities": [{
"name": "SqlServertoAzureSearchIndex",
"description": "copy activity",
"type": "Copy",
"inputs": [{
"name": " SqlServerInput"
}],
"outputs": [{
"name": "AzureSearchIndexDataset"
}],
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >=
\\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "AzureSearchIndexSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
Example:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}
Example:
{
"name": "StorageSasLinkedService",
"properties": {
"type": "AzureStorageSas",
"typeProperties": {
"sasUri": "<storageUri>?<sasToken>"
}
}
}
For more information about these linked services, see Azure Table Storage connector article.
Dataset
To define an Azure Table dataset, set the type of the dataset to AzureTable, and specify the following properties in
the typeProperties section:
tableName Name of the table in the Azure Table Yes. When a tableName is specified
Database instance that linked service without an azureTableSourceQuery, all
refers to. records from the table are copied to the
destination. If an
azureTableSourceQuery is also specified,
records from the table that satisfies the
query are copied to the destination.
Example
{
"name": "AzureTableInput",
"properties": {
"type": "AzureTable",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
For more information about these linked services, see Azure Table Storage connector article.
Azure Table Source in Copy Activity
If you are copying data from Azure Table Storage, set the source type of the copy activity to AzureTableSource,
and specify following properties in the source section:
azureTableSourceQuery Use the custom query to Azure table query string. See No. When a tableName is
read data. examples in the next section. specified without an
azureTableSourceQuery, all
records from the table are
copied to the destination. If
an azureTableSourceQuery is
also specified, records from
the table that satisfies the
query are copied to the
destination.
Example
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline for copy activity",
"activities": [{
"name": "AzureTabletoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [{
"name": "AzureTableInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"typeProperties": {
"source": {
"type": "AzureTableSource",
"AzureTableSourceQuery": "PartitionKey eq 'DefaultPartitionKey'"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
For more information about these linked services, see Azure Table Storage connector article.
Azure Table Sink in Copy Activity
If you are copying data to Azure Table Storage, set the sink type of the copy activity to AzureTableSink, and
specify following properties in the sink section:
writeBatchSize Inserts data into the Azure Integer (number of rows) No (default: 10000)
table when the
writeBatchSize or
writeBatchTimeout is hit.
writeBatchTimeout Inserts data into the Azure timespan No (Default to storage client
table when the default timeout value 90 sec)
writeBatchSize or Example: 00:20:00 (20
writeBatchTimeout is hit minutes)
Example
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline with copy activity",
"activities": [{
"name": "AzureBlobtoTable",
"description": "Copy Activity",
"type": "Copy",
"inputs": [{
"name": "AzureBlobInput"
}],
"outputs": [{
"name": "AzureTableOutput"
}],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "AzureTableSink",
"writeBatchSize": 100,
"writeBatchTimeout": "01:00:00"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
For more information about these linked services, see Azure Table Storage connector article.
Amazon RedShift
Linked service
To define an Amazon Redshift linked service, set the type of the linked service to AmazonRedshift, and specify
following properties in the typeProperties section:
port The number of the TCP port that the No, default value: 5439
Amazon Redshift server uses to listen
for client connections.
Example
{
"name": "AmazonRedshiftLinkedService",
"properties": {
"type": "AmazonRedshift",
"typeProperties": {
"server": "<Amazon Redshift host name or IP address>",
"port": 5439,
"database": "<database name>",
"username": "user",
"password": "password"
}
}
}
Example
{
"name": "AmazonRedshiftInputDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": "AmazonRedshiftLinkedService",
"typeProperties": {
"tableName": "<Table name>"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}
query Use the custom query to SQL query string. For No (if tableName of
read data. example: dataset is specified)
select * from MyTable .
Example
{
"name": "CopyAmazonRedshiftToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-
ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [{
"name": "AmazonRedshiftInputDataset"
}],
"outputs": [{
"name": "AzureBlobOutputDataSet"
}],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "AmazonRedshiftToBlob"
}],
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00"
}
}
IBM DB2
Linked service
To define an IBM DB2 linked service, set the type of the linked service to OnPremisesDB2, and specify following
properties in the typeProperties section:
Example
{
"name": "OnPremDb2LinkedService",
"properties": {
"type": "OnPremisesDb2",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"schema": "<schema>",
"authenticationType": "<authentication type>",
"username": "<username>",
"password": "<password>",
"gatewayName": "<gatewayName>"
}
}
}
tableName Name of the table in the DB2 Database No (if query of RelationalSource is
instance that linked service refers to. specified)
The tableName is case-sensitive.
Example
{
"name": "Db2DataSet",
"properties": {
"type": "RelationalTable",
"linkedServiceName": "OnPremDb2LinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
query Use the custom query to SQL query string. For No (if tableName of
read data. example: dataset is specified)
"query": "select * from
"MySchema"."MyTable""
.
Example
{
"name": "CopyDb2ToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "select * from \"Orders\""
},
"sink": {
"type": "BlobSink"
}
},
"inputs": [{
"name": "Db2DataSet"
}],
"outputs": [{
"name": "AzureBlobDb2DataSet"
}],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "Db2ToBlob"
}],
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00"
}
}
MySQL
Linked service
To define a MySQL linked service, set the type of the linked service to OnPremisesMySql, and specify following
properties in the typeProperties section:
Example
{
"name": "OnPremMySqlLinkedService",
"properties": {
"type": "OnPremisesMySql",
"typeProperties": {
"server": "<server name>",
"database": "<database name>",
"schema": "<schema name>",
"authenticationType": "<authentication type>",
"userName": "<user name>",
"password": "<password>",
"gatewayName": "<gateway>"
}
}
}
Example
{
"name": "MySqlDataSet",
"properties": {
"type": "RelationalTable",
"linkedServiceName": "OnPremMySqlLinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
For more information, see MySQL connector article.
Relational Source in Copy Activity
If you are copying data from a MySQL database, set the source type of the copy activity to RelationalSource, and
specify following properties in the source section:
query Use the custom query to SQL query string. For No (if tableName of
read data. example: dataset is specified)
select * from MyTable .
Example
{
"name": "CopyMySqlToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-
ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [{
"name": "MySqlDataSet"
}],
"outputs": [{
"name": "AzureBlobMySqlDataSet"
}],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "MySqlToBlob"
}],
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00"
}
}
Oracle
Linked service
To define an Oracle linked service, set the type of the linked service to OnPremisesOracle, and specify following
properties in the typeProperties section:
PROPERTY DESCRIPTION REQUIRED
Example
{
"name": "OnPremisesOracleLinkedService",
"properties": {
"type": "OnPremisesOracle",
"typeProperties": {
"driverType": "Microsoft",
"connectionString": "Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;Password=<password>;",
"gatewayName": "<gateway name>"
}
}
}
Example
{
"name": "OracleInput",
"properties": {
"type": "OracleTable",
"linkedServiceName": "OnPremisesOracleLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"offset": "01:00:00",
"interval": "1",
"anchorDateTime": "2016-02-27T12:00:00",
"frequency": "Hour"
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
oracleReaderQuery Use the custom query to SQL query string. For No (if tableName of
read data. example: dataset is specified)
select * from MyTable
Example
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline for copy activity",
"activities": [{
"name": "OracletoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [{
"name": " OracleInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"typeProperties": {
"source": {
"type": "OracleSource",
"oracleReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >=
\\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
writeBatchSize Inserts data into the SQL Integer (number of rows) No (default: 100)
table when the buffer size
reaches writeBatchSize.
Example
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-05T19:00:00",
"description": "pipeline with copy activity",
"activities": [{
"name": "AzureBlobtoOracle",
"description": "Copy Activity",
"type": "Copy",
"inputs": [{
"name": "AzureBlobInput"
}],
"outputs": [{
"name": "OracleOutput"
}],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "OracleSink"
}
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
PostgreSQL
Linked service
To define a PostgreSQL linked service, set the type of the linked service to OnPremisesPostgreSql, and specify
following properties in the typeProperties section:
Example
{
"name": "OnPremPostgreSqlLinkedService",
"properties": {
"type": "OnPremisesPostgreSql",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"schema": "<schema>",
"authenticationType": "<authentication type>",
"username": "<username>",
"password": "<password>",
"gatewayName": "<gatewayName>"
}
}
}
Example
{
"name": "PostgreSqlDataSet",
"properties": {
"type": "RelationalTable",
"linkedServiceName": "OnPremPostgreSqlLinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
query Use the custom query to SQL query string. For No (if tableName of
read data. example: "query": "select * dataset is specified)
from
\"MySchema\".\"MyTable\"".
Example
{
"name": "CopyPostgreSqlToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "select * from \"public\".\"usstates\""
},
"sink": {
"type": "BlobSink"
}
},
"inputs": [{
"name": "PostgreSqlDataSet"
}],
"outputs": [{
"name": "AzureBlobPostgreSqlDataSet"
}],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "PostgreSqlToBlob"
}],
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00"
}
}
Example
{
"name": "SapBwLinkedService",
"properties": {
"type": "SapBw",
"typeProperties": {
"server": "<server name>",
"systemNumber": "<system number>",
"clientId": "<client id>",
"username": "<SAP user>",
"password": "<Password for SAP user>",
"gatewayName": "<gateway name>"
}
}
}
{
"name": "SapBwDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": "SapBwLinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}
Example
{
"name": "CopySapBwToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "<MDX query for SAP BW>"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [{
"name": "SapBwDataset"
}],
"outputs": [{
"name": "AzureBlobDataSet"
}],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "SapBwToBlob"
}],
"start": "2017-03-01T18:00:00",
"end": "2017-03-01T19:00:00"
}
}
SAP HANA
Linked service
To define a SAP HANA linked service, set the type of the linked service to SapHana, and specify following
properties in the typeProperties section:
Example
{
"name": "SapHanaLinkedService",
"properties": {
"type": "SapHana",
"typeProperties": {
"server": "<server name>",
"authenticationType": "<Basic, or Windows>",
"username": "<SAP user>",
"password": "<Password for SAP user>",
"gatewayName": "<gateway name>"
}
}
}
{
"name": "SapHanaDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": "SapHanaLinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}
Example
{
"name": "CopySapHanaToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "<SQL Query for HANA>"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [{
"name": "SapHanaDataset"
}],
"outputs": [{
"name": "AzureBlobDataSet"
}],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "SapHanaToBlob"
}],
"start": "2017-03-01T18:00:00",
"end": "2017-03-01T19:00:00"
}
}
SQL Server
Linked service
You create a linked service of type OnPremisesSqlServer to link an on-premises SQL Server database to a data
factory. The following table provides description for JSON elements specific to on-premises SQL Server linked
service.
The following table provides description for JSON elements specific to SQL Server linked service.
PROPERTY DESCRIPTION REQUIRED
You can encrypt credentials using the New-AzureRmDataFactoryEncryptValue cmdlet and use them in the
connection string as shown in the following example (EncryptedCredential property):
{
"name": "MyOnPremisesSQLDB",
"properties": {
"type": "OnPremisesSqlServer",
"typeProperties": {
"connectionString": "Data Source=<servername>;Initial Catalog=MarketingCampaigns;Integrated
Security=False;User ID=<username>;Password=<password>;",
"gatewayName": "<gateway name>"
}
}
}
Example
{
"name": "SqlServerInput",
"properties": {
"type": "SqlServerTable",
"linkedServiceName": "SqlServerLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
If the sqlReaderQuery is specified for the SqlSource, the Copy Activity runs this query against the SQL Server
Database source to get the data.
Alternatively, you can specify a stored procedure by specifying the sqlReaderStoredProcedureName and
storedProcedureParameters (if the stored procedure takes parameters).
If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the
structure section are used to build a select query to run against the SQL Server Database. If the dataset definition
does not have the structure, all columns are selected from the table.
NOTE
When you use sqlReaderStoredProcedureName, you still need to specify a value for the tableName property in the
dataset JSON. There are no validations performed against this table though.
Example
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline for copy activity",
"activities": [{
"name": "SqlServertoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [{
"name": " SqlServerInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >=
\\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
In this example, sqlReaderQuery is specified for the SqlSource. The Copy Activity runs this query against the SQL
Server Database source to get the data. Alternatively, you can specify a stored procedure by specifying the
sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes parameters).
The sqlReaderQuery can reference multiple tables within the database referenced by the input dataset. It is not
limited to only the table set as the dataset's tableName typeProperty.
If you do not specify sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure
section are used to build a select query to run against the SQL Server Database. If the dataset definition does not
have the structure, all columns are selected from the table.
For more information, see SQL Server connector article.
Sql Sink in Copy Activity
If you are copying data to a SQL Server database, set the sink type of the copy activity to SqlSink, and specify
following properties in the sink section:
writeBatchSize Inserts data into the SQL Integer (number of rows) No (default: 10000)
table when the buffer size
reaches writeBatchSize.
Example
The pipeline contains a Copy Activity that is configured to use these input and output datasets and is scheduled to
run every hour. In the pipeline JSON definition, the source type is set to BlobSource and sink type is set to
SqlSink.
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline with copy activity",
"activities": [{
"name": "AzureBlobtoSQL",
"description": "Copy Activity",
"type": "Copy",
"inputs": [{
"name": "AzureBlobInput"
}],
"outputs": [{
"name": " SqlServerOutput "
}],
"typeProperties": {
"source": {
"type": "BlobSource",
"blobColumnSeparators": ","
},
"sink": {
"type": "SqlSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
Sybase
Linked service
To define a Sybase linked service, set the type of the linked service to OnPremisesSybase, and specify following
properties in the typeProperties section:
Example
{
"name": "OnPremSybaseLinkedService",
"properties": {
"type": "OnPremisesSybase",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"schema": "<schema>",
"authenticationType": "<authentication type>",
"username": "<username>",
"password": "<password>",
"gatewayName": "<gatewayName>"
}
}
}
Example
{
"name": "SybaseDataSet",
"properties": {
"type": "RelationalTable",
"linkedServiceName": "OnPremSybaseLinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
query Use the custom query to SQL query string. For No (if tableName of
read data. example: dataset is specified)
select * from MyTable .
Example
{
"name": "CopySybaseToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "select * from DBA.Orders"
},
"sink": {
"type": "BlobSink"
}
},
"inputs": [{
"name": "SybaseDataSet"
}],
"outputs": [{
"name": "AzureBlobSybaseDataSet"
}],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "SybaseToBlob"
}],
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00"
}
}
Teradata
Linked service
To define a Teradata linked service, set the type of the linked service to OnPremisesTeradata, and specify
following properties in the typeProperties section:
Example
{
"name": "OnPremTeradataLinkedService",
"properties": {
"type": "OnPremisesTeradata",
"typeProperties": {
"server": "<server>",
"authenticationType": "<authentication type>",
"username": "<username>",
"password": "<password>",
"gatewayName": "<gatewayName>"
}
}
}
{
"name": "TeradataDataSet",
"properties": {
"type": "RelationalTable",
"linkedServiceName": "OnPremTeradataLinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
query Use the custom query to SQL query string. For Yes
read data. example:
select * from MyTable .
Example
{
"name": "CopyTeradataToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-
ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', SliceStart, SliceEnd)"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [{
"name": "TeradataDataSet"
}],
"outputs": [{
"name": "AzureBlobTeradataDataSet"
}],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "TeradataToBlob"
}],
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"isPaused": false
}
}
Cassandra
Linked service
To define a Cassandra linked service, set the type of the linked service to OnPremisesCassandra, and specify
following properties in the typeProperties section:
port The TCP port that the Cassandra server No, default value: 9042
uses to listen for client connections.
username Specify user name for the user account. Yes, if authenticationType is set to Basic.
password Specify password for the user account. Yes, if authenticationType is set to Basic.
Example
{
"name": "CassandraLinkedService",
"properties": {
"type": "OnPremisesCassandra",
"typeProperties": {
"authenticationType": "Basic",
"host": "<cassandra server name or IP address>",
"port": 9042,
"username": "user",
"password": "password",
"gatewayName": "<onpremgateway>"
}
}
}
keyspace Name of the keyspace or schema in Yes (If query for CassandraSource is
Cassandra database. not defined).
tableName Name of the table in Cassandra Yes (If query for CassandraSource is
database. not defined).
Example
{
"name": "CassandraInput",
"properties": {
"linkedServiceName": "CassandraLinkedService",
"type": "CassandraTable",
"typeProperties": {
"tableName": "mytable",
"keySpace": "<key space>"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
query Use the custom query to SQL-92 query or CQL query. No (if tableName and
read data. See CQL reference. keyspace on dataset are
defined).
When using SQL query,
specify keyspace
name.table name to
represent the table you want
to query.
consistencyLevel The consistency level ONE, TWO, THREE, No. Default value is ONE.
specifies how many replicas QUORUM, ALL,
must respond to a read LOCAL_QUORUM,
request before returning EACH_QUORUM,
data to the client application. LOCAL_ONE. See
Cassandra checks the Configuring data consistency
specified number of replicas for details.
for data to satisfy the read
request.
Example
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline with copy activity",
"activities": [{
"name": "CassandraToAzureBlob",
"description": "Copy from Cassandra to an Azure blob",
"type": "Copy",
"inputs": [{
"name": "CassandraInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"typeProperties": {
"source": {
"type": "CassandraSource",
"query": "select id, firstname, lastname from mykeyspace.mytable"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
MongoDB
Linked service
To define a MongoDB linked service, set the type of the linked service to OnPremisesMongoDB, and specify
following properties in the typeProperties section:
port TCP port that the MongoDB server uses Optional, default value: 27017
to listen for client connections.
username User account to access MongoDB. Yes (if basic authentication is used).
password Password for the user. Yes (if basic authentication is used).
PROPERTY DESCRIPTION REQUIRED
authSource Name of the MongoDB database that Optional (if basic authentication is
you want to use to check your used). default: uses the admin account
credentials for authentication. and the database specified using
databaseName property.
Example
{
"name": "OnPremisesMongoDbLinkedService",
"properties": {
"type": "OnPremisesMongoDb",
"typeProperties": {
"authenticationType": "<Basic or Anonymous>",
"server": "< The IP address or host name of the MongoDB server >",
"port": "<The number of the TCP port that the MongoDB server uses to listen for client
connections.>",
"username": "<username>",
"password": "<password>",
"authSource": "< The database that you want to use to check your credentials for authentication.
>",
"databaseName": "<database name>",
"gatewayName": "<onpremgateway>"
}
}
}
Example
{
"name": "MongoDbInputDataset",
"properties": {
"type": "MongoDbCollection",
"linkedServiceName": "OnPremisesMongoDbLinkedService",
"typeProperties": {
"collectionName": "<Collection name>"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}
query Use the custom query to SQL-92 query string. For No (if collectionName of
read data. example: dataset is specified)
select * from MyTable .
Example
{
"name": "CopyMongoDBToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "MongoDbSource",
"query": "select * from MyTable"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [{
"name": "MongoDbInputDataset"
}],
"outputs": [{
"name": "AzureBlobOutputDataSet"
}],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "MongoDBToAzureBlob"
}],
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00"
}
}
Amazon S3
Linked service
To define an Amazon S3 linked service, set the type of the linked service to AwsAccessKey, and specify following
properties in the typeProperties section:
secretAccessKey The secret access key itself. Encrypted secret string Yes
Example
{
"name": "AmazonS3LinkedService",
"properties": {
"type": "AwsAccessKey",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": "<secret access key>"
}
}
}
NOTE
bucketName + key specifies the location of the S3 object where bucket is the root container for S3 objects and key is the full
path to S3 object.
{
"name": "dataset-s3",
"properties": {
"type": "AmazonS3",
"linkedServiceName": "link- testS3",
"typeProperties": {
"prefix": "testFolder/test",
"bucketName": "<S3 bucket name>",
"format": {
"type": "OrcFormat"
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}
"key": "testFolder/test.orc",
"bucketName": "<S3 bucket name>",
You can have Data Factory calculate the key and bucketName dynamically at runtime by using system variables
such as SliceStart.
You can do the same for the prefix property of an Amazon S3 dataset. See Data Factory functions and system
variables for a list of supported functions and variables.
For more information, see Amazon S3 connector article.
File System Source in Copy Activity
If you are copying data from Amazon S3, set the source type of the copy activity to FileSystemSource, and
specify following properties in the source section:
Example
{
"name": "CopyAmazonS3ToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [{
"name": "AmazonS3InputDataset"
}],
"outputs": [{
"name": "AzureBlobOutputDataSet"
}],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "AmazonS3ToBlob"
}],
"start": "2016-08-08T18:00:00",
"end": "2016-08-08T19:00:00"
}
}
File System
Linked service
You can link an on-premises file system to an Azure data factory with the On-Premises File Server linked service.
The following table provides descriptions for JSON elements that are specific to the On-Premises File Server linked
service.
userid Specify the ID of the user who has No (if you choose encryptedCredential)
access to the server.
PROPERTY DESCRIPTION REQUIRED
password Specify the password for the user No (if you choose encryptedCredential
(userid).
encryptedCredential Specify the encrypted credentials that No (if you choose to specify userid and
you can get by running the New- password in plain text)
AzureRmDataFactoryEncryptValue
cmdlet.
Local folder on Data Management D:\\ (for Data Management Gateway .\\ or folder\\subfolder (for Data
Gateway machine: 2.0 and later versions) Management Gateway 2.0 and later
versions)
Examples: D:\* or D:\folder\subfolder\* localhost (for earlier versions than Data
Management Gateway 2.0) D:\\ or D:\\folder\\subfolder (for
gateway version below 2.0)
Examples: \\myserver\share\* or
\\myserver\share\folder\subfolder\*
{
"Name": "OnPremisesFileServerLinkedService",
"properties": {
"type": "OnPremisesFileServer",
"typeProperties": {
"host": "\\\\Contosogame-Asia",
"userid": "Admin",
"password": "123456",
"gatewayName": "<onpremgateway>"
}
}
}
{
"Name": " OnPremisesFileServerLinkedService ",
"properties": {
"type": "OnPremisesFileServer",
"typeProperties": {
"host": "D:\\",
"encryptedCredential": "WFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5xxxxxxxxxxxxxxxxx",
"gatewayName": "<onpremgateway>"
}
}
}
Data.<Guid>.txt (Example:
Data.0a405f8a-93ff-4c6f-b3be-
f69616f1df7a.txt)
NOTE
You cannot use fileName and fileFilter simultaneously.
Example
{
"name": "OnpremisesFileSystemInput",
"properties": {
"type": " FileShare",
"linkedServiceName": " OnPremisesFileServerLinkedService ",
"typeProperties": {
"folderPath": "mysharedfolder/yearno={Year}/monthno={Month}/dayno={Day}",
"fileName": "{Hour}.csv",
"partitionedBy": [{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
}, {
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
}, {
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
}, {
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}]
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Example
{
"name": "SamplePipeline",
"properties": {
"start": "2015-06-01T18:00:00",
"end": "2015-06-01T19:00:00",
"description": "Pipeline for copy activity",
"activities": [{
"name": "OnpremisesFileSystemtoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [{
"name": "OnpremisesFileSystemInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"typeProperties": {
"source": {
"type": "FileSystemSource"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
auto-
Example
{
"name": "SamplePipeline",
"properties": {
"start": "2015-06-01T18:00:00",
"end": "2015-06-01T20:00:00",
"description": "pipeline for copy activity",
"activities": [{
"name": "AzureSQLtoOnPremisesFile",
"description": "copy activity",
"type": "Copy",
"inputs": [{
"name": "AzureSQLInput"
}],
"outputs": [{
"name": "OnpremisesFileSystemOutput"
}],
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >=
\\'{0:yyyy-MM-dd}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "FileSystemSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 3,
"timeout": "01:00:00"
}
}]
}
}
FTP
Linked service
To define an FTP linked service, set the type of the linked service to FtpServer, and specify following properties in
the typeProperties section:
{
"name": "FTPLinkedService",
"properties": {
"type": "FtpServer",
"typeProperties": {
"authenticationType": "Anonymous",
"host": "myftpserver.com"
}
}
}
Example: Using username and password in plain text for basic authentication
{
"name": "FTPLinkedService",
"properties": {
"type": "FtpServer",
"typeProperties": {
"host": "myftpserver.com",
"authenticationType": "Basic",
"username": "Admin",
"password": "123456"
}
}
}
{
"name": "FTPLinkedService",
"properties": {
"type": "FtpServer",
"typeProperties": {
"host": "myftpserver.com",
"authenticationType": "Basic",
"encryptedCredential": "xxxxxxxxxxxxxxxxx",
"gatewayName": "<onpremgateway>"
}
}
}
NOTE
filename and fileFilter cannot be used simultaneously.
Example
{
"name": "FTPFileInput",
"properties": {
"type": "FileShare",
"linkedServiceName": "FTPLinkedService",
"typeProperties": {
"folderPath": "<path to shared folder>",
"fileName": "test.csv",
"useBinaryTransfer": true
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Example
{
"name": "pipeline",
"properties": {
"activities": [{
"name": "FTPToBlobCopy",
"inputs": [{
"name": "FtpFileInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "00:05:00"
}
}],
"start": "2016-08-24T18:00:00",
"end": "2016-08-24T19:00:00"
}
}
HDFS
Linked service
To define a HDFS linked service, set the type of the linked service to Hdfs, and specify following properties in the
typeProperties section:
encryptedCredential New-AzureRMDataFactoryEncryptValue No
output of the access credential.
{
"name": "HDFSLinkedService",
"properties": {
"type": "Hdfs",
"typeProperties": {
"authenticationType": "Anonymous",
"userName": "hadoop",
"url": "http://<machine>:50070/webhdfs/v1/",
"gatewayName": "<onpremgateway>"
}
}
}
{
"name": "HDFSLinkedService",
"properties": {
"type": "Hdfs",
"typeProperties": {
"authenticationType": "Windows",
"userName": "Administrator",
"password": "password",
"url": "http://<machine>:50070/webhdfs/v1/",
"gatewayName": "<onpremgateway>"
}
}
}
NOTE
filename and fileFilter cannot be used simultaneously.
Example
{
"name": "InputDataset",
"properties": {
"type": "FileShare",
"linkedServiceName": "HDFSLinkedService",
"typeProperties": {
"folderPath": "DataTransfer/UnitTest/"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Example
{
"name": "pipeline",
"properties": {
"activities": [{
"name": "HdfsToBlobCopy",
"inputs": [{
"name": "InputDataset"
}],
"outputs": [{
"name": "OutputDataset"
}],
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource"
},
"sink": {
"type": "BlobSink"
}
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "00:05:00"
}
}],
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00"
}
}
SFTP
Linked service
To define an SFTP linked service, set the type of the linked service to Sftp, and specify following properties in the
typeProperties section:
skipHostKeyValidation Specify whether to skip host key No. The default value: false
validation.
hostKeyFingerprint Specify the finger print of the host key. Yes if the skipHostKeyValidation is
set to false.
PROPERTY DESCRIPTION REQUIRED
gatewayName Name of the Data Management Yes if copying data from an on-
Gateway to connect to an on-premises premises SFTP server.
SFTP server.
encryptedCredential Encrypted credential to access the SFTP No. Apply only when copying data from
server. Auto-generated when you an on-premises SFTP server.
specify basic authentication (username
+ password) or SshPublicKey
authentication (username + private key
path or content) in copy wizard or the
ClickOnce popup dialog.
{
"name": "SftpLinkedService",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "<SFTP server name or IP address>",
"port": 22,
"authenticationType": "Basic",
"username": "xxx",
"password": "xxx",
"skipHostKeyValidation": false,
"hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00",
"gatewayName": "<onpremgateway>"
}
}
}
{
"name": "SftpLinkedService",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "<FTP server name or IP address>",
"port": 22,
"authenticationType": "Basic",
"username": "xxx",
"encryptedCredential": "xxxxxxxxxxxxxxxxx",
"skipHostKeyValidation": false,
"hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00",
"gatewayName": "<onpremgateway>"
}
}
}
Using SSH public key authentication:**
To use basic authentication, set authenticationType as SshPublicKey , and specify the following properties besides
the SFTP connector generic ones introduced in the last section:
privateKeyPath Specify absolute path to the private key Specify either the privateKeyPath or
file that gateway can access. privateKeyContent .
privateKeyContent A serialized string of the private key Specify either the privateKeyPath or
content. The Copy Wizard can read the privateKeyContent .
private key file and extract the private
key content automatically. If you are
using any other tool/SDK, use the
privateKeyPath property instead.
passPhrase Specify the pass phrase/password to Yes if the private key file is protected by
decrypt the private key if the key file is a pass phrase.
protected by a pass phrase.
{
"name": "SftpLinkedServiceWithPrivateKeyPath",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "<FTP server name or IP address>",
"port": 22,
"authenticationType": "SshPublicKey",
"username": "xxx",
"privateKeyPath": "D:\\privatekey_openssh",
"passPhrase": "xxx",
"skipHostKeyValidation": true,
"gatewayName": "<onpremgateway>"
}
}
}
{
"name": "SftpLinkedServiceWithPrivateKeyContent",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "mysftpserver.westus.cloudapp.azure.com",
"port": 22,
"authenticationType": "SshPublicKey",
"username": "xxx",
"privateKeyContent": "<base64 string of the private key content>",
"passPhrase": "xxx",
"skipHostKeyValidation": true
}
}
}
NOTE
filename and fileFilter cannot be used simultaneously.
Example
{
"name": "SFTPFileInput",
"properties": {
"type": "FileShare",
"linkedServiceName": "SftpLinkedService",
"typeProperties": {
"folderPath": "<path to shared folder>",
"fileName": "test.csv"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Example
{
"name": "pipeline",
"properties": {
"activities": [{
"name": "SFTPToBlobCopy",
"inputs": [{
"name": "SFTPFileInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "00:05:00"
}
}],
"start": "2017-02-20T18:00:00",
"end": "2017-02-20T19:00:00"
}
}
HTTP
Linked service
To define a HTTP linked service, set the type of the linked service to Http, and specify following properties in the
typeProperties section:
gatewayName Name of the Data Management Yes if copying data from an on-
Gateway to connect to an on-premises premises HTTP source.
HTTP source.
encryptedCredential Encrypted credential to access the HTTP No. Apply only when copying data from
endpoint. Auto-generated when you an on-premises HTTP server.
configure the authentication
information in copy wizard or the
ClickOnce popup dialog.
{
"name": "HttpLinkedService",
"properties": {
"type": "Http",
"typeProperties": {
"authenticationType": "basic",
"url": "https://en.wikipedia.org/wiki/",
"userName": "user name",
"password": "password"
}
}
}
certThumbprint The thumbprint of the certificate that Specify either the embeddedCertData
was installed on your gateway or certThumbprint .
machines cert store. Apply only when
copying data from an on-premises
HTTP source.
If you use certThumbprint for authentication and the certificate is installed in the personal store of the local
computer, you need to grant the read permission to the gateway service:
1. Launch Microsoft Management Console (MMC). Add the Certificates snap-in that targets the Local Computer.
2. Expand Certificates, Personal, and click Certificates.
3. Right-click the certificate from the personal store, and select All Tasks->Manage Private Keys...
4. On the Security tab, add the user account under which Data Management Gateway Host Service is running with
the read access to the certificate.
Example: using client certificate: This linked service links your data factory to an on-premises HTTP web server.
It uses a client certificate that is installed on the machine with Data Management Gateway installed.
{
"name": "HttpLinkedService",
"properties": {
"type": "Http",
"typeProperties": {
"authenticationType": "ClientCertificate",
"url": "https://en.wikipedia.org/wiki/",
"certThumbprint": "thumbprint of certificate",
"gatewayName": "gateway name"
}
}
}
{
"name": "HttpLinkedService",
"properties": {
"type": "Http",
"typeProperties": {
"authenticationType": "ClientCertificate",
"url": "https://en.wikipedia.org/wiki/",
"embeddedCertData": "base64 encoded cert data",
"password": "password of cert"
}
}
}
requestMethod Http method. Allowed values are GET or No. Default is GET .
POST.
{
"name": "HttpSourceDataInput",
"properties": {
"type": "Http",
"linkedServiceName": "HttpLinkedService",
"typeProperties": {
"relativeUrl": "/XXX/test.xml",
"requestMethod": "Post",
"requestBody": "body for POST HTTP request"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
httpRequestTimeout The timeout (TimeSpan) for the HTTP No. Default value: 00:01:40
request to get a response. It is the
timeout to get a response, not the
timeout to read response data.
Example
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline with copy activity",
"activities": [{
"name": "HttpSourceToAzureBlob",
"description": "Copy from an HTTP source to an Azure blob",
"type": "Copy",
"inputs": [{
"name": "HttpSourceDataInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"typeProperties": {
"source": {
"type": "HttpSource"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
OData
Linked service
To define an OData linked service, set the type of the linked service to OData, and specify following properties in
the typeProperties section:
username Specify user name if you are using Basic Yes (only if you are using Basic
authentication. authentication)
password Specify password for the user account Yes (only if you are using Basic
you specified for the username. authentication)
authorizedCredential If you are using OAuth, click Authorize Yes (only if you are using OAuth
button in the Data Factory Copy Wizard authentication)
or Editor and enter your credential, then
the value of this property will be auto-
generated.
{
"name": "inputLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "http://services.odata.org/OData/OData.svc",
"authenticationType": "Basic",
"username": "username",
"password": "password"
}
}
}
{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "http://services.odata.org/OData/OData.svc",
"authenticationType": "Anonymous"
}
}
}
{
"name": "inputLinkedService",
"properties":
{
"type": "OData",
"typeProperties":
{
"url": "<endpoint of cloud OData source, for example,
https://<tenant>.crm.dynamics.com/XRMServices/2011/OrganizationData.svc>",
"authenticationType": "OAuth",
"authorizedCredential": "<auto generated by clicking the Authorize button on UI>"
}
}
}
Example
{
"name": "ODataDataset",
"properties": {
"type": "ODataResource",
"typeProperties": {
"path": "Products"
},
"linkedServiceName": "ODataLinkedService",
"structure": [],
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
Example
{
"name": "CopyODataToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "?$select=Name, Description&$top=5"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [{
"name": "ODataDataSet"
}],
"outputs": [{
"name": "AzureBlobODataDataSet"
}],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "ODataToBlob"
}],
"start": "2017-02-01T18:00:00",
"end": "2017-02-03T19:00:00"
}
}
ODBC
Linked service
To define an ODBC linked service, set the type of the linked service to OnPremisesOdbc, and specify following
properties in the typeProperties section:
{
"name": "ODBCLinkedService",
"properties": {
"type": "OnPremisesOdbc",
"typeProperties": {
"authenticationType": "Basic",
"connectionString": "Driver={SQL Server};Server=Server.database.windows.net;
Database=TestDatabase;",
"userName": "username",
"password": "password",
"gatewayName": "<onpremgateway>"
}
}
}
{
"name": "ODBCLinkedService",
"properties": {
"type": "OnPremisesOdbc",
"typeProperties": {
"authenticationType": "Basic",
"connectionString": "Driver={SQL Server};Server=myserver.database.windows.net;
Database=TestDatabase;;EncryptedCredential=eyJDb25uZWN0...........................",
"gatewayName": "<onpremgateway>"
}
}
}
Example
{
"name": "ODBCDataSet",
"properties": {
"type": "RelationalTable",
"linkedServiceName": "ODBCLinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
query Use the custom query to SQL query string. For Yes
read data. example:
select * from MyTable .
Example
{
"name": "CopyODBCToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-
ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [{
"name": "OdbcDataSet"
}],
"outputs": [{
"name": "AzureBlobOdbcDataSet"
}],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "OdbcToBlob"
}],
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00"
}
}
Salesforce
Linked service
To define a Salesforce linked service, set the type of the linked service to Salesforce, and specify following
properties in the typeProperties section:
- Default is
"https://login.salesforce.com".
- To copy data from sandbox, specify
"https://test.salesforce.com".
- To copy data from custom domain,
specify, for example,
"https://[domain].my.salesforce.com".
PROPERTY DESCRIPTION REQUIRED
Example
{
"name": "SalesforceLinkedService",
"properties": {
"type": "Salesforce",
"typeProperties": {
"username": "<user name>",
"password": "<password>",
"securityToken": "<security token>"
}
}
}
Example
{
"name": "SalesforceInput",
"properties": {
"linkedServiceName": "SalesforceLinkedService",
"type": "RelationalTable",
"typeProperties": {
"tableName": "AllDataType__c"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
query Use the custom query to A SQL-92 query or No (if the tableName of the
read data. Salesforce Object Query dataset is specified)
Language (SOQL) query. For
example:
select * from
MyTable__c
.
Example
{
"name": "SamplePipeline",
"properties": {
"start": "2016-06-01T18:00:00",
"end": "2016-06-01T19:00:00",
"description": "pipeline with copy activity",
"activities": [{
"name": "SalesforceToAzureBlob",
"description": "Copy from Salesforce to an Azure blob",
"type": "Copy",
"inputs": [{
"name": "SalesforceInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "SELECT Id, Col_AutoNumber__c, Col_Checkbox__c, Col_Currency__c, Col_Date__c,
Col_DateTime__c, Col_Email__c, Col_Number__c, Col_Percent__c, Col_Phone__c, Col_Picklist__c,
Col_Picklist_MultiSelect__c, Col_Text__c, Col_Text_Area__c, Col_Text_AreaLong__c, Col_Text_AreaRich__c,
Col_URL__c, Col_Text_Encrypt__c, Col_Lookup__c FROM AllDataType__c"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}]
}
}
IMPORTANT
The "__c" part of the API Name is needed for any custom object.
Web Data
Linked service
To define a Web linked service, set the type of the linked service to Web, and specify following properties in the
typeProperties section:
Example
{
"name": "web",
"properties": {
"type": "Web",
"typeProperties": {
"authenticationType": "Anonymous",
"url": "https://en.wikipedia.org/wiki/"
}
}
}
path A relative URL to the resource that No. When path is not specified, only the
contains the table. URL specified in the linked service
definition is used.
Example
{
"name": "WebTableInput",
"properties": {
"type": "WebTable",
"linkedServiceName": "WebLinkedService",
"typeProperties": {
"index": 1,
"path": "AFI's_100_Years...100_Movies"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
COMPUTE ENVIRONMENTS
The following table lists the compute environments supported by Data Factory and the transformation activities
that can run on them. Click the link for the compute you are interested in to see the JSON schemas for linked
service to link it to a data factory.
On-demand HDInsight cluster or your own HDInsight cluster .NET custom activity, Hive activity, [Pig activity](#hdinsight-
pig-activity, MapReduce activity, Hadoop streaming activity,
Spark activity
Azure Machine Learning Machine Learning Batch Execution Activity, Machine Learning
Update Resource Activity
Azure SQL Database, Azure SQL Data Warehouse, SQL Server Stored Procedure
On-demand Azure HDInsight cluster
The Azure Data Factory service can automatically create a Windows/Linux-based on-demand HDInsight cluster to
process data. The cluster is created in the same region as the storage account (linkedServiceName property in the
JSON) associated with the cluster. You can run the following transformation activities on this linked service: .NET
custom activity, Hive activity, [Pig activity](#hdinsight-pig-activity, MapReduce activity, Hadoop streaming activity,
Spark activity.
Linked service
The following table provides descriptions for the properties used in the Azure JSON definition of an on-demand
HDInsight linked service.
JSON example
The following JSON defines a Linux-based on-demand HDInsight linked service. The Data Factory service
automatically creates a Linux-based HDInsight cluster when processing a data slice.
{
"name": "HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "StorageLinkedService"
}
}
}
{
"name": "HDInsightLinkedService",
"properties": {
"type": "HDInsight",
"typeProperties": {
"clusterUri": " https://<hdinsightclustername>.azurehdinsight.net/",
"userName": "admin",
"password": "<password>",
"linkedServiceName": "MyHDInsightStoragelinkedService"
}
}
}
Azure Batch
You can create an Azure Batch linked service to register a Batch pool of virtual machines (VMs) with a data factory.
You can run .NET custom activities using either Azure Batch or Azure HDInsight. You can run a .NET custom activity
on this linked service.
Linked service
The following table provides descriptions for the properties used in the Azure JSON definition of an Azure Batch
linked service.
{
"name": "AzureBatchLinkedService",
"properties": {
"type": "AzureBatch",
"typeProperties": {
"accountName": "<Azure Batch account name>",
"accessKey": "<Azure Batch account key>",
"poolName": "<Azure Batch pool name>",
"linkedServiceName": "<Specify associated storage linked service reference here>"
}
}
}
JSON example
{
"name": "AzureMLLinkedService",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://[batch scoring endpoint]/jobs",
"apiKey": "<apikey>"
}
}
}
resourceGroupName Azure resource group name No (If not specified, resource group of
the data factory is used).
JSON example
The following example provides JSON definition for an Azure Data Lake Analytics linked service.
{
"name": "AzureDataLakeAnalyticsLinkedService",
"properties": {
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "<account name>",
"dataLakeAnalyticsUri": "datalakeanalyticscompute.net",
"authorization": "<authcode>",
"sessionId": "<session ID>",
"subscriptionId": "<subscription id>",
"resourceGroupName": "<resource group name>"
}
}
}
JSON example
{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
See Azure SQL Connector article for details about this linked service.
JSON example
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User
ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
For more information, see Azure SQL Data Warehouse connector article.
SQL Server
You create a SQL Server linked service and use it with the Stored Procedure Activity to invoke a stored procedure
from a Data Factory pipeline.
Linked service
You create a linked service of type OnPremisesSqlServer to link an on-premises SQL Server database to a data
factory. The following table provides description for JSON elements specific to on-premises SQL Server linked
service.
The following table provides description for JSON elements specific to SQL Server linked service.
You can encrypt credentials using the New-AzureRmDataFactoryEncryptValue cmdlet and use them in the
connection string as shown in the following example (EncryptedCredential property):
{
"name": "MyOnPremisesSQLDB",
"properties": {
"type": "OnPremisesSqlServer",
"typeProperties": {
"connectionString": "Data Source=<servername>;Initial Catalog=MarketingCampaigns;Integrated
Security=False;User ID=<username>;Password=<password>;",
"gatewayName": "<gateway name>"
}
}
}
HDInsight Hive activity The HDInsight Hive activity in a Data Factory pipeline executes
Hive queries on your own or on-demand Windows/Linux-
based HDInsight cluster.
HDInsight Pig activity The HDInsight Pig activity in a Data Factory pipeline executes
Pig queries on your own or on-demand Windows/Linux-based
HDInsight cluster.
HDInsight MapReduce Activity The HDInsight MapReduce activity in a Data Factory pipeline
executes MapReduce programs on your own or on-demand
Windows/Linux-based HDInsight cluster.
HDInsight Streaming Activity The HDInsight Streaming Activity in a Data Factory pipeline
executes Hadoop Streaming programs on your own or on-
demand Windows/Linux-based HDInsight cluster.
HDInsight Spark Activity The HDInsight Spark activity in a Data Factory pipeline
executes Spark programs on your own HDInsight cluster.
Machine Learning Batch Execution Activity Azure Data Factory enables you to easily create pipelines that
use a published Azure Machine Learning web service for
predictive analytics. Using the Batch Execution Activity in an
Azure Data Factory pipeline, you can invoke a Machine
Learning web service to make predictions on the data in batch.
Machine Learning Update Resource Activity Over time, the predictive models in the Machine Learning
scoring experiments need to be retrained using new input
datasets. After you are done with retraining, you want to
update the scoring web service with the retrained Machine
Learning model. You can use the Update Resource Activity to
update the web service with the newly trained model.
Stored Procedure Activity You can use the Stored Procedure activity in a Data Factory
pipeline to invoke a stored procedure in one of the following
data stores: Azure SQL Database, Azure SQL Data Warehouse,
SQL Server Database in your enterprise or an Azure VM.
ACTIVITY DESCRIPTION
Data Lake Analytics U-SQL activity Data Lake Analytics U-SQL Activity runs a U-SQL script on an
Azure Data Lake Analytics cluster.
.NET custom activity If you need to transform data in a way that is not supported
by Data Factory, you can create a custom activity with your
own data processing logic and use the activity in the pipeline.
You can configure the custom .NET activity to run using either
an Azure Batch service or an Azure HDInsight cluster.
These type properties are specific to the Hive Activity. Other properties (outside the typeProperties section) are
supported for all activities.
JSON example
The following JSON defines a HDInsight Hive activity in a pipeline.
{
"name": "Hive Activity",
"description": "description",
"type": "HDInsightHive",
"inputs": [
{
"name": "input tables"
}
],
"outputs": [
{
"name": "output tables"
}
],
"linkedServiceName": "MyHDInsightLinkedService",
"typeProperties": {
"script": "Hive script",
"scriptPath": "<pathtotheHivescriptfileinAzureblobstorage>",
"defines": {
"param1": "param1Value"
}
},
"scheduler": {
"frequency": "Day",
"interval": 1
}
}
These type properties are specific to the Pig Activity. Other properties (outside the typeProperties section) are
supported for all activities.
JSON example
{
"name": "HiveActivitySamplePipeline",
"properties": {
"activities": [
{
"name": "Pig Activity",
"description": "description",
"type": "HDInsightPig",
"inputs": [
{
"name": "input tables"
}
],
"outputs": [
{
"name": "output tables"
}
],
"linkedServiceName": "MyHDInsightLinkedService",
"typeProperties": {
"script": "Pig script",
"scriptPath": "<pathtothePigscriptfileinAzureblobstorage>",
"defines": {
"param1": "param1Value"
}
},
"scheduler": {
"frequency": "Day",
"interval": 1
}
}
]
}
}
JSON example
{
"name": "MahoutMapReduceSamplePipeline",
"properties": {
"description": "Sample Pipeline to Run a Mahout Custom Map Reduce Jar. This job calculates an Item
Similarity Matrix to determine the similarity between two items",
"activities": [
{
"type": "HDInsightMapReduce",
"typeProperties": {
"className": "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
"jarFilePath": "adfsamples/Mahout/jars/mahout-examples-0.9.0.2.2.7.1-34.jar",
"jarLinkedService": "StorageLinkedService",
"arguments": ["-s", "SIMILARITY_LOGLIKELIHOOD", "--input",
"wasb://[email protected]/Mahout/input", "--output",
"wasb://[email protected]/Mahout/output/", "--maxSimilaritiesPerItem", "500", "--
tempDir", "wasb://[email protected]/Mahout/temp/mahout"]
},
"inputs": [
{
"name": "MahoutInput"
}
],
"outputs": [
{
"name": "MahoutOutput"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "MahoutActivity",
"description": "Custom Map Reduce to generate Mahout result",
"linkedServiceName": "HDInsightLinkedService"
}
],
"start": "2017-01-03T00:00:00",
"end": "2017-01-04T00:00:00"
}
}
PROPERTY DESCRIPTION
input Input file (including location) for the mapper. In the example:
"wasb://[email protected]/example/data/gute
nberg/davinci.txt": adfsample is the blob container,
example/data/Gutenberg is the folder, and davinci.txt is the
blob.
output Output file (including location) for the reducer. The output of
the Hadoop Streaming job is written to the location specified
for this property.
filePaths Paths for the mapper and reducer executables. In the example:
"adfsample/example/apps/wc.exe", adfsample is the blob
container, example/apps is the folder, and wc.exe is the
executable.
fileLinkedService Azure Storage linked service that represents the Azure storage
that contains the files specified in the filePaths section.
NOTE
You must specify an output dataset for the Hadoop Streaming Activity for the outputs property. This dataset can be just a
dummy dataset that is required to drive the pipeline schedule (hourly, daily, etc.). If the activity doesn't take an input, you can
skip specifying an input dataset for the activity for the inputs property.
JSON example
{
"name": "HadoopStreamingPipeline",
"properties": {
"description": "Hadoop Streaming Demo",
"activities": [
{
"type": "HDInsightStreaming",
"typeProperties": {
"mapper": "cat.exe",
"reducer": "wc.exe",
"input":
"wasb://<nameofthecluster>@spestore.blob.core.windows.net/example/data/gutenberg/davinci.txt",
"output":
"wasb://<nameofthecluster>@spestore.blob.core.windows.net/example/data/StreamingOutput/wc.txt",
"filePaths": ["<nameofthecluster>/example/apps/wc.exe","
<nameofthecluster>/example/apps/cat.exe"],
"fileLinkedService": "StorageLinkedService",
"getDebugInfo": "Failure"
},
"outputs": [
{
"name": "StreamingOutputDataset"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "RunHadoopStreamingJob",
"description": "Run a Hadoop streaming job",
"linkedServiceName": "HDInsightLinkedService"
}
],
"start": "2014-01-04T00:00:00",
"end": "2014-01-05T00:00:00"
}
}
JSON example
{
"name": "SparkPipeline",
"properties": {
"activities": [
{
"type": "HDInsightSpark",
"typeProperties": {
"rootPath": "adfspark\\pyFiles",
"entryFilePath": "test.py",
"getDebugInfo": "Always"
},
"outputs": [
{
"name": "OutputDataset"
}
],
"name": "MySparkActivity",
"linkedServiceName": "HDInsightLinkedService"
}
],
"start": "2017-02-05T00:00:00",
"end": "2017-02-06T00:00:00"
}
}
IMPORTANT
We recommend that you do not set this property to Always in a production environment unless you are
troubleshooting an issue.
The outputs section has one output dataset. You must specify an output dataset even if the spark program does
not produce any output. The output dataset drives the schedule for the pipeline (hourly, daily, etc.).
For more information about the activity, see Spark Activity article.
JSON example
In this example, the activity has the dataset MLSqlInput as input and MLSqlOutput as the output. The
MLSqlInput is passed as an input to the web service by using the webServiceInput JSON property. The
MLSqlOutput is passed as an output to the Web service by using the webServiceOutputs JSON property.
{
"name": "MLWithSqlReaderSqlWriter",
"properties": {
"description": "Azure ML model with sql azure reader/writer",
"activities": [{
"name": "MLSqlReaderSqlWriterActivity",
"type": "AzureMLBatchExecution",
"description": "test",
"inputs": [ { "name": "MLSqlInput" }],
"outputs": [ { "name": "MLSqlOutput" } ],
"linkedServiceName": "MLSqlReaderSqlWriterDecisionTreeModel",
"typeProperties":
{
"webServiceInput": "MLSqlInput",
"webServiceOutputs": {
"output1": "MLSqlOutput"
},
"globalParameters": {
"Database server name": "<myserver>.database.windows.net",
"Database name": "<database>",
"Server user account name": "<user name>",
"Server user account password": "<password>"
}
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "02:00:00"
}
}],
"start": "2016-02-13T00:00:00",
"end": "2016-02-14T00:00:00"
}
}
In the JSON example, the deployed Azure Machine Learning Web service uses a reader and a writer module to
read/write data from/to an Azure SQL Database. This Web service exposes the following four parameters: Database
server name, Database name, Server user account name, and Server user account password.
NOTE
Only inputs and outputs of the AzureMLBatchExecution activity can be passed as parameters to the Web service. For
example, in the above JSON snippet, MLSqlInput is an input to the AzureMLBatchExecution activity, which is passed as an
input to the Web service via webServiceInput parameter.
{
"name": "pipeline",
"properties": {
"activities": [
{
"name": "retraining",
"type": "AzureMLBatchExecution",
"inputs": [
{
"name": "trainingData"
}
],
"outputs": [
{
"name": "trainedModelBlob"
}
],
"typeProperties": {
"webServiceInput": "trainingData",
"webServiceOutputs": {
"output1": "trainedModelBlob"
}
},
"linkedServiceName": "trainingEndpoint",
"policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "02:00:00"
}
},
{
"type": "AzureMLUpdateResource",
"typeProperties": {
"trainedModelName": "trained model",
"trainedModelDatasetName" : "trainedModelBlob"
},
"inputs": [{ "name": "trainedModelBlob" }],
"outputs": [{ "name": "placeholderBlob" }],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"name": "AzureML Update Resource",
"linkedServiceName": "updatableScoringEndpoint2"
}
],
"start": "2016-02-13T00:00:00",
"end": "2016-02-14T00:00:00"
}
}
scriptPath Path to folder that contains the U-SQL No (if you use script)
script. Name of the file is case-sensitive.
scriptLinkedService Linked service that links the storage No (if you use script)
that contains the script to the data
factory
script Specify inline script instead of specifying No (if you use scriptPath and
scriptPath and scriptLinkedService. For scriptLinkedService)
example: "script": "CREATE DATABASE
test".
JSON example
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This pipeline computes events for en-gb locale and date less than Feb 19, 2012.",
"activities":
[
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "scripts\\kona\\SearchLogProcessing.txt",
"scriptLinkedService": "StorageLinkedService",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/datalake/input/SearchLog.tsv",
"out": "/datalake/output/Result.tsv"
}
},
"inputs": [
{
"name": "DataLakeTable"
}
],
"outputs":
[
{
"name": "EventsByRegionTable"
}
],
"policy": {
"timeout": "06:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "EventsByRegion",
"linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
}
],
"start": "2015-08-08T00:00:00",
"end": "2015-08-08T01:00:00",
"isPaused": false
}
}
If you do specify an input dataset, it must be available (in Ready status) for the stored procedure activity to run.
The input dataset cannot be consumed in the stored procedure as a parameter. It is only used to check the
dependency before starting the stored procedure activity. You must specify an output dataset for a stored
procedure activity.
Output dataset specifies the schedule for the stored procedure activity (hourly, weekly, monthly, etc.). The output
dataset must use a linked service that refers to an Azure SQL Database or an Azure SQL Data Warehouse or a SQL
Server Database in which you want the stored procedure to run. The output dataset can serve as a way to pass the
result of the stored procedure for subsequent processing by another activity (chaining activities) in the pipeline.
However, Data Factory does not automatically write the output of a stored procedure to this dataset. It is the stored
procedure that writes to a SQL table that the output dataset points to. In some cases, the output dataset can be a
dummy dataset, which is used only to specify the schedule for running the stored procedure activity.
JSON example
{
"name": "SprocActivitySamplePipeline",
"properties": {
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "sp_sample",
"storedProcedureParameters": {
"DateTime": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)"
}
},
"outputs": [{ "name": "sprocsampleout" }],
"name": "SprocActivitySample"
}
],
"start": "2016-08-02T00:00:00",
"end": "2016-08-02T05:00:00",
"isPaused": false
}
}
JSON example
{
"name": "ADFTutorialPipelineCustom",
"properties": {
"description": "Use custom activity",
"activities": [
{
"Name": "MyDotNetActivity",
"Type": "DotNetActivity",
"Inputs": [
{
"Name": "InputDataset"
}
],
"Outputs": [
{
"Name": "OutputDataset"
}
],
"LinkedServiceName": "AzureBatchLinkedService",
"typeProperties": {
"AssemblyName": "MyDotNetActivity.dll",
"EntryPoint": "MyDotNetActivityNS.MyDotNetActivity",
"PackageLinkedService": "AzureStorageLinkedService",
"PackageFile": "customactivitycontainer/MyDotNetActivity.zip",
"extendedProperties": {
"SliceStart": "$$Text.Format('{0:yyyyMMddHH-mm}', Time.AddMinutes(SliceStart, 0))"
}
},
"Policy": {
"Concurrency": 2,
"ExecutionPriorityOrder": "OldestFirst",
"Retry": 3,
"Timeout": "00:30:00",
"Delay": "00:00:00"
}
}
],
"start": "2016-11-16T00:00:00",
"end": "2016-11-16T05:00:00",
"isPaused": false
}
}
For detailed information, see Use custom activities in Data Factory article.
Next Steps
See the following tutorials:
Tutorial: create a pipeline with a copy activity
Tutorial: create a pipeline with a hive activity
Azure Data Factory - Customer case studies
8/15/2017 1 min to read Edit Online
Data Factory is a cloud-based information management service that automates the movement and transformation
of data. Customers across many industries use Data Factory and other Azure services to build their analytics
pipelines and solve their business problems. Learn directly from our customers how and why they are using Data
Factory.
Milliman
Top Actuarial firm transforms the insurance industry
Rockwell Automation
Industrial Automation Firm Cuts Costs up to 90 Percent with big data Solutions
Ziosk
What game you want to go with that burger? Ziosk may already know.
Alaska Airlines
Airline Uses Tablets, Cloud Services to Offer More Engaging In-Flight Entertainment
Real Madrid FC
Real Madrid brings the stadium closer to 450 million fans around the globe, with the Microsoft Cloud
Pier 1 Imports
Finding a Better Connection with Customers through Cloud Machine Learning
Microsoft Studio
Delivering epic Xbox experiences by analyzing hundreds of billions of game events each day
Release notes for Data Management Gateway
7/10/2017 6 min to read Edit Online
One of the challenges for modern data integration is to move data to and from on-premises to cloud. Data Factory
makes this integration with Data Management Gateway, which is an agent that you can install on-premises to
enable hybrid data movement.
See the following articles for detailed information about Data Management Gateway and how to use it:
Data Management Gateway
Move data between on-premises and cloud using Azure Data Factory
Earlier versions
2.9.6313.2
Enhancements-
You can add DNS entries to whitelist Service Bus rather than whitelisting all Azure IP addresses from your
firewall (if needed). More details here.
You can now copy data to/from a single block blob up to 4.75 TB, which is the max supported size of block blob.
(earlier limit was 195 GB).
Fixed: Out of memory issue while unzipping several small files during copy activity.
Fixed: Index out of range issue while copying from Document DB to an on-premises SQL Server with
idempotency feature.
Fixed: SQL cleanup script doesn't work with on-premises SQL Server from Copy Wizard.
Fixed: Column name with space at the end does not work in copy activity.
2.8.66283.3
Enhancements-
Fixed: Issue with missing credentials on gateway machine reboot.
Fixed: Issue with registration during gateway restore using a backup file.
2.7.6240.1
Enhancements-
Fixed: Incorrect read of Decimal null value from Oracle as source.
2.6.6192.2
Whats new
Customers can provide feedback on gateway registering experience.
Support a new compression format: ZIP (Deflate)
Enhancements-
Performance improvement for Oracle Sink, HDFS source.
Bug fix for gateway auto update, gateway parallel processing capacity.
2.5.6164.1
Enhancements
Improved and more robust Gateway registration experience- Now you can track progress status during the
Gateway registration process, which makes the registration experience more responsive.
Improvement in Gateway Restore Process- You can still recover gateway even if you do not have the gateway
backup file with this update. This would require you to reset Linked Service credentials in Portal.
Bug fix.
2.4.6151.1
Whats new
You can now store data source credentials locally. The credentials are encrypted. The data source credentials can
be recovered and restored using the backup file that can be exported from the existing Gateway, all on-
premises.
Enhancements-
Improved and more robust Gateway registration experience.
Support auto detection of QuoteChar configuration for Text format in copy wizard, and improve the overall
format detection accuracy.
2.3.6100.2
Support firstRowAsHeader and SkipLineCount auto detection in copy wizard for text files in on-premises File
system and HDFS.
Enhance the stability of network connection between gateway and Service Bus
A few bug fixes
2.2.6072.1
Supports setting HTTP proxy for the gateway using the Gateway Configuration Manager. If configured, Azure
Blob, Azure Table, Azure Data Lake, and Document DB are accessed through HTTP proxy.
Supports header handling for TextFormat when copying data from/to Azure Blob, Azure Data Lake Store, on-
premises File System, and on-premises HDFS.
Supports copying data from Append Blob and Page Blob along with the already supported Block Blob.
Introduces a new gateway status Online (Limited), which indicates that the main functionality of the gateway
works except the interactive operation support for Copy Wizard.
Enhances the robustness of gateway registration using registration key.
2.1.6040.
DB2 driver is included in the gateway installation package now. You do not need to install it separately.
DB2 driver now supports z/OS and DB2 for i (AS/400) along with the platforms already supported (Linux, Unix,
and Windows).
Supports using Azure Cosmos DB as a source or destination for on-premises data stores
Supports copying data from/to cold/hot blob storage along with the already supported general-purpose
storage account.
Allows you to connect to on-premises SQL Server via gateway with remote login privileges.
2.0.6013.1
You can select the language/culture to be used by a gateway during manual installation.
When gateway does not work as expected, you can choose to send gateway logs of last seven days to
Microsoft to facilitate troubleshooting of the issue. If gateway is not connected to the cloud service, you can
choose to save and archive gateway logs.
User interface improvements for gateway configuration manager:
Make gateway status more visible on the Home tab.
Reorganized and simplified controls.
You can copy data from a storage using the code-free copy preview tool. See Staged Copy for details
about this feature in general.
You can use Data Management Gateway to ingress data directly from an on-premises SQL Server database
into Azure Machine Learning.
Performance improvements
Improve performance on viewing Schema/Preview against SQL Server in code-free copy preview tool.
1.12.5953.1
Bug fixes
1.11.5918.1
Maximum size of the gateway event log has been increased from 1 MB to 40 MB.
A warning dialog is displayed in case a restart is needed during gateway auto-update. You can choose to
restart right then or later.
In case auto-update fails, gateway installer retries auto-updating three times at maximum.
Performance improvements
Improve performance for loading large tables from on-premises server in code-free copy scenario.
Bug fixes
1.10.5892.1
Performance improvements
Bug fixes
1.9.5865.2
Zero touch auto update capability
New tray icon with gateway status indicators
Ability to Update now from the client
Ability to set update schedule time
PowerShell script for toggling auto-update on/off
Support for JSON format
Performance improvements
Bug fixes
1.8.5822.1
Improve troubleshooting experience
Performance improvements
Bug fixes
1.7.5795.1
Performance improvements
Bug fixes
1.7.5764.1
Performance improvements
Bug fixes
1.6.5735.1
Support on-premises HDFS Source/Sink
Performance improvements
Bug fixes
1.6.5696.1
Performance improvements
Bug fixes
1.6.5676.1
Support diagnostic tools on Configuration Manager
Support table columns for tabular data sources for Azure Data Factory
Support SQL DW for Azure Data Factory
Support Reclusive in BlobSource and FileSource for Azure Data Factory
Support CopyBehavior MergeFiles, PreserveHierarchy, and FlattenHierarchy in BlobSink and FileSink with
Binary Copy for Azure Data Factory
Support Copy Activity reporting progress for Azure Data Factory
Support Data Source Connectivity Validation for Azure Data Factory
Bug fixes
1.6.5672.1
Support table name for ODBC data source for Azure Data Factory
Performance improvements
Bug fixes
1.6.5658.1
Support File Sink for Azure Data Factory
Support preserving hierarchy in binary copy for Azure Data Factory
Support Copy Activity Idempotency for Azure Data Factory
Bug fixes
1.6.5640.1
Support 3 more data sources for Azure Data Factory (ODBC, OData, HDFS)
Support quote character in csv parser for Azure Data Factory
Compression support (BZip2)
Bug fixes
1.5.5612.1
Support five relational databases for Azure Data Factory (MySQL, PostgreSQL, DB2, Teradata, and Sybase)
Compression support (Gzip and Deflate)
Performance improvements
Bug fixes
1.4.5549.1
Add Oracle data source support for Azure Data Factory
Performance improvements
Bug fixes
1.4.5492.1
Unified binary that supports both Microsoft Azure Data Factory and Office 365 Power BI services
Refine the Configuration UI and registration process
Azure Data Factory Azure Ingress and Egress support for SQL Server data source
1.2.5303.1
Fix timeout issue to support more time-consuming data source connections.
1.1.5526.8
Requires .NET Framework 4.5.1 as a prerequisite during setup.
1.0.5144.2
No changes that affect Azure Data Factory scenarios.
Use Case - Customer Profiling
8/21/2017 3 min to read Edit Online
Azure Data Factory is one of many services used to implement the Cortana Intelligence Suite of solution
accelerators. For more information about Cortana Intelligence, visit Cortana Intelligence Suite. In this document, we
describe a simple use case to help you get started with understanding how Azure Data Factory can solve common
analytics problems.
Scenario
Contoso is a gaming company that creates games for multiple platforms: game consoles, hand held devices, and
personal computers (PCs). As players play these games, large volume of log data is produced that tracks the usage
patterns, gaming style, and preferences of the user. When combined with demographic, regional, and product data,
Contoso can perform analytics to guide them about how to enhance players experience and target them for
upgrades and in-game purchases.
Contosos goal is to identify up-sell/cross-sell opportunities based on the gaming history of its players and add
compelling features to drive business growth and provide a better experience to customers. For this use case, we
use a gaming company as an example of a business. The company wants to optimize its games based on players
behavior. These principles apply to any business that wants to engage its customers around its goods and services
and enhance their customers experience.
In this solution, Contoso wants to evaluate the effectiveness of a marketing campaign it has recently launched. We
start with the raw gaming logs, process and enrich them with geolocation data, join it with advertising reference
data, and lastly copy them into an Azure SQL Database to analyze the campaigns impact.
Deploy Solution
All you need to access and try out this simple use case is an Azure subscription, an Azure Blob storage account, and
an Azure SQL Database. You deploy the customer profiling pipeline from the Sample pipelines tile on the home
page of your data factory.
1. Create a data factory or open an existing data factory. See Copy data from Blob Storage to SQL Database using
Data Factory for steps to create a data factory.
2. In the DATA FACTORY blade for the data factory, click the Sample pipelines tile.
3. In the Sample pipelines blade, click the Customer profiling that you want to deploy.
4. Specify configuration settings for the sample. For example, your Azure storage account name and key, Azure
SQL server name, database, User ID, and password.
5. After you are done with specifying the configuration settings, click Create to create/deploy the sample pipelines
and linked services/tables used by the pipelines.
6. You see the status of deployment on the sample tile you clicked earlier on the Sample pipelines blade.
7. When you see the Deployment succeeded message on the tile for the sample, close the Sample pipelines
blade.
8. On DATA FACTORY blade, you see that linked services, data sets, and pipelines are added to your data
factory.
Solution Overview
This simple use case can be used as an example of how you can use Azure Data Factory to ingest, prepare,
transform, analyze, and publish data.
This Figure depicts how the data pipelines appear in the Azure portal after they have been deployed.
1. The PartitionGameLogsPipeline reads the raw game events from blob storage and creates partitions based
on year, month, and day.
2. The EnrichGameLogsPipeline joins partitioned game events with geo code reference data and enriches the
data by mapping IP addresses to the corresponding geo-locations.
3. The AnalyzeMarketingCampaignPipeline pipeline uses the enriched data and processes it with the
advertising data to create the final output that contains marketing campaign effectiveness.
In this example, Data Factory is used to orchestrate activities that copy input data, transform, and process the data,
and output the final data to an Azure SQL Database. You can also visualize the network of data pipelines, manage
them, and monitor their status from the UI.
Benefits
By optimizing their user profile analytics and aligning it with business goals, gaming company is able to quickly
collect usage patterns, and analyze the effectiveness of its marketing campaigns.
Process large-scale datasets using Data Factory and
Batch
8/21/2017 34 min to read Edit Online
This article describes an architecture of a sample solution that moves and processes large-scale datasets in an
automatic and scheduled manner. It also provides an end-to-end walkthrough to implement the solution using
Azure Data Factory and Azure Batch.
This article is longer than our typical article because it contains a walkthrough of an entire sample solution. If you
are new to Batch and Data Factory, you can learn about these services and how they work together. If you know
something about the services and are designing/architecting a solution, you may focus just on the architecture
section of the article and if you are developing a prototype or a solution, you may also want to try out step-by-step
instructions in the walkthrough. We invite your comments about this content and how you use it.
First, let's look at how Data Factory and Batch services can help with processing large datasets in the cloud.
The following list provides the basic steps of the process. The solution includes code and explanations to build the
end-to-end solution.
1. Configure Azure Batch with a pool of compute nodes (VMs). You can specify the number of nodes and size
of each node.
2. Create an Azure Data Factory instance that is configured with entities that represent Azure blob storage,
Azure Batch compute service, input/output data, and a workflow/pipeline with activities that move and
transform data.
3. Create a custom .NET activity in the Data Factory pipeline. The activity is your user code that runs on the
Azure Batch pool.
4. Store large amounts of input data as blobs in Azure storage. Data is divided into logical slices (usually by
time).
5. Data Factory copies data that is processed in parallel to the secondary location.
6. Data Factory runs the custom activity using the pool allocated by Batch. Data Factory can run activities
concurrently. Each activity processes a slice of data. The results are stored in Azure storage.
7. Data Factory moves the final results to a third location, either for distribution via an app, or for further
processing by other tools.
If you are using Azure Storage Explorer, upload the file file.txt to mycontainer. Click Copy on the toolbar
to create a copy of the blob. In the Copy Blob dialog box, change the destination blob name to
inputfolder/2015-11-16-00/file.txt . Repeat this step to create inputfolder/2015-11-16-01/file.txt ,
inputfolder/2015-11-16-02/file.txt , inputfolder/2015-11-16-03/file.txt ,
inputfolder/2015-11-16-04/file.txt and so on. This action automatically creates the folders.
5. Create another container named: customactivitycontainer . You upload the custom activity zip file to this
container.
Visual Studio
Install Microsoft Visual Studio 2012 or later to create the custom Batch activity to be used in the Data Factory
solution.
High-level steps to create the solution
1. Create a custom activity that contains the data processing logic.
2. Create an Azure data factory that uses the custom activity:
Create the custom activity
The Data Factory custom activity is the heart of this sample solution. The sample solution uses Azure Batch to run
the custom activity. See Use custom activities in an Azure Data Factory pipeline for the basic information to develop
custom activities and use them in Azure Data Factory pipelines.
To create a .NET custom activity that you can use in an Azure Data Factory pipeline, you need to create a .NET Class
Library project with a class that implements that IDotNetActivity interface. This interface has only one method:
Execute. Here is the signature of the method:
The method has a few key components that you need to understand.
The method takes four parameters:
1. linkedServices. An enumerable list of linked services that link input/output data sources (for example:
Azure Blob Storage) to the data factory. In this sample, there is only one linked service of type Azure
Storage used for both input and output.
2. datasets. This is an enumerable list of datasets. You can use this parameter to get the locations and
schemas defined by input and output datasets.
3. activity. This parameter represents the current compute entity - in this case, an Azure Batch service.
4. logger. The logger lets you write debug comments that surface as the User log for the pipeline.
The method returns a dictionary that can be used to chain custom activities together in the future. This feature is
not implemented yet, so return an empty dictionary from the method.
Procedure: Create the custom activity
1. Create a .NET Class Library project in Visual Studio.
a. Launch Visual Studio 2012/2013/2015.
b. Click File, point to New, and click Project.
c. Expand Templates, and select Visual C#. In this walkthrough, you use C#, but you can use any .NET
language to develop the custom activity.
d. Select Class Library from the list of project types on the right.
e. Enter MyDotNetActivity for the Name.
f. Select C:\ADF for the Location. Create the folder ADF if it does not exist.
g. Click OK to create the project.
2. Click Tools, point to NuGet Package Manager, and click Package Manager Console.
3. In the Package Manager Console, execute the following command to import
Microsoft.Azure.Management.DataFactories.
Install-Package Microsoft.Azure.Management.DataFactories
4. Import the Azure Storage NuGet package in to the project. You need this package because you use the Blob
storage API in this sample.
Install-Package Azure.Storage
5. Add the following using directives to the source file in the project.
using System.IO;
using System.Globalization;
using System.Diagnostics;
using System.Linq;
using Microsoft.Azure.Management.DataFactories.Models;
using Microsoft.Azure.Management.DataFactories.Runtime;
using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Blob;
namespace MyDotNetActivityNS
7. Change the name of the class to MyDotNetActivity and derive it from the IDotNetActivity interface as
shown below.
8. Implement (Add) the Execute method of the IDotNetActivity interface to the MyDotNetActivity class
and copy the following sample code to the method. See the Execute Method section for explanation for the
logic used in this method.
/// <summary>
/// Execute method is the only method of IDotNetActivity interface you must implement.
/// In this sample, the method invokes the Calculate method to perform the core logic.
/// </summary>
public IDictionary<string, string> Execute(
IEnumerable<LinkedService> linkedServices,
IEnumerable<Dataset> datasets,
Activity activity,
IActivityLogger logger)
{
// using First method instead of Single since we are using the same
// Azure Storage linked service for input and output.
inputLinkedService = linkedServices.First(
linkedService =>
linkedService.Name ==
inputDataset.Properties.LinkedServiceName).Properties.TypeProperties
as AzureStorageLinkedService;
// get the output dataset using the name of the dataset matched to a name in the Activity output
collection.
Dataset outputDataset = datasets.Single(dataset => dataset.Name == activity.Outputs.Single().Name);
folderPath = GetFolderPath(outputDataset);
// The dictionary can be used to chain custom activities together in the future.
// This feature is not implemented yet, so just return an empty dictionary.
return new Dictionary<string, string>();
}
9. Add the following helper methods to the class. These methods are invoked by the Execute method. Most
importantly, the Calculate method isolates the code that iterates through each blob.
/// <summary>
/// Gets the folderPath value from the input/output dataset.
/// </summary>
private static string GetFolderPath(Dataset dataArtifact)
{
if (dataArtifact == null || dataArtifact.Properties == null)
{
return null;
}
return blobDataset.FolderPath;
}
/// <summary>
/// Gets the fileName value from the input/output dataset.
/// Gets the fileName value from the input/output dataset.
/// </summary>
return blobDataset.FileName;
}
/// <summary>
/// Iterates through each blob (file) in the folder, counts the number of instances of search term in
the file,
/// and prepares the output text that is written to the output blob.
/// </summary>
public static string Calculate(BlobResultSegment Bresult, IActivityLogger logger, string folderPath, ref
BlobContinuationToken token, string searchTerm)
{
string output = string.Empty;
logger.Write("number of blobs found: {0}", Bresult.Results.Count<IListBlobItem>());
foreach (IListBlobItem listBlobItem in Bresult.Results)
{
CloudBlockBlob inputBlob = listBlobItem as CloudBlockBlob;
if ((inputBlob != null) && (inputBlob.Name.IndexOf("$$$.$$$") == -1))
{
string blobText = inputBlob.DownloadText(Encoding.ASCII, null, null, null);
logger.Write("input blob text: {0}", blobText);
string[] source = blobText.Split(new char[] { '.', '?', '!', ' ', ';', ':', ',' },
StringSplitOptions.RemoveEmptyEntries);
var matchQuery = from word in source
where word.ToLowerInvariant() == searchTerm.ToLowerInvariant()
select word;
int wordCount = matchQuery.Count();
output += string.Format("{0} occurrences(s) of the search term \"{1}\" were found in the file
{2}.\r\n", wordCount, searchTerm, inputBlob.Name);
}
}
return output;
}
The GetFolderPath method returns the path to the folder that the dataset points to and the GetFileName
method returns the name of the blob/file that the dataset points to.
"name": "InputDataset",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"fileName": "file.txt",
"folderPath": "mycontainer/inputfolder/{Year}-{Month}-{Day}-{Hour}",
The Calculate method calculates the number of instances of keyword Microsoft in the input files (blobs in
the folder). The search term (Microsoft) is hard-coded in the code.
10. Compile the project. Click Build from the menu and click Build Solution.
11. Launch Windows Explorer, and navigate to bin\debug or bin\release folder depending on the type of build.
12. Create a zip file MyDotNetActivity.zip that contains all the binaries in the \bin\Debug folder. You may
want to include the MyDotNetActivity.pdb file so that you get additional details such as line number in the
source code that caused the issue when a failure occurs.
13. Upload MyDotNetActivity.zip as a blob to the blob container: customactivitycontainer in the Azure blob
storage that the StorageLinkedService linked service in the ADFTutorialDataFactory uses. Create the blob
container customactivitycontainer if it does not already exist.
Execute method
This section provides more details and notes about the code in the Execute method.
1. The members for iterating through the input collection are found in the
Microsoft.WindowsAzure.Storage.Blob namespace. Iterating through the blob collection requires using the
BlobContinuationToken class. In essence, you must use a do-while loop with the token as the mechanism
for exiting the loop. For more information, see How to use Blob storage from .NET. A basic loop is shown
here:
true,
BlobListingDetails.Metadata,
null,
continuationToken,
null,
null);
// Return a string derived from parsing each blob.
output += string.Format("{0} occurrences of the search term \"{1}\" were found in the file {2}.\r\n",
wordCount, searchTerm, inputBlob.Name);
3. Once the Calculate method has done the work, it must be written to a new blob. So for every set of blobs
processed, a new blob can be written with the results. To write to a new blob, first find the output dataset.
// Get the output dataset using the name of the dataset matched to a name in the Activity output
collection.
Dataset outputDataset = datasets.Single(dataset => dataset.Name == activity.Outputs.Single().Name);
4. The code also calls a helper method: GetFolderPath to retrieve the folder path (the storage container name).
folderPath = GetFolderPath(outputDataset);
The GetFolderPath casts the DataSet object to an AzureBlobDataSet, which has a property named
FolderPath.
return blobDataset.FolderPath;
5. The code calls the GetFileName method to retrieve the file name (blob name). The code is similar to the
above code to get the folder path.
return blobDataset.FileName;
6. The name of the file is written by creating a URI object. The URI constructor uses the BlobEndpoint
property to return the container name. The folder path and file name are added to construct the output blob
URI.
7. The name of the file has been written and now you can write the output string from the Calculate method
to a new blob:
For example, drop one file (file.txt) with the following content into each of the folders.
Each input folder corresponds to a slice in Azure Data Factory even if the folder has 2 or more files. When each slice
is processed by the pipeline, the custom activity iterates through all the blobs in the input folder for that slice.
You see five output files with the same content. For example, the output file from processing the file in the 2015-
11-16-00 folder has the following content:
2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-00/file.txt.
If you drop multiple files (file.txt, file2.txt, file3.txt) with the same content to the input folder, you see the following
content in the output file. Each folder (2015-11-16-00, etc.) corresponds to a slice in this sample even though the
folder has multiple input files.
2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-00/file.txt.
2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-00/file2.txt.
2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-00/file3.txt.
The output file has three lines now, one for each input file (blob) in the folder associated with the slice (2015-11-
16-00).
A task is created for each activity run. In this sample, there is only one activity in the pipeline. When a slice is
processed by the pipeline, the custom activity runs on Azure Batch to process the slice. Since there are five slices
(each slice can have multiple blobs or file), there are five tasks created in Azure Batch. When a task runs on Batch, it
is actually the custom activity that is running.
The following walkthrough provides additional details.
Step 1: Create the data factory
1. After logging in to the Azure portal, do the following steps:
a. Click NEW on the left menu.
b. Click Data + Analytics in the New blade.
c. Click Data Factory on the Data analytics blade.
2. In the New data factory blade, enter CustomActivityFactory for the Name. The name of the Azure data
factory must be globally unique. If you receive the error: Data factory name CustomActivityFactory is not
available, change the name of the data factory (for example, yournameCustomActivityFactory) and try
creating again.
3. Click RESOURCE GROUP NAME, and select an existing resource group or create a resource group.
4. Verify that you are using the correct subscription and region where you want the data factory to be created.
5. Click Create on the New data factory blade.
6. You see the data factory being created in the Dashboard of the Azure portal.
7. After the data factory has been created successfully, you see the data factory page, which shows you the
contents of the data factory.
3. Replace account name with the name of your Azure storage account and account key with the access key
of the Azure storage account. To learn how to get your storage access key, see View, copy and regenerate
storage access keys.
4. Click Deploy on the command bar to deploy the linked service.
IMPORTANT
The URL from the Azure Batch account blade is in the following format: <accountname>.
<region>.batch.azure.com. For the batchUri property in the JSON, you need to remove "accountname."
from the URL. Example: "batchUri": "https://eastus.batch.azure.com" .
For the poolName property, you can also specify the ID of the pool instead of the name of the pool.
NOTE
The Data Factory service does not support an on-demand option for Azure Batch as it does for HDInsight.
You can only use your own Azure Batch pool in an Azure data factory.
e. Specify StorageLinkedService for the linkedServiceName property. You created this linked service in
the previous step. This storage is used as a staging area for files and logs.
3. Click Deploy on the command bar to deploy the linked service.
Step 3: Create datasets
In this step, you create datasets to represent input and output data.
Create input dataset
1. In the Editor for the Data Factory, click New dataset button on the toolbar and click Azure Blob storage from
the drop-down menu.
2. Replace the JSON in the right pane with the following JSON snippet:
{
"name": "InputDataset",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/inputfolder/{Year}-{Month}-{Day}-{Hour}",
"format": {
"type": "TextFormat"
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
]
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {}
}
}
You create a pipeline later in this walkthrough with start time: 2015-11-16T00:00:00Z and end time: 2015-
11-16T05:00:00Z. It is scheduled to produce data hourly, so there are 5 input/output slices (between
00:00:00 -> 05:00:00).
The frequency and interval for the input dataset is set to Hour and 1, which means that the input slice is
available hourly.
Here are the start times for each slice, which is represented by SliceStart system variable in the above JSON
snippet.
1 2015-11-16T00:00:00
2 2015-11-16T01:00:00
3 2015-11-16T02:00:00
4 2015-11-16T03:00:00
5 2015-11-16T04:00:00
The folderPath is calculated by using the year, month, day, and hour part of the slice start time (SliceStart).
Therefore, here is how an input folder is mapped to a slice.
1 2015-11-16T00:00:00 2015-11-16-00
2 2015-11-16T01:00:00 2015-11-16-01
3 2015-11-16T02:00:00 2015-11-16-02
4 2015-11-16T03:00:00 2015-11-16-03
5 2015-11-16T04:00:00 2015-11-16-04
3. Click Deploy on the toolbar to create and deploy the InputDataset table.
Create output dataset
In this step, you create another dataset of type AzureBlob to represent the output data.
1. In the Editor for the Data Factory, click New dataset button on the toolbar and click Azure Blob storage from
the drop-down menu.
2. Replace the JSON in the right pane with the following JSON snippet:
{
"name": "OutputDataset",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "{slice}.txt",
"folderPath": "mycontainer/outputfolder",
"partitionedBy": [
{
"name": "slice",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy-MM-dd-HH"
}
}
]
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
An output blob/file is generated for each input slice. Here is how an output file is named for each slice. All
the output files are generated in one output folder: mycontainer\\outputfolder .
1 2015-11-16T00:00:00 2015-11-16-00.txt
2 2015-11-16T01:00:00 2015-11-16-01.txt
3 2015-11-16T02:00:00 2015-11-16-02.txt
4 2015-11-16T03:00:00 2015-11-16-03.txt
5 2015-11-16T04:00:00 2015-11-16-04.txt
Remember that all the files in an input folder (for example: 2015-11-16-00) are part of a slice with the start
time: 2015-11-16-00. When this slice is processed, the custom activity scans through each file and produces
a line in the output file with the number of occurrences of search term (Microsoft). If there are three files in
the folder 2015-11-16-00, there are three lines in the output file: 2015-11-16-00.txt.
3. Click Deploy on the toolbar to create and deploy the OutputDataset.
Step 4: Create and run the pipeline with custom activity
In this step, you create a pipeline with one activity, the custom activity you created earlier.
IMPORTANT
If you haven't uploaded the file.txt to input folders in the blob container, do so before creating the pipeline. The isPaused
property is set to false in the pipeline JSON, so the pipeline runs immediately as the start date is in the past.
1. In the Data Factory Editor, click New pipeline on the command bar. If you do not see the command, click ...
(Ellipsis) to see it.
2. Replace the JSON in the right pane with the following JSON script:
{
"name": "PipelineCustom",
"properties": {
"description": "Use custom activity",
"activities": [
{
"type": "DotNetActivity",
"typeProperties": {
"assemblyName": "MyDotNetActivity.dll",
"entryPoint": "MyDotNetActivityNS.MyDotNetActivity",
"packageLinkedService": "AzureStorageLinkedService",
"packageFile": "customactivitycontainer/MyDotNetActivity.zip"
},
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"policy": {
"timeout": "00:30:00",
"concurrency": 5,
"retry": 3
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "MyDotNetActivity",
"linkedServiceName": "AzureBatchLinkedService"
}
],
"start": "2015-11-16T00:00:00Z",
"end": "2015-11-16T05:00:00Z",
"isPaused": false
}
}
6. Use Azure portal to view the tasks associated with the slices and see what VM each slice ran on. See Data
Factory and Batch integration section for details.
7. You should see the output files in the outputfolder of mycontainer in your Azure blob storage.
You should see five output files, one for each input slice. Each of the output file should have content similar
to the following output:
2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-
00/file.txt.
The following diagram illustrates how the Data Factory slices map to tasks in Azure Batch. In this example, a
slice has only one run.
8. Now, lets try with multiple files in a folder. Create files: file2.txt, file3.txt, file4.txt, and file5.txt with the same
content as in file.txt in the folder: 2015-11-06-01.
9. In the output folder, delete the output file: 2015-11-16-01.txt.
10. Now, in the OutputDataset blade, right-click the slice with SLICE START TIME set to 11/16/2015 01:00:00
AM, and click Run to rerun/re-process the slice. Now, the slice has five files instead of one file.
11. After the slice runs and its status is Ready, verify the content in the output file for this slice (2015-11-16-
01.txt) in the outputfolder of mycontainer in your blob storage. There should be a line for each file of the
slice.
2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-
01/file.txt.
2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-
01/file2.txt.
2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-
01/file3.txt.
2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-
01/file4.txt.
2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-
01/file5.txt.
NOTE
If you did not delete the output file 2015-11-16-01.txt before trying with five input files, you see one line from the previous
slice run and five lines from the current slice run. By default, the content is appended to output file if it already exists.
A task in the job is created for each activity run of a slice. If there are 10 slices ready to be processed, 10 tasks are
created in the job. You can have more than one slice running in parallel if you have multiple compute nodes in the
pool. If the maximum tasks per compute node is set to > 1, there can be more than one slice running on the same
compute.
In this example, there are five slices, so five tasks in Azure Batch. With the concurrency set to 5 in the pipeline
JSON in Azure Data Factory and Maximum tasks per VM set to 2 in Azure Batch pool with 2 VMs, the tasks runs
fast (check start and end times for tasks).
Use the portal to view the Batch job and its tasks that are associated with the slices and see what VM each slice ran
on.
2. In the Execute method of your custom activity, use the IActivityLogger object to log information that helps
you troubleshoot issues. The logged messages show up in the user_0.log file.
In the OutputDataset blade, click the slice to see the DATA SLICE blade for that slice. You see activity runs
for that slice. You should see one activity run for the slice. If you click Run in the command bar, you can start
another activity run for the same slice.
When you click the activity run, you see the ACTIVITY RUN DETAILS blade with a list of log files. You see
logged messages in the user_0.log file. When an error occurs, you see three activity runs because the retry
count is set to 3 in the pipeline/activity JSON. When you click the activity run, you see the log files that you
can review to troubleshoot the error.
In the list of log files, click the user-0.log. In the right panel are the results of using the
IActivityLogger.Write method.
3. Include the PDB file in the zip file so that the error details have information such as call stack when an error
occurs.
4. All the files in the zip file for the custom activity must be at the top level with no subfolders.
7. The custom activity does not use the app.config file from your package. Therefore, if your code reads any
connection strings from the configuration file, it does not work at runtime. The best practice when using
Azure Batch is to hold any secrets in an Azure KeyVault, use a certificate-based service principal to protect
the keyvault, and distribute the certificate to Azure Batch pool. The .NET custom activity then can access
secrets from the KeyVault at runtime. This solution is a generic one and can scale to any type of secret, not
just connection string.
There is an easier workaround (but not a best practice): you can create an Azure SQL linked service with
connection string settings, create a dataset that uses the linked service, and chain the dataset as a dummy
input dataset to the custom .NET activity. You can then access the linked service's connection string in the
custom activity code and it should work fine at runtime.
Extend the sample
You can extend this sample to learn more about Azure Data Factory and Azure Batch features. For example, to
process slices in a different time range, do the following steps:
1. Add the following subfolders in the inputfolder : 2015-11-16-05, 2015-11-16-06, 201-11-16-07, 2011-11-16-
08, 2015-11-16-09 and place input files in those folders. Change the end time for the pipeline from
2015-11-16T05:00:00Z to 2015-11-16T10:00:00Z . In the Diagram View, double-click the InputDataset, and
confirm that the input slices are ready. Double-click OuptutDataset to see the state of output slices. If they are
in Ready state, check the output folder for the output files.
2. Increase or decrease the concurrency setting to understand how it affects the performance of your solution,
especially the processing that occurs on Azure Batch. (See Step 4: Create and run the pipeline for more on the
concurrency setting.)
3. Create a pool with higher/lower Maximum tasks per VM. To use the new pool you created, update the Azure
Batch linked service in the Data Factory solution. (See Step 4: Create and run the pipeline for more on the
Maximum tasks per VM setting.)
4. Create an Azure Batch pool with autoscale feature. Automatically scaling compute nodes in an Azure Batch
pool is the dynamic adjustment of processing power used by your application.
The sample formula here achieves the following behavior: When the pool is initially created, it starts with 1
VM. $PendingTasks metric defines the number of tasks in running + active (queued) state. The formula finds
the average number of pending tasks in the last 180 seconds and sets TargetDedicated accordingly. It
ensures that TargetDedicated never goes beyond 25 VMs. So, as new tasks are submitted, pool automatically
grows and as tasks complete, VMs become free one by one and the autoscaling shrinks those VMs.
startingNumberOfVMs and maxNumberofVMs can be adjusted to your needs.
Autoscale formula:
startingNumberOfVMs = 1;
maxNumberofVMs = 25;
pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second);
pendingTaskSamples = pendingTaskSamplePercent < 70 ? startingNumberOfVMs :
avg($PendingTasks.GetSample(180 * TimeInterval_Second));
$TargetDedicated=min(maxNumberofVMs,pendingTaskSamples);
See Automatically scale compute nodes in an Azure Batch pool for details.
If the pool is using the default autoScaleEvaluationInterval, the Batch service could take 15-30 minutes to
prepare the VM before running the custom activity. If the pool is using a different
autoScaleEvaluationInterval, the Batch service could take autoScaleEvaluationInterval + 10 minutes.
5. In the sample solution, the Execute method invokes the Calculate method that processes an input data slice to
produce an output data slice. You can write your own method to process input data and replace the Calculate
method call in the Execute method with a call to your method.
Next steps: Consume the data
After you process data, you can consume it with online tools like Microsoft Power BI. Here are links to help you
understand Power BI and how to use it in Azure:
Explore a dataset in Power BI
Getting started with the Power BI Desktop
Refresh data in Power BI
Azure and Power BI - basic overview
References
Azure Data Factory
Introduction to Azure Data Factory service
Get started with Azure Data Factory
Use custom activities in an Azure Data Factory pipeline
Azure Batch
Basics of Azure Batch
Overview of Azure Batch features
Create and manage Azure Batch account in the Azure portal
Get started with Azure Batch Library .NET
Use Case - Product Recommendations
8/15/2017 4 min to read Edit Online
Azure Data Factory is one of many services used to implement the Cortana Intelligence Suite of solution
accelerators. See Cortana Intelligence Suite page for details about this suite. In this document, we describe a
common use case that Azure users have already solved and implemented using Azure Data Factory and other
Cortana Intelligence component services.
Scenario
Online retailers commonly want to entice their customers to purchase products by presenting them with products
they are most likely to be interested in, and therefore most likely to buy. To accomplish this, online retailers need to
customize their users online experience by using personalized product recommendations for that specific user.
These personalized recommendations are to be made based on their current and historical shopping behavior data,
product information, newly introduced brands, and product and customer segmentation data. Additionally, they can
provide the user product recommendations based on analysis of overall usage behavior from all their users
combined.
The goal of these retailers is to optimize for user click-to-sale conversions and earn higher sales revenue. They
achieve this conversion by delivering contextual, behavior-based product recommendations based on customer
interests and actions. For this use case, we use online retailers as an example of businesses that want to optimize
for their customers. However, these principles apply to any business that wants to engage its customers around its
goods and services and enhance their customers buying experience with personalized product recommendations.
Challenges
There are many challenges that online retailers face when trying to implement this type of use case.
First, data of different sizes and shapes must be ingested from multiple data sources, both on-premises and in the
cloud. This data includes product data, historical customer behavior data, and user data as the user browses the
online retail site.
Second, personalized product recommendations must be reasonably and accurately calculated and predicted. In
addition to product, brand, and customer behavior and browser data, online retailers also need to include customer
feedback on past purchases to factor in the determination of the best product recommendations for the user.
Third, the recommendations must be immediately deliverable to the user to provide a seamless browsing and
purchasing experience, and provide the most recent and relevant recommendations.
Finally, retailers need to measure the effectiveness of their approach by tracking overall up-sell and cross-sell click-
to-conversion sales successes, and adjust to their future recommendations.
Solution Overview
This example use case has been solved and implemented by real Azure users by using Azure Data Factory and
other Cortana Intelligence component services, including HDInsight and Power BI.
The online retailer uses an Azure Blob store, an on-premises SQL server, Azure SQL DB, and a relational data mart
as their data storage options throughout the workflow. The blob store contains customer information, customer
behavior data, and product information data. The product information data includes product brand information and
a product catalog stored on-premises in a SQL data warehouse.
All the data is combined and fed into a product recommendation system to deliver personalized recommendations
based on customer interests and actions, while the user browses products in the catalog on the website. The
customers also see products that are related to the product they are looking at based on overall website usage
patterns that are not related to any one user.
Gigabytes of raw web log files are generated daily from the online retailers website as semi-structured files. The
raw web log files and the customer and product catalog information is ingested regularly into an Azure Blob
storage using Data Factorys globally deployed data movement as a service. The raw log files for the day are
partitioned (by year and month) in blob storage for long-term storage. Azure HDInsight is used to partition the raw
log files in the blob store and process the ingested logs at scale using both Hive and Pig scripts. The partitioned
web logs data is then processed to extract the needed inputs for a machine learning recommendation system to
generate the personalized product recommendations.
The recommendation system used for the machine learning in this example is an open source machine learning
recommendation platform from Apache Mahout. Any Azure Machine Learning or custom model can be applied to
the scenario. The Mahout model is used to predict the similarity between items on the website based on overall
usage patterns, and to generate the personalized recommendations based on the individual user.
Finally, the result set of personalized product recommendations is moved to a relational data mart for consumption
by the retailer website. The result set could also be accessed directly from blob storage by another application, or
moved to additional stores for other consumers and use cases.
Benefits
By optimizing their product recommendation strategy and aligning it with business goals, the solution met the
online retailers merchandising and marketing objectives. Additionally, they were able to operationalize and
manage the product recommendation workflow in an efficient, reliable, and cost effective manner. The approach
made it easy for them to update their model and fine-tune its effectiveness based on the measures of sales click-to-
conversion successes. By using Azure Data Factory, they were able to abandon their time consuming and expensive
manual cloud resource management and move to on-demand cloud resource management. Therefore, they were
able to save time, money, and reduce their time to solution deployment. Data lineage views and operational service
health became easy to visualize and troubleshoot with the intuitive Data Factory monitoring and management UI
available from the Azure portal. Their solution can now be scheduled and managed so that finished data is reliably
produced and delivered to users, and data and processing dependencies are automatically managed without
human intervention.
By providing this personalized shopping experience, the online retailer created a more competitive, engaging
customer experience and therefore increase sales and overall customer satisfaction.