Azure Data Factory For Beginners
Azure Data Factory For Beginners
2) Permit Azure services to connect to SQL Server. Make sure that Allow access to
Azure services is enabled for your SQL Server so that Data Factory can write data
to it. To check and enable this setting, navigate to the logical SQL server >
Overview > Set server firewall> and toggle the Allow access to Azure services
option to ON.
B) Configure sink
1) To build a sink dataset, go to the Sink tab and select + New.
2) To filter the connectors in the New Dataset dialogue box, type “SQL” in the
search field, pick Azure SQL Database, and then click Continue. You copy data to
a SQL database in this demo.
3) Enter OutputSqlDataset as the Name in the Set Properties dialogue box.
Select + New from the Linked service dropdown list. A linked service must be paired
with a dataset. The connection string that Data Factory uses to connect to SQL
Database at runtime is stored in the linked service. The container, folder, and file
(optional) to which the data is copied are all specified in the dataset.
4) Take the following steps in the New Linked Service (Azure SQL
Database) dialogue box:
a. Type AzureSqlDatabaseLinkedService in the Name field.
b. Select your SQL Server instance under Server name.
c. Select your database under the Database name.
d. Under User name, type the user’s name.
e. Under Password, type the user’s password.
f. To test the connection, select Test connection.
g. To deploy the associated service, select Create.
5) It will take you straight to the Set Properties dialogue box. Select [dbo].
[emp] from the Table drop-down menu. Then press OK.
6) Go to the pipeline tab and make sure OutputSqlDataset is selected in Sink
Dataset.
5) Validate the Azure Data Factory Pipeline
1) Select Validate from the tool bar to validate the pipeline.
2) By clicking Code on the upper right, you can see the JSON code linked with the
pipeline.
6) Debug and publish the Azure Data Factory Pipeline
Before publishing artifacts (connected services, datasets, and pipelines) to Data
Factory or your own Azure Repos Git repository, you can debug your pipeline.
1) Select Debug from the toolbar to debug the pipeline. The Output tab at the
bottom of the window displays the status of the pipeline run.
2) Select Publish all from the top toolbar once the pipeline has been completed
successfully. This action sends your newly built entities (datasets and pipelines) to
Data Factory.
3) Wait until you see the message “Successfully published.” To view notification
messages, go to the top-right corner and select Show Notifications (bell button).
Go through this Microsoft Azure Blog to get a clear understanding of Azure SQL
4) ETL tools
Azure Data Factory provides approximately 100 enterprise connectors and
robust resources for both code-based and code-free users to accomplish their
data transformation and movement needs.
Also read: How Azure Event Hub & Event Grid Works?
What Is Meant By Orchestration?
Sometimes ADF will instruct another service to execute the actual work
required on its behalf, such as a Databricks to perform a transformation query.
ADF hardly orchestrates the execution of the query and then prepare the
pipelines to move the data onto the destination or next step.
8) Select Go to resource, and then Select Author & Monitor to launch the Data
Factory UI in a separate tab.
Frequently Asked Questions
Q: What is Azure Data Factory?
A: Azure Data Factory is a cloud-based data integration service provided by
Microsoft. It allows you to create, schedule, and manage data pipelines that can
move and transform data from various sources to different destinations.
Q: What are the key features of Azure Data Factory?
A: Azure Data Factory offers several key features, including data movement and
transformation activities, data flow transformations, integration with other Azure
services, data monitoring and management, and support for hybrid data integration.
Q: What are the benefits of using Azure Data Factory?
A: Some benefits of using Azure Data Factory include the ability to automate data
pipelines, seamless integration with other Azure services, scalability to handle large
data volumes, support for on-premises and cloud data sources, and comprehensive
monitoring and logging capabilities.
Q: How does Azure Data Factory handle data movement?
A: Azure Data Factory uses data movement activities to efficiently and securely
move data between various data sources and destinations. It supports a wide range
of data sources, such as Azure Blob Storage, Azure Data Lake Storage, SQL
Server, Oracle, and many others.
Q: What is the difference between Azure Data Factory and Azure Databricks?
A: While both Azure Data Factory and Azure Databricks are data integration and
processing services, they serve different purposes. Azure Data Factory focuses on
orchestrating and managing data pipelines, while Azure Databricks is a big data
analytics and machine learning platform.
Q: Can Azure Data Factory be used for real-time data processing?
A: Yes, Azure Data Factory can be used for real-time data processing. It provides
integration with Azure Event Hubs, which enables you to ingest and process
streaming data in real time.
Q: How can I monitor and manage data pipelines in Azure Data Factory?
A: Azure Data Factory offers built-in monitoring and management capabilities. You
can use Azure Monitor to track pipeline performance, set up alerts for failures or
delays, and view detailed logs. Additionally, Azure Data Factory integrates with
Azure Data Factory Analytics, which provides advanced monitoring and diagnostic
features.
Q: Does Azure Data Factory support hybrid data integration?
A: Yes, Azure Data Factory supports hybrid data integration. It can connect to on-
premises data sources using the Azure Data Gateway, which provides a secure and
efficient way to transfer data between on-premises and cloud environments.
Q: How can I schedule and automate data pipelines in Azure Data Factory?
A: Azure Data Factory allows you to create schedules for data pipelines using
triggers. You can define time-based or event-based triggers to automatically start
and stop data pipeline runs.
Q: What security features are available in Azure Data Factory?
A: Azure Data Factory provides several security features, including integration with
Azure Active Directory for authentication and authorization, encryption of data at rest
and in transit, and role-based access control (RBAC) to manage access to data and
pipelines. Please note that these FAQs are intended to provide general information
about Azure Data Factory, and for more specific details, it is recommended to refer
to the official Microsoft documentation or consult with Azure experts.
Home / Microsoft Azure / Data Engineer / How To Copy Pipeline In Azure Data
Factory
How To Copy Pipeline In Azure Data Factory
Azure Data Factory is a cloud-based data integration service that allows you to
create data-driven workflows in the cloud for orchestrating and automating data
movement and data transformation.
Source: Microsoft
Azure Data Factory does not store any data itself. It allows you to create data-driven
workflows to orchestrate the movement of data between supported data stores and
the processing of data using compute services in other regions or in an on-
premise environment. It also allows you to monitor and manage workflows using
both programmatic and UI mechanisms.
You can check out our related blog here: Azure Data Factory for Beginners
How Does Data Factory work?
1) Extract: In this extraction process, data engineers define the data and its source.
Data source: Identify source details such as the subscription, resource group, and
identity information such as secretor a key. Data: Define data by using a set of files,
a database query, or an Azure Blob storage name for blob storage.
2) Transform: Data transformation operations can include combining, splitting,
adding, deriving, removing, or pivoting columns. Map fields between the data
destination and the data source.
3) Load: During a load, many Azure destinations can take data formatted as a file,
JavaScript Object Notation (JSON), or blob. Test the ETL job in a test environment.
Then shift the job to a production environment to load the production system.
4) Publish: Deliver transformed data from the cloud to on-premise sources like SQL
Server or keep it in your cloud storage sources for consumption by BI and analytics
tools and other applications.
Read: Difference between Structured Vs Unstructured Data
What are Pipelines?
A pipeline is a logical grouping of activities that together perform a task. For
example, a pipeline could contain a set of activities that ingest and clean log data,
and then kick off a mapping data flow to analyze the log data. The pipeline allows
you to manage the activities as a set instead of each one individually. You deploy
and schedule the pipeline instead of the activities independently.
Copy Activity In Azure Data Factory
In ADF, we can use the Copy activity to copy data between data stores located on-
premises and in the cloud. After we copy the data, we can use other activities to
further transform and analyze it. We can also use the DF Copy activity to publish
transformation and study results for business intelligence (BI) and application
consumption.
Monitor Copy Activity: We can monitor all of our pipeline’s runs natively in
the ADF user experience.
Delete Activity In Azure Data Factory: We can Back up your files before you
are deleting them with the Delete activity in case you wish to restore them in
the future.
Copy Pipeline In Azure Data Factory
1.) Create Data Factory
1. Go to portal.azure.com and click the Create Resource menu item from the top
left menu. Create a new Data Factory.
2. Fill in the fields similar to below.
3. Once your data factory is set up open it in Azure. Click the Author and Monitor
button.
4. Click the Connections menu item at the bottom left and then Pick the Database
category and then click SQL Server.
5. Create the new linked service and make sure to test the connection before you
proceed.
2.) Create SQL Database
1. Go to portal.azure.com and click the Create Resource menu item from the top left
menu. Create an Azure SQL Database.
2. Fill in fields for the first screen similar to below. For the new server (it’s actually not
a server but a way to group databases) give an ID and Password that you will
remember.
3. Now click the Query editor and log in with your SQL credentials which are
the admin ID and password.
4. You have a choice to get the SQL script to create the destination table. Either
open the “Create Person Table.SQL” in GitHub and copy and paste it into the Query
editor or you can copy the file locally to your laptop.
3. Click the + icon to the right of the “Filter resources by name” input box and pick
the Copy Data option.
4. When working in a wizard like the Copy Wizard or creating pipelines from scratch
make sure to give a good name to each pipeline, linked service, data set, and other
components so it will be easier to work with later.
7. Then pick the person table as the destination and leave the default column
mapping and click next a few times until you come to the screen that says
Deployment Complete.
4.) Monitoring
1. Now click on the Monitor button to see your pipeline job running. You should see
a screen similar to below. If you don’t see your job pipeline check your filters on the
top right.
Finally, You will get to know how to create pipelines to copy data from a SQL
Server on a VM into Azure SQL Database (Platform as a Service)
==========\
Home / Microsoft Azure / Data Engineer / ADF Copy Data: Copy Data From Azure
Blob Storage To A SQL Database Using Azure Data Factory
ADF Copy Data: Copy Data From Azure Blob Storage To A SQL Database Using
Azure Data Factory
The following diagram shows the logical components such as the Storage account
(data source), SQL database (sink), and Azure data factory that fit into a copy
activity.
Topics, we’ll cover:
Overview of Azure Data Factory
Overview of Azure Blob Storage
Overview of Azure SQL Database
How to perform Copy Activity with Azure Data Factory
Before performing the copy activity in the Azure data factory, we should understand
the basic concept of the Azure data factory, Azure blob storage, and Azure SQL
database.
Overview Of Azure Data Factory
Azure Data Factory is defined as a cloud-based ETL and data integration
service.
The aim of Azure Data Factory is to fetch data from one or more data sources
and load them into a format that we process.
The data sources might contain noise that we need to filter out. Azure Data
Factory enables us to pull the interesting data and remove the rest.
Azure Data Factory to ingest data and load the data from a variety of sources
into a variety of destinations i.e. Azure data lake.
It can create data-driven pipelines for orchestrating data movement and
transforming data at scale.
To download the complete DP-203 Azure Data Engineer Associate Exam
Questions guide click here.
Overview Of Azure Blob Storage
Azure Blob storage is Microsoft’s Azure object storage solution for the cloud. It
is designed for optimizing and storing massive amounts of unstructured data.
It is used for Streaming video and audio, writing to log files, and Storing data
for backup and restore disaster recovery, and archiving.
Azure Blob storage offers three types of resources:
The storage account
A container in the storage account
A blob in a container
Objects in Azure Blob storage are accessible via the Azure PowerShell,
Azure Storage REST API, Azure CLI, or an Azure Storage client library.
Overview Of Azure SQL Database
It is a fully-managed platform as a service. Here the platform manages aspects
such as database software upgrades, patching, backups, the monitoring.
Using Azure SQL Database, we can provide a highly available and performant
storage layer for our applications.
Types of Deployment Options for the SQL Database:
Single Database
Elastics Pool
Managed Instance
Azure SQL Database offers three service tiers:
General Purpose or Standard
Business Purpose or Premium
Hyperscale
Note: If you want to learn more about it, then check our blog on Azure SQL
Database
ADF Copy Data From Blob Storage To SQL Database
1. Create a blob and a SQL table
2. Create an Azure data factory
3. Use the Copy Data tool to create a pipeline and Monitor the pipeline
STEP 1: Create a blob and a SQL table
1) Create a source blob, launch Notepad on your desktop. Copy the following text
and save it in a file named input Emp.txt on your disk.
FirstName|LastName
John|Doe
Jane|Doe
2) Create a container in your Blob storage. Container named adftutorial.
Read: Reading and Writing Data In DataBricks
3) Upload the emp.txt file to the adfcontainer folder.
4) Create a sink SQL table, Use the following SQL script to create a table
named dbo.emp in your SQL Database.
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO
CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);
Note: Ensure that Allow access to Azure services is turned ON for your SQL Server
so that Data Factory can write data to your SQL Server. To verify and turn on this
setting, go to logical SQL server > Overview > Set server firewall> set the
Allow access to Azure services option to ON.
Also read: Azure Stream Analytics is the perfect solution when you require a fully
managed service with no infrastructure setup hassle.
STEP 2: Create a data factory
1) Sign in to the Azure portal. Select Analytics > Select Data Factory.
2) On The New Data Factory Page, Select Create
3) On the Basics Details page, Enter the following details. Then Select Git
Configuration
4) On the Git configuration page, select the check box, and then Go To
Networking. Then select Review+Create
5) After the creation is finished, the Data Factory home page is displayed. select
the Author & Monitor tile.
Read: Azure Data Engineer Interview Questions September 2022
STEP 3: Use the ADF Copy Data tool to create a pipeline
1) Select the + (plus) button, and then select Pipeline.
Search for "data factory" in the marketplace and choose the result from Microsoft.
On the next page, you'll get an overview of the product. Click on create to get started
with configuring your ADF environment. You need to select a subscription. You can
either create a new resource group (which is a logical container for your resources)
or select an existing one. You need to select a region (take one close by your
location to minimize latency) and choose a name. Finally, you need to select a
version. It's highly recommended you choose V2. Version 1 of ADF is almost never
used and practically all documentation you'll find online is about V2.
Click on Review + create at the bottom. It's possible you might get a validation error
about the Git configuration. Integration with Git and Azure Devops is out of scope for
this tutorial.
If you get the error, go to the Git configuration tab and select Configure Git later.
When the validation passes, click on Create to have Azure create the ADF
environment for you. This might take a couple of minutes. When the resource is
deployed, you can check it out in the portal.
Typically, you don't spend a lot of time here. You can configure access control to
give people permission to develop in ADF, or you can set up monitoring and alerting.
The actual development itself is done in Azure Data Factory Studio, which is a
separate environment. Click on the Studio icon to go to the development
environment, which should open in a new browser tab.
Setup Storage Account
Before we can start creating pipelines in ADF, we need to set up our source and
destination (called sink in ADF). We begin by creating a storage account in the Azure
Portal. Search for the "storage account" resource in the marketplace and click
on Create.
In the Basics tab, choose your subscription and the same resource group as the
ADF environment. Specify a name for the storage account and choose the same
region as your ADF.
For the redundancy, choose "Locally-redundant storage (LRS)", which is the
cheapest option. Go to the Advanced tab and switch the access tier to Cool. This is
a cheaper option than the default Hot access tier.
Click on Review + Create and then Create to provision your storage account. When
it has been deployed, go to the resource and then to Containers in the Data
Storage section.
Specify "data-input" as the new container name and then click on Create.
The downside is we can have only 2GB for our database, but that should be plenty
for this tutorial. Just one more setting before we can create our database. In
the Additional settings tab, choose Sample as the data source. This will install the
AdventureWorksLT sample database.
Click on Review + create and then on create to create the SQL Server and the
Azure SQL database. This might take a couple of minutes. Once the deployment is
done, go the SQL Server and then to Firewalls and virtual networks, which can be
found in the Security section.
To make sure we can access our database from our machine, we need to add our
current IP address to the firewall. At the top, click on Add client IP. This will add a
new rule to the firewall. Don't forget to click Save at the top!
While we're in the firewall config, let's set the property "Allow Azure services and
resources to access this server" to Yes. This will make our lives a lot easier when we
try to connect to the server from ADF.
In the Overview pane, you can find the name of the server. Hover over it with your
mouse and click the copy icon to copy the name to your clipboard. Start SQL Server
Management Studio (SSMS) or Azure Data Studio to connect to the server. For the
remainder of the tutorial, SSMS is used. In SSMS, create a new connection. Paste
the server name and choose the authentication method you configured earlier. If
you're using Azure AD, don't choose Windows Authentication but rather one of the
Azure AD authentication methods listed: Universal with MFA, Password or
Integrated. The correct one depends on your environment.
Don't click on Connect just yet! First, go to options and enter the database name in
the upper text box.
If you don't do this, SSMS will automatically try to connect to the master database,
which might or might not work, depending on your permissions. You can now click
on Connect. Once you're connected, you can view the tables that were
automatically created for us because we chose the sample database:
Additional Informati
Build your first Azure Dara Factory Pipeline
Overview
We're going to build a pipeline using the Copy Data tool. This tool makes it easier for people
starting out with ADF to create their first pipelines. Before we start, we need to make sure
some prerequisites are met.
Prerequisites
If you haven't already, follow the steps of the previous part of the tutorial to set up ADF, a
storage account with a blob container and an Azure SQL DB.
In the Azure Portal, go to your storage account and then to the "data-input" container we
created. Click on the Upload link.
A pane will open where you can select a local file. Upload the Customers.csv file, which you
can download here.
You can also choose how the resulting pipeline needs to be scheduled. For now, we're going
with "run once now". Schedules and triggers are also discussed later in the tutorial.
In step 2, we need to choose the type of our source data. This will be our csv file in the blob
container. Azure Blob Storage is the first option in the dropdown:
We also need to define the connection to the blob container. Since we don't have any
connections yet in ADF, we need to create a new one by clicking on "New connection". In
the new pane, give the new connection a name and leave the default for the integration
runtime (also covered later in the tutorial). As authentication type, choose account key. Since
the blob storage is in the same Azure tenant as ADF, we can simply choose it from the
dropdowns. Select the correct subscription and the correct storage account.
Finally, you can test your connection. If it is successful, click on Create to create the new
connection. The screen for step 2 should look like this:
We now need to select a file from the connection we just created. Click on Browse to open a
new pane to select the file. Choose the Customers.csv file we uploaded in the prerequisites
section.
ADF will automatically detect it's a csv file and will populate most of the configuration fields
for you.
Make sure the first row is selected as a header. You can do a preview of the data to check if
everything is OK:
Now we need to configure our destination in step 3. Search for "sql" and select Azure SQL
Database from the dropdown list.
Like with the source, we will also need to define a new connection. Give it a name and select
the correct subscription, server and database from the dropdowns. If everything is in the same
Azure tenant, this should be straight forward.
Choose the authentication type that you configured during the setup of the SQL Server. In the
following screenshot, I chose SQL authentication, so I need to supply a username and a
password.
You can test the connection to see if everything works. Make sure you gave Azure Services
access to the SQL server – as shown in the previous part of the tutorial – or you will get a
firewall error. Once the connection is created, we need to choose the destination table. You
can either choose an existing table or let ADF create one for you. Fill in dbo as the schema
and Tutorial_StagingCustomer as the table name.
Next, we need to define the mapping. A mapping defines how each column of the source is
mapped against the columns of the destination. Since ADF is creating the table, everything
should be mapped automatically.
If you want, you can supply a pre-copy script. This is a SQL statement that will be executed
right before the data is loaded. In a recurring pipeline, you can for example issue a
TRUNCATE TABLE statement to empty the table. Here it would fail, since ADF first needs
to create the table. If you try to truncate it, it will fail since the table doesn't exist yet.
Now we're in step 4 of the tool and we can define general settings for the pipeline. You can
change the name of the pipeline. Leave everything else to the defaults.
In the final step, you can review all the configurations we made in the previous steps.
Click Next. ADF will create the pipeline and will run it once.
We can verify a pipeline has been created when we check the factory resources by clicking
the pencil icon in the left menu.
We can also check in the database itself that a new table has been created and has been
populated with the data from the CSV file:
Additional Information
The Copy Data Tool has much more settings you can play with. Why don't you try
them out?
Check out Getting Started with Azure Data Factory - Part 1 for another example of
how to create a pipeline.
<< Previous
Next >>
Azure Data Factory Linked Services
Overview
Now that we've created our first pipelines, it is time to delve a bit deeper into the
inner working of ADF. Let's start with linked services.
The Purpose of Linked Services
In the previous step of the tutorial, every time we created a new connection in the
Copy Data tool, we were creating a new linked service. A linked service is a
connection to a specific service or data store that can either be a source of data, or a
destination (also called target or sink). People who have worked with Integration
Services (SSIS) before will recognize this concept; a linked service can be compared
with a project connection manager in SSIS.
A linked service will store the connection string, but also any method on how to
authenticate with the service. Once a linked service is created, you can reuse it
everywhere. For example, if you have a data warehouse in Azure SQL database,
you will only need to define this connection once.
Linked services can be found in the Manage section of ADF Studio (lowest icon in
the left menu bar).
There we can find the two linked services we created in the previous part:
Keep in mind that for on-premises data sources (and some online data sources) we
need a special integration runtime, which will be covered later in the tutorial.
Creating a Linked Service Manually
In the Manage section, go to Linked Services and click on New. Search for Azure
SQL Database.
Give a name to the new linked service and use the default integration runtime.
Instead of choosing SQL authentication or Azure AD authentication, this time we're
going to use System Assigned Managed Identity. This means we're going to log
into Azure SQL DB using the user credentials of ADF itself. The advantage here is
we don't need to specify users or passwords in the linked service.
However, to make this work, we need to add ADF as a user into our database. When
logged into the database using an Azure AD user with the necessary permissions,
open a new query window and execute the following query:
CREATE USER [mssqltips-adf-tutorial] FOR EXTERNAL PROVIDER;
Next, we need to assign permissions to this user. Typically, ADF will need to read
and write data to the database. So we will add this user to
the db_datareader and db_datawriter roles. If ADF needs to be able to truncate
tables or to automatically create new tables, you can add the user to
the db_ddladmin role as well.
ALTER ROLE db_datareader ADD MEMBER [mssqltips-adf-tutorial];
ALTER ROLE db_datawriter ADD MEMBER [mssqltips-adf-tutorial];
ALTER ROLE db_ddladmin ADD MEMBER [mssqltips-adf-tutorial];
Now we can test our connection in ADF and create it:
Click on Publish to persist the new linked service to the ADF environment.
Linked Services Best Practices
A couple of best practices (or guidelines if you want) for creating linked services in
ADF:
Use a naming convention. For example, prefix connection to SQL server with
SQL_ and connection to Azure Blob Storage with BLOB_. This will make it
easier for you to keep apart the different types of linked services.
If you have multiple environments (for example a development and a
production environment), use the same name for a connection in all
environments. For example, don't call a connection to your development data
warehouse "dev_dwh", but rather "SQL_dwh". Having the same name will
make it easier when you automate deployments between environments.
If you cannot use managed identities and you need to specify usernames and
passwords, store them in Azure Key Vault instead of directly embedding them
in the Linked Service. Key Vault is a secure storage for secrets. It has the
advantage of centralizing your secrets. If for example a password or
username changes, you only need to update it at one location. You can find
an introduction to Azure Key Vault in the tip Microsoft Azure Key Vault for
Password Management for SQL Server Applications.
Additional Information
The tip Create Azure Data Lake Linked Service Using Azure Data
Factory explains how to create a linked service for Azure Data Lake Analytics.
Recently a linked service for Snowflake was introduced. You can check it out
in the tip Copy Data from and to Snowflake with Azure Data Factory.
If you want to create a connection to a file, such as Excel or CSV, you need to
create a linked service to the data store where the file can be found. For
example: Azure Blob Storage or your local file system.
For the moment, only SharePoint Lists are supported for SharePoint Online.
Reading documents inside a SharePoint library is currently not supported by
ADF.
Azure Data Factory Datasets
Overview
Once you've defined a linked service, ADF knows how to connect and authenticate
with a specific data store, but it still doesn't know how the data looks like. In this
section we explore what datasets are and how they are used.
The Purpose of Datasets
Datasets are created for that purpose: they specify how the data looks like. In the
case of a flat file for example, they will specify which delimiters are used, if there are
text qualifiers or escape symbols used, if the first row is a header and so on. In the
case of a JSON file, a dataset can specify the location of the file, and which
compression or encoding used. Or if the dataset is used for a SQL Server table, it
will just specify the schema and the name of the table. What all types of datasets
have in common is that they can specify a schema (not to be mistaken with a
database schema like "dbo"), which is the columns and their data types that are
included in the dataset.
Datasets are found in the Author section of ADF Studio (the pencil icon). There you
can find the two datasets that were created in a previous part of the tutorial with the
Copy Data tool.
We're also going to create a logging table in a schema called "etl". First execute this
script:
CREATE SCHEMA etl;
Then execute the following script for the log table:
CREATE TABLE etl.logging(
ID INT IDENTITY(1,1) NOT NULL
,LogMessage VARCHAR(500) NOT NULL
,InsertDate DATE NOT NULL DEFAULT SYSDATETIME()
);
Since we have a new destination table, we also need a new dataset. In
the Author section, go to the SQL dataset that was created as part of the Copy Data
tool (this should be "DestinationDataset_eqx"). Click on the ellipsis and
choose Clone.
This will make an exact copy of the dataset, but with a different name. Change the
name to "SQL_ExcelCustomers" and select the newly created table from the
dropdown:
In the Schema tab, we can import the mapping of the table.
Next, add a Script activity to the canvas and name it "Log Start".
In the General tab, set the timeout to 10 minutes (the default is 7 days!). You can
also set the number of retries to 1. This means if the Script activity fails, it will wait for
30 seconds and then try again. If it fails again, then the activity will actually fail. If it
succeeds on the second attempt, the activity will be marked as succeeded.
In the Settings tab, choose the linked service for the Azure SQL DB and set the
script type to NonQuery. The Query option means the executed SQL script will
return one or more result sets. The NonQuery option means no result set is returned
and is typically used to execute DDL statements (such as CREATE TABLE, ALTER
INDEX, TRUNCATE TABLE …) or DML statements that modify data (INSERT,
UPDATE, DELETE). In the Script textbox, enter the following SQL statement:
INSERT INTO etl.logging(LogMessage)
VALUES('Start reading Excel');
The settings should now look like this:
Next, drag a Copy Data activity to the canvas. Connect the Script activity with the
new activity. Name it "Copy Excel to SQL".
In this example we're reading from one single Excel file. However, if you have
multiple Excel files of the same format, you can read them all at the same time by
changing the file path type to a wildcard, for example "*.xlsx".
In the Sink tab, choose the SQL dataset we created in the prerequisites section.
Leave the defaults for the properties and add the following SQL statement to the pre-
copy script:
TRUNCATE TABLE dbo.Tutorial_Excel_Customer;
The Sink tab should now look like this:
In the Mapping tab, we can explicitly map the source columns with the sink columns.
Hit the Import Schemas button to let ADF do the mapping automatically.
In this example, doing the mapping isn't necessary since the columns from the
source map 1-to-1 to the sink columns. They have the same names and data types.
If we would leave the mapping blank, ADF will do the mapping automatically when
the pipeline is running. Specifying an explicit mapping is more important when the
column names don't match, or when the source data is more complex, for example a
hierarchical JSON file.
In the Settings tab we can specify some additional properties.
An important property is the number of data integration units (DIU), which are a
measure of the power of the compute executing the copy. As you can see in the
informational message, this directly influences the cost of the Copy data activity. The
price is calculated as $0.25 (this might vary on your subscription and currency) * the
copy duration (remember this is always at least one minute and rounded up to the
next full minute!) * # used DIUs. The default value for DIU is set to Auto, meaning
ADF will scale the number of DIUs for you automatically. Possible values are
between 2 and 256. For small data loads ADF will start with minimum 4 DIUs. But,
for a small Excel file like ours this is already overkill. If you know your dataset is
going to be small, change the property from Auto to 2. This will reduce the price of
your copy data activities by half!
As a final step, copy/paste the Script activity. Change the name to "Log End" and
connect the Copy Data activity with this new activity.
After a while the pipeline will finish. You can see in the Output pane how long each
activity has been running:
If you hover with your mouse over a line in the output, you will get icons for the input
& output, and in the case of the Copy Data activity you will get an extra "glasses"
icon for more details.
When we click on the output for the "Log End" activity, we get the following:
We can see 1 row was inserted. When we go to the details of the Copy Data, we get
the following information:
A lot of information has been kept, such as the number of rows read, how many
connections were used, how many KB were written to the database and so on. Back
in the Output pane, there's link to the debug run consumption.
This will tell us exactly how many resources the debug run of the pipeline consumed:
0.0333 corresponds with two minutes (1 minute of execution rounded up * 2 DIU).
Since our debug run was successful, we can publish everything.
Why do we need to Publish?
When you create new objects such as linked services, datasets and pipelines, or
when you modify existing ones, those changes are not automatically persisted on the
server. You can first debug your pipelines to make sure your changes are working.
Once everything works fine and validations succeeds, you can publish your changes
to the server. If you do not publish your changes and you close your browser
sessions, your changes will be lost.
Building Flexible and Dynamic Azure Data Factory Pipelines
Overview
In the previous part we built a pipeline manually, along with the needed datasets and
linked services. But what if you need to load 20 Excel files? Or 100 tables from a
source database? Are you going to create 100 datasets? And 100 different
pipelines? That would be too much (repetitive) work! Luckily, we can have flexible
and dynamic pipelines where we just need two datasets (one for the source, one for
the sink) and one pipeline. Everything else is done through metadata and some
parameters.
Prerequisites
Previously we uploaded an Excel file from Azure Blob Storage to a table in Azure
SQL Database. A new requirement came in and now we must upload another Excel
file to a different table. Instead of creating a new dataset and a new pipeline (or add
another Copy Data activity to the existing pipeline), we're going to reuse our existing
resources.
The new Excel file contains product data, and it has the following structure:
As you can see from the screenshot, the worksheet name is the default "Sheet1".
You can download the sample workbook here. Upload the Excel workbook to the
blob container we used earlier in the tutorial.
Since we want to store the data in our database, we need to create a new staging
table:
CREATE TABLE dbo.Tutorial_StagingProduct
(
[Name] NVARCHAR(50)
,[ProductNumber] NVARCHAR(25)
,[Color] NVARCHAR(15)
,[StandardCost] NUMERIC(10,2)
,[ListPrice] NUMERIC(10,2)
,[Size] NVARCHAR(5)
,[Weight] NUMERIC(8,2)
);
Implement Parameters
Instead of creating two new datasets and another Copy Data activity, we're going to
use parameters in the existing ones. This will allow us to use one single dataset for
both our Excel files. Open the Excel_Customers dataset, go to properties and
rename it to Excel_Generic.
Then go to the Parameters tab, and create the following two parameters:
The schema is different for each Excel file, so we cannot have any column
information here. It will be fetched on the fly when the Copy Data activity runs.
We're going to do the exact same process for our SQL dataset. First, we rename it
to SQL_Generic and then we add two parameters: SchemaName and TableName.
We're going to map these in the connection tab. If you enable the "Edit" checkbox,
two text fields appear (one for the schema and one for the table) which you can
parameterize:
Don't forget to clear the schema! Go to the StageExcelCustomers pipeline and
rename it to "StageExcel". If we open the Copy Data activity, we can see ADF asks
us now to provide values for the parameters we just added.
You can enter them manually, but that would defeat the purpose of our metadata-
driven pipeline.
Creating and Mapping Metadata
We're going to store the metadata we need for our parameters in a table. We're
going to read this metadata and use it to drive a ForEach loop. For each iteration of
the loop, we're going to copy the data from one Excel file to a table in Azure SQL
DB. Create the metadata table with the following script:
CREATE TABLE etl.ExcelMetadata(
ID INT IDENTITY(1,1) NOT NULL
,ExcelFileName VARCHAR(100) NOT NULL
,ExcelSheetName VARCHAR(100) NOT NULL
,SchemaName VARCHAR(100) NOT NULL
,TableName VARCHAR(100) NOT NULL
);
Insert the following two rows of data:
INSERT INTO etl.ExcelMetadata
(
ExcelFileName,
ExcelSheetName,
SchemaName,
TableName
)
VALUES
('Customers.xlsx','Customers','dbo','Tutorial_Excel_Customer')
,
('Products.xlsx' ,'Sheet1' ,'dbo','Tutorial_StagingProduct')
;
In the pipeline, add a Lookup activity to the canvas after the first Script activity. Give
the activity a decent name, set the timeout to 10 minutes and set the retry to 1.
In the Settings, choose the generic SQL dataset. Disable the checkbox for "First row
only" and choose the Query type. Enter the following query:
SELECT
ExcelFileName
,ExcelSheetName
,SchemaName
,TableName
FROM etl.ExcelMetadata;
Since we're specifying a query, we don't actually need to provide (real) values for the
dataset parameters; we're just using the dataset for its connection to the Azure SQL
database.
Preview the data to make sure everything has been configured correctly.
Next, we're going to add a ForEach to the canvas. Add it after the Lookup and
before the second Script activity.
Select the Copy Data activity, cut it (using ctrl-x), click the pencil icon inside the
ForEach activity. This will open a pipeline canvas inside the ForEach loop. Paste the
Copy Data activity there. At the top left corner of the canvas, you can see that we're
inside the loop, which is in the StageExcel pipeline. It seems like there's a "mini
pipeline" inside the ForEach. However, functionality is limited. You can't for example
put another ForEach loop inside the existing ForEach. If you need to nest loops,
you'll need to put the second ForEach in a separate pipeline and call this pipeline
from the first ForEach using the Execute Pipeline activity. Go back to the pipeline
by clicking on its name.
We can access the values of the current item of the ForEach loop by using
the item() function.
We just need to specify which column we exactly want:
We also need to change the Pre-copy script, to make sure we're truncating the
correct table. Like most properties, we can do this through an expression as well.
We're going to use the @concat() function to create a SQL statement along with the
values for the schema and table name.
@concat('TRUNCATE TABLE
',item().SchemaName,'.',item().TableName,';')
Finally, we need to remove the schema mapping in the Mapping pane. Since both
the source and the sink are dynamic, we can't specify any mapping here unless it is
the same for all Excel files (which isn't the case). If the mapping is empty, the Copy
Data activity will do it for us on-the-fly. For this to work, the columns names in the
Excel file and the corresponding table need to match!
We've now successfully loaded two Excel files to an Azure SQL database by using
one single pipeline driven by metadata. This is an important pattern for ADF, as it
greatly reduces the amount of work you need to do for repetitive tasks. Keep in mind
though, that each iteration of the ForEach loop results in at least one minute of
billing. Even though our debugging pipeline was running for a mere 24 seconds,
we're being billed for 5 minutes (2 Script activities + 1 Lookup + 2 iterations of the
loop).
Additional Information
Azure Data Factory Integration Runtimes
Overview
In this tutorial we have been executing pipelines to get data from a certain source
and write it to another destination. The Copy Data activity for example provides us
with a auto-scalable source of compute that will execute this data transfer for us. But
what is this compute exactly? Where does it reside? The answer is: integration
runtimes. These runtimes provide us with the necessary computing power to execute
all the different kind of activities in a pipeline. There are 3 types of integration
runtimes (IR), which we'll discuss in the following sections.
The Azure-IR
The most important integration runtime is the one we've been using all this time:
the Azure-IR. Every installation of ADF has a default IR:
the AutoResolveIntegrationRuntime. You can find it when you go to
the Manage section of ADF and then click on Integration Runtimes.
It's called auto resolve, because it will try to automatically resolve the geographic
region the compute will need to run. This is determined for example by the data store
of the sink in a Copy Data activity. If the sink is located in West Europe, it will try to
run the compute in the West Europe region as well.
The Azure-IR is a fully managed, serverless compute service. You don't have to do
anything to manage, except pay for the duration it has been running compute. You
can always use the default Azure-IR, but you can also create a new one. Click
on New to create one.
In the following screen, enter a name for the new IR. Also choose your closest
region.
You can also configure the IR to use a Virtual Network, but this is an advanced
setting that is not covered in the tutorial. Keep in mind that billing for pipeline
durations is several magnitudes higher when you're using a virtual network. In the
third pane, we can configure the compute power for data flows. Data flows are
discussed in the next section of the tutorial.
The Self-hosted IR
Suppose you have data on-premises that you need to access from ADF. How can
ADF reach this data store when it is in the Azure cloud? The self-hosted IR provides
us with a solution. You install the self-hosted IR on one of your local machines. This
IR will then act as a gateway through which ADF can reach the on-premises data.
Another use case for the self-hosted IR is when you want to run compute on your
own machines instead of in the Azure cloud. This might be an option if you want to
save costs (the billing for pipeline durations is lower on the self-hosted IR than one
the Azure-IR) or if you want to control everything yourself. ADF will then act as an
orchestrator, while all of the compute is running on your own local servers.
It's possible to install multiple self-hosted IRs on your local network to scale out
resources. You can also share a self-hosted IR between multiple ADF environments.
This can be useful if you want only one self-hosted IR for both development and
production.
The following tips give more detail about this type of IR:
Connect to On-premises Data in Azure Data Factory with the Self-hosted
Integration Runtime - Part 1 and Part 2.
Transfer Data to the Cloud Using Azure Data Factory
Build Azure Data Factory Pipelines with On-Premises Data Sources
The Azure-SSIS IR
ADF provides us with the opportunity to run Integration Services packages inside the
ADF environment. This can be useful if you want to quickly migrate SSIS project to
the Azure cloud, without a complete rewrite of your projects. The Azure-SSIS IR
provides us with a scale-out cluster of virtual machines that can run SSIS packages.
You create an SSIS catalog in either Azure SQL database or in Azure SQL Server
Managed Instance.
As usual, Azure deals with the infrastructure. You only need to specify how powerful
the Azure-SSIS IR is by configuring the size of a compute node and how many
nodes there need to be. You are billed for the duration the IR is running. You can
pause the IR to save on costs.
If you choose "Trigger Now", you will create a run-once trigger. The pipeline will run
and that's it. If you choose "New/Edit", you can either create a trigger or modify an
existing one. In the Add triggers pane, open the dropdown and choose New.
The default trigger type is Schedule. In the example below, we've scheduled our
pipeline to run every day, for the hours 6, 10, 14 and 18.
Once the trigger is created, it will start running and execute the pipeline according to
schedule. Make sure to publish the trigger after you've created it. You can view
existing triggers in the Manage section of ADF.
You can pause an existing trigger, or you can delete it or edit it. For more information
about triggers, check out the following tips:
Create Event Based Trigger in Azure Data Factory>
Create Schedule Trigger in Azure Data Factory ADF
Create Tumbling Window Trigger in Azure Data Factory ADF
ADF has a REST API which you can also use to start pipelines. You can for example
start a pipeline from an Azure Function or an Azure Logic App.
Monitoring
ADF has a monitoring section where you can view all executed pipelines, both
triggered or by debugging.
You can also view the state of the integration runtimes or view more info about the
data flows debugging sessions. For each pipeline run, you can view the exact output
and the resource consumption of each activity and child pipeline.
It's also possible to configure Log analytics for ADF in the Azure Portal. It's out of
scope for this tutorial, but you can find more info in the tip Setting up Azure Log
Analytics to Monitor Performance of an Azure Resource. You can check out the
Monitoring section for the ADF resource in the Azure Portal:
You can choose the type of events that are being logged:
Azure Data Factory Pipeline Logging Error Details
By: Ron L'Esteve | Updated: 2021-01-20 | Comments (2) | Related: > Azure Data
Factory
Problem
In my previous article, Logging Azure Data Factory Pipeline Audit Data, I discussed a
variety of methods for capturing Azure Data Factory pipeline logs and persisting the data to
either a SQL Server table or within Azure Data Lake Storage Gen2. While this process of
capturing pipeline log data is valuable when the pipeline activities succeed, how can we also
capture and persist error details related to Azure Data Factory pipelines when activities
within the pipeline fail?
Solution
In this article, I will cover how to capture and persist Azure Data Factory pipeline errors to an
Azure SQL Database table. Additionally, we will re-cap the pipeline parameter process that I
had discussed in my previous articles to demonstrate how the pipeline_errors, pipeline_log,
and pipeline_parameter relate to each other.
Explore and Understand the Meta-Data driven ETL Approach
Prior to continuing with the demonstration, try to read my previous articles as a pre-requisite
to gain background and knowledge around the end-to-end meta-data driven E-T-L process.
Azure Data Factory Pipeline to fully Load all SQL Server Objects to ADLS Gen2
Load Data Lake files into Azure Synapse Analytics Using Azure Data Factory
Loading Azure SQL Data Warehouse Dynamically using Azure Data Factory
Logging Azure Data Factory Pipeline Audit Data
To re-cap the tables needed for this process, I have included the diagram below which
illustrates how the pipeline_parameter, pipeline_log, and pipeline_error tables are
interconnected with each other.
Create a Parameter Table
The following script will create the pipeline_parameter table with column parameter_id as the
primary key. Note that this table drives the meta-data ETL approach.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET QUOTED_IDENTIFIER ON
GO
(
[DataFactory_Name],
[Pipeline_Name],
[RunId],
[Source],
[Destination],
[TriggerType],
[TriggerId],
[TriggerName],
[TriggerTime],
[No_ParallelCopies],
[copyDuration_in_secs],
[effectiveIntegrationRuntime],
[Source_Type],
[Sink_Type],
[Execution_Status],
[ErrorDescription],
[ErrorCode],
[ErrorLoggedTime],
[FailureType]
)
VALUES
(
@DataFactory_Name,
@Pipeline_Name,
@RunId,
@Source,
@Destination,
@TriggerType,
@TriggerId,
@TriggerName,
@TriggerTime,
@No_ParallelCopies,
@copyDuration_in_secs,
@effectiveIntegrationRuntime,
@Source_Type,
@Sink_Type,
@Execution_Status,
@ErrorDescription,
@ErrorCode,
@ErrorLoggedTime,
@FailureType
)
GO
Create a Source Error SQL Table
Recall from my previous article, Azure Data Factory Pipeline to fully Load all SQL Server
Objects to ADLS Gen2, that we used a source SQL Server Table that we then moved to the
Data Lake Storage Gen2 and ultimately into Synapse DW. Based on this process, we will
need to test a known error within the Data Factory pipeline and process. It is known that
generally a varchar(max) datatype containing at least 8000+ characters will fail when being
loaded into Synapse DW since varchar(max) is an unsupported data type. This seems like a
good use case for an error test.
The following table dbo.MyErrorTable contains two columns with col1 being the
varchar(max) datatype.
Within dbo.MyErrorTable I have added a large block of text and decided to randomly
choose Sample text for Roma : the novel of ancient Rome by Steven Saylor. After doing
some editing of the text, I confirmed that col1 contains 8001 words, which is sure to fail my
Azure Data Factory pipeline and trigger a record to be created in the pipeline_errors table.
Verify the Azure Data Lake Storage Gen2 Folders and Files
After running the pipeline to load my SQL tables to Azure Data Lake Storage Gen2, we can
see that the destination ADLS2 container now has both of the tables in snappy compressed
parquet format.
As an additional verification step, we can see that the folder contains the expected parquet
file.
As a final check, when I navigate to the Synapse DW, I can see that both tables have been
auto-created, despite the fact that one failed and one succeeded.
However, data was only loaded in MyTable since MyErrorTable contains no data.
Next Steps
Logging Azure Data Factory Pipeline Audit Data
By: Ron L'Esteve | Comments (7) | Related: > Azure Data Factory
Problem
In my last article, Azure Data Factory Pipeline to fully Load all SQL Server Objects to ADLS
Gen2, I discussed how to create a pipeline parameter table in Azure SQL DB and drive the
creation of snappy parquet files consisting of On-Premises SQL Server tables into Azure
Data Lake Store Gen2. Now that I have a process for generating files in the lake, I would also
like to implement a process to track the log activity for my pipelines that run and persist the
data. What options do I have for creating and storing this log data?
Solution
Azure Data Factory is a robust cloud-based E-L-T tool that is capable of accommodating
multiple scenarios for logging pipeline audit data.
In this article, I will discuss three of these possible options, which include:
1. Updating Pipeline Status and Datetime columns in a static pipeline parameter table
using an ADF Stored Procedure activity
2. Generating a metadata CSV file for every parquet file that is created and storing the
logs in hierarchical folders in ADLS2
3. Creating a pipeline log table in Azure SQL Database and storing the pipeline activity
as records in the table
Prerequisites
Ensure that you have read and implemented Azure Data Factory Pipeline to fully Load all
SQL Server Objects to ADLS Gen2, as this demo will be building a pipeline logging process
on the pipeline copy activity that was created in the article.
Option 1: Create a Stored Procedure Activity
The Stored Procedure Activity is one of the transformation activities that Data Factory
supports. We will use the Stored Procedure Activity to invoke a stored procedure in Azure
SQL Database. For more information on ADF Stored Procedure Activity, see Transform data
by using the SQL Server Stored Procedure activity in Azure Data Factory.
For this scenario, I would like to maintain my Pipeline Execution Status and Pipeline Date
detail as columns in my Pipeline Parameter table rather than having a separate log table. The
downside to this method is that it will not retain historical log data, but will simply update the
values based on a lookup of the incoming files to records in the pipeline parameter table. This
gives a quick, yet not necessarily robust, method of viewing the status and load date across all
items in the pipeline parameter table.
I’ll begin by adding a stored procedure activity to my Copy-Table Activity so that as the
process iterates on a table level basis for my stored procedure.
Next, I will add the following stored procedure to my Azure SQL Database where my
pipeline parameter table resides. This procedure simply looks up the destination table name in
the pipeline parameter table and updates the status and datetime for each table once the Copy-
Table activity is successful.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
BEGIN TRY
END TRY
BEGIN CATCH
IF @@TRANCOUNT > 0
ROLLBACK
I will then return to my data factory pipeline and configure the stored procedure activity. In
the Stored Procedure tab, I will select the stored procedure that I just created. I will also add a
new stored procedure parameter that references my destination name, which I had configured
in the copy activity.
After saving, publishing and running the pipeline, I can see that my pipeline_datetime and
pipeline_status columns have been updated as a result of the ADF Stored Procedure Activity.
To configure the source dataset, I will select my source on-premise SQL Server.
Next, I will add the following query as my source query. As we can see, this query will
contain a combination of pipeline activities, copy table activities, and user-defined
parameters.
Below is the connection configuration that I will use for my csv dataset.
The following parameterized path will ensure that the file is generate in the correct folder
structure.
@{item().server_name}/@{item().src_db}/@{item().src_schema}/
@{item().dst_name}/metadata/@{formatDateTime(utcnow(),'yyyy-
MM-dd')}/@{item().dst_name}.csv
After I save, publish, and run my pipeline, I can see that a metadata folder has been created in
my Server>database>schema>Destination_table location.
When I open the metadata folder, I can see that there will be csv file per day that the pipeline
runs.
Finally, I can see that a metadata .csv file with the name of my table has been created.
When I download and open the file, I can see that all of the query results have been populated
in my .csv file.
SET QUOTED_IDENTIFIER ON
GO
Below are the connection details for the Azure SQL DB pipeline log table.
When I save, publish and run my pipeline, I can see that the pipeline copy activity records
have been captured in my dbo.pipeline_log table.
Azure Data Factory Pipeline Scheduling, Error Handling and Monitoring - Part
2
By: Koen Verbeeck | Updated: 2021-06-17 | Comments | Related: > Azure Data
Factory
Problem
Azure Data Factory is a managed serverless data integration service for the Microsoft Azure
Data Platform used by data engineers during business intelligence and cloud data related
projects. In part 1 of this tutorial series, we introduced you to Azure Data Factory (ADF) by
creating a pipeline. We continue by showing you other use cases for which you can use ADF,
as well as how you can handle errors and how to use the built-in monitoring.
Solution
It's recommended to read part 1 before you continue with this tip. It shows you how to install
ADF and how to create a pipeline that will copy data from Azure Blob Storage to an Azure
SQL database as a sample ETL \ ELT process.
Azure Data Factory as an Orchestration Service
Like SQL Server Integration Services, ADF is responsible for data movement (copy data or
datasets) from a source to a destination as a workflow. But it can do so much more. There are
a variety of activities that don't do anything in ADF itself, but rather perform some tasks on
an external system. For example, there are activities specific for handling Azure
Databricks scenarios:
You can for example trigger Azure Databricks Notebooks from ADF. The following tips can
get you started on this topic:
Orchestrating Azure Databricks Notebooks with Azure Data Factory
Create Azure Data Factory inventory using Databricks
Getting Started with Delta Lake Using Azure Data Factory
Snowflake Data Warehouse Loading with Azure Data Factory and Databricks
Azure Data Factory Mapping Data Flows for Big Data Lake Aggregations and
Transformations
ADF has its own form of Azure Databricks integration: Data Flows (previously called
Mapping Data Flows) and Power Query flows (shortly called Wrangling Flows), which are
both out of scope of this tip, but will be explained in a subsequent tip.
ADF also supports other technologies, such as HDInsight:
You can call Logic Apps and Azure Functions from Azure Data Factory, which is often
necessary because there's still some functionality missing from ADF. For example, you
cannot send an email from ADF or ADF cannot easily download a file from SharePoint
Online (or OneDrive for Business).
With ADF pipelines, you can create complex data pipelines where you integrate multiple data
services with each other. But it's not all cloud. You can also access on-premises data sources
when you install the self-hosted integration runtime. This runtime also allows you to shift
workloads to on-premises machines should the need arise.
Lastly, you can also integrate existing SSIS solutions into ADF. You can create an Azure-
SSIS Integration Runtime, which is basically a cluster of virtual machines that will execute
your SSIS packages. The SSIS catalog itself is created in either an Azure SQL DB or an
Azure SQL Managed Instance. You can find more info in the following tips:
Configure an Azure SQL Server Integration Services Integration Runtime
Executing Integration Services Packages in the Azure-SSIS Integration Runtime
Customized Setup for the Azure-SSIS Integration Runtime
SSIS Catalog Maintenance in the Azure Cloud
Scheduling ADF Pipelines
To schedule an ADF pipeline, you add a trigger from within the pipeline itself:
You can either trigger a one-off execution, or you can create/edit a permanent trigger.
Currently, there are 4 types:
Schedule is very similar to what is used in SQL Server Agent jobs. You define a
frequency (for example every 10 minutes or once every day at 3AM), a start date and
an optional end date.
Tumbling window is a more specialized form of schedule. With tumbling windows,
you have a parameterized data flow. When one window is executed, the start and the
end time of the window is passed to the pipeline. The advantage of a tumbling
window is that you can execute past periods as well. Suppose you have a tumbling
window on the daily level, and the start date is at the start of this month. This will
trigger an execution for every day of the month right until the current day. This makes
tumbling windows great for doing an initial load where you want each period
executed separately. You can find more info about this trigger in the tip Create
Tumbling Window Trigger in Azure Data Factory ADF.
Storage events will trigger a pipeline whenever a blob is created or deleted from a
specific blob container.
Custom events are a new trigger type which are in preview at the time of writing.
These allow you to trigger a pipeline based on custom events from Event Grid. You
can find more info in the documentation.
Pipelines can also be triggered from an external tool, such as from an Azure Logic App or an
Azure Function. ADF has even a REST API available which you can use, but you could also
use PowerShell, the Azure CLI, .NET or even Python.
Error Handling and Monitoring
Like in SSIS, you can configure constraints on the execution paths between two activities:
This allows you to create a more robust pipeline that can handle multiple scenarios. Keep in
mind though ADF doesn't have an "OR constraint" like in SSIS. Let's illustrate why that
matters. In the following scenario, the Web Activity will never be executed:
For the Web Activity to be executed, the Copy Activity must fail AND the Azure Function
must fail. However, the Azure Function will only start if the Copy Data activity has finished
successfully. If you want to re-use some error handling functionality, you can create a
separate pipeline and call this pipeline from every activity in the main pipeline:
To capture and log any errors, you can create a stored procedure to log them into a table, as
demonstrated in the tip Azure Data Factory Pipeline Logging Error Details.
In the ADF environment, you can monitor ongoing and past pipeline runs.
There, you can view all pipeline runs. There are pre-defined filters you can use, such as date,
pipeline names and status.
You can view the error if a pipeline has failed, but you can also go into the specific run and
restart an activity if needed.
For more advanced alerting and monitoring, you can use Azure Monitor.
Query Audit data in Azure SQL Database using Kusto Query Language (KQL)
By: Rajendra Gupta | Updated: 2021-03-16 | Comments | Related: > Azure SQL
Database
Problem
In the previous tip, Auditing for Azure SQL Database, we explored the process to audit an
Azure SQL Database using the Azure Portal and Azure PowerShell cmdlets. In this article we
look at how you can leverage Kusto Query Language (KQL) for querying the audit data.
Solution
Kusto Query Language (KQL) is a read-only query language for processing real-time data
from Azure Log Analytics, Azure Application Insights, and Azure Security Center logs. SQL
Server database professionals familiar with Transact-SQL will see that KQL is similar to T-
SQL with slight differences.
For example, in T-SQL we use the WHERE clause to filter records from a table as follows.
SELECT *
FROM Employees
WHERE firstname='John'
We can write the same query in KQL with the following syntax. Like PowerShell, it uses a
pipe (|) to pass values to the next command.
Employees
| where firstname == 'John'
Similarly, in T-SQL we use the ORDER BY clause to sort data in ascending or descending
order as follows.
SELECT *
FROM Employees
WHERE firstname='John'
ORDER BY empid
The equivalent KQL code is as follows.
Employees
| where firstname == 'John'
| order by empid
The query syntax for KQL language looks familiar, right.
Enable Audit for Azure SQL Database
In the previous tip, we configured audit logs for Azure SQL Database using Azure Storage. If
you have the bulk of the audit data in Azure Storage, it might be complex to fetch the
required data. You can use the sys.fn_get_audit_file() function for fetching data, but it also
takes longer for a large data set. Therefore, for critical databases you should store audits in
Azure Log Analytics.
To configure the Azure SQL Database Audit logs in Azure Log Analytics, login to the Azure
portal using your credentials and navigate to Azure Server.
As shown below, server-level auditing is disabled. It is also disabled for all databases in the
Azure server.
Enable the server-level auditing and put a tick on Log Analytics (Preview) as the audit log
destination.
This enables the configuration option for log analytics. Click on Configure and it opens Log
Analytic Workspaces.
Click on the Create New Workspace and in the new workspace, enter the following values:
Enter a name for log analytics workspace
Select your Azure subscription
Resource group
Azure region
Pricing tier
As shown below, the auditing is configured for Azure SQL Database.
Save the audit configurations for Azure SQL Database. It enables server-level auditing for the
Azure SQL Database. The database auditing is still disabled because if we enable server
auditing, it applies to all databases.
Click on Get Started and it opens the query editor for KQL queries. In the left-hand side, it
shows a SQL database AzureDiagnostics.
Suppose someone executed an INSERT and SELECT statement for the Azure database. As a
database administrator, you may want to get SQL statements. You can use KQL language and
filter records from the statement_s table that have INSERT statements.
AzureDiagnostics
| where statement_s contains "insert"
As shown below, we get the complete INSERT statement from the audit logs.
Similarly, you can fetch data from other diagnostic tables with the help of the KQL language.
You can also monitor performance data such as CPU, Memory using the diagnostics
configuration.
Create an Alert in Microsoft Azure Log Analytics
By: Joe Gavin | Comments | Related: > Azure
Problem
You want to create an alert in Log Analytics to monitor Performance Monitor counters and /
or Event Logs and need a quick way to jump in and get familiar with it.
Solution
Log Analytics is a service in Operations Management Suite (OMS) that monitors your cloud
and on-premises environments to maintain their availability and performance. It collects data
generated by resources in your cloud and on-premises environments and from other
monitoring tools to provide analysis across multiple sources.
(Source: https://docs.microsoft.com/en-us/azure/log-analytics/log-analytics-overview)
Digging deeply into this service is out of scope for this tip. However, diving in and creating a
simple alert is a great place to get started.
We’ll walk through the following:
Creating a Workspace - A workspace is the basic organizational unit for Log
Analytics.
Installing and configuring the Microsoft Monitoring Agent - The agent is the conduit
from Windows and / or Linux monitored machines back to Log Analytics.
Creating an alert - We can create alerts based on Windows Event Logs, Windows
Performance Counters, Linux Performance Counters, IIS Logs, Custom Fields,
Custom Logs and Syslog. In our example, we’ll keep it simple and get started with an
alert based on the ‘% Processor Time’ Windows Performance Counter.
We’ll have a functioning Log Analytics alert when we’re done.
Creating a Workspace
Let’s get started.
Login to the Microsoft Azure Portal at http://portal.azure.com.
Start typing Log Analytics in the search box (as shown below) and click on Log Analytics
when it comes up in the results.
Click on the Settings icon in the upper right hand section of the OMS Portal.
Go to the desktop of the Windows machine you want to install the agent on and run
MMASetup-AMD64.exe from the location you saved it.
Click through until you get to the Agent Setup Options screen and check ‘Connect the Agent
to Azure Log Analytics (OMS)’.
(2) Click on the Search button on the right to see if there are any records. In this case we have
no values over 90%, so there are no records returned in the results section.
(3) To turn this query into an alert, click on the Alert icon in the upper left as shown above
and the window below will open.
Enter values for:
1. Name
2. Description
3. Severity
4. Time window
5. Alert frequency
6. Number of results
7. Subject
8. Recipients
9. and the click Save to save the alert.
After saving the Alert, you will get this window.
When we look at the alerts that were setup, we can see them as shown below.
And we’re done.
Next Steps
Azure Data Factory Lookup Activity Example
By: Fikrat Azizov | Comments (7) | Related: > Azure Data Factory
Problem
One of the frequently used SQL Server Integration Services (SSIS) controls is the lookup
transform, which allows performing lookup matches against existing database records. In this
post, we will be exploring Azure Data Factory's Lookup activity, which has similar
functionality.
Solution
Azure Data Factory Lookup Activity
The Lookup activity can read data stored in a database or file system and pass it to
subsequent copy or transformation activities. Unlike SSIS's Lookup transformation, which
allows performing a lookup search at the row level, data obtained from ADF's Lookup
activity can only be used on an object level. In other words, you can use ADF's Lookup
activity's data to determine object names (table, file names, etc.) within the same pipeline
dynamically.
Lookup activity can read from a variety of database and file-based sources, you can find the
list of all possible data sources here.
Lookup activity can work in two modes:
Singleton mode - Produces first row of the related dataset
Array mode - Produces the entire dataset
We will look into both modes of Lookup activity in this post.
Azure Data Factory Lookup Activity Singleton Mode
My first example will be creating Lookup activity to read the first row of SQL query from
SrcDb database and using it in subsequent Stored Procedure activity, which we will be
storing in a log table inside the DstDb database.
For the purpose of this exercise, I have created a pipeline ControlFlow1_PL and view in
SrcDb database to extract all table names, using the below query:
CREATE VIEW [dbo].[VW_TableList]
AS
SELECT TABLE_SCHEMA+'.'+TABLE_NAME AS Name FROM
INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE='BASE TABLE'
GO
I have also created a log table and stored procedure to write into it. I am going to use this
procedure for the purpose of Stored Procedure activity. Here are the required scripts to be
executed inside DstDb database:
CREATE TABLE [dbo].[TableLogs](
[TableName] [varchar](max) NULL
)
GO
I've named the new dataset TableList_DS, see the below properties:
The below screenshot shows the properties of the Lookup activity, with the new dataset
configured. Please note that 'First row only' checkbox is checked, which will ensure that this
activity produces only the first row from its data source:
Next, let's add Stored Procedure activity, pointing to the usp_LogTableNames procedure we
created earlier and link it to Lookup_Ac activity on Success criteria:
Finally, let's publish the changes, trigger it manually, switch to the Monitor page and open
the Activity Runs window to examine the detailed execution logs:
Using the Output button, we can examine the output of the lookup activity and see the value
it produced:
Now that we know how Lookup activity works in singleton mode, let's explore the array
mode.
Azure Data Factory Lookup Activity Array Mode
To explore Lookup activity's array mode, I am going to create copy of the pipeline, created
earlier and customize it, as follows:
Clone the pipeline ControlFlow1_PL and name it as ControlFlow2_PL.
Select Lookup_AC activity in the ControlFlow2_PLpipeline, switch to the Settings tab and
clear the First row only checkbox:
Because we're expecting multiple rows from Lookup activity, we can no longer use
LogTableName_AC activity with a string parameter, so let's remove it and drag-drop a Set
Variable activity, located under the General category (I've named it as Set_Variable_AC):
Problem
Data integration flows often involve execution of the same tasks on many similar objects. A
typical example could be - copying multiple files from one folder into another or copying
multiple tables from one database into another. Azure Data Factory's (ADF) ForEach and
Until activities are designed to handle iterative processing logic. We are going to discuss the
ForEach activity in this article.
Solution
Azure Data Factory ForEach Activity
The ForEach activity defines a repeating control flow in your pipeline. This activity could be
used to iterate over a collection of items and execute specified activities in a loop. This
functionality is similar to SSIS's Foreach Loop Container.
ForEach activity's item collection can include outputs of other activities, pipeline parameters
or variables of array type. This activity is a compound activity- in other words, it can include
more than one activity.
Creating ForEach Activity in Azure Data Factory
In the previous two posts (here and here), we have started developing
pipeline ControlFlow2_PL, which reads the list of tables from SrcDb database, filters out
tables with the names starting with character 'P' and assigns results to pipeline
variable FilteredTableNames. Here is the list of tables, which we get in this variable:
SalesLT.Product
SalesLT.ProductCategory
SalesLT.ProductDescription
SalesLT.ProductModel
SalesLT.ProductModelProductDescription
In this exercise, we will add ForEach activity to this pipeline, which will copy tables, listed in
this variable into DstDb database.
Before we proceed further, Let's prepare target tables. First, Let's remove foreign key
relationships between these tables in the destination database using below script, to prevent
ForEach activity from failing:
ALTER TABLE [SalesLT].[Product] DROP CONSTRAINT
[FK_Product_ProductCategory_ProductCategoryID]
GO
ALTER TABLE [SalesLT].[Product] DROP CONSTRAINT
[FK_Product_ProductModel_ProductModelID]
GO
ALTER TABLE [SalesLT].[ProductCategory] DROP CONSTRAINT
[FK_ProductCategory_ProductCategory_ParentProductCategoryID_Pr
oductCategoryID]
GO
ALTER TABLE [SalesLT].[ProductModelProductDescription] DROP
CONSTRAINT
[FK_ProductModelProductDescription_ProductDescription_ProductD
escriptionID]
GO
ALTER TABLE [SalesLT].[ProductModelProductDescription] DROP
CONSTRAINT
[FK_ProductModelProductDescription_ProductModel_ProductModelID
]
GO
ALTER TABLE [SalesLT].[SalesOrderDetail] DROP CONSTRAINT
[FK_SalesOrderDetail_Product_ProductID]
GO
Next, let's create stored procedure to purge target tables, using below script. We'll need to call
this procedure before each copy, to avoid PK errors:
CREATE PROCEDURE Usp_PurgeTargetTables
AS
BEGIN
delete from [SalesLT].[Product]
delete from [SalesLT].[ProductModelProductDescription]
delete from [SalesLT].[ProductDescription]
delete from [SalesLT].[ProductModel]
delete from [SalesLT].[ProductCategory]
END
Let's follow the below steps to add a ForEach activity to the ControlFlow2_PL pipeline:
Select pipeline ControlFlow2_PL, expand Iterations & Conditionals group on the Activities
panel, drag-drop ForEach activity into the central panel and assign a name (I've named it as
ForEach_AC):
Switch to the Settings tab and enter an expression @variables('FilteredTableNames') into
Items text box:
Next, create Azure SQL Db dataset, pointing to SrcDb database (I've named it as
ASQLSrc_DS) and add dataset parameter TableName of string type:
Switch to Connection tab and enter an expression @dataset().TableName in the Table text
box, which will ensure that table names for this dataset will be assigned dynamically, using
dataset parameter:
Now that source dataset has been created, let's return to parent pipeline's design surface and
enter an expression @item().name in the TableName text box. This expression will ensure
that items from the ForEach activity's input list are mapped to its copy activity's source
dataset:
Next, let's create parameterized Sink dataset for CopyFiltered_AC activity, using a similar
method. Here is how your screen should look like:
Now that we've completed configuration of CopyFiltered_AC activity, let's switch to the
parent pipeline's design surface, using navigation link at the top of the screen:
Next, let's add Stored Procedure activity (I've named it as SP_Purge_AC), pointing to the
Usp_PurgeTargetTables procedure we created earlier and link it to Set_Variable_AC activity
on Success criteria:
As the last configuration step, let's link activities SP_Purge_AC and ForEach_AC on Success
criteria. This will ensure that target tables will be purged prior to the beginning of copy
activities:
Finally, let's start the pipeline in Debug mode and examine execution logs in the Output
window to ensure that five copy activities (one per each item from FilteredTableNames
variable list) have finished successfully:
We can also examine the input of the ForEach activity, using the Input button and confirm
that it received five items:
Since pipeline works as expected, we can publish all the changes now.
I have attached JSON scripts for this pipeline here.
Optional attributes of ForEach activity in Azure Data Factory
ForEach activity has few optional attributes, which allow controlling parallelism degree of its
child activities. Here are those attributes:
Sequential - This setting instructs ForEach activity to run its child activities in
sequential order, one at a time
Batch Count - This setting allows specifying parallelism degree of ForEach activity's
child activities
Here is the screenshot with these attributes:
Next Steps
mport Data from Excel to Azure SQL Database using Azure Data Factory
By: Ron L'Esteve | Updated: 2021-07-06 | Comments (2) | Related: > Azure Data
Factory
Problem
The need to load data from Excel spreadsheets into SQL Databases has been a long-standing
requirement for many organizations for many years. Previously, tools such as VBA, SSIS, C#
and more have been used to perform this data ingestion orchestration process. Recently,
Microsoft introduced an Excel connector for Azure Data Factory. Based on this new Excel
connector, how can we go about loading Excel files containing multiple tabs into Azure SQL
Database Tables?
Solution
With the new addition of the Excel connector in Azure Data Factory, we now have the
capability of leveraging dynamic and parameterized pipelines to load Excel spreadsheets into
Azure SQL Database tables. In this article, we will explore how to dynamically load an Excel
spreadsheet residing in ADLS gen2 containing multiple Sheets into a single Azure SQL
Table and also into multiple tables for every sheet.
Pre-Requisites
Create an Excel Spreadsheet
The image below shows a sample Excel spreadsheet containing four sheets containing the
same headers and schema that we will use in our ADF Pipelines to load data in Azure SQL
Tables.
Upload to Azure Data Lake Storage Gen2
This same Excel spreadsheet has been loaded to ADLS gen2.
Within Data Factory, we can add an ADLS gen2 linked service for the location of the Excel
spreadsheet.
Create Linked Services and Datasets
We'll need to ensure that the ADLS gen2 linked service credentials are configured accurately.
When creating a new dataset, notice that we have Excel format as an option which we can
select.
The connection configuration properties for the Excel dataset can be found below. Note that
we will need to configure the Sheet Name property with the dynamic parameterized
@dataset().SheetName value. Also, since we have headers in the file, we will need to check
'First row as header'.
Within the parameters tab, we'll need to add SheetName.
Next, a sink dataset to the target Azure SQL Table will also need to be created with a
connection to the appropriate linked service.
Create a Pipeline to Load Multiple Excel Sheets in a Spreadsheet into a Single
Azure SQL Table
In the following section, we'll create a pipeline to load multiple Excel sheets from a single
spreadsheet file into a single Azure SQL Table.
Within the ADF pane, we can next create a new pipeline and then add a ForEach loop activity
to the pipeline canvas. Next, click on the white space of the canvas within the pipeline to add
a new Array variable called SheetName containing default values of all the sheets in the
spreadsheet from Sheet1 through Sheet4, as depicted in the image below.
When we navigate to the Azure SQL Table and query it, we can see that the data from all the
Excel Sheets were loaded into the single Azure SQL Table.
Create a Pipeline to Load Multiple Excel Sheets in a Spreadsheet into Multiple Azure
SQL Tables
In this next example, we will test loading multiple Excel sheets from a spreadsheet into
multiple Azure SQL Tables. To begin, we will need a new Excel lookup table that will
contain the SheetName and TableName which will be used by the dynamic ADF pipeline
parameters.
The following script can be used to create this lookup table.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
Next, we will also need to create a new dataset with a connection to the Excel Look up table.
The connection properties of the Excel Spreadsheet will be similar to the previous pipeline
where we parameterized SheetName as follows.
In this scenario, we will also need to add a parameter for the TableName in the Azure SQL
Database dataset connection as follows.
In the Azure SQL DB connection section, we'll leave the schema as hardcoded and would
need to add the parameter for the TableName as follows.
In this pipeline, we will also need a lookup table which will serve the purpose of looking up
the values in the SQL lookup table through a select * lookup on the table.
The values from the lookup can be passed to the ForEach loop activity's items property of the
settings tab, as follows:
Next, within the ForEachLoop activity, we'll need a Copy Data activity with the source
dataset properties containing the parameterized SheetName value, as follows.
Next, the sink dataset properties will also need to contain the parameterized TableName
value, as follows. Note that the table option is once again set to 'Auto Create Table'.
After we run this pipeline, we can see that the pipeline succeeded and four tables were
created in the Azure SQL Database.
Upon navigating to the Azure SQL Database, we can see that all four table were created with
the appropriate names based on the TableName values we defined in the SQL Lookup table.
As a final check, when we query all four tables, we can see that they all contain the data from
the Excel Sheets which confirms that the pipeline executed successfully and with the correct
mappings of sheets to multiple tables which were defined in the lookup tables.
Logging Azure Data Factory Pipeline Audit Data
By: Ron L'Esteve | Comments (7) | Related: > Azure Data Factory
Problem
In my last article, Azure Data Factory Pipeline to fully Load all SQL Server Objects to ADLS
Gen2, I discussed how to create a pipeline parameter table in Azure SQL DB and drive the
creation of snappy parquet files consisting of On-Premises SQL Server tables into Azure
Data Lake Store Gen2. Now that I have a process for generating files in the lake, I would also
like to implement a process to track the log activity for my pipelines that run and persist the
data. What options do I have for creating and storing this log data?
Solution
Azure Data Factory is a robust cloud-based E-L-T tool that is capable of accommodating
multiple scenarios for logging pipeline audit data.
In this article, I will discuss three of these possible options, which include:
1. Updating Pipeline Status and Datetime columns in a static pipeline parameter table
using an ADF Stored Procedure activity
2. Generating a metadata CSV file for every parquet file that is created and storing the
logs in hierarchical folders in ADLS2
3. Creating a pipeline log table in Azure SQL Database and storing the pipeline activity
as records in the table
Prerequisites
Ensure that you have read and implemented Azure Data Factory Pipeline to fully Load all
SQL Server Objects to ADLS Gen2, as this demo will be building a pipeline logging process
on the pipeline copy activity that was created in the article.
Option 1: Create a Stored Procedure Activity
The Stored Procedure Activity is one of the transformation activities that Data Factory
supports. We will use the Stored Procedure Activity to invoke a stored procedure in Azure
SQL Database. For more information on ADF Stored Procedure Activity, see Transform data
by using the SQL Server Stored Procedure activity in Azure Data Factory.
For this scenario, I would like to maintain my Pipeline Execution Status and Pipeline Date
detail as columns in my Pipeline Parameter table rather than having a separate log table. The
downside to this method is that it will not retain historical log data, but will simply update the
values based on a lookup of the incoming files to records in the pipeline parameter table. This
gives a quick, yet not necessarily robust, method of viewing the status and load date across all
items in the pipeline parameter table.
I’ll begin by adding a stored procedure activity to my Copy-Table Activity so that as the
process iterates on a table level basis for my stored procedure.
Next, I will add the following stored procedure to my Azure SQL Database where my
pipeline parameter table resides. This procedure simply looks up the destination table name in
the pipeline parameter table and updates the status and datetime for each table once the Copy-
Table activity is successful.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
BEGIN TRY
END TRY
BEGIN CATCH
IF @@TRANCOUNT > 0
ROLLBACK
I will then return to my data factory pipeline and configure the stored procedure activity. In
the Stored Procedure tab, I will select the stored procedure that I just created. I will also add a
new stored procedure parameter that references my destination name, which I had configured
in the copy activity.
After saving, publishing and running the pipeline, I can see that my pipeline_datetime and
pipeline_status columns have been updated as a result of the ADF Stored Procedure Activity.
Below is the connection configuration that I will use for my csv dataset.
The following parameterized path will ensure that the file is generate in the correct folder
structure.
@{item().server_name}/@{item().src_db}/@{item().src_schema}/
@{item().dst_name}/metadata/@{formatDateTime(utcnow(),'yyyy-
MM-dd')}/@{item().dst_name}.csv
After I save, publish, and run my pipeline, I can see that a metadata folder has been created in
my Server>database>schema>Destination_table location.
When I open the metadata folder, I can see that there will be csv file per day that the pipeline
runs.
Finally, I can see that a metadata .csv file with the name of my table has been created.
When I download and open the file, I can see that all of the query results have been populated
in my .csv file.
Next, I will create the following table in my Azure SQL Database. This table will store and
capture the pipeline and copy activity details.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
My sink will be a connection to the Azure SQL Db pipeline log table the I created earlier.
Below are the connection details for the Azure SQL DB pipeline log table.
When I save, publish and run my pipeline, I can see that the pipeline copy activity records
have been captured in my dbo.pipeline_log table.
Problem
In these series of posts, I am going to explore Azure Data Factory (ADF), compare its
features against SQL Server Integration Services (SSIS) and show how to use it towards real-
life data integration problems. In Control flow activities , I have provided an overview of
control flow activities and explored few simple activity types. In this post, we will be
exploring If Condition activity.
Solution
Azure Data Factory If Condition Activity
If Condition activity is similar to SSIS's Conditional Split control, described here. It allows
directing of a pipeline's execution one way or another, based on some internal or external
condition.
Unlike simple activities we have considered so far, the If Condition activity is a compound
activity, it contains a logical evaluation condition and two activity groups, a group matching
to a true evaluation result and another group matching to a false evaluation result.
If Condition activity's condition is based on logical expression, which can include properties
of pipeline, trigger as well as some system variables and functions.
Creating Azure Data Factory If Condition Activity
In one of the earlier posts (see Automating pipeline executions, Part 3), we have created
pipeline Blob_SQL_PL, which would kick-off in response to file arrival events into blob
storage container. This pipeline had a single activity, designed to transfer data from CSV files
into FactInternetSales table in Azure SQL db.
We will customize this pipeline, make it more intelligent - it will check input file's name and
based on that, transfer files into either FactInternetSales or DimCurrency table, by initiating
different activities.
To prepare the destination for the second activity, I have created table DimCurrency inside
DstDb, using the below script:
CREATE TABLE [dbo].[DimCurrency](
[CurrencyKey] [int] IDENTITY(1,1) NOT NULL,
[CurrencyAlternateKey] [nchar](3) NOT NULL,
[CurrencyName] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_DimCurrency_CurrencyKey]
PRIMARY KEY CLUSTERED ([CurrencyKey] ASC)
GO
Let's follow the below steps to add an If Condition activity:
Select pipeline Blob_SQL_PL, expand 'Iterations and Conditionals' group on Activities
panel, drag-drop an If Condition activity into the central panel and assign the name (I've
named it If_Condition_AC):
Switch to the Settings tab, place the cursor in the Expression text box and click the 'Add
dynamic content' link under that text box, to start building an evaluation expression:
Expand Functions/Conversion Functions group and select the bool function:
Place the cursor inside the bool function brackets, expand Functions/String Functions group
and select the startswith function:
Place the cursor inside the startswith function brackets and select SourceFile pipeline
parameter we created earlier, followed by a comma and 'FactIntSales' string and then confirm
to close the Add Dynamic Content window. Here's the final expression-
@bool(startswith(pipeline().parameters.SourceFile,'FactIntSales')) , which evaluates whether
or not the input file's name starts with 'FactIntSales' string. Here's a screenshot for the activity
with the evaluation condition:
Next, let's copy FactInternetSales_AC activity into the buffer, using right click and Cut
command:
Now, we need to add activities to True and False evaluation groups. Select If_Condition_AC
activity, switch to the Activities tab and click Add If True Activity button:
Right click in the design surface and select the Paste command, to paste the activity we
copied earlier into the buffer and assign a name (I have named it FactInternetSales_AC):
The activity FactInternetSales_AC originally has been created with the explicit field mapping
(see Transfer On-Premises Files to Azure SQL Database for more details). However, because
this pipeline is going to transfer files with different structures, we no longer need to have
explicit mapping, so let's switch to the Mapping tab and click the Clear button, to remove
mapping:
Please note the pipeline hierarchy link at the top of design surface, which allows you to
navigate to the parent pipeline's design screen. We could add more activities into True
Activities group, however that's not required for the purpose of this exercise, so let's click
Blob_SQL_PL navigation link to return to the parent pipeline's design screen:
As for the sink dataset, we will need to create Azure SQL DB dataset, pointing to the
DimCurrency table:
Now that we are done with the configuration of DimCurrency_AC activity, we can return to
the parent screen, using the parent navigation link and publish changes. Here is how your
final screen should look at this point:
For those, who want to see the JSON script for the pipeline we just created, I have attached
the script here.
Validating Azure Data Factory Pipeline Execution
Because this pipeline has an event-based trigger associated with it, all we need to initiate it is
to drop files into the source container. We can use Azure Portal to manage files in the blob
storage, so let's open the Blob Storage screen and remove existing files from the csvfiles
container:
Now, use the Upload button to select DimCurrency.csv file from the local folder:
Let's wait few minutes for this pipeline to finish and switch to the Monitor screen, to examine
the execution results. As expected, MyEventTrigger has started the pipeline in response to
DimCurrency.csv file's upload event:
Upon further examination of execution details, we can see that DimCurrency_AC activity ran
after conditional validation:
Now, let's upload FactIntSales2012.csv file and see the execution results:
Activity Runs screen confirms that conditional activity worked as expected:
Conclusion
The If Condition activity is great feature, allowing adding conditional logic to your data flow.
You can build complex evaluation expressions interactively, using the Add Dynamic Content
window and you can nest multiple activities within an If Condition activity.
Although If Condition activity's functionality in ADF is similar to SSIS's Conditional Split
control's functionality, there are few important differences:
If Condition activity's evaluation conditions are based on object level (for example,
dataset source file name, pipeline name, trigger time, etc.), whereas SSIS's
Conditional Split's evaluation is based on row level conditions.
SSIS's Conditional Split has default output, where rows not matching specified
criteria can be directed, whereas ADF only has True and False condition outputs.