0% found this document useful (0 votes)
449 views

Azure Data Factory For Beginners

The document provides instructions for creating an Azure Data Factory pipeline that copies data from an Azure blob storage to an Azure SQL database. It includes steps to: 1) Create the necessary Azure resources like a storage account, SQL database, and data factory. 2) Define the input and output datasets and linked services for the blob storage and SQL database. 3) Design a pipeline with a copy activity to transfer data from the blob to the database. 4) Validate, debug, publish, and manually trigger the pipeline.

Uploaded by

Rick V
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
449 views

Azure Data Factory For Beginners

The document provides instructions for creating an Azure Data Factory pipeline that copies data from an Azure blob storage to an Azure SQL database. It includes steps to: 1) Create the necessary Azure resources like a storage account, SQL database, and data factory. 2) Define the input and output datasets and linked services for the blob storage and SQL database. 3) Design a pipeline with a copy activity to transfer data from the blob to the database. 4) Validate, debug, publish, and manually trigger the pipeline.

Uploaded by

Rick V
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 250

Azure Data Factory For Beginners

June 3, 2023 by Izhar Alam 2 Comments


 13819 views
Azure Data Factory is a cloud-based ETL and data integration service that allows
us to create data-driven pipelines for orchestrating data movement and transforming
data at scale.
In this blog, we’ll learn about the Microsoft Azure Data Factory (ADF) service. This
service permits us to combine data from multiple sources, reformat it into analytical
models, and save these models for following querying, visualization, and reporting.

Also read: our blog on Azure Data Lake Overview for Beginners


What Is ADF?
 ADF is defined as a data integration service.
 The aim of ADF is to fetch data from one or more data sources and convert
them into a format that we process.
 The data sources might contain noise that we need to filter out. ADF
connectors enable us to pull the interesting data and remove the rest.
 ADF to ingest data and load the data from a variety of sources into Azure Data
Lake Storage.
 It is the cloud-based ETL service that allows us to create data-driven pipelines
for orchestrating data movement and transforming data at scale.
Home / Microsoft Azure / Data Engineer / Create Azure Data Factory Pipeline
Create Azure Data Factory Pipeline

January 23, 2022 by akshay Tondak Leave a Comment


 1483 views
In this blog, we’ll build our Azure Data Factory Pipeline, which will simply copy data
from Azure Blob storage to an Azure SQL Database database.
A pipeline is a logical collection of activities that work together to complete a task.
A pipeline, for example, could include a set of activities that ingest and clean log data
before launching a mapping data flow to analyze the log data. The pipeline enables
you to manage the activities as a group rather than individually. Instead of deploying
and scheduling the activities separately, you deploy and schedule the pipeline.
In this blog, you perform the following steps:
 Create a data factory.
 Create a pipeline with a copy activity.
 Test run the pipeline.
 Trigger the pipeline manually.
Create Azure Data Factory Pipeline
1) Prerequisites
1) Subscription to Azure. If you don’t already have an Azure subscription, sign up for
a free Azure account before you start.
2) Account for Azure storage. Blob storage is used as a source data store. If you
don’t already have an Azure storage account, see Create an Azure storage
account for instructions.
3) SQL Database on Azure The database is used as a sink data store. If you don’t
already have a database in Azure SQL Database, see Create a database in Azure
SQL Database for instructions.
2) Create a blob and a SQL table
Now, prepare your SQL database and Blob storage for the blog by following the
steps below.
a) Create a source blob
1) Start Notepad. Copy the following text and save it to your disc as an emp.txt file:
FirstName,LastName
John,Doe
Jane,Doe
2) In your Blob storage, create a container called adfdemo. In this container, make a
folder called input. Then, copy the emp.txt file into the input folder. To complete
these tasks, use the Azure portal or tools such as Azure Storage Explorer.

b) Create a sink SQL table


1) To create the dbo.emp table in your database, run the SQL script below:
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);

2) Permit Azure services to connect to SQL Server. Make sure that Allow access to
Azure services is enabled for your SQL Server so that Data Factory can write data
to it. To check and enable this setting, navigate to the logical SQL server >
Overview > Set server firewall> and toggle the Allow access to Azure services
option to ON.

3) Create an Azure Data Factory


In this step, you create a data factory and launch the Data Factory UI to begin
building a pipeline in the data factory.
1) Open Microsoft Edge or Google Chrome, whichever you want. Data Factory UI
is currently only available in the Microsoft Edge and Google Chrome web browsers.
2) Select Create a resource > Integration > Data Factory from the left menu.
3) Select the Azure Subscription in which you wish to create the data factory on the
Create Data Factory page, under the Basics tab.
4) Take one of the following steps to form a Resource Group:
 From the drop-down list, choose an existing resource group.
 Select Create new and give the resource group a name.
5) Select a location for the data factory under Region. In the drop-down list, only
places that are supported are shown. The data factory’s data storage (such as Azure
Storage and SQL Database) and computes (such as Azure HDInsight) can be in
different locations.
6) Enter ADFdemoDataFactory in the Name field.
The Azure data factory must have a globally unique name. Enter a different name for
the data factory if you get an error notice concerning the name value. (e.g.,
yournameADFdemoDataFactory).
7) Select V2 under Version.
8) On the top, pick the Git configuration tab, and then the Configure Git later check
box.
9) When the creation is complete, the notification appears in the Notifications center.
To get to the Data factory page, select Go to resource.
10) To open the Azure Data Factory UI in a new tab, select Open on the Open
Azure Data Factory Studio tile.
4) Create an Azure Data Factory Pipeline
In this phase, you’ll establish a pipeline in the data factory with a copy action. The
copy action transfers data from the blob storage to the SQL database. Following
these steps, you constructed a pipeline in the part:
1. Create the linked service.
2. Create input and output datasets.
3. Create a pipeline.
1) Select Orchestrate from the main page.
2) Name should be CopyPipeline under the General panel, under Properties.
Then, in the top-right corner, click the Properties symbol to collapse the panel.
3) Expand the Move and Transform category in the Activities tool box, then drag
and drop the Copy Data activity from the tool box to the pipeline designer surface.
The name should be CopyFromBlobToSql.
A) Configure a source
1) Select the Source tab. To add a new source dataset, select Add New.
2) Select Azure Blob Storage in the New Dataset dialogue box, then
click Continue. Because the source data is stored in a blob, you choose Azure Blob
Storage as the source dataset.
3) Choose the data format type in the Select Format dialogue box, then
click Continue.
4) Enter SourceBlobDataset as the Name in the Set Properties dialogue box.
Check the box labeled “First row as a header.” Select + New from the Linked
service text box.
5) Enter AzureStorageLinkedService as the name in the New Linked
Service (Azure Blob Storage) dialogue box, and then choose your storage account
from the Storage account name list. To deploy the associated service,
select Create after testing the connection.
6) It is returned to the Set properties page after the associated service has been
created. Choose Browse next to File path.
7) Select the emp.txt file from the adfdemo/input folder, then select OK.
8) Choose OK. It takes you straight to the pipeline page. Confirm
that SourceBlobDataset is chosen on the Source tab. Select preview data to see a
preview of the data on this page.

B) Configure sink
1) To build a sink dataset, go to the Sink tab and select + New.
2) To filter the connectors in the New Dataset dialogue box, type “SQL” in the
search field, pick Azure SQL Database, and then click Continue. You copy data to
a SQL database in this demo.
3) Enter OutputSqlDataset as the Name in the Set Properties dialogue box.
Select + New from the Linked service dropdown list. A linked service must be paired
with a dataset. The connection string that Data Factory uses to connect to SQL
Database at runtime is stored in the linked service. The container, folder, and file
(optional) to which the data is copied are all specified in the dataset.
4) Take the following steps in the New Linked Service (Azure SQL
Database) dialogue box:
a. Type AzureSqlDatabaseLinkedService in the Name field.
b. Select your SQL Server instance under Server name.
c. Select your database under the Database name.
d. Under User name, type the user’s name.
e. Under Password, type the user’s password.
f. To test the connection, select Test connection.
g. To deploy the associated service, select Create.
5) It will take you straight to the Set Properties dialogue box. Select [dbo].
[emp] from the Table drop-down menu. Then press OK.
6) Go to the pipeline tab and make sure OutputSqlDataset is selected in Sink
Dataset.
5) Validate the Azure Data Factory Pipeline
1) Select Validate from the tool bar to validate the pipeline.

2) By clicking Code on the upper right, you can see the JSON code linked with the
pipeline.
6) Debug and publish the Azure Data Factory Pipeline
Before publishing artifacts (connected services, datasets, and pipelines) to Data
Factory or your own Azure Repos Git repository, you can debug your pipeline.
1) Select Debug from the toolbar to debug the pipeline. The Output tab at the
bottom of the window displays the status of the pipeline run.
2) Select Publish all from the top toolbar once the pipeline has been completed
successfully. This action sends your newly built entities (datasets and pipelines) to
Data Factory.
3) Wait until you see the message “Successfully published.” To view notification
messages, go to the top-right corner and select Show Notifications (bell button).

7) Trigger the Azure Data Factory pipeline manually


You manually trigger the pipeline you published in the previous stage in this step.
1) On the toolbar, select Trigger, and then Trigger Now. Select OK on the Pipeline
Run page.
2) On the left, click the Monitor tab. You see a pipeline being run as a result of a
manual trigger. You can use links under the PIPELINE NAME column to check
activity data and to repeat the pipeline.
3) Select the CopyPipeline link under the PIPELINE NAME column to see activity
runs linked with the pipeline run. There is only one activity in this case, so there is
only one entry in the list. Select the Details link (eyeglasses icon) under
the ACTIVITY NAME column for more information about the copy process. To return
to the Pipeline Runs view, select All pipeline runs at the top. Select Refresh to
refresh the view.
4) Check that the emp table in the database has two more rows.

What Is a Data Integration Service?


 Data integration involves the collection of data from one or more sources.
 Then includes a process where the data may be transformed and cleansed or
may be augmented with additional data and prepared.
 Finally, the combined data is stored in a data platform service that deals with
the type of analytics that we want to perform.
 This process can be automated by ADF in an arrangement known as Extract,
Transform, and Load (ETL).
What Is ETL?
1) Extract
 In this extraction process, data engineers define the data and its source.
 Data source: Identify source details such as the subscription, resource group,
and identity information such as secretor a key.
 Data: Define data by using a set of files, a database query, or an Azure Blob
storage name for blob storage.
2) Transform
 Data transformation operations can include combining, splitting, adding,
deriving, removing, or pivoting columns.
 Map fields between the data destination and the data source.
3) Load
 During a load, many Azure destinations can take data formatted as a file,
JavaScript Object Notation (JSON), or blob.
 Test the ETL job in a test environment. Then shift the job to a production
environment to load the production system.

Go through this Microsoft Azure Blog to get a clear understanding of Azure SQL
4) ETL tools
 Azure Data Factory provides approximately 100 enterprise connectors and
robust resources for both code-based and code-free users to accomplish their
data transformation and movement needs.
Also read: How Azure Event Hub & Event Grid Works?
What Is Meant By Orchestration?
 Sometimes ADF will instruct another service to execute the actual work
required on its behalf, such as a Databricks to perform a transformation query.
 ADF hardly orchestrates the execution of the query and then prepare the
pipelines to move the data onto the destination or next step.

Copy Activity In ADF


 In ADF, we can use the Copy activity to copy data between data stores located
on-premises and in the cloud.
 After we copy the data, we can use other activities to further transform and
analyze it.
 We can also use the DF Copy activity to publish transformation and study
results for business intelligence (BI) and application consumption.

1) Monitor Copy Activity


 Once we’ve created and published a pipeline in ADF, we can associate it with
a trigger.
 We can monitor all of our pipelines runs natively in the ADF user experience.
 To monitor the Copy activity run, go to your DF Author & Monitor UI.
 On the Monitor tab page, we see a list of the pipeline runs, click the pipeline
name link to access the list of activity runs in the pipeline run.
2) Delete Activity In ADF
 Back up your files before you are deleting them with the Delete activity in case
you wish to restore them in the future.
 Make sure that Data Factory has to write permissions to delete files or folders
or from the storage store.
To Know More About Azure Databricks click here
How ADF work?
1) Connect and Collect
 Enterprises have data of various types such as structured, unstructured, and
semi-structured.
 The first step collects all the data from a different source and then move the
data to a centralized location for subsequent processing.
 We can use the Copy Activity in a data pipeline to move data from both cloud
source and on-premises data stores to a centralized data store in the cloud.
2) Transform and Enrich
 After data is available in a centralized data store in the cloud, transform, or
process the collected data by using ADF mapping data flows.
 ADF supports external activities for executing our transformations on compute
services such as Spark, HDInsight Hadoop, Machine Learning, Data Lake
Analytics.
3) CI/CD and Publish
 ADF offers full support for CI/CD of our data pipelines using GitHub and Azure
DevOps.
 After the raw data has been refined, ad the data into Azure SQL Database,
Azure Data Warehouse, Azure CosmosDB
4) Monitor
 ADF has built-in support for pipeline monitoring via Azure Monitor, PowerShell,
API, Azure Monitor logs, and health panels on the Azure portal.
5) Pipeline
 A pipeline is a logical grouping of activities that execute a unit of work.
Together, the activities in a pipeline execute a task.

Also check: Overview of Azure Stream Analytics


How To Create An ADF
1) Go to the Azure portal.
2) From the portal menu, Click on Create a resource.

Also Check: Our previous blog post on Convolutional Neural Network(CNN). Click


here
3) Select Analytics, and then select see all.
4) Select Data Factory, and then select Create

Check Out: How to create an Azure load balancer: step-by-step instruction for


beginners.
5) On the Basics Details page, Enter the following details. Then Select Git
Configuration.
6) On the Git configuration page, Select the Check the box, and then Go To
Networking.
Also Check: Data Science VS Data Engineering, to know the major differences
between them.
7) On the Networking page, don’t change the default settings and click on
Tags, and the Select Create.

8) Select Go to resource, and then Select Author & Monitor to launch the Data
Factory UI in a separate tab.
Frequently Asked Questions
Q: What is Azure Data Factory?
A: Azure Data Factory is a cloud-based data integration service provided by
Microsoft. It allows you to create, schedule, and manage data pipelines that can
move and transform data from various sources to different destinations.
Q: What are the key features of Azure Data Factory?
A: Azure Data Factory offers several key features, including data movement and
transformation activities, data flow transformations, integration with other Azure
services, data monitoring and management, and support for hybrid data integration.
Q: What are the benefits of using Azure Data Factory?
A: Some benefits of using Azure Data Factory include the ability to automate data
pipelines, seamless integration with other Azure services, scalability to handle large
data volumes, support for on-premises and cloud data sources, and comprehensive
monitoring and logging capabilities.
Q: How does Azure Data Factory handle data movement?
A: Azure Data Factory uses data movement activities to efficiently and securely
move data between various data sources and destinations. It supports a wide range
of data sources, such as Azure Blob Storage, Azure Data Lake Storage, SQL
Server, Oracle, and many others.
Q: What is the difference between Azure Data Factory and Azure Databricks?
A: While both Azure Data Factory and Azure Databricks are data integration and
processing services, they serve different purposes. Azure Data Factory focuses on
orchestrating and managing data pipelines, while Azure Databricks is a big data
analytics and machine learning platform.
Q: Can Azure Data Factory be used for real-time data processing?
A: Yes, Azure Data Factory can be used for real-time data processing. It provides
integration with Azure Event Hubs, which enables you to ingest and process
streaming data in real time.
Q: How can I monitor and manage data pipelines in Azure Data Factory?
A: Azure Data Factory offers built-in monitoring and management capabilities. You
can use Azure Monitor to track pipeline performance, set up alerts for failures or
delays, and view detailed logs. Additionally, Azure Data Factory integrates with
Azure Data Factory Analytics, which provides advanced monitoring and diagnostic
features.
Q: Does Azure Data Factory support hybrid data integration?
A: Yes, Azure Data Factory supports hybrid data integration. It can connect to on-
premises data sources using the Azure Data Gateway, which provides a secure and
efficient way to transfer data between on-premises and cloud environments.
Q: How can I schedule and automate data pipelines in Azure Data Factory?
A: Azure Data Factory allows you to create schedules for data pipelines using
triggers. You can define time-based or event-based triggers to automatically start
and stop data pipeline runs.
Q: What security features are available in Azure Data Factory?
A: Azure Data Factory provides several security features, including integration with
Azure Active Directory for authentication and authorization, encryption of data at rest
and in transit, and role-based access control (RBAC) to manage access to data and
pipelines. Please note that these FAQs are intended to provide general information
about Azure Data Factory, and for more specific details, it is recommended to refer
to the official Microsoft documentation or consult with Azure experts.
Home / Microsoft Azure / Data Engineer / How To Copy Pipeline In Azure Data
Factory
How To Copy Pipeline In Azure Data Factory

July 12, 2022 by Pradhumn Sharma Leave a Comment


 1164 views
In this blog, we are going to cover What is Azure Data Factory is, How does Data
Factory work, Copy Pipeline In Azure Data Factory.
Topics we’ll cover:
 Azure Data Factory
 How Azure Data Factory Works 
 What are Pipelines
 Copy Activity In Azure Data Factory
 Copy Pipeline In Azure Data Factory
What Is Azure Data Factory?

Azure Data Factory is a cloud-based data integration service that allows you to
create data-driven workflows in the cloud for orchestrating and automating data
movement and data transformation.

Source: Microsoft
Azure Data Factory does not store any data itself. It allows you to create data-driven
workflows to orchestrate the movement of data between supported data stores and
the processing of data using compute services in other regions or in an on-
premise environment. It also allows you to monitor and manage workflows using
both programmatic and UI mechanisms.
You can check out our related blog here: Azure Data Factory for Beginners
How Does Data Factory work?

1) Extract: In this extraction process, data engineers define the data and its source.
Data source: Identify source details such as the subscription, resource group, and
identity information such as secretor a key. Data: Define data by using a set of files,
a database query, or an Azure Blob storage name for blob storage.
2) Transform: Data transformation operations can include combining, splitting,
adding, deriving, removing, or pivoting columns. Map fields between the data
destination and the data source.
3) Load: During a load, many Azure destinations can take data formatted as a file,
JavaScript Object Notation (JSON), or blob. Test the ETL job in a test environment.
Then shift the job to a production environment to load the production system.
4) Publish: Deliver transformed data from the cloud to on-premise sources like SQL
Server or keep it in your cloud storage sources for consumption by BI and analytics
tools and other applications.
Read: Difference between Structured Vs Unstructured Data
What are Pipelines?
A pipeline is a logical grouping of activities that together perform a task. For
example, a pipeline could contain a set of activities that ingest and clean log data,
and then kick off a mapping data flow to analyze the log data. The pipeline allows
you to manage the activities as a set instead of each one individually. You deploy
and schedule the pipeline instead of the activities independently.
Copy Activity In Azure Data Factory
In ADF, we can use the Copy activity to copy data between data stores located on-
premises and in the cloud. After we copy the data, we can use other activities to
further transform and analyze it. We can also use the DF Copy activity to publish
transformation and study results for business intelligence (BI) and application
consumption.
 Monitor Copy Activity: We can monitor all of our pipeline’s runs natively in
the ADF user experience.
 Delete Activity In Azure Data Factory: We can Back up your files before you
are deleting them with the Delete activity in case you wish to restore them in
the future.
Copy Pipeline In Azure Data Factory
1.) Create Data Factory
1. Go to portal.azure.com and click the Create Resource menu item from the top
left menu. Create a new Data Factory.
2. Fill in the fields similar to below.

3. Once your data factory is set up open it in Azure. Click the Author and Monitor
button.

4. Click the Connections menu item at the bottom left and then Pick the Database
category and then click SQL Server.
5. Create the new linked service and make sure to test the connection before you
proceed.
2.) Create SQL Database
1. Go to portal.azure.com and click the Create Resource menu item from the top left
menu. Create an Azure SQL Database.
2. Fill in fields for the first screen similar to below. For the new server (it’s actually not
a server but a way to group databases) give an ID and Password that you will
remember.

3. Now click the Query editor and log in with your SQL credentials which are
the admin ID and password.
4. You have a choice to get the SQL script to create the destination table. Either
open the “Create Person Table.SQL” in GitHub and copy and paste it into the Query
editor or you can copy the file locally to your laptop.

5. Run the query and your Person table should be created.

3.) Create a Linked Service


1. Go back to Data Factory, click the Author item and then click the bottom left
Connections menu. Create a new Linked Service for your Azure SQL DB you
created earlier. Use the Azure SQL DB admin ID and password you used in the
earlier lab when you set up the Azure SQL
DB. 
2. Now in Azure Data Factory click the ellipses next to Pipelines and create a new
folder to keep things organized.

3. Click the + icon to the right of the “Filter resources by name” input box and pick
the Copy Data option.

4. When working in a wizard like the Copy Wizard or creating pipelines from scratch
make sure to give a good name to each pipeline, linked service, data set, and other
components so it will be easier to work with later.

5. Click next and pick the Person.Person table.


6. Click next twice and then for your destination pick the Azure SQL DB connection
you created earlier and click Next.

7. Then pick the person table as the destination and leave the default column
mapping and click next a few times until you come to the screen that says
Deployment Complete.
4.) Monitoring
1. Now click on the Monitor button to see your pipeline job running. You should see
a screen similar to below. If you don’t see your job pipeline check your filters on the
top right.

Finally, You will get to know how to create pipelines to copy data from a SQL
Server on a VM into Azure SQL Database (Platform as a Service)
==========\
Home / Microsoft Azure / Data Engineer / ADF Copy Data: Copy Data From Azure
Blob Storage To A SQL Database Using Azure Data Factory
ADF Copy Data: Copy Data From Azure Blob Storage To A SQL Database Using
Azure Data Factory

March 16, 2023 by Izhar Alam 6 Comments


 15127 views
In this blog, we are going to cover the case study of ADF copy data from Blob
storage to a SQL Database with Azure Data Factory (ETL service) which we will
be discussing in detail in our Microsoft Azure Data Engineer Certification [DP-
203] FREE CLASS.

The following diagram shows the logical components such as the Storage account
(data source), SQL database (sink), and Azure data factory that fit into a copy
activity.
Topics, we’ll cover:
 Overview of Azure Data Factory
 Overview of Azure Blob Storage
 Overview of Azure SQL Database
 How to perform Copy Activity with Azure Data Factory
Before performing the copy activity in the Azure data factory, we should understand
the basic concept of the Azure data factory, Azure blob storage, and Azure SQL
database.
Overview Of Azure Data Factory
 Azure Data Factory is defined as a cloud-based ETL and data integration
service.
 The aim of Azure Data Factory is to fetch data from one or more data sources
and load them into a format that we process.
 The data sources might contain noise that we need to filter out. Azure Data
Factory enables us to pull the interesting data and remove the rest.
 Azure Data Factory to ingest data and load the data from a variety of sources
into a variety of destinations i.e. Azure data lake.
 It can create data-driven pipelines for orchestrating data movement and
transforming data at scale.
To download the complete DP-203 Azure Data Engineer Associate Exam
Questions guide click here.
Overview Of Azure Blob Storage
 Azure Blob storage is Microsoft’s Azure object storage solution for the cloud. It
is designed for optimizing and storing massive amounts of unstructured data.
 It is used for Streaming video and audio, writing to log files, and Storing data
for backup and restore disaster recovery, and archiving.
 Azure Blob storage offers three types of resources:
 The storage account
 A container in the storage account
 A blob in a container
 Objects in Azure Blob storage are accessible via the Azure PowerShell,
Azure Storage REST API, Azure CLI, or an Azure Storage client library.
Overview Of Azure SQL Database
 It is a fully-managed platform as a service. Here the platform manages aspects
such as database software upgrades, patching, backups, the monitoring.
 Using Azure SQL Database, we can provide a highly available and performant
storage layer for our applications.
 Types of Deployment Options for the SQL Database:
 Single Database
 Elastics Pool
 Managed Instance
 Azure SQL Database offers three service tiers:
 General Purpose or Standard
 Business Purpose or Premium
 Hyperscale
Note: If you want to learn more about it, then check our blog on Azure SQL
Database
ADF Copy Data From Blob Storage To SQL Database
1. Create a blob and a SQL table
2. Create an Azure data factory
3. Use the Copy Data tool to create a pipeline and Monitor the pipeline
STEP 1: Create a blob and a SQL table
1) Create a source blob, launch Notepad on your desktop. Copy the following text
and save it in a file named input Emp.txt on your disk.
FirstName|LastName
John|Doe
Jane|Doe
2) Create a container in your Blob storage. Container named adftutorial.
Read: Reading and Writing Data In DataBricks
3) Upload the emp.txt file to the adfcontainer folder.
4) Create a sink SQL table, Use the following SQL script to create a table
named dbo.emp in your SQL Database.
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO
CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);
Note: Ensure that Allow access to Azure services is turned ON for your SQL Server
so that Data Factory can write data to your SQL Server. To verify and turn on this
setting, go to logical SQL server > Overview > Set server firewall> set the
Allow access to Azure services option to ON.
Also read: Azure Stream Analytics is the perfect solution when you require a fully
managed service with no infrastructure setup hassle.
STEP 2: Create a data factory
1) Sign in to the Azure portal. Select Analytics > Select Data Factory.
2) On The New Data Factory Page, Select Create
3) On the Basics Details page, Enter the following details. Then Select Git
Configuration

4) On the Git configuration page, select the check box, and then Go To
Networking. Then select Review+Create
5) After the creation is finished, the Data Factory home page is displayed. select
the Author & Monitor tile.
Read: Azure Data Engineer Interview Questions September 2022
STEP 3: Use the ADF Copy Data tool to create a pipeline
1) Select the + (plus) button, and then select Pipeline.

2) In the General panel under Properties, specify CopyPipeline for Name. Then


collapse the panel by clicking the Properties icon in the top-right corner.
 
3) In the Activities toolbox, expand Move & Transform. Drag the Copy Data
activity from the Activities toolbox to the pipeline designer surface. You can also
search for activities in the Activities toolbox. Specify CopyFromBlobToSql for
Name.
4)  Go to the Source tab. Select + New to create a source dataset.
5) In the New Dataset dialog box, select Azure Blob Storage to copy data from
azure blob storage, and then select Continue.
6) In the Select Format dialog box, choose the format type of your data, and
then select Continue.
 
Read: DP 203 Exam: Azure Data Engineer Study Guide
7) In the Set Properties dialog box, enter SourceBlobDataset for Name. Select
the checkbox for the first row as a header. Under the Linked service text box,
select + New.
8) In the New Linked Service (Azure Blob Storage) dialog box, enter
AzureStorageLinkedService as name, select your storage account from
the Storage account name list. Test connection, select Create to deploy the linked
service.
9) After the linked service is created, it’s navigated back to the Set properties
page. Nextto File path, select Browse. Navigate to the adftutorial/input
folder, select the emp.txt file, and then select OK
10) Select OK. It automatically navigates to the pipeline page. In the Source
tab, confirm that SourceBlobDataset is selected. To preview data on this page,
select Preview data.
11) Go to the Sink tab, and select + New to create a sink dataset. . In the New
Dataset dialog box, input “SQL” in the search box to filter the connectors,
select Azure SQL Database, and then select Continue.
12) In the Set Properties dialog box, enter OutputSqlDataset for Name. From
the Linked service dropdown list, select + New.
Read: Microsoft Azure Data Engineer Associate [DP-203] Exam Questions
13) In the New Linked Service (Azure SQL Database) dialog box, fill the following
details.
14) Test Connection may be failed. Go to your Azure SQL database, Select your
database. Go to Set Server Firewall setting page. On the Firewall settings page,
Select yes in Allow Azure services and resources to access this server.
Then Save settings.
15) On the New Linked Service (Azure SQL Database) Page, Select Test
connection to test the connection. Then Select Create to deploy the linked service.
16) It automatically navigates to the Set Properties dialog box. In Table, select
[dbo].[emp].Then select OK.
17) To validate the pipeline, select Validate from the toolbar.
18) Once the pipeline can run successfully, in the top toolbar, select Publish all.
Publishes entities (datasets, and pipelines) you created to Data Factory.
Select Publish.

19) Select Trigger on the toolbar, and then select Trigger Now.  On


the Pipeline Run page, select OK.
20) Go to the Monitor tab on the left. You see a pipeline run that is triggered by a
manual trigger. You can use links under the PIPELINE NAME column to view activity
details and to rerun the pipeline.
21) To see activity runs associated with the pipeline run, select
the CopyPipeline link under the PIPELINE NAME column.
22) Select All pipeline runs at the top to go back to the Pipeline Runs view. To
refresh the view, select Refresh.
23) Verify that you create a Copy data from Azure Blob storage to a database
in Azure SQL Database by using Azure Data Factory is Succeeded
Congratulations! You just use the Copy Data tool to create a pipeline and Monitor the
pipeline and activity run successfully.
Overview
If you want to follow along with the examples in this tutorial, you'll need to install ADF
in your Azure tenant. Either make sure you have the appropriate permissions to do
so, or you can opt for a free trial. We cover how to setup Azure Data Factory, a
Storage Account and Azure SQL Database.
Setup Azure Data Factory
When logged in the Azure Portal, click on "Create a resource" at the top of your
screen.

Search for "data factory" in the marketplace and choose the result from Microsoft.
On the next page, you'll get an overview of the product. Click on create to get started
with configuring your ADF environment. You need to select a subscription. You can
either create a new resource group (which is a logical container for your resources)
or select an existing one. You need to select a region (take one close by your
location to minimize latency) and choose a name. Finally, you need to select a
version. It's highly recommended you choose V2. Version 1 of ADF is almost never
used and practically all documentation you'll find online is about V2.
Click on Review + create at the bottom. It's possible you might get a validation error
about the Git configuration. Integration with Git and Azure Devops is out of scope for
this tutorial.

If you get the error, go to the Git configuration tab and select Configure Git later.
When the validation passes, click on Create to have Azure create the ADF
environment for you. This might take a couple of minutes. When the resource is
deployed, you can check it out in the portal.

Typically, you don't spend a lot of time here. You can configure access control to
give people permission to develop in ADF, or you can set up monitoring and alerting.
The actual development itself is done in Azure Data Factory Studio, which is a
separate environment. Click on the Studio icon to go to the development
environment, which should open in a new browser tab.
Setup Storage Account
Before we can start creating pipelines in ADF, we need to set up our source and
destination (called sink in ADF). We begin by creating a storage account in the Azure
Portal. Search for the "storage account" resource in the marketplace and click
on Create.

In the Basics tab, choose your subscription and the same resource group as the
ADF environment. Specify a name for the storage account and choose the same
region as your ADF.
For the redundancy, choose "Locally-redundant storage (LRS)", which is the
cheapest option. Go to the Advanced tab and switch the access tier to Cool. This is
a cheaper option than the default Hot access tier.
Click on Review + Create and then Create to provision your storage account. When
it has been deployed, go to the resource and then to Containers in the Data
Storage section.
Specify "data-input" as the new container name and then click on Create.

Setup Azure SQL Database


Next, we need our destination, which is going to be an Azure SQL database. Search
for "SQL database" in the marketplace and click on Create.
In the Basics tab, choose your subscription and the same resource group as before.
Give the database the name "tutorial".
Before we can create the database however, we need to assign it to a "SQL Server".
This is not an actual SQL Server, but rather a logical container for our databases.
Some configurations are applied to the server level. Since we do not have a server
yet, we need to create it first. Click on Create new to create one.
Specify a name for the server. This will be the server name you'll enter in a database
tool like SQL Server Management Studio or Azure Data Studio to connect to your
database. I choose "adf-tutorial-sql", but if you want to add other databases later on
that have nothing to do with this tutorial, you might want to choose another name.
If you choose SQL authentication, you need to specify a login name for the server
admin and a strong password. If you choose Azure Active Directory authentication,
you'll need to specify an Azure AD admin. You can choose either one of the
authentication methods, or both. If you use Azure AD, click on Set admin. Search for
an AD user you want to grant admin rights.
Once you have selected your admin, click on OK at the bottom to finish the
configuration of the server. Back in the configuration of the database, set the
redundancy to "locally-redundant backup storage".
The default configuration of the database is a bit too pricy for our tutorial, so let's set
this to a cheaper option. Click on Configure database to see the various options.
Switch to DTU-based purchasing and choose the Basic workload. You can see the
price has considerably dropped!

The downside is we can have only 2GB for our database, but that should be plenty
for this tutorial. Just one more setting before we can create our database. In
the Additional settings tab, choose Sample as the data source. This will install the
AdventureWorksLT sample database.

Click on Review + create and then on create to create the SQL Server and the
Azure SQL database. This might take a couple of minutes. Once the deployment is
done, go the SQL Server and then to Firewalls and virtual networks, which can be
found in the Security section.
To make sure we can access our database from our machine, we need to add our
current IP address to the firewall. At the top, click on Add client IP. This will add a
new rule to the firewall. Don't forget to click Save at the top!
While we're in the firewall config, let's set the property "Allow Azure services and
resources to access this server" to Yes. This will make our lives a lot easier when we
try to connect to the server from ADF.

In the Overview pane, you can find the name of the server. Hover over it with your
mouse and click the copy icon to copy the name to your clipboard. Start SQL Server
Management Studio (SSMS) or Azure Data Studio to connect to the server. For the
remainder of the tutorial, SSMS is used. In SSMS, create a new connection. Paste
the server name and choose the authentication method you configured earlier. If
you're using Azure AD, don't choose Windows Authentication but rather one of the
Azure AD authentication methods listed: Universal with MFA, Password or
Integrated. The correct one depends on your environment.
Don't click on Connect just yet! First, go to options and enter the database name in
the upper text box.

If you don't do this, SSMS will automatically try to connect to the master database,
which might or might not work, depending on your permissions. You can now click
on Connect. Once you're connected, you can view the tables that were
automatically created for us because we chose the sample database:

Additional Informati
Build your first Azure Dara Factory Pipeline
Overview
We're going to build a pipeline using the Copy Data tool. This tool makes it easier for people
starting out with ADF to create their first pipelines. Before we start, we need to make sure
some prerequisites are met.
Prerequisites
If you haven't already, follow the steps of the previous part of the tutorial to set up ADF, a
storage account with a blob container and an Azure SQL DB.
In the Azure Portal, go to your storage account and then to the "data-input" container we
created. Click on the Upload link.
A pane will open where you can select a local file. Upload the Customers.csv file, which you
can download here.

Click Upload to put the file in the blob container.


Create the Pipeline
Go to ADF Studio and click on the Ingest tile. This will open the Copy Data tool.
In the first step, we can choose to simply copy data from one location to another, or to create
a more dynamic, metadata-driven task. We'll choose the first option. Parameters and metadata
are covered later in this tutorial.

You can also choose how the resulting pipeline needs to be scheduled. For now, we're going
with "run once now". Schedules and triggers are also discussed later in the tutorial.
In step 2, we need to choose the type of our source data. This will be our csv file in the blob
container. Azure Blob Storage is the first option in the dropdown:
We also need to define the connection to the blob container. Since we don't have any
connections yet in ADF, we need to create a new one by clicking on "New connection". In
the new pane, give the new connection a name and leave the default for the integration
runtime (also covered later in the tutorial). As authentication type, choose account key. Since
the blob storage is in the same Azure tenant as ADF, we can simply choose it from the
dropdowns. Select the correct subscription and the correct storage account.
Finally, you can test your connection. If it is successful, click on Create to create the new
connection. The screen for step 2 should look like this:
We now need to select a file from the connection we just created. Click on Browse to open a
new pane to select the file. Choose the Customers.csv file we uploaded in the prerequisites
section.

ADF will automatically detect it's a csv file and will populate most of the configuration fields
for you.
Make sure the first row is selected as a header. You can do a preview of the data to check if
everything is OK:
Now we need to configure our destination in step 3. Search for "sql" and select Azure SQL
Database from the dropdown list.
Like with the source, we will also need to define a new connection. Give it a name and select
the correct subscription, server and database from the dropdowns. If everything is in the same
Azure tenant, this should be straight forward.
Choose the authentication type that you configured during the setup of the SQL Server. In the
following screenshot, I chose SQL authentication, so I need to supply a username and a
password.
You can test the connection to see if everything works. Make sure you gave Azure Services
access to the SQL server – as shown in the previous part of the tutorial – or you will get a
firewall error. Once the connection is created, we need to choose the destination table. You
can either choose an existing table or let ADF create one for you. Fill in dbo as the schema
and Tutorial_StagingCustomer as the table name.

Next, we need to define the mapping. A mapping defines how each column of the source is
mapped against the columns of the destination. Since ADF is creating the table, everything
should be mapped automatically.
If you want, you can supply a pre-copy script. This is a SQL statement that will be executed
right before the data is loaded. In a recurring pipeline, you can for example issue a
TRUNCATE TABLE statement to empty the table. Here it would fail, since ADF first needs
to create the table. If you try to truncate it, it will fail since the table doesn't exist yet.
Now we're in step 4 of the tool and we can define general settings for the pipeline. You can
change the name of the pipeline. Leave everything else to the defaults.

In the final step, you can review all the configurations we made in the previous steps.
Click Next. ADF will create the pipeline and will run it once.
We can verify a pipeline has been created when we check the factory resources by clicking
the pencil icon in the left menu.

We can also check in the database itself that a new table has been created and has been
populated with the data from the CSV file:
Additional Information
 The Copy Data Tool has much more settings you can play with. Why don't you try
them out?
 Check out Getting Started with Azure Data Factory - Part 1 for another example of
how to create a pipeline.

<< Previous
Next >>
Azure Data Factory Linked Services
Overview
Now that we've created our first pipelines, it is time to delve a bit deeper into the
inner working of ADF. Let's start with linked services.
The Purpose of Linked Services
In the previous step of the tutorial, every time we created a new connection in the
Copy Data tool, we were creating a new linked service. A linked service is a
connection to a specific service or data store that can either be a source of data, or a
destination (also called target or sink). People who have worked with Integration
Services (SSIS) before will recognize this concept; a linked service can be compared
with a project connection manager in SSIS.
A linked service will store the connection string, but also any method on how to
authenticate with the service. Once a linked service is created, you can reuse it
everywhere. For example, if you have a data warehouse in Azure SQL database,
you will only need to define this connection once.
Linked services can be found in the Manage section of ADF Studio (lowest icon in
the left menu bar).
There we can find the two linked services we created in the previous part:

Which Linked Services are there?


There are many different types of linked services. There are a couple of categories
available:
 Azure. Services like Blob Storage, Cosmos DB (with each different API), Data
Explorer, Data Lake Storage, Key Vault, Databricks Delta Lake, Table
Storage, Synapse Analytics and so on.
 All the Azure databases (Azure SQL DB, Azure SQL DB Managed Instance,
Azure Database for MySQL/PostgreSQL/MariaDB …), but also on-premises
SQL Server and other vendors like Amazon RDS, Amazon Redshift, Apache
Impala, DB2, Google Bigquery, Hive, PostgreSQL, Oracle, SAP BW, SAP
HANA, Spark, Sybase, Teradata and many others.
 File. Amazon S3, FTP, File System (on-premises), Google Cloud Storage,
HDFS, HTTP, Oracle Cloud Storage and SFTP.
 NoSQL. Cassandra, Couchbase and MongoDB.
 Services and apps. Dataverse, Dynamics, Github, Jira, Office 365, PayPal,
REST, Salesforce, Snowflake and many others.
 Generic protocol. When all else fails. ODBC, OData, REST and SharePoint
Online List.
This list is not exhaustive and is continuously updated. Keep an eye on the official
documentation for updates.

Keep in mind that for on-premises data sources (and some online data sources) we
need a special integration runtime, which will be covered later in the tutorial.
Creating a Linked Service Manually
In the Manage section, go to Linked Services and click on New. Search for Azure
SQL Database.
Give a name to the new linked service and use the default integration runtime.
Instead of choosing SQL authentication or Azure AD authentication, this time we're
going to use System Assigned Managed Identity. This means we're going to log
into Azure SQL DB using the user credentials of ADF itself. The advantage here is
we don't need to specify users or passwords in the linked service.
However, to make this work, we need to add ADF as a user into our database. When
logged into the database using an Azure AD user with the necessary permissions,
open a new query window and execute the following query:
CREATE USER [mssqltips-adf-tutorial] FOR EXTERNAL PROVIDER;
Next, we need to assign permissions to this user. Typically, ADF will need to read
and write data to the database. So we will add this user to
the db_datareader and db_datawriter roles. If ADF needs to  be able to truncate
tables or to automatically create new tables, you can add the user to
the db_ddladmin role as well.
ALTER ROLE db_datareader ADD MEMBER [mssqltips-adf-tutorial];
ALTER ROLE db_datawriter ADD MEMBER [mssqltips-adf-tutorial];
ALTER ROLE db_ddladmin ADD MEMBER [mssqltips-adf-tutorial];
Now we can test our connection in ADF and create it:
Click on Publish to persist the new linked service to the ADF environment.
Linked Services Best Practices
A couple of best practices (or guidelines if you want) for creating linked services in
ADF:
 Use a naming convention. For example, prefix connection to SQL server with
SQL_ and connection to Azure Blob Storage with BLOB_. This will make it
easier for you to keep apart the different types of linked services.
 If you have multiple environments (for example a development and a
production environment), use the same name for a connection in all
environments. For example, don't call a connection to your development data
warehouse "dev_dwh", but rather "SQL_dwh". Having the same name will
make it easier when you automate deployments between environments.
 If you cannot use managed identities and you need to specify usernames and
passwords, store them in Azure Key Vault instead of directly embedding them
in the Linked Service. Key Vault is a secure storage for secrets. It has the
advantage of centralizing your secrets. If for example a password or
username changes, you only need to update it at one location. You can find
an introduction to Azure Key Vault in the tip Microsoft Azure Key Vault for
Password Management for SQL Server Applications.
Additional Information
 The tip Create Azure Data Lake Linked Service Using Azure Data
Factory explains how to create a linked service for Azure Data Lake Analytics.
 Recently a linked service for Snowflake was introduced. You can check it out
in the tip Copy Data from and to Snowflake with Azure Data Factory.
 If you want to create a connection to a file, such as Excel or CSV, you need to
create a linked service to the data store where the file can be found. For
example: Azure Blob Storage or your local file system.
 For the moment, only SharePoint Lists are supported for SharePoint Online.
Reading documents inside a SharePoint library is currently not supported by
ADF.
Azure Data Factory Datasets
Overview
Once you've defined a linked service, ADF knows how to connect and authenticate
with a specific data store, but it still doesn't know how the data looks like. In this
section we explore what datasets are and how they are used.
The Purpose of Datasets
Datasets are created for that purpose: they specify how the data looks like. In the
case of a flat file for example, they will specify which delimiters are used, if there are
text qualifiers or escape symbols used, if the first row is a header and so on. In the
case of a JSON file, a dataset can specify the location of the file, and which
compression or encoding used. Or if the dataset is used for a SQL Server table, it
will just specify the schema and the name of the table. What all types of datasets
have in common is that they can specify a schema (not to be mistaken with a
database schema like "dbo"), which is the columns and their data types that are
included in the dataset.
Datasets are found in the Author section of ADF Studio (the pencil icon). There you
can find the two datasets that were created in a previous part of the tutorial with the
Copy Data tool.

Which Types of Datasets are there?


For most linked services, a dataset type typically maps 1-to-1 to the linked service
(see the previous part for a list of linked service types). For example, if the linked
service is SQL Server (or any other relational database), the dataset will be a table.
If you choose some type of file storage (e.g. Azure Blob Storage, Azure Data Lake,
Amazon S3 etc.), you can choose between a list of supported file types:
Keep in mind some datasets type can be supported as a source, but not a sink. An
example of this scenario is the Excel format. Or some types of datasets can be used
in certain activities in a pipeline, but not in other activities. For a good overview of
what is possible for which type of datasets, check out the official documentation.
Creating a Dataset Manually
We're going to create a dataset that reads in an Excel file (with the exact same
customer data as in the previous parts). You can download the Excel file here.
Upload it to the same blob container we used before.
In the Author section, expand the datasets section, hover with your mouse over the
ellipsis and choose New dataset in the popup.
For the data store, choose Azure Blob Storage.

In the next screen, choose Excel as the file format.


In the properties window, give the dataset a name, choose the blob linked service we
created in the previous parts and browse to the Excel file in the blob container.
Choose the Customers worksheet, set the first row as header and choose to import
the schema from the connection.
Click OK to create the dataset. In the Connection tab, we can see the properties we
just configured, but now we can also preview the data.
If all went well, the preview should look like this:
In the Schema tab, we can view the different columns of the Excel worksheet and
their datatypes.
Click on Publish to persist the dataset to the ADF environment.
Dataset Best Practices
Like linked services, there are a couple of best practices that you can use:
 Use a naming convention. For example, prefix datasets that describe a table
in SQL server with SQL_. For file types you can use for example CSV_,
EXCEL_, AVRO_ and so on. This will make it easier for you to keep apart the
different types of datasets.
 Like with linked services, if you have multiple environments use the same
name for a dataset in all environments.
 You can create folders to organize your datasets. This is helpful if you have a
large environment with dozens of datasets. You can organize them by source
or by file type for example. You can also create subfolders.
Building an Azure Data Factory Pipeline Manually
Overview
In the previous parts of the tutorial, we've covered all the building blocks for a
pipeline: linked services, datasets and activities. Now let's create a pipeline from
scratch.
Prerequisites
We'll be using objects that were created in the previous steps of the tutorial. If you
haven't created these yet, it is best you do so if you want to follow along.
We will be reading the Excel file from the Azure blob container and store the data in
a table in the Azure SQL database. We're also going to log some messages into a
log table.
When logged in into the database, execute the following script to create the
destination table:
DROP TABLE IF EXISTS dbo.Tutorial_Excel_Customer;

CREATE TABLE dbo.Tutorial_Excel_Customer(


[Title] [NVARCHAR](10) NULL,
[FirstName] [NVARCHAR](50) NULL,
[MiddleName] [NVARCHAR](50) NULL,
[LastName] [NVARCHAR](50) NULL,
[Suffix] [NVARCHAR](10) NULL,
[CompanyName] [NVARCHAR](100) NULL,
[EmailAddress] [NVARCHAR](250) NULL,
[Phone] [NVARCHAR](25) NULL
);
We're explicitly creating the table ourselves, because if ADF reads data from an
semi-structured file like Excel or CSVs, it cannot determine the correct data types
and it will set all columns to NVARCHAR(MAX). For example, this is the table that
was created with the Copy Data tool:

We're also going to create a logging table in a schema called "etl". First execute this
script:
CREATE SCHEMA etl;
Then execute the following script for the log table:
CREATE TABLE etl.logging(
ID INT IDENTITY(1,1) NOT NULL
,LogMessage VARCHAR(500) NOT NULL
,InsertDate DATE NOT NULL DEFAULT SYSDATETIME()
);
Since we have a new destination table, we also need a new dataset. In
the Author section, go to the SQL dataset that was created as part of the Copy Data
tool (this should be "DestinationDataset_eqx"). Click on the ellipsis and
choose Clone.

This will make an exact copy of the dataset, but with a different name. Change the
name to "SQL_ExcelCustomers" and select the newly created table from the
dropdown:
In the Schema tab, we can import the mapping of the table.

Publish the new dataset.


Building the Pipeline
Go to the Author section of ADF Studio and click on the blue "+"-icon. Go to pipeline
> pipeline to create a new pipeline.
Start by giving the new pipeline a decent name.

Next, add a Script activity to the canvas and name it "Log Start".
In the General tab, set the timeout to 10 minutes (the default is 7 days!). You can
also set the number of retries to 1. This means if the Script activity fails, it will wait for
30 seconds and then try again. If it fails again, then the activity will actually fail. If it
succeeds on the second attempt, the activity will be marked as succeeded.
In the Settings tab, choose the linked service for the Azure SQL DB and set the
script type to NonQuery. The Query option means the executed SQL script will
return one or more result sets. The NonQuery option means no result set is returned
and is typically used to execute DDL statements (such as CREATE TABLE, ALTER
INDEX, TRUNCATE TABLE …) or DML statements that modify data (INSERT,
UPDATE, DELETE). In the Script textbox, enter the following SQL statement:
INSERT INTO etl.logging(LogMessage)
VALUES('Start reading Excel');
The settings should now look like this:

Next, drag a Copy Data activity to the canvas. Connect the Script activity with the
new activity. Name it "Copy Excel to SQL".

In the General tab, change the timeout and the number of retries:


In the Source tab, choose the Excel dataset we created earlier. Disable
the Recursively checkbox.

In this example we're reading from one single Excel file. However, if you have
multiple Excel files of the same format, you can read them all at the same time by
changing the file path type to a wildcard, for example "*.xlsx".
In the Sink tab, choose the SQL dataset we created in the prerequisites section.
Leave the defaults for the properties and add the following SQL statement to the pre-
copy script:
TRUNCATE TABLE dbo.Tutorial_Excel_Customer;
The Sink tab should now look like this:
In the Mapping tab, we can explicitly map the source columns with the sink columns.
Hit the Import Schemas button to let ADF do the mapping automatically.
In this example, doing the mapping isn't necessary since the columns from the
source map 1-to-1 to the sink columns. They have the same names and data types.
If we would leave the mapping blank, ADF will do the mapping automatically when
the pipeline is running. Specifying an explicit mapping is more important when the
column names don't match, or when the source data is more complex, for example a
hierarchical JSON file.
In the Settings tab we can specify some additional properties.
An important property is the number of data integration units (DIU), which are a
measure of the power of the compute executing the copy. As you can see in the
informational message, this directly influences the cost of the Copy data activity. The
price is calculated as $0.25 (this might vary on your subscription and currency) * the
copy duration (remember this is always at least one minute and rounded up to the
next full minute!) * # used DIUs. The default value for DIU is set to Auto, meaning
ADF will scale the number of DIUs for you automatically. Possible values are
between 2 and 256. For small data loads ADF will start with minimum 4 DIUs. But,
for a small Excel file like ours this is already overkill. If you know your dataset is
going to be small, change the property from Auto to 2. This will reduce the price of
your copy data activities by half!

As a final step, copy/paste the Script activity. Change the name to "Log End" and
connect the Copy Data activity with this new activity.

In the Settings tab, change the SQL script to the following statement:


INSERT INTO etl.logging(LogMessage)
VALUES('Finish copying Excel');
The pipeline is now finished. Hit the debug button to start executing the pipeline in
debug mode.

After a while the pipeline will finish. You can see in the Output pane how long each
activity has been running:
If you hover with your mouse over a line in the output, you will get icons for the input
& output, and in the case of the Copy Data activity you will get an extra "glasses"
icon for more details.

When we click on the output for the "Log End" activity, we get the following:

We can see 1 row was inserted. When we go to the details of the Copy Data, we get
the following information:
A lot of information has been kept, such as the number of rows read, how many
connections were used, how many KB were written to the database and so on. Back
in the Output pane, there's link to the debug run consumption.

This will tell us exactly how many resources the debug run of the pipeline consumed:
0.0333 corresponds with two minutes (1 minute of execution rounded up * 2 DIU).
Since our debug run was successful, we can publish everything.
Why do we need to Publish?
When you create new objects such as linked services, datasets and pipelines, or
when you modify existing ones, those changes are not automatically persisted on the
server. You can first debug your pipelines to make sure your changes are working.
Once everything works fine and validations succeeds, you can publish your changes
to the server. If you do not publish your changes and you close your browser
sessions, your changes will be lost.
Building Flexible and Dynamic Azure Data Factory Pipelines
Overview
In the previous part we built a pipeline manually, along with the needed datasets and
linked services. But what if you need to load 20 Excel files? Or 100 tables from a
source database? Are you going to create 100 datasets? And 100 different
pipelines? That would be too much (repetitive) work! Luckily, we can have flexible
and dynamic pipelines where we just need two datasets (one for the source, one for
the sink) and one pipeline. Everything else is done through metadata and some
parameters.
Prerequisites
Previously we uploaded an Excel file from Azure Blob Storage to a table in Azure
SQL Database. A new requirement came in and now we must upload another Excel
file to a different table. Instead of creating a new dataset and a new pipeline (or add
another Copy Data activity to the existing pipeline), we're going to reuse our existing
resources.
The new Excel file contains product data, and it has the following structure:
As you can see from the screenshot, the worksheet name is the default "Sheet1".
You can download the sample workbook here. Upload the Excel workbook to the
blob container we used earlier in the tutorial.
Since we want to store the data in our database, we need to create a new staging
table:
CREATE TABLE dbo.Tutorial_StagingProduct
(
[Name] NVARCHAR(50)
,[ProductNumber] NVARCHAR(25)
,[Color] NVARCHAR(15)
,[StandardCost] NUMERIC(10,2)
,[ListPrice] NUMERIC(10,2)
,[Size] NVARCHAR(5)
,[Weight] NUMERIC(8,2)
);
Implement Parameters
Instead of creating two new datasets and another Copy Data activity, we're going to
use parameters in the existing ones. This will allow us to use one single dataset for
both our Excel files. Open the Excel_Customers dataset, go to properties and
rename it to Excel_Generic.
Then go to the Parameters tab, and create the following two parameters:

Back in the Connection tab, click on Customers.xlsx and then on "Add dynamic


content".

This will take us to the expression builder of ADF. Choose the


parameter WorkbookName from the list below.
The file path should now look like this:

Repeat the same process for the sheet name:


Both Excel files have the first row as a header, so the checkbox can remain checked,
but this is something that can be parameterized as well. Finally, go to
the Schema tab and click the Clear button to remove all metadata information from
the dataset:

The schema is different for each Excel file, so we cannot have any column
information here. It will be fetched on the fly when the Copy Data activity runs.
We're going to do the exact same process for our SQL dataset. First, we rename it
to SQL_Generic and then we add two parameters: SchemaName and TableName.
We're going to map these in the connection tab. If you enable the "Edit" checkbox,
two text fields appear (one for the schema and one for the table) which you can
parameterize:
Don't forget to clear the schema! Go to the StageExcelCustomers pipeline and
rename it to "StageExcel". If we open the Copy Data activity, we can see ADF asks
us now to provide values for the parameters we just added.

You can enter them manually, but that would defeat the purpose of our metadata-
driven pipeline.
Creating and Mapping Metadata
We're going to store the metadata we need for our parameters in a table. We're
going to read this metadata and use it to drive a ForEach loop. For each iteration of
the loop, we're going to copy the data from one Excel file to a table in Azure SQL
DB. Create the metadata table with the following script:
CREATE TABLE etl.ExcelMetadata(
ID INT IDENTITY(1,1) NOT NULL
,ExcelFileName VARCHAR(100) NOT NULL
,ExcelSheetName VARCHAR(100) NOT NULL
,SchemaName VARCHAR(100) NOT NULL
,TableName VARCHAR(100) NOT NULL
);
Insert the following two rows of data:
INSERT INTO etl.ExcelMetadata
(
ExcelFileName,
ExcelSheetName,
SchemaName,
TableName
)
VALUES
('Customers.xlsx','Customers','dbo','Tutorial_Excel_Customer')
,
('Products.xlsx' ,'Sheet1' ,'dbo','Tutorial_StagingProduct')
;
In the pipeline, add a Lookup activity to the canvas after the first Script activity. Give
the activity a decent name, set the timeout to 10 minutes and set the retry to 1.

In the Settings, choose the generic SQL dataset. Disable the checkbox for "First row
only" and choose the Query type. Enter the following query:
SELECT
ExcelFileName
,ExcelSheetName
,SchemaName
,TableName
FROM etl.ExcelMetadata;
Since we're specifying a query, we don't actually need to provide (real) values for the
dataset parameters; we're just using the dataset for its connection to the Azure SQL
database.

Preview the data to make sure everything has been configured correctly.

Next, we're going to add a ForEach to the canvas. Add it after the Lookup and
before the second Script activity.
Select the Copy Data activity, cut it (using ctrl-x), click the pencil icon inside the
ForEach activity. This will open a pipeline canvas inside the ForEach loop. Paste the
Copy Data activity there. At the top left corner of the canvas, you can see that we're
inside the loop, which is in the StageExcel pipeline. It seems like there's a "mini
pipeline" inside the ForEach. However, functionality is limited. You can't for example
put another ForEach loop inside the existing ForEach. If you need to nest loops,
you'll need to put the second ForEach in a separate pipeline and call this pipeline
from the first ForEach using the Execute Pipeline activity. Go back to the pipeline
by clicking on its name.

Go to the Settings pane of the ForEach. Here we need to configure over which


items we're going to iterate. This can be an array variable, or a result set such as the
one from our Lookup activity.
Click on "Add dynamic content" for the Items. In the "Activity outputs" node, click on
the Lookup activity.

This will add the following expression:


@activity('Get Metadata').output
However, to make this actually work, we need to add value at the end:
@activity('Get Metadata').output.value
In the settings, we can also choose if the ForEach executes in parallel, or if it will
read the Excel files sequentially. If you don't want parallelism, you need to select
the Sequential checkbox.
Now go back into the ForEach loop canvas and into the Copy Data activity. Now we
can map the metadata we retrieve from the Lookup to the dataset parameters. In
the Source pane, click on the text box for the WorkbookName parameter and go to
the dynamic content.

We can access the values of the current item of the ForEach loop by using
the item() function.
We just need to specify which column we exactly want:

We can repeat the same process for the sheet name:


And of course, we do the same for the SQL dataset in the Sink tab:

We also need to change the Pre-copy script, to make sure we're truncating the
correct table. Like most properties, we can do this through an expression as well.
We're going to use the @concat() function to create a SQL statement along with the
values for the schema and table name.
@concat('TRUNCATE TABLE
',item().SchemaName,'.',item().TableName,';')

Finally, we need to remove the schema mapping in the Mapping pane. Since both
the source and the sink are dynamic, we can't specify any mapping here unless it is
the same for all Excel files (which isn't the case). If the mapping is empty, the Copy
Data activity will do it for us on-the-fly. For this to work, the columns names in the
Excel file and the corresponding table need to match!

The pipeline is now ready to run.


Debugging the Pipeline
Start debugging of the pipeline. In the output pane, you'll see the Copy Data activity
has been run twice, in parallel.

We've now successfully loaded two Excel files to an Azure SQL database by using
one single pipeline driven by metadata. This is an important pattern for ADF, as it
greatly reduces the amount of work you need to do for repetitive tasks. Keep in mind
though, that each iteration of the ForEach loop results in at least one minute of
billing. Even though our debugging pipeline was running for a mere 24 seconds,
we're being billed for 5 minutes (2 Script activities + 1 Lookup + 2 iterations of the
loop).

Additional Information
Azure Data Factory Integration Runtimes
Overview
In this tutorial we have been executing pipelines to get data from a certain source
and write it to another destination. The Copy Data activity for example provides us
with a auto-scalable source of compute that will execute this data transfer for us. But
what is this compute exactly? Where does it reside? The answer is: integration
runtimes. These runtimes provide us with the necessary computing power to execute
all the different kind of activities in a pipeline. There are 3 types of integration
runtimes (IR), which we'll discuss in the following sections.
The Azure-IR
The most important integration runtime is the one we've been using all this time:
the Azure-IR. Every installation of ADF has a default IR:
the AutoResolveIntegrationRuntime. You can find it when you go to
the Manage section of ADF and then click on Integration Runtimes.

It's called auto resolve, because it will try to automatically resolve the geographic
region the compute will need to run. This is determined for example by the data store
of the sink in a Copy Data activity. If the sink is located in West Europe, it will try to
run the compute in the West Europe region as well.
The Azure-IR is a fully managed, serverless compute service. You don't have to do
anything to manage, except pay for the duration it has been running compute. You
can always use the default Azure-IR, but you can also create a new one. Click
on New to create one.

In the new window, choose the option with "Azure, Self-Hosted".


In the next step, choose Azure again.

In the following screen, enter a name for the new IR. Also choose your closest
region.
You can also configure the IR to use a Virtual Network, but this is an advanced
setting that is not covered in the tutorial. Keep in mind that billing for pipeline
durations is several magnitudes higher when you're using a virtual network. In the
third pane, we can configure the compute power for data flows. Data flows are
discussed in the next section of the tutorial.

There are two main reasons to create your own Azure-IR:


 You want to specify a specific region for your compute. For example, if
regulations specify your data can never leave a certain reason, you need to
create your own Azure-IR located in that region.
 You want to specify a data flow runtime with different settings than the default
one. Especially the Time To Live setting is something that is worth changing
(shorter if you want to save on costs, longer if you don't want to restart you
cluster too often during development/debugging).
Click on Create to finish the setup of the new Azure-IR. But how do we use this IR?
If we go for example to the linked service connecting to our Azure SQL database, we
can specify a different IR:

The Self-hosted IR
Suppose you have data on-premises that you need to access from ADF. How can
ADF reach this data store when it is in the Azure cloud? The self-hosted IR provides
us with a solution. You install the self-hosted IR on one of your local machines. This
IR will then act as a gateway through which ADF can reach the on-premises data.
Another use case for the self-hosted IR is when you want to run compute on your
own machines instead of in the Azure cloud. This might be an option if you want to
save costs (the billing for pipeline durations is lower on the self-hosted IR than one
the Azure-IR) or if you want to control everything yourself. ADF will then act as an
orchestrator, while all of the compute is running on your own local servers.
It's possible to install multiple self-hosted IRs on your local network to scale out
resources. You can also share a self-hosted IR between multiple ADF environments.
This can be useful if you want only one self-hosted IR for both development and
production.
The following tips give more detail about this type of IR:
 Connect to On-premises Data in Azure Data Factory with the Self-hosted
Integration Runtime - Part 1 and Part 2.
 Transfer Data to the Cloud Using Azure Data Factory
 Build Azure Data Factory Pipelines with On-Premises Data Sources
The Azure-SSIS IR
ADF provides us with the opportunity to run Integration Services packages inside the
ADF environment. This can be useful if you want to quickly migrate SSIS project to
the Azure cloud, without a complete rewrite of your projects. The Azure-SSIS IR
provides us with a scale-out cluster of virtual machines that can run SSIS packages.
You create an SSIS catalog in either Azure SQL database or in Azure SQL Server
Managed Instance.
As usual, Azure deals with the infrastructure. You only need to specify how powerful
the Azure-SSIS IR is by configuring the size of a compute node and how many
nodes there need to be. You are billed for the duration the IR is running. You can
pause the IR to save on costs.

Azure Data Factory Data Flows


Overview
During the tutorial we've mentioned data flows a couple of times. The activities in a
pipeline don't really support data transformation scenarios. The Copy Data activity
can transform data from one format to another (for example, from a hierarchical
JSON file to a table in a database), but that's about it. Typically, you load data from
one or more sources into a destination and you do the transformations over there.
E.g., you can use SQL in a database, or notebooks in Azure Databricks when the
data is stored in a data lake. This makes ADF a great ELT tool (Extract -> Load ->
Transform), but not so great for ETL. Data flows were introduced to remedy this.
They are an abstraction layer on top of Azure Databricks. They intuitively provide
you with an option to create ETL flows in ADF, without having to write any code (like
you would need to do if you worked directly in Azure Databricks). There are two
types of data flows:
 The data flow (which was previously called the "mapping data flow".
 Power Query (which was previously called the "wrangling data flow"
Data Flow
A data flow in ADF uses the Azure-IR integration runtime to spin up a cluster of
compute behind the scenes (see the previous part about runtimes on how to
configure your own). This cluster needs to be running if you want to debug or run
your data flow.

Data flows in ADF use a visual representation of the different sources,


transformations, and sinks; all connected with precedence constraints. They
resemble data flows in Integration Services. Here's an example from the tip What are
Data Flows in Azure Data Factory?. This tip gives a step-by-step example of how to
create a data flow and how to integrate it into a pipeline.
Because you need a cluster to run a pipeline, data flows are not well-suited for
processing small data sets, since there's the overhead of the cluster start-up time.
Power Query
The Power Query data flow is an implementation of the Power Query engine in ADF.
When you run a Power Query in ADF, the Power Query mash-up will be translated
into a data flow script, which will then be run on the Azure Databricks cluster. The
advantage of Power Query is that you can see the data and the results of your
transformations as you're applying them. Users who have been working with Excel,
Power BI Desktop or Power BI Data Flows are also already familiar with the editor.
You can find an example of a Power Query mash-up in the tip What are Data Flows
in Azure Data Factory? as well.
The disadvantage of Power Query is that not all functionality of the regular Power
Query (as you would have in Power BI Desktop for example) is available in ADF.
You can find a list of the limitations in the documentation.
Azure Data Factory Scheduling and Monitoring
Overview
When you've created your pipelines, you're not going to run them in debug mode
every time you need to transfer some data. Rather, you want to schedule your
pipelines so that they run on pre-defined point in times or when a certain event
happens. When using Integration Services projects, you would use for example SQL
Server Agent to schedule the execution of your packages.
Scheduling
In ADF, a "schedule" is called a trigger, and there are a couple of different types:
 Run-once trigger. In this case, you are manually triggering your pipeline so
that it runs once. The difference between the manual trigger and debugging
the pipeline, is that with a trigger you're using the pipeline configuration that is
saved to the server. With debugging, you're running the pipeline as it is in the
visual editor.
 Scheduled trigger. The pipeline is being run on schedule, much like SQL
Server Agent has schedules. You can for example schedule a pipeline to run
daily, weekly, every hour and so on.
 Tumbling window trigger. This type of trigger fires at a periodic interval. A
tumbling window is a series of fixed-sized, non-overlapping time intervals. For
example, you can have a tumbling window for each day. You can set it to start
at the first of this month, and then it will execute for each day of the month.
Tumbling triggers are great for loading historical data (e.g. initial loads) in a
"sliced" manner instead of loading all data at once.
 Event-based trigger. You can trigger a pipeline to execute every time a
specific event happens. You can start a pipeline if a new file arrives in a Blob
container (storage event), or you can define your own custom events in Azure
Event Grid.
Let's create a trigger for the pipeline we created earlier. In the pipeline, click on Add
Trigger.

If you choose "Trigger Now", you will create a run-once trigger. The pipeline will run
and that's it. If you choose "New/Edit", you can either create a trigger or modify an
existing one. In the Add triggers pane, open the dropdown and choose New.
The default trigger type is Schedule. In the example below, we've scheduled our
pipeline to run every day, for the hours 6, 10, 14 and 18.
Once the trigger is created, it will start running and execute the pipeline according to
schedule. Make sure to publish the trigger after you've created it. You can view
existing triggers in the Manage section of ADF.
You can pause an existing trigger, or you can delete it or edit it. For more information
about triggers, check out the following tips:
 Create Event Based Trigger in Azure Data Factory>
 Create Schedule Trigger in Azure Data Factory ADF
 Create Tumbling Window Trigger in Azure Data Factory ADF
ADF has a REST API which you can also use to start pipelines. You can for example
start a pipeline from an Azure Function or an Azure Logic App.
Monitoring
ADF has a monitoring section where you can view all executed pipelines, both
triggered or by debugging.

You can also view the state of the integration runtimes or view more info about the
data flows debugging sessions. For each pipeline run, you can view the exact output
and the resource consumption of each activity and child pipeline.
It's also possible to configure Log analytics for ADF in the Azure Portal. It's out of
scope for this tutorial, but you can find more info in the tip Setting up Azure Log
Analytics to Monitor Performance of an Azure Resource. You can check out the
Monitoring section for the ADF resource in the Azure Portal:
 
You can choose the type of events that are being logged:
Azure Data Factory Pipeline Logging Error Details
By: Ron L'Esteve   |   Updated: 2021-01-20   |   Comments (2)   |   Related: > Azure Data
Factory
Problem
In my previous article, Logging Azure Data Factory Pipeline Audit Data, I discussed a
variety of methods for capturing Azure Data Factory pipeline logs and persisting the data to
either a SQL Server table or within Azure Data Lake Storage Gen2. While this process of
capturing pipeline log data is valuable when the pipeline activities succeed, how can we also
capture and persist error details related to Azure Data Factory pipelines when activities
within the pipeline fail?
Solution
In this article, I will cover how to capture and persist Azure Data Factory pipeline errors to an
Azure SQL Database table. Additionally, we will re-cap the pipeline parameter process that I
had discussed in my previous articles to demonstrate how the pipeline_errors, pipeline_log,
and pipeline_parameter relate to each other.
Explore and Understand the Meta-Data driven ETL Approach
Prior to continuing with the demonstration, try to read my previous articles as a pre-requisite
to gain background and knowledge around the end-to-end meta-data driven E-T-L process.
 Azure Data Factory Pipeline to fully Load all SQL Server Objects to ADLS Gen2
 Load Data Lake files into Azure Synapse Analytics Using Azure Data Factory
 Loading Azure SQL Data Warehouse Dynamically using Azure Data Factory
 Logging Azure Data Factory Pipeline Audit Data
To re-cap the tables needed for this process, I have included the diagram below which
illustrates how the pipeline_parameter, pipeline_log, and pipeline_error tables are
interconnected with each other.
Create a Parameter Table
The following script will create the pipeline_parameter table with column parameter_id as the
primary key. Note that this table drives the meta-data ETL approach.
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[pipeline_parameter](


[PARAMETER_ID] [int] IDENTITY(1,1) NOT NULL,
[server_name] [nvarchar](500) NULL,
[src_type] [nvarchar](500) NULL,
[src_schema] [nvarchar](500) NULL,
[src_db] [nvarchar](500) NULL,
[src_name] [nvarchar](500) NULL,
[dst_type] [nvarchar](500) NULL,
[dst_name] [nvarchar](500) NULL,
[include_pipeline_flag] [nvarchar](500) NULL,
[partition_field] [nvarchar](500) NULL,
[process_type] [nvarchar](500) NULL,
[priority_lane] [nvarchar](500) NULL,
[pipeline_date] [nvarchar](500) NULL,
[pipeline_status] [nvarchar](500) NULL,
[load_synapse] [nvarchar](500) NULL,
[load_frequency] [nvarchar](500) NULL,
[dst_folder] [nvarchar](500) NULL,
[file_type] [nvarchar](500) NULL,
[lake_dst_folder] [nvarchar](500) NULL,
[spark_flag] [nvarchar](500) NULL,
[dst_schema] [nvarchar](500) NULL,
[distribution_type] [nvarchar](500) NULL,
[load_sqldw_etl_pipeline_date] [datetime] NULL,
[load_sqldw_etl_pipeline_status] [nvarchar](500) NULL,
[load_sqldw_curated_pipeline_date] [datetime] NULL,
[load_sqldw_curated_pipeline_status] [nvarchar](500) NULL,
[load_delta_pipeline_date] [datetime] NULL,
[load_delta_pipeline_status] [nvarchar](500) NULL,
PRIMARY KEY CLUSTERED
(
[PARAMETER_ID] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON
[PRIMARY]
) ON [PRIMARY]
GO
Create a Log Table
This next script will create the pipeline_log table for capturing the Data Factory success logs.
In this table, column log_id is the primary key and column parameter_id is a foreign key with
a reference to column parameter_id from the pipeline_parameter table.
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[pipeline_log](


[LOG_ID] [int] IDENTITY(1,1) NOT NULL,
[PARAMETER_ID] [int] NULL,
[DataFactory_Name] [nvarchar](500) NULL,
[Pipeline_Name] [nvarchar](500) NULL,
[RunId] [nvarchar](500) NULL,
[Source] [nvarchar](500) NULL,
[Destination] [nvarchar](500) NULL,
[TriggerType] [nvarchar](500) NULL,
[TriggerId] [nvarchar](500) NULL,
[TriggerName] [nvarchar](500) NULL,
[TriggerTime] [nvarchar](500) NULL,
[rowsCopied] [nvarchar](500) NULL,
[DataRead] [int] NULL,
[No_ParallelCopies] [int] NULL,
[copyDuration_in_secs] [nvarchar](500) NULL,
[effectiveIntegrationRuntime] [nvarchar](500) NULL,
[Source_Type] [nvarchar](500) NULL,
[Sink_Type] [nvarchar](500) NULL,
[Execution_Status] [nvarchar](500) NULL,
[CopyActivity_Start_Time] [nvarchar](500) NULL,
[CopyActivity_End_Time] [nvarchar](500) NULL,
[CopyActivity_queuingDuration_in_secs] [nvarchar](500)
NULL,
[CopyActivity_transferDuration_in_secs] [nvarchar](500)
NULL,
CONSTRAINT [PK_pipeline_log] PRIMARY KEY CLUSTERED
(
[LOG_ID] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON
[PRIMARY]
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[pipeline_log] WITH CHECK ADD FOREIGN


KEY([PARAMETER_ID])
REFERENCES [dbo].[pipeline_parameter] ([PARAMETER_ID])
ON UPDATE CASCADE
GO
Create an Error Table
This next script will create a pipeline_errors table which will be used to capture the Data
Factory error details from failed pipeline activities. In this table, column error_id is the
primary key and column parameter_id is a foreign key with a reference to column
parameter_id from the pipeline_parameter table.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[pipeline_errors](


[error_id] [int] IDENTITY(1,1) NOT NULL,
[parameter_id] [int] NULL,
[DataFactory_Name] [nvarchar](500) NULL,
[Pipeline_Name] [nvarchar](500) NULL,
[RunId] [nvarchar](500) NULL,
[Source] [nvarchar](500) NULL,
[Destination] [nvarchar](500) NULL,
[TriggerType] [nvarchar](500) NULL,
[TriggerId] [nvarchar](500) NULL,
[TriggerName] [nvarchar](500) NULL,
[TriggerTime] [nvarchar](500) NULL,
[No_ParallelCopies] [int] NULL,
[copyDuration_in_secs] [nvarchar](500) NULL,
[effectiveIntegrationRuntime] [nvarchar](500) NULL,
[Source_Type] [nvarchar](500) NULL,
[Sink_Type] [nvarchar](500) NULL,
[Execution_Status] [nvarchar](500) NULL,
[ErrorDescription] [nvarchar](max) NULL,
[ErrorCode] [nvarchar](500) NULL,
[ErrorLoggedTime] [nvarchar](500) NULL,
[FailureType] [nvarchar](500) NULL,
CONSTRAINT [PK_pipeline_error] PRIMARY KEY CLUSTERED
(
[error_id] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON
[PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

ALTER TABLE [dbo].[pipeline_errors] WITH CHECK ADD FOREIGN


KEY([parameter_id])
REFERENCES [dbo].[pipeline_parameter] ([PARAMETER_ID])
ON UPDATE CASCADE
GO
Create a Stored Procedure to Update the Log Table
Now that we have all the necessary SQL Tables in place, we can begin creating a few
necessary stored procedures. Let’s begin with the following script which will create a stored
procedure to update the pipeline_log table with data from the successful pipeline run. Note
that this stored procedure will be called from the Data Factory pipeline at run-time.
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[sp_UpdateLogTable]


@DataFactory_Name VARCHAR(250),
@Pipeline_Name VARCHAR(250),
@RunID VARCHAR(250),
@Source VARCHAR(300),
@Destination VARCHAR(300),
@TriggerType VARCHAR(300),
@TriggerId VARCHAR(300),
@TriggerName VARCHAR(300),
@TriggerTime VARCHAR(500),
@rowsCopied VARCHAR(300),
@DataRead INT,
@No_ParallelCopies INT,
@copyDuration_in_secs VARCHAR(300),
@effectiveIntegrationRuntime VARCHAR(300),
@Source_Type VARCHAR(300),
@Sink_Type VARCHAR(300),
@Execution_Status VARCHAR(300),
@CopyActivity_Start_Time VARCHAR(500),
@CopyActivity_End_Time VARCHAR(500),
@CopyActivity_queuingDuration_in_secs VARCHAR(500),
@CopyActivity_transferDuration_in_secs VARCHAR(500)
AS
INSERT INTO [pipeline_log]
(
[DataFactory_Name]
,[Pipeline_Name]
,[RunId]
,[Source]
,[Destination]
,[TriggerType]
,[TriggerId]
,[TriggerName]
,[TriggerTime]
,[rowsCopied]
,[DataRead]
,[No_ParallelCopies]
,[copyDuration_in_secs]
,[effectiveIntegrationRuntime]
,[Source_Type]
,[Sink_Type]
,[Execution_Status]
,[CopyActivity_Start_Time]
,[CopyActivity_End_Time]
,[CopyActivity_queuingDuration_in_secs]
,[CopyActivity_transferDuration_in_secs]
)
VALUES
(
@DataFactory_Name
,@Pipeline_Name
,@RunId
,@Source
,@Destination
,@TriggerType
,@TriggerId
,@TriggerName
,@TriggerTime
,@rowsCopied
,@DataRead
,@No_ParallelCopies
,@copyDuration_in_secs
,@effectiveIntegrationRuntime
,@Source_Type
,@Sink_Type
,@Execution_Status
,@CopyActivity_Start_Time
,@CopyActivity_End_Time
,@CopyActivity_queuingDuration_in_secs
,@CopyActivity_transferDuration_in_secs
)
GO
Create a Stored Procedure to Update the Errors Table
Next, lets run the following script which will create a stored procedure to update the
pipeline_errors table with detailed error data from the failed pipeline run. Note that this
stored procedure will be called from the Data Factory pipeline at run-time.
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[sp_UpdateErrorTable]


@DataFactory_Name [nvarchar](500) NULL,
@Pipeline_Name [nvarchar](500) NULL,
@RunId [nvarchar](500) NULL,
@Source [nvarchar](500) NULL,
@Destination [nvarchar](500) NULL,
@TriggerType [nvarchar](500) NULL,
@TriggerId [nvarchar](500) NULL,
@TriggerName [nvarchar](500) NULL,
@TriggerTime [nvarchar](500) NULL,
@No_ParallelCopies [int] NULL,
@copyDuration_in_secs [nvarchar](500) NULL,
@effectiveIntegrationRuntime [nvarchar](500) NULL,
@Source_Type [nvarchar](500) NULL,
@Sink_Type [nvarchar](500) NULL,
@Execution_Status [nvarchar](500) NULL,
@ErrorDescription [nvarchar](max) NULL,
@ErrorCode [nvarchar](500) NULL,
@ErrorLoggedTime [nvarchar](500) NULL,
@FailureType [nvarchar](500) NULL
AS
INSERT INTO [pipeline_errors]

(
[DataFactory_Name],
[Pipeline_Name],
[RunId],
[Source],
[Destination],
[TriggerType],
[TriggerId],
[TriggerName],
[TriggerTime],
[No_ParallelCopies],
[copyDuration_in_secs],
[effectiveIntegrationRuntime],
[Source_Type],
[Sink_Type],
[Execution_Status],
[ErrorDescription],
[ErrorCode],
[ErrorLoggedTime],
[FailureType]
)
VALUES
(
@DataFactory_Name,
@Pipeline_Name,
@RunId,
@Source,
@Destination,
@TriggerType,
@TriggerId,
@TriggerName,
@TriggerTime,
@No_ParallelCopies,
@copyDuration_in_secs,
@effectiveIntegrationRuntime,
@Source_Type,
@Sink_Type,
@Execution_Status,
@ErrorDescription,
@ErrorCode,
@ErrorLoggedTime,
@FailureType
)
GO
Create a Source Error SQL Table
Recall from my previous article, Azure Data Factory Pipeline to fully Load all SQL Server
Objects to ADLS Gen2, that we used a source SQL Server Table that we then moved to the
Data Lake Storage Gen2 and ultimately into Synapse DW. Based on this process, we will
need to test a known error within the Data Factory pipeline and process. It is known that
generally a varchar(max) datatype containing at least 8000+ characters will fail when being
loaded into Synapse DW since varchar(max) is an unsupported data type. This seems like a
good use case for an error test.
The following table dbo.MyErrorTable contains two columns with col1 being the
varchar(max) datatype.

Within dbo.MyErrorTable I have added a large block of text and decided to randomly
choose Sample text for Roma : the novel of ancient Rome by Steven Saylor. After doing
some editing of the text, I confirmed that col1 contains 8001 words, which is sure to fail my
Azure Data Factory pipeline and trigger a record to be created in the pipeline_errors table.

Add Records to Parameter Table


Now that we’ve identified the source SQL tables to run through the process, I’ll add them to
the pipeline_parameter table. For this demonstration I have added the Error table that we
created in the previous step along with a regular table that we would expect to succeed to
demonstrate both a success and failure end to end logging process.

Verify the Azure Data Lake Storage Gen2 Folders and Files
After running the pipeline to load my SQL tables to Azure Data Lake Storage Gen2, we can
see that the destination ADLS2 container now has both of the tables in snappy compressed
parquet format.
As an additional verification step, we can see that the folder contains the expected parquet
file.

Configure the Pipeline Lookup Activity


It’s now time to build and configure the ADF pipeline. My previous article, Load Data Lake
files into Azure Synapse Analytics Using Azure Data Factory, covers the details on how to
build this pipeline. To recap the process, the select query within the lookup gets the list of
parquet files that need to be loaded to Synapse DW and then passes them on to each loop
which will load the parquet files to Synapse DW.
Configure the Pipeline Foreach Loop Activity
The Foreach loop contains the Copy Table activity with takes the parquet files and loads
them to Synapse DW while auto-creating the tables. If the Copy-Table activity succeeds, it
will log the pipeline run data to the pipeline_log table. However, if the Copy-Table activity
fails, it will log the pipeline error details to the pipeline_errors table.
Configure Stored Procedure to Update the Log Table
Notice that the UpdateLogTable Stored procedure that we created earlier will be called by the
success stored procedure activity.
Below are the stored procedure parameters that will Update the pipeline_log table and can be
imported directly from the Stored Procedure.
The following values will need to be entered into the stored procedure parameter values.
Name Values
DataFactory_Name @{pipeline().DataFactory}
Pipeline_Name @{pipeline().Pipeline}
RunId @{pipeline().RunId}
Source @{item().src_name}
Destination @{item().dst_name}
TriggerType @{pipeline().TriggerType}
TriggerId @{pipeline().TriggerId}
TriggerName @{pipeline().TriggerName}
TriggerTime @{pipeline().TriggerTime}
rowsCopied @{activity('Copy-Table').output.rowsCopied}
RowsRead @{activity('Copy-Table').output.rowsRead}
@{activity('Copy-
No_ParallelCopies
Table').output.usedParallelCopies}
copyDuration_in_secs @{activity('Copy-Table').output.copyDuration}
@{activity('Copy-
effectiveIntegrationRuntime
Table').output.effectiveIntegrationRuntime}
@{activity('Copy-
Source_Type
Table').output.executionDetails[0].source.type}
@{activity('Copy-
Sink_Type
Table').output.executionDetails[0].sink.type}
@{activity('Copy-
Execution_Status
Table').output.executionDetails[0].status}
@{activity('Copy-
CopyActivity_Start_Time
Table').output.executionDetails[0].start}
CopyActivity_End_Time @{utcnow()}
@{activity('Copy-
CopyActivity_queuingDuration_in_secs Table').output.executionDetails[0].
detailedDurations.queuingDuration}
@{activity('Copy-
CopyActivity_transferDuration_in_secs Table').output.executionDetails[0].
detailedDurations.transferDuration}
Configure Stored Procedure to Update the Error Table
The last stored procedure within the Foreach loop activity is the UpdateErrorTable Stored
procedure that we created earlier and will be called by the failure stored procedure activity.
Below are the stored procedure parameters that will Update the pipeline_errors table and can
be imported directly from the Stored Procedure.
The following values will need to be entered into the stored procedure parameter values.
Description Source
DataFactory_Name @{pipeline().DataFactory}
Pipeline_Name @{pipeline().Pipeline}
RunId @{pipeline().RunId}
Source @{item().src_name}
Destination @{item().dst_name}
TriggerType @{pipeline().TriggerType}
TriggerId @{pipeline().TriggerId}
TriggerName @{pipeline().TriggerName}
TriggerTime @{pipeline().TriggerTime}
No_ParallelCopies @{activity('Copy-Table').output.usedParallelCopies}
copyDuration_in_secs @{activity('Copy-Table').output.copyDuration}
effectiveIntegrationRuntime @{activity('Copy-Table').output.effectiveIntegrationRuntime}
@{activity('Copy-
Source_Type
Table').output.executionDetails[0].source.type}
@{activity('Copy-
Sink_Type
Table').output.executionDetails[0].sink.type}
Execution_Status @{activity('Copy-Table').output.executionDetails[0].status}
ErrorCode @{activity('Copy-Table').error.errorCode}
ErrorDescription @{activity('Copy-Table').error.message}
ErrorLoggedTIme @utcnow()
@concat(activity('Copy-
FailureType Table').error.message,'failureType:',activity('Copy-
Table').error.failureType)
Run the Pipeline
Now that we have configured the pipeline, it is time to run the pipeline. As we can see from
the debug mode Output log, one table succeeded and the other failed, as expected.

Verify the Results


Finally, lets verify the results in the pipeline_log table. As we can see, the pipeline_log table
has captured one log containing the source, MyTable.
And the pipeline_errors table now has one record for MyErrorTable, along with detailed error
codes, descriptions, messages and more.

As a final check, when I navigate to the Synapse DW, I can see that both tables have been
auto-created, despite the fact that one failed and one succeeded.

However, data was only loaded in MyTable since MyErrorTable contains no data.

Next Steps
Logging Azure Data Factory Pipeline Audit Data
By: Ron L'Esteve   |   Comments (7)   |   Related: > Azure Data Factory

Problem
In my last article, Azure Data Factory Pipeline to fully Load all SQL Server Objects to ADLS
Gen2, I discussed how to create a pipeline parameter table in Azure SQL DB and drive the
creation of snappy parquet files consisting of On-Premises SQL Server tables into Azure
Data Lake Store Gen2. Now that I have a process for generating files in the lake, I would also
like to implement a process to track the log activity for my pipelines that run and persist the
data. What options do I have for creating and storing this log data?
Solution
Azure Data Factory is a robust cloud-based E-L-T tool that is capable of accommodating
multiple scenarios for logging pipeline audit data.
In this article, I will discuss three of these possible options, which include:
1. Updating Pipeline Status and Datetime columns in a static pipeline parameter table
using an ADF Stored Procedure activity
2. Generating a metadata CSV file for every parquet file that is created and storing the
logs in hierarchical folders in ADLS2
3. Creating a pipeline log table in Azure SQL Database and storing the pipeline activity
as records in the table

Prerequisites
Ensure that you have read and implemented Azure Data Factory Pipeline to fully Load all
SQL Server Objects to ADLS Gen2, as this demo will be building a pipeline logging process
on the pipeline copy activity that was created in the article.
Option 1: Create a Stored Procedure Activity
The Stored Procedure Activity is one of the transformation activities that Data Factory
supports. We will use the Stored Procedure Activity to invoke a stored procedure in Azure
SQL Database. For more information on ADF Stored Procedure Activity, see Transform data
by using the SQL Server Stored Procedure activity in Azure Data Factory.
For this scenario, I would like to maintain my Pipeline Execution Status and Pipeline Date
detail as columns in my Pipeline Parameter table rather than having a separate log table. The
downside to this method is that it will not retain historical log data, but will simply update the
values based on a lookup of the incoming files to records in the pipeline parameter table. This
gives a quick, yet not necessarily robust, method of viewing the status and load date across all
items in the pipeline parameter table.
I’ll begin by adding a stored procedure activity to my Copy-Table Activity so that as the
process iterates on a table level basis for my stored procedure.
Next, I will add the following stored procedure to my Azure SQL Database where my
pipeline parameter table resides. This procedure simply looks up the destination table name in
the pipeline parameter table and updates the status and datetime for each table once the Copy-
Table activity is successful.
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[sql2adls_data_files_loaded] @dst_name


NVARCHAR(500)
AS

SET NOCOUNT ON -- turns off messages sent back to client after


DML is run, keep this here
DECLARE @Currentday DATETIME = GETDATE();

BEGIN TRY 

BEGIN TRANSACTION -- BEGIN TRAN statement will increment


transaction count from 0 to 1
UPDATE [dbo].[pipeline_parameter] set pipeline_status =
'success', pipeline_datetime = @Currentday where dst_name =
@dst_name ;
COMMIT TRANSACTION -- COMMIT will decrement transaction
count from 1 to 0 if dml worked

END TRY 
BEGIN CATCH 
  IF @@TRANCOUNT > 0 
      ROLLBACK 

  -- Return error information. 


  DECLARE @ErrorMessage nvarchar(4000),  @ErrorSeverity int; 
  SELECT @ErrorMessage = ERROR_MESSAGE(),@ErrorSeverity =
ERROR_SEVERITY(); 
  RAISERROR(@ErrorMessage, @ErrorSeverity, 1); 
END CATCH; 
GO
After creating my stored procedure, I can confirm that it has been created in my Azure SQL
Database.

I will then return to my data factory pipeline and configure the stored procedure activity. In
the Stored Procedure tab, I will select the stored procedure that I just created. I will also add a
new stored procedure parameter that references my destination name, which I had configured
in the copy activity.

After saving, publishing and running the pipeline, I can see that my pipeline_datetime and
pipeline_status columns have been updated as a result of the ADF Stored Procedure Activity.

Option 2: Create a CSV Log file in Azure Data Lake Store2


Since my Copy-Table activity is generating snappy parquet files into hierarchical ADLS2
folders, I also want to create a metadata .csv file which contains the pipeline activity. For this
scenario, I have set up an Azure Data Factory Event Grid to listen for metadata files and then
kick of a process to transform my table and load it into a curated zone.
I will start by adding a Copy activity for creating my log files and connecting it the Copy-
Table activity. Similar to the previous process, this process will generate a .csv metadata file
in a metadata folder per table.

To configure the source dataset, I will select my source on-premise SQL Server.
Next, I will add the following query as my source query. As we can see, this query will
contain a combination of pipeline activities, copy table activities, and user-defined
parameters.

SELECT '@{pipeline().DataFactory}' as DataFactory_Name,


'@{pipeline().Pipeline}' as Pipeline_Name,
'@{pipeline().RunId}' as RunId,
'@{item().src_name}' as Source,
'@{item().dst_name}' as Destination,
'@{pipeline().TriggerType}' as TriggerType,
'@{pipeline().TriggerId}' as TriggerId,
'@{pipeline().TriggerName}' as TriggerName,
'@{pipeline().TriggerTime}' as TriggerTime,
'@{activity('Copy-Table').output.rowsCopied}' as rowsCopied,
'@{activity('Copy-Table').output.rowsRead}' as RowsRead,
'@{activity('Copy-Table').output.usedParallelCopies}' as
No_ParallelCopies,
'@{activity('Copy-Table').output.copyDuration}' as
copyDuration_in_secs,
'@{activity('Copy-Table').output.effectiveIntegrationRuntime}'
as effectiveIntegrationRuntime,
'@{activity('Copy-
Table').output.executionDetails[0].source.type}' as
Source_Type,
'@{activity('Copy-
Table').output.executionDetails[0].sink.type}' as Sink_Type,
'@{activity('Copy-Table').output.executionDetails[0].status}'
as Execution_Status,
'@{activity('Copy-Table').output.executionDetails[0].start}'
as CopyActivity_Start_Time,
'@{utcnow()}' as CopyActivity_End_Time,
'@{activity('Copy-
Table').output.executionDetails[0].detailedDurations.queuingDu
ration}' as CopyActivity_queuingDuration_in_secs,
'@{activity('Copy-
Table').output.executionDetails[0].detailedDurations.timeToFir
stByte}' as CopyActivity_timeToFirstByte_in_secs,
'@{activity('Copy-
Table').output.executionDetails[0].detailedDurations.transferD
uration}' as CopyActivity_transferDuration_in_secs
My sink will be a csv dataset with a .csv extension.

Below is the connection configuration that I will use for my csv dataset.
The following parameterized path will ensure that the file is generate in the correct folder
structure.
@{item().server_name}/@{item().src_db}/@{item().src_schema}/
@{item().dst_name}/metadata/@{formatDateTime(utcnow(),'yyyy-
MM-dd')}/@{item().dst_name}.csv
After I save, publish, and run my pipeline, I can see that a metadata folder has been created in
my Server>database>schema>Destination_table location.
When I open the metadata folder, I can see that there will be csv file per day that the pipeline
runs.

Finally, I can see that a metadata .csv file with the name of my table has been created.

When I download and open the file, I can see that all of the query results have been populated
in my .csv file.

Option 3: Create a log table in Azure SQL Database


My last scenario involves creating a log table in Azure SQL Database, where my parameter
table resides and then writing the data to records in the ASQL table.
Again, for this option, I will start by adding a copy data activity connected to my Copy-Table
activity.
Next, I will create the following table in my Azure SQL Database. This table will store and
capture the pipeline and copy activity details.
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[pipeline_log](


[DataFactory_Name] [nvarchar](500) NULL,
[Pipeline_Name] [nvarchar](500) NULL,
[RunId] [nvarchar](500) NULL,
[Source] [nvarchar](500) NULL,
[Destination] [nvarchar](500) NULL,
[TriggerType] [nvarchar](500) NULL,
[TriggerId] [nvarchar](500) NULL,
[TriggerName] [nvarchar](500) NULL,
[TriggerTime] [nvarchar](500) NULL,
[rowsCopied] [nvarchar](500) NULL,
[RowsRead] [int] NULL,
[No_ParallelCopies] [int] NULL,
[copyDuration_in_secs] [nvarchar](500) NULL,
[effectiveIntegrationRuntime] [nvarchar](500) NULL,
[Source_Type] [nvarchar](500) NULL,
[Sink_Type] [nvarchar](500) NULL,
[Execution_Status] [nvarchar](500) NULL,
[CopyActivity_Start_Time] [datetime] NULL,
[CopyActivity_End_Time] [datetime] NULL,
[CopyActivity_queuingDuration_in_secs] [nvarchar](500)
NULL,
[CopyActivity_timeToFirstByte_in_secs] [nvarchar](500)
NULL,
[CopyActivity_transferDuration_in_secs] [nvarchar](500)
NULL
) ON [PRIMARY]
GO
Similar to my last pipeline option, I will configure my on-premise SQL server as my source
and use the query provided in Option 2 as my source.
My sink will be a connection to the Azure SQL Db pipeline log table the I created earlier.

Below are the connection details for the Azure SQL DB pipeline log table.
When I save, publish and run my pipeline, I can see that the pipeline copy activity records
have been captured in my dbo.pipeline_log table.

Azure Data Factory Pipeline Scheduling, Error Handling and Monitoring - Part
2
By: Koen Verbeeck   |   Updated: 2021-06-17   |   Comments   |   Related: > Azure Data
Factory

Problem
Azure Data Factory is a managed serverless data integration service for the Microsoft Azure
Data Platform used by data engineers during business intelligence and cloud data related
projects. In part 1 of this tutorial series, we introduced you to Azure Data Factory (ADF) by
creating a pipeline. We continue by showing you other use cases for which you can use ADF,
as well as how you can handle errors and how to use the built-in monitoring.
Solution
It's recommended to read part 1 before you continue with this tip. It shows you how to install
ADF and how to create a pipeline that will copy data from Azure Blob Storage to an Azure
SQL database as a sample ETL \ ELT process.
Azure Data Factory as an Orchestration Service
Like SQL Server Integration Services, ADF is responsible for data movement (copy data or
datasets) from a source to a destination as a workflow. But it can do so much more. There are
a variety of activities that don't do anything in ADF itself, but rather perform some tasks on
an external system. For example, there are activities specific for handling Azure
Databricks scenarios:

You can for example trigger Azure Databricks Notebooks from ADF. The following tips can
get you started on this topic:
 Orchestrating Azure Databricks Notebooks with Azure Data Factory
 Create Azure Data Factory inventory using Databricks
 Getting Started with Delta Lake Using Azure Data Factory
 Snowflake Data Warehouse Loading with Azure Data Factory and Databricks
 Azure Data Factory Mapping Data Flows for Big Data Lake Aggregations and
Transformations
ADF has its own form of Azure Databricks integration: Data Flows (previously called
Mapping Data Flows) and Power Query flows (shortly called Wrangling Flows), which are
both out of scope of this tip, but will be explained in a subsequent tip.
ADF also supports other technologies, such as HDInsight:

But also Azure Machine Learning:

You can call Logic Apps and Azure Functions from Azure Data Factory, which is often
necessary because there's still some functionality missing from ADF. For example, you
cannot send an email from ADF or ADF cannot easily download a file from SharePoint
Online (or OneDrive for Business).
With ADF pipelines, you can create complex data pipelines where you integrate multiple data
services with each other. But it's not all cloud. You can also access on-premises data sources
when you install the self-hosted integration runtime. This runtime also allows you to shift
workloads to on-premises machines should the need arise.

Lastly, you can also integrate existing SSIS solutions into ADF. You can create an Azure-
SSIS Integration Runtime, which is basically a cluster of virtual machines that will execute
your SSIS packages. The SSIS catalog itself is created in either an Azure SQL DB or an
Azure SQL Managed Instance. You can find more info in the following tips:
 Configure an Azure SQL Server Integration Services Integration Runtime
 Executing Integration Services Packages in the Azure-SSIS Integration Runtime
 Customized Setup for the Azure-SSIS Integration Runtime
 SSIS Catalog Maintenance in the Azure Cloud
Scheduling ADF Pipelines
To schedule an ADF pipeline, you add a trigger from within the pipeline itself:
You can either trigger a one-off execution, or you can create/edit a permanent trigger.
Currently, there are 4 types:

 Schedule is very similar to what is used in SQL Server Agent jobs. You define a
frequency (for example every 10 minutes or once every day at 3AM), a start date and
an optional end date.
 Tumbling window is a more specialized form of schedule. With tumbling windows,
you have a parameterized data flow. When one window is executed, the start and the
end time of the window is passed to the pipeline. The advantage of a tumbling
window is that you can execute past periods as well. Suppose you have a tumbling
window on the daily level, and the start date is at the start of this month. This will
trigger an execution for every day of the month right until the current day. This makes
tumbling windows great for doing an initial load where you want each period
executed separately. You can find more info about this trigger in the tip Create
Tumbling Window Trigger in Azure Data Factory ADF.
 Storage events will trigger a pipeline whenever a blob is created or deleted from a
specific blob container.
 Custom events are a new trigger type which are in preview at the time of writing.
These allow you to trigger a pipeline based on custom events from Event Grid. You
can find more info in the documentation.
Pipelines can also be triggered from an external tool, such as from an Azure Logic App or an
Azure Function. ADF has even a REST API available which you can use, but you could also
use PowerShell, the Azure CLI, .NET or even Python.
Error Handling and Monitoring
Like in SSIS, you can configure constraints on the execution paths between two activities:

This allows you to create a more robust pipeline that can handle multiple scenarios. Keep in
mind though ADF doesn't have an "OR constraint" like in SSIS. Let's illustrate why that
matters. In the following scenario, the Web Activity will never be executed:

For the Web Activity to be executed, the Copy Activity must fail AND the Azure Function
must fail. However, the Azure Function will only start if the Copy Data activity has finished
successfully. If you want to re-use some error handling functionality, you can create a
separate pipeline and call this pipeline from every activity in the main pipeline:
To capture and log any errors, you can create a stored procedure to log them into a table, as
demonstrated in the tip Azure Data Factory Pipeline Logging Error Details.
In the ADF environment, you can monitor ongoing and past pipeline runs.

There, you can view all pipeline runs. There are pre-defined filters you can use, such as date,
pipeline names and status.

You can view the error if a pipeline has failed, but you can also go into the specific run and
restart an activity if needed.
For more advanced alerting and monitoring, you can use Azure Monitor.
Query Audit data in Azure SQL Database using Kusto Query Language (KQL)
By: Rajendra Gupta   |   Updated: 2021-03-16   |   Comments   |   Related: > Azure SQL
Database

Problem
In the previous tip, Auditing for Azure SQL Database, we explored the process to audit an
Azure SQL Database using the Azure Portal and Azure PowerShell cmdlets. In this article we
look at how you can leverage Kusto Query Language (KQL) for querying the audit data.
Solution
Kusto Query Language (KQL) is a read-only query language for processing real-time data
from Azure Log Analytics, Azure Application Insights, and Azure Security Center logs. SQL
Server database professionals familiar with Transact-SQL will see that KQL is similar to T-
SQL with slight differences.
For example, in T-SQL we use the WHERE clause to filter records from a table as follows.
SELECT *
FROM Employees
WHERE firstname='John'
We can write the same query in KQL with the following syntax. Like PowerShell, it uses a
pipe (|) to pass values to the next command.
Employees
| where firstname == 'John'
Similarly, in T-SQL we use the ORDER BY clause to sort data in ascending or descending
order as follows.
SELECT *
FROM Employees
WHERE firstname='John'
ORDER BY empid
The equivalent KQL code is as follows.
Employees
| where firstname == 'John'
| order by empid
The query syntax for KQL language looks familiar, right.
Enable Audit for Azure SQL Database
In the previous tip, we configured audit logs for Azure SQL Database using Azure Storage. If
you have the bulk of the audit data in Azure Storage, it might be complex to fetch the
required data. You can use the sys.fn_get_audit_file() function for fetching data, but it also
takes longer for a large data set.  Therefore, for critical databases you should store audits in
Azure Log Analytics.
To configure the Azure SQL Database Audit logs in Azure Log Analytics, login to the Azure
portal using your credentials and navigate to Azure Server.
As shown below, server-level auditing is disabled. It is also disabled for all databases in the
Azure server.
Enable the server-level auditing and put a tick on Log Analytics (Preview) as the audit log
destination.
This enables the configuration option for log analytics. Click on Configure and it opens Log
Analytic Workspaces.

Click on the Create New Workspace and in the new workspace, enter the following values:
 Enter a name for log analytics workspace
 Select your Azure subscription
 Resource group
 Azure region
 Pricing tier
As shown below, the auditing is configured for Azure SQL Database.
Save the audit configurations for Azure SQL Database. It enables server-level auditing for the
Azure SQL Database. The database auditing is still disabled because if we enable server
auditing, it applies to all databases.

Configure the diagnostic telemetry


We need to configure the diagnostic settings for SQL Database for gathering the metrics
using the Azure portal. You can configure the data for errors, blocking, deadlocks, query
store, and automatic tuning in the diagnostics.
 SQL Insights: It captures Intelligent Insights performance.
 AutomaticTuning: It contains automatic tuning recommendations for your Azure
SQL Database.
 QueryStoreRunTImeStatics: It captures query statistics such as CPU usage, query
duration.
 QueryStoreWaitStatistics: It captures query wait statistics such as locking, CPU,
memory stats.
 Errors: It contains information for errors on a database.
 DatabaseWaitStatistics: It captures information for database wait statistics.
 Timeouts: It captures timeouts on a database.
 Blocks: It captures database blocking information.
 Deadlocks: It captures deadlocks events for Azure databases.
 Basic: It contains information for DTU\CPU, failed or successful connections, storage
usage etc.
 InstanceAndAppAdvanced: It captures data for TempDB data and logs usage.
You can refer to this Microsoft doc for detailed information.
Click on Add diagnostics setting. Let us enable diagnostics
for errors and InstanceAndAppAdvanced. Send this data to the log analytics workspace
using your subscription and log analytics workspace. Click on Save for the configuration.
Use KQL for Azure SQL database log analysis
Navigate to the Azure database and click on Logs. You get the welcome page for Log
Analytics.

Click on Get Started and it opens the query editor for KQL queries. In the left-hand side, it
shows a SQL database AzureDiagnostics.

Kusto Query Language (KQL) to summarize the client IP Connections


Suppose we want to identify the client IP address and a number of connections for Azure
SQL Database. In the below KQL query, we use the followings.
 Summarize function for generating an output table from the input table aggregate.
 Count() operator to return the number of records.
 It uses the client_ip_s table as the data source for data ingestion.
AzureDiagnostics
|summarize count() by client_ip_s
In the KQL query output, we get an IP address and aggregate count.

KQL query for finding out login failures count by IP address


In a traditional SQL Server, we get login failure messages in the error logs. Similarly, in
Azure SQL Database, we can use KQL to determine the IP address from where these
connections are originating.
In the below KQL query, we have the following arguments.
 It filters the category for the SQLSecurityAuditEvents.
 The query uses LogicalServerName_s argument for filter records for my Azure SQL
Server (azuredemoinstance)
 DBAF - Database Authentication Failure.
 It uses summarize and count() by function and returns login failures count for each
server principal. In the below query output, it shows [sqladmin] login and its failure
count as 7.
AzureDiagnostics
| where Category == 'SQLSecurityAuditEvents' and
LogicalServerName_s == "azuredemoinstance"
| where action_id_s == 'DBAF'
| summarize count() by client_ip_s, OperationName,
server_principal_name_s

KQL query for listing events for Azure SQL Database


In the below KQL query, we filter records for the [labazuresql], my Azure SQL database.
AzureDiagnostics
| where DatabaseName_s == "labazuresql"
By default, it gives output for last hour. You can click on the time range and select the
appropriate period such as 12 hrs, 24 hrs, etc. You can expand the result set for detailed
information such as the TenantId, TimeGenerated datetime, Resourceid, etc. which are all
different data types.

Suppose someone executed an INSERT and SELECT statement for the Azure database. As a
database administrator, you may want to get SQL statements. You can use KQL language and
filter records from the statement_s table that have INSERT statements.
AzureDiagnostics
| where statement_s contains "insert"
As shown below, we get the complete INSERT statement from the audit logs.

Similarly, you can fetch data from other diagnostic tables with the help of the KQL language.
You can also monitor performance data such as CPU, Memory using the diagnostics
configuration.
Create an Alert in Microsoft Azure Log Analytics
By: Joe Gavin   |   Comments   |   Related: > Azure

Problem
You want to create an alert in Log Analytics to monitor Performance Monitor counters and /
or Event Logs and need a quick way to jump in and get familiar with it.
Solution
Log Analytics is a service in Operations Management Suite (OMS) that monitors your cloud
and on-premises environments to maintain their availability and performance. It collects data
generated by resources in your cloud and on-premises environments and from other
monitoring tools to provide analysis across multiple sources.
(Source: https://docs.microsoft.com/en-us/azure/log-analytics/log-analytics-overview)
Digging deeply into this service is out of scope for this tip. However, diving in and creating a
simple alert is a great place to get started. 
We’ll walk through the following:
 Creating a Workspace - A workspace is the basic organizational unit for Log
Analytics.
 Installing and configuring the Microsoft Monitoring Agent - The agent is the conduit
from Windows and / or Linux monitored machines back to Log Analytics.
 Creating an alert - We can create alerts based on Windows Event Logs, Windows
Performance Counters, Linux Performance Counters, IIS Logs, Custom Fields,
Custom Logs and Syslog. In our example, we’ll keep it simple and get started with an
alert based on the ‘% Processor Time’ Windows Performance Counter. 
We’ll have a functioning Log Analytics alert when we’re done.
Creating a Workspace
Let’s get started.
Login to the Microsoft Azure Portal at http://portal.azure.com.
Start typing Log Analytics in the search box (as shown below) and click on Log Analytics
when it comes up in the results.

Then click on Add.

 Name your new, uniquely named OMS Workspace


 Choose your Subscription
 Create a new or use an existing Resource Group
 Choose Location
 Choose Pricing Tier

Then click OK.


Wait for the deployment to complete and click Refresh.
You will now see the new Workspace we just created. Go ahead and click on it.
Click on OMS Portal (it will open in another tab).

Click on the Settings icon in the upper right hand section of the OMS Portal.

Installing the Microsoft Monitoring Agent


At this point we are not monitoring any machines and need to install the Microsoft
Monitoring Agent on any machines we want to collect data from.
Choose Connected Sources > Windows Servers.
Click on ‘Download Windows Agent (64 bit)’ (presuming you’re installing on a 64 bit
machine) to download the installer to your machine.

Go to the desktop of the Windows machine you want to install the agent on and run
MMASetup-AMD64.exe from the location you saved it.
Click through until you get to the Agent Setup Options screen and check ‘Connect the Agent
to Azure Log Analytics (OMS)’.

Then click Next.


On the Azure Log Analytics (OMS) tab, click Add.
Copy and paste the Workspace ID and Key from Windows Server window in the OMS
Portal, then click Next.
Then click Install and then Finish.
The agent is installed. Repeat for other machines. This process can be automated and
installed on multiple machines, but that’s a topic for another tip and day.
Creating an Alert
Now we can go back the OMS Portal.
Let’s create an alert to tell us when CPU goes over a threshold of 90% on a machine we are
monitoring.
On the left side of the screen, click on the Log Search icon and this opens the Log Search
window.
(1) Paste the following in the search window (Note: this is based on the new Log Analytics
Query Language):
Perf
| where ObjectName == "Processor"
| where CounterName == "% Processor Time"
| where InstanceName == "_Total"
| where CounterValue > 90

(2) Click on the Search button on the right to see if there are any records. In this case we have
no values over 90%, so there are no records returned in the results section.
(3) To turn this query into an alert, click on the Alert icon in the upper left as shown above
and the window below will open.
Enter values for:
1. Name
2. Description
3. Severity
4. Time window
5. Alert frequency
6. Number of results
7. Subject
8. Recipients
9. and the click Save to save the alert.
After saving the Alert, you will get this window.

When we look at the alerts that were setup, we can see them as shown below.
And we’re done.
Next Steps
Azure Data Factory Lookup Activity Example
By: Fikrat Azizov   |   Comments (7)   |   Related: > Azure Data Factory

Problem
One of the frequently used SQL Server Integration Services (SSIS) controls is the lookup
transform, which allows performing lookup matches against existing database records. In this
post, we will be exploring Azure Data Factory's Lookup activity, which has similar
functionality.
Solution
Azure Data Factory Lookup Activity
The Lookup activity can read data stored in a database or file system and pass it to
subsequent copy or transformation activities. Unlike SSIS's Lookup transformation, which
allows performing a lookup search at the row level, data obtained from ADF's Lookup
activity can only be used on an object level. In other words, you can use ADF's Lookup
activity's data to determine object names (table, file names, etc.) within the same pipeline
dynamically.
Lookup activity can read from a variety of database and file-based sources, you can find the
list of all possible data sources here.
Lookup activity can work in two modes:
 Singleton mode - Produces first row of the related dataset
 Array mode - Produces the entire dataset
We will look into both modes of Lookup activity in this post.
Azure Data Factory Lookup Activity Singleton Mode
My first example will be creating Lookup activity to read the first row of SQL query from
SrcDb database and using it in subsequent Stored Procedure activity, which we will be
storing in a log table inside the DstDb database.
For the purpose of this exercise, I have created a pipeline ControlFlow1_PL and view in
SrcDb database to extract all table names, using the below query:
CREATE VIEW [dbo].[VW_TableList]
AS
SELECT TABLE_SCHEMA+'.'+TABLE_NAME AS Name FROM
INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE='BASE TABLE'
GO
I have also created a log table and stored procedure to write into it. I am going to use this
procedure for the purpose of Stored Procedure activity. Here are the required scripts to be
executed inside DstDb database:
CREATE TABLE [dbo].[TableLogs](
[TableName] [varchar](max) NULL
)
GO

CREATE PROCEDURE [dbo].[usp_LogTableNames]


@TableName varchar(max)
AS
BEGIN
  INSERT INTO [TableLogs] Values(@TableName)
END
GO
Let's follow the below steps to add Lookup and Stored Procedure activities to
ControlFlow1_PLpipeline:
Select pipeline ControlFlow1_PL, expand General group on Activities panel, drag-drop the
Lookup activity into the central panel and assign the name (I've named it as Lookup_AC):
Switch to the Settings tab, click '+New' button to create a dataset, linked to the
VW_TableList view in the SrcDb database:

I've named the new dataset TableList_DS, see the below properties:
The below screenshot shows the properties of the Lookup activity, with the new dataset
configured. Please note that 'First row only' checkbox is checked, which will ensure that this
activity produces only the first row from its data source:
Next, let's add Stored Procedure activity, pointing to the usp_LogTableNames procedure we
created earlier and link it to Lookup_Ac activity on Success criteria:

Next, switch to Stored Procedure tab, enter [dbo].[usp_LogTableNames] as the procedure's


name, fetch the procedure's parameter, using the Import parameter button and enter the
dynamic expression @activity('Lookup_AC').output.firstRow.name as its value. This
expression reflects the data output from the Lookup activity:

Finally, let's publish the changes, trigger it manually, switch to the Monitor page and open
the Activity Runs window to examine the detailed execution logs:
Using the Output button, we can examine the output of the lookup activity and see the value
it produced:

Now that we know how Lookup activity works in singleton mode, let's explore the array
mode.
Azure Data Factory Lookup Activity Array Mode
To explore Lookup activity's array mode, I am going to create copy of the pipeline, created
earlier and customize it, as follows:
Clone the pipeline ControlFlow1_PL and name it as ControlFlow2_PL.

Select Lookup_AC activity in the ControlFlow2_PLpipeline, switch to the Settings tab and
clear the First row only checkbox:
Because we're expecting multiple rows from Lookup activity, we can no longer use
LogTableName_AC activity with a string parameter, so let's remove it and drag-drop a Set
Variable activity, located under the General category (I've named it as Set_Variable_AC):

Add array type variable TableNames to the ControlFlow2_PL pipeline:


Link two activities on Success criteria, select Set_Variable_AC activity and choose
TableNames from the Names drop-down list as a variable name and enter expression
@activity('Lookup_AC').output.value as a value. If you compare this expression to the
previous one (@activity('Lookup_AC').output.firstRow.name) you can notice that, we've
replaced the firstRow property with the value property because the Lookup activity in array
mode doesn't support firstRowproperty. Here's how your screen should look:
Azure Data Factory ForEach Activity Example
By: Fikrat Azizov   |   Comments (5)   |   Related: > Azure Data Factory

Problem
Data integration flows often involve execution of the same tasks on many similar objects. A
typical example could be - copying multiple files from one folder into another or copying
multiple tables from one database into another. Azure Data Factory's (ADF) ForEach and
Until activities are designed to handle iterative processing logic. We are going to discuss the
ForEach activity in this article.
Solution
Azure Data Factory ForEach Activity
The ForEach activity defines a repeating control flow in your pipeline. This activity could be
used to iterate over a collection of items and execute specified activities in a loop. This
functionality is similar to SSIS's Foreach Loop Container.
ForEach activity's item collection can include outputs of other activities, pipeline parameters
or variables of array type. This activity is a compound activity- in other words, it can include
more than one activity.
Creating ForEach Activity in Azure Data Factory
In the previous two posts (here and here), we have started developing
pipeline ControlFlow2_PL, which reads the list of tables from SrcDb database, filters out
tables with the names starting with character 'P' and assigns results to pipeline
variable FilteredTableNames. Here is the list of tables, which we get in this variable:
 SalesLT.Product
 SalesLT.ProductCategory
 SalesLT.ProductDescription
 SalesLT.ProductModel
 SalesLT.ProductModelProductDescription
In this exercise, we will add ForEach activity to this pipeline, which will copy tables, listed in
this variable into DstDb database.
Before we proceed further, Let's prepare target tables. First, Let's remove foreign key
relationships between these tables in the destination database using below script, to prevent
ForEach activity from failing:
ALTER TABLE [SalesLT].[Product] DROP CONSTRAINT
[FK_Product_ProductCategory_ProductCategoryID]
GO
ALTER TABLE [SalesLT].[Product] DROP CONSTRAINT
[FK_Product_ProductModel_ProductModelID]
GO
ALTER TABLE [SalesLT].[ProductCategory] DROP CONSTRAINT
[FK_ProductCategory_ProductCategory_ParentProductCategoryID_Pr
oductCategoryID]
GO
ALTER TABLE [SalesLT].[ProductModelProductDescription] DROP
CONSTRAINT
[FK_ProductModelProductDescription_ProductDescription_ProductD
escriptionID]
GO
ALTER TABLE [SalesLT].[ProductModelProductDescription] DROP
CONSTRAINT
[FK_ProductModelProductDescription_ProductModel_ProductModelID
]
GO
ALTER TABLE [SalesLT].[SalesOrderDetail] DROP CONSTRAINT
[FK_SalesOrderDetail_Product_ProductID]
GO
Next, let's create stored procedure to purge target tables, using below script. We'll need to call
this procedure before each copy, to avoid PK errors:
CREATE PROCEDURE Usp_PurgeTargetTables
AS
BEGIN
delete from [SalesLT].[Product]
delete from [SalesLT].[ProductModelProductDescription]
delete from [SalesLT].[ProductDescription]
delete from [SalesLT].[ProductModel]
delete from [SalesLT].[ProductCategory]
END
Let's follow the below steps to add a ForEach activity to the ControlFlow2_PL pipeline:
Select pipeline ControlFlow2_PL, expand Iterations & Conditionals group on the Activities
panel, drag-drop ForEach activity into the central panel and assign a name (I've named it as
ForEach_AC):
Switch to the Settings tab and enter an expression @variables('FilteredTableNames') into
Items text box:

Switch to Activities tab and click Add activity button:


Drag-drop copy activity to central panel (I've named it as CopyFiltered_AC), switch to
Source tab and click '+New' button to start creating source dataset:

Next, create Azure SQL Db dataset, pointing to SrcDb database (I've named it as
ASQLSrc_DS) and add dataset parameter TableName of string type:
Switch to Connection tab and enter an expression @dataset().TableName in the Table text
box, which will ensure that table names for this dataset will be assigned dynamically, using
dataset parameter:

Now that source dataset has been created, let's return to parent pipeline's design surface and
enter an expression @item().name in the TableName text box. This expression will ensure
that items from the ForEach activity's input list are mapped to its copy activity's source
dataset:
Next, let's create parameterized Sink dataset for CopyFiltered_AC activity, using a similar
method. Here is how your screen should look like:

Now that we've completed configuration of CopyFiltered_AC activity, let's switch to the
parent pipeline's design surface, using navigation link at the top of the screen:
Next, let's add Stored Procedure activity (I've named it as SP_Purge_AC), pointing to the
Usp_PurgeTargetTables procedure we created earlier and link it to Set_Variable_AC activity
on Success criteria:

As the last configuration step, let's link activities SP_Purge_AC and ForEach_AC on Success
criteria. This will ensure that target tables will be purged prior to the beginning of copy
activities:
Finally, let's start the pipeline in Debug mode and examine execution logs in the Output
window to ensure that five copy activities (one per each item from FilteredTableNames
variable list) have finished successfully:

We can also examine the input of the ForEach activity, using the Input button and confirm
that it received five items:

Since pipeline works as expected, we can publish all the changes now.
I have attached JSON scripts for this pipeline here.
Optional attributes of ForEach activity in Azure Data Factory
ForEach activity has few optional attributes, which allow controlling parallelism degree of its
child activities. Here are those attributes:
 Sequential - This setting instructs ForEach activity to run its child activities in
sequential order, one at a time
 Batch Count - This setting allows specifying parallelism degree of ForEach activity's
child activities
Here is the screenshot with these attributes:

Next Steps
mport Data from Excel to Azure SQL Database using Azure Data Factory
By: Ron L'Esteve   |   Updated: 2021-07-06   |   Comments (2)   |   Related: > Azure Data
Factory

Problem
The need to load data from Excel spreadsheets into SQL Databases has been a long-standing
requirement for many organizations for many years. Previously, tools such as VBA, SSIS, C#
and more have been used to perform this data ingestion orchestration process. Recently,
Microsoft introduced an Excel connector for Azure Data Factory. Based on this new Excel
connector, how can we go about loading Excel files containing multiple tabs into Azure SQL
Database Tables?
Solution
With the new addition of the Excel connector in Azure Data Factory, we now have the
capability of leveraging dynamic and parameterized pipelines to load Excel spreadsheets into
Azure SQL Database tables. In this article, we will explore how to dynamically load an Excel
spreadsheet residing in ADLS gen2 containing multiple Sheets into a single Azure SQL
Table and also into multiple tables for every sheet.
Pre-Requisites
Create an Excel Spreadsheet
The image below shows a sample Excel spreadsheet containing four sheets containing the
same headers and schema that we will use in our ADF Pipelines to load data in Azure SQL
Tables.
Upload to Azure Data Lake Storage Gen2
This same Excel spreadsheet has been loaded to ADLS gen2.

Within Data Factory, we can add an ADLS gen2 linked service for the location of the Excel
spreadsheet.
Create Linked Services and Datasets

We'll need to ensure that the ADLS gen2 linked service credentials are configured accurately.

When creating a new dataset, notice that we have Excel format as an option which we can
select.
The connection configuration properties for the Excel dataset can be found below. Note that
we will need to configure the Sheet Name property with the dynamic parameterized
@dataset().SheetName value. Also, since we have headers in the file, we will need to check
'First row as header'.
Within the parameters tab, we'll need to add SheetName.

Next, a sink dataset to the target Azure SQL Table will also need to be created with a
connection to the appropriate linked service.
Create a Pipeline to Load Multiple Excel Sheets in a Spreadsheet into a Single
Azure SQL Table
In the following section, we'll create a pipeline to load multiple Excel sheets from a single
spreadsheet file into a single Azure SQL Table.
Within the ADF pane, we can next create a new pipeline and then add a ForEach loop activity
to the pipeline canvas. Next, click on the white space of the canvas within the pipeline to add
a new Array variable called SheetName containing default values of all the sheets in the
spreadsheet from Sheet1 through Sheet4, as depicted in the image below.

Next, add @variables('SheetName') to the items property of the ForEach Settings.


Next, navigate into the ForEach activity and add a CopyActivity with source configurations
as follows.
Within the sink configurations, we'll need to set the table option property to 'Auto Create
Table' since we currently do not have a table created.
After executing the pipeline, we can see that the four Sheets have been loaded into the Azure
SQL Table.

When we navigate to the Azure SQL Table and query it, we can see that the data from all the
Excel Sheets were loaded into the single Azure SQL Table.
Create a Pipeline to Load Multiple Excel Sheets in a Spreadsheet into Multiple Azure
SQL Tables
In this next example, we will test loading multiple Excel sheets from a spreadsheet into
multiple Azure SQL Tables. To begin, we will need a new Excel lookup table that will
contain the SheetName and TableName which will be used by the dynamic ADF pipeline
parameters.
The following script can be used to create this lookup table.
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[ExcelTableLookUp](


[SheetName] [nvarchar](max) NULL,
[TableName] [nvarchar](max) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
Once the table is created, we can insert the SheetNames and corresponding TableNames into
the table:

Next, we will also need to create a new dataset with a connection to the Excel Look up table.
The connection properties of the Excel Spreadsheet will be similar to the previous pipeline
where we parameterized SheetName as follows.

In this scenario, we will also need to add a parameter for the TableName in the Azure SQL
Database dataset connection as follows.
In the Azure SQL DB connection section, we'll leave the schema as hardcoded and would
need to add the parameter for the TableName as follows.

In this pipeline, we will also need a lookup table which will serve the purpose of looking up
the values in the SQL lookup table through a select * lookup on the table.

The values from the lookup can be passed to the ForEach loop activity's items property of the
settings tab, as follows:
Next, within the ForEachLoop activity, we'll need a Copy Data activity with the source
dataset properties containing the parameterized SheetName value, as follows.

Next, the sink dataset properties will also need to contain the parameterized TableName
value, as follows. Note that the table option is once again set to 'Auto Create Table'.
After we run this pipeline, we can see that the pipeline succeeded and four tables were
created in the Azure SQL Database.

Upon navigating to the Azure SQL Database, we can see that all four table were created with
the appropriate names based on the TableName values we defined in the SQL Lookup table.
As a final check, when we query all four tables, we can see that they all contain the data from
the Excel Sheets which confirms that the pipeline executed successfully and with the correct
mappings of sheets to multiple tables which were defined in the lookup tables.
Logging Azure Data Factory Pipeline Audit Data
By: Ron L'Esteve   |   Comments (7)   |   Related: > Azure Data Factory
Problem
In my last article, Azure Data Factory Pipeline to fully Load all SQL Server Objects to ADLS
Gen2, I discussed how to create a pipeline parameter table in Azure SQL DB and drive the
creation of snappy parquet files consisting of On-Premises SQL Server tables into Azure
Data Lake Store Gen2. Now that I have a process for generating files in the lake, I would also
like to implement a process to track the log activity for my pipelines that run and persist the
data. What options do I have for creating and storing this log data?
Solution
Azure Data Factory is a robust cloud-based E-L-T tool that is capable of accommodating
multiple scenarios for logging pipeline audit data.
In this article, I will discuss three of these possible options, which include:
1. Updating Pipeline Status and Datetime columns in a static pipeline parameter table
using an ADF Stored Procedure activity
2. Generating a metadata CSV file for every parquet file that is created and storing the
logs in hierarchical folders in ADLS2
3. Creating a pipeline log table in Azure SQL Database and storing the pipeline activity
as records in the table

Prerequisites
Ensure that you have read and implemented Azure Data Factory Pipeline to fully Load all
SQL Server Objects to ADLS Gen2, as this demo will be building a pipeline logging process
on the pipeline copy activity that was created in the article.
Option 1: Create a Stored Procedure Activity
The Stored Procedure Activity is one of the transformation activities that Data Factory
supports. We will use the Stored Procedure Activity to invoke a stored procedure in Azure
SQL Database. For more information on ADF Stored Procedure Activity, see Transform data
by using the SQL Server Stored Procedure activity in Azure Data Factory.
For this scenario, I would like to maintain my Pipeline Execution Status and Pipeline Date
detail as columns in my Pipeline Parameter table rather than having a separate log table. The
downside to this method is that it will not retain historical log data, but will simply update the
values based on a lookup of the incoming files to records in the pipeline parameter table. This
gives a quick, yet not necessarily robust, method of viewing the status and load date across all
items in the pipeline parameter table.
I’ll begin by adding a stored procedure activity to my Copy-Table Activity so that as the
process iterates on a table level basis for my stored procedure.
Next, I will add the following stored procedure to my Azure SQL Database where my
pipeline parameter table resides. This procedure simply looks up the destination table name in
the pipeline parameter table and updates the status and datetime for each table once the Copy-
Table activity is successful.
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[sql2adls_data_files_loaded] @dst_name


NVARCHAR(500)
AS

SET NOCOUNT ON -- turns off messages sent back to client after


DML is run, keep this here
DECLARE @Currentday DATETIME = GETDATE();

BEGIN TRY 

BEGIN TRANSACTION -- BEGIN TRAN statement will increment


transaction count from 0 to 1
UPDATE [dbo].[pipeline_parameter] set pipeline_status =
'success', pipeline_datetime = @Currentday where dst_name =
@dst_name ;
COMMIT TRANSACTION -- COMMIT will decrement transaction
count from 1 to 0 if dml worked

END TRY 
BEGIN CATCH 
  IF @@TRANCOUNT > 0 
      ROLLBACK 

  -- Return error information. 


  DECLARE @ErrorMessage nvarchar(4000),  @ErrorSeverity int; 
  SELECT @ErrorMessage = ERROR_MESSAGE(),@ErrorSeverity =
ERROR_SEVERITY(); 
  RAISERROR(@ErrorMessage, @ErrorSeverity, 1); 
END CATCH; 
GO
After creating my stored procedure, I can confirm that it has been created in my Azure SQL
Database.

I will then return to my data factory pipeline and configure the stored procedure activity. In
the Stored Procedure tab, I will select the stored procedure that I just created. I will also add a
new stored procedure parameter that references my destination name, which I had configured
in the copy activity.

After saving, publishing and running the pipeline, I can see that my pipeline_datetime and
pipeline_status columns have been updated as a result of the ADF Stored Procedure Activity.

Option 2: Create a CSV Log file in Azure Data Lake Store2


Since my Copy-Table activity is generating snappy parquet files into hierarchical ADLS2
folders, I also want to create a metadata .csv file which contains the pipeline activity. For this
scenario, I have set up an Azure Data Factory Event Grid to listen for metadata files and then
kick of a process to transform my table and load it into a curated zone.
I will start by adding a Copy activity for creating my log files and connecting it the Copy-
Table activity. Similar to the previous process, this process will generate a .csv metadata file
in a metadata folder per table.
To configure the source dataset, I will select my source on-premise SQL Server.
Next, I will add the following query as my source query. As we can see, this query will
contain a combination of pipeline activities, copy table activities, and user-defined
parameters.

SELECT '@{pipeline().DataFactory}' as DataFactory_Name,


'@{pipeline().Pipeline}' as Pipeline_Name,
'@{pipeline().RunId}' as RunId,
'@{item().src_name}' as Source,
'@{item().dst_name}' as Destination,
'@{pipeline().TriggerType}' as TriggerType,
'@{pipeline().TriggerId}' as TriggerId,
'@{pipeline().TriggerName}' as TriggerName,
'@{pipeline().TriggerTime}' as TriggerTime,
'@{activity('Copy-Table').output.rowsCopied}' as rowsCopied,
'@{activity('Copy-Table').output.rowsRead}' as RowsRead,
'@{activity('Copy-Table').output.usedParallelCopies}' as
No_ParallelCopies,
'@{activity('Copy-Table').output.copyDuration}' as
copyDuration_in_secs,
'@{activity('Copy-Table').output.effectiveIntegrationRuntime}'
as effectiveIntegrationRuntime,
'@{activity('Copy-
Table').output.executionDetails[0].source.type}' as
Source_Type,
'@{activity('Copy-
Table').output.executionDetails[0].sink.type}' as Sink_Type,
'@{activity('Copy-Table').output.executionDetails[0].status}'
as Execution_Status,
'@{activity('Copy-Table').output.executionDetails[0].start}'
as CopyActivity_Start_Time,
'@{utcnow()}' as CopyActivity_End_Time,
'@{activity('Copy-
Table').output.executionDetails[0].detailedDurations.queuingDu
ration}' as CopyActivity_queuingDuration_in_secs,
'@{activity('Copy-
Table').output.executionDetails[0].detailedDurations.timeToFir
stByte}' as CopyActivity_timeToFirstByte_in_secs,
'@{activity('Copy-
Table').output.executionDetails[0].detailedDurations.transferD
uration}' as CopyActivity_transferDuration_in_secs
My sink will be a csv dataset with a .csv extension.

Below is the connection configuration that I will use for my csv dataset.
The following parameterized path will ensure that the file is generate in the correct folder
structure.
@{item().server_name}/@{item().src_db}/@{item().src_schema}/
@{item().dst_name}/metadata/@{formatDateTime(utcnow(),'yyyy-
MM-dd')}/@{item().dst_name}.csv
After I save, publish, and run my pipeline, I can see that a metadata folder has been created in
my Server>database>schema>Destination_table location.

When I open the metadata folder, I can see that there will be csv file per day that the pipeline
runs.

Finally, I can see that a metadata .csv file with the name of my table has been created.
When I download and open the file, I can see that all of the query results have been populated
in my .csv file.

Option 3: Create a log table in Azure SQL Database


My last scenario involves creating a log table in Azure SQL Database, where my parameter
table resides and then writing the data to records in the ASQL table.
Again, for this option, I will start by adding a copy data activity connected to my Copy-Table
activity.

Next, I will create the following table in my Azure SQL Database. This table will store and
capture the pipeline and copy activity details.
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[pipeline_log](


[DataFactory_Name] [nvarchar](500) NULL,
[Pipeline_Name] [nvarchar](500) NULL,
[RunId] [nvarchar](500) NULL,
[Source] [nvarchar](500) NULL,
[Destination] [nvarchar](500) NULL,
[TriggerType] [nvarchar](500) NULL,
[TriggerId] [nvarchar](500) NULL,
[TriggerName] [nvarchar](500) NULL,
[TriggerTime] [nvarchar](500) NULL,
[rowsCopied] [nvarchar](500) NULL,
[RowsRead] [int] NULL,
[No_ParallelCopies] [int] NULL,
[copyDuration_in_secs] [nvarchar](500) NULL,
[effectiveIntegrationRuntime] [nvarchar](500) NULL,
[Source_Type] [nvarchar](500) NULL,
[Sink_Type] [nvarchar](500) NULL,
[Execution_Status] [nvarchar](500) NULL,
[CopyActivity_Start_Time] [datetime] NULL,
[CopyActivity_End_Time] [datetime] NULL,
[CopyActivity_queuingDuration_in_secs] [nvarchar](500)
NULL,
[CopyActivity_timeToFirstByte_in_secs] [nvarchar](500)
NULL,
[CopyActivity_transferDuration_in_secs] [nvarchar](500)
NULL
) ON [PRIMARY]
GO
Similar to my last pipeline option, I will configure my on-premise SQL server as my source
and use the query provided in Option 2 as my source.

My sink will be a connection to the Azure SQL Db pipeline log table the I created earlier.

Below are the connection details for the Azure SQL DB pipeline log table.
When I save, publish and run my pipeline, I can see that the pipeline copy activity records
have been captured in my dbo.pipeline_log table.

Azure Data Factory If Condition Activity


By: Fikrat Azizov   |   Comments (1)   |   Related: > Azure Data Factory

Problem
In these series of posts, I am going to explore Azure Data Factory (ADF), compare its
features against SQL Server Integration Services (SSIS) and show how to use it towards real-
life data integration problems. In Control flow activities , I have provided an overview of
control flow activities and explored few simple activity types. In this post, we will be
exploring If Condition activity.
Solution
Azure Data Factory If Condition Activity
If Condition activity is similar to SSIS's Conditional Split control, described here. It allows
directing of a pipeline's execution one way or another, based on some internal or external
condition.
Unlike simple activities we have considered so far, the If Condition activity is a compound
activity, it contains a logical evaluation condition and two activity groups, a group matching
to a true evaluation result and another group matching to a false evaluation result.
If Condition activity's condition is based on logical expression, which can include properties
of pipeline, trigger as well as some system variables and functions.
Creating Azure Data Factory If Condition Activity
In one of the earlier posts (see Automating pipeline executions, Part 3), we have created
pipeline Blob_SQL_PL, which would kick-off in response to file arrival events into blob
storage container. This pipeline had a single activity, designed to transfer data from CSV files
into FactInternetSales table in Azure SQL db.
We will customize this pipeline, make it more intelligent - it will check input file's name and
based on that, transfer files into either FactInternetSales or DimCurrency table, by initiating
different activities.
To prepare the destination for the second activity, I have created table DimCurrency inside
DstDb, using the below script:
CREATE TABLE [dbo].[DimCurrency](
[CurrencyKey] [int] IDENTITY(1,1) NOT NULL,
[CurrencyAlternateKey] [nchar](3) NOT NULL,
[CurrencyName] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_DimCurrency_CurrencyKey]
PRIMARY KEY CLUSTERED ([CurrencyKey] ASC)
GO
Let's follow the below steps to add an If Condition activity:
Select pipeline Blob_SQL_PL, expand 'Iterations and Conditionals' group on Activities
panel, drag-drop an If Condition activity into the central panel and assign the name (I've
named it If_Condition_AC):

Switch to the Settings tab, place the cursor in the Expression text box and click the 'Add
dynamic content' link under that text box, to start building an evaluation expression:
Expand Functions/Conversion Functions group and select the bool function:

Place the cursor inside the bool function brackets, expand Functions/String Functions group
and select the startswith function:
Place the cursor inside the startswith function brackets and select SourceFile pipeline
parameter we created earlier, followed by a comma and 'FactIntSales' string and then confirm
to close the Add Dynamic Content window. Here's the final expression-
@bool(startswith(pipeline().parameters.SourceFile,'FactIntSales')) , which evaluates whether
or not the input file's name starts with 'FactIntSales' string. Here's a screenshot for the activity
with the evaluation condition:
Next, let's copy FactInternetSales_AC activity into the buffer, using right click and Cut
command:

Now, we need to add activities to True and False evaluation groups. Select If_Condition_AC
activity, switch to the Activities tab and click Add If True Activity button:
Right click in the design surface and select the Paste command, to paste the activity we
copied earlier into the buffer and assign a name (I have named it FactInternetSales_AC):

The activity FactInternetSales_AC originally has been created with the explicit field mapping
(see Transfer On-Premises Files to Azure SQL Database for more details). However, because
this pipeline is going to transfer files with different structures, we no longer need to have
explicit mapping, so let's switch to the Mapping tab and click the Clear button, to remove
mapping:
Please note the pipeline hierarchy link at the top of design surface, which allows you to
navigate to the parent pipeline's design screen. We could add more activities into True
Activities group, however that's not required for the purpose of this exercise, so let's click
Blob_SQL_PL navigation link to return to the parent pipeline's design screen:

We'll follow similar steps to add activity into False group:


Let's add a Copy activity to copy files from the blob storage container into the DimCurrency
table in Azure SQL DB (I've named it DimCurrency_AC). This activity's source dataset
screen will be identical to the FactInternetSales_AC activity's source screen:

As for the sink dataset, we will need to create Azure SQL DB dataset, pointing to the
DimCurrency table:
Now that we are done with the configuration of DimCurrency_AC activity, we can return to
the parent screen, using the parent navigation link and publish changes. Here is how your
final screen should look at this point:
For those, who want to see the JSON script for the pipeline we just created, I have attached
the script here.
Validating Azure Data Factory Pipeline Execution
Because this pipeline has an event-based trigger associated with it, all we need to initiate it is
to drop files into the source container. We can use Azure Portal to manage files in the blob
storage, so let's open the Blob Storage screen and remove existing files from the csvfiles
container:

Now, use the Upload button to select DimCurrency.csv file from the local folder:

Let's wait few minutes for this pipeline to finish and switch to the Monitor screen, to examine
the execution results. As expected, MyEventTrigger has started the pipeline in response to
DimCurrency.csv file's upload event:

Upon further examination of execution details, we can see that DimCurrency_AC activity ran
after conditional validation:

Now, let's upload FactIntSales2012.csv file and see the execution results:
Activity Runs screen confirms that conditional activity worked as expected:

Conclusion
The If Condition activity is great feature, allowing adding conditional logic to your data flow.
You can build complex evaluation expressions interactively, using the Add Dynamic Content
window and you can nest multiple activities within an If Condition activity.
Although If Condition activity's functionality in ADF is similar to SSIS's Conditional Split
control's functionality, there are few important differences:
 If Condition activity's evaluation conditions are based on object level (for example,
dataset source file name, pipeline name, trigger time, etc.), whereas SSIS's
Conditional Split's evaluation is based on row level conditions.
 SSIS's Conditional Split has default output, where rows not matching specified
criteria can be directed, whereas ADF only has True and False condition outputs.

You might also like