Build+Data+Pipeline+Using+Azure+Medallion+Architecture+Approach
Build+Data+Pipeline+Using+Azure+Medallion+Architecture+Approach
Medallion Architecture
Approach
CookBook
In this training, we shall look at and execute a pipeline that is comprehensively designed to analyze
complex datasets, i.e. water sensor data. The pipeline leverages Azure services, which are seamlessly
integrated to enable efficient data movement and processing. Each component works together to
ensure smooth data flow and accurate analysis, showcasing the power of Azure's advanced data
integration capabilities. The use cases for the proposed pipeline can be as follows:
● T he Authoritative and government organizations monitor the concentration levels of water
bodies which were collected using sensors. By analyzing this data, they aim to assess the
s afety and potential hazards of these water sources.
● A unified system for all government-related organizations and concerned authorities to address
water-related issues effectively and implement immediate safety measures. This kind of
system would enable streamlined collaboration and decision-making.
● The real-time recording of water quality data through sensors emphasizes the necessity of a
robust and constructive framework. The ultimate goal is to reduce harmful contaminants in
water, ensuring it remains safe and sustainable within the ecosystem.
T his use case focuses on"Aggregated Sensor Water Data"which is collected across various time
zones, countries, water bodies, and vegetation types. This gets crucial when climatic and habitat
conditions vary, such historical data becomes vital in providing a comprehensive overview to address
and resolve complex or uncertain issues. Every single record is crucial, as it contributes to the
measurement of determinands and informs proactive measures to ensure water safety.
T his projectleverages"Aggregated Sensor Water Data"as the main data source to implement a
comprehensive Azure data pipeline. The pipeline is designed to ingest sensor data, process it, and
move it to the destination, utilizing various Azure services such asAzure SQL Database, Azure Logic
Apps, Azure Storage Accounts, Azure Data Factory, and Azure Databricks. The final transformed
data is visualized usingPower BI, the most demandingBusiness Intelligence tool. The pipeline is
s tructured to be modular and customizable, which allows additional Azure services to be seamlessly
integrated to expand its functionality.
TheAzure services chosen for this project are scalable,robust, user-friendly, and cost-effective.
The project uses the “Aggregrated Sensor Water Data” which includes different columns as follows:
. countryCode
1
2. monitoringSiteIdentifier
3. monitoringSiteIdentifierScheme
4. parameterWaterBodyCategory
5. observedPropertyDeterminandCode
6. observedPropertyDeterminandLabel
7. procedureAnalysedMatrix
8. resultUom
9. phenomenonTimeReferenceYear
10.parameterSamplingPeriod
11.procedureLOQValue
12.resultNumberOfSamples
13.resultQualityNumberOfSamplesBelowLOQ
14.resultQualityMinimumBelowLOQ
15.resultMinimumValue
16.resultQualityMeanBelowLOQ
17.resultMeanValue
18.resultQualityMaximumBelowLOQ
19.resultMaximumValue
20.resultQualityMedianBelowLOQ
21.resultMedianValue
22.resultStandardDeviationValue
23.procedureAnalyticalMethod
24.parameterSampleDepth
25.resultObservationStatus
26.remarks
27.metadata_versionId
T he following columns represent the level of detail or drill-down of parameters at which water
readings were taken by the sensors.
F or demonstration purposes, the data is initially loaded from a local source to anAzure Managed SQL
DatabaseusingSQL Server Management Studio. Thisapplication allows for seamless connection to
the Azure Cloud Database and enables the import of data from an Excel file (the sensor water data
file) into Azure. The Azure pipeline is designed to handle data of any size, minimizing latency and
ensuring that the processed data is quickly made available for analysis or any defined purpose.
T he goal of this project is to solve water quality problem using Azure Pipeline by incorporating these
endpoints:
1. R aw Data:The raw data, which is available in an Excels heet, is first loaded into the Azure
Managed SQL Database using SQL Server Management Studio (SSMS). This enables the data to
be used in the Azure pipeline, and ensures it’s ready for further processing.
2. Data Movement:After loading raw data into the SQLdatabase, it is then moved to Azure Blob
Storage using Logic Apps. This component helps automate data transfer from the Azure SQL
Database to Blob Storage in JSON format. Logic Apps contain different variety of actions to
accomplish such intensive tasks.
3. Orchestration: Once the data is in Blob Storage,it is transferred to Azure Data Lake Gen2 for
integration with Azure Databricks. This is achieved through Azure Data Factory, which includes
a copy pipeline to move data into the lake.
4. P rocessing Layer:The data then undergoes transformationin Azure Databricks, following the
Medallion Architecture with three layers:
● Bronze Layer: Raw data is ingested from Data Lake Gen2 into a data frame using Spark
jobs.
● Silver Layer: The data is processed and refined for quality.
● Gold Layer: Final processing happens, and the transformed data is stored in a
Databricks table.
5. Data Visualization: The processed data from theGoldLayeris then used as a source for
Power BI, which enables dashboarding and detailedanalysis.
The processed data from this pipeline will be visualized using business intelligence tool like Power BI.
The architecture of our project is structured in a way to handle large volume of historical sensor data.
zure Managed SQL Database service is fully managed by Microsoft, which means it handles
A
maintenance, backups, and updates, enabling users to focus on their applications rather than
database management. It is cost-sensitive which means it’s important to monitor the cost analysis
constantly. In the context of this project, an Azure SQL Server was created first, followed by setting up
an Azure Managed SQL Database within that server.
S QL Server Management Studio (SSMS) was used to establish a connection to the Azure SQL Server.
This connection was set up using credentials like Microsoft login, password, and SQL Server
authentication. Once connected, the large Excel file was imported into the Azure Managed SQL
Database.
● It supports and processes both relational data and non-relational structures, such as graphs,
JSON, spatial, and XML.
● It is managed by Microsoft and data is spread across data centers, and with 24*7 support
teams. It also creates a high-performance data storage layer for applications and solutions,
with high-speed connectivity.
● Azure SQL Managed Database takes care of database infrastructure, patches, and
maintenance automatically. This significantly reduces the overhead of managing hardware and
s oftware updates.
● We can scale the database's compute and storage resources dynamically based on workload
demands.
● With data being stored across different data centers managed by Microsoft, there is no loss of
data even in the case of outages or disasters
● W hile Azure SQL Managed Database offers flexibility in pricing, it can become complex for
users who are not familiar with the pricing structure. If you select the geo region by mistake, it
will incur more than standard charges.
● Since Azure SQL Managed Database is a fully managed service, you have limited control over
the underlying operating system and database settings compared to an on-premise SQL Server
● For workloads with very high transaction rates or extreme performance demands, Azure SQL
Managed Database might face limitations.
● Azure SQL Managed Database is tightly integrated with Azure, which could lead to vendor
lock-in.
● Azure SQL Managed Database is designed for relational data and SQL workloads.
Alternatives:
A
● mazon RDS (Relational Database Service)
● Google Cloud SQL
● MySQL
hether it’s managing data, coordinating processes, or connecting disparate systems, Azure Logic
W
Apps offers a robust, user-friendly solution to streamline every data operation. Its low-code, visual
designer interface makes it easy to build and manage complex workflows without needing extensive
programming skills. It has become an indispensable source for enterprises to fetch a variety of data
from different sources. Azure logic app has more than 200 connectors which makes it a best choice for
data integration and retrieval.
In the context of this project, Azure Logic Apps was utilized to automate the process of fetching data
from an Azure Managed SQL Database and storing it in Azure Blob Storage. The workflow was set up
with an HTTP request trigger, allowing the Logic App to be invoked via a URL for versatile integration.
● E asy to Use:Azure Logic Apps provides a visual designinterface that allows non-developers to
create workflows easily.
Alternatives:
A
● WS Step Functions
● Google Cloud Workflows
● Apache Airflow
zure Storage Account is a fundamental service in Azure that provides highly scalable, durable, and
A
s ecure cloud storage for a variety of data types. It serves as the container for storing data in different
formats such as blobs, files, queues, and tables. There are several types of storage accounts in Azure,
each tailored to specific needs, such as Blob Storage, and Azure Data Lake Storage Gen2.
● W ith the ability to scale horizontally across both Blob Storage and Data Lake Storage Gen2, the
system can handle increasing data loads from both unstructured raw data
● Using Blob Storage for frequently accessed raw data and Data Lake Storage Gen2 for
large-scale data processing ensures we pay only for the storage you need while maximizing
cost efficiency.
● Both storage types provide a robust set of security features, but Data Lake Storage Gen2 offers
more granular control over who can access specific parts of data.
● By using both storage accounts, it is ensured that raw data in Blob Storage is highly available
and resilient to failures
● Raw JSON data stored in Blob Storage can be easily moved to Data Lake Storage Gen2, where
it can be processed using Spark jobs, machine learning models
● U sing Blob Storage for raw data storage and then moving data to Data Lake Gen2 can
introduce performance overheads, especially with frequent file writes and reads
● Both Blob Storage and Data Lake Storage Gen2 require external tools to perform data
transformations or processing tasks.
● Neither Blob Storage nor Data Lake Storage Gen2 provides native querying capabilities
s ufficient for advanced analytics and querying directly from the storage.
● Depending on the volume and frequency of data transfers, extra costs can add up.
● BothBlob StorageandData Lake Storage Gen2requireexternal tools to handle advanced
data transformations, processing, and analytics
● There is no snapshot mechanism or automated backup for Azure Files.
Alternatives:
A
● mazon S3 (Simple Storage Service)
● Google Cloud Storage
● IBM Cloud Object Storage
zure Data Factory (ADF) is a cloud-based data integration service provided by Microsoft Azure. It
A
allows you to create, schedule, and orchestrate data workflows. ADF helps move and transform data
from one location to another, using a visual interface or a code-based approach. In this project,Azure
Data Factory (ADF)will be used to orchestrate themovement of raw data stored inAzure Blob
StoragetoAzure Data Lake Storage Gen2 (ADLS). ADF'sCopy Pipeline will be used, and the JSON
data in Blob Storage will be transferred to ADLS.
● S calability:ADF is highly scalable, making it suitablefor both small and large-scale data
integration projects.
● Integration with Azure Services: ADF integrates seamlesslywith other Azure services, such
as Azure Synapse Analytics, Azure Data Lake Storage, and Azure Machine Learning.
● L imited Transformation Options:While ADF is greatfor data movement, its data
transformation capabilities are less flexible compared to other services like Azure Databricks
or SQL-based solutions.
● Complex Error Handling: While ADF has error-handlingmechanisms, they can be cumbersome
to configure.
● Data Movement Restrictions: Although ADF supportsvarious data sources and destinations,
it may not always offer the same level of integration as other services
● User Interface Limitations: The ADF user interfacecan sometimes be unintuitive, especially
for users unfamiliar with the platform.
● Limited Customization for Transformations: ADF’s built-intransformation capabilities are
s omewhat limited, especially for complex data transformations.
Alternatives:
A
● WS Glue
● dbt
● Apache NiFi
zureDatabricks,developedincollaborationwithMicrosoft,isamanagedversionofDatabricksthat
A
allows Azure customers to s et up with a s ingle click, s treamline workflows, and access s hared
collaborative interactive workspaces. It facilitates rapid collaboration among data s cientists, data
engineers, and business analysts through the Databricks platform. Azure Databricks is closely
integrated with Azure s torage and compute resources, s uch as AzureBlobStorage,DataLakeStore,
SQL Data Warehouse, and HDInsights.
● D espite its detailed documentation and intention to simplify data processing, many users find
Databricks’ lakehouse platform daunting to master.
● Commands issued in non-JVM languages need extra transformations to run on a JVM process.
● While it offers a secure, collaborative environment with different services and integrations,
these enhancements come at a cost.
● Due to its cloud-native architecture, certain workloads might experience performance
overhead.
● While Databricks integrates well with Azure storage options, its native storage capabilities are
not as extensive
A
● mazon EMR (Elastic MapReduce)
● Google Cloud Dataproc
● Apache Spark on Kubernetes
Power BI:
icrosoft Power BI isa business intelligence toolthat helps users analyze and visualize data to make
M
informed decisions. It allows users to connect to a variety of data sources, transform the data into
meaningful visualizations, and create dashboards to help make data-driven decisions. Power BI is
widely used for its ease of use, interactivity, and ability to handle large datasets from multiple
s ources in real time.
● E ase of Use: Power BI is user-friendly with a drag-and-dropinterface, making it accessible to
both technical and non-technical users.
● Cost-Effective: Power BI offers a free version withcore features and affordable pricing for the
Pro and Premium versions, making it accessible for businesses of all sizes.
● Integration with Multiple Data Sources: Power BI canconnect to a wide variety of data
s ources including databases, cloud storage, spreadsheets, and even live data streams.
● Real-Time Data Access: It supports real-time datarefreshes, enabling users to work with the
latest data and make timely decisions.
● Customizable Dashboards: Users can create personalizeddashboards and interactive reports,
giving flexibility in visualizing data the way they want.
● L imited Data Capacity (Free Version): Restrictionson data storage and sharing in the free
version.
● Performance with Large Datasets: Can slow down withvery large datasets.
● Steep Learning Curve: Advanced features and customizationscan be difficult to learn.
● Limited Customization for Visualizations: Less flexibilitycompared to other BI tools.
● Data Refresh Limits: Limited refresh frequency, especiallyin the free version.
Alternatives:
T ableau
●
● Qlik Sense
● Google Data Studio
ResourceGroupiscreatedinAzuretomanageallAzureServiceseffectivelyinoneplace.Itis
A
easier to monitor, update, or delete all Azure resources together as a single unit.
a) In the search bar, enter"Resource Groups"and selectit from the suggestions.
d) U nder Tags, in the Name field, enter s omething likedeveloper.IntheValuefield,enteryour
name.
a ) Click on the Resource Group you just created, then s elect the “Create” option within the
resource group.
g) In theServer Detailspage, under theServer Namefield, enter the desired name for your server.
j) Select"Set admin"to grant the Azure authorizeduser administrative access to the SQL Server.
l) In the Username field, enter the username you created for yourSQLServer,andinthePassword
field, enter the corresponding password.
n) After specifying your SQL Server username and password, click on"OK".
) Now, s et the Workload to Development, as the project does not include production-level
p
workloads.
v )Afterthat,validationmaytakes ometime.Oncecomplete,yourAzureSQLServeranddatabasewill
be successfully created.
x) The Azure SQL Server and Azure SQL Database have now been successfully created.
a)ToimportdatafromanExcelfileintoanAzureSQLDatabase,youneedtheSQLServerManagement
Studio (SSMS) application.
) Navigate to your browser, type "Download SQL Server Management Studio"inthes earchbar,
b
and click on the first website that appears.
f )IntheLoginandPasswordfields,s pecifytheusernameandpasswordofyourAzureSQLServerthat
you saved earlier. After that, Click onConnect.
h) ExpandDatabasesby clicking on the+icon, andthen your Azure SQL Database will appear there.
j)UndertheChooseaDataSourcepanel,s electMicrosoftExcelfromthedropdownlistasthedata
s ource.
)IntheAuthenticationMethod,chooseUseSQLServerAuthentication,andintheUsernameand
o
Passwordfields, specify the login and password youcreated for your SQL Server. Then click on “Next”
) Click on the Destination and change the name of the destination to Table in the Azure SQL
q
Database. You can name it according to your preference.
s ) Change the mappings of the last columns from 255 to 1200. This allows for a larger character
length, accommodating data that exceeds the default limit of 255 characters.
)Then,clickonNexttwicetoproceedthroughthetwoconfigurations teps,andfinallyclickonFinish
u
to complete the import process.
x )Tocheck,whetherthedatahasbeenuploadedtothetableinAzureSQLDatabaseornot.As imple
SQL query can be executed on top of the table.
c )IntheCreateaStorageAccountpanel,undertheResourceGroups ection,clickonthedropdown
and select the resource group you just created.
i) Now, on theBlob Storage Accountinterface, inthe left-side panel, click onData storage. Under
Data storage, click onContainersto create a containerfor Azure Blob Storage.
k ) In the right-side pop-up panel, specify the name of the container as per your need and then click on
“Create”. Make sure the name is unique within thes torage account.
a ) In the s earch bar, type Logic Apps and s elect Logic Apps from the s uggestions.Thepurposeof
Logic Apps is to create aworkflowthatfetchesdatafromtheAzureSQLDatabaseandloadsitinto
Azure Blob Storage.
c ) Under theStandardplan, click onWorkflow serviceplanto create the Logic App with lower
charges, as you're handling a lighter workload. Then click on “select”
e ) In theLogic App Namefield, specify the nameof the Logic App as desired. Ensure the name is
unique. Then click on NEXT: Storage
y specifying this account, you are ensuring that all logs and artifacts generated by the Logic App will
B
be stored in the selected storage account.
i) Validation might take some time. Once complete, click onGo to resource.
T his will allow you to create the workflow to load data from the Azure SQL Database to Azure Blob
Storage.
k)Then, click on + Add, and then click Add againto create the workflow.
T he reason for selecting Stateful is that it provides better efficiency when uploading large data, as it
retains the state across workflow runs.
) In the workflow panel, on the left-hand side, click onDesigner.This will allow you to create the
n
workflow using drag-and-drop actions from the available options.
T o fetch data from Azure SQL Database and load it into Azure Blob Storage, Azure Logic Apps needs
access to the Azure SQL Database. A connection must be established between the Logic App and the
Azure SQL Database. The mechanism involvesassociatinga user with the Logic Appthat has the
necessary permissions to access the database. To achieve this,a useris created within the Azure SQL
Database using a SQL query. Then, the Logic App user is granted access to the Azure SQL Database to
allow the workflow to execute.
) Open a new tab in your browser and navigate to the Azure portal. Then, access yourAzure SQL
p
Databaseby searching for it in the portal
r)Now, click onQuery Editor.You will have two optionsto access your Azure SQL Database:
U
● se the username and password you created for the SQL Server.
● Log in directly using your Azure email ID.
t ) First Query
First, Run the SQL query to create a user for theLogic App and enable it within your Azure
environment.
REATE USER: This command is used to create a newuser in the Azure SQL Database.
C
waterQualityLA1- Name of Logic App (Replace withthe name of the Logic app you created)
FROM External Provider- This indicates that the useris being authenticated by an external provider
(such as Azure Active Directory), rather than using a traditional SQL authentication
(username/password). In this case, it's the Logic App user being created through Azure AD.
reakdown of Query:This query allows the Logic Appto read data from the Azure SQL Database.
B
Since the Logic App will fetch data and move it to Azure Blob Storage, it only requires read
permissions and permission was assigned by changing the role of db_reader. Logic App user
[WaterQualityLA1] was added as a member to read the data from Azure SQL Database.
db_datareader Role: A built-in role granting read-onlyaccess to all tables and views in the database.
db_datareaderrole for
[ WaterQualityLA1]: The Logic App's Azure AD user,added to the
read-only access to the database.
In this segment, a workflow will be created in the Logic App. Actions and triggers will be dragged and
dropped to design a comprehensive workflow. This workflow will fetch data from the Azure SQL
Database and store it in the Azure Blob Storage container (in JSON format) you created earlier.
a ) Return to the tab where the Logic App workflow was created or navigate to the Logic App in the
Azure portal, and click on the workflow you previously created.
c) The first step is to create an HTTP trigger in the workflow. For that Click on the “trigger” icon
T his will generate an API-like URL that can be used to call the workflow on demand, reducing the need
for manual intervention.
e) Then click on + icon to add another action “Execute query” in the workflow.
The Execute query action is added in the workflow to query the entire data from Azure SQL Database.
g ) Once theExecute queryaction is added, you willneed to specify the configurations of Azure SQL
database in execute query field.
. C
1 onnection name: Give a name to the connection asper your preference.
2. Authentication type: From the dropdown, Choose “Managedidentity”
The name of the AZzure SQL Server and Azure SQL Database can also be copied by navigated to the
→
Azure SQL Datavase service and then click on the Azure SQL Server which you created.
i) After specifying the connection details, theSQLqueryprompt page opens. Here, aSQL query will
be executedthat retrieves all the data from the AzureSQL Database.
→ Breakdown of Query:
SELECT *:
FROM [dbo].[Waterqualitydata]:
● [ dbo]: Default schema in SQL Server and Azure SQLDatabase.dbo (Database Owner)is the
default schema in SQL Server and Azure SQL Database.
● [Waterqualitydata]: The table name from which datais fetched.
k) Then scroll down, until you find this action“Uploadblob to storage container”. Select this action.
T he"Upload Blob to Storage"action in Azure LogicApps is used to take data retrieved from an Azure
SQL Database using the Execute Query action. Once the data is fetched, it is converted into JSON
format. This JSON data(blob)is then uploaded toan Azure Blob Storage container.
Click on“Change Connection”in the bottom right cornerand adda new connection.
. C
1 onnection name:Give a name as your preference.
2. Authentication type:From the dropdown, selectStorageaccount connection string.
( To establish a connection with Azure Blob Storage and upload the data into the Blob container, use
Azure Blob Storage Access Keys or a Connection String for authentication)
→ Here are the steps to get the Azure Blob Storage account access key:
. O
1 pen another tab in browser and In the Azure portal, go to yourStorage Account.
2. In the left-hand panel, underSecurity + networking,click onAccess keys.
3. Click on show forconnection string, copy it.
Then click on“Create New”and it will create a connectionwith the Azure Blob Storage.
. C
1 ontainer Name:In this field, specify the name ofthe blob container you created earlier.
2. Blob Name:In this filed, specify the name of theblob/ file that will be loaded to the Blob
Container. This is nothing just a name that you will give to your file which is going to be loaded
to the blob container (in json format).
T hen click on theResult item(as blob action storesthe output of the execute query action and
uploads the results to the blob container in JSON form).
) Click on theRunoption, then click onRunagain.This will trigger the workflow and start its
q
execution.
7. Creating Azure Data Lake Storage Account (ADLS Gen2) and its container
a ) Navigate to the Azure home portal, typeAzure StorageAccountsin the search bar, and select
Storage Accountsfrom the suggestions.
) In the primary Servcie, from the dropdown - choose“Azure Blob Storage or Azure Data Lake
d
Storage”option.
f ) Then click on “Review+Create” andthen click on“create” to create ADLS Gen2 Accountand then
wait until the validation is done.
g) Once the deployment of ADLS account is done, click on“Go to Resource”.
i) Click on+Container. In the right-side panel, specifythe name of the ADLS container as desired,
then click onCreate.
Once you find theData Protectionoption, in the right-hands ide panel, uncheck the two options:
This will disable soft delete for both blobs and containers in the ADLS Storage Account.
T he reason behind disabling this option is to get more control on when we want to terminate
the present in the Storage Account and container.
T he copy pipeline will be created in Azure Data Factory (ADF) to orchestrate the movement of the JSON file
from the Azure Blob Storage container to the ADLS container. ADF will be responsible for initiating and
managing the data movement in this project.
a ) Navigate back to the Azure portal and in the search bar, typeAzure Data Factory. SelectData
Factoriesfrom the suggestions.
c) In theResource Groupfield, select the resourcegroup you created from the dropdown.
Also, specify the name of the Azure Data Factory as you want.
e ) ClickNexttwo times, then clickReview + Create.The validation of deployment might take some
time. Once the deployment is successful, clickGoto Resource.
i) Click on theCopy Pipelineon the white canvas,then in theGenerals ection, specify the name of
the pipeline as desired.
In this project, the source is theBlob Storage containercontaining the JSON data file.
k )In the right-side panel, under theSelect a datasourcefield, typeBloband select theAzure Blob
Storageoption and click oncontinue
Since the data file in the Azure Blob Storage container is in JSON format.
) Give a name to the source and then click on+Newto create a linked service for the Azure Blob
m
Storage container.
) Now,set the propertiesto define the exact locationwhere the JSON file is located in the Azure blob
o
s torage container.
Click onfile iconto browse the location of the AzureBlob container which hasa JSON file.
q) Now, click on the file that is present in theAzureBlob Storage Container.
t ) In the right-side panel, under theSelect a datasourcefield, typeAzure Data Lake Storage Gen2
s elect theADLS Gen2option, and click oncontinue.
v ) In the linked service, give a name to thelinkedserviceas you want and then click on+newto
create a new Linked Service.
x ) Then from this panel, specify the details, click on the file icon, and browse to the ADLS container
that you created earlier.
z ) Then click on theValidateoption. Validation inADF ensures that the pipeline configurations are
correct and that all connections are properly established before execution.
T he successful pipeline execution moves the JSON file from Azure Blob Storage to the ADLS container.
To verify, navigate to the ADLS container, and you will see the JSON file there.
a ) Navigate to the Azure portal, type "Azure Databricks"in the search bar, and select"Azure
Databricks"from the suggestions.
) In the"Workspace Name"field, specify a uniquename for your Databricks workspace. Keep all
d
other settings as default. From the pricing tier, select“standard one”as premium is costly.
f ) Click on theLaunch workspaceto open the Databricksinterface. Here, creating a cluster and
execution of code in a notebook to process the data will be done.
● orkspace
W
● Compute
● Job Runs
● Dashboards
● Catalog and so much
) Firstly, a cluster will be created inside Databricks Workspace. In the left side panel, click on
h
Compute.
cluster is necessary for Databricks to run code and process data during the running of jobs, as it
A
provides the computational resources.
j)In the policy, from the dropdown - select“unrestricted”orgo with“Personal Compute”if face any
error.
T he cluster may take some time to create. Once it is ready, aright signalwill appear next to the
cluster name.
S ince the raw JSON Data file is in the ADLS Container, it will be processed and structured using the
medallion architecture, flowing throughbronze, silver,and gold layers. Each layer will have a
dedicated Databricks notebook in the same workspace for processing, analyzing, and cleaning data
with distinct purposes. After the final transformation is done in the gold layer, the data will be
uploaded to the Hive metastore in Databricks. This data will then be connected to Power BI to create
visualizations.
hy is Medallion Architecture considered the best standard in data engineering for processing raw
W
data and delivering high-quality data?
medallion architecture is a data design pattern used to logically organize data, with the goal of
A
incrementally and progressively improving the structure and quality of data as it flows through each
layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables).
) In the top-right corner, click onCreate, thens electNotebook. Repeat this process three times to
b
create separate notebooks for each layer: Bronze, Silver, and Gold. All notebook will be created in this
s ame workspace.
c) Instead of creating notebooks, import all three notebooks into the same workspace
It can be done by clicking the (:)icon, selectingImport, and uploading notebooks from your local
system to the Databricks workspace.
d) Click on browse to upload all three notebooks from your local system to the Databricks workspace.
T he Bronze layer is where we land all the data from external source systems. In this project, the
Bronze Layer Notebookwill be used to ingest the rawJSON data from the ADLS container and load it
into a DataFrame for further processing.
Code Breakdown:
1. T his code imports essential libraries to set up a Spark session and enables the use of
DataFrame operations.SparkSessionis the entry pointfor Spark, and colhelps reference
columns. The execution of this cell sets up the environment for processing data using Spark.
4. It configures the Spark session to use the ADLS storage account by setting the account key. It
allows Spark to authenticate and access the ADLS storage.
5. It reads a JSON file from the Azure Blob Storage into a DataFrame.
6. T he code sets up writing the DataFramebronze_df(whichcontains the raw data from the JSON
file) into a Delta Lake table in theBronze Layerof the medallion architecture.
b
● ronze_df:The DataFrame contains the raw data loadedfrom Azure Blob Storage.
● .write:This is an operation to write the DataFrameto a storage location.
● .format("delta"):Specifies that the data will bewritten in the Delta format, and is optimized
for large-scale data processing.
● .mode("overwrite"):This means that if the data alreadyexists at the specified location, it will
be overwritten.
● .save("/mnt/datalake/bronze/water_quality"):Thiss pecifies the path where the data will
be saved, i.e., theBronze Layerlocation in the /mnt/datalake/bronze/water_qualitydirectory.
In this Silver layer, data from the Bronze layer is cleaned, merged, and processed to provide a more
refined and consistent view of key business data. This layer transforms raw data from the Bronze layer
into usable form for analysis in the gold layer.
2. It loads the data from the Bronze layer stored in Delta format in the specified location
(/mnt/datalake/bronze/water_quality)into a data framecalledbronze_df. The execution of
this cell will allow to access the raw data that was ingested into the Bronze layer
7. T his code maps country abbreviations to their full names in the Countrycode column for better
clarity.
T he data is written in"overwrite"mode, it will replaceany existing data at the specified path
(/mnt/datalake/silver/water_quality_cleaned)in DBFS(Databricks file system).
After cleaning and filtering in the Silver layer, the data will be further processed in the Gold layer to
→
be refined and ready for use in the use case.
T he final layer of data transformations and data quality rules are applied here. The Gold layer is for
reporting and uses more de-normalized and read-optimized data.
ata from the Gold layer is typically used for dashboards, reporting, and model building, as it
D
represents the final transformed and high-quality data.
2. It loads the data from the Bronze layer stored in Delta format in the specified location
(/mnt/datalake/silver/water_quality_cleaned)intoa data frame calledsilver_df. The
execution of this cell will allow to access the cleaned data that was processed using the Silver
layer
Breakdown:
A)
mean_val =
gold_df.select(F.mean("Minimum_Value")).first()[0]
stddev_val =
gold_df.select(F.stddev("Minimum_Value")).first()[0]
B)
gold_df_with_zscore = gold_df.withColumn(
)
C)
gold_df_with_outliers = gold_df_with_zscore.withColumn(
"MinimumValue_outlier", F.when( F.abs(F.col("z_score")) >
3, 1 ).otherwise(0) )
> It checks if the Z-score is greater than 3 orless than -3 (which means the value is
—
far from the average). If it is, the value is marked as an outlier (1), otherwise it's
marked as normal (0). Values with Z-scores beyond3 are considered outliers because
they are far away from the average.
D)
outlier_rows
=gold_df_with_outliers.filter(gold_df_with_outliers.Minimu
mValue_outlier == 1) print("Rows with Minimumvalue_outlier
= 1:") outlier_rows.display()
It is simply filtering and displaying the rows that are considered outliers based on
→
the conditionMinimumValue_outlier == 1.
F ollow the same process used for theMinimum_Valuecolumn, but replace the column name
withMaximum_Valuein the code.
a) C
reate Database: In the notebook, create a databasethat will be visible in the Hive
metastore.
Code:
b) R
un the following code snippet to load the data from thegold_dfdata frame into a
table (gold_table) within the database you created.
Code:
c) O
n the left side panel, click onCatalog, then underHive Metastore, select the
database you created.
a) Open Power BI Desktop and click on “Get Data” and from the dropdown, click on“More”
c) Specify the details such asserver hostnameandHTTPpathof your Databricks environment.
Copy these values to specify them in your connection settings in Power BI
g) In theAzure Databricks connection window, click onAzure Active Directoryfor
authentication.
i) Click onConnect.This will establish a connectionbetween Databricks and Power BI
k) Now, in the right panel under theDatas ection, yourgold_tablewill be listed.
This indicates that the data fromgold_tablehasbeen successfully loaded into Power BI.
‘’’’
● In this training, we created a comprehensive data pipeline using Azure Cloud, enabling
organizations and concerned authorities to take measurable actions in no time.
● The flexibility of the pipeline allows it to cover a vast number of use cases.
● Initially, we discussed the water sensor data (ourmain data source)and emphasized the
importance of treating every single record generated by the sensors, highlighting the criticality
of handling data for accurate analysis.
● A batch-processing approach was employed to extract the entire dataset from the Azure-based
SQL database.
● For this project, we used aggregated water sensor data collected across different terrains,
water bodies, vegetation, and countries.
● The architecture of our project is quite comprehensive and can accommodate additional
functionalities. The architecture encompasses the flow in this way:
○ Initially, we loaded the data from local storage into the cloud-based SQL database
using pipelines, as the entire database was utilized for a historical approach.
○ In the storage layer, we used ADLS storage containers and Blob Storage to store the raw
data files.
○ In the data movement layer, we leveraged Azure Logic Apps and Azure Data Factory to
facilitate the movement of data between services, ensuring it reached its final
destination.
○ In the transformation layer, we adhered to the Medallion Architecture, an
industry-standard framework, to transform the data rigorously and deliver a
high-quality dataset.
○ Finally, we wrote the processed dataset from the Gold Hive Metastore to Power BI and
visualization development was explained on this data to create comprehensive
dashboards, enabling government and private organizations to make timely decisions
and prevent hazardous activities.
● Various Azure Services were used to implement the project and cost configurations were also
taken care of to slack the unwanted charges.