0% found this document useful (0 votes)
30 views

Build+Data+Pipeline+Using+Azure+Medallion+Architecture+Approach

The document outlines a comprehensive guide for building a data pipeline using Azure's Medallion Architecture, specifically focusing on analyzing water sensor data. It details the architecture, prerequisites, and execution steps involving various Azure services such as Azure SQL Database, Logic Apps, Data Lake Storage, and Databricks, culminating in data visualization with Power BI. The project aims to facilitate effective monitoring and decision-making regarding water quality through a modular and scalable pipeline design.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Build+Data+Pipeline+Using+Azure+Medallion+Architecture+Approach

The document outlines a comprehensive guide for building a data pipeline using Azure's Medallion Architecture, specifically focusing on analyzing water sensor data. It details the architecture, prerequisites, and execution steps involving various Azure services such as Azure SQL Database, Logic Apps, Data Lake Storage, and Databricks, culminating in data visualization with Power BI. The project aims to facilitate effective monitoring and decision-making regarding water quality through a modular and scalable pipeline design.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 138

‭Build Data Pipeline using Azure‬

‭Medallion Architecture‬
‭Approach‬

‭CookBook‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Table of Content:‬

‭1.‬‭Introduction‬ ‭……………………………………………………………………………………………‬ ‭2‬


‭2.‬‭Use-case depicted in this project‬‭…………………………………………………………….‬ ‭4‬
‭3.‬‭Architecture‬‭……………………………………………………………………………………………‬ ‭6‬
‭4.‬‭P rerequisites‬‭…………………………………………………………………………………………‬ ‭7‬
‭ .‬‭Services‬‭…………………………………………………………………………………………………..‬
5 ‭‬
8
‭6.‬‭Execution‬‭…………………………………………………………………………………………………‬ ‭17‬
‭i)‬‭Creating Resource Groups‬‭…………………………………………………………………‬ ‭17‬
‭ii)‬‭Creating Azure SQL Database‬‭…………………………………………………………..‬ ‭20‬
‭iii)‬‭Importing Data in Azure SQL Database‬‭…………………………………………..‬ ‭33‬
‭iv)‬‭Creating Azure Blob Storage‬‭……………………………………………………………‬ ‭47‬
‭v)‬ ‭Creating Logic App‬‭………………………………………………………………………….‬ ‭53‬
‭vi)‬‭Creating Ingestion workflow in Azure Logic‬‭App‬‭…………………………..‬ ‭65‬
‭vii)‬‭Creating Azure Data Lake Storage Account (ADLS‬‭Gen2)‬‭……………..‬ ‭82‬
‭viii)‬‭Creating Azure Data Factory‬‭…………………………………………………………‬ ‭87‬
‭ix)‬ ‭Creating Azure Databricks‬‭……………………………………………………………..‬ ‭103‬
‭x)‬ ‭Azure Databricks Medallion Architecture‬‭………………………………………‬ ‭109‬
‭ ‬ ‭Bronze Layer‬‭…………………………………………………..‬
● ‭ 12‬
1
‭●‬ ‭Silver Layer‬‭…………………………………………………….‬ ‭115‬
‭●‬ ‭Gold Layer‬‭……………………………………………………..‬ ‭122‬
‭xi)‬‭Connection to Power BI and Visualizations‬‭………………………… 130‬
‭xii)‬‭Summary‬‭……………………………………………………………………‬ ‭137‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Introduction‬

I‭n this training, we shall look at and execute a pipeline that is comprehensively designed to analyze‬
‭complex datasets, i.e. water sensor data. The pipeline leverages Azure services, which are seamlessly‬
‭integrated to enable efficient data movement and processing. Each component works together to‬
‭ensure smooth data flow and accurate analysis, showcasing the power of Azure's advanced data‬
‭integration capabilities. The use cases for the proposed pipeline can be as follows:‬

‭●‬ T‭ he Authoritative and government organizations monitor the concentration levels of water‬
‭bodies which were collected using sensors. By analyzing this data, they aim to assess the‬
‭s afety and potential hazards of these water sources.‬
‭●‬ ‭A unified system for all government-related organizations and concerned authorities to address‬
‭water-related issues effectively and implement immediate safety measures. This kind of‬
‭system would enable streamlined collaboration and decision-making.‬
‭●‬ ‭The real-time recording of water quality data through sensors emphasizes the necessity of a‬
‭robust and constructive framework. The ultimate goal is to reduce harmful contaminants in‬
‭water, ensuring it remains safe and sustainable within the ecosystem.‬

T‭ his use case focuses on‬‭"Aggregated Sensor Water Data"‬‭which is collected across various time‬
‭zones, countries, water bodies, and vegetation types. This gets crucial when climatic and habitat‬
‭conditions vary, such historical data becomes vital in providing a comprehensive overview to address‬
‭and resolve complex or uncertain issues. Every single record is crucial, as it contributes to the‬
‭measurement of determinands and informs proactive measures to ensure water safety.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬
‭Use-case depicted in this project‬

T‭ his project‬‭leverages‬‭"Aggregated Sensor Water Data"‬‭as the main data source to implement a‬
‭comprehensive Azure data pipeline. The pipeline is designed to ingest sensor data, process it, and‬
‭move it to the destination, utilizing various Azure services such as‬‭Azure SQL Database, Azure Logic‬
‭Apps, Azure Storage Accounts, Azure Data Factory, and Azure Databricks‬‭. The final transformed‬
‭data is visualized using‬‭Power BI‬‭, the most demanding‬‭Business Intelligence tool. The pipeline is‬
‭s tructured to be modular and customizable, which allows additional Azure services to be seamlessly‬
‭integrated to expand its functionality.‬

‭The‬‭Azure services chosen for this project are scalable,‬‭robust, user-friendly, and cost-effective.‬

‭The project uses the “Aggregrated Sensor Water Data” which includes different columns as follows:‬

‭ .‬ ‭countryCode‬
1
‭2.‬ ‭monitoringSiteIdentifier‬
‭3.‬ ‭monitoringSiteIdentifierScheme‬
‭4.‬ ‭parameterWaterBodyCategory‬
‭5.‬ ‭observedPropertyDeterminandCode‬
‭6.‬ ‭observedPropertyDeterminandLabel‬
‭7.‬ ‭procedureAnalysedMatrix‬
‭8.‬ ‭resultUom‬
‭9.‬ ‭phenomenonTimeReferenceYear‬
‭10.‬‭parameterSamplingPeriod‬
‭11.‬‭procedureLOQValue‬
‭12.‬‭resultNumberOfSamples‬
‭13.‬‭resultQualityNumberOfSamplesBelowLOQ‬
‭14.‬‭resultQualityMinimumBelowLOQ‬
‭15.‬‭resultMinimumValue‬
‭16.‬‭resultQualityMeanBelowLOQ‬
‭17.‬‭resultMeanValue‬
‭18.‬‭resultQualityMaximumBelowLOQ‬
‭19.‬‭resultMaximumValue‬
‭20.‬‭resultQualityMedianBelowLOQ‬
‭21.‬‭resultMedianValue‬
‭22.‬‭resultStandardDeviationValue‬
‭23.‬‭procedureAnalyticalMethod‬
‭24.‬‭parameterSampleDepth‬
‭25.‬‭resultObservationStatus‬
‭26.‬‭remarks‬
‭27.‬‭metadata_versionId‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ 8.‬‭metadata_beginLifeSpanVersion‬
2
‭29.‬‭metadata_statusCode‬
‭30.‬‭metadata_observationStatus‬
‭31.‬‭UID‬

T‭ he following columns represent the level of detail or drill-down of parameters at which water‬
‭readings were taken by the sensors.‬

F‭ or demonstration purposes, the data is initially loaded from a local source to an‬‭Azure Managed SQL‬
‭Database‬‭using‬‭SQL Server Management Studio‬‭. This‬‭application allows for seamless connection to‬
‭the Azure Cloud Database and enables the import of data from an Excel file (the sensor water data‬
‭file) into Azure. The Azure pipeline is designed to handle data of any size, minimizing latency and‬
‭ensuring that the processed data is quickly made available for analysis or any defined purpose.‬

T‭ he goal of this project is to solve water quality problem using Azure Pipeline by incorporating these‬
‭endpoints:‬

‭1.‬ R ‭ aw Data:‬‭The raw data, which is available in an Excel‬‭s heet, is first loaded into the Azure‬
‭Managed SQL Database using SQL Server Management Studio (SSMS). This enables the data to‬
‭be used in the Azure pipeline, and ensures it’s ready for further processing.‬
‭2.‬ ‭Data Movement:‬‭After loading raw data into the SQL‬‭database, it is then moved to Azure Blob‬
‭Storage using Logic Apps. This component helps automate data transfer from the Azure SQL‬
‭Database to Blob Storage in JSON format. Logic Apps contain different variety of actions to‬
‭accomplish such intensive tasks.‬
‭3.‬ ‭Orchestration:‬ ‭Once the data is in Blob Storage,‬‭it is transferred to Azure Data Lake Gen2 for‬
‭integration with Azure Databricks. This is achieved through Azure Data Factory, which includes‬
‭a copy pipeline to move data into the lake.‬
‭4.‬ ‭P rocessing Layer:‬‭The data then undergoes transformation‬‭in Azure Databricks, following the‬
‭Medallion Architecture with three layers:‬
‭●‬ ‭Bronze Layer: Raw data is ingested from Data Lake Gen2 into a data frame using Spark‬
‭jobs.‬
‭●‬ ‭Silver Layer: The data is processed and refined for quality.‬
‭●‬ ‭Gold Layer: Final processing happens, and the transformed data is stored in a‬
‭Databricks table.‬
‭5.‬ ‭Data Visualization‬‭: The processed data from the‬‭Gold‬‭Layer‬‭is then used as a source for‬
‭Power BI‬‭, which enables dashboarding and detailed‬‭analysis.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Architecture‬

‭The processed data from this pipeline will be visualized using business intelligence tool like Power BI.‬

‭The architecture of our project is structured in a way to handle large volume of historical sensor data.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Prerequisites‬
‭Breakdown:‬

‭●‬ ‭Setup required:‬


‭○‬ ‭Create an Azure account‬
‭○‬ ‭Download the dataset‬
‭○‬ ‭Download the Code‬
‭○‬ ‭This is an OS-independent project, as the whole pipeline is set up on Azure‬
‭Cloud‬
‭○‬ ‭SQL Server Management Studio and Access Database Engine‬‭for integration with‬
‭Azure Managed SQL database‬
‭●‬ ‭Ways to interact with the Azure services:‬
‭○‬ ‭Console- Using the Web-based UI‬
‭○‬ ‭CLI- Using the Azure CLI tool in Terminal/CMD‬
‭○‬ ‭For this project, we will be using the UI Console of Azure to work‬
‭with different Azure Services‬
‭●‬ ‭Best Practices for Azure Account‬
‭○‬ ‭Enable Multi-Factor Authentication (MFA)‬
‭○‬ ‭Protect the account by adding only authorized users/ groups to Resource Groups‬
‭○‬ ‭Choose the region where services work‬
‭○‬ ‭Create a Budget and set up an alert notification‬
‭○‬ ‭Keep track of Unused ongoing Services‬
‭○‬ ‭Delete Databricks Cluster after execution to cut down charges‬
‭○‬ ‭Rotate all keys and passwords periodically‬
‭○‬ ‭Follow the least privilege principle‬
‭○‬ ‭Follow a naming convention for resources for better understanding, name after‬
‭project aim.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Services‬

‭Services used in the project:‬

‭Azure SQL Database:‬

‭ zure Managed SQL Database service is fully managed by Microsoft, which means it handles‬
A
‭maintenance, backups, and updates, enabling users to focus on their applications rather than‬
‭database management. It is cost-sensitive which means it’s important to monitor the cost analysis‬
‭constantly. In the context of this project, an Azure SQL Server was created first, followed by setting up‬
‭an Azure Managed SQL Database within that server.‬

S‭ QL Server Management Studio (SSMS) was used to establish a connection to the Azure SQL Server.‬
‭This connection was set up using credentials like Microsoft login, password, and SQL Server‬
‭authentication. Once connected, the large Excel file was imported into the Azure Managed SQL‬
‭Database.‬

‭Advantages of Azure SQL Database:‬

‭●‬ I‭t supports and processes both relational data and non-relational structures, such as graphs,‬
‭JSON, spatial, and XML.‬
‭●‬ ‭It is managed by Microsoft and data is spread across data centers, and with 24*7 support‬
‭teams. It also creates a high-performance data storage layer for applications and solutions,‬
‭with high-speed connectivity.‬
‭●‬ ‭Azure SQL Managed Database takes care of database infrastructure, patches, and‬
‭maintenance automatically. This significantly reduces the overhead of managing hardware and‬
‭s oftware updates.‬
‭●‬ ‭We can scale the database's compute and storage resources dynamically based on workload‬
‭demands.‬
‭●‬ ‭With data being stored across different data centers managed by Microsoft, there is no loss of‬
‭data even in the case of outages or disasters‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Disadvantages of Azure SQL Database:‬

‭●‬ W ‭ hile Azure SQL Managed Database offers flexibility in pricing, it can become complex for‬
‭users who are not familiar with the pricing structure. If you select the geo region by mistake, it‬
‭will incur more than standard charges.‬
‭●‬ ‭Since Azure SQL Managed Database is a fully managed service, you have limited control over‬
‭the underlying operating system and database settings compared to an on-premise SQL Server‬
‭●‬ ‭For workloads with very high transaction rates or extreme performance demands, Azure SQL‬
‭Managed Database might face limitations.‬
‭●‬ ‭Azure SQL Managed Database is tightly integrated with Azure, which could lead to vendor‬
‭lock-in.‬
‭●‬ ‭Azure SQL Managed Database is designed for relational data and SQL workloads.‬

‭Alternatives:‬

‭‬ A
● ‭ mazon RDS (Relational Database Service)‬
‭●‬ ‭Google Cloud SQL‬
‭●‬ ‭MySQL‬

‭Azure Logic App:‬

‭ hether it’s managing data, coordinating processes, or connecting disparate systems, Azure Logic‬
W
‭Apps offers a robust, user-friendly solution to streamline every data operation. Its low-code, visual‬
‭designer interface makes it easy to build and manage complex workflows without needing extensive‬
‭programming skills. It has become an indispensable source for enterprises to fetch a variety of data‬
‭from different sources. Azure logic app has more than 200 connectors which makes it a best choice for‬
‭data integration and retrieval.‬

I‭n the context of this project, Azure Logic Apps was utilized to automate the process of fetching data‬
‭from an Azure Managed SQL Database and storing it in Azure Blob Storage. The workflow was set up‬
‭with an HTTP request trigger, allowing the Logic App to be invoked via a URL for versatile integration.‬

‭Advantages of Azure Logic App:‬

‭●‬ E‭ asy to Use:‬‭Azure Logic Apps provides a visual design‬‭interface that allows non-developers to‬
‭create workflows easily.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭●‬ A ‭ PI and Custom Connectors:‬‭Logic Apps can also integrate with custom APIs, enabling you to‬
‭extend the system to meet unique integration needs.‬
‭●‬ ‭Elastic Scalability‬‭: Logic Apps scale automatically based on the workflow demand. If your‬
‭workflow requires higher processing power‬
‭●‬ ‭Environment Support‬‭: Logic Apps supports different‬‭environments (development, staging,‬
‭and production).‬
‭●‬ ‭Automate Repetitive Tasks‬‭: Logic Apps are ideal for‬‭automating business processes such as‬
‭approvals, data synchronization, notifications, and more.‬

‭Disadvantages of Azure Logic App:‬

‭●‬ D ‭ ifficulty in Managing Complex Workflows:‬‭As workflows‬‭grow in complexity, they can‬


‭become harder to manage and maintain.‬
‭●‬ ‭Cost Predictability‬‭: Although Logic Apps follows a‬‭consumption-based pricing model, it can be‬
‭challenging to predict costs‬
‭●‬ ‭Limited Debugging Capabilities‬‭: While Logic Apps provides‬‭logging and monitoring,‬
‭debugging complex workflows or issues in real time can be challenging‬
‭●‬ ‭Complexity:‬‭As our project will inevitably get bigger‬‭and bigger, it becomes increasingly‬
‭difficult to keep track of which branch we are working on and find where we left off as it‬
‭s tarted to resemble something of a spider web.‬
‭●‬ ‭Limited Fine-Grained Control‬‭: While Logic Apps provides‬‭a wide range of triggers and‬
‭actions, it may not have complete control over execution sequences or the internal state of a‬
‭workflow.‬

‭Alternatives:‬

‭‬ A
● ‭ WS Step Functions‬
‭●‬ ‭Google Cloud Workflows‬
‭●‬ ‭Apache Airflow‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Azure Storage Account:‬

‭ zure Storage Account is a fundamental service in Azure that provides highly scalable, durable, and‬
A
‭s ecure cloud storage for a variety of data types. It serves as the container for storing data in different‬
‭formats such as blobs, files, queues, and tables. There are several types of storage accounts in Azure,‬
‭each tailored to specific needs, such as Blob Storage, and Azure Data Lake Storage Gen2.‬

‭a)‬ ‭Blob Storage Account‬‭:‬


‭●‬ ‭P urpose‬‭: The storage account comes under the Azure‬‭Storage account service which is‬
‭designed for storing large amounts of unstructured data, such as text or binary data.‬
‭It's typically used for storing files like images, videos, documents, backups, and logs. In‬
‭our project, the container is created inside the blob storage account to store the JSON‬
‭data coming from the execution of Logic Apps.‬

‭b)‬ ‭Azure Data Lake Storage Gen22 (ADLS):‬


‭●‬ ‭P urpose: It is‬‭Built on top of Azure Blob Storage,‬‭but Data Lake Storage Gen2 is‬
‭optimized for big data analytics workloads and provides more advanced features for‬
‭s toring and managing data at scale. In Our project, ADLS acts as a data source in Azure‬
‭Databricks during the execution of spark jobs.‬

‭Advantages of Azure Storage Account:‬

‭●‬ W ‭ ith the ability to scale horizontally across both Blob Storage and Data Lake Storage Gen2, the‬
‭system can handle increasing data loads from both unstructured raw data‬
‭●‬ ‭Using Blob Storage for frequently accessed raw data and Data Lake Storage Gen2 for‬
‭large-scale data processing ensures we pay only for the storage you need while maximizing‬
‭cost efficiency.‬
‭●‬ ‭Both storage types provide a robust set of security features, but Data Lake Storage Gen2 offers‬
‭more granular control over who can access specific parts of data.‬
‭●‬ ‭By using both storage accounts, it is ensured that raw data in Blob Storage is highly available‬
‭and resilient to failures‬
‭●‬ ‭Raw JSON data stored in Blob Storage can be easily moved to Data Lake Storage Gen2, where‬
‭it can be processed using Spark jobs, machine learning models‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Disadvantages of Azure Storage Account:‬

‭●‬ U ‭ sing Blob Storage for raw data storage and then moving data to Data Lake Gen2 can‬
‭introduce performance overheads, especially with frequent file writes and reads‬
‭●‬ ‭Both Blob Storage and Data Lake Storage Gen2 require external tools to perform data‬
‭transformations or processing tasks.‬
‭●‬ ‭Neither Blob Storage nor Data Lake Storage Gen2 provides native querying capabilities‬
‭s ufficient for advanced analytics and querying directly from the storage.‬
‭●‬ ‭Depending on the volume and frequency of data transfers, extra costs can add up.‬
‭●‬ ‭Both‬‭Blob Storage‬‭and‬‭Data Lake Storage Gen2‬‭require‬‭external tools to handle advanced‬
‭data transformations, processing, and analytics‬
‭●‬ ‭There is no snapshot mechanism or automated backup for Azure Files.‬

‭Alternatives:‬

‭‬ A
● ‭ mazon S3 (Simple Storage Service)‬
‭●‬ ‭Google Cloud Storage‬
‭●‬ ‭IBM Cloud Object Storage‬

‭Azure Data Factory(ADF):‬

‭ zure Data Factory (ADF) is a cloud-based data integration service provided by Microsoft Azure. It‬
A
‭allows you to create, schedule, and orchestrate data workflows. ADF helps move and transform data‬
‭from one location to another, using a visual interface or a code-based approach. In this project,‬‭Azure‬
‭Data Factory (ADF)‬‭will be used to orchestrate the‬‭movement of raw data stored in‬‭Azure Blob‬
‭Storage‬‭to‬‭Azure Data Lake Storage Gen2 (ADLS)‬‭. ADF's‬‭Copy Pipeline will be used‬‭, and the JSON‬
‭data in Blob Storage will be transferred to ADLS.‬

‭Advantages of Azure Data Factory:‬

‭●‬ S‭ calability:‬‭ADF is highly scalable, making it suitable‬‭for both small and large-scale data‬
‭integration projects.‬
‭●‬ ‭Integration with Azure Services‬‭: ADF integrates seamlessly‬‭with other Azure services, such‬
‭as Azure Synapse Analytics, Azure Data Lake Storage, and Azure Machine Learning.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭●‬ C ‭ ost-Effective‬‭: ADF offers a pay-as-you-go pricing‬‭model, which means you only pay for the‬
‭resources you use.‬
‭●‬ ‭Security‬‭: ADF provides robust security features, including‬‭data encryption, network isolation‬
‭●‬ ‭Ease of Use‬‭: The intuitive visual interface of ADF‬‭allows users to create and manage data‬
‭pipelines without extensive coding knowledge.‬

‭Disadvantages of Azure Data Factory:‬

‭●‬ L‭ imited Transformation Options:‬‭While ADF is great‬‭for data movement, its data‬
‭transformation capabilities are less flexible compared to other services like Azure Databricks‬
‭or SQL-based solutions.‬
‭●‬ ‭Complex Error Handling‬‭: While ADF has error-handling‬‭mechanisms, they can be cumbersome‬
‭to configure.‬
‭●‬ ‭Data Movement Restrictions‬‭: Although ADF supports‬‭various data sources and destinations,‬
‭it may not always offer the same level of integration as other services‬
‭●‬ ‭User Interface Limitations‬‭: The ADF user interface‬‭can sometimes be unintuitive, especially‬
‭for users unfamiliar with the platform.‬
‭●‬ ‭Limited Customization for Transformations‬‭: ADF’s built-in‬‭transformation capabilities are‬
‭s omewhat limited, especially for complex data transformations.‬

‭Alternatives‬‭:‬

‭‬ A
● ‭ WS Glue‬
‭●‬ ‭dbt‬
‭●‬ ‭Apache NiFi‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Azure Databricks:‬

‭ zure‬‭Databricks,‬‭developed‬‭in‬‭collaboration‬‭with‬‭Microsoft,‬‭is‬‭a‬‭managed‬‭version‬‭of‬‭Databricks‬‭that‬
A
‭allows‬ ‭Azure‬ ‭customers‬ ‭to‬ ‭s et‬ ‭up‬ ‭with‬ ‭a‬ ‭s ingle‬ ‭click,‬ ‭s treamline‬ ‭workflows,‬ ‭and‬ ‭access‬ ‭s hared‬
‭collaborative‬ ‭interactive‬ ‭workspaces.‬ ‭It‬ ‭facilitates‬ ‭rapid‬ ‭collaboration‬ ‭among‬ ‭data‬ ‭s cientists,‬ ‭data‬
‭engineers,‬ ‭and‬ ‭business‬ ‭analysts‬ ‭through‬ ‭the‬ ‭Databricks‬ ‭platform.‬ ‭Azure‬ ‭Databricks‬ ‭is‬ ‭closely‬
‭integrated‬ ‭with‬ ‭Azure‬ ‭s torage‬ ‭and‬ ‭compute‬ ‭resources,‬ ‭s uch‬ ‭as‬ ‭Azure‬‭Blob‬‭Storage,‬‭Data‬‭Lake‬‭Store,‬
‭SQL Data Warehouse, and HDInsights.‬

‭Advantages of Azure Databricks:‬

‭●‬ W ‭ hile‬ ‭Azure‬‭Databricks‬‭is‬‭Spark-based,‬‭it‬‭allows‬‭commonly‬‭used‬‭programming‬‭languages‬‭like‬


‭Python,‬‭R,‬‭and‬‭SQL‬‭to‬‭be‬‭used.‬‭These‬‭languages‬‭are‬‭converted‬‭in‬‭the‬‭backend‬‭through‬‭APIs,‬‭to‬
‭interact with Spark.‬
‭●‬ ‭Deploying‬‭work‬‭from‬‭Notebooks‬‭into‬‭production‬‭can‬‭be‬‭done‬‭almost‬‭instantly‬‭by‬‭just‬‭tweaking‬
‭the data sources and output directories.‬
‭●‬ ‭Aside‬ ‭from‬ ‭those‬ ‭Azure-based‬ ‭s ources‬ ‭mentioned,‬ ‭Databricks‬ ‭easily‬ ‭connects‬ ‭to‬ ‭s ources‬
‭including on-premise SQL servers, CSVs, and JSONs.‬
‭●‬ ‭While‬ ‭Azure‬ ‭Databricks‬ ‭is‬ ‭ideal‬ ‭for‬ ‭massive‬ ‭jobs,‬ ‭it‬ ‭can‬ ‭also‬ ‭be‬‭used‬‭for‬‭s maller-scale‬‭jobs,‬
‭deployment, and development/ testing work.‬
‭●‬ ‭Databricks‬‭notebooks‬‭allow‬‭for‬‭real-time‬‭collaboration‬‭among‬‭data‬‭engineers,‬‭data‬‭s cientists,‬
‭and analysts using version control.‬

‭Disadvantages of Azure Databricks:‬

‭●‬ D ‭ espite its detailed documentation and intention to simplify data processing, many users find‬
‭Databricks’ lakehouse platform daunting to master.‬
‭●‬ ‭Commands issued in non-JVM languages need extra transformations to run on a JVM process.‬
‭●‬ ‭While it offers a secure, collaborative environment with different services and integrations,‬
‭these enhancements come at a cost.‬
‭●‬ ‭Due to its cloud-native architecture, certain workloads might experience performance‬
‭overhead.‬
‭●‬ ‭While Databricks integrates well with Azure storage options, its native storage capabilities are‬
‭not as extensive‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Alternatives:‬

‭‬ A
● ‭ mazon EMR (Elastic MapReduce)‬
‭●‬ ‭Google Cloud Dataproc‬
‭●‬ ‭Apache Spark on Kubernetes‬

‭Power BI:‬

‭ icrosoft Power BI is‬‭a business intelligence tool‬‭that helps users analyze and visualize data to make‬
M
‭informed decisions. It allows users to connect to a variety of data sources, transform the data into‬
‭meaningful visualizations, and create dashboards to help make data-driven decisions. Power BI is‬
‭widely used for its ease of use, interactivity, and ability to handle large datasets from multiple‬
‭s ources in real time.‬

‭Advantages of Power BI:‬

‭●‬ E‭ ase of Use‬‭: Power BI is user-friendly with a drag-and-drop‬‭interface, making it accessible to‬
‭both technical and non-technical users.‬
‭●‬ ‭Cost-Effective‬‭: Power BI offers a free version with‬‭core features and affordable pricing for the‬
‭Pro and Premium versions, making it accessible for businesses of all sizes.‬
‭●‬ ‭Integration with Multiple Data Sources‬‭: Power BI can‬‭connect to a wide variety of data‬
‭s ources including databases, cloud storage, spreadsheets, and even live data streams.‬
‭●‬ ‭Real-Time Data Access‬‭: It supports real-time data‬‭refreshes, enabling users to work with the‬
‭latest data and make timely decisions.‬
‭●‬ ‭Customizable Dashboards‬‭: Users can create personalized‬‭dashboards and interactive reports,‬
‭giving flexibility in visualizing data the way they want.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Disadvantages of Power BI:‬

‭●‬ L‭ imited Data Capacity (Free Version)‬‭: Restrictions‬‭on data storage and sharing in the free‬
‭version.‬
‭●‬ ‭Performance with Large Datasets‬‭: Can slow down with‬‭very large datasets.‬
‭●‬ ‭Steep Learning Curve‬‭: Advanced features and customizations‬‭can be difficult to learn.‬
‭●‬ ‭Limited Customization for Visualizations‬‭: Less flexibility‬‭compared to other BI tools.‬
‭●‬ ‭Data Refresh Limits‬‭: Limited refresh frequency, especially‬‭in the free version.‬

‭Alternatives:‬

‭ ‬ T‭ ableau‬

‭●‬ ‭Qlik Sense‬
‭●‬ ‭Google Data Studio‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Execution:‬

‭1.‬ ‭Creating Azure Resource Group‬

‭ ‬‭Resource‬‭Group‬‭is‬‭created‬‭in‬‭Azure‬‭to‬‭manage‬‭all‬‭Azure‬‭Services‬‭effectively‬‭in‬‭one‬‭place.‬‭It‬‭is‬
A
‭easier to monitor, update, or delete all Azure resources together as a single unit.‬

‭a)‬ ‭In the search bar, enter‬‭"Resource Groups"‬‭and select‬‭it from the suggestions.‬

‭b)‬ ‭In the top-left corner, click on‬‭Create‬‭.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭c)‬ I‭n‬ ‭the‬ ‭Resource‬ ‭Group‬ ‭Name‬ ‭field,‬ ‭enter‬ ‭the‬ ‭desired‬ ‭name‬ ‭for‬ ‭your‬ ‭resource‬ ‭group.‬ ‭Then,‬
‭click on‬‭Next:Tags‬

‭d)‬ U‭ nder‬ ‭Tags‬‭,‬ ‭in‬ ‭the‬ ‭Name‬ ‭field,‬ ‭enter‬ ‭s omething‬ ‭like‬‭developer.‬‭In‬‭the‬‭Value‬‭field,‬‭enter‬‭your‬
‭name.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭e)‬ ‭Then Click on the Create option and a resource group will be created.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭f)‬ N‭ ow,‬ ‭the‬ ‭resource‬ ‭group‬ ‭will‬ ‭be‬ ‭used‬ ‭as‬ ‭a‬ ‭central‬ ‭location‬ ‭to‬ ‭create‬ ‭Azure‬ ‭Service‬ ‭in‬ ‭the‬
‭project.‬

‭2.‬ ‭Creating Azure SQL Database‬

a‭ ‬‭)‬ ‭Click‬ ‭on‬ ‭the‬ ‭Resource‬ ‭Group‬ ‭you‬ ‭just‬ ‭created,‬ ‭then‬ ‭s elect‬ ‭the‬ ‭“Create”‬ ‭option‬ ‭within‬ ‭the‬
‭resource group.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭b) In the search bar, enter‬‭"Azure SQL Database"‬‭and‬‭s elect it from the suggestions.‬

‭c) In the top-left corner, click on‬‭Create‬‭.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ )‬ ‭In‬‭the‬‭Resource‬‭Group‬‭field,‬‭click‬‭on‬‭the‬‭drop-down‬‭menu‬‭and‬‭s elect‬‭the‬‭resource‬‭group‬‭you‬‭just‬
d
‭created.‬

‭e) In the‬‭Database‬‭field, enter the name of the database‬‭as you want.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭f) Now, click on‬‭New‬‭in the‬‭Server‬‭field to create‬‭a new server for your Azure SQL database.‬

‭g) In the‬‭Server Details‬‭page, under the‬‭Server Name‬‭field, enter the desired name for your server.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ )‬ ‭Select‬ ‭an‬ ‭available‬ ‭location‬ ‭in‬ ‭the‬ ‭Location‬ ‭field,‬ ‭and‬ ‭if‬ ‭you‬‭encounter‬‭errors,‬‭choose‬‭a‬‭different‬
h
‭location from the dropdown.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


i‭)‬‭In‬‭the‬‭Authentication‬‭Method‬‭field,‬‭s elect‬‭“‬‭Use‬‭both‬‭SQL‬‭and‬‭Microsoft‬‭Entra‬‭Authentication”.‬
‭The reason behind this is to give access to both SQL authentication and MS-authorized users.‬

‭j) Select‬‭"Set admin"‬‭to grant the Azure authorized‬‭user administrative access to the SQL Server.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭k) In the‬‭Microsoft Entra ID‬‭tab, select your user in Azure, then click‬‭Select‬‭.‬

l‭)‬ ‭In‬ ‭the‬ ‭Username‬ ‭field,‬ ‭enter‬ ‭the‬ ‭username‬ ‭you‬ ‭created‬ ‭for‬ ‭your‬‭SQL‬‭Server,‬‭and‬‭in‬‭the‬‭Password‬
‭field, enter the corresponding password.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ )‬ ‭Username‬‭and‬‭password‬‭for‬‭your‬‭SQL‬‭Server,‬‭s ave‬‭them‬‭in‬‭a‬‭s ecure‬‭location,‬‭s uch‬‭as‬‭a‬‭Notepad‬
m
‭file.‬

‭n) After specifying your SQL Server username and password, click on‬‭"OK"‬‭.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭o) Now, the SQL Server you created will be automatically specified in the server field.‬

‭ )‬ ‭Now,‬ ‭s et‬ ‭the‬ ‭Workload‬ ‭to‬ ‭Development‬‭,‬ ‭as‬ ‭the‬ ‭project‬ ‭does‬ ‭not‬ ‭include‬ ‭production-level‬
p
‭workloads.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ )‬‭From‬‭Backup‬‭Storage‬‭Redundancy‬‭,‬‭choose‬‭Locally‬‭Redundant‬‭Backup‬‭Storage‬‭s ince‬‭it‬‭does‬‭not‬
q
‭incur high costs.‬

‭r) Then, click on‬‭Next: Networking‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


s‭ )‬ ‭Under‬ ‭the‬ ‭Connectivity‬ ‭Method‬‭,‬ ‭choose‬ ‭P ublic‬ ‭Endpoint‬‭,‬ ‭as‬ ‭it‬ ‭will‬ ‭allow‬ ‭your‬ ‭Azure‬ ‭SQL‬
‭Database to connect to your local environment.‬

t‭ )‬ ‭In‬ ‭the‬ ‭Firewall‬ ‭Rules‬‭,‬ ‭s elect‬ ‭both‬ ‭options‬ ‭as‬‭"Yes"‬‭to‬‭allow‬‭access‬‭to‬‭your‬‭Azure‬‭SQL‬‭Server‬‭from‬


‭both Azure services and specific IP addresses.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭u) Keep connection policy and Encrypted connections by default and then click on‬‭“Review + Create”‬

v‭ )‬‭After‬‭that,‬‭validation‬‭may‬‭take‬‭s ome‬‭time.‬‭Once‬‭complete,‬‭your‬‭Azure‬‭SQL‬‭Server‬‭and‬‭database‬‭will‬
‭be successfully created.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ )‬ ‭After‬ ‭deployment‬ ‭of‬ ‭Azure‬ ‭SQL‬ ‭Database‬ ‭and‬ ‭Azure‬ ‭SQL‬ ‭Server‬ ‭is‬ ‭complete,‬ ‭click‬ ‭on‬ ‭“Go‬ ‭to‬
w
‭Resouce”.‬

‭x) The Azure SQL Server and Azure SQL Database have now been successfully created.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭3. Importing Data in Azure SQL Database‬

‭a)‬‭To‬‭import‬‭data‬‭from‬‭an‬‭Excel‬‭file‬‭into‬‭an‬‭Azure‬‭SQL‬‭Database,‬‭you‬‭need‬‭the‬‭SQL‬‭Server‬‭Management‬
‭Studio (SSMS) application.‬

‭ )‬ ‭Navigate‬ ‭to‬ ‭your‬ ‭browser,‬ ‭type‬ ‭"Download‬ ‭SQL‬ ‭Server‬ ‭Management‬ ‭Studio"‬‭in‬‭the‬‭s earch‬‭bar,‬
b
‭and click on the first website that appears.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


c‭ )‬‭Once‬‭the‬‭application‬‭is‬‭downloaded,‬‭install‬‭and‬‭s et‬‭it‬‭up‬‭on‬‭your‬‭local‬‭system.‬‭Also‬‭make‬‭s ure‬‭that‬
‭Access‬ ‭Database‬ ‭Engine‬ ‭is‬ ‭also‬ ‭downloaded‬ ‭on‬ ‭your‬ ‭local‬ ‭system.‬ ‭It‬ ‭can‬ ‭be‬ ‭downloaded‬ ‭from‬ ‭the‬
‭browser‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ )‬ ‭Open‬‭the‬‭SSMS‬‭application,‬‭and‬‭under‬‭the‬‭Server‬‭Name‬‭field,‬‭s pecify‬‭the‬‭name‬‭of‬‭your‬‭Azure‬‭SQL‬
d
‭Server that you created.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ )‬ ‭From‬ ‭the‬ ‭dropdown‬ ‭of‬ ‭the‬ ‭Authentication‬ ‭Method,‬ ‭choose‬ ‭SQL‬ ‭Server‬ ‭Authentication.‬ ‭It‬ ‭helps‬ ‭to‬
e
‭access Azure SQL Database, especially when using non-Azure-based tools like SSMS.‬

f‭ )‬‭In‬‭the‬‭Login‬‭and‬‭Password‬‭fields,‬‭s pecify‬‭the‬‭username‬‭and‬‭password‬‭of‬‭your‬‭Azure‬‭SQL‬‭Server‬‭that‬
‭you saved earlier. After that, Click on‬‭Connect.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭g) In the left-side panel, the specification of Azure SQL Database and the server listed there.‬

‭h) Expand‬‭Databases‬‭by clicking on the‬‭+‬‭icon, and‬‭then your Azure SQL Database will appear there.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


i‭)‬ ‭To‬ ‭import‬ ‭data‬ ‭from‬ ‭a‬ ‭local‬ ‭Excel‬‭file‬‭to‬‭Azure‬‭SQL‬‭Database,‬‭right-click‬‭on‬‭the‬‭database,‬‭click‬‭on‬
‭Tasks‬‭, and then select‬‭Import Data‬‭.‬

j‭)‬‭Under‬‭the‬‭Choose‬‭a‬‭Data‬‭Source‬‭panel,‬‭s elect‬‭Microsoft‬‭Excel‬‭from‬‭the‬‭dropdown‬‭list‬‭as‬‭the‬‭data‬
‭s ource.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭k) Then, click on‬‭Browse‬‭to select the Excel file available on your local system, and choose your file.‬

‭After selecting the file, click on open to select the file.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭l) Click on next after selecting the data source‬

‭ )‬‭From‬‭the‬‭Choose‬‭a‬‭Destination‬‭panel,‬‭s croll‬‭down‬‭and‬‭s elect‬‭Microsoft‬‭OLE‬‭DB‬‭P rovider‬‭for‬‭SQL‬


m
‭Server‬‭under‬‭Destination‬‭,‬‭as‬‭the‬‭destination‬‭is‬‭your‬‭Azure‬‭SQL‬‭Database‬‭where‬‭you‬‭need‬‭to‬‭load‬‭the‬
‭data.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭n) In the‬‭Server Name‬‭field, specify the name of your‬‭Azure SQL Server.‬

‭ )‬‭In‬‭the‬‭Authentication‬‭Method‬‭,‬‭choose‬‭Use‬‭SQL‬‭Server‬‭Authentication‬‭,‬‭and‬‭in‬‭the‬‭Username‬‭and‬
o
‭Password‬‭fields, specify the login and password you‬‭created for your SQL Server. Then click on “Next”‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭p) Choose the first option “Copy Data from one or more tables” and then click on Next.‬

‭ )‬ ‭Click‬ ‭on‬ ‭the‬ ‭Destination‬ ‭and‬ ‭change‬ ‭the‬ ‭name‬ ‭of‬ ‭the‬ ‭destination‬ ‭to‬ ‭Table‬ ‭in‬ ‭the‬ ‭Azure‬ ‭SQL‬
q
‭Database. You can name it according to your preference.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


r‭ )‬‭Once‬‭the‬‭name‬‭of‬‭the‬‭table‬‭is‬‭changed,‬‭click‬‭on‬‭the‬‭“edit‬‭mappings”‬‭to‬‭change‬‭the‬‭length‬‭of‬‭words‬
‭in columns of the dataset.‬

s‭ )‬ ‭Change‬ ‭the‬ ‭mappings‬ ‭of‬ ‭the‬ ‭last‬ ‭columns‬ ‭from‬ ‭255‬ ‭to‬ ‭1200‬‭.‬ ‭This‬ ‭allows‬ ‭for‬ ‭a‬ ‭larger‬ ‭character‬
‭length, accommodating data that exceeds the default limit of 255 characters.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭t) Click on 255 and then type 1200. Once the mapping is changed, then click on “OK”‬

‭ )‬‭Then,‬‭click‬‭on‬‭Next‬‭twice‬‭to‬‭proceed‬‭through‬‭the‬‭two‬‭configuration‬‭s teps,‬‭and‬‭finally‬‭click‬‭on‬‭Finish‬
u
‭to complete the import process.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


v‭ )‬ ‭Once‬ ‭you‬ ‭click‬ ‭on‬ ‭“Finish”,‬ ‭the‬ ‭process‬ ‭of‬ ‭loading‬ ‭data‬ ‭from‬ ‭the‬ ‭Excel‬ ‭file‬ ‭to‬ ‭the‬ ‭Azure‬ ‭SQL‬
‭database gets started.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ )‬‭Following‬‭this,‬‭all‬‭rows‬‭have‬‭been‬‭s uccessfully‬‭uploaded‬‭to‬‭the‬‭Azure‬‭SQL‬‭Database,‬‭and‬‭the‬‭result‬
w
‭will show as‬‭Succeeded‬‭.‬

x‭ )‬‭To‬‭check,‬‭whether‬‭the‬‭data‬‭has‬‭been‬‭uploaded‬‭to‬‭the‬‭table‬‭in‬‭Azure‬‭SQL‬‭Database‬‭or‬‭not.‬‭A‬‭s imple‬
‭SQL query can be executed on top of the table.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭4.‬‭Creating Azure Blob Storage‬

a‭ )‬‭In‬‭the‬‭s earch‬‭bar,‬‭type‬‭Azure‬‭Storage‬‭Accounts‬‭and‬‭s elect‬‭the‬‭Storage‬‭Accounts‬‭option‬‭from‬‭the‬


‭s uggestions.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭b) Click on‬‭Create‬‭to begin the process of creating‬‭a new storage account.‬

c‭ )‬‭In‬‭the‬‭Create‬‭a‬‭Storage‬‭Account‬‭panel,‬‭under‬‭the‬‭Resource‬‭Group‬‭s ection,‬‭click‬‭on‬‭the‬‭dropdown‬
‭and select the resource group you just created.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ ) Specify the name of your storage account as desired. Ensure the name is unique, as storage‬
d
‭account names must be globally unique.‬

e‭ ) Under‬‭Performance‬‭, specify‬‭Standard‬‭, as it doesn't‬‭require premium performance for workload,‬


‭which is relatively low.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭f) Click on the‬‭“Next”‬‭two times for the next two‬‭pages and then Click on‬‭“Review + Create”‬

‭g) Then click on‬‭“Create”‬‭to create the Azure Blob‬‭Storage account.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭h) Validation might take some time. After the validation is complete, click on‬‭Go to the resource‬‭.‬

i‭) Now, on the‬‭Blob Storage Account‬‭interface, in‬‭the left-side panel, click on‬‭Data storage‬‭. Under‬
‭Data storage‬‭, click on‬‭Containers‬‭to create a container‬‭for Azure Blob Storage.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


j‭) click on‬‭+ Container‬‭to create a container. The‬‭purpose of creating a container is to store all the CSV‬
‭and log files inside the Blob container.‬

k‭ ) In the right-side pop-up panel, specify the name of the container as per your need and then click on‬
‭“Create”‬‭. Make sure the name is unique within the‬‭s torage account.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭l) Click on the container that you created to open it and manage its contents.‬

‭5.‬ ‭Creating Azure Logic App‬

a‭ )‬ ‭In‬ ‭the‬ ‭s earch‬ ‭bar,‬ ‭type‬ ‭Logic‬ ‭Apps‬ ‭and‬ ‭s elect‬ ‭Logic‬ ‭Apps‬ ‭from‬ ‭the‬ ‭s uggestions.‬‭The‬‭purpose‬‭of‬
‭Logic‬ ‭Apps‬ ‭is‬ ‭to‬ ‭create‬ ‭a‬‭workflow‬‭that‬‭fetches‬‭data‬‭from‬‭the‬‭Azure‬‭SQL‬‭Database‬‭and‬‭loads‬‭it‬‭into‬
‭Azure Blob Storage.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭b) Click on‬‭Add‬‭to create a new Logic App.‬

c‭ ) Under the‬‭Standard‬‭plan, click on‬‭Workflow service‬‭plan‬‭to create the Logic App with lower‬
‭charges, as you're handling a lighter workload. Then click on “select”‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ ) In the‬‭Creating Logic App‬‭interface, under the‬‭Resource Group‬‭s ection, click on the dropdown and‬
d
‭s elect the resource group you created.‬

e‭ ) In the‬‭Logic App Name‬‭field, specify the name‬‭of the Logic App as desired. Ensure the name is‬
‭unique. Then click on NEXT: Storage‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭f) In the‬‭Storage Account‬‭s ection, select the storage‬‭account you created from the dropdown.‬

‭ y specifying this account, you are ensuring that all logs and artifacts generated by the Logic App will‬
B
‭be stored in the selected storage account.‬

‭g) Then click on‬‭Next: networking‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭h)‬‭Enable public access for the Logic App to allow‬‭external access. Then click on‬‭“Review+ Create”‬

‭i) Validation might take some time. Once complete, click on‬‭Go to resource.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


j‭)‬‭In the Logic App interface, in the left-side panel,‬‭click on‬‭Workflows‬‭. Then, under Workflows, click‬
‭on the‬‭workflow.‬

T‭ his will allow you to create the workflow to load data from the Azure SQL Database to Azure Blob‬
‭Storage.‬

‭k)‬‭Then, click on + Add, and then click Add again‬‭to create the workflow.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


l‭)‬‭Now, in the right-side panel, specify the name‬‭of the‬‭Logic App‬‭. Also, set the workflow to‬‭Stateful‬‭.‬
‭Then click on “Create” and this will create the workflow of Logic App.‬

T‭ he reason for selecting Stateful is that it provides better efficiency when uploading large data, as it‬
‭retains the state across workflow runs.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭m) Then click on the Workflow you just created.‬

‭ ) In the workflow panel, on the left-hand side, click on‬‭Designer.‬‭This will allow you to create the‬
n
‭workflow using drag-and-drop actions from the available options.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ ) This is the interface of the workflow using the‬‭Designer‬‭, where actions and triggers will be dragged‬
o
‭and dropped for the execution‬

‭→ Creating Logic App user in Azure SQL Database‬

T‭ o fetch data from Azure SQL Database and load it into Azure Blob Storage, Azure Logic Apps needs‬
‭access to the Azure SQL Database. A connection must be established between the Logic App and the‬
‭Azure SQL Database. The mechanism involves‬‭associating‬‭a user with the Logic App‬‭that has the‬
‭necessary permissions to access the database. To achieve this,‬‭a user‬‭is created within the Azure SQL‬
‭Database using a SQL query. Then, the Logic App user is granted access to the Azure SQL Database to‬
‭allow the workflow to execute.‬

‭ ) Open a new tab in your browser and navigate to the Azure portal. Then, access your‬‭Azure SQL‬
p
‭Database‬‭by searching for it in the portal‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭q)‬‭Click on the azure sql database you created.‬

‭r)‬‭Now, click on‬‭Query Editor.‬‭You will have two options‬‭to access your Azure SQL Database:‬

‭‬ U
● ‭ se the username and password you created for the SQL Server.‬
‭●‬ ‭Log in directly using your Azure email ID.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


s‭ )‬‭Click on‬‭Tables‬‭, and you will see the table that‬‭was populated using SSMS (SQL Server Management‬
‭Studio) displayed here.‬

t‭ ) First Query‬
‭First,‬ ‭Run the SQL query to create a user for the‬‭Logic App and enable it within your Azure‬
‭environment.‬

‭ reakdown of query:‬‭The query‬‭CREATE USER [waterQualityLA1]‬‭FROM EXTERNAL PROVIDER‬‭is used‬


B
‭to create a user in Azure SQL Database that is linked to an external authentication provider for logic‬
‭App.‬

‭→ CREATE USER [waterQualityLA1] FROM EXTERNAL PROVIDER‬

‭ REATE USER‬‭: This command is used to create a new‬‭user in the Azure SQL Database.‬
C
‭waterQualityLA1‬‭- Name of Logic App (Replace with‬‭the name of the Logic app you created)‬
‭FROM External Provider‬‭- This indicates that the user‬‭is being authenticated by an external provider‬
‭(such as Azure Active Directory), rather than using a traditional SQL authentication‬
‭(username/password). In this case, it's the Logic App user being created through Azure AD.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Second Query:‬

‭Run this query‬ ‭in the editor:‬

‭→ ALTER ROLE db_datareader ADD MEMBER [WaterQualityLA1]‬

‭ reakdown of Query:‬‭This query allows the Logic App‬‭to read data from the Azure SQL Database.‬
B
‭Since the Logic App will fetch data and move it to Azure Blob Storage, it only requires read‬
‭permissions and permission was assigned by changing the role of db_reader. Logic App user‬
‭[WaterQualityLA1] was added as a member to read the data from Azure SQL Database.‬

‭ALTER ROLE‬‭: Modifies the membership of a database‬‭role.‬

‭db_datareader Role‬‭: A built-in role granting read-only‬‭access to all tables and views in the database.‬

‭ADD MEMBER‬‭: Adds a user or group to the specified‬‭role.‬

db_datareader‬‭role for‬
[‭ WaterQualityLA1]‬‭: The Logic App's Azure AD user,‬‭added to the‬‭
‭read-only access to the database.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭6. Creating Ingestion Workflow in Azure Logic App‬

I‭n this segment, a workflow will be created in the Logic App. Actions and triggers will be dragged and‬
‭dropped to design a comprehensive workflow. This workflow will fetch data from the Azure SQL‬
‭Database and store it in the Azure Blob Storage container (in JSON format) you created earlier.‬

a‭ ) Return to the tab where the Logic App workflow was created or navigate to the Logic App in the‬
‭Azure portal, and click on the workflow you previously created.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭b) Then click on‬‭“Designer”‬‭from the left side panel.‬

‭c) The first step is to create an HTTP trigger in the workflow. For that Click on the “trigger” icon‬

T‭ his will generate an API-like URL that can be used to call the workflow on demand, reducing the need‬
‭for manual intervention.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ ) In the search bar of the Logic App Designer, type‬‭“When an HTTP Request is received"‬‭and select‬
d
‭it from the options.‬

‭e) Then click on + icon to add another action “Execute query” in the workflow.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭f) Type‬‭“Execute Query action”‬‭and then click on action‬‭“Execute Query”‬‭under SQL server.‬

‭The Execute query action is added in the workflow to query the entire data from Azure SQL Database.‬

g‭ ) Once the‬‭Execute query‬‭action is added, you will‬‭need to specify the configurations of Azure SQL‬
‭database in execute query field.‬

‭For that, click on‬‭change connection.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭h) Now, in the connection details, three configurations need to be specified:‬

‭ .‬ C
1 ‭ onnection name‬‭: Give a name to the connection as‬‭per your preference.‬
‭2.‬ ‭Authentication type‬‭: From the dropdown, Choose “Managed‬‭identity”‬

‭3.‬ ‭Managed Identity Type:‬‭From dropdown, choose‬‭“system‬‭assigned”‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ .‬ S‭ erver name:‬‭In this field, specify the name of the‬‭azure SQL Server you created‬
4
‭5.‬ ‭Database name:‬‭In this filed, specify the name of‬‭the Azure SQL Database you created‬

‭ The name of the AZzure SQL Server and Azure SQL Database can also be copied by navigated to the‬

‭Azure SQL Datavase service and then click on the Azure SQL Server which you created.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭6.‬ O
‭ nce the connection details are specified, click on‬‭Create New and‬‭this will successfully‬
‭establish the connection to Azure SQL Database using this action.‬

i‭) After specifying the connection details, the‬‭SQL‬‭query‬‭prompt page opens. Here, a‬‭SQL query will‬
‭be executed‬‭that retrieves all the data from the Azure‬‭SQL Database.‬

‭Query:‬‭Select * from [dbo].[Waterqualitydata]‬

‭→ Breakdown of Query:‬

‭SELECT *‬‭:‬

‭●‬ ‭Retrieves all columns from the table.‬

‭FROM [dbo].[Waterqualitydata]‬‭:‬

‭●‬ [‭ dbo]‬‭: Default schema in SQL Server and Azure SQL‬‭Database.‬‭dbo (Database Owner)‬‭is the‬
‭default schema in SQL Server and Azure SQL Database.‬
‭●‬ ‭[Waterqualitydata]‬‭: The table name from which data‬‭is fetched.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ fter writing the query here, just click on the + icon to add another action after the‬‭“Execute Query”‬
A
‭Action‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


j‭) In the search bar, type‬‭“blob”‬‭and click on the‬‭“see more”‬‭option corresponding to‬‭the Azure Blob‬
‭Storage‬‭action.‬

‭k) Then scroll down, until you find this action‬‭“Upload‬‭blob to storage container”. Select this action.‬

T‭ he‬‭"Upload Blob to Storage"‬‭action in Azure Logic‬‭Apps is used to take data retrieved from an Azure‬
‭SQL Database using the Execute Query action. Once the data is fetched, it is converted into JSON‬
‭format. This JSON data‬‭(blob)‬‭is then uploaded to‬‭an Azure Blob Storage container.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


l‭) After that, click on the‬‭“Upload blob to storage‬‭container”‬‭action. In the right panel, it is asking‬
‭for details. But first, we need to establish the connection with the‬‭blob storage container‬‭that we‬
‭created.‬

‭Click on‬‭“Change Connection”‬‭in the bottom right corner‬‭and adda new connection.‬

‭m) Then, specify the details as mentioned on the connection page.‬

‭ .‬ C
1 ‭ onnection name:‬‭Give a name as your preference.‬
‭2.‬ ‭Authentication type:‬‭From the dropdown, select‬‭Storage‬‭account connection string‬‭.‬

(‭ To establish a connection with Azure Blob Storage and upload the data into the Blob container, use‬
‭Azure Blob Storage Access Keys or a Connection String for authentication)‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭3.‬ ‭Storage Account Connection string:‬‭Here, specify your‬‭Blob Storage account access key.‬

‭→ Here are the steps to get the Azure Blob Storage account access key:‬

‭ .‬ O
1 ‭ pen another tab in browser and In the Azure portal, go to your‬‭Storage Account‬‭.‬
‭2.‬ ‭In the left-hand panel, under‬‭Security + networking‬‭,‬‭click on‬‭Access keys‬‭.‬
‭3.‬ ‭Click on show for‬‭connection string‬‭, copy it.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ . Navigate back to the tab where Azure Logic App is open and paste the‬‭connection string‬‭in‬‭the‬
4
‭Storage Account Connection String.‬

‭Then click on‬‭“Create New”‬‭and it will create a connection‬‭with the Azure Blob Storage.‬

‭n) Now, once the connection is created. specify the following:‬

‭ .‬ C
1 ‭ ontainer Name:‬‭In this field, specify the name of‬‭the blob container you created earlier.‬
‭2.‬ ‭Blob Name:‬‭In this filed, specify the name of the‬‭blob/ file that will be loaded to the Blob‬
‭Container. This is nothing just a name that you will give to your file which is going to be loaded‬
‭to the blob container (in json format).‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭3.‬ I‭n the Content field, choose the icon , this icon indicates this‬‭upload blob to container‬
‭action will store the results of the "Execute Query" action in the Blob container in JSON format.‬

T‭ hen click on the‬‭Result item‬‭(as blob action stores‬‭the output of the execute query action and‬
‭uploads the results to the blob container in JSON form).‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ ote:‬‭As soon as‬‭the result item‬‭is selected,‬‭each‬‭loop kind of action gets added to the‬‭“upload‬
N
‭blob to storage container”‬‭action. It means that each‬‭result is going to loop through the data and‬
‭upload to the blob container in JSON format.‬

‭o) Now, click on‬‭Save‬‭option to save the workflow‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ ) After the workflow is saved successfully, click on the‬‭overview‬‭option in the left side panel to‬‭get‬
p
‭back to the main interface of your workflow.‬

‭ ) Click on the‬‭Run‬‭option, then click on‬‭Run‬‭again.‬‭This will trigger the workflow and start its‬
q
‭execution.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


r‭ ) Wait for a few seconds, then click on‬‭trigger id‬‭to check the results. If the action is marked‬‭with‬‭a‬
‭green check‬‭, it means the workflow executed successfully‬‭without errors.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


s‭ ) Then navigate to the‬‭Blob container‬‭you created‬‭and check whether the JSON file is uploaded to‬
‭the Blob Storage container or not.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


S‭ ince the JSON file is in the Blob Storage container, it will now be copied to the ADLS container using‬
‭the Copy Pipeline in Azure Data Factory (ADF). To do this, first create an ADLS storage account and an‬
‭ADLS container.‬

‭7. Creating Azure Data Lake Storage Account (ADLS Gen2) and its container‬

a‭ ) Navigate to the Azure home portal, type‬‭Azure Storage‬‭Accounts‬‭in the search bar, and select‬
‭Storage Accounts‬‭from the suggestions.‬

‭b) Click on Create‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


c‭ ) In the resource group field. From the dropdown, select the‬‭resource group‬‭you created and also‬
‭s pecify the name of your ADLS Storage account as you want and it should be unique.‬

‭ ) In the primary Servcie, from the dropdown - choose‬‭“Azure Blob Storage or Azure Data Lake‬
d
‭Storage”‬‭option.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


e‭ ) Click on next and in the advanced tab, enable‬‭hierarchical‬‭namespace‬‭to create Azure Data Lake‬
‭Storage Gen2.‬

f‭ ) Then click on “‬‭Review+Create” and‬‭then click on‬‭“create” to create ADLS Gen2 Account‬‭and then‬
‭wait until the validation is done.‬

‭g) Once the deployment of ADLS account is done, click on‬‭“Go to Resource”.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭h) In the left-side panel, click on‬‭Data Storage‬‭,‬‭and under‬‭Storage‬‭, click on‬‭Containers‬‭.‬

i‭) Click on‬‭+Container‬‭. In the right-side panel, specify‬‭the name of the ADLS container as desired,‬
‭then click on‬‭Create‬‭.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


j‭) Navigate to the‬‭ADLS Storage Account (not storage‬‭container)‬‭and in the left-side panel of the‬
‭ADLS Storage Account, scroll down and click on‬‭Data‬‭Management‬‭. Under the‬‭Data Management‬
‭option, click on‬‭Data Protection‬‭.‬

‭Once you find the‬‭Data Protection‬‭option, in the right-hand‬‭s ide panel, uncheck the two options:‬

‭ .‬ E‭ nable soft delete for blobs‬


1
‭2.‬ ‭Enable soft delete for containers‬

‭This will disable soft delete for both blobs and containers in the ADLS Storage Account.‬

T‭ he reason behind disabling this option is to get more control on when we want to terminate‬
‭the present in the Storage Account and container.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭8. Creating Azure Data Factory‬

T‭ he copy pipeline will be created in Azure Data Factory (ADF) to orchestrate the movement of the JSON file‬
‭from the Azure Blob Storage container to the ADLS container. ADF will be responsible for initiating and‬
‭managing the data movement in this project.‬

a‭ ) Navigate back to the Azure portal and in the search bar, type‬‭Azure Data Factory‬‭. Select‬‭Data‬
‭Factories‬‭from the suggestions.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭b) Click on‬‭+Create‬

‭c) In the‬‭Resource Group‬‭field, select the resource‬‭group you created from the dropdown.‬

‭Also, specify the name of the Azure Data Factory as you want.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭d) Then click on‬‭Next and in the Networking tab,‬‭make‬‭s ure the public endpoint is selected.‬

e‭ ) Click‬‭Next‬‭two times, then click‬‭Review + Create‬‭.‬‭The validation of deployment might take some‬
‭time. Once the deployment is successful, click‬‭Go‬‭to Resource‬‭.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭f) Click on‬‭Launch Studio‬‭to open the Azure Data Factory‬‭interface.‬

‭g) Click on‬‭New‬‭and then click‬‭+Pipline‬‭option.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ ) In the‬‭Activities‬‭s ection, click on the‬‭Move &‬‭Transform‬‭option, then click on the‬‭Copy Data‬
h
‭pipeline and drag it to the white canvas on the right side.‬

i‭) Click on the‬‭Copy Pipeline‬‭on the white canvas,‬‭then in the‬‭General‬‭s ection, specify the name of‬
‭the pipeline as desired.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭j) Click on the‬‭Source‬‭s ection, then click on‬‭New‬‭to set the new source.‬

‭In this project, the source is the‬‭Blob Storage container‬‭containing the JSON data file.‬

k‭ )In the right-side panel, under the‬‭Select a data‬‭source‬‭field, type‬‭Blob‬‭and select the‬‭Azure Blob‬
‭Storage‬‭option and click on‬‭continue‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭l) From the select format option, click on‬‭JSON Format‬‭and click on‬‭Continue.‬

‭Since the data file in the Azure Blob Storage container is in JSON format.‬

‭ ) Give a name to the source and then click on‬‭+New‬‭to create a linked service for the Azure Blob‬
m
‭Storage container.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ ) In the‬‭New Linked Service‬‭window, give any name‬‭to the linked service and keep the default‬
n
‭s ettings, and then, from the‬‭Storage Accounts‬‭dropdown,‬‭s elect the Azure Blob Storage account you‬
‭created.‬

‭Then click on‬‭“Create”‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Doing this, the linked service has been created successfully between Azure blob storage and ADF.‬

‭ ) Now,‬‭set the properties‬‭to define the exact location‬‭where the JSON file is located in the Azure blob‬
o
‭s torage container.‬

‭Click on‬‭file icon‬‭to browse the location of the Azure‬‭Blob container which has‬‭a JSON file.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭p) Then click on the‬‭Azure blob container‬‭you created‬‭and then click on‬‭OK‬

‭q) Now, click on the file that is present in the‬‭Azure‬‭Blob Storage Container.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭r) Then click‬‭Ok and this will set up the source for‬‭Azure Data Factory‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


s‭ ) Click on Sink to set up the ADLS Gen2 container as a sink to store JSON file which is currently in the‬
‭Azure Blob Storage Container.‬

t‭ ) In the right-side panel, under the‬‭Select a data‬‭source‬‭field, type‬‭Azure Data Lake Storage Gen2‬
‭s elect the‬‭ADLS Gen2‬‭option, and click on‬‭continue.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭u) From select format, select‬‭JSON‬‭as a new format,‬‭and then click on‬‭Continue.‬

v‭ ) In the linked service, give a name to the‬‭linked‬‭service‬‭as you want and then click on‬‭+new‬‭to‬
‭create a new Linked Service.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ ) Specify a name for the linked service, select the‬‭ADLS storage account‬‭you created from the‬
w
‭dropdown and then click on‬‭create.‬

x‭ ) Then from this panel, specify the details, click on the file icon, and browse to the ADLS container‬
‭that you created earlier.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭y) Select the‬‭ADLS Comatiner‬‭that you created and‬‭then click on‬‭ok.‬

z‭ ) Then click on the‬‭Validate‬‭option. Validation in‬‭ADF ensures that the pipeline configurations are‬
‭correct and that all connections are properly established before execution.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ fter that click on‬‭debug‬‭option and it will trigger the‬‭“Copy pipeline”‬‭. As a result, the pipeline waill‬
A
‭executed successfully.‬

T‭ he successful pipeline execution moves the JSON file from Azure Blob Storage to the ADLS container.‬
‭To verify, navigate to the ADLS container, and you will see the JSON file there.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭8. Creating Azure Databricks‬

a‭ ) Navigate to the Azure portal, type "‬‭Azure Databricks"‬‭in the search bar, and select‬‭"Azure‬
‭Databricks"‬‭from the suggestions.‬

‭b) Click on "‬‭+Create‬‭" to create an Azure Databricks‬‭workspace.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭c) In the‬‭"Resource Group"‬‭field, select the resource‬‭group you created from the dropdown.‬

‭ ) In the‬‭"Workspace Name"‬‭field, specify a unique‬‭name for your Databricks workspace. Keep all‬
d
‭other settings as default. From the pricing tier, select‬‭“standard one”‬‭as premium is costly.‬

‭Then, click on‬‭"Review + Create"‬‭and finally click‬‭"Create"‬‭to deploy the workspace.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


e‭ ) Wait for the deployment of Databricks to complete. Once finished, click on‬‭"Go to resource"‬‭to‬
‭access your Databricks workspace.‬

f‭ ) Click on the‬‭Launch workspace‬‭to open the Databricks‬‭interface. Here, creating a cluster and‬
‭execution of code in a notebook to process the data will be done.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭g) In the Databricks interface, the left panel includes the following sections:‬

‭‬
● ‭ orkspace‬
W
‭●‬ ‭Compute‬
‭●‬ ‭Job Runs‬
‭●‬ ‭Dashboards‬
‭●‬ ‭Catalog and so much‬

‭ ) Firstly, a cluster will be created inside Databricks Workspace. In the left side panel, click on‬
h
‭Compute.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭i)‬‭Click on‬‭Create Compute.‬

‭ cluster is necessary for Databricks to run code and process data during the running of jobs, as it‬
A
‭provides the computational resources.‬

j‭)‬‭In the policy, from the dropdown - select‬‭“unrestricted”‬‭or‬‭go with‬‭“Personal Compute”‬‭if face any‬
‭error.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


k‭ )‬‭Keep all other configurations as shown in the screenshot,‬‭this will‬‭reduce costs and save compute‬
‭resources.‬

‭Click on‬‭“Create Compute”‬‭to create the cluster.‬

T‭ he cluster may take some time to create. Once it is ready, a‬‭right signal‬‭will appear next to the‬
‭cluster name.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭8. Azure Databricks Medallion Architecture‬

S‭ ince the raw JSON Data file is in the ADLS Container, it will be processed and structured using the‬
‭medallion architecture, flowing through‬‭bronze, silver,‬‭and gold layers‬‭. Each layer will have a‬
‭dedicated Databricks notebook in the same workspace for processing, analyzing, and cleaning data‬
‭with distinct purposes. After the final transformation is done in the gold layer, the data will be‬
‭uploaded to the Hive metastore in Databricks. This data will then be connected to Power BI to create‬
‭visualizations.‬

‭ hy is Medallion Architecture considered the best standard in data engineering for processing raw‬
W
‭data and delivering high-quality data?‬

‭ medallion architecture is a data design pattern used to logically organize data, with the goal of‬
A
‭incrementally and progressively improving the structure and quality of data as it flows through each‬
‭layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables).‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭a) In the left-side panel, click on‬‭Workspace‬‭.‬

‭ ) In the top-right corner, click on‬‭Create‬‭, then‬‭s elect‬‭Notebook‬‭. Repeat this process three times to‬
b
‭create separate notebooks for each layer: Bronze, Silver, and Gold. All notebook will be created in this‬
‭s ame workspace.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ ll three notebooks (Bronze, Silver, and Gold Layer) will be in the same workspace as shown above.‬
A
‭The reason is to ensure seamless integration, easier collaboration, centralized management, and‬
‭s treamlined data flow across the Bronze, Silver, and Gold layers.‬

‭c) Instead of creating notebooks, import all three notebooks into the same workspace‬

I‭t can be done by clicking the (‬‭:)‬‭icon, selecting‬‭Import‬‭, and uploading notebooks from your local‬
‭system to the Databricks workspace.‬

‭d) Click on browse to upload all three notebooks from your local system to the Databricks workspace.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭The execution of the notebooks will follow a sequential order:‬

‭Bronze Layer Notebook‬‭→‬‭Silver Layer Notebook‬‭→‬‭Gold‬‭Layer Notebook‬‭.‬

‭e)‬‭Bronze layer (raw data): Bronze Notebook‬

T‭ he Bronze layer is where we land all the data from external source systems. In this project, the‬
‭Bronze Layer Notebook‬‭will be used to ingest the raw‬‭JSON data from the ADLS container and load it‬
‭into a DataFrame for further processing.‬

‭Code Breakdown:‬

‭The code will be explained step by step, cell by cell.‬

‭1.‬ T‭ his code imports essential libraries to set up a Spark session and enables the use of‬
‭DataFrame operations.‬‭SparkSession‬‭is the entry point‬‭for Spark, and‬‭ col‬‭helps reference‬
‭columns. The execution of this cell sets up the environment for processing data using Spark.‬

‭2.‬ I‭t creates a Spark session named‬‭"WaterQualityPipeline"‬‭(give a name as per your‬


‭preference)‬‭. The execution of this cell will initialize‬‭the Spark environment in the Databricks‬
‭notebook.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭3.‬ I‭t sets two variables:‬‭adls_storage_name‬‭holds the‬‭name of the ADLS (Azure Data Lake‬
‭Storage) account, and‬‭adls_storage_access_key‬‭s tores‬‭the access key for authenticating and‬
‭accessing the ADLS account. Replace‬‭adls_storage_name‬‭with the ADLS Storage account‬
‭name and‬‭<storage_access_key>‬‭with the actual access‬‭key to access the data.‬

T‭ o get the‬‭<storage_access_key> of the ADLS storage‬‭account‬


‭Navigate to the Azure portal and go to the ADLS storage account you created.‬

a‭ )‬ I‭n the left-hand panel, click on Security + Networking.‬


‭b)‬ ‭Under Security + Networking, select Access keys.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭c)‬ C
‭ lick on Show and then copy the key and paste the key in the cell for variable‬
‭(adls_storage_access_key)‬

‭4.‬ I‭t configures the Spark session to use the ADLS storage account by setting the account key. It‬
‭allows Spark to authenticate and access the ADLS storage.‬

‭5.‬ ‭It reads a JSON file from the Azure Blob Storage into a DataFrame.‬

‭●‬ fi‭ le_location =‬


‭"wasbs://<adls-container-name>@<adls-storage-account>.blob.core.windows.net": This‬
‭s pecifies the path to the JSON file. Replace <adls-container-name>> with‬‭Azure Data Lake‬
‭Storage container‬‭and <adls-storage-account> with‬‭your‬‭Azure Data Lake Storage account‬
‭name.‬
‭●‬ ‭bronze_df = spark.read.json(file_location): This reads the JSON file from the specified location‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭and loads it into a DataFrame (bronze_df).‬

‭6.‬ T‭ he code sets up writing the DataFrame‬‭bronze_df‬‭(which‬‭contains the raw data from the JSON‬
‭file) into a Delta Lake table in the‬‭Bronze Layer‬‭of the medallion architecture.‬

‭‬ b
● ‭ ronze_df:‬‭The DataFrame contains the raw data loaded‬‭from Azure Blob Storage.‬
‭●‬ ‭.write:‬‭This is an operation to write the DataFrame‬‭to a storage location.‬
‭●‬ ‭.format("delta"):‬‭Specifies that the data will be‬‭written in the Delta format, and is optimized‬
‭for large-scale data processing.‬
‭●‬ ‭.mode("overwrite"):‬‭This means that if the data already‬‭exists at the specified location, it will‬
‭be overwritten.‬
‭●‬ ‭.save("/mnt/datalake/bronze/water_quality"):‬‭This‬‭s pecifies the path where the data will‬
‭be saved, i.e., the‬‭Bronze Layer‬‭location in the /mnt/datalake/bronze/water_quality‬‭directory.‬

‭7.‬ ‭This lists the files in the‬‭“/mnt/datalake”‬‭directory‬‭of DBFS of Databricks.‬

‭f)‬‭Silver layer (cleansed and conformed data) - Silver‬‭Notebook‬

I‭n this Silver layer, data from the Bronze layer is cleaned, merged, and processed to provide a more‬
‭refined and consistent view of key business data. This layer transforms raw data from the Bronze layer‬
‭into usable form for analysis in the gold layer.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭1.‬ T‭ his code snippet is the same as explained at the beginning of the bronze layer as it is‬
‭necessary for working with spark in the Databricks notebook.‬

‭2.‬ I‭t loads the data from the Bronze layer stored in Delta format in the specified location‬
‭(‬‭/mnt/datalake/bronze/water_quality‬‭)‬‭into a data frame‬‭called‬‭bronze_df‬‭. The execution of‬
‭this cell will allow to access the raw data that was ingested into the Bronze layer‬

bronze_df.count()‬‭counts the total number of‬‭rows in the DataFrame‬‭


‭3.‬ ‭This‬‭ bronze_df‬
‭.‬

‭4.‬ T‭ he function‬‭count_missings(spark_df, sort=True) -‬‭bronze_df is your input Spark DataFrame‬


‭which‬‭counts the number of missing (null or NaN) values‬‭in each column of a given Spark‬
‭DataFrame (‬‭s park_df‬‭). It does this by using Spark‬‭functions to identify and count null or NaN‬
‭values in each column. The function checks for‬‭NULL‬‭or NaN values‬‭in each column (excluding‬
‭timestamp‬‭,‬‭s tring‬‭, and‬‭date‬‭columns). For each column,‬‭it counts the occurrences of NaN or‬
‭NULL values using‬‭F.count(F.when(F.isnan(c) | F.isnull(c),‬‭c))‬‭.‬

‭If‬‭s ort‬‭is‬‭True‬‭, it returns the sorted DataFrame.‬

T‭ he function is called‬‭(count_missings = bronze_df)‬‭to count the missing values in each‬


‭column of‬‭bronze_df‬‭.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭5.‬ T‭ his code snippet removes columns from the DataFrame that do not contribute to the use case‬
‭and stores the data from‬‭bronze_df to silver_df (data‬‭frame of the silver layer).‬‭It ensures‬
‭that only the relevant columns are kept for further processing‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭6.‬ I‭t renames the necessary columns to make them more meaningful for easier processing and‬
‭analysis.‬

‭7.‬ T‭ his code maps country abbreviations to their full names in the Countrycode column for better‬
‭clarity.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭8.‬ T‭ his code adds a new column‬‭Country_Name‬‭to the‬‭silver_df‬‭DataFrame by mapping each‬
‭Country_Code to its corresponding country name using when-otherwise conditions.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭9.‬ I‭t maps the values in the‬‭water_body‬‭column of the‬‭s ilver_df DataFrame to full water body‬
‭names (e.g.,‬‭"GW" to "Ground Water", "RW" to "River‬‭Water")‬‭. It then adds a new column,‬
‭Water_Body_Name‬‭, with these full names‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭10.‬ ‭It maps the values in the‬‭Analyzed_Matrix‬‭column‬‭of the silver_df DataFrame to full names‬
‭(e.g.,‬‭"W" to "Water", "W-DIS" to "Water - Dissolved"‬‭).‬‭It then adds a new column,‬
‭Analyzed_Matrix_Name‬‭, with these full names.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭11.‬ ‭It writes the cleaned‬‭silver_df‬‭DataFrame to a Delta‬‭Lake table in the Silver layer.‬

I‭t enables schema evolution‬‭(mergeSchema = true)‬‭,‬‭which allows the schema to‬


‭automatically adapt to any new columns or changes in the data.‬

T‭ he data is written in‬‭"overwrite"‬‭mode, it will replace‬‭any existing data at the specified path‬
‭(/mnt/datalake/silver/water_quality_cleaned)‬‭in DBFS‬‭(Databricks file system).‬

‭ After cleaning and filtering in the Silver layer, the data will be further processed in the Gold layer to‬

‭be refined and ready for use in the use case.‬

‭g)‬‭Gold layer:‬‭(curated for use-case) - Gold Layer‬‭Notebook‬

T‭ he final layer of data transformations and data quality rules are applied here. The Gold layer is for‬
‭reporting and uses more de-normalized and read-optimized data.‬

‭ ata from the Gold layer is typically used for dashboards, reporting, and model building, as it‬
D
‭represents the final transformed and high-quality data.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭1.‬ T‭ his code snippet is the same as explained at the beginning of the bronze layer as it is‬
‭necessary for working with spark in the Databricks notebook.‬

‭2.‬ I‭t loads the data from the Bronze layer stored in Delta format in the specified location‬
‭(/mnt/datalake/silver/water_quality_cleaned)‬‭into‬‭a data frame called‬‭silver_df‬‭. The‬
‭execution of this cell will allow to access the cleaned data that was processed using the Silver‬
‭layer‬

‭3.‬ T‭ he line of code‬‭gold_df = silver_df.dropDuplicates()‬‭removes any duplicate rows from the‬


‭s ilver_df DataFrame, creating the final, clean dataset for the Gold layer.‬

‭4.‬ ‭Finding outliers in the‬‭Minimum_Value‬‭column‬

‭Breakdown:‬

A)‬‭
‭ mean_val =‬
gold_df.select(F.mean("Minimum_Value")).first()[0]‬

stddev_val =‬

gold_df.select(F.stddev("Minimum_Value")).first()[0]‬

‭ It calculates the average‬‭(‬‭mean‬‭)‬‭and the spread‬‭(‬‭standard deviation‬‭)‬‭of the values in the‬



‭Minimum_Value‬‭column and these values are necessary to calculate how far each value in‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭the column is from the average.‬

B)‬‭
‭ gold_df_with_zscore = gold_df.withColumn(‬

"z_score", (F.col("Minimum_Value") - mean_val) / stddev_val‬


)‬

Minimum_Value‬‭column and The Z-score‬


‭ It calculates the‬‭Z-score‬‭for each value in the‬‭

‭tells us how far away each value is from the average in terms of standard deviations. A high‬
‭Z-score means the value is far from the average, which might make it an outlier.‬

‭ ote:‬‭(Z-score =‬ ‭z-scores are typically defined as‬‭data points with z-scores‬


N
‭beyond a certain threshold, often‬‭±2 or ±3‬‭.)‬

C)‬‭
‭ gold_df_with_outliers = gold_df_with_zscore.withColumn(‬
"MinimumValue_outlier", F.when( F.abs(F.col("z_score")) >‬

3, 1 ).otherwise(0) )‬

‭>‬ ‭It checks if the Z-score is greater than 3 or‬‭less than -3 (which means the value is‬

‭far from the average). If it is, the value is marked as an outlier (‬‭1‬‭), otherwise it's‬
‭marked as normal (‬‭0‬‭). Values with Z-scores beyond‬‭3 are considered outliers because‬
‭they are far away from the average.‬

D)‬‭
‭ outlier_rows‬
=gold_df_with_outliers.filter(gold_df_with_outliers.Minimu‬

mValue_outlier == 1) print("Rows with Minimumvalue_outlier‬

= 1:") outlier_rows.display()‬

‭ It is simply filtering and displaying the rows that are considered outliers based on‬

‭the condition‬‭MinimumValue_outlier == 1‬‭.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭ fter filtering out, the line‬‭gold_df = gold_df_with_outliers‬‭assigns the DataFrame‬
A
‭gold_df_with_outliers to the gold_df, essentially making the data from gold_df_with_outliers the final‬
‭version for the Gold layer.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭5.‬ ‭Finding outliers in the‬‭Maximum_Value‬‭column.‬

F‭ ollow the same process used for the‬‭Minimum_Value‬‭column, but replace the column name‬
‭with‬‭Maximum_Value‬‭in the code.‬

‭ fter filtering out, the line‬‭gold_df = gold_df_with_outliers‬‭assigns the DataFrame‬


A
‭gold_df_with_outliers to the gold_df, essentially making the data from gold_df_with_outliers the final‬
‭version for the Gold layer.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭6.‬ ‭Finding outliers in the‬‭Mean_Value‬‭column.‬

F‭ ollow the same process used for the‬‭Minimum_Value‬‭and Maximum_Value‬‭columns, but‬


‭replace the column name with‬‭Mean_Value‬‭in the code.‬

‭ fter filtering out, the line‬‭gold_df = gold_df_with_outliers‬‭assigns the DataFrame‬


A
‭gold_df_with_outliers to the gold_df, essentially making the data from gold_df_with_outliers the final‬
‭version for the Gold layer.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭7.‬ T‭ his line drops/ deletes the‬‭z_score‬‭column from the data frame. The purpose is we have‬
‭already calculated the outliers and now the z_score column is not needed.‬

‭8.‬ ‭T‬‭his code splits the‬‭Sampling_Period‬‭column into two‬‭new columns:‬


‭a)‬ ‭Start_Date‬‭: It Extracts the first 10 characters (assumed‬‭to be the start date).‬
‭b)‬ ‭End_Date‬‭: It Extracts the last 10 characters (assumed‬‭to be the end date).‬
‭c)‬ ‭It uses the‬‭s ubstr()‬‭function to extract these portions‬‭from the‬‭Sampling_Period‬‭and‬
‭creates the new columns‬‭Start_Date‬‭and‬‭End_Date‬‭.‬

gold_df‬‭data to the‬‭Gold layer‬‭in a Delta format:‬


‭9. This code writes the final‬‭

‭ )‬ f‭ ormat("delta")‬‭: Saves the data in Delta format (for‬‭data processing).‬


a
‭b)‬ ‭mode("overwrite")‬‭: It replaces or overwrites any existing‬‭data at the specified location.‬
‭c)‬ ‭save("/mnt/datalake/gold/water_quality_aggregated")‬‭:‬‭Saves the data to the Gold‬
‭layer path in the data lake.‬

‭10. To write the final data frame (‬‭


gold_df‬
‭) to the‬‭Hive metastore, follow these steps:‬

‭a)‬ C
‭ reate Database‬‭: In the notebook, create a database‬‭that will be visible in the Hive‬
‭metastore.‬

‭Code:‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Give a name to the database as you want (waterdb is the name of the database in this case).‬

‭b)‬ R
‭ un the following code snippet to load the data from the‬‭gold_df‬‭data frame into a‬
‭table (‬‭gold_table‬‭) within the database you created.‬

‭Code:‬

‭c)‬ O
‭ n the left side panel, click on‬‭Catalog‬‭, then under‬‭Hive Metastore‬‭, select the‬
‭database you created.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭d)‬ Y‭ ou'll see the table (‬‭gold_table‬‭) that was created‬‭in the Hive metastore, which contains‬
‭data in CSV format.‬

‭9. Connection to Power BI and visualizations‬

‭a)‬ ‭Open Power BI Desktop and click on “Get Data” and from the dropdown, click on‬‭“More”‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭b)‬ ‭In the search bar, type‬‭Azure Databricks‬‭and click‬‭on‬‭Azure Databricks‬‭from the suggestions.‬

‭c)‬ ‭Specify the details such as‬‭server hostname‬‭and‬‭HTTP‬‭path‬‭of your Databricks environment.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭To get the‬‭server hostname‬‭and‬‭HTTP path‬‭of your Databricks‬‭environment, follow these steps:‬

a‭ )‬ I‭n the Azure portal, navigate to your‬‭Azure Databricks‬‭workspace‬‭.‬


‭b)‬ ‭Once inside the Databricks workspace, click on the‬‭Compute‬‭icon on the left panel.‬
‭c)‬ ‭Select the cluster you want to use.‬
‭d)‬ ‭In the‬‭Cluster details page‬‭, under the‬‭Advanced Options‬‭s ection, you will see the‬‭JDBC/ODBC‬
‭tab.‬

‭e)‬ ‭In this tab, you'll find the‬‭Server Hostname‬‭and‬‭HTTP‬‭Path‬‭.‬

‭Copy these values to specify them in your connection settings in Power BI‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭f)‬ ‭Once you specify both credentials, click on‬‭“OK”‬

‭g)‬ I‭n the‬‭Azure Databricks connection window‬‭, click on‬‭Azure Active Directory‬‭for‬
‭authentication.‬

‭Click on‬‭Sign In‬‭.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭h)‬ ‭Click on the email id through which you have created your Databricks workspace.‬

‭i)‬ ‭Click on‬‭Connect.‬‭This will establish a connection‬‭between Databricks and Power BI‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭j)‬ ‭In the‬‭Navigator‬‭window, click on‬‭Hive Metastore‬‭and‬‭then expand the‬‭Databases‬‭s ection.‬
‭●‬ ‭Select the‬‭database‬‭you created earlier.‬
‭●‬ ‭Find and select the‬‭gold_table‬‭you created to store‬‭data in the Hive Metastore.‬
‭●‬ ‭Click‬‭Load‬‭to load the data from the‬‭gold_table‬‭into‬‭Power BI for visualization.‬

‭k)‬ ‭Now, in the right panel under the‬‭Data‬‭s ection, your‬‭gold_table‬‭will be listed.‬

‭This indicates that the data from‬‭gold_table‬‭has‬‭been successfully loaded into Power BI.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭l)‬ N
‭ ow, you can create‬‭visualizations‬‭on top of the cleaned‬‭and high-quality data in Power BI to‬
‭extract valuable insights and information, such as:‬
‭●‬ ‭The level of determinands in water over the years.‬
‭●‬ ‭The concentration levels of determinands in different water bodies over time.‬
‭●‬ ‭Other key parameters for assessing water quality, include trends, anomalies, and‬
‭comparisons across various locations and timeframes.‬

‭‘’’’‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬


‭Summary‬

‭●‬ I‭n this training, we created a comprehensive data pipeline using Azure Cloud, enabling‬
‭organizations and concerned authorities to take measurable actions in no time.‬
‭●‬ ‭The flexibility of the pipeline allows it to cover a vast number of use cases.‬
‭●‬ ‭Initially, we discussed the water sensor dat‬‭a (our‬‭main data source)‬‭and emphasized the‬
‭importance of treating every single record generated by the sensors, highlighting the criticality‬
‭of handling data for accurate analysis.‬
‭●‬ ‭A batch-processing approach was employed to extract the entire dataset from the Azure-based‬
‭SQL database.‬
‭●‬ ‭For this project, we used aggregated water sensor data collected across different terrains,‬
‭water bodies, vegetation, and countries.‬
‭●‬ ‭The architecture of our project is quite comprehensive and can accommodate additional‬
‭functionalities. The architecture encompasses the flow in this way:‬
‭○‬ ‭Initially, we loaded the data from local storage into the cloud-based SQL database‬
‭using pipelines, as the entire database was utilized for a historical approach.‬
‭○‬ ‭In the storage layer, we used ADLS storage containers and Blob Storage to store the raw‬
‭data files.‬
‭○‬ ‭In the data movement layer, we leveraged Azure Logic Apps and Azure Data Factory to‬
‭facilitate the movement of data between services, ensuring it reached its final‬
‭destination.‬
‭○‬ ‭In the transformation layer, we adhered to the Medallion Architecture, an‬
‭industry-standard framework, to transform the data rigorously and deliver a‬
‭high-quality dataset.‬
‭○‬ ‭Finally, we wrote the processed dataset from the Gold Hive Metastore to Power BI and‬
‭visualization development was explained on this data to create comprehensive‬
‭dashboards, enabling government and private organizations to make timely decisions‬
‭and prevent hazardous activities.‬
‭●‬ ‭Various Azure Services were used to implement the project and cost configurations were also‬
‭taken care of to slack the unwanted charges.‬

‭BUILD‬‭DATA PIPELINE USING AZURE MEDALLION ARCHITECTURE APPROACH‬

You might also like