0% found this document useful (0 votes)
212 views

HowToCrackInterview Udemy

The document discusses an Azure data engineering course that provides guidance and practice for job interviews. It details what topics the course will cover, including Azure services, architecture design, and tips for answering interview questions. It also outlines the course structure and prerequisites.

Uploaded by

ravikumar lanka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
212 views

HowToCrackInterview Udemy

The document discusses an Azure data engineering course that provides guidance and practice for job interviews. It details what topics the course will cover, including Azure services, architecture design, and tips for answering interview questions. It also outlines the course structure and prerequisites.

Uploaded by

ravikumar lanka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

AZURE DATA

ENGINEERING
CRACK THE INTERVIEW
By Karthik J
https://www.udemy.com/course/azure-data-engineering-interview-
questions/?referralCode=41FC8A6544E97A2F4AB2 [email protected]
BigDataAzure.com
BigDataAzure.com
[email protected]
WHAT IS THIS COURSE ABOUT?
This course will give you What you should NOT expect

• High confidence to face interview • Questions and answers bank


• Guidance on how to articulate the • How to impress interviewer
architecture/design of your project • Communication and other soft skills
• Tips on how your answer to the point • Attentive, punctual etc qualities
• Quiz to test your knowledge
• Reference material links for self study

This course will help you to Before you buy this course

• Focus on relevant Azure services, not all you need • Azure Data Services knowledge is required.
to learn • Primary audience of this course is developers
• Self learning and practicing and architects.
• This course is available only in English language
with no caption text

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
STRUCTURE OF THE COURSE

Discussion point
Demo
Hints
Reference material to gain knowledge
Quiz per topic

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
PRE-REQUISITES
Experience in working on Azure Data Services – Azure Data Factory, Storage services, Databricks, Azure SQL

If you do not have experience then I recommend below course on Udemy to gain project development experience

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
Grand opening

TELL ME ABOUT YOUR EXPERIENCE ON AZURE

Candidate A Candidate B Candidate C

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
TELL ME ABOUT YOUR EXPERIENCE ON AZURE

Candidate A Candidate B Candidate C


I have total xx number of years I have total xx years experience. I I have total xx years experience in
experience. I know Azure data worked on Azure data lake, data Microsoft technologies, out of that I
factory, data lake, SQL server, factory, Hive, Blob storage and adls. worked on Azure for xx years. I
Spark, Hadoop, Databricks. Also I worked on 2 projects for clients xx worked on Azure data engineering
little bit of Scala. and xx. We copy data from multiple projects for big clients. I ingested
sources using ADF then store on data using batch transfer mode
storage, transform data in SQL having size xx GB/TBs. We used ADF,
…….. Long story Databricks, Storage services, Azure
SQL. I pretty much liked it.

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
TELL ME ABOUT YOUR EXPERIENCE ON AZURE

Focus on:
• What was your role? – developer, senior developer, architect ….
• How many projects you have worked on
• Were you part of designing the solution?
• What was data size and volume?
• Batch mode and/or real time data ingestion?
• Structured / semi-strucutred / unstructured data
• Mention if you faced challenges or not. Don’t describe those at this stage.
• List of Azure data services you have experience or confident on.

Finish in max 2 mins


This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
EXPLAIN YOUR PROJECT ARCHITECTURE

Aggregate
Clean and
transform

Reference architecture
This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
EXPLAIN YOUR PROJECT ARCHITECTURE

Candidate A Candidate B Candidate C


Azure data factory copies the data There are multiple sources like The project has layered architecture.
from files on FTP server to blob as database, files and web apis. ADF Data is pulled from various source
sink. Another copy activity reads connects to these sources, pulls the systems like SQL DB, files, Cosmos
data from blob and insert into SQL data and stores to data lake. The DB etc and kept in ingest layer in
and Databricks. Then we call stored data is stored inside date folders. partitioned manner. Databricks is
proc to transform the data. HDInsight reads and transform data used to process the data and kept in
and stored in Hive tables. Jobs are curated layer then aggregated data
scheduled on daily basis. is pushed to consumption purpose
to SQL DB. Most of the data is
structured. One time history load
and daily incremental load.

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
EXPLAIN YOUR PROJECT ARCHITECTURE

Focus on:
• Your understanding of architecture
• How effectively Azure data services are used. What was the purpose? E.g. in ref architecture
Databricks is used only for compute purpose
• What data is loaded? E.g. history data load, data load from heterogeneous sources etc
• Know consumers of your data – Visualization tools, data analysts, downstream applications

Finish in max 2-3 mins


This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
EXPLAIN YOUR PROJECT ARCHITECTURE

Articulate benefits of architecture. Example:


• Decoupling of storage from compute and data processing
• No compute power required to read data
• Customized security roles can be defined and assigned
• Any tool can analyze and process the data
• Centralized data architecture
• No need to keep multiple copies for multiple purpose (viz. governance, EDW, auditing, EDM etc)
• Cost effective
• Scalable
• Data size no limit
• Data from more source can be added easily
• Real time data ingestion can plugged in
• Rest API can be leveraged to manage services and data

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
OTHER STANDARD QUESTIONS

• What was team size?


• What was your role?
• What was project duration?
• Were there any difficulties in learning Azure? From where you learned Azure?
• Are you aware of current trends in the market?
• Difference between Data warehouse and data lake
etc etc…….

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATA FACTORY

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATA FACTORY
BEST
PRACTICES
This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATA FACTORY
ERROR
HANDLING
This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATA FACTORY
INTERVIEW
QUESTION
This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATA FACTORY
MONITOR
WINDOW
This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATA FACTORY
ALL ABOUT
COPY ACTIVITY
This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATA FACTORY

What is Azure Data Factory?


• Do not start telling about pipelines, datasets and triggers.

Use below hints:


• It is orchestration tool. Calling ADF ETL or ELT tool is not appropriate.
• Connects to various services like Databricks, Web, Function, Custom activity etc
• High number of connectors as source as well as sink. Not all source connectors can be used as
sink.
• Capable to connect to on-prem data via Self hosted IR
• Building blocks: Pipeline, Datasets, Triggers, IR, Monitor, ControlFlow
• Inbuilt monitor with rich UI
• Integration Runtime : Cost is associated with Inbuilt IR.
• CI-CD supported thru Git
• Rest API and SDK support
• Roleisbased
This presentation security
part of course: support
https://www.udemy.com/course/azure-data-engineering-interview-
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATA FACTORY

What is Integration Runtime?


• Compute power to execute ADF code and to process the data
• Cost is associated (as per Data Integration Units used)
• Power (DIU) can be specified in ADF Copy Activity
• Own machines (nodes) can be attached to Self Hosted Integration Runtime (SHIR)
• SHIR can be shared between multiple ADFs
• Support to run legacy SSIS packages
• Advantages of SHIR:
• Nodes are part of company network
• Flexibility to upgrade drivers to connect to on-prem systems e.g. .NET driver for SAP
OpenHub
• Easily monitored and controlled

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATA FACTORY

What is difference between Parameters and Variables?


• Focus on advantages of parameters and variables instead of telling what are those
• Understand what to use where
• Parameters should mostly hold env dependent value (e.g. SQL server name) or dynamic values
(data landing paths) which are supplied from outside of pipeline
• Variables should be used for internal functioning (e.g. building dynamic path to include date i.e.
yyyy/mm/dd) or decision making. These should not be managed by external parties.

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATA FACTORY

Can you tell me about Copy Data activity?

Candidate A Candidate B
It copies data from source to target. It copies data from source to target
We have to specify from where data using ADF connectors. Supports
has to be copied e.g. SFTP and dynamic values as paths or table
format (e.g. csv). On target also we names etc. Configurable DIU and
have to specify type and format of parallel connections can be specified
data with additional settings like for optimum performance.
header = true..

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATA FACTORY

What is special in ADF Monitor?


• Supports monitoring pipeline executions, integration runtime health
• Annotations and Gantt chart
• Rest API and SDKs are available
• Set up alert notifications
• Integration with Log Analytics

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATA FACTORY

What are the best practices in ADF?


• Reusable (generic) dataset.
• Naming and objects arrangement
• Description field
• User properties and annotation
• Security via Key vault
• Notification on error (red path)
• Create template for pipelines

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATA FACTORY

Do you know how to ingest data using metadata?


See video from other course – Copy data from Azure SQL with Control Table. This applies to any
source with enhancement of control table(s).

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATA FACTORY

How do you retrieve error details?


• Most of activities provide error as “output” or as “error”
• PreviousActivity.output.errors – collection of errors
• PreviousActivity.error – Direct error details
• Not all activities provide error in form of output or error e.g. Databricks Python activity
• The error message can be captured and used for logging and sending notificationc

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE STORAGE SERVICES

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
STORAGE SERVICES
What is difference between Blob and ADLS?
Blob Storage ADLS Gen 2
Purpose Store objects like media files, unstructured Store large files (terabytes/petabytes),
data, archival of data, log files, streaming data typically for analytics
Structure Flat Hierarchical
SAS URI Yes at account, container and blob level No (May add support in future)
Security RBAC, Container/Blob level RBAC, ACL, POSIX, Any level
Size Limit Yes No
SDK Support Yes – Matured Yes – Limited
REST API Yes – Matured Yes – Limited
Driver wasb abfs
Data stored as Block Blob, Page Blob Files
Append Data in Blob Supported Not supported
Tiers Hot, Cool, Archive Not supported
End pointsis part of course: https://www.udemy.com/course/azure-data-engineering-interview-
This presentation blob.core.windows.net dfs.core.windows.net
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
STORAGE SERVICES

More detailed learning on storage with hands on

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
STORAGE SERVICES
What is difference between Blob and ADLS?
https://docs.microsoft.com/en-us/azure/storage/blobs/

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
STORAGE SERVICES

Tricky questions
• Can you not store media files on ADLS?
• Yes, you can.
• Then why not?
• Cost of ingress and egress is high on ADLS. Benefit of ADLS performance is on analytics.

• A third party team (outside of Azure env) wants to send data files to Azure. How?
• Create container on Blob storage and share SAS URL. The team can POST data using HTTPS
from any technology / programming language.

• Once I create BLOB storage, can I change it to ADLS later by changing properties?
• No

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
STORAGE SERVICES

Why ADLS Gen2 faster than Blob storage?


Blob use WASB (Windows Azure Storage Blob) drivers which maps blob (files) each time. It adds
load to code to maintain mapping of objects.
ABFS (Azure Blob File System) on other hand use Directories which reduce the vast number of
operations required to retrieve data.

Tip: Instead of large number of small files, if you keep data in small number of large files, analytics
performance improves by many folds.

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
STORAGE SERVICES

How do you archive the data?

• ADF Copy activity – compress the data, delete original


• Use of Lifecycle management in Azure Blob Storage
• Simply keep data as-is for longer duration – storage is very cheap

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATABRICKS

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATABRICKS

Tell me about Azure Databricks


• Customized Spark engine with Apache Delta engine
• Enhanced for auto scaling, Machine learning, data management and streaming
• Offers rich UI to manage clusters, security and notebooks for code development
• Offers Delta tables – supports MERGE (ACID transactions), create views on top of tables,
compressed data store
• Integrated with Azure storage, Key Vault services and Azure Active Directory
• Notebooks can be callable from ADF
• Notebook supports magic commands, version history, can be shared, returns value
• Good for data analysts - better visualization - graphs, charts in notebook
• Offer Rest API
• Job scheduler
• SQL Server like Security implementation
• Can be integrated with Git.
This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATABRICKS

What did you not like about Azure Databricks?


• PaaS, not IaaS - customer who desire to manage infra, Databricks is not right choice
• Not every third party Spark library works on Databricks or does not work efficiently
• Can not migrate legacy Spark code as-is (security issues e.g. accessing inbuilt temp storage)
• Can not define keys and constraints on Delta lake table
• Does not support Kafka storage, Storm streaming.

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATABRICKS
BEST
PRACTICES
This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATABRICKS

Azure Databricks – Best practices (1/2)


• Development Guidelines
• Organize notebooks in proper folder structure
• Keep development and executable notebook copies in separate folders
• Use widgets in development phase
• Use only one language for all notebooks
• Error handling using try/catch
• Define standards to call spark code from ADF (notebooks, python, jar)
• Mount storages instead of referring complete absolute storage path
• Use of secret scope to access secured info (e.g DB server pwd/conn string)
• Define strategy to interact with external systems (e.g. use Pyodbc to interact with SQL DB)
• Create libraries for common functions.

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE DATABRICKS

Azure Databricks – Best practices (2/2)


• Performance Optimization (Delta Lake features)

• Use Delta tables with proper partitions


• Why - Delta tables supports ACID transactions, scalable metadata handling, compression
• Compaction – to improve analytics performance by combining small files into large files (bin
packing)
• Optimization – Z-Ordering technique is kind of indexing but not indexing. Improves
performance of select queries
• Vacuume – Delete unreferenced files (generally these files are unreferenced after compaction
process)
• Analyzing Tables – Analyze the tables periodically to collect the statistics about tables which
are used by query optimizer for better performance.

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE HDINSIGHT

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE HDINSIGHT

Tell me about Azure HDInsight


• Fully IaaS – Full control on servers
• Offers Hadoop, Spark, Kafka, Storm, Interactive Query, Hbase clusters
• Easy to migrate on-prem code
• Offers auto sizing
• Rich UI (Ambari) to manage and monitor cluster and nodes
• RDBMS like Security support using Ranger and Azure Active Directory.

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE HDINSIGHT

What did you not like about Azure HDInsight?


• Costlier as it is IaaS. Billed even not used.
• Can not upgrade the runtime. Need to recreate cluster.
• No integration with Azure Key Vault
• More maintenance efforts in infra management
• Ranger is costly
• Support tickets are routed to Hortonworks.

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE SQL DB
AZURE SYNAPSE ANALYTICS
This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE SQL DB/DW (SYNAPSE ANALYTICS)

Why do you want to use SQL DB or DW?


• Supports all features of on-prem SQL Server either thru Azure SQL DB or Managed instance
• Security at row and column level using AAD and SQL Login
• Easy to scale up and down
• Easy to manage using Azure portal
• Offers threat detection, encryption, data masking and vulnerability detection
• Azure SQL DB supports large data using Hyperscale
• Auto backup policy configuration
• Redundancy support across region
• SQL DW MPP (Massive Parallel Processing) architecture specially built for Big Data processing
which separates storage and compute (distributed processing similar to Azure Databricks Spark)
• Excellent UI (Azure Portal) for monitoring utilization, load on server, fine tune queries,
performance optimization suggestions, auto index creation etc.

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE SQL DB

What are the Azure SQL products you know?

Azure SQL DB SQL Managed Instance SQL on Azure VM


• Virtual Server • Can be added company • SQL installation on VM
• Only DB management network (VNET) • Fully IaaS
• Fully PaaS • Lift and shift database • Easy lift and shift
• Default max size limit 4 TB migration • As good as on-prem
• Hyperscale upto 100 TB • Fully PaaS • Good for large databases,
• Supports Elastic Pools • Auto backup minimum time to migrate,
• Auto backup • SSIS, SSAS and SSRS - No third party softwares like SQL
Sentry installation supported

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE SQL DB

What are the database migration tools you know?


Azure Database Data Migration Assistant Database
Migration Service Experimentation
• SQL to Azure SQL migration
• Also analyses jobs, SSIS
Assistant
• Cloud based service
• Internally uses Data Migration packages
• Desktop software • Used to upgrade SQL version
Assistant and SQL Server
Migration Assistant
• Useful for large data migration SQL Server Migration
(in terms of number of
databases or size of Assistant
databases)
• Non SQL to SQL/Azure SQL
migration
• Source – Access, DB2, MySQL,
This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview- Oracle, SAP ASE
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE SQL

More on securing data


• Transparent Data Encryption – Data, logs and backups are encrypted, decrypted real time at rest.
Bring your own key supported. This is default enabled for SQL DBs and manual switch on for
Synapse Analytics

• Always encrypted – Encryption / decryption at client side, secure data in transit

• Data Protection –
• Column level security
• Row level security
• Data masking
• Manual Encryption.

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE SQL DB/DW (SYNAPSE ANALYTICS)
All about SQL DW (Synapse Analytics)?
• Sharding data via Hash distributed, Round robin and replicated tables
• Limitless storage
• Costly (you can pause it to save cost)
• Upcoming Synapse Studio is bundle of ADF, ADLS, Spark and DW
• Best for SQL developers who are already familiar with T-SQL language

Any limitations or drawbacks?


• Very costly
• Various limitations like Polybase supports only 1 MB size per row, no support for serialized
transactions, no support for UPDATE FROM etc
• No integration of Azure Key Vault and other services
• Can not link the other DB servers (instead use external data source)
• Need
This presentation is part
toofrely
course:
onlyhttps://www.udemy.com/course/azure-data-engineering-interview-
on T-SQL www.BigDataAzure.com
questions/?referralCode=41FC8A6544E97A2F4AB2
SYNAPSE ANALYTICS
BEST
PRACTICES
This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
AZURE SQL DB/DW (SYNAPSE ANALYTICS)

Azure Synapse Analytics Best Practices


• Sharding data: Hash distributed - facts, Round robin – stage layer, replicated - dimensions
• Copy data from external source
• For fastest load use stage layer with round robin distribution
• Polybase limitation of max 1 MB per row. Prepare data with smaller size rows
• Scale up DW for large data load, scale down once data is loaded
• Create statistics periodically
• Minimize transaction size (avoid long running DML operations)
• Use smallest possible column size
• Source files: prefer parquet over flat files; multiple small files over few large files
• Use minimal logging to general small amount of logs and to increase I/O efficiency
• Rename tables instead of Delete / Update

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
MORE ON AZURE DATA

• Azure Delta Lake – adoption of open source Delta storage layer (similar to Databricks Delta)

• Azure Data Explorer – To query and analyze real time data being ingested in event hub, IOT hub,
Azure queue etc.

• Azure Data Studio – SQL Client + capability of running notebooks using Python and R on Spark
cluster

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
MISCELLANEOUS QUESTIONS

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
MISCELLANEOUS QUESTIONS
1
How do you improve performance of analytics?
How do you partition data?

Storage (Blob/ADLS) Databricks / HDInsight SQL (DB & Synapse


Analytics)
• Partition using folder structure • Table partition
Subject area / org units • Prefer static partitions over • Table partition
Region dynamic • Aggregate / summarize data
Date (yyyy/mm/dd) • Bucketing (cluster by) • Min use of external data
• Small number of large files • Optimize source
• Hadoop formats (e.g. parquet) • Vacuume periodically • Indexing
• Archive old data • Sharding fact data using hash
distribution
• Create statistics

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
MISCELLANEOUS QUESTIONS
2
Why curated layer is ADLS Gen2? You could place SQL DB there….

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
MISCELLANEOUS QUESTIONS
2
Why curated layer is ADLS Gen2? You could place SQL DB there….

• ADLS Gen2 + Databricks combination is meant for Big


Data processing
• Create delta tables pointing location to ADLS Gen2.
Delta table file format is parquet which can be read by
any external system (e.g. ADF, data analyst, reporting
tools)
• To read data, no Databricks engine needed – saves
compute cost
• Data is compressed requires less storage space
Copy
• ADLS storage is cheaper than SQL DB Activity
Polybase
• ADLS supports security at any level (POSIX like access
control lists (ACL)). Any
ADF SQL DW Spark other
tool

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
MISCELLANEOUS QUESTIONS
3
• Apart from ADF, how can you execute pipelines -- PowerShell, RestAPI, SDK, LogicApp

• What are various ways to copy data from Databricks to Azure SQL/DW? - From Spark pyodbc/jdbc, ADF Copy
activity, SQL DW polybase if we have created external tables in Databricks, Azure SQL Library for Python

• Any challenges faced at any point of time? – Yes


 Databricks excel file loading --> Third party lib was not performing well. So used panda. Converted
Panda DF to Spark DF (spark.createDataFrame)
 Data copy from Azure Databricks to SQL server was slow --> Scaled up DB, added batch size in jdbc, used
SQL lib and pushed data using bulkcopy
 Grouping of ADF activities to run parallel --> used foreach loop, supplied only 1 value to foreach
 Library installation on HDInsight cluster --> No option, used "Action Scripts" from UI
 ADF trigger was not considering default parameters --> Passed the params manually from trigger
 ADF Databricks Activity was executing before lib installation on Databricks --> Added wait time in code
as workaround. Also communicated the issue to platform team.
 dbutils was not recognized inside lib --> accepted dbutils from user as input as parameter

This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-


questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com
MISCELLANEOUS QUESTIONS
4
• Feed files are coming on monthly basis -- which trigger do you suggest - scheduled or event based
--> If feed arrival schedule is defined then scheduled trigger. If the requirement is like near real
time or no schedule of data arrival, use event based trigger

• Do you know cost saving techniques:


-- Auto shutdown/scale up/down DBX cluster
-- Pause SQL DW -- create data marts
-- Set DIU in ADF to Auto
-- Avoid unnecessary movement of large data files
-- Prefer Azure Function over logic apps. Logic apps is costlier than Azure functions.

• How do you read file from Blob storage and write back treating as File IO operation from Azure
Databricks – Using Storage SDK for Python

Keep note of each challenge faced during the execution of project


This presentation is part of course: https://www.udemy.com/course/azure-data-engineering-interview-
questions/?referralCode=41FC8A6544E97A2F4AB2 www.BigDataAzure.com

You might also like