Azure - Implementation Notes
Azure - Implementation Notes
Azure Storage:
Azure Storage is a Microsoft-managed service providing cloud storage that is highly available, secure,
durable, scalable, and redundant. Azure Storage includes Azure Blobs (objects), Azure Data Lake Storage
Gen2, Azure Files, Azure Queues, and Azure Tables
Azure account refers to the Azure Billing account---> mapped to the email id that you used
to sign up for Azure--->An account can contain multiple subscriptions; each of these
subscriptions can have multiple resource groups and the resource groups, in turn, can have
multiple resources.
---> billing is done at the level of subscriptions
Basics:
1.) Subscription (there is no limit to the number of storage accounts you can create per subscription
in Azure)
2.) Resource group (A Resource group is a container that holds related resources for an Azure
solution)
3.) Storage account name (Globally Unique)
4.) Region (Proximity to Users, Compliance Requirements, Redundancy and Disaster Recovery,
Pricing, Service Availability based Region, Network Performance between your applications and
the chosen region. Review the SLAs for Azure Storage services in different regions)
Azure Data Engineer
Advanced:
1.) Require secure transfer for REST API operations---> HTTP, HTTPS are performed securely using
SSL/TLS encryption
2.) Allow enabling public access on individual containers-----> By default, containers within a storage
account are private. Enabling this option allows you to grant public access to specific containers if
needed.
3.) Enable storage account key access----> allows you to access the storage account using the
account keys
4.) Default to Azure Active Directory authorization in the Azure portal---> allows you to use Azure
Active Directory (AD) for authentication and authorization instead of storage account keys. It
provides more secure and granular access control to your storage account resources.
5.) Minimum TLS version- Transport Layer Security and Choosing a higher version ensures stronger
encryption and better security.
6.) Enable hierarchical namespace
7.) ACCESS PROTOCOLS - Enable SFTP and network file system v3----> Enabling these protocols
allows you to access your storage account using SFTP (Secure File Transfer Protocol) and NFS
(Network File System) v3.
8.) BLOB STORAGE - Access across tenant Replication, and Access Tier
9.) AZURE FILES - Enable Large File Shares
Networking
Network routing
Microsoft network routing ensures that traffic between Azure resources within the same
region stays within the Azure network, while Internet routing allows traffic to flow through
the internet.
Data Protection
Azure Data Engineer
Tracking:
Enable versioning for blobs---> Use versioning to automatically maintain previous versions of your
blobs.
Enable blob change feed ---> Keep track of create, modification, and delete changes to blobs in your
account.
Access control:
- Allows you to set time-based retention policy on the account-level that will apply to all blob
versions. Enable this feature to set a default policy at the account level. Without enabling this, you can still
set a default policy at the container level or set policies for specific blob versions. Versioning is required
for this property to be enabled.
Encryption:
Customer-managed key (CMK) support can be limited to blob service and file service only,
or to all service types. After the storage account is created, this support cannot be
changed.
1. Choose a partition key: Determine a partition key based on the characteristics of your data, such
as customer ID, date, or geographical location. This key will be used to distribute your data across
different partitions.
2. Select a partitioning scheme: Azure provides two partitioning schemes: partition by range and
partition by hash. Partition by range is suitable when you have sequential or time-based data.
Partition by hash is useful when you want to distribute data uniformly across partitions.
Azure Data Engineer
3. Define the partitioning strategy: Implement the chosen partitioning scheme by creating a
partition map. This map specifies the partition key, the partition boundaries (in the case of range
partitioning), and the number of partitions (in the case of hash partitioning).
4. Distribute the data: When writing data to Azure, include the partition key in the data. Azure will
use this key to determine the appropriate partition for storing the data.
Designing a partition strategy for files has partition key and the partition logic which are dependent on
one another. Example, if we take the partition key has Create Date then the partition logic need adhere to
this Partition key in order to store the files in exact partition.
def get_partition_key(date):
if "2020-01-01" <= date <= "2020-06-30":
return "Partition A"
elif "2020-07-01" <= date <= "2020-12-31":
return "Partition B"
else:
return "Invalid Date Range"
# Example usage
file_date = "2020-05-15"
partition_key = get_partition_key(file_date)
print(partition_key) # Output: Partition A
import hashlib
def get_partition_key(file_name):
# Generate a hash value for the file name
hash_value = hashlib.md5(file_name.encode()).hexdigest()
return partition_key
def access_file(file_name):
partition_key = get_partition_key(file_name)
return file_content
There are three main types of partition strategies for analytical workloads. These are listed here:
Horizontal partitioning
In a horizontal partition, we divide the table data horizontally, and subsets of rows are stored in
different data stores. Each of these subsets of rows (with the same schema as the parent table)
are called shards. Essentially, each of these shards is stored in different database instances.
Azure Data Engineer
NOTE
Don't try to balance the data to be evenly distributed across partitions unless specifically
required by your use case because usually, the most recent data will get accessed more
than older data. Thus, the partitions with recent data will end up becoming bottlenecks
due to high data access.
Vertical partitioning
In a vertical partition, we divide the data vertically, and each subset of the columns is stored
separately in a different data store. This is ideal for column-oriented data stores such as HBase,
Cosmos DB, and so on.
Azure Data Engineer
Functional partitioning
Functional partitions are similar to vertical partitions, except that here, we store entire tables or
entities in different data stores. They can be used to segregate data belonging to different
organizations, frequently used tables from infrequently used ones, read-write tables from read-
only ones, sensitive data from general data, and so on.
Azure Data Engineer
Design effective folder structures to improve the efficiency of data reads and writes.
Partition data such that a significant amount of data can be pruned while running
queries.
File sizes in the range of 256 megabytes (MB) to 100 gigabytes (GB) perform really
well with analytical engines such as HDInsight and Azure Synapse, gen2 . So, aggregate
the files to these ranges before running the analytical engines on them.
For I/O-intensive jobs, try to keep the optimal I/O buffer sizes in the range of 4 to 16
MB; anything too big or too small will become inefficient.
Run more containers or executors per virtual machine (VM) (such as Apache Spark
executors or Apache Yet Another Resource Negotiator (YARN) containers).
1. List business-critical queries, the most frequently run queries, and the slowest queries.
2. Check the query plans for each of these queries using the EXPLAIN keyword and see the
amount of data being used at each stage (we will be learning about how to view query
plans in the later chapters).
3. Identify the joins or filters that are taking the most time. Identify the corresponding data
partitions.
4. Try to split the corresponding input data partitions into smaller partitions, or change the
application logic to perform isolated processing on top of each partition and later merge
only the filtered data.
5. You could also try to see if other partitioning keys would work better and if you need to
repartition the data to get better job performance for each partition.
6. If any particular partitioning technology doesn't work, you can explore having more than
one piece of partitioning logic—for example, you could apply horizontal partitioning
within functional partitioning, and so on.
7. Monitor the partitioning regularly to check if the application access patterns are balanced
and well distributed. Try to identify hot spots early on.
8. Iterate this process until you hit the preferred query execution time.
A dedicated SQL pool is a massively parallel processing (MPP) system that splits the queries
into 60 parallel queries and executes them in parallel. Each of these smaller queries runs on
something called a distribution. A distribution is a basic unit of processing and storage for a
dedicated SQL pool. There are three different ways to distribute (shard) data among
distributions, as listed here:
Round-robin tables
Hash tables
Azure Data Engineer
Replicated tables
Partitioning is supported on all the distribution types in the preceding list. Apart from the
distribution types, Dedicated SQL pool also supports three types of tables: clustered
columnstore, clustered index, and heap tables.Partitioning is supported in all of these types of
tables, too.
In a dedicated SQL pool, data is already distributed across its 60 distributions, so we need to be
careful in deciding if we need to further partition the data. The clustered columnstore tables work
optimally when the number of rows per table in a distribution is around 1 million.
For example, if we plan to partition the data further by the months of a year, we are talking about
12 partitions x 60 distributions = 720 sub-divisions. Each of these divisions needs to have at least
1 million rows; in other words, the table (usually a fact table) will need to have more than 720
million rows. So, we will have to be careful to not over-partition the data when it comes to
dedicated SQL pools.
As we have learned in the previous chapter, we can partition data according to our requirements
—such as performance, scalability, security, operational overhead, and so on—but there is
another reason why we might end up partitioning our data, and that is the various I/O bandwidth
limits that are imposed at subscription levels by Azure. These limits apply to both Blob storage
and ADLS Gen2.
The rate at which we ingest data into an Azure Storage system is called the ingress rate, and
the rate at which we move the data out of the Azure Storage system is called the egress rate.
Resource Limit
Maximum number of storage accounts with standard endpoints per region per 250 by default,
subscription, including standard and premium storage accounts. 500 by request 1
Maximum number of storage accounts with Azure DNS zone endpoints (preview) 5000 (preview)
per region per subscription, including standard and premium storage accounts.
Default maximum storage account capacity 5 PiB 2
Maximum number of blob containers, blobs, file shares, tables, queues, entities, No limit
or messages per storage account.
Default maximum request rate per storage account 20,000 requests
per second 2
Azure Data Engineer
Resource Limit
Default maximum ingress per general-purpose v2 and Blob storage account in 60 Gbps 2
Australia East
Central US
East Asia
East US 2
Japan East
Korea Central
North Europe
South Central US
Southeast Asia
UK South
West Europe
West US
Default maximum ingress per general-purpose v2 and Blob storage account in 60 Gbps 2
Australia East
Central US
East US
East US 2
Japan East
North Europe
South Central US
Southeast Asia
UK South
West Europe
West US 2
Default maximum ingress per general-purpose v2 and Blob storage account in 25 Gbps 2
Default maximum egress for general-purpose v2 and Blob storage accounts in 120 Gbps 2
Australia East
Central US
East Asia
East US 2
Azure Data Engineer
Resource Limit
Japan East
Korea Central
North Europe
South Central US
Southeast Asia
UK South
West Europe
West US
Default maximum egress for general-purpose v2 and Blob storage accounts in 120 Gbps 2
RDDs are an immutable fault-tolerant collection of data objects that can be operated on in
parallel by Spark.