Azure Data Fundamentals
Azure Data Fundamentals
Over the last decade, the amount of data that systems and devices generate has increased significantly. Because of this
increase, new technologies, roles, and approaches to working with data are affecting data professionals. In many industries,
data professionals want to understand better how these changes affect both their careers and their daily working lives.
To generate value, anyone working with data needs to understand how the data landscape has changed and how roles
and technologies are evolving. You should be able to explain this shift to any stakeholder. Learn how to clearly describe
the key factors that are driving the changes and how an organization can benefit from embracing the changes.
Learning objectives
In this module you will:
Learn the key factors that are driving changes in data generation, roles, and technologies.
Compare the differences between on-premises data technologies and cloud data technologies.
Outline how the role of the data professional is changing in organizations.
Identify use cases that involve these changes.
Data abundance
Completed100 XP
5 minutes
Over the last 30 years, we've seen an exponential increase in the number of devices and software that generate data to
meet current business and user needs. Businesses store, interpret, manage, transform, process, aggregate, and report this
data to interested parties. These parties include internal management, investors, business partners, regulators, and
consumers.
Data consumers view data on PCs, tablets, and mobile devices that are either connected or disconnected. Consumers both
generate and use data. They do this in the workplace and during leisure time with social media applications. Business
stakeholders use data to make business decisions. Consumers use data to make decisions such as what to buy, for
example. Thanks to AI, Azure Machine Learning can now both consume data and make decisions the way humans do.
Data forms include text, stream, audio, video, and metadata. Data can be structured, unstructured, or aggregated. For
structured databases, data architects define the structure (schema) as they create the data storage in platform
technologies such as Azure SQL Database and Azure SQL Data Warehouse. For unstructured (NoSQL) databases, each data
element can have its own schema at query time. Data can be stored as a file in Azure Blob storage or as NoSQL data in
Azure Cosmos DB or Azure HDInsight.
Data engineers must maintain data systems that are accurate, highly secure, and constantly available. The systems must
comply with applicable regulations such as GDPR (General Data Protection Regulation) and industry standards such as PCI
DSS (Payment Card Industry Data Security Standard). International companies might also have special data requirements
that conform to regional norms such as the local language and date format. Data in these systems can be located
anywhere. It can be on-premises or in the cloud, and it can be processed either in real time or in a batch.
Azure provides a comprehensive and rich set of data technologies that can store, transform, process, analyze, and visualize
a variety of data formats in a secure way. As data formats evolve, Microsoft continually releases new technologies to the
Azure platform. Azure customers can explore these new technologies in preview mode. Using the on-demand Azure
subscription model, customers can minimize costs, paying only for what they consume and only when they need it.
5 minutes
When traditional hardware and infrastructure components near the end of their life cycle, many organizations consider
digital transformation projects. Here we'll consider options for those transformations. We'll look at features of both on-
premises and cloud environments. We'll also cover the factors that businesses must consider as they explore each option.
On-premises environments
Computing environment
On-premises environments require physical equipment to execute applications and services. This equipment includes
physical servers, network infrastructure, and storage. The equipment must have power, cooling, and periodic maintenance
by qualified personnel. A server needs at least one operating system (OS) installed. It might need more than one OS if the
organization uses virtualization technology.
Licensing
Each OS that's installed on a server might have its own licensing cost. OS and software licenses are typically sold per server
or per CAL (Client Access License). As companies grow, licensing arrangements become more restrictive.
Maintenance
On-premises systems require maintenance for the hardware, firmware, drivers, BIOS, operating system, software, and
antivirus software. Organizations try to reduce the cost of this maintenance where it makes sense.
Scalability
When administrators can no longer scale up a server, they can instead scale out their operations. To scale an on-premises
server horizontally, server administrators add another server node to a cluster. Clustering uses either a hardware load
balancer or a software load balancer to distribute incoming network requests to a node of the cluster.
A limitation of server clustering is that the hardware for each server in the cluster must be identical. So when the server
cluster reaches maximum capacity, a server administrator must replace or upgrade each node in the cluster.
Availability
High-availability systems must be available most of the time. Service-level agreements (SLAs) specify your organization's
availability expectations.
System uptime can be expressed as three nines, four nines, or five nines. These expressions indicate system uptimes of 99.9
percent, 99.99 percent, or 99.999 percent. To calculate system uptime in terms of hours, multiply these percentages by the
number of hours in a year (8,760).
Expand table
Uptime level Uptime hours per year Downtime hours per year
For on-premises servers, the more uptime the SLA requires, the higher the cost.
Support
Hundreds of vendors sell physical server hardware. This variety means server administrators might need to know how to
use many different platforms. Because of the diverse skills required to administer, maintain, and support on-premises
systems, organizations sometimes have a hard time finding server administrators to hire.
Multilingual support
In on-premises SQL Server systems, multilingual support is difficult and expensive. One issue with multiple languages is the
sorting order of text data. Different languages can sort text data differently. To address this issue, the SQL Server database
administrator must install and configure the data's collation settings. But these settings can work only if the SQL database
developers considered multilingual functionality when they were designing the system. Systems like this are complex to
manage and maintain.
The term total cost of ownership (TCO) describes the final cost of owning a given technology. In on-premises systems, TCO
includes the following costs:
Hardware
Software licensing
Labor (installation, upgrades, maintenance)
Datacenter overhead (power, telecommunications, building, heating and cooling)
It's difficult to align on-premises expenses with actual usage. Organizations buy servers that have extra capacity so they
can accommodate future growth. A newly purchased server will always have excess capacity that isn't used. When an on-
premises server is at maximum capacity, even an incremental increase in resource demand will require the purchase of
more hardware.
Because on-premises server systems are very expensive, costs are often capitalized. This means that on financial
statements, costs are spread out across the expected lifetime of the server equipment. Capitalization restricts an IT
manager's ability to buy upgraded server equipment during the expected lifetime of a server. This restriction limits the
server system's ability to accommodate increased demand.
In cloud solutions, expenses are recorded on the financial statements each month. They're monthly expenses instead of
capital expenses. Because subscriptions are a different kind of expense, the expected server lifetime doesn't limit the IT
manager's ability to upgrade to meet an increase in demand.
Cloud environments
Computing environment
Cloud computing environments provide the physical and logical infrastructure to host services, virtual servers, intelligent
applications, and containers for their subscribers. Different from on-premises physical servers, cloud environments require
no capital investment. Instead, an organization provisions service in the cloud and pays only for what it uses. Moving
servers and services to the cloud also reduces operational costs.
Within minutes, an organization can provision anything from virtual servers to clusters of containerized apps by using
Azure services. Azure automatically creates and handles all of the physical and logical infrastructure in the background. In
this way, Azure reduces the complexity and cost of creating the services.
On-premises servers store data on physical and virtual disks. On a cloud platform, storage is more generic. Diverse storage
types include Azure Blob storage, Azure Files storage, and Azure Disk Storage. Complex systems often use each type of
storage as part of their technical architecture. With Azure Disk Storage, customers can choose to have Microsoft manage
their disk storage or to pay a premium for greater control over disk allocation.
Maintenance
In the cloud, Microsoft manages many operations to create a stable computing environment. This service is part of the
Azure product benefit. Microsoft manages key infrastructure services such as physical hardware, computer networking,
firewalls and network security, datacenter fault tolerance, compliance, and physical security of the buildings. Microsoft also
invests heavily to battle cybersecurity threats, and it updates operating systems and firmware for the customer. These
services allow data engineers to focus more on data engineering and eliminating system complexity.
Scalability
Scalability in on-premises systems is complicated and time-consuming. But scalability in the cloud can be as simple as a
mouse click. Typically, scalability in the cloud is measured in compute units. Compute units might be defined differently for
each Azure product.
Availability
Azure duplicates customer content for redundancy and high availability. Many services and platforms use SLAs to ensure
that customers know the capabilities of the platform they're using.
Support
Cloud systems are easy to support because the environments are standardized. When Microsoft updates a product, the
update applies to all consumers of the product.
Multilingual support
Cloud systems often store data as a JSON file that includes the language code identifier (LCID). The LCID identifies the
language that the data uses. Apps that process the data can use translation services such as the Bing Translator API to
convert the data into an expected language when the data is consumed or as part of a process to prepare the data.
Total cost of ownership
Cloud systems like Azure track costs by subscriptions. A subscription can be based on usage that's measured in compute
units, hours, or transactions. The cost includes hardware, software, disk storage, and labor. Because of economies of scale,
an on-premises system can rarely compete with the cloud in terms of the measurement of the service usage.
The cost of operating an on-premises server system rarely aligns with the actual usage of the system . In cloud systems, the
cost usually aligns more closely with the actual usage.
In some cases, however, those costs don't align. For example, an organization will be charged for a service that a cloud
administrator provisions but doesn't use. This scenario is called underutilization. Organizations can reduce the costs of
underutilization by adopting a best practice to provision production instances only after their developers are ready to
deploy an application to production. Developers can use tools like the Azure Cosmos DB emulator or the Azurite to
develop and test cloud applications without incurring production costs.
When moving to the cloud, many customers migrate from physical or virtualized on-premises servers to Azure Virtual
Machines. This strategy is known as lift and shift. Server administrators lift and shift an application from a physical
environment to Azure Virtual Machines without rearchitecting the application.
The lift-and-shift strategy provides immediate benefits. These benefits include higher availability, lower operational costs,
and the ability to transfer workloads from one datacenter to another. The disadvantage is that the application can't take
advantage of the many features available within Azure.
Consider using the migration as an opportunity to transform your business practices by creating new versions of your
applications and databases. Your rearchitected application can take advantage of Azure offerings such as Cognitive
Services, Bot Service, and machine learning capabilities.
Understand job responsibilities
Completed100 XP
5 minutes
Your skills need to evolve from managing on-premises database server systems, such as SQL Server, to managing cloud-
based data systems. If you're a SQL Server professional, over time you'll focus less on SQL Server and more on data in
general. You'll be a data engineer.
SQL Server professionals generally work only with relational database systems. Data engineers also work with unstructured
data and a wide variety of new data types, such as streaming data.
To master data engineering, you'll need to learn a new set of tools, architectures, and platforms. As a SQL Server
professional, your primary data manipulation tool might be T-SQL. As a data engineer you might use additional
technologies like Azure HDInsight and Azure Cosmos DB. To manipulate the data in big-data systems, you might use
languages such as HiveQL or Python.
A disadvantage of the ETL approach is that the transformation stage can take a long time. This stage can potentially tie up
source system resources.
An alternative approach is extract, load, and transform (ELT). In ELT, the data is immediately extracted and loaded into a
large data repository such as Azure Cosmos DB or Azure Data Lake Storage. This change in process reduces the resource
contention on source systems. Data engineers can begin transforming the data as soon as the load is complete.
ELT also has more architectural flexibility to support multiple transformations. For example, how the marketing department
needs to transform the data can be different than how the operations department needs to transform that same data.
Azure reduces the complexity of building and deploying servers. As a data engineer, you'll use a web user interface for
simple deployments. For more complex deployments, you can create and automate powerful scripts. In less time than it
takes you to read this module, you can set up a database that's globally distributed, sophisticated, and highly available.
You spend less time setting up services, and you focus more on security and on deriving business value from your data.
Use cases for the cloud
Completed100 XP
5 minutes
Azure can work for a range of industries, including the web, healthcare, and Internet of Things (IoT). Let's explore how
Azure can make a difference in these industries.
Web
As a data engineer, use the Azure Cosmos DB multimaster replication model to create a data architecture that supports
web and mobile applications. Thanks to Microsoft performance commitments, these applications can achieve a response
time of less than 10 ms anywhere in the world. By reducing the processing time of their websites, global organizations can
increase customer satisfaction.
Healthcare
In the healthcare industry, use Azure Databricks to accelerate big-data analytics and AI solutions. Apply these technologies
to genome studies or pharmacy sales forecasting at a petabyte scale. Using Databricks features, you can set up your Spark
environment in minutes and autoscale quickly and easily.
Using Azure, you can collaborate with data scientists on shared projects and workspaces in a wide range of languages,
including SQL, R, Scala, and Python. Because of native integration with Azure Active Directory and other Azure services, you
can build diverse solution types. For example, build a modern data warehouse or machine learning and real-time analytics
solutions.
IoT solutions
Over the last couple of years, hundreds of thousands of devices have been produced to generate sensor data. These are
known as IoT devices.
Using technologies like Azure IoT Hub, you can design a data solution architecture that captures information from IoT
devices so that the information can be analyzed.
Summary
200 XP
5 minutes
In this module, we looked at how the world of data is evolving. We explored how these changes affect data professionals. We also
discussed the differences between on-premises and cloud data solutions, and we provided a few use cases that apply cloud solutions.
Which data processing framework will a data engineer use to ingest data onto cloud data platforms in Azure?
Structured data
Azure Cosmos DB
Unstructured data
3.
Duplicating customer content for redundancy and meeting service-level agreements (SLAs) in Azure meets which
cloud technical requirement?
Maintainability
High availability
Multilingual support
2 Module Explore core data concepts
Introduction
Completed100 XP
1 minute
Over the last few decades, the amount of data generated by systems, applications, and devices has increased significantly.
Data is everywhere, in a multitude of structures and formats.
Data is now easier to collect and cheaper to store, making it accessible to nearly every business. Data solutions include
software technologies and platforms that can help facilitate the collection, analysis, and storage of valuable information.
Every business would like to grow their revenues and make larger profits. In this competitive market, data is a valuable
asset. When analyzed properly, data provides a wealth of useful information and informs critical business decisions.
The capability to capture, store, and analyze data is a core requirement for every organization in the world . In this module,
you'll learn about options for representing and storing data, and about typical data workloads. By completing this module,
you'll build the foundation for learning about the techniques and services used to work with data.
Learning objectives
In this module you will learn how to:
Data is a collection of facts such as numbers, descriptions, and observations used to record information. Data structures in
which this data is organized often represents entities that are important to an organization (such as customers, products,
sales orders, and so on). Each entity typically has one or more attributes, or characteristics (for example, a customer might
have a name, an address, a phone number, and so on).
Structured data
Structured data is data that adheres to a fixed schema, so all of the data has the same fields or properties. Most commonly,
the schema for structured data entities is tabular - in other words, the data is represented in one or more tables that
consist of rows to represent each instance of a data entity, and columns to represent attributes of the entity . For example,
the following image shows tabular data representations for Customer and Product entities.
Structured data is often stored in a database in which multiple tables can reference one another by using key values in
a relational model; which we'll explore in more depth later.
Semi-structured data
Semi-structured data is information that has some structure, but which allows for some variation between entity instances.
For example, while most customers may have an email address, some might have multiple email addresses, and some
might have none at all.
One common format for semi-structured data is JavaScript Object Notation (JSON). The example below shows a pair of
JSON documents that represent customer information. Each customer document includes address and contact
information, but the specific fields vary between customers.
JSONCopy
// Customer 1
{
"firstName": "Joe",
"lastName": "Jones",
"address":
{
"streetAddress": "1 Main St.",
"city": "New York",
"state": "NY",
"postalCode": "10099"
},
"contact":
[
{
"type": "home",
"number": "555 123-1234"
},
{
"type": "email",
"address": "[email protected]"
}
]
}
// Customer 2
{
"firstName": "Samir",
"lastName": "Nadoy",
"address":
{
"streetAddress": "123 Elm Pl.",
"unit": "500",
"city": "Seattle",
"state": "WA",
"postalCode": "98999"
},
"contact":
[
{
"type": "email",
"address": "[email protected]"
}
]
}
Note
JSON is just one of many ways in which semi-structured data can be represented. The point here is not to provide a
detailed examination of JSON syntax, but rather to illustrate the flexible nature of semi-structured data representations.
Unstructured data
Not all data is structured or even semi-structured. For example, documents, images, audio and video data, and binary files
might not have a specific structure. This kind of data is referred to as unstructured data.
Data stores
Organizations typically store data in structured, semi-structured, or unstructured format to record details of entities (for
example, customers and products), specific events (such as sales transactions), or other information in documents, images,
and other formats. The stored data can then be retrieved for analysis and reporting later.
File stores
Databases
The ability to store data in files is a core element of any computing system. Files can be stored in local file systems on the
hard disk of your personal computer, and on removable media such as USB drives; but in most organizations, important
data files are stored centrally in some kind of shared file storage system. Increasingly, that central storage location is
hosted in the cloud, enabling cost-effective, secure, and reliable storage for large volumes of data.
The specific file format used to store data depends on a number of factors, including:
Copy
FirstName,LastName,Email
Joe,Jones,[email protected]
Samir,Nadoy,[email protected]
The following example shows a JSON document containing a collection of customers. Each customer has three attributes
(firstName, lastName, and contact), and the contact attribute contains a collection of objects that represent one or more
contact methods (email or phone). Note that objects are enclosed in braces ({..}) and collections are enclosed in square
brackets ([..]). Attributes are represented by name : value pairs and separated by commas (,).
JSONCopy
{
"customers":
[
{
"firstName": "Joe",
"lastName": "Jones",
"contact":
[
{
"type": "home",
"number": "555 123-1234"
},
{
"type": "email",
"address": "[email protected]"
}
]
},
{
"firstName": "Samir",
"lastName": "Nadoy",
"contact":
[
{
"type": "email",
"address": "[email protected]"
}
]
}
]
}
Extensible Markup Language (XML)
XML is a human-readable data format that was popular in the 1990s and 2000s. It's largely been superseded by the less
verbose JSON format, but there are still some systems that use XML to represent data. XML uses tags enclosed in angle-
brackets (<../>) to define elements and attributes, as shown in this example:
XMLCopy
<Customers>
<Customer name="Joe" lastName="Jones">
<ContactDetails>
<Contact type="home" number="555 123-1234"/>
<Contact type="email" address="[email protected]"/>
</ContactDetails>
</Customer>
<Customer name="Samir" lastName="Nadoy">
<ContactDetails>
<Contact type="email" address="[email protected]"/>
</ContactDetails>
</Customer>
</Customers>
When working with data like this, data professionals often refer to the data files as BLOBs (Binary Large Objects).
Optimized file formats
While human-readable formats for structured and semi-structured data can be useful, they're typically not optimized for
storage space or processing. Over time, some specialized file formats that enable compression, indexing, and efficient
storage and processing have been developed.
Some common optimized file formats you might see include Avro, ORC, and Parquet:
Avro is a row-based format. It was created by Apache. Each record contains a header that describes the structure of
the data in the record. This header is stored as JSON. The data is stored as binary information. An application uses the
information in the header to parse the binary data and extract the fields it contains. Avro is a good format for
compressing data and minimizing storage and network bandwidth requirements.
ORC (Optimized Row Columnar format) organizes data into columns rather than rows. It was developed by
HortonWorks for optimizing read and write operations in Apache Hive (Hive is a data warehouse system that
supports fast data summarization and querying over large datasets). An ORC file contains stripes of data. Each stripe
holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each
row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.
Parquet is another columnar data format. It was created by Cloudera and Twitter. A Parquet file contains row groups.
Data for each column is stored together in the same row group. Each row group contains one or more chunks of
data. A Parquet file includes metadata that describes the set of rows found in each chunk. An application can use this
metadata to quickly locate the correct chunk for a given set of rows, and retrieve the data in the specified columns
for these rows. Parquet specializes in storing and processing nested data types efficiently. It supports very efficient
compression and encoding schemes.
Explore databases
Completed100 XP
5 minutes
A database is used to define a central system in which data can be stored and queried. In a simplistic sense, the file system
on which files are stored is a kind of database; but when we use the term in a professional data context, we usually mean a
dedicated system for managing data records rather than files.
Relational databases
Relational databases are commonly used to store and query structured data. The data is stored in tables that represent
entities, such as customers, products, or sales orders. Each instance of an entity is assigned a primary key that uniquely
identifies it; and these keys are used to reference the entity instance in other tables. For example, a customer's primary key
can be referenced in a sales order record to indicate which customer placed the order. This use of keys to reference data
entities enables a relational database to be normalized; which in part means the elimination of duplicate data values so
that, for example, the details of an individual customer are stored only once; not for each sales order the customer places.
The tables are managed and queried using Structured Query Language (SQL), which is based on an ANSI standard , so it's
similar across multiple database systems.
Non-relational databases
Non-relational databases are data management systems that don’t apply a relational schema to the data. Non-relational
databases are often referred to as NoSQL database, even though some support a variant of the SQL language.
Key-value databases in which each record consists of a unique key and an associated value, which can be in any
format.
Document databases, which are a specific form of key-value database in which the value is a JSON document (which
the system is optimized to parse and query)
Column family databases, which store tabular data comprising rows and columns, but you can divide the columns
into groups known as column-families. Each column family holds a set of columns that are logically related together.
Graph databases, which store entities as nodes with links to define relationships between them.
Explore transactional data processing
Completed100 XP
5 minutes
A transactional data processing system is what most people consider the primary function of business computing. A
transactional system records transactions that encapsulate specific events that the organization wants to track. A
transaction could be financial, such as the movement of money between accounts in a banking system, or it might be part
of a retail system, tracking payments for goods and services from customers. Think of a transaction as a small, discrete, unit
of work.
Transactional systems are often high-volume, sometimes handling many millions of transactions in a single day. The data
being processed has to be accessible very quickly. The work performed by transactional systems is often referred to as
Online Transactional Processing (OLTP).
OLTP solutions rely on a database system in which data storage is optimized for both read and write operations in order to
support transactional workloads in which data records are created, retrieved, updated, and deleted (often referred to
as CRUD operations). These operations are applied transactionally, in a way that ensures the integrity of the data stored in
the database. To accomplish this, OLTP systems enforce transactions that support so-called ACID semantics:
Atomicity – each transaction is treated as a single unit, which succeeds completely or fails completely. For example, a
transaction that involved debiting funds from one account and crediting the same amount to another account must
complete both actions. If either action can't be completed, then the other action must fail.
Consistency – transactions can only take the data in the database from one valid state to another. To continue the
debit and credit example above, the completed state of the transaction must reflect the transfer of funds from one
account to the other.
Isolation – concurrent transactions cannot interfere with one another, and must result in a consistent database state.
For example, while the transaction to transfer funds from one account to another is in-process, another transaction
that checks the balance of these accounts must return consistent results - the balance-checking transaction can't
retrieve a value for one account that reflects the balance before the transfer, and a value for the other account that
reflects the balance after the transfer.
Durability – when a transaction has been committed, it will remain committed. After the account transfer transaction
has completed, the revised account balances are persisted so that even if the database system were to be switched
off, the committed transaction would be reflected when it is switched on again.
OLTP systems are typically used to support live applications that process business data - often referred to as line of
business (LOB) applications.
Analytical data processing typically uses read-only (or read-mostly) systems that store vast volumes of historical data or
business metrics. Analytics can be based on a snapshot of the data at a given point in time, or a series of snapshots.
The specific details for an analytical processing system can vary between solutions, but a common architecture for
1. Operational data is extracted, transformed, and loaded (ETL) into a data lake for analysis.
2. Data is loaded into a schema of tables - typically in a Spark-based data lakehouse with tabular abstractions over files
in the data lake, or a data warehouse with a fully relational SQL engine.
3. Data in the data warehouse may be aggregated and loaded into an online analytical processing (OLAP) model,
or cube. Aggregated numeric values (measures) from fact tables are calculated for intersections of dimensions from
dimension tables. For example, sales revenue might be totaled by date, customer, and product.
4. The data in the data lake, data warehouse, and analytical model can be queried to produce reports, visualizations,
and dashboards.
Data lakes are common in large-scale data analytical processing scenarios, where a large volume of file-based data must
be collected and analyzed.
Data warehouses are an established way to store data in a relational schema that is optimized for read operations –
primarily queries to support reporting and data visualization. Data Lakehouses are a more recent innovation that combine
the flexible and scalable storage of a data lake with the relational querying semantics of a data warehouse. The table
schema may require some denormalization of data in an OLTP data source (introducing some duplication to make queries
perform faster).
An OLAP model is an aggregated type of data storage that is optimized for analytical workloads. Data aggregations are
across dimensions at different levels, enabling you to drill up/down to view aggregations at multiple hierarchical levels; for
example to find total sales by region, by city, or for an individual address. Because OLAP data is pre-aggregated, queries to
return the summaries it contains can be run quickly.
Different types of user might perform data analytical work at different stages of the overall architecture. For example:
Data scientists might work directly with data files in a data lake to explore and model data.
Data Analysts might query tables directly in the data warehouse to produce complex reports and visualizations.
Business users might consume pre-aggregated data in an analytical model in the form of reports or dashboards.
Summary
Completed100 XP
1 minute
Data is at the core of most software applications and solutions. It can be represented in many formats, stored in files and
databases, and used to record transactions or to support analysis and reporting.
Next steps
Now that you've learned about some core data concepts, consider learning more about data-related workloads on
Microsoft Azure by pursuing a Microsoft certification in Azure Data Fundamentals.
3 Module Explore data roles and services
Introduction
Completed100 XP
1 minute
Over the last decade, the amount of data that systems and devices generate has increased significantly. Because of this
increase, new technologies, roles, and approaches to working with data are affecting data professionals. Data professionals
typically fulfill different roles when managing, using, and controlling data. In this module, you'll learn about the various
roles that organizations often apply to data professionals, the tasks and responsibilities associated with these roles, and
the Microsoft Azure services used to perform them.
Learning objectives
In this module you will learn how to:
There's a wide variety of roles involved in managing, controlling, and using data. Some roles are business-oriented, some
involve more engineering, some focus on research, and some are hybrid roles that combine different aspects of data
management. Your organization may define roles differently, or give them different names, but the roles described in this
unit encapsulate the most common division of tasks and responsibilities.
The three key job roles that deal with data in most organizations are:
Database administrators manage databases, assigning permissions to users, storing backup copies of data and
restore data in the event of a failure.
Data engineers manage infrastructure and processes for data integration across the organization, applying data
cleaning routines, identifying data governance rules, and implementing pipelines to transfer and transform data
between systems.
Data analysts explore and analyze data to create visualizations and charts that enable organizations to make
informed decisions.
Note
The job roles define differentiated tasks and responsibilities. In some organizations, the same person might perform
multiple roles; so in their role as database administrator they might provision a transactional database, and then in their
role as a data engineer they might create a pipeline to transfer data from the database to a data warehouse for analysis.
Database Administrator
A database administrator is responsible for the design, implementation, maintenance, and operational
aspects of on-premises and cloud-based database systems. They're responsible for the overall availability and consistent
performance and optimizations of databases. They work with stakeholders to implement policies, tools, and processes for
backup and recovery plans to recover following a natural disaster or human-made error.
The database administrator is also responsible for managing the security of the data in the database, granting privileges
over the data, granting or denying access to users as appropriate.
Data Engineer
A data engineer collaborates with stakeholders to design and implement data-related workloads,
including data ingestion pipelines, cleansing and transformation activities, and data stores for analytical workloads. They
use a wide range of data platform technologies, including relational and non-relational databases, file stores, and data
streams.
They're also responsible for ensuring that the privacy of data is maintained within the cloud and spanning from on-
premises to the cloud data stores. They own the management and monitoring of data pipelines to ensure that data loads
perform as expected.
Data Analyst
A data analyst enables businesses to maximize the value of their data assets. They're responsible for
exploring data to identify trends and relationships, designing and building analytical models, and enabling advanced
analytics capabilities through reports and visualizations.
A data analyst processes raw data into relevant insights based on identified business requirements to deliver relevant
insights.
Note
The roles described here represent the key data-related roles found in most medium to large organizations. There are
additional data-related roles not mentioned here, such as data scientist and data architect; and there are other technical
professionals that work with data, including application developers and software engineers.
Identify data services
Completed100 XP
5 minutes
Microsoft Azure is a cloud platform that powers the applications and IT infrastructure for some of the world's largest
organizations. It includes many services to support cloud solutions, including transactional and analytical data workloads.
Some of the most commonly used cloud services for data are described below.
Note
This topic covers only some of the most commonly used data services for modern transactional and analytical solutions.
Additional services are also available.
Azure SQL
Azure SQL is the collective name for a family of relational database solutions based on the Microsoft SQL Server
database engine. Specific Azure SQL services include:
Azure SQL Database – a fully managed platform-as-a-service (PaaS) database hosted in Azure
Azure SQL Managed Instance – a hosted instance of SQL Server with automated maintenance, which allows more
flexible configuration than Azure SQL DB but with more administrative responsibility for the owner.
Azure SQL VM – a virtual machine with an installation of SQL Server, allowing maximum configurability with full
management responsibility.
Database administrators typically provision and manage Azure SQL database systems to support line of business (LOB)
applications that need to store transactional data.
Data engineers may use Azure SQL database systems as sources for data pipelines that perform extract, transform,
and load (ETL) operations to ingest the transactional data into an analytical system.
Data analysts may query Azure SQL databases directly to create reports, though in large organizations the data is generally
combined with data from other sources in an analytical data store to support enterprise analytics.
Azure includes managed services for popular open-source relational database systems, including:
Azure Database for MySQL - a simple-to-use open-source database management system that is commonly used
in Linux, Apache, MySQL, and PHP (LAMP) stack apps.
Azure Database for MariaDB - a newer database management system, created by the original developers of
MySQL. The database engine has since been rewritten and optimized to improve performance. MariaDB offers
compatibility with Oracle Database (another popular commercial database management system).
Azure Database for PostgreSQL - a hybrid relational-object database. You can store data in relational tables, but a
PostgreSQL database also enables you to store custom data types, with their own non-relational properties.
As with Azure SQL database systems, open-source relational databases are managed by database administrators to
support transactional applications, and provide a data source for data engineers building pipelines for analytical solutions
and data analysts creating reports.
Azure Cosmos DB
Azure Cosmos DB is a global-scale non-relational (NoSQL) database system that supports multiple application
programming interfaces (APIs), enabling you to store and manage data as JSON documents, key-value pairs, column-
families, and graphs.
In some organizations, Cosmos DB instances may be provisioned and managed by a database administrator; though often
software developers manage NoSQL data storage as part of the overall application architecture. Data engineers often need
to integrate Cosmos DB data sources into enterprise analytical solutions that support modeling and reporting by data
analysts.
Azure Storage
Azure Storage is a core Azure service that enables you to store data in:
Azure Data Factory is an Azure service that enables you to define and schedule data pipelines to transfer and
transform data. You can integrate your pipelines with other Azure services, enabling you to ingest data from cloud data
stores, process the data using cloud-based compute, and persist the results in another data store.
Azure Data Factory is used by data engineers to build extract, transform, and load (ETL) solutions that populate analytical
data stores with data from transactional systems across the organization.
Azure Synapse Analytics is a comprehensive, unified Platform-as-a-Service (PaaS) solution for data analytics
that provides a single service interface for multiple analytical capabilities, including:
Data engineers can use Azure Synapse Analytics to create a unified data analytics solution that combines data ingestion
pipelines, data warehouse storage, and data lake storage through a single service.
Data analysts can use SQL and Spark pools through interactive notebooks to explore and analyze data, and take advantage
of integration with services such as Azure Machine Learning and Microsoft Power BI to create data models and extract
insights from the data.
Azure Databricks
Azure Databricks is an Azure-integrated version of the popular Databricks platform, which combines the
Apache Spark data processing platform with SQL database semantics and an integrated management interface to enable
large-scale data analytics.
Data engineers can use existing Databricks and Spark skills to create analytical data stores in Azure Databricks.
Data Analysts can use the native notebook support in Azure Databricks to query and visualize data in an easy to use web-
based interface.
Azure HDInsight
Azure HDInsight is an Azure service that provides Azure-hosted clusters for popular Apache open-source big
data processing technologies, including:
Apache Spark - a distributed data processing system that supports multiple programming languages and APIs,
including Java, Scala, Python, and SQL.
Apache Hadoop - a distributed system that uses MapReduce jobs to process large volumes of data efficiently across
multiple cluster nodes. MapReduce jobs can be written in Java or abstracted by interfaces such as Apache Hive - a
SQL-based API that runs on Hadoop.
Apache HBase - an open-source system for large-scale NoSQL data storage and querying.
Apache Kafka - a message broker for data stream processing.
Data engineers can use Azure HDInsight to support big data analytics workloads that depend on multiple open-source
technologies.
Azure Stream Analytics
Azure Stream Analytics is a real-time stream processing engine that captures a stream of data from an input,
applies a query to extract and manipulate data from the input stream, and writes the results to an output for analysis or
further processing.
Data engineers can incorporate Azure Stream Analytics into data analytics architectures that capture streaming data for
ingestion into an analytical data store or for real-time visualization.
Azure Data Explorer is a standalone service that offers the same high-performance querying of log and telemetry data
as the Azure Synapse Data Explorer runtime in Azure Synapse Analytics.
Data analysts can use Azure Data Explorer to query and analyze data that includes a timestamp attribute, such as is
typically found in log files and Internet-of-things (IoT) telemetry data.
Microsoft Purview
Microsoft Purview provides a solution for enterprise-wide data governance and discoverability. You can use Microsoft
Purview to create a map of your data and track data lineage across multiple data sources and systems, enabling you to find
trustworthy data for analysis and reporting.
Data engineers can use Microsoft Purview to enforce data governance across the enterprise and ensure the integrity of
data used to support analytical workloads.
Microsoft Fabric
Microsoft Fabric is a unified Software-as-a-Service (SaaS) analytics platform based on open and governed
lakehouse that includes functionality to support:
Choose the best response for each of the questions below. Then select Check your answers.
Correct. Database Administrators back up the database and restore it when data is lost or corrupted.
Which role is most likely to use Azure Data Factory to define a data pipeline for an ETL process?
Database Administrator
Data Engineer
Data Analyst
3.
Which services would you use as a SaaS solution for data analytics?
Microsoft Fabric
In the early years of computing systems, every application stored data in its own unique structure. When developers
wanted to build applications to use that data, they had to know a lot about the particular data structure to find the data
they needed. These data structures were inefficient, hard to maintain, and hard to optimize for good application
performance. The relational database model was designed to solve the problem of multiple arbitrary data structures. The
relational model provides a standard way of representing and querying data that can be used by any application. One of
the key advantages of the relational database model is its use of tables, which are an intuitive, efficient, and flexible way to
store and access structured information.
The simple yet powerful relational model is used by organizations of all types and sizes for a broad variety of information
management needs. Relational databases are used to track inventories, process ecommerce transactions, manage huge
amounts of mission-critical customer information, and much more. A relational database is useful for storing any
information containing related data elements that must be organized in a rules-based, consistent structure.
In this module, you'll learn about the key characteristics of relational databases, and explore relational data structures.
Learning objectives
In this module you will learn how to:
In a relational database, you model collections of entities from the real world as tables. An entity can be anything for which
you want to record information; typically important objects and events. For example, in a retail system example, you might
create tables for customers, products, orders, and line items within an order. A table contains rows, and each row
represents a single instance of an entity. In the retail scenario, each row in the customer table contains the data for a single
customer, each row in the product table defines a single product, each row in the order table represents an order made by
a customer, and each row in the line item table represents a product that was included in an order.
Relational tables are a format for structured data, and each row in a table has the same columns; though in some cases,
not all columns need to have a value – for example, a customer table might include a MiddleName column; which can be
empty (or NULL) for rows that represent customers with no middle name or whose middle name is unknown).
Each column stores data of a specific datatype. For example, an Email column in a Customer table would likely be defined
to store character-based (text) data (which might be fixed or variable in length), a Price column in a Product table might
be defined to store decimal numeric data, while a Quantity column in an Order table might be constrained to integer
numeric values; and an OrderDate column in the same Order table would be defined to store date/time values. The
available datatypes that you can use when defining a table depend on the database system you are using; though there
are standard datatypes defined by the American National Standards Institute (ANSI) that are supported by most database
systems.
Understand normalization
Completed100 XP
6 minutes
Normalization is a term used by database professionals for a schema design process that minimizes data duplication and
enforces data integrity.
While there are many complex rules that define the process of refactoring data into various levels (or forms) of
normalization, a simple definition for practical purposes is:
To understand the core principles of normalization, suppose the following table represents a spreadsheet that a company
uses to track its sales.
Notice that the customer and product details are duplicated for each individual item sold; and that the customer name and
postal address, and the product name and price are combined in the same spreadsheet cells.
Now let's look at how normalization changes the way the data is stored.
Each entity that is represented in the data (customer, product, sales order, and line item) is stored in its own table, and
each discrete attribute of those entities is in its own column.
Recording each instance of an entity as a row in an entity-specific table removes duplication of data. For example, to
change a customer's address, you need only modify the value in a single row.
The decomposition of attributes into individual columns ensures that each value is constrained to an appropriate data type
- for example, product prices must be decimal values, while line item quantities must be integer numbers. Additionally, the
creation of individual columns provides a useful level of granularity in the data for querying - for example, you can easily
filter customers to those who live in a specific city.
Instances of each entity are uniquely identified by an ID or other key value, known as a primary key; and when one entity
references another (for example, an order has an associated customer), the primary key of the related entity is stored as
a foreign key. You can look up the address of the customer (which is stored only once) for each record in the Order table
by referencing the corresponding record in the Customer table. Typically, a relational database management system
(RDBMS) can enforce referential integrity to ensure that a value entered into a foreign key field has an existing
corresponding primary key in the related table – for example, preventing orders for non-existent customers.
In some cases, a key (primary or foreign) can be defined as a composite key based on a unique combination of multiple
columns. For example, the LineItem table in the example above uses a unique combination of OrderNo and ItemNo to
identify a line item from an individual order.
Explore SQL
Completed100 XP
10 minutes
SQL stands for Structured Query Language, and is used to communicate with a relational database. It's the standard
language for relational database management systems. SQL statements are used to perform tasks such as update data in a
database, or retrieve data from a database. Some common relational database management systems that use SQL include
Microsoft SQL Server, MySQL, PostgreSQL, MariaDB, and Oracle.
Note
SQL was originally standardized by the American National Standards Institute (ANSI) in 1986, and by the International
Organization for Standardization (ISO) in 1987. Since then, the standard has been extended several times as relational
database vendors have added new features to their systems. Additionally, most database vendors include their own
proprietary extensions that are not part of the standard, which has resulted in a variety of dialects of SQL.
You can use SQL statements such as SELECT, INSERT, UPDATE, DELETE, CREATE, and DROP to accomplish almost
everything that you need to do with a database. Although these SQL statements are part of the SQL standard, many
database management systems also have their own additional proprietary extensions to handle the specifics of that
database management system. These extensions provide functionality not covered by the SQL standard, and include areas
such as security management and programmability. For example, Microsoft SQL Server, and Azure database services that
are based on the SQL Server database engine, use Transact-SQL. This implementation includes proprietary extensions for
writing stored procedures and triggers (application code that can be stored in the database), and managing user accounts.
PostgreSQL and MySQL also have their own versions of these features.
PL/SQL. This is the dialect used by Oracle. PL/SQL stands for Procedural Language/SQL.
Users who plan to work specifically with a single database system should learn the intricacies of their preferred SQL dialect
and platform.
Note
The SQL code examples in this module are based on the Transact-SQL dialect, unless otherwise indicated. The syntax for
other dialects is generally similar, but may vary in some details.
DDL statements
You use DDL statements to create, modify, and remove tables and other objects in a database (table, stored procedures,
views, and so on).
The most common DDL statements are:
Expand table
Statement Description
ALTER Modify the structure of an object. For instance, altering a table to add a new column.
Warning
The DROP statement is very powerful. When you drop a table, all the rows in that table are lost. Unless you have a backup,
you won't be able to retrieve this data.
The following example creates a new database table. The items between the parentheses specify the details of each
column, including the name, the data type, whether the column must always contain a value (NOT NULL), and whether the
data in the column is used to uniquely identify a row (PRIMARY KEY). Each table should have a primary key, although SQL
doesn't enforce this rule.
Note
Columns marked as NOT NULL are referred to as mandatory columns. If you omit the NOT NULL clause, you can create
rows that don't contain a value in the column. An empty column in a row is said to have a NULL value.
SQLCopy
CREATE TABLE Product
(
ID INT PRIMARY KEY,
Name VARCHAR(20) NOT NULL,
Price DECIMAL NULL
);
The datatypes available for columns in a table will vary between database management systems. However, most database
management systems support numeric types such as INT (an integer, or whole number), DECIMAL (a decimal number), and
string types such as VARCHAR (VARCHAR stands for variable length character data). For more information, see the
documentation for your selected database management system.
DCL statements
Database administrators generally use DCL statements to manage access to objects in a database by granting, denying, or
revoking permissions to specific users or groups.
Expand table
Statement Description
For example, the following GRANT statement permits a user named user1 to read, insert, and modify data in
the Product table.
SQLCopy
GRANT SELECT, INSERT, UPDATE
ON Product
TO user1;
DML statements
You use DML statements to manipulate the rows in tables. These statements enable you to retrieve (query) data, insert new
rows, or modify existing rows. You can also delete rows if you don't need them anymore.
Expand table
Statement Description
Warning
SQL doesn't provide are you sure? prompts, so be careful when using DELETE or UPDATE without a WHERE clause because
you can lose or modify a lot of data.
The following code is an example of a SQL statement that selects all columns (indicated by *) from the Customer table
where the City column value is "Seattle":
SQLCopy
SELECT *
FROM Customer
WHERE City = 'Seattle';
To retrieve only a specific subset of columns from the table, you list them in the SELECT clause, like this:
SQLCopy
SELECT FirstName, LastName, Address, City
FROM Customer
WHERE City = 'Seattle';
If a query returns many rows, they don't necessarily appear in any specific sequence. If you want to sort the data, you can
add an ORDER BY clause. The data will be sorted by the specified column:
SQLCopy
SELECT FirstName, LastName, Address, City
FROM Customer
WHERE City = 'Seattle'
ORDER BY LastName;
You can also run SELECT statements that retrieve data from multiple tables using a JOIN clause. Joins indicate how the
rows in one table are connected with rows in the other to determine what data to return. A typical join condition matches
a foreign key from one table and its associated primary key in the other table.
The following query shows an example that joins Customer and Order tables. The query makes use of table aliases to
abbreviate the table names when specifying which columns to retrieve in the SELECT clause and which columns to match
in the JOIN clause.
SQLCopy
SELECT o.OrderNo, o.OrderDate, c.Address, c.City
FROM Order AS o
JOIN Customer AS c
ON o.Customer = c.ID
The next example shows how to modify an existing row using SQL. It changes the value of the Address column in
the Customer table for rows that have the value 1 in the ID column. All other rows are left unchanged:
SQLCopy
UPDATE Customer
SET Address = '123 High St.'
WHERE ID = 1;
Warning
If you omit the WHERE clause, an UPDATE statement will modify every row in the table.
Use the DELETE statement to remove rows. You specify the table to delete from, and a WHERE clause that identifies the
rows to be deleted:
SQLCopy
DELETE FROM Product
WHERE ID = 162;
Warning
If you omit the WHERE clause, a DELETE statement will remove every row from the table.
The INSERT statement takes a slightly different form. You specify a table and columns in an INTO clause, and a list of
values to be stored in these columns. Standard SQL only supports inserting one row at a time, as shown in the following
example. Some dialects allow you to specify multiple VALUES clauses to add several rows at a time:
SQLCopy
INSERT INTO Product(ID, Name, Price)
VALUES (99, 'Drill', 4.99);
Note
This topic describes some basic SQL statements and syntax in order to help you understand how SQL is used to work with
objects in a database. If you want to learn more about querying data with SQL, review the Get Started Querying with
Transact-SQL learning path on Microsoft Learn.
Describe database objects
Completed100 XP
9 minutes
In addition to tables, a relational database can contain other structures that help to optimize data organization,
encapsulate programmatic actions, and improve the speed of access. In this unit, you'll learn about three of these
structures in more detail: views, stored procedures, and indexes.
What is a view?
A view is a virtual table based on the results of a SELECT query. You can think of a view as a window on specified rows in
one or more underlying tables. For example, you could create a view on the Order and Customer tables that retrieves
order and customer data to provide a single object that makes it easy to determine delivery addresses for orders:
SQLCopy
CREATE VIEW Deliveries
AS
SELECT o.OrderNo, o.OrderDate,
c.FirstName, c.LastName, c.Address, c.City
FROM Order AS o JOIN Customer AS c
ON o.Customer = c.ID;
You can query the view and filter the data in much the same way as a table. The following query finds details of orders for
customers who live in Seattle:
SQLCopy
SELECT OrderNo, OrderDate, LastName, Address
FROM Deliveries
WHERE City = 'Seattle';
You can define a stored procedure with parameters to create a flexible solution for common actions that might need to be
applied to data based on a specific key or criteria. For example, the following stored procedure could be defined to change
the name of a product based on the specified product ID.
SQLCopy
CREATE PROCEDURE RenameProduct
@ProductID INT,
@NewName VARCHAR(20)
AS
UPDATE Product
SET Name = @NewName
WHERE ID = @ProductID;
When a product must be renamed, you can execute the stored procedure, passing the ID of the product and the new
name to be assigned:
SQLCopy
EXEC RenameProduct 201, 'Spanner';
What is an index?
An index helps you search for data in a table. Think of an index over a table like an index at the back of a book. A book
index contains a sorted set of references, with the pages on which each reference occurs. When you want to find a
reference to an item in the book, you look it up through the index. You can use the page numbers in the index to go
directly to the correct pages in the book. Without an index, you might have to read through the entire book to find the
references you're looking for.
When you create an index in a database, you specify a column from the table, and the index contains a copy of this data in
a sorted order, with pointers to the corresponding rows in the table. When the user runs a query that specifies this column
in the WHERE clause, the database management system can use this index to fetch the data more quickly than if it had to
scan through the entire table row by row.
For example, you could use the following code to create an index on the Name column of the Product table:
SQLCopy
CREATE INDEX idx_ProductName
ON Product(Name);
The index creates a tree-based structure that the database system's query optimizer can use to quickly find rows in
the Product table based on a specified Name.
For a table containing few rows, using the index is probably not any more efficient than simply reading the entire table and
finding the rows requested by the query (in which case the query optimizer will ignore the index). However, when a table
has many rows, indexes can dramatically improve the performance of queries.
You can create many indexes on a table. So, if you also wanted to find products based on price, creating another index on
the Price column in the Product table might be useful. However, indexes aren't free. An index consumes storage space,
and each time you insert, update, or delete data in a table, the indexes for that table must be maintained. This additional
work can slow down insert, update, and delete operations. You must strike a balance between having indexes that speed
up your queries versus the cost of performing other operations.
Knowledge check
Completed200 XP
3 minutes
Choose the best response for each of the questions below. Then select Check your answers.
1.
QUERY
READ
SELECT
That's correct. Use the SELECT statement to query one or more tables and return data.
3.
What is an index?
That's correct. Indexes improve query performance by locating rows with indexed column values.
Introduction
Completed100 XP
1 minute
Azure supports multiple database services, enabling you to run popular relational database management systems, such as
SQL Server, PostgreSQL, and MySQL, in the cloud.
Most Azure database services are fully managed, freeing up valuable time you’d otherwise spend managing your database.
Enterprise-grade performance with built-in high availability means you can scale quickly and reach global distribution
without worrying about costly downtime. Developers can take advantage of industry-leading innovations such as built-in
security with automatic monitoring and threat detection, automatic tuning for improved performance. On top of all of
these features, you have guaranteed availability.
In this module, you'll explore the options available for relational database services in Azure.
Learning objectives
In this module, you'll learn how to:
Identify options for Azure SQL services
Identify options for open-source databases in Azure
Provision a database service on Azure
10 minutes
Azure SQL is a collective term for a family of Microsoft SQL Server based database services in Azure. Specific Azure SQL
services include:
SQL Server on Azure Virtual Machines (VMs) - A virtual machine running in Azure with an installation of SQL Server. The use
of a VM makes this option an infrastructure-as-a-service (IaaS) solution that virtualizes hardware infrastructure for compute,
storage, and networking in Azure; making it a great option for "lift and shift" migration of existing on-premises SQL Server
installations to the cloud.
Azure SQL Managed Instance - A platform-as-a-service (PaaS) option that provides near-100% compatibility with on-premises
SQL Server instances while abstracting the underlying hardware and operating system. The service includes automated software
update management, backups, and other maintenance tasks, reducing the administrative burden of supporting a database
server instance.
Azure SQL Database - A fully managed, highly scalable PaaS database service that is designed for the cloud. This service
includes the core database-level capabilities of on-premises SQL Server, and is a good option when you need to create a new
application in the cloud.
Azure SQL Edge - A SQL engine that is optimized for Internet-of-things (IoT) scenarios that need to work with streaming time-
series data.
Note
Azure SQL Edge is included in this list for completeness. We'll focus on the other options for more general relational
database scenarios in this module.
Compare Azure SQL services
Expand table
SQL Server on Azure VMs Azure SQL Managed Instance Azure SQL Database
SQL Server Fully compatible with on-premises Near-100% compatibility with SQL Server. Supports most core database-level
compatibility physical and virtualized installations. Most on-premises databases can be capabilities of SQL Server. Some features
Applications and databases can easily migrated with minimal code changes by depended on by an on-premises application
be "lift and shift" migrated without using the Azure Database Migration service may not be available.
change.
Architecture SQL Server instances are installed in a Each managed instance can support You can provision a single database in a
virtual machine. Each instance can multiple databases. Additionally, instance dedicated, managed (logical) server; or you
support multiple databases. pools can be used to share resources can use an elastic pool to share resources
efficiently across smaller instances. across multiple databases and take
advantage of on-demand scalability.
Management You must manage all aspects of the Fully automated updates, backups, and Fully automated updates, backups, and
server, including operating system and recovery. recovery.
SQL Server updates, configuration,
backups, and other maintenance
tasks.
Use cases Use this option when you need to Use this option for most cloud migration Use this option for new cloud solutions, or
SQL Server on Azure VMs Azure SQL Managed Instance Azure SQL Database
migrate or extend an on-premises SQL scenarios, particularly when you need to migrate applications that have minimal
Server solution and retain full control minimal changes to existing applications. instance-level dependencies.
over all aspects of server and
database configuration.
SQL Server running on an Azure virtual machine effectively replicates the database running on real on-premises hardware.
Migrating from the system running on-premises to an Azure virtual machine is no different than moving the databases
from one on-premises server to another.
This approach is suitable for migrations and applications requiring access to operating system features that might be
unsupported at the PaaS level. SQL virtual machines are lift-and-shift ready for existing applications that require fast
migration to the cloud with minimal changes. You can also use SQL Server on Azure VMs to extend existing on-premises
applications to the cloud in hybrid deployments.
Note
A hybrid deployment is a system where part of the operation runs on-premises, and part in the cloud. Your database might
be part of a larger system that runs on-premises, although the database elements might be hosted in the cloud.
You can use SQL Server in a virtual machine to develop and test traditional SQL Server applications. With a virtual machine,
you have the full administrative rights over the DBMS and operating system. It's a perfect choice when an organization
already has IT resources available to maintain the virtual machines.
Create rapid development and test scenarios when you don't want to buy on-premises non-production SQL Server hardware.
Become lift-and-shift ready for existing applications that require fast migration to the cloud with minimal changes or no
changes.
Scale up the platform on which SQL Server is running, by allocating more memory, CPU power, and disk space to the virtual
machine. You can quickly resize an Azure virtual machine without the requirement that you reinstall the software that is running
on it.
Business benefits
Running SQL Server on virtual machines allows you to meet unique and diverse business needs through a combination of
on-premises and cloud-hosted deployments, while using the same set of server products, development tools, and
expertise across these environments.
It's not always easy for businesses to switch their DBMS to a fully managed service. There may be specific requirements
that must be satisfied in order to migrate to a managed service that requires making changes to the database and the
applications that use it. For this reason, using virtual machines can offer a solution, but using them doesn't eliminate the
need to administer your DBMS as carefully as you would on-premises.
Azure SQL Database Managed Instance
Azure SQL Managed instance effectively runs a fully controllable instance of SQL Server in the cloud. You can install
multiple databases on the same instance. You have complete control over this instance, much as you would for an on-
premises server. SQL Managed Instance automates backups, software patching, database monitoring, and other general
tasks, but you have full control over security and resource allocation for your databases. You can find detailed information
at What is Azure SQL Managed Instance?.
Managed instances depend on other Azure services such as Azure Storage for backups, Azure Event Hubs for telemetry,
Microsoft Entra ID for authentication, Azure Key Vault for Transparent Data Encryption (TDE) and a couple of Azure
platform services that provide security and supportability features. The managed instances make connections to these
services.
All communications are encrypted and signed using certificates. To check the trustworthiness of communicating parties,
managed instances constantly verify these certificates through certificate revocation lists. If the certificates are revoked, the
managed instance closes the connections to protect the data.
Use cases
Consider Azure SQL Managed Instance if you want to lift-and-shift an on-premises SQL Server instance and all its
databases to the cloud, without incurring the management overhead of running SQL Server on a virtual machine.
Azure SQL Managed Instance provides features not available in Azure SQL Database (discussed below). If your system uses
features such as linked servers, Service Broker (a message processing system that can be used to distribute work across
servers), or Database Mail (which enables your database to send email messages to users), then you should use managed
instance. To check compatibility with an existing on-premises system, you can install Data Migration Assistant (DMA). This
tool analyzes your databases on SQL Server and reports any issues that could block migration to a managed instance.
Business benefits
Azure SQL Managed Instance enables a system administrator to spend less time on administrative tasks because the
service either performs them for you or greatly simplifies those tasks. Automated tasks include operating system and
database management system software installation and patching, dynamic instance resizing and configuration, backups,
database replication (including system databases), high availability configuration, and configuration of health and
performance monitoring data streams.
Azure SQL Managed Instance has near 100% compatibility with SQL Server Enterprise Edition, running on-premises.
Azure SQL Managed Instance supports SQL Server Database engine logins and logins integrated with Microsoft Entra ID.
SQL Server Database engine logins include a username and a password. You must enter your credentials each time you
connect to the server. Microsoft Entra logins use the credentials associated with your current computer sign-in, and you
don't need to provide them each time you connect to the server.
Note
A SQL Database server is a logical construct that acts as a central administrative point for multiple single or pooled
databases, logins, firewall rules, auditing rules, threat detection policies, and failover groups.
This option enables you to quickly set up and run a single SQL Server database. You create and run a database server in
the cloud, and you access your database through this server. Microsoft manages the server, so all you have to do is
configure the database, create your tables, and populate them with your data. You can scale the database if you need
more storage space, memory, or processing power. By default, resources are pre-allocated, and you're charged per hour
for the resources you've requested. You can also specify a serverless configuration. In this configuration, Microsoft creates
its own server, which might be shared by databases belonging to other Azure subscribers. Microsoft ensures the privacy of
your database. Your database automatically scales and resources are allocated or deallocated as required.
Elastic Pool
This option is similar to Single Database, except that by default multiple databases can share the same resources, such as
memory, data storage space, and processing power through multiple-tenancy. The resources are referred to as a pool. You
create the pool, and only your databases can use the pool. This model is useful if you have databases with resource
requirements that vary over time, and can help you to reduce costs. For example, your payroll database might require
plenty of CPU power at the end of each month as you handle payroll processing, but at other times the database might
become much less active. You might have another database that is used for running reports. This database might become
active for several days in the middle of the month as management reports are generated, but with a lighter load at other
times. Elastic Pool enables you to use the resources available in the pool, and then release the resources once processing
has completed.
Use cases
Azure SQL Database gives you the best option for low cost with minimal administration. It isn't fully compatible with on-
premises SQL Server installations. It's often used in new cloud projects where the application design can accommodate any
required changes to your applications.
Note
You can use the Data Migration Assistant to detect compatibility issues with your databases that can impact database
functionality in Azure SQL Database. For more information, see Overview of Data Migration Assistant.
Modern cloud applications that need to use the latest stable SQL Server features.
Applications that require high availability.
Systems with a variable load that need the database server to scale up and down quickly.
Business benefits
Azure SQL Database automatically updates and patches the SQL Server software to ensure that you're always running the
latest and most secure version of the service.
The scalability features of Azure SQL Database ensure that you can increase the resources available to store and process
data without having to perform a costly manual upgrade.
The service provides high availability guarantees, to ensure that your databases are available at least 99.995% of the time.
Azure SQL Database supports point-in-time restore, enabling you to recover a database to the state it was in at any point
in the past. Databases can be replicated to different regions to provide more resiliency and disaster recovery.
Advanced threat protection provides advanced security capabilities, such as vulnerability assessments, to help detect and
remediate potential security problems with your databases. Threat protection also detects anomalous activities that
indicate unusual and potentially harmful attempts to access or exploit your database. It continuously monitors your
database for suspicious activities, and provides immediate security alerts on potential vulnerabilities, SQL injection attacks,
and anomalous database access patterns. Threat detection alerts provide details of the suspicious activity, and recommend
action on how to investigate and mitigate the threat.
Auditing tracks database events and writes them to an audit log in your Azure storage account. Auditing can help you
maintain regulatory compliance, understand database activity, and gain insight into discrepancies and anomalies that
might indicate business concerns or suspected security violations.
SQL Database helps secure your data by providing encryption that protects data that is stored in the database (at rest) and
while it is being transferred across the network (in motion).
Describe Azure services for open-source databases
Completed100 XP
6 minutes
In addition to Azure SQL services, Azure data services are available for other popular relational database systems, including
MySQL, MariaDB, and PostgreSQL. The primary reason for these services is to enable organizations that use them in on-
premises apps to move to Azure quickly, without making significant changes to their applications.
MySQL started life as a simple-to-use open-source database management system. It's the leading open source relational
database for Linux, Apache, MySQL, and PHP (LAMP) stack apps. It's available in several editions; Community, Standard, and
Enterprise. The Community edition is available free-of-charge, and has historically been popular as a database
management system for web applications, running under Linux. Versions are also available for Windows. Standard edition
offers higher performance, and uses a different technology for storing data. Enterprise edition provides a comprehensive
set of tools and features, including enhanced security, availability, and scalability. The Standard and Enterprise editions are
the versions most frequently used by commercial organizations, although these versions of the software aren't free.
MariaDB is a newer database management system, created by the original developers of MySQL. The database engine has
since been rewritten and optimized to improve performance. MariaDB offers compatibility with Oracle Database (another
popular commercial database management system). One notable feature of MariaDB is its built-in support for temporal
data. A table can hold several versions of data, enabling an application to query the data as it appeared at some point in
the past.
PostgreSQL is a hybrid relational-object database. You can store data in relational tables, but a PostgreSQL database also
enables you to store custom data types, with their own non-relational properties. The database management system is
extensible; you can add code modules to the database, which can be run by queries. Another key feature is the ability to
store and manipulate geometric data, such as lines, circles, and polygons.
PostgreSQL has its own query language called pgsql. This language is a variant of the standard relational query language,
SQL, with features that enable you to write stored procedures that run inside the database.
Azure Database for MySQL is a PaaS implementation of MySQL in the Azure cloud, based on the MySQL Community
Edition.
The Azure Database for MySQL service includes high availability at no additional cost, and scalability as required. You only
pay for what you use. Automatic backups are provided, with point-in-time restore.
The server provides connection security to enforce firewall rules and, optionally, require SSL connections. Many server
parameters enable you to configure server settings such as lock modes, maximum number of connections, and timeouts.
Azure Database for MySQL provides a global database system that scales up to large databases without the need to
manage hardware, network components, virtual servers, software patches, and other underlying components.
Certain operations aren't available with Azure Database for MySQL. These functions are primarily concerned with security
and administration. Azure manages these aspects of the database server itself.
Benefits of Azure Database for MySQL
You get the following features with Azure Database for MySQL:
The system uses pay-as-you-go pricing so you only pay for what you use.
Azure Database for MySQL servers provides monitoring functionality to add alerts, and to view metrics and logs.
Azure Database for MariaDB is an implementation of the MariaDB database management system adapted to run in
Azure. It's based on the MariaDB Community Edition.
The database is fully managed and controlled by Azure. Once you've provisioned the service and transferred your data, the
system requires almost no additional administration.
Some features of on-premises PostgreSQL databases aren't available in Azure Database for PostgreSQL. These features are
mostly concerned with the extensions that users can add to a database to perform specialized tasks, such as writing stored
procedures in various programming languages (other than pgsql, which is available), and interacting directly with the
operating system. A core set of the most frequently used extensions is supported, and the list of available extensions is
under continuous review.
The flexible-server deployment option for PostgreSQL is a fully managed database service. It provides a high level of
control and server configuration customizations, and provides cost optimization controls.
Benefits of Azure Database for PostgreSQL
Azure Database for PostgreSQL is a highly available service. It contains built-in failure detection and failover mechanisms.
Users of PostgreSQL will be familiar with the pgAdmin tool, which you can use to manage and monitor a PostgreSQL
database. You can continue to use this tool to connect to Azure Database for PostgreSQL. However, some server-focused
functionality, such as performing server backup and restore, aren't available because the server is managed and
maintained by Microsoft.
Azure Database for PostgreSQL records information about queries run against databases on the server, and saves them in
a database named azure_sys. You query the query_store.qs_view view to see this information, and use it to monitor the
queries that users are running. This information can prove invaluable if you need to fine-tune the queries performed by
your applications.
Exercise: Explore Azure relational database services
Completed100 XP
15 minutes
Choose your database
Azure SQL Database Azure Database for PostgreSQL Azure Database for MySQL
Note
To complete this lab, you will need an Azure subscription in which you have administrative access.
Launch the exercise and follow the instructions to explore Azure SQL Database.
Knowledge check
Completed200 XP
3 minutes
Choose the best response for each of the questions below. Then select Check your answers.
1.
Which deployment option offers the best compatibility when migrating an existing SQL Server on-premises
solution?
Correct. Azure SQL Managed Instance offers near 100% compatibility with SQL Server.
2.
Which database service is the simplest option for migrating a LAMP application to Azure?
Most software applications need to store data. Often this takes the form of a relational database, in which the data is organized in
related tables and managed by using Structured Query Language (SQL). However, many applications don't need the rigid structure of a
relational database and rely on non-relational (often referred to as NoSQL) storage.
Azure Storage is one of the core services in Microsoft Azure, and offers a range of options for storing data in the cloud. In this module,
you explore the fundamental capabilities of Azure storage and learn how it's used to support applications that require non-relational
data stores.
Learning objectives
In this module, you learn how to:
Azure Blob Storage is a service that enables you to store massive amounts of unstructured data as binary large objects,
or blobs, in the cloud. Blobs are an efficient way to store data files in a format that is optimized for cloud-based storage,
and applications can read and write them by using the Azure blob storage API.
In an Azure storage account, you store blobs in containers. A container provides a convenient way of grouping related
blobs together. You control who can read and write blobs inside a container at the container level.
Within a container, you can organize blobs in a hierarchy of virtual folders, similar to files in a file system on disk. However,
by default, these folders are simply a way of using a "/" character in a blob name to organize the blobs into namespaces.
The folders are purely virtual, and you can't perform folder-level operations to control access or perform bulk operations.
Blob storage provides three access tiers, which help to balance access latency and storage cost:
The Hot tier is the default. You use this tier for blobs that are accessed frequently. The blob data is stored on high-
performance media.
The Cool tier has lower performance and incurs reduced storage charges compared to the Hot tier. Use the Cool tier
for data that is accessed infrequently. It's common for newly created blobs to be accessed frequently initially, but less
so as time passes. In these situations, you can create the blob in the Hot tier, but migrate it to the Cool tier later. You
can migrate a blob from the Cool tier back to the Hot tier.
The Archive tier provides the lowest storage cost, but with increased latency. The Archive tier is intended for historical
data that mustn't be lost, but is required only rarely. Blobs in the Archive tier are effectively stored in an offline state.
Typical reading latency for the Hot and Cool tiers is a few milliseconds, but for the Archive tier, it can take hours for
the data to become available. To retrieve a blob from the Archive tier, you must change the access tier to Hot or Cool.
The blob will then be rehydrated. You can read the blob only when the rehydration process is complete.
You can create lifecycle management policies for blobs in a storage account. A lifecycle management policy can
automatically move a blob from Hot to Cool, and then to the Archive tier, as it ages and is used less frequently (policy is
based on the number of days since modification). A lifecycle management policy can also arrange to delete outdated
blobs.
Azure Data Lake Store (Gen1) is a separate service for hierarchical data storage for analytical data lakes, often used by so-
called big data analytical solutions that work with structured, semi-structured, and unstructured data stored in files. Azure
Data Lake Storage Gen2 is a newer version of this service that is integrated into Azure Storage; enabling you to take
advantage of the scalability of blob storage and the cost-control of storage tiers, combined with the hierarchical file
system capabilities and compatibility with major analytics systems of Azure Data Lake Store.
Systems like Hadoop in Azure HDInsight, Azure Databricks, and Azure Synapse Analytics can mount a distributed file
system hosted in Azure Data Lake Store Gen2 and use it to process huge volumes of data.
To create an Azure Data Lake Store Gen2 files system, you must enable the Hierarchical Namespace option of an Azure
Storage account. You can do this when initially creating the storage account, or you can upgrade an existing Azure Storage
account to support Data Lake Gen2. Be aware however that upgrading is a one-way process – after upgrading a storage
account to support a hierarchical namespace for blob storage, you can’t revert it to a flat namespace.
Many on-premises systems comprising a network of in-house computers make use of file shares. A file share enables you
to store a file on one computer, and grant access to that file to users and applications running on other computers. This
strategy can work well for computers in the same local area network, but doesn't scale well as the number of users
increases, or if users are located at different sites.
Azure Files is essentially a way to create cloud-based network shares, such as you typically find in on-premises
organizations to make documents and other files available to multiple users. By hosting file shares in Azure, organizations
can eliminate hardware costs and maintenance overhead, and benefit from high availability and scalable cloud storage for
files.
You create Azure File storage in a storage account. Azure Files enables you to share up to 100 TB of data in a single
storage account. This data can be distributed across any number of file shares in the account. The maximum size of a
single file is 1 TB, but you can set quotas to limit the size of each share below this figure. Currently, Azure File Storage
supports up to 2000 concurrent connections per shared file.
After you've created a storage account, you can upload files to Azure File Storage using the Azure portal, or tools such as
the AzCopy utility. You can also use the Azure File Sync service to synchronize locally cached copies of shared files with the
data in Azure File Storage.
Azure File Storage offers two performance tiers. The Standard tier uses hard disk-based hardware in a datacenter, and
the Premium tier uses solid-state disks. The Premium tier offers greater throughput, but is charged at a higher rate.
Server Message Block (SMB) file sharing is commonly used across multiple operating systems (Windows, Linux,
macOS).
Network File System (NFS) shares are used by some Linux and macOS versions. To create an NFS share, you must use
a premium tier storage account and create and configure a virtual network through which access to the share can be
controlled.
Azure Table Storage is a NoSQL storage solution that makes use of tables containing key/value data items. Each item is
represented by a row that contains columns for the data fields that need to be stored.
However, don't be misled into thinking that an Azure Table Storage table is like a table in a relational database. An Azure
Table enables you to store semi-structured data. All rows in a table must have a unique key (composed of a partition key
and a row key), and when you modify data in a table, a timestamp column records the date and time the modification was
made; but other than that, the columns in each row can vary. Azure Table Storage tables have no concept of foreign keys,
relationships, stored procedures, views, or other objects you might find in a relational database. Data in Azure Table
storage is usually denormalized, with each row holding the entire data for a logical entity. For example, a table holding
customer information might store the first name, last name, one or more telephone numbers, and one or more addresses
for each customer. The number of fields in each row can be different, depending on the number of telephone numbers
and addresses for each customer, and the details recorded for each address. In a relational database, this information
would be split across multiple rows in several tables.
To help ensure fast access, Azure Table Storage splits a table into partitions. Partitioning is a mechanism for grouping
related rows, based on a common property or partition key. Rows that share the same partition key will be stored together.
Partitioning not only helps to organize data, it can also improve scalability and performance in the following ways:
Partitions are independent from each other, and can grow or shrink as rows are added to, or removed from, a
partition. A table can contain any number of partitions.
When you search for data, you can include the partition key in the search criteria. This helps to narrow down the
volume of data to be examined, and improves performance by reducing the amount of I/O (input and output
operations, or reads and writes) needed to locate the data.
The key in an Azure Table Storage table comprises two elements; the partition key that identifies the partition containing
the row, and a row key that is unique to each row in the same partition. Items in the same partition are stored in row key
order. If an application adds a new row to a table, Azure ensures that the row is placed in the correct position in the table.
This scheme enables an application to quickly perform point queries that identify a single row, and range queries that fetch
a contiguous block of rows in a partition.
15 minutes
Note
To complete this lab, you will need an Azure subscription in which you have administrative access.
That's correct. The partition key identifies the partition in which a row is located, and the rows in each partition are stored
in row key order.
Row number
2.
What should you do to an existing Azure Storage account in order to support a data lake for Azure Synapse
Analytics?
Add an Azure Files share
Create Azure Storage tables for the data you want to analyze
Upgrade the account to enable hierarchical namespace and create a blob container
That's correct. Enabling a hierarchical namespace adds support for Azure Data Lake Storage Gen 2, which can be used by
Synapse Analytics.
3.
To share files that are stored on-premises with users located at other sites.
That's correct. You can create a file share in Azure File storage, upload files to this file share, and grant access to the file
share to remote users.
To store large binary data files containing images or other unstructured data.
7 Module Explore fundamentals of Azure Cosmos DB
Introduction
Completed100 XP
1 minute
Relational databases store data in relational tables, but sometimes the structure imposed by this model can be too rigid,
and often leads to poor performance unless you spend time implementing detailed tuning. Other models, collectively
known as NoSQL databases, exist. These models store data in other structures, such as documents, graphs, key-value
stores, and column family stores.
Azure Cosmos DB is a highly scalable cloud database service for NoSQL data.
Learning objectives
In this module, you'll learn how to:
Azure Cosmos DB supports multiple application programming interfaces (APIs) that enable developers to use the
programming semantics of many common kinds of data store to work with data in a Cosmos DB database. The internal
data structure is abstracted, enabling developers to use Cosmos DB to store and query data using APIs with which they're
already familiar.
Note
An API is an Application Programming Interface. Database management systems (and other software frameworks) provide
a set of APIs that developers can use to write programs that need to access data. The APIs vary for different database
management systems.
Cosmos DB uses indexes and partitioning to provide fast read and write performance and can scale to massive volumes of
data. You can enable multi-region writes, adding the Azure regions of your choice to your Cosmos DB account so that
globally distributed users can each work with data in their local replica.
Cosmos DB is a foundational service in Azure. Cosmos DB has been used by many of Microsoft's products for mission
critical applications at global scale, including Skype, Xbox, Microsoft 365, Azure, and many others. Cosmos DB is highly
suitable for the following scenarios:
IoT and telematics. These systems typically ingest large amounts of data in frequent bursts of activity. Cosmos DB can
accept and store this information quickly. The data can then be used by analytics services, such as Azure Machine
Learning, Azure HDInsight, and Power BI. Additionally, you can process the data in real-time using Azure Functions
that are triggered as data arrives in the database.
Retail and marketing. Microsoft uses Cosmos DB for its own e-commerce platforms that run as part of Windows Store
and Xbox Live. It's also used in the retail industry for storing catalog data and for event sourcing in order processing
pipelines.
Gaming. The database tier is a crucial component of gaming applications. Modern games perform graphical
processing on mobile/console clients, but rely on the cloud to deliver customized and personalized content like in-
game stats, social media integration, and high-score leaderboards. Games often require single-millisecond latencies
for reads and write to provide an engaging in-game experience. A game database needs to be fast and be able to
handle massive spikes in request rates during new game launches and feature updates.
Web and mobile applications. Azure Cosmos DB is commonly used within web and mobile applications, and is well
suited for modeling social interactions, integrating with third-party services, and for building rich personalized
experiences. The Cosmos DB SDKs can be used to build rich iOS and Android applications using the popular Xamarin
framework.
For additional information about uses for Cosmos DB, read Common Azure Cosmos DB use cases.
Identify Azure Cosmos DB APIs
Completed100 XP
6 minutes
Azure Cosmos DB is Microsoft's fully managed and serverless distributed database for applications of any size or scale, with
support for both relational and non-relational workloads. Developers can build and migrate applications fast using their
preferred open source database engines, including PostgreSQL, MongoDB, and Apache Cassandra. When you provision a
new Cosmos DB instance, you select the database engine that you want to use. The choice of engine depends on many
factors including the type of data to be stored, the need to support existing applications, and the skills of the developers
who will work with the data store.
A SQL query for an Azure Cosmos DB database containing customer data might look similar to this:
SQLCopy
SELECT *
FROM customers c
WHERE c.id = "[email protected]"
The result of this query consists of one or more JSON documents, as shown here:
JSONCopy
{
"id": "[email protected]",
"name": "Joe Jones",
"address": {
"street": "1 Main St.",
"city": "Seattle"
}
}
MongoDB Query Language (MQL) uses a compact, object-oriented syntax in which developers use objects to call methods.
For example, the following query uses the find method to query the products collection in the db object:
JavaScriptCopy
db.products.find({id: 123})
JSONCopy
{
"id": 123,
"name": "Hammer",
"price": 2.99
}
Azure Cosmos DB for PostgreSQL
Azure Cosmos DB for PostgreSQL is a native PostgreSQL, globally distributed relational database that automatically shards
data to help you build highly scalable apps. You can start building apps on a single node server group, the same way you
would with PostgreSQL anywhere else. As your app's scalability and performance requirements grow, you can seamlessly
scale to multiple nodes by transparently distributing your tables. PostgreSQL is a relational database management system
(RDBMS) in which you define relational tables of data, for example you might define a table of products like this:
Expand table
ProductID ProductName Price
123 Hammer 2.99
162 Screwdriver 3.49
You could then query this table to retrieve the name and price of a specific product using SQL like this:
SQLCopy
SELECT ProductName, Price
FROM Products
WHERE ProductID = 123;
The results of this query would contain a row for product 123, like this:
Expand table
ProductName Price
Hammer 2.99
Azure Cosmos DB for Table
Azure Cosmos DB for Table is used to work with data in key-value tables, similar to Azure Table Storage. It offers greater
scalability and performance than Azure Table Storage. For example, you might define a table named Customers like this:
Expand table
PartitionKey RowKey Name Email
1 123 Joe Jones [email protected]
1 124 Samir Nadoy [email protected]
You can then use the Table API through one of the language-specific SDKs to make calls to your service endpoint to
retrieve data from the table. For example, the following request returns the row containing the record for Samir Nadoy in
the table above:
textCopy
https://endpoint/Customers(PartitionKey='1',RowKey='124')
Expand table
ID Name Manager
1 Sue Smith
2 Ben Chan Sue Smith
Cassandra supports a syntax based on SQL, so a client application could retrieve the record for Ben Chan like this:
SQLCopy
SELECT * FROM Employees WHERE ID = 2
Gremlin syntax includes functions to operate on vertices and edges, enabling you to insert, update, delete, and query data
in the graph. For example, you could use the following code to add a new employee named Alice that reports to the
employee with ID 1 (Sue)
Copy
g.addV('employee').property('id', '3').property('firstName', 'Alice')
g.V('3').addE('reports to').to(g.V('1'))
The following query returns all of the employee vertices, in order of ID.
Copy
g.V().hasLabel('employee').order().by('id')
15 minutes
Note
To complete this lab, you will need an Azure subscription in which you have administrative access.
Which API should you use to store and query JSON documents in Azure Cosmos DB?
That's correct. The API for NoSQL is designed to store and query JSON documents.
Which Azure Cosmos DB API should you use to work with data in which entities and their relationships to one
another are represented in a graph using vertices and edges?
Azure Cosmos DB for MongoDB
That's correct. The API for Gremlin is used to manage a network of nodes (vertices) and the relationships between them
(edges).
3.
How can you enable globally distributed users to work with their own local replica of a Cosmos DB database?
Create an Azure Cosmos DB account in each region where you have users.
Use the API for Table to copy data to Azure Table Storage in each region where you have users.
Enable multi-region writes and add the regions where you have users.
That's correct. You can enable multi-region writes in the regions where you want users to work with the data.
8 Explore fundamentals of large-scale analytics
Introduction
Completed100 XP
1 minute
Large-scale data analytics solutions combine conventional data warehousing used to support business intelligence (BI)
with data lakehouse techniques that are used to integrate data from files and external sources. A conventional data
warehousing solution typically involves copying data from transactional data stores into a relational database with a
schema that's optimized for querying and building multidimensional models. Data lakehouse solutions on the other hand,
are used with large volumes of data in multiple formats, which is batch loaded or captured in real-time streams and stored
in a data lake from which distributed processing engines like Apache Spark are used to process it.
Learning objectives
In this module, you will learn how to:
Large-scale data analytics architecture can vary, as can the specific technologies used to implement it; but in general, the
following elements are included:
1. Data ingestion and processing – data from one or more transactional data stores, files, real-time streams, or other
sources is loaded into a data lake or a relational data warehouse. The load operation usually involves an extract,
transform, and load (ETL) or extract, load, and transform (ELT) process in which the data is cleaned, filtered, and
restructured for analysis. In ETL processes, the data is transformed before being loaded into an analytical store, while
in an ELT process the data is copied to the store and then transformed. Either way, the resulting data structure is
optimized for analytical queries. The data processing is often performed by distributed systems that can process high
volumes of data in parallel using multi-node clusters. Data ingestion includes both batch processing of static data
and real-time processing of streaming data.
2. Analytical data store – data stores for large scale analytics include relational data warehouses, file-system
based data lakes, and hybrid architectures that combine features of data warehouses and data lakes (sometimes
called data lakehouses or lake databases). We'll discuss these in more depth later.
3. Analytical data model – while data analysts and data scientists can work with the data directly in the analytical data
store, it’s common to create one or more data models that pre-aggregate the data to make it easier to produce
reports, dashboards, and interactive visualizations. Often these data models are described as cubes, in which numeric
data values are aggregated across one or more dimensions (for example, to determine total sales by product and
region). The model encapsulates the relationships between data values and dimensional entities to support "drill-
up/drill-down" analysis.
4. Data visualization – data analysts consume data from analytical models, and directly from analytical stores to create
reports, dashboards, and other visualizations. Additionally, users in an organization who may not be technology
professionals might perform self-service data analysis and reporting. The visualizations from the data show trends,
comparisons, and key performance indicators (KPIs) for a business or other organization, and can take the form of
printed reports, graphs and charts in documents or PowerPoint presentations, web-based dashboards, and interactive
environments in which users can explore data visually.
5 minutes
Now that you understand a little about the architecture of a large-scale data warehousing solution, and some of the
distributed processing technologies that can be used to handle large volumes of data, it's time to explore how data is
ingested into an analytical data store from one or more sources.
On Azure, large-scale data ingestion is best implemented by creating pipelines that orchestrate ETL processes. You can
create and run pipelines using Azure Data Factory, or you can use a similar pipeline engine in Azure Synapse
Analytics or Microsoft Fabric if you want to manage all of the components of your data analytics solution in a unified
workspace.
In either case, pipelines consist of one or more activities that operate on data. An input dataset provides the source data,
and activities can be defined as a data flow that incrementally manipulates the data until an output dataset is produced.
Pipelines can connect to external data sources to integrate with a wide variety of data services.
Data warehouses
A data warehouse is a relational database in which the data is stored in a schema that is optimized for data analytics rather
than transactional workloads. Commonly, the data from a transactional store is transformed into a schema in which
numeric values are stored in central fact tables, which are related to one or more dimension tables that represent entities
by which the data can be aggregated. For example a fact table might contain sales order data, which can be aggregated by
customer, product, store, and time dimensions (enabling you, for example, to easily find monthly total sales revenue by
product for each store). This kind of fact and dimension table schema is called a star schema; though it's often extended
into a snowflake schema by adding additional tables related to the dimension tables to represent dimensional hierarchies
(for example, product might be related to product categories). A data warehouse is a great choice when you have
transactional data that can be organized into a structured schema of tables, and you want to use SQL to query them.
Data lakehouses
A data lake is a file store, usually on a distributed file system for high performance data access. Technologies like Spark or
Hadoop are often used to process queries on the stored files and return data for reporting and analytics. These systems
often apply a schema-on-read approach to define tabular schemas on semi-structured data files at the point where the
data is read for analysis, without applying constraints when it's stored. Data lakes are great for supporting a mix of
structured, semi-structured, and even unstructured data that you want to analyze without the need for schema
enforcement when the data is written to the store.
You can use a hybrid approach that combines features of data lakes and data warehouses in a lake database or data
lakehouse. The raw data is stored as files in a data lake, and a relational storage layer abstracts the underlying files and
expose them as tables, which can be queried using SQL. SQL pools in Azure Synapse Analytics include PolyBase, which
enables you to define external tables based on files in a data lake (and other sources) and query them using SQL. Synapse
Analytics also supports a Lake Database approach in which you can use database templates to define the relational schema
of your data warehouse, while storing the underlying data in data lake storage – separating the storage and compute for
your data warehousing solution. Data lakehouses are a relatively new approach in Spark-based systems, and are enabled
through technologies like Delta Lake; which adds relational storage capabilities to Spark, so you can define tables that
enforce schemas and transactional consistency, support batch-loaded and streaming data sources, and provide a SQL API
for querying.
8 minutes
On Azure, there are three main platform-as-a-service (PaaS) services that you can use to implement a large-scale analytical
store
Azure Synapse Analytics is a unified, end-to-end solution for large scale data analytics. It brings together
multiple technologies and capabilities, enabling you to combine the data integrity and reliability of a scalable, high-
performance SQL Server based relational data warehouse with the flexibility of a data lake and open-source Apache Spark.
It also includes native support for log and telemetry analytics with Azure Synapse Data Explorer pools, as well as built in
data pipelines for data ingestion and transformation. All Azure Synapse Analytics services can be managed through a
single, interactive user interface called Azure Synapse Studio, which includes the ability to create interactive notebooks in
which Spark code and markdown content can be combined. Synapse Analytics is a great choice when you want to create a
single, unified analytics solution on Azure.
Note
Each of these services can be thought of as an analytical data store, in the sense that they provide a schema and interface
through which the data can be queried. In many cases however, the data is actually stored in a data lake and the service is
used to process the data and run queries. Some solutions might even combine the use of these services. An extract, load,
and transform (ELT) ingestion process might copy data into the data lake, and then use one of these services to transform
the data, and another to query it. For example, a pipeline might use a MapReduce job running in HDInsight or a notebook
running in Azure Databricks to process a large volume of data in the data lake, and then load it into tables in a SQL pool in
Azure Synapse Analytics.
30 minutes
In this exercise, you'll create an Azure Synapse Analytics workspace and use it to ingest and analyze some data.
The exercise is designed to familiarize you with some key elements of a large-scale data warehousing solution, not as a
comprehensive guide to performing advanced data analysis with Azure Synapse Analytics. The exercise should take around
30 minutes to complete.
Note
To complete this lab, you will need an Azure subscription in which you have administrative access.
8 minutes
Scalable analytics with PaaS services can be complex, fragmented, and expensive. With Microsoft Fabric, you don't have to
spend all of your time combining various services and implementing interfaces through which business users can access
them. Instead, you can use a single product that is easy to understand, set up, create, and manage. Fabric is a unified
software-as-a-service (SaaS) offering, with all your data stored in a single open format in OneLake.
OneLake is Fabric's lake-centric architecture that provides a single, integrated environment for data professionals and the
business to collaborate on data projects. Think of it like OneDrive for data; OneLake combines storage locations across
different regions and clouds into a single logical lake, without moving or duplicating data. Data can be stored in any file
format in OneLake and can be structured or unstructured. For tabular data, the analytical engines in Fabric will write data in
delta format when writing to OneLake. All engines will know how to read this format and treat delta files as tables no
matter which engine writes it.
25 minutes
In this exercise, you'll create a Microsoft Fabric workspace and use it to ingest and analyze some data.
The exercise is designed to familiarize you with some key elements of a large-scale data analytics solution, not as a
comprehensive guide to performing advanced data analysis with Microsoft Fabric. The exercise should take around 25
minutes to complete.
Note
You need a Microsoft Fabric trial license with the Fabric preview enabled in your tenant. See Getting started with
Fabric to enable your Fabric trial license.
Choose the best response for each of the questions below. Then select Check your answers.
Which Azure PaaS services can you use to create a pipeline for data ingestion and processing?
That's correct. Both Azure Synapse Analytics and Azure Data Factory include the capability to create pipelines.
What must you define to implement a pipeline that reads data from Azure Blob Storage?
A linked service for your Azure Blob Storage account
That's correct. You need to create linked services for external services you want to use in the pipeline.
That's incorrect. A dedicated SQL pool is required to support a relational data warehouse.
Which open-source distributed processing engine does Azure Synapse Analytics include?
Apache Hadoop
Apache Spark
1 minute
Large-scale data analytics is a complex workload that can involve many different technologies. This module has provided a
high-level overview of the key features of an analytics solution, and explored some of the Microsoft services that you can
use to implement one.
Increased use of technology by individuals, companies, and other organizations, together with the proliferation of smart
devices and Internet access has led to a massive growth in the volume of data that can be generated, captured, and
analyzed. Much of this data can be processed in real-time (or at least, near real-time) as a perpetual stream of data,
enabling the creation of systems that reveal instant insights and trends, or take immediate responsive action to events as
they occur.
Learning objectives
In this module, you'll learn about the basics of stream processing and real-time analytics, and the services in Microsoft
Azure that you can use to implement real-time data processing solutions. Specifically, you'll learn how to:
Data processing is simply the conversion of raw data to meaningful information through a process. There are two general
ways to process data:
Batch processing, in which multiple data records are collected and stored before being processed together in a single
operation.
Stream processing, in which a source of data is constantly monitored and processed in real time as new data events
occur.
For example, suppose you want to analyze road traffic by counting the number of cars on a stretch of road. A batch
processing approach to this would require that you collect the cars in a parking lot, and then count them in a single
operation while they're at rest.
If the road is busy, with a large number of cars driving along at frequent intervals, this approach may be impractical; and
note that you don't get any results until you have parked a batch of cars and counted them.
A real world example of batch processing is the way that credit card companies handle billing. The customer doesn't
receive a bill for each separate credit card purchase but one monthly bill for all of that month's purchases.
The time delay between ingesting the data and getting the results.
All of a batch job's input data must be ready before a batch can be processed. This means data must be carefully
checked. Problems with data, errors, and program crashes that occur during batch jobs bring the whole process to a
halt. The input data must be carefully checked before the job can be run again. Even minor data errors can prevent a
batch job from running.
In this approach, you don't need to wait until all of the cars have parked to start processing them, and you can aggregate
the data over time intervals; for example, by counting the number of cars that pass each minute.
A financial institution tracks changes in the stock market in real time, computes value-at-risk, and automatically
rebalances portfolios based on stock price movements.
An online gaming company collects real-time data about player-game interactions, and feeds the data into its
gaming platform. It then analyzes the data in real time, offers incentives and dynamic experiences to engage its
players.
A real-estate website that tracks a subset of data from mobile devices, and makes real-time property
recommendations of properties to visit based on their geo-location.
Stream processing is ideal for time-critical operations that require an instant real-time response. For example, a system
that monitors a building for smoke and heat needs to trigger alarms and unlock doors to allow residents to escape
immediately in the event of a fire.
Even when real-time analysis or visualization of data is not required, streaming technologies are often used to capture real-
time data and store it in a data store for subsequent batch processing (this is the equivalent of redirecting all of the cars
that travel along a road into a parking lot before counting them).
The following diagram shows some ways in which batch and stream processing can be combined in a large-scale data
analytics architecture.
1. Data events from a streaming data source are captured in real-time.
2. Data from other sources is ingested into a data store (often a data lake) for batch processing.
3. If real-time analytics is not required, the captured streaming data is written to the data store for subsequent batch
processing.
4. When real-time analytics is required, a stream processing technology is used to prepare the streaming data for real-
time analysis or visualization; often by filtering or aggregating the data over temporal windows.
5. The non-streaming data is periodically batch processed to prepare it for analysis, and the results are persisted in an
analytical data store (often referred to as a data warehouse) for historical analysis.
6. The results of stream processing may also be persisted in the analytical data store to support historical analysis.
7. Analytical and visualization tools are used to present and explore the real-time and historical data.
Note
Commonly used solution architectures for combined batch and stream data processing
include lambda and delta architectures. Details of these architectures are beyond the scope of this course, but they
incorporate technologies for both large-scale batch data processing and real-time stream processing to create an end-to-
end analytical solution.
Explore common elements of stream processing
architecture
Completed100 XP
4 minutes
There are many technologies that you can use to implement a stream processing solution, but while specific
implementation details may vary, there are common elements to most streaming architectures.
1. An event generates some data. This might be a signal being emitted by a sensor, a social media message being posted, a log
file entry being written, or any other occurrence that results in some digital data.
2. The generated data is captured in a streaming source for processing. In simple cases, the source may be a folder in a cloud data
store or a table in a database. In more robust streaming solutions, the source may be a "queue" that encapsulates logic to
ensure that event data is processed in order and that each event is processed only once.
3. The event data is processed, often by a perpetual query that operates on the event data to select data for specific types of
events, project data values, or aggregate data values over temporal (time-based) periods (or windows) - for example, by
counting the number of sensor emissions per minute.
4. The results of the stream processing operation are written to an output (or sink), which may be a file, a database table, a real-
time visual dashboard, or another queue for further processing by a subsequent downstream query.
Azure Stream Analytics: A platform-as-a-service (PaaS) solution that you can use to define streaming jobs that ingest data from
a streaming source, apply a perpetual query, and write the results to an output.
Spark Structured Streaming: An open-source library that enables you to develop complex streaming solutions on Apache
Spark based services, including Azure Synapse Analytics, Azure Databricks, and Azure HDInsight.
Azure Data Explorer: A high-performance database and analytics service that is optimized for ingesting and querying batch or
streaming data with a time-series element, and which can be used as a standalone Azure service or as an Azure Synapse Data
Explorer runtime in an Azure Synapse Analytics workspace.
The following services are commonly used to ingest data for stream processing on Azure:
Azure Event Hubs: A data ingestion service that you can use to manage queues of event data, ensuring that each event is
processed in order, exactly once.
Azure IoT Hub: A data ingestion service that is similar to Azure Event Hubs, but which is optimized for managing event data
from Internet-of-things (IoT) devices.
Azure Data Lake Store Gen 2: A highly scalable storage service that is often used in batch processing scenarios, but which can
also be used as a source of streaming data.
Apache Kafka: An open-source data ingestion solution that is commonly used together with Apache Spark. You can use Azure
HDInsight to create a Kafka cluster.
Sinks for stream processing
The output from stream processing is often sent to the following services:
Azure Event Hubs: Used to queue the processed data for further downstream processing.
Azure Data Lake Store Gen 2 or Azure blob storage: Used to persist the processed results as a file.
Azure SQL Database or Azure Synapse Analytics, or Azure Databricks: Used to persist the processed results in a database
table for querying and analysis.
Microsoft Power BI: Used to generate real time data visualizations in reports and dashboards.
Explore Azure Stream Analytics
Completed100 XP
2 minutes
Azure Stream Analytics is a service for complex event processing and analysis of streaming data. Stream Analytics is used
to:
Ingest data from an input, such as an Azure event hub, Azure IoT Hub, or Azure Storage blob container.
Process the data by using a query to select, project, and aggregate data values.
Write the results to an output, such as Azure Data Lake Gen 2, Azure SQL Database, Azure Synapse Analytics, Azure
Functions, Azure event hub, Microsoft Power BI, or others.
Once started, a Stream Analytics query will run perpetually, processing new data as it arrives in the input and storing
results in the output.
Azure Stream Analytics is a great technology choice when you need to continually capture data from a streaming source,
filter or aggregate it, and send the results to a data store or downstream process for analysis and reporting.
Azure Stream Analytics jobs and clusters
The easiest way to use Azure Stream Analytics is to create a Stream Analytics job in an Azure subscription, configure its
input(s) and output(s), and define the query that the job will use to process the data. The query is expressed using
structured query language (SQL) syntax, and can incorporate static reference data from multiple data sources to supply
lookup values that can be combined with the streaming data ingested from an input.
If your stream process requirements are complex or resource-intensive, you can create a Stream Analysis cluster, which
uses the same underlying processing engine as a Stream Analytics job, but in a dedicated tenant (so your processing is not
affected by other customers) and with configurable scalability that enables you to define the right balance of throughput
and cost for your specific scenario.
15 minutes
Now it's your opportunity to explore Azure Stream Analytics in a sample solution that aggregates streaming data from a
simulated IoT device.
Note
To complete this lab, you will need an Azure subscription in which you have administrative access.
3 minutes
Apache Spark is a distributed processing framework for large scale data analytics. You can use Spark on Microsoft Azure in
the following services:
Spark can be used to run code (usually written in Python, Scala, or Java) in parallel across multiple cluster nodes, enabling
it to process very large volumes of data efficiently. Spark can be used for both batch processing and stream processing.
Spark Structured Streaming is built on a ubiquitous structure in Spark called a dataframe, which encapsulates a table of
data. You use the Spark Structured Streaming API to read data from a real-time data source, such as a Kafka hub, a file
store, or a network port, into a "boundless" dataframe that is continually populated with new data from the stream. You
then define a query on the dataframe that selects, projects, or aggregates the data - often in temporal windows. The
results of the query generate another dataframe, which can be persisted for analysis or further processing.
Spark Structured Streaming is a great choice for real-time analytics when you need to incorporate streaming data into a
Spark based data lake or analytical data store.
Note
For more information about Spark Structured Streaming, see the Spark Structured Streaming programming guide.
Delta Lake
Delta Lake is an open-source storage layer that adds support for transactional consistency, schema enforcement, and other
common data warehousing features to data lake storage. It also unifies storage for streaming and batch data, and can be
used in Spark to define relational tables for both batch and stream processing. When used for stream processing, a Delta
Lake table can be used as a streaming source for queries against real-time data, or as a sink to which a stream of data is
written.
The Spark runtimes in Azure Synapse Analytics and Azure Databricks include support for Delta Lake.
Delta Lake combined with Spark Structured Streaming is a good solution when you need to abstract batch and stream
processed data in a data lake behind a relational schema for SQL-based querying and analysis.
Note
For more information about Delta Lake, see What is Delta Lake?
15 minutes
In this exercise, you'll use Spark Structured Streaming and delta tables in Azure Synapse Analytics to process streaming
data.
Note
To complete this lab, you will need an Azure subscription in which you have administrative access.
Microsoft Fabric includes native support for real-time data analytics, including real-time data ingestion from multiple
streaming sources.
In Microsoft Fabric, you can use an eventstream to capture real-time event data from a streaming source and persist it in a
destination such as a table in a Lakehouse or a KQL database.
When writing eventstream data to a Lakehouse table, you can apply aggregations and filters to summarize the captured
data. A KQL database supports tables based on the Data Explorer engine, enabling you to perform real-time analytics on
the data in tables by running KQL queries. After capturing real-time data in a table, you can use Power BI in Microsoft
Fabric to create real-time data visualizations.
25 minutes
In this exercise, you'll use realtime analytics in Microsoft Fabric in Azure Synapse Analytics to process streaming data.
Note
You need a Microsoft Fabric trial license with the Fabric preview enabled in your tenant. See Getting started with
Fabric to enable your Fabric trial license.
Data is collected in a temporary store, and all records are processed together as a batch.
Which service would you use to continually capture data from an IoT Hub, aggregate it over temporal periods, and
store results in Azure SQL Database?
Azure Cosmos DB
Azure Storage
Summary
Completed100 XP
1 minute
Real-time processing is a common element of enterprise data analytics solutions. Microsoft Azure offers a variety of
services that you can use to implement stream processing and real-time analysis.
Introduction
Completed100 XP
1 minute
Data modeling and visualization is at the heart of business intelligence (BI) workloads that are supported by large-scale
data analytics solutions. Essentially, data visualization powers reporting and decision making that helps organizations
succeed.
In this module, you'll learn about fundamental principles of analytical data modeling and data visualization, using
Microsoft Power BI as a platform to explore these principles in action.
Learning objectives
After completing this module, you'll be able to:
Describe a high-level process for creating reporting solutions with Microsoft Power BI
Describe core principles of analytical data modeling
Identify common types of data visualization and their uses
Create an interactive report with Power BI Desktop
Describe Power BI tools and workflow
Completed100 XP
3 minutes
There are many data visualization tools that data analysts can use to explore data and summarize insights visually;
including chart support in productivity tools like Microsoft Excel and built-in data visualization widgets in notebooks used
to explore data in services such as Azure Synapse Analytics and Azure Databricks. However, for enterprise-scale business
analytics, an integrated solution that can support complex data modeling, interactive reporting, and secure sharing is often
required.
Microsoft Power BI
Microsoft Power BI is a suite of tools and services that data analysts can use to build interactive data visualizations for
business users to consume.
A typical workflow for creating a data visualization solution starts with Power BI Desktop, a Microsoft Windows
application in which you can import data from a wide range of data sources, combine and organize the data from these
sources in an analytics data model, and create reports that contain interactive visualizations of the data.
After you've created data models and reports, you can publish them to the Power BI service; a cloud service in which
reports can be published and interacted with by business users. You can also do some basic data modeling and report
editing directly in the service using a web browser, but the functionality for this is limited compared to the Power BI
Desktop tool. You can use the service to schedule refreshes of the data sources on which your reports are based, and to
share reports with other users. You can also define dashboards and apps that combine related reports in a single, easy to
consume location.
Users can consume reports, dashboards, and apps in the Power BI service through a web browser, or on mobile devices by
using the Power BI phone app.
Describe core concepts of data modeling
Completed100 XP
5 minutes
Analytical models enable you to structure data to support analysis. Models are based on related tables of data and define
the numeric values that you want to analyze or report (known as measures) and the entities by which you want to
aggregate them (known as dimensions). For example, a model might include a table containing numeric measures for sales
(such as revenue or quantity) and dimensions for products, customers, and time. This would enable you aggregate sale
measures across one or more dimensions (for example, to identify total revenue by customer, or total items sold by
product per month). Conceptually, the model forms a multidimensional structure, which is commonly referred to as a cube,
in which any point where the dimensions intersect represents an aggregated measure for those dimensions.)
Note
Although we commonly refer to an analytical model as a cube, there can be more (or fewer) than three dimensions – it’s
just not easy for us to visualize more than three!
The numeric measures that will be aggregated by the various dimensions in the model are stored in Fact tables. Each row
in a fact table represents a recorded event that has numeric measures associated with it. For example, the Sales table in
the schema below represents sales transactions for individual items, and includes numeric values for quantity sold and
revenue.
This type of schema, where a fact table is related to one or more dimension tables, is referred to as a star schema (imagine
there are five dimensions related to a single fact table – the schema would form a five-pointed star!). You can also define a
more complex schema in which dimension tables are related to additional tables containing more details (for example, you
could represent attributes of product categories in a separate Category table that is related to the Product table – in
which case the design is referred to as a snowflake schema. The schema of fact and dimension tables is used to create an
analytical model, in which measure aggregations across all dimensions are pre-calculated; making performance of analysis
and reporting activities much faster than calculating the aggregations each time.)
Attribute hierarchies
One final thing worth considering about analytical models is the creation of attribute hierarchies that enable you to
quickly drill-up or drill-down to find aggregated values at different levels in a hierarchical dimension. For example, consider
the attributes in the dimension tables we’ve discussed so far. In the Product table, you can form a hierarchy in which each
category might include multiple named products. Similarly, in the Customer table, a hierarchy could be formed to
represent multiple named customers in each city. Finally, in the Time table, you can form a hierarchy of year, month, and
day. The model can be built with pre-aggregated values for each level of a hierarchy, enabling you to quickly change the
scope of your analysis – for example, by viewing total sales by year, and then drilling down to see a more detailed
breakdown of total sales by month.
Analytical modeling in Microsoft Power BI
You can use Power BI to define an analytical model from tables of data, which can be imported from one or more data
source. You can then use the data modeling interface on the Model tab of Power BI Desktop to define your analytical
model by creating relationships between fact and dimension tables, defining hierarchies, setting data types and display
formats for fields in the tables, and managing other properties of your data that help define a rich model for analysis.
Describe considerations for data visualization
Completed100 XP
5 minutes
After you've created a model, you can use it to generate data visualizations that can be included in a report.
There are many kinds of data visualization, some commonly used and some more specialized. Power BI includes an
extensive set of built-in visualizations, which can be extended with custom and third-party visualizations. The rest of this
unit discusses some common data visualizations but is by no means a complete list.
Bar and column charts are a good way to visually compare numeric values for discrete categories.
Line charts
Line charts can also be used to compare categorized values and are useful when you need to examine trends, often over
time.
Pie charts
Pie charts are often used in business reports to visually compare categorized values as proportions of a total.
Scatter plots
Scatter plots are useful when you want to compare two numeric measures and identify a relationship or correlation
between them.
Maps
Maps are a great way to visually compare values for different geographic areas or locations.
Interactive reports in Power BI
In Power BI, the visual elements for related data in a report are automatically linked to one another and provide
interactivity. For example, selecting an individual category in one visualization will automatically filter and highlight that
category in other related visualizations in the report. In the image above, the city Seattle has been selected in the Sales by
City and Category column chart, and the other visualizations are filtered to reflect values for Seattle only.
Now it's your chance to explore data modeling and visualization with Microsoft Power BI.
Note
To complete this exercise, you will need a computer running Microsoft Windows.
Choose the best response for each of the questions below. Then select Check your answers.
Which tool should you use to import data from multiple data sources and create a report?
Power BI Desktop
That's correct. Use Power BI Desktop to create reports from a wide range of data sources.
What should you define in your data model to enable drill-up/down analysis?
A measure
A hierarchy
A relationship
3.
Which kind of visualization should you use to analyze pass rates for multiple exams over time?
A pie chart
A scatter plot
A line chart
That's correct. A line chart is ideal for visualizing values over time.
Summary
Completed100 XP
1 minute
Data modeling and visualization enables organizations to extract insights from data.
Describe a high-level process for creating reporting solutions with Microsoft Power BI
Describe core principles of analytical data modeling
Identify common types of data visualization and their uses
Create an interactive report with Power BI Desktop
Next steps
Now that you've learned about data modeling and visualization, consider learning more about data-related workloads on
Azure by pursuing a Microsoft certification in Azure Data Fundamentals.
Master the basics of Azure: Fundamentals
https://learn.microsoft.com/en-us/collections/n6ga8m0jkgrwk
Microsoft Azure is a cloud computing platform with an ever-expanding set of services to help you build solutions to meet
your business goals. Azure services support everything from simple to complex. Azure has simple web services for hosting
your business presence in the cloud. Azure also supports running fully virtualized computers managing your custom
software solutions. Azure provides a wealth of cloud-based services like remote storage, database hosting, and centralized
account management. Azure also offers new capabilities like artificial intelligence (AI) and Internet of Things (IoT) focused
services.
In this series, you’ll cover cloud computing basics, be introduced to some of the core services provided by Microsoft Azure,
and will learn more about the governance and compliance services that you can use.
Whether you're interested in compute, networking, or storage services; learning about cloud security best practices; or
exploring governance and management options, think of Azure Fundamentals as your curated guide to Azure.
Azure Fundamentals includes interactive exercises that give you hands-on experience with Azure. Many exercises provide a
temporary Azure portal environment called the sandbox, which allows you to practice creating cloud resources for free at
your own pace.
Technical IT experience isn't required; however, having general IT knowledge will help you get the most from your learning
experience.
No matter your goals, Azure Fundamentals has something for you. You should take this course if you:
The Azure Fundamentals learning path series can help you prepare for Exam AZ-900: Microsoft Azure Fundamentals. This
exam includes three knowledge domain areas:
Expand table
AZ-900 Domain Area Weight
Describe cloud concepts 25-30%
Describe Azure architecture and services 35-40%
Describe Azure management and governance 30-35%
Each domain area maps to a learning path in Azure Fundamentals. The percentages shown indicate the relative weight of
each area on the exam. The higher the percentage, the more questions that part of the exam will contain. Be sure to read
the exam page for specifics about what skills are covered in each area.
In this module, you’ll be introduced to general cloud concepts. You’ll start with an introduction to the cloud in general.
Then you'll dive into concepts like shared responsibility, different cloud models, and explore the unique pricing method for
the cloud.
If you’re already familiar with cloud computing, this module may be largely review for you.
Learning objectives
After completing this module, you’ll be able to:
Cloud computing is the delivery of computing services over the internet. Computing services include common IT
infrastructure such as virtual machines, storage, databases, and networking. Cloud services also expand the traditional IT
offerings to include things like Internet of Things (IoT), machine learning (ML), and artificial intelligence (AI).
Because cloud computing uses the internet to deliver these services, it doesn’t have to be constrained by physical
infrastructure the same way that a traditional datacenter is. That means if you need to increase your IT infrastructure
rapidly, you don’t have to wait to build a new datacenter—you can use the cloud to rapidly expand your IT footprint.
You may have heard of the shared responsibility model, but you may not understand what it means or how it impacts
cloud computing.
Start with a traditional corporate datacenter. The company is responsible for maintaining the physical space, ensuring
security, and maintaining or replacing the servers if anything happens. The IT department is responsible for maintaining all
the infrastructure and software needed to keep the datacenter up and running. They’re also likely to be responsible for
keeping all systems patched and on the correct version.
With the shared responsibility model, these responsibilities get shared between the cloud provider and the consumer.
Physical security, power, cooling, and network connectivity are the responsibility of the cloud provider. The consumer isn’t
collocated with the datacenter, so it wouldn’t make sense for the consumer to have any of those responsibilities.
At the same time, the consumer is responsible for the data and information stored in the cloud. (You wouldn’t want the
cloud provider to be able to read your information.) The consumer is also responsible for access security, meaning you
only give access to those who need it.
Then, for some things, the responsibility depends on the situation. If you’re using a cloud SQL database, the cloud provider
would be responsible for maintaining the actual database. However, you’re still responsible for the data that gets ingested
into the database. If you deployed a virtual machine and installed an SQL database on it, you’d be responsible for database
patches and updates, as well as maintaining the data and information stored in the database.
With an on-premises datacenter, you’re responsible for everything. With cloud computing, those responsibilities shift. The
shared responsibility model is heavily tied into the cloud service types (covered later in this learning path): infrastructure as
a service (IaaS), platform as a service (PaaS), and software as a service (SaaS). IaaS places the most responsibility on the
consumer, with the cloud provider being responsible for the basics of physical security, power, and connectivity. On the
other end of the spectrum, SaaS places most of the responsibility with the cloud provider. PaaS, being a middle ground
between IaaS and SaaS, rests somewhere in the middle and evenly distributes responsibility between the cloud provider
and the consumer.
The following diagram highlights how the Shared Responsibility Model informs who is responsible for what, depending on
the cloud service type.
You’ll always be responsible for:
Operating systems
Network controls
Applications
Identity and infrastructure
Define cloud models
Completed100 XP
4 minutes
What are cloud models? The cloud models define the deployment type of cloud resources. The three main cloud models
are: private, public, and hybrid.
Private cloud
Let’s start with a private cloud. A private cloud is, in some ways, the natural evolution from a corporate datacenter. It’s a
cloud (delivering IT services over the internet) that’s used by a single entity. Private cloud provides much greater control
for the company and its IT department. However, it also comes with greater cost and fewer of the benefits of a public
cloud deployment. Finally, a private cloud may be hosted from your on site datacenter. It may also be hosted in a
dedicated datacenter offsite, potentially even by a third party that has dedicated that datacenter to your company.
Public cloud
A public cloud is built, controlled, and maintained by a third-party cloud provider. With a public cloud, anyone that wants
to purchase cloud services can access and use resources. The general public availability is a key difference between public
and private clouds.
Hybrid cloud
A hybrid cloud is a computing environment that uses both public and private clouds in an inter-connected environment. A
hybrid cloud environment can be used to allow a private cloud to surge for increased, temporary demand by deploying
public cloud resources. Hybrid cloud can be used to provide an extra layer of security. For example, users can flexibly
choose which services to keep in public cloud and which to deploy to their private cloud infrastructure.
The following table highlights a few key comparative aspects between the cloud models.
Expand table
Public cloud Private cloud Hybrid cloud
No capital expenditures to scale up Organizations have complete control over Provides the most flexibility
resources and security
Applications can be quickly provisioned and Data is not collocated with other Organizations determine where to run
deprovisioned organizations’ data their applications
Organizations pay only for what they use Hardware must be purchased for startup and Organizations control security,
maintenance compliance, or legal requirements
Organizations don’t have complete control Organizations are responsible for hardware
over resources and security maintenance and updates
Multi-cloud
A fourth, and increasingly likely scenario is a multi-cloud scenario. In a multi-cloud scenario, you use multiple public cloud
providers. Maybe you use different features from different cloud providers. Or maybe you started your cloud journey with
one provider and are in the process of migrating to a different provider. Regardless, in a multi-cloud environment you deal
with two (or more) public cloud providers and manage resources and security in both environments.
Azure Arc
Azure Arc is a set of technologies that helps manage your cloud environment. Azure Arc can help manage your cloud
environment, whether it's a public cloud solely on Azure, a private cloud in your datacenter, a hybrid configuration, or even
a multi-cloud environment running on multiple cloud providers at once.
Which cloud model uses some datacenters focused on providing cloud services to anyone that wants them, and
some data centers that are focused on a single customer?
Public cloud
Hybrid cloud
The hybrid cloud model is a combination of public cloud and private cloud, using both datacenters dedicated solely to one
customer and datacenters that are shared with the public.
Multi-cloud
The multi-cloud model leverages multiple public cloud providers to satisfy cloud needs.
3.
According to the shared responsibility model, which cloud service type places the most responsibility on the
customer?
IaaS places the most responsibility on the consumer, with the cloud provider being responsible for the basics of physical
security, power, and connectivity.
2 minutes
In this module, you learned about general cloud concepts. You started with things like just understanding what cloud
computing is. You also learned about the shared responsibility model and how you and your cloud provider share the
responsibility of keeping your information in the cloud secure. You briefly covered the differences between the cloud
models (public, private, hybrid, and multi-cloud). Then, you wrapped up with a unit on how the cloud shifts IT spend from a
capital expense to an operational expense.
Learning objectives
You should now be able to:
Additional resources
The following resources provide more information on topics in this module or related to this module.
Shared responsibility model - The shared responsibility model is the sharing of responsibilities for the cloud between you and
your cloud provider.
Introduction to Azure VMware Solution is a Microsoft Learn course that dives deeper into Azure VMware Solution.
Introduction to Azure hybrid cloud services is a Microsoft Learn course that explains hybrid cloud in greater detail.
Describe the consumption-based model
Completed100 XP
3 minutes
When comparing IT infrastructure models, there are two types of expenses to consider. Capital expenditure (CapEx) and
operational expenditure (OpEx).
CapEx is typically a one-time, up-front expenditure to purchase or secure tangible resources. A new building, repaving the
parking lot, building a datacenter, or buying a company vehicle are examples of CapEx.
In contrast, OpEx is spending money on services or products over time. Renting a convention center, leasing a company
vehicle, or signing up for cloud services are all examples of OpEx.
Cloud computing falls under OpEx because cloud computing operates on a consumption-based model. With cloud
computing, you don’t pay for the physical infrastructure, the electricity, the security, or anything else associated with
maintaining a datacenter. Instead, you pay for the IT resources you use. If you don’t use any IT resources this month, you
don’t pay for any IT resources.
No upfront costs.
No need to purchase and manage costly infrastructure that users might not use to its fullest potential.
The ability to pay for more resources when they're needed.
The ability to stop paying for resources that are no longer needed.
With a traditional datacenter, you try to estimate the future resource needs. If you overestimate, you spend more on your
datacenter than you need to and potentially waste money. If you underestimate, your datacenter will quickly reach capacity
and your applications and services may suffer from decreased performance. Fixing an under-provisioned datacenter can
take a long time. You may need to order, receive, and install more hardware. You'll also need to add power, cooling, and
networking for the extra hardware.
In a cloud-based model, you don’t have to worry about getting the resource needs just right. If you find that you need
more virtual machines, you add more. If the demand drops and you don’t need as many virtual machines, you remove
machines as needed. Either way, you’re only paying for the virtual machines that you use, not the “extra capacity” that the
cloud provider has on hand.
To put it another way, cloud computing is a way to rent compute power and storage from someone else’s datacenter. You
can treat cloud resources like you would resources in your own datacenter. However, unlike in your own datacenter, when
you're done using cloud resources, you give them back. You’re billed only for what you use.
Instead of maintaining CPUs and storage in your datacenter, you rent them for the time that you need them. The cloud
provider takes care of maintaining the underlying infrastructure for you. The cloud enables you to quickly solve your
toughest business challenges and bring cutting-edge solutions to your users.
2 Module Describe the benefits of using cloud services
Introduction
Completed100 XP
1 minute
In this module, you’ll be introduced to some of the benefits that cloud computing offers. You’ll learn how cloud computing
can help you meet variable demand while providing a good experience for your customer. You’ll also learn about security,
governance, and overall manageability in the cloud.
Learning objectives
After completing this module, you’ll be able to:
5 minutes
When building or deploying a cloud application, two of the biggest considerations are uptime (or availability) and the
ability to handle demand (or scale).
High availability
When you’re deploying an application, a service, or any IT resources, it’s important the resources are available when
needed. High availability focuses on ensuring maximum availability, regardless of disruptions or events that may occur.
When you’re architecting your solution, you’ll need to account for service availability guarantees. Azure is a highly available
cloud environment with uptime guarantees depending on the service. These guarantees are part of the service-level
agreements (SLAs).
Scalability
Another major benefit of cloud computing is the scalability of cloud resources. Scalability refers to the ability to adjust
resources to meet demand. If you suddenly experience peak traffic and your systems are overwhelmed, the ability to scale
means you can add more resources to better handle the increased demand.
The other benefit of scalability is that you aren't overpaying for services. Because the cloud is a consumption-based model,
you only pay for what you use. If demand drops off, you can reduce your resources and thereby reduce your costs.
Scaling generally comes in two varieties: vertical and horizontal. Vertical scaling is focused on increasing or decreasing the
capabilities of resources. Horizontal scaling is adding or subtracting the number of resources.
Vertical scaling
With vertical scaling, if you were developing an app and you needed more processing power, you could vertically scale up
to add more CPUs or RAM to the virtual machine. Conversely, if you realized you had over-specified the needs, you could
vertically scale down by lowering the CPU or RAM specifications.
Horizontal scaling
With horizontal scaling, if you suddenly experienced a steep jump in demand, your deployed resources could be scaled out
(either automatically or manually). For example, you could add additional virtual machines or containers, scaling out. In the
same manner, if there was a significant drop in demand, deployed resources could be scaled in (either automatically or
manually), scaling in.
Describe the benefits of reliability and predictability in
the cloud
Completed100 XP
2 minutes
Reliability and predictability are two crucial cloud benefits that help you develop solutions with confidence.
Reliability
Reliability is the ability of a system to recover from failures and continue to function. It's also one of the pillars of the
Microsoft Azure Well-Architected Framework.
The cloud, by virtue of its decentralized design, naturally supports a reliable and resilient infrastructure. With a
decentralized design, the cloud enables you to have resources deployed in regions around the world. With this global
scale, even if one region has a catastrophic event other regions are still up and running. You can design your applications
to automatically take advantage of this increased reliability. In some cases, your cloud environment itself will automatically
shift to a different region for you, with no action needed on your part. You’ll learn more about how Azure leverages global
scale to provide reliability later in this series.
Predictability
Predictability in the cloud lets you move forward with confidence. Predictability can be focused on performance
predictability or cost predictability. Both performance and cost predictability are heavily influenced by the Microsoft Azure
Well-Architected Framework. Deploy a solution that’s built around this framework and you have a solution whose cost and
performance are predictable.
Performance
Performance predictability focuses on predicting the resources needed to deliver a positive experience for your customers.
Autoscaling, load balancing, and high availability are just some of the cloud concepts that support performance
predictability. If you suddenly need more resources, autoscaling can deploy additional resources to meet the demand, and
then scale back when the demand drops. Or if the traffic is heavily focused on one area, load balancing will help redirect
some of the overload to less stressed areas.
Cost
Cost predictability is focused on predicting or forecasting the cost of the cloud spend. With the cloud, you can track your
resource use in real time, monitor resources to ensure that you’re using them in the most efficient way, and apply data
analytics to find patterns and trends that help better plan resource deployments. By operating in the cloud and using cloud
analytics and information, you can predict future costs and adjust your resources as needed. You can even use tools like
the Total Cost of Ownership (TCO) or Pricing Calculator to get an estimate of potential cloud spend.
Describe the benefits of security and governance in the
cloud
Completed100 XP
2 minutes
Whether you’re deploying infrastructure as a service or software as a service, cloud features support governance and
compliance. Things like set templates help ensure that all your deployed resources meet corporate standards and
government regulatory requirements. Plus, you can update all your deployed resources to new standards as standards
change. Cloud-based auditing helps flag any resource that’s out of compliance with your corporate standards and provides
mitigation strategies. Depending on your operating model, software patches and updates may also automatically be
applied, which helps with both governance and security.
On the security side, you can find a cloud solution that matches your security needs. If you want maximum control of
security, infrastructure as a service provides you with physical resources but lets you manage the operating systems and
installed software, including patches and maintenance. If you want patches and maintenance taken care of automatically,
platform as a service or software as a service deployments may be the best cloud strategies for you.
And because the cloud is intended as an over-the-internet delivery of IT resources, cloud providers are typically well suited
to handle things like distributed denial of service (DDoS) attacks, making your network more robust and secure.
By establishing a good governance footprint early, you can keep your cloud footprint updated, secure, and well managed.
Describe the benefits of manageability in the cloud
Completed100 XP
2 minutes
A major benefit of cloud computing is the manageability options. There are two types of manageability for cloud
computing that you’ll learn about in this series, and both are excellent benefits.
A major benefit of cloud computing is the manageability options. There are two types of manageability for cloud
computing that you’ll learn about in this series, and both are excellent benefits.
Choose the best response for each question. Then select Check your answers.
Which type of scaling involves adding or removing resources (such as virtual machines or containers) to meet
demand?
Vertical scaling
Horizontal scaling
Direct scaling
2.
What is characterized as the ability of a system to recover from failures and continue to function?
Reliability
Reliability is the ability of a system to recover from failures and continue to function, and it is one of the pillars of the
Microsoft Azure Well-Architected Framework.
Predictability
Scalability
Summary
Completed100 XP
2 minutes
In this module, you learned about some of the benefits of operating in the cloud. You learned about high availability and
reliability, and how those work to keep your applications running. You also learned about how the cloud can provide a
more secure environment. Finally, you learned that the cloud provides a highly manageable environment for your
resources.
Learning objectives
You should now be able to:
Additional resources
The following resources provide more information on topics in this module or related to this module.
Build great solutions with the Microsoft Azure Well-Architected Framework is a Microsoft Learn course that introduces you to
the Microsoft Azure Well-Architected Framework.
3 Module Describe cloud service types
Introduction
Completed100 XP
1 minute
In this module, you’ll be introduced to cloud service types. You’ll learn how each cloud service type determines the
flexibility you’ll have with managing and configuring resources. You'll understand how the shared responsibility model
applies to each cloud service type, and about various use cases for each cloud service type.
Learning objectives
After completing this module, you’ll be able to:
Infrastructure as a service (IaaS) is the most flexible category of cloud services, as it provides you the maximum amount of
control for your cloud resources. In an IaaS model, the cloud provider is responsible for maintaining the hardware, network
connectivity (to the internet), and physical security. You’re responsible for everything else: operating system installation,
configuration, and maintenance; network configuration; database and storage configuration; and so on. With IaaS, you’re
essentially renting the hardware in a cloud datacenter, but what you do with that hardware is up to you.
Lift-and-shift migration: You’re standing up cloud resources similar to your on-prem datacenter, and then simply
moving the things running on-prem to running on the IaaS infrastructure.
Testing and development: You have established configurations for development and test environments that you need
to rapidly replicate. You can stand up or shut down the different environments rapidly with an IaaS structure, while
maintaining complete control.
Platform as a service (PaaS) is a middle ground between renting space in a datacenter (infrastructure as a service) and
paying for a complete and deployed solution (software as a service). In a PaaS environment, the cloud provider maintains
the physical infrastructure, physical security, and connection to the internet. They also maintain the operating systems,
middleware, development tools, and business intelligence services that make up a cloud solution. In a PaaS scenario, you
don't have to worry about the licensing or patching for operating systems and databases.
PaaS is well suited to provide a complete development environment without the headache of maintaining all the
development infrastructure.
Shared responsibility model
The shared responsibility model applies to all the cloud service types. PaaS splits the responsibility between you and the
cloud provider. The cloud provider is responsible for maintaining the physical infrastructure and its access to the internet,
just like in IaaS. In the PaaS model, the cloud provider will also maintain the operating systems, databases, and
development tools. Think of PaaS like using a domain joined machine: IT maintains the device with regular updates,
patches, and refreshes.
Depending on the configuration, you or the cloud provider may be responsible for networking settings and connectivity
within your cloud environment, network and application security, and the directory infrastructure.
Scenarios
Some common scenarios where PaaS might make sense include:
Development framework: PaaS provides a framework that developers can build upon to develop or customize cloud-
based applications. Similar to the way you create an Excel macro, PaaS lets developers create applications using built-
in software components. Cloud features such as scalability, high-availability, and multi-tenant capability are included,
reducing the amount of coding that developers must do.
Analytics or business intelligence: Tools provided as a service with PaaS allow organizations to analyze and mine their
data, finding insights and patterns and predicting outcomes to improve forecasting, product design decisions,
investment returns, and other business decisions.
Describe Software as a Service
Completed100 XP
2 minutes
Software as a service (SaaS) is the most complete cloud service model from a product perspective. With SaaS, you’re
essentially renting or using a fully developed application. Email, financial software, messaging applications, and
connectivity software are all common examples of a SaaS implementation.
While the SaaS model may be the least flexible, it’s also the easiest to get up and running. It requires the least amount of
technical knowledge or expertise to fully employ.
Knowledge check
Completed200 XP
3 minutes
Choose the best response for each question. Then select Check your answers.
1.
Which cloud service type is most suited to a lift and shift migration from an on-premises datacenter to a cloud
deployment?
With an IaaS service type, you can approximate your on-premises environment, making a lift-and-shift transition to the
cloud relatively straightforward.
What type of cloud service type would a Finance and Expense tracking solution typically be in?
SaaS provides access to software solutions, such as finance and expense tracking, email, or ticketing systems.
Summary
Completed100 XP
2 minutes
In this module, you learned about the cloud service types and some common scenarios for each type. You also reinforced
how the shared responsibility model determines your responsibilities with different cloud service types.
Learning objectives
You should now be able to:
In this module, you’ll be introduced to factors that impact costs in Azure and tools to help you both predict potential costs
and monitor and control costs.
Learning objectives
After completing this module, you’ll be able to:
That OpEx cost can be impacted by many factors. Some of the impacting factors are:
Resource type
Consumption
Maintenance
Geography
Subscription type
Azure Marketplace
Resource type
A number of factors influence the cost of Azure resources. The type of resources, the settings for the resource, and the
Azure region will all have an impact on how much a resource costs. When you provision an Azure resource, Azure creates
metered instances for that resource. The meters track the resources' usage and generate a usage record that is used to
calculate your bill.
Examples
With a storage account, you specify a type such as blob, a performance tier, an access tier, redundancy settings, and a
region. Creating the same storage account in different regions may show different costs and changing any of the settings
may also impact the price.
With a virtual machine (VM), you may have to consider licensing for the operating system or other software, the processor
and number of cores for the VM, the attached storage, and the network interface. Just like with storage, provisioning the
same virtual machine in different regions may result in different costs.
Consumption
Pay-as-you-go has been a consistent theme throughout, and that’s the cloud payment model where you pay for the
resources that you use during a billing cycle. If you use more compute this cycle, you pay more. If you use less in the
current cycle, you pay less. It’s a straight forward pricing mechanism that allows for maximum flexibility.
However, Azure also offers the ability to commit to using a set amount of cloud resources in advance and receiving
discounts on those “reserved” resources. Many services, including databases, compute, and storage all provide the option
to commit to a level of use and receive a discount, in some cases up to 72 percent.
When you reserve capacity, you’re committing to using and paying for a certain amount of Azure resources during a given
period (typically one or three years). With the back-up of pay-as-you-go, if you see a sudden surge in demand that
eclipses what you’ve pre-reserved, you just pay for the additional resources in excess of your reservation. This model allows
you to recognize significant savings on reliable, consistent workloads while also having the flexibility to rapidly increase
your cloud footprint as the need arises.
Maintenance
The flexibility of the cloud makes it possible to rapidly adjust resources based on demand. Using resource groups can help
keep all of your resources organized. In order to control costs, it’s important to maintain your cloud environment. For
example, every time you provision a VM, additional resources such as storage and networking are also provisioned. If you
deprovision the VM, those additional resources may not deprovision at the same time, either intentionally or
unintentionally. By keeping an eye on your resources and making sure you’re not keeping around resources that are no
longer needed, you can help control cloud costs.
Geography
When you provision most resources in Azure, you need to define a region where the resource deploys. Azure infrastructure
is distributed globally, which enables you to deploy your services centrally or closest to your customers, or something in
between. With this global deployment comes global pricing differences. The cost of power, labor, taxes, and fees vary
depending on the location. Due to these variations, Azure resources can differ in costs to deploy depending on the region.
Network traffic is also impacted based on geography. For example, it’s less expensive to move information within Europe
than to move information from Europe to Asia or South America.
Network Traffic
Billing zones are a factor in determining the cost of some Azure services.
Bandwidth refers to data moving in and out of Azure datacenters. Some inbound data transfers (data going into Azure
datacenters) are free. For outbound data transfers (data leaving Azure datacenters), data transfer pricing is based on zones.
A zone is a geographical grouping of Azure regions for billing purposes. The bandwidth pricing page has additional
information on pricing for data ingress, egress, and transfer.
Subscription type
Some Azure subscription types also include usage allowances, which affect costs.
For example, an Azure free trial subscription provides access to a number of Azure products that are free for 12 months. It
also includes credit to spend within your first 30 days of sign-up. You'll get access to more than 25 products that are
always free (based on resource and region availability).
Azure Marketplace
Azure Marketplace lets you purchase Azure-based solutions and services from third-party vendors. This could be a server
with software preinstalled and configured, or managed network firewall appliances, or connectors to third-party backup
services. When you purchase products through Azure Marketplace, you may pay for not only the Azure services that you’re
using, but also the services or expertise of the third-party vendor. Billing structures are set by the vendor.
All solutions available in Azure Marketplace are certified and compliant with Azure policies and standards. The certification
policies may vary based on the service or solution type and Azure service involved. Commercial marketplace certification
policies has additional information on Azure Marketplace certifications.