Devops Notes by Microsoft
Devops Notes by Microsoft
About DevOps
e OVERVIEW
What is DevOps?
p CONCEPT
What is Agile?
What is Scrum?
What is Kanban?
p CONCEPT
p CONCEPT
p CONCEPT
What is monitoring?
DevSecOps
p CONCEPT
Security in DevOps
DevOps combines development (Dev) and operations (Ops) to unite people, process,
and technology in application planning, development, delivery, and operations. DevOps
enables coordination and collaboration between formerly siloed roles like development,
IT operations, quality engineering, and security.
Teams adopt DevOps culture, practices, and tools to increase confidence in the
applications they build, respond better to customer needs, and achieve business goals
faster. DevOps helps teams continually provide value to customers by producing better,
more reliable products.
The following diagram illustrates the phases of the DevOps application lifestyle:
Continuous Delivery (CD) is a process by which code is built, tested, and deployed to
one or more test and production environments. Deploying and testing in multiple
environments increases quality. CD systems produce deployable artifacts, including
infrastructure and apps. Automated release processes consume these artifacts to release
new versions and fixes to existing systems. Systems that monitor and send alerts run
continually to drive visibility into the entire CD process.
Version Control
Version control is the practice of managing code in versions—tracking revisions and
change history to make code easy to review and recover. This practice is usually
implemented using version control systems such as Git, which allow multiple developers
to collaborate in authoring code. These systems provide a clear process to merge code
changes that happen in the same files, handle conflicts, and roll back changes to earlier
states.
Infrastructure as code
Infrastructure as code defines system resources and topologies in a descriptive manner
that allows teams to manage those resources as they would code. Those definitions can
also be stored and versioned in version control systems, where they can be reviewed
and reverted—again like code.
Configuration management
Configuration management refers to managing the state of resources in a system
including servers, virtual machines, and databases. Using configuration management
tools, teams can roll out changes in a controlled, systematic way, reducing the risks of
modifying system configuration. Teams use configuration management tools to track
system state and help avoid configuration drift, which is how a system resource’s
configuration deviates over time from the desired state defined for it.
Along with infrastructure as code, it's easy to templatize and automate system definition
and configuration, which help teams operate complex environments at scale.
Continuous monitoring
Continuous monitoring means having full, real-time visibility into the performance and
health of the entire application stack. This visibility ranges from the underlying
infrastructure running the application to higher-level software components. Visibility is
accomplished through the collection of telemetry and metadata and setting of alerts for
predefined conditions that warrant attention from an operator. Telemetry comprises
event data and logs collected from various parts of the system, which are stored where
they can be analyzed and queried.
High-performing DevOps teams ensure they set actionable, meaningful alerts and
collect rich telemetry so they can draw insights from vast amounts of data. These
insights help the team mitigate issues in real time and see how to improve the
application in future development cycles.
Planning
In the planning phase, DevOps teams ideate, define, and describe the features and
capabilities of the applications and systems they plan to build. Teams track task progress
at low and high levels of granularity, from single products to multiple product portfolios.
Teams use the following DevOps practices to plan with agility and visibility:
Create backlogs.
Track bugs.
Manage Agile software development with Scrum.
Use Kanban boards.
Visualize progress with dashboards.
For an overview of the several lessons learned and practices Microsoft adopted to
support DevOps planning across the company's software teams, see How Microsoft
plans with DevOps.
Development
The development phase includes all aspects of developing software code. In this phase,
DevOps teams do the following tasks:
To innovate rapidly without sacrificing quality, stability, and productivity, DevOps teams:
For an overview of the development practices Microsoft adopted to support their shift
to DevOps, see How Microsoft develops with DevOps.
Deliver
Delivery is the process of consistently and reliably deploying applications into
production environments, ideally via continuous delivery (CD).
Safe deployment practices can identify issues before they affect the customer
experience. These practices help DevOps teams deliver frequently with ease, confidence,
and peace of mind.
Core DevOps principles and processes Microsoft evolved to provide efficient delivery
systems are described in How Microsoft delivers software with DevOps.
Operations
The operations phase involves maintaining, monitoring, and troubleshooting
applications in production environments, including hybrid or public clouds like Azure .
DevOps teams aim for system reliability, high availability, strong security, and zero
downtime.
Automated delivery and safe deployment practices help teams identify and mitigate
issues quickly when they occur. Maintaining vigilance requires rich telemetry, actionable
alerting, and full visibility into applications and underlying systems.
Practices Microsoft uses to operate complex online platforms are described in How
Microsoft operates reliable systems with DevOps.
Next steps
Plan efficient workloads with DevOps
Develop modern software with DevOps
Deliver quality services with DevOps
Operate reliable systems with DevOps
Other resources
DevOps solutions on Azure
The DevOps journey at Microsoft
Start doing DevOps with Azure
Security in DevOps (DevSecOps)
What is platform engineering?
The planning phase of DevOps is often seen as the first stage of DevOps, which isn't
quite accurate. In practice, modern software teams work in tight cycles where each
phase continuously informs the others through lessons that are learned.
Sometimes those lessons are positive. Sometimes they're negative. And sometimes
they're neutral information that the team needs so that it can make strategic decisions
for the future. The industry has coalesced around a single adjective to describe the
ability to quickly adapt to the changing circumstances that these lessons create: Agile.
The term has become so ubiquitous that it's now a synonym for most forms of DevOps
planning.
What is Agile?
Agile describes a pragmatic approach to software development that emphasizes
incremental delivery, team collaboration, continual planning, and continual learning. It's
not a specific set of tools or practices, but rather a planning mindset that's always open
to change and compromise.
Teams that employ Agile development practices shorten their development life cycle in
order to produce usable software on a consistent schedule. The continuous focus on
delivering quality to end users makes it possible for the overall project to rapidly adapt
to evolving needs. To start seeing these kinds of returns, teams need to establish some
procedures along the way.
Next steps
Microsoft was one of the first major companies to adopt DevOps for planning large-
scale software projects. Learn about how Microsoft plans in DevOps.
Looking for a hands-on DevOps experience? Check out the Evolve your DevOps
practices learning path. It primarily features Azure DevOps, but the concepts and
experience apply equally to planning in other DevOps platforms, such as GitHub.
Learn more about platform engineering, where you can use building blocks from
Microsoft and other vendors to create deeply personalized, optimized, and secure
developer experiences.
What is Agile?
Article • 11/28/2022
The manifesto doesn't imply that the items on the right side of these statements aren't
important or needed. Rather, items on the left are simply more valued.
Scrum is the most common Agile framework, and the one that most people start with.
Agile practices, on the other hand, are techniques that are applied during phases of the
software development lifecycle.
These practices, like all Agile practices, carry the Agile label, because they're consistent
with the principles in the Agile manifesto.
Agile isn't cowboy coding . Agile shouldn't be confused with a "we'll figure it out
as we go" approach to software development. Such an idea couldn't be further
from the truth. Agile requires both a definition of done and explicit value that's
delivered to customers in every sprint. While Agile values autonomy for individuals
and teams, it emphasizes aligned autonomy to ensure that the increased
autonomy produces increased value.
Agile isn't without rigor and planning. On the contrary, Agile methodologies and
practices typically emphasize discipline in planning. The key is continual planning
throughout the project, not just planning up front. Continual planning ensures that
the team can learn from the work that they execute. Through this approach, they
maximize the return on investment (ROI) of planning.
Agile isn't an excuse for the lack of a roadmap. This misconception has probably
done the most harm to the Agile movement overall. Organizations and teams that
follow an Agile approach absolutely know where they're going and the results that
they want to achieve. Recognizing change as part of the process is different from
pivoting in a new direction every week, sprint, or month.
Agile isn't development without specifications. It's necessary in any project to
keep your team aligned on why and how work happens. An Agile approach to
specs includes ensuring that specs are right-sized, and that they reflect
appropriately how the team sequences and delivers work.
Agile isn't incapable of accommodating unplanned work and other interruptions.
It's important to complete sprints on schedule. But just because an issue comes up
that sidetracks development doesn't mean that a sprint has to fail. Teams can plan
for interruptions by designating resources ahead of time for problems and
unexpected issues. Then they can address those issues but stay on track with
development.
Agile isn't inappropriate for large organizations. A common complaint is that
collaboration, a key component of Agile methodologies, is difficult in large teams.
Another gripe is that scalable approaches to Agile introduce structure and
methods that compromise flexibility. In spite of these misconceptions, it's possible
to scale Agile principles successfully. For information about overcoming these
difficulties, see Scaling Agile to large teams.
Agile isn't inefficient. To adapt to customers' changing needs, developers invest
time each iteration to demonstrate a working product and collect feedback. It's
true that these efforts reduce the time that they spend on development. But
incorporating customer requests early on saves significant time later. When
features stay aligned with the customer's vision, developers avoid major overhauls
down the line.
Agile isn't a poor fit for today's applications, which often center on data streaming.
Such projects typically involve more data modeling and extract-transform-load
(ETL) workloads than user interfaces. This fact makes it hard to demonstrate usable
software on a consistent, tight schedule. But by adjusting goals, developers can still
use an Agile approach. Instead of working to accomplish tasks each iteration,
developers can focus on running data experiments. Instead of presenting a
working product every few weeks, they can aim to better understand the data.
Why Agile?
So why would anyone consider an Agile approach? It's clear that the rules of
engagement around building software have fundamentally changed in the last 10-15
years. Many of the activities look similar, but the landscape and environments where we
apply them are noticeably different.
Compare what it's like to purchase software today with the early 2000s. How often
do people drive to the store to buy business software?
Consider how feedback is collected from customers about products. How did a
team understand what people thought about their software before social media?
Consider how often a team desires to update and improve the software that they
deliver. Annual updates are no longer feasible against modern competition.
Forrester's Diego Lo Guidice says it best in his blog, Transforming Application Delivery
(October, 2020).
The rules have changed, and organizations around the world now adapt their approach
to software development accordingly. Agile methods and practices don't promise to
solve every problem. But they do promise to establish a culture and environment where
solutions emerge through collaboration, continual planning and learning, and a desire
to ship high-quality software more often.
Next steps
Deciding to take the Agile route to software development can introduce some
interesting opportunities for enhancing your DevOps process. One key set of
considerations focuses on how Agile development compares and contrasts with an
organization's current approach.
What is Agile development?
Article • 11/28/2022
Delivering production quality code every sprint requires the Agile development team to
account for an accelerated pace. All coding, testing, and quality verification must be
done each and every sprint. Unless a team is properly set up, the results can fall short of
expectations. While these disappointments offer great learning opportunities, it's helpful
to learn some key lessons before getting started.
This article lays out a few key success factors for Agile development teams:
The product owner's job is to ensure that every sprint, the engineers have clearly
defined user stories to work with. The user stories at the top of the backlog should
always be ready for the team to start on. This notion is called backlog refinement.
Keeping a backlog ready for an Agile development team requires effort and discipline.
Fortunately, it's well worth the investment.
1. Refining user stories is often a long-lead activity. Elegant user interfaces, beautiful
screen designs, and customer-delighting solutions all take time and energy to
create. Diligent product owners refine user stories two to three sprints in advance.
They account for design iterations and customer reviews. They work to ensure
every user story is something the Agile team is proud to deliver to the customer.
2. A user story isn't refined unless the team says it is. The team needs to review the
user story and agree it's ready to work on. If a team doesn't see the user story until
day one of a sprint, problems can likely result.
3. User stories further down the backlog can remain ambiguous. Don't waste time
refining lower-priority items. Focus on the top of the backlog.
With automation, the team avoids slow, error-prone, and time-intensive manual
deployment processes. Since teams release every sprint, there isn't time to do these
tasks manually.
CI/CD also influences your software architecture. It ensures you deliver buildable and
deployable software. When teams implement a difficult-to-deploy feature, they become
aware immediately if the build and deployments fail. CI/CD forces a team to fix
deployment issues as they occur. The product is then always ready to ship.
There are some key CI/CD activities that are critically important for effective Agile
development.
1. Unit testing. Unit tests are the first defense against human error. Consider unit
tests a part of coding. Check tests in with the code. Make unit testing a part of
every build. Failed unit tests mean a failed build.
2. Build automation. The build system should automatically pull code and tests
directly from source control when builds run.
3. Branch and build policies. Configure branch and build policies to build
automatically as the team checks code in to a specific branch.
Keeping on top of technical debt requires courage. There are many pressures to delay
reworking code. It feels good to work on features and ignore debt. Unfortunately,
somebody must pay off the technical debt sooner or later. Just like financial debt,
technical debt becomes harder to pay off the longer it exists. A smart product owner
works with their team to ensure there's time to pay off technical debt every sprint.
Balancing technical debt reduction with feature development is a difficult task.
Fortunately, there are some straightforward techniques for creating productive,
customer-focused teams.
Always be Agile
Being Agile means learning from experience and continually improving. Agile
development provides more learning cycles than traditional project planning due to the
tighter process loops. Each sprint provides something new for the team to learn.
For example:
A team delivers value to the customer, gets feedback, and then modifies their
backlog based on that feedback.
They learn that their automated builds are missing key tests. They include work in
their next sprint to address this issue.
They find that certain features perform poorly in production, so they make plans to
improve performance.
Someone on the team hears of a new practice. The team decides to try it out for a
few sprints.
Teams that are just starting with Agile development should expect more learning
opportunities. They're an invaluable part of the process because they lead to growth
and improvement.
Next steps
There are many ways to settle on an Agile development process that's right for a team.
Azure DevOps provides various process templates. Teams that are looking for different
baseline structures to their planning can use these templates as starting points. For
information about selecting a process template that best fits a team's culture and goals,
see Choose a process flow or process template to work in Azure Boards.
Product owner
The product owner is responsible for what the team builds, and why they build it. The
product owner is responsible for keeping the backlog of work up to date and in priority
order.
Scrum master
The Scrum master ensures that the Scrum process is followed by the team. Scrum
masters are continually on the lookout for how the team can improve, while also
resolving impediments and other blocking issues that arise during the sprint. Scrum
masters are part coach, part team member, and part cheerleader.
Development team
The members of the development team actually build the product. The team owns the
engineering of the product, and the quality that goes with it.
Product backlog
The product backlog is a prioritized list of work the team can deliver. The product owner
is responsible for adding, changing, and reprioritizing the backlog as needed. The items
at the top of the backlog should always be ready for the team to execute on.
Scrum defines a practice called a daily Scrum, often called the daily standup. The daily
Scrum is a daily meeting limited to fifteen minutes. Team members often stand during
the meeting to ensure it stays brief. Each team member briefly reports their progress
since yesterday, the plans for today, and anything impeding their progress.
Task board
The task board lists each backlog item the team is working on, broken down into the
tasks required to complete it. Tasks are placed in To do, In progress, and Done columns
based on their status. The board provides a visual way to track the progress of each
backlog item.
Learn more about Kanban task boards.
Sprint review
The team demonstrates what they've accomplished to stakeholders. They demo the
software and show its value.
Sprint retrospective
The team takes time to reflect on what went well and which areas need improvement.
The outcome of the retrospective are actions for the next sprint.
Increment
The product of a sprint is called the increment or potentially shippable increment.
Regardless of the term, a sprint's output should be of shippable quality, even if it's part
of something bigger and can't ship by itself. It should meet all the quality criteria set by
the team and product owner.
This shorter, iterative cycle provides the team with lots of opportunities to learn and
improve. A traditional project often has a long lifecycle, say 6-12 months. While a team
can learn from a traditional project, the opportunities are far less than a team who
executes in two-week sprints, for example.
Scrum is very popular because it provides just enough framework to guide teams while
giving them flexibility in how they execute. Its concepts are simple and easy to learn.
Teams can get started quickly and learn as they go. All of this makes Scrum a great
choice for teams just starting to implement Agile principles.
Next steps
Find more information about Scrum resources, training, and certification:
Scrum.org
ScrumAlliance.org
Larger, more complex organizations may find that Scrum doesn't quite fit their needs.
For those cases, check out Scaled Agile Framework.
What is Kanban?
Article • 11/28/2022
Although Kanban was created for manufacturing, software development shares many of
the same goals, such as increasing flow and throughput. Software development teams
can improve their efficiency and deliver value to users faster by using Kanban guiding
principles and methods.
Kanban principles
Adopting Kanban requires adherence to some fundamental practices that might vary
from teams' previous methods.
Visualize work
Understanding development team status and work progress can be challenging. Work
progress and current state is easier to understand when presented visually rather than
as a list of work items or a document.
Visualization of work is a key principle that Kanban addresses primarily through Kanban
boards. These boards use cards organized by progress to communicate overall status.
Visualizing work as cards in different states on a board helps to easily see the big picture
of where a project currently stands, as well as identify potential bottlenecks that could
affect productivity.
Use a pull model
Historically, stakeholders requested functionality by pushing work onto development
teams, often with tight deadlines. Quality suffered if teams had to take shortcuts to
deliver the functionality within the timeframe.
Kanban focuses on maintaining an agreed-upon level of quality that must be met before
considering work done. To support this model, stakeholders don't push work on teams
that are already working at capacity. Instead, stakeholders add requests to a backlog
that a team pulls into their workflow as capacity becomes available.
Teams decide on a WIP limit, or maximum number of items they can work on at one
time. A well-disciplined team makes sure not to exceed their WIP limit. If teams exceed
their WIP limits, they investigate the reason and work to address the root cause.
Kanban boards
The Kanban board is one of the tools teams use to implement Kanban practices. A
Kanban board can be a physical board or a software application that shows cards
arranged into columns. Typical column names are To-do, Doing, and Done, but teams
can customize the names to match their workflow states. For example, a team might
prefer to use New, Development, Testing, UAT, and Done.
On a Kanban board, the WIP limit applies to all in-progress columns. WIP limits don't
apply to the first and last columns, because those columns represent work that hasn't
started or is completed. Kanban boards help teams stay within WIP limits by drawing
attention to columns that exceed the limits. Teams can then determine a course of
action to remove the bottleneck.
Scrum focuses on fixed length sprints, while Kanban is a continuous flow model.
Scrum has defined roles, while Kanban doesn't define any team roles.
Scrum uses velocity as a key metric, while Kanban uses cycle time.
Teams commonly adopt aspects of both Scrum and Kanban to help them work most
effectively. Regardless of which characteristics they choose, teams can always review and
adapt until they find the best fit. Teams should start simple and not lose sight of the
importance of delivering value regularly to users.
For more information, see Reasons to use Azure Boards to plan and track your
work.
The Learn module Choose an Agile approach to software development provides
hands-on Kanban experience in Azure Boards.
Adopt an Agile culture
Article • 11/28/2022
If there's one lesson to be learned from the last decade of "Agile transformations," it's
that there's no one-size-fits-all solution to adopting or implementing an Agile approach.
Every organization has different needs, constraints, and requirements. Blindly following a
prescription won't lead to success.
The Agile movement is about continually finding ways to improve the practice of
building software. It's not about a perfect daily standup or retrospective. Instead, it's
about creating a culture where the right thing happens more often than not. Activities
like standups and retrospectives have their place, but they won't change an
organization's culture.
This article details foundational elements that every organization needs to create an
Agile mindset and culture. The recommendations shouldn't be followed blindly. Each
organization should apply what makes sense in a given environment.
Select a sprint length that works for your organization's culture, product, and desire to
provide updates. For example, the Developer Tools division at Microsoft (roughly 6,000
people) works in three-week sprints. The leadership team didn't choose this sprint
length; it came from direct feedback from the engineering teams. The entire division
operates on this three-week sprint schedule. The sprints have since become the
heartbeat of the organization. Now every team marches to the beat of the same drum.
It's important to pick a sprint length and stick with it. If there are multiple Agile teams,
they should all use the same sprint length. If feedback drives a change, then be
receptive. It will become clear when the right term is in place.
A culture of shipping
Peter Provost , Principal Group Program Manager at Microsoft, said "You can't cheat
shipping." The simplicity and truth of that statement is a cornerstone of Agile culture.
What Peter means is that shipping your software will teach you things that you can't and
won't understand unless you're actually shipping your software.
Human nature is to delay or avoid doing things until absolutely necessary. This couldn't
be more true when it comes to software development. Teams punt bugs to the end of
the cycle, don't think about setup or upgrade until they're forced to, and typically avoid
things like localization and accessibility wherever possible. When this pattern emerges,
teams build up technical debt that will need to be paid at a later time. Shipping
demands all debt be paid. You can't cheat shipping. To establish an Agile culture, start
by trying to ship the product at the end of every sprint. It won't be easy at first, but
when a team attempts it, they quickly discover all the things that should be happening,
but aren't.
Healthy teams
There's no recipe for the perfect Agile team. However, a few key characteristics make
success much easier to achieve.
Load balancing work instead of people allows a team that's already established to step
in and help out. It becomes a conversation about priorities, not a conversation about
people.
When horizontal teams own layers of architecture, no single team is responsible for the
end-to-end experience. Adding a feature requires multiple teams to coordinate and
requires a higher level of dependency management. Resolving bugs requires multiple
teams to investigate whether they own the code required to fix the bug. Bugs are batted
around as teams determine it's not their bug and assign it to another team.
Feature teams don't have these issues. Ownership and accountability are clear. There
may be a place for some architectural-based teams. However, vertically focused teams
are more effective.
Next steps
As teams embark on their own Agile transformation, keep these foundational principles
in mind. Remember, there's no single recipe that will work for every organization. Agile
transformations are a journey. Make changes and learn from them. Over time the
organization will develop the Agile culture it needs.
Microsoft is one of the world's largest Agile companies. Learn more about how
Microsoft adopted an Agile culture for DevOps planning.
Learn about how Azure DevOps enables teams to adopt and scale an Agile culture.
Building productive teams
Article • 11/28/2022
Engineers thrive in environments where they can focus and get in the zone. Teams often
face distractions and competing priorities that force engineers to shift context and
divide their attention. They struggle to balance heads down time with heads up time.
Adding new features requires team members to be heads down and focused.
Responding to customer issues and addressing live site issues requires the team to be
heads up and aware of what's going on.
To mitigate distractions, a team can divide itself into two crews: one for features and one
for live site health.
Feature crew
The feature crew, or F-crew, focuses on the future. They work as an effective unit with a
clear mission and goal: to build and ship high-quality features.
The F-crew is shielded from the day-to-day chaos of the live service to ensure they have
time to design, build, and test their work. They can rely on minimal distractions and
freedom from having to fix issues that arise at random. They're encouraged to seldom
check their email and avoid getting pulled into other issues unless they're critical.
When an F-crew member joins a conversation or occasionally gets sucked into an email
thread, other team members should chide them: "You're on the F-crew, what are you
doing?" If an F-crew member needs to address a critical issue, they're encouraged to
delegate it to the customer crew and return to feature work.
The F-crew operates as a tight-knit team that swarms on a small set of features. A good
work-in-progress (WIP) limit is two features in flight for 4-6 people. By working closely
together, they build deep shared context and find critical bugs or design issues that a
cursory code review would miss. A dedicated crew allows for a more predictable
throughput rate and lead time. Team members often refer to the F-crew as serene and
focused. They find it peaceful and rejuvenating to focus deeply on a feature, to dedicate
full attention to it. People leave their time on the F-crew feeling refreshed and
accomplished.
Customer crew
The customer crew, or C-crew, focuses on the now and provides frontline support for
customer and live-site issues, bugs, telemetry, and monitoring. The C-crew often
huddles around a computer, debugging a critical live-site issue. Their number one
priority is live-site health. Laser-focused on this environment, they build expert
debugging and analysis skills. The customer crew is often referred to as the shield team,
because it shields the rest of the team from distractions. Rather than work on upcoming
features, the C-crew is the bridge between customers and the current product. Crew
members are active on email, Twitter, and other feedback channels. Customers want to
know they're heard, and the C-crew's job is to hear them. The C-crew triages customer-
reported issues immediately and quickly engages and assists blocked customers.
C-crews allow the team to address issues without pulling team members off other
priorities, and ensure customers and partners are heard. Responsiveness to questions
and issues becomes a point of pride for C-crews. However, this pace can be draining,
necessitating a frequent rotation between crews.
Crew rotation
A well-defined rotation process makes the two-crew system work. You could simply
swap the crews (F-crew becomes C-crew and vice versa), but this limits knowledge
sharing between and within the crews. Instead, opt for a weekly rotation.
At the end of each week, conduct a short swap meet where the team decides who swaps
between crews. You can use a whiteboard chart to track who is currently on each crew
and when they were swapped. The longest tenured people on each crew should
typically swap with each other. However, in any given week, someone may want to
remain to complete work on a live-site investigation or feature. While there's flexibility,
the longer someone is on a crew, the more likely they should be swapped.
Weekly rotations help prevent silos of knowledge in the team and ensure a constant
flow of information and perspective between crews. The frequent movement of
engineers creates shared knowledge of the team's work, which helps the C-crew to
resolve issues without the help of others. Often, new F-crew members will quickly find a
previously overlooked design or code flaw.
Crew size
Crew size varies to maintain the health of the team. If a team has a high incoming rate
of live-site issues or has a lot of technical debt, the C-crew gets larger, and vice versa.
Adjusting sizes weekly increases predictability in the team's deliverables and
dependencies. In some weeks, a team may move everyone to the C-crew to address the
feedback from a big release.
A dedicated F-crew leads to predictable throughput and lead time. Splitting resources
between crews increases accountability within the team and with management about
what the team can accomplish each week and each sprint.
Next steps
The two-crew system can help teams understand where engineers should spend their
time and to make progress on many competing priorities.
Microsoft is one of the world's largest Agile companies. Learn how Microsoft organizes
teams in DevOps planning.
Scaling Agile to large teams
Article • 11/28/2022
The words big and Agile aren't often used in the same sentence. Large organizations
have earned the reputation of being slow moving. However, that's changing. Many large
software organizations are successfully making the transformation to Agile. They're
learning to scale Agile principles with or without popular frameworks such as SAFe,
LeSS , or Nexus .
At Microsoft, one such organization uses Agile to build products and services shipped
under the Azure DevOps brand. This group has 35 feature teams that release to
production every three weeks.
Every team within Azure DevOps owns features from start to finish and beyond. They
own customer relationships. They manage their own product backlog. They write and
check code into the production branch. Every three weeks, the production branch is
deployed and the release becomes public. The teams then monitor system health and
fix live-site issues.
To scale Agile, you must enable autonomy for the team while ensuring alignment with
the organization.
To manage the delicate balance between alignment and autonomy, DevOps leaders
need to define a taxonomy, define a planning process, and use feature chats.
Define a taxonomy
An Agile team, and the larger Agile organization it belongs to, need a clearly defined
backlog to be successful. Teams will struggle to succeed if organizational goals are
unclear.
In order to set clear goals and state how each team contributes to them, the
organization needs to define a taxonomy. A clearly defined taxonomy creates the
nomenclature for an organization.
A common taxonomy is epics, features, stories, and tasks.
Epics
Epics describe initiatives important to the organization's success. Epics may take several
teams and several sprints to accomplish, but they aren't without an end. Epics have a
clearly defined goal. Once attained, the epic is closed. The number of epics in progress
should be manageable in order to keep the organization focused. Epics are broken
down into features.
Features
Features define new functionality required to realize an epic's goal. Features are the
release-unit; they represent what is released to the customer. Published release notes
can be built based on the list of features recently completed. Features can take multiple
sprints to complete, but should be sized to ensure a consistent flow of value to the
customer. Features are broken down into stories.
Stories
Stories define incremental value the team must deliver to create a feature. The team
breaks the feature down into incremental pieces. A single completed story may not
provide meaningful value to the customer. However, a completed story represents
production-quality software. Stories are the team's work unit. The team defines the
stories required to complete a feature. Stories optionally break down into tasks.
Tasks
Tasks define the work required to complete a story.
Initiatives
This taxonomy isn't a one-size-fits-all system. Many organizations introduce a level
above epics called initiatives.
The names of each level can be tailored to your organization. However, the names
defined above (epics, features, stories) are widely used in the industry.
Line of autonomy
Once a taxonomy is set, the organization needs to draw a line of autonomy. The line of
autonomy is the point at which ownership of the level passes from management to the
team. Management doesn't interfere with levels that are owned by the team.
The following example shows the line of autonomy drawn below features. Management
owns epics and features, which provide alignment. Teams own stories and tasks, and
have autonomy over how they execute.
In this example, management doesn't tell the team how to decompose stories, plan
sprints, or execute work.
The team, however, must ensure their execution aligns with management's goals. While
a team owns their backlog of stories, they must align their backlog with the features
assigned to them.
Planning
To scale Agile planning, a team needs a plan for each level of the taxonomy. However,
it's important to iterate and update the plan. This process is called rolling wave planning.
The plan provides direction for a fixed period of time with expected calibration at
regular intervals. For example, an 18-month plan could be calibrated every six months.
Here's an example of planning methods for each level of a taxonomy: epics, features,
stories, tasks.
Vision
The vision is expressed through epics and sets the long-term direction of the
organization. Epics define what the organization wants to complete in the next 18
months. Management owns the plan and calibrates it every six months.
Season
A season is described through features and sets the strategy for the next six months.
Features determine what the organization wants to light up for its customers.
Management owns the seasonal plan and presents the vision and seasonal plans at an
all-hands meeting. All team plans must align with management's seasonal plan. Expect
to accomplish about 80% of the seasonal plan.
3-sprint plan
The 3-sprint plan defines the stories and features the team will finish over the next three
sprints. The team owns the plan and calibrates it every sprint. Each team presents their
plan to management via the feature chat (see below). The plan specifies how the team's
execution aligns with the 6-month seasonal plan. Expect to accomplish about 90% of
the 3-sprint plan.
Sprint plan
The sprint plan defines the stories and features the team will finish in the next sprint.
The team owns the sprint plan and emails it to the entire organization for full
transparency. The plan includes what the team accomplished in the past sprint and their
focus for the next sprint. Expect to accomplish about 95% of the sprint plan.
Line of autonomy
In this example, the line of autonomy is drawn to show where teams have planning
autonomy.
As stated above, management doesn't extend ownership below the line of autonomy.
Management provides guidance using the vision and season plans, and then gives the
teams autonomy to create 3-sprint and sprint plans.
A feature chat meeting allocates 15 minutes to each team. With 12 feature teams, these
meetings can be scheduled to last about three hours. Each team prepares a 3-slide deck,
with the following slides:
Features
The first slide outlines the features that the team will light up in the next three sprints.
Debt
The next slide describes how the team manages technical debt. Debt is anything that
doesn't meet management's quality bars. The director of engineering sets the quality
bars, which are the same for all teams (alignment). Example quality bars include number
of bugs per engineer, percentage of unit tests passing, and performance goals.
Issues and dependencies
The issues and dependencies listed on the final slide include anything that impacts team
progress, such as issues the team can't resolve or dependencies on other teams that
need escalation.
Each team presents their slides directly to management. The team presents how their 3-
sprint plan aligns with the 6-month seasonal plan. Leadership asks clarifying questions
and suggests changes in direction. They can also request follow-up meetings to resolve
deeper issues.
Management must trust teams to do the right thing. If management doesn't trust
the teams, they won't give teams autonomy.
Management must provide clear plans for teams to align with and then trust their teams
to execute. Teams must align their plans with the organization and execute in a
trustworthy manner.
As organizations look to scale Agile to larger scenarios, the key is to give teams
autonomy while ensuring they're aligned with organizational goals. The critical building
blocks are clearly defined ownership and a culture of trust. Once an organization has
this foundation in place, they'll find that Agile can scale very well.
Next steps
There are many ways for a team of any size to start seeing benefits today. Check out
some of these practices that scale.
Learn about Azure DevOps features for portfolio management and visibility across
teams.
Microsoft is one of the world's largest Agile companies. Learn more about how
Microsoft scales DevOps planning.
How Microsoft plans with DevOps
Article • 11/28/2022
Microsoft is one of the largest companies in the world to use Agile methodologies. Over
years of experience, Microsoft has developed a DevOps planning process that scales
from the smallest projects up through massive efforts like Windows. This article
describes many of the lessons learned and practices Microsoft implements when
planning software projects across the company.
Instrumental changes
The following key changes help make development and shipping cycles healthier and
more efficient:
Alignment comes from the top down, to ensure that individuals and teams
understand how their responsibilities align with broader business goals.
Autonomy happens from the bottom up, to ensure that individuals and teams have
an impact on day-to-day activities and decisions.
There is a delicate balance between alignment and autonomy. Too much alignment can
create a negative culture where people perform only as they're told. Too much
autonomy can cause a lack of structure or direction, inefficient decision-making, and
poor planning.
Cross-disciplinary
10-12 people
Self-managing
Clear charter and goals for 12-18 months
Physical team rooms
Own feature deployment
Own features in production
Microsoft teams used to be horizontal, covering all UI, all data, or all APIs. Now,
Microsoft strives for vertical teams. Teams own their areas of the product end-to-end.
Strict guidelines in certain tiers ensure uniformity among teams across the product.
The following diagram conceptualizes the difference between horizontal and vertical
teams:
About every 18 months, Microsoft runs a "yellow sticky exercise," where developers can
choose which areas of the product they want to work on for the next couple of planning
periods. This exercise provides autonomy, as teams can choose what to work on, and
organizational alignment, as it promotes balance among the teams. About 80% of the
people in this exercise remain on their current teams, but they feel empowered because
they had a choice.
Sprints (3 weeks)
Plans (3 sprints)
Seasons (6 months)
Strategies (12 months)
Engineers and teams are mostly responsible for sprints and plans. Leadership is primarily
responsible for seasons and strategies.
This planning structure also helps maximize learning while doing planning. Teams are
able to get feedback, find out what customers want, and implement customer requests
quickly and efficiently.
Teams now implement a bug cap, calculated by the formula # of engineers x 5 = bug
cap . If a team's bug count exceeds the bug cap at the end of a sprint, they must stop
working on new features and fix bugs until they are under their cap. Teams now pay
down bug debt as they go.
Objectives define the goals to achieve. Objectives are significant, concrete, action
oriented, and ideally inspirational statements of intent. Objectives represent big
ideas, not actual numbers.
Key results define steps to achieve the objectives. Key results are quantifiable
outcomes that evaluate progress and indicate success against objectives in a
specific time period.
OKRs reflect the best possible results, not just the most probable results. Leaders try to
be ambitious and not cautious. Pushing teams to pursue challenging key results drives
acceleration against objectives and prioritizes work that moves towards larger goals.
Adopting an OKR framework can help teams perform better for the following reasons:
OKRs might exist at different levels of a product. For example, there can be top-level
product OKRs, component-level OKRs, and team-level OKRs. Keeping OKRs aligned is
relatively easy, especially if objectives are set top-down. Any conflicts that arise are
valuable early indicators of organizational misalignment.
OKR example
Objective: Grow a strong and happy customer base.
Key results:
For more information about OKRs, see Measure business outcomes using objectives and
key results.
Teams avoid metrics that don't accrue value toward objectives. While they may have
certain uses, the following metrics aren't helpful for tracking progress toward objectives:
Before After
Key takeaways
Take Agile science seriously, but don't be overly prescriptive. Agile can become too
strict. Let the Agile mindset and culture grow.
Celebrate results, not activity. Deploying functionality outweighs lines of code.
Ship at every sprint to establish a rhythm and cadence and find all the work that
needs to be done.
Build the culture you want to get the behavior you're looking for.
Develop modern software with DevOps
Article • 11/28/2022
The development phase of DevOps is where all the core software development work
happens. As input, it takes in plans for the current iteration, usually in the form of task
assignments. Then it produces software artifacts that express the updated functionality.
Development requires not only the tools that are used to write code, such as Visual
Studio, but also supporting services like version control, issue management, and
automated testing.
Automate processes
The real value of the development stage comes from the implementation of features.
Unfortunately, there are many other tasks that sap time from the development team.
Compiling code, running tests, and preparing output for deployment are a few
examples. To minimize the impact, DevOps emphasizes automating these types of tasks
through the practice of continuous integration.
Another time-consuming task in the development lifecycle is fixing bugs. While bugs are
often seen as an inevitable part of software development, there are valuable steps any
team can take to reduce them. Learn how to shift left to make testing faster and more
reliable.
Next steps
Microsoft has been one of the world's largest software development companies for
decades. Learn about how Microsoft develops in DevOps.
For a hands-on DevOps experience with continuous integration, see the following
learning paths:
Visual Studio historically offers DevOps productivity and integration benefits. Visual
Studio natively integrates with GitHub and Azure DevOps, and has a robust ecosystem
of extensions for every industry DevOps provider.
Both Visual Studio and Visual Studio Code have native features and first-party
extensions that simplify working with DevOps processes in Azure, GitHub, and Azure
DevOps.
Next steps
Learn to prepare Visual Studio, Visual Studio Code, Eclipse for Java, and IntelliJ IDEA for
Azure development in the hands-on learning module Prepare your development
environment for Azure development.
What is version control?
Article • 11/28/2022
Version control systems are software that help track changes make in code over time. As
a developer edits code, the version control system takes a snapshot of the files. It then
saves that snapshot permanently so it can be recalled later if needed.
Without version control, developers are tempted to keep multiple copies of code on
their computer. This is dangerous because it's easy to change or delete a file in the
wrong copy of code, potentially losing work. Version control systems solve this problem
by managing all versions of the code, but presenting the team with a single version at a
time.
Create workflows
Version control workflows prevent the chaos of everyone using their own development
process with different and incompatible tools. Version control systems provide process
enforcement and permissions so everyone stays on the same page.
Code together
Version control synchronizes versions and makes sure that changes don't conflict with
changes from others. The team relies on version control to help resolve and prevent
conflicts, even when people make changes at the same time.
Keep a history
Version control keeps a history of changes as the team saves new versions of code.
Team members can review history to find out who, why, and when changes were made.
History gives teams the confidence to experiment since it's easy to roll back to a
previous good version at any time. History lets anyone base work from any version of
code, such as to fix a bug in a previous release.
Automate tasks
Version control automation features save time and generate consistent results.
Automate testing, code analysis, and deployment when new versions are saved to
version control are three examples.
Next steps
Learn more about the worldwide standard in version control, Git.
What is Git?
Article • 11/28/2022
Git has become the worldwide standard for version control. So what exactly is it?
Git is a distributed version control system, which means that a local clone of the
project is a complete version control repository. These fully functional local repositories
make it easy to work offline or remotely. Developers commit their work locally, and then
sync their copy of the repository with the copy on the server. This paradigm differs from
centralized version control where clients must synchronize code with a server before
creating new versions of code.
Git's flexibility and popularity make it a great choice for any team. Many developers and
college graduates already know how to use Git. Git's user community has created
resources to train developers and Git's popularity make it easy to get help when
needed. Nearly every development environment has Git support and Git command line
tools implemented on every major operating system.
Git basics
Every time work is saved, Git creates a commit. A commit is a snapshot of all files at a
point in time. If a file hasn't changed from one commit to the next, Git uses the
previously stored file. This design differs from other systems that store an initial version
of a file and keep a record of deltas over time.
Commits create links to other commits, forming a graph of the development history. It's
possible to revert code to a previous commit, inspect how files changed from one
commit to the next, and review information such as where and when changes were
made. Commits are identified in Git by a unique cryptographic hash of the contents of
the commit. Because everything is hashed, it's impossible to make changes, lose
information, or corrupt files without Git detecting it.
Branches
Each developer saves changes to their own local code repository. As a result, there can
be many different changes based off the same commit. Git provides tools for
isolating changes and later merging them back together. Branches, which are
lightweight pointers to work in progress, manage this separation. Once work created in
a branch is finished, it can be merged back into the team's main (or trunk) branch.
Staging lets developers pick which file changes to save in a commit in order to break
down large changes into a series of smaller commits. By reducing the scope of commits,
it's easier to review the commit history to find specific file changes.
Benefits of Git
The benefits of Git are many.
Simultaneous development
Everyone has their own local copy of code and can work simultaneously on their own
branches. Git works offline since almost every operation is local.
Faster releases
Branches allow for flexible and simultaneous development. The main branch contains
stable, high-quality code from which you release. Feature branches contain work in
progress, which are merged into the main branch upon completion. By separating the
release branch from development in progress, it's easier to manage stable code and
ship updates more quickly.
Built-in integration
Due to its popularity, Git integrates into most tools and products. Every major IDE has
built-in Git support, and many tools support continuous integration, continuous
deployment, automated testing, work item tracking, metrics, and reporting feature
integration with Git. This integration simplifies the day-to-day workflow.
Pull requests
Use pull requests to discuss code changes with the team before merging them into the
main branch. The discussions in pull requests are invaluable to ensuring code quality
and increase knowledge across your team. Platforms like GitHub and Azure DevOps
offer a rich pull request experience where developers can browse file changes, leave
comments, inspect commits, view builds, and vote to approve the code.
Branch policies
Teams can configure GitHub and Azure DevOps to enforce consistent workflows and
process across the team. They can set up branch policies to ensure that pull requests
meet requirements before completion. Branch policies protect important branches by
preventing direct pushes, requiring reviewers, and ensuring clean builds.
Next steps
Install and set up Git
Install and set up Git
Article • 11/28/2022
Git isn't yet a default option on computers, so it must be manually installed and
configured. And like other software, it's important to keep Git up to date. Updates
protect from security vulnerabilities, fix bugs, and provide access to new features.
The following sections describe how to install and maintain Git for the three major
platforms.
Git for Windows doesn't automatically update. To update Git for Windows, download
the new version of the installer, which updates Git for Windows in place and retains all
settings.
Instead, it's recommended that you install Git through Homebrew and that you use
Homebrew tools to keep Git up to date. Homebrew is a great way to install and manage
open source development tools on a Mac from the command line.
Install Homebrew and run the following to install the latest version of Git on a Mac:
A graphical installer for Git on macOS is also available from the official Git website .
Run the following commands from the command prompt after installing Git to
configure this information:
Visual Studio offers a great out-of-the-box Git experience without any extra tooling.
Learn more in this Visual Studio Git tutorial.
Set up a Git repository
Article • 11/28/2022
A Git repository, or repo, is a folder that Git tracks changes in. There can be any number
of repos on a computer, each stored in their own folder. Each Git repo on a system is
independent, so changes saved in one Git repo don't affect the contents of another.
A Git repo contains every version of every file saved in the repo. This is different than
other version control systems that store only the differences between files. Git stores the
file versions in a hidden .git folder alongside other information it needs to manage code.
Git saves these files very efficiently, so having a large number of versions doesn't mean
that it uses a lot of disk space. Storing each version of a file helps Git merge code better
and makes working with multiple versions of code quick and easy.
Developers work with Git through commands issued while working in a local repo on
the computer. Even when sharing code or getting updates from the team, it's done from
commands that update the local repo. This local-focused design is what makes Git a
distributed version control system. Every repo is self-contained, and the owner of the
repo is responsible for keeping it up to date with the changes from others.
Most teams use a central repo hosted on a server that everyone can access to
coordinate their changes. The central repo is usually hosted in a source control
management solution, like GitHub or Azure DevOps. A source control management
solution adds features and makes working together easier.
Create a new Git repo
You have two options to create a Git repo. You can create one from the code in a folder
on a computer, or clone one from an existing repo. If working with code that's just on
the local computer, create a local repo using the code in that folder. But most of the
time the code is already shared in a Git repo, so cloning the existing repo to the local
computer is the recommended way to go.
to create the repo. Next, add any files in the folder to the first commit using the
following commands:
Be sure to use the actual URL to the existing repo instead of the placeholder URL shown
in this example. This URL, called the clone URL, points to a server where the team
coordinates changes. Get this URL from the team, or from the clone button on the site
where the repo is hosted.
It's not necessary to add files or create an initial commit when the repo is cloned since it
was all copied, along with history, from the existing repo during the clone operation.
Next steps
GitHub and Azure Repos provide unlimited free public and private Git repos.
Visual Studio user? Learn more about how to create and clone repos from Visual Studio
in this Git tutorial.
Save and share code with Git
Article • 11/28/2022
Saving and sharing versions of code with a team are the most common things done
when using version control. Git has an easy three-step workflow for these tasks:
Git makes it easy to manage work using branches. Every bugfix, new feature, added test,
and updated configuration starts with a new branch. Branches are lightweight and local
to the development machine, so you don't have to worry about using resources or
coordinating changes with others until it's time to push the branch.
Branches enable you to code in isolation from other changes in development. Once
everything's working, the branch and its changes are shared with your team. Others can
experiment with the code in their own copy of the branch without it affecting work in
progress in their own branches.
Create a branch
Create a branch based off the code in a current branch, such as main , when starting new
work. It's a good practice to check which branch is selected using git status before
creating a new branch.
Git has a shorthand command to both create the branch and switch to it at the same
time:
Learn more about working with Git branches in GitHub or Azure DevOps.
Save changes
Git doesn't automatically snapshot code as edits are made. Git must be told exactly
which changes to add to the next snapshot. This is called staging. After staging your
changes, create a commit to save the snapshot permanently.
Stage changes
Git tracks file changes made in the repo as they happen. It separates these changes into
three categories:
Commit changes
Save changes in Git by creating a commit. Each commit stores the full file contents of
the repo in each commit, not just individual file changes. This behavior is different than
other version control systems that store the file-level differences from the last version of
the code. Full file histories let Git make better decisions when merging changes and it
makes switching between branches of code lightning fast.
Stage changes with git add to add changed files, git rm to remove files, and git mv to
move files. Then, use git commit command to create the commit.
Every commit has a message that describes its changes. A good commit message helps
the developer remember the changes they made in a commit. Good commit messages
also makes it easier for others to review the commit.
Learn more about staging files and committing changes in Visual Studio or Visual Studio
Code .
Share changes
Whether working on a team or just wanting to back up their own code, developers need
to share commits with a repo on another computer. Use the git push command to take
commits from the local repo and write them into a remote repo. Git is set up in cloned
repos to connect to the source of the clone, also known as origin . Run git push to
write the local commits on your current branch to another branch (branchname) on this
origin repository. Git creates branchname on the remote repo if it doesn't exist.
> git push origin
If working in a repo created on the local system with git init , you'll need to set up a
connection to the team's Git server before changes can be pushed. Learn more about
setting up remotes and pushing changes in Visual Studio or Visual Studio Code .
Share branches
Pushing a local branch to the team's shared repo makes its changes accessible to the
rest of the team. The first time git push is run, adding the -u option tells Git to start
tracking the local branch to branchname from the origin repo. After this one-time
setup of tracking information, team members can use git push directly to share
updates quickly and easily.
Next steps
Learn more about branches in GitHub or Azure DevOps.
Learn more about pushing commits and branches in Visual Studio or Visual Studio
Code .
Understand Git history
Article • 11/28/2022
Git represents history in a fundamentally different way than centralized version controls
systems (CVCS) such as Team Foundation Version Control, Perforce, or Subversion.
Centralized systems store a separate history for each file in a repository. Git stores
history as a graph of snapshots of the entire repository. These snapshots, called commits
in Git, can have multiple parents, creating a history that looks like a graph instead of a
straight line. This difference in history is incredibly important and is the main reason
users familiar with CVCS find Git confusing.
A key difference in Git compared to CVCS is that the developer has their own full copy
of the repo. They need to keep their local repository in sync with the remote repository
by getting the latest commits from the remote repository. To do this, they pull the main
branch with the following command:
This merges all changes from the main branch in the remote repository, which Git
names origin by default. This pull brought one new commit and the main branch in the
local repo moves to that commit.
Understand branch history
Now it's time to make a change to the code. It's common to have multiple active
branches when working on different features in parallel. This is in stark contrast to CVCS
where new branches are heavy and rarely created. The first step is to checkout to a new
branch using the following command:
Two branches now point to the same commit. Suppose there are a few changes on the
cool-new-feature branch in two new commits, E and F.
The commits are reachable by the cool-new-feature branch since they were committed
to that branch. Now that the feature is done, it needs to be merged into the main
branch. To do that, use the following command:
The graph structure of history becomes visible when there's a merge. Git creates a new
commit when the branch is merged into another branch. This is a merge commit. There
aren't any changes included this merge commit since there were no conflicts. If there
were conflicts, the merge commit would include the changes needed to resolve them.
Next steps
Learn more about working with Git history in GitHub and Azure Repos or Git log
history simplification.
Get feedback with pull requests
Article • 11/28/2022
Pull requests support reviewing and merging code into a single collaborative process.
Once a developer adds a feature or a bug fix, they create a pull request to begin the
process of merging the changes into the upstream branch. Other team members are
then given a chance to review and approve the code before it's finalized. Use pull
requests to review works in progress and get early feedback on changes. But there's no
commitment to merge the changes. An owner can abandon a pull request at any time.
Code reviews help protect the team from bad merges and broken builds that sap the
team's productivity. Reviews catch problems before the merge, protecting important
branches from unwanted changes.
When you assign reviewers to a pull request, be sure to select the right set of reviewers.
Reviewers should know how the code works, but also include developers working in
other areas so they can share their ideas.
Provide a clear description of the changes and provide a build of the code that has the
fix or feature working in it. Reviewers should make an effort to provide feedback on
changes they don't agree with. Identify the issue and give specific suggestions on what
could be done differently. This feedback has clear intent and is easy for the owner of the
pull request to understand.
The pull request owner should reply to comments, accept suggestions, or explain why
they decline to apply them. Some suggestions are good, but might be outside the scope
of the pull request. Take these suggestions and create new work items and feature
branches separate from the pull request to make those changes.
Add additional conditions to pull requests to enforce a higher level of code quality in
key branches. A clean build of the merged code and approval from multiple reviewers
are some extra requirements often employed to protect key branches.
Learn more
GitHub has extensive documentation on how to propose changes to your work with pull
requests .
Read more about giving great feedback in code reviews and using pull request
templates to provide guidance to your reviewers. Azure DevOps also offers a rich pull
request experience that's easy to use and scales as needed.
Hosting Git repositories
Article • 11/28/2022
Git has quickly become the worldwide standard for version control. Millions of projects
rely on Git for everyday collaboration needs. While the decentralized nature of Git
provides substantial benefits, it's still necessary for teams to push their changes to a
centralized Git repo in order to merge branches and provide a hub for other DevOps
activities.
GitHub
By far, the world's leading host for Git projects is GitHub . GitHub provides much more
than just Git hosting. GitHub has features that span the whole DevOps process,
including a marketplace of partner products and services .
Self-hosting GitHub
Some organizations might have regulatory or other requirements that prevent them
from hosting their source code and other assets outside of their own infrastructure. For
these users, GitHub Enterprise Server is available. GitHub Enterprise Server includes
the familiar features and user experience, but can be entirely hosted within a company's
own infrastructure.
Azure Repos
Users already on Azure DevOps or earlier versions of Team Foundation Server have a
first-class option in migrating to Azure Repos. Azure Repos provides all the benefits of
Git, combined with a familiar user experience and integration points.
Migrating a team to Git from centralized version control requires more than just
learning new commands. To support distributed development, Git stores file history and
branch information differently than a centralized version control system. Planning and
implementing a successful migration to Git from a centralized version control system
requires understanding these fundamental differences.
Microsoft has helped migrate many internal teams and customers from centralized
version control systems to Git. This experience has produced the following guidance
based on practices that consistently succeed.
Teams should consider adopting the following practices as they migrate to the new
system:
Continuous integration (CI), where every check-in triggers a build and test pass. CI
helps identify defects early and provides a strong safety net for projects.
Required code reviews before checking in code. In the Git branching model, pull
request code review is part of the development process. Code reviews complement
the CI workflow.
In Git, short-lived topic branches allow developers to work close to the main branch and
integrate quickly, avoiding merge problems. Two common topic branch strategies are
GitFlow and a simpler variation, GitHub Flow .
Git discourages long-lived, isolated feature branches, which tend to delay merges until
integration becomes difficult. By using modern CD techniques like feature flags, teams
can integrate code into the main branch quickly, but still keep in-progress features
hidden from users until they're complete.
Teams that currently use a long-lived feature branch strategy can adopt feature flags
before migrating to Git. Using feature flags simplifies migration by minimizing the
number of branches to migrate. Whether they use feature branches or feature flags,
teams should document the mapping between legacy branches and new Git branches,
so everyone understands where to commit their new work.
In some version control systems, a tag or label is a collection that can contain
various files in the tree, even files at different versions. In Git, a tag is a snapshot of
the entire repository at a specific point in time. A tag can't represent a subset of
the repository or combine files at different versions.
Most version control systems store details about the way files change between
versions, including fine-grained change types like rename, undelete, and rollback.
Git stores versions as snapshots of the entire repository, and metadata about the
way files changed isn't available.
These differences mean that a full history migration will be lossy at best, and possibly
misleading. Given the lossiness, the effort involved, and the relative rarity of using
history, it's recommended that most teams avoid importing history. Instead, teams
should do a tip migration, bringing only a snapshot of the most recent branch version
into Git. For most teams, time is best spent on areas of the migration that have a higher
return on investment, such as improving processes.
Especially for teams that do only a tip migration, it's highly recommended to maintain
the previous system indefinitely. Set the old version control system to read-only after
you migrate.
Large development teams and regulated environments can place breadcrumbs in Git
that point back to the old version control system. A simple example is a text file added
as the first commit at the root of a Git repository, before the tip migration, that points to
the URL of the old version control server. If many branches migrate, a text file in each
branch should explain how the branches migrated from the old system. Breadcrumbs
are also helpful for developers who start working on a project after it's been migrated
and aren't familiar with the old version control system.
It's also recommended to exclude libraries, tools, and build output from repositories.
Instead, use package management systems like NuGet to manage dependencies.
Assets like icons and artwork might need to align with a specific version of source code.
Small, infrequently-changed assets like icons won't bloat history, and you can include
them directly in a repository. To store large or frequently-changing assets, use the Git
Large File Storage (LFS) extension. For more information about managing large files in
GitHub, see Managing large files . For Azure Repos, see Manage and store large files in
Git.
Provide training
One of the biggest challenges in migrating to Git is helping developers understand how
Git stores changes and how commits form development history. It's not enough to
prepare a cheat sheet that maps old commands to Git commands. Developers need to
stop thinking about version control history in terms of a centralized, linear model, and
understand Git's history model and the commit graph.
People learn in different ways, so you should provide several types of training materials.
Live, lab-based training with an expert instructor works well for some people. The Pro
Git book is an excellent starting point that is available free online.
Organizations should work to identify Git experts on teams, empower them to help
others, and encourage other team members to ask them questions.
Plan for a firm cutover from the old version control system to Git. Trying to operate
multiple systems in parallel means developers might not know where or how to check
in. Set the old version control system to read-only to help avoid confusion. Without this
safeguard, a second migration that includes interim changes might be necessary.
The actual migration process varies depending on the system you're migrating from. For
information about migrating from Team Foundation Version Control, see Migrate from
TFVC to Git.
Migration checklist
Team workflows:
Branching strategy:
History:
" Identify which binaries and undiffable files to remove from the repo.
" Decide on an approach for large files, such as Git-LFS.
" Decide on an approach for delivering tools and libraries, such as NuGet.
Training:
Code migration:
Next steps
Migrate to Azure DevOps from Team Foundation Server
How TFVC commands and workflow map to Git
Import and migrate repositories from
TFVC to Git
Article • 03/25/2024
Azure DevOps Services | Azure DevOps Server 2022 - Azure DevOps Server 2019
You can migrate code from an existing TFVC repository to a new Git repository within
the same organization. Migrating to Git is an involved process for large TFVC
repositories and teams. Centralized version control systems, like TFVC, behave
differently from Git in fundamental ways. The switch involves a lot more than learning
new commands. It is a disruptive change that requires careful planning. You need to
think about:
We strongly recommend reading Centralized version control to Git and the following
Migrate from TFVC to Git section before starting the migration.
The import experience is great for small simple TFVC repositories. It's also good for
repositories that have already been "cleaned up" as outlined in Centralized version
control to Git and the following Migrate from TFVC to Git section. These sections also
recommend other tools for more advanced TFVC repository configurations.
) Important
Due to the differences in how TFVC and Git store version control history, we
recommend that you don't migrate your history. This is the approach that Microsoft
took when it migrated Windows and other products from centralized version
control to Git.
4. Type the path to the repository / branch / folder that you want to import to the Git
repository. For example, $/Fabrikam/FabrikamWebsite
5. If you want to migrate history from the TFVC repository, click Migrate history and
select the number of days. You can migrate up to 180 days of history starting from
the most recent changeset. A link to the TFVC repository is added in the commit
message of the 1st changeset that is migrated to Git. This makes it easy to find
older history when needed.
6. Give a name to the new Git repository and click Import. Depending on the size of
the import, your Git repository would be ready in a few minutes.
Troubleshooting
This experience is optimized for small, simple TFVC repositories or repositories that have
been prepared for a migration. This means it has a few limitations.
1. It only migrates the contents of root or a branch. For example, if you have a TFVC
project at $/Fabrikam which has 1 branch and 1 folder under it, a path to import
$/Fabrikam would import the folder while $/Fabrikam/<branch> would only import
the branch.
2. The imported repository and associated history (if imported) cannot exceed 1GB in
size.
3. You can import up to 180 days of history.
If any of the above is a blocker for your import, we recommend you try external tools
like Git-TFS for importing and reading our whitepapers - Centralized version control
to Git and the following Migrate from TFVC to Git section.
) Important
The usage of external tools like Git-TFS with Microsoft products, services, or
platforms is entirely the responsibility of the user. Microsoft does not endorse,
support, or guarantee the functionality, reliability, or security of such third-party
extensions.
Requirements
Steps to migrate
Check out the latest version
Remove binaries and build tools
Convert version control-specific configuration
Check in changes and perform the migration
Advanced migrations
Update the workflow
Requirements
In order to make migrations easier, there are a number of requirements before following
the importing the repository procedure in the previous section of this article.
Migrate only a single branch. When planning the migration, choose a new
branching strategy for Git. Migrating only the main branch supports a topic-branch
based workflow like GitFlow or GitHub Flow .
Do a tip migration, as in, import only the latest version of the source code. If TFVC
history is simple, there's an option to migrate some history, up to 180 days, so that
the team can work only out of Git. For more information, see Plan your migration
to Git.
Exclude binary assets like images, scientific data sets, or game models from the
repository. These assets should use the Git LFS (Large File Storage) extension,
which the import tool doesn't configure.
Keep the imported repository below 1GB in size.
If the repository doesn't meet these requirements, use the Git-TFS tool to do your
migration instead.
) Important
The usage of external tools like Git-TFS with Microsoft products, services, or
platforms is entirely the responsibility of the user. Microsoft does not endorse,
support, or guarantee the functionality, reliability, or security of such third-party
extensions.
Steps to migrate
The process to migrate from TFVC is generally straightforward:
1. Check out the latest version of the branch from TFVC on your local disk.
2. Remove binaries and build tools from the repository and set up a package
management system like NuGet.
3. Convert version control-specific configuration directives. For example, convert
.tfignore files to .gitignore , and convert .tpattributes files to .gitattributes .
4. Check in changes and perform the migration to Git.
Steps 1-3 are optional. If there aren't binaries in the repository and there's no need to
set up a .gitignore or a .gitattributes , you can proceed directly to the Check in
changes and perform the migration step.
prettyprint
Due to the way Git stores the history of changed files by providing a copy of every file in
history to every developer, checking in binary files directly to the repository causes the
repo to grow quickly and can cause performance issues.
For build tools and dependencies like libraries, adopt a packaging solution with
versioning support, such as NuGet. Many open source tools and libraries are already
available on the NuGet Gallery , but for proprietary dependencies, create new NuGet
packages.
Once dependencies are moved into NuGet, be sure that they aren't included in the Git
repository by adding them to .gitignore.
If the project relies on this behavior, convert the .tfignore file to a .gitignore file.
Cross-platform TFVC clients also provide support for a .tpattributes file that controls
how files are placed on the local disk or checked into the repository. If a .tpattributes
file is in use, convert it to a .gitattributes file.
Advanced migrations
The Git-TFS tool is a two-way bridge between Team Foundation Version Control and
Git, and you can use it to perform a migration. Git-TFS is appropriate for a migration
with full history, more than the 180 days that the Import tool supports. Or you can use
Git-TFS to attempt a migration that includes multiple branches and merge relationships.
Before attempting a migration with Git-TFS, note that there are fundamental differences
between the way TFVC and Git store history:
Git stores history as a snapshot of the repository in time, while TFVC records the
discrete operations that occurred on a file. Change types in TFVC like rename,
undelete, and rollback can't be expressed in Git. Instead of seeing that file A was
renamed to file B , it only tracks that file A was deleted and file B was added in the
same commit.
Git doesn't have a direct analog of a TFVC label. Labels can contain any number of
files at any specific version and can reflect files at different versions. Although
conceptually similar, the Git tags point to a snapshot of the whole repository at a
point in time. If the project relies on TFVC labels to know what was delivered, Git
tags might not provide this information.
Merges in TFVC occur at the file level, not at the entire repository. Only a subset of
changed files can be merged from one branch to another. Remaining changed files
might then be merged in a subsequent changeset. In Git, a merge affects the
entire repository, and both sets of individual changes can't be seen as a merge.
Because of these differences, it's recommended that you do a tip migration and keep
your TFVC repository online, but read-only, in order to view history.
To attempt an advanced migration with Git-TFS, see clone a single branch with history
or clone all branches with merge history .
) Important
The usage of external tools like Git-TFS with Microsoft products, services, or
platforms is entirely the responsibility of the user. Microsoft does not endorse,
support, or guarantee the functionality, reliability, or security of such third-party
extensions.
Learn more about how to migrate from centralized version control to Git.
Feedback
Was this page helpful? Yes No
Continuous integration (CI) is the process of automatically building and testing code
every time a team member commits code changes to version control. A code commit to
the main or trunk branch of a shared repository triggers the automated build system to
build, test, and validate the full branch. CI encourages developers to share their code
and unit tests by merging their changes into the shared version control repository every
time they complete a task.
Software developers often work in isolation, and then need to integrate their changes
with the rest of a team's code base. Waiting days or weeks to integrate code can create
many merge conflicts, hard to fix bugs, diverging code strategies, and duplicated
efforts. CI avoids these problems because it requires the development team's code to
continuously merge to the shared version control branch.
CI keeps the main branch up-to-date. Developers can use modern version control
systems like Git to isolate their work in short-lived feature branches. When the feature is
complete, the developer submits a pull request from the feature branch to the main
branch. On approval of the pull request, the changes merge into the main branch, and
the feature branch can be deleted.
Development teams repeat this process for each work item. Teams can establish branch
policies to ensure the main branch maintains desired quality criteria.
Build definitions specify that every commit to the main branch triggers the automated
build and testing process. Automated tests verify that every build maintains consistent
quality. CI catches bugs earlier in the development cycle, making them less expensive to
fix.
Testing helps ensure that code performs as expected, but the time and effort to build
tests takes time away from other tasks such as feature development. With this cost, it's
important to extract maximum value from testing. This article discusses DevOps test
principles, focusing on the value of unit testing and a shift-left test strategy.
Dedicated testers used to write most tests, and many product developers didn't learn to
write unit tests. Writing tests can seem too difficult or like too much work. There can be
skepticism about whether a unit test strategy works, bad experiences with poorly-
written unit tests, or fear that unit tests will replace functional tests.
L0 and L1 tests are unit tests, or tests that depend on code in the assembly under
test and nothing else. L0 is a broad class of fast, in-memory unit tests.
L2 are functional tests that might require the assembly plus other dependencies,
like SQL or the file system.
L3 functional tests run against testable service deployments. This test category
requires a service deployment, but might use stubs for key service dependencies.
L4 tests are a restricted class of integration tests that run against production. L4
tests require a full product deployment.
While it would be ideal for all tests to run at all times, it's not feasible. Teams can select
where in the DevOps process to run each test, and use shift-left or shift-right strategies
to move different test types earlier or later in the process.
For example, the expectation might be that developers always run through L2 tests
before committing, a pull request automatically fails if the L3 test run fails, and the
deployment might be blocked if L4 tests fail. The specific rules may vary from
organization to organization, but enforcing the expectations for all teams within an
organization moves everyone toward the same quality vision goals.
One Microsoft team runs over 60,000 unit tests in parallel in less than six minutes. Their
goal is to reduce this time to less than a minute. The team tracks unit test execution
time with tools like the following chart, and files bugs against tests that exceed the
allowed time.
Functional test guidelines
Functional tests must be independent. The key concept for L2 tests is isolation. Properly
isolated tests can run reliably in any sequence, because they have complete control over
the environment they run in. The state must be known at the beginning of the test. If
one test created data and left it in the database, it could corrupt the run of another test
that relies on a different database state.
Legacy tests that need a user identity might have called external authentication
providers to get the identity. This practice introduces several challenges. The external
dependency could be unreliable or unavailable momentarily, breaking the test. This
practice also violates the test isolation principle, because a test could change the state
of an identity, such as permission, resulting in an unexpected default state for other
tests. Consider preventing these issues by investing in identity support within the test
framework.
Long-running tests might also produce failures that are time-consuming to investigate.
Teams can build a tolerance for failures, especially early in sprints. This tolerance
undermines the value of testing as insight into codebase quality. Long-running, last-
minute tests also add unpredictability to end-of-sprint expectations, because an
unknown amount of technical debt must be paid to get the code shippable.
The goal for shifting testing left is to move quality upstream by performing testing tasks
earlier in the pipeline. Through a combination of test and process improvements,
shifting left reduces both the time it takes for tests to run, and the impact of failures
later in the cycle. Shifting left ensures that most testing is completed before a change
merges into the main branch.
In addition to shifting certain testing responsibilities left to improve code quality, teams
can shift other test aspects right, or later in the DevOps cycle, to improve the final
product. For more information, see Shift right to test in production.
The team started at 27K legacy tests in sprint 78, and reached zero legacy tests at S120.
A set of L0 and L1 unit tests replaced most of the old functional tests. New L2 tests
replaced some of the tests, and many of the old tests were deleted.
In a software journey that takes over two years to complete, there's a lot to learn from
the process itself. Overall, the effort to completely redo the test system over two years
was a massive investment. Not every feature team did the work at the same time. Many
teams across the organization invested time in every sprint, and in some sprints it was
most of what the team did. Although it's difficult to measure the cost of the shift, it was
a non-negotiable requirement for the team's quality and performance goals.
Getting started
At the beginning, the team left the old functional tests, called TRA tests, alone. The team
wanted developers to buy into the idea of writing unit tests, particularly for new
features. The focus was on making it as easy as possible to author L0 and L1 tests. The
team needed to develop that capability first, and build momentum.
The preceding graph shows unit test count starting to increase early, as the team saw
the benefit of authoring unit tests. Unit tests were easier to maintain, faster to run, and
had fewer failures. It was easy to gain support for running all unit tests in the pull
request flow.
The team didn't focus on writing new L2 tests until sprint 101. In the meantime, the TRA
test count went down from 27,000 to 14,000 from Sprint 78 to Sprint 101. New unit tests
replaced some of the TRA tests, but many were simply deleted, based on team analysis
of their usefulness.
The TRA tests jumped from 2100 to 3800 in sprint 110 because more tests were
discovered in the source tree and added to the graph. It turned out that the tests had
always been running, but weren't being tracked properly. This wasn't a crisis, but it was
important to be honest and reassess as needed.
Getting faster
Once the team had a continuous integration (CI) signal that was extremely fast and
reliable, it became a trusted indicator for product quality. The following screenshot
shows the pull request and CI pipeline in action, and the time it takes to go through
various phases.
It takes around 30 minutes to go from pull request to merge, which includes running
60,000 unit tests. From code merge to CI build is about 22 minutes. The first quality
signal from CI, SelfTest, comes after about an hour. Then, most of the product is tested
with the proposed change. Within two hours from Merge to SelfHost, the entire product
is tested and the change is ready to go into production.
Using metrics
The team tracks a scorecard like the following example. At a high level, the scorecard
tracks two types of metrics: Health or debt, and velocity.
For live site health metrics, the team tracks the time to detect, time to mitigate, and how
many repair items a team is carrying. A repair item is work the team identifies in a live
site retrospective to prevent similar incidents from recurring. The scorecard also tracks
whether teams are closing the repair items within a reasonable timeframe.
For engineering health metrics, the team tracks active bugs per developer. If a team has
more than five bugs per developer, the team must prioritize fixing those bugs before
new feature development. The team also tracks aging bugs in special categories like
security.
Next steps
Learning path: Build applications with Azure DevOps
Use continuous integration
Shift right to test in production
Mocks Aren't Stubs
How Microsoft develops with DevOps
Article • 11/28/2022
Microsoft strives to use One Engineering System to build and deploy all Microsoft
products with a solid DevOps process centered on a Git branching and release flow. This
article highlights practical implementation, how the system scales from small services to
massive platform development needs, and lessons learned from using the system across
various Microsoft teams.
Microsoft also uses platform engineering principles as part of its One Engineering
System.
Branch
To fix a bug or implement a feature, a developer creates a new branch off the main
integration branch. The Git lightweight branching model creates these short-lived topic
branches for every code contribution. Developers commit early and avoid long-running
feature branches by using feature flags.
Push
When the developer is ready to integrate and ship changes to the rest of the team, they
push their local branch to a branch on the server, and open a pull request. Repositories
with several hundred developers working in many branches use a naming convention
for server branches to alleviate confusion and branch proliferation. Developers usually
create branches named users/<username>/feature , where <username> is their account
name.
Pull request
Pull requests control topic branch merges into the main branch and ensure that branch
policies are satisfied. The pull request process builds the proposed changes and runs a
quick test pass. The first- and second-level test suites run around 60,000 tests in less
than five minutes. This isn't the complete Microsoft test matrix, but is enough to quickly
give confidence in pull requests.
Next, other members of the team review the code and approve the changes. Code
review picks up where the automated tests left off, and is particularly useful for spotting
architectural problems. Manual code reviews ensure that other engineers on the team
have visibility into the changes and that code quality remains high.
Merge
Once the pull request satisfies all build policies and reviewers have signed off, the topic
branch merges into the main integration branch, and the pull request is complete.
After merge, other acceptance tests run that take more time to complete. These
traditional post-checkin tests do a more thorough validation. This testing process
provides a good balance between having fast tests during pull request review and
having complete test coverage before release.
For example, an often overlooked part of GitHub Flow is that pull requests must deploy
to production for testing before they can merge to the main branch. This process means
that all pull requests wait in the deployment queue for merge.
Some teams have several hundred developers working constantly in a single repository,
who can complete over 200 pull requests into the main branch per day. If each pull
requests requires a deployment to multiple Azure data centers across the globe for
testing, developers spend time waiting for branches to merge, instead of writing
software.
Instead, Microsoft teams continue developing in the main branch and batch up
deployments into timed releases, usually aligned with a three-week sprint cadence.
Implementation details
Here are some key implementation details of the Microsoft release flow:
Adjunct repositories
Some teams also manage adjunct repositories. For instance, build and release agents
and tasks , the VS Code extension , and open-source projects are developed on
GitHub. Configuration changes check in to a separate repository. Other packages that
the team depends on come from other places and are consumed via NuGet.
Mono repo or multi-repo
While some teams elect to have a single monolithic repository, the mono-repo, other
Microsoft products use a multi-repo approach. Skype, for instance, has hundreds of
small repositories that stitch together in various combinations to create many different
clients, services, and tools. Especially for teams that embrace microservices, multi-repo
can be the right approach. Usually, older products that began as monoliths find a
mono-repo approach to be the easiest transition to Git, and their code organization
reflects that.
Release branches
The Microsoft release flow keeps the main branch buildable at all times. Developers
work in short-lived topic branches that merge to main . When a team is ready to ship,
whether at the end of a sprint or for a major update, they start a new release branch off
the main branch. Release branches never merge back to the main branch, so they might
require cherry-picking important changes.
The following diagram shows short-lived branches in blue and release branches in black.
One branch with a commit that needs cherry-picking appears in red.
To keep branch hierarchy tidy, teams use permissions to block branch creation at the
root level of the hierarchy. In the following example, everyone can create branches in
folders like users/, features/, and teams/. Only release managers have permission to
create branches under releases/, and some automation tools have permission to the
integrations/ folder.
Branch policies and checks can require a successful build including passed tests, signoff
by the owners of any code touched, and several external checks to verify corporate
policies before a pull request can be completed.
The branch merges into main , and the new code deploys in the next sprint or major
release. That doesn't mean the new feature will show up right away. Microsoft
decouples the deployment and exposure of new features by using feature flags.
Even if the feature needs a little more work before it's ready to show off, it's safe to go
to main if the product builds and deploys. Once in main , the code becomes part of an
official build, where it's again tested, confirmed to meet policy, and digitally signed.
Currently, a product with 200+ pull requests might produce 300+ continuous
integration builds per day, amounting to 500+ test runs every 24 hours. This level of
testing would be impossible without the trunk-based branching and release workflow.
Release at sprint milestones
At the end of each sprint, the team creates a release branch from the main branch. For
example, at the end of sprint 129, the team creates a new release branch releases/M129 .
The team then puts the sprint 129 branch into production.
After the branch of the release branch, the main branch remains open for developers to
merge changes. These changes will deploy three weeks later in the next sprint
deployment.
Release hotfixes
Sometimes changes need to go to production quickly. Microsoft won't usually add new
features in the middle of a sprint, but sometimes wants to bring in a bug fix quickly to
unblock users. Issues might be minor, such as typos, or large enough to cause an
availability issue or live site incident.
Rectifying these issues starts with the normal workflow. A developer creates a branch
from main , gets it code reviewed, and completes the pull request to merge it. The
process always starts by making the change in main first. This allows creating the fix
quickly and validating it locally without having to switch to the release branch.
Following this process also guarantees that the change gets into main , which is critical.
Fixing a bug in the release branch without bringing the change back to main would
mean the bug would recur during the next deployment, when the sprint 130 release
branches from main . It's easy to forget to update main during the confusion and stress
that can arise during an outage. Bringing changes to main first means always having the
changes in both the main branch and the release branch.
Git functionality enables this workflow. To bring changes immediately into production,
once a developer merges a pull request into main , they can use the pull request page to
cherry-pick changes into the release branch. This process creates a new pull request that
targets the release branch, backporting the contents that just merged into main .
Using the cherry-pick functionality opens a pull request quickly, providing the
traceability and reliability of branch policies. Cherry-picking can happen on the server,
without having to download the release branch to a local computer. Making changes,
fixing merge conflicts, or making minor changes due to differences between the two
branches can all happen on the server. Teams can edit changes directly from the
browser-based text editor or via the Pull Request Merge Conflict Extension for a more
advanced experience.
Once a pull request targets the release branch, the team code review it again, evaluates
branch policies, tests the pull request, and merges it. After merge, the fix deploys to the
first ring of servers in minutes. From there, the team progressively deploys the fix to
more accounts by using deployment rings. As the changes deploy to more users, the
team monitors success and verifies that the change fixes the bug while not introducing
any deficiencies or slowdowns. The fix eventually deploys to all Azure data centers.
At this point, there are actually two branches in production. With a ring-based
deployment to bring changes to production safely, the fast ring gets the sprint 130
changes, and the slow ring servers stay on sprint 129 while the new changes are
validated in production.
Hotfixing a change in the middle of a deployment might require hotfixing two different
releases, the sprint 129 release and the sprint 130 release. The team ports and deploys
the hotfix to both release branches. The 130 branch redeploys with the hotfix to the
rings that have already been upgraded. The 129 branch redeploys with the hotfix to the
outer rings that haven't upgraded to the next sprint's version yet.
Once all the rings are deployed, the old sprint 129 branch is abandoned, because any
changes brought into the sprint 129 branch as a hotfix have also been made in main . So,
those changes will also be in the releases/M130 branch.
Summary
The release flow model is at the heart of how Microsoft develops with DevOps to deliver
online services. This model uses a simple, trunk-based branching strategy. But instead of
keeping developers stuck in a deployment queue, waiting to merge their changes, the
Microsoft release flow lets developers keep working.
This release model also allows deploying new features across Azure data centers at a
regular cadence, despite the size of the Microsoft codebases and the number of
developers working in them. The model also allows bringing hotfixes into production
quickly and efficiently.
Introduction to delivering quality
services with DevOps
Article • 11/28/2022
In the delivery phase of DevOps, the code moves through the release pipeline to the
production environment. Code delivery typically comes after the continuous integration
build and is run through several test environments before reaching end users. Along the
way, its quality is tested across many different measures that include functionality, scale,
and security.
Next steps
Microsoft has been one of the world's largest software development companies for
decades. Learn about how Microsoft delivers in DevOps.
Looking for a hands-on DevOps experience with continuous delivery? Learn to set up
release pipelines using GitHub Actions or Azure Pipelines.
What is continuous delivery?
Article • 11/28/2022
Continuous delivery (CD) is the process of automating build, test, configuration, and
deployment from a build to a production environment. A release pipeline can create
multiple testing or staging environments to automate infrastructure creation and deploy
new builds. Successive environments support progressively longer-running integration,
load, and user acceptance testing activities.
Before CD, software release cycles were a bottleneck for application and operations
teams. These teams often relied on manual handoffs that resulted in issues during
release cycles. Manual processes led to unreliable releases that produced delays and
errors.
CD is a lean practice, with the goal to keep production fresh with the fastest path from
new code or component availability to deployment. Automation minimizes the time to
deploy and time to mitigate (TTM) or time to remediate (TTR) production incidents. In
lean terms, CD optimizes process time and eliminates idle time.
Continuous integration (CI) starts the CD process. The release pipeline stages each
successive environment to the next environment after tests complete successfully. The
automated CD release pipeline allows a fail fast approach to validation, where the tests
most likely to fail quickly run first, and longer-running tests happen only after the faster
ones complete successfully.
The complementary practices of infrastructure as code (IaC) and monitoring facilitate
CD.
CD can sequence multiple deployment rings for progressive exposure. A ring tries a
deployment on a user group, and monitors their experience. The first deployment
ring can be a canary to test new versions in production before a broader rollout.
CD automates deployment from one ring to the next.
Deployment to the next ring can optionally depend on a manual approval step,
where a decision maker signs off on the changes electronically. CD can create an
auditable record of the approval to satisfy regulatory procedures or other control
objectives.
Blue/green deployment relies on keeping an existing blue version live while a new
green version deploys. This practice typically uses load balancing to direct
increasing amounts of traffic to the green deployment. If monitoring discovers an
incident, traffic can be rerouted to the blue deployment still running.
Feature flags or feature toggles are another technique for experimentation and
dark launches. Feature flags turn features on or off for different user groups based
on identity and group membership.
Modern release pipelines allow development teams to deploy new features fast and
safely. CD can quickly remediate issues found in production by rolling forward with a
new deployment. In this way, CD creates a continuous stream of customer value.
Next steps
GitHub Actions
Azure Pipelines
Azure Pipelines documentation
What is infrastructure as code (IaC)?
Article • 11/28/2022
Infrastructure as code (IaC) uses DevOps methodology and versioning with a descriptive
model to define and deploy infrastructure, such as networks, virtual machines, load
balancers, and connection topologies. Just as the same source code always generates
the same binary, an IaC model generates the same environment every time it deploys.
IaC is a key DevOps practice and a component of continuous delivery. With IaC, DevOps
teams can work together with a unified set of practices and tools to deliver applications
and their supporting infrastructure rapidly and reliably at scale.
Idempotence, the ability of a given operation to always produce the same result, is an
important IaC principle. A deployment command always sets the target environment
into the same configuration, regardless of the environment's starting state. Idempotency
is achieved by either automatically configuring the existing target, or by discarding the
existing target and recreating a fresh environment.
Helpful tools
Discover misconfiguration in IaC with Microsoft Defender for Cloud
There's no standard syntax for declarative IaC. The syntax for describing IaC usually
depends on the requirements of the target platform. Different platforms support file
formats such as YAML, JSON, and XML.
Third-party platforms like Terraform, Ansible, Chef, and Pulumi also support IaC to
manage automated infrastructure.
Deploy to Azure infrastructure with
GitHub Actions
Article • 11/30/2022
In this guide, we'll cover how to utilize CI/CD and Infrastructure as Code (IaC) to deploy
to Azure with GitHub Actions in an automated and repeatable fashion.
This article is an architecture overview and presents a structured solution for designing
an application on Azure that's scalable, secure, resilient, and highly available. To see
more real world examples of cloud architectures and solution ideas, browse Azure
architectures.
Declarative: When you define your infrastructure and deployment process in code,
it can be versioned and reviewed using the standard software development
lifecycle. IaC also helps prevent any drift in your configuration.
Consistency: Following an IaC process ensures that the whole organization follows
a standard, well-established method to deploy infrastructure that incorporates best
practices and is hardened to meet your security needs. Any improvements made to
the central templates can easily be applied across the organization.
Dataflow
1. Create a new branch and check in the needed IaC code modifications.
2. Create a Pull Request (PR) in GitHub once you're ready to merge your changes into
your environment.
3. A GitHub Actions workflow will trigger to ensure your code is well formatted,
internally consistent, and produces secure infrastructure. In addition, a Terraform
Plan or Bicep what-if analysis will run to generate a preview of the changes that
will happen in your Azure environment.
4. Once appropriately reviewed, the PR can be merged into your main branch.
5. Another GitHub Actions workflow will trigger from the main branch and execute
the changes using your IaC provider.
6. (exclusive to Terraform) A regularly scheduled GitHub Action workflow should also
run to look for any configuration drift in your environment and create a new issue
if changes are detected.
Prerequisites
Use Bicep
1. Create GitHub Environments
The workflows utilize GitHub environments and secrets to store the Azure identity
information and set up an approval process for deployments. Create an
environment named production by following these instructions . On the
production environment, set up a protection rule and add any required approvers
you want that need to sign off on production deployments. You can also limit the
environment to your main branch. Detailed instructions can be found here .
Set Entity Type to Environment and use the production environment name.
Set Entity Type to Pull Request .
Set Entity Type to Branch and use the main branch name.
7 Note
While none of the data about the Azure identities contain any secrets or
credentials, we still utilize GitHub secrets as a convenient means to
parameterize the identity information per environment.
Create the following secrets on the repository using the Azure identity:
registration is defined.
AZURE_SUBSCRIPTION_ID : The subscription ID where the app registration is
defined.
Use Terraform
1. Configure Terraform State Location
Terraform utilizes a state file to store information about the current state of your
managed infrastructure and associated configuration. This file will need to be
persisted between different runs of the workflow. The recommended approach is
to store this file within an Azure Storage Account or other similar remote backend.
Normally, this storage would be provisioned manually or via a separate workflow.
The Terraform backend block will need updated with your selected storage
location (see here for documentation).
The workflows utilize GitHub environments and secrets to store the Azure identity
information and set up an approval process for deployments. Create an
environment named production by following these instructions . On the
production environment set up a protection rule and add any required approvers
you want that need to sign off on production deployments. You can also limit the
environment to your main branch. Detailed instructions can be found here .
Set Entity Type to Environment and use the production environment name.
7 Note
While none of the data about the Azure identities contain any secrets or
credentials, we still utilize GitHub secrets as a convenient means to
parameterize the identity information per environment.
Create the following secrets on the repository using the read-only identity:
registration is defined.
AZURE_SUBSCRIPTION_ID : The subscription ID where the app registration is
defined.
Instructions to add the secrets to the environment can be found here . The
environment secret will override the repository secret when doing the deploy step
to the production environment when elevated read/write permissions are required.
Use Bicep
There are two main workflows included in the reference architecture :
This workflow runs on every commit and is composed of a set of unit tests on the
infrastructure code. It runs bicep build to compile the bicep to an ARM template.
This ensures there are no formatting errors. Next it performs a validate to ensure
the template is deployable. Lastly, checkov , an open source static code analysis
tool for IaC, will run to detect security and compliance issues. If the repository is
utilizing GitHub Advanced Security (GHAS), the results will be uploaded to GitHub.
This workflow runs on every pull request and on each commit to the main branch.
The what-if stage of the workflow is used to understand the impact of the IaC
changes on the Azure environment by running what-if. This report is then attached
to the PR for easy review. The deploy stage runs after the what-if analysis when the
workflow is triggered by a push to the main branch. This stage will deploy the
template to Azure after a manual review has signed off.
Use Terraform
There are three main workflows included in the reference architecture :
This workflow runs on every commit and is composed of a set of unit tests on the
infrastructure code. It runs terraform fmt to ensure the code is properly linted
and follows terraform best practices. Next it performs terraform validate to
check that the code is syntactically correct and internally consistent. Lastly,
checkov , an open source static code analysis tool for IaC, will run to detect
security and compliance issues. If the repository is utilizing GitHub Advanced
Security (GHAS), the results will be uploaded to GitHub.
This workflow runs on every pull request and on each commit to the main branch.
The plan stage of the workflow is used to understand the impact of the IaC
changes on the Azure environment by running terraform plan . This report is then
attached to the PR for easy review. The apply stage runs after the plan when the
workflow is triggered by a push to the main branch. This stage will take the plan
document and apply the changes after a manual review has signed off if there
are any pending changes to the environment.
This workflow runs on a periodic basis to scan your environment for any
configuration drift or changes made outside of Terraform. If any drift is detected, a
GitHub Issue is raised to alert the maintainers of the project.
Related Resources
What is Infrastructure as Code
Repeatable infrastructure
Comparing Terraform and Bicep
Checkov and source code
GitHub Advanced Security
What are Microservices?
Article • 11/28/2022
They can remove single points of failure (SPOFs) by ensuring issues in one service
don't crash or affect other parts of an application.
Individual microservices can be scaled out independently to provide extra
availability and capacity.
DevOps teams can extend functionality by adding new microservices without
unnecessarily affecting other parts of the application.
Using microservices can increase team velocity. DevOps practices, such as Continuous
Integration and Continuous Delivery, are used to drive microservice deployments.
Microservices nicely complement cloud-based application architectures by allowing
software development teams to take advantage of scenarios such as event-driven
programming and autoscale. The microservice components expose APIs (application
programming interfaces), typically over REST protocols, for communicating with other
services.
Next steps
Learn more about microservices on Azure .
Shift right to test in production
Article • 11/28/2022
Shift right is the practice of moving some testing later in the DevOps process to test in
production. Testing in production uses real deployments to validate and measure an
application's behavior and performance in the production environment.
One way DevOps teams can improve velocity is with a shift-left test strategy. Shift left
pushes most testing earlier in the DevOps pipeline, to reduce the amount of time for
new code to reach production and operate reliably.
But while many kinds of tests, such as unit tests, can easily shift left, some classes of
tests can't run without deploying part or all of a solution. Deploying to a QA or staging
service can simulate a comparable environment, but there's no full substitute for the
production environment. Teams find that certain types of testing need to happen in
production.
The production environment keeps changing. Even if an app doesn't change, the
infrastructure it relies on changes constantly. Testing in production validates the health
and quality of a given production deployment and of the constantly changing
production environment.
Shifting right to test in production is especially important for the following scenarios:
Microservices deployments
Microservices-based solutions can have a large number of microservices that are
developed, deployed, and managed independently. Shifting testing right is especially
important for these projects, because different versions and configurations can reach
production in many ways. Regardless of pre-production test coverage, it's necessary to
test compatibility in production.
Test data from production is literally the test results from the real customer workload.
Testing in production includes monitoring, failover testing, and fault injection. This
testing tracks failures, exceptions, performance metrics, and security events. The test
telemetry also helps detect anomalies.
Deployment rings
To safeguard the production environment, teams can roll out changes in a progressive
and controlled way by using ring-based deployments and feature flags. For example, it's
better to catch a bug that prevents a shopper from completing their purchase when less
than 1% of customers are on that deployment ring, than after switching all customers at
once. The feature value with detected failures must exceed the net losses of those
failures, measured in a meaningful way for the given business.
The first ring should be the smallest size necessary to run the standard integration suite.
The tests might be similar to those already run earlier in the pipeline against other
environments, but testing validates that the behavior is the same in the production
environment. This ring identifies obvious errors, such as misconfigurations, before they
impact any customers.
After the initial ring validates, the next ring can broaden to include a subset of real users
for the test run. If everything looks good, the deployment can progress through further
rings and tests until everyone is using it. Full deployment doesn't mean that testing is
over. Tracking telemetry is critically important for testing in production.
Fault injection
Teams often employ fault injection and chaos engineering to see how a system behaves
under failure conditions. These practices help to:
It's a good practice to automate fault injection experiments, because they are expensive
tests that must run on ever-changing systems.
Chaos engineering can be an effective tool, but should be limited to canary
environments that have little or no customer impact.
Failover testing
One form of fault injection is failover testing to support business continuity and disaster
recovery (BCDR). Teams should have failover plans for all services and subsystems. The
plans should include:
Whether a fallback works when the circuit breaker opens. The fallback might work
with unit tests, but the only way to know if it will behave as expected in production
is to inject a fault to trigger it.
Whether the circuit breaker has the right sensitivity threshold to open when it
needs to. Fault injection can force latency or disconnect dependencies to observe
breaker responsiveness. It's important to verify not only that the correct behavior
occurs, but that it happens quickly enough.
The following diagram shows tests for the Redis circuit breaker fallback behavior. The
goal is to make sure that when the breaker opens, calls ultimately go to SQL.
The preceding diagram shows three ATs, with the breakers in front of the calls to Redis.
One test forces the circuit breaker to open through a configuration change, and then
observes whether the calls go to SQL. Another test then checks the opposite
configuration change, by closing the circuit breaker to confirm that calls return back to
Redis.
This test validates that the fallback behavior works when the breaker opens, but it
doesn't validate that the circuit breaker configuration opens the breaker when it should.
Testing that behavior requires simulating actual failures.
A fault agent can introduce faults in calls going to Redis. The following diagram shows
testing with fault injection.
1. The fault injector blocks Redis requests.
2. The circuit breaker opens, and the test can observe whether fallback works.
3. The fault is removed, and the circuit breaker sends a test request to Redis.
4. If the request succeeds, calls revert back to Redis.
Further steps could test the sensitivity of the breaker, whether the threshold is too high
or too low, and whether other system timeouts interfere with the circuit breaker
behavior.
In this example, if the breaker doesn't open or close as expected, it could cause a live
site incident (LSI). Without the fault injection testing, the issue might go undetected, as
it's hard to do this type of testing in a lab environment.
Next steps
[Shift testing left with unit tests]shift-left
What are microservices?
Run a test failover (disaster recovery drill) to Azure
Safe deployment practices
What is monitoring?
How Microsoft delivers software with
DevOps
Article • 11/28/2022
Focus on delivery
Shipping faster is an obvious benefit that organizations and teams can easily measure
and appreciate. The typical DevOps cadence involves short sprint cycles with regular
deployments to production.
Fearing a lack of product stability with short sprints, some teams had compensated with
stabilization periods at the end of their sprint cycles. Engineers wanted to ship as many
features as possible during the sprint, so they incurred test debt that they had to pay
down during stabilization. Teams that managed their debt during the sprint then had to
support the teams that built up debt. The extra costs played out through the delivery
pipelines and into production.
Removing the stabilization period quickly improved the way teams managed their debt.
Instead of pushing off key maintenance work to the stabilization period, teams that built
up debt had to spend the next sprint catching up to their debt targets. Teams quickly
learned to manage their test debt during sprints. Features deliver when they're proven
and worth the cost of deployment.
It's easier to work in smaller chunks by deploying frequently. This idea might seem
obvious in hindsight, but at the time it can seem counterintuitive. Frequent deployments
also motivate teams to prioritize creating more efficient and reliable deployment tools
and pipelines.
It's important to communicate directly with teams to track progress. Tools should
facilitate communication, but conversation is the most transparent way to communicate.
Prioritize features
An important goal is to focus on delivering features. Schedules can assess how much
teams and individuals can reasonably complete over a given period of time, but some
features will deliver earlier and some will come later. Teams can prioritize work so the
most important features make it to production.
Use microservices
Microservices offer various technical benefits that improve and simplify delivery.
Microservices also provide natural boundaries for team ownership. When a team has
autonomy over investment in a microservice, they can prioritize how to implement
features and manage debt. Teams can focus on plans for factors like versioning,
independent of the overall services that depend on the microservice.
Work in main
Engineers used to work in separate branches. Merge debt on each branch grew until the
engineer tried to integrate their branch into the main branch. The more teams and
engineers there were, the bigger the integration became.
For integration to happen faster, more continuously, and in smaller chunks, engineers
now work in the main branch. One big reason for moving to Git was the lightweight
branching Git offers. The benefit to internal engineering was eliminating the deep
branch hierarchy and its waste. All the time that used to be spent integrating is now
poured into delivery.
Testing in production
Shifting right to test in production helps ensure that pre-production tests are valid, and
that ever-changing production environments are ready to handle deployments.
Besides the basis of a metric, teams consider what they need the metric to measure. For
example, the velocity or acceleration of user gains might be a more useful metric than
total number of users. Metrics vary from project to project, but the most helpful are
those with the potential to drive business decisions.
Teams throughout the organization examine engaged user metrics to determine the
meaning for their features. Teams don't just ship features, but look to see whether and
how people are using them. Teams use these metrics to adjust backlogs and determine
whether features need more work to meet goals.
Delivery guidelines
It's never a straight line to get from A to B, nor is B the end.
There will always be setbacks and mistakes.
View setbacks as learning opportunities to change tactics for completing a given
part of the process.
Over time, every team evolves its DevOps practices by building on experience and
adjusting to meet changing needs.
The key is to focus on delivering value, both to end users and to the delivery
process itself.
Introduction to operating reliable
systems with DevOps
Article • 11/28/2022
The operations phase of DevOps comes after a successful delivery and encompasses
everything that teams must consider to maintain, monitor, and troubleshoot the
application. The build gets exposed to real customers in the production environment,
where reliability becomes a critical factor.
Next steps
Learn how effective monitoring helps to ensure high system availability and allows
DevOps teams to deliver results quickly.
What is monitoring?
Article • 11/28/2022
Goals of monitoring
One goal of monitoring is to achieve high availability by minimizing key metrics that are
measured in terms of time:
Time to detect (TTD): When performance or other issues arise, rich diagnostic data
about the issues are fed back to development teams via automated monitoring.
Time to mitigate (TTM): DevOps teams act on the information to mitigate issues as
quickly as possible so that users are no longer affected.
Time to remediate (TTR): Resolution times are measured, and teams work to
improve over time. After mitigation, teams work on how to remediate problems at
root cause so that they don't recur.
A second goal of monitoring is to enable validated learning by tracking usage. The core
concept of validated learning is that every deployment is an opportunity to track
experimental results that support or diminish the hypotheses that led to the
deployment. Tracking usage and differences between versions allows teams to measure
the impact of change and drive business decisions. If a hypothesis is diminished, the
team can fail fast or pivot. If the hypothesis is supported, then the team can double
down or persevere. These data-informed decisions lead to new hypotheses and
prioritization of the backlog.
Key concepts
Telemetry is the mechanism for collecting data from monitoring. Telemetry can use
agents that are installed in deployment environments, an SDK that relies on markers
inserted into source code, server logging, or a combination of these. Typically, telemetry
will distinguish between the data pipeline optimized for real-time alerting and
dashboards and higher-volume data needed for troubleshooting or usage analytics.
Next steps
Read more about the monitoring capabilities of Azure Monitor .
Sometimes a release doesn't live up to expectations. Despite using best practices and
passing all quality gates, there are occasionally issues that result in a production
deployment causing unforeseen problems for users. To minimize and mitigate the
impact of these issues, DevOps teams are encouraged to adopt a progressive exposure
strategy that balances the exposure of a given release with its proven performance. As a
release proves itself in production, it becomes available to tiers of broader audiences
until everyone is using it. Teams can use safe deployment practices in order to maximize
the quality and speed of releases in production.
Feature flags
Certain functionality sometimes needs to be deployed as part of a release, but not
initially exposed to users. In those cases, feature flags provide a solution where
functionality may be enabled via configuration changes based on environment, ring, or
any other specific deployment.
User opt-in
Similar to feature flags, user opt-in provides a way to limit exposure. In this model, a
given feature is enabled in the release, but not activated for a user unless they
specifically want it. The risk tolerance decision is offloaded to users so they can decide
how quickly they want to adopt certain updates.
Multiple practices are commonly employed simultaneously. For example, a team may
have an experimental feature intended for a very specific use case. Since it's risky, they'll
deploy it to the first ring for internal users to try out. However, even though the features
are in the code, someone will need to set the feature flag for a specific deployment
within the ring so that the feature is exposed via the user interface. Even then, the
feature flag may only expose the option for a user to opt in to using the new feature.
Anyone who isn't in the ring, on that deployment, or hasn't opted in won't be exposed
to the feature. While this is a fairly contrived example, it serves to illustrate the flexibility
and practicality of progressive exposure.
No service isolation
Monolithic systems are traditionally scaled by leveling up the hardware on which they're
deployed. However, when something goes wrong with the instance, it leads to problems
for everyone. One simple solution is to add multiple instances so that you can load
balance users. However, this can require significant architectural considerations as many
legacy systems aren't built to be multi-instance. Plus, significant duplicate resources may
need to be allocated for functionality that may be better consolidated elsewhere.
As new features are added, explore whether a microservices architecture can help you
operate and scale thanks to better service isolation.
Teams can also make use of infrastructure as code to have better control over
deployment environments. This removes the need for requests to the operations team
to make manual changes as new features or dependencies are introduced to various
deployment environments.
Core principles
Teams looking to adopt safe deployment practices should set some core principles to
underpin the effort.
Be consistent
The same tools used to deploy in production should be used in development and test
environments. If there are issues, such as the ones that often arise from new versions of
dependencies or tools, they should be caught well before the code is close to being
released to production.
Ring-based deployment
Teams with mature DevOps release practices are in a position to take on ring-based
deployment. In this model, new features are first rolled out to customers willing to
accept the most risk. As the deployment is proven, the audience expands to include
more users until everyone is using it.
0 Finds most of the user- Internal only, high tolerance US West Central
impacting bugs introduced by for risk and bugs
the deployment
1 Areas the team doesn't test Customers using a breadth of A small data center
extensively the product
3 Scale issues in internal accounts Large internal accounts and Internal data center
and international related issues European customers and a European data
center
In general, a 24-hour day should be enough time for most scenarios to expose latent
bugs. However, this period should include a period of peak usage, requiring a full
business day, for services that peak during business hours.
Expedite hotfixes
A live site incident (LSI) occurs when a bug has a serious impact in production. LSIs
necessitate the creation of a hotfix, which is an out-of-band update designed to address
a high-priority issue.
If a bug is Sev 0, the most severe type of bug, the hotfix may be deployed directly to the
impacted scale unit as quickly as responsibly possible. While it's critical that the fix not
make things worse, bugs of this severity are considered so disruptive that they must be
addressed immediately.
Bugs rated Sev 1 must be deployed through ring 0, but can then be deployed out to the
affected scale units as soon as approved.
Hotfixes for bugs with lower severity must be deployed through all rings as planned.
Key takeaways
Every team wants to deliver updates quickly and at the highest possible quality. With the
right practices, delivery can be a productive and painless part of the DevOps cycle.
Deploy often.
Stay green throughout the sprint.
Use consistent deployment tooling in development, test, and production.
Use a continuous delivery platform that allows automation and authorization.
Follow safe deployment practices.
Next steps
Learn how feature flags help control the exposure of new features to users.
Progressive experimentation with
feature flags
Article • 11/28/2022
The scope of a feature flag will vary based on the nature of the feature and the
audience. In some cases, a feature flag will automatically enable the functionality for
everyone. In other cases, a feature will be enabled on a user-by-user basis. Teams can
also use feature flags to allow users to opt in to enable a feature, if they so desire.
There's really no limit to the way the feature flags are implemented.
Standard stages
Microsoft uses a standard rollout process to turn on feature flags. There are two
separate concepts: rings are for deployments, and stages are for feature flags. Learn
more about rings and stages .
Stages are all about disclosure or exposure. For example, the first stage could be for a
team's account and the personal accounts of members. Most users wouldn't see
anything new because the only place flags are turned on is for this first stage. This
allows a team to fully use and experiment with it. Once the team signs off, select
customers would be able to opt into it via the second stage of feature flags.
Opt in
It's a good practice to allow users to opt in to feature flags when feasible. For example,
the team may expose a preview panel associated with the user's preferences or settings.
XML
A common server framework encourages reuse and economies of scale across the whole
team. Ideally, the project will have infrastructure in place so that a developer can simply
define a flag in a central store and have the rest of the infrastructure handled for them.
TypeScript
this.props.pullRequest.branchStatusContract().sourceBranchStatus,
this.props.pullRequest.branchStatusContract().targetBranchStatus)
}
>
{VCResources.PullRequest_Revert_Button}
</button>
);
}
}
The example above illustrates usage in TypeScript, but it could just as easily be accessed
using C#. The code checks to see if the feature is enabled and, if so, renders a button to
provide the functionality. If the flag isn't enabled, then the button is skipped.
The nature of the feature flag will drive the way in which the features are exposed. In
some cases, the exposure will follow a ring and stage model. In others, users may opt in
through configuration UI, or even by emailing the team for access.
At the same time, there may be a set of feature flags that persist for various reasons. For
example, the team may want to keep a feature flag that branches something
infrastructural for a period of time after the production service has fully switched over.
However, keep in mind that this potential codepath could be reactivated in the future
during an explicit clearing of the feature flag, so it needs to be tested and maintained
until the option is removed.
developers can quickly merge features upstream and push them through the test
gauntlet. Quality code can quickly get published for testing in production. After a few
sprints, developers will recognize the benefits of feature flags and use them proactively.
Next steps
Learn more about using feature flags in an ASP.NET Core app.
Eliminate downtime through versioned
service updates
Article • 11/28/2022
Historically, administrators needed to take a server offline to update and upgrade on-
premises software. However, downtime is a complete nonstarter for global 24×7
services. Many modern cloud services are a critical dependency for users to run their
businesses. There's never a good time to take a system down, so how can a team
provide continuous service while installing important security and feature updates?
Bu using versioned updates, these critical services can be transitioned seamlessly from
one version to another while customers are actively using them. Not all updates are hard.
Updating front-end layouts or styles is easy. Changes to features can be tricky, but there
are well-known practices to mitigate migration risks. However, changes that emanate
from the data tier introduce a new class of challenges that require special consideration.
Often, versioning is easier to handle in the application code. Larger systems usually have
quite a bit of legacy code, such as SQL that lives inside its databases. Rather than further
complicating this SQL, the application code should handle the complexity. Specifically,
you can create a set of factory classes that understand SQL versioning.
During every sprint, create a new interface with that version so there's always code that
matches each database version. You can easily roll back any binaries during deployment.
If something goes wrong after deploying the new binaries, revert to the previous code.
If the binary deployment succeeds, then start the database servicing.
So how does this actually work? For example, assume that your team is currently
deploying Sprint 123. The binaries understand Sprint 123 database schema and they
understand Sprint 122 schema. The general pattern is to work with both versions/sprints
N and N-1 of the SQL schema. The binaries query the database, determine which
schema version they're talking to, and then load the appropriate binding. Then, the
application code handles the case when the new data schema isn't yet available. Once
the new version is available, the application code can start making use of the new
functionality that's enabled by the latest database version.
Deployment sequence
Consider a scenario where you need to add a set of columns to a database and
transform some data. This transition needs to be invisible to users, which means
avoiding table locks as much as possible and then holding locks for the shortest time
possible so that they aren't perceptible.
The first thing we do is manipulate the data, possibly in parallel tables using a SQL
trigger to keep data in sync. Large data migrations and transformations sometimes have
to be multi-step over several deployments across multiple sprints.
Once the extra data or new schema has been created in parallel, the team goes into
deployment mode for the application code. In deployment mode, when the code makes
a call to the database, it first grabs a lock on the schema and then releases it after
running the stored procedure. The database can't change between the time the call to
the database is issued and when the stored procedure runs.
The upgrade code acts as a schema writer and requests a writer lock on the schema. The
application code takes priority in taking a reader lock, and the upgrade code sits in the
background trying to acquire the writer lock. Under the writer lock, only a small number
of very fast operations are allowed on the tables. Then the lock is released and the
application records the new version of the database is in use and uses the interface that
matches the new database version.
The database upgrades are all performed using a migration pattern. A set of code and
scripts look at the version of the database and then make incremental changes to
migrate the schema from the old to the new version. All migrations are automated and
rolled out via release management service.
The web UI must also be updated without disrupting users. When upgrading JavaScript
files, style sheets, or images, avoid mixing old and new versions being loaded by the
client. That can lead to errors that could lose work in progress, such as a field being
edited by a user. Therefore, you should version all JavaScript, CSS, and image files by
putting all files associated with a deployment into a separate, versioned folder. When
the web UI makes calls back to the application tier, assets with a specified version are
loaded. Only when a user action results in a full page refresh does the new web UI get
loaded into the browser. The user's experience isn't disrupted by the upgrade.
Next steps
Microsoft has been one of the world's largest software development companies for
decades. Learn how Microsoft operates reliable systems with DevOps.
How Microsoft operates reliable systems
with DevOps
Article • 11/28/2022
Microsoft has been operating complex online platforms since the earliest days of the
commercial internet. Along the way, we've evolved a substantial set of practices to keep
systems available, healthy, and secure. These practices are part of a larger initiative to
maintain and improve a live site culture.
There are various factors that contribute to a successful live site culture.
Usually, health monitoring and telemetry alert us when something isn't right. A
developer can create a branch off main , make a fix, and pull request it into main .
Keeping the same general workflow means that developers don't have to context-switch
or learn a different process for a different code change.
To address a hotfix deployment, one more step is required, which is to cherry-pick the
change into the release branch. We run a hotfix deployment out of the current release
branch each weekday morning, though we can also do this on demand for urgent fixes.
The fix actually hits production out of the release branch first. But because we develop
in main first, we know it won't regress the next sprint when a new release branch is
created from main .
Releases of on-premises products are largely the same, though without the deployment
rings and stages. Also, because we do more manual testing on different configurations
and data shapes, there's a longer tail between cutting the release branch and putting
the product in the hands of customers.
We can't plan for every attack vector, but what we can do is assume that there's going
to be a breach, and plan how fast we can react to that breach. A lot of the security work
has been around that for our teams.
Finally, humans make mistakes. They sometimes get lazy and do things like store
passwords on file shares. We can tell them not to and we can send them to security
training and we can do all sorts of other things. Most people learn, but it only takes one
person to break the system. You can have all sorts of lists of best practices but unless
you're making that real, you have to assume that people are going to make mistakes.
This requires a certain level of oversight to ensure critical processes are being followed.
As we evolved this responsibility, Live site is the most important thing that we do became
the whole team's mantra. It's the customer experience they have right now and it's not
just a tax. It's actually something people count on from us and we take pride in it. It
needs to be a differentiating feature of our product.
As the engineering team zeroed in on actionable alerts, they noticed that a lot of
problems that come up, especially in the middle of the night, tend to have similar fixes,
at least temporarily. This resulted in a focus on systems that were better at failing over
and self-healing. Now the issues happen, raise alerts, and then fix themselves well
enough for the engineering team to wait until morning to fix. This wouldn't have
happened if the engineering team just pushed out bits that kept other people up at
night. Now they work to balance these improvements as a part of not just feature
velocity, but engineering improvement velocity.
Summary
Adopting a live site culture has impacted the way Microsoft builds and delivers software.
By making engineering teams a key part of security and operations, the quality of our
code and end-user experience have improved drastically. Being a full participant in
operations has made engineering a key stakeholder, resulting in systems that are
designed for better operations.
Security in DevOps (DevSecOps)
Article • 11/28/2022
Security is a key part of DevOps. But how does a team know if a system is secure? Is it
really possible to deliver a completely secure service?
Unfortunately, the answer is no. DevSecOps is a continuous and ongoing effort that
requires the attention of everyone in both development and IT operations. While the job
is never truly done, the practices that teams employ to prevent and handle breaches can
help produce systems that are as secure and resilient as possible.
"Fundamentally, if somebody wants to get in, they're getting in...accept that. What
we tell clients is: number one, you're in the fight, whether you thought you were or
not. Number two, you almost certainly are penetrated." -- Michael Hayden, Former
Director of NSA and CIA
How real is the threat? Teams often don't appreciate the potential value of the
services and data they're charged with protecting.
Our team is good, right? A security discussion may be perceived as doubt in the
team's ability to build a secure system.
I don't think that's possible. This is a common argument from junior engineers.
Those with experience usually know better.
We've never been breached. But how do you know? How would you know?
Endless debates about value. DevSecOps is a serious commitment that may be
perceived as a distraction from core feature work. While the security investment
should be balanced with other needs, it can't be ignored.
Every team should already have at least some practices in place for preventing breaches.
Writing secure code has become more of a default, and there are many free and
commercial tools to aid in static analysis and other security testing features.
However, many teams lack a strategy that assumes system breaches are inevitable.
Assuming that you've been breached can be hard to admit, especially when having
difficult conversations with management, but that assumption can help you answer
questions about security on your own time. You don't want to figure it all out during a
real security emergency.
First, focus on improving mean time to detection and mean time to recovery. These
metrics indicate how long it takes to detect a breach and how long it takes to recover,
respectively. They can be tracked through ongoing live site testing of security response
plans. When evaluating potential policies, improving these metrics should be an
important consideration.
Practice defense in depth. When a breach happens, attackers can get access to internal
networks and everything inside them. While it would be ideal to stop attackers before it
gets that far, a policy of assuming breaches drives teams to minimize exposure from an
attacker who has already gotten in.
Attack vectors
Consider a scenario where an attacker has gained access to a developer's credentials.
What can they do?
Privilege Attack
Can they access a test If a production environment takes a dependency on the test
environment? environment, exploit it
Secret management
All secrets must be stored in a protected vault. Secrets include:
You should use a hierarchy of vaults to eliminate duplication of secrets. Also consider
how and when secrets are accessed. Some are used at deploy-time when building
environment configurations, whereas others are accessed at run-time. Deploy-time
secrets typically require a new deployment in order to pick up new settings, whereas
run-time secrets are accessed when needed and can be updated at any time.
Platforms have secure storage features for managing secrets in CI/CD pipelines and
cloud environments, such as Azure Key Vault and GitHub Actions .
Helpful tools
Microsoft Defender for Cloud is great for generic infrastructure alerts, such as for
malware, suspicious processes, etc.
Source code analysis tools for static application security testing (SAST).
GitHub advanced security for analysis and monitoring of repos.
mimikatz extracts passwords, keys, pin codes, tickets, and more from the
memory of lsass.exe , the Local Security Authority Subsystem Service on
Windows. It only requires administrative access to the machine, or an account with
the debug privilege enabled.
BloodHound builds a graph of the relationships within an Active Directory
environment. It can be used the red team to easily identify attack vectors that are
difficult to quickly identify.
War game exercises
A common practice at Microsoft is to engage in war game exercises. These are security
testing events where two teams are tasked with testing the security and policies of a
system.
The red team takes on the role of an attacker. They attempt to model real-world attacks
in order to find gaps in security. If they can exploit any, they also demonstrate the
potential impact of their breaches.
The blue team takes on the role of the DevOps team. They test their ability to detect and
respond to the red team's attacks. This helps to enhance situational awareness and
measure the readiness and effectiveness of the DevSecOps strategy.
Before starting war games, the team should take care of any issues they can find
through a security pass. This is a great exercise to perform before attempting an attack
because it will provide a baseline experience for everyone to compare with after the first
exploit is found later on. Start off by identifying vulnerabilities through a manual code
review and static analysis tools.
Organize teams
Red and blue teams should be organized by specialty. The goal is to build the most
capable teams for each side in order to execute as effectively as possible.
The red team should include some security-minded engineers and developers deeply
familiar with the code. It's also helpful to augment the team with a penetration testing
specialist, if possible. If there are no specialists in-house, many companies provide this
service along with mentoring.
The blue team should be made up of ops-minded engineers who have a deep
understanding of the systems and logging available. They have the best chance of
detecting and addressing suspicious behavior.
"Defenders think in lists. Attackers think in graphs. As long as this is true, attackers
win." -- John Lambert (MSTIC)
Over time, the red team will take much longer to reach objectives. When they do, it will
often require discovery and chaining of multiple vulnerabilities to have a limited impact.
Through the use of real-time monitoring tools, the blue team should start to catch
attempts in real-time.
Guidelines
War games shouldn't be a free-for-all. It's important to recognize that the goal is to
produce a more effective system run by a more effective team.
Code of conduct
Here is a sample code of conduct used by Microsoft:
1. Both the red and blue teams will do no harm. If the potential to cause damage is
significant, it should be documented and addressed.
2. The red team should not compromise more than needed to capture target assets.
3. Common sense rules apply to physical attacks. While the red team is encouraged
to be creative with non-technical attacks, such as social engineering, they shouldn't
print fake badges, harass people, etc.
4. If a social engineering attack is successful, don't disclose the name of the person
who was compromised. The lesson can be shared without alienating or
embarrassing a team member everyone needs to continue to work with.
Rules of engagement
Deliverables
Any security risks or lessons learned should be documented in a backlog of repair items.
Teams should define a service level agreement (SLA) for how quickly security risks will be
addressed. Severe risks should be addressed as soon as possible, whereas minor issues
may have a two-sprint deadline.
A report should be presented to the entire organization with lessons learned and
vulnerabilities found. It's a learning opportunity for everyone, so make the most of it.
War games are an effective way to change DevSecOps culture and keep security
top-of-mind.
Phishing attacks are very effective for attackers and should not be underestimated.
The impact can be contained by limiting production access and requiring two-
factor authentication.
Control of the engineering system leads to control of everything. Be sure to strictly
control access to the build/release agent, queue, pool, and definition.
Practice defense in depth to make it harder for attackers. Every boundary they
have to breach slows them down and offers another opportunity to catch them.
Don't ever cross trust realms. Production should never trust anything in test.
Next steps
Learn more about the security development lifecycle and DevSecOps on Azure .
Enable DevSecOps with Azure and
GitHub
Article • 11/28/2022
DevSecOps, sometimes called Secure DevOps, builds on the principles of DevOps but
puts security at the center of the entire application lifecycle. This concept is called “shift-
left security”: it moves security upstream from a production-only concern to encompass
the early stages of planning and development. Every team and person that works on an
application is required to consider security.
Microsoft and GitHub offer solutions to build confidence in the code that you run in
production. These solutions inspect your code and allow its traceability down to the
work items and insights on the third-party components that are in use.
You can scan code to find, triage, and prioritize fixes for existing problems. Code
scanning also prevents developers from introducing new problems. You can schedule
scans for specific days and times, or trigger scans when a specific event occurs in the
repository, such as a push. You can also track your repository's dependencies and
receive security alerts when GitHub detects vulnerable dependencies.
Azure Pipelines integrates metadata tracing into your container images, including
commit hashes and issue numbers from Azure Boards, so that you can inspect your
applications with confidence.
The ability to create deployment pipelines with YAML files and store them in source
control helps drive a tighter feedback loop between development and operation teams
who rely on clear, readable documents.
Bridge to Kubernetes allows you to run and debug code on your development
computer, while still connected to your Kubernetes cluster with the rest of your
application or services. You can test your code end-to-end, hit breakpoints on code
running in the cluster, and share a development cluster between team members without
interference.
Access management for cloud resources is a critical function for any organization that
uses the cloud. Azure role-based access control (Azure RBAC) helps you manage who
has access to Azure resources, what they can do with those resources, and what areas
they can access.
You can use the Microsoft identity platform to authenticate with the rest of your DevOps
tools, including native support within Azure DevOps and integrations with GitHub
Enterprise.
Currently, an Azure Kubernetes Service (AKS) cluster (specifically, the Kubernetes cloud
provider) requires an identity to create additional resources like load balancers and
managed disks in Azure. This identity can be either a managed identity or a service
principal. If you use a service principal, you must either provide one or AKS creates one
on your behalf. If you use managed identity, one will be created for you by AKS
automatically. For clusters that use service principals, the service principal must be
renewed eventually to keep the cluster working. Managing service principals adds
complexity, which is why it's easier to use managed identities instead. The same
permission requirements apply for both service principals and managed identities.
Managed identities are essentially a wrapper around service principals, and make their
management simpler.
Learn how to monitor your applications and infrastructure using Azure Application
Insights and Azure Monitor.
Highlights
Introduction to Azure DevOps Agile at Microsoft
Azure Plan smarter, collaborate better, and ship faster with a Video PPT
DevOps set of modern developer services.
overview
Plan your Anyone who works on software projects knows that Video PPT
work with there are issues to track, manage, and prioritize. Azure
Azure Boards has all the features your team needs to
Boards successfully manage your work. Visualize projects with
Kanban boards, execute in sprints, manage your
backlog, and use queries to find work and visualize
results. Learn how to get started with Azure Boards.
Manage If you write code, then you need a place to store and Video
and store manage that code with a reliable version control system
your code like Git. Azure Repos provides a best-in-class Git
in Azure solution. You get free private and public repos, social
Repos code reviews, and more. Learn how to get started with
Git in Azure Repos and how your team can use pull
requests to work together on code.
TITLE DESCRIPTION VIDEO DOWNLOAD
Use Azure Learn how to take a GitHub repo and add continuous Video
Pipelines to builds using Azure Pipelines. You'll see each step in
add taking a Node.js GitHub project and adding continuous
continuous builds to validate the code quality of each pull request.
builds to Azure Pipelines is free for open-source projects.
GitHub
projects
Build and With Azure Pipelines, you can build and deploy code Video
deploy written in any language, using any platform. In this
your code video, you'll learn why Azure Pipelines is the best tool on
with Azure the planet for continuous integration and continuous
Pipelines deployment (CI/CD) of your code.
Get started Azure Artifacts helps you manage software components Video PPT
with Azure by providing an intuitive UI, as well as helpful tools to
Artifacts ensure immutability and performance for the
components you create or consume. Learn how to get
started by creating a feed for an npm package to use in
your Azure Pipeline.
Automated Azure DevOps Test Plan provides all the tools you need Video
and to successfully test your applications. Create and run
manual manual test plans, generate automated tests, and collect
testing feedback from users. In this video, you'll see the basic
with Azure aspects on how to get started with Azure Test Plan, so
Test Plans you can start testing your application today.
Azure Share code, track work, and ship software using integrated PPT
DevOps developer tools, hosted on-premises.
Server
60,000 tests in Good test coverage is essential for catching issues Video PPT
six minutes: before a pull request has been merged, but they
Create a reliable have to be the right kind of tests and must be
testing pipeline reliable. Sam Guckenheimer digs into the testing
& deploy safely transformation his team at Microsoft underwent as
with Azure they started on their DevOps journey. He walks you
Pipelines through the changes they went through and why,
and explains the data they found to prove their
case for change and what they did to move. Sam
also details which things are best covered by unit
tests, which you should leave to manual code
review in the pull request, and which are best
suited to testing in production.