Exchange Server Troubleshooting Companion
Exchange Server Troubleshooting Companion
All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means
without the written permission of the authors.
The example companies, organizations, products, domain names, email addresses, logos, people, places and
event depicted herein are fictitious. No association with any real company, organization, people, domain
name, email address, logo, person, place, or event is intended or should be inferred. The book expresses the
views and opinions of the authors. The information presented in the book is provided without any express,
statutory, or implied warranties. The authors cannot be held liable for any damages caused or alleged to be
caused either directly or indirectly by this book.
Although the two authors are members of Microsoft’s Most Valuable Professional (MVP) program, the content of
this book solely represents their views and opinions about Office 365 and any other technologies mentioned in
the text and is not endorsed in any way by Microsoft Corporation.
Please be respectful of the rights of the authors and do not make copies of this eBook available to
others.
Updates and corrections for this eBook are issued regularly until a new edition is published. This version was
last updated on 21 April 2016. You can find information about the changes included in each update through
posts to the exchangeserverpro.com blog.
Foreword ............................................................................................................................................................. v
Preface ................................................................................................................................................................ vi
Chapter 1: Introduction .....................................................................................................................................1
Know the environment and have a baseline ..............................................................................................1
Supportability and Staying Up to Date .......................................................................................................7
Additional reading ...................................................................................................................................... 10
Chapter 2: Troubleshooting Active Directory .............................................................................................. 11
Using Event Viewer to diagnose Exchange Active Directory Communication Issues ........................ 13
Using Exchange Setup as a troubleshooting tool................................................................................... 15
Using Active Directory Sites and Services ................................................................................................ 17
Using the DCDIAG Tool .............................................................................................................................. 19
Using the REPADMIN Tool......................................................................................................................... 19
Active Directory User Token Bloat ............................................................................................................ 20
User Principal Names and Exchange Authentication ............................................................................. 21
Active Directory Performance Counters and MaxConcurrentAPI ......................................................... 23
Additional reading ...................................................................................................................................... 25
Chapter 3: Troubleshooting Client Access Services .................................................................................... 28
Of front and back ends............................................................................................................................... 28
The basics of Certificates............................................................................................................................ 30
IIS Basics and Exchange.............................................................................................................................. 38
Troubleshooting Exchange Load Balancing ............................................................................................ 53
Validating Exchange Endpoints ................................................................................................................. 57
Additional reading ...................................................................................................................................... 65
Chapter 4: Troubleshooting Transport ......................................................................................................... 68
A Brief History of Transport in Exchange Server ..................................................................................... 68
Understanding Troubleshooting Scenarios for Transport ..................................................................... 69
The Critical Role of DNS ............................................................................................................................. 71
SMTP Connectivity ...................................................................................................................................... 78
The Transport Pipeline................................................................................................................................ 83
Troubleshooting Transport ........................................................................................................................ 89
Even if these projects remain painful, we have become better at migrations since 1996 and interoperability
between email systems is now a given. The Exchange software has improved too and is more automatic and
generally easier to manage. There’s a lot more documentation to read and a mass of knowledge to be
absorbed from blogs, webinars, conferences, and so on. Some of the blogs are extraordinarily helpful; some
are much less so. We’ve all read posts that are badly written, inaccurate, and based on outdated information.
Even with all the additional information available to us, I think the support challenge remains high.
Configurations scale to unheard-of heights using hardware that no one anticipated in 1996. Virtualization
creates more complexity for administrators to master. The spread of mobile devices and more functional
clients creates new demands on infrastructures and the software is much more complex. Just mastering a
single topic such as Autodiscover (especially in a hybrid cloud environment) can take more time than you’d
imagine.
All of which means that we still need help to master Exchange, if that is even possible for a single human
being. And it’s not just Exchange; it’s a complete ecosystem of network, hardware, software, clients, and
people that has to be understood and managed. Books like this that are founded on the experience gained in
the school of hard knocks are especially valuable because they strip away the veneer that often disguises bad
administrative practices to approach the topic in a very practical sense. I rather like that and I think you will
too.
No book is ever complete. Software evolves. Humans learn. People make mistakes and come up with new
solutions and writers suddenly realize that they have missed something important. Don’t treat this book as the
definitive word on troubleshooting Exchange because it can never be. No author understands your
environment or business goals or even the unique brand of “office politics” that exists in your company. The
best way to use the advice given here is to put it into context. Question the guidance and make sure it fits
your needs and you’ll benefit from the exercise.
Above all, remember that Exchange evolves on an ongoing basis. Every update shipped by Microsoft brings its
own unique quirks, so be prepared to continue to learn. It’s the only way to succeed.
Tony Redmond
April 2016
When Paul Cunningham approached me 2 ½ years ago about writing an Exchange troubleshooting book, I
was a bit surprised. When a man with over twenty-times the Twitter followers as you and who runs one of the
most popular Exchange Server websites in the world asks to collaborate with you on a project, it’s a bit of a
shocker. I had published a blog post about Troubleshooting Issues with Client Access Servers and Paul wanted
to expand the concept into a full book’s worth of troubleshooting information. Due to other projects we each
had, it actually took another year for the project to begin. The original idea started off as just a large dump of
useful OneNote data acquired over combined years’ experience, to be used by those looking for a place to
begin when troubleshooting relative Exchange components.
However, as writing began we realized the true value in this book would be the message of WHY the product
breaks in the way it does and HOW to go about fixing it. The beauty of an eBook is we can easily add as many
hyperlinks to blogs, tools, and Support articles as we feel necessary. This means we can still provide the value
of a “OneNote knowledge dump” while simultaneously explaining why the product behaves the way it does,
giving the reader valuable knowledge to be used in diagnosing an issue. If you know how a product is
supposed to function, you’re able to accurately identify when it is malfunctioning and whether the advice
you’re receiving (either from the internet or another source) is good or poor advice. What we hope to
accomplish in this Exchange Server Troubleshooting Companion is to give the reader both an understanding
of the product and a tool to be used as a reference when troubleshooting the many complex features of the
Exchange Server product.
Whether you’re an experienced Exchange professional looking to uncover a nugget of product knowledge you
may not have encountered before, or someone entirely new to the product who wishes to become a capable
Exchange troubleshooter, we hope this book proves useful. Here is a list of those we feel may benefit from this
book:
Whatever your goals, we hope you enjoy the book and find it useful to your tasks as an Exchange
Administrator, Support Engineer, or Consultant. If even one outage or call to Microsoft can be avoided as a
result of reading this book, the return on your investment will be instantly rewarded.
We’d love to get your feedback on the book, including any topics you feel we should cover. Please send all
feedback to [email protected].
Over 15 years later, that advice has proven itself time and time again.
Working for support organizations and in consulting roles continually threw
me into the unknown, facing previously unseen problems in unfamiliar
customer environments. Applying an effective troubleshooting process and methodology, combined with
product knowledge, got me through those situations and enabled me to solve the customer's problem.
With the Exchange Server Troubleshooting Companion I firmly believe we’ve written an eBook that will help
you to solve problems faster, understand why the problem occurred, and learn how to prevent the same
problems from occurring in the future.
Thanks to my co-author Andrew for bringing his deep technical experience to this book, Tony Redmond for
his editorial assistance, Jeff Guillet for his technical reviews, and to David Wedrat and Chris Brown for checking
my work.
Being an MVP, writing the Exchange Server Pro blog, and writing eBooks like this one is very rewarding, but
also takes a lot of time and energy away from my family and personal life. I am endlessly appreciative of the
ongoing support from my wife Hayley and our children as I spend many late nights and weekends
contributing to the Exchange Server community.
That goal was always at the foremost of my thoughts when contributing to this book. To help others by
passing on lessons learned, tips, and knowledge that was strenuously acquired over a decade of working with
Exchange. I feel we’ve written something which will not only help the reader in times of need but also help
them grow their understanding of the Exchange product itself.
Thanks to my co-author Paul for getting me in involved in the project, Tony Redmond for helping me navigate
the world of technical writing (and spending his time translating from Texan to English), and Jeff Guillet for his
technical expertise in reviewing this book. It’s been an honor to work with these professionals who are masters
of their craft.
I’d also like to thank Timothy Heeney, Alessandro Goncalves, and Brian Day at Microsoft for lending their
thoughts to my content and for all the help they’ve given me with Exchange since I’ve known them. I’d also
like to thank my fellow Microsoft Certified Masters Jedidiah Hammond, Ron He, and Mark Henderson for
being excellent Exchange mentors throughout the years. Most importantly I must thank my wife Lindsay for
her support and patience for the many long hours spent working on this book and other IT endeavors.
Follow Andrew @Ashdrewness, checkout the Exchange Server community he moderates on Reddit, or read his
blog posts at Exchangemaster.Wordpress.com and Ashdrewness.Wordpress.com.
When things don’t work as planned, how do we react? With surprise or panic or perhaps pessimistic doom
and gloom? The optimists amongst us probably take a calmer approach to first scoping and then resolving
what’s going wrong but even they can be reduced to frustration at times. Working with technology gives us
plenty of possibilities to exhibit different emotions and having worked in a support role for over a decade I’ve
seen nearly all of them. Often it’s a lack of experience that predisposes individuals to react poorly when a
piece of technology breaks, but a lack of relevant knowledge can have the same effect.
You can expect that being armed with the experiences of others combined with relevant technical product
knowledge can equip an individual to handle even the most complex troubleshooting scenarios. One can even
hope that with practice, reactions become instinctual under duress. In other words, when something breaks,
the proper questions to ask and knowing what is the right data to gather to help drive towards the resolution
of a problem simply becomes second nature. In my opinion, the mark of a good troubleshooter is someone
who can navigate a complex break-fix situation without necessarily knowing the technology itself, but rather
trusting the instincts they’ve built over time.
In a way, that’s what we hope to give you with this book. We’re never going to be able to deliver a cheat sheet
to resolve every possible thing that can go wrong with Microsoft Exchange. The product is too big and
complex and the environments in which it is deployed create such a massive matrix of possibilities that it
would be impossible to attempt to create such a document. But what we can do is tap into our years of
experience troubleshooting Microsoft Exchange, describe some of the most common areas that cause
problems, and list the tools that experience have proven to be useful in resolving those problems.
I hope that you find the content of the book not only a useful read but also a tool in of itself to help you
troubleshooting Exchange issues in different environments. A decade of troubleshooting Exchange and
training others to do the same, including the opportunity to speak on the topic at various conferences, has
made me realize that the more knowledge you have when approaching a broken Exchange server, the better.
Spending years gathering blog posts, amassing a sizeable collection of tips and techniques in OneNote, and
facing all manner of broken messaging systems have hopefully supplied sufficient useful content for readers
to keep coming back to use this text as a reference.
But before we dive into analyzing what might happen to different areas of Exchange, we should first discuss
the topic of troubleshooting itself. How we obtain and analyze data relevant to the problem that has to be
resolved to restore normal operations, how we handle stressful situations, and the steps that can be taken to
avoid similar situations in the future.
Other times we’re working with an unknown environment on a contract/project basis. Maybe to upgrade
systems to newer software or maybe to migrate mailboxes or other data to new systems. Or maybe we act in a
In all cases, gathering intelligence about the target environment is a vital step to take before attempting to
troubleshoot any issue. Much like reading and understanding the assembly instructions for a piece of IKEA
furniture is a must do to ensure successful construction (and to avoid a costly argument with your significant
other), assembling information about the IT systems and their use before you consider what might be going
wrong is a fundamental first step in the troubleshooting process.
This is especially important for performance-related issues. Someone may say “X is slow”. Well, “slow” isn’t a
measurable thing because it’s a very subjective assessment that could be influenced by many factors,
including that person’s experience of how an application works. And it’s certainly not a term that is
consistently defined among users. Slow compared to what? Yesterday? When I used Outlook 2010 opposed to
Outlook 2013? When my mailbox is on an Exchange 2013 server opposed to when the mailbox was on an
Exchange 2007 server? When I was connected by Wi-Fi compared to a LAN connection at my desk? We’ll
cover how to handle many of these variables when I talk about the right questions to ask, but for now you can
see that a term like “slow” can mean different things to different people. It is therefore important to establish a
performance baseline, as well as a means to measure it against a currently reported behavior. A colleague
once said “how will you ever know what bad performance is without knowing what your good performance
is?”
The need to have a performance baseline introduces a requirement for monitoring, not necessarily from a
particular vendor or solution, but as an ongoing practice. Some sports coaches use the term “winning is a
habit” to refer to the culmination of all the little things you do along the way, every day, that determines your
success. Those are the things that get you ready for game day. Of course in our case, game day is when the
messaging environment goes down due to an unforeseen issue. Being properly prepared can make or break
your reputation/performance rating/job security.
Every organization will have its own way to monitor performance. It doesn’t really matter what tools you use,
as long as you have an understanding of the historical load and performance of your Exchange servers. I make
sure that I know the basic characteristics of what’s happening to Exchange, including;
Number of mailboxes
Messages Sent/Received daily
Average Message Size
Expected IOPS per Database/Disk/Server
Average Disk Latency
Average Memory consumption
Average CPU Utilization
Average connections per client type at the Client Access Layer (typically the Load Balancer)
You probably have your own ideas about good essential signs to monitor and we will certainly cover most
topics in the Performance chapter.
Knowing all of this information at the start of a support call would make life much easier. In fact, I would be
surprised to receive a support call because the people running the environment are so well equipped with
data, I’d wonder why they hadn’t resolved the issue on their own yet. But more often I find myself asking for
information I can reasonably expect any administrator to have at their fingertips, such as:
The software version running on servers (including the Cumulative Update/Service Pack/Update
Rollup)
The number of servers/users in the environment
The software revision level matters because it will determine whether the server is currently supported by
Microsoft (more on this shortly when we discuss Supportability) and also because many issues are known and
have been fixed by available updates.
Knowing the number of servers/users in the environment determines both the impact of the problem as well
as the complexity of the environment. Although I feel that Microsoft’s Preferred Architecture yields the most
stable and simple deployment at scale, it can be argued that a single virtual Exchange server with 100
mailboxes is simpler or less complex than a 60k seat global deployment based on physical servers running
with JBOD storage. I haven’t mentioned client type yet but you can assume that environments where all users
run Outlook in Cached Mode or Outlook Web App are much easier to troubleshoot than environments with
ActiveSync, BES, VDI/thin clients (Outlook Online Mode), Outlook for Mac, and the various mobile Apps.
Understanding the environment ties directly into understanding the scope of an issue, which is one of the
more important aspects of troubleshooting anything in life. Will a load balancer outage bring down services
globally or only in a regional datacenter? Is a database failure impacting 20 mailboxes or 200? Does that
database also house the only Public Folder Mailbox? If so then the outage could have a global impact. As you
can see, scope and impact are often directly linked. Therefore, it’s vital upon initial issue analysis to determine
the scope of an issue.
The timeline of events is one of the most critical pieces of information to know when analyzing an issue. Did
the issue happen today? Has it been getting progressively worse over a matter of months? Did it start after a
change was made to a component such as a storage controller? Did it start after changing datacenters? Did it
start after the servers were last rebooted? What was the reason you rebooted (updates or some other reason)?
Did anything else change other than that hardware update you ran? No, nothing else changed? Oh, never
mind, looks like there were 20 pending Windows Updates waiting for a reboot…. Unfortunately, I have been in
situations where fundamental issues like servers waiting to be rebooted to complete the installation of
Windows updates have been present, and more than once sadly. So you can see why having an accurate
(making no assumptions) timeline of events can give you the kind of essential information you need to start
down the right path of troubleshooting.
Finally, having the hardware specifications can give you an idea of what skill set you’ll need to diagnose the
issue. If the problem lies with virtualized Exchange servers running on VMware with attached SAN-based
storage using Cisco switches, you could potentially need assistance from five different vendors: Microsoft,
Cisco, VMware, the server vendor and the SAN vendor. Depending on your role in the company, the
coordination of support requests to all of those vendors and the reconciliation of the answers that come back
might rest on your shoulders (I hope they’re paying you well). Alternatively, you will at the very least need the
knowledge and cooperation of several individuals.
Having as much background data about the environment at the ready is extremely useful to resolving issues
between intertwined teams. I always love to see customers with detailed network diagrams,
SAN/Switch/Server update levels, performance data (historical and current), and configuration data. Having
this information almost always results in faster results.
No matter whether you’re the individual responsible for all aspects of the Exchange environment or a member
of a vast team including coworkers as well as vendors, having a detailed and well-documented understanding
of the environment will dramatically increase your chances of a fast resolution and make you look like a rock
star. Far too often I see customers not investing the time to properly document their environments. They
Similarly, another common mistake is to think, “well we have Bob, and Bob manages Exchange and Active
Directory. He knows the environment like the back of his hand.” I call this a knowledge vacuum, because when
Bob isn’t there his coworkers are left clueless. Some individuals like it that way as it gives them a sense of job
security, but it’s the job of the IT Director/CIO to ensure that no one person holds the keys to the kingdom. So
a properly documented environment, along with performance/monitoring baselines will go a very long way
towards keeping you well able to handle any incident that comes along.
Testing
Everyone who has tested something by deploying it into production please raise their hand. It’s OK, we’ve
likely all done it in some form or fashion. In fact, if nobody raised their hand then I probably wouldn’t have a
job and books like this would not be needed. People have a tendency to remember the bad more than the
good when it comes to product quality. This is true for any product.
Many Exchange administrators could tell you about an Exchange update that broke search, or an Apple
update that filled up their drives with transaction logs because calendar items were processed 100k times, or
when a Windows Update broke applications that depended on the .NET Framework. However, they probably
don’t remember all the instances where an update went smoothly, or added stability to the environment (I’ve
seen more Exchange updates do just that rather than break things). Still, having trust in your vendors won’t
mean much when an outage occurs that could have been avoided.
So how do we avoid Exchange disasters? Well we can’t avoid them all but we can tackle the factors that
introduce risk into the environment and change is one of the biggest factors. IT change management is
worthy of its own book, but speaking purely from an Exchange perspective, we could still spend chapters
discussing the perils of haphazard Exchange changes.
Everyone talks about the horrors of Active Directory schema updates, but since Microsoft launched frequent
cumulative updates for Exchange 2013, these updates are now expected every three months. Issues arising
from schema updates are actually extremely rare. I’ve spoken with a Microsoft Active Directory Premier Field
Engineer and a Support Escalation Engineer about the topic. Between both, they had seen maybe half a dozen
catastrophic failures as a result of failed schema updates. To put that figure in context, that’s in almost 25
combined years of experience dealing with the worst of the worst Active Directory support cases. Even so,
people continue to fear schema updates not for their failure rate, but rather because of the ramifications of
such a failure.
It’s much like flying in a plane. Statistics tell us it’s the safest way to travel, but more people fear flying
compared to riding in a car, even if it is statistically a more dangerous means of travel. However, while many
people have survived car crashes, far fewer have survived plane crashes. When dealing with schema updates,
the ramifications of failure are much higher as a complete Active Directory Forest recovery might be required.
The need to avoid the consequence of failure within production environments is why people build test
environments, and it is why Microsoft recommends customers test all updates and changes to Exchange in a
lab. Although the employment status of IT professionals are seldom threatened when lab environments are
broken, the consequences are more serious when flaws emerge in production.
I commonly hear customers say that they don’t have the money, resources, or time to completely duplicate
their production environment in a lab. To that I say, few have the money to create a perfect duplication, but it
doesn’t take a lot of effort and money to back up a Domain Controller and a couple Exchange servers and
restore them into a virtual environment. While Microsoft tests every Exchange build in Office 365 as well as
Page 4 Exchange Server Troubleshooting Companion
their on-premises environments, they have zero knowledge of how YOUR environment is configured and what
software is running in it, including custom security policies, group policies, custom access control lists, anti-
virus software, third-party plugins, and that one custom script that has to run or else nobody gets paid on
Friday.
Any customization or deviation from using software in the way Microsoft expects you to (and therefore tests
for) is a potential liability that must be tested. This is especially important with Exchange 2013 onwards
because fewer rollback options exist in comparison to Exchange 2010. In Exchange 2010 a Service Pack was
essentially an entire fresh set of bits, which could not be uninstalled. If you install Exchange 2010 Service Pack
2, there is no way of uninstalling it to revert to Exchange 2010 Service Pack 1. Your only option is to install a
new server with Service Pack 1 and migrate mailboxes to it. Conversely, Update Rollups were pushed down
with Windows Updates and added new features/functionality/fixes. These Update Rollups could be uninstalled
(for example, updating from Service Pack 3 Update Rollup 7 to Update Rollup 8, and back to Update Rollup 7
again) which gave administrators the ability to roll back changes that introduced functionality that they didn’t
desire. For example, a number of Exchange 2010 Update Rollups changed things like cross-site client
redirection or RPC client connectivity. In certain environments, the changes led to undesired behavior. Luckily,
customers had an exit strategy of uninstalling the rollup. Of course, testing the updates first in a lab would’ve
been ideal as they never had to discover the problem when it appeared in production.
Along came Exchange 2013 to change the way in which Microsoft services Exchange with the introduction of
Cumulative Updates. Cumulative Updates behave much like Service Packs in that they’re full and complete
installations of a brand-new version of the product. They cannot be uninstalled and are expected to require
schema updates. This means an Exchange administrator’s exit strategy is near non-existent if they discover
problems after an update is deployed.
Because Cumulative Updates cannot be uninstalled, you have to install a new server if you did not like the
introduced behavior. And if that behavior flows from a schema update, you would find yourself performing an
Active Directory Authoritative Restore. The way the new servicing model works highlights the importance of
testing in a lab environment especially if your production environment has certain quirks or customizations
that Microsoft cannot anticipate. For example, you might have a business process that relies on mail-enabled
public folders with customized rules/categories. Or all 50 developers have access to each other’s mailboxes,
with all of those mailboxes open in Outlook Online mode in VDI because you feel it helps drive collaboration?
Or perhaps the marketing department has a frequent need to send 10,000 messages to customers in 10
minute spans, with customized encoding, using a custom transport agent? Congratulations on being unique
and creating conditions that Microsoft will likely never test. Every new Cumulative Update will be an adventure
for your configuration and you will have to test each update thoroughly to ensure that it works for you.
The importance of ensuring a new update will not break either Exchange or third party functionality is vital.
Remember to work closely with your vendors to ensure they have tested their products against the Cumulative
Update you plan to deploy. For some reason, transport agents seem to be commonly affected by changes
introduced in updates. This is possibly because any change to the code within the transport pipeline will
directly affect a transport agent’s logic. So third party agents for anti-spam, anti-virus, archiving, disclaimers,
or routing must be validated against the new code before an Exchange update is introduced into production.
Let’s move on to what you should do when the time comes to pull the trigger and start an update. Here are a
few tips.
Disable anti-virus software for both file and process scanning. I’ve dealt with many customers who had
anti-virus crash the update process and leave their Exchange server in an inconsistent state. In theory, with a
correctly behaving anti-virus process that has all the proper exclusions, you should have nothing to worry
about. However, many people do not create the proper exclusions (please see this page) or the anti-virus
Know where logs are located. Issues can still happen during the setup process and when they do you need
to know what to do. The first thing I recommend is looking through the setup logs on the system root which
typically provides some clues to point you in the right direction. Historically, if your system has crashed/hung
during setup, you’ll find yourself in what I call an in-between “Exchange Purgatory” state. Meaning some
components will either be installed or updated, while the rest are at the original version. In the old days you
would either have to start fresh or delete the installation watermark entries from the registry to force
Exchange to continue with the install. While that still may be required, Exchange setup has become pretty
sophisticated over the years and Exchange 2013 will typically just let you run the program again to a
successful completion.
Make sure your permissions are correct. Another common culprit that causes Exchange to crash during
updates are missing or inaccurate permissions. Personally, I like to open an elevated command prompt (Right-
Click>Run as Administrator) to run Setup.exe and update Exchange. Usually you’ll get an immediate failure if
permissions are the problem but getting 80% through an update only to have it fail because you (or the
process you are running) doesn’t have the necessary permissions is beyond frustrating.
Put the server in the right state. As part of Managed Availability, Microsoft introduced Server Component
States in Exchange 2013. Managed Availability is composed of a large number of components that aim to
ensure your Exchange environment stays available, functional, and healthy. Specifically, Server Component
States are groupings of the key features of an Exchange server, which are either in an Active or Inactive state.
These states can be altered by various requestors, such as Maintenance, Deployment, or the HealthAPI. When
an Exchange update process begins, it requests all components go into an Inactive state so the server transfers
work to other servers and won’t take any new workload on. Eventually, the services are fully taken offline to
allow the update process to update files. The requestor is typically Maintenance or Deployment. Unfortunately,
when an update fails for any reason, these components are not always restored from an inactive state and
you’ll find yourself having to run Get-ServerComponentState <ServerName> to verify all states are “Active”. If
they are not, you may have to manually enable them either via the Exchange Management Shell or via the
registry. The most important thing to understand here is that the same requester must be used to set a
component state as Active as the one that originally set it as "Inactive". For example, if you run:
to re-enable it. So typically, re-enabling these states and restarting the relevant Exchange services is all that’s
needed to restore functionality after a failed update.
The support lifetime for a cumulative update is as follows: “A CU will be supported for a period of three (3)
months after the release date of the next CU. For example, if CU1 is released on 3/1 and CU2 is released on 6/1,
CU1 support will end on 9/1.” At first reading, this statement shocked many Exchange administrators. In their
minds, the new model meant that unless they stayed within 3 months of NewVersion-1, they would be unable
to call into Microsoft Support for issues encountered in their production environment. I often say you should
never have a production workload for which you don’t have an escalation path (a statement I feel many IT
Directors would agree with), so you could imagine the state of panic this induced in some.
However, this is not necessarily what this statement meant and in my opinion it could’ve been much better
worded as it caused much confusion, mostly because Microsoft didn’t clearly define what “Supported” means.
First off, I recommend you watch this extremely helpful Ignite 2015 session from Exchange MVP Paul
Robichaux and Microsoft Program Manager Brent Alinger “Servicing Microsoft Exchange Server: Update
Your Knowledge”. In it, they clearly call out what Microsoft really means when they say a product is
“Supported”. Let’s look at it from each angle.
First off, just because something works does not mean it’s supported. Also, just because Microsoft hasn’t
explicitly come out saying something is not supported, does not mean it is supported. As Exchange MVP
Nathan O’Bryan once told me, “There’s no warning signage on a sports car that says not to drive it into a brick
wall going 100MPH, but that doesn’t mean you should do it!” So just because Microsoft hasn’t come out saying
they don’t support Exchange, SharePoint, SQL, and Lync on the same server it doesn’t mean it’s a supported
(or wise) configuration. Sadly, I’ve seen several customers who were used to running an integrated
environment on older Small Business Server configurations decide that similar integration was possible with
newer software and attempted to do this with disastrous results. Next, Microsoft supporting something means
that they have actually tested it and validated that everything works. So if you’re unable to get something to
work that Microsoft officially supports, it means that an environmental, configuration, or extremely oddball
issue exists that is likely to be classified as a bug. But the fact that they support it means Microsoft will help
you drive to a resolution.
Lastly, the Microsoft Product Lifecycle is generally considered the bible of Microsoft support timelines. As a
general rule you’ll find that Microsoft provides ten years support for a product after it’s released, but if a
subsequent Service Pack is released then the RTM code is only fully supported for a year after the Service Pack
release date. This is because RTM is treated as if it were the first service pack, so it’s supported for 1 year after
the next service pack appears. This logic is fairly easy to understand for service packs, but where do
Cumulative Updates and the N-1 Support strategy fit? As it turns out, this is the big difference between
Serviceable and Supportable (Serviceable not actually being an official term by Microsoft). Being within a
product’s “Supported” lifecycle doesn’t necessarily mean you’ll be issued a fix for a bug or undesirable
behavior. For that you need to be within the N-1 window, assuming the fix isn’t already in the latest release
(N).
There’s two main ways in which you can engage Microsoft Support; by picking up the phone as a regular
customer and using a credit card for a one-time incident based fee or, by having a Microsoft Premier contract
where you have purchased a pre-determined number of support hours for use at your discretion. It’s widely
understood that the level of support you receive via Premier Support is higher than normal phone incident-
based support and the length to which the support engineers will go to investigate and research a problem is
greater than exists with “normal” support. In some cases, this is because you have already paid for the hours
that the Microsoft support engineers are using and they are happy to let you spend those hours how you
wish. In my experience, I’ve found that if you’re a Premier customer, you’ll never be told something isn’t
supported and then have the engineer abruptly terminate the call. Microsoft may tell you an update is
required or that the software you have lies outside a products support timeline, but they will still attempt to
get the issue to resolution. A regular customer without a Premier contract calling in paying a one-time fee to
get an issue resolved is going to have much less success, mainly since by simply calling in, Microsoft is likely
already losing money. That one-time fee really doesn’t go that far in terms of Microsoft profitability.
As I said, it also depends on the situation itself. If your database will not mount due to it being in a Dirty
Shutdown state, Microsoft isn’t going to drop the call just because you’re running software that is three
Cumulative Updates behind. After all, the level of the software has nothing to do with your database being
dismounted. On the other hand, if you call in with performance issues or problems related to client
connectivity, they’re probably not going to spend much time looking at the issue before asking that you get to
the latest update. Some customers will then look at the public Knowledge Base articles for the updates and
discover that the issues they’re experiencing are not specifically called out. It’s important for customers to
know that Microsoft does not publish every fix included in a cumulative update. Instead, only fixes as a result
of higher profile customer cases are usually publicized. In many situations, Microsoft Support has insider
knowledge (as you would expect) as to which fixes are included in an update. However, at times even they
don’t know about every fix or enhancement that development put into the code. Being at the latest update
will ensure you have all the latest fixes and enhancements the Product Team has to offer, which could very
well resolve your issue.
Knowing how the support experience will unfold after you contact Microsoft support will not only set your
expectations but may ultimately dictate your organization’s update cadence. Although Microsoft won’t give
you the cold shoulder for being a few updates behind, the change management policies operated by some
companies prevent them from deploying an update in a reasonable amount of time just to resolve a support
incident; so being as close to N as possible will be ideal.
Speaking of factors controlling update cadence, you have to consider third party applications. Not all
applications will support every Cumulative Update, so before moving Exchange forward, verify supportability
with your vendors. This applies to Transport Agents, backup software, anti-malware, stubbing/archiving
solutions, and so on. Another caution against staying too far out of date is that the updates provided for third
party software may not necessarily be tested against older updates of Exchange. So updates to a vendor
package could introduce instability in an outdated environment. Even with dated Microsoft software, there’s
certainly no guarantee they’re testing new code against old. For example, on more than one occasion an
Exchange update has broken XP clients (also this example) because Microsoft didn’t perform the same level of
testing against a legacy product as they do against currently supported code. Of course we should be
reasonable in what we expect of Microsoft as they can’t test against every single use case for products, some
of which have never been heard of by the Microsoft testers.
When something goes wrong there’s basically three paths forward; things get better, things get worse, or they
remain at their current state. We want to improve our troubleshooting skills so the path to improvement is
quicker. If we don’t succeed we’d like to make sure we don’t make things worse. Making another parallel to
doctors, the phrase “first, do no harm” are words many of us should live by.
In college, some friends and I joked about shoot-from-the-hip troubleshooting and came up with the phrase
“troubleblasting.” As in, “Andrew doesn’t troubleshoot, he troubleblasts!” While it’s a pretty campy/lame joke, I
use it all the time with coworkers and trainees because it conveys an important message. Don’t be a
troubleblaster by throwing every fix you can think of at a problem, not documenting changes, or failing to
note what worked and what didn’t. Don’t dive down a deep-dark rabbit hole where you’re desperate because
nothing is working, your boss is yelling at you and before long you’re just throwing any half-baked solution
against the server just to see what sticks. Taking suggestions and purposed solutions from the dark corners of
the internet, thinking that just because someone has a website that means what they posted was supported or
correct. After all, anyone can start up a free blog on various platforms and start to share their experience with
the world. No one tests bloggers for their intelligence, so don’t believe everything you see on the internet. At
the very least, take a backup before trying a solution you found in a blog, even one from a Microsoft or other
trusted source. Otherwise, you run the risk of doing something foolish like deleting your transaction logs
because you’re full on disk space, enabling circular logging, and then running an ESEUTIL /P against your
database (more on this scenario in the Backup and Disaster Recovery chapter). All because the internet said so
and it was the first suggestion that came up for a search on Bing/Google.
The problem with this swashbuckling approach to tackling an issue is that it usually results in two things: fixes
that aren’t really fixes but more like Band-Aids or making so many changes in a short timeframe that you’re
unsure what actually resolved the issue. Often, things get enabled/disabled/installed/uninstalled that actually
had nothing to do with the problem with the result that instability or some other weakness is introduced into
the environment. A famous example of this is the Windows Scalable Networking Pack.
Because the pack was a common culprit for network/performance issues for a few years (mostly due to poor
code from Microsoft and NIC vendors) it became common practice to blindly disable these features as a first
step. Unfortunately, these steps were often taken in ignorance of the actual issue and because it’s what was
always done. This behavior led to many performance issues in Windows because the disabled features weren’t
able to fulfill the purpose they were designed to. As a Microsoft Escalation Engineer once asked me, “All these
customers that disable ToE, RSS, Offloading, etc. How many of them notice it didn’t resolve their issue so they re-
enabled it afterwards?” This is a perfect example of leaving a system in worse shape than when you started. For
those interested, Exchange really should have Receive Side Scaling enabled if you want the best performance.
Take your time when troubleshooting an issue, document each change you make and note whether it had an
effect. In addition, ensure you allow sufficient time for the change to take effect (AD replication, service restart,
application pool recycle, etc.). When you’ve finished, you can work backwards to remove any changes that you
felt did not resolve the issue, In that way you’re not introducing unnecessary changes to the environment
(remember; change=risk).
Page 9 Exchange Server Troubleshooting Companion
Lastly, probably the most important tip with troubleshooting or debugging any issue. Don’t let your
assumptions immediately become your conclusions. Or to put it another way, don’t make your mind up before
actually looking at the facts. People have a way of contorting the facts to suit their narrative when they’re
already so strongly convicted. Working for a hardware vendor most of my career, I’ve often seen the customer
who is convinced the performance issue is with hardware. Of course, we deal with hardware every day and
understand that components break, so we’re certainly not opposed to the idea of it being the root cause, but
we need to see some proof first. In many instances, a customer will demand replacement hardware in the
certain knowledge that it has to be the cause and will reject any input from us or Microsoft with hostility
because in this customer’s mind it can be only one thing; hardware. Sometimes a broken component certainly
is a problem, but often it’s a configuration issue, or a sizing issue, or a software bug. The point is that an
effective troubleshooter will let the facts lead them to the answer, not emotion, not anecdotal evidence, and
certainly not preexisting prejudice (“X product was the culprit last time, so it must be this time as well”). You
can certainly have a theory, but it can’t be baseless. The biggest problem with anecdotal evidence is that it’s
sometimes right, and it leads people to think that their method is sound. You can’t win at the craps tables
every night, so go with the safe bet; actual evidence.
Additional reading
Testing
Microsoft Anti-Virus Exclusion List
Exchange Setup Logs - Best Practices
How does Exchange 2007 setup know to resume a failed setup?
Exploring Exchange Server Component States
An Active Directory database is organized into Partitions (sometimes referred to as Naming Contexts), each
replicating between Domain Controllers and containing various kinds of data. The partitions are the Schema
partition, Domain Partition, Configuration Partition, and Application Partition. Let’s consider what Exchange
data is contained in each partition.
The Active Directory schema contains the definitions of object classes and attributes used within the Forest.
Out of the box (before Exchange has been installed), Active Directory has absolutely no idea what an Exchange
Server, Mailbox Database, Database Availability Group (DAG), or Mailbox is because the schema does not
define these objects. When the Exchange installer performs a schema update (Setup /PrepareSchema), it adds
the definitions of the various objects used by Exchange so that the objects become known to the directory
and can be manipulated by the management tools. When it is said that Exchange requires a schema update, it
means that the Active Directory schema must be updated to allow existing objects to take on new
functionality or to accommodate completely new objects.
Because the schema is essentially just a big repository of definitions, it’s extremely rare to experience a failure
during the update process. From a troubleshooting perspective, I’ve never found myself spending much time
looking at the schema, except perhaps to verify the current schema version after an update (see this post for
more information). The most common issues with the schema are permissions or reachability issues
encountered while attempting an update. These problems are usually resolved by ensuring that:
The Domain Partition contains all objects created in a single domain and replicates its data only within that
domain. This is where you’ll find User Objects, Distribution Groups, Computer Objects, and Organizational
Units. When an Exchange client (Outlook, OWA, etc.) performs a query against Address Lists, this is the part of
Active Directory queried by the client. When an additional SMTP address is added to a Recipient, it’s stamped
onto the user object in this partition (using the ProxyAddresses Attribute). This Exchange attribute, along with
many others are populated on the Active Directory objects when they are either Mail-Enabled or Mailbox-
Enabled. While these Attributes are present on all user objects, they are not populated until the user objects
are used with Exchange in some way.
A Global Catalog server is a Domain Controller that contains a subset of all domain objects in the Forest.
Because the Global Catalog holds this data, Exchange can perform queries and make changes to objects
regardless of which domain the object belongs. If a writable Global Catalog server is not available, Exchange
queries will fail and services may not start (more on that shortly). Often in a multi-site environment, when a
change to an Exchange Recipient is made, but not picked up by another Active Directory Site, it’s this partition
that requires updating (I typically use repadmin /syncall /Aed in a small lab environment. In a larger
environment the /Aed switch may saturate WAN bandwidth). Updates to other Global Catalog servers occur
via Active Directory Replication, which is a multi-master replication process designed to propagate changes to
Domain Controllers. When Domain Controllers exist in separate Active Directory Sites, this replication normally
occurs according to a scheduled interval, unless an administrator forces replication using the repadmin
command. However, enabling Change Notification on your Active Directory Site Links can speed up this
process.
The Configuration Partition contains information on the physical structure, services, and configuration of the
Forest. This is where the objects that constitute the Exchange organization are held, such as Exchange Server
objects, Connectors, Virtual Directories, Mailbox Databases, Public Folders, Database Availability Groups,
Active Directory Sites, and various policies. When you make a change to the Exchange configuration, the
command you execute manipulates the attributes for objects in this container. Often in a multi-site
environment, when a change to Exchange Configuration is made but not yet detected in another Active
Directory Site, it’s this partition that requires updating (I typically use repadmin /syncall /Aed).
The last partition type is an Application Partition. When talking of Exchange, the most important type of
Application Partition is the DNS Application Partition; which is created when you use Active Directory-
Integrated DNS Zones. An unhealthy DNS infrastructure means an unhealthy Active Directory environment
and usually doesn’t bode well for Exchange, especially when the time comes to route messages to other
servers or external domains.
DNS issues are broken up into two types; client-side and server-side. Examples of client-side issues include
configuring an incorrect DNS server address for the NIC of an Exchange server, a mistake that will lead to
failed DNS queries. I often see smaller customers make this mistake by adding public DNS servers on a NIC for
redundancy without realizing that this can cause queuing and service communication issues. Server-side
issues, for example, include the DNS server not answering queries or not having the proper SRV records
Exchange needs to be able to contact a Global Catalog server. Whether it’s answering external mail flow
queries or providing needed service records for Active Directory communication, a properly functioning DNS
infrastructure is critical for a healthy Active Directory and Exchange environment.
Having broadly covered what Exchange data is stored in Active Directory and how it interacts with it, let’s now
cover various failure points and the tools/techniques used to troubleshoot them.
After the initial gathering of SRV data, every 15 minutes, MSExchange ADAccess Event ID 2080 is logged
(Figure 2-2). This event provides important details about the Active Directory servers available to Exchange.
Page 13 Exchange Server Troubleshooting Companion
You can get information about the information reported in the event from Microsoft Knowledge Base article
316300. Much of the content below is taken from that Knowledge Base article. More information about the
DSAccess component functionality is available on this page.
In Figure 2-2, we see that two Active Directory servers have been detected in the site. Taking the first (ASH-
EX1.ASH.NET) as an example, we can determine:
Server name: Indicates the name of the domain controller that the rest of the data in the row corresponds to.
(ASH-EX1.ASH.NET)
Roles: The second column shows whether or not the particular server can be used as a configuration domain
controller (column value C), a domain controller (column value D), or a global catalog server (column value G)
for this particular Exchange server. A letter in this column means that the server can be used for the
designated function, and a hyphen (-) means that the server cannot be used for that function. CDG means that
the server can be used for all roles.
Enabled: Either 1 for yes or 0 for no. In this example, we see 1, so we know that the server is enabled.
Reachability: The fourth column shows whether the server is reachable by a Transmission Control Protocol
(TCP) connection. These bit flags are connected by an OR value. 0x1 means the server is reachable as a global
catalog server (port 3268), 0x2 means the server is reachable as a domain controller (port 389), and 0x4 means
the server is reachable as a configuration domain controller (port 389). In other words, if a server is reachable
as a global catalog server and as a domain controller but not as a configuration domain controller, the value is
3. In the example shown above, the value 7 in the fourth column means that the server is reachable as a global
catalog server, as a domain controller, and as a configuration domain controller (0x1 | 0x2 | 0x4 = 0x7).
Synchronized: The fifth column shows whether the "isSynchronized" flag on the rootDSE of the domain
controller is set to TRUE. These values use the same bit flags connected by an OR value as the flags that are
used in the Reachability column. The ideal situation is that the values shown in the fourth and fifth columns
match.
GC capable: The sixth column is a Boolean expression that states whether the domain controller is a global
catalog server. The value (1) indicates that the server is a global catalog.
PDC: The seventh column is a Boolean expression that states whether the domain controller is a primary
domain controller for its domain. The value (0) indicates that this is not true.
Critical Data: The ninth column is a Boolean expression that states whether DSAccess found this Exchange
server in the configuration container of the domain controller listed in Server name column. The value (1)
means that the server was found.
Netlogon Check: The tenth column states whether DSAccess successfully connected to a domain controller’s
Net Logon service. This requires the use of Remote Procedure Call (RPC), and this call may fail for reasons
other than a server that is down. For example, firewalls may block this call. So, if there is a 7 in the tenth
column (as is the case here), it means that the Net Logon service check was successful for each role (domain
controller, configuration domain controller, and global catalog).
OS Version: The eleventh column states whether the operating system of the listed domain controller is
running the minimum supported Operating System. Exchange 2007 only uses domain controllers or global
catalog servers that are running Windows 2003 SP1 or later. Exchange 2010 and 2013 support Windows 2003
SP2 or later, while Exchange 2016 supports Windows 2008 SP2 or later. A Boolean expression of 1 means the
domain controller satisfied the operating system requirements for use by DSAccess.
The value of this data comes from knowing if Exchange can properly utilize the Domain Controllers available in
your environment, and is therefore able to communicate with Active Directory. Symptoms that would cause
you to look at this counter include:
If you discover Domain Controllers that Exchange cannot communicate with or that have the improper
settings, chances are that the problem lies with a firewall setting, lack of exclusion from Anti-Virus scanning,
some setting overridden by Group Policy, a Deny security entry in Access Control Lists, or other permissions
issues with the server.
As an example, restrictive Group Policies can result in Exchange being unable to communicate with Domain
Controllers since Exchange 2000 (also see KB896703). In the reported scenarios, customers had improperly
configured Group Policies which blocked the Manage Auditing and Security log permission to the Exchange
Servers group. This is just one example of a permission required by Exchange that is configured during setup
when Setup /PrepareAD is run, but errant Group Policy settings can overwrite this configuration.
The last point is key because many permissions and Access Control Lists are created in the domain that need
to propagate to every Active Directory object used by Exchange. If problems occur (for example, if permissions
inheritance is blocked to an object) then Exchange functionality may be impaired.
Instead of installing Exchange itself, you have the option to prepare Active Directory beforehand by using
command-line setup. The schema can be prepared by using Setup /PrepareSchema and Active Directory can
be prepared by running Setup /PrepareActiveDirectory (instructions for each can be found in the links provided
above).
Many are aware that the /PrepareActiveDirectory option exists and how it is used in deployment scenarios, but
few are aware of its use for troubleshooting. If you’re experiencing any of the symptoms I described in the
“Using Event Viewer to diagnose Exchange Active Directory Communication Issues” section (probably due to
Active Directory permissions issues) then running Setup /PrepareActiveDirectory or its shortened version Setup
/PrepareAD, can be a useful troubleshooting step or even provide the resolution itself.
Because customers associate Setup /PrepareAD with deployment, changes, outage windows, or any number of
other scary words, they might be hesitant about running this command again in production. I’ve never seen or
heard any reports that running Setup /PrepareAD breaks anyone’s environment or has caused an outage and
personally believe that if Setup was to cause an issue, it’s a good indication that the environment had serious
faults to begin with. In any case, reconfiguring needed Access Control Lists and recreation of Universal Security
Groups can often resolve issues caused by that restrictive Group Policies or misguided permissions changes.
A common scenario I see occurs when Setup /PrepareAD is run to resolve a suspected Active Directory
permissions issue, is successful, yet the issue returns within minutes or a couple of days. This is almost always a
sign of a Group Policy which is breaking a required Exchange permission after being reapplied when the
machine is rebooted or the policy has refreshed (whichever comes first).
Another common time when running Setup /PrepareAD solves problems is when you need to recreate the
Arbitration mailboxes after these objects have become corrupt or have been deleted for some reason.
It’s also important to note that when running Setup /PrepareAD, you should use the same version of the setup
binaries as the newest version of Exchange that is installed in the organization.
Lastly, it’s important to realize that running Setup/PrepareAD to repopulate Active Directory with Exchange
objects might not be able to correct all permissions issues if something exists to block the correct permissions
being applied to objects. For instance, if an object or tree has Permission Inheritance Disabled then the
needed Exchange permissions will continue to not propagate to those objects until this problem is corrected.
Figure 2-4: Forcing replication using Active Directory Sites and Services
My preferred method to force Active Directory replication is the command line Repadmin utility. An Exchange
change applied to Active Directory in one site that is not detected in other sites after 15 minutes or so is the
typical justification to force Active Directory replication. If replication does not happen, it will lead to situations
such as new mailboxes or mailbox databases being invisible to Exchange servers located in other sites,
updates applied to objects such as virtual directories not showing up, or schema changes not being available.
Figure 2-5: Healthy Domain Controllers with NTDS settings being displayed
Another issue I have often come across is where an administrator decommissions Domain Controllers
incorrectly. Instead of running DCPROMO to demote a Domain Controller before it is removed, they will
simply reformat the operating system. This action leaves remnants of the Domain Controller behind in Active
Directory, usually in the form of DNS SRV records as well as a missing NTDS container in AD Sites and
Services.
At one time, all you could do to correct the problem was to run a Metadata Cleanup using NTDSUTIL (a low-
level AD tool), delete the server object from AD Sites and Services, and manually remove any lingering records
of the server that remain in Active Directory. However, starting in Windows Server 2008, all you needed to do
is to delete the server object from AD Sites and Services and then clean out DNS. However, this method didn’t
work in all cases. In these circumstances, you would still find yourself running NTDSUTIL.
Another scenario that requires NTDSUTIL is when an improperly decommissioned Domain Controller holds
one of the Flexible Single Master Operation (FSMO) roles. In this case, you also have to seize the FSMO roles
using NTDSUTIL or PowerShell.
The common symptoms that show up in Exchange when these issues are present include:
Note: Another common cause of service startup and logon failure is Kerberos Time Skew. By default,
Active Directory and Kerberos requires a discrepancy of no greater than five minutes between parties
participating in Kerberos Authentication. Ensure all devices have a consistent time keeping source and
are always within five minutes of each other. In some virtualization scenarios, if proper time
synchronization is not being performed, Exchange components and Management Tools will encounter
failures.
Verify required DNS records exist for the Domain Controllers in Active Directory
Verify the Domain Controllers respond to ICMP requests
Verify the Domain Controllers allow LDAP connectivity by binding to the instance on each server
Verify the Domain Controllers are advertising themselves properly (as Global Catalog, as Time Servers,
etc.)
Verify Active Directory replication is healthy
Parse through meaningful Event Logs related to Active Directory health and identify concerns
Verify SYSVOL and NETLOGON shares are present and able to service domain clients
For a full listing of possible tests that DCDIAG can run, see this Microsoft Blog Post.
In its simplest form, you can just run DCDIAG from an elevated command prompt and get useful output.
However, if more detailed output is required, here are just a few parameters can be used with the command:
If you suspect Active Directory issues are adversely affecting the Exchange environment, DCDIAG is a quick
tool to get a health report of each Domain Controller.
repadmin /showrepl Displays replication status of most recent inbound replication neighbors
Note: For additional reading on the topic of Active Directory Replication, especially the differences
between FRS and DFSR replication, I recommend this Microsoft post detailing the pros/cons of each.
When a user logs on interactively or attempts to make a network connection to a computer running Windows,
the logon process authenticates the user’s logon credentials. If authentication is successful, the logon process
returns a Security Identifier (SID) for the user and a list of SIDs for the security groups to which the user
belongs. The Local Security Authority (LSA) on the computer uses this information to create an access token —
in this case, the primary access token — that includes the SIDs returned by the logon process as well as a list
of privileges assigned by local security policy to the user and to the security groups to which the user belongs.
After LSA creates the primary access token, a copy of the access token is attached to every process and thread
that executes on the user’s behalf. Whenever a thread or process interacts with a securable object or tries to
perform a system task that requires privileges, the operating system checks the access token associated with
the thread to determine the level of authorization for the thread. This includes Exchange client and server
processes which authenticate the user account against a resource.
Token Bloat happens when a user’s access token becomes so large that either certain data gets excluded from
the token or certain applications cannot handle authentication using the token provided. One Microsoft
employee described the situation as follows, “Picture a suitcase filled to overflowing. You managed to close it
but some stuff had to get left out.” Token Bloat is most often caused by one of the following:
Users are migrated from one Active Directory domain to another. The Security Identifier history
(SIDHistory) is retained from the previous domain to preserve seamless access to resources for the
user.
Users are added to many security groups. The issue is made exponentially worse when those groups
are nested into other group memberships.
Among the tools that are available to combat this issue are scripts to detect user token size, formulas to
determine token size, as well as improvements in the latest operating systems to better handle this situation.
However, it should also be known how this can adversely affect Exchange environments.
A common issue seen with proxying is during migrations to Exchange 2013. The same issue can be expected
with Exchange 2016 migrations as it involves the same proxy action. The article “Failures when proxying HTTP
requests from Exchange 2013 to a previous Exchange version” describes a scenario where HTTP traffic must be
sent from the newer Exchange version (2013 in this case) to a legacy version. As previously mentioned, access
Page 20 Exchange Server Troubleshooting Companion
tokens can be used by application processes for authentication purposes and should the tokens be too large,
failures can occur.
For one large organization I worked with, after moving their Exchange Outlook Web App (OWA) and Outlook
Anywhere namespace from Exchange 2010 to Exchange 2013, some users experienced failures. Upon further
investigation we determined that only users with large tokens were experiencing the issue. Useful Exchange
logs to help diagnose this issue can be found in the <Exchange Server Install Path>\Logging\HttpProxy\<Http
resource> logs on the Exchange Server 2013 Client Access server. In this case, the resolution was to either
reduce the number of groups the affected users were members of (and therefore reducing their token size) or
to make the changes suggested in Microsoft Knowledge Base article KB298844 on the legacy Exchange
servers. The relevant extract is as follows:
Increase the MaxFieldLength and MaxRequestBytes entries to the following values. This change requires a restart
of the Client Access servers. The recommended value for Exchange 2010 coexistence is 65536.
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\HTTP\Parameters
Because Exchange 2016 will also proxy connections to previous versions of Exchange, and since this issue is
caused by Active Directory and not Exchange, I expect to continue to see this issue in the future. Since the
proxying of HTTP connections is the key factor here, the problem affects Exchange connections related to
Outlook Anywhere, Outlook Web App (OWA), Offline Address Book (OAB), Exchange Web Services (EWS), and
ActiveSync.
This is important because when a user connects to Autodiscover using an Outlook or ActiveSync client, they
must provide their Primary SMTP address for Autodiscover to be successful. If this fails, they might be
presented the option to use their Domain\Username format (also known as the pre-Windows 2000 User
Logon Name) but this is not an ideal user experience. In this scenario, because our primary SMTP address is
not a valid UPN, we will not have a seamless logon experience.
This scenario is often encountered, especially inside organizations who deployed Active Directory in the 1999-
2004 period when less attention was paid to the goodness of having matching UPNs and primary email
addresses. Although the solution is simple, it can also involve some tedious administration effort. The solution
is to first define an additional UPN Suffix in Active Directory Domains and Trusts (Figure 1-7).
Page 21 Exchange Server Troubleshooting Companion
Figure 2-7: Adding a UPN Suffix to a domain in Active Directory Domains and Trusts
After defining the additional UPN, the Explicit UPN for each user account can be changed via a drop-down
menu as shown in Figure 2-8. In this case, the user’s UPN becomes [email protected] and so matches
the primary SMTP address to resolve the Autodiscover authentication issue.
Figure 2-8: Defining the Explicit UPN for a user account in Active Directory Users and Computers
Of course, this action can be scripted to bulk-process a set of user accounts. Several useful scripts exist
(Reference 1 Reference 2 Reference 3) that can serve as a basis for this procedure. A common favorite among
Active Directory and Exchange professionals is ADModify which offers a GUI and an easy “undo” feature. Note
that making sure that the UPNs match primary SMTP addresses is vital for Office 365 migrations. It is also a
best practice for on-premises Exchange. If I hear reports of authentication pop-ups in Outlook or ActiveSync
clients, my first step is to ensure the user account’s primary SMTP address aligns with either their Explicit or
Implicit UPN as this ensures both a consistent and seamless login experience.
Some of the terms used above may be unfamiliar to most, so let’s define them. A semaphore in programming
logic is an apparatus for controlling access by multiple processes to a common resource in a concurrent
system. Think of this like a microphone in a panel Q&A session. If only two microphones are distributed for
Q&A, then only two concurrent individuals can be conversing at once. In Windows Domain Controllers (from
Windows Server 2003 to Windows Server 2008) the default number of semaphores is 2. This means a server
could handle no more than 2 authentication requests at a time. Each request has a timeout value of 45
seconds, so a server can quickly become overloaded should requests become hung. This behavior is controlled
by the MaxConcurrentAPI setting.
Armed with this information, we begin to understand how to interpret the above counters. MaxConcurrentAPI
is a setting that applies to both the client and server sides of a NetLogon authentication session, which means
that the bottleneck could exist either on the client-side (Exchange Server) or server-side (Domain Controller) of
the connection. The symptom to the end user may be repeated authentication prompts, but the root cause is
probably that not enough NetLogon authentication channels exist to handle the authentication traffic
between the Exchange Server and its reachable Domain Controllers. In most scenarios, it is sensible to increase
the MaxConcurrentAPI setting on both the Exchange Server and Domain Controller operating system. This is
done by adding (or adjusting) the following registry changes on both systems.
After making the change, at a command prompt, run net stop NetLogon, and then run net start NetLogon.
This Microsoft blog provides some information to help you determine what value is appropriate for your
environment. For more detail on how to troubleshoot and diagnose issues related to MaxConcurrentAPI, see
this “Ask Premier Field Engineering” blog post. As described in the post, here are the most common symptoms
of MaxConcurrentAPI issues:
Users may be prompted for authentication even though correct credentials are used.
Slow authentication (may be intermittent or consistent); this may mean authentication slows through
the day, or is slow from the hours of 8AM-9:30AM but fine the rest of the time, or any number of
scenarios here.
Authentication may be sporadic (10 users sitting next to each other may work fine but 3 other users
sitting in the same area may not be able to authenticate; or vice versa)
Microsoft and/or 3rd party applications fail to authenticate users
Authentication fails for all users
Restarting the NetLogon service on the application server or domain controllers may temporarily
resolve the issue
Page 24 Exchange Server Troubleshooting Companion
Note: this should not be done as a workaround as you are merely pushing the problem off to another
machine!
Any authentication handled by NetLogon (or Kerberos PAC validation) may experience the same or
similar behavior
Synchronization is not being performed
Exchange components and Management Tools will encounter failures
Fortunately, this problem should be less common as customers deploy Exchange 2013/2016 and Domain
Controllers onto Windows Server 2012 or newer operating systems. This is because the default value of
MaxConcurrentAPI is 10 instead of 2 on Server 2012. The five-fold increase in authentication channels should
alleviate these issues in most environments.
Note: The most common scenario for performance issues in modern operating systems is if several
Exchange Servers in a load balanced array were to fail and the remaining servers had to bear the load
of all authentication requests.
However, each environment is different and requires tuning according to the observed workload. The risk of
setting these values extremely high (150 for example) is unnecessarily high resource utilization on the systems,
so the change should be handled with care.
As one last step to aid in diagnosing NetLogon issues, you can enable NetLogon debug logging. Personally, I
find the performance counters listed in Table 2-1 are sufficient to tell me when there’s an Active Directory
bottleneck but these logs may prove useful in confirming a diagnosis. Another alternative for Exchange servers
is to enable Kerberos authentication for client connections, which scales much better for larger or more AD
authentication intensive environments.
Having the foundational knowledge of troubleshooting Active Directory will aid us in maintaining a
stable and resilient foundation for Exchange. We now move on to Client Access Services and the means
in which users access Exchange once they’re authenticated by Active Directory.
Additional reading
Overview
Verify Exchange Server Schema Version
FSMO Roles
Preparing Active Directory and Schema for Exchange 2013 Release Preview
Sites overview
How Active Directory Replication Topology Works
Enabling Change Notification on your Active Directory Site Links
Understanding Urgent Replication
Are you storing your AD-Integrated DNS Zones in the DNS Application Partitions
Bad NIC Settings Cause Internal Messages to Queue with 451 4.4.0 DNS query failed (nonexistent
domain)
The role of client access services in Exchange has changed over the years. By “client access services” I’m
referring to the server-side components which enable users to connect to Exchange via various clients,
including Outlook, Outlook for Mac, Outlook Web App (recently changed to “Outlook on the Web” but
commonly known as OWA), and Exchange ActiveSync mobile clients. Based on hardware and operating
system limitations over time, the Exchange product team has made several architectural changes to the
product to influence how the client access components should be deployed and supported for each release of
Exchange. To begin, let’s discuss how client access has changed in Exchange Server to aid us in understanding
how to properly troubleshoot the current version.
This need (as well as security concerns) drove the advent of the Front-End/Back-End topology configuration
for Exchange 2000 and Exchange 2003 environments. The configuration enabled customers to dedicate
Exchange servers for the purpose of handling inbound and outbound client connections (as well as SMTP
traffic), so they could scale as needed by adding additional Front-End servers. This ability to split functionality
and scale independently was critical at this time because, as x86 servers, Exchange 2000/2003 servers were
limited to 4 GB of RAM and CPU processing power was limited. The platform limitations mandated a scale-out
approach (in contrast to scale-up by adding more memory/CPU) and the Front-End/Back-End architecture
allowed Exchange Administrators to dedicate hardware solely for client access (and SMTP) traffic, freeing up
Back-End server resources for mailbox workloads. Common client access services issues encountered in
Exchange 2000/2003 were:
At the time when Exchange 2007 was released, the ability and desire to scale-out at the client access layer was
fresh in people’s minds, so many administrators were extremely pleased at the ability to install only the client
access services on a dedicated server.
Improperly sized Client Access Servers can provoke many symptoms, including
Periods of high user activity, such as start of work day or peak season, would result in connection
drops and pop-ups in Outlook clients.
ActiveSync devices are unable to send/receive messages
Client Access Server system hangs or lockups due to lack of RAM/CPU
Unsupported virtualization settings used with Exchange Client Access Server Role. For example, using
dynamic Memory or no memory reservation and a vCPU:Physical Core ratio greater than 2:1. These
scenarios all lead to serious performance issues.
Note: Many of these issues can still occur in Exchange 2013/2016 environments
Along with the option for dedicated server role installations, Exchange 2007 introduced a requirement for
Exchange Administrators to become well-versed in certificates/PKI (Public Key Infrastructure). Along with the
introduction of Autodiscover, a web-based framework for automatic client configuration, came a requirement
for HTTPS-secured Exchange client connections to enable successful Outlook and ActiveSync profile creation.
In addition, Outlook Public Folder-dependent features such as accessing the Offline Address Book (OAB),
gathering Free/Busy information, and modifying Out of Office settings were moved to web-based services that
also required an HTTPS connection. Although the OAB is now downloaded from its own URL
(HTTPS://Mail.Contoso/OAB), the Availability Service (Free/Busy) and Out of Office settings are controlled from
Outlook by accessing Exchange Web Services (EWS) via its own URL (HTTPS://Mail.Contoso.com/EWS). This
EWS URL is also used when Outlook for Mac connects to Exchange as this client uses EWS instead of MAPI-
Page 29 Exchange Server Troubleshooting Companion
based communication used by the Windows version of Outlook. It's important to know that EWS relies entirely
on the Autodiscover service for configuration. Even in an Exchange 2007/2010 environment where Outlook
users connect to Exchange using MAPI/RPC (not using Outlook Anywhere), access to these web services
requires a trusted connection to Exchange using HTTPS. Of course, this means that a certificate that the client
trusts must be installed and enabled in Exchange (we’ll discuss what’s required for a client to trust a certificate
later in this chapter).
If I were asked to list the #1 Exchange support call seen in relation to Exchange 2007/2010/2013 client access
services, certificates come high up on the list and perhaps even #1. Maybe this is due to a lack of
understanding PKI in the industry, or perhaps due to a lack of certificate management features in the
Exchange product. For whatever reason, the results were many avoidable Exchange outages. Examples of
certificate-related issues in Exchange 2007/2010/2013 include:
While the certificate-related tools in the Exchange Admin Center have streamlined certificate management in
Exchange 2013/2016, many of the issues listed above can still occur when the fundamentals of certificate
issuance and usage are not properly understood. It is for this reason we’ll begin this chapter discussing PKI,
certificates, and their relation to Exchange.
When teaching, I often use the example of a government-issued driver’s license as an analogy to certificate
trust. If I enter a government building and am asked to provide identification, what would the requesting party
be looking for in that identification? By presenting my state-issued Texas driver’s license, the requestor has
decided to trust the state of Texas. If they do not trust the state of Texas, the ID would be of no use to them
for validation purposes. To them, a note from my Mother stating my credibility would be just as unreliable, as
it’s not a trusted and unbiased source. The next thing they would look for is whether my license was expired,
as an expired license is not considered valid. Lastly, the name I’m presenting to them (my name) must match
the name listed on the license. Similarly, the picture on the license must be an accurate representation of my
appearance to them at that time. So when describing what is important in regards to certificates, the three
Page 30 Exchange Server Troubleshooting Companion
golden rules of trust (as I call them) are the same as when describing a valid form of identification. Do I trust
the issuer? Is it expired? Is the name I’m using listed on the identification?
Naming
Knowing the requirements for trust will help us to understand which names we need to put onto our Exchange
certificate when requesting it. Namespace planning in Exchange 2010, Exchange 2013, and Exchange 2016 are
extremely important to ensure successful traffic flow as well as a good end-user experience in an Exchange
environment. Part of this is determined when you decide which names you want to include in your certificates.
You can technically have a functional solution with only one name on your certificate in a simplistic
environment with limited requirements (which also seem to be the environments where less experienced
customers are unsure of their options). This is usually the case when a customer does not wish to pay the extra
cost for a multi-named (UCC) certificate. For instance:
Split DNS Enabled=Yes (Mail.Contoso.com resolves to the Exchange server both internally and externally)
AutodiscoverServiceInternalUri=HTTPS://Mail.Contoso.com/AutoDiscover/AutoDiscover.xml
In this example, internal Outlook Connectivity and AutoConfiguration will function but external Outlook and
ActiveSync AutoConfiguration will fail. This is due to the lack of an external DNS record for AutoDiscover
which is present on an installed trusted certificate. Similarly, non-domain-joined Outlook clients connecting
internally will also fail. This is due to the lack of an internal DNS record for AutoDiscover which is present on
an installed trusted certificate. This is because you don’t have AutoDiscover.Contoso.com listed for your
certificate, so the process will not be seamless. You will either be greeted with certificate warnings or the
connection just would not work (the client AutoDiscover process is further covered in the Clients chapter).
Technically speaking, you can get external and non-domain joined Outlook clients to work if you create an
AutoDiscover SRV record in DNS for the AutoDiscover service but there’s no workaround for ActiveSync
clients. Mobile users would be required to manually input the ActiveSync server when creating ActiveSync
profiles on their devices. Also, depending on how your device handles certificates, you may or may not be able
to connect at all.
Split DNS Enabled=Yes (AutoDiscover.Contoso.com resolves to the Exchange server both internally and
externally)
Split DNS Enabled=Yes (AutoDiscover.Contoso.com and Mail.Contoso.com resolve to the Exchange server both
internally and externally)
Either
AutodiscoverServiceInternalUri=HTTPS://AutoDiscover.Contoso.com/AutoDiscover/AutoDiscover.xml
Or
AutodiscoverServiceInternalUri=HTTPS://Mail.Contoso.com/AutoDiscover/AutoDiscover.xml
This is the most commonly used configuration. It allows for full AutoConfiguration of clients via the
AutoDiscover service. All necessary records are present internally and externally and the names are on a
trusted certificate.
Split DNS Enabled=Yes (AutoDiscover.Contoso.com and Mail.Contoso.com resolve to the Exchange server both
internally and externally)
Either
AutodiscoverServiceInternalUri=HTTPS://AutoDiscover.Contoso.com/AutoDiscover/AutoDiscover.xml
Or
AutodiscoverServiceInternalUri=HTTPS://Mail.Contoso.com/AutoDiscover/AutoDiscover.xml
This method accomplishes all the same goals as Example C, but adds the flexibility (and added cost of the
wildcard certificate) of publishing any name prefix you wish to access Exchange resources. Email.Contoso.com,
Europe.Contoso.com, or SMTP.Contoso.com could all be used to access Exchange if they were resolvable.
However, aside from the added cost, some security professionals have security concerns regarding wildcard
certificates. In my personal experience, I’ve never encountered an Exchange environment that was
compromised as a result of using wildcard certificates.
I mention these examples not to tell you how to deploy Exchange (by all means, get a multi-name or wildcard
certificate) but instead to explain that in the end, all that matters is that the names you configure in Exchange
must be resolvable via DNS to Exchange and are listed on the certificate. You could literally make your
Outlook Anywhere namespace “randomseriesofcharacters.contoso.com” and as long as it was on your
certificate, it hasn't expired, and the name resolved to Exchange, it would function.
Private Key
Public-Key Cryptography can be a complicated and confusing topic. It involves math, ciphers, and many
complex calculations happening extremely quickly. While many books have been written on this topic, an
effective Exchange troubleshooter only needs to understand a few key concepts. Let’s discuss them at a high
level.
Certificates have both private and public keys that are mathematically connected to each other by an
algorithm. In practice this means that if a public key is used to convert plaintext to ciphertext, only the private
key can be used to convert the ciphertext back to plaintext. However, the same cannot be said of using the
public key to decipher information encrypted with the private key. Otherwise, anyone who intercepts the
conversation could decipher the data using the easily available public key. So while the public key can be
shared freely, the private key must be held secret to everyone but the server where the certificate is installed.
What this all means to Exchange is that an installed certificate is only useful to an Exchange server if it includes
the private key. When you export a certificate with the intent of importing it on another Exchange server
(which is a common scenario as there’s no issue with reusing an Exchange certificate), you must include the
private key. When using the Certificates MMC Snap-in (Figure 3-1) to export a certificate, the option to export
the private key must be selected. Due to the security implications of the private key falling into the wrong
hands, you are required to provide a password when exporting the private key.
Follow these steps to open the Certificates MMC Snap-in in order to export a certificate:
Open PowerShell
Type “MMC” <Enter>
File>Add/Remove Snap-in
Select “Certificates” and click Add
Select “Computer Account” and click Next
Click Finish
Click Ok
Certificates for Exchange usage will be in the Personal>Certificates container
Conversely, when using the Exchange Admin Center to export a certificate (Figure 3-2) there is no choice but
to export the private key and provide a password, as this is the only way the certificate will be useful for an
Exchange server.
Figure 3-2: Exporting a certificate with the private key using the Exchange Admin Center
A common support issue where the private key (or lack thereof) comes into play is when a certificate is
installed without the private key and you’re unable to assign Exchange services to it. Depending on the version
Request an Exchange certificate with the desired names. The Exchange Certificate Wizard will guide
through this process.
Submit the request file to a Certificate Authority
Receive the certificate file from the Certificate Authority
Complete the pending certificate request on the same Exchange Server by associating the certificate
file with the pending request (this action creates the private key)
Assign Exchange services to the newly created certificate
Once this process is complete, a valid Exchange certificate that contains the private key will be installed and
enabled for Exchange server usage.
Types of Certificates
When discussing the origin of a certificate, you can place them in one of three categories:
Self-Signed Certificate
Internal Certificate Authority Certificate
Trusted Third-Party Certificate
A Self-Signed Certificate is a certificate generated by a server for its own use. This means that server is the only
entity that trusts the certificate. A self-signed certificate is typically used when a certificate is required for
encryption but not necessarily authentication. In this situation, the Public/Private Key pairs can be used for SSL
encrypted communications but client systems will not trust the issuer of the certificate. When Exchange is
installed, it creates a self-signed SAN certificate with a Subject Name of <ServerName> and the Subject
Alternative Name of <servername.contoso.com> that is used for encrypted communications between other
Exchange servers and client systems. Although an Outlook or ActiveSync client might not connect because
they do not trust this certificate, other clients such as Internet Explorer may give a warning, but allow the user
to proceed by accepting the warning (Figure 3-3).
It is important to understand that while the self-signed certificate is not trusted by clients and will fail
authentication, the connection is still secured as encryption is still used.
If the self-signed certificate installed by Exchange expires, all you need to do to generate a new self-signed
certificate is to run the New-ExchangeCertificate cmdlet in an EMS session:
Running the cmdlet with no parameters generates a self-signed certificate with the server’s name as the
SubjectName. This step is typically required when either the default certificate has expired or when someone
has erroneously removed the default certificate as an ill-advised way to “clean-up” the installed certificates on
the server. When this happens, you will see Warnings/Errors in the Application logs from the Transport Service
(Figure 3-4):
Figure 3-4: Common error when a certificate with the server’s name is not present
In this instance, the default Hub Transport receive connector which is used for Exchange Server-to-Exchange
Server communication requires a certificate with the server’s name to be present. The default self-signed
certificate satisfies this requirement but as it has been removed, so TLS is not possible for this receive
connector.
Take note of the Internal CA (ASH-ASH-EX1-CA) root certificate listed in the Trusted Root Certification
Authorities container in Figure 3-5.
Finally, a Trusted Third-Party Certificate is issued by a third-party certificate authority which is trusted by most
if not all IT systems. Examples of such third-party Certificate Authorities are DigiCert, GoDaddy, VeriSign,
Entrust and others. Although these certificates are much simpler to deploy (as all clients should trust
automatically them), they can easily cost several hundred dollars per year. The cost is usually determined by
the number of names/domains listed on the certificate as well as whether it is a Wildcard Certificate. Windows
operating systems trust the Root Certification Authorities (CAs) based on the contents of the Trusted Root
Certification Authorities container (Figure 3-5). This container is populated partially upon operating system
install, but is later updated either via Windows Updates or manual installation files (additional reference).
While Root Certificate Authorities are certainly important, Intermediate Certificate Authorities are equally
important. A client must trust not only the Root but also any Intermediate CA’s in the certificate’s chain. This
“Certificate Chain” is often provided when an issuer distributes a certificate to a customer. It’s recommended
to install this chain not only on your Windows Systems, but when installing a certificate on a load balancer as
well.
While I feel we’ve covered the basics of certificates which will help you be an effective troubleshooter, if you
remember nothing else about certificates, just remember:
If the answer to any of these questions is “No”, that should be the first issue you address. Also, see this great
blog post from co-author Paul Cunningham on the topic of Exchange Certificates. It also references several
additional articles covering the basics of planning certificates and naming for Exchange 2013.
Note: Wildcard certificates can be used with multiple subdomains of a domain. These certificates will
have a Subject of *.domain.com. As an example, https://plus.google.com uses such a certificate. They
Figure 3-6: Expanded IIS Manager on Server 2012 R2 Exchange 2013 multi-role server
Figure 3-6 shows the IIS Manager on my Windows Server 2012 R2 Exchange 2013 multi-role server. We find
the various web sites as well as the Application Pools that correspond to each application such as ActiveSync,
PowerShell, or OWA. Because this server is multi-role (it has both CAS and Mailbox Roles installed) you will see
two separate Exchange web sites:
The two primary services associated with IIS are the IIS Admin Service (inetinfo.exe) and the World Wide Web
Publishing Service (w3wp.exe). To explain it simply, inetinfo.exe corresponds to IIS configuration information
whereas w3wp.exe corresponds to each of the various Application Pools. After changing IIS configuration
information (such as authentication settings, etc.), the IIS Admin Service will typically need to be restarted.
Whereas, if a particular application still isn’t updating after you’ve made a change (like OWA or ActiveSync)
then you may need to recycle that Application Pool and at worst, restart the World Wide Web Publishing
Service.
However, in many cases it’s recommended to simply stop then start the website or recycle the application pool
rather than restarting the services or using iisreset (Reference-1 Reference-2 Reference-3). This is because it’s
Page 38 Exchange Server Troubleshooting Companion
possible IIS has not saved the necessary changes in time and those changes could be lost by a forcible service
restart. Starting then stopping the websites, recycling the application pools, or using the “/noforce” switch for
iisreset is preferred. However, sometimes killing a service using Task Manager is all you can do as a last resort
in a troubleshooting scenario.
Figure 3-7: Viewing IIS Web Site settings with the Web Administration PowerShell module
Using the series of commands shown in Figure 3-7, I imported the IIS PowerShell Module and queried the
bindings of my two web Sites in IIS. I’ve found that using PowerShell is a very useful way to query this data
fairly quickly. It’s also useful for when you need to send a customer a set of commands they can run and send
the data back to you for analysis. Figure 3-8 shows a few of my preferred information gathering command in
action:
The commands shown in Figure 3-8 are executed after navigating to the “Default Web Site” (already done in
Figure 3-7) and expose the various Applications and Virtual directories underneath it. Notice how the
commands work similar to navigating a folder structure. If I need to go back a level I can simply use “cd ..”.
Alternatively, if I wanted to export this to a text file I could repeat the last command but with a Format-List at
the end and redirect it to a text file:
Note: The Default Web Site has bindings of 80 and 443 for HTTP and HTTPS, respectively, while
Exchange Back End has 81 and 444, respectively. When a client makes a connection to Exchange using
HTTPS it's connecting to the Default Web Site which proxies the connection back to the Exchange Back
End web site. Do not change the bindings on the Exchange Back End website lest you want to break all
HTTP proxy functionality from the CAS to Mailbox role. Note that the Default Web Site will use the
certificate which has been imported and enabled for the IIS service via the Exchange Management
Tools. Conversely, the Back End website uses the self-signed Microsoft Exchange certificate which is
installed upon Exchange installation.
Alternatively you could use the Exchange Management Shell for some of these commands but you might find
that the PowerShell IIS Module gives you a bit more flexibility. Now to look at these settings in the GUI may
Page 41 Exchange Server Troubleshooting Companion
seem easier but it does require a bit more mouse clicks to get the same data. In Figure 3-10 we’ve expanded
Sites and right-clicked the Default Web Site.
After selecting Edit Bindings, we’re presented with the IP address, port number, and Host Name binding
information on this web site (Figure 3-11). By selecting HTTPS and clicking Edit, you can view the assigned SSL
certificate.
To view Application Pool properties in the IIS Manager, navigate to Application Pools below the IIS Server
object (Figure 3-12). Here you can view Application Pool state, the version of .NET it runs, and its identity.
Now within IIS I right-click Exchange Back End>Edit Bindings and change the HTTPS binding from 444 to 445
(Figure 3-14).
Figure 3-14: Changing the bindings of the Exchange Back End website in IIS Manager
If I now refresh my browser, I’ll be greeted with a blank page (Figure 3-15).
This is because, by design, the Default Web Site in Exchange 2013/2016 uses the traditional web server
bindings for port 80 and 443, while the Exchange Back End website uses ports 81 and 444 for HTTP/HTTPS
connectivity. When the Client Access Server role communicates with the Mailbox Server role for IIS–related
functions, it proxies these connections via HTTPS using port 444 (port 81 for HTTP connections). So the
expected flow for UserA logging into OWA on ServerA (single server environment for this first example) would
be:
How would the traffic flow look if we were connecting to https://ServerA/owa with our browser but our
mailbox (UserB) was on a database that was mounted on ServerB? Let’s have a look:
Figure 3-16: Local connections for port 443 viewed using the NETSTAT command
I now run a different command (Figure 3-17) from the same server but for port 444; this output is a bit busier.
There’s the connection to the local server for the OWA session that I’m logged into (the mailbox I’m logged in
with is on a database that’s mounted locally). However, you’ll also find there’s a connection to 10.180.62.191,
which is one of my other Exchange servers. This is for another instance of OWA I have open for a mailbox
that’s currently mounted on that server. In that case the PID corresponds to an instance of w3wp.exe (World
Wide Web Publishing Service). The other PIDs correspond to background processes like
Microsoft.Exchange.ServiceHost.exe (MSEXchange Service Host Service), MSExchangeHMWorker.exe
(MSExchange Health Manager Service), and MSExchangeMailboxAssistants.exe (MSExchange Mailbox
Assistants Service). These are all background processes that are constantly running behind the curtains to keep
Exchange up and running (managed availability, synthetic transactions, maintenance tasks, etc.).
It’s possible that administrators accidentally change the bindings or delete them. Unfortunately, their attempts
to repair the web sites typically result in their usage of the incorrect port numbers (like configuring 443 on the
Exchange Back End site). Alternatively, customers (or their network security admins) may block port 444 traffic
between servers and suddenly find their servers in a state of uselessness.
Note: A common issue in Exchange 2013’s timeline was the Exchange Back End site’s HTTPS bindings
having their certificate mapping removed. This could occur either after an Exchange update failure, a
failure during a new certificate assignment, or the removal of the self-signed certificate which should be
bound to this site. The resolution is to use IIS to re-assign a proper certificate.
The recreating of an Exchange Virtual Directory (vDir) was often needed after corruption, misconfiguration, or
unexplained failures were encountered. I’ve seen it resolve odd display issues in OWA as well as authentication
failures. Recreating the various virtual directories was a useful troubleshooting step in the past, but I’ll be
honest when I say that it’s usually done as a last ditch step whenever every other avenue of troubleshooting
hasn’t helped. In fact, if recreating the virtual directory doesn’t resolve the issue I’m usually looking at
a /RecoverServer install as the next step (this option is discussed in the Backup/Disaster Recovery chapter).
However, recreating a virtual directory is useful when the components that depend on IIS
(OWA/ECP/ActiveSync/EWS/OAB/PowerShell/AutoDiscover) aren’t working as expected and you’d like to reset
the relevant virtual directory to defaults.
Note: Recreating the Virtual directories will remove any settings or customizations you have
implemented, so I recommend running a “Get-OWAVirtualDirectory | Format-List” or similar command
beforehand to record the existing settings. In fact, if you use the EAC to reset the virtual directories
then you’ll be prompted to save the configuration to a network path.
There are two ways to recreate a virtual directory: EAC (GUI) or EMS (Shell). Let’s look at the EAC method first.
Navigate to EAC>Servers>Virtual Directories, select the virtual directory you wish to reset and then click the
Reset icon (Figure 3-18).
Figure 3-19 shows the prompt you’ll receive to back up the current Virtual directory settings before resetting
it.
After clicking “Reset” the Virtual directory will be removed and then recreated. Afterwards you’ll need to
restart IIS (iisreset /noforce) and reconfigure any customized settings, such as authentication settings.
The steps to perform the same action through PowerShell are straightforward. Figure 3-20 shows the
commands in action:
These commands work when we have an issue with the Default Web Site but I’ve actually encountered
instances where it was required to recreate the OWA Virtual directory on the Exchange Back End site as well.
To do this, run the commands shown in Figure 3-21:
Figure 3-21: Removing and recreating the OWA virtual directory from the Exchange Back End web site
PowerShell
What do you do if you’re having issues with the PowerShell virtual directory? You probably are unable to
connect to the problem server to manage it via EMS or EAC (since both require a functional PowerShell virtual
directory) so it will be required to load the local Exchange Management PowerShell snap-in using the
commands shown in Figure 3-22:
Figure 3-22: Recreating the PowerShell virtual directory using the local Exchange PowerShell snapin
Since we’re on the topic of PowerShell, on occasion I’ve found myself having to verify all the proper IIS
Modules are added for the PowerShell virtual directory (Figure 3-23).
I recommend comparing the loaded modules here to a known working server (or lab machine). On several
occasions I’ve found the “kerbauth” module (Figure 3-24) to be missing and I’ve needed to re-add it. I’ve
encountered missing modules with both Exchange 2010 and 2013 but with each version, the proper modules
will be needed for proper functionality on any version of Exchange.
Note: Also make sure that any and all file directory paths have the proper permissions set on them.
Again, it’s helpful to have a known working server to use as a comparison. Also, be sure that all proper
Anti-Virus Exclusions have been configured (an extremely common scenario).
Certificate Binding
Certificates are bound to both the Default Web Site as well as the Exchange Back End site in IIS. If you right-
click on Default Web Site>Edit Bindings>Select HTTPS and click Edit you can see the current certificate bound
to the site. When you run Enable-ExchangeCertificate –Thumbprint <Thumbprint> -Services IIS, this is what is
configured within IIS. Figure 3-25 shows a certificate generated by my Internal Certificate Authority installed
on the Default Web Site.
I often find the incorrect certificate listed here or that some certificates are missing. While the EAC and the
Get-ExchangeCertificate shell command are extremely useful, many customers mistakenly think that the
Exchange tools are the only way to Import/Export certificates. However, the Certificates MMC Snap-in is a very
handy troubleshooting tool.
Open PowerShell
Type “MMC” <Enter>
File>Add/Remove Snap-in
Select “Certificates” and click Add
Select “Computer Account” and click Next
Click Finish
Click Ok
Certificates for Exchange usage will be in the Personal>Certificates container
Page 50 Exchange Server Troubleshooting Companion
Figure 3-25: Certificate bindings in IIS on the Default Web Site
Figure 3-26 shows the Personal Certificates store of the Local Computer account. This is where
manually installed certificates are likely to be stored. In short, when you run Import-ExchangeCertificate the
certificate ends up here. So similarly you can use this console to Import/Export certificates as well.
Note: Your Personal store will likely look different than mine as my lab server is also a Domain
Controller/Certificate Authority.
Certificate issues have historically revolved around generating the request, but the Certificate Request Wizards
(Figure 3-27) used in Exchange 2010/2013/2016 have made this task much easier.
As previously mentioned in the section covering Certificates, when you generate the certificate request on an
Exchange server, you need to leave that request intact until you receive the new certificate file from your
issuing Certificate Authority. If you fail to do this, your certificate will be missing the private key and be
effectively useless for Exchange usage. I see this frequently when customers request a certificate multiple
times or if they try to use a different server to import the certificate. It’s possible a customer could connect to
a load-balanced name for EAC access and request the certificate on one Exchange Server. Then upon
completion of the request, they are taken to a different server. To avoid this, it may be required to connect
directly to an Exchange server when requesting a certificate. Once a request has been generated, you’ll see the
pending request in the EAC Certificates console along with an option to “Complete” the request (Figure 3-28)
to execute when you’ve received the certificate from your CA (this process generates the Private Key).
Figure 3-28: A pending certificate request in the EAC Certificates management interface
A common problem first encountered in Exchange 2013 was when the default self-signed certificate on the
Exchange Back End website became unbound and caused HTTPS connections to fail. The issue was well
documented by Microsoft (Reference1 Reference2). The issue often came to light after using the EAC
certificate wizard to import or enable a new certificate. However, knowing how to navigate IIS to manually
enable the self-signed certificate or even to initially identify the issue is a useful troubleshooting skill.
For example, a CAT6 network cable is said to operate at the Physical layer (Layer 1) of the OSI model, while a
router operates at the Network layer (Layer 3), and Exchange operates at the Application layer (Layer 7). As
Exchange generates an email message using SMTP (Layer 7) and creates a connection over port 25 (Layer 4) to
another mail server’s IP address (Layer 3) using a DSL connection (Layer 1), it goes down the OSI Model
through each layer. On the receiving mail server, the connection goes up through each layer in a mirror
fashion, starting at Layer 1 and working its way up to Exchange at the Application Layer.
When discussing load balancing, specifically layer 4 load balancing, connections are distributed using only
knowledge of the connection at Layers 1-4. This means that the most sophisticated pieces of information we
could gather at this layer are TCP/UDP port number and IP address. The load balancer has no knowledge of
whether the port 443 traffic is being used for OWA or Outlook Anywhere or Terminal Services Gateway. The
load balancer is also unable to inspect the traffic as that information exists at layer 7. As such, it is impossible
to determine whether an application/service is in a healthy state. For example, all Exchange services could be
stopped on a server, but since the server is still reachable over TCP/IP (PING, etc.) then traffic would still be
sent to it.
Alternatively, because layer 7 load balancing inspects traffic at the protocol level, it is able to make decisions
based on the actual workload (OWA vs EWS vs Outlook Anywhere etc.). It is also said to have “service
Page 53 Exchange Server Troubleshooting Companion
awareness”, meaning if a service is unresponsive then the load balancer will not send traffic to that location.
This service awareness is typically achieved by using health checks at layer 7, such as loading an HTML page or
attempting an SMTP synthetic session.
SSL Offloading
Another benefit of layer 7 load balancing is the ability to perform SSL Offloading, which terminates the SSL
session at the load balancer and allows the connection from the load balancer to the server to be unencrypted
and removes the SSL processing overhead from the Exchange servers. Although this may not provide
significant performance gains given the processing power of modern servers, it may still be a desirable
configuration in some situations. I once received a tip from a Microsoft Premier Field Engineer that an
alternative to SSL offloading would be to have a 2048-bit (or higher) certificate on the load balancer and a
1024-bit certificate on the Exchange servers themselves. While this would technically be SSL Bridging instead
of offloading, you’re still able to reduce the load on the Exchange servers by using a less resource-intensive
certificate. This also has the advantage of maintaining SSL encryption from client to server instead of passing
data unencrypted from the load balancer to the servers.
Session Affinity
Simply put, Session Affinity (aka Sticky Sessions or Persistence) ensures that new client connections for
existing client-to-server sessions are always directed to the same server. For example, if a client browser
session is established for an e-commerce website, where a user has logged into the server, it’s desirable for
that server to handle that session for its duration. If that e-commerce website opens a new browser tab
requesting the user enter their payment information (creating a new connection for the same session), the
user being repeatedly prompted for login is undesirable. This prompt could occur if this new connection were
to be directed to a different server by the load balancer because the new server would not have the user’s
login information or token in its memory. To avoid this, Affinity/Stickiness ensures that for the lifetime of that
session, each new connection is directed to the same server. While new sessions can be directed to any server
in the load balanced pool, which would then receive any new connections for that session. There are various
means to achieve affinity, the most common are:
Web-Cookie
HTTP Header
SSL Session
Source IP
Note: This information is used by the load balancer to direct incoming traffic to the desired server in
the load balanced pool.
Session affinity was required due to the Client Access architecture of Exchange 2010. All rendering and
authentication for client sessions (such as OWA) occurred within the Client Access Server role, which then used
RPCs to connect to the mailbox database on the user’s mailbox server (wherever it may be at the time). This
process depended on session affinity for the Client Access Server which initially processed the client request,
otherwise the user could be repeatedly prompted for credentials or have connection failures.
Exchange 2013 introduced architectural changes which no longer required this affinity. All rendering occurs on
the Exchange 2013 mailbox servers that currently mounts the target mailbox, meaning no matter which Client
Access Server the load balancer directs the connection to, it will always be directed to the same mailbox
server. Another change was the introduction of a shared hash between Client Access Servers which allows an
authenticated session to be shared between each CAS in the load balanced pool. When a user is
Page 54 Exchange Server Troubleshooting Companion
authenticated, an authentication hash is created using the installed SSL certificate. The client then uses this
hash in future requests to the server, at which point any CAS can now service this request without requiring
the user to re-authenticate. Since this hash is generated using the installed SSL certificate on CAS, the only
requirement is that the same certificate be enabled for IIS on all CAS in the load balanced pool.
The current recommendation for load balancing Exchange 2016 is to implement layer 7 load balancing
without session affinity. This allows layer 7 service awareness for the load balancer, as well as allowing the
simplicity of no session affinity. However, if Office Web Apps Server now called Office Online Server is
deployed to enable the viewing/editing of Office files in Outlook Web App, cookie-based session affinity is
required. Since OOS uses OAuth for authentication, the typical signs of misconfigured affinity (such as
repeated authentication prompts) may not be displayed. Instead, you should look for signs such as any
slowness while using the OOS features, or changes/edits not being saved while working with Office
documents in OWA. These might be the result of misconfigured affinity for the OOS load balanced
namespace. When troubleshooting OOS, it’s important to know that both the client and the Exchange Server
communicates to OOS using the load balanced URL. This means that if you wish to bypass the load balancer,
you will require a HOSTS file on both the client machine as well as the Exchange Server (more information in
the Load Balancing Troubleshooting Tips module).
Round Robin - This method tells the load balancer to direct requests to real servers in a round robin
order.
Weighted Round Robin - This method allows each server to be assigned a weight to adjust the round
robin order. E.g. "Server 1" can get 2 times the request that "Server 2" gets.
Least connection - This method tells the load balancer to look at the connections going to each server
and send the next connection to the server with the least amount of connections.
Weighted least connection - This method allows each server to be assigned a weight to adjust the
least connection order. E.g. "Server 1" can get 2 times the connections that "Server 2" gets.
Agent-Based Adaptive Balancing - This method is resource based load balancing where an agent gets
installed on the server and monitors the server’s resources (e.g. RAM, CPU...) and then reports back a
percentage to the load balancer which is used for load balancing.
Fixed Weighting - This method is used for redundancy rather than load balancing. All connections will
go to the server with the highest weight in the event this server fails then the server with the next
highest weight takes over.
Weighted Response Time - This method looks at the response times of the real servers (based on the
response time of the server health check) and which every real server is responding fastest gets the
next request.
Source IP Hash - This method looks at the source IP address that sent the request to the load balancer
and will create a HASH value for it and if the HASH value is different it gets sent to a different real
server.
The Exchange Product Team recommends the Least Connection method with the stipulation that you should
use a Slow Start or similar option. This “Slow Start” option (different vendors may use different terminology)
will prevent a flood of new connections to a server after it’s added to the pool. The Exchange Product Team
Note: I recommend researching and testing additional features load balancers may provide. In the blog
post below, a network congestion prevention feature called Nagle’s Algorithm adversely affected the
performance of Outlook clients.
Poor Outlook performance and Nagle’s algorithm
Lastly and most importantly, WNLB is incompatible with Windows Failover Clustering, which is required on a
Database Availability Group (DAG) member server. In other words, it is impossible to have a DAG node also be
a member of a WNLB cluster. In Exchange 2010/2013, this restriction required dedicated Client Access Servers
if WNLB was employed. However, because Exchange 2016 now has one consolidated server role, there is no
longer an option to install only a Client Access Server. In light of this architectural change, customers must
either implement a third-party load balancing solution or use Exchange 2016 servers with no active mailboxes
on them (and not members of a DAG) in a WNLB load balanced pool. Since the latter option is a waste of an
Exchange Server license, added complexity, and offers no added technical benefit, I strongly recommend
against it.
For example, if the load balanced name is https://Mail.Contoso.com/owa which resolves to 10.0.0.50 (a VIP on
the load balancer), and we would like to bypass the load balancer without configuration changes to Exchange
or the load balancer itself, we perform the following steps:
Navigate to the HOSTS file (C:\Windows\System32\drivers\etc) on the client machine being used to
access OWA/Outlook/etc.
Page 56 Exchange Server Troubleshooting Companion
Add an entry for mail.contoso.com which resolves to an individual Exchange Server’s IP address (such
as 10.0.0.10)
Via an elevated command prompt on the client machine, run ipconfig /flushdns to force the client to
remove any cached DNS entry for the mail.contoso.com record.
Connect to the Exchange resource from the client using the usual means (Outlook/OWA/etc.)
Performing these steps will bypass any affinity or networking misconfigurations which may be present on the
load balancer. It is possible that the symptoms noted while connecting through the load balancer will be
sporadic as in the case of connectivity failures which are resolved by a refresh of the client, an authentication
prompt that only occurs once or twice a day, or performance issues at unpredictable times. These symptoms
can be caused by a singular load balanced server which has either been misconfigured or has encountered a
failure of some kind. In this situation, you can do the following:
Look for connectivity errors in the Exchange logs on each Exchange server (C:\Program
Files\Microsoft\Exchange Server\V15\Logging)
Use connectivity or health checking logs on the load balancer to identify a server being unresponsive
Use a HOSTS file to point a client machine to one server and test the behavior. Repeat this process
until you encounter the Exchange server exhibiting the issue
Note: Ensure that once your testing is complete, you revert any HOSTS file changes you have
implemented. Otherwise, the client system will fail to connect if a single server failure occurs.
Additional References:
Figure 3-29: Expected response when authenticating against the AutoDiscover URL
Figure 3-31: Expected response when authenticating against the ActiveSync URL
Figure 3-33: Expected response when authenticating against the MAPI URL (MAPI over HTTP)
The current URL values can be found on the various Virtual Directories in Exchange itself, or more accurately,
based on the configuration AutoDiscover passes to the client. The most common means of obtaining this
information and/or testing AutoConfiguration are:
The first two tools are most useful. Let’s see how they work.
The Microsoft Remote Connectivity Analyzer (ExRCA) is the most useful Exchange troubleshooting tool
Microsoft has ever released. Although ExRCA had humble beginnings (running on a web server under the desk
of a Microsoft Support Engineer), it has grown to include virtually every Exchange (as well as Lync and Office
365) connectivity test you would need to troubleshoot or validate your solution. The tool can perform the
following tests:
In addition, protocol logging for POP3 and IMAP can be found in the below locations:
%ExchangeInstallPath%Logging\IMAP4\
%ExchangeInstallPath%Logging\POP3\
Figure 3-35 shows the output of the Outlook Anywhere test, with an overview of each phase displayed. The
test queries and validates AutoDiscover, as well as the capabilities for both RPC over HTTP and the newer
MAPI over HTTP. The individual test phases can be expanded to detail each step for further analysis, or the
user can trust the green ticky ticky as the symbol of a successful test. Figure 3-36 shows an expanded
availability synthetic transaction performed by the Exchange Web Services test within ExRCA.
Note: For configuring, logging, and troubleshooting MAPI over HTTP specifically, I recommend the
below posts:
Configure MAPI over HTTP
Outlook Connectivity with MAPI over HTTP
In addition, logging for MAPI over HTTP can be found in the below locations:
%ExchangeInstallPath%Logging\MAPI Address Book Service\
%ExchangeInstallPath%Logging\MAPI Client Access\
%ExchangeInstallPath%Logging\HttpProxy\Mapi\
Figure 3-35: Overview of results from a successful Outlook Anywhere test from the Exchange Remote
Connectivity Analyzer
The tool allows the results to be exported either to an XML or HTML file for later viewing or sharing with a
support representative. It’s important to understand though you may receive a failure (red x instead of the
Page 61 Exchange Server Troubleshooting Companion
green tick mark) at one phase of testing, it does not mean the entire test failed. For example, there are several
phases of AutoDiscover testing that Outlook clients can perform:
https://domain.com/AutoDiscover/AutoDiscover.xml
https://AutoDiscover.domain.com/AutoDiscover/AutoDiscover.xml
http://AutoDiscover.domain.com/AutoDiscover/AutoDiscover.xml
Locally configured XML file
SRV AutoDiscover DNS record
Most customers will use but one method for publishing AutoDiscover so it’s expected to see all AutoDiscover
tests fail but the endpoint published for that Exchange Organization. Figure 3-37 displays the output of the
AutoDiscover test against a mailbox in my Office 365 tenant. Since Office 365 uses the HTTP redirect method
for AutoDiscover, each previous step failed until the HTTP method was attempted.
Figure 3-36: Expanded results of Availability Service (EWS) test where an appointment was placed onto user’s
calendar
While the test had several failures within some of its individual phases, the overall test did pass, which is what
you should be concerned with when using the ExRCA tool. Personally, I’ve found this tool most useful in the
following scenarios:
This information is extremely useful when you must verify that the values you’ve configured in Exchange are
being provided to the Outlook client. For example, it’s possible that Active Directory replication latency and IIS
Application Pool stale cache result in clients not receiving the proper values. In addition, Outlook can
potentially take several hours to refresh its AutoDiscover cache. Knowing what information is being served up
to clients is useful in troubleshooting scenarios. To access this tool, perform the following steps:
With Outlook open and a profile configured, CTRL+Right Click the Outlook icon in the taskbar
Select Test E-mail AutoConfiguration (Figure 3-38)
Input the primary SMTP address and password of the account you would like to test and clear the two
Guessmart checkboxes (Figure 3-39)
Click Test and wait for the tests to complete. You can view the progress on the Log tab, which will display the
various AutoDiscover endpoints being attempted (Figure 3-40).
You can choose to look at the configuration information AutoDiscover proved on the XML tab in .XML format,
or use the more user friendly Results tab (Figure 3-41)
Understanding the components and infrastructure which allows clients to connector to their mailbox
empowers us to diagnose client access failures. As you would expect, there’s little use for an Exchange
Server that no client can access. Similarly, a messaging platform that cannot efficiently transport and
manage mail flow is hardly a messaging platform at all. We’ll now move on to Transport Services where
we’ll learn how to diagnose and recover from transport issues.
Additional reading
Of front and back ends
Front-End/Back-End topology
Load balancing the Front-End servers
Namespace and URL redirection configuration
Client Access Server sizing
Public Key Infrastructure
Offline Address Book
Availability Service
Exchange Web Services
Outlook for Mac
In Exchange terminology, “transport” (or the transport system) refers to the group of components responsible
for mail flow between point A and point B. Transport has evolved significantly over the years as Microsoft
released different versions of Exchange, and we’ve also seen changes in related technologies such as anti-
spam protection schemes like the Sender Protection Framework (SPF) that also need to be taken into account
along with the Exchange components.
Exchange Server 2007 introduced an entirely new server role architecture and had two dedicated transport
server roles that could be installed:
Hub Transport – primarily responsible for internal mail flow, but could also be configured to send and
receive email outside of the organization. Hub Transport servers were mandatory in any Active
Directory site that hosted mailbox servers.
Edge Transport – specifically designed to be deployed in a perimeter network to provide secure mail
flow between the organization and the internet. The Edge Transport server was optional, and
organizations could choose to deploy only Hub Transport servers instead, or use a third party product
to perform the same function.
The Edge Transport server role could not co-exist on the same Windows server as other server roles, but the
Hub Transport role could be installed with or without other roles, depending on whether the mailbox server
role was clustered or not.
From a transport perspective, Exchange 2010 followed the same architecture as Exchange 2007. It wasn’t until
the release of Exchange 2013 that we saw another significant shift in transport architecture. In Exchange 2013
the majority of the transport functionality performed by the Hub Transport role is collapsed into the Mailbox
server role. The Exchange 2013 Client Access server role is responsible for front-end transport services, along
with the other Client Access responsibilities that you can read about in Chapter 3. The front-end transport
service authenticates and proxies SMTP connections from clients to the transport services running on the
Mailbox server role. This front-end/back-end architecture works in the same manner regardless of whether the
Exchange 2013 Client Access and Mailbox server roles are installed on separate computers, or when they are
installed together on a multi-role server.
The only other server role that has survived since Exchange 2007 is the Edge Transport server, which is still
designed for deployment in perimeter networks for secure inbound and outbound mail flow. Edge Transport
remains as an optional server role for Exchange organizations.
The first step is to clearly define the scope of the problem. It’s rare to receive a support ticket from a help desk
team or a report from end users that contains 100% of the information you need. There may be some
questions you need to ask, such as:
Basic Elimination
Depending on the answers you get for the questions you ask you should go through a short process of
elimination to rule out anything that the end user wasn’t able to confidently answer. For example, send them a
test email from your own computer with a delivery receipt enabled, and make sure it is received in their
desktop client as well as any mobile device they might use. With that one simple test you’ve ruled out multiple
possible causes of the problem.
By the way, the delivery receipt when you are testing internal emails is important. For one thing, it means you
can do the test without the other person being available to confirm delivery. It also means that you’ll know
that the email was delivered successfully even if the end user claims it wasn’t (e.g. they have an inbox rule or
some other issue preventing it from appearing in Outlook or mobile device).
The more possibilities you can rule out quickly, the easier your troubleshooting will be. However, don’t assume
that anything you’ve ruled out in the initial part of the investigation should be completely ignored. At this
stage you’re only trying to identify the best place to start looking. You may need to come back later to things
that you ruled out and investigate those as well.
You should also consider what has changed (perhaps by you or your team) recently that may have contributed
to the problem. Often we can make changes to the environment which take several days to emerge as a user-
impacting problem, so make sure you consider all recent changes, not just those that occurred in the last day
or so.
If you’re dealing with a new customer and you don’t have a diagram like that already then spend a few
minutes at the start of the call finding out what’s involved in their mail flow and sketch yourself a quick
diagram. It might sound a bit basic but it is a worthwhile exercise. My notebooks are full of drawings like that
from previous support cases.
Understanding the scope of the issue and the environment in which it is occurring are just the beginning of
your troubleshooting. To get into more details about specific factors that can influence email delivery let’s
begin by looking at the role of DNS in transport.
It should be well understood by any email administrator that an email address of [email protected] is made
up of a unique prefix of "john" and a domain suffix such of "domain.com". Although in some internal systems
you may see emails from local, non-resolvable addresses such as root@localhost, you’re very unlikely to
encounter those types of email addresses in email that is travelling around the internet, unless the sender
happens to have a very poorly configured server. Let’s simply state that the most basic requirement, as
defined in RFC2821, is that “only resolvable, fully-qualified domain names are permitted when domain names
are used in SMTP.”
I’ll take that one step further and point out that you should only try to use domain names that you own and
control for sending email. As obvious as that may seem, I have encountered many customers in the past who
try to use a “dummy” domain for a test system or for internal email alerting, and run into trouble because that
domain is owned and used by someone else on the internet.
Real World: Microsoft isn’t immune to making this kind of mistake. When they first shipped Exchange
2013 they used a domain name of inboundproxy.com as the destination for some of the Managed
Availability probe messages. Unfortunately, they didn’t own that domain at the time, and customers
saw NDRs queueing on their Exchange servers for that destination. You can read the full story here.
MX Records
When an email is sent to an address of [email protected], successful delivery relies on DNS to tell the
sending server where to send that email. This is determined by looking up the “mail exchanger” (MX) records
for domain1.com in DNS.
You can look up the MX records of a domain name using the nslookup utility from a CMD prompt.
C:\>nslookup
Default Server: dns.iinet.net.au
Address: 203.0.178.191
> exchangeserverpro.com
Server: dns.iinet.net.au
Address: 203.0.178.191
Non-authoritative answer:
exchangeserverpro.com MX preference = 0, mail exchanger = exchangeserverpro-
com.mail.protection.outlook.com
Figure 4-1 illustrates how the lookup process by the sending Exchange server proceeds:
The DNS resolution process may be even further complicated when the A records that are listed as the MX
records for a domain exist in a completely different domain. For example, while domain1.com’s MX records
may be A records in the same domain, such as mail.domain1.com, the MX records for domain2.com may be
something like inbound1.emailsecurityservice.com. This is common for domains that are using an externally
hosted email security service. For example, the Australian airline Qantas has MX records that point to the
Websense service.
> qantas.com
Server: dns.iinet.net.au
Address: 203.0.178.191
Non-authoritative answer:
qantas.com MX preference = 20, mail exchanger = cust20986-3.in.mailcontrol.com
qantas.com MX preference = 20, mail exchanger = cust20986-2.in.mailcontrol.com
qantas.com MX preference = 20, mail exchanger = cust20986-1.in.mailcontrol.com
In order to resolve the MX records for the Qantas domain name, the server needs to perform additional DNS
queries to locate the authoritative name servers for mailcontrol.com as well. This is relevant in troubleshooting
scenarios because the delivery of email for one domain name may be dependent on the availability of one or
more other domain names. In the example above, a DNS outage for mailcontrol.com would have an impact on
qantas.com and any other Websense customers as well.
Note: MX records in DNS are not used by your Exchange servers for internal mail flow within the
organization. The Exchange servers use their own routing table for internal delivery, that takes into
account the Active Directory site topology when trying to calculate the best route between two servers.
Reverse DNS
So far we’ve seen that MX records and A records in DNS play a part in determining where email should be
delivered when sending over the internet. There’s another DNS record type that is also important to have in
place – a PTR (or reverse DNS) record. Resolving a host name to an IP address is the job of the A record.
Resolving an IP address to a host name is the job of the PTR record. When a sending server makes an SMTP
connection to a receiving server, all the receiving server can immediately see is the source IP address of the
connection and the host name that the sending server is using for its opening HELO or EHLO command in the
SMTP conversation.
One of the tests that most email servers perform these days is to lookup the source IP address to see if there
is a PTR record for that IP address (Figure 4-2). There is no requirement for the PTR record to match the HELO
hostname that the sending server is using. After all, a business with a single IP address may be running
multiple services behind a reverse proxy, not just their Exchange server, so setting the PTR record to match the
Exchange server's EHLO or HELO hostname isn't necessarily the correct way to go.
The other test that the receiving server will perform is to check that the host name in the HELO command
resolves to the same IP address that the sending server is connecting from (Figure 4-3). If an A record is found
that matches the sender’s source IP address, then it is likely that the connection is coming from a server that
belongs to the owner of that domain.
However, if the domain does not contain a DNS record matching the source IP address that is making the
SMTP connection, then it is considered more likely that an unauthorized server is attempting to spoof the
domain (Figure 4-4). In other words, it’s likely to be a spammer or malicious sender.
The PTR record for an IP address can be changed by the owner of that IP address, which for most companies is
the provider that is supplying you with your internet connection. The network provider will either give you
access to a portal where you can manage your own PTR records, or will allow you to submit a technical
support request to have a valid PTR record put in DNS for your IP address.
Without a working reverse DNS configuration for Exchange, you’re likely to see SMTP connections rejected by
recipients’ servers, or have the emails sent by your server be scored higher on the scale of likely spam.
Similar to MX records, the PTR records are not important for internal mail flow between the Exchange servers
in your own environment. Exchange will quite happily send and receive email between the other Exchange
servers that it knows and trusts inside your organization whether PTR records exist or not.
SPF records allow a domain owner to specify which mail servers are permitted to send email for that domain
name. When the sending server issues its “MAIL FROM” command in the SMTP conversation, the receiving
server will look up the SPF record in the domain name of the “From” address to see if there is a match for the
source IP address of the SMTP connection (Figure 4-5).
If you read about SPF records on the internet you might find advice from some sites that it is better to have no
SPF record than it is to have an incorrect SPF record. There’s some truth to that, but also some risk. Some mail
hosts will reject mail if there is no SPF record for the domain. Relatively few hosts tend to reject messages on
that basis, but because they are very large mail hosts the impact can be quite noticeable. Ultimately it is best
to have a correctly configured SPF record in DNS for your domain.
An SPF record is simply a TXT record with a certain syntax. The syntax is made up of two parts; mechanisms,
and modifiers. Modifiers are optional and are not commonly used except for special circumstances. During
management and troubleshooting of transport you'll most often be dealing with SPF records containing only
mechanisms.
The mechanisms for an SPF record define the sets of hosts that can send email from the domain. Mechanisms
can be defined by:
all – matches any host, and is placed at the end of the SPF record as a “catch all” for any senders that
did not match other mechanisms listed ahead of it.
ip4 – matches a single IPv4 address or IPv4 network range.
ip6 – matches a single IPv6 address of IPv6 network range.
a – matches a host name or domain name. The IP addresses that the name resolves to in DNS are
matched against the sender's IP address. This mechanism is useful for matching against a web server
IP address based on the domain name.
Mechanisms are used in combination with a qualifier that tells the server what to do when a match is found.
The qualifiers are:
An example of a mechanism paired with a qualifier is "-all" at the end of an SPF record, which means
"Fail/reject email from any sender who did not match an earlier mechanism in the SPF record."
If this all seems very complicated to you, don't worry, it starts out that way for everyone who has to deal with
SPF records. Fortunately, there are many tools available to help you construct and validate your SPF records.
For example, Microsoft provides the Sender ID Framework SPF Record Wizard, which has an awkwardly long
name but is nonetheless very useful.
After entering your domain name (Figure 4-6), the wizard will step you through a series of questions to
determine the most likely SPF record that you will need. In this example I answered the questions as follows:
Domain's inbound servers may send mail (in other words, the servers listed as MX records also handle
outbound email)
An additional domain name whose A record is a valid outbound email server (a common example of
this is an externally hosted website that uses its own SMTP service to send notifications and other
emails)
This domain sends mail only from the IP addresses identified above (in other words, anything else
trying to send email from my domain name should be considered unauthorized)
Adding that string as a TXT record in the public DNS zone for the domain name helps to prevent unauthorized
email servers from spoofing my domain name. At least, they won't be able to do it when sending to any
receiving server that checks and enforces SPF records. Anyone who is not checking SPF records can still
receive the spoofed email, but may reject it for other reasons such as spam content or malware. Although SPF
is important, you should consider it just one factor among many that different providers will use when filtering
email.
Apart from tools to generate your own SPF record, many email services will provide you with the exact strings
to add to your SPF record. When you add a domain name to Office 365, Microsoft advises you of the SPF
record they suggest, which is appropriate for organizations only sending outbound email using Exchange
Online Protection. This likely may not be correct for your environment, especially in a hybrid scenario.
Similarly, email marketing services and SMTP hosting services will also have documented solutions to adjust
your SPF record so that you can successfully use their services without your email being rejected.
After you have your SPF record in place you should validate it. And in fact, you should repeat this validation
test any time you suspect an external organization may be rejecting your email because of your SPF record.
MXToolbox has an SPF record validator (Figure 4-8) that takes a domain name and IP address as input and lets
you know what the result will be if that IP address sends email for your domain.
Aside from the result for that specific IP address, the MXToolbox SPF record lookup tool (Figure 4-9) will also
validate the general health of your SPF record for problems such as excessive DNS lookups or syntax
problems.
Similar to MX and PTR records, the internal mail flow between Exchange servers in your organization does not
depend on SPF records. The Exchange servers in an organization already understand that other Exchange
servers in the same organization are authoritative for your domains.
Real World: Take care when modifying SPF records, because it is easy to inadvertently cause all of your
domain's outbound email to be rejected. If there is any doubt you can use a SoftFail qualifier on the
"all" mechanism (in other words, use "~all" at the end of your SPF record) for a period of time while you
test outbound email against major hosts such as Yahoo and Google. Your SPF records should also be
considered any time there is a planned change to your email routing.
SMTP Connectivity
So far in this chapter we've looked at how a sender's email server locates a recipient's email server when it
needs to send an email message. After the sending server has located the server that it needs to connect to, it
will then attempt to connect via SMTP. A variety of things can go wrong at this stage, which we'll go through
in this section.
When you need to test that a route exists between two Exchange servers you can use the good old, reliable
ping and tracert utilities in Windows. Just keep in mind that many servers will block ICMP (ping) requests, so it
is not a fool-proof test. For example, here I am performing a ping and tracert from an on-premises Exchange
server to an MX record for microsoft.com that resolves to Exchange Online Protection in Office 365.
C:\>ping microsoft-com.mail.protection.outlook.com
Trace complete.
Note: It's useful in real troubleshooting situations to perform the ping and tracert tests from multiple
hosts inside and outside of your network. If you don't have an external host to use for that type of
network testing, such as a VM running in Azure, you can use tools such as Pingdom to run basic tests.
Testing from multiple locations often gives you a better perspective on the problem.
Running ping and tracert tests from your server to a destination gives you part of the picture, but it doesn't
necessarily mean that a working route exists between the other server and yours. Most of the Exchange
servers you encounter in the real world have a private IP address and sit behind a network device that
performs a Network Address Translation (NAT) from a public IP address to that private IP address.
When the network device performing the NAT has multiple public IP addresses assigned to it, there is usually
one IP address defined for outbound connections, also known as source NATing. On a misconfigured network
device, the NAT rule for incoming connections to the Exchange server is missing a matching source NAT rule
for the reply packets. This causes the TCP connection to break, because the source of the connection is seeing
reply traffic that originates from a different IP address than it is sending to (Figure 4-10).
A similar problem can occur due to asymmetric routing, which causes replies from the Exchange server to
traverse a completely different network route. If that route includes a NAT device, then once again the replies
will appear to be coming from a different IP address and will cause the TCP connection to break.
Page 79 Exchange Server Troubleshooting Companion
Both of these NAT and routing situations are difficult to diagnose from the perspective of the server itself. You
can run a network packet analyser that shows that the inbound connections are being received, and that the
outbound replies are being sent, but that will not give you the full picture. A network or firewall administrator
will most likely need to become involved, but even then the problem may not be apparent to them, until you
mention the possibility of a source NAT or asymmetric routing issue. Neither of those issues will look like a
firewall is rejecting the traffic, so if that is all the firewall administrator is looking for then they will not see the
problem.
You can also help things along by using an external test to show the public IP address to which an outbound
connection from your Exchange server is NATed. Using a website such as What Is My IP is an easy way, but
perhaps you would prefer not to use a web browser to connect to the internet from your Exchange servers.
Instead, you can use PowerShell to retrieve the information, either from a local PowerShell console on the
server, or from a remote PowerShell session to the server.
PS C:\> $webrequest.Content
<html><head><title>Current IP Check</title></head><body>Current IP Address:
203.206.161.219</body></html>
Although the output above is not very neat and tidy, the public IP address is obvious. If you'd prefer a cleaner
output, try the Get-PublicIPAddress.ps1 script. Either way, providing the results to your network or firewall
administrator will be helpful in a troubleshooting situation.
For SMTP connectivity the standard server-to-server port used is TCP 25. When an email server looks up the
MX records for a remote domain, and tries to connect to the IP address contained in the MX record for the
domain via SMTP, it will only try on the standard port. There is no way for you to run your email server on
another port (which would not be helpful anyway).
However, just because servers are expecting to connect on port 25 doesn't mean different ports can't be used
at all. Many client applications and devices that use SMTP can be configured to use any port you like. In fact,
there are two other standard ports that are often used by clients for mail submission:
TCP 587
TCP 465 (this has been deprecated)
The client submission ports are most commonly used by POP and IMAP clients. POP and IMAP are old client
protocols that are not widely used these days, and unless you know that some users need to use these
protocols it is usually fine to leave those ports closed on your firewall.
SMTP connectivity from the internet is not all that most organizations need. Internal SMTP usage is very
common as well, and firewalls come into play for internal scenarios as well. It's important to note here that
placing a firewall between your Exchange servers and the clients and devices that connect to them is fully
supported. The ports for client connectivity are well known and are published by Microsoft.
Real World: Microsoft's support statement for firewalling Exchange servers does not apply to Windows
Firewall. You should leave Windows Firewall enabled on your Exchange servers, and it is supported to do
so. Exchange setup creates the necessary Windows Firewall rules to allow Exchange to operate.
You can test inbound connectivity from the internet to your Exchange server on port 25 by running the
Inbound Mail Test using the Microsoft Remote Connectivity Analyzer. Alternatively, you can use Telnet to
perform a basic SMTP connectivity test. If you see the "220" response and banner from the server you're
connecting to then you know that the firewall port for SMTP is open.
You don't necessarily have to continue with the SMTP session and issue commands to send a test email. At
this stage if all you want to prove is that SMTP connectivity is possible through the firewall, then you've
achieved that. However, problems further along the SMTP conversation may still occur. To learn how to test a
complete SMTP transaction read on to the next section.
Real World: If you Telnet to a mail server and see a string of asterisks, like ********, instead of the normal
220 response and banner, then the mail server may be located behind a Cisco firewall that is performing
SMTP inspection. This will interfere with mail flow, and should be disabled on the Cisco firewall.
You can inspect and test each of those elements separately, as we've seen in the examples provided. You can
also very efficiently test multiple factors at the same time by using some tools designed to test connectivity.
The first, and one of the best, is the Microsoft Remote Connectivity Analyzer (Figure 4-11). When you run the
Inbound SMTP Test it will send an email via each of the MX records published in DNS for your domain name.
This means that you will have tested DNS, SMTP connectivity, and mail flow all the way to the mailbox via
every inbound route that you have an MX record published for.
Another useful tool is the SMTP diagnostic at MXToolbox (Figure 4-12). This tool will analyse your server for
potential latency or performance issues, as well as validate reverse DNS and check for an open relay. Although
it does not go as far as sending an actual email message, the MXToolbox SMTP diagnostic still provides very
useful information for troubleshooting.
Watch Out for ISPs and Security Products That Block SMTP
One of the frustrating things about SMTP, and email in general, is how heavily it is abused by spammers and
malicious attackers. Mailboxes are constantly bombarded by spam, phishing attempts, malware, and attackers
looking for misconfigured email servers to exploit.
The risk of attack has led to many ISPs and network providers blocking SMTP ports. For residential customers
in particular, outbound SMTP on TCP port 25 is often blocked with the exception of the mail hosts that the ISP
operate. Customers can still send email using TCP port 587 but are generally prevented from propagating
"mass mailing worms" if their computers become infected by malware.
Aside from ISPs, it is also common for “enterprise security” products to block SMTP connectivity. Usually, these
products are installed on all of the computers and servers on a network, and if care is not taken to configure
the software on your Exchange servers correctly you may find that the security product blocks outbound SMTP
connections from Exchange, which will of course stop mail flow from working.
When you're considering a troubleshooting scenario, a basic understanding of the transport pipeline is helpful
for trying to visualize the problem. An email message does not simply pass from mailbox to mailbox; instead it
passes through multiple services, potentially across multiple servers, with specific components and connectors
being responsible for processing or passing the message along.
The categorizer is at the heart of Exchange transport. This component processes each message in the
submission queue to perform recipient resolution, routing, and any content conversion that may be necessary.
Each message can be different from the last, for example an email sent to a distribution list that has members
spread across multiple sites in the organization will have the distribution list expanded, and then be split into
multiple messages to be routed to the different sites where the mailboxes are hosted. Such a message will
look different to a simpler email message sent from one mailbox to another when you dig into logs and
message tracking.
The operation of the transport pipeline will become clearer as you become more familiar with using different
troubleshooting logs and tools to investigate email delivery problems.
Transport Services
There are multiple transport services hosted on Exchange 2013 and Exchange 2016 servers. The key difference
is that in Exchange 2013 the Front End Transport service is hosted on the Client Access server role, whereas in
Exchange 2016 it is hosted by the Mailbox server role because the Client Access role no longer exists.
Front End Transport Service – this service simple acts as a stateless proxy for inbound SMTP
connections on TCP ports 25 and 587. There's no queueing, logging, or other handling of the actual
messages contents by the Front End Transport service.
Transport Service – this service handles all SMTP mail flow within the organization, acting as a go-
between for the Front End Transport service and the Mailbox Transport services on the same or other
Exchange servers. The Transport service is the only service where queueing of messages occurs.
Mailbox Transport service – this service is actually made up of two separate services; the Mailbox
Transport Submission service, which is responsible for connecting to mailbox databases to retrieve
messages and submit them to the submission queue on the Transport service, and the Mailbox
Exchange 2013 and Exchange 2016 Edge Transport servers use a Transport service that is similar to the
Transport service on the Mailbox server role.
The most common cause of a Transport service failing to start is a corrupt transport database. If you suspect
the transport database is at fault, then you can move the database files to a different location (or simply
rename the folder), then start the Transport service again. A new, empty database will be created.
Note: If you move or remove the database then any messages that were still queued for delivery will be
lost. The transport database can sometimes be repaired with the ESEUTIL program in the same way as
mailbox databases, which is discussed in chapter 9. If you move a repaired database back to its original
location and start the Transport service again, then any messages queued in the database will be
retried for delivery.
Receive Connectors
Email messages are passed between clients, servers, and transport services using connectors. As the names
make quite obvious, receive connectors are used to receive messages, and send connectors are used to send
messages.
Every Exchange 2013 or Exchange 2016 server is configured with a set of default receive connectors that
follow a standard naming convention that includes the server name.
Three default receive connectors are homed on the Front End Transport service:
Default Frontend SERVERNAME – this connector listens on TCP port 25 to accept SMTP connections,
and acts as the entry point for email into the Exchange organization. It is pre-configured to perform
that role, and there are few scenarios in which modifying this connector is appropriate. As a general
rule I recommend you do not disable, modify, or remove it. Think of this receive connector as the
connector of last resort.
Client Frontend SERVERNAME – this connector requires client authentication, accepts secure SMTP
connections (with TLS applied), and listens on port TCP 587.
Outbound Proxy Frontend SERVERNAME – this connector listens on TCP port 717 and is only used
when a send connector is configured to proxy outbound email through the front end servers or
services. This is an optional configuration, and not very common.
Default SERVERNAME – this connector accepts connections from other Mailbox and Edge Transport
servers using TCP port 2525. Clients do not connect directly to this connector, and there are no
reasons for you to ever disable, modify, or remove this connector.
Client Proxy SERVERNAME – this connector accepts connections from Front End Transport services on
the same server, or on other servers, using TCP port 465. Again, clients do not connect directly to this
connector, and you should not disable, modify, or remove it.
Receive connector conflicts are a common source of Exchange Server problems. There are two ways that
receive connector conflicts can occur.
The first is when two receive connectors homed on different services are configured to listen on the same IP
address and port number. This will cause one or both of the services to fail to start, and an event log entry will
Page 84 Exchange Server Troubleshooting Companion
be logged to the Application event log reporting the conflict. To avoid this problem, any receive connectors
that you create should be homed on the Front End Transport service, where they can all listen on the same IP
address and port number without conflicting. Unfortunately, the default role selected during the new receive
connector wizard is "Hub", which is the wrong one.
Fortunately, the most recent builds of Exchange 2013, and all builds of Exchange 2016, stop you from creating
a receive connector conflict. However, you may still encounter the problem on older installations of Exchange
2013.
Another problem caused by two receive connectors that have an IP address and port conflict is that the
installation of cumulative updates may fail, because a check for conflicting receive connectors is performed
during the upgrade process. Unfortunately, when this occurs it leaves the server in a non-working state,
because setup fails after it has already removed the existing Exchange installation.
In amongst all of that error information is the clue that you need to find the conflicting connector. In this
example the connector is named "Test" and is configured on the server "EX2013SRV2".
The values that you specified for the Bindings and RemoteIPRanges parameters conflict with the settings
on Receive connector “EX2013SRV2\Test”. Receive connectors assigned to different Transport roles on a
single server must listen on unique local IP address & port bindings.
To fix the conflict you can simply run Set-ReceiveConnector to modify the transport role that the connector is
homed on.
If you only have one Exchange server in the organization, you won't be able to use the Exchange Management
Shell (EMS) to run that command. Instead, you will need to use the ADSIEdit utility to adjust the connector's
configuration. Follow these steps.
The second common problem scenario is when two receive connectors have overlapping remote network IP
ranges (Figure 4-14), which can cause an inbound SMTP connection to be handled by a receive connector that
you were not expecting. A common example is when a custom receive connector has been configured to allow
SMTP relay for specific applications and devices on the network. However, when you attempt to relay
messages to external recipients from those applications or devices, the message is rejected.
The key to understanding overlapping remote network IP ranges, which are normal and not necessarily a
problem, is that the most specific match will win. The "Default Front End" connector on the server has a
remote IP range of "0.0.0.0-255.255.255.255", which effectively means "anything". The "Default Front End"
connector will therefore handle any incoming SMTP connection, unless another connector with has a more
specific match in its remote IP ranges for the source IP address of the connection.
That source IP might change if the application or device is connecting from behind a firewall or NAT device, if
the device is DHCP-enabled, or if the device has multiple IP addresses. Care must be taken to define the
correct IP addresses in the remote IP ranges of any custom receive connectors.
Warning: Some administrators add entire subnets or large ranges of IP addresses to their custom
connectors. For example, they might add the entire server VLAN, so that any server in their datacenter
can use Exchange for SMTP relay. This is problematic if that IP range also includes other Exchange
servers, because it may result in Exchange server-to-server communications breaking if the connections
When you suspect that the wrong receive connector is handling SMTP connections from a particular source,
there are two techniques you can use to troubleshoot the situation. The first is to modify the SMTP banner of
the receive connectors so that each one is unique. This can be easily applied by running two PowerShell
commands to set the SMTP banner on each connector so that it displays the receive connector name,
demonstrated in Jeff Guillet's blog post.
The next time that you Telnet to the Exchange server from the device or server that you're troubleshooting
SMTP relay for, you'll see the new value in the SMTP banner telling you the connector name that has accepted
your Telnet connection.
C:\>telnet ex2013srv2 25
If you see an unexpected receive connector name, you can then revisit your configuration to troubleshoot
further.
The other technique is to enabled protocol logging on the receive connector, which is discussed in a later
section of this chapter.
Send Connectors
In contrast to receive connectors, no send connectors are configured by default for an Exchange Server
installation. All send connectors must be manually configured, unless they are automatically configured by an
Edge Subscription, and serve to route email messages to servers outside the Exchange organization.
Send connectors rely on several elements that we've already discussed in this chapter:
Network connectivity
DNS (if MX records are used for delivery instead of a smart host)
Firewalls
In other words, a transport server that is trying to send email must be able to resolve the recipient's domain in
DNS (if MX records are being used), and then connect to the smart host or MX for that domain across the
network and internet through any firewalls.
In addition, send connectors can be enabled for protocol logging to assist you with troubleshooting, which is
discussed in a later section of this chapter.
Send connectors have three attributes that control whether outbound email will be sent using that connector.
The first two attributes are the cost, and the address space.
The cost of the send connector is included in the aggregate cost calculated by Exchange when determining
each possible route to a destination. The cost determines the priority order in which the Exchange server will
If the aggregate cost of two routes to the destination address space exist, then the cost of the send connector
itself is used as a tie-breaker. If two send connectors with equal costs exist, then the connector with the
alphabetically lower name is chosen.
Note: Aggregate cost refers to the total cost of all site links and connectors on the route to a
destination. This is more relevant for multi-site environments where internal Exchange servers may
need to route email via other internal servers to reach the internet. This includes AD Site links, which
you can view by running Get-ADSiteLink. Exchange will use the ADCost of an AD Site link, unless a
specific ExchangeCost has been configured.
The address space defines the domain names that a send connector can be used to send email to. At least one
send connector with an address space of "*" is necessary for outbound email to the internet to work.
Alternatively, a send connector can have a specific address space configured on it, and will only be used for
email addressed to that namespace. This is commonly used to control mail flow to specific partners, or to
internal systems that use a domain name that is not publicly resolvable.
Address space matching takes precedence over other factors. In the example shown in Figure 4-15, messages
to the contoso.com domain will be handled by Send Connector 2, even though it has a higher cost, because it
matches the address space.
Once a send connector has been chosen for delivering an email message that decision is final. Exchange will
not attempt to choose another send connector, even if the email message is stuck in a queue, unless you
restart the transport service, which will cause all messages in the queue to be re-evaluated for routing.
Troubleshooting transport queues is covered later in this chapter.
A common problem scenario is when two send connectors have been configured with matching namespaces
(such as "*" or "contoso.com") and matching costs, with the expectation that Exchange will load balance email
across both connectors. This is not correct. In the case of two otherwise matching send connectors, the
connector with the alphabetically lower name is always used. If for some reason that connector can't deliver
the email message, for example because the smart host configured on that send connector is unreachable,
then all email messages will be stuck in a queue instead of "failing over" to the other matching send
connector.
Real World: If you need outbound email to load balance or fail over between multiple smart hosts,
then you should configure one send connector with all of the smart hosts added to that connector. Do
not configure one send connector per smart host.
Troubleshooting Transport
Several logs and tools that can be used to troubleshoot transport. Keep in mind that as I described at the start
of this chapter, each transport or email delivery problem can be quite different from others. This means that
the logs or tools that you use will vary in usefulness depending on the scenario.
Protocol Logging
Protocol logs capture the SMTP communications that occur between servers. The information that is written to
the protocol log files looks very similar (Figure 4-16) to what you see when you are using Telnet to make an
SMTP connection.
The information in protocol logs is invaluable in troubleshooting scenarios, because it captures events that
occur during message delivery that may not appear in other logs on the server.
For example, many administrators are used to looking in message tracking logs when they troubleshoot email
delivery. Message tracking logs only record events for messages once they are in the transport pipeline. If a
message is never sent/received because the SMTP connection itself is rejected, the message tracking log will
show no useful troubleshooting information.
There are two parts to the configuration of protocol logging in Exchange and they are basically the same
across all versions from Exchange 2007 onwards. First, there is the per-service protocol log settings on server
roles that host transport services. The per-service settings are configured automatically with the following
default settings:
Maximum log age of 30 days. This effectively means a retention period of 30 days, with any log files
older than 30 days being automatically deleted from the server's disks.
Maximum log directory size of 250 MB. If the maximum directory size limit is reached before the
maximum log age limit, then the oldest log files will be removed from the server's disks. Which means
that it's possible to find fewer log files than you might expect if you were measuring on the log age
Page 89 Exchange Server Troubleshooting Companion
alone. The maximum directory size is intended to prevent the server's disks from filling up with
protocol log files.
A log path that is located within the Exchange install directory.
Each transport service has its own protocol log settings. This means that:
Exchange 2010 Hub Transport servers have one set of "Hub" protocol log files and settings.
Exchange 2013 Client Access servers have one set of "Frontend" protocol log files and settings.
Exchange 2013 Mailbox servers have one set of "Hub" protocol log files and settings (even though it is
referred to as the "Transport" service).
Exchange 2013 multi-role servers have both "Frontend" and "Hub" protocol log files and settings.
Exchange 2016 Mailbox servers also have both "Frontend" and "Hub" protocol log files and settings.
Any Edge Transport server has one set of "Edge" protocol log files and settings.
To expand even further on that, each transport service has separate "Receive" and "Send" protocol log files
and settings. To demonstrate, here's an example of an Exchange 2013 Frontend Transport service's protocol
log settings.
ExternalDNSProtocolOption : Any
InternalDNSProtocolOption : Any
IntraOrgConnectorProtocolLoggingLevel : Verbose
ReceiveProtocolLogMaxAge : 30.00:00:00
ReceiveProtocolLogMaxDirectorySize : 250 MB (262,144,000 bytes)
ReceiveProtocolLogMaxFileSize : 10 MB (10,485,760 bytes)
ReceiveProtocolLogPath : C:\Program Files\Microsoft\Exchange
Server\V15\TransportRoles\Logs\FrontEnd\ProtocolLog\SmtpReceive
SendProtocolLogMaxAge : 30.00:00:00
SendProtocolLogMaxDirectorySize : 250 MB (262,144,000 bytes)
SendProtocolLogMaxFileSize : 10 MB (10,485,760 bytes)
SendProtocolLogPath : C:\Program Files\Microsoft\Exchange
Server\V15\TransportRoles\Logs\FrontEnd\ProtocolLog\SmtpSend
Of the available per-service settings that you can modify, the most likely ones that you will adjust are the
maximum age and maximum directory size. If you need to retain a very large quantity of protocol log files, you
may also consider moving the log path to a different volume.
Here's an example of increasing the maximum log age to 60 days, and the maximum directory size to 1 GB.
Ultimately it will depend on your environment how much protocol log data you need to retain, which will be
the basis of any changes you make to the configuration.
Real World: The servers where protocol log data is the most valuable are those that sit on a critical
path for mail flow in your organization. Examples include the internet-facing servers and any servers
that are targets for internal applications that need SMTP access. On those servers you'll often need to
use protocol log data to prove or disprove a mail flow issue.
After the per-server settings for protocol logging you also need to look at the per-connector settings. Each
send or receive connector in your organization has two possible settings for protocol logging; verbose, or
Protocol logging is enabled using the Set-SendConnector or Set-ReceiveConnector cmdlets. For example, to
enable verbose logging for all receive connectors on a server you can run the following command.
When you need to use protocol log data to troubleshoot a mail flow issue, one of the simplest approaches is
to search the log files using PowerShell. Here's an example scenario. In this scenario an email has been sent by
[email protected] to [email protected] within the last 24 hours, and we want to
determine what happened during the SMTP session from Gmail's servers to the Edge Transport server in the
Exchange organization.
On the Edge Transport server, we can open PowerShell and change to the directory containing the receive
protocol log files.
From that location we can ingest the contents of the last 24 hour's protocol log files into a variable.
Now we can parse the data in $logs for the sender's email address.
We get one hit, which provides a key piece of information to locate the rest of the log data for that SMTP
session. Each protocol log line uses the following fields:
date-time,connector-id,session-id,sequence-number,local-endpoint,remote-endpoint,event,data,context
Since every SMTP session has a unique "session-id", we can parse the $logs data for all of the lines that include
the session ID that we're interested in, which in this case is "08D3051E24C97D16".
When you do that for the first time you'll notice that quite a lot of data is output, and sometimes a PowerShell
window (Figure 4-17) is not the easiest place to read it due to wrapping of lines.
Sending the output to a text file to open in Notepad is sometimes a better choice.
If the protocol logs show that an SMTP session was successful and that the email message was accepted by
the server, then the next step in troubleshooting would be to look further into the pipeline using message
tracking.
Warning: The IMAP and POP services on Exchange servers also have protocol logging available.
However, unlike Transport protocol logging, IMAP and POP do not have maximum directory size
thresholds. Which means that the IMAP or POP protocol log data will grow with no limits, eventually
consuming all available disk space on the volume. As such, you should not leave IMAP or POP protocol
logging enabled, unless you have implemented a separate process for removing old log files.
Message Tracking
Message tracking is an Exchange feature that records detailed log files of email traffic as messages are
transferred between Exchange servers within the organization, and between different roles, services and
components on individual servers. In other words, message tracking logs record what happens to a message
as it passes through the transport pipeline.
Here is an example of how the default configuration looks when queried using PowerShell.
MessageTrackingLogEnabled :
True
MessageTrackingLogMaxAge :
30.00:00:00
MessageTrackingLogMaxDirectorySize :
1000 MB (1,048,576,000 bytes)
MessageTrackingLogMaxFileSize :
10 MB (10,485,760 bytes)
MessageTrackingLogPath :
C:\Program Files\Microsoft\Exchange
Server\V15\TransportRoles\Logs\MessageTracking
MessageTrackingLogSubjectLoggingEnabled : True
Similar to protocol logs discussed earlier in this chapter, the message tracking logs have both a maximum log
age, and a maximum directory size, which work together to prevent the log files from consuming all available
free disk space on the volume. However, the maximum directory size applies to each different type of message
tracking log file, and there are four types of log files on Exchange 2013 and Exchange 2016. One of the log
types is infrequently used, so Microsoft recommends that you apply a 3x multiplier to the maximum directory
size value when you are calculating the amount of disk space it will actually use. In other words, if the
maximum log directory size is configured for 1 GB, allow for 3 GB of log files to accumulate.
Note: In Exchange 2010 the Hub Transport server is configured with Set-TransportServer and the
Mailbox server is configured with Set-MailboxServer. However, if both roles are installed on the same
Exchange 2010 server, as they are in the examples above, there is only one message tracking
configuration for the entire server, not one for each separate role. Both sets of cmdlets can be used to
get or set the message tracking configuration for the entire Exchange 2010 server. In Exchange 2013
and Exchange 2016 the services previously associated with the Hub Transport server role now reside on
the Mailbox server role. This means that Get-TransportService/Set-TransportService and Get-
MailboxServer/Set-MailboxServer can be used in Exchange 2013 and will achieve the same outcomes.
In most environments the default message tracking configuration is likely to be modified to retain more log
data, to allow historical searches to go back further in time. For example, you might choose to increase the
message tracking log maximum age to 90 days, and increase the maximum directory size to 4 GB. Remember,
that would equate to 12 GB of potential log data. If you need to analyse your servers for consistency in their
message tracking log configurations, you can use the Get-MessageTrackingConfig.ps1 script.
The message tracking log files themselves are simply text files in comma-separated value (CSV) format (Figure
4-18). The log files can be read in applications such as Notepad or Excel if you have a need to view their
contents. The data fields used for the CSV files include the date/time of the event, client details, server details,
event IDs, sender and recipient information, and more. You can read a complete list of the message tracking
log fields on TechNet.
Although the message tracking log files are human-readable, the best way to search message tracking logs is
by running message tracking searches using PowerShell. Message tracking log searches are performed in the
Exchange Management Shell by running the Get-MessageTrackingLog cmdlet. You can run that cmdlet with no
parameters on any Transport or Mailbox server that is enabled for message tracking, and it will return the first
1000 results of the log entries on that server.
You can also search a remote server using the -Server parameter. This is useful when you are running the
search from your own admin workstation or a separate management server, instead of while logged on
directly to an Exchange server. In fact, there is no real reason why you need to run the searches directly on an
Exchange server. Using the management tools remotely is far more convenient, as well as avoiding problems
such as very large searches consuming a lot of memory and impacting server performance.
The Get-MessageTrackingLog cmdlet also accepts input from the pipeline. This is a very convenient way to
perform searches on multiple servers at once. For example, to search all Mailbox servers at once:
The
[PS] C:\>Get-MailboxServer | Get-MessageTrackingLog
default output for Get-MessageTrackingLog presents the information in a table with just a few properties
shown. Displaying the output in a list instead, using the Format-List cmdlet, gives you more details to look at.
When
[PS] C:\>Get-MailboxServer | Get-MessageTrackingLog | Format-List
you’re performing investigative searches of your message tracking logs, particularly across multiple servers,
those queries can take a long time to return the results. If you then found that you need to adjust the query,
for example to be more specific, or to format the results in a different way, you have to wait a long time for
the query to run a second time as well. For this reason, I recommend that you always collect your query results
into a variable, particularly very broad queries that take a long time to run, so that you can pick apart the
collected data without having to re-run the query.
For example, if we want to investigate reports of email problems sending to Alan Reid, we can run one broad
query across all Hub Transport servers and collect the results in a variable I will call $msgs.
$msgs variable now contains thousands of message tracking log entries that can be dissected in different
ways without re-running that first, time-consuming query. For example, to find the top 10 senders to Alan
Reid, we can simply pipe the $msgs variable into further PowerShell cmdlets and it will return a result within
seconds instead of potentially several minutes.
[PS] C:\>$msgs | Group-Object -Property Sender | Select-Object name,count | sort count -desc | select -
first 10
Name Count
---- -----
[email protected] 110
[email protected] 108
[email protected] 104
[email protected] 102
[email protected] 100
[email protected] 100
[email protected] 96
[email protected] 96
[email protected] 96
[email protected] 96
The information shown in the example above is interesting, but doesn't help us to actually track an email
message through the transport pipeline. To track an email message, you will need to search on message
characteristics such as the sender, recipient, or the time that it was sent. Although you can filter the log data
that you've captured in a variable by piping it to the Where-Object cmdlet (which I'll demonstrate later in this
section), it is generally better to run the most precise query you can, based on the information that you have
about the message you are tracking, and then filter a smaller set of results as needed.
To filter message tracking log searches by time or date ranges you can use the –Start and –End parameters for
Get-MessageTrackingLog.
If you provide a start date for the search but do not provide an end date, the search will run from the start
date to the most recent available log entries.
Although these parameters accept standard date/time formatted values as shown in the example above, it is
often simpler to pair their usage with the Get-Date cmdlet to search on relative times instead.
You can also search message tracking logs based on the sender and recipients of the email message.
Unfortunately, sender or recipient-based search criteria do not accept wildcards or partial matches. This means
that if you want to search for all email messages sent from Gmail addresses to a particular user, then you
would need to run a query for all messages send to the user, then filter those results with Where-Object.
Page 95 Exchange Server Troubleshooting Companion
[PS] C:\> $msgs = Get-MessageTrackingLog –Start (Get-Date).AddDays(-3) –Recipients
[email protected] –Resultsize Unlimited
Notice also in the example above that I've combined a date-based search with a recipient-based search, and
used the –Resultsize parameter to ensure that I receive all results and not just the first 1000. You can combine
multiple search parameters to make your message tracking log searches very precise.
Unlike sender and recipient-based searches, you can search for email messages based on subject lines using
the –MessageSubject parameter and get partial-matches in the results.
Despite all the examples so far you might still be wondering how you can search for all message tracking log
entries for one specific message. That can be achieved by searching with the MessageID parameter. The
challenge is in finding the unique message ID first. Let's look at an example. First, let's retrieve all message
tracking log entries for a recipient in the last 7 days, from all transport servers in the organization.
Next, let's look at some information, including the unique message ID, of each entry in those results.
The output of that command should give you an idea of which message you want to zero in on, and you'll be
able to see the unique message ID for that message. If the message ID is getting cut off in the output of that
command, pipe it to Format-List to see the full value.
After you've determined the unique message ID you can run a new search using that attribute.
Now we can look at just those events captured in the $msgs2 variable. You should always sort output by
timestamp to ensure that you're looking at the message tracking log events in the order in which they
occurred.
If you want to see all of the information for each log entry, pipe it to Format-List.
For more information on the event ID and source values for message tracking logs to help you interpret the
results of your searches, you can refer to TechNet.
Transport Queues
Exchange servers that host Transport services queue messages for delivery. This is useful for situations when
the destination server is unavailable for some reason. Rather than immediately drop or reject a message that
can't be delivered, the server will hold the message in its queue and retry at regular intervals.
This means that you will often encounter troubleshooting scenarios where you find messages sitting in
transport queues. This may be due to a variety of problems. Consider the factors involved in email delivery
that were discussed earlier in this chapter, such as network connectivity, DNS, and firewalls. If any of those
elements are experiencing a fault, misconfiguration, or interruption, then you can expect to see queued emails.
The transport queue on an Exchange server can be viewed using the Get-Queue cmdlet.
[PS] C:\>Get-Queue
In the example above you can see that one queue has 10 messages in it. We can take a closer look at that
queue by running Get-Queue and specifying the queue name.
DeliveryType : SmartHostConnectorDelivery
NextHopDomain : ex2013srv1.exchangeserverpro.net,ex2010srv1.exchangeserverpro.
net,ex2016srv1.exchangeserverpro.net,ex2016srv2.exchangeserver
pro.net
TlsDomain :
NextHopConnector : 623aa60a-0727-4464-ac24-f5a47ac8488d
Status : Retry
MessageCount : 10
LastError : 451 4.4.0 DNS query failed. The error was: DNS query failed with
error ErrorRetry
RetryCount : 33
LastRetryTime : 12/27/2015 10:10:25 PM
NextRetryTime : 12/27/2015 10:41:26 PM
FirstRetryTime : 12/27/2015 9:03:04 AM
DeferredMessageCount : 0
LockedMessageCount : 0
MessageCountsPerPriority : {0, 0, 0, 0}
The problem is clearly seen; DNS queries for the next hop domains are failing, and the queue has retried 33
times since 12/27/2015 at 9:03:04 am. Could the issue have been caused by a configuration change at around
that time that impacted DNS queries from the server? That would be the logical question to answer. We
already looked at troubleshooting tips for DNS earlier in this chapter, but using nslookup to test the DNS client
on the server is just one part of it.
On an Exchange server there are multiple places that DNS can be configured:
When you perform an nslookup to test DNS without explicitly specifying a DNS server IP to test against, you're
only testing the DNS servers configured on the network interfaces for the operating system. In some
environments those DNS servers may be different from the DNS servers that the Exchange Server application
users. Exchange servers will use all available network adapters for DNS lookups by default, or if necessary you
can specify other DNS servers for Exchange to use in the properties of the server.
In addition to the per-server DNS settings, each send connector can also be optionally configured to use
specific DNS servers. The setting shown in Figure 4-20 forces a send connector to use the external DNS
Page 98 Exchange Server Troubleshooting Companion
lookups configuration on the Exchange server, instead of the Windows Server operating system configuration.
In most situations you won't need to modify those settings, however it is important to be aware of their
existence. In some cases, you might try adding specific DNS server IP addresses to the Exchange server to see
if that resolves a suspected DNS issue.
Sometimes you may notice queued messages to a destination that you are unsure about. If the next hop
domain for a queue is one you don't recognize, then you can take a closer look at the messages in that queue
by piping Get-Queue into Get-Message.
In the example above the email messages look like spam. It is not unusual to see queues for messages (usually
non-delivery reports) to next hop domains that are simply spam domains. If excessive quantities of spam
messages in transport queues are causing you a disk space or server performance problem, you can remove
messages from the queue.
Confirm
Are you sure you want to perform this action?
Removing the message "EX2013EDGE\3\41991895252993".
[Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend [?] Help (default is "Y"): a
However, even if you do not manually remove the messages they will eventually expire from the queue when
the retry intervals and maximum age limit for the server have been exceeded. You can see the retry intervals
and maximum age for a Transport service by running the Get-TransportService cmdlet.
MessageRetryInterval : 00:15:00
MessageExpirationTimeout : 2.00:00:00
If you encounter a problem with an Exchange server that is causing emails to queue, but you notice that
another Exchange server in the same organization is able to send and receive messages without any problems,
then you can consider redirecting the queued messages from the faulty server to the working server. This is
normally performed as part of server maintenance, but can be used in a troubleshooting scenario as well.
There are two steps for the process of redirecting messages. The first step places the server component
"HubTransport" into a state of "Draining" so that it is considered to be an unavailable server for new
messages. The second is to redirect the messages in the queue. Note that when running Redirect-Message you
need to use the fully-qualified domain name of the target server.
Confirm
Are you sure you want to perform this action?
Redirecting messages to "EX2016SRV1.exchangeserverpro.net".
[Y] Yes [A] Yes to All [N] No [L] No to All [?] Help (default is "Y"): y
Remote Domains
Remote domains are used in an Exchange organization to specify settings for mail flow between your
Exchange organization and external domains. Remote domains can control which message formats are used
(e.g. HTML, plain text), as well as whether automatic replies and non-delivery reports are permitted. This may
result in messages being modified, or even rejected, due to remote domains configurations. Keep that in mind
when troubleshooting mail flow issues from your organization to external domains.
Back Pressure
Back pressure describes a resource exhaustion condition for Exchange servers that leads to a server actively
refusing some or all connection or email delivery attempts. The overloaded state is based on a series of
resource utilization metrics:
Free disk space on the drive(s) that store the transport queue database and logs
Uncommitted database transactions in server memory
Memory utilization by the Transport service process
Overall memory utilization for the server
Each of those metrics is individually measured, and each can cause the server to go into a back pressure state.
The server will be in one of three states:
Normal – all is well and the server is performing its role as intended.
Medium – a resource is moderately over-utilized, and the server begins limiting some connection
types.
High – a resource is severely over-utilized. The server ceases to accept any new connections.
The most obvious indication of a back pressure condition is the following response to an SMTP connection
attempt: 452 4.3.1 Insufficient system resources.
The other signs of a back pressure condition are the following entries in the Application event log:
Event ID 15004 – an increase in the utilization level for any resource (e.g. from Normal to Medium)
Event ID 15005 – a decrease in the utilization level for any resource (e.g. from Medium to Normal)
Event ID 15006 – high utilization for disk space (i.e. critically low free disk space)
Event ID 15007 – high utilization for memory
Back pressure is ultimately a symptom of server performance issues. Server performance troubleshooting is
covered in more detail in chapter 8 of this book.
For ease of management and troubleshooting it is generally recommended to configure all of the message
size limits to the same value. However, if you have a need to set different limits, then the largest limit should
be applied at the organization level, and the smallest limit on individual recipients.
When email messages are rejected due to size limits, the non-delivery report will state that quite clearly. You
can then inspect your message size limits, or consider the limits that an external recipient may have in place,
to determine which one caused the email to be rejected.
Transport rule actions are usually identifiable in message tracking log searches, which was covered earlier in
this chapter.
Block lists
Anti-spam block lists can cause your email to external recipients to be rejected. You can monitor for your
domain being added to block lists by using services such as MXToolbox (Figure 4-21). Or you can run manual
queries if you suspect that your emails may be getting rejected due to being listed on a block list.
Not all email servers use a block list, and some don't use publicly available block lists that would show up on a
search such as that shown above. Instead they use private lists, or aggregated lists that combine several block
list sources. In any case, if you are being blocked due to a block list then you will usually receive a non-delivery
report that makes that clear. Your only action when you are included on a block list is to address whatever
reason caused you to be listed, and then apply for de-listing.
Real World: Sending email via a smart host usually means you are sharing that smart host with other
customers, some of which may engage in undesirable email behaviour from time to time. If your smart
host is blocklisted due to the actions of another customer, then you will likely notice your own emails
getting rejected as well. This is one of the perils of using a third party for your email delivery, and you
should always have a backup plan (such as another smart host, or using direct delivery) in case such an
incident occurs.
SMTP Connectivity
Get-PublicIPAddress.ps1 PowerShell Script
Network ports for clients and mail flow in Exchange 2013
You Cannot Send or Receive E-Mail Messages Behind a Cisco PIX Firewall
Configuring Unique Receive Connector SMTP Banners in Exchange Server
Troubleshooting Transport
Protocol Logging in Exchange Server 2013/2016
Message Tracking in Exchange Server 2013/2016
Transport Queues in Exchange Server 2013/2016
Understanding Back Pressure
The Mailbox server role occupies the center of Exchange and refers to the services that host mailbox
databases and provide high availability for mailbox databases in a Database Availability Group (DAG). The
Mailbox server role also performs a number of other Client Access and Transport tasks that are outside of the
scope of this chapter (see Chapters 3 and 4). In addition, the internal workings of mailbox databases such as
how transaction logging works, and the use of ESEUTIL for recovery scenarios, are covered in detail in chapter
9.
The Microsoft Exchange Replication service (MSExchangeRepl) runs as the MSExchangeRepl.exe process,
and is responsible for telling the Information Store process to mount and dismount databases, detect
database failures, and initiate recovery action for any database failures.
The Microsoft Exchange Information Store service (MSExchangeIS) runs the controller process
Microsoft.Exchange.Store.Service.exe, managing each of the worker process lifecycles by performing
the mount and dismount operations as instructed by the Replication service, or by terminating worker
processes if there is a database failover event in a database availability group.
The Microsoft.Exchange.Store.Service.exe processes do not exist as Windows services. Each worker
process is isolated to performing RPC operations for a single database when the database is mounted,
and providing the database cache for that database. If a database is dismounted, the worker process
is terminated. If a worker process fails, the Information Store can start another worker process for that
database.
That relationship between the Replication service, Store service, and the worker processes, exists on both
standalone Mailbox servers and Mailbox servers that are members of a DAG. However, for a standalone
Mailbox server there are limits as to what recovery actions can be performed on a failed database, since there
are no other database copies to failover to.
Figure 5-1: Application event log entries logged by the Replication service
For cases where a service has simply crashed or been stopped inadvertently, a Start-Service command will start
it again.
The most common causes of an Information Store service stopping, or failing to start when you issue a Start-
Service command, are:
Service dependency issues. Another service that the Information Store depends on, such as
MSExchangeADTopology, can't or won't start.
Page 105 Exchange Server Troubleshooting Companion
Service corruption issues. A software corruption has occurred, requiring the Exchange server
application to be repaired or reinstalled.
In any case, the Application event log is a good place to start looking for clues as to why the services won't
start. Even when Information Store services do start, the databases themselves may not automatically mount.
Database Mounting
Each mailbox database has an attribute, MountAtStartup, that determines whether the database should
automatically mount when the Store services starts.
The Information Store will also check the last mount state of the database at start up. If the database was in a
dismounted state because an administrator had dismounted it, and the Information Store service is restarted,
for example during an operating system restart, then it will not attempt to mount the database on the next
start up. In the example below, the database DB03 is configured to mount at start up, but was left dismounted.
The rationale is that the Exchange administrator had deliberately dismounted the database for a good reason,
and so it should be left dismounted. This decision is logged in the Application event log as an error (Figure 5-
2), so you will be able to accurately determine the root cause of a database not mounted in those conditions.
Figure 5-2: The Replication service logs its decision to the Application event log
If a database is configured to not mount at start up, or if the Replication service determines that it should not
be mounted, then it can still be manually mounted by running the Mount-Database cmdlet.
For databases that will not mount there are some common root causes:
In all cases, the Application event log is where you should look first for a clear indication of why the database
won't mount.
However, if the transaction log files unexpectedly grow in size before a backup is complete, the log drive may
run out of free disk space, which will cause the database to dismount. Extending the storage volume to allow
the database to be mounted, and then running a backup to truncate the logs, is a common resolution to this
problem scenario. Enabling circular logging is also often considered, although it carries some risks, because all
of the transaction logs will be truncated by the server regardless of whether you've backed up the database.
With circular logging enabled, restoring a database from backup only gets you a point in time recovery from
the time of the backup, as there are no more log files to "roll forward" to bring the restored database further
up to date. As a means of restoring service when you run out of logging disk space, circular logging has its
usefulness. But don't leave it enabled without considering the downsides.
The question is then what caused the unexpected growth in transaction logs? Some causes are quite obvious,
for example, a large batch of mailbox migrations will generate a significant amount of transaction logging on
the destination mailbox database. Other causes are less obvious, such as rogue users or applications.
Historically, there have been some famous incidents in which a buggy version of Apple's iOS operating system
generated excessive transaction logs due to a calendar sync issue. Such problems will occur from time to time,
and are difficult to track down when they do.
Microsoft provides guidance (17 steps in fact) in a useful blog post on the topic. The guidance is written
primarily for Exchange 2007 and 2010, but still generally applies to Exchange 2013 and 2016. Just be sure to
use the compatible version of tools such as Exmon.
Rapid growth of the mailbox database file itself presents a similar issue to the log files. If the database volume
runs out of free disk space, then the database will dismount. Again, extending the volume to allow enough
free disk space to mount the database is the usual solution, and then some remedial action can be taken to
reduce the growth rate of the database file.
Ultimately, rapid database growth is a problem of "too much data in, not enough data out". If users are
receiving new email at a rate that exceeds the deletion of unwanted email, then the database will grow. This is
a fact of life that most Exchange administrators simply need to deal with. Users don't like deleting email, and
even when they do it's common to find users who never empty their deleted items folder.
Unusually large mailbox users, who might be good candidates for archiving, or for moving to a
different database.
Mailboxes with large amounts of deleted items, which you can consider purging. A Group Policy to
automatically empty deleted items when Outlook closes is an effective way to do this, though it is
often unpopular with users who expect to be able to find things in their deleted items folder.
Running a PowerShell script such as Get-MailboxReport.ps1 will show you the mailbox statistics, including
total mailbox size, and deleted item size (Figure 5-3).
Figure 5-3: Analyze mailbox statistics to determine the cause of database growth
Of course, a mailbox size report is merely a snapshot in time, and won't tell you which mailboxes are growing
the fastest. For that type of analysis, you will need to run multiple reports over a period of time and compare
the numbers, or invest in a monitoring system that tracks user mailbox growth trends.
Mailbox databases won't shrink even if you remove data from them, but they will reclaim the unused database
pages as available new mailbox space (which we often refer to as "whitespace" even though that term carries a
different meaning to the Exchange product team), which will be used for any new data written to the database
without immediately increasing the overall EDB file size. However, if you do wish to reduce the EDB file size so
that the used disk space can be recovered, there are two options available to you:
An offline defrag. This is generally not recommended because it requires an outage while the defrag
operation is performed, which can take a long time for very large database files.
Mailbox migrations. By moving all of the mailboxes in the database to a new database, the new
database file will grow only to the size needed to host the data in those mailboxes. None of available
new mailbox space is migrated, so the new file will be smaller (Figure 5-4). The old database can then
be removed to reclaim that disk space.
Content Indexing
Exchange Search provides the capability for end users to quickly search their mailbox contents and locate
items. In addition, Exchange Search enables in-place eDiscovery to occur by allowing authorized users (such as
auditors) to search for content across multiple mailboxes in the organization. The search capability relies on
the content index (also sometimes referred to as the catalog), which is built by the Microsoft Search
Foundation content indexing engine. The Search Foundation is able to handle most of the common file
formats that are found in email attachments these days, including Microsoft Office files, PDFs, HTML, and plain
text. A complete list of file formats can be seen by running the Get-SearchDocumentFormat cmdlet.
For performance and efficiency, indexing of email messages and attachments occurs both in the transport
pipeline, and in mailbox databases. One content index is maintained per mailbox database. The health of the
content index for each mailbox database can be seen by running the Get-MailboxDatabaseCopyStatus cmdlet,
including for single-copy databases that are not hosted by a DAG member.
A content index that is not listed as "Healthy" may not actually be unhealthy, because the state of "Failed" is
also used for content indexes of dismounted databases.
Name : DB04\EX2013SRV1
ContentIndexState : Failed
ContentIndexErrorMessage : The database has been dismounted.
Similarly, if the database has indexing disabled, which is common for databases that are hosting a journal
mailbox, then the content index state for the database will be listed as "Disabled".
Conflicts with file-system anti-virus protection. If you run an anti-virus product on your Exchange
servers, ensure you are complying with Microsoft's recommendations for file-level anti-virus
exclusions.
Very high volume of changes within a database. Journal mailboxes can cause this, as can shared
mailboxes that are used by a large number of simultaneous users.
Outlook running in cached mode is included by default in the indexing that occurs in the Windows operating
system, referred to as Windows desktop search. Since this indexing does not rely on the content index built by
Search Foundation, a cached mode Outlook user can continue to search their mailbox even when there is a
server-side index problem. On the other hand, Outlook on the web (OWA) users are entirely dependent on the
server-side content index to be able to perform searches. So are users performing eDiscovery searches.
Therefore, a good test to determine whether a content index issue may be occurring is to perform searches
using cached mode Outlook and Outlook on the web, and then compare the results.
Real World: The comparison between cached mode Outlook search, and Outlook on the web search, is
still useful even when the server-side content index is reporting as "Healthy", but you suspect an
indexing problem may be impacting your end users.
In addition to end user searches, unhealthy content indexes can cause mailbox migrations to fail if the target
database for the mailbox move has an unhealthy content index. Troubleshooting mailbox migrations is
discussed in more detail in chapter 11. Content index health is also a factor that is taken into consideration
during database failovers in a DAG, which is discussed later in this chapter.
The Test-ExchangeSearch cmdlet can be used to initiate a test of a mailbox database by generating a test
message to a health mailbox on the database, and then running a search to determine whether the new
message can be found, as well as to report how long it took for the new item to be located.
This process involves removing the existing content index files, which will trigger Exchange Search to re-index
that database. The re-indexing process can cause a high load on the Exchange server, which may impact
performance for the server. So you should carefully consider the timing of any content index rebuilds, and
how it might impact your end users. The content index files are located in the same path as the database EDB
file, in a sub-folder named with a GUID (Figure 5-5).
Before the corrupt index files can be removed, the Exchange Search services must be stopped. While these
services are stopped, searches in OWA will not be able to be performed by end users, and all of the database
content indexes on the server will be reported as "Failed" by Get-MailboxDatabaseCopyStatus.
Next, delete the GUID-named folder that contains the content index files. If the folder will not delete due to
files in use, then it's likely that either:
After a delay while Exchange Search evaluates the situation, the database will be re-indexed. The content index
will have a state of "Crawling" while this is occurring.
DAG members – the Mailbox servers that are members of the DAG and can host database copies
within the DAG. A DAG can have up to 16 members and a File Share Witness.
Quorum – the process by which the DAG members determine whether a majority of members are
online and available. If quorum is lost, the DAG and all of the databases that it hosts may go offline.
File Share Witness – a non-DAG member that is involved in the quorum voting process to act as a tie-
breaker when the DAG has an even number of members.
Active database copies – the active copy of a database is the copy on one of the DAG members that is
mounted and actively servicing clients. There can be only one active copy of a database at any given
time.
Passive database copies – the passive copies of a database are dismounted and receive updates from
the active database copy through continuous replication. A database can have no passive copies if it is
a single-copy database, or up to 15 passive copies (due to the maximum number of DAG members
being 16).
Continuous replication – the process of replicating changes from the active database copy to the
passive database copies.
Copy queue – the queue of changes on the active database copy that are yet to replicate to a DAG
member hosting a passive copy.
Replay queue – the queue of changes to the active database copy that have replicated to a DAG
member hosting a passive copy, but have not been replayed or committed to the passive database
copy yet.
Lagged database copies – lagged copies are passive database copies that have a delay of up to 14
days configured for either the replay or truncation of transaction logs, in continuous replication. By
Managed Availability
Managed Availability is the built-in, intelligent monitoring and recovery system for Exchange 2013 and later.
The goal of Managed Availability is to detect and resolve problems without requiring administrator
intervention, and before they cause a negative user experience. Simply put, Managed Availability is constantly
probing and testing the health and performance of your Exchange servers, and will take corrective action
when it detects a problem. The corrective action could be as simple as recycling an IIS application pool, or
even failing over a database. For more serious problems, Managed Availability can even "bug check" (force a
reboot) of an entire server.
Monitors – these define what data to collect about an Exchange service or feature, what is considered
"Healthy", and what actions should be taken to restore a feature to a healthy state. The data that is
collected by monitors includes direct notifications from components, results from probes, and
performance counters.
Probes – these are how monitors obtain information about the health and performance of Exchange
components so that the user experience for end users can be measured. Probes include synthetic
transactions such as sending an email to the health mailbox on a database, or testing server-to-server
connectivity over different protocols. Some probes are run by services other than the Exchange Health
Manager Service, monitoring themselves and reporting results to Managed Availability.
Responders – these take corrective actions for problems that have been identified by probes and
monitors, such as restarting a service.
There are more than a thousand probes in Managed Availability; 1,192 for an Exchange 2013 CU10 multi-role
server, and 1,474 for an Exchange 2016 RTM Mailbox server. You can expect those numbers to change over
time as new builds of Exchange are released. You can see the full list by running the Get-ServerHealth cmdlet.
A summary of the health of all of the health sets on the server can be retrieved by running the Get-
HealthReport cmdlet.
In a troubleshooting scenario you can use Managed Availability to narrow the focus of your investigation by
querying a server health report for any alerts that are not "Healthy".
Degraded – the monitored item has been unhealthy for less than 60 seconds. After 60 seconds of not
returning to a healthy state, it will be marked as unhealthy.
Disabled – an administrator has disabled a monitor.
Repairing – an administrator has marked a monitor or server as currently being repaired.
Unavailable – the monitor is not responding to the Health service.
In the example above, the MailboxSpace health set is unhealthy. There are 10 monitors associated with that
health set, any one of which could be the cause of the unhealthy state. We can run Get-ServerHealth for that
health set to determine why the health set is not healthy.
Depending on the monitor or health set that is unhealthy, Managed Availability may place a server component
into an inactive state. Inactive server components are effectively removed from the production Exchange
environment. For example, a Hub Transport server component that is marked inactive will not participate in
any mail flow within the organization, because other Exchange servers will not send email messages to it. The
component states for a server can be viewed by running Get-ServerComponentState.
Server component states are set by requesters. There are five available requesters:
HealthAPI – this is used by Managed Availability when it marks an unhealthy component inactive.
When the associated health sets are healthy again, Managed availability will set the component state
back to "Active" again.
Maintenance – this is used by administrators when they are performing maintenance on the server,
such as when preparing a server for monthly Windows updates or a cumulative update installation.
Functional – this is used by Exchange setup, for example during cumulative update installation.
Sidelined, and Deployment – these are generally only used by Microsoft within Exchange Online
(Office 365).
A server component will remain inactive if any requester has still applied an inactive state to it. For example, if
you set the ServerWideOffline component to inactive in preparation for a cumulative update using the
requester "Maintenance", Exchange setup will also then set the ServerWideOffline component inactive with the
requester "Functional". At the end of Exchange setup, the ServerWideOffline component will be set back to
active by the requester "Functional", but the component will remain inactive until the "Maintenance" requester
is also used to set the component active again.
This is a good thing, because we wouldn't want Exchange setup to return the component to an active state
before we've had a chance to do our own post-upgrade testing. However, it does create the potentially
confusing scenario of an administrator trying to set a component back to an active state, and then finding that
it remains inactive. In this situation it is useful to be able to check which requester still has the component set
to an inactive state.
[PS] C:\>$ComponentState.RemoteStates
In the example above, the HealthAPI requester (Managed Availability) has marked the component as inactive.
This could be due to a server health issue detected after the server maintenance was performed, and should
be investigated further.
Real World: Exchange setup uses the requester "Functional" to set the ServerWideOffline component
as inactive very early in the setup process. If setup is interrupted, such as an administrator cancelling it
before it has completed, the server component state will remain inactive for the requester "Functional"
until setup is re-run successfully, or until an administrator manually sets the server component state
active with that requester name.
Understanding the basic concepts of Managed Availability is important so that you aren't surprised when it
takes corrective action in your environment. In the very early days of Exchange 2013, there were a few
Managed Availability bugs, causing it to take corrective action for problems that did not really exist. In one
case, Managed Availability restarted Exchange 2013 CU2 DAG members for Exchange environments deployed
in multi-domain Active Directory forests, due to a bug with an Active Directory connectivity probe.
However, today Managed Availability has grown into a robust and reliable engine for Exchange high
availability deployments. In fact, if you are seeing Managed Availability take action in your environment, the
response should not be to try and stop or disable Managed Availability. Rather, you should be using Managed
Availability to gain visibility into the health of your servers, and investigating the recurring problems with the
Exchange server components themselves that are triggering Managed Availability responders. That is, unless
another unfortunate bug rears its head in a future Exchange cumulative update.
Exchange will automatically configure a DAG network for any network interfaces that it detects in the
operating system. DAG network auto-configuration relies on the following configurations on the network
adapters.
The client-facing network interface on each DAG member should be configured with:
A default gateway
At least one DNS server
The "Register this connection's addresses in DNS" option enabled
No default gateway
No DNS servers
The "Register this connection's addresses in DNS" option disabled
Static routes, if the network will span multiple IP subnets
If those conditions are not met, then DAG network auto-configuration often fails. The signs of failed network
auto-configuration are misconfigured subnets appearing in the output of Get-
DatabaseAvailabilityGroupNetwork.
Correcting the misconfigurations, and then enabling and disabling manual network configuration for the DAG
will allow auto-configuration to make another attempt. Assuming your network interfaces are correctly
configured on all DAG members following the guidelines above, auto-configuration should be successful.
Because the DAG attempts to auto-configure a network for every network interface that is available in the
operating system, it can result in DAG networks for network interfaces that you do not want Exchange to use
at all. Examples of this include dedicated backup networks, iSCSI storage networks, and out-of-band
management ports. Unwanted network interfaces can be entirely excluded from DAG networking by running
the Set-DatabaseAvailabilityGroupNetwork cmdlet with the IgnoreNetwork parameter. At the same time, it's
also advisable to rename the DAG network from its auto-assigned name, so that the purposes of the network
and the reason for excluding it from the DAG is clear.
During file mode replication, each transaction log files generated by the active database copy is written until it
is closed off when it reaches 1MB in size, and then copied to the DAG members hosting passive database
copies. The DAG members hosting passive database copies then replay the transaction log files into their own
copy of the database file to update it with the latest changes. File mode replication is used for the initial
seeding of a database copy, and also any time a database copy needs to "catch up" with replication.
Block mode replication works in a similar way, except that as each database transaction is written to the log
buffer on the DAG member hosting the active database copy, it is also copied to the log buffer of other DAG
members hosting passive copies. Each DAG member is then able to build its own transaction log file from the
data stored in the log buffer, instead of waiting for an entire 1MB log files to be shipped from the active
database copy. Block mode has the advantage that it reduces the amount of potential data loss if there is a
failure on the active database copy.
While replication occurs, the copy queue and reply queue for a database copy indicate the amount of
transaction log data that not yet been copied or replayed for a passive database copy. You can view the copy
and reply queues by running Get-MailboxDatabaseCopyStatus.
High copy queue lengths are an indication that transaction log data for changes are occurring on the active
database copy that is not able to be copied to other DAG members fast enough, or not copied at all. This
could be due to:
Suspended replication by an administrator would be visible as a status of Suspended for the database copy.
Similarly, a status of Seeding indicates that an administrator has initiated a reseed, and Resynchronizing
indicates that the system is assessing the state of the database copy after a server restart or maintenance has
Real World: You might encounter a scenario in which the copy queue length for a database copy is
9223372036854775766 (9 quintillion). This is due to a self-preservation mechanism built in to the DAG
for when the cluster registry replication between DAG members is out off by more than 12 minutes. A
full explanation has been written up by Microsoft's Tim McMichael.
By design, lagged database copies are likely to have a high copy or replay queue length. Most lagged copies
are implemented by configuring a lag period for the replay interval, as this allows the transaction log data to
be replicated to the DAG member hosting the lagged copy, and then held there while the lag interval passes,
before it is then replayed into the passive database copy. Lagged copies are not immediately obvious when
you inspect the output of Get-MailboxDatabaseCopyStatus, because the ReplayLagStatus is not included in the
default output.
BCSS takes into account multiple variables to determine which database copy should be made active, with the
goal of restoring availability of service for end users, while balancing that goal against the risk of data loss if a
database copy that is not fully up to date was to be made active. Included in this decision are:
You can view the configuration for the database copies in a DAG by running the Get-
MailboxDatabaseCopyStatus cmdlet.
Name ActivationSuspended
Activation is often suspended on database copies that have been configured as a lagged copy.
For server-wide automatic activation settings, there are two options. DatabaseCopyAutoActivationPolicy
controls the types of automatic activation that mailbox database copies on the server can perform:
Blocked – prevents the database copies on the server from automatically activating. Administrators
can still choose to manually activate database copies on the server.
IntraSiteOnly – prevents cross-site failovers from occurring by only allowing database copies to
activate in the same Active Directory site. If the Active Directory site spans two physical datacenter
locations, this setting can't prevent a failover to the other datacenter location, because Exchange is
only aware of the Active Directory site boundary and not the physical datacenter boundaries.
Unrestricted – this is the default setting and allows any database copies on the server to be
considered for failover by BCSS, unless the specific database copy is suspended from activation.
You can view the automatic activation policy for Mailbox servers by running the Get-MailboxServer cmdlet.
In the output above you can see an additional setting for Mailbox servers, that is named
DatabaseCopyActivationDisabledAndMoveNow. This setting strikes a comfortable middle ground between the
desire to keep active database copies running on specific servers, while not completely blocking a server from
automatic activation if no other healthy servers are available. By configuring this setting to $true, database
copies on the server will only be activated if no other healthy copies exist. Furthermore, if another healthy
database copy becomes available on another server, the DAG will perform a switchover to the other healthy
copy almost immediately.
After excluding any blocked or suspended database copies, a process called Attempt Copy Last Logs (ACLL) is
run to try and copy any missing log files from the server that was hosting the active database copy. ACLL is
useful in situations where the failure is not a full server failure and the log files are still accessible, and ensures
that database copies are as up to date as possible during the BCSS process.
Exchange 2013 and 2016 also evaluate the health of the server components that are monitored by Managed
Availability. Servers with more server components healthy are preferred to those with some unhealthy
components. Furthermore, if Managed Availability has initiated the database failover due to a failed server
component, BCSS must choose a server on which that same server component is healthy when activating
another database copy. These checks prevent databases from failing over to servers that are in a worse state
of health than the server previously hosting the active copy. Next, BCSS considers the AutoDatabaseMountDial
setting on Mailbox servers. This setting configures the threshold for the number of transaction log files that
can be missing for a database copy before it is considered un-mountable. The copy queue length is used to
determine how many log files are missing for a database copy. Copy queue length is discussed earlier in this
chapter.
Name AutoDatabaseMountDial
---- ---------------------
EX2013SRV1 GoodAvailability
EX2010SRV1 GoodAvailability
EX2016SRV1 GoodAvailability
EX2016SRV2 GoodAvailability
The rationale for GoodAvailability and BestAvailability is that it is preferable to restore service, and then
mitigate the data loss caused by the missing log files by requesting resubmission of messages from Safety
Net, which caches a configurable amount of email messages that have already been delivered to mailboxes (2
days' worth, by default). You can verify that Safety Net is configured for your organization by running Get-
TransportConfig.
SafetyNetHoldTime
-----------------
2.00:00:00
When AutoDatabaseMountDial is set to Lossless on any member of a DAG, database copies that are being
considered by Active Manager as failover candidates will be sorted in ascending order of activation
preference. However, when AutoDatabaseMountDial is set to GoodAvailability or BestAvailability on all
members of a DAG, database copies are sorted by their copy queue length in order of shortest to longest.
Only when two database copies have the same copy queue length will activation preference be used as a tie-
breaker. Once a decision is made, Active Manager attempts to restore service by issuing a mount command to
the most preferred database copy based on all of the BCSS criteria.
The BCSS process is complex and highly intelligent, designed to make the best decision about where
databases should fail over to when a problem occurs. This decision if often contrary to the activation
preferences that the administrator has configured on database copies, leading to confusion and a sense that
the DAG is behaving in an unpredictable way.
Some administrators are then tempted to try and control the DAG failover behaviour by suspending activation
on some database copies, or blocking activation on some DAG members. In doing so, you are only reducing
the resilience. Instead, when you are troubleshooting database failover events, look at them in the context of
the BCSS process and try to understand why the DAG made the decision that it did. It may be that you can
improve the chances of the failovers occurring in line with your preferences through better monitoring of the
health of the DAG members and database copies.
Node majority quorum mode is used when the DAG has an odd number of members (Figure 5-8). The file
share witness is not required for the quorum voting process, because the DAG members can determine a
“majority” themselves. For example, if one DAG member fails, 2/3 DAG members are still online (a majority)
and the DAG can remain online. If two DAG members fail, 1/3 DAG members are still online, which may result
in quorum being lost and the DAG going offline.
Figure 5-8: A three member DAG, with the file share witness not included in quorum voting
Node and file share majority is used when the DAG has an even number of members (Figure 5-9). The file
share witness is included in the quorum voting process to ensure that a “majority” can be determined. For
example, in a two-member DAG if one member fails, 1/2 members are still online (not a majority), but you
would expect the DAG to be able to withstand a single node failure. The file share witness is used as the tie-
breaker, meaning 2/3 “votes” are still available, and the DAG can stay online. Similarly, with a four-member
DAG, if two members failed, with the file share witness there are still 3/5 “votes” online, so the DAG can stay
online.
Figure 5-9: A four member DAG, with the file share witness included in quorum voting
All database availability groups are configured with a file share witness, whether it is used for voting or not.
The quorum model is adjusted automatically by the DAG as you add or remove members.
Because a loss of quorum will cause the DAG to go offline, you should always plan maintenance tasks so that a
majority of voting members will remain online during the maintenance. For example, for a three member DAG,
perform updates and reboots on the first server, returning it to full operation before beginning your
maintenance on the next server, and so on.
Note: A common misconception is that the file share witness must be online 100% of the time. This
leads some administrators to try and build a resilient file share witness by using clustered file servers to
host the FSW share. The FSW is not required to be online 100% of the time, it can be unavailable for
short periods of time (for example, monthly Windows Update installation) just as the other DAG
members can be, as long as a majority of quorum voting members remains online. Building a highly
resilient FSW only adds complexity to your DAG.
In some circumstances the DAG can sustain a majority of nodes being offline if there have been multiple
sequential failures. This is because a feature of Windows Server 2012 clusters called Dynamic Quorum (DQ).
DQ makes it possible, but not guaranteed, that a cluster will survive sequential failures all the way down to a
"last man standing" situation. DQ is enabled by default on Windows Server 2012 and above clusters, and
Some environments experience an unexpected loss of quorum when a single DAG member is taken offline,
causing the entire DAG to go offline as well. This can be due to:
A misconfigured DAG that is pointing to an FSW server name or share name that is inaccessible. This
may be due to the FSW server being offline, the share removed, or the permissions on the shared
folder being incorrect. The share must be accessible and allow Full Access by the Exchange Trusted
Subsystem group.
A cluster that has been manually set to the wrong quorum mode by an administrator.
The FSW path, and the quorum mode can both be viewed by running Get-ClusterQuorum on a DAG member.
PS C:\> Get-ClusterQuorum | fl
Cluster : EX2016DAG01
QuorumResource : File Share Witness (\\mgmt.exchangeserverpro.net\EX2016DAG01.exchangeserverpro.net)
QuorumType : NodeAndFileShareMajority
Multi-site DAGs can also be configured with two file share witnesses to facilitate the datacenter switchover
process. The file share witness does not automatically fail over to the alternate server though, it is still part of
the manual datacenter switchover process. The alternate witness server is not a mandatory property of the
DAG though, so it's not unusual to see that no alternate witness server is configured on a DAG. You can use
the output of Get-DatabaseAvailabilityGroup with the –Status switch to confirm the name of the witness server,
the directory, and whether the primary or alternate witness share is in use.
Name : EX2016DAG01
WitnessServer : mgmt.exchangeserverpro.net
WitnessDirectory : C:\DAGFileShareWitnesses\EX2016DAG01.exchangeserverpro.net
AlternateWitnessServer :
AlternateWitnessDirectory :
WitnessShareInUse : Primary
If the cluster management tools have been used to modify the DAG witness settings outside of Exchange, then
the WitnessShareInUse will display a value of "InvalidConfiguration" instead. To correct this, run Set-
DatabaseAvailabilityGroup to reset the file share witness settings for the DAG.
Multi-Site Behaviour
DAGs that span multiple datacenter locations are susceptible to failures of the WAN connection between the
sites. Depending on the location of the file share witness server, a loss of WAN connectivity could cause either:
Database failovers to the datacenter where the file share witness is located.
The entire DAG going offline.
For example, in a two-datacenter deployment such as in Figure 5-10. if the WAN connection is lost then all
databases that are active in "Datacenter 2" will failover and become active in "Datacenter 1", because that is
where the file share witness server is located and therefore where quorum is able to be achieved.
However, if the entire "Datacenter 1" location goes offline due to a power failure, the databases will not be
able to become active in "Datacenter 2", because the DAG members in that site can't achieve quorum (only
two out of five voting members will be available).
Note: This is normal behavior for multi-site DAGs that do not use a third site for the file share witness.
If automatic site failover is required, then a third datacenter that is independently connected to each of
the other datacenters should be used to host the file share witness. Azure can be used for this, if the
organization doesn't have access to a third datacenter.
In datacenter outage scenarios where the DAG loses quorum, a manual datacenter switchover is required to
restore service. It is recommended that Datacenter Activation Coordination mode (DAC mode) be enabled for
all multi-site DAGs to prevent split brain scenarios from occurring after a datacenter switchover has been
performed.
The DAC mode setting for a DAG can be viewed with the Get-DatabaseAvailabilityGroup cmdlet.
Name DatacenterActivationMode
---- ------------------------
EX2016DAG01 DagOnly
EX2010DAG Off
If you have not enabled DAC mode for a DAG, the datacenter switchover cmdlets will not be available,
requiring the use of failover cluster management tools to bring the secondary site online.
Warning: DAC mode is designed to prevent a split brain condition from occurring if the primary
datacenter comes back online before WAN connectivity is re-established. The DAG members in the
primary site and the FSW will be able to achieve quorum, and the databases will be mounted in the
primary datacenter while they are also already mounted in the secondary datacenter, causing
divergence between the database copies that breaks replication. If you perform a datacenter switchover
without DAC mode enabled, be careful not to bring the primary datacenter servers online without WAN
connectivity being restored first.
In the example above, several content indexes on otherwise healthy passive database copies are in a
suspended state, which usually indicates a failure of some kind. A reseed of the content index can be started
by running Update-MailboxDatabaseCopy with the –CatalogOnly switch.
Confirm
Are you sure you want to perform this action?
Seeding database copy "DB05\EX2016SRV1".
[Y] Yes [A] Yes to All [N] No [L] No to All [?] Help (default is "Y"): y
However, when reseeding the content index on one database copy, or when all copies of the content index are
failed, you can rebuild the index using the steps demonstrated earlier in this chapter.
The default output is CSV, which can be read in Excel. If you prefer a HTML summary report, you can append
the –GenerateHtmlReport switch to the command. However, opening the CSV in Excel where you can filter and
sort the data will provide you with a lot more visibility than the HTML summary.
Exchange will also log events to several event logs on the server. The Application event log will contain basic
information about events logged by Exchange services. Managed availability will log more detailed
For database availability groups there are four crimson channels used:
HighAvailability – for DAG events relating to the Replication service and Active Manager
MailboxDatabaseFailureItems – for database copy-related events
ActiveMonitoring – for Managed Availability events relating to probes, monitors and responders
ManagedAvailability – also for Managed Availability events, relating to recovery actions taken by MA
and their results
When you identify an incident has occurred, such as an outage for end users, or a database that was not able
to failover to another DAG member, start by reviewing the Application event log to determine the time of the
incident. Then, delve into the crimson channel to look for relevant events, and any historical trends that lead
up to that event.
Because a DAG uses an underlying failover cluster, the cluster log itself can also be used to troubleshoot some
problem scenarios. You may not find the cluster log useful for every application-level issue that happens in a
DAG, such as a database failover by Managed Availability due to an unhealthy protocol. But if you experience
problems with full server failures or loss of quorum, the cluster log can be useful.
To generate the cluster log run the Get-ClusterLog cmdlet on a DAG member. You can specify a destination
for the log files, as well as a timespan (in minutes) that the logs should be collected for.
Additional reading
Information Store
Troubleshooting Rapid Growth in Databases and Transaction Log Files
Exchange Server User Monitor (ExMon) for Exchange 2013 and 2016
Get-MailboxReport.ps1 PowerShell Script
Page 126 Exchange Server Troubleshooting Companion
Content Indexes
Running Windows Antivirus Software on Exchange 2016 Servers
Antivirus Software in the Operating System on Exchange 2013 Servers
Exchange Server is installed into organizations to provide email and collaboration services for users. For all the
technical features and complexity of an Exchange deployment, it is that simple function of letting people send
and receive email, book meetings, and manage their daily tasks that Exchange provides. You'll often be called
upon to troubleshoot those functions for the recipients in your organization, whether they are a person (a
mailbox user), a resource (such as a meeting room), or a distribution group.
There are actually quote a lot of possible recipient types in an Exchange organization. Here's the full list:
User mailbox – a mail-enabled user in Active Directory has a mailbox for messages, calendar, contacts,
tasks and so on.
Shared mailbox – a mailbox that is shared by multiple users, or that is not specifically associated with a
person.
Room mailbox – a mailbox used to manage calendar bookings for a meeting room or some other
physical location.
Equipment mailbox – a mailbox used to manage calendar bookings for a piece of equipment such as
pool car or loan laptop.
Site mailbox – a mailbox that is associated with a SharePoint site for document storage. Site mailboxes
are not widely used.
Office 365 mailbox – also known as a Remote Mailbox, refers to a user in a Hybrid deployment that
has their mailbox in Exchange Online.
System mailbox – special purpose mailboxes created and managed by the Exchange server itself for
tasks such as eDiscovery and moderated transport.
Linked mailbox – a mailbox in the organization that is associated with a user in a separate, trusted
forest.
Linked user – a user in the local forest that is associated with a mailbox in a separate, trusted forest.
Mail user – a mail-enabled Active Directory user that has a mailbox hosted by an external system.
Mail contact – an email recipient external to the organization who does not have a user account in the
local Active Directory forest.
Mail forest contact – a contact representing a recipient from another forest. These are usually created
by Microsoft Identity Integration Server and are not directly managed with Exchange or Active
Directory tools.
Mail-enabled public folder – a public folder that can appear in the Global Address List and receive
email messages.
Distribution group – used to distribute messages to a group of recipients.
Dynamic distribution group – a distribution group that uses an LDAP query instead of a static
membership list to determine who to distribute messages to.
Mail-enabled security group – an Active Directory security group that can be used to distribute
messages as well as to grant or deny permissions to objects.
That's quite a variety, but can be summarised as mailboxes, users, contacts, and groups. They're collectively
referred to as recipients, and during this chapter I may refer to them in the general sense, such as "mailbox",
or when necessary will use a specific name, such as "room mailbox".
The term "mail-enabled" is also important to understand. A mail-enabled object is an Active Directory object
that has email attributes populated by Exchange. An example of this is contacts. Active Directory can host
contact objects whether Exchange is installed in the forest or not. Contact objects in Active Directory can be
used to store phone numbers, fax numbers, postal addresses, and so on. But it isn't until the contact object is
mail-enabled that a mailbox user can see the contact in the Global Address List and send email messages to
them.
Mailbox Access
The most common recipient you'll be administering is the mailbox; or more specifically, the mailbox user.
When mailbox problems occur, users notice pretty fast, and you'll quickly hear about it. Let's start with a look
at mailbox permissions.
Mailbox Permissions
In addition to the normal permission that a user has to access their own mailbox, there are also permissions
required for the Exchange system itself to access mailboxes, as well as access by other users such as delegates
or shared mailbox scenarios for teams. The permissions for Exchange access are discussed in chapter 7.
Mailbox permissions – generally used to grant access to an entire mailbox, such as a shared mailbox
Mailbox folder permissions – used to grant permissions to specific folders, which includes mail item
folders as well as other folders such as contacts, calendars, and tasks.
Active Directory permissions – used to grant specific rights to a mailbox, such as the ability to send as
the mailbox.
In this section we'll look at mailbox and mailbox folder permissions. Calendar permissions and send-as
permissions are covered later in this chapter.
To view the mailbox permissions for a mailbox, we use the Get-MailboxPermission cmdlet.
The output can be quite long for some environments, because it includes permissions set explicitly on that
mailbox as well as all of the permissions that are being inherited from the mailbox database and other parent
objects. To view only the non-inherited permissions, filter the output of Get-MailboxPermission.
Most mailboxes will only have access by "NT AUTHORITY\SELF", which basically means the Active Directory
user associated with the mailbox. To grant additional users access to the mailbox, use the Add-
MailboxPermission cmdlet.
The example above will grant Alan Reid full access to Alex Heyne's mailbox, including all folders and items.
Alan can read, modify, or delete anything he likes. In this situation, enabling mailbox audit logging would be
advisable, which is covered in chapter 12. In addition to granting Alan access to the mailbox, the default
behaviour is also to auto-map the mailbox in Alan's Outlook profile the next time Autodiscover runs (typically
within a few hours). Auto-mapping is covered later in this chapter.
Note: Mailbox permissions can also be granted to a group, instead of to individual users. The group
must be a universal security group, but does not need to be mail-enabled. When a group is used to
grant access to a mailbox, auto-mapping does not work, so the mailbox will need to be manually added
to the group members' profiles.
The most common mailbox permissions you'll be granting is FullAccess. By default, the permission is applied
to all folders and items. However, the default behaviour can be modified by using the –InheritanceType switch,
which as the following options:
All – the permission is applied to the top of the mailbox and all child objects (folders). This is the
default behavior if the –InheritanceType switch is not specified.
Children – the permission is applied to the immediate children only.
Descendants – the permission is applied to the immediate children and their descendants.
None – no inheritance is used, and the permission is applied only to the top of the mailbox.
SelfAndChildren – the permission is applied to the top of the mailbox and its immediate children
(folders), but no descendants.
Real World: There are few if any practical uses for options other than "All". However, if you run Get-
MailboxPermission and see that a user has FullAccess permission to a mailbox, but is unable to access
some of the child folders within that mailbox, it is worth trying to remove the mailbox permission and
re-add it again, in case another administrator has used a different inheritance type.
It's natural to assume that Add-MailboxPermission can be used to grant read-only access to a mailbox, because
one of the AccessRights is ReadPermission. However, all ReadPermission does is grant a user the right to read
the mailbox permissions on the mailbox object, not to actually read any of the data within the mailbox. To
grant read access to a mailbox we need to use mailbox folder permissions instead, which is discussed later in
this chapter.
When troubleshooting mailbox permissions issues, a common issue is that changes to the permissions do not
take effect immediately. Some administrators even go as far as restarting the Information Store service, or the
entire server, when trying to resolve it. A restart tends to result in the permissions changes taking effect,
leading the administrator to assume that there is a fault somewhere in their server that requires a restart for
Mailbox folder permissions are much more granular and customizable than mailbox permissions. You can
configure specific permissions, such as ReadItems, or you can configure roles such as Author that include a
bundle of permissions. The full list of permissions that can be applied as mailbox folder permissions is
available on TechNet.
Using the example of read-only access, the first step is to grant Reviewer permissions to the top of the
mailbox.
RunspaceId : 2cc2f5f2-77a3-42b6-9221-83cf24c494c6
FolderName : Top of Information Store
User : Alan Reid
AccessRights : {Reviewer}
Identity : Alan Reid
IsValid : True
When you're troubleshooting mailbox folder permissions it's important to understand how inheritance works.
Unlike mailbox permissions, there's no inheritance applied by default when you use Add-
MailboxFolderPermission, and there is also no parameter that you can use with the cmdlet to control
inheritance. Instead, the inheritance behaviour is:
If you do need to apply mailbox folder permissions to existing sub-folders of the mailbox, you will need to
apply them to each folder. For example, to grant Reviewer permissions to the Inbox folder, run the following
command.
RunspaceId : 2cc2f5f2-77a3-42b6-9221-83cf24c494c6
FolderName : Inbox
User : Alan Reid
AccessRights : {Reviewer}
Identity : Alan Reid
IsValid : True
You can be as granular with mailbox folder permissions as you like, granting Reviewer permissions to the top
of the mailbox, and different permissions for sub-folders.
Note: Adding the same mailbox folder permissions for every existing folder in a mailbox is a tedious
task, but you can speed it up by using the script demonstrated in this blog post.
Mailbox folder permissions can only have one entry per user or group. If there is an existing Reviewer
permission on the folder for a user, and you want to grant Editor permissions instead, use the Set-
MailboxFolderPermission cmdlet to update the entry.
Also, keep in mind that mailbox folder permissions do not use auto-mapping for the mailbox, so any mailbox
access that has been configured using folder permissions will require the mailbox to be manually added to the
user's Outlook profile.
Auto-Mapping
In Exchange 2010 Service Pack 1, Microsoft introduced the capability for Outlook clients to automatically add
mailboxes to the Outlook profile of users who have full access to the mailbox. This greatly simplifies the
process of granting access to shared mailboxes, because administrators do not need to assist the end user
with adding the mailbox to their profile.
Auto-mapping is enabled by default when you grant FullAccess mailbox permissions, which was discussed
earlier in this chapter. Let's look an example of Alan Reid being granted access to the Payroll shared mailbox.
After running the command shown above, the msExchDelegateLinkList attribute of the Payroll mailbox is
updated to include Alan Reid.
Outlook will then automatically map the alternative mailboxes for the user. This Autodiscover involvement
makes three important points when troubleshooting mailbox auto-mapping issues:
The Autodiscover service caches information, so it may not immediately return the alternative mailbox
details. The MSexchangeAutodiscoverAppPool in IIS can be recycled to clear cached information.
Outlook clients poll for Autodiscover information at regular intervals, and also at start up, but they
also cache Autodiscover information locally, which may create a delay between applying mailbox
permissions and auto-mapping occurring.
If Autodiscover is not working in the environment, then auto-mapping will not work at all.
Although auto-mapping is convenient, it can create performance problems for the end user when too many
mailboxes are auto-mapped in an Outlook profile. It's rare that one individual needs a large number of
mailboxes permanently mapped, so you can reduce the number by disabling auto-mapping. It's possible to
disable auto-mapping for existing mailbox permissions by re-running the Add-MailboxPermission cmdlet with
the –AutoMapping parameter set to $false.
If you have multiple users with permissions to the mailbox, and you want to disable auto-mapping for all of
them, you can achieve that with a few steps in PowerShell.
[PS] C:\> $Users = Get-MailboxPermission Payroll | Where {$_.AccessRights -eq "FullAccess" -and
$_.IsInherited -eq $false}
[PS] C:\> $Users | ForEach {Add-MailboxPermission -Identity $_.Identity -User $_.User -AccessRights
FullAccess -AutoMapping:$false}
Sometime after running the commands above, any users that have the mailbox auto-mapped in their Outlook
profile will see that mapping remove itself. They will then need to manually map the mailbox if they want to
have it permanently mapped in their Outlook profile.
Corrupt Mailboxes
Under the hood, mailboxes are complicated data constructs living within a database. And like most data, from
time to time they can suffer from corruption. Mailbox corruption often appears during mailbox migrations, as
the migration process finds issues with old data in the mailbox that the end user never looks at any more.
Migration issues caused by mailbox corruption are covered in chapter 11.
However, end users will notice more obvious signs of mailbox corruption, such as:
Folders showing item counts that are incorrect (e.g. a folder display 1 unread item when there are in
fact no unread items)
Search folders not displaying the correct results
Mailbox folders missing expected items from some views
If you suspect mailbox corruption, you can run a mailbox repair request in detect mode. Mailbox repair
requests can be run for one or more of the following corruption types:
SearchFolder
AggregateCounts
FolderView
ProvisionedFolder
Mailbox repair requests can be run against a single mailbox, or an entire database. For example, to run a repair
request for Alan Reid in detect only mode, we can use the New-MailboxRepairRequest cmdlet with the –
Mailbox and –DetectOnly parameters.
To run a repair request against an entire mailbox, use the –Database parameter instead.
Mailbox repair requests are background tasks, which will start and stop depending on the performance of the
Exchange server and its load at the time.
Note: For performance reasons, only one database-level repair request or 100 mailbox-level repair
requests can be active at the same time. Once a repair request has started, it can't be stopped unless
you dismount the mailbox database.
You can monitor the progress of your mailbox repair request by running the Get-MailboxRepairRequest cmdlet.
To view the results, look at the CorruptionsDetected and CorruptionsFixed properties of the repair requests.
Tasks : {SearchFolder}
CorruptionsDetected : 0
CorruptionsFixed : 0
Corruptions :
Tasks : {AggregateCounts}
CorruptionsDetected : 0
CorruptionsFixed : 0
Corruptions :
Tasks : {FolderView}
CorruptionsDetected : 0
CorruptionsFixed : 0
Corruptions :
Tasks : {ProvisionedFolder}
CorruptionsDetected : 0
CorruptionsFixed : 0
Corruptions :
If you need to repair mailboxes, re-run the mailbox repair request command without the –DetectOnly switch.
Corrupt Delegates
When a user adds another user to their mailbox as a delegate, and chooses to have meeting requests sent to
their delegates, a hidden inbox rule is created in their mailbox to perform the forwarding of those meeting
requests. This can also apply to delegates room and equipment mailboxes.
The root cause of both issues is a stale, or corrupt, delegate entry still being included in the hidden inbox rule.
There are two solutions you can try to fix this:
Remove all delegates from the mailbox, and then re-add them.
Use MFCMAPI to remove the hidden inbox rule (which also removes all delegates, requiring you to
re-add them).
In either case, it is wise to make a note of the delegates that you want to keep on the mailbox, and their
delegate permissions, before you attempt to fix the problem.
Calendars
Users rely on their calendars to know when and where they need to be, so any calendar issues will quickly
cause them pain. But most calendar issues come down to a handful of root causes, which we'll explore in this
section.
Calendar Permissions
The permissions for calendars can be configured in two ways:
By an administrator configuring mailbox folder permissions. The calendar is just another folder in that
sense.
By the end user configuring their own delegates or folder permissions.
Mailbox folder permissions have already been covered earlier in this chapter.
RunspaceId : 9646d4ad-70ba-48d9-98ae-db322193f06b
WorkDays : Weekdays
WorkingHoursStartTime : 08:00:00
WorkingHoursEndTime : 17:00:00
WorkingHoursTimeZone : W. Australia Standard Time
WeekStartDay : Monday
ShowWeekNumbers : False
FirstWeekOfYear : FirstDay
TimeIncrement : ThirtyMinutes
RemindersEnabled : True
ReminderSoundEnabled : True
DefaultReminderTime : 00:15:00
WeatherEnabled : True
WeatherUnit : Default
WeatherLocations : {}
Similarly, the work hours setting for a mailbox will impact the classification of free/busy time for the calendar,
and in the case of resource mailboxes, will also impact the suggested times and enforcement of bookings only
within work hours.
You can modify the time zone or work hours for a mailbox by running the Set-MailboxCalendarConfiguration
cmdlet.
A full list of valid time zones can be displayed by running the following command in PowerShell.
Free/Busy Information
When a user is trying to book a resource mailbox or another person for a meeting, they rely on free/busy
information displayed in Outlook (Figure 6-4) to let them know when the meeting invitees are available.
Free/busy information relies on Autodiscover for the client to be able to discover where the Exchange Web
Services (EWS) endpoint is, and also relies on EWS being reachable. This means that the EWS namespace (URL)
must be correctly configured, resolvable in DNS, and reachable across the network. And, because the EWS
connection from Outlook is over HTTPS, the SSL certificate on the Exchange server must also be correctly
configured. Namespaces and certificates have been covered in chapter 3, and Outlook connectivity to EWS is
covered in chapter 7.
Real World: One of the most obvious signs of EWS not working is when free/busy lookups are failing.
The other is when an Outlook user can't change their Out of Office settings. Both of those features rely
on EWS.
The CRA is set to run in RepairAndValidate mode by default for Exchange 2013 and later, which you can view
by running Get-MailboxServer.
Name CalendarRepairMode
---- ------------------
EX2013SRV1 RepairAndValidate
EX2010SRV1 ValidateOnly
EX2016SRV1 RepairAndValidate
EX2016SRV2 RepairAndValidate
Individual mailboxes can also be enabled or disabled for calendar repair. By default, mailboxes are enabled.
Name : Alan.Reid
CalendarRepairDisabled : False
The goal of the CRA is to detect and fix inconsistencies in calendar items so that users have reliable calendar
information to work from. Inconsistencies can be introduced to calendar items by things such as the meeting
organizer dragging the meeting to a different time in their calendar, or cancelling the meeting without
sending an update to invitees. The CRA attempts to detect these inconsistencies and resolve them.
A full list of CRA actions for different inconsistencies is available on TechNet. An actions taken by the CRA are
logged to the calendar repair log path on the Mailbox server, so that in a troubleshooting situation you can
review the log and determine whether the CRA has taken action on a calendar item.
Name CalendarRepairLogPath
---- ---------------------
EX2013SRV1 C:\Program Files\Microsoft\Exchange Server\V15\Logging\Calendar Repair Assistant
EX2010SRV1 C:\Program Files\Microsoft\Exchange Server\V14\Logging\Calendar Repair Assistant
EX2016SRV1 C:\Program Files\Microsoft\Exchange Server\V15\Logging\Calendar Repair Assistant
EX2016SRV2 C:\Program Files\Microsoft\Exchange Server\V15\Logging\Calendar Repair Assistant
In Exchange 2013 and later the CRA is a workload-based process, meaning it will run automatically behind the
scenes, stopping and starting depending on the server load at the time.
Real World: It's easy to suspect the Calendar Repair Assistant when calendar items go missing
unexpectedly. But a far more common cause of missing calendar items is mistaken deletion by the
mailbox owner or their delegate. Use mailbox audit logging (chapter 12) to investigate further.
In the example above, the Lakeview Room mailbox has been created as the wrong recipient type. To convert it
to a room mailbox, use the Set-Mailbox cmdlet.
Similarly, you can run the same command to convert a mailbox to Regular, Shared, or Equipment.
Note: When you convert a mailbox to a room, equipment, or shared mailbox, the associated user
account object in Active Directory is disabled. This prevents anyone directly logging on to the account.
However, there are a number of ways that the processing of meeting requests will behave, depending on how
you've configured the booking options for the mailbox. By default, a resource mailbox will not automatically
process meeting requests.
Although the basic calendar processing configuration for a resource mailbox can be managed in the Exchange
admin center, PowerShell offers a greater level of control. To view the calendar processing settings for a
mailbox, use the Get-CalendarProcessing cmdlet.
Identity AutomateProcessing
-------- ------------------
exchangeserverpro.net/Company/Head Office/Users/Lakeview ... AutoUpdate
Users will now be able to book the resource, within the constraints of the booking policies which are discussed
next.
By default, a resource mailbox has several restrictions in place that are enforced by the Resource Booking
Attendant:
You can see these settings, some of which are the result of a combination of multiple settings, by running Get-
CalendarProcessing.
RunspaceId : 78f05d8a-31d0-4efc-9a99-ccb73993e4dc
AutomateProcessing : AutoAccept
AllowConflicts : False
BookingWindowInDays : 180
MaximumDurationInMinutes : 1440
AllowRecurringMeetings : True
EnforceSchedulingHorizon : True
ScheduleOnlyDuringWorkHours : False
ConflictPercentageAllowed : 0
MaximumConflictInstances : 0
ForwardRequestsToDelegates : True
DeleteAttachments : True
DeleteComments : True
RemovePrivateProperty : True
DeleteSubject : True
AddOrganizerToSubject : True
DeleteNonCalendarItems : True
TentativePendingApproval : True
EnableResponseDetails : True
OrganizerInfo : True
ResourceDelegates : {}
RequestOutOfPolicy : {}
AllRequestOutOfPolicy : False
BookInPolicy : {}
AllBookInPolicy : True
RequestInPolicy : {}
AllRequestInPolicy : False
AddAdditionalResponse : False
AdditionalResponse :
RemoveOldMeetingMessages : True
AddNewRequestsTentatively : True
ProcessExternalMeetingMessages : False
RemoveForwardedMeetingNotifications : False
MailboxOwnerId : exchangeserverpro.net/Company/Head Office/Users/Lakeview Room
Identity : exchangeserverpro.net/Company/Head Office/Users/Lakeview Room
IsValid : True
ObjectState : Changed
Some restrictions are easy to apply, such as limiting the duration of meetings in a high demand meeting room
to 1 hour.
However, the work hours are configured in the mailbox calendar configuration, which can be viewed using
Get-MailboxCalendarConfiguration.
RunspaceId : 78f05d8a-31d0-4efc-9a99-ccb73993e4dc
WorkDays : Weekdays
WorkingHoursStartTime : 08:00:00
WorkingHoursEndTime : 17:00:00
WorkingHoursTimeZone : Pacific Standard Time
WeekStartDay : Sunday
ShowWeekNumbers : False
FirstWeekOfYear : FirstDay
TimeIncrement : ThirtyMinutes
RemindersEnabled : True
ReminderSoundEnabled : True
DefaultReminderTime : 00:15:00
WeatherEnabled : True
WeatherUnit : Default
WeatherLocations : {}
Identity : exchangeserverpro.net/Company/Head Office/Users/Lakeview Room
IsValid : True
ObjectState : New
In some cases, human management of bookings is necessary. Resource mailboxes can have delegates
assigned to process meeting requests.
AutomateProcessing : AutoAccept
ForwardRequestsToDelegates : True
ResourceDelegates : {}
AutomateProcessing : AutoAccept
ForwardRequestsToDelegates : True
ResourceDelegates : {exchangeserverpro.net/Company/Head Office/Users/Dawn.Evans}
However, you should be aware that adding a delegate is not enough to require manual processing of meeting
requests. You'll also need to set the AllBookInPolicy setting to False, and AllRequestInPolicy to True.
Even with a delegate in place to manage resource bookings, it's sometimes desirable to allow specific
individuals to make their own bookings. This is where the booking policy and request policy settings can be
used to provide the desired outcome. The policy requirements that are assessed by the Resource Booking
Attendant are those mentioned earlier in this section (e.g. meeting duration, whether recurring meetings are
allowed, and so on), and you can read a full list of the available settings and their meaning on the TechNet
page for Set-CalendarProcessing.
Page 142 Exchange Server Troubleshooting Companion
Note: The policy settings will have no effect at all if the AutomateProcessing setting is not set to
AutoAccept to enable the Resource Booking Attendant.
As you can see there are multiple settings that need to be considered as part of a whole configuration in order
to determine what will actually happen when a meeting request is sent that includes a resource booking. This
also tends to make troubleshooting complex. Here's a few examples of the behaviour you can expect to see.
Example 1: Fully automated resource booking. All resources are available for anyone to book.
AutomateProcessing: AutoAccept
AllBookInPolicy: True
Example 2: Resource bookings fully managed by a delegate. The delegate ensures that reasonable resource
bookings are made, and handles any conflicts that arise.
AutomateProcessing: AutoAccept
AllBookInPolicy: False
AllRequestInPolicy: True
ResourceDelegates: List of delegates
Example 3: Policy settings apply for most user requests, with some users allowed to make out of policy
bookings.
AutomateProcessing: AutoAccept
AllBookInPolicy: True
AllRequestInPolicy: n/a
AllRequestOutOfPolicy: False
RequestOutOfPolicy: List of users allowed to make out of policy requests
Real World: The more complex your requirements, the more thought needs to be put into how to
control resource bookings. Keeping things simple will make your administrative life easier, and create
fewer surprises for your end users who are trying to make resource bookings. You should also consider
whether your configuration is likely to make things worse when there is high demand for a limited
number of resources.
Requests declined if they are too far in the future. If a user tries to book a meeting room too far into
the future, the Resource Booking Attendant will decline it. This can be controlled by configuring the
BookingWindowInDays setting.
Recurring meeting requests are declined if the occurrences extend too far in the future. This will occur
if a recurring meeting starts before the BookingWindowInDays, but extends beyond that limit. This is
controlled by configuring the EnforceSchedulingHorizon setting. If set to False, the meeting is not
rejected, but any occurrences beyond the horizon will be removed.
Requests declined due to conflicts. By default, no conflicting appointments are accepted. However,
this can become troublesome when a recurring meeting is booked, and only a few of the meeting
Groups
Exchange distribution groups are used to send email to multiple recipients. The more accurate name is mail-
enabled distribution groups, because a distribution group can exist in Active Directory even when it is not
mail-enabled. But for the purposes of this section, assume that any reference to distribution group implies that
it is mail-enabled.
Groups must be mail-enabled to be visible in Exchange address lists, and to be able to be used to
distribute email.
Hiding a distribution group from the global address list does not prevent email from being sent to the
distribution group if the SMTP address is known by the sender.
Exchange can only use universal groups. Global or domain local groups will not work, but may be
present from a legacy version of Exchange.
Mail-enabled security groups can be used to apply permissions to objects, as well as to send email.
Mail-enabled distribution groups can only be used to send email, and can't be used to apply
permissions to objects.
An empty group will not cause any non-delivery reports. But obviously any email sent to an empty
group will not be received by anybody.
By default, distribution groups do not accept email from unauthenticated senders.
If you create an In-Place Hold for a group, the group membership is expanded at the time the hold is
created, and the group members are included in the In-Place Hold. If the group membership later
changes, the mailboxes included in the In-Place Hold do not automatically update.
This tends to break the group from an Exchange perspective. Exchange needs permission to read the group
membership so that it can determine who to send an email to when the message is addressed to that group. If
To resolve this issue, enable permission inheritance for the OU so that the Exchange ACLs re-apply. Then,
disable inheritance, but choose to convert the inherited permissions into explicit permissions on the object.
This preserves the Exchange ACLs, and you can then selectively remove the other permissions that you don't
want on the OU.
Note: This applies to nested groups as well. Exchange needs to be able to read all of the parent and
child groups when nesting is being used. If any of the groups are inaccessible to Exchange, it will "fail
closed" by not sending email to the group members, even those it was able to read from the groups it
could access.
An incorrect recipient filter causes the wrong recipients to be included or excluded in the group
membership.
The recipient container for the group is incorrect, so that even when the recipient filter is correct the
intended recipients can't be located when calculating membership.
To view the recipient filter for a dynamic distribution group, run the Get-DynamicDistributionGroup cmdlet.
To test a recipient filter, capture the dynamic DG as a variable, and then run Get-Recipient.
The command above will test the recipient filter, but won't validate that the recipient container for the
dynamic DG is correctly specified.
RecipientContainer
------------------
exchangeserverpro.net/Users
In the example above, the dynamic DG will only calculate the users who match the recipient filter and are
located in the exchangeserverpro.net/Users OU or any child OU. If the recipients are located in a different part
of the Active Directory OU structure that doesn't fall under that container, then they will not receive emails
sent to the dynamic DG.
Real World: Setting the recipient container to the domain root is the simplest approach, and should
work fine in smaller environments. However, in very large, complex environments, setting the recipient
container to the domain root may cause a high load on servers as the dynamic DG membership queries
run. In such cases, narrow the scope of the query by setting the recipient container to an OU that more
closely represents where the dynamic DG members are located.
Email Delivery
Across most recipient types the same email delivery issues can occur. There are in fact quite a few different
factors that can prevent an email from reaching a recipient. Troubleshooting transport and mail flow has
already been covered in chapter 4, but in this chapter let's take a look at things from a recipient perspective.
Non-Delivery Reports
A non-delivery report (NDR) is not a cause of email delivery issues, but does usually provide diagnostic
information that helps with troubleshoot. Always read the NDR when one is available. Many issues reported in
NDRs will be server-related, and you should refer to chapter 4 for more guidance on those.
However, the NDR may reveal an issue that is more likely to be recipient-related.
When other users in the organization sent email to that recipient, an entry was added to their Outlook auto-
complete cache. Auto-complete cache entries utilize the LegacyExchangeDN attribute, which is always unique.
When the new account and mailbox is created, a new LegacyExchangeDN value is used, so the auto-complete
cache entries no longer match.
Clear the auto-complete cache entries for users in the organization. This solution does not scale well,
and is inconvenient for end users, so it is not recommended.
Add an X500 address to the new mailbox for the old LegacyExchangeDN value.
The X500 address can be calculated from the NDR by following the steps provided in this Microsoft Support
article.
Email forwarding can be configured in the mailbox features section of the mailbox properties (Figure 6-9).
When mail forwarding is configured using that option, the altRecipient property of the Active Directory object
is updated, which can be seen using ADSIEdit (Figure 6-10).
When the targetAddress property has been mistakenly set, or not removed after a migration, you can clear it
using ADSIEdit to stop the mail forwarding.
Inbox Rules
Even when an email message has been successfully delivered, the end user may think that it was not received
due to an inbox rule deleting or moving the item. When you are troubleshooting an email delivery case, a
message tracking log search is a good way to determine whether the message was delivered to the mailbox or
not. Message tracking is explained in chapter 4.
After confirming that an email message was successfully delivered, there are three remaining possibilities as to
why it is not visible to the recipient:
Client connectivity issues are preventing them from seeing new email messages. Clients are covered in
more detail in chapter 7.
Someone has moved or deleted the item, usually by accident. This could be either the owner or a
delegate. Mailbox audit logging can help you investigate this possibility, if audit logging is correctly
configured. Mailbox audit logging is explained in chapter 12. If mailbox audit logging is not enabled,
then you can still perform searches of the mailbox within Outlook to locate the missing item.
Junk mail filtering has moved the item to the Junk E-Mail folder.
An inbox rule has moved or deleted the message.
You can check for inbox rules by running the Get-InboxRule cmdlet.
Name : Report
Description : If the message:
the message includes specific words in the subject 'Report'
Take the following actions:
copy the message to the following folder: 'Reports'
Delivery Restrictions
Recipients can have delivery restrictions applied that control who can send email to that recipient. Delivery
restrictions are often used for very large distribution groups to prevent misuse within the organization. For
example, the "All Staff" distribution group of a company can be configured so that only the executive team
and approved communications officers can send email messages. Aside from preventing misuse this also
prevents "reply all" storms on very large distribution lists.
For distribution groups the delivery management settings can be used to control whether a group can be
used by internal senders only, or by internal and external senders (Figure 6-12). Alternatively, individual users
can be added so that only those approved users can send to the group.
For mailboxes, similar controls are available in the Message Delivery Restrictions settings (Figure 6-13).
Mailboxes also have an option to require that all senders are authenticated. In effect this is the same setting as
the distribution group option to restrict internal vs external senders. Both options configure the same attribute
on the object, which is visible in the Exchange Management Shell.
RequireSenderAuthenticationEnabled : False
RequireSenderAuthenticationEnabled : True
The default setting for mailboxes is to not require authenticated senders. In other words, a mailbox can receive
email from anybody inside or outside of the organization. Groups have a default setting that requires the
sender be authenticated, which effectively means only internal users can send to the group. Groups created
prior to Exchange Server 2007 did not have this restriction enabled by default, so it's possible that the oldest
groups in your organization are open to external senders.
You can use Set-Mailbox or Set-DistributionGroup to reconfigure the recipient's restrictions to suit your
requirements.
Real World: Groups are often used to receive email from external senders, so the correct solution is
not necessarily to disable external senders for all groups. However, if you're troubleshooting misuse of
distribution groups by spammers or external senders, you can consider locking the group down to
internal senders only if it has no external communication requirements. Alternatively, you can change
the SMTP address for the group. However that is not 100% effective, and may impact other legitimate
senders who are using the SMTP address.
Send-As – this effectively allows impersonation of a mailbox, with the recipient seeing the message as
being sent by the mailbox that the real sender sent it as. This is often used with shared mailboxes,
such as a Help Desk team who need to send email from "Help Desk" and not their personal mailboxes.
Send on Behalf – this allows one user to send email on behalf of another, with the recipient seeing the
email as being from "Person A on behalf of Person B". This is often used with delegates.
When send-as and send on behalf are configured for the same person, they'll likely receive a non-delivery
report each time they try to send as the shared mailbox or other user. The reason is that send-as and send on
behalf are incompatible rights, and can't co-exist. A user should be granted either send-as, or send on behalf
permissions, depending on what they need to do.
When troubleshooting email delivery problems when you're sure that either send-as or send on behalf are
configured, make sure you quickly check for the presence of the other option as well.
Warning quota – the user is warned that their mailbox is nearly full.
Prohibit send quota – the mailbox user is no longer able to send email messages, whether internal or
external. However, they can continue to receive emails.
Prohibit send receive quota – the mailbox user is unable to send or receive emails. This prevents
unrestricted growth, particularly for abandoned or unmonitored mailboxes.
When an email has been rejected due to a mailbox storage quota, that reason will be provided in the non-
delivery report.
Individual mailboxes can have storage quota settings that are different from the database that hosts the
mailbox. Individual mailbox storage quotas will only take effect if the UseDatabaseQuotaDefaults option is
set to True. While that option is True, no amount of modifying a mailbox's individual storage quotas will have
any effect.
Additional reading
Mailboxes
Unexpected Permissions Appearing on Exchange Server Mailboxes
How to Grant Read Only Access to an Exchange Mailbox
How to Remove Auto-Mapping in Outlook
Deleted Delegates Still Receive Meeting Invites for Other Mailbox Users
Calendars
How to Change and Correct the Time Zone Values and Working Hours for a Conference Room
Groups
Important Information about Group Expansion for In-Place Holds
Pros and Cons of Using Separate Security and Distribution Groups
Remember the Basics When Working with Dynamic Distribution Groups
Email Delivery
IMCEAEX Non-Delivery Report When You Send Email Messages to an Internal User
The Attribute, the Myth, the LegacyExchangeDN
Exchange clients cover the set of applications and devices that connect to Exchange to send and receive
messages or otherwise interact with mailboxes and mailbox items. Exchange supports multiple client
connection protocols. Some are hosted by Client Access services, and some are hosted by Transport services. I
refer to them as services here because the services hosted by different server roles has changed over different
versions of Exchange, so it is often simpler to refer to services instead.
The most common clients you will encounter in the real world are Outlook (desktop), mobile devices or apps,
and web browsers. But of course, there are many other types of clients that you will also see from time to time,
such as POP/IMAP clients, SMTP clients, or custom-developed Exchange Web Services (EWS) clients.
Outlook
Outlook is the primary desktop application used to connect to Exchange mailboxes. The first thing to keep in
mind is that not all versions of Outlook will work with every version of Exchange. Exchange 2013 is supported
for Outlook 2007 SP3 or later, as long as the latest updates are also applied to Outlook. Exchange 2016
dropped Outlook 2007 support entirely and requires clients to run Outlook 2010 SP1 or later, again as long as
the latest updates are also applied to Outlook.
For Mac users there is Outlook for Mac 2014, Outlook for Mac 2011, and Entourage 2008 (EWS), which are
supported for both Exchange Server 2013 and 2016. The Mac clients connect to Exchange using Exchange
Web Services (EWS), which is discussed later in this chapter. Outlook for Mac 2016 is available from Office 365
and can also be used to connect to an on-premises server.
Outlook for Windows connects to Exchange 2013 and Exchange 2016 using one of two protocols:
Both protocols connect to Exchange over HTTPS (TCP port 443). In fact, all Outlook connectivity to Exchange
now occurs over HTTPS, with the exception of one Autodiscover lookup method that will try on port 80. This
greatly simplifies the client port requirements for connecting to Exchange, which means simpler configurations
for firewalls that sit on the network between clients and servers.
Autodiscover
Outlook configuration and connectivity relies heavily on Autodiscover. The Autodiscover service running on
Exchange provides information to clients about how to connect to Exchange services. When a domain-joined
Outlook client starts for the first time, an Active Directory lookup is used to determine the email attributes of
the user, as well as locate an Autodiscover Service Connection Point (SCP) to query for more information. Each
Exchange 2013 or earlier Client Access server, or every Exchange 2016 Mailbox server, registers an SCP using
the server's fully qualified domain name. The administrator then needs to reconfigure the SCP for each server
to the appropriate namespace for that site (all servers within a site should have the same SCP URL configured).
Page 154 Exchange Server Troubleshooting Companion
In the case of domain-joined clients, the Autodiscover SCP URL does not need to match the domain name of
the user's primary SMTP address. If Autodiscover is successful using the SCP the Outlook profile will be
automatically configured (Figure 7-1).
If the Autodiscover SCP does not exist or is invalid in Active Directory, the Autodiscover process continues
using the same sequence as clients that are not domain-joined, which is explained next.
Outlook clients running on non-domain joined computers still use Autodiscover. Instead of looking up the SCP
in Active Directory, they use a sequence of DNS lookups against the domain name of the user's primary email
address.
Although Outlook will stop searching for Autodiscover as soon as it succeeds with one of the options in the
sequence above, it might (depending on the client version and whether a policy disables it) perform multiple
lookups simultaneously. For example, Outlook 2013 will perform an Autodiscover SCP and root domain lookup
at the same time, which can have unintended consequences if the root domain resolves to a server that has a
HTTPS website running on it.
Outlook then authenticates to Autodiscover with the user’s credentials, and posts a query to determine to
which Exchange namespace it should connect to for that user’s mailbox. This process of automatic
configuration is dependent on several factors:
The Autodiscover SCP, defined as the AutodiscoverServiceInternalUri of the Client Access server, is able
to be resolved in DNS and is functioning
The Outlook client is able to make a connection on port TCP 443 to the Autodiscover service
Real World: The number one cause of Autodiscover failures I encounter in the field is the Autodiscover
SCP not being changed from the default value. The number two cause is the Autodiscover SCP being
set to a name that doesn't exist on the Exchange server's SSL certificate.
Outlook caches Autodiscover information in the user's profile path. You will find the file located in
C:\Users\<username>\AppData\Local\Microsoft\Outlook, with a file name of <GUID> – Autodiscover.xml
(Figure 7-2).
If you ever suspect that the client is using incorrect cached Autodiscover details, you can simply delete that file
and restart Outlook. Autodiscover can be tested from within Outlook by holding the CTRL key on your
keyboard and right-clicking on the Outlook icon in the system tray, then choosing Test E-mail
AutoConfiguration (Figure 7-3).
Enter the email address and password for the account, de-select the two Guessmart options, and then click
Test (Figure 7-4).
The test should complete within a few seconds, and you can review the URLs that have been returned for each
service to make sure they align with the URLs you've configured on your Client Access servers and SSL
certificates. If they don't match your settings immediately, wait a little longer and try again in case there is
simply a delay with Active Directory replication. These URLs can only come from Autodiscover. You can't
override them or assign them manually for an Outlook profile, so it's important that Autodiscover works
correctly.
Depending on your Autodiscover namespaces and DNS entries, Outlook may be redirected to a server name
that it does not trust, and will display a warning to the end user (Figure 7-5). This warning may also appear if
Outlook attempts to connect to the well-known URL of "autodiscover", even if you have a different namespace
in use for Autodiscover in your environment.
Even though the user can click to allow the redirection, and suppress warnings for that server name in future,
this is not a good user experience. Fortunately, the warning can be avoided by adding your Autodiscover
namespace to Outlook’s list of trusted names. The list of redirect servers is stored in the registry in one of two
locations:
HKEY_CURRENT_USER\Software\Policies\Microsoft\Office\xx.0\Outlook\Autodiscover\RedirectServers
HKEY_CURRENT_USER\Software\Microsoft\Office\xx.0\Outlook\Autodiscover\RedirectServers
Figure 7-6: Registry entries that control which servers are trusted for Autodiscover
To deploy the registry key to multiple computers it is easier to use a Group Policy.
Outlook Connectivity
After successfully retrieving Autodiscover information and configuring the Outlook profile, Outlook then
connects to Exchange using either Outlook Anywhere or MAPI/HTTP. Again, this is an SSL-encrypted
connection over HTTPS, so SSL certificates are important at this stage as well. If there are any SSL certificate
validity problems, then Outlook will fail to connect (Figure 7-7).
With two available protocols for Outlook connections (Outlook Anywhere, and MAPI/HTTP), you may be
wondering how you can tell which one is being used by an Outlook client. First, you can check whether
MAPI/HTTP is enabled for the Exchange organization.
MapiHttpEnabled : False
From the client-side you can also check the Outlook Connection Status dialog, available by right-clicking the
Outlook icon in the system tray. The Protocol column will show RPC/HTTP for Outlook Anywhere (Figure 7-8).
The MAPI/HTTP configuration for the organization can be enabled by running Set-OrganizationConfig.
Supported clients (Outlook 2013 SP1 or later) will not instantly switch to using MAPI/HTTP. You are likely to
see some delay before the change becomes affected, but once it is, the Protocol column of the Outlook
Connection Status dialog will display "HTTP" instead of "RPC/HTTP".
That is, unless the client has been disabled for MAPI/HTTP in the registry. This setting is located in
HKEY_CURRENT_USER\Software\Microsoft\Exchange, in a DWORD value named MapiHttpDisabled. If that
DWORD value is present, and is set to 1, then MAPI/HTTP will be disabled for that client.
For the sake of simplicity, I'll refer to it as OWA. Web browsers can connect to OWA assuming the following
conditions:
The end user knows the OWA URL to connect to. The OWA URL for on-premises Exchange is not
automatically discovered by web browsers using Autodiscover.
The OWA URL can be resolved in DNS.
The OWA URL can be accessed on HTTPS (TCP port 443).
The HTTPS session can be established.
OWA is enabled for the end-user (it's enabled by default)
OWA Login
Several methods can be used to authenticate a user connection with OWA. The recommended logon format is
User principal name (UPN), and best practice is to match a user’s UPN to their primary SMTP address (Figure
7-9). In effect this makes it seem to the user like they log on with their email address, which is simpler for them
to remember than any other method.
Using one of the forms-based authentication methods for OWA logon means that users see a user-friendly
logon page when they access the OWA URL with a browser. In fact, when the OWA logon format is set to use
the UPN the login page will display "Email address" instead of "User name" (Figure 7-10), to make it even
friendlier for the user to understand what to enter in the form.
Figure 7-10: The OWA login page with the user-friendly "email address" prompt
When multiple servers exist it is recommended to configure the OWA login settings to be the same, because
mismatched OWA logon settings on Exchange servers will cause problems for end users. The users will either
be unable to login, or will have their sessions unexpectedly terminated.
You can review the OWA virtual directory settings using the Get-OWAVirtualDirectory cmdlet.
In the example above you can see that the logon formats are different for each virtual directory. To align them
all with the desired setting, use Set-OWAVirtualDirectory. For example, to configure all of the servers in the
example above to use the UPN logon format, the following command is used.
Because the Get-OWAVirtualDirectory and Set-OWAVirtualDirectory cmdlets need to make RPC calls to IIS on
each server for this configuration, you'll notice that the command runs very slowly, particularly in larger
environments.
Note: When you modify the OWA logon format there are some warnings thrown by the PowerShell
cmdlets. On each server IIS must be restarted by running IISreset, and the ECP virtual directory also
needs to be reconfigured to match the OWA logon format by using Set-ECPVirtualDirectory.
OWA Namespaces
Each Exchange server that hosts an OWA virtual directory has two URLs configured; the internal URL, and the
external URL. You can view the namespaces configured on OWA virtual directories by running the Get-
OWAVirtualDirectory cmdlet.
As mentioned earlier, browsers do not use Autodiscover to find the OWA URL. In fact, the user can type
anything into their browser, even an IP address, and as long as it resolves to the Exchange server they will
connect to OWA. But that doesn't mean that URLs should not be configured on the OWA virtual directories.
There are still two ways in which the configured OWA URL comes into play. The first is the in the Outlook
The second is during some co-existence scenarios. In particular, when Exchange 2013 and 2007 are in co-
existence, OWA uses a redirect to a legacy URL when Exchange 2007 mailbox users connect to the Exchange
2013 server (Figure 7-12).
Figure 7-12: OWA redirection during Exchange 2007 and 2013 co-existence
Real World: Even though you can theoretically connect to OWA using the wrong URL, or even using
the IP address, if the connection goes through any type of application-aware load balancer or proxy
server that is checking for specific host names, then the connection will fail.
ActiveSync
Exchange ActiveSync (EAS) is the protocol used by mobile devices and applications to access email, calendar,
and other items in Exchange mailboxes. Troubleshooting mobile devices can be a complex task, because there
are so many factors that influence whether a mobile device will be able to successfully connect to Exchange.
For example, these are just some of the common factors that influence mobile connectivity to Exchange:
Mobile networks
DNS
Autodiscover
Firewalls
TCP session timeouts
Reverse proxies and load balancers
SSL certificates
Username format in Active Directory
EAS policies
EAS mailbox settings
Active Directory permissions
Exchange server health
ActiveSync Connectivity
The most obvious requirement for a mobile device is network connectivity, and a route between the device
and the Exchange server. Mobile devices typically connect over the internet, and EAS connectivity from the
Before the device can connect to Exchange, it first uses Autodiscover to locate the ActiveSync namespace to
which to connect. To enable the Autodiscover process, a series of DNS records are created in the public DNS
zone for the domain. One or more of those DNS records will be used by the mobile device for Autodiscover.
Different devices and applications can have a different Autodiscover lookup order, but generally speaking the
order will be:
Real World: A recommended practice is to implement DNS records for all of the Autodiscover lookups
that different devices and applications might perform, for maximum compatibility. However, this means
you should always make sure that you update all of the different record types any time there is a
change, such as a migration or hybrid scenario. If you forget to update one of the record types, you can
expect mobile devices, as well as other applications such as Skype for Business or Outlook, to fail in
unexpected ways.
If the Autodiscover endpoint is located, and the user's credentials are correct, then Autodiscover will return the
ActiveSync namespace that is configured on the Client Access services in the Active Directory site where the
user's mailbox is hosted. The ActiveSync namespace can be seen by running the Get-
ActiveSyncVirtualDirectory cmdlet.
Server : EX2013SRV1
InternalUrl : https://mail.exchangeserverpro.net/Microsoft-Server-ActiveSync
ExternalUrl : https://mail.exchangeserverpro.net/Microsoft-Server-ActiveSync
Server : EX2016SRV1
InternalUrl : https://mail.exchangeserverpro.net/Microsoft-Server-ActiveSync
ExternalUrl : https://mail.exchangeserverpro.net/Microsoft-Server-ActiveSync
Server : EX2016SRV2
InternalUrl : https://mail.exchangeserverpro.net/Microsoft-Server-ActiveSync
ExternalUrl : https://mail.exchangeserverpro.net/Microsoft-Server-ActiveSync
The mobile device or application then looks up the namespace in DNS, makes a connection over HTTPS to the
namespace, authenticates the user, and then begins to perform folder sync and other operations. All of this
relies on DNS, firewall access, load balancers (if they are deployed in the environment), certificates, and
authentication. The connection will also depend on the Active Directory security on the user object, similar to
what was described earlier in this chapter for OWA access to mailboxes. Active Directory permissions are
discussed in a later section of this chapter.
Note: The PowerShell output above is an example of using split DNS. Both the internal and external
URLs are the same, but resolve to different internal and external IP addresses respectively by hosting an
internal DNS zone for internal clients to use, and an external DNS zone for clients on the internet to
use.
The ABQ process can be represented by the workflow diagram in Figure 7-15.
At any stage during the ABQ workflow a decision can be made by the server to allow, block or quarantine the
device. For example, if a device is not policy compliant, then the block decision is made at that stage of the
process, and no further criteria are checked.
Authentication is the first requirement. If the user can't authenticate with their credentials, then they will not
be able to access their mailbox. Credentials are typically their email address and password, if you've
configured users with User Principal Names (UPNs) that match their primary SMTP address, which is the
recommended practice. If the credentials are incorrect, or the account is disabled or locked, the user can't log
on from the mobile device or application.
Real World: Disabling a user account will not prevent them from connecting if they already have an
established ActiveSync connection, due to caching of information by Internet Information Services (IIS)
on the Exchange server. It may take several hours for the user to be unable to connect, unless IIS is
restarted, which is disruptive for other users also connected to that server. If you need to immediately
block a user from connecting to ActiveSync, then you should disable the protocol on their mailbox.
The ActiveSync protocol is enabled by default on all mailboxes, and can be disabled on a per-mailbox basis.
You can see the current protocol status for a user's mailbox by running the Get-CASMailbox cmdlet.
ActiveSyncEnabled : True
The mobile device or application must also be policy compliant. Each mailbox user is assigned a mobile device
mailbox policy in Exchange that defines the required security configuration for any devices that the user wants
to use to connect to their mailbox.
The mobile device mailbox policies for the Exchange organization can be seen by running Get-
MobileDevicePolicy.
Page 166 Exchange Server Troubleshooting Companion
[PS] C:\> Get- MobileDeviceMailboxPolicy
The PowerShell output for mobile device mailbox policies is sometimes difficult to interpret, so a clearer view
can be seen by using the Exchange Admin Center to view the properties of each policy (Figure 7-16).
You can also view the mobile device mailbox policy assigned to a user by running Get-CASMailbox.
ActiveSyncMailboxPolicy : Default
ActiveSyncMailboxPolicyIsDefaulted : True
New mailboxes are automatically assigned to the mobile device mailbox policy that is marked as the default
policy.
Note: In the example above the policy name is "Default", and Alan Reid is configured to use the default
policy (which happens to be named "Default"). However, if another policy was set as the default later,
Alan would begin using the new default policy, instead of the policy named "Default". To avoid
confusion with this situation it is best to leave the policy named "Default" as the default, configured
with the policy settings you want to be the defaults for new mailbox users, and then create additional
non-default policies with different names for specific needs.
If the device settings comply with the policy for that mailbox, then the ABQ process next looks at whether the
device ID has been explicitly blocked or allowed for that mailbox user, in that order. The Get-CASMailbox
cmdlet shows us which device IDs fall into either of those categories. In this example, Vik Kirby has a device ID
that is allowed, so it will skip all subsequent ABQ checks and be allowed to connect to the mailbox.
ActiveSyncAllowedDeviceIDs : {F04016EDD8F2DD3BD6A9DA5137583C5A}
ActiveSyncBlockedDeviceIDs : {}
Personal block or allow exemptions of that nature are only created by administrator action. An administrator
can pre-populate the device ID into the block or allow list if the device ID is known beforehand, otherwise they
can do it after the user has tried to connect the device (and perhaps been quarantined, as well discuss shortly).
If no block or allow exemptions exist, then ABQ next looks at device access rules. Device access rules are
created by administrators to either block, quarantine, or allow devices and applications that meet the specific
criteria of the rule. The available criteria are:
Each device access rule can have a single characteristic assigned, which makes them very precise, but also
quite cumbersome to deal with if there are dozens or even hundreds of device characteristics that you need to
build rules for. The device access rules can be seen by running Get-ActiveSyncDeviceAccessRule.
QueryString : NastyOS
Characteristic : DeviceOS
AccessLevel : Block
Name : NastyOS (DeviceOS)
AdminDisplayName :
ExchangeVersion : 0.10 (14.0.100.0)
DistinguishedName : CN=NastyOS (DeviceOS),CN=Mobile Mailbox Settings,CN=Exchange Server
Pro,CN=Microsoft
Exchange,CN=Services,CN=Configuration,DC=exchangeserverpro,DC=net
Identity : NastyOS (DeviceOS)
Guid : 9065415f-0602-42e0-9b36-4c813e1b132e
ObjectCategory : exchangeserverpro.net/Configuration/Schema/ms-Exch-Device-Access-Rule
ObjectClass : {top, msExchDeviceAccessRule}
WhenChanged : 2/5/2016 12:39:51 AM
WhenCreated : 2/5/2016 12:39:51 AM
WhenChangedUTC : 2/4/2016 2:39:51 PM
WhenCreatedUTC : 2/4/2016 2:39:51 PM
OrganizationId :
Id : NastyOS (DeviceOS)
OriginatingServer : S1DC1.exchangeserverpro.net
IsValid : True
ObjectState : Unchanged
When multiple device access rules match a device they are assessed in the following order:
1. Block rules
2. Quarantine rules
3. Allow rules
In effect this means that the most restrictive rule (a block) will override a less restrictive, or non-restrictive rule.
That just makes good security sense.
If no device access rules match the device, then ABQ finishes by applying the default access level for the
organization. There are three levels as you would expect; allow, block, or quarantine. You can view the default
access level by running the Get-ActiveSyncOrganizationSettings cmdlet.
DefaultAccessLevel : Allow
With so many different stages of the ABQ process that can cause a device to be blocked, quarantined, or
allowed, it may seem to be impossible to troubleshoot. However, there are a few techniques that will make
your job as an Exchange administrator a lot easier.
The first is using the Get-MobileDevice cmdlet to look at the device access state, and the device access state
reason. This information will reveal which devices are blocked, quarantined, or allowed, and why they are in
that state.
DeviceId : Appl87941C1N3NS
DeviceType : iPhone
DeviceModel : iPhone2C1
DeviceAccessState : Blocked
DeviceAccessStateReason : Policy
DeviceId : 704294541
DeviceType : TestActiveSyncConnectivity
DeviceModel : TestActiveSyncConnectivity
DeviceAccessState : Allowed
DeviceAccessStateReason : Individual
DeviceId : ApplDLXH8DELDVGJ
DeviceType : iPad
DeviceModel : iPad3C3
DeviceAccessState : Allowed
DeviceAccessStateReason : Global
In the output above you can see that two devices are blocked. One is blocked because of a policy (it doesn't
meet the requirements of the mobile device mailbox policy for that user), and the other is blocked by a
personal exemption. Similarly, two mobile devices are allowed. One is allowed by a personal exemption, and
the other is allowed because of the global (or default) access level for the organization.
The second useful technique is to run the Exchange Remote Connectivity Analyzer (ExRCA), and perform its
ActiveSync test. The ExRCA test will simulate a real mobile device on the internet, giving you a clear view of
the complete, end to end connectivity and ABQ process for mobile devices and applications that are
connecting to your Exchange servers. If you can't see a mobile device associated with a mailbox user in
Exchange at all, then there's a good chance that a connectivity problem is occurring before the ABQ process
can start. The ExRCA test will help you to identify those types of problems.
Real World: It's also useful to keep some spare mobile devices around for testing with different mobile
operating systems and applications.
EWS is used by Microsoft Outlook for calendar free/busy information, Out of Office settings, calendar sharing,
and other features such as MailTips. Mac clients rely entirely on EWS for all communications. When EWS is not
working, those features will stop working as well, which end users are likely to notice as symptoms such as not
seeing the availability of other users they are inviting to a meeting.
EWS uses its own namespace which can be seen by running the Get-WebServicesVirtualDirectory cmdlet.
Server : EX2013SRV1
InternalUrl : https://mail.exchangeserverpro.net/EWS/Exchange.asmx
ExternalUrl : https://mail.exchangeserverpro.net/EWS/Exchange.asmx
Server : EX2010SRV1
InternalUrl : https://mail.exchangeserverpro.net/EWS/Exchange.asmx
ExternalUrl : https://mail.exchangeserverpro.net/EWS/Exchange.asmx
Server : EX2016SRV2
InternalUrl : https://mail.exchangeserverpro.net/EWS/Exchange.asmx
ExternalUrl : https://mail.exchangeserverpro.net/EWS/Exchange.asmx
As long as the EWS namespace is resolvable in DNS, can be reached over the network by the client, and the
server is using a valid and trusted SSL certificate, then not many things can go wrong with EWS. Remember
that EWS settings can only come from Autodiscover, so the EWS namespace can only be located when
Autodiscover is configured correctly. However, it is also possible that applications will be blocked from
accessing EWS.
There are multiple controls in place for allowing or blocking EWS. At the organization level, EWS can be
configured to either enforce a block list (which will block any applications listed in the block list), or enforce an
allow list (which will block any application except those that are listed in the allow list).
Real World: In June 2013, LinkedIn was found to have implemented a feature that invited users to
enter their corporate email credentials on the LinkedIn website. LinkedIn then connected to the
person's mailbox and scraped it for email addresses to suggest them as possible contacts that should
be invited to connect on LinkedIn. This connection used EWS to access Exchange server mailboxes.
The organization configuration can be viewed by running the Get-OrganizationConfig cmdlet. By default,
there is no policy set and nothing is explicitly allowed or blocked, which in effect means that any application
can access EWS.
EwsAllowEntourage :
EwsAllowList :
EwsAllowMacOutlook :
EwsAllowOutlook :
EwsApplicationAccessPolicy :
EwsBlockList :
EwsEnabled :
To block an application from access EWS the EWS application policy must first be set, and then the allow list or
block list populated with the user agents of applications you want to control. Using the example of LinkedIn as
mentioned above, to block the LinkedIn agent you would run the following commands.
EWS can also be enabled or disabled on a per-mailbox basis, and individual mailboxes can have their own
EWS application access policy, block list, or allow list.
EwsEnabled :
EwsAllowOutlook :
EwsAllowMacOutlook :
EwsAllowEntourage :
EwsApplicationAccessPolicy :
EwsAllowList :
EwsBlockList :
The same steps are used to allow or block EWS applications for mailboxes as are used for the organization-
wide configuration, replacing Set-OrganizationConfig with Set-CASMailbox.
The end user knows the POP/IMAP server name to connect to. The POP/IMAP server name is not
automatically discovered using Autodiscover.
The POP/IMAP server name can be resolved in DNS.
The POP/IMAP services are running (they are not enabled by default)
The server can be accessed on the correct network ports (the default ports are 110 and 995 for POP
and secure POP, 143 and 993 for IMAP and secure IMAP).
The POP/IMAP protocol is enabled for the mailbox (they are both enabled by default).
The user is able to securely authenticate with the server.
One of the first tasks for an administrator who needs to make POP or IMAP available is to start those services
on the Exchange server, and enable them for automatic start up for future restarts.
There are two POP services, and two IMAP services that need to be enabled. For POP the service names are:
For Exchange Server 2013 the front end service (e.g. Microsoft Exchange POP3) runs on the Client Access
server role, while the backend service runs on the Mailbox server role. If both roles are installed on the same
server, then both services exist on the server. For Exchange Server 2016, both services always run on the same
server.
Note: In earlier versions of Exchange only one service existed for each of POP and IMAP, and there
were no backend services for either of them.
The POP or IMAP services need to be enabled on any server that the client will be connecting to for POP or
IMAP, and on any server that can host an active database copy for the mailbox users that will be connecting.
Although their use is less common, when POP or IMAP are used it's important that clients connect securely.
This is because the POP and IMAP protocols both pass all communications, including credentials, in clear text.
POP and IMAP connections are only secure if they occur over an SSL/TLS encrypted connection. Unfortunately,
it is the secure logon process that causes most connectivity issues for POP and IMAP.
If the certificate you want to use is a wildcard certificate, then you need to use Set-POPSettings and Set-
IMAPSettings instead. For example, to set the POP service to use a full-qualified domain name of
mail.exchangeserverpro.net, we would run the following commands.
WARNING: Changes to POP3 settings will only take effect after all Microsoft Exchange POP3 services are
restarted on server EX2016SRV1.
The same process is used with the Set-IMAPSettings cmdlet to set the IMAP service to use a wildcard
certificate.
WARNING: Changes to IMAP4 settings will only take effect after all Microsoft Exchange IMAP4 services With
are restarted on server EX2016SRV1.
the
[PS] C:\>Restart-Service MSExchangeIMAP4
Warning: If the Exchange server has an SSL certificate that exactly matches the X509certificatename
value that you provide in the examples above, then that certificate will be bound to the POP or IMAP
service instead of the wildcard certificate. The wildcard certificate is only able to be used when no other
certificates contain the fully-qualified domain name that you specify when configuring the POP or IMAP
settings.
correct certificate configuration in place the POP or IMAP clients will be able to securely transmit their logon
credentials to the server. Secure login is the default setting for POP and IMAP services.
However, if you do find that you have POP or IMAP clients that can't support secure login for some reason, or
you want to disable secure login temporarily to test whether a problem is certificate-related or password-
related, then you can set plain text login for POP and IMAP. For example, to set plain text authentication for
the POP service you can run the following command.
WARNING: Changes to POP3 settings will only take effect after all Microsoft Exchange POP3 services are
restarted on server EX2016SRV1.
If you have a load balancer in the environment that is performing health checks for the POP or IMAP services,
then you'll see a moderate amount of protocol logging each day due to those probe connections from the
load balancer.
Warning: Protocol logging for POP and IMAP does not have the same retention capabilities as
protocol logging for Transport. The protocol log files will not be automatically removed when they
exceed a certain size or age limit, and will eventually consume all available disk space on the volume.
For this reason, it is recommended to only enable protocol logging for POP or IMAP temporarily while
troubleshooting a problem, and then disable it again when the troubleshooting has been completed.
POP and IMAP protocols are only used to download email from mailboxes. To send email a client needs to use
SMTP, which we'll cover in the next section of this chapter.
SMTP
Simple Mail Transfer Protocol (SMTP) is the protocol used in an Exchange environment to send email
messages between servers, and also between some clients and servers. SMTP is not used by Outlook clients
that are connecting via Outlook Anywhere or MAPI/HTTP, mobile clients connecting via ActiveSync, or other
clients connecting via Exchange Web Services. However, SMTP is used by POP/IMAP clients (which are only
Page 173 Exchange Server Troubleshooting Companion
mail-access protocols), as well as by many devices and applications in a typical network environment, to send
messages. It's quite common to use an Exchange server as the SMTP provider for scanners, photocopiers,
UPSs, hardware appliances, as well as numerous applications that need to send email notifications.
SMTP clients connect to the closest matching receive connector on the Exchange server when they are
sending an email message. In chapter 4 of we've already looked at how receive connectors work, and how you
can troubleshoot them.
For "dumb" devices a simple, unencrypted, unauthenticated SMTP connection is used, and the Exchange
server can accept those connections on its default receive connector, or on a custom receive connector that is
configured to allow SMTP relay from specific IP addresses. However, for many SMTP clients there is often
additional considerations when authentication and encryption of the SMTP connection is required. POP and
IMAP clients (discussed in the previous section of this chapter) are a good example.
Note: In the example above port 587 is used, which is configured during Exchange setup on the "Client
Frontend" receive connector on the Exchange server. 587 is the SMTP port that should be used by
authenticated clients.
The TLS encrypted connection requires a trusted and valid SSL certificate on the server. Certificates are
covered in chapter 3. With a trusted, valid SSL certificate installed and enabled for SMTP on the Exchange
server, the SMTP client may still receive errors or warnings when trying to establish a TLS encrypted session
with the server. The reason for this is that the receive connector has not been configured with the TLS
certificate name.
The TLS certificate name value is made by combining the issuer and subject into a single string. This is
achieved by using PowerShell. First, determine the thumbnail value for the certificate you want to use. In this
example I’m going to use my wildcard certificate, which is already enabled for SMTP.
Capture the certificate as a variable, specifying the thumbprint of the certificate that will be used for the
receive connector.
Now, declare a new variable that combines the issuer and subject values for the certificate.
Finally, set the TlsCertificateName property on the server's receive connector. This step is repeated for as many
servers as you need to configure.
SMTP clients are also subject to message rate limiting on the receive connectors for the Exchange server they
are connecting to. You can view the message rate limits for a server by running the Get-ReceiveConnector
cmdlet.
As you can see, most of the connectors have an unlimited message rate limit. However, the receive connector
used by SMTP clients has a message rate limit of 5 (per minute), based on the user. For humans sending email
this is likely to be enough for their needs, and indeed most email clients will handle rate limiting without issue,
and will simply retry any messages that were not able to be sent on the first attempt.
However, some applications that use SMTP may not handle rate limiting very well, and will simply drop the
messages that could not be sent. Ideally this is addressed in the application's code, but if you have a need to
set a higher rate limit you can do so by running Set-ReceiveConnector.
Real World: A high value for the message rate limit is preferable to an "unlimited" rate limit, to protect
the server against a rogue application that may try to send thousands or even millions of messages
through the Exchange server.
A default throttling policy is created in every organization, with generous limits for most settings, and even
many limits that are set to "Unlimited". You can view the throttling policy by running the Get-ThrottlingPolicy
cmdlet.
As a general rule, you should not modify the default throttling policy. Some specific software products that
need to integrate with Exchange in a way that would breach the default throttling policy limits will often come
with instructions from the third party vendor as to how to configure a new throttling policy specifically for that
product.
To check for the cause of this issue, open the properties of the user object in Active Directory Users &
Computers, select the Security tab, and then click the Advanced button. Note that if you don't see the
Security tab then you need to first enable Advanced Features in the View menu of the Active Directory Users
& Computers console.
Make sure that permissions inheritance is enabled for the user object, so that the correct ACLs apply for
Exchange server access to the user. All that should be required by you is to click the Enable Inheritance
button (Figure 7-19), and apply the changes.
Figure 7-19: An Active Directory user object with permissions inheritance disabled
The solution to this should be obvious – don't use members of protected groups for day to day access to
mailboxes over OWA and ActiveSync. Doing so risks those credentials being exposed, even if the connections
are occurring over a HTTPS connection. Use a separate administrative account for administrative tasks, and
keep your day to day account as low privilege as possible.
Certificates
As most client connectivity occurs over encrypted connections, SSL certificates play an important role in
allowing a successful connection to occur. Certificates have been discussed in several of the sections of this
chapter, and are explained in more detail in Chapter 3.
Load balancing
In a high availability deployment, load balancers are often used to distribute client traffic across multiple
Exchange servers. Client connectivity is therefore subject to the configuration and availability of the load
balancer itself. In addition, if the load balancer is sending traffic to inactive or unhealthy servers, then the client
experience will suffer.
Real World: Even a minor setting in a load balancer configuration can have a big impact on the client
connection, or the user experience. In one troubleshooting case I worked on it was determined that a
compression setting on the load balancer was causing some of the JavaScript code for OWA to be
corrupted in transit, which broke much of the OWA interface for end users.
Recipient configurations
Even a successfully connected client may experience issues that are not client-related, but rather recipient-
related. Recipients are discussed in more detail in chapter 6.
Performance
The client experience can easily be impacted by the performance of the Exchange server itself. Even if all
network, certificate, authentication and other factors are working perfectly, if the client is connected to an
overloaded Exchange server then the experience will be poor. Server performance is discussed in more detail
in chapter 8.
Additional Reading
Autodiscover
Exchange 2013 Autodiscover Service
New Behavior in Outlook 2013 Causing Certificate Errors in Some Environments
Client Connectivity
Network Ports for Clients and Mail Flow in Exchange 2013
Other
AdminSDHolder, Protected Groups, and SDPROP
Proper performance troubleshooting and analysis skills are the most transferrable skillset of any IT support
professional. I’ve personally known support engineers who were subject matter experts in Windows
performance move to an equivalent role with Exchange, SQL, or Azure. All without making a 180-degree
change in career path, but instead taking their core performance and issue analysis skills and simply building
upon them. While this chapter focuses on Exchange-specific performance troubleshooting, let’s take a
moment to discuss the troubleshooting skills that span technologies.
Establishing a baseline
As previously mentioned in Chapter 1, when you have performance issues, the best defense can often be a
good offense. In other words, the preemptive collection of performance data and historical trends can be your
most valuable resource when you eventually meet a challenging performance issue that’s affecting production.
Having the following information will assist in determining how an Exchange server has deviated from the
baseline:
Number of mailboxes
Messages Sent/Received daily
Average Message Size
Expected IOPS per Database/Disk/Server
Average Disk Latency
Average Memory consumption
Average CPU Utilization
Average connections per client type at the Client Access Layer (typically the Load Balancer)
With this information, you’re not gathering data and making assumptions, you’re simply looking for
divergence from the baseline to current state. How can you possibly know what poor performance is without
first knowing what acceptable performance is? The important thing is to avoid personal opinion and bias by
having raw data readily available to compare against. It doesn’t matter if you use the built-in Microsoft
toolsets or a third-party monitoring solution to collect the data. Select a performance monitoring and testing
solution that allows you the best combination of usability, functionality, feature set, and price.
Whether or not you’re a member of the SAN team or the networking team, if your application (Exchange in
our case) is performing poorly as a result of another team, you’re still accountable until you can prove
otherwise. It is for this reason you must educate yourself on every aspect of the Exchange deployment, as you
cannot always rely on others to show your same level of dedication to the performance of your application.
When troubleshooting Exchange performance, all data must be considered. However, we must not arrive at a
conclusion before first analyzing the data, developing a theory, and testing. A common mistake is making your
own conclusion and manipulating the data to fit your conclusion (because we as humans do not like being
wrong) instead of allowing the data to carry us to the correct conclusion.
If you’re looking for good questions to get you started on the correct path to troubleshooting a performance
issue, consider the following:
Jetstress – Tool for validation of Exchange storage. Uses Exchange ESE to generate IO with the goal of
generating the maximum number of IOPS while staying within acceptable disk latency thresholds. This
information is used to confirm that the storage solution will deliver at least as many IOPS as required
for the solution. Exchange is not required to be installed to run Jetstress.
LoadGen – Exchange user simulation tool to place synthetic workload on an Exchange server with the
goal of stressing not only disk but RAM and CPU as well.
Another important tool to this process is the Exchange Server Role Requirements Calculator which is used to
determine the goals the tests strive to achieve. While the goal of this book is not to teach how to size
Exchange, understanding what components are required to make an Exchange environment run at a
satisfactory level is vital to recognizing an improperly sized environment. We will therefore discuss at a high
level the steps required to properly size and validate an Exchange environment. Afterwards, we’ll discuss how
this data can potentially be used for performance troubleshooting in the future. Some steps have been
omitted for simplicity, such as “Determine Goals”, “List Constraints”, and so on. The steps are as follows:
Gather Inputs
The calculator allows you to place this data into User Mailbox Tiers, assuming not every user in an
environment uses email in the same manner. Figure 8-1 shows a theoretical environment with 90,000 mailbox,
80,000 of which use email in a different way than the remaining 10,000. This fact led to the decision to
categorize them in their own tier, thus helping in our sizing decisions. Ideally these 90,000 mailboxes will be
evenly distributed amongst the Exchange servers to achieve balance and predictable performance load
distribution.
These are typically the most impactful inputs that will determine both your performance and sizing needs. As a
general rule, Messages Sent/Received per day will have the greatest impact on the amount of performance
you will need out of the system. Typically, changing this value will adjust the IOPS required per server value on
the Role Requirements tab of the calculator. Whereas Avg. Message Size, mailbox size and Deleted Item
Retention will determine the amount of storage capacity needed by the infrastructure. The number of
mailboxes will obviously have an impact on both performance as well as capacity requirements.
These inputs are relevant to performance troubleshooting because changes to these values can drastically
alter the required resources for running Exchange properly. I worked on a very high profile escalation for a
large customer where the hardware was mistakenly to blame for Exchange performance issues. Their original
environment was sized as follows:
70,000 mailboxes
100 Messages Sent/Received Per Day (per mailbox)
Avg. Message Size of 75 KB
After further discussion, it was determined the customer had acquired another company (responsible for the
30% mailbox count growth) and their Exchange Administrator had recently left the company. So the
customer’s inputs had changed and there was not sufficient monitoring in place to detect this growth and
change in usage. The change from 100 Messages Sent/Received per day to 150 actually resulted in a 60%
growth in required IOPS per Server. This value is present on the Role Requirements tab of the Exchange
Calculator. Specifically, the “Total Database Required IOPS/Server” value. (Figure 8-2). Note that only database
required IOPS per server (2,111) are relevant to performance planning as log IOPS are sequential and not
considered burdensome on rotational disk storage
The hardware was actually over performing rather than underperforming. The ultimate solution was to scale
out the environment by adding additional Exchange Servers. During this process the Exchange Calculator was
again used to properly size the environment.
This is an excellent lesson in how accurate and up-to-date inputs are vital to sizing an Exchange environment,
as well as troubleshooting its performance. I recommend all customers keep the spreadsheets generated by
the Exchange Server Role Requirements Calculator for the life of the Exchange solution and regularly compare
current inputs to the original specifications.
Storage
With any storage solution, I often find myself explaining to customers the level of performance they can
expect to achieve from the hardware in question. Due to the various factors involved, the most accurate
answer would be to say the solution can achieve whatever performance rating that it has actually been
validated for. Meaning, place your workload on the storage solution and see what you can achieve; tuning
performance as necessary. If the goal is to determine how many IOPS a solution can achieve then it greatly
depends on the workload being performed against it. For example, if the individual IO is very small (4 KB) it
would be much easier to achieve more IO per second (IOPS) than if the IO size were large (1 MB). This is a trick
that vendors often use to inflate their achievable IOPS numbers, forgetting to mention that the achievable IO
greatly depends on the application that generates IO against the storage.
So while the best answer (the Consultant answer) is often “it depends” when asked about achievable IOPS,
there are some general statements around disk performance which can be used for guidance. Each disk type
has a given amount of IOPS that it can expect to provide within acceptable latency thresholds. The Exchange
Server Role Sizing Calculator has a very helpful hidden table which provides estimates for achievable IOPS per
disk type (Figure 8-3). In this instance, for Exchange sizing, we care only about Random IO.
As you can see, there’s a significant difference in the expected IOPS of a 7.2K NL SAS disk (~55 IOPS)
compared to a 15K SAS disk (~180 IOPS). You’ll notice SSD drives are not mentioned, as they are not
recommended for Exchange deployments due to cost and small size, but they can achieve IOPS in the many
thousands.
In troubleshooting Exchange storage performance, it’s important to understand the type of disk storage being
used. This will give you a rough estimate of the achievable IOPS of the overall solution. For example, a server
with 12 7.2K NL SAS hard drives should be able to achieve roughly 700 IOPS. It’s certainly not an exact science
due to the various potential size of IO as well as sequential vs random IO, but it’s usually good enough for fair
estimations. For example, if the Performance Monitor \PhysicalDisk\Disk Transfers/Sec counter (which
effectively tracks IOPS) displays over 2,000 IOPS on the same 12 disk system previously mentioned, it would be
a fair assumption that the workload was generating more IOPS than the storage solution can support. So it
would be expected to see disk latency at unacceptably high values on this solution (disk latency over Avg.
20ms is considered poor).
Note: It’s preferred to use object PhysicalDisk, to ensure we look at the closest number to the Disk
Subsystem. When using Bitlocker for instance, IOPS will increase up to 8 fold from original estimate at
the LogicalDisk object level.
I often use this chart to set expectations with customers for the number and type of disks they have in the
system. It’s important to understand that scientifically speaking, there is going to be a performance limit of
any storage solution. Knowing that as you move nearer this limit, latency will increase, is critical for
troubleshooting storage performance.
Note: With most storage solutions, placing a faster disk in the same array as slower disks will not
deliver the desired results. As an example, adding a 10K disk in an array full of 7.2k disks. In some cases,
it can even introduce latency as the array compensates for the differing disk speeds. Therefore, if you
choose to add faster media, do so for the entire array/virtual disk.
Now you might ask “why not simply purchase faster drives and achieve greater IOPS per spindle?” The answer
is simply cost per gigabyte. The goal of the Exchange Product Team has been to provide users with the
biggest and fastest mailbox for the cheapest price. With reduced IOPS requirements, faster disks are no longer
Note: NL SAS disks are really just SATA disks with SAS logic. This means that the drives typically
perform at SATA speeds. Generally speaking, a large (2-8TB) 7.2K SAS drive is synonymous with NL SAS.
Many of these recommendations revolve around .NET performance. Microsoft Support has encountered cases
where very dense servers, with many CPU cores, encountered performance issues with .NET garbage
collection. This issue is described in further detail in this TechEd Europe session on Exchange 2013
performance monitoring and tuning. It is for these same reasons that the Exchange Product Team
recommends disabling CPU hyperthreading on physical servers, as giving Exchange the impression that twice
the CPU cores are available than actually are can result in poor .NET performance. As you probably guess,
Exchange health is dependent upon .NET health. Therefore it’s critical the recommended version of .NET is
deployed on Exchange 2010/2013/2016 servers.
This post from a Microsoft Support Escalation Engineer outlines excellent CPU sizing and performance
troubleshooting tips for Exchange. One tip in the article I strongly recommend for all Exchange deployments is
to change server power plans to “High Performance” mode or an equivalent setting. Many servers have BIOS
power settings which controller system and processor power states. These are typically utilized to save power,
encourage green computing, and reduce power consumption costs. However, these solutions are not always
optimal for systems performing work where timing is critical, such as Exchange or SQL. It’s for this reason that
Microsoft recommend disabling these power saving features to ensure optimal processor performance.
Lastly, while I feel it’s important to take Microsoft’s CPU and Memory sizing recommendations into
consideration, they are still only recommendations and are not support statements. They are based on best
practices developed through trial and error gained from running Exchange Online servers inside the Office 365
cloud service. Servers with 24 cores and 96GB of RAM have been battle-tested and proven to be excellent at
running Exchange Server in Microsoft’s datacenters. Therefore, deploying a similar platform is likely to yield a
stable and well-performing Exchange solution.
Storage Validation
Knowing that the underlying storage has been validated to handle the expected workload is an important part
of being able to reliably support the actual workload. It’s why Microsoft Support has long stated that
validating storage with Jetstress is one of the requirements for being “supported”. The job of Jetstress is fairly
simple: place a pre-determined IO load on the disk subsystem and verify the latency of the storage is
acceptable to run Exchange. I highly suggest that you read the Jetstress Field Guide to better familiarize
yourself with how to properly use this tool. In short, the process is as follows:
I often joke saying many of my Exchange performance cases involve servers where Exchange hasn’t yet been
installed. This is due to customers having issues during the validation (Jetstress) phase of their deployment.
These cases can be broken into two categories:
Unfamiliarity with the Jetstress tool and the proper testing methodology
An actual hardware, configuration, or sizing issue
Reading through the Jetstress Field Guide usually resolves the former issue. People who are unfamiliar with
Jetstress testing will often install Jetstress, crank the thread count to an unreasonably high value, and blame
the hardware when it fails. My answer is that I can theoretically make Jetstress fail on any storage solution. As
for the latter, the resolution is typically one of the following (most which will be covered later in this chapter):
Having the knowledge that a storage solution passed Jetstress means that at one time, it did function
properly. To a troubleshooter, this means something has changed in the environment causing it to not
function as expected. Passing a Jetstress test also gives an indication of the IOPS that can actually be delivered
by the solution. Upon completion, Jetstress generates an .HTML results file giving the Pass/Fail result of the
test as well as the latency values measured on each drive. This is also a useful file to keep as it can help
troubleshooting efforts if problems subsequently develop. It is not unheard of to have a firmware, operating
system, or anti-virus update change the latency within an Exchange configuration. If drastically different results
are observed compared with when the storage was originally validated, this is an indication that something
fundamental has changed, assuming the IO load is the same.
For the correct deployment architecture, JBOD actually makes a lot of sense once you understand the
performance and High Availability improvements made in the product. Exchange 2013/2016 requires ~90%
fewer IOPS than Exchange 2003, which makes large/slow/cheap disks such as 6TB 7.2K NL SAS a viable
Although I’m not here to tout the capability of Exchange High Availability (that topic is addresses in its own
book), I do want to discuss a common misconception around the hardware requirements for an Exchange
JBOD solution (as specified in Microsoft’s Preferred Architecture). There’s no doubt that if you get storage
wrong, it results in poor Exchange performance and probably leads to investigation of the hardware. In every
such case that I have been involved with, the root cause was not the deployed hardware which was at fault,
but rather an inappropriate hardware configuration.
The definitive source of information regarding Exchange storage is the TechNet article describing the
Exchange Storage Configuration Options. The article details the various supported configurations, storage
media, and best practices for Exchange storage; such as:
The following guidance around controller caching settings is found in the “Supported RAID types for the
Mailbox server role” section:
This guidance is important when it comes to deciding on the detailed configuration for a JBOD-based
Exchange infrastructure. Unfortunately, many incorrectly assume that if you deploy Exchange on JBOD, a
caching controller is not required to achieve the necessary performance. Perhaps this misconception is due to
the notion that JBOD is simply a disk connected to a server, with no intelligence or sophisticated controller
whatsoever. With Microsoft advertising the performance improvements in Exchange storage, it’s easy to see
how the mistaken impression that a caching controller is not required. This is absolutely incorrect.
Although Microsoft technically supports the lack of a caching disk controller in Exchange JBOD-based
configurations, no guarantees are extended that the solution will be able to provide the performance
necessary to run Exchange. This is why running Jetstress is such an important step in an Exchange deployment.
The only reason I feel the lack of a caching controller is tolerated is so that Storage Spaces (which require the
absence of a caching controller) are supported. However, that’s purely my own speculation. So what’s the big
deal you might ask? If Microsoft does not require a caching controller in a JBOD solution from a supportability
On-Disk Cache
Hard drives have a Disk Buffer, often called a Disk Cache, used to cache writes to disk. Caching occurs when
the operating system issues a write to disk, but before the data is actually written to a platter, the drive
firmware acknowledges to the OS that the write has been committed to disk, thus allowing the OS to continue
working instead of waiting for the data to actually be committed. The potential period of latency between
caching and commitment can be significant on slower rotational media. Caching increases performance with
the known risk that the system loses power before a write is committed to disk, but after the OS has received
acknowledgement of committal, data loss/corruption will occur. This is why a UPS is required in such a
configuration.
Unfortunately, the cache on NL SAS disks is notoriously small (typically 64-128 MB) and unreliable, which
means that they really are not suitable for enterprise workloads. The cache can be easily overwhelmed or
susceptible to data loss, also known as a lost flush. If you use a low-end RAID controller which does not
contain a cache, they can only rely on the on-disk cache for write performance.
Controller Cache
Disk Array Controllers typically have a much larger (512 MB-2 GB) and more robust cache that is capable of
delivering a high write performance. In many situations, the controller cache is the single biggest contributor
to delivering the necessary disk write performance for enterprise-level deployments. In fact, the topic of
controller caching (or lack thereof) is one of the most common call drivers in hardware vendor storage
support.
As an example, let’s consider a scenario that happens far too often. A company builds a server with 96 GB of
RAM, 16 CPU cores, and 10 TB of storage, but skimps on the RAID controller by purchasing one without cache.
A low-end RAID controller may save a few hundred dollars but will turn an otherwise robust system into one
incapable of sustaining a storage-intensive workload. This is because the on-disk cache, which I previously
mentioned is easily overwhelmed, is all that stands between you and degraded storage performance.
On several occasions I’ve dealt with Exchange performance escalations where a low-end controller was used
on the assumption that an Exchange “JBOD” solution did not require one. The issue was often discovered
during Jetstress testing, but in some occasions, the customer had already begun the deployment/migration
because they chose to forgo Jetstress testing.
Even some modern high-end controllers with > 1 GB of cache can encounter a performance problem when
not properly configured. Because of modern solutions like Storage Spaces, which require no cache, some
controllers offer a non-RAID/Pass-through/HBA/JBOD (name varies by vendor) mode. This feature allows
selected disks to be presented to the OS as raw disks, with no RAID functionality whatsoever. In other words,
no cache. Again, because of misconceptions, I’ve encountered customers who used this mode for an Exchange
Server JBOD deployment because they incorrectly assumed it was appropriate. What makes the situation even
more unfortunate is that not enabling the write cache or purchasing a low-end controller are fairly easy
problems from which to recover. You either reconfigure the cache (does not even require a reboot) or upgrade
the controller (which will import the RAID configuration from the disks), neither of which involves data
destruction. However, if a customer deployed Exchange using the Pass-through option, the drive would have
to be rebuilt/reformatted. This is an issue that you really hope to discover during Jetstress testing and not
after migrating mailboxes.
When creating this array/virtual disk for Exchange JBOD, the following settings should be used:
There are a few things to note regarding this list. The on-disk cache should be disabled to avoid double-
caching (caching at the controller as well as the disk) as well as the possibility of overwhelming the on-disk
cache. If this is left enabled, the risk of a Lost Flush is increased. In addition, each vendor uses different
terminology for caching settings. For example, Dell does not use a percentage value for configuring their
cache, instead it’s either Enabled or Disabled. For example:
I performed a quick test on my lab server. A physical system with 64 GB of RAM, 12 CPU cores, and a RAID
controller with 1 GB of cache. My plan was to use Jetstress on the system with Write-Cache enabled, note the
time it took to create the test databases and the achievable IOPS, then repeat the test with the Write-Cache
disabled. I expected the testing to be much slower with caching disabled, but even I was surprised with how
drastically different the results were.
Testing Parameters:
Note: The same virtual disk/array was used for both tests; the only change was the caching setting
Note: Tests in my lab with caching disabled would always fail with Autotuning enabled. Therefore, I had
to manually configure the thread count. After several tests (all failing), I configured the test to only 1
thread, which still failed due to log write latency.
Needless to say, I was surprised. The test went from taking 8 minutes to create a 40 GB file to 4 hours! It was
more latent by a factor of 30! Now imagine instead of a Jetstress test, this was a production Exchange server.
Maybe the administrator or consultant thought the initial install was taking longer than expected, maybe the
mailbox moves were much slower than anticipated, and maybe the client experience was so slow it was
unusable. This is the point where the product or the hardware are usually blamed by the users and upper
management. It often takes an escalation to Microsoft and/or the hardware vendor to explain the importance
of a caching controller. The need to spend for success is never more apparent when it comes to good disk
controllers with solid caching.
Note: Always follow hardware vendor guidance. Also, this guidance was specifically for Exchange direct
attached storage solutions. For SAN or converged solutions, contact your vendor for guidance. Lastly,
always run Jetstress to validate an Exchange storage solution before going into production.
The Interface
PerfMon can be launched via the following common methods:
The default view brings you to a line graph representing data points captured at configured intervals over time
(Figure 8-4).
Live data provides the default view when you open PerfMon. Counters can be added or removed using the
green plus or red X buttons respectively. When adding each counter, a brief description can be viewed
detailing what each counter captures (Figure 8-5).
Note: A performance object is an entity for which performance data is available. Performance
counters define the type of data that is available for a performance object. An application can
provide information for multiple performance objects. Performance objects can contain either
single instance counters or multiple instance counters. A single instance object returns a single set
of counter values (Reference).
When working with multiple counters, I recommend using the Highlight (Pencil) button to display the currently
selected counter in bold on the graph. Once you have added the counters you wish to view, I recommend
selecting all counters and scaling them (Figure 8-6) so they can all be displayed on the same graph.
Figure 8-6: Selecting counters you wish to scale and fit onto the same graph
While this allows you to view all counters on the same graph, it’s important to understand that due to the
scaling, a value may not necessarily be larger than others just because it appears larger on the graph. For
example, disk latency for the C drive may appear higher on the graph than the E drive. However, if the counter
for C is scaled to 1.0 and the counter for E is scaled to .01, disk latency values for E may actually be much
higher. The scaling allowed both values to be displayed in the same simplistic view, but do not directly reflect
Page 192 Exchange Server Troubleshooting Companion
measured values. Instead, I recommend letting the scaled graph uncover trends and outliers, while using the
actual measured data points to uncover the true values (Figure 8-7).
Figure 8-8: The “View Log Data” button in the top-left corner of the PerfMon window
You can then select the .BLG file you wish to view (Figure 8-9). This method is preferred over double-clicking
the .BLG outside of PerfMon. With this method no counters will be loaded initially, whereas double-clicking
the file will open all counters, resulting in slowness and a visually busy graph.
After the .BLG file is open, the desired counters can be added using the green plus button. The graph will
represent the period of time the data capture was running, giving you the ability to zoom to a specific time
period (Figure 8-10). This is especially useful if the capture was for a particularly long time period. In some
cases, a spike can only be seen on the graph when zoomed into a short time window.
Capturing data
Performance data can be captured by creating a Data Collector Set, which can be used to define the following
parameters:
While you can certainly create a user defined Data Collector Set using the Performance Monitor GUI, this isn’t
really practical when adding many counters or requesting someone to gather data for your analysis. However,
you can generate an XML configuration file to be used as a template to make this process easier. I prefer using
the command line for this operation. The example shown below illustrates how to create a data collector set
using logman.exe:
This command creates a Data Collector Set named “Perf-Counter-Log” with a binary circular log that will be a
maximum of 2048 MB in size and have a sample interval of 5 seconds. This means the file will grow to 2 GB
and then continuously roll over until the Data Collector Set is manually stopped. This is useful if you wish to
capture data until an issue occurs, at which point you can stop the capture. The “-c” switch defines which
counters we will gather. As an alternative, a configuration file could also be used.
Another option is to use the PowerShell Get-Counter command to gather both local and remote server data.
This can obviously be scripted to your heart’s desire and is easily scalable using PowerShell Remoting.
There are also two well-known automated methods of Exchange performance log collection:
PAL is especially useful as it provides templates for various Microsoft applications, including Exchange, as well
as thresholds for each counter collected. It produces an HTML file for easy viewing of collected counters as
well as color-coded indicators for when thresholds are exceeded. While it’s a great starter tool for the novice,
it can sometimes be misleading by presenting a wall of red text. This can happen when spikes occur but are
not necessarily indicative of a systemic issue.
Lastly, one extremely underappreciated feature of PerfMon is the ability to save counters to an HTML file
which can then be used to paste those same counters into a different PerfMon session on a different server.
Note: Much of these values were taken from the sources cited below. Please use them as a reference
for further explanation of the counters.
AutoDiscover
MSExchangeAutodiscover\Requests/sec See References
RPC
\MSExchange RpcClientAccess\RPC Averaged Latency <250ms
ActiveSync
\MSExchange ActiveSync\Requests/sec See References
POP3
\MSExchangePop3(*)\Average LDAP Latency See References
IMAP4
\MSExchangeImap4(*)\Average LDAP Latency See References
PowerShell
\MSExchangeRemotePowershell\Current Connection Sessions See References
Information Store
\MSExchangeIS Store(*)\RPC Average Latency <50ms
Memory
\Memory\% Committed Bytes in Use <80%
Storage/Mailbox
\MSExchange Active Manager(_total)\Database Mounted Balanced
\MSExchange Database ++> Instances(*)\I/O Database Reads (Attached) Average Latency <20ms
\MSExchange Database ++> Instances(*)\I/O Database Reads (Recovery) Average Latency <200ms
Note: For an excellent insight into Windows disk counters, as well as how to measure disk latency, see
the below two TechNet posts from my friend and former colleague Flavio Muratore:
Windows Performance Monitor Disk Counters Explained
Measuring Disk Latency with Windows Performance Monitor (Perfmon)
CPU
\Processor(_Total)\% Processor Time <75% on Avg.
ASP.NET
ASP.NET\Application Restarts 0
Workload Management
MSExchange WorkloadManagement Workloads(*)\ActiveTasks See References
The .BLG files generated by these Data Collector Sets are located within the <Install Drive>\Microsoft\Exchange
Server\V15\Logging\Diagnostics\DailyPerformanceLogs directory. These files, along with accompanying
Exchange logs within the Logging directory, can consume a considerable amount of disk space. While there are
means to purge excess logs, having a historical view of an Exchange Servers’ performance can prove quite
useful. Within this directory (Figure 8-12) contains 7 days worth of Performance Monitor logging data.
Identify frequency
Is the issue currently happening? If not, did the performance issue occur only once? If it repeats, is it sporadic
or periodic? Knowing these answers will determine the most appropriate counters that you should measure
and what data is gathered.
Use Task Manager to Processes Tab to sort by Memory and CPU utilization. Identify the set of
Exchange processes and their percentage of overall utilization. Compare this to the Performance Tab
to determine overall system utilization. Task Manager does not include disk performance information,
so instead use Windows Resource Monitor to measure each processes disk utilization.
For deeper analysis, open PerfMon and the relevant counters. I recommend starting down one of
three paths based on initial analysis:
o Disk
o Memory
o CPU
Once a hypothesis is made that the issue is related to one of the three aforementioned resources,
then add the related Exchange counters
If the issue occurred only once then we’re effectively performing root cause analysis, where there is no
guarantee of success. This is because success is dependent upon data/logs still being present. Days can pass
before an issue is reported, especially if the issue occurred before the weekend or a holiday. It’s important to
set proper expectations when performing root cause analysis, and let the stakeholders know it may be
impossible to determine the cause if sufficient date is not present.
With that in mind, the built-in Exchange Data Collector Sets are extremely useful to view the performance of a
system several days in the past. You should first copy the .BLG file to an alternate location before it is deleted
Page 200 Exchange Server Troubleshooting Companion
by background Exchange cleanup processes (after 7 days). You can then open Performance Monitor, open the
file, and add the Disk, Memory, and CPU counters to begin the analysis. I also recommend immediately
exporting both the Application and System event logs, as Windows will purge these files when capacity is
reached.
The most critical piece of information for root cause analysis is the timeframe of the initial issue. This will
determine if the data you have available to you is of any use. Obviously, if an issue occurs Monday night, logs
from Tuesday to Friday are effectively useless for analysis. However, if you have the logs that cover the period
in question, knowing the precise window when an issue occurred allows you to investigate the event logs and
performance data leading up to the event. Often this data is critical to uncovering the root cause. It is
common to find that a backup or Anti-Virus scan will commence immediately before a performance issue,
allowing you to “follow the bread crumbs” so to speak and uncover the culprit.
If the issue is repeating (either sporadic at random intervals or periodic at predictable intervals), you’ll want as
many data captures as possible. A range of data captures will allow you to confirm that the same behavior is
observed during each incident as well providing the evidence to detect a pattern. For example, if a particular
process is utilizing a significant amount of system resources during each capture or an upward trend in usage
is detected, a possible memory leak might be the cause.
A full list of activities (processes) running on the server outside Exchange is also extremely useful when
troubleshooting repeating issues. Depending on the environment’s architecture, the following should be
considered:
The problem with beginning your investigation with the Exchange counters is there are many of them, and
their thresholds are not as well-known as the basic Windows counters are. This is why I recommend a
monitoring solution to track the proper counters and respond accordingly. Whether this is Microsoft’s SCOM,
or a third-party solution, is a matter of preference, cost, and functionality. However, in my experience, these
typically lead you to Disk, CPU, Memory, or Network regardless. They just may streamline the process and
allow you to scale your monitoring and diagnosis much easier.
Seeing these gaps is never a good sign and is usually indicative of a severe system performance issue. The
above gap lasted approximately 20 seconds and was not even visible when first opening the .BLG file. This is
common when you have a capture which is particularly large and covers a long period of time. Only upon
closer inspection (and zooming into a span of <10min) can we see this gap, which means that during this 20
second period, the Operating System was unable to log any performance monitoring data, likely due to it
being woefully short on system resources. In my experience, these gaps can be caused by:
Insufficient CPU resources for the Operating System to log performance data
Insufficient Memory resources for the Operating System to log performance data
Period of high Disk IO causing a resource issue
Network Settings, Network Teaming, Receive Side Scaling (RSS) & Unbalanced CPU Load
Time to revisit recommendations around Windows networking enhancements usually called Microsoft
Scalable Networking Pack
A reminder on real life performance impact of Windows SNP features
Processor 0 increased CPU utilization
Hyper-threading
Hyper-threading is a simultaneous multithreading technology with the goal of improving processing
performance. In simplistic terms, for every physical processor core, hyper-threading allows the appearance of
an additional logical core. However, while the Operating System views two logical processors, only one
physical processing core exists with a singular set of execution resources.
While going through VCP training, my instructor used an analogy that’s stuck with me for a while. People have
this misconception about hyper-threading that it’s actually giving you twice the cores, when in fact it is not.
Say you’re eating food with one hand. You’re losing the opportunity to put more food in your mouth in the
time that it takes to reach down and grab another bite. Now if you could eat with both hands (enabling HT)
then during that time when one hand is getting more food, the other hand could be placing food in your
mouth. However, you still ultimately have only one mouth to eat with. And most importantly, the faster you’re
eating (moving your hand back and forth) then the less benefit that other hand is giving you. So on a server,
the more utilized the processors are, the less benefit HT will give you. So while it’s not accurate to say that
hyper-threading gives you twice the cores, it can increase performance in some situations.
Historically, the risk of enabling hyper-threading was typically associated with virtualization. Even then, it
wasn’t a technical limitation but more of an operational one. People incorrectly assume they have twice the
processing power than they actually possess, which results in rampant CPU oversubscription which ultimately
leads to. This should not be an issue in an operationally mature environment which is properly sized. However,
the Exchange Product Team recommends against enabling hyper-threading on physical Exchange servers. The
reasoning is that the processing gains were not worth the costs. By cost, they are referring to memory
allocations per logical processor via .NET.
Page 203 Exchange Server Troubleshooting Companion
As a quick reference, terms like Virtual Processor, Logical Processor, and Physical Processor are sometimes
mistakenly used interchangeably. Here are brief descriptions of each, in my own words:
Example: The Intel i7-5960x processor has 8 Physical Processor cores, 16 Logical Processor cores (with HT
enabled), and if the system with this processor were to be made a hypervisor, it could assign up to 16 Virtual
Processor cores per virtual machine (VM).
Understanding these concepts is vital in both sizing as well as when troubleshooting CPU performance. It’s
important to understand how to recognize CPU contention in virtual environments, as well as understand how
hyper threading can potentially impact physical deployments as well.
Note: .NET updates cause Windows to recompile all the Exchange assemblies. This affects CPU
performance for up to 20 minutes after reboot. See this article by Exchange MVP Jeff Guillet for a
description and a script to help speed this up by making the recompile multi-threaded.
Exchange Virtualization
Exchange virtualization is an extremely complicated topic. And not just for technical reasons, but also political,
economic, and preferential reasons. For the sake of time and the conservation of space, I’ll tighten my focus to
matters relevant to troubleshooting virtualized Exchange servers and avoid the discussion on whether I feel
Exchange should or should not be virtualized (although a simpler solution is always easier to troubleshoot,
and virtualization is an added complexity).
My reasons for including Virtualization in the Performance chapter is simple; numbers. In troubleshooting,
there are purely Exchange issues and purely virtualization issues. A Venn Diagram of Exchange and
Virtualization issues would intersect with many Performance issues. In my opinion, the majority of these issues
could be avoided with proper sizing and adherence to best practices. With that in mind, let’s briefly discuss
Exchange virtualization best practices.
Additional References:
Memory
Hyper-V Dynamic Memory is not supported in production for Exchange Servers. However, it can be used in a
lab environment. The reasoning for this is simple enough in that an operating system running an application
with a transactional database (Exchange) should never have memory taken away from it while the application
Page 204 Exchange Server Troubleshooting Companion
is still running. This is especially true for Exchange as the sudden reduction in available memory can result in
significant performance issues, such as specific Exchange processes forced to reduce their working set, or a
significant flushing to disk of transactions formerly held in database cache. For these reasons, an Exchange
virtual machine must use Static Memory when running on Hyper-V. This means when the virtual machine
starts it will have the same amount of RAM until it powers off. In effect, memory simulates what is found on a
physical machine.
If the Exchange virtual machine is being hosted on a VMware hypervisor (ESX), a similar memory management
rule must be enforced (using different terminology). A production Exchange virtual machine must have a 100%
Memory Reservation.
CPU
There are many well-written articles on CPU sizing when virtualizing, most of which are from the VMware
perspective:
CPU Over commitment and Its Impact on SQL Server Performance on VMware
Note: Much of this module is taken from my blog post entitled CPU Contention and Exchange
Virtual Machines
All of these articles provide excellent advice on the topic. In summary, in vendor-neutral terminology, on a
given host, the total number of assigned processor cores on all of your virtual machines can potentially be
greater than the total number of actual cores on the physical host.
Example:
On a virtual host with two Intel Xeon E5-2640 2.5GHz CPU’s, each with 6 cores (hyper threading disabled for
this example), you have a total of 12 physical cores. If you host three virtual machines, each with 4 virtual
cores, then you would have assigned 12 total cores, or a ratio of 1:1 (virtual cores : physical cores). If you
increase the number of cores per VM from 4 to 8 then your ratio changes to 2:1 and at least some level of CPU
contention is introduced.
Note: You should also assign processor cores to the Hypervisor OS, but for this example I excluded
that measure for simplicity.
Again, in vendor-neutral layman’s terms, when your ratio exceeds 1:1 (2:1, 3:1, 4:1, etc.) the likelihood that a
virtual machine will have to wait for a physical core to become available for its use increases. Check out the
above links for detailed explanations of how this happens.
Is overcommitting CPUs good or bad? It depends. In many cases, overcommitting CPUs is perfectly acceptable
and recommended. In fact, I’d say it’s one of the biggest advantages of virtualization. However,
overcommitting CPUs is dependent on the workload and solutions like VDI are capable of functioning
perfectly happily beyond 12:1 (depending on which vendor you speak to) while still achieving acceptable
performance.
If this guidance is not followed, the following symptoms might occur on virtual Exchange Servers:
The most interesting thing about troubleshooting a system that’s experiencing CPU contention is that when
viewed from within the VM itself, CPU utilization may be minimal while things are painfully slow. Much of this
will depend on the load of the host, workload within your VM, and the sizing of the other VMs. A few very
dense VMs, or many small VMs with only 1-2 vCPUs can have differing behaviors.
The important thing to remember is not to jump to conclusions and start adding cores because you think a
virtual machine needs them. If you’ve been paying attention, you’ll understand that this solution will only
make matters worse as the VM may have to wait even longer for its additional vCPUs to be scheduled against
physical cores that are heavily utilized. Simply put, if your CPU is fully utilized within the VM, then you might
look at increasing vCPUs. If you suspect processor but the vCPU within the VM is not fully utilized, then
investigate possible CPU contention.
It’s important to realize how easily this can happen without proper operational controls. I once encountered a
support case where an Exchange 2010 customer experienced a period of heavy mail flow but the mail queues
were not draining. The root cause was when a junior virtualization admin moved several dense VMs to the
same host as Exchange and caused the CPU ratio to move from a 2:1 core ratio to 6:1. Massive CPU contention
caused the problem.
Storage
Many of the principles of troubleshooting Exchange storage transfer to troubleshooting Exchange virtualized
storage. The sweet spot for disk latency remains <20ms, vendor best practices for caching update levels
should still be followed, and Jetstress must still be run to validate the storage. The difference lies in additional
layers of complexity, failure points for performance, and storage virtualization technologies being used, such
as Thin Provisioning, Snapshots, Dynamic Disks, and Tiered Storage. Several of these technologies are not
unique to virtualization. For example, you can have physical servers and utilize thin provisioning or tiered
storage. However, as these technologies are typically utilized in virtualization scenarios I’m addressing them
now. Let’s discuss how each technology can “go wrong” or be encountered in an Exchange troubleshooting
scenario.
Thin Provisioning: Using this technology, you could potentially have a scenario where an Exchange volume
exhausts available storage, while Exchange and the Operating System believes free space is still available. This
can initially result in transport issues due to Back Pressure and at worst, an Exchange database dismounting in
a Dirty Shutdown state. This can be particularly harmful as Exchange has built-in mechanisms to gracefully
dismount its databases when it detects available disk space is nearing zero. In some thin provisioning
Page 206 Exchange Server Troubleshooting Companion
scenarios an Operating System could believe there is over 1 TB of free space on a drive, when the backend
storage solution has been completely allocated. The key here is to fix this problem before it ever happens by
closely monitoring your thin provisioned storage to enable additional capacity to be added before an outage
occurs.
Virtual Machine Snapshots: VM snapshots are not supported in production for Exchange virtual machines. This
is primarily due to the many Active Directory tie-ins with Exchange. If an Exchange VM diverges from the
Active Directory configuration database after a snapshot is used to revert the system, the Exchange
organization will be in an inconsistent state. Also, if the reverted Exchange VM is a member of a Database
Availability Group, its database copies will probably require to be reseeded.
Dynamic Disks (Dynamic VHDX): Until Exchange 2013 and Server 2012 R2, Dynamic VHDX files were not
supported for Exchange in production. However, due to write performance improvements, this is now a fully
supported option. While VHX is supported, I recommend scheduling any expansion of disks to off-peak hours
to help mitigate the performance impacts.
Tiered Storage: This technology has different names depending on the vendor (Tiered Storage, Data
Progression, FAST, Virtual Storage Tier, Hierarchical Storage Management, etc.). While it is supported, if data is
automatically moved by storage software, it is not recommended by the Exchange Product Team for Exchange
storage. The most common time when issues occur is during Jetstress validation. If you choose to use tiered
storage with Exchange, then the recommendation is to use LoadGen instead of Jetstress to validate the
storage configuration. However, if you must run Jetstress, engage your storage vendor as it will likely require
custom configuration for the test to pass. In troubleshooting, if you experience latency when running
Exchange on tiered storage, consider moving it to static/dedicated disks where the data blocks will not be
moved. If this is not an option, consider increasing Tier 1 storage capacity so more Exchange data can be held
within Tier 1.
Lowest
Low
Medium
High
Expert
When experiencing an issue with a particular process in Exchange but the Application logs do not provide
sufficient details, the logging level can be modified to hopefully shed additional light on the cause.
Additional Tools
Fiddler: Useful for capturing and analyzing HTTP traffic
Network Monitor:
Windows Storport Tracing: Measuring latency of individual IO at the lowest point in the Windows IO stack. Can
be used to bypass filter drivers in Windows such as Anti-V or backup software.
Additional reading
Overview
IOPS
Correlation does not imply causation
Exchange Virtualization
Best Practices for Virtualizing and Managing Exchange 2013 (Microsoft)
Exchange 2013 on VMware Best Practices Guide (VMware)
Hyper-V Dynamic Memory Overview
Understanding VMware Memory Resource Management
CPU Overcommitment and Its Impact on SQL Server Performance on VMware
Virtual Machine Performance – CPU Ready
How to successfully Virtualize MS Exchange – Part 1 – CPU Sizing
Hyper-V CPU Scheduling–Part 1
CPU Contention and Exchange Virtual Machines
Thin provisioning
About Virtual Machine Snapshots
Windows Server 2012 R2 Hyper-V: What's New
Page 210 Exchange Server Troubleshooting Companion
Tiered storage
Back pressure
Exchange Database Is in a Dirty Shutdown State
Be careful with VM snapshots of Exchange 2010 servers
Exchange Storage for Insiders: It's ESE
Plan it the right way - Exchange Server 2013 sizing scenarios
Few would argue against the view that email is a business critical application for almost every company. For
this reason, understanding how to recover from disaster, both minor and major, while restoring both data and
functionality, is a critical skillset for Exchange Administrators and Support Engineers. Consultants are often
asked to provide guidance on backup, restoration, and disaster recovery procedures too, making these
extremely valuable skillsets for all who work with Exchange.
In this chapter, we first discuss common Exchange backup methodologies and the support considerations they
imply. This will help us understand the scenarios we will work through and the tools needed to ensure that we
can recover from disasters. Next we’ll discuss Exchange database internals, logging, and how truncation occurs
for transaction logs, including why this might or might not happen following a backup. Understanding
transaction logging will also prove useful for later modules covering Recovery Databases and Lagged
Database recovery. We’ll then discuss database corruption and the tools/techniques used for recovery.
Corruption can be caused for many reasons, including environmental factors such as storage or power failures,
or possibly even misbehaving software such as Anti-Virus or backup software. In any case, knowing how to
recover from these scenarios is useful for troubleshooting scenarios and for everyday operations of Exchange.
Lastly, we’ll cover how to install servers in disaster recovery scenarios and why you might need to take this
action.
Backup Strategies
Exchange Native Data Protection
Exchange Native Data Protection is a major piece of Exchange core functionality and highlighted in Microsoft’s
Preferred Architecture. It is designed to protect Exchange data without relying upon traditional backups.
Although the idea of not backing up Exchange was shocking for some and was initially met with cynicism, it’s
the way that things work in the largest Exchange deployment in the world (Exchange Online within Office 365).
That being said, not every company is in the unique circumstances of having the knowledge, software, and
hardware resources available to Microsoft to run Office 365.
Exchange Native Data Protection is not a singular product or feature. Rather, it is a framework of
interconnecting Exchange features, most of which can be deployed without any dependency on one of the
other features:
Knowledge of how these features work is required if you are to design or support an environment which uses
Exchange Native Data Protection. The DAG provides redundancy for databases, allowing up to 16 copies of
Page 212 Exchange Server Troubleshooting Companion
each database (although 3-5 is more common) on servers placed in one or more datacenters that can, in turn,
be geographically dispersed. If one database copy or one datacenter fails, an up-to-date copy of that
database can be mounted automatically on another server with little to no data loss.
Single Item Recovery (SIR) ensures that items which have been deleted/purged by end users can still be
recovered up to the DeletedItemRetention property of a mailbox database. For instance, if the value of
DeletedItemRetention is set to 365 days, no matter what action a user takes on a message, it can still be
retrieved from the Recoverable Items folder for 365 days after deletion, even if the user selects and then
purges the item using the Recover Deleted Items feature. SIR therefore avoids the resource-intense need to
restore an entire database from backup media, just to recover one item. Single Item Recovery also allows
preservation and recovery of mail in its original form, part of what provides Exchange with the ability to claim
that its data is stored in an immutable format.
Normally, the replication that occurs within a DAG ensures that database copies are kept up-to-date as close
as possible in terms of content to the active database copy. Lagged database copies represent a moving
point-in-time of the data in the database. This is accomplished by establishing an artificial delay on when a
database’s transaction logs are replayed into the database’s .edb file.The replay queue for a lagged database
copy always contains the transaction logs that comprise the set of transactions that have taken place during
the replay lag period. In effect, the lagged database copy is a snapshot of what the database looked like at the
replay lag period (up to 14 days), thus providing the answer to the question of “how do we recover from
logical corruption events?”
A DAG provides protection from physical database corruption and hardware failure events and is able to repair
physical corruptions while active using techniques such as single page repair (Page Patching). Logical
corruption falls outside the capability of these repair mechanisms. For instance, a virus corrupts data in
mailboxes across the environment and the corruption is replicated across all database copies. In this case, the
normal database copies are all corrupt and cannot be used to recover a good copy. However, a lagged
database copy will not have replayed the transactions because, although waiting in the replay queue, the
transaction logs containing the bad transaction lie outside its replay lag period. The database therefore
remains good and is a viable candidate for recovery, including the ability to replay transaction logs up to the
time when corruption occurred. The process for recovering via a lagged database will be covered later in this
chapter.
Note: The analogy I like to use when discussing Exchange physical vs logical corruption is that of a
damaged book. If a book’s pages are torn out and it’s binding is damaged, I compare that to physical
corruption in a database. Now once the book’s pages and binding have been repaired (with all pages
now in the correct order), it could still have logical corruption if the words on the pages don’t make
sense to the reader, if the letters are smudged, the words are out of order, or it’s now in the wrong
language.
A Recovery Database is similar to a Recovery Storage Group used in older versions of Exchange. Traditionally,
this database is used as a location for restoring Exchange database files from an Exchange-Aware Backup. The
restored database could then be mounted and the data restored as needed. This method was commonly used
when needing to recover mail items that had been deleted or during various disaster recovery scenarios.
Making use of Recovery Databases will be covered later in this chapter.
In-Place Hold is an evolution of Litigation Hold. Although both types of hold allow data to be kept for an
indefinite period, an in-place hold can apply a filter to make the hold more granular so that only important
data is kept. Either hold type can replace the need to retain backup tapes for extended periods for compliance
purposes.
In my opinion, the most important distinction to understand in regard to Exchange backups is the difference
between Exchange-Aware backups (which are supported) and non-Exchange-Aware backups (which are not
supported). To quote the TechNet article on this topic:
“Exchange 2013 supports only Exchange-aware, VSS-based backups. Exchange 2013 includes a plug-in for
Windows Server Backup that enables you to make and restore VSS-based backups of Exchange data. To back up
and restore Exchange 2013, you must use an Exchange-aware application that supports the VSS writer for
Exchange 2013, such as Windows Server Backup (with the VSS plug-in), Microsoft System Center 2012 - Data
Protection Manager, or a third-party Exchange-aware VSS-based application.”
In simple terms, an Exchange-Aware backup program understands Exchange ESE databases and can back up
the data in a consistent manner without data corruption. An Exchange-Aware backup also has the ability to
inform Exchange that it may truncate unneeded log files after a backup successfully completes.
Note: A Full or Incremental backup is required for the Exchange transaction logs to truncate.
These capabilities are not present in backups which are not Exchange-Aware, also commonly known as “Flat-
File” backups. Flat-File backups are great when used with file servers or raw static data, as they simply make a
copy of a file without any concern for what type of file it is or whether it’s open and in active use.
Unfortunately, such a backup is mostly useless (and unsupported) for backing up Exchange databases. This is
because the backup has no understanding of Exchange database operations and will not have a means to
freeze Exchange IO since there is no Exchange VSS Writer.
A common Support case is when a customer is in dire need of a good Exchange backup after a failure event,
only to discover that the backup files are in a corrupt or Dirty Shutdown state, with no easy means of
mounting the database. While Windows Server Backup has an Exchange VSS Writer and is therefore Exchange-
Aware, many third-party backup technologies do not. I’ve worked several escalations where a customer used a
non-Exchange-Aware program and resulted in production mailbox database corruption. You should avoid at
all costs any program that attempts to lock Exchange database or pausing IO without understanding Exchange
ESE storage.
Note: Many third-party products have Exchange Backup Agents that require additional fees or
subscriptions. Always consult with your backup vendor to ensure their backup method is supported by
Microsoft.
Apart from asking the vendor a direct question, an easy means to determine whether a backup is Exchange-
Aware is to monitor the Windows Application logs during and after the backup to verify events from sources
MSExchangeIS and ESE. The presence of these events is an indication that the backup software is
communicating with the Exchange Information Store to coordinate the backup and send the command for log
truncation once the backup is complete. A set of articles is listed at the end of this chapter containing
additional references about how to monitor Exchange-Aware backups and verify their completion.
Page 214 Exchange Server Troubleshooting Companion
Another simple command to inspect the last time a backup was performed against an Exchange database is to
use this command:
I’ve encountered many customers (usually small businesses with untrained IT staff) who claim to have been
successfully backing up Exchange for years, but by examining the last good backup time for the databases I
can determine only a flat-file backup has been occurring. This means their backups contain databases in an
inconsistent state which may or may not possess the ability to be mounted for recovery.
This scenario is typically discovered (if not during a recovery scenario) when the customer realizes Exchange
transaction logs are consuming all of their disk space. Exchange Transaction Logs are 1 MB in size and can
grow in the hundreds of thousands if good backups are not taken and the log set is not truncated. However,
to fully understand Exchange transaction logs, we must first have an understanding of Exchange database
internals.
Regardless of which Exchange client is used, after connecting through the various Exchange services and
worker processes, a user ultimately accesses their mailbox via the Information Store process. The Exchange
Information Store (formerly the Store.exe process, but with Exchange 2013 onward,
Microsoft.Exchange.Store.Worker.exe) is where the Exchange Database Cache exists. A separate worker
process runs for every database on a server. The cache holds all currently active Exchange database
transactions in memory, such as new mail, deletions, modifications, etc. This cache can become quite large but
this is by design as keeping transactions in memory is much faster than fetching and updating database pages
from disk. When the cache does read/write to the database (.edb), it does so in 32 KB pages.
Note: One of the biggest contributors to the IOPS reductions between Exchange 2003 and Exchange
2010 was the increase in database page size. Page size was 4 KB in Exchange 2003, 8 KB in 2007, and 32
KB in 2010 onwards. A larger page size translates to fewer requests to disk, as more data can be
read/written per IO, but requires more RAM.
It’s important to understand that clients connect to the cache, not the actual .edb file. No client ever directly
accesses the database (.edb) or any log files (.log). Instead, all connections occur in memory to the database
cache. Of course, if the database (.edb) file becomes locked due to another process (such as anti-virus or
backup programs), the Information Service will eventually be unable to communicate with it and the database
will eventually dismount.
When discussing backups, transaction logs are extremely important, so it’s vital to understand which role they
play in database transactions. As transactions occur in the cache, they create a series of transaction records or
log data, which fills a log buffer. Once a log buffer is full (1 MB), it is flushed to disk to create a 1 MB
transaction log file. This process repeats as more transactions occur in cache and the log buffer fills and is
written to disk. The currently active transaction log file is always E0n.log, where n represents the number of the
log stream for that database.
Note: The first database on a server will be E00.log, the second will be E01.log and so on. Once the
current log file is committed to disk, it is renamed to a value such as E0000000001.log,
As transaction log files are written to disk, the transactions in cache might yet still not be committed to the
actual database (.edb) file. This process is referred to as “write-ahead logging” (In fact, technically the
transactions are written to the logs before the user sees the change). This is because writes to the database
are random IO while creating a transaction log is a sequential IO. Since the data being written to the database
could be located anywhere in a very large .edb file, the operation on disk is usually random. On the other
hand, creating a 1 MB transaction log takes a single new sequential write operation to disk. With rotational
media, this distinction becomes important as Seek Time contributes to disk latency. Sequential IO is very low
impact on a disk subsystem while random IO is more burdensome. Writing the transactions sequentially to
logs instead of the database (.edb) file allows the transactions to be captured efficiently (in cache and on disk
in the transaction log) while also reducing random IOPS. The focus on trading random IO for sequential IO by
using memory has contributed to the gradual reduction in the product’s IO requirement since Exchange 2007.
When are the transactions written to the database (.edb) file? After a predetermined amount of transaction
logs are generated, the transactions in cache are flushed to the database (.edb) file. The predetermined
amount is called the Checkpoint Depth Target and is tracked by the Checkpoint File (.chk). The Checkpoint File
monitors the logs with the oldest outstanding uncommitted page. Databases which have no replicas/copies
have a Checkpoint Depth Target of 20 transaction logs whereas databases with passive copies (DAG) have a
target of 100 transaction logs. This fact will become relevant when I discuss log truncation, especially in a DAG.
For a long time, it was recommended to place your Exchange database file (.edb) onto different spindles than
your transaction logs. This is still the recommendation when only one copy of a database exists (non-DAG
protected). In fact, this is not for performance reasons but to assist recovery in the event of a storage failure.
Say you lost your database drive, leaving you only with the transaction logs. Technically, if you still had every
transaction log present since the database was first created, you could use ESEUTIL to generate a new
database and commit every transaction from the logs into it. However, this is not usually the case. People
usually resort to an Exchange-Aware backup, which leads us to why an Exchange-Aware backup truncates log
files. When a Full Exchange-Aware backup is performed against a database, the .edb file is copied to the
backup location, as well as any transaction logs present for the database. With these files, the database can be
restored and the database can be brought into a Clean Shutdown state, meaning all transactions in the logs
have been committed to the .edb file and the database can be mounted. As the backup completes, it sends a
command to the ESE database engine stating that any transaction logs older than a certain point can be
truncated (deleted). These logs are no longer required on the Exchange server because we now possess a copy
of them in the backup set.
Note: Technically, once a transaction log has been written to disk, it is no longer needed. When the
time comes to commit transactions to the database (.edb) file, this action occurs from cache to the .edb
file, not from the transaction logs to the .edb file. However, you should not manually delete transaction
logs as these are vital for recovery purposes.
This brings me to the question: “Why can’t I use Circular Logging to delete my transaction logs automatically
so they don’t fill up my drive?” First we need to understand that Circular Logging is a property on a database.
When circular logging enabled, the Information Store deletes transaction log files as soon as the transactions
contained within them have been committed from cache to the .edb file. A much smaller set of logs is
maintained as the set never grows over time because logs are continuously deleted.
The question then comes into play as to why not enable circular logging for every database? Having read the
chapter up to this point, the answer is obvious. If your database subsequently becomes unusable due to a
failure, no transaction logs will be available for replay into the database, as all but the most recent logs will
have been automatically deleted by Exchange. Since all transaction logs must be replayed into the database in
their proper sequence, your only choice would be to restore the most recent database backup, and lose all
Exchange transactions that occurred between the time of the backup and the time of failure.
It’s at this point that I should define the differences between traditional (JET) Circular Logging and Continuous
Replication Circular Logging (CRCL). What I described above is JET Circular Logging and is used on standalone
Exchange Servers, or databases on Exchange Servers which do not have replicated copies. CRCL is used when
a database has replicated copies in a DAG and is enabled using the same command. When using traditional
backup methods, neither should be enabled as the backup program’s consistency check will fail due to
missing logs.
Note: After executing this command, the database must be dismounted and remounted before the
change takes effect. Once enabled, any transaction logs that have been committed to the database will
be automatically removed. So if there were 50k transaction logs present, they would be automatically
deleted.
In my opinion, JET Circular Logging should only ever be enabled in a lab or when you are in dire need of disk
space during a recovery or during transaction-intense operations such as moving mailboxes. This is because if
you encounter a failure and lose the database, you will lose all transactions that occurred between the last
backup and the failure. CRCL should only be enabled in a DAG environment where Exchange Native Data
Protection is being used, as traditional backups are not performed. In this configuration, since backups are not
performed, no transaction logs will ever be truncated and database drives would ultimately reach capacity.
The logic used to determine when logs will be truncated is as follows (for more information, see this
reference):
For truncation to occur on highly available (non-lagged) mailbox database copies, the answer must be "Yes" to
the following questions:
Simply put, the DAG ensures a transaction log has been inspected by all database copies in the DAG before it
can be deleted. The log file must also be below the Checkpoint Depth Target, which is 100 transaction logs in
DAG-protected databases. Knowing this target is important for understanding expected behavior when
verifying transaction log truncation after a successful Exchange-Aware backup. On a standalone database
there will always be ~20 transaction logs and on a DAG-protected database there will always be ~100
transaction logs, even after a successful Exchange-Aware backup. This is because only logs that have reached
the Checkpoint Depth Target are eligible for truncation. On several occasions I’ve been asked why transaction
logs within a DAG were not truncating after a successful backup because there always seemed to be
transaction logs in the log directory. The short answer is there will always ~100 logs in these directories. If you
want to verify successful log truncation following a backup in a DAG, I recommend the following article.
Similarly, in lab environments truncation may not occur because the database is new and has yet to generate
100 transaction logs, so there’s nothing to truncate.
Note: While CRCL is enabled on a DAG-protected database, additional copies cannot be added. CRCL
must be disabled, and the database must be dismounted/remounted before additional copies can be
added.
Understanding the inner workings of Exchange databases helps to determine why a backup might not be
successful and the steps needed to remedy it. We’ll now move on to troubleshooting the underlying VSS
components which make Exchange-Aware backups possible.
Streaming Backups are no longer used with Exchange 2010 and newer (although some of these same
components are used when a DAG performs a database seed operation in a DAG). Instead, Exchange relies
upon Exchange VSS Backups (another excellent blog series on Exchange backups). In summary, a VSS backup
consists of the following components:
You can find some articles to help you understand backups and VSS at the end of this chapter.
It’s important to understand the difference between a VSS Snapshot (which is a file system snapshot) and a
Hypervisor Snapshot (which is used in virtualization; aka Checkpoint in Hyper-V). Hypervisor
Troubleshooting VSS extends beyond just Exchange troubleshooting, as many applications rely on VSS for
backups. However, most VSS issues in my experience revolve around the following:
VSS Writers
o Use vssadmin list writers to query any writers in a failed state which may require restarting of
their respective service
VSS Providers
o Use vssadmin list providers to view installed providers.
o Most VSS-based backup applications leverage the built-in VSS provider, but some utilize their
own which may present atypical behavior. Work with vendor if issues arise.
VSS ShadowStorage
o Use vssadmin list shadowstorage to view used/available space for snapshot data.
o ShadowStorage serves as temporary storage for capturing copy-on-write blocks during a
snapshot. Backups may fail if there is insufficient ShadowStorage space for a backup to
complete.
o It’s important to keep Windows updated, as past issues have resulted in ShadowStorage data
not being properly purged.
Note: In some environments, free disk space can become a problem if ShadowStorage is consuming
too much space. While ShadowStorage is used for other functions, Exchange should only need ~1 GB
maximum disk space to perform a VSS snapshot. Reference. Also, in some cases there can be inefficient
use of ShadowStorage if a VSS provider vendor uses a large differential block size. Check with your
backup VSS provider vendor if you experience excessive use of shadowstorage.
In situations where the Information Store service will not start, the most common issues involve
communications between Exchange and Active Directory or some interaction involving a file used by Exchange
and anti-virus software. I recommend using event viewer to verify Active Directory communications as well as
Page 219 Exchange Server Troubleshooting Companion
verifying there are no service dependencies not met (such as the Microsoft Exchange Active Directory
Topology Service not being started). With anti-virus software, verify all exclusions have been properly
configured. It is common to discover that an anti-virus product either locks the database files or blocks
necessary RPC communications. Anti-virus is also a common reason why database files become corrupt as a
result of multiple processes attempting to access the files simultaneously. It is also common to see that an
anti-virus product quarantines database files or logs, a step that might break the sequence of the log stream,
making recovery from backup challenging, and possibly breaking DAG replication.
Dirty Shutdown
The words “Dirty Shutdown” are fairly self-explanatory; a database is down and not in a healthy state. The
words can still cause anxiety for Exchange Administrators who recall spending many hours while frantically
attempting to get a database to mount and restore functionality. Over ten years of support, I have probably
worked on well over a hundred Dirty Shutdown support cases. But to be honest, a decade worth of experience
is not required to troubleshoot a Dirty Shutdown event accurately, especially since the proper course of
actions can easily fit into a decision tree. Figure 9-1 provides a high-level flow chart for navigating common
scenarios where a database will not mount:
First, let’s define what happens in a Dirty Shutdown. When Exchange dismounts a database, the Information
Store ensures that all transactions (dirty database pages) in cache are committed to the database (.edb) file.
When this is allowed to happen, the database is said to be in “Clean Shutdown” and requires no transaction
logs to mount because all transactions have already been replayed into the database. Figure 9-2 demonstrates
how the ESEUTIL utility validates that a database is in Clean Shutdown state.
However, if something happens to cause the database to terminate abruptly, it is probable to result in a Dirty
Shutdown state. Whether a Dirty Shutdown happens and which transaction logs may be required to recover
the database is dependent entirely upon which transactions were being processed at the time of the failure.
Figure 9-3 displays a database in Dirty Shutdown, requiring log 3 (E0000000003.log) to mount.
Figure 9-3: DB1 in Dirty Shutdown and requiring log E0000000003.log to mount
Note: An easy way to get a database into Dirty Shutdown (for testing purposes, only do this in a lab) is
to use Task Manager to kill the Store.exe (Exchange 2000-2010) or the
Microsoft.Exchange.Store.Service.exe (Exchange 2013-2016) process. Before doing this, I recommend
setting the Microsoft Exchange Information Store Service to “Disabled” (but not stopping it). This will
prevent the service from automatically restarting itself and attempting to mount the database while
you inspect it.
This command can be used to verify the .edb file and log folder paths so you know where all the files are
located for a given database:
The next step is to verify the required log files are present in the database’s log directory. This can be done
visually via Windows Explorer, but the preferred method is to use the below command:
In the above command, E00 is the log sequence for this database. The first database on a server will have a
sequence of E00, the second database E01, the third E02, and so on in Hexadecimal. This means the 11th
database on a server will have a sequence of E0A. This command will parse through every transaction log in
the directory from oldest to newest (E0n is always the current log file so is therefore the newest) and verify
that every log in the stream is present and not corrupted. Figure 9-4 displays this action on the same DB1
database which I intentionally placed into a Dirty Shutdown.
Figure 9-4: ESEUTIL /ml run against DB1, which is newly created and therefore only has 4 log files
In the case of DB1, all log files are healthy and accounted for. This means that if I attempt to mount this
database, following a successful mount the database will automatically recover by replaying the transactions in
the available log set. However, for this example I removed log 3 from the directory and repeated the
Figure 9-5: Result of ESEUTIL /ml with a missing file in the log stream
Because we are missing a required log file, the database will not mount, as seen in Figure 9-6.
Figure 9-6: Result of attempting to mount a database with a missing Required Log
Now if I place log 3 back into the directory but instead remove log 2, what do you think will happen?
Remember, only log 3 was required because a transaction contained within it was deemed necessary to mount
the database. So while running ESEUTIL /ml will still list log 2 as missing, the database would mount because
the Checkpoint File (.chk) tells the database to only be concerned with transaction log file 3 or newer.
What would happen if I removed the checkpoint file while log 2 was still missing (remember, the database
only requires log 3)? In this case, the database will fail to mount because without a checkpoint file, all of the
logs present in the directory must be inspected in ascending order and any missing logs will result in a failure
to mount. Now what’s really interesting is that if I also remove log file 1, the database will mount successfully.
This is because when the log stream is reviewed, there are no missing log files. The Information Store identifies
log file 3 as the oldest log, and begins processing onward starting at that log. Whereas before, it detected logs
1 and 3 but knew 2 was missing, so the process failed. Remember, we don’t require every transaction log since
the beginning of the database to be present in the directory (this would be unreasonable, as old log files get
deleted once a successful backup has occurred), but the logs which are there must be contiguous from oldest
to newest.
Understanding what log files might be required is important. Although this example database has only a few
log files, production databases can have tens of thousands, depending on the last successful backup.
Therefore, since many backup programs perform a log stream integrity check (similar to an ESEUTIL /ml) for
the log directory, missing or damaged log files within the directory can cause this check to fail.
Let’s now discuss how a soft recovery using ESEUTIL /r is performed. With Exchange 2010 and later versions, it
is rarely necessary to manually run a soft recovery as a means to get a database to mount after a failure. This is
because Exchange does this process automatically (and does a pretty good job), and if the necessary log files
are present in the directory, you simply need to mount the database for recovery to happen. More commonly
Figure 9-7 displays a soft recovery being performed against DB1 which places it into a Clean Shutdown and
ready to mount state.
Figure 9-7: Performing a soft recovery against DB1 to place it into a Clean Shutdown state
Another useful ESEUTIL command performs a checksum validation against the .edb file, using the following
command syntax:
Think of this as a scan for physical corruption in the database. If this command returns some bad checksums,
your likelihood of a soft recovery is unlikely and a restore from backup could be required. Figure 9-8 displays
ESEUTIL /k being performed against DB1. As you can see, no bad checksums were encountered.
The last command we need to cover is the dreaded ESEUTIL /p command (also known as a Hard
Recovery/Repair). Unfortunately, the internet is full of bad advice regarding ESEUTIL /p. The usual scenario is
that someone starts a thread pleading for help with a database that will not mount, and the first response is
usually, “just run an ESEUTIL /p <DatabaseFileName> and the database should mount”. This advice is rarely
accompanied with a warning about data loss or the fact that a /p should NEVER be your first troubleshooting
step when troubleshooting databases which will not mount. In fact, a /p must always be your path of last
resort, as it invariably means that data loss will result.
So what is a /p and why is it so bad? Simply put, a /p does anything it can to remove problems that might
cause a database to fail to mount, even if that means deleting data and rolling back transactions in an effort to
get into a Clean Shutdown state. In the event of bad checksums or other types of corruption, you could say a
/p removes anything it doesn’t recognize as valid, all in an effort to purge corruption. This can potentially have
catastrophic effects to the data in the database as no one can predict what pages will be removed or what
data is held in those pages as it all depends on what was happening with the database at the time of failure or
which part of the database structure was affected. Sometimes it is a single data page, but sometimes the
removal of a single page can remove an entire table from the database.
Based on my experience, ESEUTIL /p will be successful in 9/10 instances and the database will mount with
minimal data loss. However, I have also seen how ESEUTIL purged so many pages that a 200 GB database was
reduced to 60 GB with no indication of what was lost in the 140 GB that was removed. Losing data is the
biggest danger of running ESEUTIL /p, mounting the database, and then carrying on business as usual. You’ll
probably be happy that the database comes back online and not realize that data loss has occurred until users
point out large holes in their mailboxes later, at which point it’s likely too late to recover it easily. So if an
Exchange-Aware backup is available, it should always be used instead of running the quicker/easier ESEUTIL
/p.
However, even if hard recovery is successful and involves minimal data loss, there are still repercussions that
flow from performing a hard repair on a database. Microsoft’s New Support Policy for Repaired Exchange
Assuming all other recovery methods have failed and no backup exists, you need to know how to run a hard
repair. The syntax for this command is:
Before this command is allowed to be run, you’ll receive the warning shown in Figure 9-9.
A hard repair can take anywhere from a few seconds to over 24 hours to run, depending upon database size,
available computer resources, and the amount of corruption detected. This is another reason to keep your
databases under 200 GB for standalone servers, as recovery operations (including restoring a backup) take
considerably longer for larger databases. It is reasonable to expect that ESEUTIL will process data at the rate of
10-15 GB an hour (storage performance is also a factor), but without knowing the nature of the corruption, it
is difficult to predict how long the complete repair will take. Figure 9-10 displays the execution and
completion of an ESEUTIL /p operation.
Once the hard repair is complete, all database files should be removed from the folder except for the .edb file.
In fact, you can purge any previous transaction logs for this database, as they’ll now be useless as they cannot
be replayed into the database because the newly repaired database has a different signature to the corrupted
file. Also, it’s possible you’ll need to enable this database to be overwritten using this command:
After the database has been mounted, as previously mentioned, all mailboxes should be moved from it, and a
New-MailboxRepairRequest should be performed against them.
Database Defragmentation
There was a time when performing an Offline Defragmentation on an Exchange database was considered a
maintenance operation that should be scheduled regularly. However, beginning with Exchange 2007, there is
little reason to perform an Offline Defragmentation. Not only does database replication complicate the
process, but also because there is little to be gained. Historically, when mailbox data was purged, it left
whitespace in the database. In other words, after deleting 40 GB of data from a 100 GB database file (.edb)
would still remain a 100 GB file. Eventually Exchange would reclaim this space as data is added during normal
user operations, but if available disk space was running low, administrators might perform an offline defrag
against the database to reduce it down to ~60 GB. This was accomplished using the following command on a
dismounted database:
An offline defrag was also required after an ESEUTIL /p was performed against a database (Microsoft now
requires the database instead be vacated). The downside of the offline defrag was it required the database to
be offline and could take many hours (5-10 GB/hr depending on the hardware of the time), but it was thought
a better alternative than moving all mailboxes out of the database, as that too was an offline operation (for
Starting with Exchange 2010, as all mailbox moves use an online process, offline defragmentation should be
avoided. Database whitespace should either be allowed to be consumed by new database pages over time, or
mailboxes should be moved to a new database (allowing you to delete the old database file). Since the move
is now an online process, it’s a much better alternative than an Offline Defrag.
Command Examples
General ESEUTIL Log File Replay Rules (Reference):
You cannot replay log files from one database against a different one. The operations held inside a log
file are low-level that collectively form a stream of transactions to be applied against a database. You
will not see anything inside a log file such as "Deliver Message A to Mailbox B." A better example of a
log file operation is "Write this stream of 123 bytes to byte offset 456 on database page 7890."
Imagine that you gave someone instructions for editing a document, and your instructions are "On
page five, paragraph four, in the third sentence, insert the phrase 'to be or not to be' after the second
word." If these instructions were applied to a document other than the one intended, the result would
be random corruption of the document. Likewise, if the wrong log files were played against an
Exchange database, a similar result would occur. Exchange therefore has multiple safeguards to
prevent such corruption.
If you defragment or hard repair (ESEUTIL /p) an Exchange database, transaction logs that previously
were associated with this database can no longer be replayed into it. If you try to replay log files after
a defragmentation or hard repair, Exchange skips the inappropriate transaction logs. Again, consider
the analogy of editing the document. If a paragraph has been moved, edited, or deleted since the
instructions were created, applying the out-of-date instructions would be as destructive as applying
them to an entirely different document.
You cannot replay log files unless all uncommitted log files from the time the database was last
running are available. You must have all log files starting from the checkpoint at the time the
database was backed up. You can then replay log files from this point as long as they follow an
unbroken sequence. If there is a single log file missing in the middle or from the beginning of the
sequence, replay stops there.
You cannot replay log files if the checkpoint file points to the wrong log. Exchange treats a
checkpoint log as if it were the first log available and ignores all older log files. If you restore an older
file copy of the database, the checkpoint will be too far ahead, and Exchange tries to start log file
replay from a log file that is too new. You can solve this problem by removing the checkpoint file; thus
forcing Exchange to scan all available logs. (If you restore an online backup, recovery ignores the
checkpoint file.)
ESEUTIL /r <E0n>
ESEUTIL /r E00
Note: /a can also be used if the last log in the stream is missing. The /a is not always successful.
Perform a hard recovery against a database:
Perform a Mailbox Repair Request against a mailbox while checking against all corruption types:
Note: See the Migration chapter and the Mailbox Corruption module for additional details on Repair
Requests and their function.
Recovery Databases
A Recovery Database is not a new concept to Exchange. It is merely the evolution of Recovery Storage Groups
which were used in Exchange 2003-2007. After Microsoft removed Storage Groups from the product in
Exchange 2010, the naming was changed from Recovery Storage Group to Recovery Database. A Recovery
Database (RDB) is a special kind of mailbox database that allows you to mount a restored mailbox database
taken from a backup and extract data from the restored database as part of a recovery operation. Database
files can be restored from backup, placed into a Clean Shutdown state, and mounted in an RDB so its contents
can be extracted and placed into a mailbox of the administrator’s choosing. The following points differentiates
an RDB from a regular mailbox database (see reference):
A database named DB15 contains 100 mailboxes, of which one is named “John Smith”
John Smith accidentally deletes a very important item from his mailbox
The deletion goes unnoticed for a month, allowing the default Deleted Item Retention window of 14
days on the mailbox database to expire and the email to be purged
In this example, an Exchange-Aware backup of DB15 from before the deletion can be restored into a directory.
Using ESEUTIL commands, an administrator can verify the database is in a clean shutdown. If not, they can
perform an ESEUTIL /r to soft recover the restored log files into the database, bringing it into a Clean
Shutdown state.
Once the database is in a Clean Shutdown, it can be mounted as a Recovery Database. No users will be able to
access this database, and even though the mailboxes within it also exist on the production database (DB15), it
is allowed to mount in this special state. Once the database is mounted, the mailboxes it contains, along with
their item count, can be viewed using the below command:
This mailbox data can now be exported into a production mailbox using the New-MailboxRestoreRequest
Cmdlet. This restore can be performed to the same folder the item originally existed in, or sent to a different
folder
Once the RDB is no longer needed, it can be dismounted, removed from Active Directory, and the restored
database files removed.
The preceding example is a fairly common one, though many third-party backup solutions make this process
easier by providing streamlined item-level recovery. For this reason, the RDB has become a bit of a lost art to
Exchange Administrators. Of course, there are several other use cases where Recovery Databases are used:
This was a common scenario played out in when slow backup solutions and slower hardware meant that it
took hours to recover even small databases, leading to in extended periods of unavailability as recovery took
place. It is still for this reason that Microsoft recommends that non-DAG protected databases be no greater
than 200 GB. As the size of files needing to be restored grows, so does the potential for downtime.
Exchange 2010 introduced Database Portability, a feature permitting database files to be mounted on any
server in the Exchange organization. Also, a user’s mailbox location can be configured simply by running this
command:
Of course no data is migrated with this command, instead “John Smith” will find a new empty mailbox when
he opens Outlook. But this capability is extremely useful when combined with the ability to move database
files and mount them anywhere in the Exchange organization. It’s this capability that makes Dial Tone
Database files have been lost or corrupted (for our example, FailedDB)
A restore from backup will result in an extended outage as users are unable to access the contents of
their mailbox and cannot send/receive emails
A desire exists to provide users the ability to send/receive new emails while their mailbox data is
being recovered
A new database is created (in this example “DialToneDB”)
Mailboxes on the failed database are rehomed to DialToneDB using Get-Mailbox -Database
<FailedDB> | Set-Mailbox -Database <DialToneDB>
DialToneDB is mounted so users can send/receive emails during the restore operation
A Recovery Database is created and the FailedDB files are restored from backup to its location
After the FailedDB data is copied to the Recovery Database from backup, but before mounting the
restored database, copy any log files from the current FailedDB to the Recovery Database log folder
so they can be played against the restored database. This will bring the database as close to current as
possible.
Mount the Recovery Database so the logs can be replayed and then dismount it
Dismount DialToneDB (involves outage to users) and swap the database files for DialToneDB and the
Recovery Database. This results in the restored copy of FailedDB being mounted in production, but
missing the past few days of email (which exist in DialToneDB). DialToneDB’ s data is now mounted in
the Recovery Database.
Use the New-MailboxRestoreRequest Cmdlet to move/merge the contents of the Recovery Database
into FailedDB
At this point recovery is complete and the Recovery Database can be removed
Lagged copies in a DAG hold the replaying of their transaction logs for up to 14 days, allowing the copy to
represent a past point in time for that database. In Exchange Native Data Protection, this feature enables the
backup-less solution to restore data from a past date chosen by the administrator. The following provides an
overview of this process:
Single Item Recovery makes this process seldom required, as you can easily recover individual items without
going through the trouble of mounting a Recovery Database.
Note: When using Lagged Database Copies, the SafetyNetHoldTime value should be equal to the
ReplayLagTime of the lagged copies. This ensures that should the Lagged Copies be activated, the
messages contained within Safety Net will include the entire time window of the Replay Lag Time.
Also, outside of Lagged Copy Recovery scenarios, the cmdlet Add-ResubmitRequest can be used to
request data contained within SafetyNet to be replayed into mailboxes. Remember, the Information
Store performs duplicate checking, so users should not experience duplicated messages.
For these reasons Exchange provides the Disaster Recovery Installation option for Exchange. This installation
option following the below steps (Reference):
Hopefully this chapter provides and insight into the various tools you have at your disposal, as well
being a useful reference when encountering backup, restore, corruption, and disaster recovery
scenarios out in the wild. This chapter discussed database corruption and how to overcome it. For
information on how to recover from mailbox corruption, see the Mailbox Corruption module in the
Migration chapter later in this book. Though first, we’ll discuss troubleshooting Exchange Hybrid
environments. This is a topic with many moving parts and complex technologies coming together to
provide a hybrid On-Prem and Office 365 experience.
Additional reading
Exchange Native Data Protection
Exchange Native Data Protection
Exchange Preferred Architecture
Database Availability Group
Single Item Recovery
Lagged Database Copies
Recovery Database
In-Place Hold
Database availability groups (DAGs)
Recover deleted messages in a user's mailbox
Activate a lagged mailbox database copy
Restore an Exchange Server 2013 Database to a Recovery Database
In-Place Hold and Litigation Hold
Single Item Recovery Walkthrough
DeletedItemRetention Parameter
Email Immutability
Database Copy Types
ReplayLagTime Parameter
Recovery Storage Group
Exchange-Aware Backup
Working with Recovery Storage Groups
Litigation Hold
Database Defragmentation
Offline Defragmentation
Little reason to perform an Offline Defragmentation
Offline Defrag And DAG Databases, Oh My!
How To Check Database White Space In Exchange
How to Defrag an Exchange 2010 Mailbox Database
Microsoft now requires Hard Repaired (/p) databases be vacated
Moving mailboxes, the Exchange 2010 way (Online Moves)
Recovery Databases
Recovery Database
Recovery Storage Groups
Storage Groups
New-MailboxRestoreRequest
Restoring a Mailbox from an Exchange Server 2013 Recovery Database
Restoring Exchange Server 2016 Mailboxes and Items Using a Recovery Database
Individual products are usually fairly easy to troubleshoot and support. You can attend product training, install
it in a lab, and eventually end up solving enough support tickets to become a subject matter expert. On the
other hand, solutions are complex configurations that bring together several different products to solve a
defined technical and business need. An Exchange Hybrid Configuration is a complex solution that relies on
several different Exchange features and utilizes multiple Microsoft products and services to connect Exchange
on-premises organizations to Office 365 tenants. The work necessary to configure a hybrid deployment has
evolved since Exchange 2010 SP1 and now Microsoft provides a Hybrid Configuration Wizard to automate the
many tedious manual configuration steps which were often poorly implemented. A Hybrid Configuration can
involve the following Exchange Server components:
A Hybrid Configuration might also involve the following Microsoft products, features, and services:
Office 365
Azure Active Directory
Remote PowerShell
Azure AD Connect (aka. Azure AD Sync/DirSync)
Active Directory Federation Services (ADFS)
On-Premises Active Directory Domain Services
Microsoft Office Suite
Information Rights Management
Exchange Online Protection
Needless to say, becoming an effective troubleshooter in all of these areas is an ambitious task. Even for
Microsoft, it is often difficult to troubleshoot and resolve issues that occur in hybrid configurations.
The breadth of technologies included in Office 365 can make the manager of any IT support team wonder
how it is possible to provide satisfactory support for everything, especially when some of the components
operate in the cloud and some on-premises. I call this out because when troubleshooting a product which
involves both on-premises and cloud-hosted components, you must understand and be able to identify when
an issue relates to an on-premises component, cloud component, or in-between. Support is provided by
On-Premises Environment
Office 365 Environment
Network connecting On-Premises and Office 365
Local clients (operating on the company WAN)
External clients connecting to their mailbox
The first step towards resolution is to identify in which environment the problem belongs. Once this is done,
you can use the appropriate tools and techniques to resolve the issue or gather the required logs for
Microsoft to analyze. It would be impossible to cover every possible troubleshooting scenario for every
Microsoft technology involved in an Exchange hybrid configuration, so the purpose of this chapter is to
highlight the tools, logs, and methodologies used to resolve the most common hybrid break-fix scenarios. For
a deeper dive into the various Hybrid components as well as designing, deploying, and managing a Hybrid
Configuration, I recommend this eBook: Office 365 Complete Guide to Managing Hybrid Exchange
Deployments
For information on Hybrid configurations, as well as how to operate Office 365 from the perspective of an
Exchange professional, I recommend this eBook: Office 365 for Exchange Professionals
Additional References:
Note: For a great beginning-to-end overview of a Hybrid configuration, I highly recommend the
Microsoft Virtual Academy course on Exchange Hybrid Deployments..
While the first versions of the Hybrid Configuration Wizard had some issues, over the course of its four-year
history, it has evolved into an extremely robust tool. The largest evolutionary step was the repackaging and
rebranding as the Office 365 Hybrid Configuration Wizard in September 2015. The wizard is now maintained
and downloaded separately instead of being packaged with each Exchange Cumulative Update. This means
any issues with the code can be detected, remedied, tested, and published in a matter of days and means that
the wizard always represents the latest state of knowledge about hybrid connectivity. Here’s a summary of the
major improvements in the new wizard:
Up-To-Date Hybrid Experience - the latest version of the code is checked and downloaded every time
the wizard is run. As changes are made to, or issues are fixed in the HCW, customers will see the
benefits soon thereafter.
Note: If you notice anything out of the ordinary when running the HCW, you are strongly encouraged
to provide feedback using the built-in “Give Feedback” option. This feedback is closely monitored by
the Hybrid Support, Development, and Engineering teams. On more than one occasion, the feedback
received allowed Microsoft to quickly identify and correct customer issues encountered with the HCW.
If you want to opt out of uploading the Hybrid logs you can do that by using the registry key below on the
machine were you are running the HCW from:
1. Navigate to the following location in the registry, create the path if needed:
Note: While you might wish to keep a local copy of the HCW, it’s possible (and likely) it could be
outdated within a week’s time. Therefore, it’s highly recommended to always download the latest
version when you need to run the HCW, whether for the first time or to make modifications to a Hybrid
Configuration.
When you run the wizard (Figure 10-2), you have the option to learn more about what the wizard does and
which settings it will modify depending on your input.
Figure 10-2: The first screen of the new Hybrid Configuration Wizard
Note: At the time of this writing, Exchange 2010 SP3, Exchange 2013 (CU9 or above), or Exchange 2016
can be used with the new HCW.
The next screen (Figure 10-4) looks for credentials for an on-premises account which is a member of the
Organization Management role group (if the currently logged-on user will not suffice) as well as administrative
credentials for your Office 365 Tenant. The HCW then connects to both the on-premises Exchange
organization and the Office 365 tenant to gather information and validate access rights. If network or internet
proxy issues exist, then you may require additional network configuration to allow the HCW to access the
Office 365 endpoints. It’s highly recommended to view the Office 365 URLs and IP address ranges article to
determine which endpoints are used. Understand that these IP’s and URL’s may change so I recommend
subscribing to the RSS feed to receive notifications of changes.
The Federation Trust section of the HCW asks whether Federation should be enabled. Federation is used for
sharing calendar free/busy information with other Exchange organizations, amongst other things. From here
the Hybrid Domains page will list the Accepted Domains found when the HCW queries both the Exchange
Online tenant and the on-premises Exchange organization. My environment only has one domain,
ashdrewness.com. However, if I also had ashdrewness.net and ashdrewness.org listed as Accepted Domains in
my tenant and on-premises environment, all three domains would be displayed. I would be given the option
to designate one as an Autodiscover Domain. Obviously, if a domain that you’d like to use as a shared
namespace in the hybrid configuration is missing, you should verify that domain exists as an Accepted Domain
in both the tenant and on-premises. This requires the tenant domain to have fully completed verification via
TXT record. You can choose to stop the wizard and add the missing domains, or re-run the HCW at a later
time. Going forward, when new domains are to be added to the hybrid configuration, the HCW can be re-run
as needed.
Note: If you are not presented with the Hybrid Domains page then the HCW has only detected one
domain which is present in both the tenant and on-premises as Accepted Domains. Therefore, there is
no need to choose domains from the wizard’s perspective. If this is not desired and multiple domains
are to be added to the hybrid configuration, then verify all domains have been added as Accepted
Domains in the tenant and on-premises.
Once Federation is enabled and domains are selected, the Domain Ownership screen is displayed (Figure 10-
5). The domains to be used in this Hybrid Configuration must first pass validation that they are indeed owned
by the on-premises organization. This is done by requiring a TXT record be created in the domain’s zone with
a unique token string. Creating this record validates the individual running the HCW actually owns (or has
appropriate access to) this domain. It’s recommended to use the “Copy to clipboard” option to ensure only
the required text is copied. Then add a TXT record with the pasted token as its contents. Once this is done, it
may take several minutes (or even hours) for the record to be propagated, at which point the “verify domain
ownership” button can be clicked.
The HCW now asks if I utilize an Edge Transport Server for secure transport (Figure 10-6). By clicking
“Advanced…” the option to enable Centralized Transport is presented. Centralized Transport ensures all
outbound mail from both on-premises as well as the Office 365 Tenant are routed through the on-premises
Exchange servers. By default, outbound mail from mailboxes in the tenant leave via Exchange Online
Protection (EOP) to the intended recipient. Enabling Centralized Transport changes this behavior so that all
outbound emails from Exchange Online are routed back on-premises. This is usually done for compliance
reasons.
The two following screens (Receive Connector Configuration and Send Connector Configuration) allow you to
select which Exchange servers are to be used for hosting Receive and Send Connectors to facilitate SMTP
communications between the on-premises organization and the Office 365 tenant. For obvious reasons, these
servers must be able to communicate to and from Office 365 over port 25. The next (Transport Certificate)
screen asks which certificate will be used for SMTP communications to the Office 365 tenant. This certificate
and private key must be installed on every server that will participate in hybrid mailflow, as well as be assigned
to SMTP services on Exchange. The final transport screen (Organization FQDN) of the HCW asks which FQDN
should be used by EOP when delivering mail to your on-premises organization.
If you have not configured an External URL for the Exchange Web Services (EWS) virtual directory, you may see
the screen illustrated in Figure 10-7.
Figure 10-7: The absence of an EWS External URL causes the HCW to display a warning
Seeing this screen should be a red flag that the current environment hasn’t been properly configured. Any
issues should be remedied before proceeding to complete the hybrid configuration. However, in this case, we
will manually provide an EWS URL and proceed forward see how the HCW reacts.
Figure 10-8: Pushing the changes to the on-premises organization and the Office 365 tenant
The HCW will now take several minutes to complete the requested configuration changes. The first time it runs
it will enable the MRSProxy on all Exchange Servers in the organization and can take a very long time to
complete in a globally distributed environment. Enabling it beforehand may lower the time to completion for
the wizard. In either case, please be patient. In my case, the HCW encountered a failure (Figure 10-9) because
even though it warned me before, I still have not configured an External URL for my EWS Virtual Directory.
I’ve never been known to be fond of error messages or failure in general, but there are two things about this
notification that I like. First, I’m provided with hyperlinks at the bottom-left of the wizard to open Exchange
PowerShell either for my on-premises organization or my Exchange Online environment in my Office 365
tenant. This can allow me to remedy any mistakes which may have caused the HCW to fail. In my case, I used
the on-premises link to configure an External URL for my EWS Virtual Directory with the intent of resolving this
issue. The other thing I like about this screen is that once I’ve made the necessary changes to correct my
environment, I’m able to simply click “Back” to go to the “Ready for Update” screen once again and click
“Update” once more. No need to restart the HCW, just two simple mouse clicks.
This time, the HCW completed successfully and displayed the wizard completion screen (Figure 10-10).
Figure 10-10: Completion page where Feedback can be provided to the HCW team
Again, I highly suggest you take the time to provide feedback to the Hybrid Configuration Wizard team. Once
the HCW has been run, the Hybrid Configuration may be updated by navigating to the Hybrid feature pane in
Changing a certificate
Adding a new Exchange transport server
Adding a domain to be used in the Hybrid Configuration
The logs provide a record of each action taken during the wizard’s execution. The following data is contained
within the logs:
If you encounter any issue during the execution of the HCW and the error message displayed in the wizard
GUI not be sufficient, I recommend reviewing the content of the logs to assist in identifying the root cause.
Common issues encountered during the wizard’s execution are:
While work is being continuously performed to make issues encountered during the wizard’s execution easier
to diagnose via easy to understand messages, the HCW logs may still be required to determine exactly what
went wrong. For example, I removed my third-party certificate from the Exchange server’s local certificate
store and re-ran the HCW. I was greeted with this warning message:
“No valid certificate could be found to use for securing hybrid mail transport. The certificate must be installed on
all servers where Send or Receive connectors are hosted. After installing a valid certificate on all respective
servers return to this step and push `search again’”
If I then check the HCW logs, I can see the last action the wizard performed, which was to execute the Get-
ExchangeCertificate cmdlet and gather all installed certificates on the selected Exchange servers. While the
default self-signed certificates installed with Exchange as well as the Federation certificate are detected in the
logs, a valid third-party certificate with the FQDN names used on the transport connectors was not found. I
could then use this data to correlate the findings on my Exchange server using Get-ExchangeCertificate. In my
scenario, after re-importing the third-party certificate, enabling it for SMTP, and clicking “Search Again” in the
HCW, the certificate was detected and the wizard was allowed to continue.
Once connected, let’s look at the Hybrid Configuration object created on-premises using the Get-
HybridConfiguration cmdlet (Figure 10-11).
This object is a reference to the Hybrid Configuration itself and contains the servers, domains, transport
services, and features used in the Hybrid Configuration. This object will only exist in the on-premises
environment. While this object can be removed using the Remove-HybridConfiguration cmdlet, this will not
actually remove hybrid features such as Organization Relationships, connectors, Remote Domains etc.
This cmdlet displays the properties of the Organization Relationship the on-premises Exchange organization
has with the Office 365 tenant Exchange organization. This information is used for sharing calendar free/busy
information between each organization. Remember, the end goal of a Hybrid Configuration is to have an on-
premises Exchange organization and an Office 365 tenant appear to the users as one entity. Allowing cloud
mailboxes to query free/busy information of on-premises mailboxes (and vice versa) is one feature which
makes that illusion possible. To view the Organization Relationship in the Office 365 tenant, run the same
command in the remote PowerShell session to your tenant (Figure 10-13).
Next we’ll look at the custom Remote Domains configured by the HCW for the “.onmicrosoft.com” and
“mail.onmicrosoft.com” domains using the Get-RemoteDomain cmdlet (Figure 10-14).
Remote Domains are used for controlling message formatting, automatic reply settings, and NDR information.
Since Exchange uses these onmicrosoft.com SMTP domains for routing mail to the Office 365 tenant, these
Remote Domains allow the mail to be treated (and formatted) as if it were being sent internally. Further
keeping the appearance of one singular Exchange organization.
This connector is used whenever queries are issued for the ashdrewness.mail.onmicrosoft.com domain, which
is the TargetAddress for Exchange Online mailboxes that have a mail-enabled user account on-premises (this
requires directory synchronization to be enabled; more on this later in the chapter).
Figure 10-16 displays the IntraOrganization Connector which exists in the Office 365 tenant. This connector
would be used whenever queries are issued for the ashdrewness.com domain, which is the TargetAddress for
on-premises mailboxes that have a mail-enabled user account in the Office 365 tenant.
Figure 10-16: IntraOrganization connector that exists in the Office 365 tenant
The IntraOrganization Connectors might appear to serve the same purpose as Organization Relationships. It
seems that way because it is true. The difference being that Exchange 2013/2016 mailboxes will utilize the
IntraOrganization Connectors and OAUTH for authentication while Exchange 2010 mailboxes will utilize
Organization Relationships and the Microsoft Federation Gateway for authentication. Of course, if OAUTH is
not configured or an IntraOrganization Connector for the target domain is missing, Exchange 2013/2016
mailboxes will use the legacy Organization Relationship method. If an Organization Relationship is missing,
Exchange 2010/2013/2016 will all fall back to use an Availability Address Space. Of course, in a functional
Hybrid Configuration, an Organization Relationship should not be missing. An Availability Address Space is a
legacy method for cross-forest free/busy using a service account or trust to mitigate permissions across
forests. The HCW creates an Availability Address Space on-premises for the mail.onmicrosoft.com address
space (Figure 10-17). This is actually a special “InternalProxy” access method, which resolves to a local 2013
SP1 or newer Exchange server to proxy the availability request to Office 365. This is used for legacy Exchange
servers to proxy their queries through the newer versions of Exchange.
It should be noted however, that once Exchange 2013/2016 detects the presence of an IOC and OAUTH,
should an issue subsequently arise with either, it will never failover to using an Organization Relationship and
the Microsoft Federation Gateway. Similarly, if Exchange 2010 detects the presence of an Organization
Relationship, if a failure is encountered, it will not failover to using an Availability Address Space.
Let’s walk through a few example scenarios for how free/busy requests work between on-premises and Office
365 tenants.
Scenario #1: In the Ashdrewness Company, Exchange 2013 connects to Office 365 in a Hybrid Configuration.
An on-premises Exchange 2013 mailbox (UserA) wishes to query the Free/Busy information of a mailbox in
Office 365 (UserB) which is also in Ashdrewness Company. OAUTH is configured and an IntraOrganization
Connector exists (created by the HCW). The process would be as follows:
Scenario #2: In the Contoso Company, Exchange 2013 and Exchange 2010 connect to Office 365 in a Hybrid
configuration. An on-premises Exchange 2010 mailbox (UserA) wishes to query the Free/Busy information of a
mailbox in Office 365 (UserB) which is also in Contoso Company. OAUTH is configured and an
IntraOrganization Connector exists (manually created by the administrator, as the HCW will only enable
OAUTH in a pure Exchange 2013 CU5 or later environment). The process would be as follows:
[email protected] uses Outlook to query the free/busy status of [email protected] via the
Scheduling Assistant
Scenario#3: In the Ashdrewness Company, Exchange 2013 connects to Office 365 in a Hybrid Configuration.
An on-premises Exchange 2013 mailbox (UserA) wishes to query the Free/Busy information of a mailbox in
Office 365 (UserB) which is also in Ashdrewness Company. OAUTH is configured and an IntraOrganization
Connector exists (created by the HCW). The process would be as follows:
Note: The process is similar when an Office 365 mailbox attempts to query free/busy information for an
on-premises mailbox. Because office 365 mailboxes are on Exchange 2016 (or newer), they will always
use IOC/OAUTH if it has been configured.
As previously mentioned, when OAUTH is not used, Exchange relies on Organization Relationships and the
Microsoft Federation Gateway (MFG). A Federation Trust is established by the on-premises organization with
the MFG (the HCW performs this action). Since each Office 365 tenant already trusts the MFG, the on-premises
organization and the Office 365 tenant now have a mutual trust. The on-premises organization will federate
the company’s domain names while the Office 365 tenant will federate the <domain>.mail.onmicrosoft.com
domains. The Federation Information for these domains contains their Autodiscover endpoints. With these in
place, authentication can now occur between these two federated domains via the issuance of tokens. Each of
these federated domains can be seen below in Figure 10-18.
Note: The Federation is established using a self-signed certificate with a Subject Name of
CN=Federation which is enabled for the Federation service. These actions are performed by the Hybrid
Configuration Wizard. If this certificate expires or is removed, the Federation trust will be broken.
As you can see, understanding how free/busy requests are processed in Hybrid is vital to troubleshooting
free/busy issues from on-premises to Office 365 and vice versa. The following are common issues with
free/busy in a Hybrid configuration and tips for troubleshooting them:
Expired or invalid certificate used for Federation (when not using OAUTH):
If OAUTH was manually configured, verify the spelling and formatting of all URL’s and that the
certificate uploaded to Azure Active Directory Access Control Services is not expired.
The following command verifies the Service Principal is still present
o Get-MsolServicePrincipal -ServicePrincipalName 00000002-0000-0ff1-ce00-000000000000 | fl
The following command can verify the credentials created during manual OAUTH configuration is still
valid
o Get-MsolServicePrincipalCredential -AppPrincipalId 00000002-0000-0ff1-ce00-000000000000
Use Test-OauthConnectivity to validate connectivity and authentication for OAUTH
Of course, re-running the HCW should correct most issues with OAUTH configuration, assuming
OAUTH was not manually configured due to legacy Exchange Servers being present.
Use the Microsoft Remote Connectivity Analyzer to validate AutoDiscover and EWS connectivity for
mailboxes on-premises and in the Office 365 tenant
For EWS Issues, run the Synchronization, Notification, Availability, and Automatic Replies (OOF) test in
the Microsoft Exchange Web Services Connectivity Tests section, and verify that there aren’t any errors.
If errors occur, correct the items that the test identified
For AutoDiscover issues, run the Outlook Test E-mail AutoConfiguration test in the Microsoft Office
Outlook Connectivity Tests section, and verify that there aren’t any errors. If errors occur, correct the
items that the test identified
From the Exchange server itself, verify you can reach the target AutoDiscover and EWS endpoints
using Internet Explorer and not receive a certificate warning
When verifying access to Office 365 EWS endpoints, it’s recommended to use the Microsoft Remote
Connectivity Analyzer to first query AutoDiscover. It will then provide a specific application server to
use for EWS queries
If queries from Office 365 to on-premises fail, verify a valid ExternalURL is configured for the Web
Services Virtual Directory. Also verify this URL is reachable from outside your environment
Please refer to the Client Access Services chapter for further tips on troubleshooting Exchange
endpoints.
Verify an Azure AD Connect (Formerly Azure AD Sync/DirSync) synchronization has been performed
and the mailbox exists in the remote Exchange organization
Often times, users are created on-premises, mailbox-enabled, a directory synchronization is
performed, and their mailbox is moved to the Office 365 tenant. If a process other than this was
performed (such as the mailbox being created directly in Office 365) then a free/busy request may fail
Additional References:
Note: Hybrid Mailbox Move troubleshooting, another feature of Hybrid Configurations, will be covered
in the Migration chapter.
Directory Synchronization
The ability to synchronize directory objects between organizations has existed for some time and has been
provided by various Microsoft and third-party tools. The available Microsoft toolsets have seen extensive
change and rebranding since 2010:
The rapid change in product naming and confusion over which tool is used under which circumstances has led
to many Exchange professionals to fear directory synchronization. In addition, few administrators may feel
confident in troubleshooting directory synchronization when problems arise. With Directory Synchronization,
most issues arise during deployment, which emphasizes the need to achieve a proper configuration to ensure
a healthy installation. To be honest, once synchronization has been deployed it usually simply works. With the
release of Azure AD Connect, much of the complexity has been removed during deployment and has also
made the more complex migration scenarios (such as migrating from a legacy tool such as DirSync) much
easier.
Microsoft likes to market Azure AD Connect as enabling customers to connect their on-premises Active
Directory to Microsoft’s cloud “with only 4 clicks.” They aren’t exaggerating either, as a simple Exchange
environment can be integrated with Azure Active Directory/Office 365 in 4 clicks and usually less than an hour.
The green field Exchange 2016 environment I’ve used to demonstrate a Hybrid Configuration in this chapter
has still not had a directory synchronization performed against it. That's right, you can run the Hybrid
Configuration Wizard successfully without any form or directory synchronization be performed against it. In
theory, a functional Hybrid Configuration can exist by manually creating/updating directory objects between
Note: At the time of this writing, one limitation of Azure AD Connect is its inability to synchronize Mail-
Enabled Public Folders from on-premises to Office 365. This blog post from Exchange MVP Michael Van
Horenbeeck has more details.
Help determine whether directory synchronization is right for your environment (Figure 10-19)
Perform a check to verify on-premises domains to be synchronized have been added to the tenant
and verified (see Note)
You’re given the option to download and run the idFix tool against your on-premises Active Directory
Download and install Azure AD Connect as well as help verify synchronization has occurred
The purpose of the idFix tool is to identify and remedy issues with on-premises directory objects which would
prevent a successful synchronization; such as duplicate objects or invalid characters.
Using idFix, an Administrator can simply edit the object by accepting the suggested changes provided by the
tool, or manually supply an alternative supported attribute value. I highly suggest reading the idFix user’s
guide included in the download to fully understand both the capabilities of the tool as well as the cautions
that need to be taken in running it.
Figure 10-20: Invalid character (a space) detected in a user’s ProxyAddresses Active Directory attribute
Once all preparations are complete, Azure AD Connect can be downloaded and configured. In this example,
we will use the guide and the Express settings to configure Azure AD Connect (which will enable Password
Sync and use a SQL local DB Instance) as well as selecting the option for an Exchange Hybrid deployment. The
Express Settings are perfectly acceptable for a lab environment and the install should take ~10 seconds as it
will not include any of the management tools SQL Express uses. As expected, four clicks and about ten
minutes later, I have directory synchronization installed and am ready to begin synchronizing my local
directory with Azure Active Directory. If you encounter any issues during installation, Azure AD Connect
installation logs can be found in the directory: C:\Users\<Current User>\AppData\Local\AADConnect
Note: If the Express settings option was chosen, AADConnect will be configured to automatically
upgrade itself.
The following folders are created in their default locations (to change these locations, select Custom
Installation instead of Express):
Any troubleshooting logs should be contained in the above paths. The logs are fairly straightforward to read
yourself but if you need to open a support case with Microsoft they may ask you to gather the logs for
analysis.
The wizard also creates several service accounts and security groups:
AAD_GUID (User)
AADSyncSched_GUID (User)
MSOL_GUID (User)
ADSyncAdmins (Group)
Page 260 Exchange Server Troubleshooting Companion
ADSyncBrowse (Group)
ADSyncOperations (Group)
ADSYncPasswordSet (Group)
Ensure these user accounts are not deleted or their group membership modified as otherwise directory
synchronization functionality will be hindered. Since Azure AD Connect with the Express installation options is
a fairly basic install, uninstalling the program and reinstalling it is not a cumbersome task. However, for more
advanced custom installations it may be preferred to attempt to reset permissions on these accounts based on
a lab environment or Microsoft guidance. Also, any configuration customizations should be backed up
beforehand.
Manual Synchronization
By default n AAD Connect 1.1.x, a synchronization should occur every 30 minutes via a scheduled workflow. In
previous versions, it was performed via a Scheduled Task in Windows (Figure 10-21). It can be manually
synchronized but cannot be configured for an interval less than 30 minutes.
Figure 10-21: Scheduled Task for an early version of Azure AD Sync created by the Azure AD Connect installer
When Password Hash Synchronization is enabled, whenever a user changes their password on-premises,
synchronization should be triggered within a two minutes. However, in early versions if you needed to
manually invoke synchronization for troubleshooting purposes, you could right-click the task and select “Run”.
Alternatively, you can run this executable via the command prompt: C:\Program Files\Microsoft Azure AD
Sync\Bin>DirectorySyncClientCmd.exe
In current versions of the tool, the Scheduled Task was replaced by a custom synchronization engine workflow
exposed via PowerShell. The below Cmdlets can be used to manipulate the synchronization engine:
Get-ADSyncScheduler
Set-ADSyncScheduler
Start-ADSyncSyncCycle
For more information, see Microsoft MVP Jeff Guillet’s post on How to Schedule and Force Sync Updates with
AAD Connect 1.1.x
Synchronization events can be monitored using the program or using a very helpful third-party script written
by Exchange MVPs Mike Crowley and Michael Van Horenbeeck.
If you want to Monitor Azure AD Connect from the Azure Portal, the Azure AD Connect Health option helps
you monitor and gain insight into your on-premises identity infrastructure and the synchronization services
available through Azure AD Connect.
A common issue is newly created on-premises objects not appearing in the Office 365 tenant. The above tools
should allow you to view why an objects weren’t synchronized. Of course the answer may simply be that
synchronization has not been performed since the creation of the object. You’d be surprised how many
support incidents have been created simply due to impatience. With Hybrid configurations, since the source of
authority is the on-premises Active Directory via directory synchronization, objects such as mailboxes must be
created on-premises and allowed to synchronize to the tenant. The two common ways to create a mailbox
which will eventually live in Office 365 are:
To create a remote mailbox, you can either run New-RemoteMailbox which will create a new mail-enabled
user account in local Active Directory with a TargetAddress of [email protected]. Once
directory synchronization has occurred, the user account will show up in the tenant, and (in my experience)
after a few minutes will be mailbox-enabled. Once an Exchange Online license is assigned, it’ll be a fully
functioning mailbox existing in Office 365. Alternatively, if you already have a user account which exists in on-
premises Active Directory, you can instead use the Enable-RemoteMailbox cmdlet.
In each of these scenarios, a successful directory synchronization must occur before a mailbox can exist in the
Office 365 tenant.
Additional References:
Extending On-Premises Directories to the Cloud Made Easy with Azure Active Directory Connect
Monitor your on-premises identity infrastructure and synchronization services in the cloud
Deploy Office 365 Directory Synchronization (DirSync) in Microsoft Azure
Azure AD Connect FAQ
Azure AD Connect PowerShell cmdlets
How to troubleshoot password synchronization when using an Azure AD Sync appliance
Note: For a deeper dive into the above topics, as well as Office 365 from the Exchange perspective, I
recommend the eBook Office 365 for Exchange Professionals.
Cloud identity
Synchronized identity (with or without password synchronization)
Synchronized identity with federation
Note: Third party identity systems can be used to integrate with Office 365 in a variety of ways,
whether there is also an on-premises Active Directory or not. In this chapter we'll focus on the native
options provided by Microsoft.
In the Cloud Identity model (Figure 10-23), credentials in Office 365 are completely separate from on-premises
Active Directory credentials. A user logs into their local domain-joined machine with their local Active
Directory credentials but when connecting to their Office 365-hosted mailbox via OWA or Outlook they must
enter a separate set of credentials. These credentials may have the same values but only because the user has
managed them identically, not because automated synchronization is in place.
In the Synchronized identity model (Figure 10-24),with the “PasswordSync” option, Azure AD Connect instructs
the Azure AD Sync components to synchronize a hash of the locally stored password hash in the on-premises
Active Directory to matching objects in Azure Active Directory. This "salted hash" is non-reversible and
therefore secure. This arrangement ensures the user can use their same password configured on-premises to
access cloud-hosted resources. A user logs into their local domain-joined machine with their local Active
Directory credentials but when connecting to their Office 365-hosted mailbox via OWA or Outlook they will
In the Federated identity model (Figure 10-25), federation is enabled on a per-domain basis for each of the
domains configured in the Office 365 tenant. After ADFS has been configured, the domains used for a Hybrid
Configuration can be converted from “Managed” to “Federated.” This change causes authentication requests
to Office 365 to be redirected to the on-premises ADFS farm where users actually authenticate which then
grants an authentication token, granting access to the requested resources. A user logs into their local
domain-joined machine with their local Active Directory credentials but when connecting to their Office 365-
hosted mailbox via OWA or Outlook they should connect without having to enter their credentials again. This
is considered Single Sign-On (SSO). External or non-domain joined computers may still need to authenticate
to ADFS.
Note: Depending on the client, the ADFS Federated Identity process may not necessarily be SSO. For
example, Outlook still requires the password be presented. However, selecting the option to remember
the password often effectively leads to a “SSO” experience. The long-term solution for Outlook will be
Modern Authentication.
To view whether a domain is Managed or Federated, you use the Get-MsolDomain cmdlet. The value for
“Authentication” will either be “Managed” or “Federated.” As previously stated, the risk of federating a domain
and using ADFS is that it becomes a single point of failure. Therefore, it is recommended to also enable
Password Sync when configuring Azure AD Connect to provide a backup authentication method.
Unfortunately, this is not an automatic process and you must manually switch from single sign-on to password
sync. I recommend practicing this process in a lab environment or during an outage window to be prepared in
the event an extended ADFS outage occurs. Of course, simply correcting the ADFS issues would be ideal as
this process can potentially take up to two hours to complete.
Expired SSL certificate that's assigned to the AD FS proxy server - Suggested Resolution Steps
o Jeff Guillet’s blog post for How to Update Certificates for AD FS 3.0
Incorrect configuration of IIS authentication endpoints - Suggested Resolution Steps
Broken trust between the AD FS proxy server and the AD FS Federation Service - Suggested
Resolution Steps
Incorrect ADFS URL – Suggested Resolution Steps
Request Signing Certificate failing revocation – Suggested Resolution Steps
Problem with individual ADFS/WAP Server - Suggested Resolution Steps
Additional References:
Microsoft Office 365 Deployment Readiness Tool – Determine readiness of environment prior to deploying
Office 365 services
Exchange Deployment Assistant – Provides Microsoft official guidance on configuration and migration based
on environmental inputs provided
The Hybrid Free Busy Troubleshooter – Troubleshoot free/busy issues between on-premises Exchange and
Office 365
Automated Hybrid Troubleshooting Experience – Gathers Hybrid connectivity logs, performs analysis, and
feedback for how best to resolve the issue with the Hybrid Configuration.
Mail Flow Guided Walkthrough for Office 365 – Guided walkthrough for troubleshooting mail flow issues in
Office 365 and Hybrid environments.
Office 365 Support and Recovery Assistant – Identifies and fixes Outlook and Office 365 connectivity issues
Office 365 Client Performance Analyzer - Identify issues that affect network performance between your
company’s client PCs and Office 365
Additional reading
Introduction
Exchange Hybrid Configuration
Hybrid in Exchange 2013
Hybrid Configuration Wizard
Cross-Forest Mailbox Moves
Free/Busy Sharing
MailTips
Online Archiving
Secure Mail via shared SMTP namespace
Shared Address Books (Unified GAL)
Remote Domains
eDiscovery Search and Compliance
Integrated Administration (RBAC, Exchange Management Shell, EAC)
Office 365
Azure Active Directory
Remote PowerShell
Azure AD Connect (aka. Azure AD Sync/DirSync)
Active Directory Federation Services (ADFS)
On-Premises Active Directory Domain Services
Microsoft Office Suite
Information Rights Management
Exchange Online Protection
Office 365 Complete Guide to Managing Hybrid Exchange Deployments
Office 365 for Exchange Professionals
Getting Started with Hybrid Exchange Deployment
Configuring an Exchange 2013 Hybrid Deployment and Migrating to Office 365 (Exchange Online)
Exchange Server Deployment Assistant
Directory Synchronization
Forefront Identity Manager (FIM) 2010 R2
Page 269 Exchange Server Troubleshooting Companion
Azure Active Directory Synchronization Tool (DirSync)
Azure Active Directory Sync (AADSync)
Azure Active Directory Connect (Azure AD Connect)
Microsoft Identity Manager (MIM) 2016
Integrating your on-premises identities with Azure Active Directory
Connecting AD and Azure AD: Only 4 clicks with Azure AD Connect
Mail-Enabled Public Folders & Directory-Based Edge Blocking
Enabling Directory Synchronization in Office 365 Portal (Legacy Method)
Set up directory synchronization in Office 365 (New Wizard-driven experience)
Install and run the Office 365 IdFix tool
Prepare directory attributes for synchronization with Office 365 by using the IdFix tool
IDFix Download
Azure AD Connect Download
Install the Azure Active Directory Sync Service
Azure Active Directory Synchronization Services: How to Install, Backup & Restore with full SQL
How to Schedule and Force Sync Updates with AAD Connect 1.1.x
Azure AD Sync Tool HTML Report
Monitor your on-premises identity infrastructure and synchronization services in the cloud
New-RemoteMailbox
Enable-RemoteMailbox
How to do Hard match in Dirsync?
PowerShell to generate ImmutableID
How to use SMTP matching to match on-premises user accounts to Office 365 user accounts for
directory synchronization
Extending On-Premises Directories to the Cloud Made Easy with Azure Active Directory Connect
Deploy Office 365 Directory Synchronization (DirSync) in Microsoft Azure
Azure AD Connect FAQ
Azure AD Connect PowerShell cmdlets
How to troubleshoot password synchronization when using an Azure AD Sync appliance
Migration, by its very definition, involves change. In the field of IT, change can be the catalyst for issues to
emerge, and why proper change management is so vital to an environment’s health. Exchange migrations are
often performed by experienced third-party Consultants, not because Exchange Administrators are necessarily
incapable, but for two reasons. Planning and performing an Exchange migration can be a very time consuming
endeavor, meaning Administrators my not have the necessary time while also performing their regular duties.
However, primarily because there are many “gotchas” or hurdles that only an experienced specialist in
migrating Exchange data will be able to efficiently navigate; or mistakes they’ll know to avoid based on past
experience.
This chapter covers the more common issues experienced during an Exchange migration. The topics covered
will include:
Moving Mailboxes
Public Folder Migration and Coexistence
Client Access Coexistence between Exchange versions
Mail flow considerations
Managing Exchange during coexistence
However, let us first list the various types of migrations as well as tips to ensure a successful migration
experience.
Internal migrations
This is the most common type of migration, even if it’s simply moving mailboxes from one database to
another. However, migrating from one version of Exchange to another is what we’re really referring to when
discussing internal migrations. Whether the scenario is migrating from Exchange 2007 to 2013, or 2010 to
2016, or a double-hop migration from 2007 to 2010/2013 to 2016 etc., the hurdles are usually a result of
getting one version of Exchange code to properly interact with another version.
Historically, the struggle for Microsoft has been engineering a newer code base which can interoperate with a
legacy version. The constraints are typically that due to major architecture changes in the product, a new
feature may be impossible or very time-consuming to code into a legacy version of the product. For example,
when migrating from Exchange 2003 to Exchange 2010, there was no ability to proxy OWA traffic from
Exchange 2010 to 2003 OWA. Instead a redirect to the client was necessary, which resulted in a sub-optimal
client experience.
In another example, when Exchange 2013 was introduced into an Exchange 2007/2010 environment, access to
Public Folders or Shared Mailboxes on the legacy servers did not work. Negotiate Authentication for Outlook
Anywhere had to be disabled, as the legacy servers did not support it. Changing the code base of Exchange
2007/2010 to support Negotiate authentication would have been too burdensome of a task for Microsoft,
therefore a change/workaround during the migration was required to allow proper functionality.
Note: One form of migration troubleshooting not covered in this chapter is migrating from a third-
party (non-Exchange) email system. These often involve the use of third-party utilities/services and
troubleshooting would involve contacting their Support organization.
Cross-Forest
Mergers, acquisitions, divestitures, and temporary business partnerships often involve a cross-forest migration
or at least a period of coexistence across Exchange organizations. Many Consultants specialize in these
scenarios as they often involve many moving parts outside the normal expertise of an Exchange Administrator.
Technologies often involved in such migrations include but are not limited to:
Certificates trusted by both organizations to ensure secure mail flow as well as trusted HTTPS
endpoints for accessing resources such as calendaring Free/Busy information
Directory Synchronization tool to provide synchronization of objects between organizations enabling
querying of Calendaring Free/Busy information, cross-forest AutoDiscover redirection, Remote
Mailbox Moves, Linked Mailboxes, and cross-forest mail flow
Connectors for secure cross-forest mail flow, which are dependent upon mutually trusted certificates
for proper TLS communications
(Optionally) Third-party tools used to simplify the migration experience, provide additional features,
and potentially increase the speed of the migration
Issues surrounding cross-forest migrations typically involve one of the following areas:
These scenarios, and common issues encountered, will be covered later in this chapter.
Note: Cross-Forest mailbox moves typically require the Mailbox Replication Service Proxy (MRSProxy)
to be enabled. The MRSProxy facilitates cross-forest mailbox moves and remote move migrations
between multiple on-premises Exchange organizations or between an on-premises Exchange
organization and Exchange Online. The MRSProxy is a subcomponent of the Exchange Web Services
Virtual Directory and can be enabled via EAC or EMS. For Hybrid migrations, the Hybrid Configuration
Wizard enables the MRSProxy during its execution.
Hybrid
An Exchange Hybrid Configuration, is essentially a specialized cross-forest coexistence state. An on-premises
Exchange organization is made to coexist with Exchange Online in an Office 365 tenant. Secure mail flow,
cross-forest free/busy, and cross-forest mailbox moves are all features shared with a non-hybrid cross-forest
migration. Key differences are:
Hybrid has a dedicated Hybrid Configuration Wizard which automates this complex configuration with
Office 365 and Exchange Online
Hybrid relies on Federation and OAuth for authorization whereas cross-forest scenarios often involve
a forest trust
See the Hybrid chapter for details regarding troubleshooting Hybrid configurations.
Note: When choosing to use Shared Mailboxes in a Hybrid Configuration, do not convert a regular
mailbox which has been synched to Office 365 into a Shared Mailbox. Instead, make the mailbox
Shared while it exists on-premises and then synchronize it to Office 365.
Note: Just as permission inheritance being disabled in on-premises Exchange can cause various issues,
it can also cause Hybrid Mailbox Moves to fail:
A user can't access a mailbox by using Outlook after a remote mailbox move from an on-premises
Exchange Server environment to Office 365
Moving Mailboxes
There are various different reasons for moving mailboxes in Exchange environments. Some as part of a
migration effort, and others for administrative or troubleshooting purposes. TechNet lists some common
reasons why mailboxes need to be moved within an Exchange organization.
Having spoken with many Exchange administrators, the most common reason for moving mailboxes are:
To facilitate system housekeeping. For example, to move all mailboxes from a server to allow the
server software to be upgraded.
To move mailboxes so that they are geographically closer to their owners.
To implement an evenly distributed environment (per the sizing calculator’s recommendations).
As a means to balance mailbox capacity and load.
I’d add resolving mailbox corruption to the list as this is a reason that I often encounter, probably because of
my position in a support organization.
In Microsoft’s eyes, the practice of placing mailboxes in databases based on department or hierarchy within
the company (we all know the joke about keeping everyone who can fire you on the same, closely monitored
and regularly backed up database) should be discouraged. Most Exchange experts would also agree, as it
makes planning availability much more difficult. If a single Exchange server fails, there’s no easy method to
ensure the remaining servers handle a balanced user load upon failover.
In such an environment, if each mailbox database contains 100 mailboxes, 40 would be high-usage mailboxes
while 60 would be low-usage mailboxes. This would ensure a balanced performance and storage capacity load
across each mailbox database in the environment. This would result in even distribution if a database failover
occurs, consistent database backup/restore times, as well as consistent performance and capacity load on
each volume.
Of course, this balance doesn’t come at an administrative cost. Monitoring of the environment to ensure
mailboxes are still generating the expected load, and then rebalancing the environment as needed by moving
mailboxes to different databases will be required. Of course, the frequency at which this will be required
depends entirely on the environment’s rate of change. It has been said that Microsoft is constantly moving
mailboxes in Office 365 to balance workload across available servers. However, this cannot be taken as normal
operating practice because Exchange Online is a massive environment that is very different to any other.
If you’re working to achieve this mailbox distribution balance, it’s import to understand how to move
mailboxes without affecting user work, to throttle the moves as needed so performance is not impacted, and
monitor/manage the moves as efficiently as possible.
Note: Although Microsoft promotes the Preferred Architecture for all on-premises Exchange
deployments, not all environments will be deployed in such a manner. Smaller environments will have a
hard time justifying the hardware/licensing expenses that allow for three copies of every database and
site resiliency. Many environments will have one to three Exchange servers and will not have evenly
distributed databases. However, I still recommend the practice as it allows for easier growth planning
and balanced performance. Even in single server environments, having multiple balanced databases will
allow flexibility when timing backup/restore jobs.
Exchange 2010 introduced asynchronous (behind the scenes) mailbox moves in the form of Move Requests.
The new approach delivers some significant advantages, including:
Mailbox moves are asynchronous and are performed by the Microsoft Exchange Mailbox Replication
service (MRS). A request is made to MRS to move a mailbox that is processed as system load and
resources allow. Any instance of the MRS in the Active Directory Site can process a move request.
Servers/services can also be restarted while the move is in process and the move request will
automatically resume once services again become available
Mailboxes remain online during the asynchronous moves. Users can still access their mailbox data
while their mailbox is being moved. Behind the scenes, MRS copies the mailbox in the destination
database. Once the copy is complete, the “switch is flipped” (of pointers in Active Directory) to activate
the mailbox copy in the the destination database. The final redirection to the new mailbox can be
controlled by the administrator, as move requests can be configured to suspend and await approval
after copying is complete.
The items in a mailbox's Recoverable Items folder are moved with the mailbox. Therefore, deleted or
purged items remain with the mailbox when it is moved. This is vital in compliance scenarios where all
mail items must be retained in the mailbox.
As soon as MRS begins to copy data to the target mailbox, the indexing service on that server starts to
index it so that fast searches are available to users immediately after the mailbox move completes.
You can configure throttling for each MRS instance, each mailbox database, or each Mailbox server.
Throttling behavior can be controlled by editing the MSExchangeMailboxReplication.exe.config file that
exists on every Exchange mailbox servers
If a group of mailboxes, such as all mailboxes on a given database, need to be moved at once, then a Batch
Move Request can be created:
This should not be confused with a Migration Batch in Exchange 2013/2016. The Migration Batch feature
introduced in Exchange 2013 will be discussed later in this module. The “BatchName” parameter is only used
to assign an identifier to a group of mailboxes to be moved and allows you to query them using the Get-
MoveRequest cmdlet while specifying the BatchName parameter:
If you want to have the move request suspend at 95% completion, the “SuspendWhenReadyToComplete”
parameter can be used. This allows the majority of the mailbox data to be moved in the background while the
mailboxes are still online. Then when ready, an Administrator can complete the moves, which will involve a
brief outage for the mailbox. As there are many reasons to move mailboxes as a means of troubleshooting, I
highly recommend performing mailbox moves in a way which will have as little impact to production as
possible. Therefore, the SuspendWhenReadyToComplete parameter should be used to move the mailbox data,
suspend the move request before completion, and allow you to resume the request using the Resume-
MoveRequest cmdlet. The above example using BatchName “DB01toDB02” will automatically suspend at 95%
completion. To complete the automatically suspended batch of move requests, we would use the below
command:
Note: Timing mailbox moves is vital to maintaining availability in an Exchange environment. While the
moves are an online process in Exchange 2007 SP3 and newer, as the move completes there is still a
short window where the users will be unable to access their mailbox. Even if this is only a few seconds,
the possibility of generating help desk calls still exists, especially if the move generates an Outlook
pop-up regarding connectivity loss. It's best to notify users about this potential outage before it
happens to mitigate helpdesk calls.
This is a simple cmdlet which will provide the information contained in Figure 11-1.
Figure 11-1: Simple view of active move requests and useful data
When you simply wish to know which mailboxes are being moved, their size, and % completed, this is the
simplest way to gather that information. However, should you want further details, including a report of any
failures encountered during the move, I recommend the below command:
I’ve found this particularly useful when trying to diagnose a failed move request. The below output in Figure
11-2 is simply an example of the command’s output and displays the data captured. Should bad (corrupted)
items be encountered, they will be displayed here. It’s not uncommon for a mailbox (in particular, large
mailboxes) to have corrupted items which cannot be copied. When creating a move request you can specify
the maximum acceptable bad items using the BadItemLimit parameter. If this limit is exceeded, the move will
fail. One must decide the amount of data loss which is tolerable, but I will add that if an item is corrupted in
the mailbox you’ll unlikely be able to open it with Outlook anyways.
Note: If a Move Request’s BadItemLimit parameter is configured higher than 50, the
AcceptLargeDataLoss parameter must be provided.
When providing the IncludeReport parameter to the Get-MoveRequestStatistics cmdlet, you’re provided with
a detailed log of the actions performed during the move. If a move request fails for any reason, the report will
help in the diagnosis. Common reasons for a move request failing are:
The BadItemLimit parameter can also be modified on an existing Move Request using the Set-MoveRequest
cmdlet. Other actions which can be performed against an existing request are suspending it using Suspend-
MoveRequest, resuming it using Resume-MoveRequest, and Remove-MoveRequest. While a Move Request
can be manually suspended and resumed using the above aptly named cmdlets, it’s more common to create
Move Requests which will suspend before completion; allowing you to manually resume them when ready
(Figure 11-3).
When a Move Request has completed it should be removed, as the mailbox cannot be moved again until you
do so. When Exchange 2010 first released, it was a very common support call to have customers be unable to
move a mailbox because a previous request had not been removed.
Migration Batches
Migration Batches are not necessarily a new vehicle for moving mailboxes, but more akin to a new paint job
and improved handling. Migration Batches allow superior manageability of the mass movement of mailboxes
both inter and intra-forest. They allow an administrator to manage large amounts of mailbox moves,
submitted manually or via CSV file, with new features such as:
Incremental synchronizations - This feature enables a sync every 24 hours to keep source and
destination mailboxes updated
Automatic retry of moves
Automatic cleanup of Move Requests upon batch removal - Using the –Force parameter of the
Remove-MigrationBatch cmdlet will not remove the move requests within the Migration Batch. The
parameter is typically used to remove corrupted batches.
Move Report generation – Email reports can be sent to a specified address.
Ability to select multiple target databases for even distribution.
Support for Migration Endpoints - Used in IMAP, Cutover, Staged, Hybrid, and Cross-Forest migrations
Not required for intra-org migrations
Below are the various cmdlets used to create, manage, and remove Migration Batches, along with explanations
and use cases for the non-intuitive cmdlets:
Get-MigrationBatch
New-MigrationBatch
Set-MigrationBatch - Modify parameters such as BadItemLimit, LargeItemLimit, etc. of an already
created Migration Batch
Start-MigrationBatch - Begin a Migration Batch which was stopped or not created to automatically
start
Stop-MigrationBatch
Complete-MigrationBatch - Finalize a Migration Batch which has successfully completed initial
synchronization
Remove-MigrationBatch
Get-MigrationUser - View details of a user which is a part of a Migration Batch
Remove-MigrationUser - Remove a user from a Migration Batch
Get-MigrationUserStatistics - Similar to Get-MoveRequestStatistics but with less details of the MRS
process itself
Data for Migration Batches are stored within an arbitration mailbox called “Migration mailbox.
(Migration.8f3e7716-2011-43e4-96b1-aba62d229136)” which can be viewed using Get-Mailbox –Arbitration. If
this mailbox is removed, inaccessible, or past its storage limit, Migration Batches will fail. If this mailbox does
not exist, it needs to be recreated.
Start the Active Directory Users and Computers snap-in. Click Users, and look at the accounts, or perform a
search, to verify that an account named " Migration.8f3e7716-2011-43e4-96b1-aba62d229136" does not exist.
If this account exists in the Users container, skip straight to the step for running Enable-Mailbox.
To recreate the mailbox, first run the following command using the Exchange setup files.
Note: It is expected to have excessive transaction log generation on the mailbox database which hosts
the Migration mailbox when using Migration Batches. This is explained by Microsoft in the below
Knowledge Base article:
Large transaction logs are generated when you move mailboxes in Exchange Server 2013 or Exchange
Server 2016 Administration Center
Microsoft recommends two workarounds for this behavior. Either enable Circular Logging on the
database hosting the Migration Mailbox or do not use Migration Batches; instead using New-
MoveRequest in Exchange Management Shell to move mailboxes.
Obviously, the mailbox should be visible, but also verify the value for StorageLimitStatus is blank. If the value is
“MailboxDisabled” then it has exceeded its storage limits and moves will fail. This can occur when a large
amount of mailbox moves have been created to this database. The simplest solution is simply to increase the
storage quota for this mailbox.
If this system mailbox is missing, options are fairly limited. While other system/arbitration mailboxes can be
easily recreated (see below references), I’m unaware of a supported method to recreate this mailbox. In my
experience, the only successful remediation step is to move all mailboxes from the database and delete it.
Remember, the system mailbox on the destination database is what is used to process move requests. So even
though the system mailbox is missing from DatabaseA, it will not prevent you from moving mailboxes from
DatabaseA to DatabaseB.
Additional References:
Exchange 2010 will keep the previous 2 moves, whereas 2013/2016 will keep move history for the previous 5
moves by default. This value can be changed to as high as 100 by modifying the MaxMoveHistoryLength value
of the “Program Files\Microsoft\Exchange Server\V15\Bin\MsExchangeMailboxReplication.exe.config” file.
See Paul Cunningham’s Retrieving the Move History for an Exchange Server 2010/2013 Mailbox blog post for
more details on viewing move history.
The file lists the defaults as well as the minimum and maximum values which can be configured. When the
goal is to increase migration performance, settings such as MaxActive* can be modified to increase migration
throughput. However, take care when modifying the values as they are set to their default values by Microsoft
for a reason. Most settings are in place to prevent mailbox moves from overwhelming a server’s resources.
However, if you system is robust (CPU/RAM/Disk), you can certainly tweak the settings to squeeze extra
performance from the server. These settings aren’t just for migration performance troubleshooting of course.
When moving tens of thousands of mailboxes, even a 10% increase in performance can have drastic
improvements at scale. The Microsoft Exchange Mailbox Replication Service must be restarted for any changes
to take effect.
Other instances of throttling may be experienced when migrating to Office 365. The Microsoft Migration Team
has provided the following AnalyzeMoveRequestsStats script to aid in identifying the source of performance
issues. The article also details several possible causes of slowness:
This Exchange Online migration performance and best practices article is an excellent resource for diagnosing
Exchange Online migration performance issues. Also, while migrating large amounts of data to Office 365 it
may be required to create a support ticket requesting that Office 365 support ease throttling limits for your
tenant. This can increase migration throughput, but know that Microsoft typically only leaves this exception in
place for 90 days.
Note: While not directly related to mailbox moves, other forms of EWS throttling can impact mailbox
performance, in particular service accounts using EWS to access Exchange. A new Throttling Policy may
need to be created and customized to prevent the service account from being adversely impacted by
the Exchange EWS Throttling.
While working with mailbox moves, you may be presented with move stalls which have a
“MoveRequestStatus” of RelinquishedWlmStall. This is often happening due to Workload Management, which
can throttle Exchange actions if system resources such as CPU, Content Indexing, Replication Health, etc. are
unhealthy. The Move Report may indicate why the throttling is occurring. However, in certain cases the
behavior is unexplained to the Administrator. If the unhealthy system state cannot be corrected or identified,
the best approach would be to contact Microsoft Support for assistance. However, there are a few options.
The recommended approach would be to set the “Priority” of the Move request to “Emergency”, which should
bypass the Workload Management settings.
This should allow the Move request to complete in a timely fashion. If you are configuring this setting on an
existing Move request using Set-MoveRequest, then I recommend restarting the Microsoft Exchange Mailbox
Replication Service afterwards for it to take immediate effect. However, another option (which should only be
used if the above method fails) is to bypass WLM for all Move Requests. This option is not recommended
unless all other avenues are first pursued and it’s recommended to revert the setting once your task is
accomplished. I only mention it here because I’ve had colleagues tell me sometimes the above method is not
successful.
Additional References:
Hardware/Storage failures
Power failures
File System failures or corruption
Faulty ActiveSync clients
Faulty third-party Outlook add-ins
Anti-Virus software (faulty or missing proper exclusions)
Much of this corruption may go undetected during normal operations. Event logging for physical database
corruption can be found in the Application logs, but logging for corruption within individual mailboxes is
somewhat sparse. In fact, most people will only discover mailbox corruption when attempting to move a
mailbox either to new databases within the company or across Exchange organizations. They’ll find the
mailbox moves will fail due to bad items being encountered, and because the default BadItemLimit will be
zero for all move requests it’s not at all uncommon. In some cases, after a database corruption event where
the mailbox database was left with bad checksums or an ESEUTIL /P (database repair) had to be run against it,
the mailboxes within the database are left in a corrupted state (hence Microsoft’s new support policy stating
that a database must be vacated after an ESEUTIL /P was run against it). Let’s discuss behaviors when
corruption is present as well as how to recover from them.
Mailbox Quarantine
Often times, it is possible mailboxes may become so corrupted and unstable that they become quarantined by
Exchange. A quarantined mailbox means the user will be unable to access the mailbox using any client. A
mailbox will become quarantined if it causes the Information Store processes to hang or crash repeatedly. That
mailbox will then be inaccessible by any mail client (such as Outlook, Outlook Web App, ActiveSync, etc.) until
a given time period has expired. The default for this “penalty box” time period is six hours and can be
customized by modifying the registry. The Mailbox Quarantine feature was created in Exchange 2010’s lifetime
but options were limited when it came to detecting and configuring it. You were limited to searching the
below Windows Registry key for quarantined entries and then clearing them by deleting the registry entry for
a particular mailbox.
HKLM\SYSTEM\CurrentControlSet\Services\MSExchangeIS\<ServerName>\Private-
{dbguid}\QuarantinedMailboxes\{mailbox guid}
Fortunately, with Exchange 2013 came cmdlets to help administer Mailbox Quarantine. These commands were:
Enable-MailboxQuarantine
Disable-MailboxQuarantine
These commands replaced the manual deletion of registry keys as a means to disable mailbox quarantine for
mailboxes. In addition to the above commands, the below command can be used to detect which mailboxes
are currently quarantined.
Types of Corruption
Sometimes you’ll have mailboxes that frequently get quarantined, requiring they be repaired. Let’s first discuss
the two categories of corruption in an Exchange mailbox or mailbox database - physical corruption and logical
corruption. The analogy I like to use when discussing Exchange physical vs. logical corruption is that of a
damaged book. If a book’s pages are torn out and its binding is damaged, I compare that to physical
corruption in a database. This is where we would need to run ESEUTIL against the database to repair its
physical structure, either by running an ESEUTIL /R (recovery) or the dreaded ESEUTIL /P (repair). For details on
ESEUTIL, please see the Backup and Disaster Recovery chapter. Now, once the book’s pages and binding have
been repaired (with all pages now in the correct order), it could still have logical corruption if the words on the
pages don’t make sense to the reader, if the letters are smudged, the words are out of order, or it’s now in the
wrong language. For this, Exchange had a utility called ISINTEG, which would correct any logical corruption
that resulted in Exchange being unable to properly decipher and process data in the Jet database. The
symptoms of logical corruption were often display or search-related issues in Exchange clients, or in some
cases, messages not displaying at all. I worked with one customer who were unable to sync ActiveSync devices
because of a combination of physical and logical corruption in their mailboxes. While ISINTEG was certainly
useful for recovering from this, starting with Exchange 2010 SP1, ISINTEG was replaced with the New-
MailboxRepairRequest Exchange Management Shell cmdlet.
AggregateCounts
Searchfolder
Folderview
Provisionedfolder
Additional, less used, Corruption Types exist. If a repair request fail using the common Corruption Types, you
can attempt a repair of all Corruption Types using the following command.
In some situations, you can also use the CorruptionType of LockedMoveTarget if you encounter a mailbox
move that has become locked.
In Exchange 2010, you monitor the progress of mailbox repair requests in Event Viewer. However, in Exchange
2013/2016, you instead use the Get-MailboxRepairRequest cmdlet. A common example of the command
would be:
In certain circumstances, even moving the mailbox may not be successful. I’ve encountered scenarios where
the move request would fail because the mailbox kept being quarantined or some other unidentifiable error
occurred. While our options were limited in Exchange 2010, Exchange 2013/2016 gave us a very useful
parameter in the New-MoveRequest cmdlet; the ForceOffline parameter. Starting in Exchange 2007 SP3,
mailbox moves were an online process where users could still access their mailbox while it was being moved.
The ForceOffline parameter forces the move to be an offline process, similar to Exchange 2007 SP2 and older.
I’ve found much success with using this command to overcome mailbox corruption issues during move
requests. Using this switch will prevent a mailbox from becoming quarantined or crashing the store during the
move, as well as keep the user from accessing an already corrupted mailbox. A common example of the
command in action is as follows:
If all of these recovery actions fail and the mailbox still will not move, the remaining options are to:
Perform an Offline Defrag against the mailbox database holding the corrupted mailbox
Export the contents of the corrupt mailbox to .PST using New-MailboxExportRequest
Export the contents of the corrupt mailbox to .PST using an Outlook client
In each of the last two options, once the data is exported you should disable the mailbox, create a new
mailbox, and import the data using either the New-MailboxImportRequest or an Outlook client.
Public Folders
Historically, the mention of Public Folders to an Exchange Server Support Engineer meant induced anxiety and
potentially a four-letter expletive. This was because Legacy Public Folders (pre-Exchange 2013) utilized a multi-
master replication technology enabling multiple copies of public folder content to be replicated (SMTP-based
content replication) across potentially geographically disperse Exchange Servers. Not only did this allow for
data redundancy, but also for users to have speedy access to their local replicas. Unfortunately, this replication
technology was also the catalyst of many Public Folder support cases to Microsoft. Common Legacy Public
Folder support issues included (with common causes):
Note: For the last bullet, I highly recommend reading this excellent article series on Recovering Public
Folders After Accidental Deletion.
If troubleshooting Public Folder issues during normal operations were no easy task, troubleshooting Legacy
Public Folder Migrations could test the patience and aptitude of any Exchange Professional. At a high level, the
Legacy Public Folder migration process was as follows:
At this point, mailboxes on all Exchange versions (2003/2007/2010) could still access the Legacy Public Folder
content. This is because after a decade of Exchange development, the underlying architecture of Public Folders
had not changed. The only considerable change was that newer Outlook clients (Outlook 2007/2010 on
Exchange 2007/2010 mailboxes) did not require Public Folders for accessing Free/Busy, Out Of Office, or
Offline Address Book information. Instead they relied on new web services introduced in Exchange 2007, such
as the OAB and EWS Virtual Directories.
Data not being replicated to new Public Folder Databases after Replicas are added
o Replication failure (Check Event Logs and Queue Viewer)/Routing Group Connector missing or
misconfigured (recreate RGC)
Mail-Enabled Public Folders routing issues
o Routing Group Connector missing or misconfigured (recreate)/Receive Connector missing or
misconfigured/No valid route to destination/Deny ACL (Suggest running Setup /PrepareAD
again)/ Verify mail flow between versions (port 25 connectivity, Exchange Server Auth on
Receive Connectors, and Windows Integrated Auth enabled on SMTP Virtual Server)
Permissions issues with Routing Group Connector (2003-to-2007/2010)
o Delete and recreate RGC
Issues with legacy mailboxes querying Free/Busy data for new mailboxes
o Verify Public Folder replication/Verify Referrals are enabled on RGC (Set-
RoutingGroupConnector)
Inability to remove old Public Folder Database due to ghosted Replicas
o Verify all content has been replicated to new Public Folder Database (Get-
PublicFolderStatistics)
I won’t spend as much time discussing troubleshooting Legacy-to-Legacy Public Folder migrations, as they
should now be much less frequent. The only scenario where I would still expect to see such a migration would
be for those few poor souls still on Exchange 2003 and would require moving to Exchange 2010 (via a legacy-
to-legacy migration) before moving to 2016 (via a Legacy-to-Modern Public Folder migration). If a customer
were still using Legacy Public Folders on Exchange 2007, I would recommend they move to Modern Public
Folders on Exchange 2013 and then move to Exchange 2016.
However, here are a few useful commands which can be used to troubleshoot a Legacy-to-Legacy Public
Folder migration:
Get-PublicFolderStatistics - Gather item count for Public Folders (suggest viewing on each replica to ensure
consistency). Useful for comparing content replicated between the source and destination servers.
Update-PublicFolderHierarchy – Manually replicating the Public Folder hierarchy. Useful for when adding a
new Public Folder Database and ensuring it has the hierarchy replicated to it.
Additional References:
The big change came with the removal of multi-master SMTP-based replication of Public Folder content. With
Modern Public Folders there is only ever one instance of a folder and its contents in the entire Exchange
organization. Gone also are Public Folder Databases, as all content is now stored in Public Folder Mailboxes,
which are effectively system mailboxes used for serving Public Folder hierarchy and content to clients. Public
Folder high availability is now achieved through DAG replication. So gone is much of the complexity around
both Public Folder operations and migrations.
However, that’s not to say there aren’t caveats migrating to and managing Modern Public Folders. The biggest
challenge is the fact that there is no coexistence between Legacy and Modern Public Folders. It is not
supported (yet technically possible) to have Legacy and Modern Public Folders accessed at the same time.
Note: It is technically possible for Exchange 2013/2016 mailboxes to access Modern Public Folders
while Exchange 2007/2010 mailboxes in the same Exchange Organization are accessing Legacy Public
Folders. This is done by unlocking the Legacy Public Folders once the migration has been complete.
While all access and production activity still happens on Modern Public Folders, this is not supported.
No changes on the Legacy Public Folders will replicate to Modern Public Folders or vice versa. The only
use case I’ve encountered was a customer who prematurely completed the Modern Public Folder
migration and allowed it to run in production for several weeks before realizing several key folders
were missing. They used the above method and an Exchange 2007 mailbox to extract the contents from
the Legacy Public Folders via Outlook to .PST.
Since coexistence is impossible for Legacy and Modern Public Folders, there will come a time when a cutover
must be performed. This is done after all mailboxes are migrated to Exchange 2013/2016, since Exchange
2007/2010 mailboxes are unable to access Modern Public Folders. Fortunately, Exchange 2013/2016 mailboxes
can access Legacy Public Folders, so general Exchange coexistence (Exchange 2007 and 2013 as an example) is
possible for as long as needed. The overview of a Legacy-to-Modern Public Folder migration is as follows:
Implement Exchange 2013/2016 into an existing Exchange environment which contains Legacy Public
Folders which the customer wishes to migrate
o You may consider SharePoint or Shared Mailboxes as an acceptable alternative to Public
Folders, and therefore would not require a migration to Modern Public Folders
Implement required steps allowing Exchange 2013/2016 to access Legacy Public Folders
o On-Premises Legacy Public Folder Coexistence for Exchange 2013 Cumulative Update 7 and
Beyond (A must read if you want successful access to Legacy Public Folders)
Move namespaces to the new version of Exchange
o Outlook Anywhere clients will connect to the new version of Exchange before being proxied
to the legacy version
o Outlook Anywhere configuration must be modified for 2013/2016 mailboxes to access Legacy
Public Folders (Reference – Common issue, a must read)
Migrate all mailboxes to the new version of Exchange
o This phase may last many months depending on the migration project timeline
Migrate Legacy Public Folder content to Modern Public Folders
o Data will be synchronized in the background until the cutover is ready to occur
Decommission Legacy Public Folder databases
o Remove Public Folder Databases
Note: If the prerequisites for accessing Legacy Public Folders not be configured in an Exchange 2013
CU7 or newer environment as stated above, Outlook clients will be unable to open Public Folders.
Additional References:
When Exchange 2013 first released, the mechanism to actually move the content from Legacy to Modern
Public Folders was referred to as a Serial Migration. This migration method is deprecated and no longer
supported by Microsoft. The new preferred method is referred to as the Batch Migration method:
Use batch migration to migrate public folders to Exchange 2013 from previous versions
Use batch migration to migrate public folders to Exchange 2016 from previous versions
The Exchange 2013 and 2016 version of these articles are virtually the same. Each involve the following steps:
In my experience, most issues experienced during these phases are simply syntax issues with the commands
which can easily be remedied by re-reading the TechNet articles and practicing in a lab. Other issues revolve
around the Migration Failing or being stuck in a particular status. Using the methods previously mentioned in
this chapter for viewing Migration batches is recommend for initial analysis. However, I’ve found that at times
restarting the Microsoft Exchange Mailbox Replication Service on the target server or restarting the Microsoft
Exchange Information Store Service on the source legacy server will unlock the batch and allow it to complete.
Also, be sure to verify Active Directory Replication in a large environment.
I claim no inside knowledge of the Exchange and Outlook team’s product testing. Yet I feel they do not
possess the resources or time to test a new Cumulative Update against every single past Exchange update
version as well as every past Outlook update version, in every possible coexistence scenario. In my above
example, the end user is going to have a much better experience if the Exchange 2007 server were running the
latest Update Rollup (19 as of this writing) and their Outlook 2007 client were at the latest update (February
2015 as of this writing). Of course, ideally they would be running Outlook 2010/2013/2016 instead, all which
have updates as recent as January 2016 (as of this writing).
I’ve been personally told by Exchange Product Team members of several coexistence issues which were
resolved simply by updating Outlook. There are many moving parts to an Outlook client connecting to
Exchange, being proxied to a legacy version, and accessing a legacy resource. There are also new features
constantly being released to the Office suite. All of these result in bugs that may pop-up along the way. You
can either choose to never update any software product in an Exchange environment (leaving you in an
unsupported state that business might be impacted) or accept that there is a constantly moving window of
updates which your clients must all be within to properly operate with each other. In addition to Microsoft’s
guidance, I’ve experienced several issues myself in coexistence which were resolved either by an Outlook or
Exchange update:
Poor performance when bring proxied from a newer version of Exchange to a legacy version
Poor performance when accessing a legacy Public Folder or shared/delegated mailbox on a legacy
Exchange Server
Inability to modify items or calendar of a mailbox you’ve been given permissions to
Connection failures or random disconnects in Outlook when accessing a legacy resource
Connectivity failures when using MAPI/HTTP with older (but supported) Outlook updates
Issues accessing an Office 365 hosted Archive Mailbox
It’s important to stress that while my example used Outlook 2007, similar issues can occur with Outlook
2010/2013/2016 on older update versions. Outlook 2013 has experienced over 30 code revisions since its
release, many to resolve coexistence or compatibility issues. In one of my examples above, the scenario I
witnessed with a customer was resolved by updating Outlook 2013 to a version that was only 6 months newer.
I’ve found it’s a good rule of thumb to use the N-2 rule not only for Exchange Server update cadence, but also
for Outlook updates.
Additional References:
Users of Exchange Server 2013 or later or Exchange Online can't open public folders or shared
mailboxes on a legacy Exchange server
o As legacy Exchange versions do not support Anonymous authentication, so the Outlook
Anywhere settings must be modified if shared resources or Legacy Public Folders are to be
accessed on the legacy Exchange Servers
Outlook Anywhere users prompted for credentials when they try to connect to Exchange Server 2013
or Exchange Server 2016
o A bug in Server 2008 R2 causes proxied sessions to have authentication failures. Therefore
when the newer Exchange version proxy’s sessions to the legacy Exchange versions, they will
be repeatedly prompted for authentication. A Windows Server 2008 R2 update resolves the
issue.
Speaking of Shared Mailboxes and delegated mailboxes (mailboxes which another mailbox has been assigned
permissions to), care must be taken in environments where a large number of such mailboxes exist. This is a
great example of something being supported not necessarily ensuring a great end user experience. It's of my
opinion, and the opinion of many Exchange Consultants that these mailboxes and the users accessing them
should be kept on the same Exchange version. Meaning that if they all currently exist on Exchange 2007 and
you’re migrating to Exchange 2013, they should all be moved at the same time to ensure the best possible
user experience. If a Secretary or Assistant manages 10 individuals’ mailboxes, then all 11 mailboxes should be
moved at the same time. I’ve worked several escalations where an Assistant could not properly manage an
Executives calendar because the Exec’s mailbox had been moved to the newer Exchange version but their
mailbox was still on the legacy Exchange version. While you could certainly spend hours, days, or even weeks
on the phone with Microsoft trying to resolve the issue (since it may be supported), in almost every case it’s
much simpler to just move the Assistant to the same server version as the Exec. I would even say moving them
to the same Mailbox Database would be ideal, as it may slightly improve performance of the repeated access.
Outlook logon fails after mailbox moves from Exchange 2010 to Exchange 2013 or Exchange 2016
The symptom is that Outlook clients could fail to connect after the mailbox is moved to Exchange 2013/2016.
This is due to the AutoDiscover cache located in the AutoDiscover Application Pool on each Exchange Server
having outdated data which still references the mailbox being located on the legacy Exchange Server. Upon
opening Outlook after the move completes, AutoDiscover will issue an HTTP 302 response, which is a redirect
resulting in a loop. This can be seen in the Log tab of the Test E-mail AutoConfiguration tool within Outlook.
The workaround is to restart the AutoDiscover Application Pool on each Exchange Server where the Outlook
client could potentially retrieve AutoDiscover information from, immediately following the completion of a
mailbox migration to Exchange 2013/2016.
Page 292 Exchange Server Troubleshooting Companion
While it’s uncertain whether this behavior will change in future Exchange Server updates, for the time being, to
prevent a negative user experience during a migration you should plan an Active Directory site-wide
AutoDiscover Application Pool restart after the completion of any move requests to Exchange 2013-2016. Of
course the effect of this behavior may differ drastically between environments, so I would recommend testing
the post-move behavior using test mailboxes early in the migration.
Additional reading
Migration Overview
Negotiate Authentication
Remote Mailbox Moves
Linked Mailboxes
Cross-Forest Mail Flow
Cross-Forest Connectors
Mailbox Replication Service Proxy (MRSProxy)
Exchange Hybrid Configuration
A user can't access a mailbox by using Outlook after a remote mailbox move from an on-premises
Exchange Server environment to Office 365
Do not convert synced mailboxes to shared in a hybrid environment
Cutover Migration
Staged Migration
IMAP Migration
Exchange Remote Move (MRSProxy)
Moving Mailboxes
Preferred Architecture
Server Role Requirements Calculator
Move Requests
Exchange Mailbox Dumpster
New-MoveRequest
Throttling the Mailbox Replication Service
Public Folders
Recovering Public Folders After Accidental Deletion
Get-PublicFolderStatistics
Get-PublicFolder
Update-PublicFolderHierarchy
Update-PublicFolder
Public Folder replication troubleshooter
Public Folder Hierarchy Replication Problems
Managing Exchange Public Folder Permissions
Exchange 2010 FAQ: How Do I Migrate Public Folders to Exchange Server 2010?
Public Folder Replication – Troubleshooting Basics
Troubleshooting the Replication of New Changes
Troubleshooting the Replication of Existing Data
Troubleshooting Replica Deletion and Common Problems
Exchange Server 2007/2010 tips
On-Premises Legacy Public Folder Coexistence for Exchange 2013 Cumulative Update 7 and Beyond
Remove Public Folder Databases
Exchange Server 2010 to 2013 Migration – Moving Public Folders
Legacy Public Folders to Exchange 2013 migration tips
Serial Migration
Use batch migration to migrate public folders to Exchange 2013 from previous versions
Use batch migration to migrate public folders to Exchange 2016 from previous versions
Outlook logon fails after mailbox moves from Exchange 2010 to Exchange 2013 or Exchange 2016
HTTP 302 Response
Exchange Management Shell and Mailbox Anchoring
Mailbox Anchoring affecting new deployments & upgrades
Throughout this book we've discussed many scenarios that are security-related, such as user authentication,
Active Directory permissions, mailbox permissions, and more. This chapter doesn't deal specifically with
troubleshooting of security problems. Rather, it provides you with knowledge of security-related tools in
Exchange that you can use in a wide variety of situations.
Think about this chapter as more about answering the security question, "Who did that?", rather than security
in terms of granting access to things.
For example:
An email message from a customer was never responded to, and the manager of the customer service
team wants to know which person in the team moved or deleted the message from the shared
mailbox.
Information sent to an executive via email has leaked to the press or to a competitor and there is an
investigation to determine which of the executive’s delegates accessed the message.
In these situations, it is assumed that a delegate, or even a team of people, already have full or read-only
access to the mailbox. Based on that assumption the focus is now on which of those people took action with
specific items.
Exchange 2010 and later versions can log access to mailboxes by the owner, delegates, and administrators,
using a feature called mailbox audit logging. When audit logging is enabled for a mailbox, audit log entries
are stored in the Recoverable Items folder of the mailbox, which is not visible to the mailbox user via Outlook
or other client interfaces.
Log entries are written for actions taken by the mailbox owner, delegates, or by administrators, depending on
the audit logging configuration applied to the mailbox. The mailbox audit log entries are then retained for a
configurable period of time, allowing administrators to perform audit log searches to determine who took an
action on a mailbox.
A default mailbox audit logging configuration for an Exchange mailbox looks like this.
AuditEnabled : False
AuditLogAgeLimit : 90.00:00:00
AuditAdmin : {Update, Move, MoveToDeletedItems, SoftDelete, HardDelete, FolderBind, SendAs,
SendOnBehalf, Create}
Mailbox audit logging is disabled. This means that if you do not enable it for your mailbox users, you
will not have access to audit log information when it comes to troubleshooting scenarios.
Audit log entries are retained for 90 days. While this will be adequate for the majority of organizations,
you can increase or decrease the retention period to suit your needs. Mailbox audit logs retained for
90 days add between 2-5% to the overall size of the mailbox, based on an analysis I performed for the
impact of audit logging in multiple environments.
No owner actions are logged. Obviously owners are taking action on their own mailbox constantly,
and auditing everything they do would cause a massive amount of audit logging to be generated.
However, there are some owner actions such as deletes that can be useful to capture.
Some delegate and administrator actions are logged. For the most part the defaults will be sufficient,
but additional actions such as FolderBind are useful if there is a concern about delegates snooping
around mailbox folders they're not supposed to be looking in.
Note: The AuditAdmin settings refer to access via mechanisms such as eDiscovery searches,
mailbox import/export operations, or tools such as MFCMAPI. If an administrator is granted
permission to a mailbox and accesses it then those actions will be logged according to the
AuditDelegate settings.
The full list of actions that mailbox audit logging can capture are:
Copy
Create
FolderBind
HardDelete
MailboxLogin
MessageBind
Move
MoveToDeletedItems
SendAs
SendOnBehalf
SoftDelete
Update
To add more actions to an existing mailbox audit logging configuration, we use Set-Mailbox again. In this
example, HardDelete, SoftDelete, and MoveToDeletedItems actions are added to the owner auditing of the
mailbox of Alan Reid.
Using the Search-MailboxAuditLog cmdlet we can search the "Help Desk" mailbox for actions taken by
delegates between two dates.
[PS] C:> Search-MailboxAuditLog -Identity "Help Desk" -LogonTypes Delegate -StartDate 1/14/2014 -
EndDate 1/15/2014 -ShowDetails
RunspaceId : d8142847-166a-488a-b668-f7b84c3f3ceb
Operation : SendAs
OperationResult : Succeeded
LogonType : Delegate
ExternalAccess : False
DestFolderId :
DestFolderPathName :
FolderId :
FolderPathName :
ClientInfoString : Client=MSExchangeRPC
ClientIPAddress : 192.168.0.181
ClientMachineName :
ClientProcessName : OUTLOOK.EXE
ClientVersion : 15.0.4551.1004
InternalLogonType : Owner
MailboxOwnerUPN : [email protected]
MailboxOwnerSid : S-1-5-21-2175008225-1847283934-4039955522-1471
DestMailboxOwnerUPN :
DestMailboxOwnerSid :
DestMailboxGuid :
CrossMailboxOperation :
LogonUserDisplayName : Sarah Jones
LogonUserSid : S-1-5-21-2175008225-1847283934-4039955522-1471
SourceItems : {}
SourceFolders : {}
SourceItemIdsList :
SourceItemSubjectsList :
SourceItemFolderPathNamesList :
SourceFolderIdsList :
SourceFolderPathNamesList :
ItemId :
ItemSubject : Wheeee!
DirtyProperties :
OriginatingServer : E15MB1 (15.00.0775.022)
MailboxGuid : a0f10db1-5268-47a5-8f71-d1e65f55c653
MailboxResolvedOwnerName : Help Desk
LastAccessed : 14/01/2014 9:31:07 PM
Identity :
RgAAAAD2fF/dZobvQoWbbV7P6N7eBwD7Y5OF+DDRQZRz1a4+yUyzAABaldDBAAD7Y5OF+DDRQZRz1a4+yUyzAAB
aldDCAAAJ
IsValid : True
ObjectState : New
In the output above we can see that the user "Sarah Jones" made a successful SendAs action on the "Help
Desk" mailbox, with the inappropriate subject like of "Wheeee!". Management can now take the appropriate
action to deal with the situation.
Real World: Running frequent mailbox audit log searches can become tedious. When there is a regular
need to review mailbox audit logs, consider using my Get-MailboxAuditLoggingReport.ps1 PowerShell
script to speed up the process. You can even automate regular reports using a scheduled task.
Administrator audit logging captures all changes made my administrators using the Exchange management
tools (PowerShell cmdlets, or the Exchange Admin Center). Only commands that make changes, for example
Remove-Mailbox, are logged by administrator audit logging, whereas commands that do not effect data, such
as Get-Mailbox, are not logged by administrator audit logging.
Administrator audit logging can be disabled, or the configuration modified to limit the cmdlets or parameters
that are audited, or to modify the log retention period. For this reason, you should limit the ability of
administrators in your organization to modify the administrator audit log settings. By default, this right is
granted to members of Organization Management and Records Management. I recommend you review your
membership of the Organization Management and Records Management role groups to ensure that only the
most trusted administrators are members of those groups.
Note: Any changes made to the administrator audit log configuration are logged in the administrator
audit logs, regardless of whether admin audit logging is enabled or disabled. So in theory you should
see evidence of any tampering that has occurred.
You can also combine the above by using multiple parameters in your search.
RunspaceId : f6553abe-9d57-40bc-8e43-dc919bea2b50
ObjectModified : exchange2013demo.com/ExchangeUsers/Alannah.Shaw
CmdletName : Add-MailboxPermission
CmdletParameters : {User, AccessRights, Identity}
ModifiedProperties : {}
Caller : exchange2013demo.com/Users/Bob.Helpdesk
ExternalAccess : False
Succeeded : True
Error :
RunDate : 22/09/2015 4:35:33 PM
OriginatingServer : SYDEX2 (15.00.1076.011)
Identity : AAMkADI1NGQyZjhiLTFkYTAtNDhmYy05OTBiLTU4MGZlODY0MDQ3NgBGAAAAAAAkqZy/nl4
jSa4VBIka73bMBwCEoBRTwPA6QKt9HgzDn/p6AAAAAAEYAACEoBRTwPA6QK
t9HgzDn/p6AACUBtrhAAA=
IsValid : True
ObjectState : New
Another approach for the same scenario would be to look for modifications to the object “Alannah.Shaw” by
using the -ObjectIds parameter. In this example we see exactly the same result, but you can imagine that other
modifications may have been made to the same object and that multiple log entries would appear in many
real world environments.
RunspaceId : f6553abe-9d57-40bc-8e43-dc919bea2b50
ObjectModified : exchange2013demo.com/ExchangeUsers/Alannah.Shaw
CmdletName : Add-MailboxPermission
CmdletParameters : {User, AccessRights, Identity}
ModifiedProperties : {}
Caller : exchange2013demo.com/Users/Bob.Helpdesk
ExternalAccess : False
Succeeded : True
Error :
RunDate : 22/09/2015 4:35:33 PM
OriginatingServer : SYDEX2 (15.00.1076.011)
Identity : AAMkADI1NGQyZjhiLTFkYTAtNDhmYy05OTBiLTU4MGZlODY0MDQ3NgBGAAAAAAAkqZy/nl
4jSa4VBIka73bMBwCEoBRTwPA6QKt9HgzDn/p6AAAAAAEYAACEoBRTwP
A6QKt9HgzDn/p6AACUBtrhAAA=
IsValid : True
ObjectState : New
Searches can be limited to specific date ranges. Here’s how to search for modifications made by
“Bob.Helpdesk” in the last 30 days.
If Bob has been doing his job, it's likely that a lot of results will be returned by that search, most of which
should be perfectly normal for a person such as Bob on the help desk. To focus the search a little more, we
can look at just the objects that Bob has modified in the last 30 days.
[PS] C:\>$logentries.ObjectModified
SYDEX2
SYDEX2\mapi (Default Web Site)
SYDEX2\OAB (Default Web Site)
SYDEX2\EWS (Default Web Site)
SYDEX2\Microsoft-Server-ActiveSync (Default Web Site)
SYDEX2\ecp (Default Web Site)
SYDEX2\owa (Default Web Site)
Looks like Bob has been messing with virtual directories. Let’s make it even more useful and look at the time
stamp, cmdlet, and objects modified by Bob in the last 30 days.
As you can see administrator audit logging contains a lot of valuable information to help you identify who has
been making changes in your Exchange organization, which could help you in many different troubleshooting
scenarios. You can also see why it is important to limit administrative rights to only the minimum that each IT
team member needs to do their job.
Additional reading
Mailbox Audit Logging
Understanding Role Based Access Control
All of us have worked with front-line Support Agents. The good agents are those who ask the right questions,
gather the right data, communicate effectively, and don’t make the situation worse than when the problem
was discovered. Hopefully, by reading this book, you can gain some insight into the skills required to master
each phase that leads to a successful resolution of a problem and maybe even some that help in a situation
you cannot resolve. By acquiring knowledge about a problem and some inkling into its root cause, you will be
in a better position to hand the issue over to Microsoft support or whomever will take over the case.
With that thought in mind, we bid you good luck in your troubleshooting efforts. Be patient, effective, and be
positive. Also, as previously mentioned in the Introduction, don’t be a Troubleblaster. Please!
We’d love to get your feedback on the book, including any topics you feel we should cover. Please send all
feedback to [email protected].
Helpful companion material for Exchange and Office 365 can be found at
http://exchangeserverpro.com/ebooks/.