Surviving Microsoft Active Directory Failures With Rubrik
Surviving Microsoft Active Directory Failures With Rubrik
Surviving Microsoft
Active Directory Failures
with Rubrik
3 OVERVIEW
3 Technology Overview
3 Introduction to AD Replication
6 Time Synchronization
10 Prerequisites
27 DCDiag
29 RepAdmin
30 CONCLUSION
30 VERSION HISTORY
OVERVIEW
This document walks through the process of recovering either a failed individual Active Directory (AD) Domain
Controller (DC) or an entire domain in the context of a Disaster Recovery Plan (DRP) with Rubrik.
Although Microsoft AD Domain Services (DS) is a distributed service leveraging data replication at the
application level, it is sometimes better to recover a failed DC rather than deploying a new Windows Server
machine and promoting it to a new DC in the same domain. For instance, in the case of a DRP where the
recovery site doesn’t normally host any running workloads, it’s not possible to rely on native AD replication and
have a couple of DCs already running on the DR site. Moreover, in the event of a total failure of the primary
site, it would be required to seize the Flexible Single Master Operations (FSMO) roles to have a fully functional
directory back up and running. This becomes an even greater challenge when considering failback or test
failover operations.
Throughout this document, users will be provided with steps to properly and safely recover Microsoft AD DCs
with Rubrik.
TECHNOLOGY OVERVIEW
Before describing how to actually recover AD DCs with Rubrik, it’s important to understand some of the key
concepts and technologies that are critical for Active Directory to work.
Introduction to AD Replication
Active Directory Domain Services are Microsoft’s multi-master directory services. As per Microsoft, AD DS
provides “secure, structured, hierarchical data storage for objects in a network such as users, computers,
printers and services. Active Directory Domain Services provide support for locating and working with
these objects.”
This multi-master, distributed architecture allows you to create or modify objects from any writable DC in the
forest or domain. These changes are then replicated to all other DCs using one of the two replication methods
available in Windows Server:
Understanding which replication technology is utilized in your environment is critical to ensure you will take the
right steps to recover AD DCs in the event of a failure.
Keep in mind that if the functional level of your domain is Windows Server 2003, then NTFRS is used to replicate
AD data. The use of DFS-R for the same purpose was introduced with Windows Server 2008. But having AD
DCs running only Windows Server 2008 or later versions, and a domain functional level of Windows Server
2008 or above, does not guarantee that DFS-R is the replication technology in use in your AD environment.
Specifically, if your AD was created with versions of Windows prior to Windows Server 2008 and then upgraded,
there is a chance that NTFRS is still the underlying replication method. In fact, migrating from NTFRS to DFS-R
for AD replication is not done automatically. It has to be planned carefully and done manually.
If the result says the global state is “Eliminated”, then DFS-R is utilized for AD replication:
If the result is anything different, then it likely means that the migration process has been started, but is
not complete yet. To learn more about how to migrate from NTFRS to DFS-R, read Microsoft’s SYSVOL
Replication Migration Guide.
Make sure you enable the Advanced Features from the console’s View menu and expand your domain.
You will then see the System container:
Expand the System container, then look for either the File Replication Service sub-container and
expand it to display what’s in the Domain System Volume (SYSVOL share) sub-container; or the DFS-R-
GlobalSettings sub-container and expand it to display the content of the Topology sub-container.
Time Synchronization
One of the key success factor of AD replication is time synchronization between DCs. In order to ensure smooth
recovery of your AD DCs, follow these best practices:
• On the DC hosting the PDC Emulator FSMO role in the root domain of your forest, configure the Windows
Time Service to synchronize with an external time source using the NTP protocol. Make sure UDP port 123
is not blocked anywhere then open a CMD prompt and run these commands:
• On the DC hosting the PDC Emulator FSMO role in each of the child domains of your forest, repeat the
procedure described above, but set the /manualpeerlist parameter with the FQDN of the PDC Emulator
of the root domain.
• On all other DCs in the forest, open a CMD prompt and run the following commands:
If there are too many DCs in your environment to do this manually, and for all other domain members, then using
Group Policies is easier.
There are a few commands to help you verify what is the time source of any given domain member:
To learn more about time synchronization for Active Directory, read the corresponding article on
Microsoft TechNet.
To make the recovery process of AD DCs easier and faster, all DCs that are also DNS servers should have more
than one DNS server configured in their TCP/IP parameters, like in the example shown below:
• The Rubrik Backup Service (RBS) is installed on each AD DC and is registered with the corresponding
cluster, as shown in the below screenshot. RBS is a lightweight agent that triggers Microsoft’s VSS
framework to ensure VSS-compatible applications consistency:
• For physical AD DCs, a new Windows Host must be created for each in Rubrik cluster
• Application-consistent backups of AD DCs are taken. This is achieved with the help of RBS with Rubrik
4.2.1 and later versions. If the AD DC is virtual then the Application Consistency option of the VM in
Rubrik must be set to “Automatic”, as shown below:
Rubrik has offered support for backing up and restoring VMware vSphere virtual machines since version 1.0, and
introduced support for Microsoft Hyper-V VMs and Nutanix AHV VMs as of version 4.0. When recovering a VM,
Rubrik provides three main options:
• Instant Recovery – The Instant Recovery option intentionally powers off and renames the production
VM while re-creating it from the Rubrik backups. Upon completion, a VM with the exact same name and
parameters as the original will be created, connected to the network and powered on.
• Export – The Export option, like the Instant Recovery option, creates a new VM object from the Rubrik
backups. However, it presents the ability to select the same or different hypervisor host and datastore, or
volume, as targets.
• In-Place Recovery – This recovery model leverages VMware’s Changed Block Tracking to recover only
the changed blocks within an existing virtual machine. Because of this, only the data blocks that have
changed need to be recovered, making for the fastest of these recovery options.
• Live Mount – Rubrik’s Live Mount option provides near-instantaneous recovery of a VM by publishing
the selected point-in-time snapshot of that VM as an NFS or SMBv3 share (dependent on whether the
hypervisor is vSphere, AHV or Hyper-V), to a hypervisor and powering on the VM directly from the Rubrik
cluster. This reduces the Recovery Time Objective (RTO) drastically. Interestingly, Rubrik can live mount
individual VMware virtual disks and physical volumes, as well as SQL and Oracle databases.
Rubrik introduced support for Windows Full Volume Protection with version 4.2. This allows you to perform
block-based backups of entire Windows volumes, in turn enabling full volume or entire computer recovery to
similar hardware. In order to do Full Volume Protection backups, the Rubrik Backup Service must be installed on
the physical host.
The first step to recovering an entire computer is to download the Windows Recovery Tools from the Rubrik
Support Portal to a machine running Windows Server 2012 R2 or later, as well as the Windows Assessment and
Deployment Kit. This set of tools will be used to create a bootable WinPE ISO file. Burn it to a CD/DVD or USB
stick and boot the target server on it.
In parallel, live mount the volumes that you need to restore from the Rubrik user interface.
• Run the RubrikBMR PowerShell script provided with the Windows Recovery Tools
• Follow the steps given by the script to recover each volume. You will find the SMB path to the live-
mounted volumes in the Rubrik UI
• When all volumes have been restored, remove the WinPE media and restart the server
To discover the details of the physical Windows recovery process, please refer to the Rubrik User Guide, in the
Full Volume Protection for Windows chapter.
Although it is very common to host other infrastructure-related roles, such as DNS or DHCP, on the same
servers, it is recommended to avoid deploying any other roles or applications on AD DCs. The logic for this is
twofold: firstly, as your authentication platform, you should minimize any potential attack surface which might
be caused by these other services and applications. Secondly, in retaining only Active Directory functions on
these servers, the DC itself becomes more disposable, as the AD data can simply be replicated back from
surviving DCs in a disaster. For example, when a DC is also a file server, an antivirus server, a WSUS server or
even a backup software’s master server, it becomes critical to be able to recover this machine when it fails.
Luckily, Microsoft introduced the Volume Shadow Copy Service (VSS) with Windows Server 2003 and Windows
XP, along with the NTDS writer, which is the AD VSS Writer. It is installed when promoting a Windows Server
machine to a Domain Controller with the help of the dcpromo command. This writer is requested when a VSS-
compatible backup application wants to prepare the operating system and the VSS-enabled applications, such
as Active Directory, for an online backup. The application writers are responsible for freezing the application and
queueing incoming I /Os to allow the VSS service to create an application-consistent volume shadow copy.
VSS makes AD backup and recovery possible and the deploy-and-dcpromo method a thing of the past! In order
to guarantee application-consistency of AD DC backups with Rubrik, please review the prerequisites.
Once the server itself has been restored, we recommend to unplug it from the network. In the case of a VM,
restore operations such as Export or Live Mount restore VMs disconnected from the network automatically. For
physical servers, a functioning network connection is required during the recovery process to retrieve the data
from the live-mounted volumes on Rubrik via SMB. Once this process is finished, physically unplug the ethernet
cable from the server when possible, or disable the NIC from Windows.
When only one individual DC fails in an existing domain, and other DCs are running and providing service
normally, it is important to make sure to perform a non-authoritative restore of the failed DC, or more precisely
of its SYSVOL folder. The way to achieve this depends on what AD replication method is utilized in your domain.
To know which one is used, refer to the Introduction to AD Replication section of this document.
Whether the restoration of SYSVOL is authoritative or non-authoritative, the target Windows Server machine
will have to be booted in the Directory Services Recovery Mode (DSRM). Make sure you have the DSRM admin
password with you. You will not be able to login with any domain user account because the directory services
will be stopped. An easy way to configure Windows to start in DSRM is to use the msconfig tool, and to select
“Safe boot” > “Active Directory repair”, as shown in the screenshot below:
Because Rubrik does either image-level or volume-level backups of AD DCs, the recovery process is fairly
simple. It is usually recommended to recover from the latest backup, but if you decide or have to restore from an
older backup, make sure its age does not exceed the Tombstone Lifetime (default is 60 days).
• Run msconfig again and choose to restart in the normal mode by unselecting the Safe boot option:
• Allow the restored server time to finish starting Windows and to contact its replication partners.
• Change the value of the SYSVOL STRING to “non-authoritative”. If it doesn’t not exist, create it:
• Run msconfig again and choose to restart in the normal mode by unselecting the Safe boot option:
• Allow the restored server time to finish starting Windows and to contact its replication partners.
The steps to restore an entire AD domain are similar to recovering an individual AD DC, so please read the
Restoring an Individual Active Directory Domain Controller section first.
1. You will have to select carefully which DC to restore first. The choice will depend on your environment,
FSMO roles distribution and other criteria. Usually you should select the DC hosting the PDC Emulator
role. To know which one it is in your environment, refer to the Introduction to AD Replication section of this
document. The first restored DC should also be a DNS server, which DNS client is configured to use itself
(127.0.0.1) as the primary DNS server.
2. Once the first DC server is restored, you will have to perform an authoritative restore of SYSVOL, as
described later in this document.
b. Navigate to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters
e. IMPORTANT: Make sure to reset this value to 1 after the forest is recovered completely.
4. Determine the right order to restore the remaining DCs, according to the FSMO roles and other roles and
applications they host.
5. Restore remaining DCs one by one, performing a non-authoritative restore of SYSVOL for each of them.
• Run msconfig again and choose to restart in the normal mode by unselecting the Safe boot option:
• Allow the restored server time to finish starting Windows and to contact its replication partners.
• Change the value of the SYSVOL STRING to authoritative. If it doesn’t not exist, create it:
• Run msconfig again and choose to restart in the normal mode by unselecting the Safe boot option:
• Allow the restored server time to finish starting Windows and to contact its replication partners.
Start by restoring the root domain of your forest, then restore child domains. After the entire forest is restored,
do not forget to reset the Repl Perform Initial Synchronizations parameter to 1 in HKEY_LOCAL_MACHINE\
SYSTEM\CurrentControlSet\Services\NTDS\Parameters
After the required AD DCs are restored, verify that time synchronization works as expected, as well as DNS.
There are multiple tools available in Windows, or downloadable for free, to help you confirm restoration is
a success.
One of the most important events to search for is event ID 1109 in the Directory Service log. This event
indicates that an new InvocationID is assigned to the DC and is logged when an Active Directory database has
been restored successfully. Setting a new InvocationID to a restored AD DC will tell its replication partners that
it has been restored and that it needs to update its data from them:
AD replication should be watched closely as well. If it fails, other errors or failures are most likely to occur,
often leading the restored DC to become unreliable. When recovery is done successfully, you should see the
following event IDs in the Windows Event Viewer, depending on whether the AD replication technology in your
environment is NTFRS or DFS-R:
• In an AD environment that uses DFS-R for replication, you will see the event ID 1104 logged in the DFSR
Windows log. It indicates that DFS-R replication has restarted successfully after a restore operation:
DCDIAG
DCDiag is a diagnostics tool built in Windows to monitor an AD DC’s health. To start with, you can simply open
a command prompt and type dcdiag. A healthy DC should report all tests as passed successfully. However,
running this tool right after a restoration will probably display multiple errors and warnings. This is both because
a recovered DC should be given enough time to finalize replication with its partner, and because dcdiag reports
all recent errors and warnings, even if they have been cleared since they were logged.
REPADMIN
A healthy AD replication status should be observed after a successful recovery of AD DCs. To monitor this,
Windows Server provides another command line tool called repadmin. It allows to watch many aspects of
replication, but you can start easy by displaying the replication summary. To do this, simply open a command
prompt and type repadmin /replsum:
After the recovered DC has completed an update of its data from its replication partners, the command should
not display any failures or errors.
If you want to see a little more details about the status of the latest replication of the different partitions
between the replication partners, use repadmin with the parameter /showrepl:
The latest replication attempt of all partitions between the replication partners should all be successful.
Note: do not expect to see replications between each of your AD DCs. In fact, the replication topology is
built and adjusted as needed automatically by the Knowledge Consistency Checker (KCC) process.
While Active Directory backup and recovery is generally not an easy thing to comprehend, Rubrik provides
a VSS-compatible solution to ensure successful application-consistent backups of Active Directory Domain
Controllers. To accommodate both physical and virtual deployments, Rubrik allows AD DCs to be backed up
either using an image-level approach for VMs, or a volume-level approach for physical machines. Both methods
allow recovery of entire DC servers in the event of a failure of one of them, or to meet disaster recovery needs.
Throughout this document, we covered when and how to perform an authoritative or non-authoritative restore
of SYSVOL after the server itself has been restored, in order to guarantee the consistency of the domain or
the forest.
To learn more about Rubrik and how it can help simplify data protection, from backup to replication to disaster
recovery to archival, visit www.rubrik.com.
VERSION HISTORY
Rubrik is on a mission to secure the world’s data. With Zero Trust Data Security™, we help organizations achieve
business resilience against cyberattacks, malicious insiders, and operational disruptions. Rubrik Security Cloud,
powered by machine learning, secures data across enterprise, cloud, and SaaS applications. We help organizations
uphold data integrity, deliver data availability that withstands adverse conditions, continuously monitor data risks and
Global HQ threats, and restore businesses with their data when infrastructure is attacked.
3495 Deer Creek Road 1-844-4RUBRIK For more information please visit www.rubrik.com and follow @rubrikInc on X (formerly Twitter) and Rubrik on LinkedIn.
Palo Alto, CA 94304 [email protected] Rubrik is a registered trademark of Rubrik, Inc. All company names, product names, and other such names in this
United States www.rubrik.com document are registered trademarks or trademarks of the relevant company.
sw-surviving-microsoft-active-directory-failures-with-rubrik / 20230913