Learn The Architecture - Ras Overview 107790 0100 01 en
Learn The Architecture - Ras Overview 107790 0100 01 en
Version 1.0
Non-Confidential Issue 01
Copyright © 2023 Arm Limited (or its affiliates). 107790_0100_01_en
All rights reserved.
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
Release information
Document history
Proprietary Notice
This document is protected by copyright and other related rights and the practice or
implementation of the information contained in this document may be protected by one or more
patents or pending patent applications. No part of this document may be reproduced in any form
by any means without the express prior written permission of Arm. No license, express or implied,
by estoppel or otherwise to any intellectual property rights is granted by this document unless
specifically stated.
Your access to the information in this document is conditional upon your acceptance that you
will not use or permit others to use the information for the purposes of determining whether
implementations infringe any third party patents.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR
ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL,
INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND
REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS
DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
This document consists solely of commercial items. You shall be responsible for ensuring that
any use, duplication or disclosure of this document complies fully with any relevant export laws
and regulations to assure that this document or any portion thereof is not exported, directly
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 2 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
or indirectly, in violation of such export laws. Use of the word “partner” in reference to Arm’s
customers is not intended to create or refer to any partnership relationship with any other
company. Arm may make changes to this document at any time and without notice.
This document may be translated into other languages for convenience, and you agree that if there
is any conflict between the English version of this document and any translation, the terms of the
English version of the Agreement shall prevail.
The Arm corporate logo and words marked with ® or ™ are registered trademarks or trademarks
of Arm Limited (or its affiliates) in the US and/or elsewhere. All rights reserved. Other brands and
names mentioned in this document may be the trademarks of their respective owners. Please
follow Arm’s trademark usage guidelines at https://www.arm.com/company/policies/trademarks.
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
(LES-PRE-20349|version 21.0)
Confidentiality Status
This document is Non-Confidential. The right to use, copy and disclose this document may be
subject to license restrictions in accordance with the terms of the agreement entered into by Arm
and the party that Arm delivered this document to.
Product Status
Feedback
Arm welcomes feedback on this product and its documentation. To provide feedback on the
product, create a ticket on https://support.developer.arm.com
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 3 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
We believe that this document contains no offensive language. To report offensive language in this
document, email [email protected].
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 4 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
Contents
Contents
1. Introduction to RAS....................................................................................................................................... 6
6. Related information..................................................................................................................................... 20
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 5 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
Introduction to RAS
1. Introduction to RAS
There are three key attributes of a robust, dependable, computer system: Reliability, Availability,
and Serviceability (RAS).
RAS support is essential for many computing situations, reducing unplanned outages as follows:
• Detecting and correcting transient errors before they cause application or system failure.
System failure might impact the customer’s business and reputation when systems are used
for mission-critical functions. In some cases, reliability might form part of the service’s value
proposition.
• Identifying and replacing failing components. In any system, failures are inevitable. One
approach is to replace components on a regular schedule regardless of whether they are
failing or not, but this is costly. RAS allows failing components to be identified, and the cost of
maintenance can therefore be reduced by only replacing failed parts.
• Predicting failures ahead-of-time to allow replacement during planned maintenance. Scheduled
maintenance is much cheaper than unscheduled call-outs.
The RAS extension is a mandatory extension to the Armv8.2-A architecture, and an optional
extension to the base Armv8.0-A architecture. Armv9-A and Armv8-R inherit this support for RAS.
Armv8.4-A and Armv8.7-A introduce additional architectural features to the RAS extension.
There are two aspects to RAS: the RAS extension to the Processing Element (PE) architecture, and
the RAS system architecture.
The RAS extension provides the following architectural features to help systems improve RAS:
• System registers for accessing optional error records and fault injection controls defined by the
RAS system architecture
• An Error synchronization event and Error Synchronization Barrier instruction, ESB, that software
can use to isolate the effects of errors
• A non-maskable asynchronous error exception, for errors reported to the highest Exception
level (EL3), and changes to the exception model to ease partitioning of error recovery from
other exception handling software
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 6 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
Introduction to RAS
• Additional register fields allow the Processing Element (PE) to report a PE error state when an
error exception is taken
The RAS system architecture provides a framework for building RAS features in a system:
• A standard format for error records:
◦ Reports the occurrence and severity of errors
◦ Provides useful information to software, such as field replaceable unit (FRU) identification.
• An architecture for reporting different severities of error either synchronously as an in-band
error response, or asynchronously as error interrupts.
• Standard extensions for error counters, error record timestamps, and fault injection.
The RAS extension lets you design systems with a wide range of RAS capability. For example:
• A system with basic error recovery might provide very few RAS hardware features, and simply
reset when an error is detected.
• A mission-critical system which needs to provide high reliability might provide hardware
features such as data consistency checks and component redundancy to maximize reliability
and availability.
The system designer decides the importance of RAS to their situation, and uses the RAS
architecture extension to implement a solution that meets their needs.
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 7 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
RAS basic concepts
An error is any deviation from correct behavior. Errors are caused by faults. Faults can be either
internal or external to a system, and either transient or persistent. The following table gives
examples of each of these fault categories:
When a fault is activated, it causes an error. Errors that are not detected are called latent errors.
When an error occurs, the error propagates through the system until it results in a failure. A failure
is the event of deviation from correct service. This can include data corruption, data loss, and
service loss.
The following diagram shows the relationship between faults, errors, and failures:
Activation Propagation
Fault Error Failure
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 8 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
RAS basic concepts
A simple example of error propagation is when a producer passes a corrupt data value to a
consumer.
If the data is corrupted, then an uncorrected error is propagated from the producer to the
consumer.
If a system has no error detection capabilities, all errors that occur are latent errors because they
are never detected. These latent errors are silently propagated through the system until either of
the following happens:
• The error affects the correct behavior of the system, causing a failure. This is a Silent Data
Corruption (SDC) failure.
• The error is masked. An error is masked when the behavior of the system is such that the
error does not affect the correct service of the system. For example, if an error resulted
in an incorrect data value in a register, if that data value was subsequently overwritten by
uncorrupted data before the corrupted data is used then the error would be masked.
The interface between a producer and consumer might indicate that the data is corrupted, for
example, by including an error signal.
If the error signal indicates the presence of the uncorrected error, a detected error is signaled and
passed to the consumer. Otherwise, the uncorrected error is silently propagated.
The following diagram shows data being passed from a producer to a consumer together with an
error signal.
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 9 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
RAS basic concepts
Figure 2-2: Error propagation from producer to consumer, with error signal
Data
Producer Consumer
Error signal
The error signal might be a separate signal, or embedded in the data as an error
detection code.
If an error is consumed and updates the state of the component, then that state becomes infected.
If the state is marked as being in error, meaning a subsequent read of the state signals a detected
error, the state is poisoned.
An undetected error is uncontained at the component that failed to detect it. A detected
uncorrected error is uncontained at the component that silently propagates it.
An error that is uncontainable at a component might still be containable at the system level.
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 10 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
RAS basic concepts
• Uncontainable (UC)
• Unrecoverable (UEU)
• Recoverable (UER)
• Restartable (UEO)
• Deferred (DE)
• Corrected (CE)
When an error occurs, it always starts with a producer. If the producer is using error detection
techniques, for example a parity bit on data, then the producer might be able to detect the error. In
this case, the error is detected. The error is classified based on how the error is dealt with by both
the producer and the consumer.
Producer
Unrecoverable
Latent
(UEU)
N N
Can Y
Propagated?
continue?
N Y
Error signal
Corrupted data Corrected data
Corrupted data
Consumer
The categorization process starts with the producer. Depending on the nature of the error and the
capabilities of the system, the error can be categorized as follows:
1. Might the producer have silently propagated the error?
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 11 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
RAS basic concepts
If the error might have been silently propagated, the error is uncontainable (UC). For example, if
corrupted data has been received by the consumer, or corrupt data might be consumed by any
other agent, without the consumer knowing that the data is corrupted.
2. Can the producer correct the error?
If the error can be detected, the producer might be able to correct the error, for example by
using a cyclic redundancy check (CRC). In this case, the error is a corrected error. The corrected
data is passed to the consumer, and normal operation continues. In this case the error is a
corrected error (CE).
An error is deferred by hardware if hardware can make forward progress without consuming
the error.
Deferred errors result in corrupted data being passed to the consumer, but with an error signal
to alert the consumer. The consumer can then decide how to deal with the error. For example,
if it does not need to consume the corrupted data, it can leave the error as deferred.
4. Can the producer recover from the error?
If the error prevents the producer from continuing operation, then the error is categorized as a
detected unrecoverable (UER) error.
5. Does the producer propagate the error to a consumer?
If the detected, uncorrected, and undeferred error is not passed to a consumer, then the error
is a detected latent error.
Otherwise, if the error is passed to the consumer, it is a signaled error. Corrupted data is passed
to the consumer, but with an error signal to alert the consumer.
If the consumer receives corrupted data, then the error is further categorized based on how the
consumer deals with the error:
1. Was the error silently propagated?
If the error might have been silently propagated, the error is uncontainable (UC).
2. Can the consumer correct the error?
If the error has corrupted the component’s state and recovery of the component is not
possible, the error is an unrecoverable error (UEU).
4. Does the consumer need the corrupted data to continue?
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 12 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
RAS basic concepts
If the consumer can take action to recover from the error, but requires the corrupted data to
make progress, the error is a recoverable error (UER).
If the consumer does not require the corrupted data to make progress, and therefore can
recover with no action, the error is a restartable error (UER).
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 13 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
Arm RAS extension and system architecture
There are two aspects to RAS: the RAS extension to the Processing Element (PE) architecture, and
the RAS system architecture.
The RAS extension does not define the level of reliability, availability, and
serviceability in a PE. The RAS framework allows for a wide range of RAS
capabilities, from basic RAS approaches such as “reset-on-error” through to very
high-reliability systems with sophisticated error handling and correction features.
But the specific RAS features that the PE includes are IMPLEMENTATION DEFINED
and must be decided by each individual designer. The RAS extension provides the
framework for the implementation to communicate these choices to software when
an error occurs.
The error can also be deferred by generating a poison value. Data poisoning is a mechanism for
marking data as corrupted. The poison value is stored with the corrupted data to indicate that
an error was detected. Subsequent accesses of the data see the poison value and treat it as a
detected error.
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 14 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
Arm RAS extension and system architecture
An Error synchronization event ensures that all SError Interrupts that would be reported as
unrecoverable (UEU) are either taken or pended before execution continues.
The ESB (Error Synchronization Barrier) instruction generates an Error synchronization event to
isolate UEU errors. This ensures that interrupts which are generated by instructions before the ESB
are either taken or pended before execution continues beyond the ESB instruction.
Implicit ESB events are inserted at exception handler entry and exit. This enables errors to be
isolated without having to modify software to add explicit ESB instructions.
The RAS extension provides a standardized error record format for recording errors. The
information logged in these records includes the following:
• An error code to identify the specific error that was detected.
• Any memory address associated with the error.
• An optional timestamp when the error was detected.
• An optional counter, for counting correctable errors so they can be serviced less frequently.
• Miscellaneous fields used to store application-specific information about errors. These are
IMPLEMENTATION DEFINED.
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 15 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
RAS error reporting flow
If a CE counter is implemented, then an FHI is generated only when the CE counter overflows.
5. If the error cannot be corrected or deferred, then the node does one or both of the following:
• Returns an in-band error response to the access
• Generates an Error Recovery Interrupt (ERI)
6. When a PE receives an in-band error response, it generates either a Synchronous External Data
Abort (SEA) or an asynchronous SError Interrupt (SEI), depending on the implementation of the
PE.
Software can control some aspects of this flow, for example disabling generation of one or more of
the interrupts.
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 16 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
Example RAS implementations
This section of the guide describes examples of systems with different levels of RAS capability:
• Fundamental
• Advanced
• Safety-critical
The following table describes these different example systems and the types of RAS hardware
features they might implement.
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 17 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
Example RAS implementations
Service Level Agreements (SLAs) with customers might require real-time Long term reliability important
RAS features to meet the agreed service standards, for example not because of high cost to owner.
dropping video streams or keeping a cell tower operational 24/7.
Risk of failure User annoyance. Impact to reputation. Potential liability and
reputational loss.
Productivity Potential liability.
loss. High cost to recall affected
Potential high cost due to service availability (for example, search parts, if needed.
Reputational engines) or having to re-run large-scale jobs (for example, weather
loss. modelling).
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 18 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
Example RAS implementations
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 19 of 20
Learn the Architecture - RAS Overview Document ID: 107790_0100_01_en
Version 1.0
Related information
6. Related information
Here are some resources related to material in this guide:
• Reliability, Availability, and Serviceability (RAS) Arm Architecture Reference Manual Supplement
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 20 of 20