RAS Features
The server supports a variety of Reliability, Availability, and Serviceability (RAS) features. You can configure these features for better performance.
For details about how to configure RAS features, see the Huawei Atlas Server Purley Platform BIOS Parameter Reference.
Module |
Feature |
Description |
---|---|---|
CPU |
Corrected Machine Check Interrupt (CMCI) |
Corrects error-triggered interrupts. |
Memory |
Failed DIMM Isolation |
Identifies faulty DIMMs to facilitate isolation and replacement of the faulty DIMMs. |
Memory Thermal Throttling |
Automatically adjusts the memory temperature to prevent the memory from being damaged due to overheat. |
|
Rank Sparing |
Uses some memory ranks for backup to prevent the system from breaking down due to uncorrectable errors. |
|
Memory Address Parity Protection |
Detects memory command and address errors. |
|
Memory Demand and Patrol Scrubbing |
Corrects correctable errors upon detection. If these errors are not corrected in a timely manner, uncorrectable errors may occur. |
|
Memory Mirroring |
Provides high reliability for the system via mirroring. |
|
Single Device Data Correction (SDDC) |
Corrects single-chip multi-bit errors to improve memory reliability. |
|
Device Tagging |
Degrades and rectifies memory faults to improve memory availability. |
|
Data Scrambling |
Optimizes data flow distribution to reduce the error probability and improve memory data flow reliability and address error detection. |
|
PCIe |
PCIe Advanced Error Reporting |
Provides a PCIe advanced error reporting mechanism to improve server serviceability. |
UPI |
Intel UPI Link Level Retry |
Provides a retry mechanism to improve the reliability of UPI links. |
Intel UPI Protocol Protection via CRC |
Provides cyclic redundancy check (CRC) protection for UPI data packets to improve system reliability. |
|
System |
Core Disable For FRB (Fault Resilient Boot) |
Isolates a faulty CPU core during startup to improve system reliability and availability. |
Corrupt Data Containment Mode |
Marks the memory storage unit when a data error occurs to limit the impact on the running program and improve system reliability. |
|
Socket disable for FRB (Fault Resilient Boot) |
Isolates a faulty socket during the BIOS startup process to improve system reliability. |
|
Architected Error Records |
With the features such as eMCA, the BIOS collects error information recorded in hardware registers in compliance with UEFI specifications, notifies the OS through the APEI interface of the ACPI, and locates the error unit, improving system availability. |
|
Error Injection Support |
Implements fault injection to verify RAS features. |
|
Machine Check Architecture (MCA) |
Provides a software repair function to rectify uncorrectable errors to improve system availability. |
|
Enhanced Machine Check Architecture (eMCA): Gen2 |
Improves system availability. |
|
OOB access to MCA registers |
The out-of-band system can access MCA registers through the PECI. When a fatal error occurs in the system, the out-of-band system can collect onsite data to facilitate subsequent fault analysis and locating and improve system serviceability. |
|
BIOS Abstraction Layer for Error Handling |
The BIOS processes errors and reports error information to the OS based on specifications, improving system serviceability. |
|
BIOS-based Predictive Failure Analysis (PFA) |
The OS takes the lead. The BIOS provides information about physical memory error units. The OS tracks, predicts, and handles the errors. |