Safety Mechanism for Processing Unit
In this post, we will learn about Safety Mechanisms used in Processing Unit. The Safety Mechanisms described in this post are based on ISO 26262-5:2018 Annex D.
D.2.3.1 Self-test by software
Self-test by Software in the ISO 26262 standard is a safety mechanism that uses software to detect faults in the processing unit and physical storage of a vehicle at an early stage. The purpose of this is to improve the overall safety of the vehicle.
Purpose
- Early fault detection: Improves the reliability of the system by detecting faults in the processing unit and sub-components as early as possible using software.
Implementation method
- Software-based fault detection: Software generates data patterns to test physical storage (e.g. data and address registers) and functional units (e.g. instruction decoders).
- Pattern generation and periodic testing: Generates various data patterns and executes self-tests during regular operation of the system or at specific events to detect faults.
How it works
- Data pattern application: Apply software-generated patterns to physical storage and functional units to identify faults. – Fault Analysis: Analyze whether the system is operating as expected, and consider any unexpected behavior as a fault.
Fault Handling
- Alert and Notification: Send an alert message to the operator or administrator when a fault is detected.
- Transition to Safe Mode: Switch the system to safe mode to minimize the impact of the fault.
Examples
Example 1: Functional Correctness Test of Processing Unit
- Test Method: Test the functional correctness of the processing unit by applying at least one pattern for each instruction. Coverage is limited if there are instructions that are excluded from the test.
- Test Limits:
- Coverage is limited for specific registers, core timers, and exception handling.
- Coverage may be lacking for sequence dependencies such as pipelines or timing-related fault modes.
- Coverage is limited for soft errors.
Example 2: EDC Coder/Decoder Test
- Test Method: Test the behavior of the EDC logic using intentionally corrupted data. – Coverage: Coverage varies depending on the number and variety of patterns, and does not provide coverage for soft errors.
Example 3: Utilization in Automotive Electronic Systems
1. Engine Control Unit (ECU) Test
- Test functional correctness by applying data patterns to each command of the ECU, and switch to restricted mode or issue an alert when a fault is found.
2. EDC Coder/Decoder Integrity Check
- Check the integrity of the EDC logic using intentionally corrupted data, and log and recover errors when a fault is found.
Limitations and Challenges of Self-test by Software
1. Coverage Limit
- Software testing has limited detection of soft errors, and may not fully cover specific registers and exception handling elements.
2. Increased Complexity
- Increased system complexity makes test pattern design difficult, and consumes a lot of time and resources.
3. Incompleteness of Fault Detection
- Detection of timing-related error modes is limited, and not all faults can be perfectly detected.
D.2.3.2 Self-test supported by hardware (one-channel)
Self-test Supported by Hardware (One-channel) defined in the ISO 26262 standard is a mechanism for early detection of faults by utilizing special hardware in vehicle systems. It aims to provide high coverage by expanding the speed and scope of fault detection and enhance the safety of the system.
Purpose
- Early fault detection: Rapidly detects faults in processing units and other sub-elements by using special hardware.
- Expanding detection speed and scope: Realizes more effective fault management by expanding the speed and scope of fault detection through hardware.
Description
Implementation method
- Special hardware support: Add hardware that supports self-test function to detect faults in sub-elements at the gate level.
- Hardware-built-in self-test (BIST): Integrate the BIST mechanism into the system so that each element performs its own test to detect faults early. – Random Pattern Generation: Generate various input patterns using the random pattern generator (MISR) and test each element of the system.
How it works
- Apply input patterns: Apply the generated random input patterns to the system to detect faults.
- Result evaluation: Analyze the difference between the expected output and the actual output to determine whether there is a fault, and take appropriate actions if there is a fault.
- High coverage: Hardware-based testing provides high coverage and is mainly performed during system initialization.
- Multiple fault detection: Detect faults at multiple points to perform more comprehensive inspection.
Fault handling
- Generate warning messages: When a fault is detected, generate warning messages to notify the operator.
- Switch to safe mode: If necessary, the system switches to safe mode to minimize the impact of the fault.
Examples
Example 1: Hardware-based testing of EDC coder/decoder
- Special hardware mechanism: Apply logic BIST to the EDC coder/decoder to generate input and check the expected results. – Use random pattern generator: Provides high coverage due to automated pattern generation.
- Test limitations: Soft errors are difficult to detect.
Example 2: Self-test Supported by Hardware in Automotive Electronic Systems
1. Hardware-based self-test during ECU initialization
- Hardware-supported test: The ECU performs self-test via BIST during initialization to check key functional units and storages.
- Random pattern generation: Tests key gates and storage elements through various input patterns.
- Result processing: Generates warning messages and switches to safe mode when a fault occurs.
2. Hardware-based test of automotive safety systems
- Safety system initialization: Airbag controllers, etc. check all functions with hardware-based self-test during initialization.
- BIST application: Evaluates the functional correctness of each component.
- Fault response: Issues a warning immediately when a fault is found and disables safety functions to ensure driver safety.
Limitations and Challenges
1. Soft Error Limitations
- Difficulty in Detection: Hardware-based testing focuses on physical defects, so it has limited coverage for soft errors caused by external environmental factors or data corruption.
2. Invasiveness of Testing
- Requires Operational Interruption: Hardware-based testing is performed at system initialization or shutdown, and does not run during normal operation.
3. Complex Implementation
- Design Complexity: Hardware-assisted self-testing can require complex designs, which increases development costs and time.
D.2.3.3 Self-test by software cross exchanged between two independent units
The Self-test by Software Cross Exchanged Between Two Independent Units proposed in the ISO 26262 standard is a mechanism for early detection of mutual faults between two or more independent processing units through software. This mechanism aims to increase reliability through mutual verification between the two units.
Purpose
- Early fault detection: Early detection of faults occurring in physical storage and functional units through software exchanged between two or more independent processing units.
Description
Implementation method
- Independent software test: Each processing unit tests the physical storage and functional units by executing additional software functions. The test is performed using a method such as Walking Bit Pattern.
- Result generation and exchange: After the test, each unit generates a result and exchanges it with other processing units for mutual verification. Design an interface for result exchange between processing units, so that fast and accurate result comparison can be made.
How it works
- Apply test patterns: Detect faults by applying various test patterns, such as working bit patterns, to each unit.
- Result analysis and exchange: Each unit analyzes the results and exchanges them with other units to verify the accuracy of the results.
- Fault detection and response: If a result inconsistency is found, it determines that the unit is defective, issues a warning message, and takes necessary actions.
- Limited soft error coverage: This method has limited or no coverage for soft errors. This is because it mainly focuses on hardware-based fault detection.
Fault handling
- Issue warning message: If a fault is detected, it immediately issues a warning message to notify the system administrator.
- Switch to safe mode: If necessary, the system switches to safe mode to minimize the impact of the fault.
Example
Example 1: Software cross-test between two ECUs in a vehicle
- Configuration: Two ECUs (Engine Control Units) A and B each perform a software self-test.
- Self-test execution: Each ECU tests its own physical and functional units, such as data and address registers, and instruction decoders, using a working bit pattern.
- Result exchange and analysis: ECUs A and B exchange their test results with each other. Then, they compare the results and consider any discrepancies as faults.
- Fault handling: If any discrepancies are found, the ECU logs an error and issues a warning to the system administrator. If necessary, the ECU switches to safe mode to ensure the safety of the vehicle.
Example 2: Self-test between multiple processing units in an aircraft control system
- Configuration: There are multiple processing units in an aircraft control system, and each unit performs a self-test independently.
- Self-test execution: Each processing unit tests its own physical storage and functional units using test patterns, such as a working bit pattern. – Result Exchange and Analysis: All processing units exchange test results with each other and compare the results to check for inconsistencies.
- Fault Handling: If a inconsistency is found, the unit is considered to be defective, an alert message is immediately sent, and safety protocols are executed if necessary.
Limitations and Challenges
1. Soft Error Limitations
- Difficulty in Detecting Soft Errors: Soft errors are errors caused by environmental factors and are often difficult to detect with software-based testing.
2. Increased Test Complexity
- Complex Pattern Management: Managing and applying various test patterns is complex, which can increase development costs and time.
3. Difficulty in Real-Time Fault Handling
- Delay in Result Comparison: If the result comparison between two processing units is delayed, real-time fault handling can become difficult.
D.2.3.4 Software diversified redundancy (one hardware channel)
- Software Diversified Redundancy (One Hardware Channel)* is one of the safety mechanisms of the ISO 26262 standard, which is a method to increase the reliability and safety of the system through different software implementations within a single hardware channel. This mechanism focuses on early detection and handling of faults using two independent software paths.
Purpose
- Early fault detection: Ensure safety by detecting faults in the processing unit as early as possible through dynamic comparison of software.
Description
1. Implementation method
- Two software implementations: Detect faults using two different software implementations within a single hardware channel. Each software provides multi-angle diagnosis using different algorithms and codes.
- Hardware resource diversity: Utilizes different RAM and ROM memory areas to expand the diagnostic range and increase the possibility of fault detection.
2. How it works
- Primary Path and Redundant Path:
- Primary Path: This is the path where the main calculations are performed, and errors can cause serious problems.
- Redundant Path: This verifies the calculations of the primary path and takes action when a fault is detected.
- Compare Output Data: Once the calculations of the two software paths are complete, the output data is compared, and a fault message is generated if a difference is detected. – Resynchronization and hysteresis: Resynchronize the two paths to protect against transient errors, and apply hysteresis and filtering to allow for minor differences.
3. Fault Handling
- Issue warning message: If a fault is detected, immediately issue a warning message to notify the system administrator.
- Switch to safe mode: If necessary, the system switches to safe mode to minimize the impact of the fault.
Examples of algorithm diversity
- Example of algorithm: Provides software diversity through various approaches, such as using two algorithms in the same way as A + B = C and C – B – A, or using normal calculations on one path and 2’s complement math on the other.
- Example of redundant path: A simple redundant path can be implemented that performs size or rate limit checks on the calculations of the primary path.
Notes
- Common Cause Fault (CFE): Diagnostics can be enhanced by using an additional watchdog processor to prevent common cause faults between the primary and redundant paths. – In the absence of software redundancy: Implementing a redundant path as an exact copy of the primary path or executing the primary path twice provides coverage for soft errors. This provides an easy and clear pass-fail criterion by comparing it to the expected output.
Examples
Example 1: Automotive Engine Control System
- Configuration: In an automotive engine control system, two software paths are implemented, each with a different algorithm, to control engine speed and fuel mixture.
- Primary Path:
- Algorithm: Controls the main functions of the engine using a traditional PID control algorithm.
- Role: Performs the main calculations required for the operation of the engine.
- Redundant Path:
- Algorithm: Uses a state-space model to verify the computational results of the primary path.
- Role: Adjusts the engine speed and fuel mixture if there is a difference compared to the computational results of the primary path.
- Compare Results: If the outputs of the two paths do not match, generate a fault message to notify the system administrator, and switch the engine to safe mode.
Example 2: Aircraft Flight Control System
- Configuration: Implement two software paths in the aircraft flight control system to manage the flight path and altitude control.
- Primary Path:
- Algorithm: Control the flight path of the aircraft using the Kalman filter.
- Role: Performs the main flight control function.
- Redundant Path:
- Algorithm: Use the Newton-Raphson method to verify the accuracy of the flight path and altitude control.
- Role: Compare the calculation results of the primary path, and adjust the flight path if an error is found.
- Compare Results: If the outputs of the two paths do not match, generate a warning message and switch the flight control to safe mode.
Limitations and Challenges
1. Common Cause Faults (CFEs)
- Risk of Common Faults: Common Cause Faults (CFEs) between the primary and redundant paths can occur. In this case, additional watchdog processors can be used to enhance diagnostics.
2. Soft Error Coverage
- Difficulty in detecting soft errors: Software diversity can effectively detect hardware errors, but coverage for soft errors may be limited.
3. Increased Complexity
- Complex Design and Implementation: Designs for implementing software diversity can be complex, which can increase development costs and time.
4. Performance Overhead
- Slower Processing Speed: Since the two software paths each execute different algorithms, processing speed may be slow.
D.2.3.5 Reciprocal comparison by software in separate processing units
Reciprocal Comparison by Software in Separate Processing Units is a safety mechanism specified in the ISO 26262 standard that uses software comparison between two independent processing units to increase the reliability and safety of a system. This technique focuses on detecting and responding to early faults by utilizing diversity in both hardware and software.
Objectives
- Early Fault Detection: Detect faults as early as possible through dynamic comparison of software in two processing units.
Description
1. Implementation Methods
- Processor Diversification: Implement independent paths using different processor types and memory areas. This method increases the diversity of the system by utilizing various hardware components.
- Result Exchange Interface: Design an interface for result exchange between two processing units to enable fast and accurate result comparison.
2. How it works
- Data Exchange: Two independent processing units exchange results, intermediate results, and test data to compare them.
- Result Comparison: The two processing units compare data and generate a fault message if a difference is detected. This approach allows for hardware and software diversity by using different processor types, algorithms, code, and compilers.
- Error Prevention Methods: To prevent errors due to differences between processors, factors such as loop jitter, communication delays, and processor initialization are considered. – Multi-core implementation: It can be implemented using separate cores of a dual-core processor, in which case analysis is included to understand the common cause failure mode of both cores.
3. Fault Handling
- Issue warning message: If a fault is detected, immediately issue a warning message to notify the system administrator.
- Switch to safe mode: If necessary, the system switches to safe mode to minimize the impact of the fault.
Example
Example 1: Dual processor system of an autonomous vehicle
- Configuration: The two main processors of an autonomous vehicle each process driving data and exchange the results with each other to compare them.
- Processor A:
- Algorithm: Detects obstacles using AI-based image recognition algorithms.
- Role: It is responsible for the vehicle’s driving path and obstacle detection.
- Processor B:
- Algorithm: Calculates distance and object positions using traditional lidar data processing algorithms.
- Role: Verifies the results of Processor A and evaluates the accuracy of the driving path.
- Result comparison and error handling:
- If the results of the two processors do not match, a fault message is generated and the system administrator is notified.
- Safety measures such as path correction and emergency stop of the vehicle are implemented.
Example 2: Aircraft flight control system
- Configuration: There are two independent processing units in the aircraft flight control system, and each unit calculates and verifies the flight path through the data exchanged with each other.
- Processing unit 1:
- Algorithm: Uses a Newton-Raphson based algorithm to optimize the flight path.
- Role: Performs calculation of the basic flight path.
- Processing unit 2:
- Algorithm: Uses a Kalman filter based path prediction algorithm to verify the results of Processing unit 1.
- Role: Evaluates the stability of the flight path and issues a warning when an error occurs.
- Result comparison and response:
- If the flight paths do not match, the system issues a warning message and switches the flight to safe mode to correct the error.
Limitations and Challenges
1. Common Cause Fault (CFE)
- Risk of Common Fault: Common Cause Fault (CFE) between two processors can occur, which can affect the safety of the system.
2. Soft Error Coverage
- Difficulty in detecting soft errors: While hardware errors can be effectively detected through software comparison, the coverage for soft errors may be limited.
3. Increased Complexity
- Complex Design and Implementation: The design for comparing results between multiple processing units can be complex, which can increase development cost and time.
4. Performance Overhead
- Slower Processing Speed: Since the two processing units each execute different algorithms, the processing speed may be slow.
D.2.3.6 HW redundancy (e.g. Dual Core Lockstep, asymmetric redundancy, coded processing)
The HW Redundancy mechanism specified in the ISO 26262 standard plays an important role in ensuring reliability by detecting and handling faults early in safety-critical systems. This paper describes how to improve the safety of a system by utilizing hardware-based redundancy technologies such as Dual Core Lockstep, asymmetric redundancy, and coded processing.
Aim
- Early Fault Detection: Detect faults in a processing unit as early as possible by comparing internal and external results while the two processing units operate in lockstep.
Description
1. Implementation Method
Dual Core Lockstep
- Configuration: Two symmetric processing units are included in one die, and perform redundant operations in lockstep.
- Operating Principle:
- Synchronous Execution: The two processing units execute the same instructions synchronously and compare the results of each instruction.
- Compare Results: If a mismatch occurs, an error condition is triggered and the error is usually resolved by resetting the system.
- Memory Address Lines and Configuration Registers: The high level of redundancy allows coverage to extend to memory address lines and configuration registers.
- Advantages:
- Easy Implementation: No separate code is required for parallel paths, and implementation is simple.
- Fault Detection: Effectively detects transient errors and ALU-related faults.
- Disadvantages:
- Performance Limitations: The two processing units provide only the performance of a single processing unit.
- Common Cause Faults: It does not fully cover common cause faults (CFEs), and these faults must be understood and addressed during design (e.g., common clock faults).
- Limitations:
- Undetected Systematic Faults: This approach by itself does not provide coverage for systematic faults.
Asymmetric Redundancy
- Composition: Multiple dedicated processing units are connected to the main processing unit through an interface to compare internal and external results step by step.
- Operational Principle:
- Dedicated Processing Unit: There is a dedicated processing unit designed differently from the main processing unit, which may be smaller.
- Interface: The interface reduces complexity and reduces error detection latency, allowing faster detection of faults affecting the processing unit register bank.
- Advantages:
- Provides Diversity: Hardware diversity provides effective coverage for common cause faults and systematic faults.
- Reduces Complexity: No separate code is required for parallel paths, and fault detection is faster.
- Disadvantages:
- Requires Complex Analysis: Detailed analysis may be required to demonstrate diagnostic coverage.
Coded Processing
- Composition: Processing units designed using special error recognition or error correction circuit techniques.
- Operational Principle:
- Error recognition/correction: Designed in a way that ensures high coverage of processor sub-units (e.g. ALU).
- Hardware and software coding: Combines hardware and software coding through approaches such as Vital Coded Processor.
- Advantages:
- High coverage: Ensures high coverage of limited functionality of small processors.
- Suitability: Suitable for processor sub-units such as ALU.
- Disadvantages:
- Detailed analysis required: Detailed analysis is required to demonstrate diagnostic coverage.
- Interface design: Designs interfaces between processors to enable fast and accurate comparison of results.
2. How it works
- Result comparison: Compares results through redundant hardware paths, and if they do not match, it is considered a fault and appropriate action is taken.
- Error recognition and correction: Uses special error recognition and correction techniques to ensure data integrity.
3. Fault Handling
- Issue warning message: When a fault is detected, a warning message is issued immediately to notify the system administrator.
- Switch to safe mode: If necessary, the system switches to safe mode to minimize the impact of the fault.
Examples
Example 1: Dual-core lock-in phase of an automotive safety system
- Configuration: Implement a dual-core lock-in phase in the airbag control system of an automotive system to ensure that the airbag is not deployed incorrectly.
- Operation principle:
- The two cores execute the same airbag deployment algorithm and compare the results of each step.
- If inconsistent results are found, the system immediately issues a warning and stops the airbag deployment.
- Advantages:
- Ensures passenger safety through immediate detection and prevention of airbag deployment errors.
Example 2: Asymmetric redundancy in industrial automation systems
- Configuration: A manufacturing robot controller with a main processing unit and a small dedicated processing unit.
- Operation principle:
- The main processing unit controls the movement of the robot, and the dedicated processing unit verifies the results to detect faults. – The dedicated processing unit is designed to be smaller than the main processing unit, reducing complexity.
- Advantages:
- Increases safety and productivity through rapid error detection and warning in complex systems.
Example 3: Coding processing of communication equipment
- Configuration: Special error correction circuits are integrated into the data processing unit of the network router.
- Operational principle:
- The data processing unit uses an error correction algorithm to recover data lost during transmission.
- Minimizes network errors using advanced error recognition technology.
- Advantages:
- Increases network stability and improves data transmission efficiency.
Limitations and Challenges
1. Common Cause Fault (CFE)
- Risk of Common Fault: Common cause faults can occur even in redundant designs, requiring additional analysis to resolve them.
2. Increased complexity
- Complex design and implementation: Designs for implementing redundancy can be complex, increasing development costs and time.
3. Performance Overhead
- Slow Processing Speed: Processing speed may be slowed down due to redundancy, and optimizations are needed to compensate for this.
4. Analysis Needs
- Detailed Analysis Needs: Detailed analysis may be needed to prove diagnostic coverage, and this may require additional resources.
D.2.3.7 Configuration register test
The Configuration Register Test is a vital technique within the ISO 26262 standard, aimed at ensuring the integrity and reliability of configuration registers in processing units. These registers are crucial for the correct operation of various systems, as they store settings that control the behavior of the hardware. This test method is designed to detect and correct failures due to both hardware and software-related issues, such as stuck values, bit flips, incorrect values, or corruption.
Configuration Register Test
Aim
- Early Defect Detection: Ensures the stability and reliability of the system by detecting defects in the configuration registers early.
Description
1. Read and Compare Settings
- Read Configuration Register Settings: Periodically read the settings of the configuration registers and compare them with the expected settings. These settings are predefined as masks or encoded expected values.
2. Compare with expected value
- Compare operation: Compare the current value of each register with the preset mask or encoded expected value. If a difference is found, a fault is considered to have occurred.
3. Fault correction
- Correct inconsistency: If a mismatch is found, the register is reloaded with the correct value.
- Recheck: After correction, recheck to ensure that the error is properly corrected.
4. Repeated check and reporting
- Handling consecutive errors: If the same error occurs consecutively, the system records the fault status and issues a warning to the system administrator if necessary. This warning can help prevent larger problems by requesting system maintenance or service.
5. Hardware and software faults
- Hardware faults: These include stuck-at faults due to hardware problems or bit flips due to soft errors.
- Software faults: Incorrect values may be stored or registers may be corrupted due to software errors.
Example
Example 1: Testing the configuration registers of an automotive engine control unit (ECU)
- Configuration: The ECU of an automobile uses various configuration registers to optimize the performance and efficiency of the engine.
- Operational principle:
- Periodic inspection: The ECU periodically reads all configuration registers and compares them with predefined expected values.
- Check configuration match: If the settings do not match, the registers are reset to the correct values.
- Log errors and issue warnings: If the error persists, the ECU logs the fault and issues warnings to request maintenance if necessary.
- Advantages:
- Optimizes engine performance and prevents performance degradation due to incorrect settings. This can also contribute to improving fuel efficiency and reducing emissions.
Example 2: Testing the configuration registers of a network router
- Configuration: The network router manages network traffic and enhances security through various configuration registers.
- Operational principle:
- Periodic check: The router periodically checks all configuration registers to ensure they match the expected settings.
- Configuration correction: If an error is found, the registers are reloaded with the correct settings.
- Error management: If a persistent error occurs, the administrator is notified to take action to prevent security vulnerabilities.
- Advantages:
- Maintains network stability and minimizes security vulnerabilities due to misconfiguration. This is essential to prevent serious security issues such as data leakage, especially in critical networks.
Limitations and Challenges
1. Limited detection of soft errors
- Limitations of soft error detection: This method is more suitable for hardware-fixed faults and may have limitations for soft errors. This means that errors that may occur during the periodic test cycle may not be fully detected.
2. Performance overhead
- Decreased processing speed: The continuous check of registers may affect the system performance. This may be especially noticeable in real-time systems that require high performance.
3. Increased Complexity
- Complex Configuration and Analysis: The process of setting and analyzing expectations can be complex, and incorrectly set expectations can lead to false alarms.
4. Need for Automation
- Automation Need: Additional hardware and software resources may be required for continuous monitoring and test automation.
D.2.3.8 Stack over/under flow detection
The Stack Overflow/Underflow Detection mechanism is a critical safety feature detailed in the ISO 26262 standard. It is designed to detect abnormal stack behavior, such as excessive expansion (overflow) or reduction (underflow), to ensure the integrity and reliability of software systems, especially in safety-critical environments like automotive and aerospace applications.
Aim
- Early stack overflow/underflow detection: Detect abnormal stack expansion or reduction as early as possible to ensure the safety and reliability of the system.
Description
1. Implementation method
Setting predefined values
- Setting boundary values:
- Set a specific pattern or predefined value at the beginning and end of the stack. For example, use a pattern such as
0xDEADBEEF
to clearly define the boundary. - This boundary value defines the normal range of the stack, which allows detection of abnormal stack behavior.
- Set a specific pattern or predefined value at the beginning and end of the stack. For example, use a pattern such as
Periodic Check Implementation
- Boundary Checking:
- The system periodically checks the stack boundary value to check for changes from the expected value.
- If a change in the boundary value is detected, it is considered that a stack overflow or underflow has occurred and an immediate response is taken.
Role of Memory Management Unit (MMU)
- When using MMU:
- If the write outside the stack boundary can be controlled by the memory management unit (MMU), protection is possible without software checking.
- The MMU blocks illegal memory access at the hardware level.
- When MMU is not present:
- Software-based stack checking is essential in systems without hardware-based protection.
- In this case, the stack checking routine periodically checks the boundary value.
2. How it works
- Compare and fix:
- If the stack boundary value does not match the expected value, an error is immediately reported and the stack is restored to the correct state.
- This maintains the integrity of the stack state. – Error Reporting and Response:
- Issue a warning message to the system administrator and record it in the system log when a persistent error occurs.
- The system switches to safe mode if necessary to prevent further damage.
3. Fault Handling
- Issue a warning message:
- Issue a warning message immediately when a fault is detected to notify the system administrator.
- Switch to safe mode:
- If necessary, the system switches to safe mode to minimize the impact of the fault.
- This mode is used to protect other functions of the system and reduce potential damage.
Example
Example 1: Stack Protection of an Automotive Engine Control Unit (ECU)
- Configuration:
- The ECU of an automobile uses stack memory for real-time processing and protects it by setting predefined values on the stack boundary.
- Operational Principle:
- Periodic Check: The ECU periodically checks the stack boundary value to detect abnormal changes.
- Overflow Response: If an overflow is detected, the ECU switches the engine control to safe mode to prevent damage to the vehicle.
- Advantages:
- Prevents crashes caused by stack overflow in real-time systems and enhances safety.
- Maintains engine performance and stability to ensure driver safety.
Example 2: Stack monitoring of aircraft flight control system
- Configuration:
- The aircraft flight control system manages multiple threads and monitors the stack boundary of each thread.
- Operation principle:
- Setting boundary values: Sets a predefined value to the stack boundary of each thread and performs periodic checks.
- Underflow response: When an underflow is detected, the system generates a warning message and safely terminates the thread.
- Advantages:
- Maintains stack stability in a multi-threaded environment and ensures flight safety.
- Increases safety by quickly handling errors that may occur during flight.
Limitations and challenges
1. Performance overhead
- Performance degradation due to periodic checks:
- Continuous checks of stack boundaries can affect system performance, and periodic checks can cause delays, especially in real-time systems. – Optimization should minimize performance degradation.
2. Limited Coverage
- Software Dependency:
- Software-based approach cannot perfectly detect hardware faults. It needs to be used in conjunction with hardware-based protection.
- This may limit comprehensive detection of various faults in the system.
3. Increased Implementation Complexity
- Complex Design and Implementation:
- Continuous checking of various stack boundaries and settings can complicate design and implementation.
- This increases the complexity of the system, which can lead to increased development time and cost.
4. Need for Memory Management
- Limitations on MMU Use:
- This technique is essential for systems without a Memory Management Unit (MMU), but can be avoided for systems with an MMU.
- Software-based protection mechanisms are required in the absence of an MMU.
D.2.3.9 Integrated hardware consistency monitoring
Integrated Hardware Consistency Monitoring is a critical safety mechanism outlined in the ISO 26262 standard. It is designed to detect illegal conditions within processing units, allowing systems to address and mitigate errors effectively. By leveraging hardware exceptions, this approach provides a robust method for handling both systematic failures and certain types of random hardware faults. Here’s a detailed explanation of its implementation, benefits, and challenges.
Aim
- Early Illegal Condition Detection: Detect illegal conditions that may occur in processing units as early as possible to maintain system stability.
Description
1. Implementation Method
Hardware Exception Setting
- Pre-setting exception conditions:
- Prepare to respond to various types of errors by pre-setting various exception conditions in the processor.
- Exception conditions include division by zero, illegal instruction code execution, and unauthorized memory access.
Interrupt Handler Development
- Interrupt Handler:
- Develop an interrupt handler corresponding to each exception condition, and design it to respond immediately when an error occurs.
- The interrupt handler catches the error and safely isolates the system state.
2. How it works
1. Hardware exception trigger:
- Exception occurrence: The processor triggers a hardware exception when an incorrect condition occurs. These exceptions are automatically generated by the processor’s protection mechanism.
- Typical exceptions: Includes division by zero, illegal instruction code execution, and unauthorized memory access.
2. Interrupt handling:
- Immediate response: When a hardware exception occurs, the system immediately executes an interrupt handling routine to react to the error.
- Error isolation: The interrupt handler catches the error and immediately takes action to stabilize the system state.
3. Fault Detection and Isolation:
- Error Logging: Detected errors are logged in the system log, and isolation and correction procedures are initiated as needed.
- Critical Error Response: If a critical error occurs, the system will switch to safe mode or restart to resolve the issue.
4. System Recovery:
- Recovery Attempt: After the error is handled, the system will attempt to recover and return to normal operation.
- Repeated Error Response: If a repetitive error occurs, additional diagnostic and correction actions are required.
5. System Failure Detection:
- Hardware Monitoring: Primarily used to detect system failures, and can also detect certain types of random hardware faults.
- Random Hardware Fault Detection: It also provides detection capabilities for certain types of random hardware faults.
6. Coding Error Coverage:
- Limited Coverage: Coverage for some coding errors is low, but still reflects good design practices. – Design Practices: Implement good design practices for efficient system design and error handling.
3. Fault Handling
- Issue warning messages:
- When a fault is detected, a warning message is issued immediately to notify the system administrator.
- Switch to safe mode:
- When necessary, the system switches to safe mode to minimize the impact of the fault.
- This is used to protect other functions of the system and reduce potential damage.
Example
Example 1: Hardware monitoring in an automotive electronic control unit (ECU)
- Configuration:
- An automotive ECU monitors the operation of various sensors and actuators and detects abnormal conditions using hardware exceptions.
- Operational principle:
- Error detection: The ECU triggers a hardware exception when an error occurs during sensor data processing, such as a divide by zero.
- Interrupt handling: An interrupt handler handles these exceptions to stabilize the system state.
- Advantages:
- Maintains stable vehicle operation through immediate error detection and isolation. – Real-time monitoring is possible for safe vehicle operation.
Example 2: Hardware consistency monitoring of industrial automation systems
- Configuration:
- In industrial automation systems, various error conditions of the processor are monitored to ensure the safety and efficiency of the system.
- Operation principle:
- Instruction code error: The processor triggers a hardware exception when an incorrect instruction code is executed.
- Safe mode transition: The system immediately isolates the error and, if necessary, switches to safe mode to continue operation.
- Advantages:
- Increases safety in industrial environments and minimizes production interruptions due to errors.
- Ensures stable and efficient operation of equipment.
Limitations and challenges
1. Coding error coverage
- Limited coverage:
- Some coding errors cannot be fully detected as hardware exceptions and may require additional software verification.
- This means that additional testing and verification procedures are required during the design phase.
2. Performance Overhead
- Performance Degradation Due to Interrupt Handling:
- Constant interrupt handling and error checking can impact system performance, which should be mitigated through performance optimization.
- Real-time systems require strategies to minimize performance degradation.
3. Complex Design Requirements
- Handling Various Error Conditions:
- Design and implementation to handle various error conditions can be complex, and thorough planning is required in the early design phase.
- This means that complex designs are required to handle various exception conditions.
4. Hardware Dependency
- Hardware Limitations:
- Some older processors may not support the latest hardware exception mechanisms, and hardware upgrades may be required.
- Designs that consider compatibility with the latest processors are required.
If you are interested in other articles about ISO 2626 Series, please refer to the links below!
[ISO 26262] #1. Part4-6 Technical Safety Concept (TSC)
[ISO 26262] #2. Safety Mechanisms for Electrical and Electronic