Safety Mechanisms for Communication Bus
In this post, we will learn about Safety Mechanisms used in Communication Bus. The Safety Mechanisms described in this post are based on ISO 26262-5:2018 Annex D.
D.2.5.1 One-bit hardware redundancy
The One-bit Hardware Redundancy technique is a fundamental safety measure outlined in ISO 26262 that focuses on detecting bit failures in communication buses. This method enhances the integrity of data transmission by adding a simple redundancy mechanism, namely a parity bit, to the data stream. Here’s a detailed overview of how this technique works, its purpose, and practical applications.
Aim
- Detecting each odd bit failure: Early detection of transmission errors by detecting 50% of the bit failures that may occur in the data stream.
Description
1. Adding a parity bit:
- During data transmission, a parity bit is added to each data packet.
- The parity bit is set so that the sum of the bits in the data is either even or odd.
2. Performing a parity check:
- The receiving end checks the parity bit of the data packet to determine whether a bit error occurred during transmission.
- If the bit sum does not match the parity bit, an error is detected.
3. Error Detection and Reporting:
- Error Detection: If the parity check result is different from the expected one, the system detects an error.
- Error Reporting: If an error is detected, it immediately generates a warning message and requests retransmission if necessary.
4. Automatic Correction and Recovery:
- Automatically repairs data for detected errors or retransmits damaged data packets to maintain integrity.
Example
Example 1: Parity Bit Implementation in UART
- Configuration: Parity bits are used in UART (Universal Asynchronous Receiver-Transmitter) to improve the reliability of data transmission.
- Operating Principle:
- UART adds a parity bit to each data byte and transmits it.
- The receiving end checks the parity bit to determine if an error occurred during transmission.
- If a parity mismatch is detected, UART reports an error and requests data retransmission.
- Advantages:
- Improves the reliability of UART communication and minimizes data transmission errors.
Example 2: Using Parity Bits in Vehicle CAN Bus
- Configuration: Parity bits are used in the CAN (Controller Area Network) bus of automobiles to verify the integrity of data packets.
- Operation Principle:
- The CAN bus transmits each data frame with a parity bit.
- The receiver checks the parity bit to verify the integrity of the data frame.
- If a mismatch is found, the CAN bus records the error and attempts to retransmit the data.
- Advantages:
- Improves the reliability of in-vehicle data communication and reduces system errors.
Limitations and Challenges
1. Limited Error Coverage
- Cannot Detect Even Bit Errors: Parity bits can only detect odd bit errors, and cannot detect even bit errors.
2. Additional Data Overhead
- Overhead due to Parity Bits: Adding parity bits increases data overhead and may affect transmission speed.
3. Limited Error Correction Capability
- Focus on Error Detection: Parity bits are primarily used for error detection, and additional mechanisms for error correction may be required.
D.2.5.2 Multi-bit hardware redundancy
The Multi-bit Hardware Redundancy technique, as detailed in ISO 26262, is a comprehensive safety measure designed to enhance the reliability of communication systems. By extending the communication bus with additional lines and employing advanced error-detection codes, this method aims to detect and correct a wide range of transmission errors, ensuring the integrity and safety of data communication in both parallel and serial transmission links.
Aim
- Error detection in bus communication and serial transmission links: Ensures the integrity of data by detecting various errors that may occur during communication at an early stage.
Description
- Multi-bit line extension:
- Using additional lines: Adding two or more bits to the communication bus to detect and correct errors that may occur during data transmission.
- Use of block codes:
- Application of block codes: Errors are detected using block codes such as Hamming codes, Reed-Solomon codes, CRC (Cyclic Redundancy Check), and Low-Density Parity Check (LDPC) codes.
- Error correction function: These codes go beyond simple error detection and also provide the ability to automatically correct some errors.
Example
Example 1: Hamming code application in network communication
- Configuration: Hamming codes are used in network communication to detect and correct data transmission errors.
- Operation principle:
- The transmitter generates and adds a Hamming code to the data packet.
- The receiver uses the Hamming code to check the integrity of the data and correct single-bit errors.
- If a multi-bit error occurs, the receiver reports the error and requests retransmission.
- Advantages:
- It improves the reliability of network data transmission and minimizes data loss.
Example 2: Data Transmission Error Management of Reed-Solomon Code
- Configuration: Enhance the reliability of data transmission using Reed-Solomon code.
- Operation Principle:
- The transmitter adds Reed-Solomon code to the data packet and transmits it.
- The receiver uses Reed-Solomon code to check the integrity of the data and correct multi-bit errors.
- If the error cannot be corrected, the system requests retransmission.
- Advantages:
- Improves error detection and correction capabilities in various data transmission environments.
Limitations and Challenges
1. Increased Complexity
- Complexity of Block Code: The process of generating and managing block code can be complex, which can add additional burden to system design.
2. Performance Overhead
- Additional Data Overhead: The data overhead caused by block code can affect system performance, and optimization is required to minimize it.
3. Limited Error Correction Capability
- Block Code Limitations: Some block codes can only correct certain types of errors, and cannot completely correct all errors.
4. Hardware Requirements
Hardware Support Required: Hardware that supports block codes is required, and system upgrades may be required.
D.2.5.3 Complete hardware redundancy
Complete Hardware Redundancy is a robust safety measure outlined in ISO 26262 that focuses on detecting failures in communication systems by duplicating the communication bus. This technique leverages dual-channel configurations to compare signals across parallel buses, thereby enhancing the reliability and integrity of data transmission in safety-critical applications.
Aim
- Fault detection during communication: Comparing signals between two buses to detect various errors that may occur during communication at an early stage.
Description
1. Dual bus configuration:
- The same data is transmitted in parallel over two independent buses.
- Each bus uses an independent channel to convey the same data.
2. Signal comparison:
- The signals of the two buses are compared at the receiving end to determine if they match.
- If the signals do not match, an error is detected and immediate action is taken.
3. Error Detection and Response:
- Error Detection: If a signal mismatch occurs between the two buses, it is considered an error.
- Error Reporting and Handling: If an error is detected, a warning message is generated and, if necessary, data retransmission is requested.
4. Automatic Recovery and Adjustment:
- If the error is correctable, the system automatically recovers the data or re-establishes the damaged path.
Example
Example 1: Dual Channel FlexRay Implementation
- Configuration: The FlexRay communication system uses dual channels to ensure the integrity of data transmission.
- Operating Principle:
- The FlexRay system transmits the same data over two independent channels.
- The receiving end compares the data from the two channels to determine if they match.
- If a mismatch is detected, the system logs the error and attempts to retransmit the data.
- Advantages:
- Increases the reliability of FlexRay communication and minimizes data transmission errors.
Example 2: Dual Bus Configuration in Automotive Safety Systems
- Configuration: Dual buses are used in automotive safety systems (e.g. ABS) to detect signal integrity.
- Operation Principle:
- The ABS system transmits the same control signals over two independent buses.
- The receiver compares the signals on the two buses to determine if they match.
- If a mismatch occurs, the ABS system immediately issues a warning and stops or adjusts the operation.
- Advantages:
- Increases the reliability of the vehicle safety system and prevents abnormal braking behavior.
Limitations and Challenges
1. Increased Cost
- Hardware Cost: The additional hardware required to implement dual channels can increase the cost.
2. Increased Complexity
- Design Complexity: Implementing a dual channel configuration can increase the complexity of the system design and requires thorough planning in the early stages of the design.
3. Performance Overhead
- Additional Data Processing Requirements: The additional data processing required by dual channels can impact system performance.
4. Limited Coverage
- Limited Coverage for All Error Types: Dual channels are effective for certain types of errors, but they cannot perfectly detect all errors.
D.2.5.4 Inspection using test patterns
Inspection Using Test Patterns is a diagnostic technique outlined in ISO 26262 that focuses on detecting static failures, such as stuck-at faults and cross-talk, in data paths. This approach uses predefined test patterns to evaluate the integrity of data paths, ensuring that systems maintain their functional reliability and safety.
Aim
- Early detection of static faults and crosstalk: Ensures the safety and integrity of the system by detecting stack-at faults and signal interference (crosstalk) that may occur in the data path as early as possible.
Description
1. Define test patterns:
- Establish test patterns that cover all critical states of the data path.
- For example, check the integrity of the signal using specific bit sequences.
2. Periodic test execution:
- Apply test patterns periodically to the data paths of the system and observe the results.
- Compare the observed results with the predefined expected values and evaluate whether there are any differences.
3. Fault Detection and Reporting:
- Fault Detection: Faults are detected when there is a discrepancy from the expected value.
- Error Reporting: Detected faults are reported immediately and actions are taken if necessary.
4. Automatic Correction and Response:
- Automatically recover data for detected faults or re-establish damaged paths to maintain the integrity of the system.
5. Independence of Test Patterns:
- Importance of Independence: Test coverage depends on the independence between test pattern information, reception, and evaluation.
- Design Excellence: In a good design, test patterns do not adversely affect the functional behavior of the system.
Examples
Example 1: Application of Test Patterns in Network Communications
- Configuration: Network routers periodically use test patterns to check the integrity of the data path.
- Operational Principle:
- The router periodically transmits a specific bit sequence on the network data path.
- The received data pattern is compared to the expected value to evaluate whether the signal is consistent. – If a mismatch occurs, the router logs the error and issues a warning.
- Advantages:
- Increases the reliability of network data transmission, minimizes signal interference and static faults.
Example 2: Test Pattern Inspection of an Automotive ECU
- Configuration: The electronic control unit (ECU) of an automobile periodically uses test patterns to verify the integrity of the sensor and actuator paths.
- Operational Principle:
- The ECU periodically executes test patterns for sensor inputs and actuator outputs.
- Compares the observed results with the expected values to verify that the signals are within the acceptable range.
- If a mismatch is detected, the ECU issues a warning and adjusts its operation if necessary.
- Advantages:
- Increases the reliability of the vehicle system, and reduces the risk of incorrect sensor or actuator signals.
Limitations and Challenges
1. Limitations of Test Patterns
- Limited Coverage: Depending on the design and implementation of the test pattern, some faults may not be fully detected.
2. Increased design complexity
- Implementation in complex systems: In complex systems, the design and implementation of test patterns can be more complex and require thorough planning in the early design stages.
3. Performance overhead
- Performance degradation due to periodic testing: Periodic testing can impact system performance and should be minimized through optimization.
4. Limitations of data flow independence
- Need for data flow variation: Data flow must change during diagnostic test intervals, and static data may be ineffective for detection.
D.2.5.5 Transmission redundancy
Transmission Redundancy is a technique outlined in ISO 26262 designed to detect transient failures in bus communication by transmitting the same information multiple times. This redundancy ensures that temporary issues do not lead to incorrect data being processed or stored, thereby enhancing the reliability of communication systems.
Aim
- Detection of transient failures in bus communication: Ensures data integrity by detecting transient errors that may occur in bus communication at an early stage.
Description
- Repeatability of information transmission:
- Multiple transmission: Detects transient errors that may occur during transmission by transmitting the same information multiple times in succession.
- Focus on transient failures: This technique is mainly effective for detecting transient failures and is not suitable for detecting permanent failures.
Example
Example 1: Application of Transmission Redundancy in Vehicle CAN Bus
- Configuration: Enhance the reliability of data communication by utilizing transmission redundancy in the CAN (Controller Area Network) bus of the vehicle.
- Operation Principle:
- The CAN bus transmits each data message multiple times to detect transient errors that may occur during transmission.
- The receiving side checks whether the same message is received multiple times, and if a mismatch is detected, it is considered an error.
- If a mismatch occurs, the CAN bus records the error and attempts to retransmit the data.
- Advantages:
- Enhance the reliability of data communication within the vehicle and minimize data loss due to transient errors.
Example 2: Transmission Redundancy in Industrial Automation Systems
- Configuration: Enhance the reliability of the system by repeatedly transmitting important control commands in industrial automation systems.
- Operation Principle:
- Important control commands are transmitted multiple times, and the receiving side checks whether all commands are consistent.
- If a mismatch is detected, the system issues a warning and requests command retransmission.
- Advantages:
- Increases system reliability and prevents production outages due to temporary communication errors.
Limitations and Challenges
1. Limited error coverage
- Inability to detect permanent faults: Transmission redundancy is mainly effective in detecting temporary faults, and has limitations in detecting permanent faults.
2. Performance overhead
- Additional data transmission requirements: Multiple transmissions of data may affect system performance, and optimization is required to minimize this.
3. Increased design complexity
- Complex system implementation: Implementation of multiple transmissions and comparison mechanisms may increase the complexity of system design.
4. Increased data transmission volume
- Increased transmission bandwidth requirements: Repeated transmissions increase transmission bandwidth, which may burden network resources.
D.2.5.6 Information redundancy
The Information Redundancy technique outlined in ISO 26262 is designed to detect failures in bus communication by adding redundancy to the transmitted data. This redundancy is achieved through the use of checksums or cyclic redundancy checks (CRC), which help verify the integrity of the data and provide coverage against various types of communication errors. The method is especially effective in addressing transient and systematic failures, including data corruption and masquerading attacks.
Aim
- Bus communication fault detection: Ensures the integrity and reliability of data by detecting various errors that may occur during data transmission at an early stage.
Description
- Block-based data transmission:
- Use of checksums and CRC: Detects errors by transmitting a checksum or CRC value calculated together with the data block.
- Receiver-side recalculation: The receiver recalculates the checksum or CRC value of the received data and compares it with the transmitted value.
- CRC design elements:
- Data length: The coverage of the CRC depends on the data length, the number of CRC bits, and the polynomial used.
- Polynomial design: The CRC can be designed to handle common communication error modes in hardware (e.g. burst errors).
- Include message ID: Include message ID in the checksum/CRC calculation to provide coverage against corruption of that part (pseudonym attacks).
- Error detection and coverage:
- Low coverage: Overall low coverage of data transmission failure modes (Hamming distance 2 or less).
- Medium coverage: Overall medium coverage of data transmission failure modes (Hamming distance 3 or greater).
Example
Example 1: Embedding message information with CRC value
- Description: Hamming distance 2 when using CRC value embedded in message information, CRC size 5 bits, polynomial 0x12.
- Data length: 2,048 bits or less.
- Application: The transmitter includes the CRC value, and the receiver calculates and compares the CRC value to verify the data.
Example 2: CRC value used in LIN bus
- Description: Hamming distance 4 when using CRC value embedded in message information, CRC size 8 bits, polynomial 0x97.
- Data length: 119 bits or less.
- Application: The transmitter includes the CRC value, and the receiver calculates and compares the CRC value to verify the data.
Example 3: CRC value embedded in message information and ID
- Description: Hamming distance 4 when using message information and message ID embedded CRC value, CRC size 10 bits, polynomial 0x319.
- Data length: 501 bits or less.
- Application: The transmitter includes the CRC value, and the receiver calculates and compares the CRC value to verify the data.
Example 4: CRC value used in CAN
- Description: Hamming distance 5 when using message information and ID embedded CRC value, CRC size 15 bits, polynomial 0x4599.
- Data length: 127 bits or less.
- Application: Up to 15 burst errors can be detected. The transmitter includes the CRC value, and the receiver calculates and compares the CRC value to verify the data.
Example 5: CRC values used in FlexRay’s frame CRC
- Description: Hamming distance 6 when using CRC value embedded in message information, CRC size 24 bits, polynomial 0x5D6DCB.
- Data length: Hamming distance 6 when less than 248 bytes, Hamming distance 4 when more than 248 bytes.
- Application: The transmitter includes the CRC value, and the receiver calculates and compares the CRC value to verify the data.
Example 6: CRC values used in FlexRay’s header CRC
- Description: Hamming distance 6 when using CRC value embedded in message header and message ID, CRC size 11 bits, polynomial 0x385.
- Data length: 20 bits or less.
- Application: The transmitter includes the CRC value, and the receiver calculates and compares the CRC value to verify the data.
Limitations and Challenges
1. Limited Coverage
- Limited Coverage for All Error Types: Checksums and CRCs are effective for certain types of errors, but they cannot perfectly detect all errors.
2. Increased Design Complexity
- Complex Polynomial Design: The process of designing and implementing CRC polynomials can be complex and can add additional burden to the system design.
3. Performance Overhead
- Additional Data Transfer Requirements: Performance overhead may occur due to the calculation and comparison of checksums and CRCs.
4. Limited Message Loss and Repeat Detection
- Message Loss and Repeat: Signature checking can verify the consistency of data and IDs, but it cannot detect message loss or unintentional message repetition.
D.2.5.7 Frame counter
The Frame Counter technique described in ISO 26262 is a method used to detect frame losses in communication systems. A frame consists of a coherent set of data sent from one controller to another, and each frame is uniquely identified by a message ID. By including a counter within each frame, the system can track the transmission of frames and identify any lost or missed frames, ensuring data integrity and reliability.
Aim
- Frame Loss Detection: Improves communication reliability and maintains data integrity by detecting frame loss early.
Description
- Use of Frame Counter:
- Unique Frame Identification: Each unique safety-related frame includes a counter as part of the message.
- Counter Increment: The counter is incremented with each successive frame transmission, and rollover may occur.
- Frame Loss Detection: The receiver can detect frame loss or non-update by checking if the counter is incremented by 1.
- Special Frame Counter Version:
- Includes individual signal counters: Individual signal counters can be included to match the update of safety-related data.
- Multiple safety-related data: If a frame contains multiple safety-related data, a separate counter is provided for each piece of data.
Examples
Example 1: Application of Frame Counters on an Automotive CAN Bus
- Configuration: The integrity of data transmission is verified using frame counters on the CAN (Controller Area Network) bus in an automobile.
- Principle of Operation:
- Each CAN message is transmitted with a frame counter.
- The receiver checks whether the counter for each message is increased by 1 compared to the previous message.
- If a counter mismatch occurs, the receiver detects a frame loss and reports an error.
- Advantages:
- Increases the reliability of data communication within the vehicle and minimizes data corruption due to frame loss.
Example 2: Utilization of Frame Counters in Industrial Automation Systems
- Configuration: Use frame counters to prevent command loss when transmitting important control commands in industrial automation systems.
- Operating Principle:
- Each control command frame is transmitted with a unique counter.
- The receiver checks whether the counter is continuously increasing and issues an alert if a mismatch occurs.
- Advantages:
- Increases the reliability of the system and prevents production outages due to lost commands.
Limitations and Challenges
1. Increased Design Complexity
- Complex Counter Management: Counter management for multiple frames and data can be complex and can add additional burden to the system design.
2. Performance Overhead
- Additional Data Transmission Requirements: Frame counters can impact system performance and require optimization to minimize this.
3. Limited Error Detection Capabilities
- **Unintentional Frame Repeat Detection Unintentional Frame Repeat Detection Unintentional frame repeat detection can be difficult to detect using counters.
D.2.5.8 Timeout monitoring
Timeout Monitoring is a technique referenced in ISO 26262 that is designed to detect data loss in communication systems between sending and receiving nodes. This technique is particularly effective in identifying continuous loss of communication channels or specific messages by monitoring the time interval between received frames. By implementing timeout monitoring, systems can ensure timely detection of communication failures and maintain data integrity and reliability.
Aim
- Data Loss Detection: Maintains communication reliability and data integrity by early detection of data loss between sending and receiving nodes.
Description
- Message ID Time Monitoring:
- Expected Message Monitoring: The receiving side expects a safety message associated with a specific message ID and monitors the time between the receptions.
- Timeout Detection: If the time between two messages exceeds the configured timeout threshold, it indicates a communication error or message loss.
- Error Detection:
- Continuous Loss of Communication Channel: Detects when messages are continuously lost and no longer received.
- Continuous Loss of Specific Message: Detects when no more frames are received for a specific message ID.
Examples
Example 1: Timeout Monitoring on an Automotive CAN Bus
- Configuration: Use timeout monitoring on an automotive CAN (Controller Area Network) bus to ensure the integrity of data transmission.
- Operational Principle:
- The receiver monitors the reception interval of each CAN message and detects a reception timeout for a specific message ID.
- If the time between messages exceeds the expected interval, the receiver considers it a communication error and issues a warning.
- Advantages:
- Increases the reliability of in-vehicle data communication and minimizes data corruption due to message loss.
Example 2: Timeout Monitoring in Industrial Automation Systems
- Configuration: Monitors the reception interval of critical control commands in industrial automation systems to prevent command loss. – Operating Principle:
- The receiver records the reception time of each control command and issues an alert if it exceeds the expected interval.
- When a command loss is detected, the system takes immediate action to correct the error.
- Advantages:
- Increases the reliability of the system and prevents production downtime due to command loss.
Limitations and Challenges
1. Increased Design Complexity
- Complex Time Management: Time management between various messages and nodes can be complex and can add additional burden to the system design.
2. Performance Overhead
- Additional Data Transmission Requirements: System performance can be impacted due to timeout monitoring, and optimization is required to minimize this.
3. Limited Error Detection Capability
- Unintentional Message Repeat Detection Unintentional Message Repeat Detection Unintentional Message Repeat Detection can be difficult to detect, although timeouts can be used to detect message loss.
D.2.5.9 Read back of sent message
The Read Back of Sent Message technique is a communication safety mechanism referenced in ISO 26262. This approach is designed to detect failures in bus communication by allowing the transmitter to read back its own sent message from the bus and compare it with the original message. This technique is particularly effective in detecting data corruption and ensuring that the data being sent is consistent with the data received on the bus.
Aim
- Detect bus communication errors: Early detection of errors in bus communication by rereading and comparing the message sent by the sender.
Description
- Read and compare sent message:
- Read message from bus: The sender rereads the message sent on the bus.
- Compare with original message: The read message is compared with the original message to ensure data consistency.
- Detect errors: If an inconsistency is found, a communication error is considered to have been detected.
- Used in CAN:
- CAN Protocol Application: This safety mechanism is also used in the CAN (Controller Area Network) protocol.
- Data and ID Corruption Detection: Provides high coverage for data and message ID corruption.
- Limitations:
- Incomplete Error Coverage: Since it only checks the consistency of data and ID, it does not completely detect other error modes, such as unintended message repetition.
Example
Example 1: Reading and Comparing Messages on the CAN Bus
- Configuration: The integrity of the communication is verified by rereading the message sent by the sender from the CAN bus and comparing it.
- Operational Principle:
- The sender transmits data, rereads the data from the bus and compares it with the original data.
- If a mismatch is detected, the sender recognizes the data corruption and attempts to retransmit.
- Advantages:
- Increases the reliability of communication by early detection of data corruption in the CAN network.
Example 2: Reading and comparing messages in industrial automation systems
- Configuration: When transmitting important control commands in industrial automation systems, data integrity is verified by reading messages.
- Operational principle:
- After the sender transmits a control command, the command is read back from the bus and compared with the original.
- If the data does not match, the system reports an error and retransmits the command.
- Advantages:
- Increases the reliability of the system and prevents malfunctions due to command corruption.
Limitations and challenges
1. Limited error detection capability
- Unintentional message duplication cannot be detected: Since only data consistency is checked, unintentional message duplication is difficult to detect.
2. Performance overhead
- Additional data processing requirements: The process of the sender rereading and comparing data can affect the system performance, and optimization is required to minimize this.
3. Increased design complexity
- Complex message processing: Implementing message reading and comparison mechanisms can increase the complexity of the system design.
If you are interested in other articles about ISO 2626 Series, please refer to the links below!
[ISO 26262] #1. Part4-6 Technical Safety Concept (TSC)
[ISO 26262] #2. Safety Mechanisms for Electrical and Electronic
[ISO 26262] #3. Safety Mechanism for Processing Unit
[ISO 26262] #4. Safety Mechanisms for IO units and Interfaces