### Assertion-Based Microarchitecture Design for Improved Fault Tolerance

### Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg



Center for Embedded Systems Research (CESR) Department of Electrical & Computer Engineering North Carolina State University

### **Motivation**

- Technology scaling
  - Smaller, faster transistors
  - Transistors more susceptible to transient faults
- How to build reliable processors using unreliable transistors?

### **Motivation**



Need efficient fault tolerance solutions

# Redundant Multithreading (RMT)

- Duplicate a program and compare outcomes to detect transient faults
- Positives
  - Simple and straightforward
  - General solution
  - Complete fault coverage
- Negatives
  - High overhead

### RMT using an extra processor



# RMT using simultaneous multithreading



### Alternate solution: Targeted fault checks

- Add regimen of fault checks
- Specific to logic block
  - Arbitrary latches: Robust latch design
  - Arbitrary gates: Self-checking logic
  - FSM: Self-checking FSM designs
  - ALU: Self-checking ALUs, RESO
  - Storage and buses: Parity, ECC
- Positives
  - No overhead of duplicating the program
- Negatives
  - Not general, i.e., many types of checks needed

## Our contribution: Microarchitectural Assertions

- Novel class of fault checks
- Key Idea: Confirm µarch. "truths" within processor pipeline
- Catch-all checks for microarchitecture mechanisms
- Positives
  - Broad coverage (catch-all checks)
  - Very low-overhead solution (no redundant execution)









# Examples of Microarchitectural Assertions

- Register Name Authentication (RNA)
  - Aims to detect faults in renaming unit
  - Asserts consistencies in renaming unit
    - Exploits inherent redundancy in renaming structures
    - Asserts expected physical register states
- Timestamp-based Assertion Checking (TAC)
  - Aims to detect faults in issue unit
  - Asserts sequential order among data dependent instructions
  - Uses timestamps































- RNA previous mapping check
  - Detects faults in renaming structures
    - Rename map table, Architecture map table
    - Branch checkpoint tables
    - Active list (renaming state)
- RNA writeback state check
  - Detects faults in renaming logic and freelist

- Pure source renaming faults undetected
  - E.g., fault in source renaming logic
  - Researching solutions similar to RNA
- However, faults causing deadlock are detectable
  - Faulty source name causing cyclic dependency
  - Faulty source name points to unpopped freelist entry
  - Other faults that cause phantom producers
- Use watchdog timer to detect deadlocks

# Timestamp-based Assertion Checking (TAC)

- Confirm data dependent instructions issued sequentially
  - Assign timestamps to instructions at issue
  - At retirement, confirm instruction timestamp greater than producers' timestamps
- TAC check

Instruction timestamp >= Producer's timestamp + latency

 Faults on checking logic can only cause false alarms





- Randomly injected faults in timing simulator
  - 1000 faults per benchmark
  - Faults target issue and rename state
  - Simulation ends 1 million cycles following fault injection
- Observations
  - Fault detected by an assert (Assert) or not (Undet)
  - Fault corrupts architectural state (SDC) or not (Masked)
- Possible outcomes
  - Assert+SDC
  - Undet+SDC
  - Assert+Masked
  - Undet+Masked

# TAC - Fault Injection Experiments

- Type of faults injected
  - Ready bits prematurely set
  - Speculatively issued cache-missing load dependents not reissued

# **TAC - Fault Injection Experiments**



### **RNA - Fault Injection Experiments**

- Faults injected
  - Flip bits of an entry in architecture map table
  - Flip bits of an entry in rename map table
  - Flip bits of an entry in freelist
  - Flip destination register bits at dispatch
- Watchdog timer included to detect deadlocks
- Additional possible outcomes
  - Assert+Wdog : RNA detected a future deadlock
  - Undet+Wdog : Deadlock undetected by RNA (possibly would have been caught by RNA in future)

### **RNA - Fault Injection Experiments**



# **RNA - Fault outcome distribution**

"dest" faults cause deadlocks because of phantom producers

Deadlock blocks retirement Thus, RNA check can't complete

"arch\_map" faults cause most undetected SDC

Long live register range

System trap before RNA check can complete



# Conclusions

