Skip to main content

Fault-Tolerant Acrhitectures

Overview

There is growing concern that transient faults, caused by cosmic rays and other factors, will occur frequently in future high-performance processors, as designers push technology to its extreme limits. Existing fault-tolerant techniques are either too costly (system-level replication), too intrusive (gate-level replication), or too specific (e.g., ECC on memory). In 1999, we proposed a microarchitectural approach to fault tolerance (AR-SMT), achieving broad coverage of transient faults with low performance overhead and few changes to the underlying microarchitecture. We revisit this notion and explore other ways the microarchitecture can help reliability.

This project is in its early stages. For background, please follow the Slipstream Project link on the main research page.

Publications

Conference and Journal Papers

V. K. Reddy and E. Rotenberg. Coverage of a Microarchitecture-level Fault Check Regimen in a Superscalar Processor. Proceedings of the 38th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN-38, DCCS track), pp. 1-10, June 2008. [pdf]

V. K. Reddy and E. Rotenberg. Inherent Time Redundancy (ITR): Using Program Repetition for Low-Overhead Fault Tolerance. Proceedings of the 37th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN-37, DCCS track), pp. 307-316, June 2007. [pdf]

V. K. Reddy, S. Parthasarathy, and E. Rotenberg. Understanding Prediction-Based Partial Redundant Threading for Low-Overhead, High-Coverage Fault Tolerance. Proceedings of the 12th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XII), pp. 83-94, October 2006. [pdf]

V. K. Reddy, A. S. Al-Zawawi, and E. Rotenberg. Assertion-Based Microarchitecture Design for Improved Fault Tolerance. Proceedings of the 24th IEEE International Conference on Computer Design (ICCD-24), pp. 362-369, October 2006. [pdf]

Student Theses

S. Rajan Vijaya Kumar. RTL Design and Analysis of a Fault Check Regimen for Superscalar Processors. M.S. Thesis, Department of Electrical and Computer Engineering, North Carolina State University, July 2010. [NCSU library: on-line thesis]

V. K. Reddy. Exploiting Microarchitecture Insights for Efficient Fault Tolerance. Ph.D. Thesis, Department of Electrical and Computer Engineering, North Carolina State University, August 2007. [NCSU library: on-line thesis]

S. Parthasarathy. Improving Transient Fault Tolerance of Slipstream Processors. M.S. Thesis, Department of Electrical and Computer Engineering, North Carolina State University, December 2005. [NCSU library: on-line thesis]

Talks

Coverage of a Microarchitecture-level Fault Check Regimen in a Superscalar Processor. Presented at DSN-38 by V. K. Reddy. [pps]

Inherent Time Redundancy (ITR): Using Program Repetition for Low-Overhead Fault Tolerance. Presented at DSN-37 by V. K. Reddy. [pps]

Understanding Prediction-Based Partial Redundant Threading for Low-Overhead, High-Coverage Fault Tolerance. Presented at ASPLOS-12 by V. K. Reddy. [pps]

Assertion-Based Microarchitecture Design for Improved Fault Tolerance. Presented at ICCD-24 by V. K. Reddy. [pps]

Funding

This project is supported by NSF CAREER grant No. CCR-0092832 (CAREER: Cooperative Redundant Threads), and generous funding and equipment donations from Intel.

Any opinions, findings, and conclusions or recommendations expressed in this website and publications herein are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.