Architectural-Space Exploration of Heterogeneous Reliability and Checkpointing Modes for Out-of-Order Superscalar Processors

Prabakaran, Bharath Srinivas; Dave, Mihika; Kriebel, Florian; Rehman, Semeen; Shafique, Muhammad

doi:10.1109/ACCESS.2019.2945622

Record link:

https://resolver.obvsg.at/urn:nbn:at:at-ubtuw:3-8519
http://hdl.handle.net/20.500.12708/781

Title:

Architectural-Space Exploration of Heterogeneous Reliability and Checkpointing Modes for Out-of-Order Superscalar Processors

Citation:

Prabakaran, B. S., Dave, M., Kriebel, F., Rehman, S., & Shafique, M. (2019). Architectural-Space Exploration of Heterogeneous Reliability and Checkpointing Modes for Out-of-Order Superscalar Processors. IEEE Access. https://doi.org/10.1109/ACCESS.2019.2945622

Publisher DOI:

10.1109/ACCESS.2019.2945622

CatalogPlus:

AC15576021

Publication Type:

Article - Original Research Article

Language:

English

Authors:

Prabakaran, Bharath Srinivas
Dave, Mihika
Kriebel, Florian
Rehman, Semeen
Shafique, Muhammad

Organisational Unit:

E191 - Institut für Computer Engineering
E384 - Institut für Computertechnik

Journal:

IEEE Access

ISSN:

2169-3536

Date (published):

2019

Publisher:

IEEE

Peer reviewed:

Yes

Keywords:

Reliability; multi-cores; heterogeneity; fault-tolerance; AVF; hardening; microprocessors; superscalar; resilience; design space exploration; checkpointing; out-of-order; architecture

Abstract:

State-of-the-art reliability techniques and mechanisms deploy full-scale redundancy, like double or triple modular redundancy (DMR, TMR), on different layers of the computing stack to detect and/or correct such transient faults. However, the techniques relying on full-scale redundancy incur significant area, performance, and/or power overheads, which might not always be feasible/practical due to system constraints such as deadlines and available power budget for the full chip (or a processor core). In this work, we propose a novel design methodology to generate and explore the architectural-space of heterogeneous reliability modes for out-of-order superscalar multi-core processors. These heterogeneous modes enable varying reliability and power/area trade-offs, from which an optimal configuration can be chosen at run time to meet the reliability requirements of a given system while reducing the corresponding power overheads (or solving the inverse problem, i.e., maximizing the reliability under a given power constraint). Our experimental results show that a pareto-optimal heterogeneous reliability mode reduces the core vulnerability by 87%, on average, across multiple application workloads, with area and power overheads of 10% and 43%, respectively. To further enhance the design space of heterogeneous reliability modes, we investigate the effectiveness of combining different processor state compression techniques like Distributed Multi-threaded Checkpointing (DMTCP), Hash-based Incremental Checkpointing (HBICT) and GNU zip, such that the correct processor state can be recovered once a fault is detected. We reduced the checkpoint sizes by a factor of ~6× using a unique combination of different state compression techniques.

License:

CC BY 4.0