Title: Architectural-Space Exploration of Heterogeneous Reliability and Checkpointing Modes for Out-of-Order Superscalar Processors
Language: English
Authors: Prabakaran, Bharath Srinivas 
Dave, Mihika 
Kriebel, Florian 
Rehman, Semeen 
Shafique, Muhammad
Category: Research Article
Keywords: Reliability; multi-cores; heterogeneity; fault-tolerance; AVF; hardening; microprocessors; superscalar; resilience; design space exploration; checkpointing; out-of-order; architecture
Issue Date: 2019
Journal: IEEE Access
State-of-the-art reliability techniques and mechanisms deploy full-scale redundancy, like double or triple modular redundancy (DMR, TMR), on different layers of the computing stack to detect and/or correct such transient faults. However, the techniques relying on full-scale redundancy incur significant area, performance, and/or power overheads, which might not always be feasible/practical due to system constraints such as deadlines and available power budget for the full chip (or a processor core). In this work, we propose a novel design methodology to generate and explore the architectural-space of heterogeneous reliability modes for out-of-order superscalar multi-core processors. These heterogeneous modes enable varying reliability and power/area trade-offs, from which an optimal configuration can be chosen at run time to meet the reliability requirements of a given system while reducing the corresponding power overheads (or solving the inverse problem, i.e., maximizing the reliability under a given power constraint). Our experimental results show that a pareto-optimal heterogeneous reliability mode reduces the core vulnerability by 87%, on average, across multiple application workloads, with area and power overheads of 10% and 43%, respectively. To further enhance the design space of heterogeneous reliability modes, we investigate the effectiveness of combining different processor state compression techniques like Distributed Multi-threaded Checkpointing (DMTCP), Hash-based Incremental Checkpointing (HBICT) and GNU zip, such that the correct processor state can be recovered once a fault is detected. We reduced the checkpoint sizes by a factor of ~6× using a unique combination of different state compression techniques.
DOI: 10.1109/ACCESS.2019.2945622
Library ID: AC15576021
URN: urn:nbn:at:at-ubtuw:3-8519
ISSN: 2169-3536
Organisation: E191 - Institut für Computer Engineering 
Publication Type: Article
Appears in Collections:Article

Show full item record

Page view(s)

checked on Feb 26, 2021


checked on Feb 26, 2021

Google ScholarTM


This item is licensed under a Creative Commons License Creative Commons