Title: | Architectural-Space Exploration of Heterogeneous Reliability and Checkpointing Modes for Out-of-Order Superscalar Processors | Language: | English | Authors: | Prabakaran, Bharath Srinivas Dave, Mihika Kriebel, Florian Rehman, Semeen Shafique, Muhammad |
Category: | Research Article Forschungsartikel |
Keywords: | Reliability; multi-cores; heterogeneity; fault-tolerance; AVF; hardening; microprocessors; superscalar; resilience; design space exploration; checkpointing; out-of-order; architecture | Issue Date: | 2019 | Journal: | IEEE Access | Abstract: | State-of-the-art reliability techniques and mechanisms deploy full-scale redundancy, like double or triple modular redundancy (DMR, TMR), on different layers of the computing stack to detect and/or correct such transient faults. However, the techniques relying on full-scale redundancy incur significant area, performance, and/or power overheads, which might not always be feasible/practical due to system constraints such as deadlines and available power budget for the full chip (or a processor core). In this work, we propose a novel design methodology to generate and explore the architectural-space of heterogeneous reliability modes for out-of-order superscalar multi-core processors. These heterogeneous modes enable varying reliability and power/area trade-offs, from which an optimal configuration can be chosen at run time to meet the reliability requirements of a given system while reducing the corresponding power overheads (or solving the inverse problem, i.e., maximizing the reliability under a given power constraint). Our experimental results show that a pareto-optimal heterogeneous reliability mode reduces the core vulnerability by 87%, on average, across multiple application workloads, with area and power overheads of 10% and 43%, respectively. To further enhance the design space of heterogeneous reliability modes, we investigate the effectiveness of combining different processor state compression techniques like Distributed Multi-threaded Checkpointing (DMTCP), Hash-based Incremental Checkpointing (HBICT) and GNU zip, such that the correct processor state can be recovered once a fault is detected. We reduced the checkpoint sizes by a factor of ~6× using a unique combination of different state compression techniques. |
DOI: | 10.1109/ACCESS.2019.2945622 | Library ID: | AC15576021 | URN: | urn:nbn:at:at-ubtuw:3-8519 | ISSN: | 2169-3536 | Organisation: | E191 - Institut für Computer Engineering | Publication Type: | Article Artikel |
Appears in Collections: | Article |
Files in this item:
File | Description | Size | Format | |
---|---|---|---|---|
Architectural-Space Exploration of Heterogeneous Reliability and Checkpointing Modes for Out-of-Order Superscalar Processors.pdf | 2.98 MB | Adobe PDF | ![]() View/Open |
Page view(s)
68
checked on Feb 26, 2021
Download(s)
19
checked on Feb 26, 2021

Google ScholarTM
Check
This item is licensed under a
Creative Commons License