State-of-the-art reliability techniques and mechanisms deploy full-scale redundancy, like double or triple modular redundancy (DMR, TMR), on different layers of the computing stack to detect and/or correct such transient faults. However, the techniques relying on full-scale redundancy incur significant area, performance, and/or power overheads, which might not always be feasible/practical due to system constraints such as deadlines and available power budget for the full chip (or a processor core). In this work, we propose a novel design methodology to generate and explore the architectural-space of heterogeneous reliability modes for out-of-order superscalar multi-core processors. These heterogeneous modes enable varying reliability and power/area trade-offs, from which an optimal configuration can be chosen at run time to meet the reliability requirements of a given system while reducing the corresponding power overheads (or solving the inverse problem, i.e., maximizing the reliability under a given power constraint). Our experimental results show that a pareto-optimal heterogeneous reliability mode reduces the core vulnerability by 87%, on average, across multiple application workloads, with area and power overheads of 10% and 43%, respectively. To further enhance the design space of heterogeneous reliability modes, we investigate the effectiveness of combining different processor state compression techniques like Distributed Multi-threaded Checkpointing (DMTCP), Hash-based Incremental Checkpointing (HBICT) and GNU zip, such that the correct processor state can be recovered once a fault is detected. We reduced the checkpoint sizes by a factor of ~6× using a unique combination of different state compression techniques.
en
dc.language
English
-
dc.language.iso
en
-
dc.publisher
IEEE
-
dc.relation.ispartof
IEEE Access
-
dc.rights.uri
http://creativecommons.org/licenses/by/4.0/
-
dc.subject
Reliability
en
dc.subject
multi-cores
en
dc.subject
heterogeneity
en
dc.subject
fault-tolerance
en
dc.subject
AVF
en
dc.subject
hardening
en
dc.subject
microprocessors
en
dc.subject
superscalar
en
dc.subject
resilience
en
dc.subject
design space exploration
en
dc.subject
checkpointing
en
dc.subject
out-of-order
en
dc.subject
architecture
en
dc.title
Architectural-Space Exploration of Heterogeneous Reliability and Checkpointing Modes for Out-of-Order Superscalar Processors
en
dc.type
Article
en
dc.type
Artikel
de
dc.rights.license
Creative Commons Namensnennung 4.0 International
de
dc.rights.license
Creative Commons Attribution 4.0 International
en
dc.contributor.affiliation
University of Illinois Urbana-Champaign, United States of America (the)
-
dc.relation.grantno
German Research Foundation (DFG)
-
dc.relation.grantno
‘‘Dependable Embedded Systems, SPP 1500
-
dc.rights.holder
The Author(s) 2019
-
dc.type.category
Original Research Article
-
tuw.journal.peerreviewed
true
-
tuw.peerreviewed
true
-
tuw.version
vor
-
dcterms.isPartOf.title
IEEE Access
-
tuw.publication.orgunit
E191 - Institut für Computer Engineering
-
tuw.publication.orgunit
E384 - Institut für Computertechnik
-
tuw.publisher.doi
10.1109/ACCESS.2019.2945622
-
dc.identifier.eissn
2169-3536
-
dc.identifier.libraryid
AC15576021
-
dc.identifier.urn
urn:nbn:at:at-ubtuw:3-8519
-
tuw.author.orcid
0000-0003-0557-2166
-
dc.rights.identifier
CC BY 4.0
de
dc.rights.identifier
CC BY 4.0
en
wb.sci
true
-
item.grantfulltext
open
-
item.fulltext
with Fulltext
-
item.openairecristype
http://purl.org/coar/resource_type/c_2df8fbb1
-
item.openaccessfulltext
Open Access
-
item.cerifentitytype
Publications
-
item.languageiso639-1
en
-
item.openairetype
research article
-
crisitem.author.dept
E191-02 - Forschungsbereich Embedded Computing Systems
-
crisitem.author.dept
University of Illinois Urbana-Champaign
-
crisitem.author.dept
E191-02 - Forschungsbereich Embedded Computing Systems
-
crisitem.author.dept
E384 - Institut für Computertechnik
-
crisitem.author.dept
E191-02 - Forschungsbereich Embedded Computing Systems
-
crisitem.author.parentorg
E191 - Institut für Computer Engineering
-
crisitem.author.parentorg
E191 - Institut für Computer Engineering
-
crisitem.author.parentorg
E350 - Fakultät für Elektrotechnik und Informationstechnik