NEUTRON-INDUCED EFFECTS AND FAULT TOLERANCE IN MODERN PARALLEL AND HETEROGENEOUS ARCHITECTURES

DOI

We evaluate the resilience of modern parallel devices for high performance computing applications, i.e. Graphics Processing Units (GPUs), and heterogeneous Systems on Chips, i.e. Accelerated Processing Units. Nowadays the error rate of supercomputers may be extremely high. As we have evaluated with previous experiments at ISIS, the error rate of TITAN, a supercomputer composed of 18,000 GPUs, can be of up to one error every 10 minutes. TITAN personnel confirmed this value. Additionally, we were able to match experimental data gathered at ISIS with TITAN field data, based on more than 1,400 millions GB of data, and 500 millions GPU node hours of operation. Finally, we will study the neutron sensitivity of modern System on Chips embedding ARM core and FPGA programmable logic. We will exploit the FPGA programmability to implement a general-purpose mitigation scheme for the ARM processor.

Identifier
DOI https://doi.org/10.5286/ISIS.E.61786330
Metadata Access https://icatisis.esc.rl.ac.uk/oaipmh/request?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:icatisis.esc.rl.ac.uk:inv/61786330
Provenance
Creator Dr Luca Sterpone; Dr Paolo Rech; Professor Luigi Carro; Mr Lucas Tambara; Dr Chris Frost; Mr Daniel Oliveira
Publisher ISIS Neutron and Muon Source
Publication Year 2018
Rights CC-BY Attribution 4.0 International; https://creativecommons.org/licenses/by/4.0/
OpenAccess true
Contact isisdata(at)stfc.ac.uk
Representation
Resource Type Dataset
Discipline Construction Engineering and Architecture; Engineering; Engineering Sciences
Temporal Coverage Begin 2015-07-20T08:00:00Z
Temporal Coverage End 2015-07-24T08:00:00Z