SHF:Medium:Collaborative Research:A comprehensive methodology to pursue reproducible accuracy in ensemble scientific simulations on multi- and many-core platforms

  • Becchi, Michela M. (PI)

Project Details

Description

Ensemble simulations of scientific phenomena typically run for weeks or even months on high-performance computing clusters. The already high level of concurrency of these computing environments is expected to significantly increase in the near future, causing simulations to suffer not only from numerical errors due to limited arithmetic precision but also from the non-determinism in the execution associated with multithreading. Ultimately this trend can compromise the simulation results and break the scientific community's trust in ensemble simulations. This project tackles this problem and defines a methodology to enable the reproducible accuracy of large ensemble simulations on exascale platforms that include multi- and many-core processors.

This project moves along two major fronts. First, the investigators identify common sources of accuracy errors and study their accumulation, propagation, and runtime effects in a controlled environment. This phase includes three research activities: (i) generating code motifs that model those computations that may lead to accuracy errors; (ii) providing multiple implementations of these motifs, called code inspectors, targeting different parallel platforms; and (iii) evaluating the accuracy and runtime of these implementations using a variety of datasets and stress conditions. Second, by installing these code inspectors in real scientific code bases, the investigators study their behavior in uncertain environments. This phase includes two research activities: (i) prioritizing code segments based on quantitative impact scores and matching segments to inspector motifs; and (ii) finding the optimal code inspector implementations and patching the code with them so as to optimize the overall result variance. The applications targeted in this project are deterministic chaotic applications including n-body atomic system simulations and astrophysical simulations.

StatusFinished
Effective start/end date1/1/1731/5/20

Funding

  • National Science Foundation: US$370,233.00

ASJC Scopus Subject Areas

  • Astronomy and Astrophysics
  • Computer Networks and Communications
  • Electrical and Electronic Engineering
  • Communication

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.