EAGER: Recomputation-Based Checkpointing for Sparse Matrices

  • Solihin, Yan Y. (Investigador principal)

Detalles del proyecto

Descripción

High-performance computing (HPC) is essential for maintaining the US international competitive edge and leadership in science, technology, engineering, and mathematics (STEM). Advances in HPC are vital to national interests by providing infrastructure for scientific discovery that improves the national health, prosperity, welfare, and defense. To solve large-scale scientific problems, HPC relies on an increasing number of nodes and components, which makes it likelier for long-running computation to be interrupted with failures before completing. A critical technique to ensure computation completion is checkpointing. Checkpointing allows snapshots of the computation to be saved so that when a failure occurs, computation state can be restored from the last snapshot and continues execution, rather than restarting from the beginning. The research in this project seeks to advance the state-of-the-art checkpointing technique by making it significantly faster and lowering its cost. This project also plans to contribute to the training of future workforce by providing students with exposure to the mechanisms and inefficiencies of current checkpointing mechanisms on NVMM, and the new in-place checkpointing. The project seeks to increase participation of minority and under-represented groups and involves undergraduates in research.

Prior approaches to checkpointing rely on taking a snapshot of the system state (system-level checkpointing) or the application state (application-level checkpointing) and saving it to secondary non-volatile storage. With the advent of non-volatile main memory (NVMM), a new approach to checkpointing becomes possible. In contrast to traditional approaches to checkpointing that rely on storing separate snapshots in a separate secondary storage, the project uses a new approach where checkpoints can be constructed in-place in the NVMM utilizing the working data structures used by the applications. This allows only very minimal additional state beyond what the program already saves to memory, making checkpointing significantly faster and incurring lower cost, in turn providing further HPC scaling.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

EstadoFinalizado
Fecha de inicio/Fecha fin15/5/1831/3/19

Financiación

  • National Science Foundation: USD298,716.00

!!!ASJC Scopus Subject Areas

  • Matemáticas (todo)
  • Redes de ordenadores y comunicaciones
  • Ingeniería eléctrica y electrónica
  • Comunicación

Huella digital

Explore los temas de investigación que se abordan en este proyecto. Estas etiquetas se generan con base en las adjudicaciones/concesiones subyacentes. Juntos, forma una huella digital única.