EAGER: Exploring Automatic Optimization of Multi-tiered HPC Storage Systems via Practical Reinforcement Learning

  • Dai, Dong D. (Investigador principal)

Detalles del proyecto

Descripción

Nowadays, scientific discovery increasingly involves generating and analyzing large amounts of data. These data-intensive scientific applications pose significant challenges to the storage systems of high-performance computing (HPC) clusters, that are heterogeneous and extremely complex. Scientists who need high-speed data access often experience frustration in effectively using these heterogeneous storage options. There is need to build the long-missing automated HPC I/O (Input/Output) middleware to transparently help scientists achieve optimal data access performance without their manual efforts. Designing automated HPC I/O middleware for large-scale, heterogeneous, and shared HPC storage systems is an extremely challenging task. The researchers supported by this grant plan to leverage machine learning techniques to understand the requests and the current system status, intelligently and adaptively scheduling and coordinating I/O requests. The outcomes of this research are expected to work with existing storage components and minimize the impacts on both scientific applications and the HPC systems.This project plans to tackle this grand challenge by exploring practical reinforcement learning-based (RL) methods and building relevant software infrastructure in an HPC environment. There are two main focuses in the project: 1) RL-based data placement for high storage utilization, and 2) RL-based I/O coordination for shared storage. Both tasks depend on identifying effective reinforcement learning methods and integrating these methods effectively into HPC systems. To achieve this goal, a novel, system-centric reinforcement learning framework will be developed. Moreover, in each research focus, various RL algorithms, deep neural network designs, and reward shaping will be proposed, implemented, rigorously benchmarked, and compared with state-of-the-art solutions.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
EstadoNo iniciado
Fecha de inicio/Fecha fin1/7/2430/6/25

Financiación

  • National Science Foundation: USD133,980.00

!!!ASJC Scopus Subject Areas

  • Inteligencia artificial
  • Redes de ordenadores y comunicaciones
  • Ingeniería (todo)
  • Ingeniería eléctrica y electrónica
  • Comunicación

Huella digital

Explore los temas de investigación que se abordan en este proyecto. Estas etiquetas se generan con base en las adjudicaciones/concesiones subyacentes. Juntos, forma una huella digital única.