EAGER: Exploring Automatic Optimization of Multi-tiered HPC Storage Systems via Practical Reinforcement Learning

  • Dai, Dong D. (PI)

Project Details

Description

Nowadays, scientific discovery increasingly involves generating and analyzing large amounts of data. These data-intensive scientific applications pose significant challenges to the storage systems of high-performance computing (HPC) clusters, that are heterogeneous and extremely complex. Scientists who need high-speed data access often experience frustration in effectively using these heterogeneous storage options. There is need to build the long-missing automated HPC I/O (Input/Output) middleware to transparently help scientists achieve optimal data access performance without their manual efforts. Designing automated HPC I/O middleware for large-scale, heterogeneous, and shared HPC storage systems is an extremely challenging task. The researchers supported by this grant plan to leverage machine learning techniques to understand the requests and the current system status, intelligently and adaptively scheduling and coordinating I/O requests. The outcomes of this research are expected to work with existing storage components and minimize the impacts on both scientific applications and the HPC systems.This project plans to tackle this grand challenge by exploring practical reinforcement learning-based (RL) methods and building relevant software infrastructure in an HPC environment. There are two main focuses in the project: 1) RL-based data placement for high storage utilization, and 2) RL-based I/O coordination for shared storage. Both tasks depend on identifying effective reinforcement learning methods and integrating these methods effectively into HPC systems. To achieve this goal, a novel, system-centric reinforcement learning framework will be developed. Moreover, in each research focus, various RL algorithms, deep neural network designs, and reward shaping will be proposed, implemented, rigorously benchmarked, and compared with state-of-the-art solutions.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
StatusNot started
Effective start/end date1/7/2430/6/25

Funding

  • National Science Foundation: US$133,980.00

ASJC Scopus Subject Areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Engineering(all)
  • Electrical and Electronic Engineering
  • Communication

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.