SHF: Small: Collaborative Research: A Parallel Graph-Based Paradigm for HPC Parallel File System Checkers

  • Dai, Dong D. (PI)

Project Details

Description

Modern high performance computing (HPC) platforms rely on large-scale parallel file systems for serving data accesses of scientific applications. These parallel file systems often run on expensive hardware and are usually well-maintained, but they may still experience failures and run into inconsistent states for various reasons (e.g., hardware faults, software bugs, configuration errors). When the state becomes inconsistent, a checking and repairing program called checker is the last line of defense to bring the system back to consistency. Nevertheless, today's checkers are error-prone and time-consuming to run. With the scale and complexity keeps increasing, the situation will likely get worse. This project aims to enable scalable, high performance checking and repairing of widely used parallel file systems through a new parallel graph-based model. The success of this project will dramatically change how parallel file system checkers would be used. Such an effort is a fundamental step towards building highly reliable future HPC parallel file systems for scientific discovery. In addition, this project integrates the research activities with education and outreach efforts to train broadly inclusive and globally competitive science workforce.

The project consists of three thrusts. The first task focuses on constructing a general graph-based metadata model to abstract key metadata and consistency rules; the second task focuses on efficiently retrieving metadata from real systems and instantiating metadata graphs; the third task focuses on building a graph-based consistency checking runtime engine to conduct the checking in parallel to gain scalable high performance. This includes constructing a generic graph structure for representing different file system metadata, extracting the consistency rules among metadata items for checking, and defining a set of interfaces to facilitate building the graph model for other file systems. The project will explore compiling all consistency rules into a unified executable called ?blob?, which can be run in parallel in all compute nodes, and optimize the runtime graph engine to accommodate dependencies and achieve high performance.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

StatusFinished
Effective start/end date15/7/1930/6/23

Funding

  • National Science Foundation: US$307,682.00

ASJC Scopus Subject Areas

  • Computer Networks and Communications
  • Electrical and Electronic Engineering
  • Communication

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.