Collaborative Research: Building the Community for the Open Storage Network

  • Ahalt, Stanley S. (PI)
  • Shanley, Lea L. (CoPI)
  • Aikat, Jay J. (CoPI)

Project Details

Description

The scientific community is facing a major challenge dealing with the increasing amount of open scientific data emerging from research projects on all scales-- from large facilities to small research labs. Over the last five years the NSF has funded more than 200 high-speed connections to the Internet-2 backbone operating at 10-100Gbps speeds. The goal of this project is to develop a prototype module for a high performance distributed storage system that extends the usability of the existing high-speed interconnects. This project is a pilot for a potential national-scale storage infrastructure for open scientific data, which at full scale could serve hundred sites and many hundreds of Petabytes. Many of the technologies associated with such a distributed system already exist; the key challenge in this project is social engineering: how can one design a simple enough yet robust storage node that can be easily replicated, is attractive for universities and research projects to adopt, is easy to manage and can support the various patterns for large scale scientific analyses?

Many universities have several of the necessary pieces for Data Intensive Science in place-- reasonably sized computing clusters, a few PB of storage and even a high-speed connection-- yet performing the analyses of data intensive science is very painful and slow. Data is never there when needed, large storage systems often fail despite having massive RAID configurations, and moving data from disk-to-disk at the full network speed still requires complex skills. The project offers a broad community buy-in through the Big Data Hubs, a unique combination of skills, facilities and science challenges to test, evaluate and deploy different hardware and software combinations that can be used in the design of a much larger, national-scale system. The goal is to design and run detailed benchmarks for various test science projects requiring different combinations of data transfer, data processing and massive compute, and use the results to design and build a low-cost, scalable petascale appliance including inexpensive hardware nodes and a simple software stack that can be replicated across many universities, supercomputer centers and large NSF facilities. The proposed system could become an enormous multiplier on the existing NSF investments in high end computing and fast networks. It could also accelerate the pace of standardization of data storage across the nation. The public, open data products, often discussed in the Data Management Plans at the end of NSF proposals could find an easy-to-use home. Various educational projects could simply rely upon a robust storage infrastructure with a simple API, and build a variety of delivery services for the educational community.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

StatusFinished
Effective start/end date15/6/1831/5/21

Funding

  • National Science Foundation: US$431,786.00

ASJC Scopus Subject Areas

  • Computer Science(all)

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.