Project Details
Description
Vision: Advance research in distributed intelligence to transform the foundations of the integrated research infrastructure (IRI) in support of resilient DOE scientific workflows.
Objectives: This project seeks to revolutionize adaptive management and partitioning of resources. It starts with the assumption that systems, computational platforms at all layers from hardware to the operating system, runtime systems, and middleware are inherently heterogeneous and unstable. Therefore, faults at different levels of the system affect other components locally or cascade across the network. This work will explore how distributed intelligence, specifically, swarm intelligence (SI), can provide robust, performant, resilient, and fault-tolerant execution of DOE scientific workflows that span across a continuum of resources from edge devices near sensors and instruments through wide area networks to leadership-class systems. The goal is to design SI-based resilient IRI that can quickly recover from failures, adapt to changes in the environment, maximize overall resource utilization, and optimize the execution time of workflows submitted by DOE scientists.
Methods: Existing adaptive management and resource partitioning strategies developed for resilient infrastructure are often static, based on rules developed by experts with years of experience, and dependent on centralized control. While significant attention has been paid to online and dynamic resource management using mainstream artificial intelligence (AI) methods, their effectiveness has not yet been demonstrated at scale because of their lack of ability to deal with the unique set of challenges related to the complexity and scale of the resilient infrastructures. On the other hand, nature provides elegant solutions for decentralized resilient systems with self-monitoring and self-healing capabilities. Examples include foraging behavior in ant and bee colonies, flocking behavior in swarms of birds, and schooling behavior in fish groups. SI is a class of AI methods inspired by such intelligent behavior of biological swarms. It deals with the study of how large numbers of simple agents can be designed to achieve a desired collective behavior through decentralized and local interactions among the agents and between the agents and the environment. SI methods have not made their way into the computational fabric supporting scientific applications, where they can potentially transform the way advanced scientific computing is done. In this context, we will explore the following hypothesis: Robust, scalable, flexible, and resilient scientific workflow executions can be achieved by developing a new class of decentralized fault tolerance and adaptation strategies that are grounded in and inspired by swarm intelligence techniques and advances in ML and HPC. These new classes of methods will establish a new basic computer science research direction for DOE in resilient platforms and will significantly change the way in which scientific workflows are designed, developed, and executed on the DOE computing continuum.
Potential Impacts: Outcomes: This research has the potential to transform the current state-of-the-art, re-conceptualize distributed systems design, implementation, and deployment, provide key results on how DOE science workflows can be optimized for performance and resilience. From a foundational perspective, the project will innovate across research areas of SI, AI, scientific workflow management, compute and network systems research, and resilience. It will make significant contributions to the understanding of how decentralized algorithms can optimize workflow throughput, resource utilization, and resilience to maximize scientific productivity and accelerate scientific discovery. From the application perspective, this research will be grounded in real DOE science workflows, which will be characterized, evaluated in the context of the reimagined DOE IRI, and made available to the community. Benefits: The proposed research has the potential to revolutionize the IRI by advancing the science of workflows spanning edge-to-HPC computing and networking continua; it will provide DOE scientists with the innovative tools to be more productive and accelerate scientific discovery. The project will advance research in SI for distributed infrastructure by making the agents more powerful and knowledgeable and thus enhance their applicability to other complex systems. Advancements in SI that include self-monitoring, self-healing, and also more intelligence can also have an impact on how future DOE user facilities are designed and operated, supporting DOE's goal to make them self-driving laboratories.
Status | Active |
---|---|
Effective start/end date | 1/7/23 → 30/6/28 |
Links | https://pamspublic.science.energy.gov/WebPAMSExternal/Interface/Common/ViewPublicAbstract.aspx?rv=2ea3dc28-783c-410a-8615-2f270f2a939a&rtc=24&PRoleId=10 |
Funding
- Advanced Scientific Computing Research: US$350,000.00
ASJC Scopus Subject Areas
- Artificial Intelligence
- Energy(all)