Checkpointing the Un-checkpointable: the Split-Process Approach for MPI and Formal Verification
Checkpointing is the ability to save the state of a running process to stable storage, and later restart, perhaps on a different computer. Transparent checkpointing (or system-level checkpointing) is the ability to checkpoint a (possibly parallel or distributed) application, without modifying the application binaries. The speaker has led the DMTCP project in this area for 15 years.
This talk presents an efficient, new software architecture: split processes. It is being applied to MPI (“MANA for MPI”) and formal verification. The key insight is that any numerical, network and system libraries (e.g., including MPI and CUDA) reside in process memory that we tag as the “lower half”. Only application-specific code is checkpointed (text, data and stack). On restart, a separate lower-half process is started as a trivial application. A bit of plumbing then re-connects the upper half with this lower-half, and unites them in a single thread of control. Runtime overhead remains near zero, since the lower-half libraries on the destination host have been optimized for that host. Overhead is near zero even for heterogeneous systems with wildly different libraries (e.g., TCP on source host and InfiniBand on destination).
A test of any new theory is how broadly it can be applied. The split-process approach is related in spirit to Microsoft Drawbridge and to the design of WSL. But its internals are different. This difference enables ongoing work in exploring split processes to provide the first native checkpointing support for Microsoft Windows that doesn’t require any administrative privilege.
The new theory is also being applied to formal verification, in a collaboration with Martin Quinson and the SimGrid team. SimGrid provides model checking for Pthread, MPI, and distributed applications, based directly on the original source code (no intermediate model necessary). SimGrid successfully finds race conditions and safety property violations in real-world software, but it has largely been limited to the first few minutes of execution. However, as one scales up, a rare race condition may eventually trigger, resulting in a new crash beyond the reach of SimGrid. To capture this case, applications are started natively (without SimGrid), while periodic checkpointing under DMTCP. After a crash, the application is restarted from the last checkpoint, but using a SimGrid-based lower half. SimGrid then produces a schedule that exhibits the crash, regardless of any race conditions.
The split-process approach represents joint work with Rohan Garg.
Speaker Details
Professor Cooperman works in high-performance computing and scalable applications for computational algebra. He received his B.S. from the University of Michigan in 1974, and his Ph.D. from Brown University in 1978. He then spent six years in basic research at GTE Laboratories. He came to Northeastern University in 1986, and has been a full professor there since 1992. His visiting research positions include a 5-year IDEX Chair of Attractivity at the University of Toulouse/CNRS in France, and sabbaticals at Concordia University, at CERN, and at Inria (France). He is one of the more than 100 co-authors on the foundational Geant4 paper, whose current citation count is at 25,000. The extension of the million-line code of Geant4 to use multi-threading (Geant4-MT) was accomplished in 2014 on the basis of joint work with his PhD student, Xin Dong. Prof. Cooperman currently leads the DMTCP project (Distributed Multi-Threaded CheckPointing) for transparent checkpointing. The project began in 2004, and has benefited from a series of PhD theses. Over 100 refereed publications cite DMTCP as having contributed to their research project.
- Series:
- Microsoft Research Talks
- Date:
- Speakers:
- Gene Cooperman
- Affiliation:
- Northeastern University
Series: Microsoft Research Talks
-
Decoding the Human Brain – A Neurosurgeon’s Experience
Speakers:- Pascal Zinn,
- Ivan Tashev
-
-
-
-
Galea: The Bridge Between Mixed Reality and Neurotechnology
Speakers:- Eva Esteban,
- Conor Russomanno
-
Current and Future Application of BCIs
Speakers:- Christoph Guger
-
Challenges in Evolving a Successful Database Product (SQL Server) to a Cloud Service (SQL Azure)
Speakers:- Hanuma Kodavalla,
- Phil Bernstein
-
Improving text prediction accuracy using neurophysiology
Speakers:- Sophia Mehdizadeh
-
-
DIABLo: a Deep Individual-Agnostic Binaural Localizer
Speakers:- Shoken Kaneko
-
-
Recent Efforts Towards Efficient And Scalable Neural Waveform Coding
Speakers:- Kai Zhen
-
-
Audio-based Toxic Language Detection
Speakers:- Midia Yousefi
-
-
From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks
Speakers:- Sujeeth Bharadwaj
-
Hope Speech and Help Speech: Surfacing Positivity Amidst Hate
Speakers:- Monojit Choudhury
-
-
-
-
-
'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project
Speakers:- Peter Clark
-
Checkpointing the Un-checkpointable: the Split-Process Approach for MPI and Formal Verification
Speakers:- Gene Cooperman
-
Learning Structured Models for Safe Robot Control
Speakers:- Ashish Kapoor
-