Checkpointing the Un-checkpointable: the Split-Process Approach for MPI and Formal Verification

Checkpointing is the ability to save the state of a running process to stable storage, and later restart, perhaps on a different computer. Transparent checkpointing (or system-level checkpointing) is the ability to checkpoint a (possibly parallel or distributed) application, without modifying the application binaries. The speaker has led the DMTCP project in this area for 15 years.

This talk presents an efficient, new software architecture: split processes. It is being applied to MPI («MANA for MPI») and formal verification. The key insight is that any numerical, network and system libraries (e.g., including MPI and CUDA) reside in process memory that we tag as the «lower half». Only application-specific code is checkpointed (text, data and stack). On restart, a separate lower-half process is started as a trivial application. A bit of plumbing then re-connects the upper half with this lower-half, and unites them in a single thread of control. Runtime overhead remains near zero, since the lower-half libraries on the destination host have been optimized for that host. Overhead is near zero even for heterogeneous systems with wildly different libraries (e.g., TCP on source host and InfiniBand on destination).

A test of any new theory is how broadly it can be applied. The split-process approach is related in spirit to Microsoft Drawbridge and to the design of WSL. But its internals are different. This difference enables ongoing work in exploring split processes to provide the first native checkpointing support for Microsoft Windows that doesn’t require any administrative privilege.

The new theory is also being applied to formal verification, in a collaboration with Martin Quinson and the SimGrid team. SimGrid provides model checking for Pthread, MPI, and distributed applications, based directly on the original source code (no intermediate model necessary). SimGrid successfully finds race conditions and safety property violations in real-world software, but it has largely been limited to the first few minutes of execution. However, as one scales up, a rare race condition may eventually trigger, resulting in a new crash beyond the reach of SimGrid. To capture this case, applications are started natively (without SimGrid), while periodic checkpointing under DMTCP. After a crash, the application is restarted from the last checkpoint, but using a SimGrid-based lower half. SimGrid then produces a schedule that exhibits the crash, regardless of any race conditions.

The split-process approach represents joint work with Rohan Garg.

[Talk Slides]

Speaker Bios

Professor Cooperman works in high-performance computing and scalable applications for computational algebra. He received his B.S. from the University of Michigan in 1974, and his Ph.D. from Brown University in 1978. He then spent six years in basic research at GTE Laboratories. He came to Northeastern University in 1986, and has been a full professor there since 1992. His visiting research positions include a 5-year IDEX Chair of Attractivity at the University of Toulouse/CNRS in France, and sabbaticals at Concordia University, at CERN, and at Inria (France). He is one of the more than 100 co-authors on the foundational Geant4 paper, whose current citation count is at 25,000. The extension of the million-line code of Geant4 to use multi-threading (Geant4-MT) was accomplished in 2014 on the basis of joint work with his PhD student, Xin Dong. Prof. Cooperman currently leads the DMTCP project (Distributed Multi-Threaded CheckPointing) for transparent checkpointing. The project began in 2004, and has benefited from a series of PhD theses. Over 100 refereed publications cite DMTCP as having contributed to their research project.