Summary: Checkpointing Processes

From: fed!m1rcd00@uunet.uu.net
Date: Tue Nov 13 1990 - 11:23:50 CST


Many thanks to the many who have replied. Sorry for the delay, but I
was still getting email on this up until Sunday. There have been many
different suggestions, most of them for the partial solutions I thought
must exist and which I solicited. The closest thing to a complete
solution was a suggestion to rip the swapping code out of the operating
system source and start from there, but lo, I don't have source... :-(

A common suggestion was to look at Condor from the University of
Wisconsin, ftp-able from shorty.cs.wisc.edu; I was able to pick up the
current version (4.0.0) from uunet.uu.net. This is an implementation of
a classic process migration scheme. It looks to be very simple and well
thought out and we may put this up as a general service on our network
even if we don't use it for the problem at hand. But for a couple of
difficulties (it requires linking with a custom C library, and may not
work if the program writes and reads the same file, which SAS does
rampantly), it would probably do exactly what we need as is, just by
setting up a CPU pool with only one CPU. If the user eventually decides
to rewrite the core computation in a 3rd-GL, then it seems that this
would be an ideal tool. This suggestion was at least part of the
response from:

 uunet!ctr.columbia.edu!seth (Seth Robertson)
 Mark Verber <uunet!pacific.mps.ohio-state.edu!verber>
 uunet!hao.ucar.edu!sitongia (Leonard Sitongia)
 uunet!Sun.COM!timsmith (Timothy G. Smith - Technical Consultant Sun Baltimore)

Some other suggestions:

Jeff Nieusma <uunet!eclipse.Colorado.EDU!nieusma> suggested getting a faster
machine :-)

uunet!Sun.COM!timsmith (Timothy G. Smith - Technical Consultant Sun Baltimore)
sent a thoughtful note that referred to some other work that had been done
in the area, and talked a little about how one would go about writing an
application well-behaved for checkpointing. He also said that we should
all be beating up on our computer vendors to demand tools and kernel support
for doing "hard core" computing.

uunet!ctr.columbia.edu!seth (Seth Robertson) mentioned that he had written
a checkpointer for Suns that has some I/O restrictions.

uunet!water.ca.gov!rfinch (Ralph Finch) described how they had split up
a calibration program and distributed the pieces using ISIS. If one piece
crashes, the main program just restarts it somewhere else.

uunet!sedist.cray.com!rjt (Randy Thomas) thought that this question wasn't
appropriate for sun-managers.

steve@ssd.kodak.com (Steve Bochinski) thought that one might be able to
build something that was based on the swap code in the OS.

uunet!Sun.COM!halstern (Hal Stern - Consultant) had a number of helpful
suggestions on how to write a program that could do it's own checkpointing,
and cited GNU emacs as an example.

uunet!wubios.wustl.edu!phil (J. Philip Miller) Suggested that we turn to
the sas-l mailing list for help in breaking up the SAS program.

"Bill Eshleman" <uunet!water.agen.ufl.edu!wde> suggested setting up a
good fast workstation with a UPS and dedicating it to the task, letting
NFS handle server crashes.

Mark Ferraretto <uunet!physics.adelaide.edu.au!mferrare> suggested logging
the progress of the program in a file so that results that have been computed
could be saved, and the program could pick up where it left off.

uunet!Corp.Sun.COM!kevin (Kevin Sheehan {Consulting Poster Child}) said
that he had solved a similar problem by having his program do all it's
scratch work in a mmap'd file; the program can read this file and just
pick up where it left off. He pointed to the Rogue sources as an example
of something that does a very good job of starting where it left off.

uunet!niwot.scd.ucar.EDU!era (Ed Arnold) suggested casting about for
a public domain version of Cray's NQS. He said it has public-domain
roots (from NASA/AMES) and as heard that there are other versions around...

Thanks again to all! I will take this information to the user and
discuss with him the next step. If anyone would like full copies of all
the mail I received on this subject, please let me know by sending a
request to rcd@fed.frb.gov. Below is my original message:

-----------------------------------------------------------------------
Greetings.

A user at this site is using SAS on a Solbourne 5/802 (running
Solbourne's version of SunOS 4.0.3) to do some survey analysis
involving millions of observations. Soon, he will be running some
programs that he projects will take 20 to 30 *days* to complete.
Obviously, a system crash or power outage at 25 days would be
disastrous. We are looking into acquiring an UPS for the system,
and I have asked him and others to take a good hard look to see if
there is really *no* way to break the program up.

However, we would like to know if anyone out there knows of any
way to "checkpoint" and "restore" such a process from the
operating system level. It occurred to me that the implementation
of some process migration scheme might contain the required tools,
although it strikes me that the fact that he is using a closed
black box like SAS may hold complications for such a solution,
especially in regard to dealing with open files...

If anyone has any information or opinions that might help us, even
if it's just "fat chance", please send it to rcd@fed.frb.gov. I
would be interested in hearing about partial solutions, such as a
process migration scheme that requires linking with special
libraries... on further analysis, we might discover that the
long-running part of the program might easily be re-coded in
Fortran/NAG, for example.

Thank you. I will summarize for the net.

Bob Drzyzgula
Federal Reserve Board
Washington, D.C.



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:05:59 CDT