SUMMARY:suspend&resume

From: Mr Rene Occelli (rene@iusti.univ-mrs.fr)
Date: Fri Apr 26 1996 - 03:20:17 CDT


Hi,

Firt my question:
One of my user want to run a very large an long simulation programs
(about 60 Meg of Memory and perhaps two weeks long on a sparc 20 ).

I wonder if there is a way to run this job the night, suspend it the
morning (dumping all the memory on disk ) and restart it the next night
(without lossing anything).

So in this case the Sation is free for all users during the day.

............

I have received a lot of responses. Many thanks to all people. There are
many solutions :

a) The most popular : ( I've choosen this solution)
Start the process in the backgound ( at ,....)
get the pid of the running job, then STOP it with kill:

kill -STOP pid

This suspend the process and during this time the job will be swapped out.
In this case the CPU will be free for anoher users.
************************************************
One should take care to have ENOUGH swap place.
************************************************
Then restart the job the night with

kill -CONT pid
 

One can put this commnand in the crontab's user
0 8 * * * kill -STOP pid
0 20 * * * kill -CONT pid

CAUTION: In case of crash or reboot the job is LOST

b) Start the program , check its pid and a use cron to increase/decrease its
priority for night/day.

c) Start the process in a window. Pressing Ctrl-z (in this window) suspends it
 (in csh/tcsh) and sends it into background. Then type fd in that window puts
job back in foreground. This method can be choosen when this problem
appears occasionally .

d) Write this applications using mmap() and files. If the job is suspending,
 the pages are backed onto the disk. But in this suspend state if the system
reboots the job can restart. (Not tested).
(idea from Kevin.Sheehan@uniq.com.au (Kevin Sheehan {Consulting Poster Child}) )

e)Start the job then kill it in such a way it dumps core ( kill -SEGV pid)
then reload it in a debugger
dbx /path/executable core
and continue from here
(Not tested)
(idea from Gerhard den Hollander <gerhard@jason.nl> )

f)The Condor package.

This is a huge free software that has the possibility to move jobs
between machines (!). So Condor did more then just stopping the process
it also saves image to the disk, so there will be no swap space used anymore
by a stopped job.
One suggets also to recompile the job with the condor libraries which make
it chekpointable.

Condor can be found at :
http://www.cs.wisc.edu/condor/index.html
http://www.cs.wisc.edu/condor/
psuvax1.cs.psu.edu:/pub/src/Condor
condor-request@cs.wisc.edu

Because it's a big package, I will try it in the future and perhaps send
a Summary about it.

CONCLUSION:
In my case, the solution based of stopping or running the job at low priority
is sufficient and easy to explain to users.

Thanks to :
Claus Assmann <ca@informatik.uni-kiel.de>
simes@tcp.co.uk
Paul Groves <paul.groves@linkhouse.com>
Daniel Lorenzini <lorenzd@gcm.com>
kozover@bimacs.cs.biu.ac.il (Kozover maxim)
 mrs@cadem.mc.xerox.com ("Michael Salehi x22725")
 miquel@proton.uab.es (Miquel Cabanas . BBM - UAB)
Gerhard den Hollander <gerhard@jason.nl>
Torsten Metzner <tom@math.uni-paderborn.de>
 davem@cp.tybrin.com (Dave McFerren)
heas <heas@nexen.com>
iv08480@issc02.mdc.com (Colin Melville)
bukys@cs.rochester.edu
jgarb@erim.org (Joe Garbarino)
Jay Lessert <jayl@lattice.com>
 tjb839@zacatecas.optimum.com (Tim Boemker)
Fedor Gnuchev <qwe@ht.eimb.rssi.ru>
mjohnson@knee.brooks.af.mil (H. Milton Johnson)
Kevin.Sheehan@uniq.com.au (Kevin Sheehan {Consulting Poster Child})
Glenn.Satchell@Uniq.com.au (Glenn Satchell - Uniq Professional Services)
Tony Kay <Tony.Kay@dubai.Sun.COM>
misawa@physics.Berkeley.EDU (Shigeki Misawa)
thomas@wiwi.hu-berlin.de (Thomas Koetter)
oconnor%gecko@aec.aeg.kn.DaimlerBenz.com
 Andreas Stuebinger <stuebing@fmi.uni-passau.de>
 fredc@hounix.org (Fred Chastang)
 hagberg@ece.arizona.edu (D. J. Hagberg)

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ Rene OCCELLI +
+ I.U.S.T.I. C.N.R.S. U.M.R. 139 +
+ Av. Esc. Normandie Niemen +
+ 13397 MARSEILLE Cedex 20 France +
+ Tel: (33)91 28 82 08 +
+ Fax: (33)91 28 82 25 +
+ Email: rene@iusti.univ-mrs.fr +
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:58 CDT