SUMMARY: batch control

From: Glenn Carver (glenn@atmos-modelling.chemistry.cambridge.ac.uk)
Date: Sun Oct 27 1991 - 13:44:15 CST


Thanks to all of you who took time to reply to my query about batch job control
on workstations. Sorry about the delay in summarizing.

I had a large number of responses which directed me to several freely available
software packages. Unfortunately I've not had the time to examine the
capabilities of each package in detail so this summary will only give brief
details. But, I hope this helps those who are facing the same problems I am.

As I half expected, batch control has been a problem for system managers for
sometime and a great deal of effort has been spent on developing a useable
system. However, the level of sophistication varies so you have to decide
on your requirements before investing time in installing any of these packages.

Here's a summary of what I found out (my original message at the end):

1. SunOS batch, at, cron.
-------------------------
Can be configured for multiple queues per machine. You can specify the number
of jobs per queue, nice value and retry time for jobs. See man page for
queuedefs for more details. Very limited capabilities.

2. Using the print spooler.
---------------------------
Several people pointed out that you can use the print spooler mechanism to
setup and manage distributed batch queues by running scripts instead of
printing. Noone sent me details of a working mechanism and I haven't tried
it yet. It might be very useful in combination with some of the non-distributed software.

3. dsh.
-------
Alan Stebbens <aks%anywhere@edu.ucsb.hub> pointed me in the direction of
'dsh'. 'dsh' implements a distributed shell which finds the least loaded
machine and runs the command on it. dsh is available by anonymous ftp
from hub.ucsb.edu in pub/shells/dsh.tar.Z

4. Batch.
---------
Ken Lalonde <ken@edu.toronto.cs> has written a batch control package. It is
a collection of programs and scripts that allows you to set up various
queues on a machine with characteristics such as the priority of jobs,
job resource limits and so on. It runs a daemon which monitors the load
on the machine and can halt jobs when the load reaches a settable level.
Batch is not networked. Several people recommended this package. It's
available by anonymous ftp from ftp.cs.toronto.edu in pub/batch.tar.Z.

5. QBATCH
---------
Thanks to Milt Ratcliff <milt@pe-nelson.com> for mailing me about QBATCH.
QBATCH was developed by Alan Saunders on Sun workstations. It is
not networked but does provide a comprehensive set of job control
options, more than Batch (4.) but does not halt jobs if load reaches some
predetermined level. QBATCH is available from several anonymous ftp sites. I
got it from lth.se in netnews/alt.sources/volume91/jul but it's also available
from cs.dal.ca in pub/bio as qbatch.tar.Z.

6. Condor.
----------
Many replies mentioned the Condor package. Condor was written at the
University of Wisconsin and is quite sophisticated and well documented.
It is fully distributed, machines enter and leave a 'pool' which condor uses
to run jobs. Jobs are checkpointed and can be moved from one machine which
leaves the pool and continued on a machine that enters. The snags appear to
be that a replacement version of the libc.a library is required to enable the
checkpointing (programs must be statically linked) and I/O is not implemented
well for FORTRAN. For more info contact condor-request@cs.wisc.edu. Condor
is available from many ftp sites as Condor_4.0.0.tar.Z. Use 'archie' to find
one (USA: quiche.cs.mcgill.ca; EUROPE nic.funet.fi; log in as user archie).

7. NQS.
-------
The Network Queueing System was developed on contract from NASA. There is a
version (I assume to be the original) on permac.space.swri.edu in
public/convexug/nqs.tar.Z (and other anonymous ftp sites). NQS is also marketed
by several companies and improved over the original: COSMIC, 382 East Broad St.,
Athens GA 30602 supporting SIG, Sun, VAX & Stardent, Sterling Software (415
area code, sorry no other details). Cray also have a version and sell a version
called RQS for remote queueing on Cray machines. COSMIC are also rumoured to
be developing NQS II. For those with money to spend, this may be the one.

At a first glance NQS seems to give similar sort of capabilities as Condor but
this is quite a big package and I haven't had time to go through it all. I
did hear from someone who had successfully installed the permac version
on a multiarchitecture environment (including Suns, although it required a
bit of work).

8. MDQS
-------
MDQS was developed at the U.S. Army Ballistic Research Lab. and is available
from ftp.brl.mil in arch/mdqs.tar.Z. MDQS stands for Multi-Device Queueing
System and appears to have been originally developed to handle a large number
of network printer devices (multiple devices per queue, multiple queues per
device) but also includes facilities for batching jobs on machines. This
appears to be a powerful package with alot of documentation to it.

9. DNQS
-------
Tom Green <green@edu.fsu.scri.ds17> mailed me about DNQS. This is available
from ftp.fsu.edu in the directory pub/DNQS. This package supports a multi-
architecture environment is a distributed way but doesn't include some of
the more fancier features of the above packages. However, it was developed
for a workstation environment rather than a few high-speed processors (such
as NQS). Documentation is good (not always the case!) and it looks fairly easy
to setup (although I haven't done it yet). Won't halt jobs when machine load
is too high, relies on nice priority to do that. Known to run on
Sun, VAX, DecStation, SGI & IBM.

------------------------------------------------------------------------
Glenn Carver Email: carver@atm.ch.cam.ac.uk
Atmospheric Chemistry Modelling Group Phone: (44-223) 336521
Chemistry Department Fax : (44-223) 336362
Cambridge University
UK
------------------------------------------------------------------------

Thanks to all who replied:

Mike Raffety <miker@com.sbcoc>
Alan Stebbens <aks%anywhere@edu.ucsb.hub>
Ken Lalonde <ken@edu.toronto.cs>
huittsco@com.pwfl (Scott Huitt 407-796-2969)
erueg@de.gwdg.uni-math.cfgauss (Eckhard Rueggeberg)
Steve Seaney <seaney@edu.wisc.me.robios>
Seth Robertson <seth@edu.columbia.ctr>
sitongia@edu.ucar.hao (Leonard Sitongia)
Loki Jorgenson <loki@ca.mcgill.physics.nazgul>
feldt@edu.uoknor.nhn.phyast (Andy Feldt)
green@edu.fsu.scri.ds17 (Tom Green)
urszula@edu.berkeley.garnet ( Urszula Frydman )
milt@com.pe-nelson
brianc@edu.ucsf.jekyll
peb@com.ueci (Paul Begley)
dan@com.BBN
henry@ca.concordia.davinci
Larry Thorne <larryt@edu.MsState.ERC>
David Fetrow <fetrow@edu.washington.biostat.orac>
"Steven G. Parker" <sgp@edu.uoknor.nhn.phyast>
Jon Diekema <diekema@org.mi.jdbbs>
Ed Arnold <era@edu.ucar.scd.niwot>
pete@uk.ac.ox.physchem (Pete Biggs)
gwolsk%seidc@com.mips (Guntram Wolski)
kevins@com.Sun.Aus (Kevin Sheehan {Consulting Poster Child})
Mike Raffety <miker@com.sbcoc>

and here's my original message:

To: sun-managers@edu.nwu.eecs
Subject: Batch control
Status: RO

Several users have recently begun to run large programs on machines in our
network. By large, I mean that these programs run for several days and
memory usage is such that we cannot run them on our 8Mb IPCs when OW is
running. I have instructed these users to run at reduced priority and
only on the machines that have enough memory to cope.

They've so far been using the 'batch' command to do this. The problem is that
these users are new to UNIX and often start the programs on the
wrong machine. Also, when they have realised they've done something wrong,
they can't figure out why 'batch' doesn't give a queue entry and I have to
tell them to use 'ps' ...etc etc.

I'm hoping that someone out there can point me in the direction of some
freely available software (no money!) on how best to present
a batch environment to users continuely running large programs in background.

What I'd like to have is:

1. A command interface the same across all machines on the network. e.g.
   % batch myjob machine1
   would start up the script myjob on machine1, rather than as at present,
   where batch only works for the local machine.

2. User can query state of the batch queues; what's running, what's queued
   and on what machines, again without having to log on to each machine in
   turn ('atq' only tells you what's waiting to run).

3. I need to be able to specify which machines can be used for batch jobs.

4. I need to be able to control priority, start and stop jobs.

5. Ability for the user to kill jobs currently running easily. 'atrm' only
   works for jobs queued.

This may be asking alot but these are all problems that I will have to
overcome. I am expecting several more users to begin using the network for
large background jobs. I'm sure someone has had this problem before and I'd
be grateful for any advice/software.

I will summarise.



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:17 CDT