SUMMARY: Hanging rsh's

From: Guy Freeman - Systems Analyst (guy@fmlrnd.co.uk)
Date: Tue Jan 04 1994 - 09:50:22 CST


This problem as I stated in my last posting was caused by inetd. It seems
that this has a limit of 40 connections per minute. This was explained as
follows:

>
> The reason you get this is that inetd tries to be clever. If inetd sees
> too many connections to the one service (40 per minute) it assumes that
> there is a problem like a runaway process trying to connect so it shuts
> down that service. This was never a problem in the old sun3 days
> because the machines just weren't fast enough. With the sparcstations
> it is relatively easy (as you have seen) to break this speed limit.
> The fix has been around for a while, it is patch number 100178-08. I've
> included the README below for you. You need to install this patch on
> all the systems that you rsh to.
>
> regards,
> --
> Glenn Satchell glenn@uniq.com.au

I received many similar replies from other helpful Sun Managers. Many thanks
to :

abeckett@fmlrnd.co.uk
mharris@jpmorgan.com
tom@yac.llnl.gov
strombrg@hydra.acs.uci.edu
ye@Software.ORG
vasey@issi.com
blakespr@reston.eri.com
mclayt@pwfl.com
gusset@sparc1.ntb.ch
M.Ramchand@car0101.wins.icl.co.uk
davec@cs.ust.hk (Dave Curado)
cmm@hoccson.ho.att.com (45266B-C.MURPHY(HOM617)1000)
miguel@dt.fee.unicamp.br (Miguel A. Rozsas)
symanski@gold.nosc.mil (Jerry Symanski - NRaD Code 761)
Mike Raffety <miker@il.us.swissbank.com>
jml4@cus.cam.ac.uk
glenn@uniq.com.au (Glenn Satchell - Uniq Professional Services)
dal@gcm.com (Dan Lorenzini)

==================================My question Part 2==========================

Dear Sun Managers,

I posted a problem I was having with rsh commands hanging, some weeks ago - see
my original mail below. Sorry for the delay with this follow-up but this is
the fourth time I've posted this mail to Sun Managers; it never arrived the
other times for some reason.

Anyway, I've received quite a number of very useful replies to the problem. A
few people have come across it before, and come up with good work-arounds but
I'd like to pursue it a little bit further before I resort to these.

I've learnt the problem can be recreated using the following script :-

-------------------------
#!/bin/csh -f

@ count=0

while(1)
        @ count++
        rsh -n herbert /usr/bin/true
        echo $status $count
end
-------------------------

After the script repeats the rsh command around 40-80 times, it HANGS.
However, for some reason this only happens on certain machines (I cannot find
a common link). When it does hang, the target machine shows the following
error:

Dec 17 10:42:52 herbert inetd[237]: shell/tcp server failing (looping), service
terminated

I've been told from a very reliable source (abeckett@fmlrnd.co.uk) that this
is definately caused by inetd (which starts rshd etc. on the target machine.)

This means rcmd() and rexec() will fall over with the same problem, which has
been tested and is true.

I assume that the problem can be avoided by waiting a certain amount of time
before reconnecting to the same machine. Does anyone know how long this is
likely to be or what has to happen on the target machine before it can be
reconnected to without ever hitting the problem.

In the meantime, I'm going to look to see if I can find any patches for inetd
that fix the problem.

Any help would be appreciated.

Guy Freeman.

Many, many thanks (so far) to:

abeckett@fmlrnd.co.uk
mharris@jpmorgan.com
tom@yac.llnl.gov
strombrg@hydra.acs.uci.edu
ye@Software.ORG
vasey@issi.com
blakespr@reston.eri.com
mclayt@pwfl.com
gusset@sparc1.ntb.ch

==================================My question Part 1==========================

Dear SunNet Managers,

I have been writing a network backup program in 'C' for the last few months.
After initial completion and testing, I hit a problem which I am, as yet,
unable to fix.

Background
----------

We use an Exabyte 10i stacker with TTi RADD device driver software and Exabyte
8500 tape drive all connected to a Sparc 1+ running SunOS 4.1.3.

The program looks at each machine on the network in a priority based order to
find the next suitable disk partition to be backed up. The criteria for a
"suitable disk partition" is one which will fit onto the current tape and
whose host is alive and functional. These two criteria are checked using the
following system call:

        ....
        sprintf(command,"intr -t 60 rsh -n %s \"df -t 4.2\" >%s 2>&1",hostname,tempfile);
        if((system(command))!=0)
        ....

If the partition is suitable, then it is dumped to tape using the following
system call:

        ....
        sprintf(command,"rsh -n %s \"rdump 0ufbsd hagar:/dev/nrst8 100 15000 54000 %s\" >%s 2>&1",bkentry->hostname,bkentry->mountpoint,tempfile);
        if((system(command))!=0)
        ....

Problem
-------

When the program begins, everything seems to function correctly, but then after
a seemingly random number of partitions have been backed up (anything from 15
to 55 partitions) the program begins to fail.

What exactly happens is the "rsh" commands (executed from the system function)
simply hang and never connect to the remote machine.

A "ps" command executed on the host which is running the program shows that
the rsh command is executing, but the remote machine (specified in the "rsh"
command) does not show any such process. Note, I can login from another shell
without problems.

The first system call, which uses "intr", times out because of this hang-up,
as do all further system calls which use "rsh".

Please note, I have tried the program without the "intr" part of the first
system call, without the "-n" options on the "rsh" commands, and also executing
the commands via a script which I called from system function. None of these
fixed the problem.

Help

----

If anybody has any suggestions or can help in any way, I'd be very grateful.

Guy Freeman.

==============================================================================



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:08:53 CDT