Summary Re: rpc.lockd interaction problem

From: Gerhard Hertlein (hertlein@pki-nbg.philips.de)
Date: Thu Apr 23 1992 - 15:41:21 CDT


Sun Managers,

Two weeks ago I had a question about the following problem:

-- Begin INCLUDED --

The setup is as follows:

1. Server 'sunsvr' (SunOs 4.1.1, no patches) exports homedirectories
   (/export/home) and the mailbox directory (/var/spool/mail).
2. Client 'hpclnt' (HP-UX 8.02, is a HP9000/847) mounts sunsvr:/export/home
   and sunsvr:/var/spool/mail.

This setup has been in operation for a couple of month with no unsolvable
problems.

Suddenly users complained about hanging processes on 'hpclnt', especially
ksh and mailx. These processes refuse to be killed by kill -9.
A reboot is the only way to get rid of these hanging processes.

A subsequent analysis showed that the processes (on hpclnt) are blocked
waiting for a lock on files such as /var/spool/mail/$USER or
~/.sh_history on sunsvr.

A network trace with etherfind revealed the following events:
1. hpclnt sends rpc-call nlockmgr proc 7 to sunsvr to request a lock.
2. sunsvr grants lock by sending an rpc-call nlockmgr proc 12 to hpclnt
   (udp port 1035).
3. hpclnt replies by an ICMP error message "Bad port 1035".

rpcinfo -p on hpclnt shows that rpc.lockd is indeed listening on port
1032 and not 1035.

...

-- End INCLUDED --

Many thanks for quick replies to:

casper@fwi.uva.nl
mdl@cypress.com
mondics@tartan.com
root@toy.rad.msu.edu
djc@xanadu.acuson.com
derek@ncc.nexus.ca
miker@sbcoc.com
geertj@philica@unido.uucp

The outcome was the following:

The problem is caused by a general 'feature' of RPC, which does not require
that the portmapper is asked for every RPC call. This is a reasonable thing
(for performance reasons), but causes problems, when the RPC client
outlives its RPC server. This is exactly the case, when a NFS client is
rebooted.

In my opinion, the proper solution in the case of the rpc.lockd would be
for SUN to extend the lockd protocol/mechanism, so that the lockd on a
server would be notified in case of client reboots (e.g. by the mountd,
which knows, when a client mounts/remounts a filesystem). After such a
notification the NFS-server lockd could reinitialize its 'connection' to
the NFS-client lockd.

Several people mentioned patches: 100075-07 (not for this problem),
100075-08 (rpc.lockd JUMBO patch), needs 100173-07 (NFS Jumbo).

I decided not to go after the patch, because the problem does not occur
spontaneously, and is under control, now that I know what could cause
hanging processes.

Fix: reboot the NFS server, or restart the server's lockd (dangerous?!?)

Gerhard Hertlein
hertlein@pki-nbg.philips.de



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:41 CDT