Summary from problems with yp, mailtool, and alms

From: Daniel Quinlan (chs!danq@boulder.colorado.edu)
Date: Tue Mar 05 1991 - 12:35:23 CST


Sorry for the delay in summarizing responses to my problems, but I've
been hoping to get an authoritative fix to at least one of them. No
such luck so far. Here's the scoop:

1) yp problems on clients -- I've found that oftentimes sparcstations have to
    be rebooted in order to get their ypbind to connect to a new ypserv
    after the server has been rebooted.
    
    Many people have experienced problems with ypbind not switching to
    a different ypserv quickly when the first one fails. However,
    there were no suggestions for a clearcut solution to the problem.
    Some pertinent excerpts below:

> From boulder!maestro.mitre.org!lamour Thu Feb 28 08:41:11 1991
> We see a similar problem, but only very rarely.

> From boulder!sacto.west.sun.com!pacacc!steve Thu Feb 28 11:41:12 1991
>
> Wow, only 2 minutes 35 seconds! You seem pretty lucky. NIS usually is
> pretty slow about switching and Sun is full of it if they say it should
> switch much faster (Well, of course it should, but they know it doesn't).
>

> From boulder!utig.ig.utexas.edu!markw Thu Feb 28 11:59:08 1991
>
> os 4.1 server reboot hangs 4.1 nfs clients.
>
> Problem: when my 4.1 4/380 nfs server goes away for a "while", either by
> intent or panic, and it reboots, some 4.1 clients hang forever (>10 hours)
> until an rlogin from anywhere is attempted. This rlogin times out, but
> shortly (3 min?) later, the client is ok, and remains so.
>
> This NFS server is also the only NIS server for the domain (but see below).
>
> Ping works fine during the deadtime, but does not revive the client.
>
> Can't login from console on clients, user seet login request, types
> username,
> but then hangs. The client never wakes up from this.
>
> fastboot seems not to cause this problem; server is not down "long" enough?
>
> No problem is ever observed with 4.0.3c ss1 clients, only 4.1 clients,
> ( 3/50, slc, ipc, and ss1+ ). All except 3/50's are standalone installs.
>
> All are on the same physical enet and logical ip net (128.83.149.xx).
>
> There are NFS crossmounts from server to some clients, but
> the mount -vat nfs is backgrounded, and all mounts succeed once the
> network is turned on as the server boots.
>
> The NFS server was also the only NIS server for the domain. I did some
> etherfinding, and the problem seems to be that the client (yyy) calls the
> YP server (xxx) "RPC Call ypserv YPPOROC_DOMAIN V2" which is then answered
> by
> xxx "ICMP from xxx to yyy dst unreachable bad port bad packet was UDP from
> yyy.1033 to xxx.748 56 bytes RPC call prog 4 proc 926490675 V1970563431"
>
> My interpretation of this is that the client has cached the previous ypserv
> port, and does not immediately give up on this port, even though the server
> now has renumbered the ypserv port to some new value after the boot.

This sounds to me like the germ of the solution, but it sounds like a problem
to be fixed in ypbind source. I can't do anything about it here --
we don't have source, and I'd rather Sun took care of problems like
this anyway.

>
> We (sort of) solved this problem by making a slave yp server which at least
> does not renumber ports. Even if it is diskless or stalls on NFS while the
> server reboots, things seem to work musch better. However, there is a new
> problem we encounter if DNS is used with yp on the slave server. We get
> msgs on console and system log like:
> nres_gethostbyaddr: orange.cc.utexas.edu != 128.83.148.31
> nres_gethostbyaddr: begws2.beg.utexas.edu != 128.83.161.6
> nres_gethostbyaddr: hub.ucsb.edu != 128.111.24.40
>
> These *are* the correct ip address of these geezers by nslookup.
>
> These do not occur on the master server. It is possible I just set
> something up incorrectly...
>
>

2) I've also been having problems with mailtool sometimes leaving a
    Mail process running after it is killed, resulting in high loads
    and system cpu time on the machine from which /usr/spool/mail is
    mounted and a load average of 1 or higher continuously on the sparcstation
    where the mailtool was running. There were two suggestions about this;

> From boulder!silence.princeton.nj.us!jay Wed Feb 27 22:41:04 1991

> On item 2 (mailtools getting "stale fhandle"): does root's
> crontab clean out old files in /tmp?

To answer Jay's question, yes, there is a crontab entry which cleans out
old files in /tmp on the sparcstations:

0 0 * * * find /tmp -xdev -fstype nfs -prune -o -mtime +1 -atime +1 -depth -
    exec rm -f {} \;

On the other hand, I've wondered if this could be the problem, and done some
experiments where I started mailtool, removed everything from /tmp, and then
opened the mailtool icon -- no problems.

If this is truly the problem, I suppose I can rewrite the crontab entry
to avoid removing the mailtool files. I haven't tried this yet. I wish
I were able to reproduce the problem.

> From boulder!cs.purdue.edu!trinkle Thu Feb 28 08:41:21 1991
>
> We have seen the mailtool/Mail problem, but not frequently. It
> could be that the file the client initially opens with Mail gets
> removed by another mail agent while Mail continues to hold the
> original open. We had this problem with a user that would use Mail to
> check the header lines, then inc his mail (using MH). The initial
> Mail process then has an open file that has been removed on the
> server. The filehandle is therefore a stale filehandle. This is a
> problem with NFS. I suspect something in Mail is not detecting the
> error condition correctly (ESTALE) on some system call (most system
> code does not check return values :-) and just keeps retrying.

This is sort of a variation on the suggestion above, but with the mail
file disappearing on the server rather than on the local file. I've tried
some experiments where I start mailtool, open it up, close it, and then
delete all the local files in /tmp, and the nfs mounted mail file (using
plain mail), and have not been able to reproduce the problem, however.

3) I have a weird and intermittent problem with thousands of interrupts
per second from an alm board. I got no suggestions about this problem.
I think the ultimate answer there is not to use alm boards and to use terminal
servers instead, but that's not an option at present.

Thanks very much to all who responded.

        danq



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:11 CDT