SUMMARY: UNIX rm command problem

From: Steve Lodin (swlodin@kocrsv01.delcoelect.com)
Date: Thu Jan 07 1993 - 00:03:41 CST


I am including a summary of the results so far and including more data. I
still don't have an explanation.

I got responses from:
ncf@tngstar.cray.com (Nicholas Franco)
poffen@sj.ate.slb.com (Russ Poffenberger)
Dieter Muller <dworkin@shiara.rootgroup.com>
rodo@auspex.com (Rod Livingood)
From: lincoln!brian@netcom.com

I wrote:

> Problem: rm command failure
>
> Systems: Server: Auspex NS5000 Auspex 1.4.2/SunOS 4.1
> Clients: Sun SS2 4.1.3
> Sun SS2 4.1.2
>
> Every now and then, I get the following error:
>
> rm: internal synchronization error: filename
>
> while trying to do a big rm -rf on an NFS-mounted disk. I'll end up doing the
> command again and it will finish. It doesn't seem like anything other than
> an annoyance, but I'm not sure. Others have also reported the problem here.
>
> Here is a command sequence:
>
> [1] kocrsw15: rm -rf news [ trying to remove directory ]
> rm: internal synchronization error: news/trn.2.hp, newsgroups, ng.c
> rm: internal synchronization error: news/xrn6-17.hp/includes/X11/Xaw, AsciiSrc.h, AsciiSrcP.h
> [2] kocrsw15: ls -la news [ ls shows not all gone ]
> total 11
> drwxr-sr-x 4 news 1024 Dec 15 13:59 ./
> drwxrwsr-x 10 sysadmin 4096 Dec 15 13:58 ../
> drwxr-xr-x 2 news 4096 Dec 15 13:58 trn.2.hp/
> drwxr-xr-x 3 news 2048 Dec 15 13:59 xrn6-17.hp/
> [3] kocrsw15: rm -rf news [ remove it again ]
> [4] kocrsw15: ls -la news [ now all gone ]
> news not found
> [5] kocrsw15: df . [ automntd drive from Auspex ]
> Filesystem kbytes used avail capacity Mounted on
> sv04_02:/vp2/a 1861388 1584449 183869 90% /a/sv04_02/vp2/a
>
>
> These two machines are on the same subnet. (Well, with the 6 ethernets in
> the Auspex, most everything is on the same subnet :-) The next-to-latest
> beta version of amd is used to automount.
>
>
> Has anyone else encountered this before? Is is a problem with the server or
> the client? How serious is it? Is there a patch?
>
> Thanks for any help or suggestions.
>
>
> Steve Lodin, Chris Cleary, and Ryland Rusch (sysadmin@kocrsv01.delcoelect.com)
> System Administrators
> Delco Electronics Corp

-----------------------------------------------------------------------------
From: ncf@tngstar.cray.com (Nicholas Franco)

: I have an NS5000 also but have not seem this problem before. Have you put in
: a call to Auspex. They have always been very helpful when I had a problem.
: If you want to give them a call the number is 1-800-328-7739.

We talked with Auspex, including it with another call so it got lost. They
sent some info included later. This just enabled me to track down the
problem. I still don't know what causes it.
-----------------------------------------------------------------------------
From: poffen@sj.ate.slb.com (Russ Poffenberger)

: Can't say that I have seen this on my Auspex. I don't regularly do large
: rm -r's on it though.

: You might get more input sending it to the Auspex mailing list. Did you
: contact Auspex? Sounds like an Auspex problem, they are usually very helpful.

I forgot the Auspex mailing list. Is it just me or is that very, very slow?
-----------------------------------------------------------------------------
From: Dieter Muller <dworkin@shiara.rootgroup.com>

: What's happening is that a directory is changing out from under rm.
: Most likely, someone else is also running an rm -rf from a different
: system (although you can get it to happen on a single system). The
: complaints are harmless, with the only negative being that the tree
: you're deleting doesn't get completely deleted.

: The work-around that seemed to reduce the conflicts the most for me
: was to always do rm -rf on the server rather than on a client. You've
: got fewer race conditions that way, although they aren't completely
: eliminated (to do that, only one person could use rm(1) or unlink(2)
: at a time on a particular filesystem, which isn't a practical
: solution).

I know for a fact that I am the only person in the hierarchy and no one else
(or no thing else) is doing anything there since this is an old source tree
and I'm basically the only one with the news userid access.

So, I discount the race condition idea.
-----------------------------------------------------------------------------
From: rodo@auspex.com (Rod Livingood)

: This is probably the culprit of this and the other problem.

: ----- Begin Included Message -----
 
: (Mohan Srinivasan)
: To: rodo
: Subject: Re: UNIX rm command failure - mail.sun-managers #9280
 
 
: This is (very likely) caused by the NFS client side open-unlink-rename hack.
 
: When rm reports this internal error and spits out the directories
: that it couldn't remove, it would be useful to do an 'ls -a pathname'
: and see if the directory has any .nfsXXX files in it. (This may still
: reveal nothing 'cause the .nfsXXX files may have gone inactive and
: removed before you do the ls -a).
 
: If there are any .nfsXXX files in the directories printed, this would be the
: reason why the rm -rf is failing on those directories.
 
: When you remove a file (from an NFS mounted filesystem), and some other
: process is holding a reference to that file (this is determined by
: checking the refcnt on the vnode), then the NFS client code renames the file to
: .nfsXXX (and puts it in the same directory). The .nfsXXX file will be removed
: when the vnode goes inactive.
 
: This may be what is causing rm to fail, because once rm unlinks all the files
: in a directory, it assumes that the directory is removable, (but the rmdir
: will fail if there are .nfsXXX entries in it), this will confuse rm and make
: it spit out the message.
 
: trace -o somefile rm -rf , and looking at the trace output should also
: give more data, if you see a rmdir NFS request getting an ENOTEMPTY
: reply, then you know that this is the problem.
 
: mohan
: ----- End Included Message -----

In the section following called MORE DATA, I do what is suggested above. I
still don't know what is causing this, although I almost 100% sure that no
other process has open files in that directory structure.
-----------------------------------------------------------------------------
From: lincoln!brian@netcom.com
 
: This happens to me when there are two "rm -rf" processes removing the
: same directory tree.
 
As above, I know this isn't happening.
-----------------------------------------------------------------------------

MORE DATA:
As a test, I created a 22MB directory tree (brand new, no one else knows about
it so there is no other people or programs accessing it). I did the remove
followed immediately by the list.

[1] kocrsw15: trace -o trace rm -rf /vol/AES_Tools/news
rm: internal synchronization error: /vol/AES_Tools/news/cnews.hp/libstdio, stdio ck.stock, stdiock.fast
rm: internal synchronization error: /vol/AES_Tools/news/cnews.hp/misc, newswatch.orig, canonhdr
rm: internal synchronization error: /vol/AES_Tools/news/xrn6-17.hp, xthelper.h, xrn6-17.patch
rm: internal synchronization error: /vol/AES_Tools/news/xrn6-17.sun, copyright.h, cursor.c
[2] kocrsw15: ls -alR /vol/AES_Tools/news
total 10
drwxr-xr-x 5 news 512 Jan 4 12:29 ./
drwxrwsr-x 13 sysadmin 4096 Jan 4 12:26 ../
drwxrwxr-x 4 news 1024 Jan 4 12:27 cnews.hp/
drwxr-xr-x 2 news 2048 Jan 4 12:29 xrn6-17.hp/
drwxr-xr-x 2 news 2048 Jan 4 12:29 xrn6-17.sun/

/vol/AES_Tools/news/cnews.hp:
total 4
drwxrwxr-x 4 news 1024 Jan 4 12:27 ./
drwxr-xr-x 5 news 512 Jan 4 12:29 ../
drwxrwxr-x 2 news 512 Jan 4 12:26 libstdio/
drwxrwxr-x 2 news 1024 Jan 4 12:26 misc/

/vol/AES_Tools/news/cnews.hp/libstdio:
total 8
drwxrwxr-x 2 news 512 Jan 4 12:26 ./
drwxrwxr-x 4 news 1024 Jan 4 12:27 ../
-rwxr-xr-x 1 news 5430 Jan 4 12:18 stdiock.fast*

/vol/AES_Tools/news/cnews.hp/misc:
total 11
drwxrwxr-x 2 news 1024 Jan 4 12:26 ./
drwxrwxr-x 4 news 1024 Jan 4 12:27 ../
-rwxr-xr-x 1 news 8484 Jan 4 12:18 canonhdr*

/vol/AES_Tools/news/xrn6-17.hp:
total 14
drwxr-xr-x 2 news 2048 Jan 4 12:29 ./
drwxr-xr-x 5 news 512 Jan 4 12:29 ../
-rw-r--r-- 1 news 10587 Jan 4 12:23 xrn6-17.patch

/vol/AES_Tools/news/xrn6-17.sun:
total 21
drwxr-xr-x 2 news 2048 Jan 4 12:29 ./
drwxr-xr-x 5 news 512 Jan 4 12:29 ../
-r--r--r-- 1 news 17536 Jan 4 12:23 cursor.c

I was looking for .nfsXXXX files but I didn't find any (which is what I
expected).

For the stdiock.fast file, here is the trace output:

> lstat ("/vol/AES_Tools/news/cnews.hp/mis".., 0xf7fff488) = 0
> unlink ("/vol/AES_Tools/news/cnews.hp/mis"..) = -1 ENOENT (No such file or directory)
> open ("/vol/AES_Tools/news/cnews.hp/mis".., 0, 0) = 3
> fstat (3, 0xf7fff488) = 0
> fcntl (3, 02, 0x1) = 0
> getdents (3, 0x65c0, 8192) = 340
> write (2, "rm: internal synchronization err".., 96) = 96
> close (3) = 0

and then later on the rmdir fails:

> rmdir ("/vol/AES_Tools/news/cnews.hp/mis"..) = -1 ENOTEMPTY (Directory not empty)

Since the path was so long, I can't verify the files in the trace output,
but the error messages were easy to find. Just for grins, I tried the whole
process again and this time it gave about 8 sync errors.

Anyone have any other suggestions?

Again, thanks for all the help.



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:07:20 CDT