SUMMARY:Solstice Backups fail with "abandoned by asavegrp" messages

From: Litwin, Gary (gary.litwin@fsbti.com)
Date: Tue Jan 18 2000 - 07:54:12 CST


All -
As a last desperate act, after nearly 3 weeks of intermittently failing
backups, I renamed the index file (/nsr/index/ctil1/db) for the most
commonly failing client, forcing a new db file to be generated.
If I need to recover any files that are not on the current index db file, I
will have to scan the tapes for it, but very few restores have been needed
for this client, so I may survive for the full 6-week retention period.

I have not seen a recurrence of the failures for the last four nights, so I
am starting to relax. A little.

The index file was HUGE, over 590MB, and though I ran nsrchk against it
without any indication of error, it is looking like the index MAY have been
corrupted during my Y2K upgrade process.

Thanks to everyone who responded to my posting, and especially Stuart Whitby
who actually took time to delve through various snaps, log files, etc. Your
soul is saved for sure, Stuart!

Gary Litwin
UNIX Systems Administrator
FlightSafety Boeing Training Intl.
MS: 20-79
(206) 662-8346
gary.litwin@fsbti.com

-----Original Message-----
From: Litwin, Gary [mailto:gary.litwin@fsbti.com]
Sent: Friday, January 07, 2000 8:23 PM
To: 'sun-managers@sunmanagers.ececs.uc.edu'
Subject: UPDATE:Solstice Backups fail with "abandoned by asavegrp"
messages

All -

I received several responses suggesting that indexes could possibly be
corrupted and proposing I run nsrck -F against the client indexes, and I
have additionally staggered the start times of the various groups to help
isolate the problem and reduce parallelism.

Prior to the nsrck, I manually ran each backup in sequence, with no overlap,
and they all completed without errors, and the nsrck -F run later did not
indicate any errors or repairs. So I can't verify actual index corruption
was at the root of the error messages.

Last night's backups completed successfully, but I do not yet have a warm
feeling that things are fixed.

I'll try to truss the process that is hanging when I can catch it again, and
post a summary if I can spot anything definite...

Thanks to Stuart Whitby, Gary D. Duncan, Johnny Hall, and Marco Breedeveld
for their sugestions and efforts thus far...

Gary Litwin
UNIX Systems Administrator
FlightSafety Boeing Training Intl.
MS: 20-79
(206) 662-8346
gary.litwin@fsbti.com

-----Original Message-----
From: Litwin, Gary [mailto:gary.litwin@fsbti.com]
Sent: Wednesday, January 05, 2000 9:46 AM
To: 'sun-managers@sunmanagers.ececs.uc.edu'
Subject: Solstice Backups fail with "abandoned by asavegrp" messages

I seem to be having some Solstice Backup problems after bringing my Solaris
2.6 clients and Solaris 2.5.1 backups server up to the Y2K levels indicated
by Sunscan 2.4.

Here is a sample completion notification that shows the errors:

----------------------------------------------------------------------------
---------------------------
Solstice Backup Savegroup: (notice) ctil1_G7 completed, 1 client (ctil1
Failed)
Start time: Tue Jan 4 18:00:02 2000
End time: Tue Jan 4 18:46:15 2000

--- Unsuccessful Save Sets ---

* ctil1:index has been inactive for 33 minutes since Tue Jan 4 18:13:17
2000.
* ctil1:index is being abandoned by asavegrp.

--- Successful Save Sets ---

  ctil1: /usr2 level=incr, 1.6 MB 00:00:21 8
files
  ctil1: /admin level=incr, 0 KB 00:00:41 0
files
----------------------------------------------------------------------------
----------------------------

I started receiving intermittent errors of this type only AFTER I updated
the server that is doing the backups for all my Solaris clients for Y2K
compliance by adding the 105277-03 patch.

I am seeing this failure about 2 or 3 times each week.

The server is running Solaris 2.5.1 with all the current Y2K patches added,
and Solstice Backup 4.2.6b Turbo/35. (went to 4.2.6b with the addition of
patch 105277-03)

All the clients are running Solaris 2.6 with the current Y2K patches added,
including the 105277-03 patch for the Solstice Backups.

The error seems to chronologically start with the ctil1 client index, then
after the 33 minute inactivity and asavegrp index abandoned messages, all
the rest of the clients "hang" and time out as well.

An examination of the processes on the backup server show that the backup
"save" process for the failing client has reverted to ownership by init, PID
1.
When this stalled process is "killed", the remaining stalled clients start
running to completion again, though they issue a "failed" type of completion
message, and backups that do not start until later proceed without error.

I checked the Sunsolve web site, and the sun-managers archives, and found
some info on increasing the inactivity timeout, but increasing this to 60
minutes only changes the wait time to 63 minutes before the savegroup is
abandoned.

Has anybody got any suggestions on where to check next? Or info on
conflicting patches that could cause this?

Thanks in advance, and I'll summarize!

Gary Litwin
UNIX Systems Administrator
FlightSafety Boeing Training Intl.
MS: 20-79
(206) 662-8346
gary.litwin@fsbti.com



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:14:01 CDT