SUMMARY - hung socket

From: Christopher M. Murphy (murphyc@synapse.bms.com)
Date: Tue Jan 28 1997 - 14:19:45 CST


**** summary:

I received a couple of me-too's on this one, but no solutions to the
problem. The product vendor has since my original posting acknowledged
that there is a problem and are working on a fix for it.

**** thanks to:

Andrew Foote <acf@nabaus.com.au>
Jacques Rall <jacques.rall@za.eds.com>
Marc S. Gibian

***** answers:

> From: gibian@stars1.hanscom.af.mil
>
> I've been away from the office so I don't know if you've sent out a summary yet.
> Anyway, so far as I know, the only recovery path for a hung socket is a reboot.
> Let me add that hung sockets are not all that uncommon when I've run unattended
> ufsdumps over the LAN. This is why I strongly advise against use of backup
> products that use the OS' underlying tools for the actual tape handling. Some
> argue that they want to be able to restore without first installing the backup
> tool on a crashed system. My position is that you spend so much more time on the
> dump side that the slight overhead during recovery is far outweighed by the
> added reliability during dump.
>
> Hope this helps,
> Marc S. Gibian
> Telos Consulting Services phone: (617) 377-6350
> PRISM/TFS email: gibian@stars1.hanscom.af.mil

> From: Jacques Rall <jacques.rall@za.eds.com>
> What about using pmadm or sacadm? (sorry, don't know any switches)
>
> ----------

> From: ACF
>
> Me too !!
>
> I however am running proxy backups under AIX b/w RS/6000's. Like you,
> the only method I've found to "reset" the socket is by killing all
> associated processes.
>
> PDC do need to work on this as it's pretty dirty.
> Pls let me know how you go,
>
> Rgds,
> Midrange Services.

**** original question:

   SUN Sparc20 running Solaris2.5 with the 2.5 recommended patches installed.

   Problem description:

   This machine is a dedicated backup server that runs the PDC Budtool product
   This product uses remote shelled dump/restore to backup the client
   machines. There appears to be a bug that gets "activated" when one of the
   backup clients either hangs or crashes while a dump is being run. The
   backup server keeps the socket connection open to the client that was being
   backed up. This socket will stay open until I manually kill the parent
   process on my backup server that initiated the remote dump.

   I'm working with the backup product vendor on a fix for this problem, but
   was hoping in the meantime to find a way to close this socket without
   killing the parent backup process. When I kill the parent process, none of
   the backups that still remain in the "backup schedule" will get run and the
   summary of the backup schedule will not get generated. I guess my basic
   question is: shouldn't a socket get closed when the destination machine is
   no longer accessible (e.g. no longer ping-able)?

   Attached is some info that will hopefully clarify my problem description.
   All of the commands have been run from the backup server (of course, since
   the client is accessible!):

   backupsvr: lsof | grep client
   goserver 2095 root 11u inet 0xf611fec0 0t5 TCP backupsvr:1020->client.bms.com:shell

   backupsvr: netstat -a | grep client
   backupsvr.1020 client.bms.com.shell 61315 0 8760 0 ESTABLISHED

   backupsvr: ping client 1
   no answer from client.bms.com

   (NOTE:# the goserver is the "parent" process which controls the backup schedule
   and initiates the remote dump command)

   backupsvr: /usr/ucb/ps auxw | grep "goserver -x"
   root 2095 0.0 4.3 3272 2676 ? S Jan 05 286:19
   /usr/budtool/bin/solaris_sparc/goserver -x0

   backupsvr: truss -aef -p 2095
   2095: psargs: /usr/budtool/bin/solaris_sparc/goserver -x0
   2095: getmsg(12, 0xEFFF87F8, 0xEFFF87EC, 0xEFFF8804) (sleeping...)

   Any info on how to try and close this socket without killing the "goserver"
   process would be appreciated. Thanks!

   --
   Christopher M. Murphy email: murphy@bms.com
   Bristol Myers Squibb phone: (609) 252-5741
   Scientific Information Systems fax: (609) 252-6163
   Princeton NJ

-- 
Christopher M. Murphy		email: murphy@bms.com
Bristol Myers Squibb		phone: (609) 252-5741
Scientific Information Systems	fax: (609) 252-6163
Princeton  NJ



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:43 CDT