Summary: Disk partition is at MAX I/O (reads), why?

From: shaight@ukans.edu
Date: Thu Dec 09 1999 - 12:40:49 CST


Thanks go out to Unixboy@aol.com, john.hilger@ac.com, kevin@joltin.com,
and Mark_Neill@CSX.com for replying with various suggestions about
eliminating disk access hotspots and other ways to abuse DBAs ;).

In this case, the key clue as provided by running this command:

# find /apps -exec fuser -u {} \; |grep -v " : $" >/tmp/apps.files.in.use 2>&1 &

and then perusing the resulting list. I found three surprises in
$ORACLE_HOME/otrace/admin - three binary files that were each opened by
nearly 300 processes.

 -l *.dat
-rw-rw-rw- 1 oracle dba 39202 Dec 8 15:49 collect.dat
-rw-rw-rw- 1 oracle dba 2017954800 Dec 8 15:49 process.dat
-rw-rw-rw- 1 oracle dba 645276 Dec 8 15:49 regid.dat

That is an awfully large number in front of process.dat, I guess it's
pretty close the the 2G largefiles size. Even if it's not, that's
probably too many bytes to keep in the file system cache, which leaves
300 processes reading straight from disk at least once in a while. So I
shut all the oracle stuff down once more, renamed these files so oracle
would lose them, and slowly restarted a few databases. Problem avoided,
all I/O and scanning are back to normal. The .dat files were not
recreated either, I wondered what they are for enough to ask the DBAs.
They haven't gotten back to me yet.

Thanks again all,

Yesterday I wrote:
> I've got a server:
>
> $ uname -a
> SunOS kusun8 5.6 Generic_105181-15 sun4u sparc SUNW,Ultra-Enterprise
>
> Running Oracle 7.3.4.1 databases (about 30 of them active at any one
> time) for PeopleSoft applications. At the moment the machine is a slug,
> ~30 seconds from the time I type a character until the time it is echoed
> to the screen, whether through a network login or over the seriel
> console. All applications are suffering in similar fashion.
>
> I ran an iostat -xcneP and found the following single partition of
> interest:
>
> cpu
> us sy wt id
> 9 19 72 0
> extended device statistics ---- errors ---
> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot
> device
> 20.6 5.1 10876.3 41.3 0.0 12.4 0.0 483.2 0 100 0 0 0 0
> c1t0d0s7
>
> Over the last few hours this is consistently reporting between 10000 and
> 12000 kr/s and is pegged at 100% blocked. I've also got an outrageous
> scan rate (between 2000 and 20000) reported by vmstat, but the swap
> disks are not active like they were a few weeks ago when there was a
> true memory shortage.
>
> This disk is the home for most of my OS partitions, and this particular
> partition is where the Oracle and PeopleSoft application software is
> installed. Typically there is only a light I/O load. All the database
> files and user-active files are on other filesystems, on other I/O
> channels.
>
> I have completely shutdown and rebooted Solaris, and the problem
> returned when the Oracle databases were restarted. I'm currently
> shutting them down one at a time hoping to narrow the scope of what I'm
> looking for. I feel pretty strongly that it is application oriented,
> but I've never seen anything quite like this before. Has anyone got any
> insight, clues, or tips for debugging this?
>
> Thanks in advance
>
> --
> Steve Haight
> shaight@ukans.edu
> The University of Kansas

--
Steve Haight
shaight@ukans.edu



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:13:34 CDT