SUMMARY : Fast disk IO required.

From: Mathew BM LIM (M.Lim@anu.edu.au)
Date: Fri May 08 1992 - 15:15:36 CDT


Thanks to all those who replied to my above posting, as usual, the
response from the group was overwhelmingly useful, thanks in particular to :

         stern@sunne.east.sun.com (Hal Stern - NE Area Systems Engineer)
         John DiMarco <jdd@db.toronto.edu>
         Nigel Mitchem - System Support <nigel@cs.city.ac.uk>
         paulo@dcc.unicamp.br (Paulo L. de Geus)
         rjq@math.ksu.edu (Rob Quinn)
         lm@slovax.eng.sun.com (Larry McVoy)
         mike@fionn.lbl.gov (Michael Helm)
         kalli!kevin@fourx.aus.sun.com (Kevin Sheehan)
         Mister Damage <julian@syd.dwt.csiro.au>
         Larry McVoy <lm@slovax.eng.sun.com>
         don@mars.dgrc.doc.ca (Donald McLachlan)
         adam%bwnmr4@harvard.harvard.edu (Adam Shostack)
         seeger@thedon.cis.ufl.edu (F. L. Charles Seeger III)
         Mike Raffety <miker@sbcoc.com>
         srm@shasta.gvg.tek.com (Steve Maraglia)
         bb@math.ufl.edu
         "Everett Schell" <schell@molbio.cbs.umn.edu>
         rjq@math.ksu.edu (Rob Quinn)

In general the advice indicated that :
1) Simple SCSI disks will not achive > 4.5 MBytes / sec. due to the
        physical limitations of the disks.
2) Try disk striping - via Disk Suite or a similar product.
        I have no performance figures on this, but our Sun rep.
        suggested that performance gains via disk striping was very
        appliaction specific. He quoted gains of -15% to +15% on some
        database applications that he had seen.
3) Disk arrays are bl*?*&dy expensive!!
4) Don't go with IPI, it's expensive and will only gain you a little bit.
5) The sun4c architecture will only sustain 6MB/sec user->kernel,
        *memory* speeds anyway.
6) use mmap().

****************** My original posting ....
Hi,
        I'm looking for information about fast (faster than standard Sun
SCSI disks anyway) storage devices for an IPX. Here is a summary of
our situation

The problem :
        One of our users needs a dedicated machine to run some computational
        chemistry software on. These problems are typically single
        process, 50% CPU intensive, 50% IO intensive and do a lot of
        floating point.

The machine :
        We currently have a IPC which we will upgrade to an IPX and install
        upto the maximum (64MB) amount of memory. This config should
        satisfy the need for CPU grunt. The user then needs between 1GB - 2GB
        of disk space locally to run the computations, here is where
        the problem lies.

The CPU grunt is easily solved, but trying to improve the IO performance
to make the user's computations less IO bound is harder. Possible
solutions I have looked at include :

1) Get a really fast SCSI disk (Wren 9 / Elite 3). This may need
        a 3rd party SCSI controlled to exploit the Fast / Wide SCSI
        capabilities of these drives. The spec sheets for these quote
        an external transfer rate of 10MBytes/sec. This is the sort of
        performance we want, but is this speed sustainable?

2) Get a SBus -> VME Bus converter, an IPI controller and a 1 - 2GB
        IPI disk. This is expensive, but indications are that it will
        provide a more sustainable ~10MBytes/sec performance than SCSI will.

3) Get a solid state disk. Again, expensive and these disks are
        generally too small (upto 400MBytes only) and are volitile.

We are NOT looking for a solution to improve the throughput to a general
access system, as I said, this will be a dedicated machine with, usually,
only the one active process working on one data file at a time. So, no,
getting 2 or more SCSI adaptors and splitting the disk space across them
will not help (I dont think).

**************************

Here are some of the more useful replies :

**************************
answers:
(a) don't go Sbus->VME with IPI. IPI disks aren't any faster
        than SCSI disks today, and SCSI is cheaper. plus you'll
        need a rack to put the IPI stuff in.
(b) if you get a fast or fast+wide disk, you'll need to get a
        SCSI host adaptor to handle that interface as well.
        the sun scsi host adaptor does 5 Mbyte/sec. you can
        do 10 Mybte/sec with a "fast" disk, and a fast+wide
        disk should go to 20 Mbyte/sec
(c) the speeds are sustainable for sequential transfers (see (a)
        above). but they are for raw disks, not through the
        UNIX file system. when you go through the filesystem,
        your transfer rates drop to 3.5-4 Mbyte/sec, and it
        gets a little worse for large files (due to indirect
        and double indirect blocks)
(d) if the same files are going to be used over and over, use mmap()
        to get them into the user's address space. this lets the
        VM system cache the files, and may eliminate some disk i/o.
(e) if there are several files involved, buy 3 or 4 424 Mbyte disks
        and put them on separate SCSI busses, then set up parallel
        transfers using async i/o (or using mmap(), letting the VM
        system handle the faults for you). it's still one process
        driving it, but it can suck in data from multiple files at
        the same time -- so if you take care of splitting the data
        you can effectively increase your throughput.

**************************

>The CPU grunt is easily solved, but trying to improve the IO performance
>to make the user's computations less IO bound is harder. Possible
>solutions I have looked at include :

>1) Get a really fast SCSI disk (Wren 9 / Elite 3). This may need
> a 3rd party SCSI controlled to exploit the Fast / Wide SCSI
> capabilities of these drives. The spec sheets for these quote
> an external transfer rate of 10MBytes/sec. This is the sort of
> performance we want, but is this speed sustainable?

No. The controller certainly can sustain that transfer rate, but the bits
can't be read off the disk fast enough.

Maximum disk transfer rate (from platter to disk buffer) is
(#bytes/track)*RPM*60, which gives you somewhere around 5MB/sec with current
disks.

Narrow/Fast SCSI (third-party SBUS controllers are available which implement
this) is perfectly capable of close to 10MB/sec sustained transfers from disk,
but a single disk can only do this from buffer to controller, not from platter
to buffer.

If you want better than around 5MB/sec sustained transfer with such a
controller, you need an unusual disk. Here are a couple of options:

- a two-headed disk. Seagate has a version of the Sabre VI and VII which has
  two head assemblies. The drive is clever enough to use both simultaneously
  to essentially double the transfer rate. There's a 911MB version which
  transfers at 6MB/sec (we have these, in IPI), and a 3GB version (Sabre VII)
  which transfers at 9.4MB/sec. Get one with a Narrow/Fast SCSI2 interface,
  if there's one available.
- a disk array. Ciprico makes a RAID controller which uses five ESDI disks
  to emulate a narrow-fast SCSI drive with a transfer rate of 10MB/sec,
  using disk striping. One of the drives is used for error correction, so
  the resulting virtual disk is the size of four of the physical disks.
  It gives you reliability too, since you can replace any one of the
  drives (live!) and it'll automatically reconstruct it.

>2) Get a SBus -> VME Bus converter, an IPI controller and a 1 - 2GB
> IPI disk. This is expensive, but indications are that it will
> provide a more sustainable ~10MBytes/sec performance than SCSI will.

Should work, if the bus converter does a good job. You'll still need a
fast IPI drive, or a disk striping scheme, though.

**************************

 You might look at the mmap() system call. You can cause the output file to
be cached in memory. Also the madvise() call. Hal Stern (sp?) wrote a paper
titled "Tuning Sunos 4.1" or "System Tuning under SunOS 4.1" or something like
that that talks about stuff like that. Damn I'm full of specifics huh?

**************************

Hi. I have worked on file systems at Sun. I believe that Sun has some
of the best FS technology around, especialy for large sequential I/O.
We usually run at the platter (disk) speed.

That's the good news. The bad news is the sun4c architecture. It won't
sustain the I/O rates you want, no way. You can tell this by trying to
move your data in TMPFS. I get about 6MB/sec user->kernel, *memory*
speeds. There are two reasons, I think, that cause this problem:

        sun4c has a write through cache
        sun4c memory lives across the sbus
        sun4c has no bcopy hardware
        sun4c has 4k pages

A system that might sustain the I/O rates you want is a sun4 (470 or
490). A 4/4xx machine has 8K pages, a write back cache, bcopy
hardware, and faster memory (12MB up to 29MB if it is in the cache).
All in all, a much nicer machine. Unfortunately, the sun4 has crappy
SCSI. See you can get a third party SCSI board because the on board
SCSI sucks.

The summary: your problem is not disk speed as much as memory speed. We
can't keep up with sort of memory speed on anything that has the disk to
match. Sorry.

Oh, yeah, stay away from IPI - they max out at about 2 or 3 MB/sec. I
get 3.5MB/sec off a SCSI elite 1 w/ an IPX doing sequential reads.

Finally, I'm quoting you honest numbers, numbers that taken out of
context would make Sun look bad. So don't quote them out of context.
Please. I'm not exactly Sun's biggest fan but we don't have to hit
them when they are down.

Oh, one last carrot - the sparcstation 3 should have better memory
bandwidth. I can find out how much better if you are still interested.

**************************

> We are NOT looking for a solution to improve the throughput to a general
> access system, as I said, this will be a dedicated machine with, usually,
> only the one active process working on one data file at a time. So, no,
> getting 2 or more SCSI adaptors and splitting the disk space across them
> will not help (I dont think).

I'd disagree - we just played with the striping provided by On Line
Disk Suite, and striping a filesystem across two disks/controllers gave
us nearly twice the thruput on sequential read access. We expect the
same improvement on stiped non-mirrored file systems as well.

******************************

If the IPX is good enough then go for it. Although I have not used one
people here who need better floating point processing power look to
the HP snake machines. If memory serves, they are about

        2 * Sun SS2 integer performance and
        5 * Sun SS2 floating point performance

Anyway some comments on your faster disk IO ideas.

1) I suspect you will need a non-default controller, but I don't know.
2) I think you should be able to find an SBus IPI controller without
   needing an SBus to VME adapter.
3) Re: solid state disks ... most of them are available with battery backup
   that will retain the info for a week or so without AC power.

Have you looked into disk striping solutions. Essentially you but N (2 or
more) disks in parallel and split your data across them. Properly
implemented you can boost your throughput since each disk can transfer
X MBytes/sec.This in combination with SCSI wide/fast, or IPI should do well.

******************************

> We are NOT looking for a solution to improve the throughput to a general
> access system, as I said, this will be a dedicated machine with, usually,
> only the one active process working on one data file at a time. So, no,
> getting 2 or more SCSI adaptors and splitting the disk space across them
> will not help (I dont think).

Actually, it might, with disk striping, where, through a bright
controler, block 1 is written to disk a, block 2 to disk b, block 3 to
disk c, etc. Sun has a product that does this, called Online
disksuite, that you might want to look into. We looked at it, but
found another bottleneck in the way our code handled I/O that cleared
out the problem, so we never really examined it closely.

******************************

| 1) Get a really fast SCSI disk (Wren 9 / Elite 3). This may need
| a 3rd party SCSI controlled to exploit the Fast / Wide SCSI
| capabilities of these drives. The spec sheets for these quote
| an external transfer rate of 10MBytes/sec. This is the sort of
| performance we want, but is this speed sustainable?

I would suggest waiting for the "SPARCstation 3" announcement, May 19.
Rumor has it that this will have the Fast SCSI-2 capabilities. Attach
an Elite-2 or Elite-3 (or a couple of them) to that. A Presto-serve
S-Bus card may or may not help, but it might be worth evaluating. This
would act mostly as a non-volatile 1 or 2 MB disk write cache.

| IPI disk. This is expensive, but indications are that it will
| provide a more sustainable ~10MBytes/sec performance than SCSI will.

If you find more money than you have down the road, there are IPI drives
with ~25 MB/s transfer rates. These are special drives that read from
9 heads in parallel, and they require special controllers (VME). You
might be able to fit one of these to a 4/630 or 4/670 with the drives
mounted in an external cabinet. These drives tend to be used for
applications requiring "video" bandwidth from the drive. Seagate makes
these drives, perhaps others do as well.

| We are NOT looking for a solution to improve the throughput to a general
| access system, as I said, this will be a dedicated machine with, usually,
| only the one active process working on one data file at a time. So, no,
| getting 2 or more SCSI adaptors and splitting the disk space across them
| will not help (I dont think).

Well, I think Sun sells a "Disk Suite" product that allows a single file
system to span disk partitions. If this were interleaved across different
controllers, there *might* be an increase in i/o bandwidth. You would have
to check with some technical types at Sun.

************************

Two additional possibilities ...

Get a SPARCstation 2, which can have 128 MB of memory, thereby
increasing your file cache in main memory.

Get a SPARCstation 370 or 670, which can have IPI disks attached.

Probably your fast/wide SCSI solution is best, but since that's a
pretty new technology, make sure you buy controller and disk together,
so they interoperate.

****************************

Also consider Sun's "Online DiskSuite" product, specifically the
disk striping feature. The product allows you to spread the I/O
across multiple disk thus increasing overall throughput.

Here's what I would do -

Buy two SCSI hosts adapters, two 1Gb disks or 4 500Mb disk, split the
disk across the two SCSI hosts adapters. Use disk striping to create
one file system that spans all disk drives.

I havn't personally done this however, this is a common technique used
by database software to improve I/O. Basically your just spreading
the I/O across multiple spindle's.

****************************

> Get a really fast SCSI disk (Wren 9 / Elite 3). This may need a 3rd
> party SCSI controlled to exploit the Fast / Wide SCSI capabilities
> of these drives. The spec sheets for these quote an external
> transfer rate of 10MBytes/sec. This is the sort of performance we
> want, but is this speed sustainable?

No. Check out the spec sheet for the speed at which data can be read
off the platter. 2.5-2.75 Megabytes/sec is typical, 4.75 is the
highest I've seen in SCSI (Fuji 2.0 Gig).

****************************

  Have you heard of Ciprico's Rimfire controllers. We got them two
  years ago, because they were about twice as fast as the Sun's at
  that time.

****************************

Mathew Lim, Unix Systems Programmer, ANU Supercomputer Facility,
Australian National University, GPO Box 4, Canberra City, ACT, Australia 2601.
Telephone : +61 06 249 2750 | ACSnet : M.Lim@anu.oz
Fax : +61 06 247 3425 | Internet : M.Lim@anu.edu.au



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:42 CDT