SUMMARY: LARGE email application

From: Tom Lojewski (thl@ProNetC.com)
Date: Sun Feb 16 1997 - 17:40:04 CST


Dear Managers:

This is a summary of the responses to the question I posted WRT a
LARGE e-mail application. Some folks took a lot of time to present
their idea of how this application should be architected. From my own
personal experience with sendmail on both Solaris and BSDI boxes, it
seems that a single box can pump through more emails then one would
imagine.

Our application will do almost no local delivery, it will just do an
alias lookup and forward the email on to its final destination.

It's almost impossible to guess how many messages an "average" user
gets. But I think the bulk our our users will not be receiving more
than 30 messages per day.

I think I'd like to have several smaller machines handle the mail as
opposed to one big one. That way we have built in redundancy. And
the "proof of concept" phase of our project will be much less
expensive that way.

I hate to say it to this list but I'm waivering between SS5's and
Pentium Pro's running BSDI. I've seen a low end Pentium handle lots
of mail.

The original post:
------------------
> Dear Managers,
>
> I'm trying to forecast hardware costs and requirements for a large
> (30-60,000 user) email application. The number would grow over time
> as the product became more successful. Email would arrive at our
> servers and in most cases be forwarded (via an alias lookup) to
> various remote destinations for final delivery.
>
> I envision several servers, each supporting perhaps 5,000 - 15,000
> users. I *think* the bottleneck will be the number of simultaneous
> TCP port 25 connections that can be handled by each server. Does
> anyone have real-world experience with handling large numbers of
> accounts on Solaris based machines?
>
> 1. Let's say each user gets an average of 15 email's/day - (is
> that realistic?).
>
> 2. How many such accounts could I support on a Sparc 5, 20, Ultra-1?
> (We'd like to start small, do a proof of concept, then grow the
> business as necessary.)
>
> Would *love* to hear from anyone who has experience applicable to the
> above.

Here is the list of responders:

From: "Bert N. Shure" <bert@virtual.com>

tom:

sounds like an interesting problem. i would talk to sun and talk to
netscape.

i think the number of emails per day depends on the type of user. my
wife probably gets five per day. i'm on the sun managers mailing list
and the dachshund digest, and i probably get 100 per day.

i think you'll end up on a sun E4000 with a couple cpu's, and bunch of
memory and a disk array.

----------------------------------------------------------------------

====================================================================

The following post from Rich combined with my own experience with BSDI
and the fact that we need to do a "proof of concept" for this project
before we can get a lot of funding got me thinking about BSDI:

====================================================================

From: Rich Kulawiec <rsk@itw.com>

I've been consulting on the design and deployment of some very large
Internet gateways over the last few years, so I can tell you about
some of the solutions that I've seen.

Due to the number of users you mention, I'm going to assume that this
company is geographically dispersed, and Internet-connected.
I'm also going to assume that implementation cost is probably an issue.

And finally, I'm going to assume that final delivery destinations
could be anywhere: users might retrieve mail from a server with POP,
they might have it forwarded to Lotus Notes or MS:mail or just about
anything else.

If any of my assumptions are wrong, well, you're going to need to
adjust what follows.

All that said, I'd recommend that you use an architecture something
like this:

             Internet
                |
                |
        ------------------------
       | external mail server | Sparc 5 or 20 running Solaris
        ------------------------
                |
                |
        ------------------------
       | firewall/gateway |
        ------------------------
                |
                |
        ------------------------
       | internal mail server | Sparc 5 or 20 running Solaris
        ------------------------
                |
                |
        ------------------------
       | group mail server #1 | Pentiums running BSDI
        ---------------------------
          | group mail server #2 |
           ---------------------------
             | group mail server #3 |
              ------------------------

All systems running Sendmail 8.8.5 (as of Feb '97).

Here's the reasoning behind this.

Internet email usage tends to be a chokepoint, and setting up multiple
gateways is a pain, so it's much easier to have a single high-performance
gateway. The external/internal setup does require an extra machine,
but it's worth it for the extra security (and that's a whole 'nother
discussion, which I'll skip here). Whether to go with something like
a 5 or 20 (or Ultra) depends on the volume of messages that you need
to shovel through the gateway on an hourly/daily basis. It also depends
on how easy it is for you to make a usage-based argument for upgrades;
depending on your political environment, it may be easier for you to
buy big to begin with.

The reason for going with BSDI boxes for group mail servers is simple:
they're cheap. In fact, they're so cheap, that you can afford to have
one sitting around as a hot spare, ready to plug in and replace any
one of them should it go poof. And because they're cheap, you can
scale this part of the solution at low budget impact -- much lower
than a Sparc-based solution. (And you can also buy emergency replacement
parts on a weekend just about wherever you happen to be located.)

Depending on how your users are grouped geographically or politically,
you can assign each group a mail server, then wrangle with the problem
of funneling all of their inbound/outbound mail to that server from
whatever weird local mail system they're using. And *that* can be
an entirely ugly problem. But once you solve it, you're just about
home free: mail from users of group mail server #1 that's headed for
the Internet goes to your main internal mail server and then out;
mail that's destined for other internal users goes to the other
group mail servers as apppropriate. By letting the group mail servers
talk to each other, you unburden the main server, which now only carries
traffic related to the Internet.

This approach will let you start small, and then grow the number
of distributed servers as demand grows. You can even grow them
unevenly, with higher-powered machines in those departments/areas
where demand is higher -- as longer as the software architecture
is consistent, it's really quite easy.

There's another reason for this approach, btw, relating to web access:
the same architecture faciliates hierarchical web caching, using an
master internal web caching machine which is then pointed to by the
web caches on each of the group machines. Whether or not you can
load down the group machines with web activity as well as mail depends
on the relative usage levels of both -- but if you can, it's great,
because it solves two major traffic problems with one box.

To give you a rough idea of performance levels, it's not difficult
for a 200 Mhz Pentium with 64M of memory and 2x2G drives to handle
mail and web caching for a couple thousand moderately busy people.
(And yes, if you have 60,000 people, then that means, maybe, 30 machines.
But 30 *cheap* machines, and that's an outside estimate. I think it's
more likely that you could do this with 10-15 boxes and grow 'em as
time goes on.)

---Rsk
Rich Kulawiec
rsk@itw.com

-------------------------------------------------------------------------

============================================================
I have yet to contact these folks, but will...
============================================================

From: James.E.Coby.Jr@cdc.com (James Coby)

Hello Tom,
  
   I usually am a reader of this list and seldom respond to marketing
types of queries. However I believe the Electronic Commerce group here
may have some information pertaining to your query.
If you would be interested I can suggest you contact Jim Payne at
Control Data Systems. (James.Payne@cdc.com)
You can also checkout the following URL:

http://www.cdc.com

When you get to the home page click on the Rialto Messaging storefront.

--------------------------------------------------------------------------

============================================================
I should have made it more clear that we would do very little local
delivery. So local disk space is not an issue. I think we'll use a
commercial data base to handle the aliases. Thanks Jim, for the
following detailed post:
============================================================

From: Jim Harmon <jim@telecnnct.com.ProNetC.com>

I estimate the average user gets 15 SPAM's a day... aside from real
messages.

Sophisticated or prolific users will recieve upto 50+ messages per day,
and subsribers to mailing lists like this will get 10-20 messages per
hour.

JUST from this list I get 30+ messages per day.

E-mail is a very compact service. Your headaches are going to be
managing the aliases list(s).

How large a disk will you be providing for /swap and /var/spool? and how
long do you intend to allow each user to store or archive mail online?

How many drives do you plan to support? Since mail is ALL files, your
real limitation is disk space and disk management.

A VERY dirty estimate of how much space you'll need for /var/spool/mail
is to take a standard, say vt100 80x25 ascii display. For practical
purposes that's 2000 characters per screen max. If the average message
is --say-- 5 screens (that's a high estimate) then you're looking at
10,000 characters per file. That's 80K. (1 ascii char = 8 bits data)

On a 4GB drive, if 20% (guesstimate) is overhead, that leaves 3.2GB of
available space. (I believe you can only allocate 2GB per partition, so
allocate 2GB to /var/spool/mail, and split the rest between /swap and
/var.)

On a 1 disk system with that kind of partition, you can store 25,000
messages of 80K each. That doesn't sound like much, but it's really
more than 3 times as many-- easily. Ascii data can be compressed by
over 70% with good compression utilities, and the average message is
less than 1000 characters, or 2K.

DOUBLE that and redo the math with a 4K file vs. 80K and you can easily
keep over 500,000 messages online. Again, the headache is keeping track
of them all.

This does not take into account MIME extensions, graphics files, sound
files, or anything other than text-only mail.

Your next questions, and I'm sure others out there will be better
experienced to help you with them, are: how much overhead do you need to
allocate per user, how can you subdivide /var/spool/mail across
partitions if you need more than 2GB (assuming a 2GB limit) and how much
space should you add to the calculation regarding MIME extensions
attachments (like Microsoft files that take 3-5 times -and more- space
than the text they contain), and how many disks will you need of xGB
size. Lastly, what about backup and redundancy? (Think RAID- if this
will be a commercial venture, you won't be able to keep customers if you
can't gaurantee their messages --regardless of how trivial or SPAMmy--
will be there when they want them).

Mail itself, once configured, and the aliases list(s) properly
maintained, shouldn't be a problem for you.

--------------------------------------------------------------------------

======================================================================

Will look into procmail and it's capabilities. I've done lots of work
with BIND and Sendmail.

Good idea to keep track of performance from day one.

======================================================================

From: "Karl E. Vogel" <vogelke@c17mis.region2.wpafb.af.mil>

     Make sure you have the most recent version of Solaris (at least 2.5.1).
     You have other bottlenecks besides port 25 which will bite you first.

     1. With that many users, you'll need something to handle batch
         creation of new accounts. The GUI-based crap is dandy if you want to
         avoid training someone or only have a few accounts to add, but you'll
         need something that can handle tons of accounts at one time.

     2. Don't rely on a "flat" passwd file, or your system will be spending
                all day handling logins and figuring out what userid goes with what
                password. Make sure you've got a workable database solution to hold
                your user account information, like DBM. A user lookup should be damn
                near instant.

     3. Get the best versions of sendmail and networking software you can
         find:
                 sendmail-8.8.5
                 bind-4.9.5 (address resolution)
                 procmail-3.11 (robust final mail delivery plus filtering)
                 smartlist or listserv (mailing-list handler)
                 fetchmail (delivery to PC clients)

         Earlier versions have security problems. This includes all of the
         vendor-supplied stuff. All the packages above are free and work like
         a champ under Solaris.

     4. If you've never configured either sendmail or BIND, find someone who
         has. A slow network or slow hostname resolution will kill you. Ask
         your Sun vendor about installing fast Ethernet (100 Mbits/sec), and
         make sure your Internet service provider can keep up with you.

     5. Anticipate what happens when you have lots of simultaneous users or
         lots of mail traffic. With lots of logged-in users, you have lots of
         home-directory disk access; therefore, get something like a Storage
         Array with lots of smaller disks instead of a few huge ones. This way
         you spread I/O requests over a bunch of disk controllers instead of
         beating one controller into the ground.

         With lots of mail traffic, you have lots of mailboxes which would
         ordinarily sit under (say) /var/mail. Unix still rewards small files
         and small directories, so either prepare to store incoming mail under
         each user's home directory (VERY easy with procmail), or do some
         subdividing under /var/mail. For example, all accounts starting with
         "a" reside under the /var/mail/a/ directory, etc. The home directory
         solution is better, because if you have the users spread all over
         separate drives, you get less I/O contention.

     6. Start tracking your system performance from the day it's installed.
         You can't tell if your system is sick unless you have some data
         gathered from when it was well.

Tom> Let's say each user gets an average of 15 email's/day - (is that
Tom> realistic?).

     No way to say, really. I get anywhere from 20 to 120 messages per day,
     depending on mailing-list traffic.

Tom> How many such accounts could I support on a Sparc 5, 20, Ultra-1?

     You might want to consider something hefty, like a Sparc-1000.

-- 
Karl Vogel                                          vogelke@c17.wpafb.af.mil
ASC/YCOA, Wright-Patterson AFB, OH 45433	                937-255-3688

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + Tom Lojewski (thl@ProNetC.com) =+= ProNet Consulting Alameda, CA + + =+= + + phone: (510)864-8633 =+= fax:(510)864-8637 + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:46 CDT