>One thing I do that may or may not be typical is that I let Outlook rules
>take care of all the mailing list traffic.  That includes almost no spam and
>so I don't train or classify it (the list admins do a good job).  Therefore,
>I _don't_ include it in my ham corpus.


>This gives me a roughly 1:5 ham/spam corpus, instead of roughly even, but
>that's the mail stream that SpamBayes sees.

This is the stuff I'd tend to use for the testing, as opposed to your
equal-sized training sets.

>At present, my corpus is about 7,500 messages total.  This may not be enough
>to "divide into ten sets", etc.  Or is it?

I think we did our first classifier shootouts with a minimum of 2,000
messages, so you should be fine.  You may not have enough to see some
of the longer-term effects I'm now witnessing (with inflection points
at 120 and 200+ days), but you should be able to get started, at least.
And heck, those inflection points (or the timing thereof) may be
peculiarities of my own data.  It'd be good to see.

- Alex

