[Spambayes] My first results with pop3proxy and smtpproxy

Tim Peters tim.one@comcast.net
Mon Nov 4 19:39:31 2002

> I've trained using the smtpproxy and a few dozen spams that I hadn't
> deleted and hadn't been contaminated by SA before I got involved with
> spambayes (basically SA mistakes).

You don't have to worry about that:  by default, the tokenizer ignores all
header lines SA may have had anything to do with, so it doesn't matter
whether SA has added headers or not.

> Even given the small size of the corpus, it is doing an amazingly great
> job classifying inbound mail.


> It even correctly classified one of those "here's another funny
> story" infernal mails that gets forwarded three hundred times, and I
> hadn't trained it on anything like that.

Ya, but that would be ham to somebody else.  Train accordingly <wink>.

> I have to say that a corpus of thousands really isn't turning out to
> be a necessity for spambayes to be useful to me.

Indeed, old tests show that it is, on average, *useful* after training on a
single ham and a single spam:  it gets significantly more right than wrong
after that much.  So long as *none* of your ham looks like advertising or
random chatter, a few hundred of each may be fine for you.
Fraction-of-a-percent error rate improvements are important for high-volume
uses (like python.org, which handle more email in a day than most people get
in a year).

> One other observation... my strong tendency *IS* to train this thing
> only when it makes a mistake.

That's a UI problem.  A good UI would deduce what's ham and spam by watching
what you do to your email, and train on a random sampling of it.  The
Outlook client may be the only one making real progress in that direction so

> Skip et.al. has warned boucoup times about not doing this...

That would be me.

> train on a reasonable smattering of both, even if they're correctly
> classified, and train often.

The things I call ham would shock you <wink -- but I do get several
categories of difficult ham>.

> **BUT** if this is  my tendency and I understand the system, then this
> will likely be a  real problem when the masses get started using it.
>  How to ensure that mistakes only training isn't the norm?  Beats me.
> But we've either gotta figure out how to make sure that the teeming
> masses don't make this error, or we've gotta figure out how to make
> the system tolerate this error reasonably well.

It can't tolerate it -- it can only learn what it's been taught, and
reliance on hapaxes is both vital over the short term and brittle over the
long term; ongoing training is needed to prevent hapaxes from becoming a
liability over time.  *Most* spam is dead easy to recognize, though, as is
most ham.  The errors occur in atypical cases.

More information about the Spambayes mailing list