[Spambayes] My first results with pop3proxy and smtpproxy
Mon Nov 4 19:39:31 2002
> I've trained using the smtpproxy and a few dozen spams that I hadn't
> deleted and hadn't been contaminated by SA before I got involved with
> spambayes (basically SA mistakes).
You don't have to worry about that: by default, the tokenizer ignores all
header lines SA may have had anything to do with, so it doesn't matter
whether SA has added headers or not.
> Even given the small size of the corpus, it is doing an amazingly great
> job classifying inbound mail.
> It even correctly classified one of those "here's another funny
> story" infernal mails that gets forwarded three hundred times, and I
> hadn't trained it on anything like that.
Ya, but that would be ham to somebody else. Train accordingly <wink>.
> I have to say that a corpus of thousands really isn't turning out to
> be a necessity for spambayes to be useful to me.
Indeed, old tests show that it is, on average, *useful* after training on a
single ham and a single spam: it gets significantly more right than wrong
after that much. So long as *none* of your ham looks like advertising or
random chatter, a few hundred of each may be fine for you.
Fraction-of-a-percent error rate improvements are important for high-volume
uses (like python.org, which handle more email in a day than most people get
in a year).
> One other observation... my strong tendency *IS* to train this thing
> only when it makes a mistake.
That's a UI problem. A good UI would deduce what's ham and spam by watching
what you do to your email, and train on a random sampling of it. The
Outlook client may be the only one making real progress in that direction so
> Skip et.al. has warned boucoup times about not doing this...
That would be me.
> train on a reasonable smattering of both, even if they're correctly
> classified, and train often.
The things I call ham would shock you <wink -- but I do get several
categories of difficult ham>.
> **BUT** if this is my tendency and I understand the system, then this
> will likely be a real problem when the masses get started using it.
> How to ensure that mistakes only training isn't the norm? Beats me.
> But we've either gotta figure out how to make sure that the teeming
> masses don't make this error, or we've gotta figure out how to make
> the system tolerate this error reasonably well.
It can't tolerate it -- it can only learn what it's been taught, and
reliance on hapaxes is both vital over the short term and brittle over the
long term; ongoing training is needed to prevent hapaxes from becoming a
liability over time. *Most* spam is dead easy to recognize, though, as is
most ham. The errors occur in atypical cases.
More information about the Spambayes