[Spambayes] Tony Meyer - Training question

Erik Brown kirebrow at yahoo.com
Sun Sep 18 10:23:40 CEST 2005

I forgot to mention they must train on all false negatives and positives as

Erik Brown

-----Original Message-----
From: spambayes-bounces at python.org [mailto:spambayes-bounces at python.org] On
Behalf Of Hely Holdings Pty Ltd (Sales Dept.)
Sent: Sunday, September 18, 2005 1:32 AM
To: spambayes at python.org
Subject: [Spambayes] Tony Meyer - Training question

Hi Tony.

Back in August 2004 you kindly critiqued a spam chapter for me
from my security book "The Hacker's Nightmare".

I am gearing up for a new edition of THN and will be expanding
the spam section a fair bit in the process. I deal only with the
Outlook plug-in.

At this time I would like to know if you have changed your
opinion on training since then. Here's what you said in a message
to me on August 10, 2004 after reading my draft chapter.

---------- BEGIN QUOTE ----------

Training is a difficult issue to write about.  The problem is
that not enough is yet known about the best ways to train, and
that the Outlook plug-in really only facilitates a couple of
different methods.  However, it is almost certain that 'train on
everything' is a bad idea, that smaller databases are generally
better than large ones, and that imbalances are bad.

These are not hard rules.  Your training described has a huge
imbalance, and is a pretty large database, and is (at least
initially) train-on-everything, and yet I presume you have had
good results or you wouldn't be writing this. In general, though,
based on both testing and feedback from users, the above is true.

I believe that the best training method to recommend to people
using the plug-in is:

 * Don't do *any* initial training. (Everything will now end up
in the 'unsure' folder.)
 * Train on *everything* that ends up in the 'unsure' folder.  At
first, this will be a lot of mail, but it will rapidly reduce.
 * Train on *all* mistakes (at first, there may be some false
positives/false negatives, but these will even more rapidly

Once 10-20 mails of each type have been trained, the system
should be very accurate.

---------- END QUOTE ----------

For my target audience I need to make all explanations and
instructions as simple as possible. If I started describing
techniques like Seth Goodman's "Recursive Training Set Selection
For Outlook" I'd have them throwing up out of fear and confusion.

I basically distilled your advice down to "do no pre-training at
all - train only on the UNSURE folder".

While that seems to work fine and has been well received, it was
after all a year and several releases ago.

Where do you stand on training these days, for people who simply
will not or cannot follow a complicated set of instructions.

Best regards,
 - Bill H.

We take security very seriously. All outgoing mail is
certified Virus Free. To boost YOUR security visit
The Hacker's Nightmare: http://HackersNightmare.com.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.11.1/104 - Release Date:

Spambayes at python.org
Check the FAQ before asking: http://spambayes.sf.net/faq.html

More information about the Spambayes mailing list