Tony Meyer tameyer at ihug.co.nz
Mon Feb 9 20:00:27 EST 2004

> OK, that makes sense.  So, (other than ignoring the problem ;-)
> I could either move the "goalposts", or find some ham that came
> through that mail gateway and do some more training. 

Yes.  The latter is more likely to be successful and would certainly be
easier.  Note, too, (if you haven't heard this already) that SpamBayes works
better with a roughly equal amount of ham & spam trained, so that's good
> I wasn't clear on how the classifier selected its evidence
> (nor how the individual terms are weighted). 

The weighting is fairly complicated - if you want to know the gory details,
check out the classifier.py file in the source distribution.
> Well, I only counted about 80 in the mail header, but,
> uh, I wasn't exactly counting carefully.  Perhaps my training
> corpus was too small to complete cover this piece of spam? 

It probably means that there weren't more than 80 tokens (ignoring ones in
the 0.4-0.6 range) in that message.  Short messages can quite easily have
fewer than 150 tokens, as can longer ones that contain a lot of words that
you haven't trained on (since they'll score 0.5).

This can cause problems - for example messages that just contain a URL don't
have a lot of tokens.  Usually the tokens from the headers (and maybe the
URL itself) are enough to make a difference, but not always (one thing that
can be done - SpamBayes has this as an experimental option, partly because
it's a bit controversial - is to get more tokens from whatever's at the end
of the URL).  If you do find that you get a reasonable number of
misclassified messages, and that there relatively few tokens, then the
solution might be generating more - either by turning on some options that
are off by default, or by some new tokenizing tricks.

