[Spambayes] Client/server model akjha

Fri Oct 18 01:59:32 2002

[Neale Pickett]
> Yeah but I don't think anybody's done any tests to see if classifying on
> headers alone still gets good results.

A while back I reported on an experiment that looked only at Subject lines:
no other headers, and nothing in the body.  It did very heavy tokenization
of subject lines (word unigrams, and word bigrams, and folding case, and
preserving case, and splitting on whitespace, and sucking out alphanumeric
runs, and tokenizing runs of pure punctuation).  Using the default
combining, the bottom line was

-> best cutoff for all runs: 0.575
->     with weighted total 10*65 fp + 486 fn = 1136
->     fp rate 0.325%  fn rate 3.47%

That's much worse than we do by taking the body into account too, but in
absolute terms it's not too shabby!

Staring at the results caused me to add the least likely part of that
gimmick to our regular tokenizer:  generating tokens for runs of pure
punctuation in Subject lines.  It's obvious in retrospect:  spam often has
over-the-top PUNCTUATION!!! $$$$$$, and the one that delighted me the most
was long runs of blanks.  Those come from Subject lines that stuff a short
random string at the end of the line to fool dumb filters, separated from
the ***SCREAMING PART*** by a long run of blanks.  I added one to the
Subject line here for illustration <wink>.