[Spambayes] Headers and Other Significant Message Parts

Sun Oct 20 22:12:07 2002

This is a multipart message in MIME format.

---------------------- multipart/mixed attachment
Database:
341 training genuine (ham) messages,  406 training spam messages
(or 398 spam when parsing due to a bug with messages that don't
have body text).

40 test genuine messages, 40 test spam messages, all more recent
than the training ones.

Spam threshold is 0.56, Gary-combining method, simplistic
word tokenization.

Just headers:
Genuine .181352 to .557881, one false positive (a mailbox full announcement).  2.5% wrong.
Spam .450602 to .750511, 21 false negatives.  52.5% wrong.

Whole raw message text:
Genuine .163027 to .627022, 3 false positives.  7.5% wrong.
Spam .509355 to .993985, 1 false negative.  2.5% wrong.

Any text/* parts and header:
Genuine .162697 to .614136, 4 false positives, 10% wrong.
Spam .614973 to .994362, 0 false negatives, 0% wrong.

Any text parts, no headers:
Genuine .221923 to .635487, 6 false positives, 15% wrong.
Spam .594271 to .994441, 0 false negatives, 0% wrong.

Just text/plain parts (including body text) and headers:
Genuine .137869 to .583192, 3 false positives, 7.5% wrong.
Spam .448059 to .994119, 17 false negatives, 42.5% wrong.

Just text/plain parts, no headers.  150 spam and 1 genuine training message had no words:
Genuine .219169 to .696899, 9 false positives, 22.5% wrong.
Spam .660755 to .994116, 0 false positives, 27 had no words.

So, the headers are quite useful for identifying Spam.

The winners are chewing up the whole message, or using all text
text parts (throwing away binary attachments) and including the
headers too.  The advantage with the parts method is that the
database doesn't fill up with junk words from binary attachments.

- Alex

---------------------- multipart/mixed attachment
I did some more tests using AGMSBayesianSpam v1.58 for BeOS
(http://www.bebits.com/app/3055) to tokenize different parts
of mail messages, to see if headers were useful or if some
parts could be discarded.

Database:
341 training genuine (ham) messages,  406 training spam messages
(or 398 spam when parsing due to a bug with messages that don't
have body text, shouldn't influence it too much).

40 test genuine messages, 40 test spam messages, all more recent
than the training ones.

Spam threshold is 0.56, Gary-combining method, simplistic
word tokenization.

Just headers:
Genuine .181352 to .557881, one false positive (a mailbox full announcement).  2.5% wrong.
Spam .450602 to .750511, 21 false negatives.  52.5% wrong.

Whole raw message text (only quoted-printable decoding):
Genuine .163027 to .627022, 3 false positives.  7.5% wrong.
Spam .509355 to .993985, 1 false negative.  2.5% wrong.

Message parsed into parts (parsing decodes base64 and
quoted-printable, and for text converts the character
set to UTF-8), plus headers (includes MIME subheaders too):
Genuine .168857 to .609005, 4 false positives, 10% wrong.
Spam .614564 to .994364, 0 false negatives, 0% wrong.

Message parsed into parts of all kinds, no header data:
Genuine .220161 to .631161, 5 false positives, 12.5% wrong.
Spam .592501 to .994444, 0 false negatives, 0% wrong.

Only text/* parts and headers:
Genuine .162697 to .614136, 4 false positives, 10% wrong.
Spam .614973 to .994362, 0 false negatives, 0% wrong.

Just text/* parts, no headers:
Genuine .221923 to .635487, 6 false positives, 15% wrong.
Spam .594271 to .994441, 0 false negatives, 0% wrong.

Just text/plain parts (including body text) and headers:
Genuine .137869 to .583192, 3 false positives, 7.5% wrong.
Spam .448059 to .994119, 17 false negatives, 42.5% wrong.

Just text/plain parts, no headers.
150 spam and 1 genuine training message had no words.
Genuine .219169 to .696899, 9 false positives, 22.5% wrong.
Spam .660755 to .994116, 0 false positives,
27 spam had no words (a good sign of spam).

So, the headers are quite useful for identifying Spam in general.
If using just headers, there are few false positives, making them
suitable for deleting spam on the server (only downloading the
header).  But they have many false negatives, so it isn't that
useful.  Harmless and half useless :-).

The winners are the whole message as raw text method, or using
all text parts (throwing away binary attachments) and including the
headers too.  The advantage with the parts method is that the
database doesn't fill up with junk words from binary attachments.

- Alex

---------------------- multipart/mixed attachment--