[Spambayes] RE: Content-Type text/html charset

Fri Jan 2 18:54:34 EST 2004

[Bulgrien, Dennis]
> SpamBayes misses some e-mails with oriental font (charsets "euc-kr",
> "ks_c_5601-1987", "?ISO-8859-1?B?", etc).  The Outlook filter I used
> before

If you're using the Outlook flavor of SpamBayes, it should respond quickly
to training on these as spam.

> SpamBayes deleted all e-mails with the known oriental font
> (even if one said that my mother died I couldn't read it! ;-0) by
> searching the header for the charset name.

Ours is a "preponderance of evidence" scheme -- there's nothing we consider
as being determinative, on its own, of whether a message is ham or spam.  So
if someone did send you an email about your mother dying, and you've trained
on mother-dying emails as ham <wink>, SpamBayes would probably score a new
mother-dying email as ham even if it had spammish characteristics too.

> "Show spam clues for current message" after training on it seems
> to imply that certain parts of the header are preprocessed and
> ignored.

A great many are, yes.  There's no "sense" to it:  a great many strategies
were *tried*, and often different strategies were tried for different kinds
of header lines.  We kept the ones that did best in large-scale tests early
in the project.  So the only explanation you'll get for most any specific
decision here is just "tests said that worked best across all the
alternatives tried".  So, for example, we preserve case in Subject headers,
but convert to lowercase everywhere else.

> ...
> If so I think it unfortunately ignores the very valuable charset
> token.  Message Tokens lists 'content-type:text/plain' but not the
> charset value that follows, "charset="euc-kr" (see below).  If the
> charset value was a token it would ALWAYS be in spam and wouldn't
> it be a terrific scoring clue?

It would still just be one of many, but you're right that testing said
charset identifiers are in fact good clues.  If you were using any version
of SpamBayes *other* than the Outlook addin, you'd be getting them, too.
Their absence in the Outlook-addin version of SpamBayes is a nasty technical
problem stemming from that Outlook destroys the original MIME structure of
incoming email, but the email parser we use requires valid MIME structure.
The specific problems are described in detail in comments in the
GetEmailPackageObject() method in source file Outlook2000/msgstore.py (you
need to download the project's source code to see that, of course).

In a nutshell, as part of destroying the incoming MIME structure, Outlook
does its own character conversions, and if we left the charset identifiers
intact in the plain-text version of the email we actually score, the email
parser we use would try to do the specified conversions *again*, and either
blow up or deliver nonsense results.  The most expedient way to worm around
this was to strip Content-Transfer-Encoding and Content-Type headers from
the email we synthesize for scoring:  they specify the message's *original*
MIME structure, but the original MIME structure can't be gotten from
Outlook.  As a result, the synthesized message we actually score doesn't
contain the

    Content-Type: text/html; charset="euc-kr"

header line at all, and that outcome is unique to the Outlook addin.