[spambayes-dev] RE: [Spambayes] How low can you go?

Tim Peters tim.one at comcast.net
Wed Dec 24 02:33:21 EST 2003


[Tony Meyer]
> The export.py script does a reasonable job of putting everything back
> together again [from Outlook].

Thanks, Tony!  I'm mortified to admit I had forgotten where this script
lived.

> Actually, I believe it does the exact same job as when getting a
> message to pass to tokenizer for general use.

In particular, exactly the same as when scoring a message, or training on
one.  The MIME armor (if any) is gone, (at least all) non text/* attachments
are gone, and if the original headers contained Content-Type or
Content-Transfer-Encoding specs, they're gone too.  If it was
multipart/alternative with text/plain and text/html sections, they're both
slammed into the body, without indication of where one ends and the other
begins.

But that's the way we score Outlook email, and it's darned hard to do
better.  Outlook's message store is a complicated beast, and predates
current email standards; they tacked MIME email on top of a sprawling store
that didn't know anything about MIME, spraying bits and pieces all over the
place.  Pretty cool <wink>.

> So although popping a proxy in between Outlook and the POP3 server to
> catch raw messages would certainly be more pure and correct
> (sb_server can do this, BTW, just set the cache expiry limit *really*
> high and don't bother classifiying any messages), for practical
> purposes using the data that Outlook gives is just as useful.

For anyone using spambayes via the Outlook addin, it's *better* to use
export.py than to capture the incoming email bytestream.  SpamBayes can't
reconstruct the original bytestream from Outlook (not out of laziness, it's
simply impossible), so how the classifier would do if it *could* see the
original bytestream is irrelevant to real-life Outlook use.

It's close enough that I doubt it matters much.




More information about the spambayes-dev mailing list