[spambayes-dev] Another incremental training idea...
tim.one at comcast.net
Wed Jan 14 22:13:45 EST 2004
> A generalization might be to score each attachment (or possibly just
> each message/rfc822 type attachment) separately. Then choose an
> algorithm for combining the scores, e.g. outer-only, inner-only,
> combined, etc.
That should simplify things <wink>.
Or you could upgrade to Outlook: I don't think we have any real idea which
attachments we do and don't get back from Outlook when we synthesize a
plain-text message for your picky email parser to chew on ("standards" --
what a stupid idea that was <wink>), but I know for a fact that we *don't*
get the body of messages attached to things I get from Mailman in my
capacity as list admin. So I routinely train on Mailman-wrapped spam and
ham, meaning that I've trained on a grand total of about two of them, and
all wrapped msgs from Mailman have scored 0% for me thereafter.
Something to note: my personal classifier is using the experimental bigrams
gimmick, and bigram Mailmanisms like
List: PSF-Board at python.org
act like strong lexical fingerprints for Mailman-generated administrivia,
never appearing in ham or spam other than the Mailman stuff. This is one
clear way in which bigrams can generate a killer-strong collection of
hapaxes sufficient to nail an entire large class of messages from just one
Of course, that also sets me up for a spectacularly bad false negative
More information about the spambayes-dev