[spambayes-dev] default to mine_received_headers=True, "may be forged"

Skip Montanaro skip at pobox.com
Mon Dec 22 15:29:58 EST 2003


    Tim> A generalization of this gimmick finds several potentially
    Tim> interesting Received comments in my current little training
    Tim> database:

    ...

Interesting scheme.  When I tried that I got swamped by '(qmail NNN ...'
stuff, where it appears that NNN is a process id.  To retain this in its
current form I suspect we'd have to either specifically eliminate such
features or implement hapax expiration.

    Tim> Note that one of the "may be forged" comments there was split
    Tim> across lines ('(may\n\tbe forged)').

Perhaps we should add

    header = re.sub(r'\s+', ' ', header)

to the "for header ..." loop in any case?  It seems that many other headers
get split that way.  If we're looking for features which include whitespace
we should probably normalize it.

I'm willing to tuck the more general received sifting into the tokenizer
controlled by a new experimental option.  Let me know if you want me to take
that step.

Skip



More information about the spambayes-dev mailing list