[spambayes-dev] default to mine_received_headers=True,
"may be forged"
Skip Montanaro
skip at pobox.com
Mon Dec 22 15:29:58 EST 2003
Tim> A generalization of this gimmick finds several potentially
Tim> interesting Received comments in my current little training
Tim> database:
...
Interesting scheme. When I tried that I got swamped by '(qmail NNN ...'
stuff, where it appears that NNN is a process id. To retain this in its
current form I suspect we'd have to either specifically eliminate such
features or implement hapax expiration.
Tim> Note that one of the "may be forged" comments there was split
Tim> across lines ('(may\n\tbe forged)').
Perhaps we should add
header = re.sub(r'\s+', ' ', header)
to the "for header ..." loop in any case? It seems that many other headers
get split that way. If we're looking for features which include whitespace
we should probably normalize it.
I'm willing to tuck the more general received sifting into the tokenizer
controlled by a new experimental option. Let me know if you want me to take
that step.
Skip
More information about the spambayes-dev
mailing list