[spambayes-dev] default to mine_received_headers=True,
"may be forged"
Skip Montanaro
skip at pobox.com
Mon Dec 22 16:58:59 EST 2003
Tim> Changing the regexp to use [a-z] instead of \w would weed out all
Tim> that stuff.
I'll give that a try. Thanks.
>> Perhaps we should add
>>
>> header = re.sub(r'\s+', ' ', header)
>>
>> to the "for header ..." loop in any case?
Tim> There are many "for header" loops, and I'm not sure which one(s)
Tim> you're talking about here. If you want to do this somewhere,
Tim> header = ' '.join(header.split())
Tim> is faster.
Okay. I was just referring to the loop over the Received headers in the
section of code we've been messing with.
>> I'm willing to tuck the more general received sifting into the
>> tokenizer controlled by a new experimental option. Let me know if
>> you want me to take that step.
Tim> No, I don't want another experimental option just for this. It
Tim> seems clear enough already that "may be forged" is potentially
Tim> interesting, and also that "may be forged" isn't the only
Tim> potentially interesting string. We should suck up a bunch of them,
Tim> or none of them. The classifier will learn which are and aren't
Tim> useful, and it sure looks like that will vary depending on user
Tim> (that one of my ISPs is Comcast and one of yours isn't is not a
Tim> good reason to poo-poo the clues Comcast leaves behind <wink>).
Okay, I'll leave "(may be forged)" in and add Comcast's "(untrusted
sender)". I posted a note to comp.mail.misc asking for equivalents to "(may
be forged)" for other MTAs. I'll see if anything interesting turns up which
warrants investigation.
Skip
More information about the spambayes-dev
mailing list