[Spambayes] A couple of small tokenizer experiments.
Tim Peters
tim.one@comcast.net
Tue Nov 12 02:06:05 2002
Quickie:
>> In personal email "subjectcharset:unknown" shows up a lot for some
>> reason (but only in spam).
> Hm. Dunno about that - Barry might know under what circumstances
> email package gives 'unknown' as a charset. I can't see how that
> could happen.
Easy <wink>: it's my personal email, and the string UNKNOWN is what
*Outlook* delivers. I think it actually says UNKNOWN as it came in off the
wire!
I get my share of
Subject: =?Big5?B?pc7BecHIpGq/+g==?=
thingies but I also get a monsters like these:
Subject: =?UNKNOWN?Q?=1B$B!z%-%c%s%Z!=3C%s=3CB=3B=5CCf!*!*=1B=28
B1=1B$B%/?==?UNKNOWN?Q?%j%C%/!w=1B=28B15=1B$B1=5F!A=1B=28
B25=1B$B1=5F!z=1B?==?UNKNOWN?Q?=28B?=
That one came in to webmaster@python.org on Friday. Perhaps they've learned
that Greg will reject a msg just for using an unloved charset, but I doubt
it.
In fact, I see that 'subjectcharset:unknown' is now the single strongest
spam word in my entire mistaken-driven (and tiny) training corpus:
'subjectcharset:unknown' 0.934783'
More information about the Spambayes
mailing list