[Spambayes] A couple of small tokenizer experiments.

Tim Peters tim.one@comcast.net
Tue Nov 12 02:06:05 2002


Quickie:


>> In personal email "subjectcharset:unknown" shows up a lot for some
>> reason (but only in spam).

> Hm. Dunno about that - Barry might know under what circumstances
> email package gives 'unknown' as a charset. I can't see how that
> could happen.

Easy <wink>:  it's my personal email, and the string UNKNOWN is what
*Outlook* delivers.  I think it actually says UNKNOWN as it came in off the
wire!

I get my share of

    Subject: =?Big5?B?pc7BecHIpGq/+g==?=

thingies but I also get a monsters like these:

Subject: =?UNKNOWN?Q?=1B$B!z%-%c%s%Z!=3C%s=3CB=3B=5CCf!*!*=1B=28
        B1=1B$B%/?==?UNKNOWN?Q?%j%C%/!w=1B=28B15=1B$B1=5F!A=1B=28
        B25=1B$B1=5F!z=1B?==?UNKNOWN?Q?=28B?=

That one came in to webmaster@python.org on Friday.  Perhaps they've learned
that Greg will reject a msg just for using an unloved charset, but I doubt
it.

In fact, I see that 'subjectcharset:unknown' is now the single strongest
spam word in my entire mistaken-driven (and tiny) training corpus:

'subjectcharset:unknown'       0.934783'




More information about the Spambayes mailing list