[Spambayes] Foreign language spam: bug or feature?
Tim Peters
tim.one@comcast.net
Thu Oct 24 03:06:28 2002
There's an interesting bug in the Outlook 2000 client that's absolutely
nailing all the Asian spam I get, along with several other non-Asian
languages. "The bug" is this, in Outlook2000/manager.py's
GetBayesStreamForMessage():
body += message.Text.encode("ascii", "replace")
Outlook uses Unicode internally. message.Text grabs the message body from
Outlook as a Unicode string. .encode(...) is then plain Python, telling it
to encode the Unicode string as a regular string, using the ascii encoding,
and replacing Unicode characters that can't be represented faithfully in
ascii by "a suitable replacement character". For the ascii encoding, that
almost always turns out to be a question mark character, because there's
almost always nothing in ascii that's truly suitable.
While this may suck from a purity view, it leads to spam-clue listings like
this (from a typical Asian spam):
Spam Score: 1
'*H*' 0
'*S*' 1
'header:Return-Path:1' 0.611133
'header:Message-ID:1' 0.813889
'15????' 0.844828
'24????' 0.844828
'7??????' 0.844828
'&' 0.863317
'header:Mime-Version:1' 0.89556
'header:Reply-To:1' 0.90756
'10????' 0.934783
'??????!!!' 0.934783
'header:Received:2' 0.957828
'??????????)' 0.958716
'??????...' 0.965116
'????????...' 0.965116
'message-id:@cpimssmtpa05.msn.com' 0.969799
'from:email addr:korea.com>' 0.980349
'(????' 0.981928
'??.' 0.985437
'e-mail??????' 0.986322
'????,' 0.99505
'????????,' 0.995258
'??????,' 0.99545
'????????.' 0.997691
'??????????.' 0.99776
'skip:? 20' 0.998034
'????????????' 0.998192
'??????????' 0.998474
'??????' 0.998562
'????' 0.998598
'????????' 0.998672
'skip:? 10' 0.998894
That is, languages having scant intersection with ASCII end up getting
tokenized as collections of mostly question marks, and each instance of
"?"*n ends up earning a high spamprob. The database burden is trivial,
since there just aren't many *possible* strings consisting of nearly pure
question marks, and the "skip" gimmick kicks in when a contiguous string of
question marks gets long.
Of course lots of '?'*n thingies in a msg are highly correlated, which in
*my* personal email is helpful: spam or not, anything sent to me in a
language having small intersection with ASCII may as well be spam -- there's
no chance *I* can read it regardless.
If somebody would like to formalize this bug as a tokenizer option, so that
non-Outlook American-English users can enjoy its benefits too, I won't
object. For International Sensitivity reasons, we may have to put it in a
[Dont Ask Dont Tell] .ini section <wink>.