[Spambayes] uncaught exception on unicode parsing errors lets spam through

Frank Stajano fms27 at cam.ac.uk
Sat Sep 20 00:11:02 EDT 2003


This is relatively new, but I now witness frequent instances (at least one 
every couple of days) of spam messages that were not classified as such 
because spambayes choked on them at the parsing stage, adding a header such 
as the one below.

X-Spambayes-Exception: exceptions.UnicodeDecodeError('ascii' codec can't
	decode byte 0x92 in position 7: ordinal not in range(128)) in
	append() at C:\apps\python23\lib\email\Header.py line 272:
	ustr = unicode(s, incodec, errors)

Clearly this parsing stage needs to be more robust, otherwise spammers will 
send out malformed messages just to evade spambayes's classification.

I include a sample message as attachment. I have more if needed.

Note that you can't just say that all messages with unicode errors are spam 
(though experimentally so far the majority of them have been). In my 
academic position I also get plenty of phd applications from East Asians 
and I have sometimes seen the Chinese characters in the From header 
triggering the same kind of message. I'm guessing that's because they were 
encoded in something other than unicode.
-------------- next part --------------
Return-path: <alisha_stevensus at mailcity.com>
Envelope-to: fms27 at hermes.cam.ac.uk
Delivery-date: Fri, 19 Sep 2003 19:05:34 +0100
Received: from brown.csi.cam.ac.uk ([131.111.8.14])
	by orange.csi.cam.ac.uk with esmtp (Exim 4.12)
	id 1A0PdO-00011G-00
	for fms27 at hermes.cam.ac.uk; Fri, 19 Sep 2003 19:05:34 +0100
Received: from mta2.cl.cam.ac.uk ([128.232.0.14] helo=whittlesey.cl.cam.ac.uk)
	by brown.csi.cam.ac.uk with esmtp (Exim 4.20)
	id 1A0PdN-00010W-L5
	for fms27 at cam.ac.uk; Fri, 19 Sep 2003 19:05:33 +0100
Received: from rose.csi.cam.ac.uk ([131.111.8.13])
	by whittlesey.cl.cam.ac.uk with esmtp (Exim 3.092 #1)
	id 1A0Pct-0007Qv-00
	for Frank.Stajano at cl.cam.ac.uk; Fri, 19 Sep 2003 19:05:03 +0100
Received: from [128.252.188.12] (helo=yahoo.com)
	by rose.csi.cam.ac.uk with esmtp (Exim 4.20)
	id 1A0Pco-0000Fa-5h
	for fms27 at cl.cam.ac.uk; Fri, 19 Sep 2003 19:04:58 +0100
Message-ID: <1063994698.0969 at mailcity.com>
From: "Alisha Stevens" <alisha_stevensus at mailcity.com>
To: Frank.Stajano at cl.cam.ac.uk
Subject: Re: I?ll tell you why!
Date: Fri, 19 Sep 2003 18:04:58 +0000
MIME-Version: 1.0
X-Mailer: Pegasus Mail for Win32 (v3.12a)
Content-Type: text/html
Content-Transfer-Encoding: 8bit
X-Cam-AntiVirus: No virus found
X-Cam-SpamDetails: scanned, SpamAssassin (score=6.4, HTML_80_90 0.54,
	HTML_FONT_BIG 0.27, HTML_FONT_COLOR_BLUE 0.10,
	HTML_FONT_COLOR_RED 0.10, HTML_MESSAGE 0.10, MIME_HTML_ONLY 0.10,
	OBFUSCATING_COMMENT 2.60, RAZOR2_CHECK 2.06, SEE_FOR_YOURSELF 0.48)
X-Cam-SpamScore: ssssss
X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/
X-Cam-AntiVirus: No virus found
X-Cam-SpamDetails: Not scanned
Status:  O
X-Spambayes-Exception: exceptions.UnicodeDecodeError('ascii' codec can't
	decode byte 0x92 in position 5: ordinal not in range(128)) in
	append() at C:\apps\python23\lib\email\Header.py line 272:
	ustr = unicode(s, incodec, errors)


Wholesale Prescription Medications
Our doctors will write you 
a prescription for free!
Buy Your Prescription Meds Online 
See For Yourself...
Check It Out Here


Stop Receiving the offers 
-------------- next part --------------

   Frank (filologo disneyano) http://www-lce.eng.cam.ac.uk/~fms27/


More information about the Spambayes mailing list