[spambayes-bugs] [ spambayes-Bugs-1600821 ] Classifier UnicodeDecodeError on wrong transfer encoding

Fri Dec 5 10:35:59 CET 2008

Bugs item #1600821, was opened at 2006-11-22 00:59
Message generated for change (Comment added) made by gelato
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1600821&group_id=61702

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: imapfilter
Group: 1.0.1
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Ivan Vilata i Balaguer (ivilata)
Assigned to: Skip Montanaro (montanaro)
Summary: Classifier UnicodeDecodeError on wrong transfer encoding

Initial Comment:
Running ``sb_imapfilter.py`` 1.0.1 seems to raise the following ``UnicodeDecodeError`` when it comes across a mail with 7-bit content transfer encoding with 8-bit characters in it while classifying::

    Traceback (most recent call last):
    File "/usr/bin/sb_imapfilter.py", line 924, in ?
      run()
    File "/usr/bin/sb_imapfilter.py", line 914, in run
      imap_filter.Filter()
    File "/usr/bin/sb_imapfilter.py", line 785, in Filter
      self.unsure_folder)
    File "/usr/bin/sb_imapfilter.py", line 703, in Filter
      evidence=True)
    File "/usr/lib/python2.4/site-packages/spambayes/classifier.py", line 190, in chi2_spamprob
      clues = self._getclues(wordstream)
    File "/usr/lib/python2.4/site-packages/spambayes/classifier.py", line 496, in _getclues
      clues.sort()
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)

I'm attaching the mail which caused this.  I know it is not properly-formatted, but it is a legitimate mail produced by a popular MUA (Thunderbird 1.5).  Spam surely is worsely formatted

Someone talked about the same problem in the list: http://www.mail-archive.com/spambayes@python.org/msg04543.html

----------------------------------------------------------------------

Comment By: Sergio Gelato (gelato)
Date: 2008-12-05 10:35

Message:
I now have the pleasure of submitting a very simple patch for this issue.
I've just had a chance to test it (on spambayes 1.0.4). The only drawback
is that it bumps the minimum required Python version to 2.4, but hopefully
that's not too much of a problem nowadays.
In a nutshell: only sort on the first component of the tuple, like this.
-            clues.sort()
+            clues.sort(key=lambda x:x[0])

----------------------------------------------------------------------

Comment By: Sergio Gelato (gelato)
Date: 2008-09-15 15:31

Message:
I've had the same problem, with a similar traceback (also using spambayes
1.0.4). I was able to identify the exact word in the input data that
triggered the problem. It turns out, however, that changing the database
even slightly (I trained on a portion of the offending message) makes the
symptoms disappear.

In my case, The offending word was "Enk=E4t" (in a qp-encoded,
charset="iso-8859-1" text/plain subpart of a message/rfc822 subpart of a
multipart-mixed message). There were other similarly encoded words with
non-ASCII data earlier in the message (even in the same
body part), but only this one triggered the problem. (I established this
by truncating the input message after a variable number of lines and noting
which inputs were causing it to fail.) Extracting the message/rfc822 part
and running it alone through sb_filter.py did not trigger the problem.

In inspecting the spambayes source code, I noticed that tokenizer.py
doesn't seem to take into account the MIME charset. I'm not necessarily
saying that it should; in fact, spambayes must be able to cope with
malformed input data. But the result is that the words out of the tokenizer
are not in any well-defined encoding.

clues is a list of (distance, prob, word, record) tuples. When there is a
tie on prob (and therefore also on distance=abs(0.5-prob)), the sort()
method will need to compare the word strings. This is where an implicit
word.decode('ascii') may take place, especially when one of the operands is
of type 'str' and the other one is of type 'unicode'. Training one more
message will change the probabilities and make the symptoms disappear (or
move somewhere else).

I'd guess that some of the elements of the wordstream returned by the
tokenizer are of the wrong class. They should either all be of type str or
all of type unicode; probably the former.

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2007-10-02 12:55

Message:
Logged In: YES 
user_id=44345
Originator: NO

None of these make the current version of sb_filter.py barf.
I wonder if there's something peculiar about the way the
mail is transmitted via IMAP?  (Just a wild guess.)

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2007-10-02 02:06

Message:
Logged In: YES 
user_id=44345
Originator: NO

File Added: mailbox

----------------------------------------------------------------------

Comment By: Jesús Cea Avión (jcea)
Date: 2007-10-02 00:46

Message:
Logged In: YES 
user_id=97460
Originator: NO

Three examples sent to skip at pobox.com.

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2007-09-26 04:48

Message:
Logged In: YES 
user_id=44345
Originator: NO

jcea,

Do you have an email message I can work with?  If so, zip it and send it
to me as an attachment (skip at pobox.com).

Thx,

Skip

----------------------------------------------------------------------

Comment By: Jesús Cea Avión (jcea)
Date: 2007-09-26 04:02

Message:
Logged In: YES 
user_id=97460
Originator: NO

My version is 1.0.4 and the traceback is:

"""
Traceback (most recent call last):
  File "/usr/local/lib/python2.5/site-packages/Milter/__init__.py", line
203, in <lambda>
    milter.set_eom_callback(lambda ctx: ctx.getpriv().eom())
  File "antispam.py", line 513, in eom
    prob=hammiedb.score(msg)
  File "/usr/local/lib/python2.5/site-packages/spambayes/hammie.py", line
62, in score
    return self._scoremsg(msg, evidence)
  File "/usr/local/lib/python2.5/site-packages/spambayes/hammie.py", line
38, in _scoremsg
    return self.bayes.spamprob(tokenize(msg), evidence)
  File "/usr/local/lib/python2.5/site-packages/spambayes/classifier.py",
line 190, in chi2_spamprob
    clues = self._getclues(wordstream)
  File "/usr/local/lib/python2.5/site-packages/spambayes/classifier.py",
line 496, in _getclues
    clues.sort()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfa in position 0:
ordinal not in range(128)
"""

----------------------------------------------------------------------

Comment By: Jesús Cea Avión (jcea)
Date: 2007-09-17 17:37

Message:
Logged In: YES 
user_id=97460
Originator: NO

My version is 1.0.4.

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2007-09-07 04:11

Message:
Logged In: YES 
user_id=44345
Originator: NO

I ran the submitted email through the current sb_filter.py in Subversion
(probably the same classifier as in 1.1a4).  It worked for me.  While I
don't use the IMAP filter, any of the SpamBayes applications should use the
same classifier code.  I'm not sure this is a problem in the current code. 
What version of SpamBayes are you using?

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2007-09-05 19:23

Message:
Logged In: YES 
user_id=44345
Originator: NO

Do you have a traceback?  What version of SpamBayes are you using?

----------------------------------------------------------------------

Comment By: Jesús Cea Avión (jcea)
Date: 2007-09-05 16:59

Message:
Logged In: YES 
user_id=97460
Originator: NO

I'm seeing a lot (>1 per hour in my system) of current spam crashing
spambayes because they are marked as "ascii" but body is 8-bit actually.

Since my milter spam filter crashes and sendmail disables the milter
filtering for 50 seconds because the failure (my configuration, and I
wouldn't like to touch it), a lot of spam is getting thru. About 30-100
spams, everytime this bug hits.

Please, increase the priority of this bug a bit... It is hitting. Hard.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1600821&group_id=61702