[spambayes-dev] unicode error in sb_dbexpimp.py

Skip Montanaro skip at pobox.com
Tue Mar 9 17:20:01 EST 2004


I decided to try speeding up my train-to-exhaustion runs (they are taking
longer per round and running more rounds now that I'm approaching 1,000
total training messages) by training to a pickle then using sb_dbexpimp.py
to dump first to CSV then to a database file.  I got this error when
importing from tte.csv to tte.db:

    Importing database tte.db using file tte.csv
    Traceback (most recent call last):
      File "/Users/skip/local/bin/sb_dbexpimp.py", line 267, in ?
        runImport(dbFN, useDBM, newDBM, flatFN)
      File "/Users/skip/local/bin/sb_dbexpimp.py", line 199, in runImport
        word = uunquote(word)
      File "/Users/skip/local/bin/sb_dbexpimp.py", line 115, in uunquote
        return unicode(urllib.unquote(s), 'utf-8')
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 7: unexpected end of data

It's barfing on this input string

    %A5x%A5_%A5%AB%ABh%B8q%B0%CF

which urllib.unquote()s to

    \xa5x\xa5_\xa5\xab\xabh\xb8q\xb0\xcf

which is (apparently) invalid utf-8.

At first glance the uquote() and uunquote() function definitions seemed
okay, but after further reflection I wonder why urllib.(un)?quote() are
being called.

Skip



More information about the spambayes-dev mailing list