[spambayes-dev] unicode error in sb_dbexpimp.py
Skip Montanaro
skip at pobox.com
Tue Mar 9 17:20:01 EST 2004
I decided to try speeding up my train-to-exhaustion runs (they are taking
longer per round and running more rounds now that I'm approaching 1,000
total training messages) by training to a pickle then using sb_dbexpimp.py
to dump first to CSV then to a database file. I got this error when
importing from tte.csv to tte.db:
Importing database tte.db using file tte.csv
Traceback (most recent call last):
File "/Users/skip/local/bin/sb_dbexpimp.py", line 267, in ?
runImport(dbFN, useDBM, newDBM, flatFN)
File "/Users/skip/local/bin/sb_dbexpimp.py", line 199, in runImport
word = uunquote(word)
File "/Users/skip/local/bin/sb_dbexpimp.py", line 115, in uunquote
return unicode(urllib.unquote(s), 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 7: unexpected end of data
It's barfing on this input string
%A5x%A5_%A5%AB%ABh%B8q%B0%CF
which urllib.unquote()s to
\xa5x\xa5_\xa5\xab\xabh\xb8q\xb0\xcf
which is (apparently) invalid utf-8.
At first glance the uquote() and uunquote() function definitions seemed
okay, but after further reflection I wonder why urllib.(un)?quote() are
being called.
Skip
More information about the spambayes-dev
mailing list