[Spambayes] I am the author of my own undoing.
Webb Scales
scales at zko.dec.com
Mon Mar 22 18:31:47 EST 2004
A suggestion was made, in terms of recovering from my difficulties, to export
my database to a flat file, fix the offending spam count, and reimport it.
(The author of this suggestion is being obscurred to preserve his life. ;-)
Well, when you're knee deep in aligators, it's hard to remember that your
original intention was to drain the swamp.... :-)
First, the help for sb_dbexpimp.py is missing a few options (like -d and -n),
but I figured out what to do from the examples.
Next, when I tried to reimport the (unmodified) database from the flat file, I
got the following:
% sb_dbexpimp.py -i -v -d hammie.db -f hammie.db.prev.export.2
Loading state from hammie.db database
hammie.db is a new database
Importing database hammie.db using file hammie.db.prev.export.2
Traceback (most recent call last):
File "/etc/procmailrcs/scales/usr/local/bin/sb_dbexpimp.py", line 267, in ?
runImport(dbFN, useDBM, newDBM, flatFN)
File "/etc/procmailrcs/scales/usr/local/bin/sb_dbexpimp.py", line 199, in runImport
word = uunquote(word)
File "/etc/procmailrcs/scales/usr/local/bin/sb_dbexpimp.py", line 115, in uunquote
return unicode(urllib.unquote(s), 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 14: unexpected code byte
After reacquainting myself with extended regular expressions, I located a line
in the flat file which seemed like the offending one:
% egrep '([^%]|(%..)){14}%AE' hammie.db.prev.export.2
subject%3AValium%AE`0`1`
Evidently, the import code doesn't like 8-bit ASCII characters.
So, I tried replacing %AE with %2E (i.e., clearing the high bit). But,
importing this file yielded another error:
% sb_dbexpimp.py -i -v -d hammie.db -f hammie.db.prev.export
Loading state from hammie.db database
hammie.db is a new database
Importing database hammie.db using file hammie.db.prev.export
Traceback (most recent call last):
File "/etc/procmailrcs/scales/usr/local/bin/sb_dbexpimp.py", line 267, in ?
runImport(dbFN, useDBM, newDBM, flatFN)
File "/etc/procmailrcs/scales/usr/local/bin/sb_dbexpimp.py", line 199, in runImport
word = uunquote(word)
File "/etc/procmailrcs/scales/usr/local/bin/sb_dbexpimp.py", line 115, in uunquote
return unicode(urllib.unquote(s), 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-10: unexpected end of data
I couldn't find anything obviously wrong with any lines at position 9 or 10
(although, I don't really know what to look for). But, I did a search for
other 8-bit characters, and there were several:
% grep '%[8-9A-F][0-9A-F]' hammie.db.prev.export
subject%3Af%FCr`0`1`
subject%3A%B2%D5%B8%CB`0`1`
subject%3A%A4%C9%AF%C5`0`1`
subject%3A%B8%A3%BA%FB%AD%D7/%A4%C9%AF%C5/%B2%D5%B8%CB`0`1`
subject%3A%A7%D6%B3`0`1`
subject%3A%A7%D6%B3t%B9q%B8%A3%BA%FB%AD%D7`0`1`
subject%3A%B9`0`1`
subject%3A%FC`0`1`
So, I manually knocked out all of the high bits (i.e., I replaced %8 with %0,
%9 with %1, on up to %F with %7), and then the import worked. I hope I didn't
damage the contents too much in the process....
So, how much of the above is a bug worth reporting? (And, what should
I include in the report?)
Thanks,
Webb
--
------------------------------------------------------------------------
Webb Scales Hewlett-Packard Company
scales at zko.dec.com 110 Spit Brook Rd, ZKO2-3/N30
Voice: 603.884.2196, FAX: 603.884.0120 Nashua, NH 03062-2711
Experience: the exam comes first, you get the lesson afterward.
------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes/attachments/20040322/53c559a5/attachment.html
More information about the Spambayes
mailing list