[Spambayes] I am the author of my own undoing.

Webb Scales scales at zko.dec.com
Mon Mar 22 18:31:47 EST 2004


A suggestion was made, in terms of recovering from my difficulties, to export
my database to a flat file, fix the offending spam count, and reimport it.
(The author of this suggestion is being obscurred to preserve his life.  ;-)

Well, when you're knee deep in aligators, it's hard to remember that your
original intention was to drain the swamp....  :-)

First, the help for sb_dbexpimp.py is missing a few options (like -d and -n),
but I figured out what to do from the examples.

Next, when I tried to reimport the (unmodified) database from the flat file, I
got the following:

     % sb_dbexpimp.py -i -v -d hammie.db -f hammie.db.prev.export.2
     Loading state from hammie.db database
     hammie.db is a new database
     Importing database hammie.db using file hammie.db.prev.export.2
     Traceback (most recent call last):
       File "/etc/procmailrcs/scales/usr/local/bin/sb_dbexpimp.py", line 267, in ?
         runImport(dbFN, useDBM, newDBM, flatFN)
       File "/etc/procmailrcs/scales/usr/local/bin/sb_dbexpimp.py", line 199, in runImport
         word = uunquote(word)
       File "/etc/procmailrcs/scales/usr/local/bin/sb_dbexpimp.py", line 115, in uunquote
         return unicode(urllib.unquote(s), 'utf-8')
     UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 14: unexpected code byte

After reacquainting myself with extended regular expressions, I located a line
in the flat file which seemed like the offending one:

     % egrep '([^%]|(%..)){14}%AE' hammie.db.prev.export.2
     subject%3AValium%AE`0`1`

Evidently, the import code doesn't like 8-bit ASCII characters.

So, I tried replacing %AE with %2E (i.e., clearing the high bit).  But,
importing this file yielded another error:

     % sb_dbexpimp.py -i -v -d hammie.db -f hammie.db.prev.export
     Loading state from hammie.db database
     hammie.db is a new database
     Importing database hammie.db using file hammie.db.prev.export
     Traceback (most recent call last):
       File "/etc/procmailrcs/scales/usr/local/bin/sb_dbexpimp.py", line 267, in ?
         runImport(dbFN, useDBM, newDBM, flatFN)
       File "/etc/procmailrcs/scales/usr/local/bin/sb_dbexpimp.py", line 199, in runImport
         word = uunquote(word)
       File "/etc/procmailrcs/scales/usr/local/bin/sb_dbexpimp.py", line 115, in uunquote
         return unicode(urllib.unquote(s), 'utf-8')
     UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-10: unexpected end of data

I couldn't find anything obviously wrong with any lines at position 9 or 10
(although, I don't really know what to look for).  But, I did a search for
other 8-bit characters, and there were several:

     % grep '%[8-9A-F][0-9A-F]' hammie.db.prev.export
     subject%3Af%FCr`0`1`
     subject%3A%B2%D5%B8%CB`0`1`
     subject%3A%A4%C9%AF%C5`0`1`
     subject%3A%B8%A3%BA%FB%AD%D7/%A4%C9%AF%C5/%B2%D5%B8%CB`0`1`
     subject%3A%A7%D6%B3`0`1`
     subject%3A%A7%D6%B3t%B9q%B8%A3%BA%FB%AD%D7`0`1`
     subject%3A%B9`0`1`
     subject%3A%FC`0`1`

So, I manually knocked out all of the high bits (i.e., I replaced %8 with %0,
%9 with %1, on up to %F with %7), and then the import worked.  I hope I didn't
damage the contents too much in the process....


So, how much of the above is a bug worth reporting?  (And, what should
I include in the report?)


                Thanks,

                    Webb


--
------------------------------------------------------------------------
Webb Scales                                Hewlett-Packard Company
scales at zko.dec.com                         110 Spit Brook Rd, ZKO2-3/N30
Voice: 603.884.2196, FAX: 603.884.0120     Nashua, NH 03062-2711
    Experience: the exam comes first, you get the lesson afterward.
------------------------------------------------------------------------


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes/attachments/20040322/53c559a5/attachment.html


More information about the Spambayes mailing list