[Spambayes] Eliminating duplicates from mbox file

Skip Montanaro skip at pobox.com
Fri Mar 7 17:37:43 EST 2003


While retraining today I flubbed at one point and wound up with a bunch of
duplicates in my training sets.  I wrote the attached script to eliminate
the duplicates.  I have a few questions:

    1. Is this worth checking into the contrib directory?

    2. Why did I have to subclass mailbox.PortableUnixMailbox?  It looks on
       the surface like mailbox.PortableUnixMailbox ought to work as-is (it
       has both __iter__() and next()), but if I use it directly without
       subclassing I get this:

            Traceback (most recent call last):
              File "singular.py", line 32, in ?
                main()
              File "singular.py", line 18, in main
                for msg in mbox:
            TypeError: iteration over non-sequence

       (BTW, I get the same error if I iterate over the mbox file using
       mboxutils.getmbox.)

    3. Is there a better way to emit the unique messages that doesn't
       require me to manually escape leading "From " sequences?

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/octet-stream
Size: 722 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20030307/b3068bf5/attachment.obj


More information about the Spambayes mailing list