[Spambayes] Eliminating duplicates from mbox file

Skip Montanaro skip at pobox.com
Fri Mar 7 17:37:43 EST 2003

While retraining today I flubbed at one point and wound up with a bunch of
duplicates in my training sets.  I wrote the attached script to eliminate
the duplicates.  I have a few questions:

    1. Is this worth checking into the contrib directory?

    2. Why did I have to subclass mailbox.PortableUnixMailbox?  It looks on
       the surface like mailbox.PortableUnixMailbox ought to work as-is (it
       has both __iter__() and next()), but if I use it directly without
       subclassing I get this:

            Traceback (most recent call last):
              File "singular.py", line 32, in ?
              File "singular.py", line 18, in main
                for msg in mbox:
            TypeError: iteration over non-sequence

       (BTW, I get the same error if I iterate over the mbox file using

    3. Is there a better way to emit the unique messages that doesn't
       require me to manually escape leading "From " sequences?


