[Spambayes] Eliminating duplicates from mbox file

Tim Peters tim.one at comcast.net
Fri Mar 7 20:20:27 EST 2003

[Skip Montanaro]
> While retraining today I flubbed at one point and wound up with a bunch of
> duplicates in my training sets.  I wrote the attached script to eliminate
> the duplicates.  I have a few questions:
>     1. Is this worth checking into the contrib directory?

Not for Outlook users <wink>.

>     2. Why did I have to subclass mailbox.PortableUnixMailbox?

You shouldn't have to, and you shouldn't have to check for "msg is None"
either.  Note that some of the earliest scripts in the codebase don't do
either.  For example, from split.py:

    mbox = mailbox.PortableUnixMailbox(infp, mboxutils.get_message)
    for msg in mbox:
        if random.random() < percent:
            outfp = bin1out
            outfp = bin2out
        astext = str(msg)
        assert astext.endswith('\n')

> ...
>     3. Is there a better way to emit the unique messages that doesn't
>        require me to manually escape leading "From " sequences?

Looks to me like the email pkg (at least the one in Python CVS) already does
the ">From" bit within msg bodies.  The *leading* "From " isn't supposed to
be escaped -- "From " at the start of a line within a body is supposed to be
escaped precisely so that an unescaped "From " at the start of a line is
recognized as the start of a new msg.

