[Spambayes] Eliminating duplicates from mbox file
skip at pobox.com
Fri Mar 7 17:37:43 EST 2003
While retraining today I flubbed at one point and wound up with a bunch of
duplicates in my training sets. I wrote the attached script to eliminate
the duplicates. I have a few questions:
1. Is this worth checking into the contrib directory?
2. Why did I have to subclass mailbox.PortableUnixMailbox? It looks on
the surface like mailbox.PortableUnixMailbox ought to work as-is (it
has both __iter__() and next()), but if I use it directly without
subclassing I get this:
Traceback (most recent call last):
File "singular.py", line 32, in ?
File "singular.py", line 18, in main
for msg in mbox:
TypeError: iteration over non-sequence
(BTW, I get the same error if I iterate over the mbox file using
3. Is there a better way to emit the unique messages that doesn't
require me to manually escape leading "From " sequences?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 722 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20030307/b3068bf5/attachment.obj
More information about the Spambayes