[spambayes-dev] sb_filter change
Skip Montanaro
skip at pobox.com
Wed Nov 12 09:48:25 EST 2003
I modified sb_filter.py to accept one or more file names on the command
line. Existing behavior should be retained. If a single message is read
from stdin, the output message will have a From_ line only if the input
message did. When processing files from the command line, it uses
mboxutils.getmbox() to decipher their format. In such cases, the output is
always a Unix-style mailbox on stdout.
This change probably doesn't have a lot of practical use, but I find it
helpful in one situation. If I want to score a mailbox full of messages to
identify outliers (perhaps mistakes in my classification of a large body of
messages), I used to do this:
formail -s sb_filter.py < somembox \
| egrep -i '^(x-spambayes-classification|message-id): '
which incurred sb_filter.py startup for each message. Now I execute
sb_filter.py somembox \
| egrep -i '^(x-spambayes-classification|message-id): '
which runs a lot faster.
I should be able to figure out how to process my incoming mail that was as
well, then spit the result into
formail -s procmail
to do the usual procmail processing.
This usage suggests an enhancement to mboxutils.getmbox(). Currently, it
doesn't recognize Tim-style training databases (e.g. Data/Ham/SetN where all
files have numeric filenames. mboxutils.DirOfTxtFileMailbox could be
extended to simply accept all plain files as messages and all subdirectories
as nested Dir_ofTxtFileMailboxes. Would that change break anyone's usage?
(What are .lorien files anyway?)
Skip
More information about the spambayes-dev
mailing list