[spambayes-dev] RE: [Spambayes] How low can you go?

Seth Goodman nobody at spamcop.net
Fri Dec 26 12:07:52 EST 2003


> [Tim Peters]
> Two days ago I created a new .pst file, with two folders "All
> ham" and "All
> spam".  Since then I've been copying each message I get into one of them.

That's exactly what I've been doing for a while, so that's encouraging.  I
have local ham and spam corpus folders in outlook.pst that I move (or copy)
_all_ messages into when I am finished with them.  I toss a few unsures, but
most go into one bucket or the other.  Those two folders in Outlook.pst get
autoarchived into SpamCorpus1.pst when messages in them are more than three
days old.  That gives me time to manually track statistics (another PITA).


> [Tim Peters]
> When it comes time to use export.py, I'll have to temporarily fiddle my
> spambayes config to say that "All ham" is my (only) ham folder and "All
> spam" my (only) spam folder (export.py gets its idea of where your ham and
> spam training data are from your Outlook spambayes config file).

Which place in the SpamBayes manager is the one that changes the config that
export.py uses?  There are ham and spam folder specifications in more than
one place:  filtering, training and watched folders at least, there may be
more.


> [Tim Peters]
> Copying all incoming msgs is a bit of a PITA for me, and if you
> use Outlook
> rules too (I don't) to sort ham into different folders, may be a
> royal PITA.
> So it goes -- Outlook wasn't designed for running spam-filter experiments
> (then again, no email client was, and that's why we have a "standard"
> test-data directory structure of our own).

Yeah, I use a lot of rules and sub-folders, so I have developed a "recipe"
to make sure I don't screw up the semi-manual sorting (the thought of
learning VB and the insides of Outlook is painful; my hat's off to Mark).
One thing I do that may or may not be typical is that I let Outlook rules
take care of all the mailing list traffic.  That includes almost no spam and
so I don't train or classify it (the list admins do a good job).  Therefore,
I _don't_ include it in my ham corpus.  This gives me a roughly 1:5 ham/spam
corpus, instead of roughly even, but that's the mail stream that SpamBayes
sees.  I _do_ make sure the training sets have equal numbers of messages.
At present, my corpus is about 7,500 messages total.  This may not be enough
to "divide into ten sets", etc.  Or is it?


--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above




More information about the spambayes-dev mailing list