[Spambayes] So many spams, so little ham

Skip Montanaro skip at pobox.com
Mon Jun 28 09:44:57 EDT 2004


    Someone> I have trained 253 ham and 1181 spam.

    Tony> Note that, in general, you'll have better results training roughly
    Tony> the same amount of ham and spam.

    Amir> I'm really trying to, but since nowadays I get much more spam than
    Amir> ham (like many people do), I cannot really keep the numbers
    Amir> balanced anymore. Is this really a problem? 

I've solved this problem in my train-to-exhaustion script, contrib/tte.py.
I don't think this will help people who use Outlook unless you have some way
to save your spam and ham to external mailboxes and retrain from scratch.
Tte.py ruthlessly trains hams and spams in pairs, skipping any leftovers at
the end.  The -R flag causes it to work its way backward through the
mailboxes.  In general, this means that it trains on new messages in
preference to old messages.  The -c flag causes it to write out new
mailboxes, culling messages which were considered but scored correctly in
each round.  This has the nice effect that you don't need to worry if you
have two of the same sort of ham or spam.  The script will automatically
cull those messages which will have no effect on training.

Finally, if the generated database gets bigger than I'd like, I visit the
ham and spam collections in my mail program, sort by date and toss out a few
old messages from each collection.

I've attached the shell script (tte.sh) I use to drive tte.py.  In the
common case I run it like so.  (I suppose I could bury mv and touch commands
into tte.sh but I haven't.)

    cd ~/tmp
    # the .cull files are those saved from the previous tte run
    mv newham.old.cull newham.old
    mv newspam.old.cull newspam.old
    # just to guarantee we'll run the main loop
    touch newham
    tte.sh

You won't be able to use tte.sh as-is, but it may be useful as a jumping off
point.

Skip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/octet-stream
Size: 2200 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20040628/5e8fd007/attachment.obj


More information about the Spambayes mailing list