python script as an emergency mailbox cleaner

Sun Sep 21 18:03:42 EDT 2003

[Phil Weldon]

Phil, top-posting makes it very hard to follow a discussion.  I'll reply
briefly here, but if you want to continue it, please move it to the
spambayes list and interleave your comments with the text of the msg you're
reply to.

> Yes, I tend to discount your advice because it may be that you aren't
> considering the messages generated by Worm.Automat.AHB are a very
> restricted subset of spam, the legitimate 'undeliverable e-mail'
> messages are closely related, and the 'undelivered e-mail' messages
> caused by Worm.Automat.AHB generated e-mail with the target e-mail
> address in the FROM line are also closely related.  The current need
> is a quick way to counter the 'spam' effects of Worm.Automat.AHB, not
> correctly categorizing Nigerian fund transfer and Viagra spam sets.

As I said the first time, the worm spew in my classifier (you surely don't
think you're the only one getting these msgs, right?) was adequately caught
after training on 6 of the beasts.  You'll catch the worm spew too if you
train on 1500 of them, but at the cost of warping the classifier.

> To further explain, the bogus 'undeliverable e-mail' type messages are
> permutating and the database supplying the input to the worm's
> generator is growing.  There are at least two classes of bogus
> 'undeliverable mail';
>
> 1.  e-mail generated by the worm
> 2.  real 'undeliverable e-mail' messages that are the results of the
> worm using your e-mail address as the sender on bogus 'undeliverable
> e-mail' which then generates a legitimate but unwanted and useless
> 'undeliverable e-mail' message.
>
> Now, if you have the time to supply your arguments rather than cv,
> I'll be happy to learn.

spambayes never relied on arguments, it relied on testing.  Indeed, that's
why it works <wink>.

> And, to quote the Inboxer help file,
>
> "The text box in the Create Filters area indicates the number of
> messages that were processed to build the filters. Generally, the
> higher the number, the more accurate the filters will become."

Partly true, and partly misleading due to brevity.  Things aren't *that*
simple in reality (how could they be?).  If, after the "Generally", they
added some weasel-words about training set balance, I'd be happier with it,
but it's still just "generally".

> So far the scoring Inboxer developed on the basis of the ~1500 bad
> and 264 good examples results in no false negatives or false
> positives, including correctly classifing a dozen completely
> legitimate 'undelivered e-mail' messages in a set of ~ 400 new
> messages.

In part that's because I believe Inboxer uses the spambayes

experimental_ham_spam_imbalance_adjustment: True

option.  This option was intended to fight the worst effects of people
taking "more is better" too literally, getting their training data out of
balance.  The result of this option is that once you have a large imbalance
in one direction (and you do:  1500::264 is an out-of-whack ratio), training
on additional messages in the already-over-represented class (spam, for you)
has very little effect.

What you're seeing now is the good effects of that option trying its hardest
to ignore the imbalance you created.  It has bad effects too, which you'll
see later; the spambayes archives have many discussions of this already, so
suffice it to say here that we're removing the code that supports the
option.  The bad effects of unbalanced training will be more severe then,
but easier to recognize and address (the bad effects this option creates are
subtler -- in your case, if you get a significantly different new kind of
worm spew, you'll find it very hard to get it classed as spam, because
training on new spam will have little effect for you from now on).

> The -1500 bad e-mail messages have a date spread of 18SEP03 though
> 20SEP03 while the 265 good e-mail messages have a date spread of
> 1AUG03 through 20SEP03.  Both sets were sent to my ISP mailbox.

OK by me <wink>.

> I will try dividing the two sets of messages into smaller sets and
> try the results of your suggestion on new e-mails as they collect.
> By the way, my current ratio of Worm.Automat.AHB instigated messages
> to legitimate e-mail (which for my purposes includes traditional
> spam) is far greater than 1500:265; it's more like 1500:50.
>
> And I guess I should download from spambayes

There's no need to download anything from the spambayes project if you're
happy with Inboxer.

> and donate to PSF

That's always appreciated!

> since my daughter is using Python in her physics classes at Carnegie-
> Mellon.  Concidently, I just happened to be looking at my loose-leafed
> copy of Feynman's Lectures on Physics with a reference manual in the
> back for FORTRAN IV I had to use for physics classes.

There kids today don't know what pain is, eh <wink>?