[Spambayes] Some details that could be better

Lauri Harpf webmaster at apromotionguide.com
Sat Aug 21 09:07:06 CEST 2004


Hi,

just been getting back to using SpamBayes and training it after the database went. 
I have noticed a few things that disturb me a bit and would like to point them out. 
As for the judgement on whether to act upon them, I leave that to the people who 
are doing the coding.

First, the bigger issue is that at SpamBayes home, (ie. localhost:8080/home), you 
get a warning "you have much more spam than ham - SpamBayes works best with 
approximately even numbers of ham and spam" if you indeed have clearly more 
spam than ham.

Does the algorithm rely heavily on obtaining approximately a 50/50 ratio? According 
to the most recent survey I have seen, about 80-90% of all E-mail traffic on the 
Internet is spam. Thus, it is quite difficult to get even numbers of ham and spam, 
especially in situations where SpamBayes is most needed. If only half of my 
E-mail was spam, SpamBayes would probably not be such an essential tool 
for me. I take it that many others are in the same situation as I am.

(Related to the above, if you get the above warning, does the algorithm work
better if you stop classifying spam that the program already recognizes to
even up the numbers - or is it better just to classify all E-mail?)

The second issue is much less important, but somewhat funny. I'm using
Outlook Express (yes, booooooo!) and thus have configured SpamBayes
to add "spam" as recipient whenever it recognizes something as spam.
Then, OE, based on a simple rule, moves all E-mails to a junk folder.

Annoyingly enough, because messages from this list contain 
"spambayes .locatedat. python.org" in the To: field, due to this configuration
messages from this list get transferred into the Junk folder, while they are
correctly recognized as ham. The OE rules seem to be quite simple and
do not appear to allow complex configuration like "if contains <this> but
does not contain <that> then...".

I have made a special rule for this; but it might be worth considering
adding an option to customize the notation, ie. when SpamBayes sees
the message as spam, it notates to "Qedko421805AQ" for example
instead of "spam". Of course, this is just a minor issue and adding
the complexity of configuring over something like this is unreasonable,
still, I'd like to point it out.

Thanks for the great software to all who have contributed and keep
up the good work!

- Lauri


More information about the Spambayes mailing list