[Spambayes] Language distribution

Maik Musall lists at musall.de
Thu Feb 12 03:16:54 EST 2004

Hi there,

I get about 60 spams a day, and installed spambayes a week ago. I'm
very happy with it, as it filters out most of it and didn't have
any false positives yet.

However I discovered a certain weakness. Most of the incoming spam is
in English, while a large portion of my ham is German. So when I get
English ham, it's often classified not near 0 but at about 0.20, while
German spam (which is currently evolving) is often not recognized
as such.

I also have a similar ratio with HTML mails and non-HTML mails.

Spambayes also has some problems distinguishing real MDA error messages
from those MyDoom stuff with the typical attachments. I'm currently
trying to connect enough of that to make up a procmail rule that catches
that stuff by spambayes classification combined with content length
and attachment configuration.

I'd like to share the experiences with especially that type of problems.

Some information about my spambayes configs:

I trained spambayes with about 9000 spams and 6000 hams, accepting that
the spam was from a few months while ham was from a few years - no
other chance to get near the ideal 1:1 ratio. I plan to make experiments
with just a few 100 of the newer spams and hams, but I want to run
it at least a few weeks with this configuration to collect more
information. My hammiedb is 23 MBytes now.

I use spambayes with a procmail script that does the following:
1. Sort out mailing lists by List-Id, Sender or From headers
2. Collect all the rest as copy in an unread reference folder
3. collect some spam with certain subject words, or from big at boss.com etc.
4. Filter the rest through spambayes, and put it into spam and unsure
   folders if marked so
5. The rest goes into my inbox.


