Graham's spam filter

Erik Max Francis max at
Thu Aug 22 22:16:00 CEST 2002

Heiko Wundram wrote:

> Well... I explicitly stated that it doesn't scale well for larger
> units
> of people, but here where I live, we get our mail from the university
> accounts, and get pretty much the same spam (as the mail addresses are
> all of the form 4 letters, 4 digits, they are pretty well known out
> there...).

But the issue is that different people get different kinds of
_non_-spam.  If I subscribe to a lot of mailing lists, I may get a lot
of message that contain spam-like phrases, even though in my case they
would be completely legitimate.  A statistical filter fine tuned to my
needs would understand this (and not mark such mail as spam), but a
general one may not (and thus might generate false positives).  The
whole point of the Graham filter is that it needs to not only know what
typical spam looks like (which truly is, as you say, very similar across
most people), but what typical non-spam _for you_ looks like.  The
former will be very much the same across most people, but the latter
will vary widely.

> This would mean separating the training process to two separate
> instances, a global database, and a personal database.

This doesn't sound like the right approach to me.  Instead, you should
perhaps start with a "global" database that is a sample of fairly
typical mail from your clients and typical spam.  These should be used
as an initial "seed" to the system only; once a user starts actually
actively using the system to filter his mail, it can tailor itself to
his specific needs.  The "global" database is simply a seed, so it never
needs to be updated; it's just to get the customer user-specific
databases started.

 Erik Max Francis / max at /
 __ San Jose, CA, US / 37 20 N 121 53 W / ICQ16063900 / &tSftDotIotE
/  \ There is nothing so subject to the inconstancy of fortune as war.
\__/ Miguel de Cervantes
    Church /
 A lambda calculus explorer in Python.

More information about the Python-list mailing list