[Spambayes] progress on POP+VM+ZODB deployment

Derek Simkowiak dereks@itsite.com
Sun Oct 27 22:46:53 2002


> can be -- it's easy).  Then all users get them.  If one user signs up for a
> minister-by-mail scam (a real-life example reported earlier on this list),
> then all users get minister-by-mail scams.  Etc.

	I'm a little slow, so forgive me if this is... repetitive.  But
your argument sounds like it something of a showstopper to my intended use
of SpamBayes, and I want to make sure this behaviour is clearly documented
in the archives.

	Consider a group of a people who all use the same mail server.
I'm thinking of a university, or customers of one of those $20/month email
services, or a 1000-person company.

	Now consider the sysadmin who wants to use SpamBayes for the
purpose of flagging spam on that mail server, such that users can set up a
generic filter rule that is easily supported by the organization's Help
Desk.

	The way I understand it, if any _one_ person in the group of
people likes to get advertisements, porn mails, hotel conference info,
and/or minister-by-mail, and SpamBayes is trained on all incoming mail,
then everybody in the group will have their filtering rendered useless.

	In other words, Bayesian filtering (as popularized by the article
"A Plan for Spam") is only good for individuals, or small groups of
individuals who all like the same kinds of ham.

	I can't help but feel that I'm missing something.  In this
setting, it seems like training on hams is quite destructive to the goal
of flagging Spam.


	What if we pretend that all hams have exactly .5 probability, that
is, any given ham cannot be identified as either being a spam, or not
being a spam.  That is, all hams are just random noise.

	Then we train against a huge collection of spam, like Bruce G.'s
stuff.

	Each word in the database gets a "spam likelihood" rating,
depending on what percentage of the time it shows up in the spams.  A word
that shows up in every single spam gets a "1.0", and every word that does
not appear in the spam at all gets a "0.0".  We throw out ueber-common
words like a, and, the, it, just like Google does for its searches, as a
matter of efficiency.

	Then every email is rated word-by-word.  The scores for all the
words are then averaged together.  So an email with many words commonly
found in spam gets a high rating... (?)

	Um, I've overstepped my understanding of the problem, so I'll just
stop there.  But to you algorithm geniuses, I plead for a way to filter
spam that depends only on previously-seen Spam, and that does not depend
on what ham looks like.


Thanks,
Derek