[Spambayes] progress on POP+VM+ZODB deployment

Tim Peters tim.one@comcast.net
Sun Oct 27 20:55:25 2002


[Derek Simkowiak]
> 	It seems like you're saying that SpamBayes will not work for an
> enterprise-wide deployment, since different individual's vocabularies,
> writing styles, and interests vary so wildly.

That's a matter for testing to decide, but it's not a kind of thing I can
make time to test.  I doubt that their vocabularies or writing styles matter
(it's the email you get, not the email you write, that's judged), what
matters is what forms of advertising the individuals within the enterprise
want.  "Enterprise" is too vague a word to guess anything about that in
general.  If "the enterprise" is general tech mailing-list traffic going
thru python.org, then we have strong evidence (from testing) that a single
classifier will work great.  If "the enterprise" is an ISP serving 1,000
individuals' private email, I expect a single classifier would have such
high false positive rates as to be unacceptable.  If you have one user who
*wants* porn ads, a single classifier has to be trained to accept them (and
can be -- it's easy).  Then all users get them.  If one user signs up for a
minister-by-mail scam (a real-life example reported earlier on this list),
then all users get minister-by-mail scams.  Etc.

> 	In the false positives you mention above, was the spam cutoff
> being used?  (If so, what was it set to?)  Or, are those "false positives"
> hams being assigned a spam probability >.50 ?

Different tests were done at different times with different combining
schemes and different corpora.  They all had in common that "false positive"
scores were above a realistic middle-ground cutoff.

> 	I am a big fan of enterprise-wide anti-spam measures.  In my mind,
> it makes sense to flag messages and have "default" filter rules for every
> workstation.  It makes it much easier on the I.T. department.  Requiring
> Python on every Windows box would immediately make SpamBayes a no-go in
> many businesses and Universities, simply because of the (expensive) user
> support that would be required.  So I am concerned when you present
> evidence that every individual needs to do their own SpamBayes training.

Spam in the sense of "advertising I don't want to see, as opposed to
advertising I do want to see" is a personal judgment.  That doesn't preclude
server-based approaches, but would require knowing about (saving info about)
each individual, unless "the enterprise" has a single, fixed policy about
what constitutes advertising nobody in the enterprise should be allowed to
receive.

> 	It is obvious and well-understood that a .db trained from a
> specific individual's body of emails will work better for that individual
> than for some other individual.  So what you say above does not surprise
> me.  But what does surpise me is the argument that every individual should
> do their own SpamBayes training.

Test it and draw your own conclusions -- nothing is hidden here <wink>.

>> The low-spamprob words specific to *your* ham will depend on the
>> content of your ham in equally quirky ways.

> 	No doubt; but over a large body of emails from many different
> individuals, I think the "quirkies" would fall by the wayside (because any
> one individual's quirkies would not be very frequent over the given
> collection), and that the Spam-specific "quirkies" (things like
> color=#FF0000) would hence become the strongest identifiers for any given
> message.

In that case the ham quirkies become too weak to let that individual's
favored forms of advertising thru.  By the way, if you think #FF0000 is a
killer-strong spam clue, you don't have young relatives sending you HTML
birthday greetings <0.6 wink>.

> 	(Officially proposing the term "quirkie" to mean a strong spamprob
> word -- either for or against -- that is specific a particular corpus of
> email.)

Consider it adopted -- I like it!

> 	I'm guessing that if you did your tests again, but trained against
> all the corpuses before doing the test, your false positive rate would
> drop way down.  (Is that not how SpamBayes is supposed to work?)

Training on ham does improve the FP rate.  But if I have to train it to
allow the forms of bulk advertising you want to see, then a single
classifier can't block those forms of advertising for anyone else.  In the
python.org context, the only community-accepted advertising is highly
specific to Python and Zope, so a single classifier works fine.  In the
context of my personal email, the only advertising I want to see is from the
companies I do business with, and I indeed needed to train carefully on
several examples each of marketing email from various *specific* financial
institutions, companies, and special-interest newsletters *I* like to see.
I've even trained it to accept "Joke of the Day" spam, because I often like
the jokes, despite that the rest of those spams are trying to sell me the
usual range of crap from human growth hormone to miracle diets.  You don't
want to see that stuff, and that I've trained my classifier to accept
marketing blurbs from Strong isn't going to help you get marketing blurbs
from Oppenheimer.

> 	You see, I do not have access to a large corpus of email from many
> different individuals.  All I have is my inbox, which is quite quirky
> indeed.

So start with that.

>  But I want to set up a hammie.py installation for a small
> workgroup, to see what kind of performance I get, and to monitor
> SpamBayes' performance changes over time (as it's trained to the small
> workgroup's incoming messages).

Then start with that.

> 	If I had a starter .db file that was trained against many emails
> from many different individuals, then I'd be able to get going.

Just start and see what happens.  You're simply not going to get a DB from
anyone trained on personal email, because there are too many clues about
individual identities in the database, including things like passwords and
account numbers, and email addresses of friends and relatives.

> Instead, I'm stuck wondering what process I should go through to try to
> collect a large corpus of email that will have its ham quirkies averaged
> away.

You don't need a large corpus; the system learns quickly; just start.

> 	But I know from reading test results here that many individuals
> have already taken the time and effort to do that.  So I am asking for
> someone to share that effort -- kind of like Open Source, except on
> SpamBayes training instead of code writing.

I could give you a classifier trained on comp.lang.python traffic plus Bruce
G's 2002 spam collection.  Indeed, I used to make such a thing available on
SourceForge.  Few people bothered to try it, and those who did reported poor
results on their personal email, so I got rid of it.  I don't believe anyone
tried it in the context of corporate email.  I won't believe that you're
going to try it until you report that you've already started and are getting
poor results.