[Spambayes] progress on POP+VM+ZODB deployment

Tim Peters tim.one@comcast.net
Mon Oct 28 20:51:20 2002


[Derek Simkowiak]
> 	I'm not missing the basic point, I'm disagreeing with it.  (You
> can stop with the lengthy examples of one guy who wants commercial mails
> from some particular company or subject domain -- I get it, really, I do.)

Good!

> 	I may personally consider messages from you to be "spam" (not as
> Unsolicited Bulk Email, but simply as unwanted messages).  But I don't
> think it would be the job of a general-purspose installation-wide spam
> identifier to know that about me, as you seem to suggest.

Then you're willing to settle for very little, and I'm glad you're not
running my installation <wink>.

> 	I would want a tool like SpamBayes to flag emails as being like
> the ones in Bruce's collection.  If I like to get mails similar to those,
> then nowhere am I obligated to filter those flagged messages into my
> "Trash" folder.  If I like to get messages similar to those, but only if
> they come from Company X, then I can set up my filters to do that, too.
>
> 	But for the vast majority of people, just knowing that a
> particular email has Bruce-spam-like content would be enough to want to
> filter it into a lower-priority folder, or even directly into Trash.  At
> least, I see it as the job of the postmaster to provide a flag that could
> be used like that.
>
> 	To summarize: I think it's the job of a spam filter (or "flagger")
> to identify those messages univerally accepted as being spam -- whether or
> not any one person likes that kind of mail.  And although for any given
> spam there is _somebody_ on Earth who would want to read it, it would be
> up to them to set up their client-app filter rules to work how they want
> them to -- even if that includes running a local installation of SpamBayes
> to do personalized (high-resolution) filtering.

In that case, try this code and see what happens.  Use all defaults, because
they still favor mixed-source corpora so won't suck out "too many" clues
specific to your machines or your recipients.  Generate a starter database
from your own email, and then teach it from the complaints your friendly
workgroup makes.  Put some elbow grease into this!

> ...
> 	I think there are a great many people interested in having all
> spam messages treated like interchangeable cogs.  "Spam" meaning a message
> that would be universally accepted as being a "spam".

I'll leave that argument to you and your users now.

> 	I've seen many people on this list use Bruce's spam for their
> training.

I know of two.

> But undoubtedly there is a message in his collection that would
> be of interest to at least *someone* on this list.  Does that invalidate
> his collection as being a spam training repository?

Of course not, but I've removed messages from his spam corpus that don't fit
an appropriate definition of spam for comp.lang.python purposes.  There are
other messages I'd remove from his spam corpus if training for my personal
purposes.  There are some messages that need to be removed for any purposes,
because they were plainly misclassified.

> 	I would say no, it does not, because his collection is of the type
> "universally accepted as spam".  That is the type of message I would like
> to see flagged at Universities, ISPs, and companies.
>
> 	And to do that, I don't think ham training can be in the picture,
> since somebody's "ham" is another person's "spam", and training on
> people's "ham" can only weaken what is considered "universally accepted as
> spam".

Set up a test and measure results.  I expect it will detect "BruceG spam"
quite reliably, but that it will also call many other msgs spam.  The
variety in spam is, I expect, much larger than you presently imagine, and
BruceG's collection includes msgs like this:

"""
Tim,


 It was great to talk to you today I should have the propsal done by
tommorrow


Take Care,

Susan
""""""

In fact, it contains *many* msgs like that.  They are in fact spam, but I
doubt you would claim that this msg would be "universally recognized as
spam".  If you don't want msgs "like that" classified as spam, and won't
train on ham too to give it a fighting chance, then you've got weeks of work
of your own to do to try and remove msgs like that from BruceG's (or anyone
else's) spam collection before training.  Our codebase will help you do
that, BTW:  this kind of spam usually does score as spam, but on the low end
of the spam scale.  It's statistically unusual compared to the bulk of the
spam.