[Spambayes] an alternative use of filters

Eric S. Johansson esj at harvee.billerica.ma.us
Fri Dec 20 11:44:57 EST 2002


Tim, thank you for your reply.  I hadn't realized it was to the general 
mailing list until I got the mailman notice.  signed up now so I can start 
my "personal archive" like so many other of the mailing lists I'm on. ;-)

Tim Peters wrote:
> Three-way classification is the intended use of the spambayes classifier.  A
> msg gets a score from 0.0 (ham) to 1.0 (spam) and there are two configurable
> cutoffs:  msgs with a score below ham_cutoff are called Ham, above
> spam_cutoff Spam, and any score between those Unsure.

good to know.  It looks like I'll be experimenting with a variety of filter 
engines to see how they work.

> While experience varies across test sets and care in training, in my
> experience Unsures are, over time, about half spam and half ham.  A curious
> and semi-encouraging thing is that they're overwhelmingly msgs *I* can't
> judge at a glance either, and sometimes it's so hard to tell I just throw
> the msg away as unintelligble.  I call that "semi-"encouraging because, in
> conjunction with camram, I don't believe I'd want Unsures stopped from
> reaching me.  For example, a common class of Unsures is commercial HTML
> email from companies I do business with; e.g., last week I got an Unsure
> that was an auto-generated order receipt for an online order of a software
> program.  I wanted to get the receipt, but the email was very spammish, full
> of ads and links for follow-on offers, and other marketing collateral.  I
> doubt reply email would be seen by a human, so a postage-due scheme probably
> would have dropped it into the bit bucket on both ends.

I feel like I'm living in a bipolar world when it comes to choosing a 
solution for dealing with mystery meat.  Non geek computer users generally 
love the idea of canning mystery meat.  The most common attitude is "if it's 
important, somebody will call me on the phone".  When I tell them there is a 
jail they can go rummaging through its something important goes astray, 
objections, for the most part, fall by the wayside.

On the other hand, dealing with the geek computer users is a difference of 
the challenges.  While most ngcu and enterprise organizations want 
absolutely no knobs or buttons to confuse the user.  In contrast, gcu seem 
to want complete control over behavior.  I must plead guilty to that as well 
and have had a few vocal ngcu "educate" me.

Now there is nothing stopping the camram filter from passing mystery meat 
through while at the same time sending out a challenge message.  You see, I 
have a cunning plan.  The act of sending out a challenge message creates 
more information that can be used in separating out ham/mystery meat/spam. 
If the message is deliverable or not, that information can be fed back into 
the classifier for further refinement.

> The outstanding feature of the kind of classifier we're using is that it
> adjusts to an individual's notions of what constitutes ham and spam, so this
> kind of mistake is less frequent here than under other systems (for example,
> the order receipt mentioned above wasn't called spam, because the system
> knew I ordered other software of similar nature in the past; but the email
> *would* have been called spam if most other people had received it).  But
> the error rates are, while very low for individual use, still non-zero, and
> I expect they always will be.

classifiers for spam filtering will work well for people like us because 
we're willing to take the effort to train the system.  It's sort of like 
speech recognition.  Only one user in 5 succeeds and success seems to be a 
function of the persons ability to consistently train the recognition 
engine.  What I describe training processes to non geek computer users, the 
reaction is not kind.  It's best described as "seems too much like work". 
It will probably get easier if there are "delete as spam, delete as not 
spam" buttons on the user interface but I'm not holding my breath.

As for error rates, if you want to have some fun, plot out error rates vs. 
volume of traffic.  A small ISP has mail volume on the order of 300,000 to 
500,000 messages a week.  a .1% error rate is 300 to 500 messages MIA.  how 
many will trigger customer support call?  what happens when your volume 
reaches 850,000 messages per day.  It gets really interesting.

> So, if you try this, I suggest setting ham_cutoff very low (below 0.05), and
> spam_cutoff very high (over 0.95).  The mdedian ham score is essentially 0,
> and the median spam score is essentially 1.0, so, while aggressive, this
> isn't quite as extreme as it may sound at first.  The problem I expect
> remains, though:  solicited commercial email, and especially the first few
> times a user gets one from a given vendor, will end up Unsure, and there may
> not be anyone on the other end to respond to a postage nag.

I should probably run these classifiers on my mail stream and plot message 
scores vs. frequency.  I would like to see if there are any interesting 
patterns we can use.

as for the not responding problem, yup.  It's the only way camram gets false 
positives.  What I am planning on doing is, with users permission, 
harvesting the addresses of messages that camram was able to 1) successfully 
send postage due notices to and 2) got no response after 24 hours.  I would 
then try to get these folks to generate stamps so that they can bypass the 
whole filter/classifier problem.  don't expect much success but it's not my 
mail that's getting trapped.

thanks
---eric

PS how do you handle meatloaf?




More information about the Spambayes mailing list