[Spambayes] Proposing to remove 4 combining schemes

Sean True seant@webreply.com
Thu Oct 17 14:25:54 2002


> > [Tim]
> >
> >>>Now that I'm playing with a UI (Sean & Mark's code) as a user, I'm
> >>>growing fonder of the non-chi schemes again.  Rational or not, I
> >>>find that the more uniform range of outcomes in [0.0, 1.0] is
> >>>psychologically reassuring when using a UI that throws the scores
> >>>in your face.
> >>
> >
> > [Rob]
> >
> >>But it is unrealistic. Think about the original problem again: "why
> >>can't software that classifies ham/spam be very easy? Almost all
> >>spam's scream in your face that they are". With chi_squared
> >>combining we found a method that agrees with this. Most messages
> >>scream either "Ham" or "Spam", and there is very little left to
> >>doubt.
> >
> >
> > But in real life there are also plenty of messages that mislead or
> > defy the human screener (if only for a second), and if these still
> > have a significant chance of becoming a f.p. or f.n., it would be
> > appropriate if the score reflected that uncertainty.
>
> But it does: between one and two percent of all messages deviates
> significantly from 0.0 and 100.0; those are the ones we as humans take
> more than split second to judge.
>
> > While you're still deciding on how much value you place on
> > f.p. vs. f.n., the score can be very helpful (as long as it has a
> > middle ground).
>
> Sure, but for Joe User, this "should" be uninteresting.
>
> Rob
>

I hate to try to speak for Joe User (like speaking for the "common man",
always a red flag), but I _am_ just a user of these scoring schemes. I have
several hundred messages (commercial email) tucked away in a folder that
score in the non-chi scheme in the range .4 to .6. That score appears to
reflect my own real uncertainty about the value of Motley Fool newsletters.
No snickering, please. A system like chi- looks like a very good choice for
black and white, upstream discards offers to increase body part size.

But I don't want these messages automatically discarded upstream, I want
them labelled so that I can deal with them more efficiently.

When I sort this particular folder by spam score, I get MIT club and
Infoworld newsletters at the the beginning (the good end), and the Motley
Fool and Edgar Online at the other end, with a range of spam score from .2
to .6 Just right. If I could color them continuously, it would be easy to
spot the ones I want to read, now. And over time, as I change my definition
of spam, their position in the list looks like it will vary smoothly -- and
appropriately.

This may not fit your original mission statement, but mission statements
often don't survive contact with the enemy, err, customer.

-- Sean