[spambayes-dev] I took a big step Tuesday...

Tim Peters tim.one at comcast.net
Mon Aug 4 01:30:24 EDT 2003


[Rob Hooft]

Nice to hear from you, Rob!

> People, this is all very unscientific.

Seat-of-the-pants tuning usually is <wink>.

> We have done lots of research in the earlier days of spambayes, and
> have come to the conclusion that there are no more than two useful
> cut-off points. Our false-positives mostly scored hopelessly close
> to the ideal 1.00000000000000000.

Hmm.  That wasn't true of my data:  the only FP I had scoring 1.00 (rounded)
was the message that consisted almost entirely of a full quote of a Nigerian
scam.  That one was hopeless.  All other FP scored below 1.00 (rounded).

> If you find spam boring and want to delete everything above 0.995
> automatically, there is no scientific basis for not cutting at 0.90
> instead.

There's an obvious basis for not doing that, though:  I've seen FP scoring
above 0.90 in day-to-day use, always a piece of HTML email I actually want,
from an online business spambayes hadn't yet been taught about.  OTOH, I've
never seen an FP in day-to-day use that scored 1.00 (rounded), although
*most* spam scores 1.00 (rounded -- and most ham scores 0.00 (rounded)).  I
think Skip is seeing the same.

I didn't do any research using the full set of tokenization gimmicks we have
today, and I didn't do any using the kind of training I've fallen into (a
few hundred "random" at the start, followed by a mix of mistake-based and
unsure-based when I felt like it), and I didn't do any on personal email (I
was doing tech mailing-list tests).  It *appears* to be "impossibly" hard
for a mistake to get nailed at the wrong end of the scale for me because my
database remains small (so individual spamprobs aren't getting near 0.00 or
1.00).  It appears to be hard for Skip because he uses much more training
data (than I use), so his spambayes has a more accurate view of his reality.

Theory simply hasn't kept up with practice here.  That's what happens when
all the theorists die <wink>.




More information about the spambayes-dev mailing list