[spambayes-dev] Results for DNS lookup in tokenizer

Thu Apr 15 02:21:45 EDT 2004

> Since I now have a nifty set of ten buckets, I'm glad to try 
> out other folks' Fabulously Clever Ideas.

Always appreciated!  If you contribute nothing else to SpamBayes (and I'm
sure you will :) simply testing out other people's ideas and letting
everyone know the results helps a lot - especially since not many people
manage to get time to do this these days.  If you want to do more (it gets
addictive, trust me ;) there are all the current x- options...

> Here's the result:
[...]
> Alas, it seems that there's not much advantage there either. 
> The only classification difference seems to be that the 
> number of unsures went up by two.

I should have looked at your original cmp.py posting more closely (and have
now).  I think that you've hit the "Peters barrier", i.e. your results with
the defaults are so good that it's hard to measure whether any changes are
doing you any good or not.

Your defaults run only has one fp and one fn - to improve on this, the new
Fabulously Clever Idea would need to directly target those two messages
(without losing the rest).  Unless the improvement is all in the unsures -
since cmp.py output doesn't mention them, I can't tell how many there are in
the defaults; maybe this is where the room to improve is.  (If you still
have the rates.py output around, could you post a table.py for the defaults,
dns and bigrams outputs?)

If you run "fpfn.py ratespyoutputs.txt" (with the appropriate rates.py
output file) it'll spit out a list of the fp's and fn's (all two of them ;)
for that test.  It'd be worth taking a look at these two messages and seeing
what they are.  It might be that they are basically impossible to get right
- for example, a message from someone you've never had mail from before
quoting a spam with a single line addition - that's very difficult to
classify as ham without getting a lot of fn's, too.

=Tony Meyer