[spambayes-dev] Results for DNS lookup in tokenizer

Matthew Dixon Cowles matt at mondoinfo.com
Wed Apr 14 21:40:40 EDT 2004


[me]
> Hm. Well that's probably enough evidence. A tiny win for me 
> and a small loss for you.

[Tony Meyer]
> I don't know if it's enough, but it's likely that it's all you'll
> be able to solicit here <0.1 wink>.

<0.9 chuckle>

> If you go through your spam folder and look at the clues for
> messages that look like the ones that used to be there, do you see
> these tokens?

I do. For example, I have a nonsense spam ("ostrich rimy cowlick
derange...") that has the subject "Our little secret". And its clues
include:

0.908 url-ip:221.5.250.122/32
0.908 url-ip:221.5.250/24
0.908 url-ip:221.5/16
0.965 url-ip:221/8

> It could be that the spammers sending these types of messages took
> a holiday this week <0.5 wink>.

<grin> It may also be that sending nonsense spams is a new tactic
among spammers (born of the success of SpamBayes of course) and
testing against spam even a month old won't show much advantage. I
was certainly motivated to try the url-ip thing because of the
unsures I had seen in the previous week or so.

> In any case, if you're happy running from source, then there's
> nothing stopping you keeping the patch going for your own system -
> it seems unlikely that it'll conflict with any tokenizer changes in
> the near future.

Indeed, I plan to. It doesn't seem to do me any harm. I'm mostly
miffed that the value of my Fabulously Clever Idea isn't borne out by
actual testing. I expect that Tim Peters in particular has enormous
sympathy <wink>.

> I suspect that it's that the spams that this helps to nail are
> already nailed with other techniques.

That seems like the most likely explanation.

> I was reading some past messages today and that reminded me to
> suggest that you try (if you haven't already) the x-use_bigrams
> option.  At least some people have found that it's better at
> nailing short spams (although maybe not quite as good at some of
> the more 'talky' spams).  Testing and developer experience (I'm not
> sure if any users have turned the option on) does indicate that
> it's a win overall.

Since I now have a nifty set of ten buckets, I'm glad to try out
other folks' Fabulously Clever Ideas. Here's the result:

normals.txt -> bigramss.txt
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fp went from 1 to 1 tied          
mean fp % went from 0.1 to 0.1 tied          

false negative percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fn went from 1 to 1 tied          
mean fn % went from 0.1 to 0.1 tied          

ham mean                     ham sdev
   0.27    0.28   +3.70%        3.13    2.97   -5.11%
   0.36    0.58  +61.11%        3.86    4.91  +27.20%
   0.68    0.92  +35.29%        7.28    8.16  +12.09%
   0.14    0.24  +71.43%        1.03    1.83  +77.67%
   0.31    0.30   -3.23%        2.53    2.78   +9.88%

ham mean and sdev for all runs
   0.35    0.46  +31.43%        4.13    4.71  +14.04%

spam mean                    spam sdev
  99.89   99.77   -0.12%        1.02    1.61  +57.84%
  99.74   99.89   +0.15%        2.99    1.29  -56.86%
  98.92   99.24   +0.32%        5.15    4.27  -17.09%
  98.37   98.38   +0.01%        9.43    8.39  -11.03%
  98.86   98.82   -0.04%        6.36    6.71   +5.50%

spam mean and sdev for all runs
  99.16   99.22   +0.06%        5.79    5.28   -8.81%

ham/spam mean difference: 98.81 98.76 -0.05

Alas, it seems that there's not much advantage there either. The only
classification difference seems to be that the number of unsures went
up by two.

Regards,
Matt




More information about the spambayes-dev mailing list