[spambayes-dev] Results for DNS lookup in tokenizer

Tony Meyer tameyer at ihug.co.nz
Tue Apr 13 01:59:34 EDT 2004

Have you tried using the x-slurp_urls option as a solution for this problem?
(I'm not saying it's a better solution, just curious if you have, and if so,
what the results were).

> In case anyone would like to play with it, I'll append my trivial
> patch. It requires pydns from:
> http://sourceforge.net/projects/pydns/

This concerns me a bit.  I'd want to see really dramatic results before
something in the core distribution required non-standard libraries to be
installed.  How complex is the code that the patch is using?  Running
timcv.py was *really* slow, too - I don't know whether this was because a
lot of messages timed out, or that the DNS lookup was slow, or what, but it
worries me a bit.  Doing the DNS enquiry interactively was very quick, and
at this time of night our DNS server isn't used much at all, so quite

Here are my results using timcv.py -n5 with two corpora.  First cmp.py
results, then a table.py with just running with defaults as well.

The first one (my wife's mail for the last few months) is a win (-1 fn, -4
unsure).  The second one (my work mail for the last few months) is a loss
(two unsure move into fn in one run, the rest unchanged).

Note that in both of these the standard x-pick_apart_urls option does
nothing (good or bad) for me.

-> <stat> tested 101 hams & 358 spams against 398 hams & 1427 spams
-> <stat> tested 100 hams & 359 spams against 399 hams & 1426 spams
-> <stat> tested 100 hams & 358 spams against 399 hams & 1427 spams
-> <stat> tested 99 hams & 353 spams against 400 hams & 1432 spams
-> <stat> tested 99 hams & 357 spams against 400 hams & 1428 spams
-> <stat> tested 101 hams & 358 spams against 398 hams & 1427 spams
-> <stat> tested 100 hams & 359 spams against 399 hams & 1426 spams
-> <stat> tested 100 hams & 358 spams against 399 hams & 1427 spams
-> <stat> tested 99 hams & 353 spams against 400 hams & 1432 spams
-> <stat> tested 99 hams & 357 spams against 400 hams & 1428 spams

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied  5 times
lost  0 times

total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied

false negative percentages
    0.279  0.279  tied
    0.279  0.279  tied
    0.559  0.559  tied
    2.266  2.266  tied
    2.521  2.241  won    -11.11%

won   1 times
tied  4 times
lost  0 times

total unique fn went from 21 to 20 won     -4.76%
mean fn % went from 1.18076754281 to 1.12474513385 won     -4.74%

ham mean                     ham sdev
   0.00    0.01 +(was 0)        0.04    0.04   +0.00%
   0.49    0.49   +0.00%        4.91    4.91   +0.00%
   0.02    0.01  -50.00%        0.12    0.11   -8.33%
   0.03    0.02  -33.33%        0.21    0.21   +0.00%
   0.01    0.01   +0.00%        0.08    0.08   +0.00%

ham mean and sdev for all runs
   0.11    0.11   +0.00%        2.21    2.21   +0.00%

spam mean                    spam sdev
  96.02   96.11   +0.09%       13.44   13.60   +1.19%
  97.15   97.31   +0.16%       11.27   11.10   -1.51%
  97.12   97.30   +0.19%       11.86   11.89   +0.25%
  94.93   94.92   -0.01%       17.08   17.53   +2.63%
  94.99   95.08   +0.09%       17.16   17.26   +0.58%

spam mean and sdev for all runs
  96.05   96.15   +0.10%       14.40   14.55   +1.04%

ham/spam mean difference: 95.94 96.04 +0.10

filename:       libbys libby_picks libby_pickms
ham:spam:     499:1785    499:1785    499:1785
fp total:            0           0           0
fp %:             0.00        0.00        0.00
fn total:           21          21          20
fn %:             1.18        1.18        1.12
unsure t:          118         119         114
unsure %:         5.17        5.21        4.99
real cost:      $44.60      $44.80      $42.80
best cost:      $11.80      $11.80      $12.00
h mean:           0.11        0.11        0.11
h sdev:           2.21        2.21        2.21
s mean:          96.04       96.05       96.15
s sdev:          14.40       14.40       14.55
mean diff:       95.93       95.94       96.04
k:                5.78        5.78        5.73

-> <stat> tested 280 hams & 131 spams against 1111 hams & 512 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 277 hams & 128 spams against 1114 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 280 hams & 131 spams against 1111 hams & 512 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 277 hams & 128 spams against 1114 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied  5 times
lost  0 times

total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied

false negative percentages
    6.870  6.870  tied
    3.125  3.125  tied
    7.813  9.375  lost   +19.99%
    3.906  3.906  tied
    5.469  5.469  tied

won   0 times
tied  4 times
lost  1 times

total unique fn went from 35 to 37 lost    +5.71%
mean fn % went from 5.43654580153 to 5.74904580153 lost    +5.75%

ham mean                     ham sdev
   0.18    0.18   +0.00%        1.77    1.77   +0.00%
   0.01    0.01   +0.00%        0.17    0.17   +0.00%
   0.01    0.01   +0.00%        0.12    0.12   +0.00%
   0.03    0.01  -66.67%        0.39    0.13  -66.67%
   0.28    0.29   +3.57%        3.37    3.38   +0.30%

ham mean and sdev for all runs
   0.10    0.10   +0.00%        1.72    1.71   -0.58%

spam mean                    spam sdev
  88.89   88.89   +0.00%       25.38   25.48   +0.39%
  90.07   90.39   +0.36%       23.20   22.75   -1.94%
  87.23   87.13   -0.11%       28.96   29.35   +1.35%
  90.79   90.92   +0.14%       23.89   23.80   -0.38%
  90.31   90.67   +0.40%       25.99   25.52   -1.81%

spam mean and sdev for all runs
  89.46   89.60   +0.16%       25.59   25.52   -0.27%

ham/spam mean difference: 89.36 89.50 +0.14

filename:    exchanges exchange_picks
ham:spam:     1391:643    1391:643    1391:643
fp total:            0           0           0
fp %:             0.00        0.00        0.00
fn total:           35          35          37
fn %:             5.44        5.44        5.75
unsure t:           83          82          80
unsure %:         4.08        4.03        3.93
real cost:      $51.60      $51.40      $53.00
best cost:      $33.80      $33.80      $33.00
h mean:           0.10        0.10        0.10
h sdev:           1.72        1.72        1.71
s mean:          89.34       89.46       89.60
s sdev:          25.65       25.59       25.52
mean diff:       89.24       89.36       89.50
k:                3.26        3.27        3.29

=Tony Meyer

More information about the spambayes-dev mailing list