[spambayes-dev] Results for DNS lookup in tokenizer

Tony Meyer tameyer at ihug.co.nz
Tue Apr 13 01:59:34 EDT 2004


Have you tried using the x-slurp_urls option as a solution for this problem?
(I'm not saying it's a better solution, just curious if you have, and if so,
what the results were).

> In case anyone would like to play with it, I'll append my trivial
> patch. It requires pydns from:
> 
> http://sourceforge.net/projects/pydns/

This concerns me a bit.  I'd want to see really dramatic results before
something in the core distribution required non-standard libraries to be
installed.  How complex is the code that the patch is using?  Running
timcv.py was *really* slow, too - I don't know whether this was because a
lot of messages timed out, or that the DNS lookup was slow, or what, but it
worries me a bit.  Doing the DNS enquiry interactively was very quick, and
at this time of night our DNS server isn't used much at all, so quite
responsive.

Here are my results using timcv.py -n5 with two corpora.  First cmp.py
results, then a table.py with just running with defaults as well.

The first one (my wife's mail for the last few months) is a win (-1 fn, -4
unsure).  The second one (my work mail for the last few months) is a loss
(two unsure move into fn in one run, the rest unchanged).

Note that in both of these the standard x-pick_apart_urls option does
nothing (good or bad) for me.

-> <stat> tested 101 hams & 358 spams against 398 hams & 1427 spams
-> <stat> tested 100 hams & 359 spams against 399 hams & 1426 spams
-> <stat> tested 100 hams & 358 spams against 399 hams & 1427 spams
-> <stat> tested 99 hams & 353 spams against 400 hams & 1432 spams
-> <stat> tested 99 hams & 357 spams against 400 hams & 1428 spams
-> <stat> tested 101 hams & 358 spams against 398 hams & 1427 spams
-> <stat> tested 100 hams & 359 spams against 399 hams & 1426 spams
-> <stat> tested 100 hams & 358 spams against 399 hams & 1427 spams
-> <stat> tested 99 hams & 353 spams against 400 hams & 1432 spams
-> <stat> tested 99 hams & 357 spams against 400 hams & 1428 spams

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied  5 times
lost  0 times

total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied

false negative percentages
    0.279  0.279  tied
    0.279  0.279  tied
    0.559  0.559  tied
    2.266  2.266  tied
    2.521  2.241  won    -11.11%

won   1 times
tied  4 times
lost  0 times

total unique fn went from 21 to 20 won     -4.76%
mean fn % went from 1.18076754281 to 1.12474513385 won     -4.74%

ham mean                     ham sdev
   0.00    0.01 +(was 0)        0.04    0.04   +0.00%
   0.49    0.49   +0.00%        4.91    4.91   +0.00%
   0.02    0.01  -50.00%        0.12    0.11   -8.33%
   0.03    0.02  -33.33%        0.21    0.21   +0.00%
   0.01    0.01   +0.00%        0.08    0.08   +0.00%

ham mean and sdev for all runs
   0.11    0.11   +0.00%        2.21    2.21   +0.00%

spam mean                    spam sdev
  96.02   96.11   +0.09%       13.44   13.60   +1.19%
  97.15   97.31   +0.16%       11.27   11.10   -1.51%
  97.12   97.30   +0.19%       11.86   11.89   +0.25%
  94.93   94.92   -0.01%       17.08   17.53   +2.63%
  94.99   95.08   +0.09%       17.16   17.26   +0.58%

spam mean and sdev for all runs
  96.05   96.15   +0.10%       14.40   14.55   +1.04%

ham/spam mean difference: 95.94 96.04 +0.10

filename:       libbys libby_picks libby_pickms
ham:spam:     499:1785    499:1785    499:1785
fp total:            0           0           0
fp %:             0.00        0.00        0.00
fn total:           21          21          20
fn %:             1.18        1.18        1.12
unsure t:          118         119         114
unsure %:         5.17        5.21        4.99
real cost:      $44.60      $44.80      $42.80
best cost:      $11.80      $11.80      $12.00
h mean:           0.11        0.11        0.11
h sdev:           2.21        2.21        2.21
s mean:          96.04       96.05       96.15
s sdev:          14.40       14.40       14.55
mean diff:       95.93       95.94       96.04
k:                5.78        5.78        5.73

-> <stat> tested 280 hams & 131 spams against 1111 hams & 512 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 277 hams & 128 spams against 1114 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 280 hams & 131 spams against 1111 hams & 512 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 277 hams & 128 spams against 1114 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied  5 times
lost  0 times

total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied

false negative percentages
    6.870  6.870  tied
    3.125  3.125  tied
    7.813  9.375  lost   +19.99%
    3.906  3.906  tied
    5.469  5.469  tied

won   0 times
tied  4 times
lost  1 times

total unique fn went from 35 to 37 lost    +5.71%
mean fn % went from 5.43654580153 to 5.74904580153 lost    +5.75%

ham mean                     ham sdev
   0.18    0.18   +0.00%        1.77    1.77   +0.00%
   0.01    0.01   +0.00%        0.17    0.17   +0.00%
   0.01    0.01   +0.00%        0.12    0.12   +0.00%
   0.03    0.01  -66.67%        0.39    0.13  -66.67%
   0.28    0.29   +3.57%        3.37    3.38   +0.30%

ham mean and sdev for all runs
   0.10    0.10   +0.00%        1.72    1.71   -0.58%

spam mean                    spam sdev
  88.89   88.89   +0.00%       25.38   25.48   +0.39%
  90.07   90.39   +0.36%       23.20   22.75   -1.94%
  87.23   87.13   -0.11%       28.96   29.35   +1.35%
  90.79   90.92   +0.14%       23.89   23.80   -0.38%
  90.31   90.67   +0.40%       25.99   25.52   -1.81%

spam mean and sdev for all runs
  89.46   89.60   +0.16%       25.59   25.52   -0.27%

ham/spam mean difference: 89.36 89.50 +0.14

filename:    exchanges exchange_picks
                                   exchange_pickms
ham:spam:     1391:643    1391:643    1391:643
fp total:            0           0           0
fp %:             0.00        0.00        0.00
fn total:           35          35          37
fn %:             5.44        5.44        5.75
unsure t:           83          82          80
unsure %:         4.08        4.03        3.93
real cost:      $51.60      $51.40      $53.00
best cost:      $33.80      $33.80      $33.00
h mean:           0.10        0.10        0.10
h sdev:           1.72        1.72        1.71
s mean:          89.34       89.46       89.60
s sdev:          25.65       25.59       25.52
mean diff:       89.24       89.36       89.50
k:                3.26        3.27        3.29

=Tony Meyer




More information about the spambayes-dev mailing list