[Spambayes] URL parsing improvement ideas

Thu Aug 28 21:50:51 EDT 2003

> 1) Replace % escapes.

This is the decodes column.  A definite loss.

> 2) Find server names for ip addresses.

Still running.  This is very slow.  I'll post the results when they
arrive, but it would have to be amazing for it to be worth waiting this
long ;)

> 3) Remove numbers from the end of domain names (experimental).
>    www.buythis123.com => url:buythis

This is the no_num_urls column.  No difference.

> Or add a special token for domains ending with a number.

This is the url_end_nums column.  No effective difference.

---

filename:  standards       url_end_nums
                   no_num_urls     decodes
ham:spam:  7900:15260      7900:15260
                   7900:15260      7900:15260
fp total:        1       1       1       2
fp %:         0.01    0.01    0.01    0.03
fn total:      225     225     224     222
fn %:         1.47    1.47    1.47    1.45
unsure t:      531     531     533     558
unsure %:     2.29    2.29    2.30    2.41
real cost: $341.20 $341.20 $340.60 $353.60
best cost: $540.60 $540.80 $540.40 $547.40
h mean:       0.50    0.50    0.51    0.60
h sdev:       4.25    4.25    4.27    4.72
s mean:      93.44   93.43   93.46   93.46
s sdev:      20.68   20.69   20.64   20.41
mean diff:   92.94   92.93   92.95   92.86
k:            3.73    3.73    3.73    3.70

---

These are only my results, of course, and not guaranteed to match anyone
else's.  If you (or anyone) has the testing setup ready and would like
to run any of these, I can provide patches or alternative versions of
tokenizer.py; just let me know.

Other ideas?  (Testing is fun ;)

=Tony Meyer