[Spambayes] URL parsing improvement ideas
Meyer, Tony
T.A.Meyer at massey.ac.nz
Thu Aug 28 21:50:51 EDT 2003
> 1) Replace % escapes.
This is the decodes column. A definite loss.
> 2) Find server names for ip addresses.
Still running. This is very slow. I'll post the results when they
arrive, but it would have to be amazing for it to be worth waiting this
long ;)
> 3) Remove numbers from the end of domain names (experimental).
> www.buythis123.com => url:buythis
This is the no_num_urls column. No difference.
> Or add a special token for domains ending with a number.
This is the url_end_nums column. No effective difference.
---
filename: standards url_end_nums
no_num_urls decodes
ham:spam: 7900:15260 7900:15260
7900:15260 7900:15260
fp total: 1 1 1 2
fp %: 0.01 0.01 0.01 0.03
fn total: 225 225 224 222
fn %: 1.47 1.47 1.47 1.45
unsure t: 531 531 533 558
unsure %: 2.29 2.29 2.30 2.41
real cost: $341.20 $341.20 $340.60 $353.60
best cost: $540.60 $540.80 $540.40 $547.40
h mean: 0.50 0.50 0.51 0.60
h sdev: 4.25 4.25 4.27 4.72
s mean: 93.44 93.43 93.46 93.46
s sdev: 20.68 20.69 20.64 20.41
mean diff: 92.94 92.93 92.95 92.86
k: 3.73 3.73 3.73 3.70
---
These are only my results, of course, and not guaranteed to match anyone
else's. If you (or anyone) has the testing setup ready and would like
to run any of these, I can provide patches or alternative versions of
tokenizer.py; just let me know.
Other ideas? (Testing is fun ;)
=Tony Meyer
More information about the Spambayes
mailing list