[Spambayes] A couple of small tokenizer experiments.
Anthony Baxter
anthony@interlink.com.au
Tue Nov 12 07:52:19 2002
>>> Tim Peters wrote
> > First experiment was to make the URL tokenizer look for the string
> > 'mailman' in the URL. If it was found, simple push the clue "url:
> > Mailman URL" onto the clue-pile. This was an attempt to remove the
> Can you try this again replacing "break" with "continue"? I can't believe
> you intended break here -- it means that the first time we see a Mailman URL
> in a msg, we stop looking for embedded URLs period. Spam could easily
> exploit that.
--- tokenizer.py 12 Nov 2002 06:21:38 -0000 1.66
+++ tokenizer.py 12 Nov 2002 07:23:30 -0000
@@ -944,6 +944,11 @@
new_text.append(text[i : start])
new_text.append(' ')
+ if guts.find('mailman') != -1:
+ pushclue("url: Mailman URL")
+ i = end
+ continue
+
pushclue("proto:" + proto)
# Lose the trailing punctuation for casual embedding, like:
# The code is at http://mystuff.org/here? Didn't resolve.
filename: new_fromtocc2
new_mailman2
ham:spam: 4800:1600
4800:1600
fp total: 0 0
fp %: 0.00 0.00
fn total: 6 5
fn %: 0.38 0.31
unsure t: 97 95
unsure %: 1.52 1.48
real cost: $25.40 $24.00
best cost: $19.20 $18.20
h mean: 0.39 0.42
h sdev: 4.48 4.59
s mean: 98.56 98.68
s sdev: 8.62 8.17
mean diff: 98.17 98.26
k: 7.49 7.70
before:
-> largest ham & spam cutoffs 0.24 & 0.93
-> fp 0; fn 4; unsure ham 25; unsure spam 51
-> fp rate 0%; fn rate 0.25%; unsure rate 1.19%
after:
-> largest ham & spam cutoffs 0.24 & 0.94
-> fp 0; fn 3; unsure ham 27; unsure spam 49
-> fp rate 0%; fn rate 0.188%; unsure rate 1.19%
It replaces a chunk of closely correlated ham clues, which has the
expected result of pushing both ham and spam up slightly. This (for
me) rescues one fn at the expense of a couple of extra unsure hams.
This looks like a YMMV one. It's (for me) a marginal win.
Anthony
More information about the Spambayes
mailing list