[Spambayes] A couple of small tokenizer experiments.

Tue Nov 12 07:52:19 2002

>>> Tim Peters wrote
> > First experiment was to make the URL tokenizer look for the string
> > 'mailman' in the URL. If it was found, simple push the clue "url:
> > Mailman URL" onto the clue-pile. This was an attempt to remove the
> Can you try this again replacing "break" with "continue"?  I can't believe
> you intended break here -- it means that the first time we see a Mailman URL
> in a msg, we stop looking for embedded URLs period.  Spam could easily
> exploit that.

--- tokenizer.py        12 Nov 2002 06:21:38 -0000      1.66
+++ tokenizer.py        12 Nov 2002 07:23:30 -0000
@@ -944,6 +944,11 @@
         new_text.append(text[i : start])
         new_text.append(' ')
 
+        if guts.find('mailman') != -1:
+            pushclue("url: Mailman URL")
+            i = end
+            continue
+
         pushclue("proto:" + proto)
         # Lose the trailing punctuation for casual embedding, like:
         #     The code is at http://mystuff.org/here?  Didn't resolve.


filename:  new_fromtocc2  
                   new_mailman2
ham:spam:  4800:1600      
                   4800:1600
fp total:        0       0
fp %:         0.00    0.00
fn total:        6       5
fn %:         0.38    0.31
unsure t:       97      95
unsure %:     1.52    1.48
real cost:  $25.40  $24.00
best cost:  $19.20  $18.20
h mean:       0.39    0.42
h sdev:       4.48    4.59
s mean:      98.56   98.68
s sdev:       8.62    8.17
mean diff:   98.17   98.26
k:            7.49    7.70

before:
-> largest ham & spam cutoffs 0.24 & 0.93
->     fp 0; fn 4; unsure ham 25; unsure spam 51
->     fp rate 0%; fn rate 0.25%; unsure rate 1.19%

after:
-> largest ham & spam cutoffs 0.24 & 0.94
->     fp 0; fn 3; unsure ham 27; unsure spam 49
->     fp rate 0%; fn rate 0.188%; unsure rate 1.19%

It replaces a chunk of closely correlated ham clues, which has the
expected result of pushing both ham and spam up slightly. This (for
me) rescues one fn at the expense of a couple of extra unsure hams.

This looks like a YMMV one. It's (for me) a marginal win. 

Anthony