[Spambayes] A couple of small tokenizer experiments.

Wed Nov 13 06:44:02 2002

[Anthony Baxter, tokenizing mail-address headers]
> I've added this now. For me, tokenising just the 'from' line
> with the new 'address_headers' option gives (vs the old code):
>
> (all tests with 4 sets of 1200H/400S)
>
> filename:  old_from
>                    new_from
> ham:spam:  4800:1600
>                    4800:1600
> fp total:        1       1
> fp %:         0.02    0.02
> fn total:       12      11
> fn %:         0.75    0.69
> unsure t:       86      88
> unsure %:     1.34    1.38
> real cost:  $39.20  $38.60
> best cost:  $31.80  $32.40
> h mean:       0.36    0.36
> h sdev:       4.04    4.05
> s mean:      98.25   98.25
> s sdev:       8.93    8.99
> mean diff:   97.89   97.89
> k:            7.55    7.51
>
> The old code's best cost was:
> -> achieved at ham & spam cutoffs 0.24 & 0.99
> ->     fp 0; fn 3; unsure ham 26; unsure spam 118
> ->     fp rate 0%; fn rate 0.188%; unsure rate 2.25%
>
> The new code's best cost was:
> -> largest ham & spam cutoffs 0.26 & 0.99
> ->     fp 0; fn 4; unsure ham 24; unsure spam 118
> ->     fp rate 0%; fn rate 0.25%; unsure rate 2.22%
>
> The one additional fn was a spam that was dragged from 0.35 to
> 0.21 because it came from 'update@localhost.net' - the 'update'
> was a strong spam clue.

Well, regardless of reason, the best cost got worse, and it did on my c.l.py
test too, but also by a trivial amount.  I fiddled the tokenization of this
field until it did better again, so please make sure I didn't screw you too
badly <wink>.

Something that helped:  it now generates log-count "no real name" metatokens
too for address headers without real-name parts.

        'from:no real name:2**0' 0.933186

became one of the 40 most-frequent discriminators in my c.l.py data then,
and is a strong spam clue.  The good news is that it raised my
lowest-scoring spam from near 0.20 to over 0.27, so at ham_cutoff=0.20
(which I'm using on the c.l.py test), I have no spam close to being called
ham anymore.  The bad news is that it gave me another FP, but it's one of
those useless msgs I don't care about (a two-word "confirm 12345" msg from a
first-time poster sent to a wrong address, using a free email acct that
inserted advertising at the bottom of the msg -- it's always been on the
edge).

> Where it gets more interesting is when I also tokenize to and cc:

I would hope so <wink>.

> filename:  new_from
>                    new_fromtocc
> ham:spam:  4800:1600
>                    4800:1600
> fp total:        1       1
> fp %:         0.02    0.02
> fn total:        4       5
> fn %:         0.25    0.31
> unsure t:      121     104
> unsure %:     1.89    1.62
> real cost:  $38.20  $35.80
> best cost:  $32.40  $28.00
> h mean:       0.36    0.31
> h sdev:       4.05    3.80
> s mean:      98.25   98.42
> s sdev:       8.99    8.77
> mean diff:   97.89   98.11
> k:            7.51    7.81
>
>
> We go from:
> -> largest ham & spam cutoffs 0.26 & 0.99
> ->     fp 0; fn 4; unsure ham 24; unsure spam 118
> ->     fp rate 0%; fn rate 0.25%; unsure rate 2.22%
>
> to
> -> largest ham & spam cutoffs 0.22 & 0.99
> ->     fp 0; fn 3; unsure ham 25; unsure spam 100
> ->     fp rate 0%; fn rate 0.188%; unsure rate 1.95%
>
> That's a total of 142->125 unsures. I'll accept that :)

Yup, it's a small win.  I can't use it my c.l.py test, but should be able to
on the general python.org corpus (plus, of course, my own email).

> Just to make sure, ran with a different seed.

... [and another small win] ...

BTW, you should make sure the seeds aren't close together.  For example,
using seed 123 one time, and 124 the next, will give a lot of msg overlap.

> toemail:python.org and toemail:zope.org both show up in
> my 'best discriminators' list as _very_ strong ham clues
> (not suprising, given the mailing lists I'm on).

Well, that's also going to make the spam that slips thru that much harder to
catch.  Of course, after Greg deploys this system, there won't be any more
spam slipping thru <wink>.

> My old/uncommon email addresses generally show up as strong strong
> spam clues (eg prob('toemail:arb') = 0.999356)

Cool!