[spambayes-dev] Results for DNS lookup in tokenizer

Matthew Dixon Cowles matt at mondoinfo.com
Thu Apr 15 14:48:12 EDT 2004

[Tony Meyer]
> Your defaults run only has one fp and one fn - to improve on this,
> the new Fabulously Clever Idea would need to directly target those
> two messages (without losing the rest).  Unless the improvement is
> all in the unsures - since cmp.py output doesn't mention them, I
> can't tell how many there are in the defaults; maybe this is where
> the room to improve is.

There is some room to improve the unsures. With the defaults, I get
27 unsures out of 1000 messages.

> (If you still have the rates.py output around, could you post a
> table.py for the defaults, dns and bigrams outputs?)

Here you go:

filename:       normal     bigrams         dns
ham:spam:    1000:1000   1000:1000   1000:1000
fp total:            1           1           1
fp %:             0.10        0.10        0.10
fn total:            1           1           1
fn %:             0.10        0.10        0.10
unsure t:           27          29          26
unsure %:         1.35        1.45        1.30
real cost:      $16.40      $16.80      $16.20
best cost:      $10.20      $11.60       $9.60
h mean:           0.35        0.46        0.32
h sdev:           4.13        4.71        3.97
s mean:          99.16       99.22       99.16
s sdev:           5.79        5.28        5.79
mean diff:       98.81       98.76       98.84
k:                9.96        9.89       10.13

> If you run "fpfn.py ratespyoutputs.txt" (with the appropriate
> rates.py output file) it'll spit out a list of the fp's and fn's
> (all two of them ;) for that test.  It'd be worth taking a look at
> these two messages and seeing what they are.  It might be that they
> are basically impossible to get right - for example, a message from
> someone you've never had mail from before quoting a spam with a
> single line addition - that's very difficult to classify as ham
> without getting a lot of fn's, too.

The false positive is one I ran into in real life. It's a
confirmation of an order for a pair of headphones. There are lots of
spammy words in it and I don't think I have much other ham from that
company or on that subject. The false negative is harder to explain.
The subject is "Help your employees avoid heat-related illnesses".
It's not the most traditional sort of spam since it doesn't ask me to
buy anything now. Scoring it against my normal database, it gets
0.789. Judging from the evidence reported, it seems that's because I
live in Minneapolis and talk about the weather a lot <22 winks


