[Spambayes] It gets funnier all the time....

Skip Montanaro skip at pobox.com
Thu Feb 13 06:16:49 EST 2003


    TimP> I'd hate to see the code bloat with gimmicks that don't prove
    TimP> themselves via testing

    Skip> People asked about decoding stuff that was encoded but didn't have
    Skip> a Content-Transfer-Encoding header.  I suggested the diff I
    Skip> posted.  That's as far as it's gone at this point.

    TimS> Apparently our test corpora didn't include any mail with this
    TimS> problem.

Au contraire.  Using my untouched-since-December ham/spam collections I ran
a 10-fold cross-validation last night.  The summary results are

    filename:     base     cte
    ham:spam:  2000:2000      
                       2000:2000
    fp total:        9       9
    fp %:         0.45    0.45
    fn total:       17      14
    fn %:         0.85    0.70
    unsure t:       94     100
    unsure %:     2.35    2.50
    real cost: $125.80 $124.00
    best cost:  $76.20  $77.60
    h mean:       1.50    1.56
    h sdev:       9.59    9.80
    s mean:      98.03   98.14
    s sdev:      10.91   10.62
    mean diff:   96.53   96.58
    k:            4.71    4.73

"base" is an empty ini file.  "cte" is 

    [Tokenizer]
    assume_missing_cte: True

so in this case at least the false negatives got slightly better and the
unsures a bit worse.  I suspect this is typical of what we'll see with most
changes at this stage of the game - somewhat inconclusive results.  Whether
or not to add it is going to be a judgement call.

A patch which implements this change is attached for anyone who wants to run
the test.

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb.diff
Type: application/octet-stream
Size: 2078 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20030213/2dfdb1d1/sb.obj


More information about the Spambayes mailing list