[Spambayes] It gets funnier all the time....

Tim Stone - Four Stones Expressions tim at fourstonesExpressions.com
Thu Feb 13 07:48:21 EST 2003


2/13/2003 6:16:49 AM, Skip Montanaro <skip at pobox.com> wrote:

>    TimP> I'd hate to see the code bloat with gimmicks that don't prove
>    TimP> themselves via testing

I think the point here is that we're using the email module, which makes quite 
a few assumptions about the "well-formedness" of the mail.  When spammers 
figure this out, they'll unleash a whole lot of crap that most mailers will 
display, but is so badly formed that bayesian filters can't find enough about 
it to place it in the spam category.  That's all they're after.  They don't 
care if it's classified as ham or unsure, just NOT spam.  So... it behooves us 
to begin to think like a spammer: How can I break this thing?  They'll be 
looking for all the tricks.  Let's find 'em first.

>
>    Skip> People asked about decoding stuff that was encoded but didn't have
>    Skip> a Content-Transfer-Encoding header.  I suggested the diff I
>    Skip> posted.  That's as far as it's gone at this point.
>
>    TimS> Apparently our test corpora didn't include any mail with this
>    TimS> problem.
>
>Au contraire.  Using my untouched-since-December ham/spam collections I ran
>a 10-fold cross-validation last night.  The summary results are
>
>    filename:     base     cte
>    ham:spam:  2000:2000      
>                       2000:2000
>    fp total:        9       9
>    fp %:         0.45    0.45
>    fn total:       17      14
>    fn %:         0.85    0.70
>    unsure t:       94     100
>    unsure %:     2.35    2.50
>    real cost: $125.80 $124.00
>    best cost:  $76.20  $77.60
>    h mean:       1.50    1.56
>    h sdev:       9.59    9.80
>    s mean:      98.03   98.14
>    s sdev:      10.91   10.62
>    mean diff:   96.53   96.58
>    k:            4.71    4.73
>
>"base" is an empty ini file.  "cte" is 
>
>    [Tokenizer]
>    assume_missing_cte: True
>
>so in this case at least the false negatives got slightly better and the
>unsures a bit worse.  I suspect this is typical of what we'll see with most
>changes at this stage of the game - somewhat inconclusive results.  Whether
>or not to add it is going to be a judgement call.
>
>A patch which implements this change is attached for anyone who wants to run
>the test.
>
>Skip
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org





More information about the Spambayes mailing list