[Spambayes] Cunning use of quoted-printable

Wed, 02 Oct 2002 22:34:48 +0100

[Tim]
> Richie, what do you have spam_cutoff set to?  I thought your first message
> implied it was set to 0.56.

It is, yes.

[Tim]
> this should not have *been* a false positive

You're right.  Where 'richie.pickle' is my full ~4000-message database:

>>> import cPickle, pprint, tokenizer, classifier
>>> from Options import options
>>> text = open( "Data/Ham/Set4/1641", "rt" ).read()
>>> bayes = cPickle.load( open( "richie.pickle", "rb" ) )
>>> score, clues = bayes.spamprob( tokenizer.tokenize( text ), True )
>>> print options.spam_cutoff, score
0.56 0.402748505794
>>> pprint.pprint( clues )
[('header:Received:5', 0.13592289441927),
 ('from:email addr:biglobe.ne.jp>', 0.15517241379310345),
 ('from:email name:<rxmx7x5x1', 0.15517241379310345),
 ('from:skip:= 30', 0.15517241379310345),
 ('message-id:@biglobe.ne.jp', 0.15517241379310345),
 ('subject:2022', 0.15517241379310345),
 ('subject:IBskQiMxGyhC', 0.15517241379310345),
 ('charset:us-ascii', 0.26241865802854009),
 ('content-type:text/plain', 0.34572203385342953),
 ('subject:ISO', 0.35151428063116696),
 ('header:Message-Id:1', 0.64496476638361089),
 ('x-mailer:none', 0.67584084707587),
 ('subject:=?', 0.69778644753001717),
 ('subject:?=', 0.7215916912471283),
 ('unsubscribe', 0.93148161126231199)]
>>>

But running in the test environment, which uses the same 4000 messages
(subject to a couple of hundred extras being shuffled around by rebal.py),
I get this:

> python timcv.py -n10 --ham=200 --spam=200 -s1

[snip]
-> <stat> 1 new false positives
    new fp: ['Data/Ham/Set4/1641']
******************************************************************************
Data/Ham/Set4/1641
prob = 0.581295852793
prob('header:Received:5') = 0.141997
prob('charset:us-ascii') = 0.26578
prob('content-type:text/plain') = 0.346687
prob('header:Message-Id:1') = 0.648679
prob('x-mailer:none') = 0.674625
prob('subject:=?') = 0.775229
prob('subject:?=') = 0.908163
prob('unsubscribe') = 0.928485

>From RxMx7x5x@biglobe.ne.jp Fri May 02 22:21:22 1997
[snip]

What's going on??  Far fewer clues in the test environment (and my other
false positive prints 67 of them, so it's not a display issue).

I have a bayescustomize.ini like this:

[TestDriver]
best_cutoff_fp_weight = 10
nbuckets = 100

which I guess shouldn't have any effect on this at all.

-- 
Richie Hindle
richie@entrian.com