[Spambayes] defaults vs. chi-square

T. Alexander Popiel popiel@wolfskeep.com
Mon, 14 Oct 2002 13:29:59 -0700


In message:  <LNBBLJKPBEHFEDALKOLCCEIKBLAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>> The best info for cv2 (chi-square):
>>
>> """
>> -> best cost $48.00
>> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
>> -> achieved at 3 cutoff pairs
>> -> smallest ham & spam cutoffs 0.03 & 0.89
>> ->     fp 3; fn 6; unsure ham 12; unsure spam 48
>> ->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
>> -> largest ham & spam cutoffs 0.03 & 0.9
>> ->     fp 3; fn 6; unsure ham 12; unsure spam 48
>> ->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
>> """
>
>And this seems a lot easier to live with in a world without time machines:
>the middle ground spans a huge range of scores, yet contains a lot fewer
>msgs than under highly-corpus-tuned cv1.
>
>> The histograms for chi-square look pretty much like all the other
>> histograms reported here (big spikes at the ends for the ham and
>> spam, several spread lightly (and fairly evenly) over the middle
>> ground.
>>
>> I must say that I like chi-square best out of all the ones I've
>> tested, since it has fairly obvious points for the cutoffs (I suspect
>> that .05 and .90 are not too far from optimal for just about everyone),
>> and it does have a useful middle ground.
>
>I agree on all counts.
>
>> (The false positives I get from it are fairly hopeless cases:
>> FDIC informing customers that NextBank died, a contractor's bid
>> containing only an encoded .pdf,
>
>That one surprises me:  assuming we threw the body away unlooked-at (we
>ignore MIME sections that aren't of text/* type), it's hard to get enough
>other clues to force a spam score so high.  If possible, I'd like to see the
>list of clues (the "prob('word') = 0.432' thingies in the main output file,
>assuing you have show_false_positives enabled).

Data/Ham/Set5/2745
prob = 0.685540245196
prob('*H*') = 0.535842
prob('*S*') = 0.906922
prob('content-type:application/pdf') = 0.0918367
prob('filename:fname piece:pdf') = 0.0918367
prob('subject:Electrical') = 0.155172
prob('content-type:text/plain') = 0.389566
prob('header:Received:5') = 0.389918
prob('content-type:multipart/mixed') = 0.737422
prob('content-type:multipart/alternative') = 0.948917
prob('&nbsp;') = 0.959269
prob('content-type:text/html') = 0.986282

That's the whole list of probabilities.  I did fib slightly: in
addition to the bid.pdf, there's a one-space-character message
body represented in both plain text and HTML.  Effectively null,
but the classifier doesn't see it that way.  It's that dual-body
that's killing it.

>> info requests wrt getting a new mortgage.  The false negatives are a
>> bunch of particularly chatty spams, and one or two with empty bodies.
>> Again, fairly hopeless.)
>
>Long chatty spam has been pretty reliably scoring near 0.5 for me, which has
>been a real advantage of chi combining.  So again I'd really like to see the
>list of clues.

My error... I was looking at the fn output without paying attention
to the listed probs.  Since the fn output is based on the single
cutoff (set at 0.56), it was getting some of the chatty stuff.  The
real fns are pretty short, and generally in odd languages or binary.

This one looks like a worm:

Data/Spam/Set3/32
prob = 0.000317545970781
prob('*H*') = 0.999926
prob('*S*') = 0.000560844
prob('skip:b 70') = 0.0412844
prob('skip:a 70') = 0.0505618
prob('skip:d 70') = 0.0505618
prob('skip:e 70') = 0.0505618
prob('email name:debian-java-request') = 0.0547407
prob('email addr:lists.debian.org') = 0.0594895
prob('email name:listmaster') = 0.0599834
prob("control: couldn't decode") = 0.0652174
prob('from:email addr:t-online.de>') = 0.0652174
prob('skip:c 70') = 0.0652174
prob('skip:i 70') = 0.0652174
prob('skip:y 70') = 0.0652174
prob('skip:z 70') = 0.0652174
prob('trouble?') = 0.0753369
prob('skip:" 10') = 0.277389
prob('skip:a 20') = 0.295202
prob('content-type:text/plain') = 0.388944
prob('header:Message-Id:1') = 0.6167
prob('email') = 0.787497
prob('x-mailer:microsoft outlook express 5.50.4133.2400') = 0.791262
prob('message-id:@lists.debian.org') = 0.844828
prob('skip:5 70') = 0.844828

And again:

Data/Spam/Set3/2472
prob = 0.0029549796705
prob('*H*') = 0.999949
prob('*S*') = 0.00585924
prob('header:In-Reply-To:1') = 0.000449595
prob('skip:s 70') = 0.0412844
prob('skip:d 70') = 0.0505618
prob('skip:o 70') = 0.0505618
prob('skip:t 70') = 0.0505618
prob("control: couldn't decode") = 0.0652174
prob('skip:c 70') = 0.0652174
prob('skip:i 70') = 0.0652174
prob('skip:l 70') = 0.0652174
prob('skip:z 70') = 0.0652174
prob('from:email addr:mail.com>') = 0.23545
prob('charset:us-ascii') = 0.317057
prob('skip:n 30') = 0.355072
prob('content-type:text/plain') = 0.388944
prob('header:Message-Id:1') = 0.6167
prob('content-disposition:inline') = 0.661659
prob('content-type:multipart/mixed') = 0.696645
prob('x-mailer:microsoft outlook, build 10.0.2616') = 0.97619


This one actually wasn't too long and chatty, but it seemed
to hit a bunch of good words, and was half in french:

Data/Spam/Set6/2011
prob = 0.00173950022128
prob('*H*') = 0.99774
prob('*S*') = 0.00121919
prob('forum') = 0.0121951
prob('url:be') = 0.0302013
prob('email name:debian-java-request') = 0.0341451
prob('email addr:lists.debian.org') = 0.0441114
prob('email name:listmaster') = 0.044487
prob('trouble?') = 0.0604856
prob('des') = 0.0652174
prob('cross') = 0.117486
prob('avec') = 0.155172
prob('est') = 0.155172
prob('firmwares') = 0.155172
prob('progress,') = 0.155172
prob('toute') = 0.155172
prob('...') = 0.180314
prob('occasionally') = 0.184814
prob('still') = 0.237895
prob('but') = 0.249098
prob('skip:" 10') = 0.278104
prob('site') = 0.295343
prob('already') = 0.301798
prob('charset:us-ascii') = 0.308681
prob('after') = 0.341657
prob('x-mailer:microsoft outlook express 6.00.2600.0000') = 0.347036
prob('content-type:text/plain') = 0.390599
prob('header:Reply-To:1') = 0.60073
prob('from') = 0.604083
prob('subject:.') = 0.605015
prob('available') = 0.637633
prob('header:Mime-Version:1') = 0.646706
prob('email') = 0.785132
prob('please') = 0.83219
prob('subject:skip:W 10') = 0.908163
prob('url:') = 0.936848

I don't know what happened to the other fn < 0.03.  Close, but not
quite, is a nigerian spam (!!!):

Data/Spam/Set7/352
prob = 0.0344593026264
prob('*H*') = 0.999908
prob('*S*') = 0.0688269
prob('indeed') = 0.00556242
prob('aim') = 0.012894
prob('(my') = 0.0145631
prob('manner') = 0.0180723
prob('wrote') = 0.0211545
prob('reminder') = 0.0238095
prob('nigerian') = 0.0266272
prob('december') = 0.0266272
prob('so.') = 0.0281933
prob('okay') = 0.0302013
prob('although') = 0.0350768
prob('numbered') = 0.0412844
prob('ratio') = 0.0446266
prob('opposed') = 0.0481336
prob('apparently,') = 0.0505618
prob('revert') = 0.0505618
prob('officer') = 0.0505618
prob('subsequently') = 0.0505618
prob('patience') = 0.0505618
prob('however') = 0.0524146
prob('overcome') = 0.0599022
prob('fixed') = 0.0617239
prob('infer') = 0.0652174
prob('presumed') = 0.0652174
prob('filename:fname piece:txt') = 0.0652174
prob('therefore') = 0.0838752
prob('attempts') = 0.0874263
prob('expert,') = 0.0918367
prob('calendar') = 0.0918367
prob('travelling') = 0.0918367
prob('nigeria.') = 0.0918367
prob('apparently') = 0.0929593
prob('forwarding') = 0.106987
prob('saw') = 0.107116
prob('thus') = 0.110275
prob('did') = 0.112618
prob('concern') = 0.114396
prob('especially') = 0.125537
prob('finally,') = 0.126719
prob('shall') = 0.135258
prob('worked') = 0.138554
prob('point') = 0.154593
prob('totaling') = 0.155172
prob('proposition') = 0.155172
prob('6th') = 0.155172
prob('actively') = 0.165428
prob('since') = 0.166612
prob('knows') = 0.169148
prob('which') = 0.172635
prob('necessary') = 0.182854
prob('source') = 0.183395
prob('routine') = 0.189922
prob('driven') = 0.205305
prob('got') = 0.206143
prob('reality') = 0.206601
prob('light') = 0.207284
prob('skip:h 20') = 0.211375
prob('some') = 0.214937
prob('there') = 0.219934
prob('same') = 0.227242
prob('still') = 0.238027
prob('but') = 0.254404
prob('according') = 0.254563
prob('very') = 0.256327
prob('skip:m 10') = 0.258633
prob('stand') = 0.260226
prob('died') = 0.263314
prob('branch') = 0.263314
prob('zero') = 0.26593
prob('number') = 0.267526
prob('them') = 0.274205
prob('large') = 0.27431
prob('his') = 0.276565
prob('transaction') = 0.281659
prob('consultant') = 0.283198
prob('reason') = 0.288324
prob('dead') = 0.288434
prob('trace') = 0.29021
prob('mr.') = 0.292388
prob('part') = 0.294772
prob('when') = 0.297739
prob('ask') = 0.299886
prob('already') = 0.299963
prob('listing') = 0.310964
prob('given') = 0.311411
prob('down') = 0.311983
prob('charset:us-ascii') = 0.312457
prob('being') = 0.312739
prob('federal') = 0.695627
prob('president') = 0.697044
prob('safely') = 0.700267
prob('notification') = 0.700364
prob('information') = 0.703131
prob('skip:r 10') = 0.706302
prob('inform') = 0.707612
prob('brought') = 0.70783
prob('your') = 0.710937
prob('complete') = 0.711206
prob('content-type:application/octet-stream') = 0.718341
prob('country.') = 0.718341
prob('immediately') = 0.727163
prob('further') = 0.728674
prob('obtained') = 0.732221
prob('risk') = 0.747156
prob('content-type:multipart/mixed') = 0.751609
prob('contract') = 0.754669
prob('informed') = 0.75788
prob('business') = 0.761283
prob('internet') = 0.768097
prob('phone') = 0.774467
prob('questions') = 0.795045
prob('money,') = 0.796192
prob('bank') = 0.801151
prob('succeed') = 0.805677
prob('settled') = 0.810078
prob('month') = 0.811997
prob('claim') = 0.812913
prob('confidential') = 0.815186
prob('money.') = 0.8156
prob('our') = 0.820323
prob('please') = 0.828641
prob('months,') = 0.829218
prob('fund') = 0.83557
prob('national') = 0.835796
prob('sent') = 0.837147
prob('blood') = 0.843797
prob('asked,') = 0.844828
prob('treasury') = 0.844828
prob('address') = 0.860353
prob('reply') = 0.864689
prob('achieving') = 0.87037
prob('money') = 0.878353
prob('70%') = 0.880818
prob('million') = 0.885051
prob('corporation') = 0.891198
prob('free') = 0.90477
prob('approval') = 0.904949
prob('x-mailer:microsoft outlook express 5.00.2919.6900 dm') = 0.908163
prob('modalities') = 0.908163
prob('employment') = 0.912574
prob('claim.') = 0.915225
prob('skip:y 10') = 0.922406
prob('deposit') = 0.929253
prob('wish') = 0.930416
prob('credit') = 0.941699
prob('valued') = 0.950726
prob('guaranteed') = 0.956906
prob('honored') = 0.958716
prob('message-id:@ucsu.colorado.edu') = 0.965116
prob('conservative') = 0.983271

All you folks _talking_ about the nigerian spams has turned them
into ham for me! ;-)

- Alex