[Spambayes] cancellation disease again?

Anthony Baxter anthony@interlink.com.au
Mon Oct 21 10:34:55 2002


I think I'm seeing what's been referred to as cancellation disease again,
using chi combining. I'm getting very very long spams (like those 
interminable MLMs with the "5 reports" that are getting both *H* and
*S* scores at or near 1, and a final score of 0.5. 

E.g. the perfectly standard "send money for 5 reports" spam gets:

prob = 0.500000000004
prob('*H*') = 1
prob('*S*') = 1
prob('sent:') = 0.000670741
prob('indeed') = 0.00248756
prob('place.') = 0.0025729
prob('obviously') = 0.00272893
prob('missed') = 0.0033358
prob('persistent') = 0.00378469
prob('replaced') = 0.00455005
prob('something.') = 0.00542823
prob('george') = 0.00617284
prob('think.') = 0.00617284
prob('"no') = 0.0065312
prob('happen.') = 0.00672646
prob('him.') = 0.00672646
prob('basically,') = 0.00819672
prob('key.') = 0.00850662
prob('"why') = 0.00884086
prob('correctly,') = 0.00920245
prob('sorry.') = 0.00920245
prob('"just') = 0.00959488
prob('hopes') = 0.00959488
prob('initially') = 0.00959488
prob('(so') = 0.0104895
prob('it;') = 0.0104895
prob('assumes') = 0.0110024
prob('at.') = 0.0110024
prob('everyone,') = 0.0110024
prob('myself.') = 0.0110024
prob('determining') = 0.0115681
prob('problem).') = 0.0115681
prob('falling') = 0.0121951
prob('received,') = 0.0121951
prob("don't,") = 0.012894
prob('stanford') = 0.012894
prob('struggling') = 0.012894
prob('directions') = 0.0136778
prob('jonathan') = 0.0145631
prob('portland,') = 0.0145631
prob('privately') = 0.0145631
prob('sometime') = 0.0145631
prob('saying,') = 0.0155709
prob('gained') = 0.0180723
prob('sized') = 0.0180723
prob('belief') = 0.0196507
prob('fortunately,') = 0.0196507
prob('goodness') = 0.0196507
prob('encounters') = 0.0215311
prob('scratch.') = 0.0215311
prob('trash') = 0.0215311
prob('build.') = 0.0238095
prob('exactly,') = 0.0238095
prob('invested') = 0.0238095
prob('pressed') = 0.0238095
prob('me;') = 0.0266272
prob('work...') = 0.0266272
prob('financially') = 0.973743
prob('responses.') = 0.974053
prob('money.') = 0.974232
prob('ordering') = 0.974234
prob('wolf') = 0.974514
prob('remember,') = 0.974677
prob('residual') = 0.975263
prob('guidelines') = 0.975736
prob('downline') = 0.97619
prob('investing') = 0.976423
prob('response,') = 0.976574
prob('investment') = 0.976809
prob('goes,') = 0.976946
prob('pencil') = 0.9772
prob('me!') = 0.977468
prob('envelope') = 0.977667
prob('involved.') = 0.97782
prob('recession') = 0.978047
prob('following.') = 0.978188
prob('lately.') = 0.978188
prob('legal.') = 0.978188
prob('receive,') = 0.97834
prob('in!') = 0.978469
prob('devoted') = 0.978815
prob('orders') = 0.979431
prob('wife,') = 0.979852
prob('purchase') = 0.979994
prob('subject:YOUR') = 0.980474
prob('tested,') = 0.980495
prob('plan.') = 0.980747
prob('materials') = 0.981284
prob('friend,') = 0.981371
prob('opportunity.') = 0.981474
prob('$5,000') = 0.981496
prob('income,') = 0.981928
prob('$50,000') = 0.981962
prob('gambling') = 0.982318
prob('$25') = 0.982672
prob('chicago,') = 0.982771
prob('secrets') = 0.982771
prob('resell') = 0.982897
prob('letter,') = 0.983163
prob('#4.') = 0.983271
prob('e-mails') = 0.983483
prob('currency)') = 0.983805
prob('instructed') = 0.984241
prob('live.') = 0.984241
prob('success:') = 0.985
prob('exceedingly') = 0.985702
prob('her,') = 0.985702
prob('reach.') = 0.98603
prob('earn') = 0.986397
prob('e-mailed') = 0.986405
prob('profits.') = 0.986641
prob('e-mail,') = 0.987065
prob('profit!') = 0.987106
prob('subject:Money') = 0.987106
prob('500,000') = 0.987464
prob('invaluable') = 0.987784
prob('independent.') = 0.988086
prob('marketing,') = 0.988432
prob('crammed') = 0.988647
prob('mitchell.') = 0.988647
prob('p.o.') = 0.988647
prob('prohibiting') = 0.988989
prob("'knew'") = 0.988998
prob("so'") = 0.988998
prob('orders,') = 0.989157
prob('profitable') = 0.989427
prob('reports!') = 0.98951
prob('ordered.') = 0.990472
prob('advertise.') = 0.990959
prob('imagined.') = 0.991185
prob('originator') = 0.991185
prob('$500,000') = 0.991603
prob("1,000's") = 0.991603
prob('feet.') = 0.991603
prob('grumbled') = 0.991603
prob('50,000') = 0.991984
prob('concealed') = 0.991984
prob('year!!!') = 0.992846
prob('refinance') = 0.993653
prob('accurately!') = 0.994148
prob('cash,') = 0.994148
prob('relax,') = 0.994297
prob('spouting') = 0.994438
prob('instructed.') = 0.994572
prob('jody') = 0.994572
prob('merciless') = 0.994572
prob('(u.s.') = 0.994822
prob('income') = 0.994933
prob('multilevel') = 0.994938
prob('ordering,') = 0.995156
prob('e-mails.') = 0.995258
prob('money!') = 0.99579
prob('message-id:@yarrina.connect.com.au') = 0.998453


I'm not sure what the best way to approach this is....

Anthony