[Spambayes] Re: Training oddity/confusion

Mathew Hendry TJLWBECGSGWU at spammotel.com
Fri Jan 14 19:47:50 CET 2005


On Fri, 14 Jan 2005 09:19:37 +1300, "Tony Meyer" <tameyer at ihug.co.nz> wrote:

>> >With 'classic' train to exhaustion, the database is kept exactly 
>> >balanced, I believe.  How well is your system working for you?
>> 
>> Erm, not all that well. :|
>
>:(  I'm trying to get things rearranged a little for 1.1 so that it's easier
>to try out different training regimes (including tte) with the various apps,
>so hopefully that'll help.

Ah, that sounds good. You mean by making it easier to tweak the training
code, or by exposing more options in the user interface?

>> My incoming mail is very unbalanced - 17:1 spam:ham since I 
>> started the training - which can't help, but so far I have 
>> 18% unsure spam and 3% false negatives. No mistakes on ham 
>> though; none scored higher than 0.5%. Given that, I suppose I 
>> could simply mess with the thresholds.
>
>I've read reports of people who have done that (in an extreme way, so that
>the cutoffs are 5% and 10% or something like that).  It seems pretty risky
>to me, though, since a message that contains nothing that has been seen
>before will score 0.5 and that would be same under that system...

I don't think I'd need to go that far. Most of the unsure spam I get is in
the 70-90% range. The FNs are all 419-type scams - got to give the Nigerians
points for effort, they're laboriously written and different every time.

BTW, I'm also seeing better results since finally re-enabling my
SpamCopAndAssassin patch and retraining (was running vanilla 1.0.1 before).
The URL blacklist support (http://surbl.org) recently added to SpamAssassin
seems to make for particularly good spam clues; from the most recent spam to
come in:

Combined Score: 100% (0.999917)
Internal ham score (*H*): 2.23853e-005
Internal spam score (*S*): 0.999857

# ham trained on: 187
# spam trained on: 193

29 Significant Tokens
token                               spamprob         #ham  #spam
'out:'                              0.0918367           2      0
'viagra,'                           0.155172            1      0
'to:addr:[munged]'                  0.299867           84     37
'url:index'                         0.301585           16      7
'url:com'                           0.603076           79    124
'private'                           0.611308            3      5
'over'                              0.616477           18     30
'discount'                          0.648476            2      4
'save'                              0.654963            5     10
'proto:http'                        0.656739           87    172
'url:'                              0.694878           22     52
'sell'                              0.719354            1      3
'prescription'                      0.801118            2      9
'header:Received:8'                 0.802243            7     30
'70%'                               0.805954            1      5
'sa_rule:3.0:DRUGS_ERECTILE'        0.84212             4     23
'drugs.'                            0.844828            0      1
'shipping!'                         0.844828            0      1
'subject:discount'                  0.844828            0      1
'x-mailer:microsoft outlook [snip]  0.89925             1     11
'generic'                           0.908163            0      2
'subject:without'                   0.908163            0      2
'sa_rule:3.0:URIBL_SBL'             0.939819            6    100
'required!!'                        0.949438            0      4
'sa_rule:3.0:URIBL_AB_SURBL'        0.965581            2     64
'thanks:'                           0.969799            0      7
'sa_rule:3.0:URIBL_WS_SURBL'        0.973332            3    121
'sa_rule:3.0:URIBL_OB_SURBL'        0.976995            2     97
'sa_rule:3.0:URIBL_SC_SURBL'        0.977448            2     99

Some spammers have now resorted to removing explicit links from their spam
and asking recipients to cut and paste an address into their browser,
apparently to avoid their URLs automatically being picked up and added to
these blacklists.

-- Mat.




More information about the Spambayes mailing list