[Spambayes] Weird Spam

Mon Feb 16 23:08:10 EST 2004

> I've got SpamBayes trained up pretty well (872 good and 478 
> spam in my database).  I just received the email you see 
> attached, which slipped by the filter with a 1% score.

With my current db, it scores 99%. (Together, we're a perfect 100 <wink>).
It's really the clues that are interesting - I'll attach mine at the end
(note the large number of spam hapaxes, which carry the day).

> The form of the spam seemed interesting, as if it was specifically 
> designed to elude this type of filter.

It seems to be a combination of two avoidance techniques - 'word salad' and
'mini spam'.  The first accounts for the 'random' words, almost all of which
don't exist in my db (the theory is that 'word salad' doesn't help against
statistical content filters because they are just as likely to hit words in
the spam db or in neither db, as ones in the ham db).  If the words were
specifically harvested to target me, though (results of a "I'm feeling
lucky" google for "Tony Meyer", for example) that would be another story.

(You'll see in my clues, that I don't talk much about Elena, the camel who
is heir to Sacramento, and her admission of bigotry <wink>).

"mini spam" could be a problem, if enough tokens aren't generated
(ironically, for me, the harm from the mini-spam was countered by the use of
word salad).  One way around this is to tokenize whatever's at the end of
the URL (there is an experimental option with SpamBayes to do this).  I've
got this turned on, but it's only used when needed, and wasn't here (for
me).  Quite often the URL itself, and the headers, are enough to counter
this.

=Tony Meyer

Combined Score: 99% (0.993322)
Internal ham score (*H*): 0.00658669
Internal spam score (*S*): 0.99323

# ham trained on: 100
# spam trained on: 223

51 Significant Tokens
token                               spamprob         #ham  #spam
'analyze'                           0.0918367           2      0
'issue'                             0.14016            12      4
'angled'                            0.155172            1      0
'grassland'                         0.155172            1      0
'idiomatic'                         0.155172            1      0
'ireland'                           0.155172            1      0
'linus'                             0.155172            1      0
'nixon'                             0.155172            1      0
'rat'                               0.155172            1      0
'proper'                            0.167451            3      1
'between'                           0.292455           10      9
'bi:the message'                    0.328852            2      2
'wonderful'                         0.328852            2      2
'subject:: '                        0.339228           15     17
'loading'                           0.344569            1      1
'sugar'                             0.344569            1      1
'wing'                              0.344569            1      1
'skip:p 10'                         0.34958            26     31
'white'                             0.355755            5      6
'went'                              0.3659              4      5
'bi:header:Subject:1 proto:http'    0.600175           37    124
'drive'                             0.721907            2     12
'bi:header:Reply-To:1 header:Message-ID:1' 0.747164           11     73
'admission'                         0.844828            0      1
'adrenal'                           0.844828            0      1
'appian'                            0.844828            0      1
'becloud'                           0.844828            0      1
'beep'                              0.844828            0      1
'bigotry'                           0.844828            0      1
'camel'                             0.844828            0      1
'elena'                             0.844828            0      1
'epoch'                             0.844828            0      1
'flair'                             0.844828            0      1
'gustavus'                          0.844828            0      1
'heir'                              0.844828            0      1
'overdue'                           0.844828            0      1
'prestige'                          0.844828            0      1
'sacramento'                        0.844828            0      1
'singleton'                         0.844828            0      1
'sloth'                             0.844828            0      1
'subject:need'                      0.844828            0      1
'toefl'                             0.844828            0      1
'url:es'                            0.844828            0      1
'bi:try this'                       0.908163            0      2
'reception'                         0.908163            0      2
'subject:Fwd'                       0.908163            0      2
'bi:url:1 url:gif'                  0.934783            0      3
'walls'                             0.934783            0      3
'bi:header:From:1 header:MIME-Version:1' 0.958716            0      5
'subject:...'                       0.965116            0      6
'subject:your'                      0.98951             0     21

---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.