[Spambayes] Weird Spam
Tony Meyer
tameyer at ihug.co.nz
Mon Feb 16 23:08:10 EST 2004
> I've got SpamBayes trained up pretty well (872 good and 478
> spam in my database). I just received the email you see
> attached, which slipped by the filter with a 1% score.
With my current db, it scores 99%. (Together, we're a perfect 100 <wink>).
It's really the clues that are interesting - I'll attach mine at the end
(note the large number of spam hapaxes, which carry the day).
> The form of the spam seemed interesting, as if it was specifically
> designed to elude this type of filter.
It seems to be a combination of two avoidance techniques - 'word salad' and
'mini spam'. The first accounts for the 'random' words, almost all of which
don't exist in my db (the theory is that 'word salad' doesn't help against
statistical content filters because they are just as likely to hit words in
the spam db or in neither db, as ones in the ham db). If the words were
specifically harvested to target me, though (results of a "I'm feeling
lucky" google for "Tony Meyer", for example) that would be another story.
(You'll see in my clues, that I don't talk much about Elena, the camel who
is heir to Sacramento, and her admission of bigotry <wink>).
"mini spam" could be a problem, if enough tokens aren't generated
(ironically, for me, the harm from the mini-spam was countered by the use of
word salad). One way around this is to tokenize whatever's at the end of
the URL (there is an experimental option with SpamBayes to do this). I've
got this turned on, but it's only used when needed, and wasn't here (for
me). Quite often the URL itself, and the headers, are enough to counter
this.
=Tony Meyer
Combined Score: 99% (0.993322)
Internal ham score (*H*): 0.00658669
Internal spam score (*S*): 0.99323
# ham trained on: 100
# spam trained on: 223
51 Significant Tokens
token spamprob #ham #spam
'analyze' 0.0918367 2 0
'issue' 0.14016 12 4
'angled' 0.155172 1 0
'grassland' 0.155172 1 0
'idiomatic' 0.155172 1 0
'ireland' 0.155172 1 0
'linus' 0.155172 1 0
'nixon' 0.155172 1 0
'rat' 0.155172 1 0
'proper' 0.167451 3 1
'between' 0.292455 10 9
'bi:the message' 0.328852 2 2
'wonderful' 0.328852 2 2
'subject:: ' 0.339228 15 17
'loading' 0.344569 1 1
'sugar' 0.344569 1 1
'wing' 0.344569 1 1
'skip:p 10' 0.34958 26 31
'white' 0.355755 5 6
'went' 0.3659 4 5
'bi:header:Subject:1 proto:http' 0.600175 37 124
'drive' 0.721907 2 12
'bi:header:Reply-To:1 header:Message-ID:1' 0.747164 11 73
'admission' 0.844828 0 1
'adrenal' 0.844828 0 1
'appian' 0.844828 0 1
'becloud' 0.844828 0 1
'beep' 0.844828 0 1
'bigotry' 0.844828 0 1
'camel' 0.844828 0 1
'elena' 0.844828 0 1
'epoch' 0.844828 0 1
'flair' 0.844828 0 1
'gustavus' 0.844828 0 1
'heir' 0.844828 0 1
'overdue' 0.844828 0 1
'prestige' 0.844828 0 1
'sacramento' 0.844828 0 1
'singleton' 0.844828 0 1
'sloth' 0.844828 0 1
'subject:need' 0.844828 0 1
'toefl' 0.844828 0 1
'url:es' 0.844828 0 1
'bi:try this' 0.908163 0 2
'reception' 0.908163 0 2
'subject:Fwd' 0.908163 0 2
'bi:url:1 url:gif' 0.934783 0 3
'walls' 0.934783 0 3
'bi:header:From:1 header:MIME-Version:1' 0.958716 0 5
'subject:...' 0.965116 0 6
'subject:your' 0.98951 0 21
---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.
More information about the Spambayes
mailing list