[Spambayes] spam designed to defeat Bayesian filters
papaDoc
papaDoc at videotron.ca
Wed Nov 19 14:27:40 EST 2003
Hi,
This classify for me as Spam
Spam probability: *77.55% (0.7754698219)*.
This is the token
*Word* *Probability* *Times in ham* *Times in spam*
*H* 0.445814 - -
*S* 0.996753 - -
nov 0.065217 3 0
vcr 0.065217 3 0
before 0.083754 6 1
skip:1 10 0.083754 6 1
battery 0.091837 2 0
controls 0.091837 2 0
emergency 0.091837 2 0
given 0.091837 2 0
knowing 0.091837 2 0
protection 0.091837 2 0
selection 0.091837 2 0
separately 0.091837 2 0
switches 0.091837 2 0
talking 0.091837 2 0
teaching 0.091837 2 0
telling 0.091837 2 0
window 0.091837 2 0
either 0.097789 5 1
good 0.117544 4 1
control 0.147499 3 1
network 0.147499 3 1
provides 0.147499 3 1
tell 0.147499 3 1
white 0.147499 3 1
based 0.149229 5 2
basic 0.155172 1 0
considered 0.155172 1 0
consists 0.155172 1 0
containing 0.155172 1 0
core 0.155172 1 0
corrected 0.155172 1 0
effects 0.155172 1 0
elements 0.155172 1 0
elevator 0.155172 1 0
empty 0.155172 1 0
enables 0.155172 1 0
encouraged 0.155172 1 0
from: 0.155172 1 0
graph 0.155172 1 0
graphics 0.155172 1 0
ground 0.155172 1 0
keyboard 0.155172 1 0
kind 0.155172 1 0
laboratory 0.155172 1 0
language 0.155172 1 0
necessarily 0.155172 1 0
none 0.155172 1 0
produced 0.155172 1 0
proper 0.155172 1 0
properties 0.155172 1 0
protect 0.155172 1 0
seem 0.155172 1 0
seems 0.155172 1 0
senior 0.155172 1 0
separate 0.155172 1 0
subject: 0.155172 1 0
switching 0.155172 1 0
target 0.155172 1 0
task 0.155172 1 0
telephone 0.155172 1 0
whole 0.155172 1 0
because 0.184556 7 4
got 0.195811 5 3
else 0.198686 2 1
whether 0.198686 2 1
version 0.219899 3 2
going 0.231109 4 3
skip:h 10 0.231109 4 3
content-type:text/plain 0.2363 32 27
balance 0.844828 0 1
ball 0.844828 0 1
bar 0.844828 0 1
basis 0.844828 0 1
become 0.844828 0 1
becomes 0.844828 0 1
consumption 0.844828 0 1
continually 0.844828 0 1
continued 0.844828 0 1
continues 0.844828 0 1
continuing 0.844828 0 1
convention 0.844828 0 1
conventions 0.844828 0 1
convinced 0.844828 0 1
corner 0.844828 0 1
electronic 0.844828 0 1
encourage 0.844828 0 1
ended 0.844828 0 1
government 0.844828 0 1
grant 0.844828 0 1
grants 0.844828 0 1
graphic 0.844828 0 1
greatly 0.844828 0 1
green 0.844828 0 1
gross 0.844828 0 1
groups 0.844828 0 1
growing 0.844828 0 1
lands 0.844828 0 1
message-id:invalid 0.844828 0 1
natural 0.844828 0 1
naturally 0.844828 0 1
necessary 0.844828 0 1
neither 0.844828 0 1
normal 0.844828 0 1
promise 0.844828 0 1
promised 0.844828 0 1
protected 0.844828 0 1
prove 0.844828 0 1
seeing 0.844828 0 1
seek 0.844828 0 1
sees 0.844828 0 1
selected 0.844828 0 1
self 0.844828 0 1
series 0.844828 0 1
skip:( 20 0.844828 0 1
skip:[ 10 0.844828 0 1
skip:b 40 0.844828 0 1
tape 0.844828 0 1
to:none 0.844828 0 1
willing 0.844828 0 1
wins 0.844828 0 1
woman 0.844828 0 1
women 0.844828 0 1
contain 0.908163 0 2
contained 0.908163 0 2
end 0.908163 0 2
grow 0.908163 0 2
known 0.908163 0 2
national 0.908163 0 2
production 0.908163 0 2
project 0.908163 0 2
properly 0.908163 0 2
provide. 0.908163 0 2
provided 0.908163 0 2
publication 0.908163 0 2
sending 0.908163 0 2
serious 0.908163 0 2
women's 0.908163 0 2
wondering 0.908163 0 2
copy 0.934783 0 3
giving 0.934783 0 3
growth 0.934783 0 3
nearly 0.934783 0 3
needs 0.934783 0 3
secure 0.934783 0 3
skip:x 10 0.934783 0 3
taking 0.958716 0 5
wish 0.958716 0 5
without 0.969799 0 7
mail 0.980349 0 11
url:biz 0.99236 0 29
Seth Goodman wrote:
>Attached is an email (along with resulting spam clues) that apparently was
>designed specifically to get past Bayesian filters. I believe this was
>mentioned before as the "white on white" HTML problem. The email has a
>large number of legitimate words, probably randomly picked from a
>dictionary, in a section where the font color is almost white on a white
>background. There is a little snippet of HTML at the end that contains my
>email address. I don't know what it does but I don't like the looks of it.
>The message appears blank, unless you look very closely and then look at the
>HTML source. Not only does this message slip through the classifier as ham,
>but training on this message as spam would probably reduce the effectiveness
>of the classifier.
>
>My questions are:
>
>1) What is this thing? Does it harvest addresses when rendered?
>
>2) Are there any approaches that have been discussed to ignore the "almost
>white" text during parsing?
>
>3) Is anything in the works for this exploit?
>
>--
>Seth Goodman
>
> Humans: please remove ".delete" to reply
>
> Spambots: please disregard the above
>
>
>
> ------------------------------------------------------------------------
>
> Subject:
> movie's
> From:
> "Dion Tiff" <gwenneth401 at hotmail.com>
> Date:
> Wed, 19 Nov 2003 03:12:07 -0600
> To:
> <sethg at goodmanassociates.com>
>
>
> hriie os hy vsicwj k tnu mk k vcr uuhfw wawp ucu neqge oo cqstcpw
>jrsqldm qvkm ncy fim
>
>
>
>
> ------------------------------------------------------------------------
>
> Subject:
> Spam Clues: movie's
>
>
>Spam Score: 0% (0)
>
>
>
>word spamprob #ham #spam
>'*H*' 1 - -
>'*S*' 0 - -
>'provides' 0.00790861 28 0
>'project' 0.00819672 27 0
>'basic' 0.00884086 25 0
>'efforts' 0.0100223 22 0
>'electronics' 0.0100223 22 0
>'bar' 0.0110024 20 0
>'nature' 0.012894 17 0
>'switch' 0.0136778 16 0
>'wind' 0.0136778 16 0
>'neither' 0.0145631 15 0
>'groups' 0.0155709 14 0
>'necessarily' 0.0167286 13 0
>'properly' 0.0167286 13 0
>'bars' 0.0180723 12 0
>'enable' 0.0180723 12 0
>'noise' 0.0180723 12 0
>'continuing' 0.0196507 11 0
>'convinced' 0.0196507 11 0
>'projects' 0.0196507 11 0
>'technical' 0.0209429 53 1
>'correct.' 0.0215311 10 0
>'grateful' 0.0215311 10 0
>'senior' 0.0215311 10 0
>'tells' 0.0215311 10 0
>'window' 0.0215311 10 0
>'conventional' 0.0238095 9 0
>'glass' 0.0238095 9 0
>'go.' 0.0238095 9 0
>'knows' 0.0238095 9 0
>'nation' 0.0238095 9 0
>'series' 0.0238095 9 0
>'encourage' 0.0266272 8 0
>'technique' 0.0266272 8 0
>'wine' 0.0266272 8 0
>'controlling' 0.0302013 7 0
>'core' 0.0302013 7 0
>'emphasis' 0.0302013 7 0
>'grant' 0.0302013 7 0
>'granted' 0.0302013 7 0
>'nervous' 0.0302013 7 0
>'networks' 0.0302013 7 0
>'protected' 0.0302013 7 0
>'selecting' 0.0302013 7 0
>'talked' 0.0302013 7 0
>'tape' 0.0302013 7 0
>'win' 0.0302013 7 0
>'wishes' 0.0302013 7 0
>'conversation' 0.0348837 6 0
>'election' 0.0348837 6 0
>'encouraging' 0.0348837 6 0
>'governor' 0.0348837 6 0
>'grows' 0.0348837 6 0
>'suspected' 0.0348837 6 0
>'winter' 0.0348837 6 0
>'night' 0.0351794 31 1
>'bear' 0.0412844 5 0
>'copied' 0.0412844 5 0
>'corrected' 0.0412844 5 0
>'effectively' 0.0412844 5 0
>'encounter' 0.0412844 5 0
>'encourages' 0.0412844 5 0
>'ending' 0.0412844 5 0
>'kill' 0.0412844 5 0
>'label' 0.0412844 5 0
>'promising' 0.0412844 5 0
>'sends' 0.0412844 5 0
>'talks' 0.0412844 5 0
>'teach' 0.0412844 5 0
>'winning' 0.0412844 5 0
>'network' 0.0474167 41 2
>'whether' 0.0474167 41 2
>'beginning' 0.048731 22 1
>'constraints' 0.0505618 4 0
>'context' 0.0505618 4 0
>'continually' 0.0505618 4 0
>'contract.' 0.0505618 4 0
>'contribution' 0.0505618 4 0
>'convince' 0.0505618 4 0
>'convincing' 0.0505618 4 0
>'cope' 0.0505618 4 0
>'correcting' 0.0505618 4 0
>'elect' 0.0505618 4 0
>'keeps' 0.0505618 4 0
>'keys' 0.0505618 4 0
>'kinds' 0.0505618 4 0
>'knocked' 0.0505618 4 0
>'promised' 0.0505618 4 0
>'selects' 0.0505618 4 0
>'switching' 0.0505618 4 0
>'widespread' 0.0505618 4 0
>'wondered' 0.0505618 4 0
>'graphics' 0.0532931 20 1
>'key' 0.0635763 30 2
>'backs' 0.0652174 3 0
>'backwards' 0.0652174 3 0
>'basically' 0.0652174 3 0
>'consists' 0.0652174 3 0
>'elements' 0.0652174 3 0
>'elsewhere' 0.0652174 3 0
>'encouraged' 0.0652174 3 0
>"government's" 0.0652174 3 0
>"governor's" 0.0652174 3 0
>'grew' 0.0652174 3 0
>'lacks' 0.0652174 3 0
>'ladies' 0.0652174 3 0
>'names.' 0.0652174 3 0
>'nearby' 0.0652174 3 0
>'next.' 0.0652174 3 0
>'noisy' 0.0652174 3 0
>'programmers' 0.0652174 3 0
>'proportion' 0.0652174 3 0
>'protest' 0.0652174 3 0
>'sells' 0.0652174 3 0
>'separately' 0.0652174 3 0
>'serial' 0.0652174 3 0
>'teeth.' 0.0652174 3 0
>'within.' 0.0652174 3 0
>'telling' 0.0655701 16 1
>'group' 0.065624 42 3
>'systems' 0.067776 28 2
>'greatly' 0.0695772 15 1
>'none' 0.0695772 15 1
>'telephone' 0.0695772 15 1
>'sensitive' 0.0741059 14 1
>'separate' 0.0741059 14 1
>'glad' 0.0780931 24 2
>'wide' 0.0780931 24 2
>'basis' 0.0792652 13 1
>'ground' 0.0792652 13 1
>'consider' 0.0797402 34 3
>'copy' 0.0840554 52 5
>'seems' 0.0842721 32 3
>'property' 0.0845266 22 2
>'base' 0.0851967 12 1
>'becomes' 0.0851967 12 1
>'team' 0.087111 50 5
>'suspended' 0.0918367 2 0
>"system's" 0.0918367 2 0
>'table.' 0.0918367 2 0
>'tank' 0.0918367 2 0
>'tea' 0.0918367 2 0
>'teacher' 0.0918367 2 0
>'teaching' 0.0918367 2 0
>'whilst' 0.0918367 2 0
>'whoever' 0.0918367 2 0
>'wider' 0.0918367 2 0
>'wild' 0.0918367 2 0
>"window's" 0.0918367 2 0
>'x-mailer:qualcomm windows eudora version 5.1' 0.925475 2
>30
>'url:biz' 0.995187 1 277
>
>
More information about the Spambayes
mailing list