[spambayes-dev] siickkk and deprrravved stufff totallly grossssse
Glenn Brown
gbrown at alumni.caltech.edu
Mon Dec 22 11:49:23 EST 2003
I fear my email box is seeing a reliable Spam attack on Bayesian filters,
starting in the past week: the tweaking of spam tokens by repeating
characters.
If spammers use 0-3 repetitions of each letter, a spam token like
"investment" can be spelled 4^10 (a million) different ways. I don't want
to suffer a million spam messages to train my filter for this one word.
A simple solution would be to eliminate character repetitions in the spam
database. This produces 163 ambiguities out of the 25143 words in the
Solaris /usr/dict/words list of words in the English language, but probably
none of these are spam tokens. I've appended a list of the ambiguous tokens
below. For example, "be" represents "be" and "bee".
I won't be implementing adding this feature myself, but would sure like to
see this feature in my favorite spam filter.
Cheers to all the SpamBayes developers,
--Glenn
Alan
Alison
Barnet
Bela
Burt
De
Diane
Douglas
Eliot
Eliot
Emanuel
Gary
Godwin
Greg
Haley
Herman
Kaufman
Kenan
Liget
Lilian
Marieta
Mathews
Matson
McConel
NW
Nichols
Paterson
Philip
SE
SW
Scot
Shafer
Shepard
Simons
Wals
Whitaker
ad
advise
apointe
as
bare
bat
be
bel
below
bel
below
bet
bib
bit
bled
boby
bogy
bon
both
bred
bus
but
canister
canon
canvas
carton
chery
chose
col
coma
con
con
cop
coral
cot
desert
desicate
devise
devote
discus
divorce
dragon
drol
drop
duly
el
el
escape
fed
fel
fiance
filet
fogy
fury
gable
gal
glom
god
gripe
grove
hel
his
hop
hot
i
i
in
inbred
invite
ken
knel
later
legate
lop
lose
lot
mana
marque
mate
met
milenia
mortgage
mot
ne
non
nose
of
pal
parole
pep
pepy
per
pol
pol
pop
pose
put
red
refuge
retire
rifle
robin
rod
rot
salon
sen
shot
slop
son
sped
step
stop
tapa
ten
the
til
to
todle
tol
tor
tot
very
vi
vi
we
wed
whop
willful
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031222/8aea057d/attachment-0001.html
More information about the spambayes-dev
mailing list