[spambayes-dev] Interesting unsure

Skip Montanaro skip at pobox.com
Wed Jun 25 17:40:23 EDT 2003

    >> It's not clear much can be done, though it might be interesting to
    >> try an option to map Latin-1 accented characters to their unadorned
    >> ASCII counterparts, at least in subjects (strip_subject_accents?).

    Alex> I suspect that would have serious detrimental effects for foreign
    Alex> language users.

My thought was that whether or not to enable the option would be under user
control.  If it's a good spaminator for me why should I suffer because the
effbot's native language includes accented characters?  (No offsense
intended toward any Swedes who might be reading this, BTW. ;-)

    >> The problem with trying such an experiment isn't that it might not be
    >> worthwhile, but that if it's a new spammer technique, there won't be
    >> many messages in our existing spam/ham databases which would exercise
    >> the technique.

    Alex> I don't see this as any different from any of the other neologisms
    Alex> that spammers come up with; if they persist in using such words
    Alex> (and you're still training), then the odd words with accents will
    Alex> quickly become strong spam indicators.  No need for us to do
    Alex> anything...  it's already going to be handled properly.

Except note that they weren't accenting every vowel and there were many
other accents to choose from.  The message I received had "makë" and "teën".
There are several other accented characters with "a" or "e" as their base
character.  I would have to receive many messages using this technique to
build up enough such odd words to make a difference.  I think that's the
spammer's basic idea with this - keep it readable but fly below the word
count radar.  Like I said, "subject:love" is very spammy for me, but I'd
never seem "subject:löve" before, so it wasn't used to score the message.

The fundamental problem when dealing with new spam techniques is (and will
always be, I think) when to mount a counterattack.  That's certainly the
case here.


