spam classification breaker

Skip Montanaro skip at pobox.com
Thu Feb 5 18:27:25 CET 2004


    >> This article at the BBC reports on what appears to be a genetic
    >> algorithm or random search method for finding words that apparently
    >> fool bayesian classifiers every time.
    >> 
    >> http://news.bbc.co.uk/1/hi/technology/3458457.stm

I noticed immediately that the author of the article used the term "ham" to
refer to mail which was not spam.  Even if SpamBayes dies an ignominious
death in the future at the hands of some ruthless spammers, that will be our
lasting legacy.

Mr. Graham-Cumming could have avoided the overhead of sending himself 10,000
mails by simply selecting words from his archived public presence on the
net: web pages, Usenet posts or archived mailing list posts associated with
his email address.  I suspect his genetic algorithm would have been all but
unnecessary.  (Google for "John Graham-Cumming" for example.)

This doesn't have to be a tedious process either.  In the course of normal
scumbag email harvesting, all the crawler has to do is select a few
non-trivial words from the harvested page and associate them with the email
address(es) on that page.  After seeing the same email address a few times
they would have a decent collection of hammy words for use in the "random
words" block of later spam.

Also, unlike the statement the author made:

    And, he said, this would have to be repeated for every person a spammer
    wanted to reach because they would all have a different list of key
    words.

this wouldn't have to be done for all email addresses.  Anything which
increases the likelihood that a spam is opened will be seen as an
improvement for the spammer.  There's obviously no need for them to get a
100% open rate on spam.  If that was the case, they'd already all be out of
business.

These research types.  They always do things in the hardest way possible...

Skip




More information about the Python-list mailing list