spam classification breaker
mwh at python.net
Thu Feb 5 17:31:30 CET 2004
"Tim Peters" <tim.one at comcast.net> writes:
> [Robin Becker]
> > This article at the BBC reports on what appears to be a genetic
> > algorithm or random search method for finding words that apparently
> > fool bayesian classifiers every time.
> > http://news.bbc.co.uk/1/hi/technology/3458457.stm
> > The author apparently had to include html reporting into the emails to
> > allow his mail client to report back automatically.
> If I'm a spammer trying to get my pitches seen by you, and you're using a
> personal Bayesian classifier, then I need to load my pitches with words that
> are very hammy to you. If I don't have access to your personal training
> data (if I do, I already own your machine ...), then I need to *deduce*
> what's hammy to you. One way to do that is, as John Graham-Cumming noted
> here, is for me to send you thousands of messages with different piles of
> words, and note which ones did and didn't get caught by your filter. Then
> I load my sales pitches with words from the ones that your filter didn't
> reject, and avoid words from ones your filter did reject. In order to do
> that, I have to know which messages you did and didn't look at. That's the
> purpose of the HTML "web bug"/"web beacon"s in the thousands of test
I did wonder what the point of some of the stuff that ends up in my
unsure folder was. It seems so mashed up that even if I wanted to
work out what the hell they were selling me I would have a hard time
figuring it out.
> > Of course if he'd used python the whole process of email generation
> > and classification could have been done in a single process and would
> > probably allow easier generation of the magic words.
> I have to deduce *your* magic words, not mine. I have to send email to you,
> and deduce what you did and didn't look at. This is an expensive process
> for the spammer, of course.
Surely there comes a point when just sending me mail selling something
I actually want becomes cheaper...
Monte Carlo sampling is no way to understand code.
-- Gordon McMillan, comp.lang.python
More information about the Python-list