spam classification breaker

Michael Hudson mwh at python.net
Thu Feb 5 17:31:30 CET 2004


"Tim Peters" <tim.one at comcast.net> writes:

> [Robin Becker]
> > This article at the BBC reports on what appears to be a genetic
> > algorithm or random search method for finding words that apparently
> > fool bayesian classifiers every time.
> >
> > http://news.bbc.co.uk/1/hi/technology/3458457.stm
> >
> > The author apparently had to include html reporting into the emails to
> > allow his mail client to report back automatically.
> 
> If I'm a spammer trying to get my pitches seen by you, and you're using a
> personal Bayesian classifier, then I need to load my pitches with words that
> are very hammy to you.  If I don't have access to your personal training
> data (if I do, I already own your machine ...), then I need to *deduce*
> what's hammy to you.  One way to do that is, as John Graham-Cumming noted
> here, is for me to send you thousands of messages with different piles of
> words, and note which ones did and didn't get caught by your filter.   Then
> I load my sales pitches with words from the ones that your filter didn't
> reject, and avoid words from ones your filter did reject.  In order to do
> that, I have to know which messages you did and didn't look at.  That's the
> purpose of the HTML "web bug"/"web beacon"s in the thousands of test
> messages.  

I did wonder what the point of some of the stuff that ends up in my
unsure folder was.  It seems so mashed up that even if I wanted to
work out what the hell they were selling me I would have a hard time
figuring it out.

> > Of course if he'd used python the whole process of email generation
> > and classification could have been done in a single process and would
> > probably allow easier generation of the magic words.
> 
> I have to deduce *your* magic words, not mine.  I have to send email to you,
> and deduce what you did and didn't look at.  This is an expensive process
> for the spammer, of course.

Surely there comes a point when just sending me mail selling something
I actually want becomes cheaper...

Cheers,
mwh

-- 
  Monte Carlo sampling is no way to understand code.
                                  -- Gordon McMillan, comp.lang.python



More information about the Python-list mailing list