OT: spam filtering idea

Tim Peters tim.one at comcast.net
Mon Jan 13 20:42:16 EST 2003


[jerf at compy.attbi.com]
> I couldn't think of a reasonable way to predict the results of that,
> because as I think I mentioned in another posting, there are two big
> unknowns: The nature of the people responding to the spams (have you
> every really thought about it? who the hell is keeping these things
> afloat? In all seriousness, my current theory is that we're talking
> people of reduced> intelligence, but I don't *know*.), and how close the
> spam industry may be to economic collapse, such that Bayes-type filters
> (which *are* legitimately better then previous approaches) may be enough
< to tip them over the edge. Without more data about those two things it's
> hard to predict what will happen if spam tones down.

Well, I've noted before (but on the spambayes mailing list) that I expect
widespread adoption of this kind of classifier may actually increase spam,
while *not* toning it down at all.  The thing is that the system has no
predefined notions of "ham" or "spam":  it believes whatever you train it to
believe.  For example, I get a particular class of "Joke of the Day" spam,
which I sometimes enjoy.  My personal classifier is trained to consider that
ham, and despite that the rest of such a msg hawks everything from human
growth hormone to cheap ink jet cartridges (and it came as a surprise just
how fine are the distinctions the classifier can make).  OTOH, there are
some kinds of email I get from companies I do business that I'd rather not
be bothered with, and the system calls those spam now.

Now supposing I really want porn spam, and the raunchier the better, it's
easy to train a classifier to call such stuff ham.  If this filter
technology reaches enough people that the fraction of a fraction of a
percent of those who really want porn spam get hold of it, they won't miss
porn spam anymore in the blizzard of spams they don't want to see, and
response rates for porn spammers may well go *up*.  But it will be in the
interest of the porn spammers then not to try to disguise the nature of
their msg; to the contrary, it will be in their interest to have it SCREAM
"porn spam".

Substitute get-rich-quick, or penis enlargement, or what have you.

> Somwhat back on the Python topic, once SpamBayes is done I intend to see
> if I can implement what I talked about.

It's long been done enough for geeks to use effectively.  The killers are
integrating with a gazillion quirky mail clients, and making a system so
easy to use that you don't have to learn anything to use it.

> It's just not worth picking up an implementation in another language
> when it'd probably be a small handful of hours' work in Python...

The classifier proper is very simple and brief Python code.  The tokenizer
is hairier, but still not a major piece of work.  The hairiest code by far
is Mark Hammond's wondrous integration code for Outlook 2000.






More information about the Python-list mailing list