pop3 email header classifier?

David Mertz mertz at gnosis.cx
Tue Sep 23 13:53:17 EDT 2003


Robin Becker <robin at jessikat.fsnet.co.uk> wrote previously:
|Is there a python tool that can be made to delete these from my POP3
|mail box rather than let my client reject?
|I know about spam-bayes etc, but these things are over 120k each and it
|seems pretty pointless to download them (as well as taking about an
|hour).

I do exactly this myself.  For my article (about a year ago now) on Spam
filtering, for IBM developerWorks, I developed my own little custom
tool.  I've refined it over time, but it remains kinda hackerish and
un(der)documented.  Still, I'd be happy to share with anyone
interested... especially if anyone wants to make something nice out of
it for distribution.

The idea of what I do is a hodgepodge.  But the general idea is that I
use [poplib] to download ONLY the headers.  Those messages that are
convincingly spam based on that get deleted without me ever needing to
download bodies.

As a first line of defense, I have a collection of blacklist and
whitelist patterns (I only use strings and globs, not regexen; though
the latter would be easy to add).  These look at specific headers fields
in which patterns might occur (or at the whole header, if I wish).

But the next line of defense is the usual naive Bayesian style.  The
wrinkle here is that I do not use "words" in the headers for analysis,
but rather trigrams (sequences of three characters).  I believe that for
headers-only, this is more accurate, although I have not rigorously
tested this.  Things like routing IPs and spam mail clients are hard to
pick out by whole words, but trigrams do some magic.

The other feature of my 'spamfilter' tool is that it knows nothing at
all about specific mail clients.  It just sits daemon-like, and
periodically deletes stuff it doesn't like.  I check mail from a lot of
different clients, on a lot of different machines; so for me it would be
inconvenient to have the filtering tied to one particular mail
client/machine.  My thing just runs and kills, even when I'm out of
town, and checking for internet cafes.

Yours, David...

--
mertz@  | The specter of free information is haunting the `Net!  All the
gnosis  | powers of IP- and crypto-tyranny have entered into an unholy
.cx     | alliance...ideas have nothing to lose but their chains.  Unite
        | against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------






More information about the Python-list mailing list