OT: spam filtering idea

David Mertz mertz at gnosis.cx
Tue Jan 14 14:50:53 EST 2003


Paul Rubin <phr-n2002b at NOSPAMnightsong.com> wrote previously:
|Does spambayes look at the charset?  I get tons of spam in korean
|characters.  Anything with charset="euc-kr" or "ks_c_5601-1987" etc.
|is just about certainly spam.

I have a custom filter setup on my machine.  It's a bit cobbled together
with duct tape and string, so I'm not exactly advocating it.  I should
probably start using spambayes, but I'd need to write some wrapper for
my particular use model.

What I do (in a Python script) is poll my POP3 mailbox intermittently,
and download the headers only.  If I decide something is definitely spam
based on the headers, I send a delete command, and never need to
download the whole message (i.e. a large virus body) with my regular
mail client.  I like this because the spam-killer script is completely
independent of which mail client I use.

I analyze the headers twice.  The first time looks for some values that
I manually entered, specific to header fields (e.g.  "URGENT ASSISTANCE"
in the Subject:).  Mostly I just started using this crude style first,
and didn't remove it.  But then I make a second pass using a
pseudo-Bayesian analysis of the *trigrams* in the header.  I think
trigrams work nicely for headers, which contain distinctive substrings,
but not so many whole words.  I wrote about this a bit at:

    http://www-106.ibm.com/developerworks/linux/library/l-spamf.html

One thing I look for in the first pass is several of those east Asian
charset strings.  The way I figure it, even though I might get perfectly
welcome mail from Korean correspondents, if they are encoded in Korean,
I can't read them anyway.  Of course, some people *do* read Korean (or
Chinese, Japanese, etc), so this filter clearly wouldn't work for them.

I've noticed, however, that the manual filters are usually redundant.
Almost everything that the patterns I hand coded catch are then also
caught by the trigram-bayes style.

Yours, David...

--
mertz@  | The specter of free information is haunting the `Net!  All the
gnosis  | powers of IP- and crypto-tyranny have entered into an unholy
.cx     | alliance...ideas have nothing to lose but their chains.  Unite
        | against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------






More information about the Python-list mailing list