OT: spam filtering idea
David Mertz
mertz at gnosis.cx
Tue Jan 14 14:50:53 EST 2003
Paul Rubin <phr-n2002b at NOSPAMnightsong.com> wrote previously:
|Does spambayes look at the charset? I get tons of spam in korean
|characters. Anything with charset="euc-kr" or "ks_c_5601-1987" etc.
|is just about certainly spam.
I have a custom filter setup on my machine. It's a bit cobbled together
with duct tape and string, so I'm not exactly advocating it. I should
probably start using spambayes, but I'd need to write some wrapper for
my particular use model.
What I do (in a Python script) is poll my POP3 mailbox intermittently,
and download the headers only. If I decide something is definitely spam
based on the headers, I send a delete command, and never need to
download the whole message (i.e. a large virus body) with my regular
mail client. I like this because the spam-killer script is completely
independent of which mail client I use.
I analyze the headers twice. The first time looks for some values that
I manually entered, specific to header fields (e.g. "URGENT ASSISTANCE"
in the Subject:). Mostly I just started using this crude style first,
and didn't remove it. But then I make a second pass using a
pseudo-Bayesian analysis of the *trigrams* in the header. I think
trigrams work nicely for headers, which contain distinctive substrings,
but not so many whole words. I wrote about this a bit at:
http://www-106.ibm.com/developerworks/linux/library/l-spamf.html
One thing I look for in the first pass is several of those east Asian
charset strings. The way I figure it, even though I might get perfectly
welcome mail from Korean correspondents, if they are encoded in Korean,
I can't read them anyway. Of course, some people *do* read Korean (or
Chinese, Japanese, etc), so this filter clearly wouldn't work for them.
I've noticed, however, that the manual filters are usually redundant.
Almost everything that the patterns I hand coded catch are then also
caught by the trigram-bayes style.
Yours, David...
--
mertz@ | The specter of free information is haunting the `Net! All the
gnosis | powers of IP- and crypto-tyranny have entered into an unholy
.cx | alliance...ideas have nothing to lose but their chains. Unite
| against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------
More information about the Python-list
mailing list