[Python-Dev] Getting started with GBayes testing
Skip Montanaro
skip@pobox.com
Thu, 5 Sep 2002 09:57:45 -0500
Brad> My feeling is that the presentation of "the message" is
Brad> independent of the message itself, so if I get a message in Text,
Brad> HTML, RTF only the actual content is important, not the markup
Brad> method. Though I suppose using lots of red and large fonts might
Brad> be an indicator of spam, the text of the message should still
Brad> suffice.
You might be surprised. In Paul Graham's "A New Plan for Spam" he writes:
I don't know why I avoided trying the statistical approach for so
long. I think it was because I got addicted to trying to identify
spam features myself, as if I were playing some kind of
competitive game with the spammers. (Nonhackers don't often
realize this, but most hackers are very competitive.) When I did
try statistical analysis, I found immediately that it was much
cleverer than I had been. It discovered, of course, that terms
like "virtumundo" and "teens" were good indicators of spam. But
it also discovered that "per" and "FL" and "ff0000" are good
indicators of spam. In fact, "ff0000" (html for bright red) turns
out to be as good an indicator of spam as any pornographic term.
As Tim has pointed out several times, intuition and hunches about this
stuff often turns out to be incorrect.
Skip