[Python-Dev] Getting started with GBayes testing

Skip Montanaro skip@pobox.com
Thu, 5 Sep 2002 09:57:45 -0500


    Brad> My feeling is that the presentation of "the message" is
    Brad> independent of the message itself, so if I get a message in Text,
    Brad> HTML, RTF only the actual content is important, not the markup
    Brad> method. Though I suppose using lots of red and large fonts might
    Brad> be an indicator of spam, the text of the message should still
    Brad> suffice.

You might be surprised.  In Paul Graham's "A New Plan for Spam" he writes:

    I don't know why I avoided trying the statistical approach for so
    long.  I think it was because I got addicted to trying to identify
    spam features myself, as if I were playing some kind of
    competitive game with the spammers.  (Nonhackers don't often
    realize this, but most hackers are very competitive.)  When I did
    try statistical analysis, I found immediately that it was much
    cleverer than I had been.  It discovered, of course, that terms
    like "virtumundo" and "teens" were good indicators of spam.  But
    it also discovered that "per" and "FL" and "ff0000" are good
    indicators of spam.  In fact, "ff0000" (html for bright red) turns
    out to be as good an indicator of spam as any pornographic term.

As Tim has pointed out several times, intuition and hunches about this
stuff often turns out to be incorrect.

Skip