[Spambayes] statistical comparison of enviroment?

Wed Mar 5 20:39:25 EST 2003

    >> 1. time of day (would require some real granularity tweaking)
    >> 2. size of header / size message / header:message ratio
    >> 3. attachment count (MIME count) / MIME count:message size ratio
    >> 4. [space|tab|\n]:[visible char] ratio

(Just thinking out loud.)

One of the problems we have generating new improvements is the system is so
good now that improvements of any kind tend to be microscopic, and thus
extremely hard to measure.  Still, the more ways you can get the tool to
tell you "this smells like spam", the harder it will be for spammers to
defeat it.

Accordingly, when considering potential improvements (improved tokenizing
tricks, for example), perhaps what we should be doing is disabling much of
the current capability and then testing a new change against such a
"crippled" system.  Making it more concrete, suppose we split tokenizing
into two groups, "natural" tokens and "synthetic" tokens.  Natural tokens
would be what you get with basic whitespace splitting, nothing more.
Synthetic tokens would be stuff like tokenizing this subject and generating

    subject:[Spambayes]
    subject:statistical
    subject:comparison
    subject:of
    subject:environment

By reducing the effectiveness of the system for testing, I think we'd have a
better idea how effective a new idea might be.  What I don't know is how to
measure the independence of two different "improvements".  (The more
independent two improvements are, the harder it seems it would be for a
spammer to hit two birds with one stone when trying to defeat spambayes.)
Suppose for the sake of argument that this base system I talk about is 80%
effective at properly distinguishing ham from spam.  Suppose improvement A
takes that to 83% and applied independently to the base system, improvement
B takes that to 85%.  How do you tell how independent A and B are from one
another?

Skip