[spambayes-dev] testing tweaks

Justin Mason jm at jmason.org
Thu Aug 7 18:12:33 EDT 2003

Tim Peters writes:
> If I ever get time for it, I'd like to pursue a specific mixed
> unigram-bigram scheme worked out with Gary Robinson.  For example, given
> "penis size", that can be viewed as a bigram, or as two unigrams, or as two
> unigrams *and* a bigram.  The last choice isn't so good because it
> systematically creates highly correlated clues, which leads to mistakes that
> don't make sense to a human eye (I'll claim that experienced spambayes users
> are sympathetic to the mistakes it makes -- spambayes judgments are
> "intuitive", in some real sense).  But with enough effort, it's possible to
> "tile" a message with non-overlapping unigrams and bigrams, so that each
> token contributes to exactly one scored entity.  The trick is to do this in
> a way that maximizes the overall strength of the entities that get scored.
> So, for example, and simplifying too much, if the bigram "penis size" has a
> spamprob closer to 0.0 or 1.0 than either of the unigrams "penis" and
> "size", view it as a bigram; but if "penis" has a spamprob closer to 0.0 or
> 1.0 than "penis size", view it as two unigrams instead.
> I only had time to run a few tests on that, and it looked very promising,
> learning faster than our current pure-unigram scheme, and doing at least as
> well on all error measures.  It was (of course) slower to score, and the
> database more than doubled in size.  For my own use, it would have been
> worth it, since my personal databases are still relatively tiny (about 1,000
> training msgs total), and the code runs too fast for me to notice it now.  I
> suspect, but don't know, that this mixed scheme would do significantly
> better on short messages.

That's interesting -- it's like the idea of "decomposing" tokens and using
the strongest output of the result.  e.g. for "Free!", decompose that to
"free!" "Free" "free" and use the strongest result of those 4 lookups.

Yeah, I'm interested because I'd be pretty sure that compound-word breakup
tweak would increase db size, but that doesn't seem to be mentioned...


More information about the spambayes-dev mailing list