[spambayes-dev] "Bayesian Dobly"

Seth Goodman sethg at GoodmanAssociates.com
Thu Feb 26 16:22:13 EST 2004

> [Kenny Pitt]
> I've had very similar results.  My suspicion is that if (and
> it's a big
> *if*) word salad is going to have any negative effect at all, it would
> be a possible long-term reduction in classifier accuracy.  That's
> something that's going to be very difficult to test.  I doubt that
> anyone who rebuilds their training database from scratch on a
> semi-regular basis will ever see any effect from it, though,
> unless the
> spammers do a much better job of selecting the words they use.

As another data point, my database has no problem identifying the salad
messages as spam.  Anecdotally, I *think* they don't score as high (no
data whatsoever, just a vague impression), but that could easily be
wrong.  Spammers could do a better job of selecting salad words,
perhaps, as most of them turn out to be hapaxes for my database.  This
is a tough nut for them to crack because everyone's hammy vocabulary is
different.  I think that will protect us in the end.  However, if anyone
could distill a subset of hammy words that were hammy to at least a
large number of people, then we'd have some trouble.  Doing that is not
a small project and might not even be possible, but as their delivery
rates decline, they may try it.

As far as the "Dolby" approach goes (I think Dolby Labs would cringe at
this, but Hormel got nabbed so why not Dolby?), it's interesting as a
contrast to bigrams.  With bigrams, we look for the "strongest" word
pairs, considering different tilings of the word stream.  It doesn't
care about the individual strengths of adjacent words, except indirectly
when deciding on the best tiling.  The "Dolby" approach looks for word
pairs that have the most opposite classifications and doesn't care about
the strength of the word pair as a token.  This puts a high value on
word order, which is information not considered in bigrams.  My
intuition says it wouldn't help, but like they say, that and a nickel
will get you on the subway (obviously a long time ago in a universe far,
far away).


Seth Goodman

More information about the spambayes-dev mailing list