[spambayes-dev] problems locating messages with bigrams

Tue Jan 6 14:15:22 EST 2004

[Skip Montanaro]
> Hmmm...  I thought _enhance_wordstream() was the thing which "tiled"
> the token space.

No, tiling requires knowledge of spamprobs, and __getclues() does the
tiling.  _enhance_wordstream() generates the universe of possible tiles (all
individual tokens and all pairs of adjacent tokens); a subset of those is
selected by _getclues(), based on the tiles' spamprobs, to create a tiling
(a partitioning of the token stream into non-overlapping features).

> Why isn't this code in tokenize.py if it doesn't rely on training
> data?

It's a transformation of tokenize's output, so it doesn't really "belong" in
tokenize either.  It's certainly more convenient to do it in the classifier,
and _getclues() (which inarguably belongs in the classifier) requires
intimate knowledge of how the universe of tiles was generated in order to
guarantee non-overlap among the tiles it chooses.