[spambayes-dev] problems locating messages with bigrams
tim.one at comcast.net
Tue Jan 6 14:15:22 EST 2004
> Hmmm... I thought _enhance_wordstream() was the thing which "tiled"
> the token space.
No, tiling requires knowledge of spamprobs, and __getclues() does the
tiling. _enhance_wordstream() generates the universe of possible tiles (all
individual tokens and all pairs of adjacent tokens); a subset of those is
selected by _getclues(), based on the tiles' spamprobs, to create a tiling
(a partitioning of the token stream into non-overlapping features).
> Why isn't this code in tokenize.py if it doesn't rely on training
It's a transformation of tokenize's output, so it doesn't really "belong" in
tokenize either. It's certainly more convenient to do it in the classifier,
and _getclues() (which inarguably belongs in the classifier) requires
intimate knowledge of how the universe of tiles was generated in order to
guarantee non-overlap among the tiles it chooses.
More information about the spambayes-dev