[spambayes-dev] problems locating messages with bigrams
tim.one at comcast.net
Tue Jan 6 13:42:52 EST 2004
> I eventually figured out that the way I generate bigrams:
> for t in Classifier()._enhance_wordstream(tokenize(msg)):
> uses the current training database to decide which tokens should be
I think you're hallucinating here -- _enhance_wordstream() doesn't make any
use of training data. Whenever tokenize() yields a stream of N tokens,
_enhance_wordstream() yields a derived stream of 2*N-1 tokens.
> the leading & trailing unigrams or the bigram of the two. All
> possible bigrams are not generated.
A specific example would clarify what you think you mean by these phrases.
By the definition of bigrams *intended* by the code, only adjacent token
pairs can be pasted together into bigrams. If the 4 incoming tokens are a b
c d, the 2*4-1 = 7 output tokens are
and it doesn't matter whether a, b, c, and/or d have or haven't been trained
More information about the spambayes-dev