[spambayes-dev] problems locating messages with bigrams
Tim Peters
tim.one at comcast.net
Tue Jan 6 13:42:52 EST 2004
[Skip Montanaro]
> ...
> I eventually figured out that the way I generate bigrams:
>
> for t in Classifier()._enhance_wordstream(tokenize(msg)):
> ...
>
> uses the current training database to decide which tokens should be
> generated,
I think you're hallucinating here -- _enhance_wordstream() doesn't make any
use of training data. Whenever tokenize() yields a stream of N tokens,
_enhance_wordstream() yields a derived stream of 2*N-1 tokens.
> the leading & trailing unigrams or the bigram of the two. All
> possible bigrams are not generated.
A specific example would clarify what you think you mean by these phrases.
By the definition of bigrams *intended* by the code, only adjacent token
pairs can be pasted together into bigrams. If the 4 incoming tokens are a b
c d, the 2*4-1 = 7 output tokens are
a
b
bi:a b
c
bi:b c
d
bi:c d
and it doesn't matter whether a, b, c, and/or d have or haven't been trained
on previously.
More information about the spambayes-dev
mailing list