[spambayes-dev] Very small change for composite word tokenizing.

Meyer, Tony T.A.Meyer at massey.ac.nz
Thu Aug 7 12:21:39 EDT 2003


> Uh oh, just noticed a bug in the original that I didn't catch before 
> hitting Send.  The original code above should be:
>      yield w
> instead of:
>      yield word
> The variation would then be:
>      yield 'compound:' + w
> 
> Did everyone who previously tested this change catch the error?

My original results, and Sean's, were pre fixing this.  My later
results, and Alex's were post fixing.  (And Sean indicated that his
retest after fixing was also a loss, although he was going to try
different bucket sizes).

Ironically, the incorrect method had better results for Sean, and
similar for me.  Unless anyone is going to post some more results, I
suspect that this will be thrown in the "nice idea but doesn't produce
the needed results" bin.

(If someone had the time, it would be great to take all the comments
from the list, tokenizer.py and elsewhere and make a coherent summary of
all the things that have been tested and what the results were...)

Anyone up for some more testing?

=Tony Meyer



More information about the spambayes-dev mailing list