> Well, what happens in CRM114 is not that 
> the HTML causes confusion, it does get 
> factored in, but when you have a nearly 
> 1:1 ratio in the hits, it basically doesn't 
> make any difference to the end value.

Bill, how does CRM-114 handle a typical <IMG SRC=HTTP tag? Could you
post a sample of the (multi-word) tokens generated? 

This might help me understand your approach and if it is the right one
to take with my test tweaks to the tokenizer for Spambayes.

Thank for your help,

