[Spambayes] Here's why "generate_long_skips: False" worked...
Neil Schemenauer
nas@python.ca
Mon, 30 Sep 2002 17:42:57 -0700
Tim Peters wrote:
> An easy example is Asian spam, where the lack of whitespace ends up
> generating oodles of skip tokens (and '8bit%' tokens), but there must
> be a more effective way to generate useful tokens for that without
> bloating the database beyond reason.
I tried generating 2 character-grams when has_highbit_char was true. I
seem to recall that it worked okay. The bonus would be that there would
be a limit of 2**16 of these tokens in the DB.
Neil