[Spambayes] Here's why "generate_long_skips: False" worked...

Neil Schemenauer nas@python.ca
Mon, 30 Sep 2002 17:42:57 -0700


Tim Peters wrote:
> An easy example is Asian spam, where the lack of whitespace ends up
> generating oodles of skip tokens (and '8bit%' tokens), but there must
> be a more effective way to generate useful tokens for that without
> bloating the database beyond reason.

I tried generating 2 character-grams when has_highbit_char was true.  I
seem to recall that it worked okay.  The bonus would be that there would
be a limit of 2**16 of these tokens in the DB.

  Neil