[Spambayes] Here's why "generate_long_skips: False" worked...

Tim Peters tim.one@comcast.net
Mon, 30 Sep 2002 22:22:03 -0400


[Neil Schemenauer]
> I tried generating 2 character-grams when has_highbit_char was true.

In addition to, or in lieu of, generating skip tokens?

> I seem to recall that it worked okay.  The bonus would be that there
> would be a limit of 2**16 of these tokens in the DB.

Appreciated.  I used to do character 5-grams in this case, and the database
burden was significant.  Plus results didn't get worse when I stopped doing
n-grams altogether.

Somebody want to try this on their corpus?

1. Current vs doing character 2-grams when has_highbit_char is true
   instead of generating skip tokens.

2. Current vs doing character 2-grams when has_highbit_char is true
   in addition to generating skip tokens.