[Spambayes] Looking for code to modify spambayes for 5-gram tokenization

Tim Peters tim.peters at gmail.com
Fri Jun 10 23:39:17 CEST 2005


[Richard Coleman]
> I'm looking to modify spambayes to use 5-grams rather than
> split-on-whitespace.  We have a few Asian customers and the default
> spambayes setup has not been very effective for them.  So, we want to
> test with 5-grams and see if we can improve the effectiveness.
>
> I know that n-grams have been tested several times before.  So, if
> anyone has a n-gram tokenizer that they can share, I would appreciate a
> copy.  Otherwise, I'll dive in and write it myself.

I doubt anyone has still-working code for character N-grams.  It was
tested in the very early stages of this project, because (a) it was
clear that split-on-whitespace wouldn't work worth spit for many Asian
languages; and, (b) it was a prejudice of mine that character N-grams
would do well on European languages too.  Alas, test results said #b
wasn't true, and character N-grams create problems of their own: 
large increase in database size, cross-token correlation exaggerations
if overlapping N-grams are used, and mysteriousness of results. 
There's a largish comment block in tokenizer.py expanding on those.

That said, tokenizing for character N-grams is dead easy.  If `text`
is a string containing the message you want to tokenize,

A. Overlapping N-grams:

    for i in xrange(len(text)-N+1):
        yield text[i:i+N]

B. Non-overlapping N-grams:

    for i in xrange(0, len(text)-N+1, N):
        yield text[i:i+N]

For some time we did generate character 5-grams for "long" words
containing "high-bit" characters, but dropped that.  This was mostly
aiming at a cheap way for non-Asian users to recognize Asian spam, but
we found cheaper ways to do that (carried to an extreme by the
`replace_nonascii_chars` option, which is very effective for English
users without Asian ham).

I expect that effective spam identification for Asian languages would
require mostly replacing tokenizer.py, and a different database
strategy too.


More information about the Spambayes mailing list