[Spambayes] Looking for code to modify spambayes for 5-gram tokenization
Tim Peters
tim.peters at gmail.com
Fri Jun 10 23:39:17 CEST 2005
[Richard Coleman]
> I'm looking to modify spambayes to use 5-grams rather than
> split-on-whitespace. We have a few Asian customers and the default
> spambayes setup has not been very effective for them. So, we want to
> test with 5-grams and see if we can improve the effectiveness.
>
> I know that n-grams have been tested several times before. So, if
> anyone has a n-gram tokenizer that they can share, I would appreciate a
> copy. Otherwise, I'll dive in and write it myself.
I doubt anyone has still-working code for character N-grams. It was
tested in the very early stages of this project, because (a) it was
clear that split-on-whitespace wouldn't work worth spit for many Asian
languages; and, (b) it was a prejudice of mine that character N-grams
would do well on European languages too. Alas, test results said #b
wasn't true, and character N-grams create problems of their own:
large increase in database size, cross-token correlation exaggerations
if overlapping N-grams are used, and mysteriousness of results.
There's a largish comment block in tokenizer.py expanding on those.
That said, tokenizing for character N-grams is dead easy. If `text`
is a string containing the message you want to tokenize,
A. Overlapping N-grams:
for i in xrange(len(text)-N+1):
yield text[i:i+N]
B. Non-overlapping N-grams:
for i in xrange(0, len(text)-N+1, N):
yield text[i:i+N]
For some time we did generate character 5-grams for "long" words
containing "high-bit" characters, but dropped that. This was mostly
aiming at a cheap way for non-Asian users to recognize Asian spam, but
we found cheaper ways to do that (carried to an extreme by the
`replace_nonascii_chars` option, which is very effective for English
users without Asian ham).
I expect that effective spam identification for Asian languages would
require mostly replacing tokenizer.py, and a different database
strategy too.
More information about the Spambayes
mailing list