[Spambayes] 12 char limit on tokens

Amedee Van Gasse amedee at amedee.be
Tue Feb 27 08:19:51 CET 2007


On ma, 2007-02-26 at 16:07 -0800, Peter Bishop wrote:
> I was looking into why "Schwarzenegger" was not recognized as a token,
> when I discovered that you had determined that it was good to have a
> 12 character limit on tokens.  Is this really better that a 15 char
> limit?

For languages with short words like English, increasing the token length
will only give marginally better results (if any).
On the other hand, if a lot of your correspondence is in a language with
long words (like German - and Schwarzenegger is a German/Austrian name),
then increasing the token length might give better results.

I presume the devs chose a limit of 12 chars based on experience (they
have tested with thousands of messages). I think there must be some
balance in the efficiency of the algorithm and the size of the token
database.

There is only one way to find out if a token limt of 15 is better _for_
_you_: try it out.

-- 
Amedee
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail/spambayes/attachments/20070227/0700e3a9/attachment.pgp 


More information about the SpamBayes mailing list