[Spambayes] 12 char limit on tokens
Amedee Van Gasse
amedee at amedee.be
Tue Feb 27 08:19:51 CET 2007
On ma, 2007-02-26 at 16:07 -0800, Peter Bishop wrote:
> I was looking into why "Schwarzenegger" was not recognized as a token,
> when I discovered that you had determined that it was good to have a
> 12 character limit on tokens. Is this really better that a 15 char
> limit?
For languages with short words like English, increasing the token length
will only give marginally better results (if any).
On the other hand, if a lot of your correspondence is in a language with
long words (like German - and Schwarzenegger is a German/Austrian name),
then increasing the token length might give better results.
I presume the devs chose a limit of 12 chars based on experience (they
have tested with thousands of messages). I think there must be some
balance in the efficiency of the algorithm and the size of the token
database.
There is only one way to find out if a token limt of 15 is better _for_
_you_: try it out.
--
Amedee
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail/spambayes/attachments/20070227/0700e3a9/attachment.pgp
More information about the SpamBayes
mailing list