[spambayes-dev] A new and altogether different bsddb breakage

Thu Dec 18 00:38:55 EST 2003

[Tony Meyer]
> I think part of the Japanese/Asian languages patch which I keep
> meaning to look more closely into has these turn into unicode strings
> (how many bits is that?  I know nothing much about unicode; English
> is good enough for me <wink>).
>
> (Just in case someone was about to implement a new spambayes db
> system with only 8-bit tokens).

Overall, I'd encourage them in that vice.  I did all I could to keep
SpamBayes neutral across European "Latin-insert-your-favorite-number"
languages, except for the non-default Anglocentric replace_nonascii_chars
option.  That's why I favored split-on-whitespace as the only msg body
lexing gimmick (of course it helped a lot that s-o-w did best in tests
across all lexing schemes ever tried!); have consistently resisted attempts
to add knowledge about "punctuation" (except in header-line contexts, where
standards constrain the permitted characters); haven't voiced any support
for gimmicks like "map Latin-1 into letters that look more like the ones I'm
used to" (but as the replace_nonascii_chars perpetrator, couldn't oppose
them in good conscience as options either <wink>); and haven't written a u''
literal anywhere in the source.

My belief is that Asian languages are so different in what they would need
to do a good job that someone wanting that would be better off forking the
project.  I really don't want to see masses of deeply different algorithms
all slammed into the same codebase, not even if "the cost" were just
massively refactoring SpamBayes to add another two layers of expensive
indirection.  SpamBayes isn't required to be all things to all people.

I haven't studied the patch you're talking about, so maybe it's just a
one-liner <wink>.  Alas, I'm aware of it, and have read the patch comments,
and the panic above is a fair reflection of my first, second, and third
reactions.

As to how many bits are in a Unicode string, you don't want to know.  "It
depends."   Pickles store them in an Anglocentric format (UTF-8) that
happens to consume exactly the same number of bytes as now if the string
consists of just US ASCII characters.  The memory burden is much larger,
though (Python Unicode string objects are big beasts).