[spambayes-dev] RE: [Python-Dev] RE: [Spambayes] Question (or possibly a bug report)

Tim Peters tim.one@comcast.net
Thu, 24 Jul 2003 23:08:34 -0400


[Skip Montanaro]
> Jeez, this locale crap makes Unicode look positively delightful...

Yes, it does!  locale is what you get when someone complains they like to
use ampersands instead commas to separate thousands, and a committee thinks
"hey! we've got all these great functions already, so why change them?
instead we'll add mounds of hidden global state that affects lots of ancient
functions in radical ways!".  Make sure it's as hostile to threads as
possible, decline to define any standard locale names beyond "C" and the
empty string, and decline to define what anything except the "C" locale name
means, and you're almost there.  The finishing touches come in the function
definitions, like this in strtod():

   In other than the "C" locale, additional locale-specific subject
   sequence forms may be accepted.

What those may be aren't constrained in any way, of course.

locale can be cool in a monolithic, single-threaded, one-platform program,
provided the platform C made up rules you can live with for the locales you
care about.  It's more of an API framework than a solution, and portable
programs really can't use it except via forcing locale back to "C" every
chance they get <wink>.

> The SB Windows triumvirate (Mark, Tim, Tony) seem to have narrowed
> down the problem quite a bit.  Is there some way to worm around it?
> I take it with the unmarshalling problem it's not sufficient to
> specify floating point values without decimal points (e.g., 0.12 ==
> 1e-1+2e-2).

When true division becomes the default, things like

    12/100

should work reliably regardless of locale -- i.e., don't use any float
literals, and you can't get screwed by locale float-literal quirks.  Today,
absurd spellings like

    float(12)/100

can accomplish the same.

Changing Python is a better solution.  The rule that an embedded Python
requires that LC_NUMERIC be "C" isn't livable -- embedded Python is a fly
trying to stare down an elephant, in Outlook's case.  I dragged python-dev
into this to illustrate that it's a very real problem in a very popular
kick-ass Python app.  Note that this same problem was discussed in more
abstract terms by others here within the last few weeks, and I hope that
making it more concrete helps get the point across.

The float-literal-in-.pyc problem could be addressed in several ways.
Binary pickles, and the struct module, use a portable binary float format
that isn't subject to locale quirks.  I think marshal should be changed to
use that too, by adding an additional marshal float format (so old marshals
would continue to be readable, but new marshals may not be readable under
older Pythons).  Note that text-mode pickles of floats are vulnerable to
locale nightmares too.

> Is the proposed early specification of a locale in the config file
> sufficient to make things work?

I doubt it, as Outlook can switch locale any time it feels like it.  We
can't control that.  I think we should set a line-tracing hook, and force
locale back to "C" on every callback <wink>.

> A foreign user of the nascent CSV module beat us up a bit during
> development about not supporting different locales (I guess in Brazil
> the default separator is a semicolon, which makes sense if your
> decimal "point" is a comma).  Thank God we ignored him! ;-)

Ya, foreigners are no damn good <wink>.