[Python-Dev] Re: Be Honest about LC_NUMERIC [REPOST]

Mon Sep 1 18:03:29 EDT 2003

[Tim]
> In short, I can't be enthusiastic about the patch because it doesn't
> solve the only relevant locale problem I've actually run into.  I
> understand that it may well solve many I haven't run into.

At this point in your life, Tim, is there any patch you could be truly
enthusiastic about? :-)

I'm asking because I'd like to see the specific problem that started
this thread solved, if necessary using a compromise that means the
solution isn't perfect.  I'm even willing to take a step back in the
status quo, given that the status quo isn't perfect anyway, and that
compromises mean something has to give.

*Maybe* the right solution is that we have to accept a
hard-to-understand overcomplicated piece of code that we don't know
how to maintain (but for which the author asserts that we won't have
to do much maintenance in the foreseeable future).  But *maybe*
there's a simpler solution.

> OTOH, the specific problem I'm acutely worried about would be better
> addressed by changing the way Python marhals float values.

So solve it.  The approach used by binary pickles seems entirely
reasonable.  All we need to do is change the .pyc magic number.
(There's undoubtedly user code in the world that would break because
it requires interoperability between Python versions.  So let the
marshal module grow a way to specify the format.)

> [Guido]
> > Maybe at least we can detect platforms for which we know there is a
> > native conversion in the library, and not use the hack on those?
> 
> I rarely find that piles of conditionalized code are more
> comprehensible or reliable; they usually result in mysterious
> x-platform differences, and become messier over time as we stumble
> into more platform library bugs, quirks, and limitations.

Fair enough.  So *if* we decide to use the donated conversion code, we
should start by using it unconditionally.  I predict that at some
point in the future we'll find a platform whose quirks are not handled
by the donated code, and where it's simpler to use a correct native
equivalent than to try to fix the donated code; but I expect that
point to be pretty far in the future, *or* the platform to be pretty
far from the main stream.

> > ...
> > Here's yet another idea (which probably has flaws as well): instead of
> > substituting the locale's decimal separator, rewrite strings like
> > "3.14" as "314e-2" and rewrite strings like "3.14e5" as "314e3", then
> > pass to strtod(), which assigns the same meaning to such strings in
> > all locales.
> 
> This is a harder transformation than s/./locale_decimal_point/.  It does
> address the thread-safety issue.  Numerically it's flaky, as only a
> perfectly-rounding string->float routine can guarantee to return bit-for-bit
> identical results given equivalent (viewed as infinite precision) decimal
> representations as inputs, and few platform string->float routines do
> perfect rounding.
> 
> > This removes the question of what decimal separator is used by the
> > locale completely, and thus removes the last bit of thread-unsafety
> > from the code.  However, I don't know if underflow can cause the result
> > to be different, e.g. perhaps 1.23eX could be computed but 123e(X-2)
> > could not???  (Sounds pretty unlikely on the face of it since I'd expect
> > any decent conversion algorithm to pretty much break its input down into
> > a string of digits and an exponent, but I've never actually studied
> > such algorithms in detail.)
> 
> Each library is likely fail in its own unique ways.  Here's a cute one:
> 
> """
> base = 1.2345678901234567
> 
> digits = "12345678901234567"
> 
> for exponent in range(-16, -15000, -1):
>     string = digits + "0" * (-16 - exponent)
>     string += "e%d" % exponent
>     derived = float(string)
>     assert base == derived, (string, derived)
> """
> 
> On Windows, this first fails at exponent -5202, where float(string)
> delivers a result a factor of 10 too large.  I was surprised it did
> that well!  Under Cygwin Python 2.2.3, it consumed > 14 minutes of
> CPU time, but never failed.  I believe they're using a derivative of
> David Gay's excruciatingly complex IEEE-754 perfect-rounding
> string<->float routines (which would explain both why it didn't fail
> and why it consumed enormous CPU time; the code is excruciatingly
> complex because it does perfect rounding quickly for "normal"
> inputs, via a large variety of delicate speed tricks; when those
> tricks don't apply, it has to simulate unbounded-precision
> arithmetic to guarantee perfect rounding).

I fail to see the relevance of the example to my proposed hack, except
as a proof that the world isn't perfect -- but we already know that.
Under my proposal, the number of digits converted would never change,
so any sensitivity of the algorithm used to the number of digits
converted would be irrelevant.  I note that the strtod.c code that's
currently in the Python source tree uses a similar (though opposite)
trick: it converts the number to the form 0.<fraction>E<expt> before
handing it off to atof().  So my proposal still stands.  I'm happy to
entertain a proof that it's flawed but not one where the flawed input
has over 5000 digits *and* depends on a flaw in the platform routines.

--Guido van Rossum (home page: http://www.python.org/~guido/)