
[Martin]
Ok. Are you then, overall, in favour of taking the proposed approach?
It solves part of one problem; I'd rather solve all of it, but can't volunteer time to do that.
It is not thread-safe, but only so if somebody calls setlocale in a different thread, and that is known not to be thread-safe - so I could live with that limitation.
There's no way of using C's locale gimmicks that's threadsafe, short of all callers agreeing to follow a beyond-standard-C exclusion protocol -- which is the same as saying "no way" in reality. So that's part of one problem no patch of this ilk *can* solve. It's not that the patch doesn't try hard enough, it's that this approach is inherently inadequate to solve all of this particular problem.
It is just that the patch does not "feel" right, given that there must be "native" locale-inaware parsing of floating point constants somewhere on each platform (atleast on those that support C++98).
I haven't found one on Windows (doesn't mean it doesn't exist, does mean it's apparently well hidden if it does exist).
... One of my early concerns (and I still have this concern) is that the contributors here appear to take the position "We have this fine code developed elsewhere, it seems to work, so we copy it. We don't actually have to understand this code". I would feel more comfortable if the code was written from scratch for usage in Python, with just the ideas borrowed from glib. Proper attribution of contributors and licensing are just one aspect, we really need the submitter of the code fully understand it, and be capable of reacting to problems quickly.
The patch is certainly more code than is needed to solve the part of the problem it does solve. For example, things like typedef char gchar; typedef short gshort; typedef long glong; typedef int gint; introduce silly synonyms ("silly" == typing gshort instead of short does nothing except introduce possibilities for confusion); there are many definitions like #define g_ascii_isupper(c) \ ((g_ascii_table[(guchar) (c)] & G_ASCII_UPPER) != 0) that are never referenced; the code caters to C99's hexadecimal float literals but Python doesn't; and so on. If someone who understood Python internals read my earlier two-sentence description of how the patch works, they could write something that works equally well for Python's purposes with a fraction of the code introduced by the patch.
... The PEP should also point out deficiencies of the approach taken, e.g. the issue of spelling NaN, inf, etc. If it can be determined not to be an issue in real life (i.e. for all interesting platforms), this should be documented as well.
Well, the patch doesn't even pretend to address other issues with portability of float literals. They routinely come up on c.l.py, so of course users bump into them; when someone is motivated enough to file a bug report, I shuffle it off to PEP 42, under the "non-accidental 754 support" heading (which covers many fp issues beyond just literals, of course). [James Henstridge]
... Your average localised package usually switches to the user's preferred locale on startup, so that it can display strings and messages, and occasionally wants to read/write numbers in a locale independent format (usually when saving/loading files). The most common way of doing this is the setlocale/strtod/setlocale combo, which has thread safety problems and possible reentrancy problems if done wrong.
I became acutely aware of the problems here due to the spambayes project, part of which embeds Python in Outlook 2000/2002. Outlook routinely runs more than a dozen threads, and by observation changes locale "frequently". None of that is documented, Python has no influence over when or why Outlook decides to switch locale, and neither can Python exclude Outlook's other threads when the Outlook thread Python is running in becomes active. Mark Hammond solved our problems there by forcing locale back to "C" every chance he gets; that's an anti-social and probabilistic approach, but appears to be the best spambayes can do today. Having spambayes grow its own float<->string code doesn't help, because the worst problem spambayes had is that Python's marshal format uses ASCII strings to store float literals in .pyc files, so that Python itself can (and does) load insane float values out of .pyc files if LC_NUMERIC isn't "C" at the time a .pyc file gets imported. The only thing that could truly solve spambayes's problems here is for Python to use a thoroughly thread-safe string->float routine, where "thoroughly" includes not caring whether other threads switch locale in mid-stream. An irony is that Microsoft's *native* locale gimmicks are thread-safe (each Win32 thread has its own idea of Win32 locale); why Outlook is even mucking with C's thread-braindead notion of locale is a mystery. In short, I can't be enthusiastic about the patch because it doesn't solve the only relevant locale problem I've actually run into. I understand that it may well solve many I haven't run into. OTOH, the specific problem I'm acutely worried about would be better addressed by changing the way Python marhals float values. [Guido]
Maybe at least we can detect platforms for which we know there is a native conversion in the library, and not use the hack on those?
I rarely find that piles of conditionalized code are more comprehensible or reliable; they usually result in mysterious x-platform differences, and become messier over time as we stumble into more platform library bugs, quirks, and limitations.
... Here's yet another idea (which probably has flaws as well): instead of substituting the locale's decimal separator, rewrite strings like "3.14" as "314e-2" and rewrite strings like "3.14e5" as "314e3", then pass to strtod(), which assigns the same meaning to such strings in all locales.
This is a harder transformation than s/./locale_decimal_point/. It does address the thread-safety issue. Numerically it's flaky, as only a perfectly-rounding string->float routine can guarantee to return bit-for-bit identical results given equivalent (viewed as infinite precision) decimal representations as inputs, and few platform string->float routines do perfect rounding.
This removes the question of what decimal separator is used by the locale completely, and thus removes the last bit of thread-unsafety from the code. However, I don't know if underflow can cause the result to be different, e.g. perhaps 1.23eX could be computed but 123e(X-2) could not??? (Sounds pretty unlikely on the face of it since I'd expect any decent conversion algorithm to pretty much break its input down into a string of digits and an exponent, but I've never actually studied such algorithms in detail.)
Each library is likely fail in its own unique ways. Here's a cute one: """ base = 1.2345678901234567 digits = "12345678901234567" for exponent in range(-16, -15000, -1): string = digits + "0" * (-16 - exponent) string += "e%d" % exponent derived = float(string) assert base == derived, (string, derived) """ On Windows, this first fails at exponent -5202, where float(string) delivers a result a factor of 10 too large. I was surprised it did that well! Under Cygwin Python 2.2.3, it consumed > 14 minutes of CPU time, but never failed. I believe they're using a derivative of David Gay's excruciatingly complex IEEE-754 perfect-rounding string<->float routines (which would explain both why it didn't fail and why it consumed enormous CPU time; the code is excruciatingly complex because it does perfect rounding quickly for "normal" inputs, via a large variety of delicate speed tricks; when those tricks don't apply, it has to simulate unbounded-precision arithmetic to guarantee perfect rounding).