Be Honest about LC_NUMERIC [REPOST]

So, in an attempt to garner comments (now that we have 2.3 off the chopping block) I'm reposting my PEP proposal (with minor updates). Comments would be appreciated, of course (nudges Barry slightly after him getting me to write this on my only free Sunday in months ;) PEP: XXX Title: Be Honest about LC_NUMERIC (to the C library) Version: $Revision: 1.9 $ Last-Modified: $Date: 2002/08/26 16:29:31 $ Author: Christian R. Reis <kiko at async.com.br> Status: Draft Type: Standards Track Content-Type: text/plain <pep-xxxx.html> Created: 19-July-2003 Post-History: ------------------------------------------------------------------------ Abstract Support in Python for the LC_NUMERIC locale category is currently implemented only in Python-space, which causes inconsistent behavior and thread-safety issues for applications that use extension modules and libraries implemented in C. This document proposes a plan for removing this inconsistency by providing and using substitute locale-agnostic functions as necessary. Introduction Python currently provides generic localization services through the locale module, which among other things allows localizing the display and conversion process of numeric types. Locale categories, such as LC_TIME and LC_COLLATE, allow configuring precisely what aspects of the application are to be localized. The LC_NUMERIC category specifies formatting for non-monetary numeric information, such as the decimal separator in float and fixed-precision numbers. Localization of the LC_NUMERIC category is currently implemented in only in Python-space; the C libraries are unaware of the application's LC_NUMERIC setting. This is done to avoid changing the behavior of certain low-level functions that are used by the Python parser and related code [2]. However, this presents a problem for extension modules that wrap C libraries; applications that use these extension modules will inconsistently display and convert numeric values. James Henstridge, the author of PyGTK [3], has additionally pointed out that the setlocale() function also presents thread-safety issues, since a thread may call the C library setlocale() outside of the GIL, and cause Python to function incorrectly. Rationale The inconsistency between Python and C library localization for LC_NUMERIC is a problem for any localized application using C extensions. The exact nature of the problem will vary depending on the application, but it will most likely occur when parsing or formatting a numeric value. Example Problem The initial problem that motivated this PEP is related to the GtkSpinButton [4] widget in the GTK+ UI toolkit, wrapped by PyGTK. The widget can be set to numeric mode, and when this occurs, characters typed into it are evaluated as a number. Because LC_NUMERIC is not set in libc, float values are displayed incorrectly, and it is impossible to enter values using the localized decimal separator (for instance, `,' for the Brazilian locale pt_BR). This small example demonstrates reduced usability for localized applications using this toolkit when coded in Python. Proposal Martin v. Löwis commented on the initial constraints for an acceptable solution to the problem on python-dev: - LC_NUMERIC can be set at the C library level without breaking the parser. - float() and str() stay locale-unaware. The following seems to be the current practice: - locale-aware str() and float() [XXX: atof(), currently?] stay in the locale module. An analysis of the Python source suggests that the following functions currently depend on LC_NUMERIC being set to the C locale: - Python/compile.c:parsenumber() - Python/marshal.c:r_object() - Objects/complexobject.c:complex_to_buf() - Objects/complexobject.c:complex_subtype_from_string() - Objects/floatobject.c:PyFloat_FromString() - Objects/floatobject.c:format_float() - Modules/stropmodule.c:strop_atof() - Modules/cPickle.c:load_float() [XXX: still need to check if any other occurrences exist] The proposed approach is to implement LC_NUMERIC-agnostic functions for converting from (strtod()/atof()) and to (snprintf()) float formats, using these functions where the formatting should not vary according to the user-specified locale. This change should also solve the aforementioned thread-safety problems. Potential Code Contributions This problem was initially reported as a problem in the GTK+ libraries [5]; since then it has been correctly diagnosed as an inconsistency in Python's implementation. However, in a fortunate coincidence, the glib library implements a number of LC_NUMERIC-agnostic functions (for an example, see [6]) for reasons similar to those presented in this paper. In the same GTK+ problem report, Havoc Pennington has suggested that the glib authors would be willing to contribute this code to the PSF, which would simplify implementation of this PEP considerably. [I'm checking if Alex Larsson is willing to sign the PSF contributor agreement [7] to make sure the code is safe to integrate; XXX: what would be necessary to sign here?] Risks There may be cross-platform issues with the provided locale-agnostic functions. This needs to be tested further. Martin has pointed out potential copyright problems with the contributed code. I believe we will have no problems in this area as members of the GTK+ and glib teams have said they are fine with relicensing the code. Code An implementation is being developed by Gustavo Carneiro <gjc at inescporto.pt>. It is currently attached to Sourceforge.net bug 744665 [8] [XXX: The SF.net tracker is horrible 8(] References [1] PEP 1, PEP Purpose and Guidelines, Warsaw, Hylton http://www.python.org/peps/pep-0001.html [2] Python locale documentation for embedding, http://www.python.org/doc/current/lib/embedding-locale.html [3] PyGTK homepage, http://www.daa.com.au/~james/pygtk/ [4] GtkSpinButton screenshot (demonstrating problem), http://www.async.com.br/~kiko/spin.png [5] GNOME bug report, http://bugzilla.gnome.org/show_bug.cgi?id=114132 [6] Code submission of g_ascii_strtod and g_ascii_dtostr (later renamed g_ascii_formatd) by Alex Larsson, http://mail.gnome.org/archives/gtk-devel-list/2001-October/msg00114.html [7] PSF Contributor Agreement, http://www.python.org/psf/psf-contributor-agreement.html [8] Python bug report, http://www.python.org/sf/774665 Copyright This document has been placed in the public domain. Take care, -- Christian Reis, Senior Engineer, Async Open Source, Brazil. http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

On Wed, Aug 13, 2003, Christian Reis wrote:
I'll repeat what I said on 7/24: Looks good to me (can't comment on the tech issues, but it seems clear enough). However, I'd recommend changing the title to something like "Locale-independent float conversions". -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ This is Python. We don't care much about theory, except where it intersects with useful practice. --Aahz

On Thu, Aug 14, 2003 at 12:52:53AM -0400, Aahz wrote:
OT: I actually replied to you back then, but your MX denies mail from me (since our host happens to be in a pretty wide spamhaus block).
I suppose I'm okay with changing the title if it's too obscure to be helpful -- it was only meant to be precise in a light-hearted fashion. Take care, -- Christian Reis, Senior Engineer, Async Open Source, Brazil. http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

On 14/08/2003 10:21 AM, Christian Reis wrote:
In my opinion, I think it is worth investigating this. One of the things I love about Python is the way how it can be used to glue different bits of code together to form useful programs. Python's handling of LC_NUMERIC seems to work against this goal. According to the POSIX standard, standard library functions like strtod() and printf()/sprintf() work with locale representations of floating point numbers once the setlocale() function has been called. For locale aware libraries, it seems quite sensible for them to use standard library functions to format dates for display, and reading dates entered by the user (which may use a comma for the decimal point under some locales). Now since Python uses the C standard library functions to parse floating point numbers too, and doesn't want these string <-> float conversions to be locale sensitive. The solution currently in place is to declare that LC_NUMERIC must be C for Python programs (or applications embedding Python) or things will go wrong. If a bit of Python code wishes to format a float according to the locale it can call the locale.format() and to read a float from the user, can use locale.atof(). These use locale data read while the locale was set to the correct value temporarily. Unfortunately, this does not help external libraries that know nothing of Python's requirements about how locales are set up, so often they will always represent and read floating point numbers under the C locale (using a full stop as the decimal point). This is the problem Christian ran into with the GtkSpinButton widget in GTK, and I would not be surprised if other people have run into the problem as well. There are two solutions to this problem that I can see: 1. modify Python so that it doesn't use locale sensitive conversion functions when it wants to convert floats in a locale independent manner. 2. modify every external library that makes use of the standard library strtod()/sprintf(), expecting locale sensitive float conversions, to use some other API. Clearly (1) is the easier option, as there is a finite (and quite small) amount of code to change. For (2), there is potentially an unlimited amount of code to change. As Christian said, there is code in glib (not to be confused with glibc: the GNU C library) that could act as a basis for locale independent float conversion functions in Python. The code was written by Alex Larsson (who works at Red Hat, so I suppose they own the copyright), who is willing to license it under Python's terms. You can see the history of the two functions (g_ascii_strtod() and g_ascii_formatd()) here: http://cvs.gnome.org/bonsai/cvsblame.cgi?file=glib/glib/gstrfuncs.c#328 There are very minor alterations by other people (they look minor enough that the FSF wouldn't require a copyright assignment), but you could always use the versions from the initial checkin (rev 1.77) if that is a problem. One of the alternatives that some programs use to do locale independent conversions using code a bit like this: char *oldlocale = setlocale(LC_NUMERIC, "C"); num = strtod(string, NULL); setlocale(LC_NUMERIC, oldlocale); This particular code snippet has some problems that I think make it unsuitable for Python: * setlocale() affects the whole process, so this sort of operation could affect the results of strtod, printf, etc for some other thread in the program. * setlocale() is not reentrant. Oldlocale in the above snippet is a pointer to a static string. If you surround the snippet with another pair of setlocale() calls, you can get unexpected results. This means that adding setlocale() calls to random Python API calls has the potential to break existing code. Alex's code from glib does not suffer from these problems. To sum it up, the current status-quo in Python w.r.t. locales is causing problems for some problems people want to use Python for. It would be nice to fix this problem. James. -- Email: james@daa.com.au WWW: http://www.daa.com.au/~james/

James Henstridge <james@daa.com.au> writes:
I very much doubt that this statement is true. Are you sure this code supports all the platforms where Python runs? E.g. what about the three (!) different floating point formats on a VAX?
Unfortunately, this is not thread-safe, so it is clearly out of question.
Certainly. However, incorporating glib code is not a solution. *Calling* glib code (where available) might be a solution. Also, the standard C++ library supports multiple concurrent locale objects, so calling *that* (where available) might be an option. Furthermore, the C++ library is often implemented on top of some C-only library, so calling that library would be better, as it would keep the C++ runtime library out of the Python prerequisites. Regards, Martin

[James Henstridge]
[Martin v. Löwis]
Well, you should look at the patch: it doesn't know anything about internal fp formats -- all conversions are performed in the end by calling the platform C's strtod() or snprintf(). What it does do is: 1. For string to double, preprocess the input string to change it to use current-locale spelling before calling the platform C strtod(). 2. For double to string, postprocess the result of the platform C snprintf() to replace current-locale spelling with a "standard" spelling. So this is much more string-munging code than it is floating-point code. Indeed, there doesn't appear to be a single floating-point operation in the entire patch (apart from calls to platform string<->float functions). OTOH, despite the claims, it doesn't look threadsafe to me: there's no guarantee, e.g., that the idea of current locale g_ascii_strtod() obtains from localeconv() at its start is still in effect by the time g_ascii_strtod() gets around to calling strtod(). So at best it solves part of one relevant problem here (other relevant problems include that platform C libraries disagree about how to spell infinities, NaNs and signed zeroes; about how many digits to use for an exponent; and about how to round results (for example,
"%.1f" % 2.25 '2.3'
on Windows, but most (not all!) flavors of Unix produce the IEEE-754 to-nearest/even rounded '2.2' instead)). It's easy to write portable, perfectly-rounding string<->double conversion routines without calling any platform functions. The rub is that "fast" goes out the window then, unless you give up at least one of {portable, accurate}.

"Tim Peters" <tim.one@comcast.net> writes:
1. For string to double, preprocess the input string to change it to use current-locale spelling before calling the platform C strtod().
I see (I was confused by the presence of a table of bytes). This is much worse, then: How can it possibly know what formats the C library expects in the current locale? What if the C library insists that a thousands-separator is used when the locale has one? etc. Regards, Martin

[Tim, about what the patch does]
[Martin]
I see (I was confused by the presence of a table of bytes).
Right, the table appears to be there just to support locale-independent character classification.
I'm not sure that's a realistic objection. The patch appears to be trying to replace only the decimal point (if any), with localeconv()->decimal_point, and I've certainly not seen a locale that refuses to accept, e.g., 1234 <its idea of a decimal point> 5678 meaning the same as 1234.5678 in "C" locale. The draft C99 standard I have handy here says (in its strtod() section): In other than the "C" locale, additional locale-specific subject sequence forms may be accepted. and "additional" implies to me that every locale must accept at least the basic floating-point spellings described before that quoted sentence.

"Tim Peters" <tim.one@comcast.net> writes:
Ok. Are you then, overall, in favour of taking the proposed approach? It is not thread-safe, but only so if somebody calls setlocale in a different thread, and that is known not to be thread-safe - so I could live with that limitation. It is just that the patch does not "feel" right, given that there must be "native" locale-inaware parsing of floating point constants somewhere on each platform (atleast on those that support C++98). Regards, Martin

FWIW, I have the same feeling, but the idea of having to support our own version of such code is even more uncomfortable. Maybe at least we can detect platforms for which we know there is a native conversion in the library, and not use the hack on those? --Guido van Rossum (home page: http://www.python.org/~guido/)

A Seg, 2003-09-01 às 09:34, Guido van Rossum escreveu:
In case people haven't noticed, the second version of the patch, submitted following the first patch, already does that. It adds a configure.in check for glibc's strtod_l() and uses that instead of glib code whenever it's available. Only on non-glibc systems is the glib code compiled in. Regards. -- Gustavo João Alves Marques Carneiro <gjc@inescporto.pt> <gustavo@users.sourceforge.net>

This doesn't address the first sentence of mine quoted above: I'm uncomfortable with having this code being part of Python, given that we have no expertise to maintain it long-term (nor should we need such expertise, IMO). Here's yet another idea (which probably has flaws as well): instead of substituting the locale's decimal separator, rewrite strings like "3.14" as "314e-2" and rewrite strings like "3.14e5" as "314e3", then pass to strtod(), which assigns the same meaning to such strings in all locales. This removes the question of what decimal separator is used by the locale completely, and thus removes the last bit of thread-unsafety from the code. However, I don't know if underflow can cause the result to be different, e.g. perhaps 1.23eX could be computed but 123e(X-2) could not??? (Sounds pretty unlikely on the face of it since I'd expect any decent conversion algorithm to pretty much break its input down into a string of digits and an exponent, but I've never actually studied such algorithms in detail.) --Guido van Rossum (home page: http://www.python.org/~guido/)

[Martin]
Ok. Are you then, overall, in favour of taking the proposed approach?
It solves part of one problem; I'd rather solve all of it, but can't volunteer time to do that.
There's no way of using C's locale gimmicks that's threadsafe, short of all callers agreeing to follow a beyond-standard-C exclusion protocol -- which is the same as saying "no way" in reality. So that's part of one problem no patch of this ilk *can* solve. It's not that the patch doesn't try hard enough, it's that this approach is inherently inadequate to solve all of this particular problem.
I haven't found one on Windows (doesn't mean it doesn't exist, does mean it's apparently well hidden if it does exist).
The patch is certainly more code than is needed to solve the part of the problem it does solve. For example, things like typedef char gchar; typedef short gshort; typedef long glong; typedef int gint; introduce silly synonyms ("silly" == typing gshort instead of short does nothing except introduce possibilities for confusion); there are many definitions like #define g_ascii_isupper(c) \ ((g_ascii_table[(guchar) (c)] & G_ASCII_UPPER) != 0) that are never referenced; the code caters to C99's hexadecimal float literals but Python doesn't; and so on. If someone who understood Python internals read my earlier two-sentence description of how the patch works, they could write something that works equally well for Python's purposes with a fraction of the code introduced by the patch.
Well, the patch doesn't even pretend to address other issues with portability of float literals. They routinely come up on c.l.py, so of course users bump into them; when someone is motivated enough to file a bug report, I shuffle it off to PEP 42, under the "non-accidental 754 support" heading (which covers many fp issues beyond just literals, of course). [James Henstridge]
I became acutely aware of the problems here due to the spambayes project, part of which embeds Python in Outlook 2000/2002. Outlook routinely runs more than a dozen threads, and by observation changes locale "frequently". None of that is documented, Python has no influence over when or why Outlook decides to switch locale, and neither can Python exclude Outlook's other threads when the Outlook thread Python is running in becomes active. Mark Hammond solved our problems there by forcing locale back to "C" every chance he gets; that's an anti-social and probabilistic approach, but appears to be the best spambayes can do today. Having spambayes grow its own float<->string code doesn't help, because the worst problem spambayes had is that Python's marshal format uses ASCII strings to store float literals in .pyc files, so that Python itself can (and does) load insane float values out of .pyc files if LC_NUMERIC isn't "C" at the time a .pyc file gets imported. The only thing that could truly solve spambayes's problems here is for Python to use a thoroughly thread-safe string->float routine, where "thoroughly" includes not caring whether other threads switch locale in mid-stream. An irony is that Microsoft's *native* locale gimmicks are thread-safe (each Win32 thread has its own idea of Win32 locale); why Outlook is even mucking with C's thread-braindead notion of locale is a mystery. In short, I can't be enthusiastic about the patch because it doesn't solve the only relevant locale problem I've actually run into. I understand that it may well solve many I haven't run into. OTOH, the specific problem I'm acutely worried about would be better addressed by changing the way Python marhals float values. [Guido]
Maybe at least we can detect platforms for which we know there is a native conversion in the library, and not use the hack on those?
I rarely find that piles of conditionalized code are more comprehensible or reliable; they usually result in mysterious x-platform differences, and become messier over time as we stumble into more platform library bugs, quirks, and limitations.
This is a harder transformation than s/./locale_decimal_point/. It does address the thread-safety issue. Numerically it's flaky, as only a perfectly-rounding string->float routine can guarantee to return bit-for-bit identical results given equivalent (viewed as infinite precision) decimal representations as inputs, and few platform string->float routines do perfect rounding.
Each library is likely fail in its own unique ways. Here's a cute one: """ base = 1.2345678901234567 digits = "12345678901234567" for exponent in range(-16, -15000, -1): string = digits + "0" * (-16 - exponent) string += "e%d" % exponent derived = float(string) assert base == derived, (string, derived) """ On Windows, this first fails at exponent -5202, where float(string) delivers a result a factor of 10 too large. I was surprised it did that well! Under Cygwin Python 2.2.3, it consumed > 14 minutes of CPU time, but never failed. I believe they're using a derivative of David Gay's excruciatingly complex IEEE-754 perfect-rounding string<->float routines (which would explain both why it didn't fail and why it consumed enormous CPU time; the code is excruciatingly complex because it does perfect rounding quickly for "normal" inputs, via a large variety of delicate speed tricks; when those tricks don't apply, it has to simulate unbounded-precision arithmetic to guarantee perfect rounding).

On Mon, Sep 01, 2003 at 02:30:23PM -0400, Tim Peters wrote:
I would certainly concede this point -- Gustavo's patch is a proof-of-concept implementation. I do believe that the glib code is a good starting point for an implementation, and the author has submitted a written agreement, so the next step would be obtaining approval of the general approach and then diving in to clean up and minimize the code as much as possible. The issue is whether a cleaned up patch is the way python-dev would like this to go, or if another, perhaps orthogonal, approach -- such as the conversion Guido has proposed -- would be preferrable.
Are these representations (NaN, infinity, etc) LC_NUMERIC-dependent? Or, more generally, locale-dependent? As for the thread-safety issues, it's true, changing locale multiple times in runtime is bound to confuse the conversion code (and consequently the interpreter). One way around this would be a complete reimplementation of the relevant functions, but here's betting that that alternative would be shunned even more intensely. Take care, -- Christian Reis, Senior Engineer, Async Open Source, Brazil. http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

A Seg, 2003-09-01 às 20:44, Christian Reis escreveu:
Actually, this code presentation is intentional, for the following two reasons: 1- I didn't want to accidentally introduce any bug, so I tried to copy-paste the code with as little changes as possible; 2- In the current form, if glib developers find any bug in this code, we can easily merge the changes back into python. Perhaps I was wrong to have done it this way... Anyway, replacing the g* types is trivial with any decent text editor. Regards. -- Gustavo J. A. M. Carneiro <gjc@inescporto.pt> <gustavo@users.sourceforge.net>

Gustavo J A M Carneiro <gjc@inescporto.pt> writes:
Perhaps I was wrong to have done it this way... Anyway, replacing the g* types is trivial with any decent text editor.
It's more than that. We want you to fully understand the patch, and to spot errors in it even before the glib developers find them. We want you to provide a minimalistic patch, that just implements the required functionality, and nothing else. We want the patch to be maintainable due to it being easy to read and follow, instead of being maintainable due to the fact that it is identical with some code elsewhere in the world. Regards, Martin

It's more than that. We want you to fully understand the patch, and to spot errors in it even before the glib developers find them.
It's unlikely that the glib code would contain errors that Gustavo could spot, before or after cleaning up the patch (no offense to Gustavo meant!). However, a more likely cause of errors would be that the adoption of the code to a new environment breaks an unspoken assumption made by the code. Only truly understanding the code would reveal such assumptions.
Right. Just say no to "copy-and-paste code reuse". --Guido van Rossum (home page: http://www.python.org/~guido/)

[Christian Reis]
Don't know; C89 didn't say anything at all about them, so existing C practice is all over the map; C99 does say something about them, but whether a C99 compiler supports them is optional (support for them isn't mandatory; if a C implementation does choose to support them, then the spellings for input are standardized, although a locale is allowed to *produce* any spellings whatsoever).

[Tim]
[martin@v.loewis.de]
Where exactly does it (C99) say that the spellings are locale-specific?
I can't find anything in the std supporting the claim. For that matter, I can't find anything in the std supporting the notion that a locale is allowed to insert thousand-separator characters either (can you?). There's lots of stuff allowing a locale to *accept* locale-specific spellings (when parsing strings), in addition to the "C" locale spellings; the other direction (producing strings) appears much less permissive.

"Tim Peters" <tim.one@comcast.net> writes:
No: I'm now convinced that sprintf is *forbidden* to insert the thousands separator. This is why POSIX added the '-flag (%'f); this will produce the thousands-separator. That said: Implementations might choose to ignore the standard in that respect. This issue just supports my thesis that the patch is complicated: If I have to read the C99 standard to find out whether it is correct, it must be complicated. I doubt either the submitters or the original author of the code did that exercise...
Indeed. I'm not sure whether this is intentional, though. Regards, Martin

[Tim]
At this point in your life, Tim, is there any patch you could be truly enthusiastic about? :-) I'm asking because I'd like to see the specific problem that started this thread solved, if necessary using a compromise that means the solution isn't perfect. I'm even willing to take a step back in the status quo, given that the status quo isn't perfect anyway, and that compromises mean something has to give. *Maybe* the right solution is that we have to accept a hard-to-understand overcomplicated piece of code that we don't know how to maintain (but for which the author asserts that we won't have to do much maintenance in the foreseeable future). But *maybe* there's a simpler solution.
OTOH, the specific problem I'm acutely worried about would be better addressed by changing the way Python marhals float values.
So solve it. The approach used by binary pickles seems entirely reasonable. All we need to do is change the .pyc magic number. (There's undoubtedly user code in the world that would break because it requires interoperability between Python versions. So let the marshal module grow a way to specify the format.)
Fair enough. So *if* we decide to use the donated conversion code, we should start by using it unconditionally. I predict that at some point in the future we'll find a platform whose quirks are not handled by the donated code, and where it's simpler to use a correct native equivalent than to try to fix the donated code; but I expect that point to be pretty far in the future, *or* the platform to be pretty far from the main stream.
I fail to see the relevance of the example to my proposed hack, except as a proof that the world isn't perfect -- but we already know that. Under my proposal, the number of digits converted would never change, so any sensitivity of the algorithm used to the number of digits converted would be irrelevant. I note that the strtod.c code that's currently in the Python source tree uses a similar (though opposite) trick: it converts the number to the form 0.<fraction>E<expt> before handing it off to atof(). So my proposal still stands. I'm happy to entertain a proof that it's flawed but not one where the flawed input has over 5000 digits *and* depends on a flaw in the platform routines. --Guido van Rossum (home page: http://www.python.org/~guido/)

[Tim]
[Guido]
At this point in your life, Tim, is there any patch you could be truly enthusiastic about? :-)
Yes, but I can't be enthusiastic about a hack, and especially not about a hack that (as I said) doesn't solve the real-life problem spambayes has.
I'm asking because I'd like to see the specific problem that started this thread solved,
At this point, can you state what that specific problem was <wink>?
I'm finding it hard to believe that anyone other than me and the author has actually read the patch! It's easy to understand. It's over-complicated for what Python needs, and would be dead easy to understand if the fluff got chopped. The *fear* of this code expressed in this thread is baffling to me, but I suspect it's due to initial shell-shock from the sheer bulk of the unnecessary code in the patch.
But *maybe* there's a simpler solution.
OTOH, the specific problem I'm acutely worried about would be better addressed by changing the way Python marhals float values.
So solve it.
Sorry, I don't foresee making time to do that.
The approach used by binary pickles seems entirely reasonable.
It's the best binary format we've got. It has problems with 754's special values (as recorded in PEP 42), and loses precision for VAX D format doubles (any double format with greater dynamic range or precision than IEEE-754 double). A decimal string is actually better on all those counts (dynamic range is no problem then; and *some* platforms can preserve IEEE special values via to-string-and-back conversion (Windows cannot)). Decimal strings lose on correctness only because of locale variations; depending on platform, they may also lose on speed, but I don't give much weight to speed here.
Do read the patch. It amounts to if decimal_point != '.': s/./decimal_point/ in one direction and if decimal_point != '.': s/decimal_point/./ in the other. It gets its idea of decimal_point from the platform localeconv(), so if that doesn't lie it's hard to get wrong. In the double->string direction, though, the substitution code appears inadequate to me, since it doesn't try to strip out thousand-separation characters, which some locales produce. For example, on Windows,
AFAICT, the patch will leave that output as "123.456". The string->double direction is much easier to be confident about for this reason.
[long example]
I fail to see the relevance of the example to my proposed hack, except as a proof that the world isn't perfect -- but we already know that.
The point is that only perfect-rounding string->float routines can guarantee to produce identical doubles from mathematically equivalent decimal string representations. Finding counterexamples for non-perfect-rounding libraries is extremely difficult, and/or time-consuming, without studying the source code of a specific library intensely (almost certainly with more intensity than its author gave to writing it!), and I don't have time for that. It's a potential vulnerability. Answering whether it's an actual vulnerability in practice is much more work than I can give to it now.
As hacks go, it's probably OK. I don't think it can fail on glibc-based platforms because I think they do perfect-rounding conversions; the Windows conversion routines aren't perfect-rounding, but we don't have their source code so it's impossible for me to give examples offhand where different results could be delivered, or even to swear that there are (or aren't) such cases. I give it a lot of credit for being truly threadsafe. Note that it doesn't address the other half of the locale conversion problem (double->string), which, as I noted above, is the harder half (due to thousands_sep becoming an additional issue).

"Tim Peters" <tim.one@comcast.net> writes:
The user was writing a gtk application (using pygtk), where gtk, internally, would rely on C-library LC_NUMERIC following the local conventions (I believe gtk would call snprintf to display some message to the user). The user then thought that calling locale.setlocale would be sufficient, but it isn't, and there is no way to fix that short of rewriting gtk.
I'm finding it hard to believe that anyone other than me and the author has actually read the patch! It's easy to understand.
On the surface, yes. However, it seems full of hidden assumptions that are difficult to find out and consider. For example, what if the platform snprintf choses to output the thousands-separator? I can't see how that handled in the patch. Regards, Martin

[Tim]
I'm finding it hard to believe that anyone other than me and the author has actually read the patch! It's easy to understand.
[Martin]
I mentioned that one too last night -- it doesn't. OTOH, *are* there locales that insert thousands_sep? I don't know. To get thousands_sep to appear via Python's locale.format(), in all locales I've tried so far it requires passing a true value for the optional "grouping" argument. Like
Now going thru locale.py is far from going thru C, but the same thing happens if I use sprintf() directly from C (no thousands_sep appears, regardless of how I change locale). That's on Win2K. The draft std I have handy here sez: LC_NUMERIC affects the decimal-point character for the formatted input/output functions and the string conversion functions, as well as the nonmonetary formatting information returned by the localeconv function. There's no support there for the notion that "the formatted (etc)" functions *can* be affected by thousands_sep, just that fiddling locale can affect decimal-point and the (passive) values returned by localeconv().

On Tue, Sep 02, 2003 at 11:29:55AM -0400, Tim Peters wrote:
At least with locale.format, if you want grouping, you pass in the third argument (grouping=1) to format(). An example: >>> locale.setlocale(locale.LC_NUMERIC, 'da_DK') 'da_DK' >>> locale.format("%.2f", 71630, 1) '71.630,00' Now, from the glibc docs: [...] The SUSv2 specifies one further flag character. ' For decimal conversion (i, d, u, f, F, g, G) the output is to be grouped with thousands' grouping characters if the locale information indicates any. Note that many versions of gcc cannot parse this option and will issue a warning. SUSv2 does not include %'F. So unless I'm mistaken, this wouldn't really be an issue in our case if explicit grouping isn't requested inside the python conversion functions. Take care, -- Christian Reis, Senior Engineer, Async Open Source, Brazil. http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

On Tue, Sep 02, 2003 at 11:29:55AM -0400, Tim Peters wrote:
The Linux/glibc documentation, which cites SUSv2, seems to imply that no locale inserts the thousands separator in formatting operations, except when the ' flag character is included: For some numeric conversions a radix character ('decimal point') or thousands' grouping character is used. The actual character used depends on the LC_NUMERIC part of the locale. The POSIX locale uses '.' as radix character, and does not have a grouping character. Thus, printf("%'.2f", 1234567.89); results in '1234567.89' in the POSIX locale, in '1234567,89' in the nl_NL locale, and in '1.234.567,89' in the da_DK locale. [...] The five flag characters above are defined in the C standard. The SUSv2 specifies one further flag character. ' For decimal conversion (i, d, u, f, F, g, G) the output is to be grouped with thousands' grouping characters if the locale infor- mation indicates any. Note that many versions of gcc cannot parse this option and will issue a warning. SUSv2 does not include %'F. Jeff

On Mon, Sep 01, 2003 at 02:30:23PM -0400, Tim Peters wrote:
Just to follow up, today I found a thread on opengroup.org that discusses locale-safe APIs in the C library. They don't suggest anything very positive in the way of standardization :-/ http://www.opengroup.org/austin/mailarchives/austin-group-l/msg00763.html Take care, -- Christian Reis, Senior Engineer, Async Open Source, Brazil. http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

On 31/08/2003 9:25 AM, Tim Peters wrote:
This is true. However, in practice we found it fixed a number of thread safety issues in programs. Your average localised package usually switches to the user's preferred locale on startup, so that it can display strings and messages, and occasionally wants to read/write numbers in a locale independent format (usually when saving/loading files). The most common way of doing this is the setlocale/strtod/setlocale combo, which has thread safety problems and possible reentrancy problems if done wrong. The method used by g_ascii_strotod() removes the need to switch locale when parsing the float, which means that an application using it may only need to call setlocale() once on startup and never again. This seems to be the best way to use setlocale w.r.t. thread safety. The existing locale handling in Python shares this property, but makes it difficult for external libraries to format and parse floats in the locale's representation. From what I can see, leaving LC_NUMERIC set to the locale value rather than "C" leads to better interoperability.
It would be great for Python to have consistent float parsing/formatting on every platform in the future. Making sure that every place where Python wants to parse or format a float in a locale independent fashion go through a single set of functions should make it easier to drop in a new set of routines in the future. However, getting rid of the LC_NUMERIC=C requirement would have real benefits today. James. -- Email: james@daa.com.au WWW: http://www.daa.com.au/~james/

James Henstridge <james@daa.com.au> writes:
I think everybody agrees that allowing non-C LC_NUMERIC settings in the C library is very desirable. My concerns are about the specific approach taken to implement that change. Or, actually, with an entire class of approaches: namely those that involve complex algorithms (i.e. which include a for-statement :) to implement that feature. Regards, Martin

Christian Reis <kiko@async.com.br> writes:
So, in an attempt to garner comments (now that we have 2.3 off the chopping block) I'm reposting my PEP proposal (with minor updates).
I can agree with the declared problem of the PEP, and the rationale for fixing it. Tim also convinced me that the approach taken to solve it is, technically, acceptable. So I only list issues where I disagree.
This change should also solve the aforementioned thread-safety problems.
It does not, and I think the PEP should point out that it doesn't.
One of my early concerns (and I still have this concern) is that the contributors here appear to take the position "We have this fine code developed elsewhere, it seems to work, so we copy it. We don't actually have to understand this code". I would feel more comfortable if the code was written from scratch for usage in Python, with just the ideas borrowed from glib. Proper attribution of contributors and licensing are just one aspect, we really need the submitter of the code fully understand it, and be capable of reacting to problems quickly. That said, I don't actually require that the code is written from scratch. Instead, a detailed elaboration of how precisely the implementation is approached, in the PEP, would be good. The PEP should also point out deficiencies of the approach taken, e.g. the issue of spelling NaN, inf, etc. If it can be determined not to be an issue in real life (i.e. for all interesting platforms), this should be documented as well. Regards, Martin

I think that anything that might still be needed years after the code is adopted should be in comments in the code, not in a PEP. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Wed, Aug 13, 2003, Christian Reis wrote:
I'll repeat what I said on 7/24: Looks good to me (can't comment on the tech issues, but it seems clear enough). However, I'd recommend changing the title to something like "Locale-independent float conversions". -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ This is Python. We don't care much about theory, except where it intersects with useful practice. --Aahz

On Thu, Aug 14, 2003 at 12:52:53AM -0400, Aahz wrote:
OT: I actually replied to you back then, but your MX denies mail from me (since our host happens to be in a pretty wide spamhaus block).
I suppose I'm okay with changing the title if it's too obscure to be helpful -- it was only meant to be precise in a light-hearted fashion. Take care, -- Christian Reis, Senior Engineer, Async Open Source, Brazil. http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

On 14/08/2003 10:21 AM, Christian Reis wrote:
In my opinion, I think it is worth investigating this. One of the things I love about Python is the way how it can be used to glue different bits of code together to form useful programs. Python's handling of LC_NUMERIC seems to work against this goal. According to the POSIX standard, standard library functions like strtod() and printf()/sprintf() work with locale representations of floating point numbers once the setlocale() function has been called. For locale aware libraries, it seems quite sensible for them to use standard library functions to format dates for display, and reading dates entered by the user (which may use a comma for the decimal point under some locales). Now since Python uses the C standard library functions to parse floating point numbers too, and doesn't want these string <-> float conversions to be locale sensitive. The solution currently in place is to declare that LC_NUMERIC must be C for Python programs (or applications embedding Python) or things will go wrong. If a bit of Python code wishes to format a float according to the locale it can call the locale.format() and to read a float from the user, can use locale.atof(). These use locale data read while the locale was set to the correct value temporarily. Unfortunately, this does not help external libraries that know nothing of Python's requirements about how locales are set up, so often they will always represent and read floating point numbers under the C locale (using a full stop as the decimal point). This is the problem Christian ran into with the GtkSpinButton widget in GTK, and I would not be surprised if other people have run into the problem as well. There are two solutions to this problem that I can see: 1. modify Python so that it doesn't use locale sensitive conversion functions when it wants to convert floats in a locale independent manner. 2. modify every external library that makes use of the standard library strtod()/sprintf(), expecting locale sensitive float conversions, to use some other API. Clearly (1) is the easier option, as there is a finite (and quite small) amount of code to change. For (2), there is potentially an unlimited amount of code to change. As Christian said, there is code in glib (not to be confused with glibc: the GNU C library) that could act as a basis for locale independent float conversion functions in Python. The code was written by Alex Larsson (who works at Red Hat, so I suppose they own the copyright), who is willing to license it under Python's terms. You can see the history of the two functions (g_ascii_strtod() and g_ascii_formatd()) here: http://cvs.gnome.org/bonsai/cvsblame.cgi?file=glib/glib/gstrfuncs.c#328 There are very minor alterations by other people (they look minor enough that the FSF wouldn't require a copyright assignment), but you could always use the versions from the initial checkin (rev 1.77) if that is a problem. One of the alternatives that some programs use to do locale independent conversions using code a bit like this: char *oldlocale = setlocale(LC_NUMERIC, "C"); num = strtod(string, NULL); setlocale(LC_NUMERIC, oldlocale); This particular code snippet has some problems that I think make it unsuitable for Python: * setlocale() affects the whole process, so this sort of operation could affect the results of strtod, printf, etc for some other thread in the program. * setlocale() is not reentrant. Oldlocale in the above snippet is a pointer to a static string. If you surround the snippet with another pair of setlocale() calls, you can get unexpected results. This means that adding setlocale() calls to random Python API calls has the potential to break existing code. Alex's code from glib does not suffer from these problems. To sum it up, the current status-quo in Python w.r.t. locales is causing problems for some problems people want to use Python for. It would be nice to fix this problem. James. -- Email: james@daa.com.au WWW: http://www.daa.com.au/~james/

James Henstridge <james@daa.com.au> writes:
I very much doubt that this statement is true. Are you sure this code supports all the platforms where Python runs? E.g. what about the three (!) different floating point formats on a VAX?
Unfortunately, this is not thread-safe, so it is clearly out of question.
Certainly. However, incorporating glib code is not a solution. *Calling* glib code (where available) might be a solution. Also, the standard C++ library supports multiple concurrent locale objects, so calling *that* (where available) might be an option. Furthermore, the C++ library is often implemented on top of some C-only library, so calling that library would be better, as it would keep the C++ runtime library out of the Python prerequisites. Regards, Martin

[James Henstridge]
[Martin v. Löwis]
Well, you should look at the patch: it doesn't know anything about internal fp formats -- all conversions are performed in the end by calling the platform C's strtod() or snprintf(). What it does do is: 1. For string to double, preprocess the input string to change it to use current-locale spelling before calling the platform C strtod(). 2. For double to string, postprocess the result of the platform C snprintf() to replace current-locale spelling with a "standard" spelling. So this is much more string-munging code than it is floating-point code. Indeed, there doesn't appear to be a single floating-point operation in the entire patch (apart from calls to platform string<->float functions). OTOH, despite the claims, it doesn't look threadsafe to me: there's no guarantee, e.g., that the idea of current locale g_ascii_strtod() obtains from localeconv() at its start is still in effect by the time g_ascii_strtod() gets around to calling strtod(). So at best it solves part of one relevant problem here (other relevant problems include that platform C libraries disagree about how to spell infinities, NaNs and signed zeroes; about how many digits to use for an exponent; and about how to round results (for example,
"%.1f" % 2.25 '2.3'
on Windows, but most (not all!) flavors of Unix produce the IEEE-754 to-nearest/even rounded '2.2' instead)). It's easy to write portable, perfectly-rounding string<->double conversion routines without calling any platform functions. The rub is that "fast" goes out the window then, unless you give up at least one of {portable, accurate}.

"Tim Peters" <tim.one@comcast.net> writes:
1. For string to double, preprocess the input string to change it to use current-locale spelling before calling the platform C strtod().
I see (I was confused by the presence of a table of bytes). This is much worse, then: How can it possibly know what formats the C library expects in the current locale? What if the C library insists that a thousands-separator is used when the locale has one? etc. Regards, Martin

[Tim, about what the patch does]
[Martin]
I see (I was confused by the presence of a table of bytes).
Right, the table appears to be there just to support locale-independent character classification.
I'm not sure that's a realistic objection. The patch appears to be trying to replace only the decimal point (if any), with localeconv()->decimal_point, and I've certainly not seen a locale that refuses to accept, e.g., 1234 <its idea of a decimal point> 5678 meaning the same as 1234.5678 in "C" locale. The draft C99 standard I have handy here says (in its strtod() section): In other than the "C" locale, additional locale-specific subject sequence forms may be accepted. and "additional" implies to me that every locale must accept at least the basic floating-point spellings described before that quoted sentence.

"Tim Peters" <tim.one@comcast.net> writes:
Ok. Are you then, overall, in favour of taking the proposed approach? It is not thread-safe, but only so if somebody calls setlocale in a different thread, and that is known not to be thread-safe - so I could live with that limitation. It is just that the patch does not "feel" right, given that there must be "native" locale-inaware parsing of floating point constants somewhere on each platform (atleast on those that support C++98). Regards, Martin

FWIW, I have the same feeling, but the idea of having to support our own version of such code is even more uncomfortable. Maybe at least we can detect platforms for which we know there is a native conversion in the library, and not use the hack on those? --Guido van Rossum (home page: http://www.python.org/~guido/)

A Seg, 2003-09-01 às 09:34, Guido van Rossum escreveu:
In case people haven't noticed, the second version of the patch, submitted following the first patch, already does that. It adds a configure.in check for glibc's strtod_l() and uses that instead of glib code whenever it's available. Only on non-glibc systems is the glib code compiled in. Regards. -- Gustavo João Alves Marques Carneiro <gjc@inescporto.pt> <gustavo@users.sourceforge.net>

This doesn't address the first sentence of mine quoted above: I'm uncomfortable with having this code being part of Python, given that we have no expertise to maintain it long-term (nor should we need such expertise, IMO). Here's yet another idea (which probably has flaws as well): instead of substituting the locale's decimal separator, rewrite strings like "3.14" as "314e-2" and rewrite strings like "3.14e5" as "314e3", then pass to strtod(), which assigns the same meaning to such strings in all locales. This removes the question of what decimal separator is used by the locale completely, and thus removes the last bit of thread-unsafety from the code. However, I don't know if underflow can cause the result to be different, e.g. perhaps 1.23eX could be computed but 123e(X-2) could not??? (Sounds pretty unlikely on the face of it since I'd expect any decent conversion algorithm to pretty much break its input down into a string of digits and an exponent, but I've never actually studied such algorithms in detail.) --Guido van Rossum (home page: http://www.python.org/~guido/)

[Martin]
Ok. Are you then, overall, in favour of taking the proposed approach?
It solves part of one problem; I'd rather solve all of it, but can't volunteer time to do that.
There's no way of using C's locale gimmicks that's threadsafe, short of all callers agreeing to follow a beyond-standard-C exclusion protocol -- which is the same as saying "no way" in reality. So that's part of one problem no patch of this ilk *can* solve. It's not that the patch doesn't try hard enough, it's that this approach is inherently inadequate to solve all of this particular problem.
I haven't found one on Windows (doesn't mean it doesn't exist, does mean it's apparently well hidden if it does exist).
The patch is certainly more code than is needed to solve the part of the problem it does solve. For example, things like typedef char gchar; typedef short gshort; typedef long glong; typedef int gint; introduce silly synonyms ("silly" == typing gshort instead of short does nothing except introduce possibilities for confusion); there are many definitions like #define g_ascii_isupper(c) \ ((g_ascii_table[(guchar) (c)] & G_ASCII_UPPER) != 0) that are never referenced; the code caters to C99's hexadecimal float literals but Python doesn't; and so on. If someone who understood Python internals read my earlier two-sentence description of how the patch works, they could write something that works equally well for Python's purposes with a fraction of the code introduced by the patch.
Well, the patch doesn't even pretend to address other issues with portability of float literals. They routinely come up on c.l.py, so of course users bump into them; when someone is motivated enough to file a bug report, I shuffle it off to PEP 42, under the "non-accidental 754 support" heading (which covers many fp issues beyond just literals, of course). [James Henstridge]
I became acutely aware of the problems here due to the spambayes project, part of which embeds Python in Outlook 2000/2002. Outlook routinely runs more than a dozen threads, and by observation changes locale "frequently". None of that is documented, Python has no influence over when or why Outlook decides to switch locale, and neither can Python exclude Outlook's other threads when the Outlook thread Python is running in becomes active. Mark Hammond solved our problems there by forcing locale back to "C" every chance he gets; that's an anti-social and probabilistic approach, but appears to be the best spambayes can do today. Having spambayes grow its own float<->string code doesn't help, because the worst problem spambayes had is that Python's marshal format uses ASCII strings to store float literals in .pyc files, so that Python itself can (and does) load insane float values out of .pyc files if LC_NUMERIC isn't "C" at the time a .pyc file gets imported. The only thing that could truly solve spambayes's problems here is for Python to use a thoroughly thread-safe string->float routine, where "thoroughly" includes not caring whether other threads switch locale in mid-stream. An irony is that Microsoft's *native* locale gimmicks are thread-safe (each Win32 thread has its own idea of Win32 locale); why Outlook is even mucking with C's thread-braindead notion of locale is a mystery. In short, I can't be enthusiastic about the patch because it doesn't solve the only relevant locale problem I've actually run into. I understand that it may well solve many I haven't run into. OTOH, the specific problem I'm acutely worried about would be better addressed by changing the way Python marhals float values. [Guido]
Maybe at least we can detect platforms for which we know there is a native conversion in the library, and not use the hack on those?
I rarely find that piles of conditionalized code are more comprehensible or reliable; they usually result in mysterious x-platform differences, and become messier over time as we stumble into more platform library bugs, quirks, and limitations.
This is a harder transformation than s/./locale_decimal_point/. It does address the thread-safety issue. Numerically it's flaky, as only a perfectly-rounding string->float routine can guarantee to return bit-for-bit identical results given equivalent (viewed as infinite precision) decimal representations as inputs, and few platform string->float routines do perfect rounding.
Each library is likely fail in its own unique ways. Here's a cute one: """ base = 1.2345678901234567 digits = "12345678901234567" for exponent in range(-16, -15000, -1): string = digits + "0" * (-16 - exponent) string += "e%d" % exponent derived = float(string) assert base == derived, (string, derived) """ On Windows, this first fails at exponent -5202, where float(string) delivers a result a factor of 10 too large. I was surprised it did that well! Under Cygwin Python 2.2.3, it consumed > 14 minutes of CPU time, but never failed. I believe they're using a derivative of David Gay's excruciatingly complex IEEE-754 perfect-rounding string<->float routines (which would explain both why it didn't fail and why it consumed enormous CPU time; the code is excruciatingly complex because it does perfect rounding quickly for "normal" inputs, via a large variety of delicate speed tricks; when those tricks don't apply, it has to simulate unbounded-precision arithmetic to guarantee perfect rounding).

On Mon, Sep 01, 2003 at 02:30:23PM -0400, Tim Peters wrote:
I would certainly concede this point -- Gustavo's patch is a proof-of-concept implementation. I do believe that the glib code is a good starting point for an implementation, and the author has submitted a written agreement, so the next step would be obtaining approval of the general approach and then diving in to clean up and minimize the code as much as possible. The issue is whether a cleaned up patch is the way python-dev would like this to go, or if another, perhaps orthogonal, approach -- such as the conversion Guido has proposed -- would be preferrable.
Are these representations (NaN, infinity, etc) LC_NUMERIC-dependent? Or, more generally, locale-dependent? As for the thread-safety issues, it's true, changing locale multiple times in runtime is bound to confuse the conversion code (and consequently the interpreter). One way around this would be a complete reimplementation of the relevant functions, but here's betting that that alternative would be shunned even more intensely. Take care, -- Christian Reis, Senior Engineer, Async Open Source, Brazil. http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

A Seg, 2003-09-01 às 20:44, Christian Reis escreveu:
Actually, this code presentation is intentional, for the following two reasons: 1- I didn't want to accidentally introduce any bug, so I tried to copy-paste the code with as little changes as possible; 2- In the current form, if glib developers find any bug in this code, we can easily merge the changes back into python. Perhaps I was wrong to have done it this way... Anyway, replacing the g* types is trivial with any decent text editor. Regards. -- Gustavo J. A. M. Carneiro <gjc@inescporto.pt> <gustavo@users.sourceforge.net>

Gustavo J A M Carneiro <gjc@inescporto.pt> writes:
Perhaps I was wrong to have done it this way... Anyway, replacing the g* types is trivial with any decent text editor.
It's more than that. We want you to fully understand the patch, and to spot errors in it even before the glib developers find them. We want you to provide a minimalistic patch, that just implements the required functionality, and nothing else. We want the patch to be maintainable due to it being easy to read and follow, instead of being maintainable due to the fact that it is identical with some code elsewhere in the world. Regards, Martin

It's more than that. We want you to fully understand the patch, and to spot errors in it even before the glib developers find them.
It's unlikely that the glib code would contain errors that Gustavo could spot, before or after cleaning up the patch (no offense to Gustavo meant!). However, a more likely cause of errors would be that the adoption of the code to a new environment breaks an unspoken assumption made by the code. Only truly understanding the code would reveal such assumptions.
Right. Just say no to "copy-and-paste code reuse". --Guido van Rossum (home page: http://www.python.org/~guido/)

[Christian Reis]
Don't know; C89 didn't say anything at all about them, so existing C practice is all over the map; C99 does say something about them, but whether a C99 compiler supports them is optional (support for them isn't mandatory; if a C implementation does choose to support them, then the spellings for input are standardized, although a locale is allowed to *produce* any spellings whatsoever).

[Tim]
[martin@v.loewis.de]
Where exactly does it (C99) say that the spellings are locale-specific?
I can't find anything in the std supporting the claim. For that matter, I can't find anything in the std supporting the notion that a locale is allowed to insert thousand-separator characters either (can you?). There's lots of stuff allowing a locale to *accept* locale-specific spellings (when parsing strings), in addition to the "C" locale spellings; the other direction (producing strings) appears much less permissive.

"Tim Peters" <tim.one@comcast.net> writes:
No: I'm now convinced that sprintf is *forbidden* to insert the thousands separator. This is why POSIX added the '-flag (%'f); this will produce the thousands-separator. That said: Implementations might choose to ignore the standard in that respect. This issue just supports my thesis that the patch is complicated: If I have to read the C99 standard to find out whether it is correct, it must be complicated. I doubt either the submitters or the original author of the code did that exercise...
Indeed. I'm not sure whether this is intentional, though. Regards, Martin

[Tim]
At this point in your life, Tim, is there any patch you could be truly enthusiastic about? :-) I'm asking because I'd like to see the specific problem that started this thread solved, if necessary using a compromise that means the solution isn't perfect. I'm even willing to take a step back in the status quo, given that the status quo isn't perfect anyway, and that compromises mean something has to give. *Maybe* the right solution is that we have to accept a hard-to-understand overcomplicated piece of code that we don't know how to maintain (but for which the author asserts that we won't have to do much maintenance in the foreseeable future). But *maybe* there's a simpler solution.
OTOH, the specific problem I'm acutely worried about would be better addressed by changing the way Python marhals float values.
So solve it. The approach used by binary pickles seems entirely reasonable. All we need to do is change the .pyc magic number. (There's undoubtedly user code in the world that would break because it requires interoperability between Python versions. So let the marshal module grow a way to specify the format.)
Fair enough. So *if* we decide to use the donated conversion code, we should start by using it unconditionally. I predict that at some point in the future we'll find a platform whose quirks are not handled by the donated code, and where it's simpler to use a correct native equivalent than to try to fix the donated code; but I expect that point to be pretty far in the future, *or* the platform to be pretty far from the main stream.
I fail to see the relevance of the example to my proposed hack, except as a proof that the world isn't perfect -- but we already know that. Under my proposal, the number of digits converted would never change, so any sensitivity of the algorithm used to the number of digits converted would be irrelevant. I note that the strtod.c code that's currently in the Python source tree uses a similar (though opposite) trick: it converts the number to the form 0.<fraction>E<expt> before handing it off to atof(). So my proposal still stands. I'm happy to entertain a proof that it's flawed but not one where the flawed input has over 5000 digits *and* depends on a flaw in the platform routines. --Guido van Rossum (home page: http://www.python.org/~guido/)

[Tim]
[Guido]
At this point in your life, Tim, is there any patch you could be truly enthusiastic about? :-)
Yes, but I can't be enthusiastic about a hack, and especially not about a hack that (as I said) doesn't solve the real-life problem spambayes has.
I'm asking because I'd like to see the specific problem that started this thread solved,
At this point, can you state what that specific problem was <wink>?
I'm finding it hard to believe that anyone other than me and the author has actually read the patch! It's easy to understand. It's over-complicated for what Python needs, and would be dead easy to understand if the fluff got chopped. The *fear* of this code expressed in this thread is baffling to me, but I suspect it's due to initial shell-shock from the sheer bulk of the unnecessary code in the patch.
But *maybe* there's a simpler solution.
OTOH, the specific problem I'm acutely worried about would be better addressed by changing the way Python marhals float values.
So solve it.
Sorry, I don't foresee making time to do that.
The approach used by binary pickles seems entirely reasonable.
It's the best binary format we've got. It has problems with 754's special values (as recorded in PEP 42), and loses precision for VAX D format doubles (any double format with greater dynamic range or precision than IEEE-754 double). A decimal string is actually better on all those counts (dynamic range is no problem then; and *some* platforms can preserve IEEE special values via to-string-and-back conversion (Windows cannot)). Decimal strings lose on correctness only because of locale variations; depending on platform, they may also lose on speed, but I don't give much weight to speed here.
Do read the patch. It amounts to if decimal_point != '.': s/./decimal_point/ in one direction and if decimal_point != '.': s/decimal_point/./ in the other. It gets its idea of decimal_point from the platform localeconv(), so if that doesn't lie it's hard to get wrong. In the double->string direction, though, the substitution code appears inadequate to me, since it doesn't try to strip out thousand-separation characters, which some locales produce. For example, on Windows,
AFAICT, the patch will leave that output as "123.456". The string->double direction is much easier to be confident about for this reason.
[long example]
I fail to see the relevance of the example to my proposed hack, except as a proof that the world isn't perfect -- but we already know that.
The point is that only perfect-rounding string->float routines can guarantee to produce identical doubles from mathematically equivalent decimal string representations. Finding counterexamples for non-perfect-rounding libraries is extremely difficult, and/or time-consuming, without studying the source code of a specific library intensely (almost certainly with more intensity than its author gave to writing it!), and I don't have time for that. It's a potential vulnerability. Answering whether it's an actual vulnerability in practice is much more work than I can give to it now.
As hacks go, it's probably OK. I don't think it can fail on glibc-based platforms because I think they do perfect-rounding conversions; the Windows conversion routines aren't perfect-rounding, but we don't have their source code so it's impossible for me to give examples offhand where different results could be delivered, or even to swear that there are (or aren't) such cases. I give it a lot of credit for being truly threadsafe. Note that it doesn't address the other half of the locale conversion problem (double->string), which, as I noted above, is the harder half (due to thousands_sep becoming an additional issue).

"Tim Peters" <tim.one@comcast.net> writes:
The user was writing a gtk application (using pygtk), where gtk, internally, would rely on C-library LC_NUMERIC following the local conventions (I believe gtk would call snprintf to display some message to the user). The user then thought that calling locale.setlocale would be sufficient, but it isn't, and there is no way to fix that short of rewriting gtk.
I'm finding it hard to believe that anyone other than me and the author has actually read the patch! It's easy to understand.
On the surface, yes. However, it seems full of hidden assumptions that are difficult to find out and consider. For example, what if the platform snprintf choses to output the thousands-separator? I can't see how that handled in the patch. Regards, Martin

[Tim]
I'm finding it hard to believe that anyone other than me and the author has actually read the patch! It's easy to understand.
[Martin]
I mentioned that one too last night -- it doesn't. OTOH, *are* there locales that insert thousands_sep? I don't know. To get thousands_sep to appear via Python's locale.format(), in all locales I've tried so far it requires passing a true value for the optional "grouping" argument. Like
Now going thru locale.py is far from going thru C, but the same thing happens if I use sprintf() directly from C (no thousands_sep appears, regardless of how I change locale). That's on Win2K. The draft std I have handy here sez: LC_NUMERIC affects the decimal-point character for the formatted input/output functions and the string conversion functions, as well as the nonmonetary formatting information returned by the localeconv function. There's no support there for the notion that "the formatted (etc)" functions *can* be affected by thousands_sep, just that fiddling locale can affect decimal-point and the (passive) values returned by localeconv().

On Tue, Sep 02, 2003 at 11:29:55AM -0400, Tim Peters wrote:
At least with locale.format, if you want grouping, you pass in the third argument (grouping=1) to format(). An example: >>> locale.setlocale(locale.LC_NUMERIC, 'da_DK') 'da_DK' >>> locale.format("%.2f", 71630, 1) '71.630,00' Now, from the glibc docs: [...] The SUSv2 specifies one further flag character. ' For decimal conversion (i, d, u, f, F, g, G) the output is to be grouped with thousands' grouping characters if the locale information indicates any. Note that many versions of gcc cannot parse this option and will issue a warning. SUSv2 does not include %'F. So unless I'm mistaken, this wouldn't really be an issue in our case if explicit grouping isn't requested inside the python conversion functions. Take care, -- Christian Reis, Senior Engineer, Async Open Source, Brazil. http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

On Tue, Sep 02, 2003 at 11:29:55AM -0400, Tim Peters wrote:
The Linux/glibc documentation, which cites SUSv2, seems to imply that no locale inserts the thousands separator in formatting operations, except when the ' flag character is included: For some numeric conversions a radix character ('decimal point') or thousands' grouping character is used. The actual character used depends on the LC_NUMERIC part of the locale. The POSIX locale uses '.' as radix character, and does not have a grouping character. Thus, printf("%'.2f", 1234567.89); results in '1234567.89' in the POSIX locale, in '1234567,89' in the nl_NL locale, and in '1.234.567,89' in the da_DK locale. [...] The five flag characters above are defined in the C standard. The SUSv2 specifies one further flag character. ' For decimal conversion (i, d, u, f, F, g, G) the output is to be grouped with thousands' grouping characters if the locale infor- mation indicates any. Note that many versions of gcc cannot parse this option and will issue a warning. SUSv2 does not include %'F. Jeff

On Mon, Sep 01, 2003 at 02:30:23PM -0400, Tim Peters wrote:
Just to follow up, today I found a thread on opengroup.org that discusses locale-safe APIs in the C library. They don't suggest anything very positive in the way of standardization :-/ http://www.opengroup.org/austin/mailarchives/austin-group-l/msg00763.html Take care, -- Christian Reis, Senior Engineer, Async Open Source, Brazil. http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

On 31/08/2003 9:25 AM, Tim Peters wrote:
This is true. However, in practice we found it fixed a number of thread safety issues in programs. Your average localised package usually switches to the user's preferred locale on startup, so that it can display strings and messages, and occasionally wants to read/write numbers in a locale independent format (usually when saving/loading files). The most common way of doing this is the setlocale/strtod/setlocale combo, which has thread safety problems and possible reentrancy problems if done wrong. The method used by g_ascii_strotod() removes the need to switch locale when parsing the float, which means that an application using it may only need to call setlocale() once on startup and never again. This seems to be the best way to use setlocale w.r.t. thread safety. The existing locale handling in Python shares this property, but makes it difficult for external libraries to format and parse floats in the locale's representation. From what I can see, leaving LC_NUMERIC set to the locale value rather than "C" leads to better interoperability.
It would be great for Python to have consistent float parsing/formatting on every platform in the future. Making sure that every place where Python wants to parse or format a float in a locale independent fashion go through a single set of functions should make it easier to drop in a new set of routines in the future. However, getting rid of the LC_NUMERIC=C requirement would have real benefits today. James. -- Email: james@daa.com.au WWW: http://www.daa.com.au/~james/

James Henstridge <james@daa.com.au> writes:
I think everybody agrees that allowing non-C LC_NUMERIC settings in the C library is very desirable. My concerns are about the specific approach taken to implement that change. Or, actually, with an entire class of approaches: namely those that involve complex algorithms (i.e. which include a for-statement :) to implement that feature. Regards, Martin

Christian Reis <kiko@async.com.br> writes:
So, in an attempt to garner comments (now that we have 2.3 off the chopping block) I'm reposting my PEP proposal (with minor updates).
I can agree with the declared problem of the PEP, and the rationale for fixing it. Tim also convinced me that the approach taken to solve it is, technically, acceptable. So I only list issues where I disagree.
This change should also solve the aforementioned thread-safety problems.
It does not, and I think the PEP should point out that it doesn't.
One of my early concerns (and I still have this concern) is that the contributors here appear to take the position "We have this fine code developed elsewhere, it seems to work, so we copy it. We don't actually have to understand this code". I would feel more comfortable if the code was written from scratch for usage in Python, with just the ideas borrowed from glib. Proper attribution of contributors and licensing are just one aspect, we really need the submitter of the code fully understand it, and be capable of reacting to problems quickly. That said, I don't actually require that the code is written from scratch. Instead, a detailed elaboration of how precisely the implementation is approached, in the PEP, would be good. The PEP should also point out deficiencies of the approach taken, e.g. the issue of spelling NaN, inf, etc. If it can be determined not to be an issue in real life (i.e. for all interesting platforms), this should be documented as well. Regards, Martin

I think that anything that might still be needed years after the code is adopted should be in comments in the code, not in a PEP. --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (9)
-
Aahz
-
Christian Reis
-
Guido van Rossum
-
Gustavo J A M Carneiro
-
Gustavo J. A. M. Carneiro
-
James Henstridge
-
Jeff Epler
-
martin@v.loewis.de
-
Tim Peters