RE: [Spambayes] Question (or possibly a bug report)

When I call "marshal.dumps(0.1)" from AsyncDialog (or anywhere in the Outlook code) I get "f\x030.0", which fits with what you have.
So the obvious <wink> answers are:
(Glad you posted this - I was wading through the progress of marshalling (PyOS_snprintf etc) and getting rapidly lost).
1. When LC_NUMERIC is "german", MS C's atof() stops at the first period it sees.
This is the case: """ #include <locale.h> #include <stdio.h> #include <stdlib.h> int main() { float f; setlocale(LC_NUMERIC, "german"); f = atof("0.1"); printf("%f\n", f); } """ Gives me with gcc version 3.2 20020927 (prerelease): 0.100000 Gives me with Microsoft C++ Builder (I don't have Visual C++ handy, but I suppose it would be the same): 0,00000 The help file for Builder does say that this is the correct behaviour - it will stop when it finds an unrecognised character - here '.' is unrecognised (because we are in German), so it stops. Does this then mean that this is a Python bug? Or because Python tells us not to change the c locale and we (Outlook) are, it's our fault/problem? Presumably what we'll have to do for a solution is just what Mark is doing now - find the correct place to put a call that (re)sets the c locale to English. =Tony Meyer

[Tony Meyer]
It's the unmarshalling code that's relevant -- that just passes a string to atof().
1. When LC_NUMERIC is "german", MS C's atof() stops at the first period it sees.
It's possible that glibc doesn't recognize "german" as a legitimate locale name (so that the setlocale() call had no effect).
atof does have to stop at the first unrecognized character, but atof is locale-dependent, so which characters are and aren't recognized depends on the locale. After I set locale to "german" on Win2K:
MS tells me that the decimal_point character is ',' and the thousands_sep character is '.':
Python believes that the locale-specified thousands_sep character should be ignored, and that's what locale.atof() does. It may well be a bug in MS's atof() that it doesn't ignore the current thousands_sep character -- I don't have time now to look up the rules in the C standard, and it doesn't matter to spambayes either way (whether we load .001 as 0.0 as 1.0 is a disaster either way).
Does this then mean that this is a Python bug?
That Microsoft's atof() doesn't ignore the thousands_sep character is certainly not Pyton's bug <wink>.
Or because Python tells us not to change the c locale and we (Outlook) are, it's our fault/problem?
The way we're using Python with Outlook doesn't meet the documented requirements for using Python, so for now everything that goes wrong here is our problem. It would be better if Python didn't use locale-dependent string<->float conversions internally, but that's just not the case (yet).
Python requires that the (true -- from the C library's POV) LC_NUMERIC category be "C" locale. That isn't English (although it looks a lot like it to Germans <wink>), and we don't care about any category other than LC_NUMERIC here.

Jeez, this locale crap makes Unicode look positively delightful... The SB Windows triumvirate (Mark, Tim, Tony) seem to have narrowed down the problem quite a bit. Is there some way to worm around it? I take it with the unmarshalling problem it's not sufficient to specify floating point values without decimal points (e.g., 0.12 == 1e-1+2e-2). Is the proposed early specification of a locale in the config file sufficient to make things work? A foreign user of the nascent CSV module beat us up a bit during development about not supporting different locales (I guess in Brazil the default separator is a semicolon, which makes sense if your decimal "point" is a comma). Thank God we ignored him! ;-) Skip

Jeez, this locale crap makes Unicode look positively delightful...
This seems to be coming to a conclusion. Not a completely satisfactory one, but one nonetheless. Short story for the python-dev crew: * Some Windows programs are known to run with the CRT locale set to other than "C" - specifically, set to the locale of the user. * If this happens, the marshal module will load floating point literals incorrectly. * Thus, once this happens, if a .pyc file is imported, the floating point literals in that .pyc are wrong. Confusion reigns. The "best" solution to this probably involves removing Python being dependent on the locale - there is even an existing patch for that. To the SpamBayes specifics:
I have a version working for the original bug reporter. While on our machines, we can reproduce the locale being switched at MAPILogon time, my instrumented version also shows that for some people at least, Outlook itself will also change it back some time before delivering UI events to us. Today I hope to produce a less-instrumented version with the fix I intend leaving in, and asking the OP to re-test. We *do* still have the "social" problem of what locale conventions to use for Config files, but that has nothing to do with our tools... Mark.

"Mark Hammond" <mhammond@skippinet.com.au> writes:
The "best" solution to this probably involves removing Python being dependent on the locale - there is even an existing patch for that.
While the feature is desirable, I don't like the patch it all. It copies the relevant code of Gnome glib, and I a) doubt it works on all systems we care about, and b) is too much code for us to maintain, and c) introduces yet another license (although the true authors of that code would be willing to relicense it) It would be better if system functions could be found for a locale-agnostic atof/strtod on all systems. For example, glibc has a strtod_l function, which expects a locale_t in addition to the char*. It would be good if something similar was discovered for VC. Using undocumented or straight Win32 API functions would be fine. Unfortunately, the "true" source of atof (i.e. from conv.obj) is not shipped with MSVC :-( Regards, Martin

[martin@v.loewis.de]
OTOH, even assuming "C" locale, Python's float<->string story varies across platforms anyway, due to different C libraries treating things like infinities, NaNs, signed zeroes, and the number of digits displayed in an exponent differently. This also has bad consequences, although one-platform programmers usually don't notice them (Windows programmers do more than most, because MS's C library can't read back the strings it produces for NaNs and infinities -- which Python also produces and can't read back in then). So it's not that the patch is too much code to maintain, it's not enough code to do the whole job <0.9 wink>.
Well, a growing pile of funky platform #ifdefs isn't all that attractive either.
It would be good if something similar was discovered for VC. Using undocumented or straight Win32 API functions would be fine.
Only half joking, I expect that anything using the native Win32 API would end up being as big as the glib patch.
Unfortunately, the "true" source of atof (i.e. from conv.obj) is not shipped with MSVC :-(
Would that help us if we could get it? I'm not sure how. I expect the true source is assembler, for easy exploitation of helpful Pentium FPU gimmicks you can't get at reliably from C code. Standard-quality float<->string routines require simulating (by hook or by crook) more precision than the float type has, and access to the Pentium's extended-precision float type can replace pages of convoluted C with a few machine instructions.

"Tim Peters" <tim.one@comcast.net> writes:
I would hope that some inner routine that does the actual construction of the double is locale-independent, and takes certain details as separate arguments. Then, this routine could be used, passing the "C" specific parameters instead of those of the current locale. Regards, Martin

[martin@v.loewis.de]
Unfortunately, the "true" source of atof (i.e. from conv.obj) is not shipped with MSVC :-(
[Tim]
Would that help us if we could get it? I'm not sure how.
[Martin]
OK. I looked and couldn't find anything useful. The Win32 GetNumberFormat() call can be used to format numbers for locale-aware display, but I didn't find anything in the other direction. The info at http://www.microsoft.com/globaldev/getwr/steps/wrg_nmbr.mspx seems to imply Win32 apps have to roll their own numeric parsing, building on the raw info returned by GetLocaleInfo().

On Fri, Jul 25, 2003 at 03:13:46AM -0400, Tim Peters wrote:
My question, now, is if we would we be able to cobble something even more magical into the g_ascii_* functions that makes Python more robust to these changes (over time)? Take care, -- Christian Reis, Senior Engineer, Async Open Source, Brazil. http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

On Fri, Jul 25, 2003 at 07:25:48AM +0200, Martin v. Löwis wrote:
I'm sorry you don't like the patch, but if there's something that can be fixed, we will fix it :-) Well, glib is known to be quite portable, and we would make sure that it does run on the supported platforms before considering checking it in. (I'm betting it does.)
b) is too much code for us to maintain, and
It's not *that* much code, and we can rely on fixes that are produced to glib being easily ported to us -- we get free maintenance of the code if we choose to do so, actually.
c) introduces yet another license (although the true authors of that code would be willing to relicense it)
Which means that c) is a non-issue?
Yes, but if all we were worried about was glibc, then point a) would be a non-issue too. I imagine it's easier to make sure the code we *have* runs on multiple platforms than trying to find and call code that *may* exist on each given platform.
I don't understand this bit. You'd rather use an undocumented API function than an open source, well-tested, properly licensed set of functions? Take care, -- Christian Reis, Senior Engineer, Async Open Source, Brazil. http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

I don't know what the exact requirements of this license are, but I assure you that redistributing code that is not under the PSF license is a pain, even if it's an open source license. If we can get the original authors to contribute the code to the PSF without the requirement to include a license of any kind (beyond the PSF license) in redistributions, either by the PSF or downstream, even if those redistributions are commercial or contain proprietary code in addition to open source code. This is what's possible with the PSF license, and that needs to remain the case. In particular, the GPL is *not* acceptable for this purpose. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Wed, Aug 13, 2003 at 08:34:16PM -0700, Guido van Rossum wrote:
You omit the predicate that follows this if clause, but I'm hoping you meant something positive like `we will gladly accept it' <wink> I'm waiting on Alex's answer on relicensing the code, but he's said on IRC that he'd be willing to do it, so barring any environmental disasters, that should be solved sometime soon. Take care, -- Christian Reis, Senior Engineer, Async Open Source, Brazil. http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

[Mark Hammond]
Well, it depends on the locale, and on the fp literals in question, but it's often the case that damage occurs.
* Thus, once this happens, if a .pyc file is imported, the floating point literals in that .pyc are wrong. Confusion reigns.
Yup -- and it's an excellent to-the-point summary!
The "best" solution to this probably involves removing Python being dependent on the locale - there is even an existing patch for that.
Kinda.
There's potentially another dark side to this story: if MS code is going out of its way to switch locale, it's presumably because some of MS's code wants to change the behavior of CRT routines to work "as expected" for the current user. So if we switch LC_NUMERIC back to "C", we *may* be creating problems for Outlook. I'll never stumble into this, since "C" locale and my normal locale are so similar (and have identical behavior in the LC_NUMERIC category). At least Win32's *native* notions of locale are settable on a per-thread basis; C's notion is a global hammer; it's unclear to me why MS's code is even touching C's notion.
To the extent that Config files use Python syntax, they're least surprising if they stick to Python syntax. The locale pit is deep. For example, Finnish uses a non-breaking space to separate thousands "although fullstop may be used in monetary context". We'll end up with more code to cater to gratuitous locale differences than to identify spam <0.7 wink>.

[Skip Montanaro]
Jeez, this locale crap makes Unicode look positively delightful...
Yes, it does! locale is what you get when someone complains they like to use ampersands instead commas to separate thousands, and a committee thinks "hey! we've got all these great functions already, so why change them? instead we'll add mounds of hidden global state that affects lots of ancient functions in radical ways!". Make sure it's as hostile to threads as possible, decline to define any standard locale names beyond "C" and the empty string, and decline to define what anything except the "C" locale name means, and you're almost there. The finishing touches come in the function definitions, like this in strtod(): In other than the "C" locale, additional locale-specific subject sequence forms may be accepted. What those may be aren't constrained in any way, of course. locale can be cool in a monolithic, single-threaded, one-platform program, provided the platform C made up rules you can live with for the locales you care about. It's more of an API framework than a solution, and portable programs really can't use it except via forcing locale back to "C" every chance they get <wink>.
When true division becomes the default, things like 12/100 should work reliably regardless of locale -- i.e., don't use any float literals, and you can't get screwed by locale float-literal quirks. Today, absurd spellings like float(12)/100 can accomplish the same. Changing Python is a better solution. The rule that an embedded Python requires that LC_NUMERIC be "C" isn't livable -- embedded Python is a fly trying to stare down an elephant, in Outlook's case. I dragged python-dev into this to illustrate that it's a very real problem in a very popular kick-ass Python app. Note that this same problem was discussed in more abstract terms by others here within the last few weeks, and I hope that making it more concrete helps get the point across. The float-literal-in-.pyc problem could be addressed in several ways. Binary pickles, and the struct module, use a portable binary float format that isn't subject to locale quirks. I think marshal should be changed to use that too, by adding an additional marshal float format (so old marshals would continue to be readable, but new marshals may not be readable under older Pythons). Note that text-mode pickles of floats are vulnerable to locale nightmares too.
Is the proposed early specification of a locale in the config file sufficient to make things work?
I doubt it, as Outlook can switch locale any time it feels like it. We can't control that. I think we should set a line-tracing hook, and force locale back to "C" on every callback <wink>.
Ya, foreigners are no damn good <wink>.

[Tony Meyer]
It's the unmarshalling code that's relevant -- that just passes a string to atof().
1. When LC_NUMERIC is "german", MS C's atof() stops at the first period it sees.
It's possible that glibc doesn't recognize "german" as a legitimate locale name (so that the setlocale() call had no effect).
atof does have to stop at the first unrecognized character, but atof is locale-dependent, so which characters are and aren't recognized depends on the locale. After I set locale to "german" on Win2K:
MS tells me that the decimal_point character is ',' and the thousands_sep character is '.':
Python believes that the locale-specified thousands_sep character should be ignored, and that's what locale.atof() does. It may well be a bug in MS's atof() that it doesn't ignore the current thousands_sep character -- I don't have time now to look up the rules in the C standard, and it doesn't matter to spambayes either way (whether we load .001 as 0.0 as 1.0 is a disaster either way).
Does this then mean that this is a Python bug?
That Microsoft's atof() doesn't ignore the thousands_sep character is certainly not Pyton's bug <wink>.
Or because Python tells us not to change the c locale and we (Outlook) are, it's our fault/problem?
The way we're using Python with Outlook doesn't meet the documented requirements for using Python, so for now everything that goes wrong here is our problem. It would be better if Python didn't use locale-dependent string<->float conversions internally, but that's just not the case (yet).
Python requires that the (true -- from the C library's POV) LC_NUMERIC category be "C" locale. That isn't English (although it looks a lot like it to Germans <wink>), and we don't care about any category other than LC_NUMERIC here.

Jeez, this locale crap makes Unicode look positively delightful... The SB Windows triumvirate (Mark, Tim, Tony) seem to have narrowed down the problem quite a bit. Is there some way to worm around it? I take it with the unmarshalling problem it's not sufficient to specify floating point values without decimal points (e.g., 0.12 == 1e-1+2e-2). Is the proposed early specification of a locale in the config file sufficient to make things work? A foreign user of the nascent CSV module beat us up a bit during development about not supporting different locales (I guess in Brazil the default separator is a semicolon, which makes sense if your decimal "point" is a comma). Thank God we ignored him! ;-) Skip

Jeez, this locale crap makes Unicode look positively delightful...
This seems to be coming to a conclusion. Not a completely satisfactory one, but one nonetheless. Short story for the python-dev crew: * Some Windows programs are known to run with the CRT locale set to other than "C" - specifically, set to the locale of the user. * If this happens, the marshal module will load floating point literals incorrectly. * Thus, once this happens, if a .pyc file is imported, the floating point literals in that .pyc are wrong. Confusion reigns. The "best" solution to this probably involves removing Python being dependent on the locale - there is even an existing patch for that. To the SpamBayes specifics:
I have a version working for the original bug reporter. While on our machines, we can reproduce the locale being switched at MAPILogon time, my instrumented version also shows that for some people at least, Outlook itself will also change it back some time before delivering UI events to us. Today I hope to produce a less-instrumented version with the fix I intend leaving in, and asking the OP to re-test. We *do* still have the "social" problem of what locale conventions to use for Config files, but that has nothing to do with our tools... Mark.

"Mark Hammond" <mhammond@skippinet.com.au> writes:
The "best" solution to this probably involves removing Python being dependent on the locale - there is even an existing patch for that.
While the feature is desirable, I don't like the patch it all. It copies the relevant code of Gnome glib, and I a) doubt it works on all systems we care about, and b) is too much code for us to maintain, and c) introduces yet another license (although the true authors of that code would be willing to relicense it) It would be better if system functions could be found for a locale-agnostic atof/strtod on all systems. For example, glibc has a strtod_l function, which expects a locale_t in addition to the char*. It would be good if something similar was discovered for VC. Using undocumented or straight Win32 API functions would be fine. Unfortunately, the "true" source of atof (i.e. from conv.obj) is not shipped with MSVC :-( Regards, Martin

[martin@v.loewis.de]
OTOH, even assuming "C" locale, Python's float<->string story varies across platforms anyway, due to different C libraries treating things like infinities, NaNs, signed zeroes, and the number of digits displayed in an exponent differently. This also has bad consequences, although one-platform programmers usually don't notice them (Windows programmers do more than most, because MS's C library can't read back the strings it produces for NaNs and infinities -- which Python also produces and can't read back in then). So it's not that the patch is too much code to maintain, it's not enough code to do the whole job <0.9 wink>.
Well, a growing pile of funky platform #ifdefs isn't all that attractive either.
It would be good if something similar was discovered for VC. Using undocumented or straight Win32 API functions would be fine.
Only half joking, I expect that anything using the native Win32 API would end up being as big as the glib patch.
Unfortunately, the "true" source of atof (i.e. from conv.obj) is not shipped with MSVC :-(
Would that help us if we could get it? I'm not sure how. I expect the true source is assembler, for easy exploitation of helpful Pentium FPU gimmicks you can't get at reliably from C code. Standard-quality float<->string routines require simulating (by hook or by crook) more precision than the float type has, and access to the Pentium's extended-precision float type can replace pages of convoluted C with a few machine instructions.

"Tim Peters" <tim.one@comcast.net> writes:
I would hope that some inner routine that does the actual construction of the double is locale-independent, and takes certain details as separate arguments. Then, this routine could be used, passing the "C" specific parameters instead of those of the current locale. Regards, Martin

[martin@v.loewis.de]
Unfortunately, the "true" source of atof (i.e. from conv.obj) is not shipped with MSVC :-(
[Tim]
Would that help us if we could get it? I'm not sure how.
[Martin]
OK. I looked and couldn't find anything useful. The Win32 GetNumberFormat() call can be used to format numbers for locale-aware display, but I didn't find anything in the other direction. The info at http://www.microsoft.com/globaldev/getwr/steps/wrg_nmbr.mspx seems to imply Win32 apps have to roll their own numeric parsing, building on the raw info returned by GetLocaleInfo().

On Fri, Jul 25, 2003 at 03:13:46AM -0400, Tim Peters wrote:
My question, now, is if we would we be able to cobble something even more magical into the g_ascii_* functions that makes Python more robust to these changes (over time)? Take care, -- Christian Reis, Senior Engineer, Async Open Source, Brazil. http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

On Fri, Jul 25, 2003 at 07:25:48AM +0200, Martin v. Löwis wrote:
I'm sorry you don't like the patch, but if there's something that can be fixed, we will fix it :-) Well, glib is known to be quite portable, and we would make sure that it does run on the supported platforms before considering checking it in. (I'm betting it does.)
b) is too much code for us to maintain, and
It's not *that* much code, and we can rely on fixes that are produced to glib being easily ported to us -- we get free maintenance of the code if we choose to do so, actually.
c) introduces yet another license (although the true authors of that code would be willing to relicense it)
Which means that c) is a non-issue?
Yes, but if all we were worried about was glibc, then point a) would be a non-issue too. I imagine it's easier to make sure the code we *have* runs on multiple platforms than trying to find and call code that *may* exist on each given platform.
I don't understand this bit. You'd rather use an undocumented API function than an open source, well-tested, properly licensed set of functions? Take care, -- Christian Reis, Senior Engineer, Async Open Source, Brazil. http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

I don't know what the exact requirements of this license are, but I assure you that redistributing code that is not under the PSF license is a pain, even if it's an open source license. If we can get the original authors to contribute the code to the PSF without the requirement to include a license of any kind (beyond the PSF license) in redistributions, either by the PSF or downstream, even if those redistributions are commercial or contain proprietary code in addition to open source code. This is what's possible with the PSF license, and that needs to remain the case. In particular, the GPL is *not* acceptable for this purpose. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Wed, Aug 13, 2003 at 08:34:16PM -0700, Guido van Rossum wrote:
You omit the predicate that follows this if clause, but I'm hoping you meant something positive like `we will gladly accept it' <wink> I'm waiting on Alex's answer on relicensing the code, but he's said on IRC that he'd be willing to do it, so barring any environmental disasters, that should be solved sometime soon. Take care, -- Christian Reis, Senior Engineer, Async Open Source, Brazil. http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

[Mark Hammond]
Well, it depends on the locale, and on the fp literals in question, but it's often the case that damage occurs.
* Thus, once this happens, if a .pyc file is imported, the floating point literals in that .pyc are wrong. Confusion reigns.
Yup -- and it's an excellent to-the-point summary!
The "best" solution to this probably involves removing Python being dependent on the locale - there is even an existing patch for that.
Kinda.
There's potentially another dark side to this story: if MS code is going out of its way to switch locale, it's presumably because some of MS's code wants to change the behavior of CRT routines to work "as expected" for the current user. So if we switch LC_NUMERIC back to "C", we *may* be creating problems for Outlook. I'll never stumble into this, since "C" locale and my normal locale are so similar (and have identical behavior in the LC_NUMERIC category). At least Win32's *native* notions of locale are settable on a per-thread basis; C's notion is a global hammer; it's unclear to me why MS's code is even touching C's notion.
To the extent that Config files use Python syntax, they're least surprising if they stick to Python syntax. The locale pit is deep. For example, Finnish uses a non-breaking space to separate thousands "although fullstop may be used in monetary context". We'll end up with more code to cater to gratuitous locale differences than to identify spam <0.7 wink>.

[Skip Montanaro]
Jeez, this locale crap makes Unicode look positively delightful...
Yes, it does! locale is what you get when someone complains they like to use ampersands instead commas to separate thousands, and a committee thinks "hey! we've got all these great functions already, so why change them? instead we'll add mounds of hidden global state that affects lots of ancient functions in radical ways!". Make sure it's as hostile to threads as possible, decline to define any standard locale names beyond "C" and the empty string, and decline to define what anything except the "C" locale name means, and you're almost there. The finishing touches come in the function definitions, like this in strtod(): In other than the "C" locale, additional locale-specific subject sequence forms may be accepted. What those may be aren't constrained in any way, of course. locale can be cool in a monolithic, single-threaded, one-platform program, provided the platform C made up rules you can live with for the locales you care about. It's more of an API framework than a solution, and portable programs really can't use it except via forcing locale back to "C" every chance they get <wink>.
When true division becomes the default, things like 12/100 should work reliably regardless of locale -- i.e., don't use any float literals, and you can't get screwed by locale float-literal quirks. Today, absurd spellings like float(12)/100 can accomplish the same. Changing Python is a better solution. The rule that an embedded Python requires that LC_NUMERIC be "C" isn't livable -- embedded Python is a fly trying to stare down an elephant, in Outlook's case. I dragged python-dev into this to illustrate that it's a very real problem in a very popular kick-ass Python app. Note that this same problem was discussed in more abstract terms by others here within the last few weeks, and I hope that making it more concrete helps get the point across. The float-literal-in-.pyc problem could be addressed in several ways. Binary pickles, and the struct module, use a portable binary float format that isn't subject to locale quirks. I think marshal should be changed to use that too, by adding an additional marshal float format (so old marshals would continue to be readable, but new marshals may not be readable under older Pythons). Note that text-mode pickles of floats are vulnerable to locale nightmares too.
Is the proposed early specification of a locale in the config file sufficient to make things work?
I doubt it, as Outlook can switch locale any time it feels like it. We can't control that. I think we should set a line-tracing hook, and force locale back to "C" on every callback <wink>.
Ya, foreigners are no damn good <wink>.
participants (7)
-
Christian Reis
-
Guido van Rossum
-
Mark Hammond
-
martin@v.loewis.de
-
Meyer, Tony
-
Skip Montanaro
-
Tim Peters