formatting issues, locale and co

Hi, While looking at the last failures of numpy trunk on windows for python 2.5 and 2.6, I got into floating point number formatting issues; I got deeper and deeper, and now I am lost. We have several problems: - we are not consistent between platforms, nor are we consistent with python - str(np.float32(a)) is locale dependent, but python str method is not (locale.str is) - formatting of long double does not work on windows because of the broken long double support in mingw. 1 consistency problem: ---------------------- python -c "a = 1e20; print a" -> 1e+020 python26 -c "a = 1e20; print a" -> 1e+20 In numpy, we use PyOS_snprintf for formatting, but python itself uses PyOS_ascii_formatd - which has different behavior on different versions of python. The above behavior can be simply reproduced in C: #include <Python.h> int main() { double x = 1e20; char c[200]; PyOS_ascii_format(c, sizeof(c), "%.12g", x); printf("%s\n", c); printf("%g\n", x); return 0; } On 2.5, this will print: 1e+020 1e+020 But on 2.6, this will print: 1e+20 1e+020 2 locale dependency: -------------------- Another issue is that our own formatting is local dependent, whereas python isn't: import numpy as np import locale locale.setlocale(locale.LC_NUMERIC, 'fr_FR') a = 1.2 print "str(a)", str(a) print "locale.str(a)", locale.str(a) print "str(np.float32(a))", str(np.float32(a)) print "locale.str(np.float32(a))", locale.str(np.float32(a)) Returns: str(a) 1.2 locale.str(a) 1,2 str(np.float32(a)) 1,2 locale.str(np.float32(a)) 1,20000004768 I thought about copying the way python does the formatting in the trunk (where discrepancies between platforms have been fixed), but this is not so easy, because it uses a lot of code from different places - and the code needs to be adapted to float and long double. The other solution would be to do our own formatting, but this does not sound easy: formatting in C is hard. I am not sure about what we should do, if anyone else has any idea ? cheers, David

On Sat, Dec 27, 2008 at 10:27 PM, David Cournapeau < david@ar.media.kyoto-u.ac.jp> wrote:
Hi,
While looking at the last failures of numpy trunk on windows for python 2.5 and 2.6, I got into floating point number formatting issues; I got deeper and deeper, and now I am lost. We have several problems: - we are not consistent between platforms, nor are we consistent with python - str(np.float32(a)) is locale dependent, but python str method is not (locale.str is) - formatting of long double does not work on windows because of the broken long double support in mingw.
1 consistency problem: ----------------------
python -c "a = 1e20; print a" -> 1e+020 python26 -c "a = 1e20; print a" -> 1e+20
In numpy, we use PyOS_snprintf for formatting, but python itself uses PyOS_ascii_formatd - which has different behavior on different versions of python. The above behavior can be simply reproduced in C:
#include <Python.h>
int main() { double x = 1e20; char c[200];
PyOS_ascii_format(c, sizeof(c), "%.12g", x); printf("%s\n", c); printf("%g\n", x);
return 0; }
On 2.5, this will print:
1e+020 1e+020
But on 2.6, this will print:
1e+20 1e+020
2 locale dependency: --------------------
Another issue is that our own formatting is local dependent, whereas python isn't:
import numpy as np import locale locale.setlocale(locale.LC_NUMERIC, 'fr_FR') a = 1.2
print "str(a)", str(a) print "locale.str(a)", locale.str(a) print "str(np.float32(a))", str(np.float32(a)) print "locale.str(np.float32(a))", locale.str(np.float32(a))
Returns:
str(a) 1.2 locale.str(a) 1,2 str(np.float32(a)) 1,2 locale.str(np.float32(a)) 1,20000004768
I thought about copying the way python does the formatting in the trunk (where discrepancies between platforms have been fixed), but this is not so easy, because it uses a lot of code from different places - and the code needs to be adapted to float and long double. The other solution would be to do our own formatting, but this does not sound easy: formatting in C is hard. I am not sure about what we should do, if anyone else has any idea ?
I think the first thing to do is make a decision on locale. If we chose to support locales I don't see much choice but to depend Python because it's too much work otherwise, and work not directly related to Numpy at that. If we decide not to support locales then we can do our own formatting if we need to using a fixed choice of locale. There is a list of snprintf implementations here <http://www.ijs.si/software/snprintf/>. Trio<http://daniel.haxx.se/projects/trio/>looks like a mature project and has an MIT license, which I think is a license compatible with Numpy. I'm inclined to just fix the locale and ignore the rest until Python gets things sorted out. But I'm lazy... Chuck

On Sun, Dec 28, 2008 at 01:38, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sat, Dec 27, 2008 at 10:27 PM, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Hi,
While looking at the last failures of numpy trunk on windows for python 2.5 and 2.6, I got into floating point number formatting issues; I got deeper and deeper, and now I am lost. We have several problems: - we are not consistent between platforms, nor are we consistent with python - str(np.float32(a)) is locale dependent, but python str method is not (locale.str is) - formatting of long double does not work on windows because of the broken long double support in mingw.
1 consistency problem: ----------------------
python -c "a = 1e20; print a" -> 1e+020 python26 -c "a = 1e20; print a" -> 1e+20
In numpy, we use PyOS_snprintf for formatting, but python itself uses PyOS_ascii_formatd - which has different behavior on different versions of python. The above behavior can be simply reproduced in C:
#include <Python.h>
int main() { double x = 1e20; char c[200];
PyOS_ascii_format(c, sizeof(c), "%.12g", x); printf("%s\n", c); printf("%g\n", x);
return 0; }
On 2.5, this will print:
1e+020 1e+020
But on 2.6, this will print:
1e+20 1e+020
2 locale dependency: --------------------
Another issue is that our own formatting is local dependent, whereas python isn't:
import numpy as np import locale locale.setlocale(locale.LC_NUMERIC, 'fr_FR') a = 1.2
print "str(a)", str(a) print "locale.str(a)", locale.str(a) print "str(np.float32(a))", str(np.float32(a)) print "locale.str(np.float32(a))", locale.str(np.float32(a))
Returns:
str(a) 1.2 locale.str(a) 1,2 str(np.float32(a)) 1,2 locale.str(np.float32(a)) 1,20000004768
I thought about copying the way python does the formatting in the trunk (where discrepancies between platforms have been fixed), but this is not so easy, because it uses a lot of code from different places - and the code needs to be adapted to float and long double. The other solution would be to do our own formatting, but this does not sound easy: formatting in C is hard. I am not sure about what we should do, if anyone else has any idea ?
I think the first thing to do is make a decision on locale. If we chose to support locales I don't see much choice but to depend Python because it's too much work otherwise, and work not directly related to Numpy at that. If we decide not to support locales then we can do our own formatting if we need to using a fixed choice of locale. There is a list of snprintf implementations here. Trio looks like a mature project and has an MIT license, which I think is a license compatible with Numpy.
We should not support locales. The string representations of these elements should be Python-parseable.
I'm inclined to just fix the locale and ignore the rest until Python gets things sorted out. But I'm lazy...
What do you think Python doesn't have sorted out? -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Robert Kern wrote:
We should not support locales. The string representations of these elements should be Python-parseable.
It looks like I was wrong in my analysis of the problem: I thought I was using the most recent implementation of PyOS_* functions in my test codes, but the ones in 2.6 are not the same as the ones in the current trunk. So the problem may be easier to fix that what I first thought: simply providing our own PyOS_ascii_formatd (and similar for float and long double) may be enough, and since we don't care about locale (%Z and %n), the function is simple (and can be pulled out from python sources). We would then use PyOS_ascii_format* (locale independant) instead of PyOS_snprintf (locale dependant) in str/repr implementation of scalar arrays. Does that sound acceptable to you ? cheers, David

On Sat, Dec 27, 2008 at 11:40 PM, David Cournapeau < david@ar.media.kyoto-u.ac.jp> wrote:
Robert Kern wrote:
We should not support locales. The string representations of these elements should be Python-parseable.
It looks like I was wrong in my analysis of the problem: I thought I was using the most recent implementation of PyOS_* functions in my test codes, but the ones in 2.6 are not the same as the ones in the current trunk. So the problem may be easier to fix that what I first thought: simply providing our own PyOS_ascii_formatd (and similar for float and long double) may be enough, and since we don't care about locale (%Z and %n), the function is simple (and can be pulled out from python sources).
We would then use PyOS_ascii_format* (locale independant) instead of PyOS_snprintf (locale dependant) in str/repr implementation of scalar arrays. Does that sound acceptable to you ?
As long as we rename it ;) Trio might be worth a look anyway as it has some extensions that might be useful, binary formats, for instance. Chuck

On Sun, Dec 28, 2008 at 4:12 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sat, Dec 27, 2008 at 11:40 PM, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Robert Kern wrote:
We should not support locales. The string representations of these elements should be Python-parseable.
It looks like I was wrong in my analysis of the problem: I thought I was using the most recent implementation of PyOS_* functions in my test codes, but the ones in 2.6 are not the same as the ones in the current trunk. So the problem may be easier to fix that what I first thought: simply providing our own PyOS_ascii_formatd (and similar for float and long double) may be enough, and since we don't care about locale (%Z and %n), the function is simple (and can be pulled out from python sources).
We would then use PyOS_ascii_format* (locale independant) instead of PyOS_snprintf (locale dependant) in str/repr implementation of scalar arrays. Does that sound acceptable to you ?
I put my yesterday work in the fix_float_format branch: - it fixes the locale issue - it fixes the long double issue on windows. - it also fixes some tests (we were not testing single precision formatting but twice double precision instead - the single precision test fails on the trunk BTW). - it handles inf and nan more consistently across platforms (e.g. str(np.log(0)) will be '-inf' on all platforms; on windows, it used to be '-1.#INF' - I was afraid it would broke converting back the string to float, but it is broken anyway before my change, e.g. float('-1.#INF') does not work on windows). - for now, it breaks in windows python 2.5, because float(1e10) used to be 1e+010 on python 2.5 and is 1e+10 on python 2.6 (to be more consistent with C99). But I could simply forces a backward compatibility with python 2.5/2.4, since I can control the number of digits in the exponent in the formatting code. There are still some problems related for double which I am not sure how to solve: import numpy as np a = 1e10 print np.float32(a) # -> call format_float print np.float64(a) # -> do not call format_double print np.float96(a) # -> call format_longdouble I guess the different with float64 comes from its multi-inheritence (that is, it derives from the builtin float, and the rules for print are different that for the other). Is this behavior the expected one ? cheers, David

On Sun, Dec 28, 2008 at 9:38 PM, David Cournapeau <cournape@gmail.com>wrote:
On Sun, Dec 28, 2008 at 4:12 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sat, Dec 27, 2008 at 11:40 PM, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Robert Kern wrote:
We should not support locales. The string representations of these elements should be Python-parseable.
It looks like I was wrong in my analysis of the problem: I thought I was using the most recent implementation of PyOS_* functions in my test codes, but the ones in 2.6 are not the same as the ones in the current trunk. So the problem may be easier to fix that what I first thought: simply providing our own PyOS_ascii_formatd (and similar for float and long double) may be enough, and since we don't care about locale (%Z and %n), the function is simple (and can be pulled out from python sources).
We would then use PyOS_ascii_format* (locale independant) instead of PyOS_snprintf (locale dependant) in str/repr implementation of scalar arrays. Does that sound acceptable to you ?
I put my yesterday work in the fix_float_format branch: - it fixes the locale issue - it fixes the long double issue on windows. - it also fixes some tests (we were not testing single precision formatting but twice double precision instead - the single precision test fails on the trunk BTW).
Curious, I don't see any test failures here. Were the tests actually being run or is something else different in your test setup? Or do you mean the fixed up test fails.
- it handles inf and nan more consistently across platforms (e.g. str(np.log(0)) will be '-inf' on all platforms; on windows, it used to be '-1.#INF' - I was afraid it would broke converting back the string to float, but it is broken anyway before my change, e.g. float('-1.#INF') does not work on windows). - for now, it breaks in windows python 2.5, because float(1e10) used to be 1e+010 on python 2.5 and is 1e+10 on python 2.6 (to be more consistent with C99). But I could simply forces a backward compatibility with python 2.5/2.4, since I can control the number of digits in the exponent in the formatting code.
There are still some problems related for double which I am not sure how to solve:
import numpy as np a = 1e10 print np.float32(a) # -> call format_float print np.float64(a) # -> do not call format_double print np.float96(a) # -> call format_longdouble
I guess the different with float64 comes from its multi-inheritence (that is, it derives from the builtin float, and the rules for print are different that for the other). Is this behavior the expected one ?
Expected, but I would like to see it change because it is kind of frustrating. Fixing it probably involves setting a function pointer in the type definition but I am not sure about that. We might also want to do something about integers, as in Python 3.0 they will all be Python long integers. I don't know if that actually breaks anything in numpy, or how Python 3.0 implements integers, but it might be a good idea not to derive from Python integers. How that will affect indexing speed I don't know. Chuck

Charles R Harris wrote:
I put my yesterday work in the fix_float_format branch: - it fixes the locale issue - it fixes the long double issue on windows. - it also fixes some tests (we were not testing single precision formatting but twice double precision instead - the single precision test fails on the trunk BTW).
Curious, I don't see any test failures here. Were the tests actually being run or is something else different in your test setup? Or do you mean the fixed up test fails.
The later: if you look at numpy/core/tests/test_print, you will see that the types tested are np.float, np.double and np.longdouble, but at least on linux, np.float == np.double, and np.float32 is what we want to test I suppose here instead.
Expected, but I would like to see it change because it is kind of frustrating. Fixing it probably involves setting a function pointer in the type definition but I am not sure about that.
Hm, it took me a while to get this, but print np.float32(value) can be controlled through tp_print. Still, it does not work in all cases: print np.float32(a) -> call the tp_print print '%f' % np.float32(a) -> does not call the tp_print (nor tp_str/tp_repr). I have no idea what going on there.
We might also want to do something about integers, as in Python 3.0 they will all be Python long integers.
I will only care about floating point numbers for now, since they have problem today in numpy, with currently used python interpreters :) David

On Sun, Dec 28, 2008 at 10:35 PM, David Cournapeau < david@ar.media.kyoto-u.ac.jp> wrote:
Charles R Harris wrote:
I put my yesterday work in the fix_float_format branch: - it fixes the locale issue - it fixes the long double issue on windows. - it also fixes some tests (we were not testing single precision formatting but twice double precision instead - the single precision test fails on the trunk BTW).
Curious, I don't see any test failures here. Were the tests actually being run or is something else different in your test setup? Or do you mean the fixed up test fails.
The later: if you look at numpy/core/tests/test_print, you will see that the types tested are np.float, np.double and np.longdouble, but at least on linux, np.float == np.double, and np.float32 is what we want to test I suppose here instead.
Expected, but I would like to see it change because it is kind of frustrating. Fixing it probably involves setting a function pointer in the type definition but I am not sure about that.
Hm, it took me a while to get this, but print np.float32(value) can be controlled through tp_print. Still, it does not work in all cases:
print np.float32(a) -> call the tp_print print '%f' % np.float32(a) -> does not call the tp_print (nor tp_str/tp_repr). I have no idea what going on there.
I'll bet it's calling a conversion to python float, i.e., double, because of the %f. In [1]: '%s' % np.float32(1) Out[1]: '1.0' In [2]: '%f' % np.float32(1) Out[2]: '1.000000' I don't see any way to work around that without changing the way the python formatting works. Chuck

On Mon, Dec 29, 2008 at 4:36 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sun, Dec 28, 2008 at 10:35 PM, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Charles R Harris wrote:
I put my yesterday work in the fix_float_format branch: - it fixes the locale issue - it fixes the long double issue on windows. - it also fixes some tests (we were not testing single precision formatting but twice double precision instead - the single precision test fails on the trunk BTW).
Curious, I don't see any test failures here. Were the tests actually being run or is something else different in your test setup? Or do you mean the fixed up test fails.
The later: if you look at numpy/core/tests/test_print, you will see that the types tested are np.float, np.double and np.longdouble, but at least on linux, np.float == np.double, and np.float32 is what we want to test I suppose here instead.
Expected, but I would like to see it change because it is kind of frustrating. Fixing it probably involves setting a function pointer in the type definition but I am not sure about that.
Hm, it took me a while to get this, but print np.float32(value) can be controlled through tp_print. Still, it does not work in all cases:
print np.float32(a) -> call the tp_print print '%f' % np.float32(a) -> does not call the tp_print (nor tp_str/tp_repr). I have no idea what going on there.
I'll bet it's calling a conversion to python float, i.e., double, because of the %f.
Yes, I meant that I did not understand the code path in that case. I realize that I don't know how to get the (C) call graph between two code points in python, that would be useful. Where are you dtrace on linux when I need you :)
In [1]: '%s' % np.float32(1) Out[1]: '1.0'
In [2]: '%f' % np.float32(1) Out[2]: '1.000000'
I don't see any way to work around that without changing the way the python formatting works.
Yes, I think you're right. Specially since python itself is not consistent. On python 2.6, windows: a = complex('inf') print a # -> print inf print '%s' % a # -> print inf print '%f' % a # -> print 1.#INF Which suggests that in that case, it gets directly to stdio without much formatting work from python. Maybe it is an oversight ? Anyway, I think it would be useful to override the tp_print member ( to avoid 'print a' printing 1.#INF). cheers, David

On Mon, Dec 29, 2008 at 8:12 PM, David Cournapeau <cournape@gmail.com>wrote:
On Mon, Dec 29, 2008 at 4:36 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sun, Dec 28, 2008 at 10:35 PM, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Charles R Harris wrote:
I put my yesterday work in the fix_float_format branch: - it fixes the locale issue - it fixes the long double issue on windows. - it also fixes some tests (we were not testing single precision formatting but twice double precision instead - the single
precision
test fails on the trunk BTW).
Curious, I don't see any test failures here. Were the tests actually being run or is something else different in your test setup? Or do you mean the fixed up test fails.
The later: if you look at numpy/core/tests/test_print, you will see that the types tested are np.float, np.double and np.longdouble, but at least on linux, np.float == np.double, and np.float32 is what we want to test I suppose here instead.
Expected, but I would like to see it change because it is kind of frustrating. Fixing it probably involves setting a function pointer in the type definition but I am not sure about that.
Hm, it took me a while to get this, but print np.float32(value) can be controlled through tp_print. Still, it does not work in all cases:
print np.float32(a) -> call the tp_print print '%f' % np.float32(a) -> does not call the tp_print (nor tp_str/tp_repr). I have no idea what going on there.
I'll bet it's calling a conversion to python float, i.e., double, because of the %f.
Yes, I meant that I did not understand the code path in that case. I realize that I don't know how to get the (C) call graph between two code points in python, that would be useful. Where are you dtrace on linux when I need you :)
I'm not sure we are quite on the same page here. The float32 object has a "convert to python float" method, (which I don't recall at the moment and I don't have the source to hand). So when %f appears in the format string that method is called and the resulting python float is formatted in the python way. Same with %s, only __str__ is called instead.
In [1]: '%s' % np.float32(1) Out[1]: '1.0'
In [2]: '%f' % np.float32(1) Out[2]: '1.000000'
I don't see any way to work around that without changing the way the
python
formatting works.
Yes, I think you're right. Specially since python itself is not consistent. On python 2.6, windows:
a = complex('inf') print a # -> print inf print '%s' % a # -> print inf print '%f' % a # -> print 1.#INF
How does a python inf display on windows?
Which suggests that in that case, it gets directly to stdio without much formatting work from python. Maybe it is an oversight ? Anyway, I think it would be useful to override the tp_print member ( to avoid 'print a' printing 1.#INF).
Sounds like the sort of thing the python folks would want to clean up, just as you have for numpy. Chuck

Charles R Harris wrote:
Yes, I meant that I did not understand the code path in that case. I realize that I don't know how to get the (C) call graph between two code points in python, that would be useful. Where are you dtrace on linux when I need you :)
I'm not sure we are quite on the same page here.
Yep, indeed. I think my bogus example did not help :) The right test script use float('inf'), not complex('inf').
The float32 object has a "convert to python float" method, (which I don't recall at the moment and I don't have the source to hand). So when %f appears in the format string that method is called and the resulting python float is formatted in the python way.
I think that's not the case for '%f', because the 'python' way is to print 'inf', not '1.#INF' (at least on 2.6 - on 2.5, it is always '1.#INF' on windows). If you use a pure C program on windows, you will get '1.#INF', etc... instead of 'inf'. repr, str, print all call the C format_float function, which takes care of fomatting 'inf' and co the 'python' way. So getting '1.#INF' from python suggests me that python does not format it in the '%f' case - and I don't know the code path at that point. For '%s', it goes through tp_str, for print a, it goes through tp_print, but for '%f' ?
a = complex('inf') print a # -> print inf print '%s' % a # -> print inf print '%f' % a # -> print 1.#INF
How does a python inf display on windows?
As stated: it depends. 'inf' or '1.#INF', the later being the same as the formatting done within the MS runtime.
Which suggests that in that case, it gets directly to stdio without much formatting work from python. Maybe it is an oversight ? Anyway, I think it would be useful to override the tp_print member ( to avoid 'print a' printing 1.#INF).
Sounds like the sort of thing the python folks would want to clean up, just as you have for numpy.
The thing is since I don't understand what happens in the print '%f' case, I don't know how to clean it up, if it is at all possible. But in anyway, it means that with my changes, we are not worse than python itself, and I think we are better than before, cheers, David

David Cournapeau wrote:
The thing is since I don't understand what happens in the print '%f' case, I don't know how to clean it up, if it is at all possible. But in anyway, it means that with my changes, we are not worse than python itself, and I think we are better than before,
Just a quick look in SVN, trunk/Objects/stringobject.c, shows that the call path for a "%f" format is string_mod -> PyString_Format -> formatfloat -> PyOS_ascii_formatd. -- Lenard Lindstrom <len-l@telus.net>

On Wed, Dec 31, 2008 at 3:41 AM, Lenard Lindstrom <len-l@telus.net> wrote:
David Cournapeau wrote:
The thing is since I don't understand what happens in the print '%f' case, I don't know how to clean it up, if it is at all possible. But in anyway, it means that with my changes, we are not worse than python itself, and I think we are better than before,
Just a quick look in SVN, trunk/Objects/stringobject.c, shows that the call path for a "%f" format is string_mod -> PyString_Format -> formatfloat -> PyOS_ascii_formatd.
Thanks, I did not think about looking into stringobject. I now have to understand why it does print differently, as going through format_float should avoid the inconsistencies cheers, David
-- Lenard Lindstrom <len-l@telus.net>
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

Mon, 29 Dec 2008 13:38:12 +0900, David Cournapeau wrote: [clip]
I put my yesterday work in the fix_float_format branch: - it fixes the locale issue - it fixes the long double issue on windows. - it also fixes some tests (we were not testing single precision formatting but twice double precision instead - the single precision test fails on the trunk BTW). - it handles inf and nan more consistently across platforms (e.g. str(np.log(0)) will be '-inf' on all platforms; on windows, it used to be '-1.#INF' - I was afraid it would broke converting back the string to float, but it is broken anyway before my change, e.g. float('-1.#INF') does not work on windows). [clip]
I did some work on the fix_float_format branch from the opposite direction, making fromfile and fromstring properly locale-independent. (cf. #884) Works currently on POSIX systems, but some tests fail on Windows because float('inf') does not work [neither does float('-1.#INF')...]. (cf. #510) A bit more work must be done on NumPyOS_ascii_strtod to make inf/nan work as intended. Also, roundtrip tests for repr would be nice to add, if they aren't there yet, and possibly for str <-> fromstring roundtrip, too. I'll be almost offline for 1.5 weeks starting now, so if you want to finish this, go ahead. -- Pauli Virtanen

On Wed, Dec 31, 2008 at 11:28 AM, Pauli Virtanen <pav@iki.fi> wrote:
Mon, 29 Dec 2008 13:38:12 +0900, David Cournapeau wrote: [clip]
I put my yesterday work in the fix_float_format branch: - it fixes the locale issue - it fixes the long double issue on windows. - it also fixes some tests (we were not testing single precision formatting but twice double precision instead - the single precision test fails on the trunk BTW). - it handles inf and nan more consistently across platforms (e.g. str(np.log(0)) will be '-inf' on all platforms; on windows, it used to be '-1.#INF' - I was afraid it would broke converting back the string to float, but it is broken anyway before my change, e.g. float('-1.#INF') does not work on windows). [clip]
I did some work on the fix_float_format branch from the opposite direction, making fromfile and fromstring properly locale-independent. (cf. #884)
Works currently on POSIX systems, but some tests fail on Windows because float('inf') does not work [neither does float('-1.#INF')...]. (cf. #510) A bit more work must be done on NumPyOS_ascii_strtod to make inf/nan work as intended. Also, roundtrip tests for repr would be nice to add, if they aren't there yet, and possibly for str <-> fromstring roundtrip, too. I'll be almost offline for 1.5 weeks starting now, so if you want to finish this, go ahead.
Thank you for working on this, Pauli. The problem on windows may not be specific to windows: the difference really is whether the formatting is done by python or the C runtime. It just happens that on Linux and Mac OS X, the strings are the same - but it could be different on other OS. I have not looked into C99, whether this is standardized or not (the size of exponent is, but I don't know about nan and inf). We should also change pretty print of arrays, I think - although it is a change and may break things. Since that's how python represents the numbers, I guess we will have to change at some point. David

Wed, 31 Dec 2008 13:11:02 +0900, David Cournapeau wrote: [clip]
Thank you for working on this, Pauli. The problem on windows may not be specific to windows: the difference really is whether the formatting is done by python or the C runtime. It just happens that on Linux and Mac OS X, the strings are the same - but it could be different on other OS. I have not looked into C99, whether this is standardized or not (the size of exponent is, but I don't know about nan and inf).
C99 appears to specify for *printf (case-insensitive) nan, +nan, -nan, nan(WHATEVER), -nan(WHATEVER) inf, infinity, +inf, +infinity, -inf, -infinity The fromfile/fromstring code now in `fix_float_format` branch recognizes all of these, independently of the platform. There are also some roundtrip tests that check that this plays along as expected with your new float formatting code.
We should also change pretty print of arrays, I think - although it is a change and may break things. Since that's how python represents the numbers, I guess we will have to change at some point.
It'd be nice to make also this behave as the rest of the float formatting. However, the current formatting of "Inf", "-Inf", "NaN" is OK as far as C99 is concerned. -- Pauli Virtanen

On Sat, Dec 27, 2008 at 11:46 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Sun, Dec 28, 2008 at 01:38, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sat, Dec 27, 2008 at 10:27 PM, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Hi,
While looking at the last failures of numpy trunk on windows for python 2.5 and 2.6, I got into floating point number formatting issues; I got deeper and deeper, and now I am lost. We have several problems: - we are not consistent between platforms, nor are we consistent with python - str(np.float32(a)) is locale dependent, but python str method is not (locale.str is) - formatting of long double does not work on windows because of the broken long double support in mingw.
1 consistency problem: ----------------------
python -c "a = 1e20; print a" -> 1e+020 python26 -c "a = 1e20; print a" -> 1e+20
In numpy, we use PyOS_snprintf for formatting, but python itself uses PyOS_ascii_formatd - which has different behavior on different versions of python. The above behavior can be simply reproduced in C:
#include <Python.h>
int main() { double x = 1e20; char c[200];
PyOS_ascii_format(c, sizeof(c), "%.12g", x); printf("%s\n", c); printf("%g\n", x);
return 0; }
On 2.5, this will print:
1e+020 1e+020
But on 2.6, this will print:
1e+20 1e+020
2 locale dependency: --------------------
Another issue is that our own formatting is local dependent, whereas python isn't:
import numpy as np import locale locale.setlocale(locale.LC_NUMERIC, 'fr_FR') a = 1.2
print "str(a)", str(a) print "locale.str(a)", locale.str(a) print "str(np.float32(a))", str(np.float32(a)) print "locale.str(np.float32(a))", locale.str(np.float32(a))
Returns:
str(a) 1.2 locale.str(a) 1,2 str(np.float32(a)) 1,2 locale.str(np.float32(a)) 1,20000004768
I thought about copying the way python does the formatting in the trunk (where discrepancies between platforms have been fixed), but this is not so easy, because it uses a lot of code from different places - and the code needs to be adapted to float and long double. The other solution would be to do our own formatting, but this does not sound easy: formatting in C is hard. I am not sure about what we should do, if anyone else has any idea ?
I think the first thing to do is make a decision on locale. If we chose
to
support locales I don't see much choice but to depend Python because it's too much work otherwise, and work not directly related to Numpy at that. If we decide not to support locales then we can do our own formatting if we need to using a fixed choice of locale. There is a list of snprintf implementations here. Trio looks like a mature project and has an MIT license, which I think is a license compatible with Numpy.
We should not support locales. The string representations of these elements should be Python-parseable.
I'm inclined to just fix the locale and ignore the rest until Python gets things sorted out. But I'm lazy...
What do you think Python doesn't have sorted out?
Consistency between versions and platforms. David's note with the ticket points to a Python 3.0 bug on this reported about, oh, two years ago. If we wait long enough this problem will eventually get fixed as old python versions disappear and some sort decision is made for the 3.x series. Or we could do our own and be consistent with ourselves. There is also the problem of long doubles on the windows platform, which isn't Python specific since Python doesn't use long doubles. As I understand long doubles on windows, mingw32 supports them, VS doesn't, so there is a compiler inconsistency to deal with also. Chuck

Charles R Harris wrote:
On Sat, Dec 27, 2008 at 11:46 PM, Robert Kern <robert.kern@gmail.com <mailto:robert.kern@gmail.com>> wrote:
On Sun, Dec 28, 2008 at 01:38, Charles R Harris <charlesr.harris@gmail.com <mailto:charlesr.harris@gmail.com>> wrote: > > On Sat, Dec 27, 2008 at 10:27 PM, David Cournapeau > <david@ar.media.kyoto-u.ac.jp <mailto:david@ar.media.kyoto-u.ac.jp>> wrote: >> >> Hi, >> >> While looking at the last failures of numpy trunk on windows for >> python 2.5 and 2.6, I got into floating point number formatting issues; >> I got deeper and deeper, and now I am lost. We have several problems: >> - we are not consistent between platforms, nor are we consistent >> with python >> - str(np.float32(a)) is locale dependent, but python str method is >> not (locale.str is) >> - formatting of long double does not work on windows because of the >> broken long double support in mingw. >> >> 1 consistency problem: >> ---------------------- >> >> python -c "a = 1e20; print a" -> 1e+020 >> python26 -c "a = 1e20; print a" -> 1e+20 >> >> In numpy, we use PyOS_snprintf for formatting, but python itself uses >> PyOS_ascii_formatd - which has different behavior on different versions >> of python. The above behavior can be simply reproduced in C: >> >> #include <Python.h> >> >> int main() >> { >> double x = 1e20; >> char c[200]; >> >> PyOS_ascii_format(c, sizeof(c), "%.12g", x); >> printf("%s\n", c); >> printf("%g\n", x); >> >> return 0; >> } >> >> On 2.5, this will print: >> >> 1e+020 >> 1e+020 >> >> But on 2.6, this will print: >> >> 1e+20 >> 1e+020 >> >> 2 locale dependency: >> -------------------- >> >> Another issue is that our own formatting is local dependent, whereas >> python isn't: >> >> import numpy as np >> import locale >> locale.setlocale(locale.LC_NUMERIC, 'fr_FR') >> a = 1.2 >> >> print "str(a)", str(a) >> print "locale.str(a)", locale.str(a) >> print "str(np.float32(a))", str(np.float32(a)) >> print "locale.str(np.float32(a))", locale.str(np.float32(a)) >> >> Returns: >> >> str(a) 1.2 >> locale.str(a) 1,2 >> str(np.float32(a)) 1,2 >> locale.str(np.float32(a)) 1,20000004768 >> >> I thought about copying the way python does the formatting in the trunk >> (where discrepancies between platforms have been fixed), but this is not >> so easy, because it uses a lot of code from different places - and the >> code needs to be adapted to float and long double. The other solution >> would be to do our own formatting, but this does not sound easy: >> formatting in C is hard. I am not sure about what we should do, if >> anyone else has any idea ? > > I think the first thing to do is make a decision on locale. If we chose to > support locales I don't see much choice but to depend Python because it's > too much work otherwise, and work not directly related to Numpy at that. If > we decide not to support locales then we can do our own formatting if we > need to using a fixed choice of locale. There is a list of snprintf > implementations here. Trio looks like a mature project and has an MIT > license, which I think is a license compatible with Numpy.
We should not support locales. The string representations of these elements should be Python-parseable.
> I'm inclined to just fix the locale and ignore the rest until Python gets > things sorted out. But I'm lazy...
What do you think Python doesn't have sorted out?
Consistency between versions and platforms. David's note with the ticket points to a Python 3.0 bug on this reported about, oh, two years ago.
As an example: in python 2.6, they solved some issues like inf/nan by interpreting the strings in python before outputting them, but we do not use their fix. So we have: python -c "import numpy as np; print np.log(0)" -> -inf (python 2.6) / -1.#INF (2.5, which is the format from the MS runtime). But: python -c "import numpy as np; print np.log(0).astype(np.float32)" -> -1.#INF (both 2.6 and 2.5) Etc... We can't be consistent with ourselves and with python at the same time, I think. I don't know which one is best: numpy being consistent through platforms and python versions, or being consistent with python.
There is also the problem of long doubles on the windows platform, which isn't Python specific since Python doesn't use long doubles. As I understand long doubles on windows, mingw32 supports them, VS doesn't, so there is a compiler inconsistency to deal with also.
To be exact, both mingw and VS support long double sensu stricto: the long double type is available. But sizeof(long double) == sizeof(double) with VS toolchain, and sizeof(long double) is 12 with mingw. The later is a pain, because mingw use both MS runtime (printf) and its own function (some math funcs), so we can't easily be consistent (either 8 or 12 bytes long double) with mingw. One solution would be to use the mingwex printf (a printf reimplementation available on recent mingwrt) instead of MSVC runtime - I would hope that this one is fixed wrt long double. This problem is even worse on 64 bits (long double are 16 bytes by default there with mingw). cheers, David

On Sat, Dec 27, 2008 at 11:55 PM, David Cournapeau < david@ar.media.kyoto-u.ac.jp> wrote:
Charles R Harris wrote:
On Sat, Dec 27, 2008 at 11:46 PM, Robert Kern <robert.kern@gmail.com <mailto:robert.kern@gmail.com>> wrote:
On Sun, Dec 28, 2008 at 01:38, Charles R Harris <charlesr.harris@gmail.com <mailto:charlesr.harris@gmail.com>>
wrote:
> > On Sat, Dec 27, 2008 at 10:27 PM, David Cournapeau > <david@ar.media.kyoto-u.ac.jp <mailto:david@ar.media.kyoto-u.ac.jp>> wrote: >> >> Hi, >> >> While looking at the last failures of numpy trunk on windows
for
>> python 2.5 and 2.6, I got into floating point number formatting issues; >> I got deeper and deeper, and now I am lost. We have several problems: >> - we are not consistent between platforms, nor are we
consistent
>> with python >> - str(np.float32(a)) is locale dependent, but python str method is >> not (locale.str is) >> - formatting of long double does not work on windows because of the >> broken long double support in mingw. >> >> 1 consistency problem: >> ---------------------- >> >> python -c "a = 1e20; print a" -> 1e+020 >> python26 -c "a = 1e20; print a" -> 1e+20 >> >> In numpy, we use PyOS_snprintf for formatting, but python itself uses >> PyOS_ascii_formatd - which has different behavior on different versions >> of python. The above behavior can be simply reproduced in C: >> >> #include <Python.h> >> >> int main() >> { >> double x = 1e20; >> char c[200]; >> >> PyOS_ascii_format(c, sizeof(c), "%.12g", x); >> printf("%s\n", c); >> printf("%g\n", x); >> >> return 0; >> } >> >> On 2.5, this will print: >> >> 1e+020 >> 1e+020 >> >> But on 2.6, this will print: >> >> 1e+20 >> 1e+020 >> >> 2 locale dependency: >> -------------------- >> >> Another issue is that our own formatting is local dependent, whereas >> python isn't: >> >> import numpy as np >> import locale >> locale.setlocale(locale.LC_NUMERIC, 'fr_FR') >> a = 1.2 >> >> print "str(a)", str(a) >> print "locale.str(a)", locale.str(a) >> print "str(np.float32(a))", str(np.float32(a)) >> print "locale.str(np.float32(a))", locale.str(np.float32(a)) >> >> Returns: >> >> str(a) 1.2 >> locale.str(a) 1,2 >> str(np.float32(a)) 1,2 >> locale.str(np.float32(a)) 1,20000004768 >> >> I thought about copying the way python does the formatting in the trunk >> (where discrepancies between platforms have been fixed), but this is not >> so easy, because it uses a lot of code from different places - and the >> code needs to be adapted to float and long double. The other solution >> would be to do our own formatting, but this does not sound easy: >> formatting in C is hard. I am not sure about what we should do, if >> anyone else has any idea ? > > I think the first thing to do is make a decision on locale. If we chose to > support locales I don't see much choice but to depend Python because it's > too much work otherwise, and work not directly related to Numpy at that. If > we decide not to support locales then we can do our own formatting if we > need to using a fixed choice of locale. There is a list of snprintf > implementations here. Trio looks like a mature project and has an MIT > license, which I think is a license compatible with Numpy.
We should not support locales. The string representations of these elements should be Python-parseable.
> I'm inclined to just fix the locale and ignore the rest until Python gets > things sorted out. But I'm lazy...
What do you think Python doesn't have sorted out?
Consistency between versions and platforms. David's note with the ticket points to a Python 3.0 bug on this reported about, oh, two years ago.
As an example: in python 2.6, they solved some issues like inf/nan by interpreting the strings in python before outputting them, but we do not use their fix. So we have:
python -c "import numpy as np; print np.log(0)" -> -inf (python 2.6) / -1.#INF (2.5, which is the format from the MS runtime).
But:
python -c "import numpy as np; print np.log(0).astype(np.float32)" -> -1.#INF (both 2.6 and 2.5)
Etc... We can't be consistent with ourselves and with python at the same time, I think. I don't know which one is best: numpy being consistent through platforms and python versions, or being consistent with python.
There is also the problem of long doubles on the windows platform, which isn't Python specific since Python doesn't use long doubles. As I understand long doubles on windows, mingw32 supports them, VS doesn't, so there is a compiler inconsistency to deal with also.
To be exact, both mingw and VS support long double sensu stricto: the long double type is available. But sizeof(long double) == sizeof(double) with VS toolchain, and sizeof(long double) is 12 with mingw. The later is a pain, because mingw use both MS runtime (printf) and its own function (some math funcs), so we can't easily be consistent (either 8 or 12 bytes long double) with mingw. One solution would be to use the mingwex printf (a printf reimplementation available on recent mingwrt) instead of MSVC runtime - I would hope that this one is fixed wrt long double. This problem is even worse on 64 bits (long double are 16 bytes by default there with mingw).
I think there are also less visible problems with string to number conversions, so that might be a reason to consider third party software. Python doesn't directly support conversion of complex numbers presented as strings, for instance, although that may have been fixed in 3.0. So extending some third party sscanf might be useful. The question comes of how much time you want to spend on this. I know working on a dissertation is a great excuse to do something else; I spent some weeks writing my own latex dissertation class, for instance. But I don't know if that is recommended practice. Chuck
participants (6)
-
Charles R Harris
-
David Cournapeau
-
David Cournapeau
-
Lenard Lindstrom
-
Pauli Virtanen
-
Robert Kern