In the course of trying to parse ascii times, I ran into a puzzling bug. Sometimes it works as expected: In [31]:npy.fromstring('23:19:01', dtype=int, sep=':') Out[31]:array([23, 19, 1]) But sometimes it doesn't: In [32]:npy.fromstring('23:09:01', dtype=int, sep=':') Out[32]:array([23, 0]) In [33]:npy.__version__ Out[33]:'1.0.5.dev4742' In [34]:npy.fromstring('23:09:01', dtype=int, sep=':', count=3) Out[34]:array([23, 0, 16]) In [35]:npy.fromstring('23 09 01', dtype=int, sep=' ', count=3) Out[35]:array([23, 0, 9]) In [36]:npy.fromstring('23 09 01', dtype=int, sep=' ') Out[36]:array([23, 0, 9, 1]) Maybe it is a problem specific to int conversion; examples that fail with int work with float: In [37]:npy.fromstring('23 09 01', dtype=float, sep=' ') Out[37]:array([ 23., 9., 1.]) In [38]:npy.fromstring('23:09:01', dtype=float, sep=':') Out[38]:array([ 23., 9., 1.]) Eric
On Jan 26, 2008 11:30 PM, Eric Firing <efiring@hawaii.edu> wrote:
In the course of trying to parse ascii times, I ran into a puzzling bug. Sometimes it works as expected:
In [31]:npy.fromstring('23:19:01', dtype=int, sep=':') Out[31]:array([23, 19, 1])
But sometimes it doesn't:
In [32]:npy.fromstring('23:09:01', dtype=int, sep=':') Out[32]:array([23, 0])
In [33]:npy.__version__ Out[33]:'1.0.5.dev4742'
Works here. In [6]: for i in range(100):fromstring('23:19:01', dtype=int, sep=':') In [7]: numpy.__version__ Out[7]: '1.0.5.dev4730' produces no failures. The fact that it fails for you sometimes and not others is very odd. It's like some sort of bizarre race condition or bit flip. What architecture/OS/compiler are you using? Have you tested memory? Chuck
In [34]:npy.fromstring('23:09:01', dtype=int, sep=':', count=3) Out[34]:array([23, 0, 16])
In [35]:npy.fromstring('23 09 01', dtype=int, sep=' ', count=3) Out[35]:array([23, 0, 9])
In [36]:npy.fromstring('23 09 01', dtype=int, sep=' ') Out[36]:array([23, 0, 9, 1])
Maybe it is a problem specific to int conversion; examples that fail with int work with float:
In [37]:npy.fromstring('23 09 01', dtype=float, sep=' ') Out[37]:array([ 23., 9., 1.])
In [38]:npy.fromstring('23:09:01', dtype=float, sep=':') Out[38]:array([ 23., 9., 1.])
Eric _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Doesn't work here: In [1]: import numpy as npy In [2]: npy.fromstring('23:09:01', dtype=int, sep=':') Out[2]: array([23, 0]) In [3]: npy.__version__ Out[3]: '1.0.5.dev4722' In [4]: npy.fromstring('23:09:01', dtype=int, sep=':', count=3) Out[4]: array([ 23, 0, 151904160]) ... Pentium Dual Core, Ubuntu Linux 7.10, I have both gcc-3.4 and gcc-4.1 on my system; don't know which one the setup.py uses. Default is 4.1.3 Gabriel On Sun, 2008-01-27 at 01:16 -0700, Charles R Harris wrote:
On Jan 26, 2008 11:30 PM, Eric Firing <efiring@hawaii.edu> wrote: In the course of trying to parse ascii times, I ran into a puzzling bug. Sometimes it works as expected:
In [31]:npy.fromstring('23:19:01', dtype=int, sep=':') Out[31]:array([23, 19, 1])
But sometimes it doesn't:
In [32]:npy.fromstring('23:09:01', dtype=int, sep=':') Out[32]:array([23, 0])
In [33]:npy.__version__ Out[33]:'1.0.5.dev4742'
Works here.
In [6]: for i in range(100):fromstring('23:19:01', dtype=int, sep=':')
In [7]: numpy.__version__ Out[7]: '1.0.5.dev4730'
produces no failures. The fact that it fails for you sometimes and not others is very odd. It's like some sort of bizarre race condition or bit flip. What architecture/OS/compiler are you using? Have you tested memory?
Chuck
In [34]:npy.fromstring('23:09:01', dtype=int, sep=':', count=3) Out[34]:array([23, 0, 16])
In [35]:npy.fromstring('23 09 01', dtype=int, sep=' ', count=3) Out[35]:array([23, 0, 9])
In [36]:npy.fromstring('23 09 01', dtype=int, sep=' ') Out[36]:array([23, 0, 9, 1])
Maybe it is a problem specific to int conversion; examples that fail with int work with float:
In [37]:npy.fromstring('23 09 01', dtype=float, sep=' ') Out[37]:array([ 23., 9., 1.])
In [38]:npy.fromstring('23:09:01', dtype=float, sep=':') Out[38]:array([ 23., 9., 1.])
Eric _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Dr. Gabriël J.L. Beckers Max Planck Institute for Ornithology Group Neurobiology of Behaviour Postfach 1564, 82305 Starnberg, Germany Web: http://www.gbeckers.nl Phone: +49-8157932273, Fax: +49-8157932285
su, 2008-01-27 kello 01:16 -0700, Charles R Harris kirjoitti:
On Jan 26, 2008 11:30 PM, Eric Firing <efiring@hawaii.edu> wrote: In the course of trying to parse ascii times, I ran into a puzzling bug. Sometimes it works as expected:
In [31]:npy.fromstring('23:19:01', dtype=int, sep=':') Out[31]:array([23, 19, 1])
But sometimes it doesn't:
In [32]:npy.fromstring('23:09:01', dtype=int, sep=':') Out[32]:array([23, 0])
In [33]:npy.__version__ Out[33]:'1.0.5.dev4742'
Works here.
I think it's that some numbers work, and some don't. Consider:
npy.fromstring('23:06:01', dtype=int, sep=':') array([23, 6, 1]) npy.fromstring('23:07:01', dtype=int, sep=':') array([23, 7, 1]) npy.fromstring('23:08:01', dtype=int, sep=':') array([23, 0]) npy.fromstring('23:09:01', dtype=int, sep=':') array([23, 0])
and
npy.fromstring('23:010:01', dtype=int, sep=':') array([23, 8, 1]) npy.fromstring('23:011:01', dtype=int, sep=':') array([23, 9, 1])
and
npy.fromstring('23:0xff:01', dtype=int, sep=':') array([ 23, 255, 1])
Smells like some scanf function is interpreting numbers beginning with zero as octal, and recognizing also hexadecimals. This is a bit surprising, and whether this is the desired behavior is questionable. -- Pauli Virtanen
Pauli Virtanen wrote:
su, 2008-01-27 kello 01:16 -0700, Charles R Harris kirjoitti:
On Jan 26, 2008 11:30 PM, Eric Firing <efiring@hawaii.edu> wrote: In the course of trying to parse ascii times, I ran into a puzzling bug. Sometimes it works as expected:
In [31]:npy.fromstring('23:19:01', dtype=int, sep=':') Out[31]:array([23, 19, 1])
But sometimes it doesn't:
In [32]:npy.fromstring('23:09:01', dtype=int, sep=':') Out[32]:array([23, 0])
In [33]:npy.__version__ Out[33]:'1.0.5.dev4742'
Works here.
I think it's that some numbers work, and some don't. Consider:
npy.fromstring('23:06:01', dtype=int, sep=':') array([23, 6, 1]) npy.fromstring('23:07:01', dtype=int, sep=':') array([23, 7, 1]) npy.fromstring('23:08:01', dtype=int, sep=':') array([23, 0]) npy.fromstring('23:09:01', dtype=int, sep=':') array([23, 0])
and
npy.fromstring('23:010:01', dtype=int, sep=':') array([23, 8, 1]) npy.fromstring('23:011:01', dtype=int, sep=':') array([23, 9, 1])
and
npy.fromstring('23:0xff:01', dtype=int, sep=':') array([ 23, 255, 1])
Smells like some scanf function is interpreting numbers beginning with zero as octal, and recognizing also hexadecimals.
That is it exactly. The code in core/src/arraytypes.inc.src is using scanf, and scanf tries hard to recognize integers specified in different ways. So, what caught me is a feature, not a bug, and I should have recognized it as such right away. The bug was in my expectations, not in the code.
This is a bit surprising, and whether this is the desired behavior is questionable.
From a user's standpoint it would be nice to be able to have numbers with leading zeros interpreted as base 10 instead of octal, since this turns up any time one converts date and time-of-day strings, and can occur in many other contexts also. (Outside of computer science octal is rare, as far as I know.) It looks like supporting this would require quite a bit of change in the code, however. I suspect it would have to go in as a kwarg that would be propagated through several layers of C function calls. Otherwise, if octal conversion support were simply dropped, I suspect someone else's code would break, and equally reasonable expectations would be violated. Eric
On Jan 27, 2008 12:40 PM, Eric Firing <efiring@hawaii.edu> wrote:
Pauli Virtanen wrote:
<snip>
That is it exactly. The code in core/src/arraytypes.inc.src is using scanf, and scanf tries hard to recognize integers specified in different ways. So, what caught me is a feature, not a bug, and I should have recognized it as such right away. The bug was in my expectations, not in the code.
This is a bit surprising, and whether this is the desired behavior is questionable.
From a user's standpoint it would be nice to be able to have numbers with leading zeros interpreted as base 10 instead of octal, since this turns up any time one converts date and time-of-day strings, and can occur in many other contexts also. (Outside of computer science octal is rare, as far as I know.) It looks like supporting this would require quite a bit of change in the code, however. I suspect it would have to go in as a kwarg that would be propagated through several layers of C function calls. Otherwise, if octal conversion support were simply dropped, I suspect someone else's code would break, and equally reasonable expectations would be violated.
I don't think the problem is scanf, at least not here. The following code snippet works fine for me. #include <stdio.h> int main(int argc, char** argv) { int a,b,c; sscanf(argv[1], "%d :%d :%d", &a, &b, &c); printf("%d %d %d\n", a, b, c); return 0; } $[charris@f8 scratch]$ ./a.out "23:09:01" 23 9 1 Maybe it acts differently for files? $[charris@f8 scratch]$ echo "23:09:01" > tmp $[charris@f8 scratch]$ ./a.out 23 9 1 Nope, that works fine also. Numpy is making a type decision somewhere else. Chuck
su, 2008-01-27 kello 13:48 -0700, Charles R Harris kirjoitti: [clip]
I don't think the problem is scanf, at least not here. The following code snippet works fine for me.
Reading the code in arraytypes.inc.src and multiarraymodule.c, it appears that numpy is using strtol(str, &tailptr, 0) for the string to integer conversion. Calling strtol with BASE == 0 enables the automatic base detection from the prefix. However, as you say, scanf does not do this. Numpy appears to use fscanf when reading data from files, so there is a discrepancy here:
from numpy import fromfile, fromstring f = open('test.dat', 'w') f.write("20:09:21") f.close()
fromfile('test.dat', dtype=int, sep=':') array([20, 9, 21]) fromstring('20:09:21', dtype=int, sep=':') array([20, 0])
Also, the following result is quite strange, seems like a silent failure:
fromfile('test.dat', dtype=int) array([809119794, 825375289])
I guess some more testcases should be written... -- Pauli Virtanen
On Jan 27, 2008 3:19 PM, Pauli Virtanen <pav@iki.fi> wrote:
su, 2008-01-27 kello 13:48 -0700, Charles R Harris kirjoitti: [clip]
I don't think the problem is scanf, at least not here. The following
code snippet works fine for me.
Reading the code in arraytypes.inc.src and multiarraymodule.c, it appears that numpy is using strtol(str, &tailptr, 0) for the string to integer conversion. Calling strtol with BASE == 0 enables the automatic base detection from the prefix.
However, as you say, scanf does not do this. Numpy appears to use fscanf when reading data from files, so there is a discrepancy here:
from numpy import fromfile, fromstring f = open('test.dat', 'w') f.write("20:09:21") f.close()
fromfile('test.dat', dtype=int, sep=':') array([20, 9, 21]) fromstring('20:09:21', dtype=int, sep=':') array([20, 0])
I vote for fromstring working like fromfile.
Also, the following result is quite strange, seems like a silent failure:
fromfile('test.dat', dtype=int) array([809119794, 825375289])
The default is to treat the file as containing binary data. Chuck
Charles R Harris wrote:
Reading the code in arraytypes.inc.src and multiarraymodule.c, it appears that numpy is using strtol(str, &tailptr, 0) for the string to integer conversion. Calling strtol with BASE == 0 enables the automatic base detection from the prefix.
However, as you say, scanf does not do this. Numpy appears to use fscanf when reading data from files, so there is a discrepancy here:
>>> from numpy import fromfile, fromstring >>> f = open('test.dat', 'w') >>> f.write("20:09:21") >>> f.close()
>>> fromfile('test.dat', dtype=int, sep=':') array([20, 9, 21]) >>> fromstring('20:09:21', dtype=int, sep=':') array([20, 0])
I vote for fromstring working like fromfile.
I agree. Can we get this change into 1.05? I could make a patch if that would help. Although I was wrong in calling this a "major bug", I think it is an inconsistency that should be removed. The fromfile and fromstring docstrings could also state explicitly what the behavior is. Eric
On Jan 28, 2008 4:09 PM, Eric Firing <efiring@hawaii.edu> wrote:
Charles R Harris wrote:
Reading the code in arraytypes.inc.src and multiarraymodule.c, it appears that numpy is using strtol(str, &tailptr, 0) for the string
to
integer conversion. Calling strtol with BASE == 0 enables the
automatic
base detection from the prefix.
However, as you say, scanf does not do this. Numpy appears to use
fscanf
when reading data from files, so there is a discrepancy here:
>>> from numpy import fromfile, fromstring >>> f = open('test.dat', 'w') >>> f.write("20:09:21") >>> f.close()
>>> fromfile('test.dat', dtype=int, sep=':') array([20, 9, 21]) >>> fromstring('20:09:21', dtype=int, sep=':') array([20, 0])
I vote for fromstring working like fromfile.
I agree. Can we get this change into 1.05? I could make a patch if that would help. Although I was wrong in calling this a "major bug", I think it is an inconsistency that should be removed. The fromfile and fromstring docstrings could also state explicitly what the behavior is.
Your best bet is to file a ticket. I you don't, I will, but I'll wait a bit. Chuck
Charles R Harris wrote:
On Jan 28, 2008 4:09 PM, Eric Firing <efiring@hawaii.edu <mailto:efiring@hawaii.edu>> wrote:
Charles R Harris wrote:
> Reading the code in arraytypes.inc.src and multiarraymodule.c, it > appears that numpy is using strtol(str, &tailptr, 0) for the string to > integer conversion. Calling strtol with BASE == 0 enables the automatic > base detection from the prefix. > > However, as you say, scanf does not do this. Numpy appears to use fscanf > when reading data from files, so there is a discrepancy here: > > >>> from numpy import fromfile, fromstring > >>> f = open('test.dat', 'w') > >>> f.write("20:09:21") > >>> f.close() > > >>> fromfile('test.dat', dtype=int, sep=':') > array([20, 9, 21]) > >>> fromstring('20:09:21', dtype=int, sep=':') > array([20, 0]) > > > I vote for fromstring working like fromfile.
I agree. Can we get this change into 1.05? I could make a patch if that would help. Although I was wrong in calling this a "major bug", I think it is an inconsistency that should be removed. The fromfile and fromstring docstrings could also state explicitly what the behavior is.
Your best bet is to file a ticket. I you don't, I will, but I'll wait a bit.
It is ticket #650. Eric
participants (4)
-
Charles R Harris -
Eric Firing -
Gabriel J.L. Beckers -
Pauli Virtanen