numpy.array() of mixed integers and strings can truncate data
Is this expected behavior?
np.array([-345,4,2,'ABC']) array(['-34', '4', '2', 'ABC'], dtype='|S3')
np.version.full_version '1.6.1' np.version.git_revision '68538b74483009c2c2d1644ef00397014f95a696'
Ray Jones
Le 01/12/2011 14:52, Thouis (Ray) Jones a écrit :
Is this expected behavior?
np.array([-345,4,2,'ABC']) array(['-34', '4', '2', 'ABC'], dtype='|S3')
With my numpy 1.5.1, I got indeed a different result: In [1]: np.array([-345,4,2,'ABC']) Out[1]: array(['-345', '4', '2', 'ABC'], dtype='|S8') The type casting is a bit different, and actually may better match what you expect, but still a casting is required (i.e. you cannot have a "numpy.array() of mixed integers and strings" because numpy arrays only store *homogenous* sets of data) Now one question remains for me : why use a numpy array to store a few strings, and not just a regular Python list ? Best, Pierre
On Thu, Dec 1, 2011 at 15:47, Pierre Haessig <pierre.haessig@crans.org> wrote:
Le 01/12/2011 14:52, Thouis (Ray) Jones a écrit :
Is this expected behavior?
np.array([-345,4,2,'ABC']) array(['-34', '4', '2', 'ABC'], dtype='|S3')
With my numpy 1.5.1, I got indeed a different result:
In [1]: np.array([-345,4,2,'ABC']) Out[1]: array(['-345', '4', '2', 'ABC'], dtype='|S8')
This is closer to what I would expect.
The type casting is a bit different, and actually may better match what you expect, but still a casting is required (i.e. you cannot have a "numpy.array() of mixed integers and strings" because numpy arrays only store *homogenous* sets of data)
Of course, but when converting from a non-homogenous python list, I would expect it to do something reasonable (or at least not as bad as turning -345 into '-34').
Now one question remains for me : why use a numpy array to store a few strings, and not just a regular Python list ?
It was a small test case. The actual data is much larger. Ray Jones
On Thursday, December 1, 2011, Thouis Jones <thouis.jones@curie.fr> wrote:
On Thu, Dec 1, 2011 at 15:47, Pierre Haessig <pierre.haessig@crans.org> wrote:
Le 01/12/2011 14:52, Thouis (Ray) Jones a écrit :
Is this expected behavior?
np.array([-345,4,2,'ABC']) array(['-34', '4', '2', 'ABC'], dtype='|S3')
With my numpy 1.5.1, I got indeed a different result:
In [1]: np.array([-345,4,2,'ABC']) Out[1]: array(['-345', '4', '2', 'ABC'], dtype='|S8')
This is closer to what I would expect.
The type casting is a bit different, and actually may better match what you expect, but still a casting is required (i.e. you cannot have a "numpy.array() of mixed integers and strings" because numpy arrays only store *homogenous* sets of data)
Of course, but when converting from a non-homogenous python list, I would expect it to do something reasonable (or at least not as bad as turning -345 into '-34').
Now one question remains for me : why use a numpy array to store a few strings, and not just a regular Python list ?
It was a small test case. The actual data is much larger.
Ray Jones
This is total speculation on my part. My suspicion is that the loading process sees numbers and starts casting in that manner, then it sees the string and realizes that it has to cast everything to a fixed width string. The width is determined as the width of the longest string. Since -345 was already processed as a number, it never considers its string representation length. Does the same problem occur if -345 comes after "ABC"? Ben Root
On Thu, Dec 1, 2011 at 16:29, Benjamin Root <ben.root@ou.edu> wrote:
Does the same problem occur if -345 comes after "ABC"?
Yes.
np.array(list(reversed([-345,4,2,'ABC']))) array(['ABC', '2', '4', '-34'], dtype='|S3')
On Thu, Dec 1, 2011 at 6:52 AM, Thouis (Ray) Jones <thouis@gmail.com> wrote:
Is this expected behavior?
np.array([-345,4,2,'ABC']) array(['-34', '4', '2', 'ABC'], dtype='|S3')
Given that strings should be the result, this looks like a bug. It's a bit of a corner case that probably slipped through during the recent work on casting. There needs to be tests for these sorts of things, so if you find more oddities post them so we can add them. Chuck
On 1 Dec 2011, at 17:39, Charles R Harris wrote:
On Thu, Dec 1, 2011 at 6:52 AM, Thouis (Ray) Jones <thouis@gmail.com> wrote: Is this expected behavior?
np.array([-345,4,2,'ABC']) array(['-34', '4', '2', 'ABC'], dtype='|S3')
Given that strings should be the result, this looks like a bug. It's a bit of a corner case that probably slipped through during the recent work on casting. There needs to be tests for these sorts of things, so if you find more oddities post them so we can add them.
As it is not dependent on the string appearing before or after the numbers, numerical values appear to always be processed first before any string transformation, even if you explicitly specify the string format - consider the following (1.6.1):
np.array((2, 12,0.1+2j)) array([ 2.0+0.j, 12.0+0.j, 0.1+2.j])
np.array((2, 12,0.001+2j)) array([ 2.00000000e+00+0.j, 1.20000000e+01+0.j, 1.00000000e-03+2.j])
np.array((2, 12,0.001+2j), dtype='|S8') array(['2', '12', '(0.001+2'], dtype='|S8')
- notice the last value is only truncated because it had first been converted into a "standard" complex representation, so maybe the problem is already in the way Python treats the input. Cheers, Derek
On 12/1/2011 9:15 AM, Derek Homeier wrote:
np.array((2, 12,0.001+2j), dtype='|S8') array(['2', '12', '(0.001+2'], dtype='|S8')
- notice the last value is only truncated because it had first been converted into a "standard" complex representation, so maybe the problem is already in the way Python treats the input.
no -- it's truncated because you've specified a 8 char long string, and the string representation of complex is longer than that. I assume that numpy is using the objects __str__ or __repr__: In [13]: str(0.001+2j) Out[13]: '(0.001+2j)' In [14]: repr(0.001+2j) Out[14]: '(0.001+2j)' I think the only bug we've identified here is that numpy is selecting the string size based on the longest string input, rather than checking to see how long the string representation of the numeric input is as well. if there is a long-enough string in there, it works fine: In [15]: np.array([-345,4,2,'ABC', 'abcde']) Out[15]: array(['-345', '4', '2', 'ABC', 'abcde'], dtype='|S5') An open question is what it should do if you specify the length of the string dtype, but one of the values can't be fit into that size. At this point, it truncates, but should it raise an error? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 1 Dec 2011, at 21:35, Chris Barker wrote:
On 12/1/2011 9:15 AM, Derek Homeier wrote:
np.array((2, 12,0.001+2j), dtype='|S8') array(['2', '12', '(0.001+2'], dtype='|S8')
- notice the last value is only truncated because it had first been converted into a "standard" complex representation, so maybe the problem is already in the way Python treats the input.
no -- it's truncated because you've specified a 8 char long string, and the string representation of complex is longer than that. I assume that numpy is using the objects __str__ or __repr__:
In [13]: str(0.001+2j) Out[13]: '(0.001+2j)'
In [14]: repr(0.001+2j) Out[14]: '(0.001+2j)'
That's what I meant with the "Python-side" of the issue, but you're right, there is no numerical conversion involved.
I think the only bug we've identified here is that numpy is selecting the string size based on the longest string input, rather than checking to see how long the string representation of the numeric input is as well. if there is a long-enough string in there, it works fine:
In [15]: np.array([-345,4,2,'ABC', 'abcde']) Out[15]: array(['-345', '4', '2', 'ABC', 'abcde'], dtype='|S5')
An open question is what it should do if you specify the length of the string dtype, but one of the values can't be fit into that size. At this point, it truncates, but should it raise an error?
I would probably raise a warning rather than an error - I think if the user explicitly specifies a string length, they should be aware that the data might be truncated (and might even want this behaviour). Another "issue" could be that the string representation can look quite different from what has been typed in, like In [95]: np.array(('abcdefg', 12, 0.00001+2j), dtype='|S12') Out[95]: array(['abcdefg', '12', '(1e-05+2j)'], dtype='|S12') but then I think one has to accept that _ 0.00001+2j _ is not a string and thus cannot be guaranteed to be represented in that exact way - it can be either understood as a numerical object or not at all (i.e. one should just type it in as a string - with quotes - if one wants string-behaviour). Cheers, Derek
On Thu, Dec 1, 2011 at 17:39, Charles R Harris <charlesr.harris@gmail.com> wrote:
Given that strings should be the result, this looks like a bug. It's a bit of a corner case that probably slipped through during the recent work on casting. There needs to be tests for these sorts of things, so if you find more oddities post them so we can add them.
I'm happy to add a patch and tests, but could use some guidance... It looks like discover_itemsize() in core/src/multiarray/ctors.c should compute the length of the string or unicode representation of the object based on the eventual type, but looking at UNICODE_setitem() and STRING_setitem() in core/src/multiarray/arraytypes.c.src, this is not trivial. Perhaps the object-to-unicode/string parts of UNICODE_setitem/STRING_setitem can be extracted into separate functions that can be called from *_setitem as well as discover_itemsize. discover_itemsize would also need to know the type it's discovering for (string or unicode or user-defined). Not sure what to do to handle user-defined types (error?). If that's is too complicated, maybe discover_itemsize should return -1 (or warn, but given the danger of truncation, that seems a bit weak) if asked to discover from data that doesn't have a length. This would result in dtype=object when np.array is handed a mixed int/string list. I wonder, also, if STRING_setitem and UNICODE_setitem shouldn't emit a warning if asked to truncate data? Ray Jones
On Fri, Dec 2, 2011 at 8:23 AM, Thouis (Ray) Jones <thouis@gmail.com> wrote:
On Thu, Dec 1, 2011 at 17:39, Charles R Harris <charlesr.harris@gmail.com> wrote:
Given that strings should be the result, this looks like a bug. It's a bit of a corner case that probably slipped through during the recent work on casting. There needs to be tests for these sorts of things, so if you find more oddities post them so we can add them.
I'm happy to add a patch and tests, but could use some guidance...
It looks like discover_itemsize() in core/src/multiarray/ctors.c should compute the length of the string or unicode representation of the object based on the eventual type, but looking at UNICODE_setitem() and STRING_setitem() in core/src/multiarray/arraytypes.c.src, this is not trivial.
Perhaps the object-to-unicode/string parts of UNICODE_setitem/STRING_setitem can be extracted into separate functions that can be called from *_setitem as well as discover_itemsize. discover_itemsize would also need to know the type it's discovering for (string or unicode or user-defined).
After sleeping on this, I think an object array in this situation would be the better choice and wouldn't result in lost information. This might change the behavior of some functions though, so would need testing. Not sure what to do to handle user-defined types (error?).
If that's is too complicated, maybe discover_itemsize should return -1 (or warn, but given the danger of truncation, that seems a bit weak) if asked to discover from data that doesn't have a length. This would result in dtype=object when np.array is handed a mixed int/string list.
I wonder, also, if STRING_setitem and UNICODE_setitem shouldn't emit a warning if asked to truncate data?
I think a warning would be useful. But I don't use strings much so input from a user might carry more weight. Chuck
On Fri, Dec 2, 2011 at 18:53, Charles R Harris <charlesr.harris@gmail.com> wrote:
After sleeping on this, I think an object array in this situation would be the better choice and wouldn't result in lost information. This might change the behavior of some functions though, so would need testing.
I tried to come up with a simple patch to achieve this, but I think this is beyond me, particularly since I think something different has to happen for these cases: np.array([1234, 'ab']) np.array([1234]).astype('|S2') I tried a few things (changing the rules in PyArray_PromoteTypes(), other places), but I think I'm more likely to break some corner case than fix this cleanly. I filed a ticket (#1990) and a pull request to add a test to the 1.6.x maintenance branch, for someone more knowledgeable than me to address. I tried to write the test so that either choosing dtype=object or dtype=<string of the required length> would both pass. Ray Jones
participants (7)
-
Benjamin Root -
Charles R Harris -
Chris Barker -
Derek Homeier -
Pierre Haessig -
Thouis (Ray) Jones -
Thouis Jones