numpy 00 character bug?
nrook at wesleyan.edu
Fri Jun 5 18:14:10 CEST 2009
I've recently encountered a bug in NumPy's string arrays, where the 00
ASCII character ('\x00') is not stored properly when put at the end of a
Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> print numpy.version.version
>>> arr = numpy.empty(1, 'S2')
>>> arr = 'ab'
>>> arr = 'c\x00'
It seems that the string array is using the 00 character to pad strings
smaller than the maximum size, and thus is treating any 00 characters at
the end of a string as padding. Obviously, as long as I don't use
smaller strings, there is no information lost here, but I don't want to
have to re-add my 00s each time I ask the array what it is holding.
Is this a well-known bug already? I couldn't find it on the NumPy bug
tracker, but I could have easily missed it, or it could be triaged,
deemed acceptable because there's no better way to deal with
arbitrary-length strings. Is there an easy way to avoid this problem?
Pretty much any performance-intensive part of my program is going to be
dealing with these arrays, so I don't want to just replace them with a
slower dictionary instead.
I can't imagine this issue hasn't come up before; I encountered it by
using NumPy arrays to store Python structs, something I can imagine is
done fairly often. As such, I apologize for bringing it up again!
More information about the Python-list