[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Oscar Benjamin oscar.j.benjamin at gmail.com
Wed Jan 22 16:13:32 EST 2014


On Wed, Jan 22, 2014 at 12:07:28PM -0800, Chris Barker wrote:
> On Wed, Jan 22, 2014 at 2:46 AM, Oscar Benjamin
> <oscar.j.benjamin at gmail.com>wrote:
> 
> > BTW, as much as the fixed-width 'S' dtype doesn't really work for str in
> >  Python 3 it's also a poor fit for bytes since it strips trailing nulls:
> >
> > >>> a = np.array(['a\0s\0', 'qwert'], dtype='S')
> > >>> a
> > array([b'a\x00s', b'qwert'],
> >       dtype='|S5')
> > >>> a[0]
> > b'a\x00s'
> 
> 
> WHOOA!  Good catch, Oscar.
> 
> This conversation started with me suggesting that 'S' on py3 should mean
> "ascii string" (or latin-1 string).
> 
> Then it was pointed out that it was already being used for arbitrary bytes,
> and thus could not be changed to mean a string without breaking already
> working code.
> 
> However,  if 'S' is assigning meaning to null bytes, and doing something
> with that, then it is, indeed being treated as an ANSI string (or the old c
> string "type", anyway). And any code that is expecting it to be arbitrary
> bytes is already broken, and in a way that could result in pretty subtle,
> hard to find bugs in the future.
> 
> I think we really need a proper bytes dtype (which could be 'S' with the
> null byte thing removed), and a proper one-byte-per-character string type.

It's not safe to stop removing the null bytes. This is how numpy determines
the length of the strings in a dtype='S' array. The strings are not
"fixed-width" but rather have a maximum width. Aything shorter gets padded
with nulls. This is transparent if you index strings from the array:

>>> a = np.array(b'a string of different length words'.split(), dtype='S')
>>> a
array([b'a', b'string', b'of', b'different', b'length', b'words'], 
      dtype='|S9')
>>> a[0]
b'a'
>>> len(a[0])
1
>>> a.tostring()
b'a\x00\x00\x00\x00\x00\x00\x00\x00string\x00\x00\x00of\x00\x00\x00\x00\x00\x00\x00differentlength\x00\x00\x00words\x00\x00\x00\x00'o

If the trailing nulls are not removed then you would get:

>>> a[0]
b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> len(a[0])
9

And I'm sure that someone would get upset about that.

> Though I still don't know the use case for the fixed-length bytes type that
> can't be satisfied with the other numeric types,

Having the null bytes removed and a str (on Py2) object returned is precisely
the use case that distinguishes it from np.uint8. The other differences are the
removal of arithmetic operations.

Some more oddities:

>>> a[0] = 1
>>> a
array([b'1', b'string', b'of', b'different', b'length', b'words'], 
      dtype='|S9')
>>> a[0] = None
>>> a
array([b'None', b'string', b'of', b'different', b'length', b'words'], 
      dtype='|S9')
>>> a[0] = range(1, 2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: cannot set an array element with a sequence
>>> a[0] = (x for x in range(2))
>>> a
array([b'<generato', b'string', b'of', b'different', b'length', b'words'], 
      dtype='|S9')


Oscar



More information about the NumPy-Discussion mailing list