[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Thu Jan 23 05:45:22 EST 2014

On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote:
> On Jan 22, 2014, at 1:13 PM, Oscar Benjamin <oscar.j.benjamin at gmail.com> wrote:
> 
> >
> > It's not safe to stop removing the null bytes. This is how numpy determines
> > the length of the strings in a dtype='S' array. The strings are not
> > "fixed-width" but rather have a maximum width.
> 
> Exactly--but folks have told us on this list that they want (and are)
> using the 'S' style for arbitrary bytes, NOT for text. In which case
> you wouldn't want to remove null bytes. This is more evidence that 'S'
> was designed to handle c-style one-byte-per-char strings, and NOT
> arbitrary bytes, and thus not to map directly to the py2 string type
> (you can store null bytes in a py2 string"

You can store null bytes in a Py2 string but you normally wouldn't if it was
supposed to be text.

> 
> Which brings me back to my original proposal: properly map the 'S'
> type to the py3 data model, and maybe add some kind of fixed width
> bytes style of there is a use case for that. I still have no idea what
> the use case might be.
> 

There would definitely be a use case for a fixed-byte-width
bytes-representing-text dtype in record arrays to read from a binary file:

dt = np.dtype([
    ('name', '|b8:utf-8'),
    ('param1', '<i4'),
    ('param2', '<i4')
    ...
    ])

with open('binaryfile', 'rb') as fin:
    a = np.fromfile(fin, dtype=dt)

You could also use this for ASCII if desired. I don't think it really matters
that utf-8 uses variable width as long as a too long byte string throws an
error (and does not truncate).

For non 8-bit encodings there would have to be some way to handle endianness
without a BOM, but otherwise I think that it's always possible to pad with zero
*bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip
null *characters* after decoding. i.e.:

$ cat tmp.py 
import encodings

def test_encoding(s1, enc):
    b = s1.encode(enc).ljust(32, b'\0')
    s2 = b.decode(enc)
    index = s2.find('\0')
    if index != -1:
        s2 = s2[:index]
    assert s1 == s2, enc

encodings_set = set(encodings.aliases.aliases.values())

for N, enc in enumerate(encodings_set):
    try:
        test_encoding('qwe', enc)
    except LookupError:
        pass

print('Tested %d encodings without error' % N)
$ python3 tmp.py 
Tested 88 encodings without error

> > If the trailing nulls are not removed then you would get:
> >
> >>>> a[0]
> > b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00'
> >>>> len(a[0])
> > 9
> >
> > And I'm sure that someone would get upset about that.
> 
> Only if they are using it for text-which you "should not" do with py3.

But people definitely are using it for text on Python 3. It should be
deprecated in favour of something new but breaking it is just gratuitous.
Numpy doesn't have the option to make a clean break with Python 3 precisely
because it needs to straddle 2.x and 3.x while numpy-based applications are
ported to 3.x.

> > Some more oddities:
> >
> >>>> a[0] = 1
> >>>> a
> > array([b'1', b'string', b'of', b'different', b'length', b'words'],
> >      dtype='|S9')
> >>>> a[0] = None
> >>>> a
> > array([b'None', b'string', b'of', b'different', b'length', b'words'],
> >      dtype='|S9')
> 
> More evidence that this is a text type.....

And the big one:

$ python3
Python 3.2.3 (default, Sep 25 2013, 18:22:43) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings
>>> a
array([b'asd', b'zxc'], 
      dtype='|S3')
>>> a[0] = 'qwer' # Unicode string again
>>> a
array([b'qwe', b'zxc'], 
      dtype='|S3')
>>> a[0] = 'Õscar'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)

The analogous behaviour was very deliberately removed from Python 3:

>>> a[0] == 'qwe'
False
>>> a[0] == b'qwe'
True

Oscar