[Numpy-discussion] using loadtxt to load a text file in to a numpy array

josef.pktd at gmail.com josef.pktd at gmail.com
Thu Jan 23 10:41:30 EST 2014


On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin
<oscar.j.benjamin at gmail.com> wrote:
> On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote:
>> On Jan 22, 2014, at 1:13 PM, Oscar Benjamin <oscar.j.benjamin at gmail.com> wrote:
>>
>> >
>> > It's not safe to stop removing the null bytes. This is how numpy determines
>> > the length of the strings in a dtype='S' array. The strings are not
>> > "fixed-width" but rather have a maximum width.
>>
>> Exactly--but folks have told us on this list that they want (and are)
>> using the 'S' style for arbitrary bytes, NOT for text. In which case
>> you wouldn't want to remove null bytes. This is more evidence that 'S'
>> was designed to handle c-style one-byte-per-char strings, and NOT
>> arbitrary bytes, and thus not to map directly to the py2 string type
>> (you can store null bytes in a py2 string"
>
> You can store null bytes in a Py2 string but you normally wouldn't if it was
> supposed to be text.
>
>>
>> Which brings me back to my original proposal: properly map the 'S'
>> type to the py3 data model, and maybe add some kind of fixed width
>> bytes style of there is a use case for that. I still have no idea what
>> the use case might be.
>>
>
> There would definitely be a use case for a fixed-byte-width
> bytes-representing-text dtype in record arrays to read from a binary file:
>
> dt = np.dtype([
>     ('name', '|b8:utf-8'),
>     ('param1', '<i4'),
>     ('param2', '<i4')
>     ...
>     ])
>
> with open('binaryfile', 'rb') as fin:
>     a = np.fromfile(fin, dtype=dt)
>
> You could also use this for ASCII if desired. I don't think it really matters
> that utf-8 uses variable width as long as a too long byte string throws an
> error (and does not truncate).
>
> For non 8-bit encodings there would have to be some way to handle endianness
> without a BOM, but otherwise I think that it's always possible to pad with zero
> *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip
> null *characters* after decoding. i.e.:
>
> $ cat tmp.py
> import encodings
>
> def test_encoding(s1, enc):
>     b = s1.encode(enc).ljust(32, b'\0')
>     s2 = b.decode(enc)
>     index = s2.find('\0')
>     if index != -1:
>         s2 = s2[:index]
>     assert s1 == s2, enc
>
> encodings_set = set(encodings.aliases.aliases.values())
>
> for N, enc in enumerate(encodings_set):
>     try:
>         test_encoding('qwe', enc)
>     except LookupError:
>         pass
>
> print('Tested %d encodings without error' % N)
> $ python3 tmp.py
> Tested 88 encodings without error
>
>> > If the trailing nulls are not removed then you would get:
>> >
>> >>>> a[0]
>> > b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>> >>>> len(a[0])
>> > 9
>> >
>> > And I'm sure that someone would get upset about that.
>>
>> Only if they are using it for text-which you "should not" do with py3.
>
> But people definitely are using it for text on Python 3. It should be
> deprecated in favour of something new but breaking it is just gratuitous.
> Numpy doesn't have the option to make a clean break with Python 3 precisely
> because it needs to straddle 2.x and 3.x while numpy-based applications are
> ported to 3.x.
>
>> > Some more oddities:
>> >
>> >>>> a[0] = 1
>> >>>> a
>> > array([b'1', b'string', b'of', b'different', b'length', b'words'],
>> >      dtype='|S9')
>> >>>> a[0] = None
>> >>>> a
>> > array([b'None', b'string', b'of', b'different', b'length', b'words'],
>> >      dtype='|S9')
>>
>> More evidence that this is a text type.....
>
> And the big one:
>
> $ python3
> Python 3.2.3 (default, Sep 25 2013, 18:22:43)
> [GCC 4.6.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import numpy as np
>>>> a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings
>>>> a
> array([b'asd', b'zxc'],
>       dtype='|S3')
>>>> a[0] = 'qwer' # Unicode string again
>>>> a
> array([b'qwe', b'zxc'],
>       dtype='|S3')
>>>> a[0] = 'Õscar'
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)

looks mostly like casting rules to me, which looks like ASCII based
instead of an arbitrary encoding.

>>> a = np.array(['asd', 'zxc'], dtype='S')
>>> b = a.astype('U')
>>> b[0] = 'Õscar'
>>> a[0] = 'Õscar'
Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
    a[0] = 'Õscar'
UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
position 0: ordinal not in range(128)
>>> b
array(['Õsc', 'zxc'],
      dtype='<U3')
>>> b.astype('S')
Traceback (most recent call last):
  File "<pyshell#19>", line 1, in <module>
    b.astype('S')
UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
position 0: ordinal not in range(128)
>>> b.view('S4')
array([b'\xd5', b's', b'c', b'z', b'x', b'c'],
      dtype='|S4')

>>> a.astype('U').astype('S')
array([b'asd', b'zxc'],
      dtype='|S3')

Josef

>
> The analogous behaviour was very deliberately removed from Python 3:
>
>>>> a[0] == 'qwe'
> False
>>>> a[0] == b'qwe'
> True
>
>
> Oscar
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion



More information about the NumPy-Discussion mailing list