[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Thu Jan 23 11:23:09 EST 2014

On Thu, Jan 23, 2014 at 10:41 AM,  <josef.pktd at gmail.com> wrote:
> On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin
> <oscar.j.benjamin at gmail.com> wrote:
>> On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote:
>>> On Jan 22, 2014, at 1:13 PM, Oscar Benjamin <oscar.j.benjamin at gmail.com> wrote:
>>>
>>> >
>>> > It's not safe to stop removing the null bytes. This is how numpy determines
>>> > the length of the strings in a dtype='S' array. The strings are not
>>> > "fixed-width" but rather have a maximum width.
>>>
>>> Exactly--but folks have told us on this list that they want (and are)
>>> using the 'S' style for arbitrary bytes, NOT for text. In which case
>>> you wouldn't want to remove null bytes. This is more evidence that 'S'
>>> was designed to handle c-style one-byte-per-char strings, and NOT
>>> arbitrary bytes, and thus not to map directly to the py2 string type
>>> (you can store null bytes in a py2 string"
>>
>> You can store null bytes in a Py2 string but you normally wouldn't if it was
>> supposed to be text.
>>
>>>
>>> Which brings me back to my original proposal: properly map the 'S'
>>> type to the py3 data model, and maybe add some kind of fixed width
>>> bytes style of there is a use case for that. I still have no idea what
>>> the use case might be.
>>>
>>
>> There would definitely be a use case for a fixed-byte-width
>> bytes-representing-text dtype in record arrays to read from a binary file:
>>
>> dt = np.dtype([
>>     ('name', '|b8:utf-8'),
>>     ('param1', '<i4'),
>>     ('param2', '<i4')
>>     ...
>>     ])
>>
>> with open('binaryfile', 'rb') as fin:
>>     a = np.fromfile(fin, dtype=dt)
>>
>> You could also use this for ASCII if desired. I don't think it really matters
>> that utf-8 uses variable width as long as a too long byte string throws an
>> error (and does not truncate).
>>
>> For non 8-bit encodings there would have to be some way to handle endianness
>> without a BOM, but otherwise I think that it's always possible to pad with zero
>> *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip
>> null *characters* after decoding. i.e.:
>>
>> $ cat tmp.py
>> import encodings
>>
>> def test_encoding(s1, enc):
>>     b = s1.encode(enc).ljust(32, b'\0')
>>     s2 = b.decode(enc)
>>     index = s2.find('\0')
>>     if index != -1:
>>         s2 = s2[:index]
>>     assert s1 == s2, enc
>>
>> encodings_set = set(encodings.aliases.aliases.values())
>>
>> for N, enc in enumerate(encodings_set):
>>     try:
>>         test_encoding('qwe', enc)
>>     except LookupError:
>>         pass
>>
>> print('Tested %d encodings without error' % N)
>> $ python3 tmp.py
>> Tested 88 encodings without error
>>
>>> > If the trailing nulls are not removed then you would get:
>>> >
>>> >>>> a[0]
>>> > b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> >>>> len(a[0])
>>> > 9
>>> >
>>> > And I'm sure that someone would get upset about that.
>>>
>>> Only if they are using it for text-which you "should not" do with py3.
>>
>> But people definitely are using it for text on Python 3. It should be
>> deprecated in favour of something new but breaking it is just gratuitous.
>> Numpy doesn't have the option to make a clean break with Python 3 precisely
>> because it needs to straddle 2.x and 3.x while numpy-based applications are
>> ported to 3.x.
>>
>>> > Some more oddities:
>>> >
>>> >>>> a[0] = 1
>>> >>>> a
>>> > array([b'1', b'string', b'of', b'different', b'length', b'words'],
>>> >      dtype='|S9')
>>> >>>> a[0] = None
>>> >>>> a
>>> > array([b'None', b'string', b'of', b'different', b'length', b'words'],
>>> >      dtype='|S9')
>>>
>>> More evidence that this is a text type.....
>>
>> And the big one:
>>
>> $ python3
>> Python 3.2.3 (default, Sep 25 2013, 18:22:43)
>> [GCC 4.6.3] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>> import numpy as np
>>>>> a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings
>>>>> a
>> array([b'asd', b'zxc'],
>>       dtype='|S3')
>>>>> a[0] = 'qwer' # Unicode string again
>>>>> a
>> array([b'qwe', b'zxc'],
>>       dtype='|S3')
>>>>> a[0] = 'Õscar'
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)
>
> looks mostly like casting rules to me, which looks like ASCII based
> instead of an arbitrary encoding.
>
>>>> a = np.array(['asd', 'zxc'], dtype='S')
>>>> b = a.astype('U')
>>>> b[0] = 'Õscar'
>>>> a[0] = 'Õscar'
> Traceback (most recent call last):
>   File "<pyshell#17>", line 1, in <module>
>     a[0] = 'Õscar'
> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
> position 0: ordinal not in range(128)
>>>> b
> array(['Õsc', 'zxc'],
>       dtype='<U3')
>>>> b.astype('S')
> Traceback (most recent call last):
>   File "<pyshell#19>", line 1, in <module>
>     b.astype('S')
> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
> position 0: ordinal not in range(128)
>>>> b.view('S4')
> array([b'\xd5', b's', b'c', b'z', b'x', b'c'],
>       dtype='|S4')
>
>>>> a.astype('U').astype('S')
> array([b'asd', b'zxc'],
>       dtype='|S3')


another curious example, encode utf-8 to latin-1 bytes

>>> b
array(['Õsc', 'zxc'],
      dtype='<U3')
>>> b[0].encode('utf8')
b'\xc3\x95sc'
>>> b[0].encode('latin1')
b'\xd5sc'
>>> b.astype('S')
Traceback (most recent call last):
  File "<pyshell#40>", line 1, in <module>
    b.astype('S')
UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
position 0: ordinal not in range(128)
>>> c = b.view('S4').astype('S1').view('S3')
>>> c
array([b'\xd5sc', b'zxc'],
      dtype='|S3')
>>> c[0].decode('latin1')
'Õsc'

--------
The original numpy py3 conversion used latin-1 as default
(It's still used in statsmodels, and I haven't looked at the structure
under the common py2-3 codebase)

if sys.version_info[0] >= 3:
    import io
    bytes = bytes
    unicode = str
    asunicode = str
    def asbytes(s):
        if isinstance(s, bytes):
            return s
        return s.encode('latin1')
    def asstr(s):
        if isinstance(s, str):
            return s
        return s.decode('latin1')

--------------

Josef

>
> Josef
>
>>
>> The analogous behaviour was very deliberately removed from Python 3:
>>
>>>>> a[0] == 'qwe'
>> False
>>>>> a[0] == b'qwe'
>> True
>>
>>
>> Oscar
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion