[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Wed Jan 15 12:57:51 EST 2014

On Wed, Jan 15, 2014 at 10:27 AM, Chris Barker <chris.barker at noaa.gov>wrote:

> On Wed, Jan 15, 2014 at 4:38 AM, Julian Taylor <
> jtaylor.debian at googlemail.com> wrote:
>
>> >     I try to print my fileContent array after I read it and it looks
>>  >     like this :
>> >
>> >     ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'"
>> >       "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'"
>> >       "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"]
>>
>
>
>> you have the bytes representation and a duplicate slash in it.
>>
>
> the duplicate slash confuses me, but I'm not running py3 to test, so...
>
>
>> np.loadtxt(file, dtype=bytes).astype(str)
>>
>> for non ascii I guess you should use python directly as numpy would also
>> require a python loop with explicit decoding.
>>
>> Currently handling strings in python3 with numpy is even worse than
>> before, you always have to go over bytes and do explicit decodes to get
>> python strings out of ascii data.
>>
>
> There is a MASSIVE set of threads on Python-dev about better support for
> ASCII and ASCII+binary data in py3 -- but in the meantime, I think we have
> two issue shere that could be adressed:
>
> 1) loadtext behavior -- it's a really, really common case for  data files
> suitable for loadtxt to be ascii, but they also could be another encoding
> -- so loadtext should have the option to specify the encoding (default to
> ascii? or ascii-compatible?)
>
> The trick here is handling both these cases correctly -- clearly loadtxt
> is broken on py3 now. This example works fine under py2.
>
> It seems to be reading the file as bytes, then passing those bytes off to
> a unicode string (str in py3), without specifying an encoding (which I
> think is how that b' ...'
>  junk gets in there.
>
> note that: np.loadtxt('pathlist.txt', dtype=unicode) works fine on py2 as
> well:
>
> In [7]: np.loadtxt('pathlist.txt', dtype=unicode)
> Out[7]:
> array([u'C:\\Users\\Documents\\Project\\mytextfile1.txt',
>        u'C:\\Users\\Documents\\Project\\mytextfile2.txt',
>        u'C:\\Users\\Documents\\Project\\mytextfile3.txt'],
>       dtype='<U42')
>
> which is what should happen in py3. So the internal loadtxt code must be
> confusing bytes and unicode objects...
>
> Anyway, this should work, and there should be an obvious way to spell it.
>
> 2) numpy string types -- it seems numpy already has a both a string type
> and unicode type -- perhaps some re-naming or better documentation is in
> order:
>    the string type 'S10', for example, should be clearly defined as 1-byte
> per character ascii-compatible.
>
> I'm not sure how many bytes the unicode type has, but it may make sense to
> be abel to choose UCS-2 or UCS-4 -- though memory is cheep, I'd probably go
> with UCS-4 and be done with it.
>

There was a discussion of this long ago and UCS-4 was chosen as the numpy
standard. There are just too many complications that arise in supporting
both.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140115/3cb84d6c/attachment.html>