[Numpy-discussion] using loadtxt to load a text file in to a numpy array

josef.pktd at gmail.com josef.pktd at gmail.com
Fri Jan 17 10:58:25 EST 2014


On Fri, Jan 17, 2014 at 10:26 AM, Oscar Benjamin
<oscar.j.benjamin at gmail.com> wrote:
> On Fri, Jan 17, 2014 at 03:12:32PM +0100, Julian Taylor wrote:
>> On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin
>> <oscar.j.benjamin at gmail.com>wrote:
>>
>> > On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
>> > >
>> > > no, the right solution is to add an encoding argument.
>> > > Its a 4 line patch for python2 and a 2 line patch for python3 and the
>> > issue
>> > > is solved, I'll file a PR later.
>> >
>> > What is the encoding argument for? Is it to be used to decode, process the
>> > text and then re-encode it for an array with dtype='S'?
>> >
>>
>> it is only used to decode the file into text, nothing more.
>> loadtxt is supposed to load text files, it should never have to deal with
>> bytes ever.
>> But I haven't looked into the function deeply yet, there might be ugly
>> surprises.
>>
>> The output of the array is determined by the dtype argument and not by the
>> encoding argument.
>
> If the dtype is 'S' then the output should be bytes and you therefore
> need to encode the text; there's no such thing as storing text in
> bytes without an encoding.
>
> Strictly speaking the 'U' dtype uses the encoding 'ucs-4' or 'utf-32'
> which just happens to be as simple as expressing the corresponding
> unicode code points as int32 so it's reasonable to think of it as "not
> encoded" in some sense (although endianness becomes an issue in
> utf-32).
>
> On 17 January 2014 14:11,  <josef.pktd at gmail.com> wrote:
>> Windows seems to use consistent en/decoding throughout (example run in IDLE)
>
> The reason for the Py3k bytes/text overhaul is that there were lots of
> situations where things *seemed* to work until someone happens to use
> a character you didn't try. "Seems to" doesn't cut it! :)
>
>> Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
>> 32 bit (Intel)] on win32
>>
>>>>> filenames = numpy.loadtxt('filenames.txt', dtype='S')
>>>>> filenames
>> array([b'weighted_kde.py', b'_proportion.log.py', b'__init__.py',
>>        b'\xd5scar.txt'],
>>       dtype='|S18')
>>>>> fn = open(filenames[-1])
>>>>> fn.read()
>> '1,2,3,hello\n5,6,7,Õscar\n'
>>>>> fn
>> <_io.TextIOWrapper name=b'\xd5scar.txt' mode='r' encoding='cp1252'>
>
> You don't show how you created the file. I think that in your case the
> content of 'filenames.txt' is correctly encoded latin-1.

I had created it with os.listdir but deleted some lines
Running the full script again I still get the same correct answer for fn
------------
import os
if 1:
    with open('filenames5.txt', 'w') as fout:
         fout.writelines([f + '\n' for f in os.listdir('.')])
with open('filenames.txt') as fin:
     print(fin.read())

import numpy

#filenames = numpy.loadtxt('filenames.txt')
filenames = numpy.loadtxt('filenames5.txt', dtype='S')
fn = open(filenames[-1])
------------


>
> My guess is that you did the same as me and opened it in text mode and
> wrote the unicode string allowing Python to encode it for you. Judging
> by the encoding on fn above I'd say that it wrote the file with cp1252
> which is mostly compatible with latin-1. Try it with a byte that is
> incompatible between cp1252 and latin-1 e.g.:
>
> In [3]: b'\x80'.decode('cp1252')
> Out[3]: '€'
>
> In [4]: b'\x80'.decode('latin-1')
> Out[4]: '\x80'
>
> In [5]: b'\x80'.decode('cp1252').encode('latin-1')
> ---------------------------------------------------------------------------
> UnicodeEncodeError                        Traceback (most recent call last)
> /users/enojb/<ipython-input-5-cfd8b16d6d9f> in <module>()
> ----> 1 b'\x80'.decode('cp1252').encode('latin-1')
>
> UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in
> position 0: ordinal not in range(256)

I get similar problems when I use a file that someone else has
written, however I haven't seen much problems if I do everything on
Windows.

The main problems I get and where I don't know how it's supposed to
work in the best way is when we get "foreign"  data.

some examples I just played with that are closer to what we use in
statsmodels but don't have any unit tests

>>> filenames1 = numpy.recfromtxt(open('Õscar.txt',"rb"), delimiter=',')
>>> filenames1
rec.array([(1, 2, 3, b'hello'), (5, 6, 7, b'\xd5scar')],
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S5')])
>>> filenames1['f3'][-1]
b'\xd5scar'
>>> filenames1['f3'] == 'Õscar'
False
>>> filenames1['f3'] == 'Õscar'.encode('cp1252')
array([False,  True], dtype=bool)
>>> filenames1['f3'] == 'hello'
False
>>> filenames1['f3'] == b'hello'
array([ True, False], dtype=bool)
>>> filenames1['f3'] == b'\xd5scar'
array([False,  True], dtype=bool)
>>> filenames1['f3'] == np.array(['Õscar'.encode('utf8')], 'S5')
array([False, False], dtype=bool)

Josef

>
>
> Oscar
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion



More information about the NumPy-Discussion mailing list