[Numpy-discussion] bug in genfromtxt for python 3.2

Wed Mar 30 15:48:18 EDT 2011

Hi,

On Wed, Mar 30, 2011 at 11:32 AM, Pauli Virtanen <pav at iki.fi> wrote:
> On Wed, 30 Mar 2011 10:37:45 -0700, Matthew Brett wrote:
> [clip]
>> imagine I'm working with a non-latin default encoding, and I've opened a
>> file:
>>
>> fobj = open('my_nonlatin.txt', 'rt')
>>
>> in python 3.2.  That might contain numbers and non-latin text.   I can't
>> pass that into 'genfromtxt' because it will give me this error above.  I
>> can pass it is as binary but then I'll get garbled text.
>
> That's the way it also works on Python 2. The text is not garbled -- it's
> just in some binary representation that you can later on decode to
> unicode:
>
>>>> np.array(['asd']).view(np.chararray).decode('utf-8')
> array([u'asd'],
>      dtype='<U3')
>
> Granted, utf-16 and the ilk might be problematic.
>
>> Should those functions also allow unicode-providing files (perhaps with
>> binary as default for speed)?
>
> Nobody has yet asked for this feature as far as I know, so I guess the
> need for it is pretty low.
>
> Personally, I don't think going unicode makes much sense here. First, it
> would be a Py3-only feature. Second, there is a real need for it only
> when dealing with multibyte encodings, which are seldom used these days
> with utf-8 rightfully dominating.

It's not a feature I need, but then, I'm afraid all the languages I've
been taught are latin-1.  Oh, except I learnt a tiny bit of Greek.
But I don't use it for work :)

I suppose the annoyances would be:

1) Probably temporary surprise that genfromtxt(open('my_file.txt',
'rt')) generates this error
2) Having to go back over returned arrays decoding stuff for utf-8
3) Wrong results for other encodings

Maybe the best way is a graceful warning on entry to the routine?

Best,

Matthew