Mailman 3 bug in genfromtxt for python 3.2 - NumPy-Discussion

bug in genfromtxt for python 3.2

older
Warning: invalid value encountered...

josef.pktd＠gmail.com

29 Mar 2011 29 Mar '11

11:59 a.m.

numpy/lib/test_io.py only uses StringIO in the test, no actual csv file If I give the filename than I get a TypeError: Can't convert 'bytes' object to str implicitly from the statsmodels mailing list example

...

...
...
...
data = recfromtxt(open('./star98.csv', "U"), delimiter=",", skip_header=1, dtype=float) Traceback (most recent call last): File "", line 1, in <module> data = recfromtxt(open('./star98.csv', "U"), delimiter=",", skip_header=1, dtype=float) File "C:\Programs\Python32\lib\site-packages\numpy\lib\npyio.py", line 1633, in recfromtxt output = genfromtxt(fname, **kwargs) File "C:\Programs\Python32\lib\site-packages\numpy\lib\npyio.py", line 1181, in genfromtxt first_values = split_line(first_line) File "C:\Programs\Python32\lib\site-packages\numpy\lib\_iotools.py", line 206, in _delimited_splitter line = line.split(self.comments)[0].strip(asbytes(" \r\n")) TypeError: Can't convert 'bytes' object to str implicitly

...

line 1184 in npyio (py32 sourcefile)

if isinstance(fname, str): fhd = np.lib._datasource.open(fname, 'U')

seems to be the culprit for my case

changing to binary solved this problem for me fhd = np.lib._datasource.open(fname, 'Ub') (I still have other errors but don't know yet where they are coming from.) Almost all problem with porting statsmodels to python 3.2 so far are input related, mainly reading csv files which are heavily used in the tests. All the "real" code seems to work fine with numpy and scipy (and matplotlib so far) for python 3.2 Josef

Show replies by date

Matthew Brett

30 Mar 30 Mar

7:09 a.m.

Hi, On Mon, Mar 28, 2011 at 11:29 PM, wrote:

...

numpy/lib/test_io.py only uses StringIO in the test, no actual csv file

If I give the filename than I get a TypeError: Can't convert 'bytes' object to str implicitly

from the statsmodels mailing list example

...
...
...
...
data = recfromtxt(open('./star98.csv', "U"), delimiter=",", skip_header=1, dtype=float) Traceback (most recent call last): File "", line 1, in <module> data = recfromtxt(open('./star98.csv', "U"), delimiter=",", skip_header=1, dtype=float) File "C:\Programs\Python32\lib\site-packages\numpy\lib\npyio.py", line 1633, in recfromtxt output = genfromtxt(fname, **kwargs) File "C:\Programs\Python32\lib\site-packages\numpy\lib\npyio.py", line 1181, in genfromtxt first_values = split_line(first_line) File "C:\Programs\Python32\lib\site-packages\numpy\lib\_iotools.py", line 206, in _delimited_splitter line = line.split(self.comments)[0].strip(asbytes(" \r\n")) TypeError: Can't convert 'bytes' object to str implicitly

Is the right fix for this to open a 'filename' passed to genfromtxt, as 'binary' (bytes)? If so I will submit a pull request with a fix and a test, Best, Matthew

Ralf Gommers

10:32 p.m.

On Wed, Mar 30, 2011 at 3:39 AM, Matthew Brett wrote:

...

Hi,

On Mon, Mar 28, 2011 at 11:29 PM, wrote:

...
numpy/lib/test_io.py only uses StringIO in the test, no actual csv file

If I give the filename than I get a TypeError: Can't convert 'bytes' object to str implicitly

from the statsmodels mailing list example

...
...
...
...
data = recfromtxt(open('./star98.csv', "U"), delimiter=",", skip_header=1, dtype=float) Traceback (most recent call last): File "", line 1, in <module> data = recfromtxt(open('./star98.csv', "U"), delimiter=",", skip_header=1, dtype=float) File "C:\Programs\Python32\lib\site-packages\numpy\lib\npyio.py", line 1633, in recfromtxt output = genfromtxt(fname, **kwargs) File "C:\Programs\Python32\lib\site-packages\numpy\lib\npyio.py", line 1181, in genfromtxt first_values = split_line(first_line) File "C:\Programs\Python32\lib\site-packages\numpy\lib\_iotools.py", line 206, in _delimited_splitter line = line.split(self.comments)[0].strip(asbytes(" \r\n")) TypeError: Can't convert 'bytes' object to str implicitly

Is the right fix for this to open a 'filename' passed to genfromtxt, as 'binary' (bytes)?

If so I will submit a pull request with a fix and a test,

Seems to work and is what was intended I think, see Pauli's changes/notes in commit 0f2e7db0. This is ticket #1607 by the way. Cheers, Ralf

Matthew Brett

11:07 p.m.

Hi, On Wed, Mar 30, 2011 at 10:02 AM, Ralf Gommers wrote:

...

On Wed, Mar 30, 2011 at 3:39 AM, Matthew Brett wrote:

...
Hi,

On Mon, Mar 28, 2011 at 11:29 PM, wrote:

...
numpy/lib/test_io.py only uses StringIO in the test, no actual csv file

If I give the filename than I get a TypeError: Can't convert 'bytes' object to str implicitly

from the statsmodels mailing list example

...
...
...
> data = recfromtxt(open('./star98.csv', "U"), delimiter=",", skip_header=1, dtype=float) Traceback (most recent call last): File "", line 1, in <module> data = recfromtxt(open('./star98.csv', "U"), delimiter=",", skip_header=1, dtype=float) File "C:\Programs\Python32\lib\site-packages\numpy\lib\npyio.py", line 1633, in recfromtxt output = genfromtxt(fname, **kwargs) File "C:\Programs\Python32\lib\site-packages\numpy\lib\npyio.py", line 1181, in genfromtxt first_values = split_line(first_line) File "C:\Programs\Python32\lib\site-packages\numpy\lib\_iotools.py", line 206, in _delimited_splitter line = line.split(self.comments)[0].strip(asbytes(" \r\n")) TypeError: Can't convert 'bytes' object to str implicitly

Is the right fix for this to open a 'filename' passed to genfromtxt, as 'binary' (bytes)?

If so I will submit a pull request with a fix and a test,

Seems to work and is what was intended I think, see Pauli's changes/notes in commit 0f2e7db0.

This is ticket #1607 by the way.

Thanks for making a ticket. I've submitted a pull request for the fix and linked to it from the ticket. The reason I asked whether this was the correct fix was: imagine I'm working with a non-latin default encoding, and I've opened a file: fobj = open('my_nonlatin.txt', 'rt') in python 3.2. That might contain numbers and non-latin text. I can't pass that into 'genfromtxt' because it will give me this error above. I can pass it is as binary but then I'll get garbled text. Should those functions also allow unicode-providing files (perhaps with binary as default for speed)? See you, Matthew

Ralf Gommers

11:42 p.m.

On Wed, Mar 30, 2011 at 7:37 PM, Matthew Brett wrote:

...

Hi,

On Wed, Mar 30, 2011 at 10:02 AM, Ralf Gommers wrote:

...
On Wed, Mar 30, 2011 at 3:39 AM, Matthew Brett wrote:

...
Hi,

On Mon, Mar 28, 2011 at 11:29 PM, wrote:

...
numpy/lib/test_io.py only uses StringIO in the test, no actual csv file

If I give the filename than I get a TypeError: Can't convert 'bytes' object to str implicitly

from the statsmodels mailing list example

...
...
>> data = recfromtxt(open('./star98.csv', "U"), delimiter=",", skip_header=1, dtype=float) Traceback (most recent call last): File "", line 1, in <module> data = recfromtxt(open('./star98.csv', "U"), delimiter=",", skip_header=1, dtype=float) File "C:\Programs\Python32\lib\site-packages\numpy\lib\npyio.py", line 1633, in recfromtxt output = genfromtxt(fname, **kwargs) File "C:\Programs\Python32\lib\site-packages\numpy\lib\npyio.py", line 1181, in genfromtxt first_values = split_line(first_line) File "C:\Programs\Python32\lib\site-packages\numpy\lib\_iotools.py", line 206, in _delimited_splitter line = line.split(self.comments)[0].strip(asbytes(" \r\n")) TypeError: Can't convert 'bytes' object to str implicitly

Is the right fix for this to open a 'filename' passed to genfromtxt, as 'binary' (bytes)?

If so I will submit a pull request with a fix and a test,

Seems to work and is what was intended I think, see Pauli's changes/notes in commit 0f2e7db0.

This is ticket #1607 by the way.

Thanks for making a ticket. I've submitted a pull request for the fix and linked to it from the ticket.

The reason I asked whether this was the correct fix was:

imagine I'm working with a non-latin default encoding, and I've opened a file:

fobj = open('my_nonlatin.txt', 'rt')

in python 3.2. That might contain numbers and non-latin text. I can't pass that into 'genfromtxt' because it will give me this error above. I can pass it is as binary but then I'll get garbled text.

I admit the string/bytes thing is still a little confusing to me, but isn't that always going to be a problem (even with python 2.x)? There's no way for genfromtxt to know what the encoding of an arbitrary file is. So your choices are garbled text or an error. Garbled text is better. It may help to explicitly say in the docstring that this is an ASCII routine (as it does in the source code). Ralf

...

Should those functions also allow unicode-providing files (perhaps with binary as default for speed)?

Pauli Virtanen

31 Mar 31 Mar

12:02 a.m.

On Wed, 30 Mar 2011 10:37:45 -0700, Matthew Brett wrote: [clip]

...

imagine I'm working with a non-latin default encoding, and I've opened a file:

fobj = open('my_nonlatin.txt', 'rt')

in python 3.2. That might contain numbers and non-latin text. I can't pass that into 'genfromtxt' because it will give me this error above. I can pass it is as binary but then I'll get garbled text.

That's the way it also works on Python 2. The text is not garbled -- it's just in some binary representation that you can later on decode to unicode:

...

...
...
np.array(['asd']).view(np.chararray).decode('utf-8') array([u'asd'], dtype='

Granted, utf-16 and the ilk might be problematic.

...

Should those functions also allow unicode-providing files (perhaps with binary as default for speed)?

Nobody has yet asked for this feature as far as I know, so I guess the need for it is pretty low. Personally, I don't think going unicode makes much sense here. First, it would be a Py3-only feature. Second, there is a real need for it only when dealing with multibyte encodings, which are seldom used these days with utf-8 rightfully dominating. -- Pauli Virtanen

Matthew Brett

1:18 a.m.

Hi, On Wed, Mar 30, 2011 at 11:32 AM, Pauli Virtanen wrote:

...

On Wed, 30 Mar 2011 10:37:45 -0700, Matthew Brett wrote: [clip]

...
imagine I'm working with a non-latin default encoding, and I've opened a file:

fobj = open('my_nonlatin.txt', 'rt')

in python 3.2. That might contain numbers and non-latin text. I can't pass that into 'genfromtxt' because it will give me this error above. I can pass it is as binary but then I'll get garbled text.

That's the way it also works on Python 2. The text is not garbled -- it's just in some binary representation that you can later on decode to unicode:

...
...
...
np.array(['asd']).view(np.chararray).decode('utf-8') array([u'asd'], dtype='
Granted, utf-16 and the ilk might be problematic.

...
Should those functions also allow unicode-providing files (perhaps with binary as default for speed)?

Nobody has yet asked for this feature as far as I know, so I guess the need for it is pretty low.

Personally, I don't think going unicode makes much sense here. First, it would be a Py3-only feature. Second, there is a real need for it only when dealing with multibyte encodings, which are seldom used these days with utf-8 rightfully dominating.

It's not a feature I need, but then, I'm afraid all the languages I've been taught are latin-1. Oh, except I learnt a tiny bit of Greek. But I don't use it for work :) I suppose the annoyances would be: 1) Probably temporary surprise that genfromtxt(open('my_file.txt', 'rt')) generates this error 2) Having to go back over returned arrays decoding stuff for utf-8 3) Wrong results for other encodings Maybe the best way is a graceful warning on entry to the routine? Best, Matthew

4775

Age (days ago)

4776

Last active (days ago)

List overview

Download

6 comments

4 participants

participants (4)

josef.pktd＠gmail.com
Matthew Brett
Pauli Virtanen
Ralf Gommers

bug in genfromtxt for python 3.2

josef.pktd＠gmail.com

Matthew Brett

Ralf Gommers

Matthew Brett

Ralf Gommers

Pauli Virtanen

Matthew Brett

tags

participants (4)