Python3, genfromtxt and unicode
With bytes fields, genfromtxt(dtype=None) sets the sizes of the fields to the largest number of chars (npyio.py line 1596), but it doesn't do the same for unicode fields, which is a pity. See example below. I tried to change npyio.py around line 1600 to add that but it didn't work; from my limited understanding the problem comes earlier, in the way StringBuilder is defined(?). Antony Lee import io, numpy as np s = io.BytesIO() s.write(b"abc 1\ndef 2") s.seek(0) t = np.genfromtxt(s, dtype=None) # (or converters={0: bytes}) print(t, t.dtype) # -> [(b'a', 1) (b'b', 2)] [('f0', '|S1'), ('f1', '<i8')] s.seek(0) t = np.genfromtxt(s, dtype=None, converters={0: lambda s: s.decode("utf-8")}) print(t, t.dtype) # -> [('', 1) ('', 2)] [('f0', '<U0'), ('f1', '<i8')]
On Fri, Apr 27, 2012 at 8:17 PM, Antony Lee <antony.lee@berkeley.edu> wrote:
With bytes fields, genfromtxt(dtype=None) sets the sizes of the fields to the largest number of chars (npyio.py line 1596), but it doesn't do the same for unicode fields, which is a pity. See example below. I tried to change npyio.py around line 1600 to add that but it didn't work; from my limited understanding the problem comes earlier, in the way StringBuilder is defined(?). Antony Lee
import io, numpy as np s = io.BytesIO() s.write(b"abc 1\ndef 2") s.seek(0) t = np.genfromtxt(s, dtype=None) # (or converters={0: bytes}) print(t, t.dtype) # -> [(b'a', 1) (b'b', 2)] [('f0', '|S1'), ('f1', '<i8')] s.seek(0) t = np.genfromtxt(s, dtype=None, converters={0: lambda s: s.decode("utf-8")}) print(t, t.dtype) # -> [('', 1) ('', 2)] [('f0', '<U0'), ('f1', '<i8')]
Could you open a ticket for this? Chuck
Sure, I will. Right now my solution is to use genfromtxt once with bytes and auto-dtype detection, then modify the resulting dtype, replacing bytes with unicodes, and use that new dtypes for a second round of genfromtxt. A bit awkward but that gets the job done. Antony Lee 2012/5/1 Charles R Harris <charlesr.harris@gmail.com>
On Fri, Apr 27, 2012 at 8:17 PM, Antony Lee <antony.lee@berkeley.edu>wrote:
With bytes fields, genfromtxt(dtype=None) sets the sizes of the fields to the largest number of chars (npyio.py line 1596), but it doesn't do the same for unicode fields, which is a pity. See example below. I tried to change npyio.py around line 1600 to add that but it didn't work; from my limited understanding the problem comes earlier, in the way StringBuilder is defined(?). Antony Lee
import io, numpy as np s = io.BytesIO() s.write(b"abc 1\ndef 2") s.seek(0) t = np.genfromtxt(s, dtype=None) # (or converters={0: bytes}) print(t, t.dtype) # -> [(b'a', 1) (b'b', 2)] [('f0', '|S1'), ('f1', '<i8')] s.seek(0) t = np.genfromtxt(s, dtype=None, converters={0: lambda s: s.decode("utf-8")}) print(t, t.dtype) # -> [('', 1) ('', 2)] [('f0', '<U0'), ('f1', '<i8')]
Could you open a ticket for this?
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (2)
-
Antony Lee
-
Charles R Harris