[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Oscar Benjamin oscar.j.benjamin at gmail.com
Fri Jan 17 07:44:16 EST 2014


On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote:
> Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
> [clip]
> > - inconvenience in dealing with strings in python 3.
> > 
> > bytes are not strings in python3 which means ascii data is either a byte
> > array which can be inconvenient to deal with or 4 byte unicode which
> > wastes space.

It doesn't waste that much space in practice. People have been happily using
Python 2's 4-byte-per-char unicode string on wide builds (e.g. on Linux) for
years in all kinds of text heavy applications.

$ python2
Python 2.7.3 (default, Sep 26 2013, 20:03:06) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getsizeof(u'a' * 1000)
4052

> > For backward compatibility we *cannot* change S.

Do you mean to say that loadtxt cannot be changed from decoding using system
default, splitting on newlines and whitespace and then encoding the substrings
as latin-1?

An obvious improvement would be along the lines of what Chris Barker
suggested: decode as latin-1, do the processing and then reencode as latin-1.
Or just open the file in binary and use the bytes string methods. Either of
these has the advantage that it won't corrupt the binary representation of the
data - assuming ascii-compatible whitespace and newlines (e.g. utf-8 and most
currently used 8-bit encodings).

In the situations where the current behaviour differs from this the user
*definitely* has mojibake. Can anyone possibly be relying on that (except in
the sense of having implemented a workaround that would break if it was
fixed)?

> > Maybe we could change
> > the meaning of 'a' but it would be safer to add a new dtype, possibly
> > 'S' can be deprecated in favor of 'B' when we have a specific encoding dtype.
> > 
> > The main issue is probably: is it worth it and who does the work?
> 
> I don't think this is a good idea: the bytes vs. unicode separation in
> Python 3 exists for a good reason. If unicode is not needed, why not just
> use the bytes data type throughout the program?

Or on the other hand, why try to use bytes when you're clearly dealing with
text data?

If you're concerned about memory usage why not use Python strings? As of
CPython 3.3 strings consisting only of latin-1 characters are stored with 1
char-per-byte. This is only really sensible for immutable strings with an
opaque memory representation though so numpy shouldn't try to copy it.

> (Also, assuming that ASCII is in general good for text-format data is
> quite US-centric.)

Indeed. The original use case in this thread was a text file containing file
paths. In most of the world there's a reasonable chance that file paths can
contain non-ascii characters. The current behaviour of decoding using one
codec and encoding with latin-1 would, in many cases, break if the user tried
to e.g. open() a file using a byte-string from the array.


Oscar



More information about the NumPy-Discussion mailing list