[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Chris Barker chris.barker at noaa.gov
Thu Jan 16 12:08:38 EST 2014


On Thu, Jan 16, 2014 at 2:43 AM, Oscar Benjamin
<oscar.j.benjamin at gmail.com>wrote:

> > My proposal:
> >
> > loadtxt accepts an encoding argument.
> >
> > default is ascii -- that's what it's doing now, anyway, yes?
>
> No it's loading the file reading a line, encoding the line with latin-1,
> and
> then putting the repr of the resulting byte-string as a unicode string
> into a
> UCS-4 array (dtype='<Ux'). I can't see any good reason for that behaviour.


agreed -- really odd. If we're going assume latin-1 -- why not put the
decode unicode string in the the string?

But what about parsing numbers? latin-1 decoded to a unicode object, then
parsed? Reasonable enough.

> If the file is encoded ascii, then a one-byte-per character dtype is used
> > for text data, unless the user specifies otherwise (do they need to
> specify
> > anyway?)
> >
> > If the file has another encoding, the the default dtype for text is
> unicode.
>
> That's a silly idea. There's already the dtype='S' for ascii that will give
> one byte per character.
>

Except that 'S' is being translated to a bytes object, and in py3 bytes is
not really text -- see the other thread.

However numpy.loadtxt(dtype='S') doesn't actually use ascii IIUC. It loads
> the file as text with the default system encoding,


not such a bad idea in principle, but I think with scientific data files in
particular, the file was just as likely generated on a different system, so
system settings should be avoided. My guess is that a large fraction of
systems have system encodings that are ascii-compatible, so we'll get away
with this most of the time, but explicit is better than implicit, and all
that.

encodes the text with
> latin-1 and stores the resulting bytes into a dtype='S' array. I think it
> should just open the file in binary read the bytes and store them in the
> dtype='S' array. The current behaviour strikes me as a hangover from the
> Python 2.x 8-bit text model.
>

not sure it's even that -- I suspect it's a broken attempt to match the py3
text model...

> Not sure about other one-byte per character encodings (e.g. latin-1) The
> defaults may be moot, if the loadtxt doesn't have auto-detection of text in
> a filie anyway.
>

I'm not suggesting auto0detection, but I am suggesting the ability to
specify an encoding, and in that case, we need a default, and I don't think
it should be the system encoding.

> This all required that there be an obvious way for the user to spell the
> > one-byte-per character dtype -- I think 'S' will do it.
>
> They should use 'S' and not encoding='ascii'.


that is stating implicitly that 'S' is ascii-compatible, but it gets
traslated to the py3 bytes type, which the pyton dev folks REALLY want to
mean "arbitrary bytes", rather than 'ascii text'.

practically, it means you need to decode it to use it as text -- compare
with a string, etc...

If the user provides an encoding
> then it should be used to open the file and decode it to unicode resulting
> in
> a dtype='U' array. (Python 3 handles this all for you).


I think it may be an important use case to pull ansi-compatible text out of
a file and put it into a 1-byte per character dtype (i.,e 'S'). Folks
don't necessarily want or need 4 bytes per charater.

In practice this probably only makes sense it the file is in an
ascii-compatible encoding anyway, but I like the idea of keeping the file
encoding and the dtype independent.

It only seems to work because you're using ascii data.
>

(or latin-1?) well, yes, but that was the OP's example. though it was file
names, so he'd probably ultimately want them as py3 strings...


> which will
> corrupt the binary form of the data if the system encoding is not
> compatible
> with latin-1 (e.g. ascii and latin-1 will work but utf-8 will not).


a good reason not to use the system default encoding!

 > NOTE: another option is to use latin-1 all around, rather than ascii --
> you
> > may get garbage from the higher value bytes, but it won't barf on you.
>
> I guess you're alluding to the idea that reading/writing files as latin-1
> will
> pretend to seamlessly decode/encode any bytes preserving binary data in any
> round-trip.


yes, exactly -- a practical common use case is that there is non-ascii
compliant bytes in a data stream, but that the use-case doesn't care what
they are. If you use ascii, then you get exceptions you don't need to get.


> This concept is already broken if you intend to do any processing,
> indexing or slicing of the array.


no it's not -- latin-1 is ascii-compatible (as is utf-8), so a lot
of processing will work fine -- splitting on whitespace or whatever, etc.

yes, indexing can go to heck if you have utf-8 or, of course, non-ascii
compatible encoding -- but that's never going to work without specifying an
encoding anyway.


> Additionally the current loadtxt behaviour
> fails to achieve this round-trip even for the 'S' dtype even if you don't
> do
> any processing:
>

right -- I think we agree that it's broken now.

This is a mess. I don't know about how to handle backwards compatibility but
> the sensible way to handle this in *both* Python 2 and 3 is that dtype='S'
> opens the file in binary, reads byte strings, and stores them in an array
> with
> dtype='S'. dtype='U' should open the file as text with an encoding argument
> (or system default if not supplied), decode the bytes and create an array
> with
> dtype='U'.


agreed -- except for the system encoding part....


> The only reasonable difference between Python 2 and 3 is which of
> these two behaviours dtype=str should do.


well, str is a py3 string in py3 -- so it should be dtype 'U'. Personally,
I avoid using the native types for dtype arguemtns anyway, so users should
use:

dtype=np.unicode
or
dtype=np.string0 (or np.string_) -- or????

How do you spell the dtype that 'S' give you????

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140116/809a3d30/attachment.html>


More information about the NumPy-Discussion mailing list