[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Aldcroft, Thomas aldcroft at head.cfa.harvard.edu
Fri Jan 17 08:09:00 EST 2014


On Fri, Jan 17, 2014 at 5:59 AM, Pauli Virtanen <pav at iki.fi> wrote:

> Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
> [clip]
> > - inconvenience in dealing with strings in python 3.
> >
> > bytes are not strings in python3 which means ascii data is either a byte
> > array which can be inconvenient to deal with or 4 byte unicode which
> > wastes space.
> >
> > A proposal to fix this would be to add a one or two byte dtype with a
> specific
> > encoding that behaves similar to bytes but converts to string when
> outputting
> > to python for comparisons etc.
> >
> > For backward compatibility we *cannot* change S. Maybe we could change
> > the meaning of 'a' but it would be safer to add a new dtype, possibly
> > 'S' can be deprecated in favor of 'B' when we have a specific encoding
> dtype.
> >
> > The main issue is probably: is it worth it and who does the work?
>
> I don't think this is a good idea: the bytes vs. unicode separation in
> Python 3 exists for a good reason. If unicode is not needed, why not just
> use the bytes data type throughout the program?
>

I've been playing around with porting a stack of analysis libraries to
Python 3 and this is a very timely thread and comment.  What I discovered
right away is that all the string data coming from binary HDF5 files show
up (as expected) as 'S' type,, but that trying to make everything actually
work in Python 3 without converting to 'U' is a big mess of whack-a-mole.

Yes, it's possible to change my libraries to use bytestring literals
everywhere, but the Python 3 user experience becomes horrible because to
interact with the data all downstream applications need to use bytestring
literals everywhere.  E.g. doing a simple filter like `string_array ==
'foo'` doesn't work, and this will break all existing code when trying to
run in Python 3.  And every time you try to print something it has this
horrible "b" in front.  Ugly, and it just won't work well in the end.

Following the excellent advice at http://nedbatchelder.com/text/unipain.html,
I've come to the conclusion that the only way to support Python 3 is to
bite the bullet and do the "unicode sandwich".  That is to say convert all
external bytestring values to 'U' arrays for internal (and user)
manipulation, and back to 'S' for delivery to files / network etc.  This is
a pain and very inefficient, but at least the the Python 3 user experience
is natural and pleasant.  I figure if you are manipulating anything less
than ~Gb of text data then it won't be a disaster.

The upshot from this is that I would be very much in favor of solutions
that address the inefficiency issue of using 4 bytes / character in the
common use-case of pure-ASCII strings.  Right now this is the single
biggest issue I see for migrating to Python 3.  Otherwise making the code
python 2 / 3 compatible wasn't too difficult.

- Tom


>
> (Also, assuming that ASCII is in general good for text-format data is
> quite US-centric.)
>
> Christopher Barker wrote:
> >
> > How do you spell the dtype that 'S' give you????
> >
>
> 'S' is bytes.
>
> dtype='S', dtype=bytes, and dtype=np.bytes_ are all equivalent.
>
> --
> Pauli Virtanen
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140117/903f96b9/attachment.html>


More information about the NumPy-Discussion mailing list