[Numpy-discussion] String type again.

Charles R Harris charlesr.harris at gmail.com
Fri Jul 18 17:49:20 EDT 2014


On Fri, Jul 18, 2014 at 3:30 PM, Andrew Collette <andrew.collette at gmail.com>
wrote:

> Hi Chris,
>
> > Actually, I agree about the truncation issue, but it's a question of
> where
> > to put it -- I'm suggesting that I don't want it at the python<->numpy
> > interface.
>
> Yes, that's a good point.  Of course, by using Latin-1 rather than
> UTF-8 we can't support all Unicode code points (hence the "?"
> replacement possible on read from HDF5).
>
> > do vlen strings support full unicode? -- then, yes, that's good.
>
> Yes, they do.  It's somewhat unfortunate to immediately cast to vlen
> though, since people usually have fixed-width datasets to start with
> for efficiency reasons...
>
> > what about reading from fixed-width UTF-8 to 'U' -- that seems like the
> > natural way to go for unicode. Tough a bit hard to know how long U needs
> to
> > be -- but <= the length of the utf-8 array (in characters).
>
> Space concerns ("U" has a 4x space penalty for ASCII-ish data).  Plus,
> for similar reasons to this discussion, creating "U" datasets is
> unsupported at the moment.
>
> > note that I'm also proposing a "bytes" dtype, which might make sense for
> > grabbing utf-8 data from HDF-5. Then either h5py or the user could
> decode to
> > a unicode type.
>
> Sound quite like the existing 'S' type.
>
> >> In any case, I can say that the lack of an text 'S' type in NumPy has
> >> been a significant pain point for h5py users on Python 3 over the
> >> years.
> >
> > isn't the current 'S'  a pretty good map to hdf ascii?
>
> Yes; in fact, right now all fixed-width strings in h5py (ASCII and
> UTF-8) are read/written as 'S'.  The problem is that on Py3, 'S' is
> treated as bytes, not text, so you can't freely mix it with str.
>
> I am about to leave for the weekend... thanks for a great discussion!
> To conclude, it strikes me that in choosing an encoding we get to pick
> at most two of the following:
>
> 1. Support for all Unicode characters
> 2. Fixed number of characters
> 3. Fixed number of storage bytes
>
> At this point, I would vote for UTF-8 in a fixed width buffer (1/3),
> but I imagine as this progresses towards a NEP others will weigh in.
>

At some point I'm pretty sure we will want to support utf-8 as it looks
well on its way to a universal standard.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140718/4c4a5a93/attachment.html>


More information about the NumPy-Discussion mailing list