[Numpy-discussion] String type again.

Chris Barker chris.barker at noaa.gov
Thu Jul 17 17:05:26 EDT 2014


On Wed, Jul 16, 2014 at 3:48 AM, Todd <toddrjen at gmail.com> wrote:

> On Jul 16, 2014 11:43 AM, "Chris Barker" <chris.barker at noaa.gov> wrote:
> > So numpy should have dtypes to match these. We're a bit stuck, however,
> because 'S' mapped to the py2 string type, which no longer exists in py3.
> Sorry not running py3 to see what 'S' does now, but I know it's bit broken,
> and may be too late to change it
>
> In py3 a 'S' dtype is converted to a python bytes object.
>
right -- thanks. That's the source of the problems.

A bit of a higher-level view of the issues at hand.

Python has three relevant data types:

A unicode type (unicode in py2, str in py3)
A one-byte-per-char stringtype (py2 string)
A bytes type

The big problem is that py2 only has the unicode and py2string types, and
py3 only has the unicode and bytes type.

numpy has 'S' and 'U' types: which map naturally to the py2string and
unicode types.

but since py3 has no py2string type, we have a problem.

If numpy were to embrace the py3 model, then 'S' should have mapped to
py3's string, aka unicode.

But:

1) then there would be no bytes type, which is a problem, as people do need
to a pass collections of bytes around. I"ve alwyas figured numpy's uint8
should suffice for that, but "strings of bytes" are useful, and it seem to
be awkward, or maybe impossible to construct such a beast with the usual
dtype machinery

2) there is a need (or at least a desire), to have a compact,
one-byte-per-charater text type in numpy.

Thinking of it in this framework leads me to the conclusion that numpy
should have three types:

1) A unicode type --no change here

2) A bytes types -- almost the current 'S' type
    - A bytes type would map to/from py3 bytes objects (and py2 bytes
objects, which are the same as py2strings)
    - one way is would differ from a py2str is that there would be no
assumption of null-termination (not sure where that is now)

3) A one-byte-per-char text type -- more or less Chuck's current proposal.
   - it would map to/from the py3 string -- it is text after all
   - it would be null-terminated
   - it would have a one-byte per-char encoding: ascii, latin-1 or settable
(TBA)

It would be nice if numpy had built-in encoding/decoding to/from the
unicode type to/from the bytes type (tricky due to not knowing how many
bytes a given string will decode to without decoding it..

Which leaves us with the decisions:

* what does 'S' map to?
  - currently it's almost a bytes type, and maps to bytes in py3 -- so
maybe keep that status quo. Except that it really doesn't act like text
anymore, so 2 to 3 transition is kind of ugly, and the name is misleading.

* what encoding to use for the one-byte-per-char-text-type?
   - I think latin-1 is the way to go -- you could use it like asciii if
you want, but if you need a few other characters they are there. And you
can even store binary data in it, thought that's a "bad idea" anyway.
  - ascii would solve common use cases, but I see no reason to restrict
folks to 127 characters -- you can use those if you like. If the binary
data needs to get passed to something that really needs to be ascii-only,
it could be checked at that point.
  - perhaps the best option is for client code to be able chose an encoding
-- but more code, maybe a more confusing interface? worth it?

* Do we have a utf-8 type?: I think not -- it simply does not map to both
unicode and numpy's fixed-length requirement.

If all this gets done, we have some transition issues, but I think it would
solve everyone's problems (though maybe not as cleanly as we'd like...).

For instance, if someone needs to map numpy arrays to utf-8 data (i.e.
HDF5), then they can either use the bytes type and let the user decode, or
encode/decode to unicode on i/o.

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140717/326789be/attachment.html>


More information about the NumPy-Discussion mailing list