Re: [Numpy-discussion] proposal: smaller representation of string arrays

25 Apr 2017

      OK -- onto proposals:

1) The default behaviour for numpy arrays of strings is compatible with
...
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do::
arr = np.array(("this", "that",))
you get an array that can store ANY unicode string with 4 or less
characters.
and arr[1] will return a native Python3 string object.
This is the use-case for "casual" numpy users -- not the folks writing
H5py and the like, or the ones writing Cython bindings to C++ libs.
I see two options here:

a) The current 'U' dtype -- fully meets the specs, and is already there.

b) Having a pointer-to-a-python string dtype:

    -I take it that's what Pandas does and people seem happy.

    -That would get us variable length strings, and potentially other nifty
string-processing.

   - It would lose the ability to interact at the binary level with other
systems -- but do any other systems use UCS-4 anyway?

   - how would it work with pickle and numpy zip storage?

Personally, I'm fine with (a), but (b) seems like it could be a nice
addition. As the 'U' type already exists, the choice to add a python-string
type is really orthogonal to the rest of this discussion.

Note that I think using utf-8 internally to fit his need is a mistake -- it
does not match well with the Python string model.

That's it for use-case (1)

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov

Re: [Numpy-discussion] proposal: smaller representation of string arrays

Chris Barker