OK -- onto proposals:

1) The default behaviour for numpy arrays of strings is compatible with Python3's string model: i.e. fully unicode supporting, and with a character oriented interface. i.e. if you do::

  arr = np.array(("this", "that",))

you get an array that can store ANY unicode string with 4 or less characters.

and arr[1] will return a native Python3 string object.

This is the use-case for "casual" numpy users -- not the folks writing H5py and the like, or the ones writing Cython bindings to C++ libs.

I see two options here:

a) The current 'U' dtype -- fully meets the specs, and is already there.

b) Having a pointer-to-a-python string dtype:

    -I take it that's what Pandas does and people seem happy.

    -That would get us variable length strings, and potentially other nifty string-processing.

   - It would lose the ability to interact at the binary level with other systems -- but do any other systems use UCS-4 anyway? 
  
   - how would it work with pickle and numpy zip storage?

Personally, I'm fine with (a), but (b) seems like it could be a nice addition. As the 'U' type already exists, the choice to add a python-string type is really orthogonal to the rest of this discussion.

Note that I think using utf-8 internally to fit his need is a mistake -- it does not match well with the Python string model.

That's it for use-case (1)

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov