I just re-read the "Utf-8" manifesto, and it helped me clarify my thoughts:

1) most of it is focused on utf-8 vs utf-16. And that is a strong argument -- utf-16 is the worst of both worlds. 

2) it isn't really addressing how to deal with fixed-size string storage as needed by numpy.

It does bring up Python's current approach to Unicode:

"""
This lead to software design decisions such as Python’s string O(1) code point access. The truth, however, is that Unicode is inherently more complicated and there is no universal definition of such thing as Unicode character. We see no particular reason to favor Unicode code points over Unicode grapheme clusters, code units or perhaps even words in a language for that. 
"""

My thoughts on that-- it's technically correct, but practicality beats purity, and the character concept is pretty darn useful for at least some (commonly used in the computing world) languages.

In any case, whether the top-level API is character focused doesn't really have a bearing on the internal encoding, which is very much an implementation detail in py 3 at least.

And Python has made its decision about that.

So what are the numpy use-cases?

I see essentially two:

1) Use with/from  Python -- both creating and working with numpy arrays.

In this case, we want something compatible with Python's string (i.e. full Unicode supporting) and I think should be as transparent as possible. Python's string has made the decision to present a character oriented API to users (despite what the manifesto says...).

However, there is a challenge here: numpy requires fixed-number-of-bytes dtypes. And full unicode support with fixed number of bytes matching fixed number of characters is only possible with UCS-4 -- hence the current implementation. And this is actually just fine! I know we all want to be efficient with data storage, but really -- in the early days of Unicode, when folks thought 16 bits were enough, doubling the memory usage for western language storage was considered fine -- how long in computer life time does it take to double your memory? But now, when memory, disk space, bandwidth, etc, are all literally orders of magnitude larger, we can't handle a factor of 4 increase in "wasted" space?

Alternatively, Robert's suggestion of having essentially an object array, where the objects were known to be python strings is a pretty nice idea -- it gives the full power of python strings, and is a perfect one-to-one match with the python text data model.

But as scientific text data often is 1-byte compatible, a one-byte-per-char dtype is a fine idea, too -- and we pretty much have that already with the existing string type -- that could simply be enhanced by enforcing the encoding to be latin-9 (or latin-1, if you don't want the Euro symbol). This would get us what scientists expect from strings in a way that is properly compatible with Python's string type. You'd get encoding errors if you tried to stuff anything else in there, and that's that.

Yes, it would have to be a "new" dtype for backwards compatibility.

2) Interchange with other systems: passing the raw binary data back and forth between numpy arrays and other code, written in C, Fortran, or binary flle formats. 

This is a key use-case for numpy -- I think the key to its enormous success. But how important is it for text? Certainly any data set I've ever worked with has had gobs of binary numerical data, and a small smattering of text. So in that case, if, for instance, h5py had to encode/decode text when transferring between HDF files and numpy arrays, I don't think I'd ever see the performance hit. As for code complexity -- it would mean more complex code in interface libs, and less complex code in numpy itself. (though numpy could provide utilities to make it easy to write the interface code)

If we do want to support direct binary interchange with other libs, then we should probably simply go for it, and support any encoding that Python supports -- as long as you are dealing with multiple encodings, why try to decide up front which ones to support?

But how do we expose this to numpy users? I still don't like having non-fixed-width encoding under the hood, but what can you do? Other than that, having the encoding be a selectable part of the dtype works fine -- and in that case the number of bytes should be the "length" specifier. 

This, however, creates a bit of an impedance mismatch between the "character-focused" approach of the python string type. And requires the user to understand something about the encoding in order to even know how many bytes they need -- a utf-8-100 string will hold a different "length" of string than a utf-16-100 string.

So -- I think we should address the use-cases separately -- one for "normal" python use and simple interoperability with python strings, and one for interoperability at the binary level. And an easy way to convert between the two.

For Python use -- a pointer to a Python string would be nice.

Then use a native flexible-encoding dtype for everything else.

Thinking out loud -- another option would be to set defaults for the multiple-encoding dtype so you'd get UCS-4 -- with its full compatibility with the python string type -- and make folks make an effort to get anything else.

One more note: if a user tries to assign a value to a numpy string array that doesn't fit, they should get an error:

EncodingError if it can't be encoded into the defined encoding.

ValueError if it is too long -- it should not be silently truncated.

-CHB

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov