On Fri, Apr 21, 2017 at 11:34 AM, Chris Barker <chris.barker@noaa.gov> wrote:
1) Use with/from Python -- both creating and working with numpy arrays.
In this case, we want something compatible with Python's string (i.e. full Unicode supporting) and I think should be as transparent as possible. Python's string has made the decision to present a character oriented API to users (despite what the manifesto says...).
Yes, but NumPy doesn't really implement string operations, so fortunately this is pretty irrelevant to us -- except for our API for specifying dtype size. We already have strong precedence for dtypes reflecting number of bytes used for storage even when Python doesn't: consider numeric types like int64 and float32 compared to the Python equivalents. It's an intrinsic aspect of NumPy that users need to think about how their data is actually stored.
However, there is a challenge here: numpy requires fixed-number-of-bytes dtypes. And full unicode support with fixed number of bytes matching fixed number of characters is only possible with UCS-4 -- hence the current implementation. And this is actually just fine! I know we all want to be efficient with data storage, but really -- in the early days of Unicode, when folks thought 16 bits were enough, doubling the memory usage for western language storage was considered fine -- how long in computer life time does it take to double your memory? But now, when memory, disk space, bandwidth, etc, are all literally orders of magnitude larger, we can't handle a factor of 4 increase in "wasted" space?
Storage cost is always going to be a concern. Arguably, it's even more of a concern today than it used to be be, because compute has been improving faster than storage.
But as scientific text data often is 1-byte compatible, a one-byte-per-char dtype is a fine idea, too -- and we pretty much have that already with the existing string type -- that could simply be enhanced by enforcing the encoding to be latin-9 (or latin-1, if you don't want the Euro symbol). This would get us what scientists expect from strings in a way that is properly compatible with Python's string type. You'd get encoding errors if you tried to stuff anything else in there, and that's that.
I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data. So -- I think we should address the use-cases separately -- one for
"normal" python use and simple interoperability with python strings, and one for interoperability at the binary level. And an easy way to convert between the two.
For Python use -- a pointer to a Python string would be nice.
Yes, absolutely. If we want to be really fancy, we could consider a parametric object dtype that allows for object arrays of *any* homogeneous Python type. Even if NumPy itself doesn't do anything with that information, there are lots of use cases for that information. Then use a native flexible-encoding dtype for everything else.
No opposition here from me. Though again, I think utf-8 alone would also be enough.
Thinking out loud -- another option would be to set defaults for the multiple-encoding dtype so you'd get UCS-4 -- with its full compatibility with the python string type -- and make folks make an effort to get anything else.
The np.unicode_ type is already UCS-4 and the default for dtype=str on Python 3. We probably shouldn't change that, but if we set any default encoding for the new text type, I strongly believe it should be utf-8. One more note: if a user tries to assign a value to a numpy string array
that doesn't fit, they should get an error:
EncodingError if it can't be encoded into the defined encoding.
ValueError if it is too long -- it should not be silently truncated.
I think we all agree here.