In this case, we want something compatible with Python's string (i.e. full Unicode supporting) and I think should be as transparent as possible. Python's string has made the decision to present a character oriented API to users (despite what the manifesto says...).Yes, but NumPy doesn't really implement string operations, so fortunately this is pretty irrelevant to us -- except for our API for specifying dtype size.
We already have strong precedence for dtypes reflecting number of bytes used for storage even when Python doesn't: consider numeric types like int64 and float32 compared to the Python equivalents. It's an intrinsic aspect of NumPy that users need to think about how their data is actually stored.
Storage cost is always going to be a concern. Arguably, it's even more of a concern today than it used to be be, because compute has been improving faster than storage.
I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
For Python use -- a pointer to a Python string would be nice.Yes, absolutely. If we want to be really fancy, we could consider a parametric object dtype that allows for object arrays of *any* homogeneous Python type. Even if NumPy itself doesn't do anything with that information, there are lots of use cases for that information.
Then use a native flexible-encoding dtype for everything else.No opposition here from me. Though again, I think utf-8 alone would also be enough.
One more note: if a user tries to assign a value to a numpy string array that doesn't fit, they should get an error:EncodingError if it can't be encoded into the defined encoding.ValueError if it is too long -- it should not be silently truncated.I think we all agree here.