[Numpy-discussion] proposal: smaller representation of string arrays
shoyer at gmail.com
Fri Apr 21 17:34:27 EDT 2017
On Fri, Apr 21, 2017 at 11:34 AM, Chris Barker <chris.barker at noaa.gov>
> 1) Use with/from Python -- both creating and working with numpy arrays.
> In this case, we want something compatible with Python's string (i.e. full
> Unicode supporting) and I think should be as transparent as possible.
> Python's string has made the decision to present a character oriented API
> to users (despite what the manifesto says...).
Yes, but NumPy doesn't really implement string operations, so fortunately
this is pretty irrelevant to us -- except for our API for specifying dtype
We already have strong precedence for dtypes reflecting number of bytes
used for storage even when Python doesn't: consider numeric types like
int64 and float32 compared to the Python equivalents. It's an intrinsic
aspect of NumPy that users need to think about how their data is actually
> However, there is a challenge here: numpy requires fixed-number-of-bytes
> dtypes. And full unicode support with fixed number of bytes matching fixed
> number of characters is only possible with UCS-4 -- hence the current
> implementation. And this is actually just fine! I know we all want to be
> efficient with data storage, but really -- in the early days of Unicode,
> when folks thought 16 bits were enough, doubling the memory usage for
> western language storage was considered fine -- how long in computer life
> time does it take to double your memory? But now, when memory, disk space,
> bandwidth, etc, are all literally orders of magnitude larger, we can't
> handle a factor of 4 increase in "wasted" space?
Storage cost is always going to be a concern. Arguably, it's even more of a
concern today than it used to be be, because compute has been improving
faster than storage.
> But as scientific text data often is 1-byte compatible, a
> one-byte-per-char dtype is a fine idea, too -- and we pretty much have that
> already with the existing string type -- that could simply be enhanced by
> enforcing the encoding to be latin-9 (or latin-1, if you don't want the
> Euro symbol). This would get us what scientists expect from strings in a
> way that is properly compatible with Python's string type. You'd get
> encoding errors if you tried to stuff anything else in there, and that's
I still don't understand why a latin encoding makes sense as a preferred
one-byte-per-char dtype. The world, including Python 3, has standardized on
UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
So -- I think we should address the use-cases separately -- one for
> "normal" python use and simple interoperability with python strings, and
> one for interoperability at the binary level. And an easy way to convert
> between the two.
> For Python use -- a pointer to a Python string would be nice.
Yes, absolutely. If we want to be really fancy, we could consider a
parametric object dtype that allows for object arrays of *any* homogeneous
Python type. Even if NumPy itself doesn't do anything with that
information, there are lots of use cases for that information.
Then use a native flexible-encoding dtype for everything else.
No opposition here from me. Though again, I think utf-8 alone would also be
> Thinking out loud -- another option would be to set defaults for the
> multiple-encoding dtype so you'd get UCS-4 -- with its full compatibility
> with the python string type -- and make folks make an effort to get
> anything else.
The np.unicode_ type is already UCS-4 and the default for dtype=str on
Python 3. We probably shouldn't change that, but if we set any default
encoding for the new text type, I strongly believe it should be utf-8.
One more note: if a user tries to assign a value to a numpy string array
> that doesn't fit, they should get an error:
> EncodingError if it can't be encoded into the defined encoding.
> ValueError if it is too long -- it should not be silently truncated.
I think we all agree here.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion