[Numpy-discussion] proposal: smaller representation of string arrays

Nathaniel Smith njs at pobox.com
Wed Apr 26 14:31:00 EDT 2017


On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal" <
chris.barker at noaa.gov> wrote:


UTF-8 does not match the character-oriented Python text model. Plenty
of people argue that that isn't the "correct" model for Unicode text
-- maybe so, but it is the model python 3 has chosen. I wrote a much
longer rant about that earlier.

So I think the easy to access, and particularly defaults, numpy string
dtypes should match it.


This seems a little vague? The "character-oriented Python text model" is
just that str supports O(1) indexing of characters. But... Numpy doesn't.
If you want to access individual characters inside a string inside an
array, you have to pull out the scalar first, at which point the data is
copied and boxed into a Python object anyway, using whatever representation
the interpreter prefers. So AFAICT​ it makes literally no difference to the
user whether numpy's internal representation allows for fast character
access.

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/a089f715/attachment.html>


More information about the NumPy-Discussion mailing list