
On Wed, Apr 26, 2017 at 2:31 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal" <chris.barker@noaa.gov> wrote:
UTF-8 does not match the character-oriented Python text model. Plenty of people argue that that isn't the "correct" model for Unicode text -- maybe so, but it is the model python 3 has chosen. I wrote a much longer rant about that earlier.
So I think the easy to access, and particularly defaults, numpy string dtypes should match it.
This seems a little vague? The "character-oriented Python text model" is just that str supports O(1) indexing of characters. But... Numpy doesn't. If you want to access individual characters inside a string inside an array, you have to pull out the scalar first, at which point the data is copied and boxed into a Python object anyway, using whatever representation the interpreter prefers. So AFAICT it makes literally no difference to the user whether numpy's internal representation allows for fast character access.
you can create a view on individual characters or bytes, AFAICS
t = np.array(['abcdefg']*10) t2 = t.view([('s%d' % i, '<U1') for i in range(7)]) t2['s5'] array(['f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f'], dtype='<U1')
t.view('<U1').reshape(len(t), -1)[:, 2] array(['c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c'], dtype='<U1')
Josef
-n
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion