On Wed, Apr 26, 2017 at 11:31 AM, Nathaniel Smith <njs@pobox.com> wrote:
UTF-8 does not match the character-oriented Python text model. Plenty
of people argue that that isn't the "correct" model for Unicode text
-- maybe so, but it is the model python 3 has chosen. I wrote a much
longer rant about that earlier.

So I think the easy to access, and particularly defaults, numpy string
dtypes should match it.

This seems a little vague?

sorry -- that's what I get for trying to be concise...
 
The "character-oriented Python text model" is just that str supports O(1) indexing of characters.

not really -- I think the performance characteristics are an implementation detail (though it did influence the design, I'm sure)

I'm referring to the fact that a python string appears (to the user -- also under the hood, but again, implementation detail)  to be a sequence of characters, not a sequence of bytes, not a sequence of glyphs, or graphemes, or anything else. Every Python string has a length, and that length is the number of characters, and if you index you get a string of length-1, and it has one character it it, and that character matches to a code point of a single value.

Someone could implement a python string using utf-8 under the hood, and none of that would change (and I think micropython may have done that...)

Sure, you might get two characters when you really expect a single grapheme, but it's at least a consistent oddity. (well, not always, as some graphemes can be represented by either a single code point or two combined -- human language really sucks!)

The UTF-8 Manifesto (http://utf8everywhere.org/) makes the very good point that a character-oriented interface is not the only one that makes sense, and may not make sense at all. However:

1) Python has chosen that interface

2) It is a good interface (probably the best for computer use) if you need to choose only one

utf8everywhere is mostly arguing for utf-8 over utf16 -- and secondarily for utf-8 everywhere as the best option for working at the C level. That's probably true.

(I also think the utf-8 fans are in a bit of a fantasy world -- this would all be easier, yes, if one encoding was used for everything, all the time, but other than that, utf-8 is not a Pancea -- we are still going to have encoding headaches no matter how you slice it)

So where does numpy fit? well, it does operate at the C level, but people work with it from python, so exposing the details of the encoding to the user should be strictly opt-in.

When a numpy user wants to put a string into a numpy array, they should know how long a string they can fit -- with "length" defined how python strings define it. 

Using utf-8 for the default string in numpy would be like using float16 for default float--not a good idea!

I believe Julian said there would be no default -- you would need to specify, but I think there does need to be one:

np.array(["a string", "another string"]) 

needs to do something.

if we make a parameterized dtype that accepts any encoding, then we could do:

np.array(["a string", "another string"], dtype=no.stringtype["utf-8"]) 

If folks really want that.

I'm afraid that that would lead to errors -- cool,. utf-8 is just like ascii, but with full Unicode support!

But... Numpy doesn't. If you want to access individual characters inside a string inside an array, you have to pull out the scalar first, at which point the data is copied and boxed into a Python object anyway, using whatever representation the interpreter prefers.
 
So AFAICT​ it makes literally no difference to the user whether numpy's internal representation allows for fast character access.

agreed - unless someone wants to do a view that makes a N-D array for strings look like a 1-D array of characters.... Which seems odd, but there was recently a big debate on the netcdf CF conventions list about that very issue...

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov