
On Wed, Apr 26, 2017 at 11:31 AM, Nathaniel Smith <njs@pobox.com> wrote:
UTF-8 does not match the character-oriented Python text model. Plenty of people argue that that isn't the "correct" model for Unicode text -- maybe so, but it is the model python 3 has chosen. I wrote a much longer rant about that earlier.
So I think the easy to access, and particularly defaults, numpy string dtypes should match it.
This seems a little vague?
sorry -- that's what I get for trying to be concise...
The "character-oriented Python text model" is just that str supports O(1) indexing of characters.
not really -- I think the performance characteristics are an implementation detail (though it did influence the design, I'm sure) I'm referring to the fact that a python string appears (to the user -- also under the hood, but again, implementation detail) to be a sequence of characters, not a sequence of bytes, not a sequence of glyphs, or graphemes, or anything else. Every Python string has a length, and that length is the number of characters, and if you index you get a string of length-1, and it has one character it it, and that character matches to a code point of a single value. Someone could implement a python string using utf-8 under the hood, and none of that would change (and I think micropython may have done that...) Sure, you might get two characters when you really expect a single grapheme, but it's at least a consistent oddity. (well, not always, as some graphemes can be represented by either a single code point or two combined -- human language really sucks!) The UTF-8 Manifesto (http://utf8everywhere.org/) makes the very good point that a character-oriented interface is not the only one that makes sense, and may not make sense at all. However: 1) Python has chosen that interface 2) It is a good interface (probably the best for computer use) if you need to choose only one utf8everywhere is mostly arguing for utf-8 over utf16 -- and secondarily for utf-8 everywhere as the best option for working at the C level. That's probably true. (I also think the utf-8 fans are in a bit of a fantasy world -- this would all be easier, yes, if one encoding was used for everything, all the time, but other than that, utf-8 is not a Pancea -- we are still going to have encoding headaches no matter how you slice it) So where does numpy fit? well, it does operate at the C level, but people work with it from python, so exposing the details of the encoding to the user should be strictly opt-in. When a numpy user wants to put a string into a numpy array, they should know how long a string they can fit -- with "length" defined how python strings define it. Using utf-8 for the default string in numpy would be like using float16 for default float--not a good idea! I believe Julian said there would be no default -- you would need to specify, but I think there does need to be one: np.array(["a string", "another string"]) needs to do something. if we make a parameterized dtype that accepts any encoding, then we could do: np.array(["a string", "another string"], dtype=no.stringtype["utf-8"]) If folks really want that. I'm afraid that that would lead to errors -- cool,. utf-8 is just like ascii, but with full Unicode support! But... Numpy doesn't. If you want to access individual characters inside a
string inside an array, you have to pull out the scalar first, at which point the data is copied and boxed into a Python object anyway, using whatever representation the interpreter prefers.
So AFAICT it makes literally no difference to the user whether numpy's internal representation allows for fast character access.
agreed - unless someone wants to do a view that makes a N-D array for strings look like a 1-D array of characters.... Which seems odd, but there was recently a big debate on the netcdf CF conventions list about that very issue... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov