[Numpy-discussion] proposal: smaller representation of string arrays
chris.barker at noaa.gov
Wed Apr 26 18:27:10 EDT 2017
On Wed, Apr 26, 2017 at 11:31 AM, Nathaniel Smith <njs at pobox.com> wrote:
> UTF-8 does not match the character-oriented Python text model. Plenty
> of people argue that that isn't the "correct" model for Unicode text
> -- maybe so, but it is the model python 3 has chosen. I wrote a much
> longer rant about that earlier.
> So I think the easy to access, and particularly defaults, numpy string
> dtypes should match it.
> This seems a little vague?
sorry -- that's what I get for trying to be concise...
> The "character-oriented Python text model" is just that str supports O(1)
> indexing of characters.
not really -- I think the performance characteristics are an implementation
detail (though it did influence the design, I'm sure)
I'm referring to the fact that a python string appears (to the user -- also
under the hood, but again, implementation detail) to be a sequence of
characters, not a sequence of bytes, not a sequence of glyphs, or
graphemes, or anything else. Every Python string has a length, and that
length is the number of characters, and if you index you get a string of
length-1, and it has one character it it, and that character matches to a
code point of a single value.
Someone could implement a python string using utf-8 under the hood, and
none of that would change (and I think micropython may have done that...)
Sure, you might get two characters when you really expect a single
grapheme, but it's at least a consistent oddity. (well, not always, as some
graphemes can be represented by either a single code point or two combined
-- human language really sucks!)
The UTF-8 Manifesto (http://utf8everywhere.org/) makes the very good point
that a character-oriented interface is not the only one that makes sense,
and may not make sense at all. However:
1) Python has chosen that interface
2) It is a good interface (probably the best for computer use) if you need
to choose only one
utf8everywhere is mostly arguing for utf-8 over utf16 -- and secondarily
for utf-8 everywhere as the best option for working at the C level. That's
(I also think the utf-8 fans are in a bit of a fantasy world -- this would
all be easier, yes, if one encoding was used for everything, all the time,
but other than that, utf-8 is not a Pancea -- we are still going to have
encoding headaches no matter how you slice it)
So where does numpy fit? well, it does operate at the C level, but people
work with it from python, so exposing the details of the encoding to the
user should be strictly opt-in.
When a numpy user wants to put a string into a numpy array, they should
know how long a string they can fit -- with "length" defined how python
strings define it.
Using utf-8 for the default string in numpy would be like using float16 for
default float--not a good idea!
I believe Julian said there would be no default -- you would need to
specify, but I think there does need to be one:
np.array(["a string", "another string"])
needs to do something.
if we make a parameterized dtype that accepts any encoding, then we could
np.array(["a string", "another string"], dtype=no.stringtype["utf-8"])
If folks really want that.
I'm afraid that that would lead to errors -- cool,. utf-8 is just like
ascii, but with full Unicode support!
But... Numpy doesn't. If you want to access individual characters inside a
> string inside an array, you have to pull out the scalar first, at which
> point the data is copied and boxed into a Python object anyway, using
> whatever representation the interpreter prefers.
> So AFAICT it makes literally no difference to the user whether numpy's
> internal representation allows for fast character access.
agreed - unless someone wants to do a view that makes a N-D array for
strings look like a 1-D array of characters.... Which seems odd, but there
was recently a big debate on the netcdf CF conventions list about that very
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion