[Numpy-discussion] proposal: smaller representation of string arrays

Chris Barker chris.barker at noaa.gov
Wed Apr 26 18:27:10 EDT 2017


On Wed, Apr 26, 2017 at 11:31 AM, Nathaniel Smith <njs at pobox.com> wrote:

> UTF-8 does not match the character-oriented Python text model. Plenty
> of people argue that that isn't the "correct" model for Unicode text
> -- maybe so, but it is the model python 3 has chosen. I wrote a much
> longer rant about that earlier.
>
> So I think the easy to access, and particularly defaults, numpy string
> dtypes should match it.
>
>
> This seems a little vague?
>

sorry -- that's what I get for trying to be concise...


> The "character-oriented Python text model" is just that str supports O(1)
> indexing of characters.
>

not really -- I think the performance characteristics are an implementation
detail (though it did influence the design, I'm sure)

I'm referring to the fact that a python string appears (to the user -- also
under the hood, but again, implementation detail)  to be a sequence of
characters, not a sequence of bytes, not a sequence of glyphs, or
graphemes, or anything else. Every Python string has a length, and that
length is the number of characters, and if you index you get a string of
length-1, and it has one character it it, and that character matches to a
code point of a single value.

Someone could implement a python string using utf-8 under the hood, and
none of that would change (and I think micropython may have done that...)

Sure, you might get two characters when you really expect a single
grapheme, but it's at least a consistent oddity. (well, not always, as some
graphemes can be represented by either a single code point or two combined
-- human language really sucks!)

The UTF-8 Manifesto (http://utf8everywhere.org/) makes the very good point
that a character-oriented interface is not the only one that makes sense,
and may not make sense at all. However:

1) Python has chosen that interface

2) It is a good interface (probably the best for computer use) if you need
to choose only one

utf8everywhere is mostly arguing for utf-8 over utf16 -- and secondarily
for utf-8 everywhere as the best option for working at the C level. That's
probably true.

(I also think the utf-8 fans are in a bit of a fantasy world -- this would
all be easier, yes, if one encoding was used for everything, all the time,
but other than that, utf-8 is not a Pancea -- we are still going to have
encoding headaches no matter how you slice it)

So where does numpy fit? well, it does operate at the C level, but people
work with it from python, so exposing the details of the encoding to the
user should be strictly opt-in.

When a numpy user wants to put a string into a numpy array, they should
know how long a string they can fit -- with "length" defined how python
strings define it.

Using utf-8 for the default string in numpy would be like using float16 for
default float--not a good idea!

I believe Julian said there would be no default -- you would need to
specify, but I think there does need to be one:

np.array(["a string", "another string"])

needs to do something.

if we make a parameterized dtype that accepts any encoding, then we could
do:

np.array(["a string", "another string"], dtype=no.stringtype["utf-8"])

If folks really want that.

I'm afraid that that would lead to errors -- cool,. utf-8 is just like
ascii, but with full Unicode support!

But... Numpy doesn't. If you want to access individual characters inside a
> string inside an array, you have to pull out the scalar first, at which
> point the data is copied and boxed into a Python object anyway, using
> whatever representation the interpreter prefers.
>


> So AFAICT​ it makes literally no difference to the user whether numpy's
> internal representation allows for fast character access.
>

agreed - unless someone wants to do a view that makes a N-D array for
strings look like a 1-D array of characters.... Which seems odd, but there
was recently a big debate on the netcdf CF conventions list about that very
issue...

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/1f74bfdc/attachment.html>


More information about the NumPy-Discussion mailing list