[Numpy-discussion] proposal: smaller representation of string arrays

Mon Apr 24 13:04:53 EDT 2017

On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer <shoyer at gmail.com> wrote:

> In this case, we want something compatible with Python's string (i.e. full
>> Unicode supporting) and I think should be as transparent as possible.
>> Python's string has made the decision to present a character oriented API
>> to users (despite what the manifesto says...).
>>
>
> Yes, but NumPy doesn't really implement string operations, so fortunately
> this is pretty irrelevant to us -- except for our API for specifying dtype
> size.
>

Exactly -- the character-orientation of python strings means that people
are used to thinking that strings have a length that is the number of
characters in the string. I think there will a cognitive dissonance if
someone does:

arr[i] = a_string

Which then raises a ValueError, something like:

String too long for a string[12] dytype array.

When len(a_string) <= 12

AND that will only  occur if there are non-ascii characters in the string,
and maybe only if there are more than N non-ascii characters. i.e. it is
very likely to be a run-time error that may not have shown up in tests.

So folks need to do something like:

len(a_string.encode('utf-8')) to see if their string will fit. If not, they
need to truncate it, and THAT is non-obvious how to do, too -- you don't
want to truncate the encodes bytes naively, you could end up with an
invalid bytestring. but you don't know how many characters to truncate,
either.

> We already have strong precedence for dtypes reflecting number of bytes
> used for storage even when Python doesn't: consider numeric types like
> int64 and float32 compared to the Python equivalents. It's an intrinsic
> aspect of NumPy that users need to think about how their data is actually
> stored.
>

sure, but a float64 is 64 bytes forever an always and the defaults
perfectly match what python is doing under its hood --even if users don't
think about. So the default behaviour of numpy matched python's built-in
types.

Storage cost is always going to be a concern. Arguably, it's even more of a
>> concern today than it used to be be, because compute has been improving
>> faster than storage.
>>
>
sure -- but again, what is the use-case for numpy arrays with a s#$)load of
text in them? common? I don't think so. And as you pointed out numpy
doesn't do text processing anyway, so cache performance and all that are
not important. So having UCS-4 as the default, but allowing folks to select
a more compact format if they really need it is a good way to go. Just like
numpy generally defaults to float64 and Int64 (or 32, depending on
platform) -- users can select a smaller size if they have a reason to.

I guess that's my summary -- just like with numeric values, numpy should
default to Python-like behavior as much as possible for strings, too --
with an option for a knowledgeable user to do something more performant.

> I still don't understand why a latin encoding makes sense as a preferred
> one-byte-per-char dtype. The world, including Python 3, has standardized on
> UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
>

utf-8 is NOT a one-byte per char encoding. IF you want to assure that your
data are one-byte per char, then you could use ASCII, and it would be
binary compatible with utf-8, but not sure what the point of that is in
this context.

latin-1 or latin-9 buys you (over ASCII):

- A bunch of accented characters -- sure it only covers the latin
languages, but does cover those much better.

- A handful of other characters, including scientifically useful ones. (a
few greek characters, the degree symbol, etc...)

- round-tripping of binary data (at least with Python's encoding/decoding)
-- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
same bytes back. You may get garbage, but you won't get an EncodingError.

For Python use -- a pointer to a Python string would be nice.
>>
>
> Yes, absolutely. If we want to be really fancy, we could consider a
> parametric object dtype that allows for object arrays of *any* homogeneous
> Python type. Even if NumPy itself doesn't do anything with that
> information, there are lots of use cases for that information.
>

hmm -- that's nifty idea -- though I think strings could/should be special
cased.

> Then use a native flexible-encoding dtype for everything else.
>>
>
> No opposition here from me. Though again, I think utf-8 alone would also
> be enough.
>

maybe so -- the major reason for supporting others is binary data exchange
with other libraries -- but maybe most of them have gone to utf-8 anyway.

One more note: if a user tries to assign a value to a numpy string array
>> that doesn't fit, they should get an error:
>>
>
>> EncodingError if it can't be encoded into the defined encoding.
>>
>> ValueError if it is too long -- it should not be silently truncated.
>>
>
> I think we all agree here.
>

I'm actually having second thoughts -- see above -- if the encoding is
utf-8, then truncating is non-trivial -- maybe it would be better for numpy
to do it for you. Or set a flag as to which you want?

The current 'S' dtype truncates silently already:

In [6]: arr

Out[6]:
array(['this', 'that'],
      dtype='|S4')

In [7]: arr[0] = "a longer string"

In [8]: arr

Out[8]:
array(['a lo', 'that'],
      dtype='|S4')

(similarly for the unicode type)

So at least we are used to that.

BTW -- maybe we should keep the pathological use-case in mind: really short
strings. I think we are all thinking in terms of longer strings, maybe a
name field, where you might assign 32 bytes or so -- then someone has an
accented character in their name, and then ge30 or 31 characters -- no big
deal.

But what if you have a simple label or something with 1 or two characters:
Then you have 2 bytes to store the name in, and someone tries to put an
"odd" character in there, and you get an empty string. not good.

Also -- if utf-8 is the default -- what do you get when you create an array
from a python string sequence? Currently with the 'S' and 'U' dtypes, the
dtype is set to the longest string passed in. Are we going to pad it a bit?
stick with the exact number of bytes?

It all comes down to this:

Python3 has made a very deliberate (and I think Good) choice to treat text
as a string of characters, where the user does not need to know or care
about encoding issues. Numpy's defaults should do the same thing.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/0ef22af5/attachment.html>