On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker <chris.barker@noaa.gov> wrote:
On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
 
In this case, we want something compatible with Python's string (i.e. full Unicode supporting) and I think should be as transparent as possible. Python's string has made the decision to present a character oriented API to users (despite what the manifesto says...).

Yes, but NumPy doesn't really implement string operations, so fortunately this is pretty irrelevant to us -- except for our API for specifying dtype size.

Exactly -- the character-orientation of python strings means that people are used to thinking that strings have a length that is the number of characters in the string. I think there will a cognitive dissonance if someone does:

arr[i] = a_string

Which then raises a ValueError, something like:

String too long for a string[12] dytype array.

When len(a_string) <= 12

AND that will only  occur if there are non-ascii characters in the string, and maybe only if there are more than N non-ascii characters. i.e. it is very likely to be a run-time error that may not have shown up in tests.

So folks need to do something like:

len(a_string.encode('utf-8')) to see if their string will fit. If not, they need to truncate it, and THAT is non-obvious how to do, too -- you don't want to truncate the encodes bytes naively, you could end up with an invalid bytestring. but you don't know how many characters to truncate, either.
 
We already have strong precedence for dtypes reflecting number of bytes used for storage even when Python doesn't: consider numeric types like int64 and float32 compared to the Python equivalents. It's an intrinsic aspect of NumPy that users need to think about how their data is actually stored.

sure, but a float64 is 64 bytes forever an always and the defaults perfectly match what python is doing under its hood --even if users don't think about. So the default behaviour of numpy matched python's built-in types.
 

Storage cost is always going to be a concern. Arguably, it's even more of a concern today than it used to be be, because compute has been improving faster than storage.

sure -- but again, what is the use-case for numpy arrays with a s#$)load of text in them? common? I don't think so. And as you pointed out numpy doesn't do text processing anyway, so cache performance and all that are not important. So having UCS-4 as the default, but allowing folks to select a more compact format if they really need it is a good way to go. Just like numpy generally defaults to float64 and Int64 (or 32, depending on platform) -- users can select a smaller size if they have a reason to.

I guess that's my summary -- just like with numeric values, numpy should default to Python-like behavior as much as possible for strings, too -- with an option for a knowledgeable user to do something more performant.
 
I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data.

utf-8 is NOT a one-byte per char encoding. IF you want to assure that your data are one-byte per char, then you could use ASCII, and it would be binary compatible with utf-8, but not sure what the point of that is in this context.

latin-1 or latin-9 buys you (over ASCII):

- A bunch of accented characters -- sure it only covers the latin languages, but does cover those much better.

- A handful of other characters, including scientifically useful ones. (a few greek characters, the degree symbol, etc...)

- round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError.

+1.  The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255.  Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective.
 

For Python use -- a pointer to a Python string would be nice.

Yes, absolutely. If we want to be really fancy, we could consider a parametric object dtype that allows for object arrays of *any* homogeneous Python type. Even if NumPy itself doesn't do anything with that information, there are lots of use cases for that information.

hmm -- that's nifty idea -- though I think strings could/should be special cased.
 
Then use a native flexible-encoding dtype for everything else.

No opposition here from me. Though again, I think utf-8 alone would also be enough.

maybe so -- the major reason for supporting others is binary data exchange with other libraries -- but maybe most of them have gone to utf-8 anyway.

One more note: if a user tries to assign a value to a numpy string array that doesn't fit, they should get an error:

EncodingError if it can't be encoded into the defined encoding.

ValueError if it is too long -- it should not be silently truncated.

I think we all agree here.

I'm actually having second thoughts -- see above -- if the encoding is utf-8, then truncating is non-trivial -- maybe it would be better for numpy to do it for you. Or set a flag as to which you want?

The current 'S' dtype truncates silently already:

In [6]: arr

Out[6]:
array(['this', 'that'],
      dtype='|S4')

In [7]: arr[0] = "a longer string"

In [8]: arr

Out[8]:
array(['a lo', 'that'],
      dtype='|S4')

(similarly for the unicode type)

So at least we are used to that.

BTW -- maybe we should keep the pathological use-case in mind: really short strings. I think we are all thinking in terms of longer strings, maybe a name field, where you might assign 32 bytes or so -- then someone has an accented character in their name, and then ge30 or 31 characters -- no big deal.

I wouldn't call it a pathological use case, it doesn't seem so uncommon to have large datasets of short strings.  I personally deal with a database of hundreds of billions of 2 to 5 character ASCII strings.  This has been a significant blocker to Python 3 adoption in my world.

BTW, for those new to the list or with a short memory, this topic has been discussed fairly extensively at least 3 times before.  Hopefully the *fourth* time will be the charm!

https://mail.scipy.org/pipermail/numpy-discussion/2014-January/068622.html
https://mail.scipy.org/pipermail/numpy-discussion/2014-July/070574.html
https://mail.scipy.org/pipermail/numpy-discussion/2015-February/072311.html

- Tom
 
 

But what if you have a simple label or something with 1 or two characters: Then you have 2 bytes to store the name in, and someone tries to put an "odd" character in there, and you get an empty string. not good.

Also -- if utf-8 is the default -- what do you get when you create an array from a python string sequence? Currently with the 'S' and 'U' dtypes, the dtype is set to the longest string passed in. Are we going to pad it a bit? stick with the exact number of bytes? 

It all comes down to this:

Python3 has made a very deliberate (and I think Good) choice to treat text as a string of characters, where the user does not need to know or care about encoding issues. Numpy's defaults should do the same thing.

-CHB




--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion