Stephan Hoyer writes:
On Tue, Sep 13, 2016 at 11:05 AM, Lluís Vilanova
wrote: Whenever we repr an array using 'S', we can instead show a unicode in py3. That keeps the binary representation, but will always show the expected result to users, and it's only a handful of lines added to dump_data().
If needed, I could easily add a bytes array to make the alternative explicit (where py3 would repr the contents as b'foo').
This would only leave the less-common paths inconsistent across python versions, which should not be a problem for most examples/doctests:
* A 'U' array will show u'foo' in py2 and 'foo' in py3. * The new binary array will show 'foo' in py2 and b'foo' in py3 (that could also be patched on the repr code). * A 'O' array will not be able to do any meaningful repr conversions.
A more complex alternative (and actually closer to what I'm proposing) is to modify numpy in py3 to restrict 'S' to using 8-bit points in a unicode string. It would have the binary compatibility, while being a unicode string in practice.
I'm afraid these are both also non-starters at this point. NumPy's string dtype corresponds to bytes on Python 3, and you can use it to store arbitrary binary values. Would it really be an improvement to change the repr, if the scalar value resulting from indexing is still bytes?
The sanest approach is probably a new dtype for one-byte strings. We talked about this a few years ago, but nobody has implemented it yet: http://numpy-discussion.scipy.narkive.com/3nqDu3Zk/a-one-byte-string-dtype
From the ref manual, 'S' is a "(byte-)string", which (to me) should never have non-printable characters. That's why I'm advocating "S" to be your proposed one-byte strings, while a new "B" dtype is needed for arbitrary binary arrays. This has the added benefit of making docstrings correct on both py2 and py3.
But I won't keep pushing for this; I understand the backwards-compatibility issues mentioned before. Maybe "S" should just be deprecated, "s" (as the one-byte strings) and "B" added instead, and all docstrings and tests changed to "s". In any case, after reading the whole thread, it's not clear to me what's the consensus on what the solution should be (Chris's summary is the closest thing to that). Cheers, Lluis