Re: [Numpy-discussion] proposal: smaller representation of string arrays

April 24, 2017

      On Mon, Apr 24, 2017 at 10:51 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
...
- round-tripping of binary data (at least with Python's encoding/decoding)
...
-- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
same bytes back. You may get garbage, but you won't get an EncodingError.
For a new application, it's a good thing if a text type breaks when you to
stuff arbitrary bytes in it
maybe, maybe not -- the application may be new, but the data it works with
may not be.
...
(see Python 2 vs Python 3 strings).
this is exactly why py3 strings needed to add the "surrogateescape" error
handler:

https://www.python.org/dev/peps/pep-0383

sometimes text and binary data are mixed, sometimes encoded text is broken.
It is very useful to be able to pass such data through strings losslessly.

Certainly, I would argue that nobody should write data in latin-1 unless
...
they're doing so for the sake of a legacy application.
or you really want that 1-byte per char efficiency
...
I do understand the value in having some "string" data type that could be
used by default by loaders for legacy file formats/applications (i.e.,
netCDF3) that support unspecified "one byte strings." Then you're a few
short calls away from viewing (i.e., array.view('text[my_real_encoding]'),
if we support arbitrary encodings) or decoding (i.e.,
np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the
proper encoding. It's not realistic to expect users to know the true
encoding for strings from a file before they even look at the data.
except that you really should :-(

On the other hand, if this is the use-case, perhaps we really want an
...
encoding closer to "Python 2" string, i.e, "unknown", to let this be
signaled more explicitly. I would suggest that "text[unknown]" should
support operations like a string if it can be decoded as ASCII, and
otherwise error. But unlike "text[ascii]", it will let you store arbitrary
bytes.
I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really
is ascii, then it's perfect. If it really is latin-*, then you get some
extra useful stuff, and if it's corrupted somehow, you still get the ascii
text correct, and the rest won't  barf and can be passed on through.

So far, we have real use cases for at least UTF-8, UTF-32, ASCII and
...
"unknown".
hmm -- "unknown" should be bytes, not text. If the user needs to look at it
first, then load it as bytes, run chardet or something on it, then cast to
the right encoding.

The current 'S' dtype truncates silently already:
...
...
One advantage of a new (non-default) dtype is that we can change this
behavior.
yeah -- still on the edge about that, at least with variable-size
encodings. It's hard to know when it's going to happen and it's hard to
know what to do when it does.

At least if if truncates silently, numpy can have the code to do the
truncation properly. Maybe an option?

And the numpy numeric types truncate (Or overflow) already. Again:

If the default string handling matches expectations from python strings,
then the specialized ones can be more buyer-beware.

Also -- if utf-8 is the default -- what do you get when you create an array
...
...
from a python string sequence? Currently with the 'S' and 'U' dtypes, the
dtype is set to the longest string passed in. Are we going to pad it a bit?
stick with the exact number of bytes?
It might be better to avoid this for now, and force users to be explicit
about encoding if they use the dtype for encoded text.
yup.

And we really should have a bytes type for py3 -- which we do, it's just
called 'S', which is pretty confusing :-)

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov

Re: [Numpy-discussion] proposal: smaller representation of string arrays

Chris Barker