![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Mon, Apr 24, 2017 at 10:51 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
- round-tripping of binary data (at least with Python's encoding/decoding)
-- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError.
For a new application, it's a good thing if a text type breaks when you to stuff arbitrary bytes in it
maybe, maybe not -- the application may be new, but the data it works with may not be.
(see Python 2 vs Python 3 strings).
this is exactly why py3 strings needed to add the "surrogateescape" error handler: https://www.python.org/dev/peps/pep-0383 sometimes text and binary data are mixed, sometimes encoded text is broken. It is very useful to be able to pass such data through strings losslessly. Certainly, I would argue that nobody should write data in latin-1 unless
they're doing so for the sake of a legacy application.
or you really want that 1-byte per char efficiency
I do understand the value in having some "string" data type that could be used by default by loaders for legacy file formats/applications (i.e., netCDF3) that support unspecified "one byte strings." Then you're a few short calls away from viewing (i.e., array.view('text[my_real_encoding]'), if we support arbitrary encodings) or decoding (i.e., np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the proper encoding. It's not realistic to expect users to know the true encoding for strings from a file before they even look at the data.
except that you really should :-( On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be signaled more explicitly. I would suggest that "text[unknown]" should support operations like a string if it can be decoded as ASCII, and otherwise error. But unlike "text[ascii]", it will let you store arbitrary bytes.
I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really is ascii, then it's perfect. If it really is latin-*, then you get some extra useful stuff, and if it's corrupted somehow, you still get the ascii text correct, and the rest won't barf and can be passed on through. So far, we have real use cases for at least UTF-8, UTF-32, ASCII and
"unknown".
hmm -- "unknown" should be bytes, not text. If the user needs to look at it first, then load it as bytes, run chardet or something on it, then cast to the right encoding. The current 'S' dtype truncates silently already:
One advantage of a new (non-default) dtype is that we can change this behavior.
yeah -- still on the edge about that, at least with variable-size encodings. It's hard to know when it's going to happen and it's hard to know what to do when it does. At least if if truncates silently, numpy can have the code to do the truncation properly. Maybe an option? And the numpy numeric types truncate (Or overflow) already. Again: If the default string handling matches expectations from python strings, then the specialized ones can be more buyer-beware. Also -- if utf-8 is the default -- what do you get when you create an array
from a python string sequence? Currently with the 'S' and 'U' dtypes, the dtype is set to the longest string passed in. Are we going to pad it a bit? stick with the exact number of bytes?
It might be better to avoid this for now, and force users to be explicit about encoding if they use the dtype for encoded text.
yup. And we really should have a bytes type for py3 -- which we do, it's just called 'S', which is pretty confusing :-) -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov