Chris, you've mashed all of my emails together, some of them are in reply to you, some in reply to others. Unfortunately, this dropped a lot of the context from each of them, and appears to be creating some misunderstandings about what each person is advocating. On Mon, Apr 24, 2017 at 2:00 PM, Chris Barker <chris.barker@noaa.gov> wrote:
On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern <robert.kern@gmail.com>
wrote:
Solve the HDF5 problem (i.e. fixed-length UTF-8 strings)
I agree-- binary compatibility with utf-8 is a core use case -- though is it so bad to go through python's encoding/decoding machinery to so it? Do numpy arrays HAVE to be storing utf-8 natively?
If the point is to have an array that transparently accepts/yields `unicode/str` scalars while maintaining the in-memory encoding, yes. If that's not the point, then IMO the status quo is fine, and *no* new dtypes should be added, just maybe some utility functions to convert between the bytes-ish arrays and the Unicode-holding arrays (which was one of my proposals). I am mostly happy to live in a world where I read in data as bytes-ish arrays, decode into `object` arrays holding `unicode/str` objects, do my manipulations, then encode the array into a bytes-ish array to give to the C API or file format.
or leave it be until someone else is willing to solve that problem. I don't think we're at the bikeshedding stage yet; we're still disagreeing about fundamental requirements.
yeah -- though I've seen projects get stuck in the sorting out what to do, so nothing gets done stage before -- I don't want Julian to get too frustrated and end up doing nothing.
So here I'll lay out what I think are the fundamental requirements:
1) The default behaviour for numpy arrays of strings is compatible with Python3's string model: i.e. fully unicode supporting, and with a character oriented interface. i.e. if you do:
arr = np.array(("this", "that",))
you get an array that can store ANY unicode string with 4 or less characters
and arr[1] will return a native Python string object.
2) There be some way to store mostly ascii-compatible strings in a single byte-per-character array -- so not be wasting space for "typical european-oriented data".
arr = np.array(("this", "that",), dtype=np.single_byte_string)
(name TBD)
and arr[1] would return a python string.
attempting to put in a not-compatible with the encoding string in would raise an Encoding Error.
I highly recommend that (SO 8859-15 ( latin-9 or latin-1) be the encoding in this case.
3) There be a dtype that could store strings in null-terminated utf-8 binary format -- for interchange with other systems (netcdf, HDF, others???)
4) a fixed length bytes dtype -- pretty much what 'S' is now under python
I understand, but not all tedious discussions that have not yet achieved consensus are bikeshedding to be cut short. We couldn't really decide what to do back in the pre-1.0 days, too, so we just did *something*, and that something is now the very situation that Julian has a problem with. We have more experience now, especially with the added wrinkles of Python 3; other projects have advanced and matured their Unicode string array-handling (e.g. pandas and HDF5); now is a great time to have a real discussion about what we *need* before we make decisions about what we should *do*. three -- settable from a bytes or bytearray object, and returns a bytes object.
- you could use astype() to convert between bytes and a specified encoding with no change in binary representation.
2) and 3) could be fully covered by a dtype with a settable encoding that might as well support all python built-in encodings -- though I think an alias to the common cases would be good -- latin, utf-8. If so, the length would have to be specified in bytes.
1) could be covered with the existing 'U': type - only downside being some wasted space -- or with a pointer to a python string dtype -- which would also waste space, though less for long-ish strings, and maybe give us some better access to the nifty built-in string features.
+1. The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255. Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective.
That says to me that these are properly represented by `bytes` objects, not `unicode/str` objects encoding to and decoding from a hardcoded latin-1 encoding.
Well, yes -- BUT: That strictness in python3 -- "data is either text or bytes, and text in an unknown (or invalid) encoding HAVE to be bytes" bit Python3 is the butt for a long time. Folks that deal in the messy real world of binary data that is kinda-mostly text, but may have a bit of binary data, or be in an unknown encoding, or be corrupted were very, very adamant about how this model DID NOT work for them. Very influential people were seriously critical of python 3. Eventually, py3 added bytes string
Practicality beats purity -- if you have one-byte per char data that is mostly european, than latin-1 or latin-9 let you work with it, have it mostly work, and never crash out with an encoding error.
- round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError. But what if the format I'm working with specifies another encoding? Am I supposed to encode all of my Unicode strings in the specified encoding,
You'll need to specify what NULL-terminating behavior you want here. np.string_ has NULL-termination. np.void (which could be made to work better with `bytes`) does not. Both have use-cases for text encoding (shakes fist at UTF-16). formatting, surrogate_escape, and other features that facilitate working with messy almost text. Walk me through a problem that you've encountered with such textish data in arrays. I know the problems in Web protocol-land, but they are not really relevant to us. What are *your* problems? Why didn't those ameliorations that were added for the Web world address your problems? I really want to get at specific use cases that interact with numpy, not handwaving at problems other people have had in other contexts. then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a really important use case for me.
latin-1 would be only for the special case of mostly-ascii (or true
As for external data in utf-8 -- yes that should be dealt with properly -- either by truly supporting utf-8 internally, or by properly encoding/decoding when putting it in and moving it out of an array.
utf-8 is a very important encoding -- I just think it's the wrong one for
latin) one-byte-per-char encodings (which is a common use-case in scientific data sets). I think it has only upside over ascii. It would be a fine idea to support any one-byte-per-char encoding, too. In my experience, it has both upside and downside. Silently creating mojibake is a problem. The process that you described, decoding ANY strings of bytes as latin-1, can create mojibake. The inverse, encoding then decoding, may not, but of course the encoding step there does not accept arbitrary Unicode strings. the default interplay with python strings.
Doing in-memory assignments to a fixed-encoding, fixed-width string
dtype will always have this kind of problem. You should only put up with it if you have a requirement to write to a format that specifies the width and the encoding. That specified encoding is frequently not latin-1!
of course not -- if you are writing to a format that specifies a width
I'm happy to consider a latin-1-specific dtype as a second, workaround-for-specific-applications-only-you-have-been-warned-you're-gonna-get-mojibake
and the encoding, you want o use bytes :-) -- or a dtype that is properly encoding-aware. I was not suggesting that latin-1 be used for arbitrary bytes -- that is what bytes are for. Ah, your message was responding to Stephan who questioned why latin-1 should be the default encoding for the `unicode/str`-aware string dtype. It seemed like you were affirming that latin-1 ought to be that default. It seems like that is not your position, but you are defending the existence of a latin-1 dtype for specific uses. option.
well, it wouldn't create mojibake - anything that went from a python
string to a latin-1 array would be properly encoded in latin-1 -- unless is came from already corrupted data. but when you have corrupted data, your only choices are to:
- raise an error - alter the data (error-"replace") - pass the corrupted data on through.
but it could deal with mojibake -- that's the whole point :-)
You are right that assigning a `unicode/str` object into my latin-1-dtype array would not create mojibake, but that's not the only way to fill a numpy array. In the context of my email, I was responding to a use case being floated for the latin-1 dtype that was to read existing FITS files that have fields that are text-ish: plain octets according to the file format standard, but in practice mostly ASCII with a few sparse high-bit characters typically from some unspecified iso-8859-* encoding. If that unspecified encoding wasn't latin-1, then I'm getting mojibake when I read the file (unless if, happy days, the author of the file was also using latin-1). I understand that you are proposing a latin-1 dtype in a context with other dtypes and tools that might make that use of the latin-1 dtype obsolete. However, there are others who have been proposing just a latin-1 dtype for this purpose. Let me make a counter-proposal for your latin-1 dtype (your #2) that might address your, Thomas's, and Julian's use cases: 2) We want a single-byte-per-character, NULL-terminated string dtype that can be used to represent mostly-ASCII textish data that may have some high-bit characters from some 8-bit encoding. It should be able to read arbitrary bytes (that is, up to the NULL-termination) and write them back out as the same bytes if unmodified. This lets us read this text from files where the encoding is unspecified (or is lying about the encoding) into `unicode/str` objects. The encoding is specified as `ascii` but the decoding/encoding is done with the `surrogateescape` option so that high-bit characters are faithfully represented in the `unicode/str` string but are not erroneously reinterpreted as other characters from an arbitrary encoding. I'd even be happy if Julian or someone wants to go ahead and implement this right now and leave the UTF-8 dtype for a later time. As long as this ASCII-surrogateescape dtype is not called np.realstring (it's *really* important to me that the bikeshed not be this color). ;-) -- Robert Kern