Re: [Numpy-discussion] proposal: smaller representation of string arrays
I just re-read the "Utf-8" manifesto, and it helped me clarify my thoughts: 1) most of it is focused on utf-8 vs utf-16. And that is a strong argument -- utf-16 is the worst of both worlds. 2) it isn't really addressing how to deal with fixed-size string storage as needed by numpy. It does bring up Python's current approach to Unicode: """ This lead to software design decisions such as Python’s string O(1) code point access. The truth, however, is that Unicode is inherently more complicated and there is no universal definition of such thing as *Unicode character*. We see no particular reason to favor Unicode code points over Unicode grapheme clusters, code units or perhaps even words in a language for that. """ My thoughts on that-- it's technically correct, but practicality beats purity, and the character concept is pretty darn useful for at least some (commonly used in the computing world) languages. In any case, whether the top-level API is character focused doesn't really have a bearing on the internal encoding, which is very much an implementation detail in py 3 at least. And Python has made its decision about that. So what are the numpy use-cases? I see essentially two: 1) Use with/from Python -- both creating and working with numpy arrays. In this case, we want something compatible with Python's string (i.e. full Unicode supporting) and I think should be as transparent as possible. Python's string has made the decision to present a character oriented API to users (despite what the manifesto says...). However, there is a challenge here: numpy requires fixed-number-of-bytes dtypes. And full unicode support with fixed number of bytes matching fixed number of characters is only possible with UCS-4 -- hence the current implementation. And this is actually just fine! I know we all want to be efficient with data storage, but really -- in the early days of Unicode, when folks thought 16 bits were enough, doubling the memory usage for western language storage was considered fine -- how long in computer life time does it take to double your memory? But now, when memory, disk space, bandwidth, etc, are all literally orders of magnitude larger, we can't handle a factor of 4 increase in "wasted" space? Alternatively, Robert's suggestion of having essentially an object array, where the objects were known to be python strings is a pretty nice idea -- it gives the full power of python strings, and is a perfect one-to-one match with the python text data model. But as scientific text data often is 1-byte compatible, a one-byte-per-char dtype is a fine idea, too -- and we pretty much have that already with the existing string type -- that could simply be enhanced by enforcing the encoding to be latin-9 (or latin-1, if you don't want the Euro symbol). This would get us what scientists expect from strings in a way that is properly compatible with Python's string type. You'd get encoding errors if you tried to stuff anything else in there, and that's that. Yes, it would have to be a "new" dtype for backwards compatibility. 2) Interchange with other systems: passing the raw binary data back and forth between numpy arrays and other code, written in C, Fortran, or binary flle formats. This is a key use-case for numpy -- I think the key to its enormous success. But how important is it for text? Certainly any data set I've ever worked with has had gobs of binary numerical data, and a small smattering of text. So in that case, if, for instance, h5py had to encode/decode text when transferring between HDF files and numpy arrays, I don't think I'd ever see the performance hit. As for code complexity -- it would mean more complex code in interface libs, and less complex code in numpy itself. (though numpy could provide utilities to make it easy to write the interface code) If we do want to support direct binary interchange with other libs, then we should probably simply go for it, and support any encoding that Python supports -- as long as you are dealing with multiple encodings, why try to decide up front which ones to support? But how do we expose this to numpy users? I still don't like having non-fixed-width encoding under the hood, but what can you do? Other than that, having the encoding be a selectable part of the dtype works fine -- and in that case the number of bytes should be the "length" specifier. This, however, creates a bit of an impedance mismatch between the "character-focused" approach of the python string type. And requires the user to understand something about the encoding in order to even know how many bytes they need -- a utf-8-100 string will hold a different "length" of string than a utf-16-100 string. So -- I think we should address the use-cases separately -- one for "normal" python use and simple interoperability with python strings, and one for interoperability at the binary level. And an easy way to convert between the two. For Python use -- a pointer to a Python string would be nice. Then use a native flexible-encoding dtype for everything else. Thinking out loud -- another option would be to set defaults for the multiple-encoding dtype so you'd get UCS-4 -- with its full compatibility with the python string type -- and make folks make an effort to get anything else. One more note: if a user tries to assign a value to a numpy string array that doesn't fit, they should get an error: EncodingError if it can't be encoded into the defined encoding. ValueError if it is too long -- it should not be silently truncated. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Fri, Apr 21, 2017 at 11:34 AM, Chris Barker <chris.barker@noaa.gov> wrote:
1) Use with/from Python -- both creating and working with numpy arrays.
In this case, we want something compatible with Python's string (i.e. full Unicode supporting) and I think should be as transparent as possible. Python's string has made the decision to present a character oriented API to users (despite what the manifesto says...).
Yes, but NumPy doesn't really implement string operations, so fortunately this is pretty irrelevant to us -- except for our API for specifying dtype size. We already have strong precedence for dtypes reflecting number of bytes used for storage even when Python doesn't: consider numeric types like int64 and float32 compared to the Python equivalents. It's an intrinsic aspect of NumPy that users need to think about how their data is actually stored.
However, there is a challenge here: numpy requires fixed-number-of-bytes dtypes. And full unicode support with fixed number of bytes matching fixed number of characters is only possible with UCS-4 -- hence the current implementation. And this is actually just fine! I know we all want to be efficient with data storage, but really -- in the early days of Unicode, when folks thought 16 bits were enough, doubling the memory usage for western language storage was considered fine -- how long in computer life time does it take to double your memory? But now, when memory, disk space, bandwidth, etc, are all literally orders of magnitude larger, we can't handle a factor of 4 increase in "wasted" space?
Storage cost is always going to be a concern. Arguably, it's even more of a concern today than it used to be be, because compute has been improving faster than storage.
But as scientific text data often is 1-byte compatible, a one-byte-per-char dtype is a fine idea, too -- and we pretty much have that already with the existing string type -- that could simply be enhanced by enforcing the encoding to be latin-9 (or latin-1, if you don't want the Euro symbol). This would get us what scientists expect from strings in a way that is properly compatible with Python's string type. You'd get encoding errors if you tried to stuff anything else in there, and that's that.
I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data. So -- I think we should address the use-cases separately -- one for
"normal" python use and simple interoperability with python strings, and one for interoperability at the binary level. And an easy way to convert between the two.
For Python use -- a pointer to a Python string would be nice.
Yes, absolutely. If we want to be really fancy, we could consider a parametric object dtype that allows for object arrays of *any* homogeneous Python type. Even if NumPy itself doesn't do anything with that information, there are lots of use cases for that information. Then use a native flexible-encoding dtype for everything else.
No opposition here from me. Though again, I think utf-8 alone would also be enough.
Thinking out loud -- another option would be to set defaults for the multiple-encoding dtype so you'd get UCS-4 -- with its full compatibility with the python string type -- and make folks make an effort to get anything else.
The np.unicode_ type is already UCS-4 and the default for dtype=str on Python 3. We probably shouldn't change that, but if we set any default encoding for the new text type, I strongly believe it should be utf-8. One more note: if a user tries to assign a value to a numpy string array
that doesn't fit, they should get an error:
EncodingError if it can't be encoded into the defined encoding.
ValueError if it is too long -- it should not be silently truncated.
I think we all agree here.
On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
In this case, we want something compatible with Python's string (i.e. full
Unicode supporting) and I think should be as transparent as possible. Python's string has made the decision to present a character oriented API to users (despite what the manifesto says...).
Yes, but NumPy doesn't really implement string operations, so fortunately this is pretty irrelevant to us -- except for our API for specifying dtype size.
Exactly -- the character-orientation of python strings means that people are used to thinking that strings have a length that is the number of characters in the string. I think there will a cognitive dissonance if someone does: arr[i] = a_string Which then raises a ValueError, something like: String too long for a string[12] dytype array. When len(a_string) <= 12 AND that will only occur if there are non-ascii characters in the string, and maybe only if there are more than N non-ascii characters. i.e. it is very likely to be a run-time error that may not have shown up in tests. So folks need to do something like: len(a_string.encode('utf-8')) to see if their string will fit. If not, they need to truncate it, and THAT is non-obvious how to do, too -- you don't want to truncate the encodes bytes naively, you could end up with an invalid bytestring. but you don't know how many characters to truncate, either.
We already have strong precedence for dtypes reflecting number of bytes used for storage even when Python doesn't: consider numeric types like int64 and float32 compared to the Python equivalents. It's an intrinsic aspect of NumPy that users need to think about how their data is actually stored.
sure, but a float64 is 64 bytes forever an always and the defaults perfectly match what python is doing under its hood --even if users don't think about. So the default behaviour of numpy matched python's built-in types. Storage cost is always going to be a concern. Arguably, it's even more of a
concern today than it used to be be, because compute has been improving faster than storage.
sure -- but again, what is the use-case for numpy arrays with a s#$)load of text in them? common? I don't think so. And as you pointed out numpy doesn't do text processing anyway, so cache performance and all that are not important. So having UCS-4 as the default, but allowing folks to select a more compact format if they really need it is a good way to go. Just like numpy generally defaults to float64 and Int64 (or 32, depending on platform) -- users can select a smaller size if they have a reason to. I guess that's my summary -- just like with numeric values, numpy should default to Python-like behavior as much as possible for strings, too -- with an option for a knowledgeable user to do something more performant.
I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
utf-8 is NOT a one-byte per char encoding. IF you want to assure that your data are one-byte per char, then you could use ASCII, and it would be binary compatible with utf-8, but not sure what the point of that is in this context. latin-1 or latin-9 buys you (over ASCII): - A bunch of accented characters -- sure it only covers the latin languages, but does cover those much better. - A handful of other characters, including scientifically useful ones. (a few greek characters, the degree symbol, etc...) - round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError. For Python use -- a pointer to a Python string would be nice.
Yes, absolutely. If we want to be really fancy, we could consider a parametric object dtype that allows for object arrays of *any* homogeneous Python type. Even if NumPy itself doesn't do anything with that information, there are lots of use cases for that information.
hmm -- that's nifty idea -- though I think strings could/should be special cased.
Then use a native flexible-encoding dtype for everything else.
No opposition here from me. Though again, I think utf-8 alone would also be enough.
maybe so -- the major reason for supporting others is binary data exchange with other libraries -- but maybe most of them have gone to utf-8 anyway. One more note: if a user tries to assign a value to a numpy string array
that doesn't fit, they should get an error:
EncodingError if it can't be encoded into the defined encoding.
ValueError if it is too long -- it should not be silently truncated.
I think we all agree here.
I'm actually having second thoughts -- see above -- if the encoding is utf-8, then truncating is non-trivial -- maybe it would be better for numpy to do it for you. Or set a flag as to which you want? The current 'S' dtype truncates silently already: In [6]: arr Out[6]: array(['this', 'that'], dtype='|S4') In [7]: arr[0] = "a longer string" In [8]: arr Out[8]: array(['a lo', 'that'], dtype='|S4') (similarly for the unicode type) So at least we are used to that. BTW -- maybe we should keep the pathological use-case in mind: really short strings. I think we are all thinking in terms of longer strings, maybe a name field, where you might assign 32 bytes or so -- then someone has an accented character in their name, and then ge30 or 31 characters -- no big deal. But what if you have a simple label or something with 1 or two characters: Then you have 2 bytes to store the name in, and someone tries to put an "odd" character in there, and you get an empty string. not good. Also -- if utf-8 is the default -- what do you get when you create an array from a python string sequence? Currently with the 'S' and 'U' dtypes, the dtype is set to the longest string passed in. Are we going to pad it a bit? stick with the exact number of bytes? It all comes down to this: Python3 has made a very deliberate (and I think Good) choice to treat text as a string of characters, where the user does not need to know or care about encoding issues. Numpy's defaults should do the same thing. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker <chris.barker@noaa.gov> wrote:
latin-1 or latin-9 buys you (over ASCII):
...
- round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError.
For a new application, it's a good thing if a text type breaks when you to stuff arbitrary bytes in it (see Python 2 vs Python 3 strings). Certainly, I would argue that nobody should write data in latin-1 unless they're doing so for the sake of a legacy application. I do understand the value in having some "string" data type that could be used by default by loaders for legacy file formats/applications (i.e., netCDF3) that support unspecified "one byte strings." Then you're a few short calls away from viewing (i.e., array.view('text[my_real_encoding]'), if we support arbitrary encodings) or decoding (i.e., np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the proper encoding. It's not realistic to expect users to know the true encoding for strings from a file before they even look at the data. On the other hand, if this is the use-case, perhaps we really want an encoding closer to "Python 2" string, i.e, "unknown", to let this be signaled more explicitly. I would suggest that "text[unknown]" should support operations like a string if it can be decoded as ASCII, and otherwise error. But unlike "text[ascii]", it will let you store arbitrary bytes.
Then use a native flexible-encoding dtype for everything else.
No opposition here from me. Though again, I think utf-8 alone would also be enough.
maybe so -- the major reason for supporting others is binary data exchange with other libraries -- but maybe most of them have gone to utf-8 anyway.
Indeed, it would be helpful for this discussion to know what other encodings are actually currently used by scientific applications. So far, we have real use cases for at least UTF-8, UTF-32, ASCII and "unknown". The current 'S' dtype truncates silently already:
One advantage of a new (non-default) dtype is that we can change this behavior.
Also -- if utf-8 is the default -- what do you get when you create an array from a python string sequence? Currently with the 'S' and 'U' dtypes, the dtype is set to the longest string passed in. Are we going to pad it a bit? stick with the exact number of bytes?
It might be better to avoid this for now, and force users to be explicit about encoding if they use the dtype for encoded text. We can keep bytes/str mapped to the current choices.
On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker <chris.barker@noaa.gov> wrote:
On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
In this case, we want something compatible with Python's string (i.e.
full Unicode supporting) and I think should be as transparent as possible. Python's string has made the decision to present a character oriented API to users (despite what the manifesto says...).
Yes, but NumPy doesn't really implement string operations, so fortunately this is pretty irrelevant to us -- except for our API for specifying dtype size.
Exactly -- the character-orientation of python strings means that people are used to thinking that strings have a length that is the number of characters in the string. I think there will a cognitive dissonance if someone does:
arr[i] = a_string
Which then raises a ValueError, something like:
String too long for a string[12] dytype array.
When len(a_string) <= 12
AND that will only occur if there are non-ascii characters in the string, and maybe only if there are more than N non-ascii characters. i.e. it is very likely to be a run-time error that may not have shown up in tests.
So folks need to do something like:
len(a_string.encode('utf-8')) to see if their string will fit. If not, they need to truncate it, and THAT is non-obvious how to do, too -- you don't want to truncate the encodes bytes naively, you could end up with an invalid bytestring. but you don't know how many characters to truncate, either.
We already have strong precedence for dtypes reflecting number of bytes used for storage even when Python doesn't: consider numeric types like int64 and float32 compared to the Python equivalents. It's an intrinsic aspect of NumPy that users need to think about how their data is actually stored.
sure, but a float64 is 64 bytes forever an always and the defaults perfectly match what python is doing under its hood --even if users don't think about. So the default behaviour of numpy matched python's built-in types.
Storage cost is always going to be a concern. Arguably, it's even more of
a concern today than it used to be be, because compute has been improving faster than storage.
sure -- but again, what is the use-case for numpy arrays with a s#$)load of text in them? common? I don't think so. And as you pointed out numpy doesn't do text processing anyway, so cache performance and all that are not important. So having UCS-4 as the default, but allowing folks to select a more compact format if they really need it is a good way to go. Just like numpy generally defaults to float64 and Int64 (or 32, depending on platform) -- users can select a smaller size if they have a reason to.
I guess that's my summary -- just like with numeric values, numpy should default to Python-like behavior as much as possible for strings, too -- with an option for a knowledgeable user to do something more performant.
I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
utf-8 is NOT a one-byte per char encoding. IF you want to assure that your data are one-byte per char, then you could use ASCII, and it would be binary compatible with utf-8, but not sure what the point of that is in this context.
latin-1 or latin-9 buys you (over ASCII):
- A bunch of accented characters -- sure it only covers the latin languages, but does cover those much better.
- A handful of other characters, including scientifically useful ones. (a few greek characters, the degree symbol, etc...)
- round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError.
+1. The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255. Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective.
For Python use -- a pointer to a Python string would be nice.
Yes, absolutely. If we want to be really fancy, we could consider a parametric object dtype that allows for object arrays of *any* homogeneous Python type. Even if NumPy itself doesn't do anything with that information, there are lots of use cases for that information.
hmm -- that's nifty idea -- though I think strings could/should be special cased.
Then use a native flexible-encoding dtype for everything else.
No opposition here from me. Though again, I think utf-8 alone would also be enough.
maybe so -- the major reason for supporting others is binary data exchange with other libraries -- but maybe most of them have gone to utf-8 anyway.
One more note: if a user tries to assign a value to a numpy string array
that doesn't fit, they should get an error:
EncodingError if it can't be encoded into the defined encoding.
ValueError if it is too long -- it should not be silently truncated.
I think we all agree here.
I'm actually having second thoughts -- see above -- if the encoding is utf-8, then truncating is non-trivial -- maybe it would be better for numpy to do it for you. Or set a flag as to which you want?
The current 'S' dtype truncates silently already:
In [6]: arr
Out[6]: array(['this', 'that'], dtype='|S4')
In [7]: arr[0] = "a longer string"
In [8]: arr
Out[8]: array(['a lo', 'that'], dtype='|S4')
(similarly for the unicode type)
So at least we are used to that.
BTW -- maybe we should keep the pathological use-case in mind: really short strings. I think we are all thinking in terms of longer strings, maybe a name field, where you might assign 32 bytes or so -- then someone has an accented character in their name, and then ge30 or 31 characters -- no big deal.
I wouldn't call it a pathological use case, it doesn't seem so uncommon to have large datasets of short strings. I personally deal with a database of hundreds of billions of 2 to 5 character ASCII strings. This has been a significant blocker to Python 3 adoption in my world. BTW, for those new to the list or with a short memory, this topic has been discussed fairly extensively at least 3 times before. Hopefully the *fourth* time will be the charm! https://mail.scipy.org/pipermail/numpy-discussion/2014-January/068622.html https://mail.scipy.org/pipermail/numpy-discussion/2014-July/070574.html https://mail.scipy.org/pipermail/numpy-discussion/2015-February/072311.html - Tom
But what if you have a simple label or something with 1 or two characters: Then you have 2 bytes to store the name in, and someone tries to put an "odd" character in there, and you get an empty string. not good.
Also -- if utf-8 is the default -- what do you get when you create an array from a python string sequence? Currently with the 'S' and 'U' dtypes, the dtype is set to the longest string passed in. Are we going to pad it a bit? stick with the exact number of bytes?
It all comes down to this:
Python3 has made a very deliberate (and I think Good) choice to treat text as a string of characters, where the user does not need to know or care about encoding issues. Numpy's defaults should do the same thing.
-CHB
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Mon, Apr 24, 2017 at 10:51 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
- round-tripping of binary data (at least with Python's encoding/decoding)
-- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError.
For a new application, it's a good thing if a text type breaks when you to stuff arbitrary bytes in it
maybe, maybe not -- the application may be new, but the data it works with may not be.
(see Python 2 vs Python 3 strings).
this is exactly why py3 strings needed to add the "surrogateescape" error handler: https://www.python.org/dev/peps/pep-0383 sometimes text and binary data are mixed, sometimes encoded text is broken. It is very useful to be able to pass such data through strings losslessly. Certainly, I would argue that nobody should write data in latin-1 unless
they're doing so for the sake of a legacy application.
or you really want that 1-byte per char efficiency
I do understand the value in having some "string" data type that could be used by default by loaders for legacy file formats/applications (i.e., netCDF3) that support unspecified "one byte strings." Then you're a few short calls away from viewing (i.e., array.view('text[my_real_encoding]'), if we support arbitrary encodings) or decoding (i.e., np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the proper encoding. It's not realistic to expect users to know the true encoding for strings from a file before they even look at the data.
except that you really should :-( On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be signaled more explicitly. I would suggest that "text[unknown]" should support operations like a string if it can be decoded as ASCII, and otherwise error. But unlike "text[ascii]", it will let you store arbitrary bytes.
I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really is ascii, then it's perfect. If it really is latin-*, then you get some extra useful stuff, and if it's corrupted somehow, you still get the ascii text correct, and the rest won't barf and can be passed on through. So far, we have real use cases for at least UTF-8, UTF-32, ASCII and
"unknown".
hmm -- "unknown" should be bytes, not text. If the user needs to look at it first, then load it as bytes, run chardet or something on it, then cast to the right encoding. The current 'S' dtype truncates silently already:
One advantage of a new (non-default) dtype is that we can change this behavior.
yeah -- still on the edge about that, at least with variable-size encodings. It's hard to know when it's going to happen and it's hard to know what to do when it does. At least if if truncates silently, numpy can have the code to do the truncation properly. Maybe an option? And the numpy numeric types truncate (Or overflow) already. Again: If the default string handling matches expectations from python strings, then the specialized ones can be more buyer-beware. Also -- if utf-8 is the default -- what do you get when you create an array
from a python string sequence? Currently with the 'S' and 'U' dtypes, the dtype is set to the longest string passed in. Are we going to pad it a bit? stick with the exact number of bytes?
It might be better to avoid this for now, and force users to be explicit about encoding if they use the dtype for encoded text.
yup. And we really should have a bytes type for py3 -- which we do, it's just called 'S', which is pretty confusing :-) -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
BTW -- maybe we should keep the pathological use-case in mind: really
short strings. I think we are all thinking in terms of longer strings, maybe a name field, where you might assign 32 bytes or so -- then someone has an accented character in their name, and then ge30 or 31 characters -- no big deal.
I wouldn't call it a pathological use case, it doesn't seem so uncommon to have large datasets of short strings.
It's pathological for using a variable-length encoding.
I personally deal with a database of hundreds of billions of 2 to 5 character ASCII strings. This has been a significant blocker to Python 3 adoption in my world.
I agree -- it is a VERY common case for scientific data sets. But a one-byte-per-char encoding would handle it nicely, or UCS-4 if you want Unicode. The wasted space is not that big a deal with short strings... BTW, for those new to the list or with a short memory, this topic has been
discussed fairly extensively at least 3 times before. Hopefully the *fourth* time will be the charm!
yes, let's hope so! The big difference now is that Julian seems to be committed to actually making it happen! Thanks Julian! Which brings up a good point -- if you need us to stop the damn bike-shedding so you can get it done -- say so. I have strong opinions, but would still rather see any of the ideas on the table implemented than nothing. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Mon, Apr 24, 2017 at 11:21 AM, Chris Barker <chris.barker@noaa.gov> wrote:
On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcroft@head.cfa.harvard.edu> wrote:
BTW -- maybe we should keep the pathological use-case in mind: really
short strings. I think we are all thinking in terms of longer strings, maybe a name field, where you might assign 32 bytes or so -- then someone has an accented character in their name, and then ge30 or 31 characters -- no big deal.
I wouldn't call it a pathological use case, it doesn't seem so uncommon to have large datasets of short strings.
It's pathological for using a variable-length encoding.
I personally deal with a database of hundreds of billions of 2 to 5 character ASCII strings. This has been a significant blocker to Python 3 adoption in my world.
I agree -- it is a VERY common case for scientific data sets. But a one-byte-per-char encoding would handle it nicely, or UCS-4 if you want Unicode. The wasted space is not that big a deal with short strings...
BTW, for those new to the list or with a short memory, this topic has been discussed fairly extensively at least 3 times before. Hopefully the *fourth* time will be the charm!
yes, let's hope so!
The big difference now is that Julian seems to be committed to actually making it happen!
Thanks Julian!
Which brings up a good point -- if you need us to stop the damn bike-shedding so you can get it done -- say so.
I have strong opinions, but would still rather see any of the ideas on
Unless if you have hundreds of billions of them. the table implemented than nothing. FWIW, I prefer nothing to just adding a special case for latin-1. Solve the HDF5 problem (i.e. fixed-length UTF-8 strings) or leave it be until someone else is willing to solve that problem. I don't think we're at the bikeshedding stage yet; we're still disagreeing about fundamental requirements. -- Robert Kern
On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker <chris.barker@noaa.gov>
- round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError.
+1. The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been
wrote: the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255. Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective. That says to me that these are properly represented by `bytes` objects, not `unicode/str` objects encoding to and decoding from a hardcoded latin-1 encoding. -- Robert Kern
On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker <chris.barker@noaa.gov>
wrote:
- round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError.
+1. The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255. Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective.
That says to me that these are properly represented by `bytes` objects, not `unicode/str` objects encoding to and decoding from a hardcoded latin-1 encoding.
If you could go back 30 years and get every scientist in the world to do the right thing, then sure. But we are living in a messy world right now with messy legacy datasets that have character type data that are *mostly* ASCII, but not infrequently contain non-ASCII characters. So I would beg to actually move forward with a pragmatic solution that addresses very real and consequential problems that we face instead of waiting/praying for a perfect solution. - Tom
-- Robert Kern
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker <chris.barker@noaa.gov> wrote:
On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
In this case, we want something compatible with Python's string (i.e.
full Unicode supporting) and I think should be as transparent as possible. Python's string has made the decision to present a character oriented API to users (despite what the manifesto says...).
Yes, but NumPy doesn't really implement string operations, so
fortunately this is pretty irrelevant to us -- except for our API for specifying dtype size.
Exactly -- the character-orientation of python strings means that people are used to thinking that strings have a length that is the number of characters in the string. I think there will a cognitive dissonance if someone does:
arr[i] = a_string
Which then raises a ValueError, something like:
String too long for a string[12] dytype array.
When len(a_string) <= 12
AND that will only occur if there are non-ascii characters in the string, and maybe only if there are more than N non-ascii characters. i.e. it is very likely to be a run-time error that may not have shown up in tests.
So folks need to do something like:
len(a_string.encode('utf-8')) to see if their string will fit. If not,
I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
utf-8 is NOT a one-byte per char encoding. IF you want to assure that your data are one-byte per char, then you could use ASCII, and it would be binary compatible with utf-8, but not sure what the point of that is in
We have the freedom to make the error message not suck. :-) they need to truncate it, and THAT is non-obvious how to do, too -- you don't want to truncate the encodes bytes naively, you could end up with an invalid bytestring. but you don't know how many characters to truncate, either. If this becomes the right strategy for dealing with these problems (and I'm not sure that it is), we can easily make a utility function that does this for people. This discussion is why I want to be sure that we have our use cases actually mapped out. For this kind of in-memory manipulation, I'd use an object array (a la pandas), then convert to the uniform-width string dtype when I needed to push this out to a C API, HDF5 file, or whatever actually requires a string-dtype array. The required width gets computed from the data after all of the manipulations are done. Doing in-memory assignments to a fixed-encoding, fixed-width string dtype will always have this kind of problem. You should only put up with it if you have a requirement to write to a format that specifies the width and the encoding. That specified encoding is frequently not latin-1! this context.
latin-1 or latin-9 buys you (over ASCII):
- A bunch of accented characters -- sure it only covers the latin
languages, but does cover those much better.
- A handful of other characters, including scientifically useful ones. (a
few greek characters, the degree symbol, etc...)
- round-tripping of binary data (at least with Python's
encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError. But what if the format I'm working with specifies another encoding? Am I supposed to encode all of my Unicode strings in the specified encoding, then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a really important use case for me. -- Robert Kern
On Mon, Apr 24, 2017 at 11:56 AM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern <robert.kern@gmail.com>
On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker <chris.barker@noaa.gov>
wrote:
- round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError.
+1. The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255. Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective.
That says to me that these are properly represented by `bytes` objects, not `unicode/str` objects encoding to and decoding from a hardcoded latin-1 encoding.
If you could go back 30 years and get every scientist in the world to do
wrote: the right thing, then sure. But we are living in a messy world right now with messy legacy datasets that have character type data that are *mostly* ASCII, but not infrequently contain non-ASCII characters. I am not unfamiliar with this problem. I still work with files that have fields that are supposed to be in EBCDIC but actually contain text in ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit encodings. In that experience, I have found that just treating the data as latin-1 unconditionally is not a pragmatic solution. It's really easy to implement, and you do get a program that runs without raising an exception (at the I/O boundary at least), but you don't often get a program that really runs correctly or treats the data properly. Can you walk us through the problems that you are having with working with these columns as arrays of `bytes`?
So I would beg to actually move forward with a pragmatic solution that addresses very real and consequential problems that we face instead of waiting/praying for a perfect solution.
Well, I outlined a solution: work with `bytes` arrays with utilities to convert to/from the Unicode-aware string dtypes (or `object`). A UTF-8-specific dtype and maybe a string-specialized `object` dtype address the very real and consequential problems that I face (namely and respectively, working with HDF5 and in-memory manipulation of string datasets). I'm happy to consider a latin-1-specific dtype as a second, workaround-for-specific-applications-only-you-have- been-warned-you're-gonna-get-mojibake option. It should not be *the* Unicode string dtype (i.e. named np.realstring or np.unicode as in the original proposal). -- Robert Kern
On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern <robert.kern@gmail.com> wrote:
I agree -- it is a VERY common case for scientific data sets. But a one-byte-per-char encoding would handle it nicely, or UCS-4 if you want Unicode. The wasted space is not that big a deal with short strings...
Unless if you have hundreds of billions of them.
Which is why a one-byte-per char encoding is a good idea. Solve the HDF5 problem (i.e. fixed-length UTF-8 strings)
I agree-- binary compatibility with utf-8 is a core use case -- though is it so bad to go through python's encoding/decoding machinery to so it? Do numpy arrays HAVE to be storing utf-8 natively?
or leave it be until someone else is willing to solve that problem. I don't think we're at the bikeshedding stage yet; we're still disagreeing about fundamental requirements.
yeah -- though I've seen projects get stuck in the sorting out what to do, so nothing gets done stage before -- I don't want Julian to get too frustrated and end up doing nothing. So here I'll lay out what I think are the fundamental requirements: 1) The default behaviour for numpy arrays of strings is compatible with Python3's string model: i.e. fully unicode supporting, and with a character oriented interface. i.e. if you do: arr = np.array(("this", "that",)) you get an array that can store ANY unicode string with 4 or less characters and arr[1] will return a native Python string object. 2) There be some way to store mostly ascii-compatible strings in a single byte-per-character array -- so not be wasting space for "typical european-oriented data". arr = np.array(("this", "that",), dtype=np.single_byte_string) (name TBD) and arr[1] would return a python string. attempting to put in a not-compatible with the encoding string in would raise an Encoding Error. I highly recommend that (SO 8859-15 ( latin-9 or latin-1) be the encoding in this case. 3) There be a dtype that could store strings in null-terminated utf-8 binary format -- for interchange with other systems (netcdf, HDF, others???) 4) a fixed length bytes dtype -- pretty much what 'S' is now under python three -- settable from a bytes or bytearray object, and returns a bytes object. - you could use astype() to convert between bytes and a specified encoding with no change in binary representation. 2) and 3) could be fully covered by a dtype with a settable encoding that might as well support all python built-in encodings -- though I think an alias to the common cases would be good -- latin, utf-8. If so, the length would have to be specified in bytes. 1) could be covered with the existing 'U': type - only downside being some wasted space -- or with a pointer to a python string dtype -- which would also waste space, though less for long-ish strings, and maybe give us some better access to the nifty built-in string features.
+1. The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255. Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective.
That says to me that these are properly represented by `bytes` objects, not
`unicode/str` objects encoding to and decoding from a hardcoded latin-1 encoding.
Well, yes -- BUT: That strictness in python3 -- "data is either text or bytes, and text in an unknown (or invalid) encoding HAVE to be bytes" bit Python3 is the butt for a long time. Folks that deal in the messy real world of binary data that is kinda-mostly text, but may have a bit of binary data, or be in an unknown encoding, or be corrupted were very, very adamant about how this model DID NOT work for them. Very influential people were seriously critical of python 3. Eventually, py3 added bytes string formatting, surrogate_escape, and other features that facilitate working with messy almost text. Practicality beats purity -- if you have one-byte per char data that is mostly european, than latin-1 or latin-9 let you work with it, have it mostly work, and never crash out with an encoding error.
- round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError. But what if the format I'm working with specifies another encoding? Am I supposed to encode all of my Unicode strings in the specified encoding, then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a really important use case for me.
latin-1 would be only for the special case of mostly-ascii (or true latin) one-byte-per-char encodings (which is a common use-case in scientific data sets). I think it has only upside over ascii. It would be a fine idea to support any one-byte-per-char encoding, too. As for external data in utf-8 -- yes that should be dealt with properly -- either by truly supporting utf-8 internally, or by properly encoding/decoding when putting it in and moving it out of an array. utf-8 is a very important encoding -- I just think it's the wrong one for the default interplay with python strings. Doing in-memory assignments to a fixed-encoding, fixed-width string dtype
will always have this kind of problem. You should only put up with it if you have a requirement to write to a format that specifies the width and the encoding. That specified encoding is frequently not latin-1!
of course not -- if you are writing to a format that specifies a width and the encoding, you want o use bytes :-) -- or a dtype that is properly encoding-aware. I was not suggesting that latin-1 be used for arbitrary bytes -- that is what bytes are for.
- round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError.
But what if the format I'm working with specifies another encoding? Am I supposed to encode all of my Unicode strings in the specified encoding, then decode as latin-1 to assign into my array?
of course not -- see above. I'm happy to consider a latin-1-specific dtype as a second,
workaround-for-specific-applications-only-you-have-been- warned-you're-gonna-get-mojibake option.
well, it wouldn't create mojibake - anything that went from a python string to a latin-1 array would be properly encoded in latin-1 -- unless is came from already corrupted data. but when you have corrupted data, your only choices are to: - raise an error - alter the data (error-"replace") - pass the corrupted data on through. but it could deal with mojibake -- that's the whole point :-)
It should not be *the* Unicode string dtype (i.e. named np.realstring or np.unicode as in the original proposal).
God no -- sorry if it looked like I was suggesting that. I only suggest that it might be *the* one-byte-per-char string type -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <robert.kern@gmail.com> wrote:
I am not unfamiliar with this problem. I still work with files that have fields that are supposed to be in EBCDIC but actually contain text in ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit encodings. In that experience, I have found that just treating the data as latin-1 unconditionally is not a pragmatic solution. It's really easy to implement, and you do get a program that runs without raising an exception (at the I/O boundary at least), but you don't often get a program that really runs correctly or treats the data properly.
Can you walk us through the problems that you are having with working with these columns as arrays of `bytes`?
This is very simple and obvious but I will state for the record. Reading an HDF5 file with character data currently gives arrays of `bytes` [1]. In Py3 this cannot be compared to a string literal, and comparing to (or assigning from) explicit byte strings everywhere in the code quickly spins out of control. This generally forces one to convert the data to `U` type and incur the 4x memory bloat. In [22]: dat = np.array(['yes', 'no'], dtype='S3') In [23]: dat == 'yes' # FAIL (but works just fine in Py2) Out[23]: False In [24]: dat == b'yes' # Right answer but not practical Out[24]: array([ True, False], dtype=bool) - Tom [1]: Using h5py or pytables. Same with FITS, although astropy.io.fits does some tricks under the hood to auto-convert to `U` type as needed.
So I would beg to actually move forward with a pragmatic solution that addresses very real and consequential problems that we face instead of waiting/praying for a perfect solution.
Well, I outlined a solution: work with `bytes` arrays with utilities to convert to/from the Unicode-aware string dtypes (or `object`).
A UTF-8-specific dtype and maybe a string-specialized `object` dtype address the very real and consequential problems that I face (namely and respectively, working with HDF5 and in-memory manipulation of string datasets).
I'm happy to consider a latin-1-specific dtype as a second, workaround-for-specific-applications-only-you-have-been- warned-you're-gonna-get-mojibake option. It should not be *the* Unicode string dtype (i.e. named np.realstring or np.unicode as in the original proposal).
-- Robert Kern
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Chris, you've mashed all of my emails together, some of them are in reply to you, some in reply to others. Unfortunately, this dropped a lot of the context from each of them, and appears to be creating some misunderstandings about what each person is advocating. On Mon, Apr 24, 2017 at 2:00 PM, Chris Barker <chris.barker@noaa.gov> wrote:
On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern <robert.kern@gmail.com>
wrote:
Solve the HDF5 problem (i.e. fixed-length UTF-8 strings)
I agree-- binary compatibility with utf-8 is a core use case -- though is it so bad to go through python's encoding/decoding machinery to so it? Do numpy arrays HAVE to be storing utf-8 natively?
If the point is to have an array that transparently accepts/yields `unicode/str` scalars while maintaining the in-memory encoding, yes. If that's not the point, then IMO the status quo is fine, and *no* new dtypes should be added, just maybe some utility functions to convert between the bytes-ish arrays and the Unicode-holding arrays (which was one of my proposals). I am mostly happy to live in a world where I read in data as bytes-ish arrays, decode into `object` arrays holding `unicode/str` objects, do my manipulations, then encode the array into a bytes-ish array to give to the C API or file format.
or leave it be until someone else is willing to solve that problem. I don't think we're at the bikeshedding stage yet; we're still disagreeing about fundamental requirements.
yeah -- though I've seen projects get stuck in the sorting out what to do, so nothing gets done stage before -- I don't want Julian to get too frustrated and end up doing nothing.
So here I'll lay out what I think are the fundamental requirements:
1) The default behaviour for numpy arrays of strings is compatible with Python3's string model: i.e. fully unicode supporting, and with a character oriented interface. i.e. if you do:
arr = np.array(("this", "that",))
you get an array that can store ANY unicode string with 4 or less characters
and arr[1] will return a native Python string object.
2) There be some way to store mostly ascii-compatible strings in a single byte-per-character array -- so not be wasting space for "typical european-oriented data".
arr = np.array(("this", "that",), dtype=np.single_byte_string)
(name TBD)
and arr[1] would return a python string.
attempting to put in a not-compatible with the encoding string in would raise an Encoding Error.
I highly recommend that (SO 8859-15 ( latin-9 or latin-1) be the encoding in this case.
3) There be a dtype that could store strings in null-terminated utf-8 binary format -- for interchange with other systems (netcdf, HDF, others???)
4) a fixed length bytes dtype -- pretty much what 'S' is now under python
I understand, but not all tedious discussions that have not yet achieved consensus are bikeshedding to be cut short. We couldn't really decide what to do back in the pre-1.0 days, too, so we just did *something*, and that something is now the very situation that Julian has a problem with. We have more experience now, especially with the added wrinkles of Python 3; other projects have advanced and matured their Unicode string array-handling (e.g. pandas and HDF5); now is a great time to have a real discussion about what we *need* before we make decisions about what we should *do*. three -- settable from a bytes or bytearray object, and returns a bytes object.
- you could use astype() to convert between bytes and a specified encoding with no change in binary representation.
2) and 3) could be fully covered by a dtype with a settable encoding that might as well support all python built-in encodings -- though I think an alias to the common cases would be good -- latin, utf-8. If so, the length would have to be specified in bytes.
1) could be covered with the existing 'U': type - only downside being some wasted space -- or with a pointer to a python string dtype -- which would also waste space, though less for long-ish strings, and maybe give us some better access to the nifty built-in string features.
+1. The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255. Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective.
That says to me that these are properly represented by `bytes` objects, not `unicode/str` objects encoding to and decoding from a hardcoded latin-1 encoding.
Well, yes -- BUT: That strictness in python3 -- "data is either text or bytes, and text in an unknown (or invalid) encoding HAVE to be bytes" bit Python3 is the butt for a long time. Folks that deal in the messy real world of binary data that is kinda-mostly text, but may have a bit of binary data, or be in an unknown encoding, or be corrupted were very, very adamant about how this model DID NOT work for them. Very influential people were seriously critical of python 3. Eventually, py3 added bytes string
Practicality beats purity -- if you have one-byte per char data that is mostly european, than latin-1 or latin-9 let you work with it, have it mostly work, and never crash out with an encoding error.
- round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError. But what if the format I'm working with specifies another encoding? Am I supposed to encode all of my Unicode strings in the specified encoding,
You'll need to specify what NULL-terminating behavior you want here. np.string_ has NULL-termination. np.void (which could be made to work better with `bytes`) does not. Both have use-cases for text encoding (shakes fist at UTF-16). formatting, surrogate_escape, and other features that facilitate working with messy almost text. Walk me through a problem that you've encountered with such textish data in arrays. I know the problems in Web protocol-land, but they are not really relevant to us. What are *your* problems? Why didn't those ameliorations that were added for the Web world address your problems? I really want to get at specific use cases that interact with numpy, not handwaving at problems other people have had in other contexts. then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a really important use case for me.
latin-1 would be only for the special case of mostly-ascii (or true
As for external data in utf-8 -- yes that should be dealt with properly -- either by truly supporting utf-8 internally, or by properly encoding/decoding when putting it in and moving it out of an array.
utf-8 is a very important encoding -- I just think it's the wrong one for
latin) one-byte-per-char encodings (which is a common use-case in scientific data sets). I think it has only upside over ascii. It would be a fine idea to support any one-byte-per-char encoding, too. In my experience, it has both upside and downside. Silently creating mojibake is a problem. The process that you described, decoding ANY strings of bytes as latin-1, can create mojibake. The inverse, encoding then decoding, may not, but of course the encoding step there does not accept arbitrary Unicode strings. the default interplay with python strings.
Doing in-memory assignments to a fixed-encoding, fixed-width string
dtype will always have this kind of problem. You should only put up with it if you have a requirement to write to a format that specifies the width and the encoding. That specified encoding is frequently not latin-1!
of course not -- if you are writing to a format that specifies a width
I'm happy to consider a latin-1-specific dtype as a second, workaround-for-specific-applications-only-you-have-been-warned-you're-gonna-get-mojibake
and the encoding, you want o use bytes :-) -- or a dtype that is properly encoding-aware. I was not suggesting that latin-1 be used for arbitrary bytes -- that is what bytes are for. Ah, your message was responding to Stephan who questioned why latin-1 should be the default encoding for the `unicode/str`-aware string dtype. It seemed like you were affirming that latin-1 ought to be that default. It seems like that is not your position, but you are defending the existence of a latin-1 dtype for specific uses. option.
well, it wouldn't create mojibake - anything that went from a python
string to a latin-1 array would be properly encoded in latin-1 -- unless is came from already corrupted data. but when you have corrupted data, your only choices are to:
- raise an error - alter the data (error-"replace") - pass the corrupted data on through.
but it could deal with mojibake -- that's the whole point :-)
You are right that assigning a `unicode/str` object into my latin-1-dtype array would not create mojibake, but that's not the only way to fill a numpy array. In the context of my email, I was responding to a use case being floated for the latin-1 dtype that was to read existing FITS files that have fields that are text-ish: plain octets according to the file format standard, but in practice mostly ASCII with a few sparse high-bit characters typically from some unspecified iso-8859-* encoding. If that unspecified encoding wasn't latin-1, then I'm getting mojibake when I read the file (unless if, happy days, the author of the file was also using latin-1). I understand that you are proposing a latin-1 dtype in a context with other dtypes and tools that might make that use of the latin-1 dtype obsolete. However, there are others who have been proposing just a latin-1 dtype for this purpose. Let me make a counter-proposal for your latin-1 dtype (your #2) that might address your, Thomas's, and Julian's use cases: 2) We want a single-byte-per-character, NULL-terminated string dtype that can be used to represent mostly-ASCII textish data that may have some high-bit characters from some 8-bit encoding. It should be able to read arbitrary bytes (that is, up to the NULL-termination) and write them back out as the same bytes if unmodified. This lets us read this text from files where the encoding is unspecified (or is lying about the encoding) into `unicode/str` objects. The encoding is specified as `ascii` but the decoding/encoding is done with the `surrogateescape` option so that high-bit characters are faithfully represented in the `unicode/str` string but are not erroneously reinterpreted as other characters from an arbitrary encoding. I'd even be happy if Julian or someone wants to go ahead and implement this right now and leave the UTF-8 dtype for a later time. As long as this ASCII-surrogateescape dtype is not called np.realstring (it's *really* important to me that the bikeshed not be this color). ;-) -- Robert Kern
On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker <chris.barker@noaa.gov> wrote:
On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be signaled more explicitly. I would suggest that "text[unknown]" should support operations like a string if it can be decoded as ASCII, and otherwise error. But unlike "text[ascii]", it will let you store arbitrary bytes.
I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really is ascii, then it's perfect. If it really is latin-*, then you get some extra useful stuff, and if it's corrupted somehow, you still get the ascii text correct, and the rest won't barf and can be passed on through.
I am totally in agreement with Thomas that "We are living in a messy world right now with messy legacy datasets that have character type data that are *mostly* ASCII, but not infrequently contain non-ASCII characters." My question: What are those non-ASCII characters? How often are they truly latin-1/9 vs. some other text encoding vs. non-string binary data? I don't think that silently (mis)interpreting non-ASCII characters as latin-1/9 is a good idea, which is why I think it would be a mistake to use 'latin-1' for text data with unknown encoding. I could get behind a data type that compares equal to strings for ASCII only and allows for *storing* other characters, but making blind assumptions about characters 128-255 seems like a recipe for disaster. Imagine text[unknown] as a one character string type, but it supports .decode() like bytes and every character in the range 128-255 compares for equality with other characters like NaN -- not even equal to itself.
On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <robert.kern@gmail.com>
wrote:
I am not unfamiliar with this problem. I still work with files that have
fields that are supposed to be in EBCDIC but actually contain text in ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit encodings. In that experience, I have found that just treating the data as latin-1 unconditionally is not a pragmatic solution. It's really easy to implement, and you do get a program that runs without raising an exception (at the I/O boundary at least), but you don't often get a program that really runs correctly or treats the data properly.
Can you walk us through the problems that you are having with working
with these columns as arrays of `bytes`?
This is very simple and obvious but I will state for the record.
Reading an HDF5 file with character data currently gives arrays of `bytes` [1]. In Py3 this cannot be compared to a string literal, and comparing to (or assigning from) explicit byte strings everywhere in the code quickly spins out of control. This generally forces one to convert
I appreciate it. What is obvious to you is not obvious to me. the data to `U` type and incur the 4x memory bloat.
In [22]: dat = np.array(['yes', 'no'], dtype='S3')
In [23]: dat == 'yes' # FAIL (but works just fine in Py2) Out[23]: False
In [24]: dat == b'yes' # Right answer but not practical Out[24]: array([ True, False], dtype=bool)
I'm curious why you think this is not practical. It seems like a very practical solution to me. -- Robert Kern
On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern <robert.kern@gmail.com> wrote:
Let me make a counter-proposal for your latin-1 dtype (your #2) that might address your, Thomas's, and Julian's use cases:
2) We want a single-byte-per-character, NULL-terminated string dtype that can be used to represent mostly-ASCII textish data that may have some high-bit characters from some 8-bit encoding. It should be able to read arbitrary bytes (that is, up to the NULL-termination) and write them back out as the same bytes if unmodified. This lets us read this text from files where the encoding is unspecified (or is lying about the encoding) into `unicode/str` objects. The encoding is specified as `ascii` but the decoding/encoding is done with the `surrogateescape` option so that high-bit characters are faithfully represented in the `unicode/str` string but are not erroneously reinterpreted as other characters from an arbitrary encoding.
I'd even be happy if Julian or someone wants to go ahead and implement this right now and leave the UTF-8 dtype for a later time.
As long as this ASCII-surrogateescape dtype is not called np.realstring (it's *really* important to me that the bikeshed not be this color). ;-)
This sounds quite similar to my text[unknown] proposal, with the advantage that the concept of "surrogateescape" that already exists. Surrogate-escape characters compare equal to themselves, which is maybe less than ideal, but it looks like you can put them in real unicode strings, which is nice.
On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker <chris.barker@noaa.gov>
On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be signaled more explicitly. I would suggest that "text[unknown]" should support operations like a string if it can be decoded as ASCII, and otherwise error. But unlike "text[ascii]", it will let you store arbitrary bytes.
I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really is ascii, then it's perfect. If it really is latin-*, then you get some extra useful stuff, and if it's corrupted somehow, you still get the ascii text correct, and the rest won't barf and can be passed on through.
I am totally in agreement with Thomas that "We are living in a messy world right now with messy legacy datasets that have character type data
wrote: that are *mostly* ASCII, but not infrequently contain non-ASCII characters."
My question: What are those non-ASCII characters? How often are they
truly latin-1/9 vs. some other text encoding vs. non-string binary data? I don't know that we can reasonably make that accounting relevant. Number of such characters per byte of text? Number of files with such characters out of all existing files? What I can say with assurance is that every time I have decided, as a developer, to write code that just hardcodes latin-1 for such cases, I have regretted it. While it's just personal anecdote, I think it's at least measuring the right thing. :-) -- Robert Kern
On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <robert.kern@gmail.com>
wrote:
I am not unfamiliar with this problem. I still work with files that
have fields that are supposed to be in EBCDIC but actually contain text in ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit encodings. In that experience, I have found that just treating the data as latin-1 unconditionally is not a pragmatic solution. It's really easy to implement, and you do get a program that runs without raising an exception (at the I/O boundary at least), but you don't often get a program that really runs correctly or treats the data properly.
Can you walk us through the problems that you are having with working
with these columns as arrays of `bytes`?
This is very simple and obvious but I will state for the record.
I appreciate it. What is obvious to you is not obvious to me.
Reading an HDF5 file with character data currently gives arrays of `bytes` [1]. In Py3 this cannot be compared to a string literal, and comparing to (or assigning from) explicit byte strings everywhere in the code quickly spins out of control. This generally forces one to convert the data to `U` type and incur the 4x memory bloat.
In [22]: dat = np.array(['yes', 'no'], dtype='S3')
In [23]: dat == 'yes' # FAIL (but works just fine in Py2) Out[23]: False
In [24]: dat == b'yes' # Right answer but not practical Out[24]: array([ True, False], dtype=bool)
I'm curious why you think this is not practical. It seems like a very practical solution to me.
In Py3 most character data will be string, not bytes. So every time you want to interact with the bytes array (compare, assign, etc) you need to explicitly coerce the right hand side operand to be a bytes-compatible object. For code that developers write, this might be possible but results in ugly code. But for the general science and engineering communities that use numpy this is completely untenable. The only practical solution so far is to implement a unicode sandwich and convert to the 4-byte `U` type at the interface. That is precisely what we are trying to eliminate. - Tom
-- Robert Kern
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Mon, Apr 24, 2017 at 5:56 PM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern <robert.kern@gmail.com>
On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas <
aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <robert.kern@gmail.com>
wrote:
I am not unfamiliar with this problem. I still work with files that
have fields that are supposed to be in EBCDIC but actually contain text in ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit encodings. In that experience, I have found that just treating the data as latin-1 unconditionally is not a pragmatic solution. It's really easy to implement, and you do get a program that runs without raising an exception (at the I/O boundary at least), but you don't often get a program that really runs correctly or treats the data properly.
Can you walk us through the problems that you are having with working
with these columns as arrays of `bytes`?
This is very simple and obvious but I will state for the record.
I appreciate it. What is obvious to you is not obvious to me.
Reading an HDF5 file with character data currently gives arrays of `bytes` [1]. In Py3 this cannot be compared to a string literal, and comparing to (or assigning from) explicit byte strings everywhere in the code quickly spins out of control. This generally forces one to convert
In [22]: dat = np.array(['yes', 'no'], dtype='S3')
In [23]: dat == 'yes' # FAIL (but works just fine in Py2) Out[23]: False
In [24]: dat == b'yes' # Right answer but not practical Out[24]: array([ True, False], dtype=bool)
I'm curious why you think this is not practical. It seems like a very
wrote: the data to `U` type and incur the 4x memory bloat. practical solution to me.
In Py3 most character data will be string, not bytes. So every time you
want to interact with the bytes array (compare, assign, etc) you need to explicitly coerce the right hand side operand to be a bytes-compatible object. For code that developers write, this might be possible but results in ugly code. But for the general science and engineering communities that use numpy this is completely untenable. Okay, so the problem isn't with (byte-)string literals, but with variables being passed around from other sources. Eg. def func(dat, scalar): return dat == scalar Every one of those functions deepens the abstraction and moves that unicode-by-default scalar farther away from the bytesish array, so it's harder to demand that users of those functions be aware that they need to pass in `bytes` strings. So you need to implement those functions defensively, which complicates them.
The only practical solution so far is to implement a unicode sandwich and convert to the 4-byte `U` type at the interface. That is precisely what we are trying to eliminate.
What do you think about my ASCII-surrogateescape proposal? Do you think that would work with your use cases? In general, I don't think Unicode sandwiches will be eliminated by this or the latin-1 dtype; the sandwich is usually the right thing to do and the surrogateescape the wrong thing. But I'm keenly aware of the problems you get when there just isn't a reliable encoding to use. -- Robert Kern
On Apr 21, 2017 2:34 PM, "Stephan Hoyer" <shoyer@gmail.com> wrote: I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data. You may already know this, but probably not everyone reading does: the reason why latin1 often gets special attention in discussions of Unicode encoding is that latin1 is effectively "ucs1". It's the unique one byte text encoding where byte N represents codepoint U+N. I can't think of any reason why this property is particularly important for numpy's usage, because we always have a conversion step anyway to get data in and out of an array. The potential arguments for latin1 that I can think of are: - if we have to implement our own en/decoding code for some reason then it's the most trivial encoding - if other formats standardize on latin1-with-nul-padding and we want in-memory/mmap compatibility - if we really want a fixed width encoding for some reason but don't care which one, then it's in some sense the most obvious choice I can't think of many reasons why having a fixed width encoding is particularly important though... For our current style of string storage, even calculating the length of a string is O(n), and AFAICT the only way to actually take advantage of the theoretical O(1) character indexing is to make a uint8 view. I guess it would be useful if we had a string slicing ufunc... But why would we? That said, AFAICT what people actually want in most use cases is support for arrays that can hold variable-length strings, and the only place where the current approach is *optimal* is when we need mmap compatibility with legacy formats that use fixed-width-nul-padded fields (at which point it's super convenient). It's not even possible to *represent* all Python strings or bytestrings in current numpy unicode or string arrays (Python strings/bytestrings can have trailing nuls). So if we're talking about tweaks to the current system it probably makes sense to focus on this use case specifically.
From context I'm assuming FITS files use fixed-width-nul-padding for strings? Is that right? I know HDF5 doesn't.
-n
That said, AFAICT what people actually want in most use cases is support for arrays that can hold variable-length strings, and the only place where
On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith <njs@pobox.com> wrote: the current approach is *optimal* is when we need mmap compatibility with legacy formats that use fixed-width-nul-padded fields (at which point it's super convenient). It's not even possible to *represent* all Python strings or bytestrings in current numpy unicode or string arrays (Python strings/bytestrings can have trailing nuls). So if we're talking about tweaks to the current system it probably makes sense to focus on this use case specifically.
From context I'm assuming FITS files use fixed-width-nul-padding for
strings? Is that right? I know HDF5 doesn't. Yes, HDF5 does. Or at least, it is supported in addition to the variable-length ones. https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html -- Robert Kern
On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith <njs@pobox.com> wrote:
That said, AFAICT what people actually want in most use cases is support for arrays that can hold variable-length strings, and the only place where the current approach is *optimal* is when we need mmap compatibility with legacy formats that use fixed-width-nul-padded fields (at which point it's super convenient). It's not even possible to *represent* all Python strings or bytestrings in current numpy unicode or string arrays (Python strings/bytestrings can have trailing nuls). So if we're talking about tweaks to the current system it probably makes sense to focus on this use case specifically.
From context I'm assuming FITS files use fixed-width-nul-padding for strings? Is that right? I know HDF5 doesn't.
Yes, HDF5 does. Or at least, it is supported in addition to the variable-length ones.
https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
Doh, I found that page but it was (and is) meaningless to me, so I went by http://docs.h5py.org/en/latest/strings.html, which says the options are fixed-width ascii, variable-length ascii, or variable-length utf-8 ... I guess it's just talking about what h5py currently supports. But also, is it important whether strings we're loading/saving to an HDF5 file have the same in-memory representation in numpy as they would in the file? I *know* [1] no-one is reading HDF5 files using np.memmap :-). Is it important for some other reason? Also, further searching suggests that HDF5 actually supports all of nul termination, nul padding, and space padding, and that nul termination is the default? How much does it help to have in-memory compatibility with just one of these options (and not even the default one)? Would we need to add the other options to be really useful for HDF5? (Unlikely to happen within numpy itself, but potentially something that could be done inside h5py or whatever if numpy's user-defined dtype system were a little more useful.) -n [1] hope -- Nathaniel J. Smith -- https://vorpus.org
On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith <njs@pobox.com> wrote:
But also, is it important whether strings we're loading/saving to an HDF5 file have the same in-memory representation in numpy as they would in the file? I *know* [1] no-one is reading HDF5 files using np.memmap :-).
Of course they do :) https://github.com/jjhelmus/pyfive/blob/98d26aaddd6a7d83cfb189c113e172cc1b60...
Also, further searching suggests that HDF5 actually supports all of nul termination, nul padding, and space padding, and that nul termination is the default? How much does it help to have in-memory compatibility with just one of these options (and not even the default one)? Would we need to add the other options to be really useful for HDF5?
h5py actually ignores this option and only uses null termination. I have not heard any complaints about this (though I have heard complaints about the lack of fixed-length UTF-8). But more generally, you're right. h5py doesn't need a corresponding NumPy dtype for each HDF5 string dtype, though that would certainly be *convenient*. In fact, it already (ab)uses NumPy's dtype metadata with h5py.special_dtype to indicate a homogeneous string type for object arrays. I would guess h5py users have the same needs for efficient string representations (including surrogate-escape options) as other scientific users.
On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern <robert.kern@gmail.com>
wrote:
On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith <njs@pobox.com> wrote:
That said, AFAICT what people actually want in most use cases is support for arrays that can hold variable-length strings, and the only place where the current approach is *optimal* is when we need mmap compatibility with legacy formats that use fixed-width-nul-padded fields (at which point it's super convenient). It's not even possible to *represent* all Python strings or bytestrings in current numpy unicode or string arrays (Python strings/bytestrings can have trailing nuls). So if we're talking about tweaks to the current system it probably makes sense to focus on this use case specifically.
From context I'm assuming FITS files use fixed-width-nul-padding for strings? Is that right? I know HDF5 doesn't.
Yes, HDF5 does. Or at least, it is supported in addition to the variable-length ones.
https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
Doh, I found that page but it was (and is) meaningless to me, so I went by http://docs.h5py.org/en/latest/strings.html, which says the options are fixed-width ascii, variable-length ascii, or variable-length utf-8 ... I guess it's just talking about what h5py currently supports.
It's okay, I made exactly the same mistake earlier in the thread. :-)
But also, is it important whether strings we're loading/saving to an HDF5 file have the same in-memory representation in numpy as they would in the file? I *know* [1] no-one is reading HDF5 files using np.memmap :-). Is it important for some other reason?
The lack of such a dtype seems to be the reason why neither h5py nor PyTables supports that kind of HDF5 Dataset. The variable-length Datasets can take up a lot of disk-space because they can't be compressed (even accounting for the wasted padding space). I mean, they probably could have implemented it with objects arrays like h5py does with the variable-length string Datasets, but they didn't. https://github.com/PyTables/PyTables/issues/499 https://github.com/h5py/h5py/issues/624 -- Robert Kern
On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern <robert.kern@gmail.com> wrote:
Chris, you've mashed all of my emails together, some of them are in reply to you, some in reply to others. Unfortunately, this dropped a lot of the context from each of them, and appears to be creating some misunderstandings about what each person is advocating.
Sorry about that -- I was trying to keep an already really long thread from getting eve3n longer.... And I'm not sure it matters who's doing the advocating, but rather *what* is being advocated -- I hope I didn't screw that up too badly. Anyway, I think I made the mistake of mingling possible solutions in with the use-cases, so I'm not sure if there is any consensus on the use cases -- which I think we really do need to nail down first -- as Robert has made clear. So I'll try again -- use-case only! we'll keep the possible solutions separate. Do we need to write up a NEP for this? it seems we are going a bit in circles, and we really do want to capture the final decision process. 1) The default behaviour for numpy arrays of strings is compatible with Python3's string model: i.e. fully unicode supporting, and with a character oriented interface. i.e. if you do:: arr = np.array(("this", "that",)) you get an array that can store ANY unicode string with 4 or less characters. and arr[1] will return a native Python3 string object. This is the use-case for "casual" numpy users -- not the folks writing H5py and the like, or the ones writing Cython bindings to C++ libs. 2) There be some way to store mostly ascii-compatible strings in a single byte-per-character array -- so not to be wasting space for "typical european-language-oriented data". Note: this should ALSO be compatible with Python's character-oriented string model. i.e. a Python String with length N will fit into a dtype of size N. arr = np.array(("this", "that",), dtype=np.single_byte_string) (name TBD) and arr[1] would return a python string. attempting to put in a not-compatible with the encoding String would raise an EncodingError. This is also a use-case primarily for "casual" users -- but ones concerned with the size of the data storage and know that are using european text. 3) dtypes that support storage in particular encodings: Python strings would be encoded appropriately when put into the array. A Python string would be returned when indexing. a) There be a dtype that could store strings in null-terminated utf-8 binary format -- for interchange with other systems (netcdf, HDF, others???) at the binary level. b) There be a dtype that could store data in any encoding supported by Python -- to facilitate bytes-level interchange with other systems. If we need more than utf-8, then we might as well have the full set. 4) a fixed length bytes dtype -- pretty much what 'S' is now under python three -- settable from a bytes or bytearray object (or other memoryview?), and returns a bytes object. You could use astype() to convert between bytes and a specified encoding with no change in binary representation. This could be used to store any binary data, including encoded text or anything else. this should map directly to the Python bytes model -- thus NOT null-terminted. This is a little different than 'S' behaviour on py3 -- it appears that with 'S', a if ALL the trailing bytes are null, then it is truncated, but if there is a null byte in the middle, then it is preserved. I suspect that this is a legacy from Py2's use of "strings" as both text and binary data. But in py3, a "bytes" type should be about bytes, and not text, and thus null-values bytes are simply another value a byte can hold. There are multiple ways to address these use cases -- please try to make your comments clear about whether you think the use-case is unimportant, or ill-defined, or if you think a given solution is a poor choice. To facilitate that, I will put my comments on possible solutions in a separate note, too. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
This is essentially my rant about use-case (2): A compact dtype for mostly-ascii text: On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker <chris.barker@noaa.gov> wrote:
On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be signaled more explicitly. I would suggest that "text[unknown]" should support operations like a string if it can be decoded as ASCII, and otherwise error. But unlike "text[ascii]", it will let you store arbitrary bytes.
I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really is ascii, then it's perfect. If it really is latin-*, then you get some extra useful stuff, and if it's corrupted somehow, you still get the ascii text correct, and the rest won't barf and can be passed on through.
I am totally in agreement with Thomas that "We are living in a messy world right now with messy legacy datasets that have character type data that are *mostly* ASCII, but not infrequently contain non-ASCII characters."
My question: What are those non-ASCII characters? How often are they truly latin-1/9 vs. some other text encoding vs. non-string binary data?
I am totally euro-centric, but as I understand it, that is the whole point of the desire for a compact one-byte-per character encoding. If there is a strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we should support that. But this all started with "mostly ascii". My take on that is: We don't want to use pure-ASCII -- that is the hell that python2's default encoding approach led to -- it is MUCH better to pass garbage through than crash out with an EncodingError -- data are messy, and people are really bad at writing comprehensive tests. So we need something that handles ASCII properly, and can pass trhough arbitrary bytes as well without crashing. Options are: * ASCII With errors='ignore' or 'replace' I think that is a very bad idea -- it is tossing away information that _may_ have some use eslewhere:: s = arr[i] arr[i] = s should put the same bytes back into the array. * ASCII with errors='surrogateescape' This would preserve bytes and not crash out, so meets the key criteria. * latin-1 This would do the exactly correct thing for ASCII, preserve the bytes, and not crash out. But it would also allow additional symbols useful to european languages and scientific computing. Seems like a win-win to me. As for my use-cases: - Messy data: I have had a lot of data sets with european text in them, mostly ASCII and an occasional non ASCII accented character or symbol -- most of these come from legacy systems, and have an ugly arbitrary combination of MacRoman, Win-something-or-other, and who knows what -- i.e. mojibake, though at least mostly ascii. The only way to deal with it "properly" is to examine each string and try to figure out which encoding it is in, hope at least a single string is in one encoding, and then decode/encode it properly. So numpy should support that -- which would be handled by a 'bytes' type, just like in Python itself. But sometimes that isn't practical, and still doesn't work 100% -- in which case, we can go with latin-1, and there will be some weird, incorrect characters in there, and that is OK -- we fix them later when QA/QC or users notice it -- really just like a typo. But stripping the non-ascii characters out would be a worse solution. As would "replace", as sometimes it IS the correct symbol! (european encodings aren't totally incompatible...). And surrogateescape is worse, too -- any "weird" character is the same to my users, and at least sometimes it will be the right character -- however surrogateescape gets printed, it will never look right. (and can it even be handles by a non-python system?) - filenames File names are one of the key reasons folks struggled with the python3 data model (particularly on *nix) and why 'surrogateescape' was added. It's pretty common to store filenames in with our data, and thus in numpy arrays -- we need to preserve them exactly and display them mostly right. Again, euro-centric, but if you are euro-centric, then latin-1 is a good choice for this. Granted, I should probably simply use a proper unicode type for filenames anyway, but sometimes the data comes in already encoded as latin-something. In the end I still see no downside to latin-1 over ascii-only -- only an upside. I don't think that silently (mis)interpreting non-ASCII characters as
latin-1/9 is a good idea, which is why I think it would be a mistake to use 'latin-1' for text data with unknown encoding.
if it's totally unknown, then yes -- but for totally uknown, bytes is the only reasonable option -- then run chardet or something over it. but "some latin encoding" -- latin-1 is a good choice. I could get behind a data type that compares equal to strings for ASCII
only and allows for *storing* other characters, but making blind assumptions about characters 128-255 seems like a recipe for disaster. Imagine text[unknown] as a one character string type, but it supports .decode() like bytes and every character in the range 128-255 compares for equality with other characters like NaN -- not even equal to itself.
would this be ascii with surrogateescape? -- almost, though I think the surrogateescapes would compare equal if they were equal -- which, now that I think about it would be what you want -- why preserve the bytes if they aren't an important part of the data? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Mon, Apr 24, 2017 at 4:23 PM, Robert Kern <robert.kern@gmail.com> wrote:
My question: What are those non-ASCII characters? How often are they truly latin-1/9 vs. some other text encoding vs. non-string binary data?
I don't know that we can reasonably make that accounting relevant. Number of such characters per byte of text? Number of files with such characters out of all existing files?
I have a lot of mostly english -- usually not latin-1, but usually mostly latin-1. -- the non-ascii characters are a handful of accented characters (usually from spanish, some french), then a few "scientific" characters: the degree symbol, the "micro" symbol. I suspect that this is not an unusual pattern for mostly-english scientific text. if it's non-string binary data, I know it -- and I'd use a bytes type. I have two options -- try to detect the encoding properly or use _something_ and fix it up later. latin-1 is a great choice for the later option -- most of the text displays fine, and the wrong stuff is untouched, so I can figure it out. What I can say with assurance is that every time I have decided, as a
developer, to write code that just hardcodes latin-1 for such cases, I have regretted it. While it's just personal anecdote, I think it's at least measuring the right thing. :-)
I've had the opposite experience -- so that's two anecdotes :-) If it were, say, shift-jis, then yes using latin-1 would be a bad idea. but not really much worse then any other option other than properly decoding it. IN a way, using latin-1 is like the old py2 string -- it can be used as text, even if it has arbitrary non-text garbage in it... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
OK -- onto proposals: 1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character oriented interface. i.e. if you do::
arr = np.array(("this", "that",))
you get an array that can store ANY unicode string with 4 or less characters.
and arr[1] will return a native Python3 string object.
This is the use-case for "casual" numpy users -- not the folks writing H5py and the like, or the ones writing Cython bindings to C++ libs.
I see two options here: a) The current 'U' dtype -- fully meets the specs, and is already there. b) Having a pointer-to-a-python string dtype: -I take it that's what Pandas does and people seem happy. -That would get us variable length strings, and potentially other nifty string-processing. - It would lose the ability to interact at the binary level with other systems -- but do any other systems use UCS-4 anyway? - how would it work with pickle and numpy zip storage? Personally, I'm fine with (a), but (b) seems like it could be a nice addition. As the 'U' type already exists, the choice to add a python-string type is really orthogonal to the rest of this discussion. Note that I think using utf-8 internally to fit his need is a mistake -- it does not match well with the Python string model. That's it for use-case (1) -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
2017-04-25 12:34 GMT-04:00 Chris Barker <chris.barker@noaa.gov>:
I am totally euro-centric, but as I understand it, that is the whole point of the desire for a compact one-byte-per character encoding. If there is a strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we should support that. But this all started with "mostly ascii". My take on that is:
But Shift-JIS is not one-byte; it's two-byte (unless you allow only half-width characters and nothing else). :-) In fact legacy CJK encodings are all nominally two-byte (so that the width of a character's internal representation matches that of its visual representation).
- filenames
File names are one of the key reasons folks struggled with the python3 data model (particularly on *nix) and why 'surrogateescape' was added. It's pretty common to store filenames in with our data, and thus in numpy arrays -- we need to preserve them exactly and display them mostly right. Again, euro-centric, but if you are euro-centric, then latin-1 is a good choice for this.
This I don't understand. As far as I can tell non-Western-European filenames are not unusual. If filenames are a reason, even if you're euro-centric (think Eastern Europe, say) I don't see how latin1 is a good choice. Lurker here, and I haven't touched numpy in ages. So I might be blurting out nonsense. -- Ambrose Li // http://o.gniw.ca / http://gniw.ca If you saw this on CE-L: You do not need my permission to quote me, only proper attribution. Always cite your sources, even if you have to anonymize and/or cite it as "personal communication".
Now my proposal for the other use cases: 2) There be some way to store mostly ascii-compatible strings in a single
byte-per-character array -- so not to be wasting space for "typical european-language-oriented data". Note: this should ALSO be compatible with Python's character-oriented string model. i.e. a Python String with length N will fit into a dtype of size N.
arr = np.array(("this", "that",), dtype=np.single_byte_string)
(name TBD)
and arr[1] would return a python string.
attempting to put in a not-compatible with the encoding String would raise an EncodingError.
This is also a use-case primarily for "casual" users -- but ones concerned with the size of the data storage and know that are using european text.
more detail elsewhere -- but either ascii with surrageescape or latin-1 always are good options here. I prefer latin-1 (I really see no downside), but others disagree... But then we get to:
3) dtypes that support storage in particular encodings:
We need utf-8. We may need others. We may need a 1-byte per char compact encoding that isn't close enough to ascii or latin-1 to be useful (say, shift-jis), And I don't think we are going to come to a consensus on what "single" encoding to use for 1-byte-per-char. So really -- going back to Julian's earlier proposal: dytpe with an encoding specified "size" in bytes once defined, numpy would encode/decode to/from python strings "correctly" we might need "null-terminated utf-8" as a special case. That would support all the other use cases. Even the one-byte per char encoding. I"d like to see a clean alias to a latin-1 encoding, but not a big deal. That leaves a couple decisions: - error out or truncate if the passed-in string is too long? - error out or suragateescape if there are invalid bytes in the data? - error out or something else if there are characters that can't be encoded in the specified encoding. And we still need a proper bytes type: 4) a fixed length bytes dtype -- pretty much what 'S' is now under python
three -- settable from a bytes or bytearray object (or other memoryview?), and returns a bytes object.
You could use astype() to convert between bytes and a specified encoding with no change in binary representation. This could be used to store any binary data, including encoded text or anything else. this should map directly to the Python bytes model -- thus NOT null-terminted.
This is a little different than 'S' behaviour on py3 -- it appears that with 'S', a if ALL the trailing bytes are null, then it is truncated, but if there is a null byte in the middle, then it is preserved. I suspect that this is a legacy from Py2's use of "strings" as both text and binary data. But in py3, a "bytes" type should be about bytes, and not text, and thus null-values bytes are simply another value a byte can hold.
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Tue, Apr 25, 2017 at 9:57 AM, Ambrose LI <ambrose.li@gmail.com> wrote:
2017-04-25 12:34 GMT-04:00 Chris Barker <chris.barker@noaa.gov>:
I am totally euro-centric,
But Shift-JIS is not one-byte; it's two-byte (unless you allow only half-width characters and nothing else). :-)
bad example then -- are their other non-euro-centric one byte per char encodings worth worrying about? I have no clue :-)
This I don't understand. As far as I can tell non-Western-European filenames are not unusual. If filenames are a reason, even if you're euro-centric (think Eastern Europe, say) I don't see how latin1 is a good choice.
right -- this is the age of Unicode -- Unicode is the correct choice. But many of us have data in old files that are not proper Unicode -- and that includes filenames. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Anyway, I think I made the mistake of mingling possible solutions in with
On Tue, Apr 25, 2017 at 9:01 AM, Chris Barker <chris.barker@noaa.gov> wrote: the use-cases, so I'm not sure if there is any consensus on the use cases -- which I think we really do need to nail down first -- as Robert has made clear.
So I'll try again -- use-case only! we'll keep the possible solutions
separate.
Do we need to write up a NEP for this? it seems we are going a bit in
circles, and we really do want to capture the final decision process.
1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character oriented interface. i.e. if you do:: ... etc. These aren't use cases but rather requirements. I'm looking for something rather more concrete than that. * HDF5 supports fixed-length and variable-length string arrays encoded in ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite the documentation claiming that there are more options). In practice, the ASCII strings permit high-bit characters, but the encoding is unspecified. Memory-mapping is rare (but apparently possible). The two major HDF5 bindings are waiting for a fixed-length UTF-8 numpy dtype to support that HDF5 option. Compression is supported for fixed-length string arrays but not variable-length string arrays. * FITS supports fixed-length string arrays that are NULL-padded. The strings do not have a formal encoding, but in practice, they are typically mostly ASCII characters with the occasional high-bit character from an unspecific encoding. Memory-mapping is a common practice. These arrays can be quite large even if each scalar is reasonably small. * pandas uses object arrays for flexible in-memory handling of string columns. Lengths are not fixed, and None is used as a marker for missing data. String columns must be written to and read from a variety of formats, including CSV, Excel, and HDF5, some of which are Unicode-aware and work with `unicode/str` objects instead of `bytes`. * There are a number of sometimes-poorly-documented, often-poorly-adhered-to, aging file format "standards" that include string arrays but do not specify encodings, or such specification is ignored in practice. This can make the usual "Unicode sandwich" at the I/O boundaries difficult to perform. * In Python 3 environments, `unicode/str` objects are rather more common, and simple operations like equality comparisons no longer work between `bytes` and `unicode/str`, making it difficult to work with numpy string arrays that yield `bytes` scalars. -- Robert Kern
On Tue, Apr 25, 2017 at 6:05 PM Chris Barker <chris.barker@noaa.gov> wrote:
Anyway, I think I made the mistake of mingling possible solutions in with the use-cases, so I'm not sure if there is any consensus on the use cases -- which I think we really do need to nail down first -- as Robert has made clear.
I would make my use-cases more user-specific: 1) User wants an array with numpy indexing tricks that can hold python strings but doesn't care about the underlying representation. -> Solvable with object arrays, or Robert's string-specific object arrays; underlying representation is python objects on the heap. Sadly UCS-4, so zillions are going to be a memory problem. 2) User has to deal with fixed-width binary data from an external program/library and wants to see it as python strings. This may be systematically encoded in a known encoding (e.g. HDF5's fixed-storage-length zero-padded UTF-8 strings, spec-observing FITS' zero-padded ASCII) or ASCII-with-exceptions-and-the-user-is-supposed-to-know (e.g. spec-violating FITS files with zero-padded latin-9, koi8-r, cp1251, or whatever). Length may be signaled by null termination, null padding, or space padding. -> Solvable with a fixed-storage-size encoded-string dtype, as long as it has a parameter for how length is signaled. Python tricks for dealing with wrong or unknown encodings can make bogus data manageable. 3) User has to deal with fixed-width binary data from an external program/library that really is binary bytes. -> Solvable with a dtype that returns fixed-length byte strings. 4) User has a stupendous number (billions) of short strings which are mostly but not entirely ASCII and wants to manipulate them as strings. -> Not sure how to solve this. Maybe an object array with byte strings for storage and encoding information in the dtype, allowing transparent decoding? Or a fixed-storage-size array with a one-byte encoding that can cope with all the characters the user will ever want to use? 5) User has a bunch of mystery-encoding strings(?) and wants to store them in a numpy array. -> If they're python strings already, no further harm is done by treating this as case 1 when in python-land. If they need to be in fixed-width fields for communication with an external program or library, this puts us in case 2, unknown encoding variety; user will have to pick an encoding that the external program is likely to be able to cope with; this may be the one that originated the mystery strings in the first place. 6) User has python strings and wants to store them in non-object numpy arrays for some reason but doesn't care about the actual memory layout. -> Solvable with the current setup; fixed-width UCS-4 fields, padded with Unicode NULL. Happily, this comes for free from arbitrary-encoding fixed-storage-size dtypes, though a friendlier interface might be nice. Also allows people to use UCS-2 or ASCII if they know their strings fit. 7) User has data in one binary format and it needs to go into another, with perhaps casual inspection while in python-land. Such data is mostly ASCII but might contain mystery characters; presenting gobbledygook to the user is okay as long as the characters are output intact. -> Reading and writing as a fixed-width one-byte encoding, preferably one resembling the one the data is actually in, should work here. UTF-8 is likely to mangle the data; ASCII-with-surrogateescape might do okay. The key thing here is that both input and output files will have their own ways of specifying string length and their own storage specifiers; user must know these, and someone has to know and specify what to do with strings that don't fit. Simple truncation will mangle UTF-8 if it is not known to be UTF-8, but there's maybe not much that can be done about that. I guess my point is that a use case should specify: * Where does the data come from (i.e. in what format)? * Are there memory constraints in the storage format? * How should access look to the user? In particular, what should misencoded data look like? * Where does the data need to go? Anne
On Tue, Apr 25, 2017 at 10:04 AM, Chris Barker <chris.barker@noaa.gov> wrote:
On Tue, Apr 25, 2017 at 9:57 AM, Ambrose LI <ambrose.li@gmail.com> wrote:
2017-04-25 12:34 GMT-04:00 Chris Barker <chris.barker@noaa.gov>:
I am totally euro-centric,
But Shift-JIS is not one-byte; it's two-byte (unless you allow only half-width characters and nothing else). :-)
bad example then -- are their other non-euro-centric one byte per char
encodings worth worrying about? I have no clue :-) I've run into Windows-1251 in files (seismic and well log data from Russian wells). Treating them as latin-1 does not make for a happy time. Both encodings also technically derive from ASCII in the lower half, but most of the actual language is written with the high-bit characters. -- Robert Kern
On Tue, Apr 25, 2017 at 7:09 PM Robert Kern <robert.kern@gmail.com> wrote:
* HDF5 supports fixed-length and variable-length string arrays encoded in ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite the documentation claiming that there are more options). In practice, the ASCII strings permit high-bit characters, but the encoding is unspecified. Memory-mapping is rare (but apparently possible). The two major HDF5 bindings are waiting for a fixed-length UTF-8 numpy dtype to support that HDF5 option. Compression is supported for fixed-length string arrays but not variable-length string arrays.
* FITS supports fixed-length string arrays that are NULL-padded. The strings do not have a formal encoding, but in practice, they are typically mostly ASCII characters with the occasional high-bit character from an unspecific encoding. Memory-mapping is a common practice. These arrays can be quite large even if each scalar is reasonably small.
* pandas uses object arrays for flexible in-memory handling of string columns. Lengths are not fixed, and None is used as a marker for missing data. String columns must be written to and read from a variety of formats, including CSV, Excel, and HDF5, some of which are Unicode-aware and work with `unicode/str` objects instead of `bytes`.
* There are a number of sometimes-poorly-documented, often-poorly-adhered-to, aging file format "standards" that include string arrays but do not specify encodings, or such specification is ignored in practice. This can make the usual "Unicode sandwich" at the I/O boundaries difficult to perform.
* In Python 3 environments, `unicode/str` objects are rather more common, and simple operations like equality comparisons no longer work between `bytes` and `unicode/str`, making it difficult to work with numpy string arrays that yield `bytes` scalars.
It seems the greatest challenge is interacting with binary data from other programs and libraries. If we were living entirely in our own data world, Unicode strings in object arrays would generally be pretty satisfactory. So let's try to get what is needed to read and write other people's formats. I'll note that this is numpy, so variable-width fields (e.g. CSV) don't map directly to numpy arrays; we can store it however we want, as conversion is necessary anyway. Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur. Anne
On 04/25/2017 01:34 PM, Anne Archibald wrote:
I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere?
Strings in FITS headers are delimited by single quotes. Some keywords (only a handful) are required to have values that are blank-padded (in the FITS file) if the value is less than eight characters. Whether you get trailing blanks when you read the header depends on the FITS reader. I use astropy.io.fits to read/write FITS files, and that interface strips trailing blanks from character strings: TARGPROP= 'UNKNOWN ' / Proposer's name for the target
fd = fits.open("test.fits") s = fd[0].header['targprop'] len(s) 7
Phil
On Tue, Apr 25, 2017 at 6:36 PM Chris Barker <chris.barker@noaa.gov> wrote:
This is essentially my rant about use-case (2):
A compact dtype for mostly-ascii text:
I'm a little confused about exactly what you're trying to do. Do you need your in-memory format for this data to be compatible with anything in particular? If you're not reading or writing files in this format, then it's just a matter of storing a whole bunch of things that are already python strings in memory. Could you use an object array? Or do you have an enormous number so that you need a more compact, fixed-stride memory layout? Presumably you're getting byte strings (with no NULLs) from somewhere and need to store them in this memory structure in a way that makes them as usable as possible in spite of their unknown encoding. Presumably the thing to do is just copy them in there as-is and then use .astype to arrange for python to decode them when accessed. So this is precisely the problem of "how should I decode random byte strings?" that python has been struggling with. My impression is that the community has established that there's no one solution that makes everyone happy, but that most people can cope with some combination of picking a one-byte encoding, ascii-with-surrogateescapes, zapping bogus characters, and giving wrong results. But I think that all the standard python alternatives are needed, in general, and in terms of interpreting numpy arrays full of bytes. Clearly your preferred solution is .astype("string[latin-9]"), but just as clearly that's not going to work for everyone. If your question is "what should numpy's default string dtype be?", well, maybe default to object arrays; anyone who just has a bunch of python strings to store is unlikely to be surprised by this. Someone with more specific needs will choose a more specific - that is, not default - string data type. Anne
On Tue, Apr 25, 2017 at 7:52 PM Phil Hodge <hodge@stsci.edu> wrote:
On 04/25/2017 01:34 PM, Anne Archibald wrote:
I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere?
Strings in FITS headers are delimited by single quotes. Some keywords (only a handful) are required to have values that are blank-padded (in the FITS file) if the value is less than eight characters. Whether you get trailing blanks when you read the header depends on the FITS reader. I use astropy.io.fits to read/write FITS files, and that interface strips trailing blanks from character strings:
TARGPROP= 'UNKNOWN ' / Proposer's name for the target
fd = fits.open("test.fits") s = fd[0].header['targprop'] len(s) 7
Actually, for what it's worth, the FITS spec says that in such values trailing spaces are not significant, see page 7: https://fits.gsfc.nasa.gov/standard40/fits_standard40draft1.pdf But they're not really relevant to numpy's situation, because as here you need to do elaborate de-quoting before they can go into a data structure. What I was wondering was whether people have data lying around with fixed-width fields where the strings are space-padded, so that numpy needs to support that. Anne
On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <peridot.faceted@gmail.com> wrote:
On Tue, Apr 25, 2017 at 7:09 PM Robert Kern <robert.kern@gmail.com> wrote:
* HDF5 supports fixed-length and variable-length string arrays encoded in ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite the documentation claiming that there are more options). In practice, the ASCII strings permit high-bit characters, but the encoding is unspecified. Memory-mapping is rare (but apparently possible). The two major HDF5 bindings are waiting for a fixed-length UTF-8 numpy dtype to support that HDF5 option. Compression is supported for fixed-length string arrays but not variable-length string arrays.
* FITS supports fixed-length string arrays that are NULL-padded. The strings do not have a formal encoding, but in practice, they are typically mostly ASCII characters with the occasional high-bit character from an unspecific encoding. Memory-mapping is a common practice. These arrays can be quite large even if each scalar is reasonably small.
* pandas uses object arrays for flexible in-memory handling of string columns. Lengths are not fixed, and None is used as a marker for missing data. String columns must be written to and read from a variety of formats, including CSV, Excel, and HDF5, some of which are Unicode-aware and work with `unicode/str` objects instead of `bytes`.
* There are a number of sometimes-poorly-documented, often-poorly-adhered-to, aging file format "standards" that include string arrays but do not specify encodings, or such specification is ignored in practice. This can make the usual "Unicode sandwich" at the I/O boundaries difficult to perform.
* In Python 3 environments, `unicode/str` objects are rather more common, and simple operations like equality comparisons no longer work between `bytes` and `unicode/str`, making it difficult to work with numpy string arrays that yield `bytes` scalars.
It seems the greatest challenge is interacting with binary data from other programs and libraries. If we were living entirely in our own data world, Unicode strings in object arrays would generally be pretty satisfactory. So let's try to get what is needed to read and write other people's formats.
I'll note that this is numpy, so variable-width fields (e.g. CSV) don't map directly to numpy arrays; we can store it however we want, as conversion is necessary anyway.
Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur.
Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated. For byte strings, it looks like we need a parameterized type. This is for two uses, display and conversion to (Python) unicode. One could handle the display and conversion using view and astype methods. For instance, we already have In [1]: a = array([1,2,3], uint8) + 0x30 In [2]: a.view('S1') Out[2]: array(['1', '2', '3'], dtype='|S1') In [3]: a.view('S1').astype('U') Out[3]: array([u'1', u'2', u'3'], dtype='<U1') Chuck
Chuck: That sounds like something we want to deprecate, for the same reason that python3 no longer allows str(b'123') to do the right thing. Specifically, it seems like astype should always be forbidden to go between unicode and byte arrays - so that would need to be written as: In [1]: a = array([1,2,3], uint8) + 0x30 In [2]: a.view('S1') Out[2]: array(['1', '2', '3'], dtype='|S1') In [3]: a.view('U[ascii]') Out[3]: array([u'1', u'2', u'3'], dtype='<U[ascii]1') In [4]: a.view('U[ascii]').astype('U[ucs32]') # re-encoding is a astype operation Out[4]: array([u'1', u'2', u'3'], dtype='<U1') # UCS32 is the current default In [5]: a.view('U[ascii]').astype('U[ucs32]').view(uint8) Out [5]: array([0x31, 0, 0, 0, 0x32, 0, 0, 0, 0x33, 0, 0, 0]) I guess for backwards compatibility, .view('U') would always mean view('U[ucs32]'). As an aside - it’d be nice if parameterized dtypes acquired a non-string syntax, like np.unicode_['ucs32']. Eric On Tue, 25 Apr 2017 at 19:19 Charles R Harris charlesr.harris@gmail.com <http://mailto:charlesr.harris@gmail.com> wrote: On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <peridot.faceted@gmail.com>
wrote:
On Tue, Apr 25, 2017 at 7:09 PM Robert Kern <robert.kern@gmail.com> wrote:
* HDF5 supports fixed-length and variable-length string arrays encoded in ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite the documentation claiming that there are more options). In practice, the ASCII strings permit high-bit characters, but the encoding is unspecified. Memory-mapping is rare (but apparently possible). The two major HDF5 bindings are waiting for a fixed-length UTF-8 numpy dtype to support that HDF5 option. Compression is supported for fixed-length string arrays but not variable-length string arrays.
* FITS supports fixed-length string arrays that are NULL-padded. The strings do not have a formal encoding, but in practice, they are typically mostly ASCII characters with the occasional high-bit character from an unspecific encoding. Memory-mapping is a common practice. These arrays can be quite large even if each scalar is reasonably small.
* pandas uses object arrays for flexible in-memory handling of string columns. Lengths are not fixed, and None is used as a marker for missing data. String columns must be written to and read from a variety of formats, including CSV, Excel, and HDF5, some of which are Unicode-aware and work with `unicode/str` objects instead of `bytes`.
* There are a number of sometimes-poorly-documented, often-poorly-adhered-to, aging file format "standards" that include string arrays but do not specify encodings, or such specification is ignored in practice. This can make the usual "Unicode sandwich" at the I/O boundaries difficult to perform.
* In Python 3 environments, `unicode/str` objects are rather more common, and simple operations like equality comparisons no longer work between `bytes` and `unicode/str`, making it difficult to work with numpy string arrays that yield `bytes` scalars.
It seems the greatest challenge is interacting with binary data from other programs and libraries. If we were living entirely in our own data world, Unicode strings in object arrays would generally be pretty satisfactory. So let's try to get what is needed to read and write other people's formats.
I'll note that this is numpy, so variable-width fields (e.g. CSV) don't map directly to numpy arrays; we can store it however we want, as conversion is necessary anyway.
Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur.
Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated.
For byte strings, it looks like we need a parameterized type. This is for two uses, display and conversion to (Python) unicode. One could handle the display and conversion using view and astype methods. For instance, we already have
In [1]: a = array([1,2,3], uint8) + 0x30
In [2]: a.view('S1') Out[2]: array(['1', '2', '3'], dtype='|S1')
In [3]: a.view('S1').astype('U') Out[3]: array([u'1', u'2', u'3'], dtype='<U1')
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.faceted@gmail.com> wrote:
Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur.
Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated.
Just to clarify some terminology (because it wasn't originally clear to me until I looked it up in reference to HDF5): * "NULL-padded" implies that, for a fixed width of N, there can be up to N non-NULL bytes. Any extra space left over is padded with NULLs, but no space needs to be reserved for NULLs. * "NULL-terminated" implies that, for a fixed width of N, there can be up to N-1 non-NULL bytes. There must always be space reserved for the terminating NULL. I'm not really sure if "NULL-padded" also specifies the behavior for embedded NULLs. It's certainly possible to deal with them: just strip trailing NULLs and leave any embedded ones alone. But I'm also sure that there are some implementations somewhere that interpret the requirement as "stop at the first NULL or the end of the fixed width, whichever comes first", effectively being NULL-terminated just not requiring the reserved space. -- Robert Kern
On Apr 25, 2017 11:53 AM, "Robert Kern" <robert.kern@gmail.com> wrote: On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.faceted@gmail.com> wrote:
Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur.
Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated.
Just to clarify some terminology (because it wasn't originally clear to me until I looked it up in reference to HDF5): * "NULL-padded" implies that, for a fixed width of N, there can be up to N non-NULL bytes. Any extra space left over is padded with NULLs, but no space needs to be reserved for NULLs. * "NULL-terminated" implies that, for a fixed width of N, there can be up to N-1 non-NULL bytes. There must always be space reserved for the terminating NULL. I'm not really sure if "NULL-padded" also specifies the behavior for embedded NULLs. It's certainly possible to deal with them: just strip trailing NULLs and leave any embedded ones alone. But I'm also sure that there are some implementations somewhere that interpret the requirement as "stop at the first NULL or the end of the fixed width, whichever comes first", effectively being NULL-terminated just not requiring the reserved space. And to save anyone else having to check, numpy's current NUL-padded dtypes only strip trailing NULs, so they can round-trip strings that contain NULs, just not strings where NUL is the last character. So the set of strings representable by str/bytes is a strict superset of the set of strings representable by numpy U/S dtypes, which in turn is a strict superset of the set of strings representable by a hypothetical NUL-terminated dtype. (Of course this doesn't matter for most practical purposes, because people rarely make strings with embedded NULs.) -n
On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.faceted@gmail.com> wrote:
Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur.
Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated.
Just to clarify some terminology (because it wasn't originally clear to me until I looked it up in reference to HDF5):
* "NULL-padded" implies that, for a fixed width of N, there can be up to N non-NULL bytes. Any extra space left over is padded with NULLs, but no space needs to be reserved for NULLs.
* "NULL-terminated" implies that, for a fixed width of N, there can be up to N-1 non-NULL bytes. There must always be space reserved for the terminating NULL.
I'm not really sure if "NULL-padded" also specifies the behavior for embedded NULLs. It's certainly possible to deal with them: just strip trailing NULLs and leave any embedded ones alone. But I'm also sure that there are some implementations somewhere that interpret the requirement as "stop at the first NULL or the end of the fixed width, whichever comes first", effectively being NULL-terminated just not requiring the reserved space.
Thanks for the clarification. NULL-padded is what I meant. I'm wondering how much of the desired functionality we could get by simply subclassing ndarray in python. I think we mostly want to be able to view byte strings and convert to unicode if needed. Chuck
On Tue, Apr 25, 2017 at 12:30 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern <robert.kern@gmail.com>
On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
charlesr.harris@gmail.com> wrote:
On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur.
Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.
Just to clarify some terminology (because it wasn't originally clear to me until I looked it up in reference to HDF5):
* "NULL-padded" implies that, for a fixed width of N, there can be up to N non-NULL bytes. Any extra space left over is padded with NULLs, but no space needs to be reserved for NULLs.
* "NULL-terminated" implies that, for a fixed width of N, there can be up to N-1 non-NULL bytes. There must always be space reserved for the terminating NULL.
I'm not really sure if "NULL-padded" also specifies the behavior for embedded NULLs. It's certainly possible to deal with them: just strip
wrote: peridot.faceted@gmail.com> wrote: trailing NULLs and leave any embedded ones alone. But I'm also sure that there are some implementations somewhere that interpret the requirement as "stop at the first NULL or the end of the fixed width, whichever comes first", effectively being NULL-terminated just not requiring the reserved space.
Thanks for the clarification. NULL-padded is what I meant.
Okay, however, the biggest use-case we have for UTF-8 arrays (HDF5) is NULL-terminated.
I'm wondering how much of the desired functionality we could get by simply subclassing ndarray in python. I think we mostly want to be able to view byte strings and convert to unicode if needed.
I'm not sure. Some of these fixed-width string arrays are embedded inside structured arrays with other dtypes. -- Robert Kern
On Apr 25, 2017 9:35 AM, "Chris Barker" <chris.barker@noaa.gov> wrote: - filenames File names are one of the key reasons folks struggled with the python3 data model (particularly on *nix) and why 'surrogateescape' was added. It's pretty common to store filenames in with our data, and thus in numpy arrays -- we need to preserve them exactly and display them mostly right. Again, euro-centric, but if you are euro-centric, then latin-1 is a good choice for this. Eh... First, on Windows and MacOS, filenames are natively Unicode. So you don't care about preserving the bytes, only the characters. It's only Linux and the other traditional unixes where filenames are natively bytestrings. And then from in Python, if you want to actually work with those filenames you need to either have a bytestring type or else a Unicode type that uses surrogateescape to represent the non-ascii characters. I'm not seeing how latin1 really helps anything here -- best case you still have to do something like the wsgi "encoding dance" before you could use the filenames. IMO if you have filenames that are arbitrary bytestrings and you need to represent this properly, you should just use bytestrings -- really, they're perfectly friendly :-). -n
On Tue, Apr 25, 2017 at 1:30 PM, Charles R Harris <charlesr.harris@gmail.com
wrote:
On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.faceted@gmail.com> wrote:
Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur.
Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated.
Just to clarify some terminology (because it wasn't originally clear to me until I looked it up in reference to HDF5):
* "NULL-padded" implies that, for a fixed width of N, there can be up to N non-NULL bytes. Any extra space left over is padded with NULLs, but no space needs to be reserved for NULLs.
* "NULL-terminated" implies that, for a fixed width of N, there can be up to N-1 non-NULL bytes. There must always be space reserved for the terminating NULL.
I'm not really sure if "NULL-padded" also specifies the behavior for embedded NULLs. It's certainly possible to deal with them: just strip trailing NULLs and leave any embedded ones alone. But I'm also sure that there are some implementations somewhere that interpret the requirement as "stop at the first NULL or the end of the fixed width, whichever comes first", effectively being NULL-terminated just not requiring the reserved space.
Thanks for the clarification. NULL-padded is what I meant.
I'm wondering how much of the desired functionality we could get by simply subclassing ndarray in python. I think we mostly want to be able to view byte strings and convert to unicode if needed.
And I think the really tricky part is sorting and rich comparison. Unfortunately, the comparison function is currently located in the c structure. I suppose we could define a c wrapper function to go in the slot. Chuck
On Apr 25, 2017 10:13 AM, "Anne Archibald" <peridot.faceted@gmail.com> wrote: On Tue, Apr 25, 2017 at 6:05 PM Chris Barker <chris.barker@noaa.gov> wrote:
Anyway, I think I made the mistake of mingling possible solutions in with the use-cases, so I'm not sure if there is any consensus on the use cases -- which I think we really do need to nail down first -- as Robert has made clear.
I would make my use-cases more user-specific: 1) User wants an array with numpy indexing tricks that can hold python strings but doesn't care about the underlying representation. -> Solvable with object arrays, or Robert's string-specific object arrays; underlying representation is python objects on the heap. Sadly UCS-4, so zillions are going to be a memory problem. It's possible to do much better than this when defining a specialized variable-width string dtype. E.g. make the itemsize 8 bytes (like an object array, assuming a 64 bit system), but then for strings that can be encoded in 7 bytes or less of utf8 store them directly in the array; else store a pointer to a raw utf8 string on the heap. (Possibly with a reference count - there are some interesting tradeoffs there. I suspect 1-byte reference counts might be the way to go; if a logical copy would make it overflow then make an actual copy instead.) Anything involving the heap is going to have some overhead, but we don't need full fledged Python objects and once we give up mmap compatibility then there's a lot of room to tune. -n
A compact dtype for mostly-ascii text:
I'm a little confused about exactly what you're trying to do. Actually, *I* am not trying to do anything here -- I'm the one that said computers are so big and fast now that we shouldn't whine about 4 bytes for a character....but this whole conversation started with that request...and I have sympathy .. no one likes to waste memory. After all, numpy support small numeric dtypes, too. Do you need your in-memory format for this data to be compatible with anything in particular? Not for this requirement -- binary interchange is another requirement. If you're not reading or writing files in this format, then it's just a matter of storing a whole bunch of things that are already python strings in memory. Could you use an object array? Or do you have an enormous number so that you need a more compact, fixed-stride memory layout? That's the whole point, yes. Object arrays would be a good solution to the full Unicode problem, not the "why am I wasting so much space when all my data are ascii ? Presumably you're getting byte strings (with unknown encoding. No -- thus is for creating and using mostly ascii string data with python and numpy. Unknown encoding bytes belong in byte arrays -- they are not text. I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that. Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters. If your question is "what should numpy's default string dtype be?", well, maybe default to object arrays; Or UCS-4. I think object arrays would be more problematic for npz storage, and raw "tostring" dumping. (And pickle?) not sure how important that is. And it would be good to have something that plays well with recarrays anyone who just has a bunch of python strings to store is unlikely to be surprised by this. Someone with more specific needs will choose a more specific - that is, not default - string data type. Exactly. -CHB
Actually, for what it's worth, the FITS spec says that in such values trailing spaces are not significant, see page 7: https://fits.gsfc.nasa.gov/standard40/fits_standard40draft1.pdf But they're not really relevant to numpy's situation, because as here you need to do elaborate de-quoting before they can go into a data structure. What I was wondering was whether people have data lying around with fixed-width fields where the strings are space-padded, so that numpy needs to support that. I would say whether to strip space-padded strings should be the reader's problem, not numpy's -CHB
On Apr 25, 2017, at 12:38 PM, Nathaniel Smith <njs@pobox.com> wrote:
Eh... First, on Windows and MacOS, filenames are natively Unicode.
Yeah, though once they are stored I. A text file -- who the heck knows? That may be simply unsolvable.
s. And then from in Python, if you want to actually work with those filenames you need to either have a bytestring type or else a Unicode type that uses surrogateescape to represent the non-ascii characters.
IMO if you have filenames that are arbitrary bytestrings and you need to represent this properly, you should just use bytestrings -- really, they're perfectly friendly :-).
I thought the Python file (and Path) APIs all required (Unicode) strings? That was the whole complaint! And no, bytestrings are not perfectly friendly in py3. This got really complicated and sidetracked, but All I'm suggesting is that if we have a 1byte per char string type, with a fixed encoding, that that encoding be Latin-1, rather than ASCII. That's it, really. Having a settable encoding would work fine, too. -CHB
On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal < chris.barker@noaa.gov> wrote:
Presumably you're getting byte strings (with unknown encoding.
No -- thus is for creating and using mostly ascii string data with python and numpy.
Unknown encoding bytes belong in byte arrays -- they are not text.
I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with a few extra characters" data. With all the sloppiness over the years,
You are welcome to try to convince Thomas of that. That is the status quo for him, but he is finding that difficult to work with. there are way to many files like that. That sloppiness that you mention is precisely the "unknown encoding" problem. Your previous advocacy has also touched on using latin-1 to decode existing files with unknown encodings as well. If you want to advocate for using latin-1 only for the creation of new data, maybe stop talking about existing files? :-)
Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters.
For that use case, the alternative in play isn't ASCII, it's UTF-8, which buys you a whole bunch of useful extra characters. ;-) There are several use cases being brought forth here. Some involve file reading, some involve file writing, and some involve in-memory manipulation. Whatever change we make is going to impinge somehow on all of the use cases. If all we do is add a latin-1 dtype for people to use to create new in-memory data, then someone is going to use it to read existing data in unknown or ambiguous encodings. -- Robert Kern
On Tue, Apr 25, 2017 at 4:11 PM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
On Apr 25, 2017, at 12:38 PM, Nathaniel Smith <njs@pobox.com> wrote:
Eh... First, on Windows and MacOS, filenames are natively Unicode.
Yeah, though once they are stored I. A text file -- who the heck knows? That may be simply unsolvable.
s. And then from in Python, if you want to actually work with those filenames you need to either have a bytestring type or else a Unicode type that uses surrogateescape to represent the non-ascii characters.
IMO if you have filenames that are arbitrary bytestrings and you need to represent this properly, you should just use bytestrings -- really, they're perfectly friendly :-).
I thought the Python file (and Path) APIs all required (Unicode) strings? That was the whole complaint!
No, the path APIs all accept bytestrings (and ones that return pathnames like listdir return bytestrings if given bytestrings). Or at least they're supposed to. The really urgent need for surrogateescape was things like sys.argv and os.environ where arbitrary bytes might come in (on some systems) but the API is restricted to strs.
And no, bytestrings are not perfectly friendly in py3.
I'm not saying you should use them everywhere or that they remove the need for an ergonomic text dtype, but when you actually want to work with bytes they're pretty good (esp. in modern py3). -n -- Nathaniel J. Smith -- https://vorpus.org
On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal < chris.barker@noaa.gov> wrote:
Presumably you're getting byte strings (with unknown encoding.
No -- thus is for creating and using mostly ascii string data with python and numpy.
Unknown encoding bytes belong in byte arrays -- they are not text.
You are welcome to try to convince Thomas of that. That is the status quo for him, but he is finding that difficult to work with.
I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that.
That sloppiness that you mention is precisely the "unknown encoding" problem. Your previous advocacy has also touched on using latin-1 to decode existing files with unknown encodings as well. If you want to advocate for using latin-1 only for the creation of new data, maybe stop talking about existing files? :-)
Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters.
For that use case, the alternative in play isn't ASCII, it's UTF-8, which buys you a whole bunch of useful extra characters. ;-)
There are several use cases being brought forth here. Some involve file reading, some involve file writing, and some involve in-memory manipulation. Whatever change we make is going to impinge somehow on all of the use cases. If all we do is add a latin-1 dtype for people to use to create new in-memory data, then someone is going to use it to read existing data in unknown or ambiguous encodings.
The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation. Note that for terminal display we will want something supported by the system, which is another problem altogether. Let me break the problem down into four categories 1. Storage -- hdf5, .npy, fits, etc. 2. Display -- ? 3. Modification -- editing 4. Parsing -- fits, etc. There is probably no one solution that is optimal for all of those. Chuck
On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
Presumably you're getting byte strings (with unknown encoding.
No -- thus is for creating and using mostly ascii string data with python and numpy.
Unknown encoding bytes belong in byte arrays -- they are not text.
You are welcome to try to convince Thomas of that. That is the status quo for him, but he is finding that difficult to work with.
I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that.
That sloppiness that you mention is precisely the "unknown encoding" problem. Your previous advocacy has also touched on using latin-1 to decode existing files with unknown encodings as well. If you want to advocate for using latin-1 only for the creation of new data, maybe stop talking about existing files? :-)
Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters.
For that use case, the alternative in play isn't ASCII, it's UTF-8, which buys you a whole bunch of useful extra characters. ;-)
There are several use cases being brought forth here. Some involve file reading, some involve file writing, and some involve in-memory manipulation. Whatever change we make is going to impinge somehow on all of the use cases. If all we do is add a latin-1 dtype for people to use to create new in-memory data, then someone is going to use it to read existing data in unknown or ambiguous encodings.
The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation.
Note that for terminal display we will want something supported by the system, which is another problem altogether. Let me break the problem down into four categories
Storage -- hdf5, .npy, fits, etc. Display -- ? Modification -- editing Parsing -- fits, etc.
There is probably no one solution that is optimal for all of those.
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
quoting Julian ''' I probably have formulated my goal with the proposal a bit better, I am not very interested in a repetition of which encoding to use debate. In the end what will be done allows any encoding via a dtype with metadata like datetime. This allows any codec (including truncated utf8) to be added easily (if python supports it) and allows sidestepping the debate. My main concern is whether it should be a new dtype or modifying the unicode dtype. Though the backward compatibility argument is strongly in favour of adding a new dtype that makes the np.unicode type redundant. ''' I don't quite understand why this discussion goes in a direction of an either one XOR the other dtype. I thought the parameterized 1-byte encoding that Julian mentioned initially sounds useful to me. (I'm not sure I will use it much, but I also don't use float16 ) Josef
On Tue, Apr 25, 2017 at 7:11 PM, Chris Barker - NOAA Federal < chris.barker@noaa.gov> wrote:
On Apr 25, 2017, at 12:38 PM, Nathaniel Smith <njs@pobox.com> wrote:
Eh... First, on Windows and MacOS, filenames are natively Unicode.
s. And then from in Python, if you want to actually work with those filenames you need to either have a bytestring type or else a Unicode type
Yeah, though once they are stored I. A text file -- who the heck knows? That may be simply unsolvable. that uses surrogateescape to represent the non-ascii characters.
IMO if you have filenames that are arbitrary bytestrings and you need to represent this properly, you should just use bytestrings -- really, they're perfectly friendly :-).
I thought the Python file (and Path) APIs all required (Unicode) strings? That was the whole complaint!
And no, bytestrings are not perfectly friendly in py3.
This got really complicated and sidetracked, but All I'm suggesting is that if we have a 1byte per char string type, with a fixed encoding, that that encoding be Latin-1, rather than ASCII.
That's it, really.
Fully agreed.
Having a settable encoding would work fine, too.
Yup. At a simple level, I just want the things that currently work just fine in Py2 to start working in Py3. That includes being able to read / manipulate / compute and write back to legacy binary FITS and HDF5 files that include ASCII-ish text data (not strictly ASCII). Memory mapping such files should be supportable. Swapping type from bytes to a 1-byte char str should be possible without altering data in memory. BTW, I am saying "I want", but this functionality would definitely be welcome in astropy. I wrote a unicode sandwich workaround for the astropy Table class (https://github.com/astropy/astropy/pull/5700) which should be in the next release. It would be way better to have this at a level lower in numpy. - Tom
-CHB _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation.
The major use case that we have for a UTF-8 array is HDF5, and it specifies the width in bytes, not Unicode characters. -- Robert Kern
On Tue, Apr 25, 2017 at 9:21 PM Robert Kern <robert.kern@gmail.com> wrote:
On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation.
The major use case that we have for a UTF-8 array is HDF5, and it specifies the width in bytes, not Unicode characters.
It's not just HDF5. Counting bytes is the Right Way to measure the size of UTF-8 encoded text: http://utf8everywhere.org/#myths I also firmly believe (though clearly this is not universally agreed upon) that UTF-8 is the Right Way to encode strings for *non-legacy* applications. So if we're adding any new string encodings, it needs to be one of them.
On 26.04.2017 03:55, josef.pktd@gmail.com wrote:
On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
Presumably you're getting byte strings (with unknown encoding.
No -- thus is for creating and using mostly ascii string data with python and numpy.
Unknown encoding bytes belong in byte arrays -- they are not text.
You are welcome to try to convince Thomas of that. That is the status quo for him, but he is finding that difficult to work with.
I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that.
That sloppiness that you mention is precisely the "unknown encoding" problem. Your previous advocacy has also touched on using latin-1 to decode existing files with unknown encodings as well. If you want to advocate for using latin-1 only for the creation of new data, maybe stop talking about existing files? :-)
Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters.
For that use case, the alternative in play isn't ASCII, it's UTF-8, which buys you a whole bunch of useful extra characters. ;-)
There are several use cases being brought forth here. Some involve file reading, some involve file writing, and some involve in-memory manipulation. Whatever change we make is going to impinge somehow on all of the use cases. If all we do is add a latin-1 dtype for people to use to create new in-memory data, then someone is going to use it to read existing data in unknown or ambiguous encodings.
The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation.
Note that for terminal display we will want something supported by the system, which is another problem altogether. Let me break the problem down into four categories
Storage -- hdf5, .npy, fits, etc. Display -- ? Modification -- editing Parsing -- fits, etc.
There is probably no one solution that is optimal for all of those.
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
quoting Julian
''' I probably have formulated my goal with the proposal a bit better, I am not very interested in a repetition of which encoding to use debate. In the end what will be done allows any encoding via a dtype with metadata like datetime. This allows any codec (including truncated utf8) to be added easily (if python supports it) and allows sidestepping the debate.
My main concern is whether it should be a new dtype or modifying the unicode dtype. Though the backward compatibility argument is strongly in favour of adding a new dtype that makes the np.unicode type redundant. '''
I don't quite understand why this discussion goes in a direction of an either one XOR the other dtype.
I thought the parameterized 1-byte encoding that Julian mentioned initially sounds useful to me.
(I'm not sure I will use it much, but I also don't use float16 )
Josef
Indeed, Most of this discussion is irrelevant to numpy. Numpy only really deals with the in memory storage of strings. And in that it is limited to fixed length strings (in bytes/codepoints). How you get your messy strings into numpy arrays is not very relevant to the discussion of a smaller representation of strings. You couldn't get messy strings into numpy without first sorting it out yourself before, you won't be able to afterwards. Numpy will offer a set of encodings, the user chooses which one is best for the use case and if the user screws it up, it is not numpy's problem. You currently only have a few ways to even construct string arrays: - array construction and loops - genfromtxt (which is again just a loop) - memory mapping which I seriously doubt anyone actually does for the S and U dtype Having a new dtype changes nothing here. You still need to create numpy arrays from python strings which are well defined and clean. If you put something in that doesn't encode you get an encoding error. No oddities like surrogate escapes are needed, numpy arrays are not interfaces to operating systems nor does numpy need to _add_ support for historical oddities beyond what it already has. If you want to represent bytes exactly as they came in don't use a text dtype (which includes the S dtype, use i1). Concerning variable sized strings, this is simply not going to happen. Nobody is going to rewrite numpy to support it, especially not just for something as unimportant as strings. Best you are going to get (or better already have) is object arrays. It makes no sense to discuss it unless someone comes up with an actual proposal and the willingness to code it. What is a relevant discussion is whether we really need a more compact but limited representation of text than 4-byte utf32 at all. Its usecase is for the most part just for python3 porting and saving some memory in some ascii heavy cases, e.g. astronomy. It is not that significant anymore as porting to python3 has mostly already happened via the ugly byte workaround and memory saving is probably not as significant in the context of numpy which is already heavy on memory usage. My initial approach was to not add a new dtype but to make unicode parametrizable which would have meant almost no cluttering of numpys internals and keeping the api more or less consistent which would make this a relatively simple addition of minor functionality for people that want it. But adding a completely new partially redundant dtype for this usecase may be a too large change to the api. Having two partially redundant string types may confuse users more than our current status quo of our single string type (U). Discussing whether we want to support truncated utf8 has some merit as it is a decision whether to give the users an even larger gun to shot themselves in the foot with. But I'd like to focus first on the 1 byte type to add a symmetric API for python2 and python3. utf8 can always be added latter should we deem it a good idea. cheers, Julian
On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer <shoyer@gmail.com> wrote:
On Tue, Apr 25, 2017 at 9:21 PM Robert Kern <robert.kern@gmail.com> wrote:
On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation.
The major use case that we have for a UTF-8 array is HDF5, and it specifies the width in bytes, not Unicode characters.
It's not just HDF5. Counting bytes is the Right Way to measure the size of UTF-8 encoded text: http://utf8everywhere.org/#myths
I also firmly believe (though clearly this is not universally agreed upon) that UTF-8 is the Right Way to encode strings for *non-legacy* applications. So if we're adding any new string encodings, it needs to be one of them.
It seems to me that most of the requirements people have expressed in this thread would be satisfied by: (1) object arrays of strings. (We have these already; whether a strings-only specialization would permit useful things like string-oriented ufuncs is a question for someone who's willing to implement one.) (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data. All python encodings should be permitted. An additional function to truncate encoded data without mangling the encoding would be handy. I think it makes more sense for this to be NULL-padded than NULL-terminated but it may be necessary to support both; note that NULL-termination is complicated for encodings like UCS4. This also includes the legacy UCS4 strings as a special case. (3) a dtype for fixed-length byte strings. This doesn't look very different from an array of dtype u8, but given we have the bytes type, accessing the data this way makes sense. There seems to be considerable debate about what the "default" string type should be, but since users must specify a length anyway, might as well force them to specify an encoding and thus dodge the debate about the right default. The other question - which I realize is how the thread started - is what to do about backward compatibility. I'm not writing the code, so my opinion doesn't matter much, but I think we're stuck maintaining what we have now - ASCII and UCS4 strings - for a while yet. But it can be deprecated, or they can be simply reimplemented as shorthand names for ASCII- or UCS4-encoded strings in the bytes-with-encoding dtype. Anne
On Wed, Apr 26, 2017 at 3:15 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern <robert.kern@gmail.com>
wrote:
On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
Presumably you're getting byte strings (with unknown encoding.
No -- thus is for creating and using mostly ascii string data with python and numpy.
Unknown encoding bytes belong in byte arrays -- they are not text.
You are welcome to try to convince Thomas of that. That is the status
quo
for him, but he is finding that difficult to work with.
I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that.
That sloppiness that you mention is precisely the "unknown encoding" problem. Your previous advocacy has also touched on using latin-1 to decode existing files with unknown encodings as well. If you want to advocate for using latin-1 only for the creation of new data, maybe stop talking about existing files? :-)
Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters.
For that use case, the alternative in play isn't ASCII, it's UTF-8, which buys you a whole bunch of useful extra characters. ;-)
There are several use cases being brought forth here. Some involve file reading, some involve file writing, and some involve in-memory manipulation. Whatever change we make is going to impinge somehow on all of the use cases. If all we do is add a latin-1 dtype for people to use to create new in-memory data, then someone is going to use it to read existing data in unknown or ambiguous encodings.
The maximum length of an UTF-8 character is 4 bytes, so we could use
On 26.04.2017 03:55, josef.pktd@gmail.com wrote: that to
size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation.
Note that for terminal display we will want something supported by the system, which is another problem altogether. Let me break the problem down into four categories
Storage -- hdf5, .npy, fits, etc. Display -- ? Modification -- editing Parsing -- fits, etc.
There is probably no one solution that is optimal for all of those.
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
quoting Julian
''' I probably have formulated my goal with the proposal a bit better, I am not very interested in a repetition of which encoding to use debate. In the end what will be done allows any encoding via a dtype with metadata like datetime. This allows any codec (including truncated utf8) to be added easily (if python supports it) and allows sidestepping the debate.
My main concern is whether it should be a new dtype or modifying the unicode dtype. Though the backward compatibility argument is strongly in favour of adding a new dtype that makes the np.unicode type redundant. '''
I don't quite understand why this discussion goes in a direction of an either one XOR the other dtype.
I thought the parameterized 1-byte encoding that Julian mentioned initially sounds useful to me.
(I'm not sure I will use it much, but I also don't use float16 )
Josef
Indeed, Most of this discussion is irrelevant to numpy. Numpy only really deals with the in memory storage of strings. And in that it is limited to fixed length strings (in bytes/codepoints). How you get your messy strings into numpy arrays is not very relevant to the discussion of a smaller representation of strings. You couldn't get messy strings into numpy without first sorting it out yourself before, you won't be able to afterwards. Numpy will offer a set of encodings, the user chooses which one is best for the use case and if the user screws it up, it is not numpy's problem.
You currently only have a few ways to even construct string arrays: - array construction and loops - genfromtxt (which is again just a loop) - memory mapping which I seriously doubt anyone actually does for the S and U dtype
Having a new dtype changes nothing here. You still need to create numpy arrays from python strings which are well defined and clean. If you put something in that doesn't encode you get an encoding error. No oddities like surrogate escapes are needed, numpy arrays are not interfaces to operating systems nor does numpy need to _add_ support for historical oddities beyond what it already has. If you want to represent bytes exactly as they came in don't use a text dtype (which includes the S dtype, use i1).
Concerning variable sized strings, this is simply not going to happen. Nobody is going to rewrite numpy to support it, especially not just for something as unimportant as strings. Best you are going to get (or better already have) is object arrays. It makes no sense to discuss it unless someone comes up with an actual proposal and the willingness to code it.
What is a relevant discussion is whether we really need a more compact but limited representation of text than 4-byte utf32 at all. Its usecase is for the most part just for python3 porting and saving some memory in some ascii heavy cases, e.g. astronomy. It is not that significant anymore as porting to python3 has mostly already happened via the ugly byte workaround and memory saving is probably not as significant in the context of numpy which is already heavy on memory usage.
My initial approach was to not add a new dtype but to make unicode parametrizable which would have meant almost no cluttering of numpys internals and keeping the api more or less consistent which would make this a relatively simple addition of minor functionality for people that want it. But adding a completely new partially redundant dtype for this usecase may be a too large change to the api. Having two partially redundant string types may confuse users more than our current status quo of our single string type (U).
Discussing whether we want to support truncated utf8 has some merit as it is a decision whether to give the users an even larger gun to shot themselves in the foot with. But I'd like to focus first on the 1 byte type to add a symmetric API for python2 and python3. utf8 can always be added latter should we deem it a good idea.
I think we can implement viewers for strings as ndarray subclasses. Then one could do `my_string_array.view(latin_1)`, and so on. Essentially that just changes the default encoding of the 'S' array. That could also work for uint8 arrays if needed. Chuck
I think we can implement viewers for strings as ndarray subclasses. Then one could do `my_string_array.view(latin_1)`, and so on. Essentially that just changes the default encoding of the 'S' array. That could also work for uint8 arrays if needed.
Chuck
To handle structured data-types containing encoded strings, we'd also need to subclass `np.void`. Things would get messy when a structured dtype contains two strings in different encodings (or more likely, one bytestring and one textstring) - we'd need some way to specify which fields are in which encoding, and using subclasses means that this can't be contained within the dtype information. So I think there's a strong argument for solving this with`dtype`s rather than subclasses. This really doesn't seem hard though. Something like (C-but-as-python): def ENCSTRING_getitem(ptr, arr): # The PyArrFuncs slot encoded = STRING_getitem(ptr, arr) return encoded.decode(arr.dtype.encoding) def ENCSTRING_setitem(val, ptr, arr): # The PyArrFuncs slot val = val.encode(arr.dtype.encoding) # todo: handle "safe" truncation, where safe might mean keep codepoints, keep graphemes, or never allow STRING_setitem(val, ptr, arr)) We'd probably need to be careful to do a decode/encode dance when copying from one encoding to another, but we [already have bugs](https://github.com/numpy/numpy/issues/3258) in those cases anyway. Is it reasonable that the user of such an array would want to work with plain `builtin.unicode` objects, rather than some special numpy scalar type? Eric
I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that.
That sloppiness that you mention is precisely the "unknown encoding" problem.
Exactly -- but from a practicality beats purity perspective, there is a difference between "I have no idea whatsoever" and "I know it is mostly ascii, and European, but there are some extra characters in there" Latin-1 had proven very useful for that case. I suppose in most cases ascii with errors='replace' would be a good choice, but I'd still rather not throw out potentially useful information.
Your previous advocacy has also touched on using latin-1 to decode existing files with unknown encodings as well. If you want to advocate for using latin-1 only for the creation of new data, maybe stop talking about existing files? :-)
Yeah, I've been very unfocused in this discussion -- sorry about that.
Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters.
For that use case, the alternative in play isn't ASCII, it's UTF-8, which buys you a whole bunch of useful extra characters. ;-)
UTF-8 does not match the character-oriented Python text model. Plenty of people argue that that isn't the "correct" model for Unicode text -- maybe so, but it is the model python 3 has chosen. I wrote a much longer rant about that earlier. So I think the easy to access, and particularly defaults, numpy string dtypes should match it. It's become clear in this discussion that there is s strong desire to support a numpy dtype that stores text in particular binary formats (I.e. Encodings). Rather than choose one or two, we might as well support all encodings supported by python. In that case, we'll have utf-8 for those that know they want that, and we'll have latin-1 for those that incorrectly think they want that :-) So what remains is to decide is implementation, syntax, and defaults. Let's keep in mind that most of us on this list, and in this discussion, are the folks that write interface code and the like. But most numpy users are not as tuned in to the internals. So defaults should be set to best support the more "naive" user.
. If all we do is add a latin-1 dtype for people to use to create new in-memory data, then someone is going to use it to read existing data in unknown or ambiguous encodings.
If we add every encoding known to man someone is going to use Latin-1 to read unknown encodings. Indeed, as we've all pointed out, there is no correct encoding with which to read unknown encodings. Frankly, if we have UTF-8 under the hood, I think people are even MORE likely to use it inappropriately-- it's quite scary how many people think UTF-8 == Unicode, and think all you need to do is "use utf-8", and you don't need to change any of the rest of your code. Oh, and once you've done that, you can use your existing ASCII-only tests and think you have a working application :-) -CHB
On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
Indeed, Most of this discussion is irrelevant to numpy. Numpy only really deals with the in memory storage of strings. And in that it is limited to fixed length strings (in bytes/codepoints). How you get your messy strings into numpy arrays is not very relevant to the discussion of a smaller representation of strings. You couldn't get messy strings into numpy without first sorting it out yourself before, you won't be able to afterwards. Numpy will offer a set of encodings, the user chooses which one is best for the use case and if the user screws it up, it is not numpy's problem.
You currently only have a few ways to even construct string arrays: - array construction and loops - genfromtxt (which is again just a loop) - memory mapping which I seriously doubt anyone actually does for the S and U dtype
I fear that you decided that the discussion was irrelevant and thus did not read it rather than reading it to decide that it was not relevant. Because several of us have showed that, yes indeed, we do memory-map string arrays. You can add to this list C APIs, like that of libhdf5, that need to communicate (Unicode) string arrays. Look, I know I can be tedious, but *please* go back and read this discussion. We have concrete use cases outlined. We can give you more details if you need them. We all feel the pain of the rushed, inadequate implementation of the U dtype. But each of our pains is a little bit different; you obviously aren't experiencing the same pains that I am.
Having a new dtype changes nothing here. You still need to create numpy arrays from python strings which are well defined and clean. If you put something in that doesn't encode you get an encoding error. No oddities like surrogate escapes are needed, numpy arrays are not interfaces to operating systems nor does numpy need to _add_ support for historical oddities beyond what it already has. If you want to represent bytes exactly as they came in don't use a text dtype (which includes the S dtype, use i1).
Thomas Aldcroft has demonstrated the problem with this approach. numpy arrays are often interfaces to files that have tons of historical oddities.
Concerning variable sized strings, this is simply not going to happen. Nobody is going to rewrite numpy to support it, especially not just for something as unimportant as strings. Best you are going to get (or better already have) is object arrays. It makes no sense to discuss it unless someone comes up with an actual proposal and the willingness to code it.
No one has suggested such a thing. At most, we've talked about specializing object arrays.
What is a relevant discussion is whether we really need a more compact but limited representation of text than 4-byte utf32 at all. Its usecase is for the most part just for python3 porting and saving some memory in some ascii heavy cases, e.g. astronomy. It is not that significant anymore as porting to python3 has mostly already happened via the ugly byte workaround and memory saving is probably not as significant in the context of numpy which is already heavy on memory usage.
My initial approach was to not add a new dtype but to make unicode parametrizable which would have meant almost no cluttering of numpys internals and keeping the api more or less consistent which would make this a relatively simple addition of minor functionality for people that want it. But adding a completely new partially redundant dtype for this usecase may be a too large change to the api. Having two partially redundant string types may confuse users more than our current status quo of our single string type (U).
Discussing whether we want to support truncated utf8 has some merit as it is a decision whether to give the users an even larger gun to shot themselves in the foot with. But I'd like to focus first on the 1 byte type to add a symmetric API for python2 and python3. utf8 can always be added latter should we deem it a good idea.
What is your current proposal? A string dtype parameterized with the encoding (initially supporting the latin-1 that you desire and maybe adding utf-8 later)? Or a latin-1-specific dtype such that we will have to add a second utf-8 dtype at a later date? If you're not going to support arbitrary encodings right off the bat, I'd actually suggest implementing UTF-8 and ASCII-surrogateescape first as they seem to knock off more use cases straight away. -- Robert Kern
On 26.04.2017 19:08, Robert Kern wrote:
On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.com>> wrote:
Indeed, Most of this discussion is irrelevant to numpy. Numpy only really deals with the in memory storage of strings. And in that it is limited to fixed length strings (in bytes/codepoints). How you get your messy strings into numpy arrays is not very relevant to the discussion of a smaller representation of strings. You couldn't get messy strings into numpy without first sorting it out yourself before, you won't be able to afterwards. Numpy will offer a set of encodings, the user chooses which one is best for the use case and if the user screws it up, it is not numpy's problem.
You currently only have a few ways to even construct string arrays: - array construction and loops - genfromtxt (which is again just a loop) - memory mapping which I seriously doubt anyone actually does for the S and U dtype
I fear that you decided that the discussion was irrelevant and thus did not read it rather than reading it to decide that it was not relevant. Because several of us have showed that, yes indeed, we do memory-map string arrays.
You can add to this list C APIs, like that of libhdf5, that need to communicate (Unicode) string arrays.
Look, I know I can be tedious, but *please* go back and read this discussion. We have concrete use cases outlined. We can give you more details if you need them. We all feel the pain of the rushed, inadequate implementation of the U dtype. But each of our pains is a little bit different; you obviously aren't experiencing the same pains that I am.
I have read every mail and it has been a large waste of time, Everything has been said already many times in the last few years. Even if you memory map string arrays, of which I have not seen a concrete use case in the mails beyond "would be nice to have" without any backing in actual code, but I may have missed it. In any case it is still irrelevant. My proposal only _adds_ additional cases that can be mmapped. It does not prevent you from doing what you have been doing before.
Having a new dtype changes nothing here. You still need to create numpy arrays from python strings which are well defined and clean. If you put something in that doesn't encode you get an encoding error. No oddities like surrogate escapes are needed, numpy arrays are not interfaces to operating systems nor does numpy need to _add_ support for historical oddities beyond what it already has. If you want to represent bytes exactly as they came in don't use a text dtype (which includes the S dtype, use i1).
Thomas Aldcroft has demonstrated the problem with this approach. numpy arrays are often interfaces to files that have tons of historical oddities.
This does not matter for numpy, the text dtype is well defined as bytes with a specific encoding and null padding. If you have an historical oddity that does not fit, do not use the text dtype but use a pure byte array instead.
Concerning variable sized strings, this is simply not going to happen. Nobody is going to rewrite numpy to support it, especially not just for something as unimportant as strings. Best you are going to get (or better already have) is object arrays. It makes no sense to discuss it unless someone comes up with an actual proposal and the willingness to code it.
No one has suggested such a thing. At most, we've talked about specializing object arrays.
What is a relevant discussion is whether we really need a more compact but limited representation of text than 4-byte utf32 at all. Its usecase is for the most part just for python3 porting and saving some memory in some ascii heavy cases, e.g. astronomy. It is not that significant anymore as porting to python3 has mostly already happened via the ugly byte workaround and memory saving is probably not as significant in the context of numpy which is already heavy on memory usage.
My initial approach was to not add a new dtype but to make unicode parametrizable which would have meant almost no cluttering of numpys internals and keeping the api more or less consistent which would make this a relatively simple addition of minor functionality for people that want it. But adding a completely new partially redundant dtype for this usecase may be a too large change to the api. Having two partially redundant string types may confuse users more than our current status quo of our single string type (U).
Discussing whether we want to support truncated utf8 has some merit as it is a decision whether to give the users an even larger gun to shot themselves in the foot with. But I'd like to focus first on the 1 byte type to add a symmetric API for python2 and python3. utf8 can always be added latter should we deem it a good idea.
What is your current proposal? A string dtype parameterized with the encoding (initially supporting the latin-1 that you desire and maybe adding utf-8 later)? Or a latin-1-specific dtype such that we will have to add a second utf-8 dtype at a later date?
My proposal is a single new parameterizable dtype. Adding multiple dtypes for each encoding seems unnecessary to me given that numpy already supports parameterizable types. For example datetime is very similar, it is basically encoded integers. There are multiple encodings = units supported.
If you're not going to support arbitrary encodings right off the bat, I'd actually suggest implementing UTF-8 and ASCII-surrogateescape first as they seem to knock off more use cases straight away.
Please list the use cases in the context of numpy usage. hdf5 is the most obvious, but how exactly would hdf5 use an utf8 array in the actual implementation? What you save by having utf8 in the numpy array is replacing a decoding ane encoding step with a stripping null padding step. That doesn't seem very worthwhile compared to all their other overheads involved.
On Wed, Apr 26, 2017 at 3:27 AM, Anne Archibald <peridot.faceted@gmail.com> wrote:
On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer <shoyer@gmail.com> wrote:
On Tue, Apr 25, 2017 at 9:21 PM Robert Kern <robert.kern@gmail.com>
On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <
charlesr.harris@gmail.com> wrote:
The maximum length of an UTF-8 character is 4 bytes, so we could use
The major use case that we have for a UTF-8 array is HDF5, and it
specifies the width in bytes, not Unicode characters.
It's not just HDF5. Counting bytes is the Right Way to measure the size of UTF-8 encoded text: http://utf8everywhere.org/#myths
I also firmly believe (though clearly this is not universally agreed upon) that UTF-8 is the Right Way to encode strings for *non-legacy* applications. So if we're adding any new string encodings, it needs to be one of them.
It seems to me that most of the requirements people have expressed in
wrote: that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation. this thread would be satisfied by:
(1) object arrays of strings. (We have these already; whether a
strings-only specialization would permit useful things like string-oriented ufuncs is a question for someone who's willing to implement one.)
(2) a dtype for fixed byte-size, specified-encoding, NULL-padded data.
All python encodings should be permitted. An additional function to truncate encoded data without mangling the encoding would be handy. I think it makes more sense for this to be NULL-padded than NULL-terminated but it may be necessary to support both; note that NULL-termination is complicated for encodings like UCS4. This also includes the legacy UCS4 strings as a special case.
(3) a dtype for fixed-length byte strings. This doesn't look very
different from an array of dtype u8, but given we have the bytes type, accessing the data this way makes sense. The void dtype is already there for this general purpose and mostly works, with a few niggles. On Python 3, it uses 'int8' ndarrays underneath the scalars (fortunately, they do not appear to be mutable views). It also accepts `bytes` strings that are too short (pads with NULs) and too long (truncates). If it worked more transparently and perhaps rigorously with `bytes`, then it would be quite suitable. -- Robert Kern
On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal" < chris.barker@noaa.gov> wrote: UTF-8 does not match the character-oriented Python text model. Plenty of people argue that that isn't the "correct" model for Unicode text -- maybe so, but it is the model python 3 has chosen. I wrote a much longer rant about that earlier. So I think the easy to access, and particularly defaults, numpy string dtypes should match it. This seems a little vague? The "character-oriented Python text model" is just that str supports O(1) indexing of characters. But... Numpy doesn't. If you want to access individual characters inside a string inside an array, you have to pull out the scalar first, at which point the data is copied and boxed into a Python object anyway, using whatever representation the interpreter prefers. So AFAICT it makes literally no difference to the user whether numpy's internal representation allows for fast character access. -n
On Wed, 2017-04-26 at 19:43 +0200, Julian Taylor wrote:
On 26.04.2017 19:08, Robert Kern wrote:
On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.co m>> wrote:
Indeed, Most of this discussion is irrelevant to numpy. Numpy only really deals with the in memory storage of strings. And in that it is limited to fixed length strings (in bytes/codepoints). How you get your messy strings into numpy arrays is not very relevant to the discussion of a smaller representation of strings. You couldn't get messy strings into numpy without first sorting it out yourself before, you won't be able to afterwards. Numpy will offer a set of encodings, the user chooses which one is best for the use case and if the user screws it up, it is not numpy's problem.
You currently only have a few ways to even construct string arrays: - array construction and loops - genfromtxt (which is again just a loop) - memory mapping which I seriously doubt anyone actually does for the S and U dtype
I fear that you decided that the discussion was irrelevant and thus did not read it rather than reading it to decide that it was not relevant. Because several of us have showed that, yes indeed, we do memory- map string arrays.
You can add to this list C APIs, like that of libhdf5, that need to communicate (Unicode) string arrays.
Look, I know I can be tedious, but *please* go back and read this discussion. We have concrete use cases outlined. We can give you more details if you need them. We all feel the pain of the rushed, inadequate implementation of the U dtype. But each of our pains is a little bit different; you obviously aren't experiencing the same pains that I am.
I have read every mail and it has been a large waste of time, Everything has been said already many times in the last few years. Even if you memory map string arrays, of which I have not seen a concrete use case in the mails beyond "would be nice to have" without any backing in actual code, but I may have missed it. In any case it is still irrelevant. My proposal only _adds_ additional cases that can be mmapped. It does not prevent you from doing what you have been doing before.
Having a new dtype changes nothing here. You still need to create numpy arrays from python strings which are well defined and clean. If you put something in that doesn't encode you get an encoding error. No oddities like surrogate escapes are needed, numpy arrays are not interfaces to operating systems nor does numpy need to _add_ support for historical oddities beyond what it already has. If you want to represent bytes exactly as they came in don't use a text dtype (which includes the S dtype, use i1).
Thomas Aldcroft has demonstrated the problem with this approach. numpy arrays are often interfaces to files that have tons of historical oddities.
This does not matter for numpy, the text dtype is well defined as bytes with a specific encoding and null padding. If you have an historical oddity that does not fit, do not use the text dtype but use a pure byte array instead.
Concerning variable sized strings, this is simply not going to happen. Nobody is going to rewrite numpy to support it, especially not just for something as unimportant as strings. Best you are going to get (or better already have) is object arrays. It makes no sense to discuss it unless someone comes up with an actual proposal and the willingness to code it.
No one has suggested such a thing. At most, we've talked about specializing object arrays.
What is a relevant discussion is whether we really need a more compact but limited representation of text than 4-byte utf32 at all. Its usecase is for the most part just for python3 porting and saving some memory in some ascii heavy cases, e.g. astronomy. It is not that significant anymore as porting to python3 has mostly already happened via the ugly byte workaround and memory saving is probably not as significant in the context of numpy which is already heavy on memory usage.
My initial approach was to not add a new dtype but to make unicode parametrizable which would have meant almost no cluttering of numpys internals and keeping the api more or less consistent which would make this a relatively simple addition of minor functionality for people that want it. But adding a completely new partially redundant dtype for this usecase may be a too large change to the api. Having two partially redundant string types may confuse users more than our current status quo of our single string type (U).
Discussing whether we want to support truncated utf8 has some merit as it is a decision whether to give the users an even larger gun to shot themselves in the foot with. But I'd like to focus first on the 1 byte type to add a symmetric API for python2 and python3. utf8 can always be added latter should we deem it a good idea.
What is your current proposal? A string dtype parameterized with the encoding (initially supporting the latin-1 that you desire and maybe adding utf-8 later)? Or a latin-1-specific dtype such that we will have to add a second utf-8 dtype at a later date?
My proposal is a single new parameterizable dtype. Adding multiple dtypes for each encoding seems unnecessary to me given that numpy already supports parameterizable types. For example datetime is very similar, it is basically encoded integers. There are multiple encodings = units supported.
If you're not going to support arbitrary encodings right off the bat, I'd actually suggest implementing UTF-8 and ASCII-surrogateescape first as they seem to knock off more use cases straight away.
Please list the use cases in the context of numpy usage. hdf5 is the most obvious, but how exactly would hdf5 use an utf8 array in the actual implementation?
What you save by having utf8 in the numpy array is replacing a decoding ane encoding step with a stripping null padding step. That doesn't seem very worthwhile compared to all their other overheads involved.
I remember talking with a colleague about something like that. And basically an annoying thing there was that if you strip the zero bytes in a zero padded string, some encodings (UTF16) may need one of the zero bytes to work right. (I think she got around it, by weird trickery, inverting the endianess or so and thus putting the zero bytes first). Maybe will ask her if this discussion is interesting to her. Though I think it might have been something like "make everything in hdf5/something similar work" without any actual use case, I don't know. Have not read the thread, I think a fixed byte but settable encoding type would make sense. I personally wonder whether storing the length might make sense, even if that removes direct memory mapping, but as you said, you can still memmap the bytes, and then probably just cast back and forth. Sorry if there is zero actual input here :) - Sebastian
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Wed, Apr 26, 2017 at 2:31 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal" <chris.barker@noaa.gov> wrote:
UTF-8 does not match the character-oriented Python text model. Plenty of people argue that that isn't the "correct" model for Unicode text -- maybe so, but it is the model python 3 has chosen. I wrote a much longer rant about that earlier.
So I think the easy to access, and particularly defaults, numpy string dtypes should match it.
This seems a little vague? The "character-oriented Python text model" is just that str supports O(1) indexing of characters. But... Numpy doesn't. If you want to access individual characters inside a string inside an array, you have to pull out the scalar first, at which point the data is copied and boxed into a Python object anyway, using whatever representation the interpreter prefers. So AFAICT it makes literally no difference to the user whether numpy's internal representation allows for fast character access.
you can create a view on individual characters or bytes, AFAICS
t = np.array(['abcdefg']*10) t2 = t.view([('s%d' % i, '<U1') for i in range(7)]) t2['s5'] array(['f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f'], dtype='<U1')
t.view('<U1').reshape(len(t), -1)[:, 2] array(['c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c'], dtype='<U1')
Josef
-n
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
On 26.04.2017 19:08, Robert Kern wrote:
On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.com>> wrote:
Indeed, Most of this discussion is irrelevant to numpy. Numpy only really deals with the in memory storage of strings. And in that it is limited to fixed length strings (in bytes/codepoints). How you get your messy strings into numpy arrays is not very relevant
the discussion of a smaller representation of strings. You couldn't get messy strings into numpy without first sorting it out yourself before, you won't be able to afterwards. Numpy will offer a set of encodings, the user chooses which one is best for the use case and if the user screws it up, it is not numpy's
to problem.
You currently only have a few ways to even construct string arrays: - array construction and loops - genfromtxt (which is again just a loop) - memory mapping which I seriously doubt anyone actually does for the S and U dtype
I fear that you decided that the discussion was irrelevant and thus did not read it rather than reading it to decide that it was not relevant. Because several of us have showed that, yes indeed, we do memory-map string arrays.
You can add to this list C APIs, like that of libhdf5, that need to communicate (Unicode) string arrays.
Look, I know I can be tedious, but *please* go back and read this discussion. We have concrete use cases outlined. We can give you more details if you need them. We all feel the pain of the rushed, inadequate implementation of the U dtype. But each of our pains is a little bit different; you obviously aren't experiencing the same pains that I am.
I have read every mail and it has been a large waste of time, Everything has been said already many times in the last few years. Even if you memory map string arrays, of which I have not seen a concrete use case in the mails beyond "would be nice to have" without any backing in actual code, but I may have missed it.
Yes, we have stated that FITS files with string arrays are currently being read via memory mapping. http://docs.astropy.org/en/stable/io/fits/index.html You were even pointed to a minor HDF5 implementation that memory maps: https://github.com/jjhelmus/pyfive/blob/master/pyfive/low_level.py#L682-L683 I'm afraid that I can't share the actual code of the full variety of proprietary file formats that I've written code for, I can assure you that I have memory mapped many string arrays in my time, usually embedded as columns in structured arrays. It is not "nice to have"; it is "have done many times and needs better support".
In any case it is still irrelevant. My proposal only _adds_ additional cases that can be mmapped. It does not prevent you from doing what you have been doing before.
You are the one who keeps worrying about the additional complexity, both in code and mental capacity of our users, of adding new overlapping dtypes and solutions, and you're not wrong about that. I think it behooves us to consider if there are solutions that solve multiple related problems at once instead of adding new dtypes piecemeal to solve individual problems.
Having a new dtype changes nothing here. You still need to create numpy arrays from python strings which are well defined and clean. If you put something in that doesn't encode you get an encoding error. No oddities like surrogate escapes are needed, numpy arrays are not interfaces to operating systems nor does numpy need to _add_ support for historical oddities beyond what it already has. If you want to represent bytes exactly as they came in don't use a text dtype (which includes the S dtype, use i1).
Thomas Aldcroft has demonstrated the problem with this approach. numpy arrays are often interfaces to files that have tons of historical oddities.
This does not matter for numpy, the text dtype is well defined as bytes with a specific encoding and null padding.
You cannot dismiss something as "not mattering for *numpy*" just because your new, *proposed* text dtype doesn't support it. You seem to have fixed on a course of action and are defining everyone else's use cases as out-of-scope because your course of action doesn't support them. That's backwards. Define the use cases first, determine the requirements, then build a solution that meets those requirements. We skipped those steps before, and that's why we're all feeling the pain.
If you have an historical oddity that does not fit, do not use the text dtype but use a pure byte array instead.
Concerning variable sized strings, this is simply not going to happen. Nobody is going to rewrite numpy to support it, especially not just for something as unimportant as strings. Best you are going to get (or better already have) is object arrays. It makes no sense to discuss it unless someone comes up with an actual proposal and the willingness to code it.
No one has suggested such a thing. At most, we've talked about specializing object arrays.
What is a relevant discussion is whether we really need a more compact but limited representation of text than 4-byte utf32 at all. Its usecase is for the most part just for python3 porting and saving some memory in some ascii heavy cases, e.g. astronomy. It is not that significant anymore as porting to python3 has mostly already happened via the ugly byte workaround and memory saving is probably not as significant in the context of numpy which is already heavy on memory usage.
My initial approach was to not add a new dtype but to make unicode parametrizable which would have meant almost no cluttering of numpys internals and keeping the api more or less consistent which would make this a relatively simple addition of minor functionality for people
That's his status quo, and he finds it unworkable. Now, I have proposed a way out of that by supporting ASCII-surrogateescape as a specific encoding. It's not an ISO standard encoding, but the surrogeescape mechanism seems to be what the Python world has settled on for such situations. Would you support that with your parameterized-encoding text dtype? that
want it. But adding a completely new partially redundant dtype for this usecase may be a too large change to the api. Having two partially redundant string types may confuse users more than our current status quo of our single string type (U).
Discussing whether we want to support truncated utf8 has some merit as it is a decision whether to give the users an even larger gun to shot themselves in the foot with. But I'd like to focus first on the 1 byte type to add a symmetric API for python2 and python3. utf8 can always be added latter should we deem it a good idea.
What is your current proposal? A string dtype parameterized with the encoding (initially supporting the latin-1 that you desire and maybe adding utf-8 later)? Or a latin-1-specific dtype such that we will have to add a second utf-8 dtype at a later date?
My proposal is a single new parameterizable dtype. Adding multiple dtypes for each encoding seems unnecessary to me given that numpy already supports parameterizable types. For example datetime is very similar, it is basically encoded integers. There are multiple encodings = units supported.
Okay great. What encodings are you intending to support? You seem to be pushing against supporting UTF-8.
If you're not going to support arbitrary encodings right off the bat, I'd actually suggest implementing UTF-8 and ASCII-surrogateescape first as they seem to knock off more use cases straight away.
Please list the use cases in the context of numpy usage. hdf5 is the most obvious, but how exactly would hdf5 use an utf8 array in the actual implementation?
File reading: The user requests data from a fixed-width UTF-8 Dataset. E.g. h5py:
a = h5['/some_utf8_array'][:]
h5py looks at the Dataset's shape (with the fixed width defined in bytes) and allocates a numpy UTF-8 array with the dtype being given the same bytewidth as specified by the Dataset. h5py fills in the data quickly in bulk using libhdf5's efficient APIs for such data movement. The user now has a numpy array whose scalars come out/go in as `unicode/str` objects. File writing: The user needs to create a string Dataset with Unicode characters. A fixed-width UTF-8 Dataset is preferred (in this case) over HDF5 variable-width Datasets because the latter is not compressible, and the strings are all reasonably close in size. The user's in-memory data may or may not be in a UTF-8 array (it might be in an object array of `unicode/str` string objects or a U-dtype array), but h5py can use numpy's conversion machinery to turn it into a numpy UTF-8 array (much like it can accept lists of floats and cast it to a float64 array). It can look at the UTF-8 array's shape and itemsize to create the corresponding Dataset, and then pass the array to libhdf5's efficient APIs for copying arrays of data into a Dataset.
What you save by having utf8 in the numpy array is replacing a decoding ane encoding step with a stripping null padding step. That doesn't seem very worthwhile compared to all their other overheads involved.
It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently. -- Robert Kern
On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
I remember talking with a colleague about something like that. And basically an annoying thing there was that if you strip the zero bytes in a zero padded string, some encodings (UTF16) may need one of the zero bytes to work right. (I think she got around it, by weird trickery, inverting the endianess or so and thus putting the zero bytes first). Maybe will ask her if this discussion is interesting to her. Though I think it might have been something like "make everything in hdf5/something similar work" without any actual use case, I don't know.
I don't think that will be an issue for an encoding-parameterized dtype. The decoding machinery of that would have access to the full-width buffer for the item, and the encoding knows what it's atomic unit is (e.g. 2 bytes for UTF-16). It's only if you have to hack around at a higher level with numpy's S arrays, which return Python byte strings that strip off the trailing NULL bytes, that you have to worry about such things. Getting a Python scalar from the numpy S array loses information in such cases. -- Robert Kern
On Wed, Apr 26, 2017 at 11:31 AM, Nathaniel Smith <njs@pobox.com> wrote:
UTF-8 does not match the character-oriented Python text model. Plenty of people argue that that isn't the "correct" model for Unicode text -- maybe so, but it is the model python 3 has chosen. I wrote a much longer rant about that earlier.
So I think the easy to access, and particularly defaults, numpy string dtypes should match it.
This seems a little vague?
sorry -- that's what I get for trying to be concise...
The "character-oriented Python text model" is just that str supports O(1) indexing of characters.
not really -- I think the performance characteristics are an implementation detail (though it did influence the design, I'm sure) I'm referring to the fact that a python string appears (to the user -- also under the hood, but again, implementation detail) to be a sequence of characters, not a sequence of bytes, not a sequence of glyphs, or graphemes, or anything else. Every Python string has a length, and that length is the number of characters, and if you index you get a string of length-1, and it has one character it it, and that character matches to a code point of a single value. Someone could implement a python string using utf-8 under the hood, and none of that would change (and I think micropython may have done that...) Sure, you might get two characters when you really expect a single grapheme, but it's at least a consistent oddity. (well, not always, as some graphemes can be represented by either a single code point or two combined -- human language really sucks!) The UTF-8 Manifesto (http://utf8everywhere.org/) makes the very good point that a character-oriented interface is not the only one that makes sense, and may not make sense at all. However: 1) Python has chosen that interface 2) It is a good interface (probably the best for computer use) if you need to choose only one utf8everywhere is mostly arguing for utf-8 over utf16 -- and secondarily for utf-8 everywhere as the best option for working at the C level. That's probably true. (I also think the utf-8 fans are in a bit of a fantasy world -- this would all be easier, yes, if one encoding was used for everything, all the time, but other than that, utf-8 is not a Pancea -- we are still going to have encoding headaches no matter how you slice it) So where does numpy fit? well, it does operate at the C level, but people work with it from python, so exposing the details of the encoding to the user should be strictly opt-in. When a numpy user wants to put a string into a numpy array, they should know how long a string they can fit -- with "length" defined how python strings define it. Using utf-8 for the default string in numpy would be like using float16 for default float--not a good idea! I believe Julian said there would be no default -- you would need to specify, but I think there does need to be one: np.array(["a string", "another string"]) needs to do something. if we make a parameterized dtype that accepts any encoding, then we could do: np.array(["a string", "another string"], dtype=no.stringtype["utf-8"]) If folks really want that. I'm afraid that that would lead to errors -- cool,. utf-8 is just like ascii, but with full Unicode support! But... Numpy doesn't. If you want to access individual characters inside a
string inside an array, you have to pull out the scalar first, at which point the data is copied and boxed into a Python object anyway, using whatever representation the interpreter prefers.
So AFAICT it makes literally no difference to the user whether numpy's internal representation allows for fast character access.
agreed - unless someone wants to do a view that makes a N-D array for strings look like a 1-D array of characters.... Which seems odd, but there was recently a big debate on the netcdf CF conventions list about that very issue... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg <sebastian@sipsolutions.net
wrote:
I remember talking with a colleague about something like that. And basically an annoying thing there was that if you strip the zero bytes in a zero padded string, some encodings (UTF16) may need one of the zero bytes to work right.
I think it's really clear that you don't want to mess with the bytes in any way without knowing the encoding -- for UTF-16, the code unit is two bytes, so a "null" is two zero bytes in a row. So generic "null padded" or "null terminated" is dangerous -- it would have to be "Null-padded utf-8" or whatever. Though I
think it might have been something like "make everything in hdf5/something similar work"
That would be nice :-), but I suspect HDF-5 is the same as everything else -- there are files in the wild where someone jammed the wrong thing into a text array .... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Wed, Apr 26, 2017 at 10:45 AM, Robert Kern <robert.kern@gmail.com> wrote:
The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases.
isn't UTF-32 pretty compressible also? lots of zeros in there.... here's an example with pure ascii Lorem Ipsum text: In [17]: len(text) Out[17]: 446 In [18]: len(utf8) Out[18]: 446 # the same -- it's pure ascii In [20]: len(utf32) Out[20]: 1788 # four times a big -- of course. In [22]: len(bz2.compress(utf8)) Out[22]: 302 # so from 446 to 302, not that great -- probably it would be better for longer text # -- but are compressing whole arrays or individual strings? In [23]: len(bz2.compress(utf32)) Out[23]: 319 # almost as good as the compressed utf-8 And I'm guessing it would be even closer with more non-ascii charactors. OK -- turns out I'm wrong -- here it with greek -- not a lot of ascii charactors: In [29]: len(text) Out[29]: 672 In [30]: utf8 = text.encode("utf-8") In [31]: len(utf8) Out[31]: 1180 # not bad, really -- still smaller than utf-16 :-) In [33]: len(bz2.compress(utf8)) Out[33]: 495 # pretty good then -- better than 50% In [34]: utf32 = text.encode("utf-32") In [35]: len(utf32) Out[35]: 2692 In [36]: len(bz2.compress(utf32)) Out[36]: 515 # still not quite as good as utf-8, but close. So: utf-8 compresses better than utf-32, but only by a little bit -- at least with bz2. But it is a lot smaller uncompressed.
The major use case that we have for a UTF-8 array is HDF5, and it specifies the width in bytes, not Unicode characters.
It's not just HDF5. Counting bytes is the Right Way to measure the size of UTF-8 encoded text: http://utf8everywhere.org/#myths
It's really the only way with utf-8 -- which is why it is an impedance mismatch with python strings.
I also firmly believe (though clearly this is not universally agreed upon) that UTF-8 is the Right Way to encode strings for *non-legacy* applications.
fortunately, we don't need to agree to that to agree that:
So if we're adding any new string encodings, it needs to be one of them.
Yup -- the most important one to add -- I don't think it is "The Right Way" for all applications -- but it "The Right Way" for text interchange. And regardless of what any of us think -- it is widely used.
(1) object arrays of strings. (We have these already; whether a strings-only specialization would permit useful things like string-oriented ufuncs is a question for someone who's willing to implement one.)
This is the right way to get variable length strings -- but I'm concerned that it doesn't mesh well with numpy uses like npz files, raw dumping of array data, etc. It should not be the only way to get proper Unicode support, nor the default when you do: array(["this", "that"])
(2) a dtype for fixed byte-size, specified-encoding, NULL-padded data. All python encodings should be permitted. An additional function to truncate encoded data without mangling the encoding would be handy.
I think necessary -- at least when you pass in a python string...
I think it makes more sense for this to be NULL-padded than NULL-terminated but it may be necessary to support both; note that NULL-termination is complicated for encodings like UCS4.
is it if you know it's UCS4? or even know the size of the code-unit (I think that's the term)
This also includes the legacy UCS4 strings as a special case.
what's special about them? I think the only thing shold be that they are the default.
(3) a dtype for fixed-length byte strings. This doesn't look very different from an array of dtype u8, but given we have the bytes type, accessing the data this way makes sense.
The void dtype is already there for this general purpose and mostly works, with a few niggles.
I'd never noticed that! And if I had I never would have guessed I could use it that way.
If it worked more transparently and perhaps rigorously with `bytes`, then it would be quite suitable.
Then we should fix a bit of those things -- and call it soemthig like "bytes", please. -CHB
--
Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Wed, Apr 26, 2017 at 3:27 PM, Chris Barker <chris.barker@noaa.gov> wrote:
When a numpy user wants to put a string into a numpy array, they should know how long a string they can fit -- with "length" defined how python strings define it.
Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and myself have already given), but we seem to be talking past each other here. I am still -1 on any new string encoding support unless that includes at least UTF-8, with length indicated by the number of bytes.
On Apr 26, 2017 12:09 PM, "Robert Kern" <robert.kern@gmail.com> wrote: On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote: [...]
I have read every mail and it has been a large waste of time, Everything has been said already many times in the last few years. Even if you memory map string arrays, of which I have not seen a concrete use case in the mails beyond "would be nice to have" without any backing in actual code, but I may have missed it.
Yes, we have stated that FITS files with string arrays are currently being read via memory mapping. http://docs.astropy.org/en/stable/io/fits/index.html You were even pointed to a minor HDF5 implementation that memory maps: https://github.com/jjhelmus/pyfive/blob/master/pyfive/low_ level.py#L682-L683 I'm afraid that I can't share the actual code of the full variety of proprietary file formats that I've written code for, I can assure you that I have memory mapped many string arrays in my time, usually embedded as columns in structured arrays. It is not "nice to have"; it is "have done many times and needs better support". Since concrete examples are often helpful in focusing discussions, here's some code for reading a lab-internal EEG file format: https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py See in particular _header_dtype with its embedded string fields, and the code in _channel_names_from_header -- both of these really benefit from having a quick and easy way to talk about fixed width strings of single byte characters. (The history here of course is that the original tools for reading/writing this format are written in C, and they just read in sizeof(struct header) and cast to the header.) _get_full_string in that file is also interesting: it's a nasty hack I implemented because in some cases I actually needed *fixed width* strings, not NUL padded ones, and didn't know a better way to do it. (Yes, there's void, but I have no idea how those work. They're somehow related to buffer objects, whatever those are?) In other cases though that file really does want NUL padding. Of course that file is python 2 and blissfully ignorant of unicode. Thinking about what we'd want if porting to py3: For the "pull out this fixed width chunk of the file" problem (what _get_full_string does) then I definitely don't care about unicode; this isn't text. np.void or an array of np.uint8 aren't actually too terrible I suspect, but it'd be nice if there were a fixed-width dtype where indexing gave back a native bytes or bytearray object, or something similar like np.bytes_. For the arrays of single-byte-encoded-NUL-padded text, then the fundamental problem is just to convert between a chunk of bytes in that format and something that numpy can handle. One way to do that would be with an dtype that represented ascii-encoded-fixed-width-NUL-padded text, or any ascii-compatible encoding. But honestly I'd be just as happy with np.encode/np.decode ufuncs that converted between the existing S dtype and any kind of text array; the existing U dtype would be fine given that. The other thing that might be annoying in practice is that when writing py2/py3 polyglot code, I can say "str" to mean "bytes on py2 and unicode on py3", but there's no dtype with similar behavior. Maybe there's no good solution and this just needs a few version-dependent convenience functions stuck in a private utility library, dunno.
What you save by having utf8 in the numpy array is replacing a decoding ane encoding step with a stripping null padding step. That doesn't seem very worthwhile compared to all their other overheads involved.
It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently. I would really like to hear more from the authors of these libraries about what exactly it is they feel they're missing. Is it that they want numpy to enforce the length limit early, to catch errors when the array is modified instead of when they go to write it to the file? Is it that they really want an O(1) way to look at a array and know the maximum number of bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is really annoying and files that need it are rare so they haven't had the motivation to implement it? My impression is similar to Julian's: you *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few dozen lines of code, which is nothing compared to all the other hoops these libraries are already jumping through, so if this is really the roadblock then I must be missing something. -n
On Wed, Apr 26, 2017 at 4:30 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and myself have already given), but we seem to be talking past each other here.
yeah -- I think it's not clear what the use cases we are talking about are.
I am still -1 on any new string encoding support unless that includes at least UTF-8, with length indicated by the number of bytes.
I've said multiple times that utf-8 support is key to any "exchange binary data" use case (memory mapping?) -- so yes, absolutely. I _think_ this may be some of the source for the confusion: The name of this thread is: "proposal: smaller representation of string arrays". And I got the impression, maybe mistaken, that folks were suggesting that internally encoding strings in numpy as "UTF-8, with length indicated by the number of bytes." was THE solution to the " the 'U' dtype takes up way too much memory, particularly for mostly-ascii data" problem. I do not think it is a good solution to that problem. I think a good solution to that problem is latin-1 encoding. (bear with me here...) But a bunch of folks have brought up that while we're messing around with string encoding, let's solve another problem: * Exchanging unicode text at the binary level with other systems that generally don't use UCS-4. For THAT -- utf-8 is critical. But if I understand Julian's proposal -- he wants to create a parameterized text dtype that you can set the encoding on, and then numpy will use the encoding (and python's machinery) to encode / decode when passing to/from python strings. It seems this would support all our desires: I'd get a latin-1 encoded type for compact representation of mostly-ascii data. Thomas would get latin-1 for binary interchange with mostly-ascii data The HDF-5 folks would get utf-8 for binary interchange (If we can workout the null-padding issue) Even folks that had weird JAVA or Windows-generated UTF-16 data files could do the binary interchange thing.... I'm now lost as to what the hang-up is. -CHB PS: null padding is a pain, python strings seem to preserve the zeros, whic is odd -- is thre a unicode code-point at x00? But you can use it to strip properly with the unicode sandwich: In [63]: ut16 = text.encode('utf-16') + b'\x00\x00\x00\x00\x00\x00' In [64]: ut16.decode('utf-16') Out[64]: 'some text\x00\x00\x00' In [65]: ut16.decode('utf-16').strip('\x00') Out[65]: 'some text' In [66]: ut16.decode('utf-16').strip('\x00').encode('utf-16') Out[66]: b'\xff\xfes\x00o\x00m\x00e\x00 \x00t\x00e\x00x\x00t\x00' -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Apr 26, 2017 12:09 PM, "Robert Kern" <robert.kern@gmail.com> wrote:
It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently.
I would really like to hear more from the authors of these libraries about what exactly it is they feel they're missing. Is it that they want numpy to enforce the length limit early, to catch errors when the array is modified instead of when they go to write it to the file? Is it that they really want an O(1) way to look at a array and know the maximum number of bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is really annoying and files that need it are rare so they haven't had the motivation to implement it?
https://github.com/PyTables/PyTables/issues/499 https://github.com/h5py/h5py/issues/379 -- Robert Kern
But a bunch of folks have brought up that while we're messing around with string encoding, let's solve another problem:
* Exchanging unicode text at the binary level with other systems that generally don't use UCS-4.
For THAT -- utf-8 is critical.
But if I understand Julian's proposal -- he wants to create a
On Wed, Apr 26, 2017 at 5:02 PM, Chris Barker <chris.barker@noaa.gov> wrote: parameterized text dtype that you can set the encoding on, and then numpy will use the encoding (and python's machinery) to encode / decode when passing to/from python strings.
It seems this would support all our desires:
I'd get a latin-1 encoded type for compact representation of mostly-ascii
data.
Thomas would get latin-1 for binary interchange with mostly-ascii data
The HDF-5 folks would get utf-8 for binary interchange (If we can workout
the null-padding issue)
Even folks that had weird JAVA or Windows-generated UTF-16 data files
could do the binary interchange thing....
I'm now lost as to what the hang-up is.
The proposal is for only latin-1 and UTF-32 to be supported at first, and the eventual support of UTF-8 will be constrained by specification of the width in terms of characters rather than bytes, which conflicts with the use cases of UTF-8 that have been brought forth. https://mail.python.org/pipermail/numpy-discussion/2017-April/076668.html -- Robert Kern
On Wed, Apr 26, 2017 at 5:17 PM, Robert Kern <robert.kern@gmail.com> wrote:
The proposal is for only latin-1 and UTF-32 to be supported at first, and the eventual support of UTF-8 will be constrained by specification of the width in terms of characters rather than bytes, which conflicts with the use cases of UTF-8 that have been brought forth.
https://mail.python.org/pipermail/numpy-discussion/ 2017-April/076668.html
thanks -- I had forgotten (clearly) it was that limited. But my question now is -- if there is a encoding-parameterized string dtype, then is it much more effort to have it support all the encodings in the stdlib? It seems that would solve everyone's issue. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs@pobox.com> wrote:
It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently.
I would really like to hear more from the authors of these libraries about what exactly it is they feel they're missing. Is it that they want numpy to enforce the length limit early, to catch errors when the array is modified instead of when they go to write it to the file? Is it that they really want an O(1) way to look at a array and know the maximum number of bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is really annoying and files that need it are rare so they haven't had the motivation to implement it? My impression is similar to Julian's: you *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few dozen lines of code, which is nothing compared to all the other hoops these libraries are already jumping through, so if this is really the roadblock then I must be missing something.
I actually agree with you. I think it's mostly a matter of convenience that h5py matched up HDF5 dtypes with numpy dtypes: fixed width ASCII -> np.string_/bytes variable length ASCII -> object arrays of np.string_/bytes variable length UTF-8 -> object arrays of unicode This was tenable in a Python 2 world, but on Python 3 it's broken and there's not an easy fix. We absolutely could fix h5py by mapping everything to object arrays of Python unicode strings, as has been discussed ( https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would be a fine but non-ideal solution, since there is currently no fixed width UTF-8 support. For fixed width ASCII arrays, this would mean increased convenience for Python 3 users, at the price of decreased convenience for Python 2 users (arrays now contain boxed Python objects), unless we made the h5py behavior dependent on the version of Python. Hence, we're back here, waiting for better dtypes for encoded strings. So for HDF5, I see good use cases for ASCII-with-surrogateescape (for handling ASCII arrays as strings) and UTF-8 with length equal to the number of bytes.
2017-04-27 3:34 GMT+02:00 Stephan Hoyer <shoyer@gmail.com>:
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs@pobox.com> wrote:
It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently.
I would really like to hear more from the authors of these libraries about what exactly it is they feel they're missing. Is it that they want numpy to enforce the length limit early, to catch errors when the array is modified instead of when they go to write it to the file? Is it that they really want an O(1) way to look at a array and know the maximum number of bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is really annoying and files that need it are rare so they haven't had the motivation to implement it? My impression is similar to Julian's: you *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few dozen lines of code, which is nothing compared to all the other hoops these libraries are already jumping through, so if this is really the roadblock then I must be missing something.
I actually agree with you. I think it's mostly a matter of convenience that h5py matched up HDF5 dtypes with numpy dtypes: fixed width ASCII -> np.string_/bytes variable length ASCII -> object arrays of np.string_/bytes variable length UTF-8 -> object arrays of unicode
This was tenable in a Python 2 world, but on Python 3 it's broken and there's not an easy fix.
We absolutely could fix h5py by mapping everything to object arrays of Python unicode strings, as has been discussed ( https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would be a fine but non-ideal solution, since there is currently no fixed width UTF-8 support.
For fixed width ASCII arrays, this would mean increased convenience for Python 3 users, at the price of decreased convenience for Python 2 users (arrays now contain boxed Python objects), unless we made the h5py behavior dependent on the version of Python. Hence, we're back here, waiting for better dtypes for encoded strings.
So for HDF5, I see good use cases for ASCII-with-surrogateescape (for handling ASCII arrays as strings) and UTF-8 with length equal to the number of bytes.
Well, I'll say upfront that I have not read this discussion in the fully, but apparently some opinions from developers of HDF5 Python packages would be welcome here, so here I go :) As a long-time developer of one of the Python HDF5 packages (PyTables), I have always been of the opinion that plain ASCII (for byte strings) and UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing large amounts of data, most specially for disk storage (but also using compressed in-memory containers). My rational is that, although UCS-4 may require way too much space, compression would reduce that to basically the space that is required by compressed UTF-8 (I won't go into detail, but basically this is possible by using the shuffle filter). I remember advocating for UCS-4 adoption in the HDF5 library many years ago (2007?), but I had no success and UTF-8 was decided to be the best candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I don't think there is a go back (not even adding UCS-4 support on it, although I continue to think it would be a good idea). So, I suppose that if HDF5 is found to be an important format for NumPy users (and I think this is the case), a solution for representing Unicode characters by using UTF-8 in NumPy would be desirable (at the risk of making the implementation more complex). Francesc
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
-- Francesc Alted
So while compression+ucs-4 might be OK for out-of-core representation, what about in-core? blosc+ucs-4? I don't think that works for mmap, does it? On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted <faltet@gmail.com> wrote:
2017-04-27 3:34 GMT+02:00 Stephan Hoyer <shoyer@gmail.com>:
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs@pobox.com> wrote:
It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently.
I would really like to hear more from the authors of these libraries about what exactly it is they feel they're missing. Is it that they want numpy to enforce the length limit early, to catch errors when the array is modified instead of when they go to write it to the file? Is it that they really want an O(1) way to look at a array and know the maximum number of bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is really annoying and files that need it are rare so they haven't had the motivation to implement it? My impression is similar to Julian's: you *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few dozen lines of code, which is nothing compared to all the other hoops these libraries are already jumping through, so if this is really the roadblock then I must be missing something.
I actually agree with you. I think it's mostly a matter of convenience that h5py matched up HDF5 dtypes with numpy dtypes: fixed width ASCII -> np.string_/bytes variable length ASCII -> object arrays of np.string_/bytes variable length UTF-8 -> object arrays of unicode
This was tenable in a Python 2 world, but on Python 3 it's broken and there's not an easy fix.
We absolutely could fix h5py by mapping everything to object arrays of Python unicode strings, as has been discussed ( https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would be a fine but non-ideal solution, since there is currently no fixed width UTF-8 support.
For fixed width ASCII arrays, this would mean increased convenience for Python 3 users, at the price of decreased convenience for Python 2 users (arrays now contain boxed Python objects), unless we made the h5py behavior dependent on the version of Python. Hence, we're back here, waiting for better dtypes for encoded strings.
So for HDF5, I see good use cases for ASCII-with-surrogateescape (for handling ASCII arrays as strings) and UTF-8 with length equal to the number of bytes.
Well, I'll say upfront that I have not read this discussion in the fully, but apparently some opinions from developers of HDF5 Python packages would be welcome here, so here I go :)
As a long-time developer of one of the Python HDF5 packages (PyTables), I have always been of the opinion that plain ASCII (for byte strings) and UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing large amounts of data, most specially for disk storage (but also using compressed in-memory containers). My rational is that, although UCS-4 may require way too much space, compression would reduce that to basically the space that is required by compressed UTF-8 (I won't go into detail, but basically this is possible by using the shuffle filter).
I remember advocating for UCS-4 adoption in the HDF5 library many years ago (2007?), but I had no success and UTF-8 was decided to be the best candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I don't think there is a go back (not even adding UCS-4 support on it, although I continue to think it would be a good idea). So, I suppose that if HDF5 is found to be an important format for NumPy users (and I think this is the case), a solution for representing Unicode characters by using UTF-8 in NumPy would be desirable (at the risk of making the implementation more complex).
Francesc
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
-- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
2017-04-27 13:27 GMT+02:00 Neal Becker <ndbecker2@gmail.com>:
So while compression+ucs-4 might be OK for out-of-core representation, what about in-core? blosc+ucs-4? I don't think that works for mmap, does it?
Correct, the real problem is mmap for an out-of-core, HDF5 representation, I presume. For in-memory, there are several compressed data containers, like: https://github.com/alimanfoo/zarr (meant mainly for multidimensional data containers) https://github.com/Blosc/bcolz (meant mainly for tabular data containers) (there might be others).
On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted <faltet@gmail.com> wrote:
2017-04-27 3:34 GMT+02:00 Stephan Hoyer <shoyer@gmail.com>:
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs@pobox.com> wrote:
It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently.
I would really like to hear more from the authors of these libraries about what exactly it is they feel they're missing. Is it that they want numpy to enforce the length limit early, to catch errors when the array is modified instead of when they go to write it to the file? Is it that they really want an O(1) way to look at a array and know the maximum number of bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is really annoying and files that need it are rare so they haven't had the motivation to implement it? My impression is similar to Julian's: you *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few dozen lines of code, which is nothing compared to all the other hoops these libraries are already jumping through, so if this is really the roadblock then I must be missing something.
I actually agree with you. I think it's mostly a matter of convenience that h5py matched up HDF5 dtypes with numpy dtypes: fixed width ASCII -> np.string_/bytes variable length ASCII -> object arrays of np.string_/bytes variable length UTF-8 -> object arrays of unicode
This was tenable in a Python 2 world, but on Python 3 it's broken and there's not an easy fix.
We absolutely could fix h5py by mapping everything to object arrays of Python unicode strings, as has been discussed ( https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would be a fine but non-ideal solution, since there is currently no fixed width UTF-8 support.
For fixed width ASCII arrays, this would mean increased convenience for Python 3 users, at the price of decreased convenience for Python 2 users (arrays now contain boxed Python objects), unless we made the h5py behavior dependent on the version of Python. Hence, we're back here, waiting for better dtypes for encoded strings.
So for HDF5, I see good use cases for ASCII-with-surrogateescape (for handling ASCII arrays as strings) and UTF-8 with length equal to the number of bytes.
Well, I'll say upfront that I have not read this discussion in the fully, but apparently some opinions from developers of HDF5 Python packages would be welcome here, so here I go :)
As a long-time developer of one of the Python HDF5 packages (PyTables), I have always been of the opinion that plain ASCII (for byte strings) and UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing large amounts of data, most specially for disk storage (but also using compressed in-memory containers). My rational is that, although UCS-4 may require way too much space, compression would reduce that to basically the space that is required by compressed UTF-8 (I won't go into detail, but basically this is possible by using the shuffle filter).
I remember advocating for UCS-4 adoption in the HDF5 library many years ago (2007?), but I had no success and UTF-8 was decided to be the best candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I don't think there is a go back (not even adding UCS-4 support on it, although I continue to think it would be a good idea). So, I suppose that if HDF5 is found to be an important format for NumPy users (and I think this is the case), a solution for representing Unicode characters by using UTF-8 in NumPy would be desirable (at the risk of making the implementation more complex).
Francesc
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
-- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
-- Francesc Alted
On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted <faltet@gmail.com> wrote:
I remember advocating for UCS-4 adoption in the HDF5 library many years ago (2007?), but I had no success and UTF-8 was decided to be the best candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I don't think there is a go back
This is the key point -- we can argue all we want about the best encoding for fixed-length unicode-supporting strings (I think numpy and HDF have very similar requirements), but that is not our decision to make -- many other systems have chosen utf-8, so it's a really good idea for numpy to be able to deal with that cleanly and easily and consistently. I have made many anti utf-8 points in this thread because while we need to deal with utf-8 for interplay with other systems, I am very sure that it is not the best format for a default, naive-user-of-numpy unicode-supporting dtype. Nor is it the best encoding for a mostly-ascii compact in memory format. So I think numpy needs to support at least: utf-8 latin-1 UCS-4 And it maybe should support one-byte encoding suitable for non-european languages, and maybe utf-16 for Java and Windows compatibility, and .... So that seems to point to "support as many encodings as possible" And python has the machinery to do so -- so why not? (I'm taking Julian's word for it that having a parameterized dtype would not have a major impact on current code) If we go with a parameterized by encoding string dtype, then we can pick sensible defaults, and let users use what they know best fits their use-cases. As for python2 -- it is on the way out, I think we should keep the 'U' and 'S' dtypes as they are for backward compatibility and move forward with the new one(s) in a way that is optimized for py3. And it would map to a py2 Unicode type. The only catch I see in that is what to do with bytes -- we should have a numpy dtype that matches the bytes model -- fixed length bytes that map to python bytes objects. (this is almost what teh void type is yes?) but then under py2, would a bytes object (py2 string) map to numpy 'S' or numpy bytes objects?? @Francesc: -- one more question for you: How important is it for pytables to match the numpy storage to the hdf storage byte for byte? i.e. would it be a killer if encoding / decoding happened every time at the boundary? I'm guessing yes, as this would have been solved long ago if not. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
2017-04-27 18:18 GMT+02:00 Chris Barker <chris.barker@noaa.gov>:
On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted <faltet@gmail.com> wrote:
I remember advocating for UCS-4 adoption in the HDF5 library many years ago (2007?), but I had no success and UTF-8 was decided to be the best candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I don't think there is a go back
This is the key point -- we can argue all we want about the best encoding for fixed-length unicode-supporting strings (I think numpy and HDF have very similar requirements), but that is not our decision to make -- many other systems have chosen utf-8, so it's a really good idea for numpy to be able to deal with that cleanly and easily and consistently.
Agreed. But it would also be a good idea to spread the word that simple UCS4 encoding in combination with compression can be a perfectly good system for storing large amounts of unicode data too.
I have made many anti utf-8 points in this thread because while we need to deal with utf-8 for interplay with other systems, I am very sure that it is not the best format for a default, naive-user-of-numpy unicode-supporting dtype. Nor is it the best encoding for a mostly-ascii compact in memory format.
I resonate a lot with this feeling too :)
So I think numpy needs to support at least:
utf-8 latin-1 UCS-4
And it maybe should support one-byte encoding suitable for non-european languages, and maybe utf-16 for Java and Windows compatibility, and ....
So that seems to point to "support as many encodings as possible" And python has the machinery to do so -- so why not?
(I'm taking Julian's word for it that having a parameterized dtype would not have a major impact on current code)
If we go with a parameterized by encoding string dtype, then we can pick sensible defaults, and let users use what they know best fits their use-cases.
As for python2 -- it is on the way out, I think we should keep the 'U' and 'S' dtypes as they are for backward compatibility and move forward with the new one(s) in a way that is optimized for py3. And it would map to a py2 Unicode type.
The only catch I see in that is what to do with bytes -- we should have a numpy dtype that matches the bytes model -- fixed length bytes that map to python bytes objects. (this is almost what teh void type is yes?) but then under py2, would a bytes object (py2 string) map to numpy 'S' or numpy bytes objects??
@Francesc: -- one more question for you:
How important is it for pytables to match the numpy storage to the hdf storage byte for byte? i.e. would it be a killer if encoding / decoding happened every time at the boundary? I'm guessing yes, as this would have been solved long ago if not.
The PyTables team decided some time ago that it was a waste of time and resources to maintain the internal HDF5 interface, and that it would be better to switch to h5py for the low I/O communication with HDF5 (btw, we just received a small NumFOCUS grant for continue the ongoing work on this; thanks guys!). This means that PyTables will be basically agnostic about this sort of encoding issues, and that the important package to have in account for interfacing NumPy and HDF5 is just h5py. -- Francesc Alted
participants (16)
-
Aldcroft, Thomas
-
Ambrose LI
-
Anne Archibald
-
Charles R Harris
-
Chris Barker
-
Chris Barker - NOAA Federal
-
Eric Wieser
-
Francesc Alted
-
josef.pktd@gmail.com
-
Julian Taylor
-
Nathaniel Smith
-
Neal Becker
-
Phil Hodge
-
Robert Kern
-
Sebastian Berg
-
Stephan Hoyer