As previous posts have pointed out, Numpy's `S` type is currently treated as a byte string, which leads to more complicated code in python3. OTOH, the unicode type is stored as UCS4, which consumes a lot of space, especially for ascii strings. This note proposes to adapt the currently existing 'a' type letter, currently aliased to 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte internal representations for unicode strings, ascii and latin1. Ascii has the advantage that it is a subset of UTF-8, whereas latin1 has a few more symbols. Another possibility is to just make it an UTF-8 encoding, but I think this would involve more overhead as Python would need to determine the maximum character size. These are just preliminary thoughts, comments are welcome. Chuck
On 12 Jul 2014 23:06, "Charles R Harris" <charlesr.harris@gmail.com> wrote:
As previous posts have pointed out, Numpy's `S` type is currently treated
as a byte string, which leads to more complicated code in python3. OTOH, the unicode type is stored as UCS4, which consumes a lot of space, especially for ascii strings. This note proposes to adapt the currently existing 'a' type letter, currently aliased to 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte internal representations for unicode strings, ascii and latin1. Ascii has the advantage that it is a subset of UTF-8, whereas latin1 has a few more symbols. Another possibility is to just make it an UTF-8 encoding, but I think this would involve more overhead as Python would need to determine the maximum character size. These are just preliminary thoughts, comments are welcome. I feel like for most purposes, what we *really* want is a variable length string dtype (I.e., where each element can be a different length.). Pandas pays quite some price in overhead to fake this right now. Adding such a thing will cause some problems regarding compatibility (what to do with array(["foo"])) and education, but I think it's worth it in the long run. A variable length string with out of band storage also would allow for a lot of py3.3-style storage tricks of we want then. Given that, though, I'm a little dubious about adding a third fixed length string type, since it seems like it might be a temporary patch, yet raises the prospect of having to indefinitely support *5* distinct string types (3 of which will map to py3 str)... OTOH, fixed length nul padded latin1 would be useful for various flat file reading tasks. -n
On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith <njs@pobox.com> wrote:
I feel like for most purposes, what we *really* want is a variable length string dtype (I.e., where each element can be a different length.).
I've been toying with the idea of creating an array type for interned strings. In many applications dealing with large arrays of variable size strings, the strings come from a relatively short set of names. Arrays of interned strings can be manipulated very efficiently because in may respects they are just like arrays of integers.
2014-07-13 19:05 GMT+02:00 Alexander Belopolsky <ndarray@mac.com>:
On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith <njs@pobox.com> wrote:
I feel like for most purposes, what we *really* want is a variable length string dtype (I.e., where each element can be a different length.).
I've been toying with the idea of creating an array type for interned strings. In many applications dealing with large arrays of variable size strings, the strings come from a relatively short set of names. Arrays of interned strings can be manipulated very efficiently because in may respects they are just like arrays of integers.
+1 I think this is why pandas is using dtype=object to load string data: in many cases short string values are used to represent categorical variables with a comparatively small cardinality of possible values for a dataset with comparatively numerous records. In that case the dtype=object is not that bad as it just stores pointer on string objects managed by Python. It's possible to intern the strings manually at load time (I don't know if pandas or python already do it automatically in that case). The integer semantics is good for that case. Having an explicit dtype might be even better. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
in 0.15.0 pandas will have full fledged support for categoricals which in effect allow u 2 map a smaller number of strings to integers this is now in pandas master http://pandas-docs.github.io/pandas-docs-travis/categorical.html feedback welcome!
On Jul 14, 2014, at 1:00 PM, Olivier Grisel <olivier.grisel@ensta.org> wrote:
2014-07-13 19:05 GMT+02:00 Alexander Belopolsky <ndarray@mac.com>:
On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith <njs@pobox.com> wrote:
I feel like for most purposes, what we *really* want is a variable length string dtype (I.e., where each element can be a different length.).
I've been toying with the idea of creating an array type for interned strings. In many applications dealing with large arrays of variable size strings, the strings come from a relatively short set of names. Arrays of interned strings can be manipulated very efficiently because in may respects they are just like arrays of integers.
+1 I think this is why pandas is using dtype=object to load string data: in many cases short string values are used to represent categorical variables with a comparatively small cardinality of possible values for a dataset with comparatively numerous records.
In that case the dtype=object is not that bad as it just stores pointer on string objects managed by Python. It's possible to intern the strings manually at load time (I don't know if pandas or python already do it automatically in that case). The integer semantics is good for that case. Having an explicit dtype might be even better.
-- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, Jul 14, 2014 at 10:00 AM, Olivier Grisel <olivier.grisel@ensta.org> wrote:
2014-07-13 19:05 GMT+02:00 Alexander Belopolsky <ndarray@mac.com>:
I've been toying with the idea of creating an array type for interned strings. In many applications dealing with large arrays of variable size strings, the strings come from a relatively short set of names. Arrays of interned strings can be manipulated very efficiently because in may respects they are just like arrays of integers.
+1 I think this is why pandas is using dtype=object to load string data: in many cases short string values are used to represent categorical variables with a comparatively small cardinality of possible values for a dataset with comparatively numerous records.
Pandas has a new "categorical" type (just merged into master) which is pretty similar to interned strings: https://github.com/pydata/pandas/pull/7217 http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html Of course, it would be ideal for numpy itself to natively support categoricals and variables length strings. Best, Stephan
On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith <njs@pobox.com> wrote:
On 12 Jul 2014 23:06, "Charles R Harris" <charlesr.harris@gmail.com> wrote:
As previous posts have pointed out, Numpy's `S` type is currently
treated as a byte string, which leads to more complicated code in python3. OTOH, the unicode type is stored as UCS4, which consumes a lot of space, especially for ascii strings. This note proposes to adapt the currently existing 'a' type letter, currently aliased to 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte internal representations for unicode strings, ascii and latin1. Ascii has the advantage that it is a subset of UTF-8, whereas latin1 has a few more symbols. Another possibility is to just make it an UTF-8 encoding, but I think this would involve more overhead as Python would need to determine the maximum character size. These are just preliminary thoughts, comments are welcome.
I feel like for most purposes, what we *really* want is a variable length string dtype (I.e., where each element can be a different length.). Pandas pays quite some price in overhead to fake this right now. Adding such a thing will cause some problems regarding compatibility (what to do with array(["foo"])) and education, but I think it's worth it in the long run. A variable length string with out of band storage also would allow for a lot of py3.3-style storage tricks of we want then.
Given that, though, I'm a little dubious about adding a third fixed length string type, since it seems like it might be a temporary patch, yet raises the prospect of having to indefinitely support *5* distinct string types (3 of which will map to py3 str)...
OTOH, fixed length nul padded latin1 would be useful for various flat file reading tasks.
As one of the original agitators for this, let me re-iterate that what the astronomical community *really* wants is the original proposal as described by Chris Barker [1] and essentially what Charles said. We have large data archives that have ASCII string data in binary formats like FITS and HDF5. The current readers for those datasets present users with numpy S data types, which in Python 3 cannot be compared to str (unicode) literals. In many cases those datasets are large, and in my case I regularly deal with multi-Gb sized bytestring arrays. Converting those to a U dtype is not practical. This issue is the sole blocker that I personally have in beginning to move our operations code base to be Python 3 compatible, and eventually actually baselining Python 3. A variable length string would be great, but it feels like a different (and more difficult) problem to me. If, however, this can be the solution to the problem I described, and it can be implemented in a finite time, then I'm all for it! :-) I hate begging for features with no chance of contributing much to the implementation (lacking the necessary expertise in numpy internals). I would be happy to draft a NEP if that will help the process. Cheers, Tom [1]: http://mail.scipy.org/pipermail/numpy-discussion/2014-January/068622.html
-n
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Tue, Jul 15, 2014 at 7:40 PM, Aldcroft, Thomas <aldcroft@head.cfa.harvard.edu> wrote:
On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith <njs@pobox.com> wrote:
OTOH, fixed length nul padded latin1 would be useful for various flat file reading tasks.
As one of the original agitators for this, let me re-iterate that what the astronomical community *really* wants is the original proposal as described by Chris Barker [1] and essentially what Charles said. We have large data archives that have ASCII string data in binary formats like FITS and HDF5. The current readers for those datasets present users with numpy S data types, which in Python 3 cannot be compared to str (unicode) literals. In many cases those datasets are large, and in my case I regularly deal with multi-Gb sized bytestring arrays. Converting those to a U dtype is not practical.
This is feedback is *super* useful, thanks. Can you elaborate a bit more on your requirements? I get that: - You have data that is treated as text, so it is convenient to be able to use Python strings for things like equality tests, np.sum(arr == "green") etc. - Your data uses only ASCII characters, and you don't want to spend more than 1 byte of memory per character. Do you ever have 8 bit characters, and if so, what encoding do you use? Does it matter to you that the memory layout for these 1-byte-per-char strings remain fixed-width nul-padded concatenated strings (e.g., because you are mmap'ing files that have this format)? Or do FITS/HDF5 handle layout details internally and you don't care so long as the above requirements are met? Does the fixed-width nature of numpy strings cause problems in the above setting? -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
On Thu, Jul 17, 2014 at 11:52 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Jul 15, 2014 at 7:40 PM, Aldcroft, Thomas <aldcroft@head.cfa.harvard.edu> wrote:
On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith <njs@pobox.com> wrote:
OTOH, fixed length nul padded latin1 would be useful for various flat
reading tasks.
As one of the original agitators for this, let me re-iterate that what
file the
astronomical community *really* wants is the original proposal as described by Chris Barker [1] and essentially what Charles said. We have large data archives that have ASCII string data in binary formats like FITS and HDF5. The current readers for those datasets present users with numpy S data types, which in Python 3 cannot be compared to str (unicode) literals. In many cases those datasets are large, and in my case I regularly deal with multi-Gb sized bytestring arrays. Converting those to a U dtype is not practical.
This is feedback is *super* useful, thanks. Can you elaborate a bit more on your requirements?
I get that: - You have data that is treated as text, so it is convenient to be able to use Python strings for things like equality tests, np.sum(arr == "green") etc. - Your data uses only ASCII characters, and you don't want to spend more than 1 byte of memory per character.
Do you ever have 8 bit characters, and if so, what encoding do you use?
No.
Does it matter to you that the memory layout for these 1-byte-per-char strings remain fixed-width nul-padded concatenated strings (e.g., because you are mmap'ing files that have this format)? Or do FITS/HDF5 handle layout details internally and you don't care so long as the above requirements are met?
Yes, memory layout matters since mmap'ing files is a key feature in FITS.
Does the fixed-width nature of numpy strings cause problems in the above setting?
No. In particular FITS is ubiquitous as the binary data transport format in astronomy, and it specifies fixed width strings, so fixed width in numpy is a good thing in this case. More generally legacy (or even modern high-performance) Fortran / C will commonly handle string arrays as arrays of fixed width characters. In the majority of cases these codes (that I'm aware of) know nothing about unicode. This all works transparently with Python 2 + Numpy, so the goal is to have that same "it just works" capability in Python 3 with minimal code changes. Thanks, Tom
-n
-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi Chuck,
This note proposes to adapt the currently existing 'a' type letter, currently aliased to 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte internal representations for unicode strings, ascii and latin1. Ascii has the advantage that it is a subset of UTF-8, whereas latin1 has a few more symbols. Another possibility is to just make it an UTF-8 encoding, but I think this would involve more overhead as Python would need to determine the maximum character size.
For storing data in HDF5 (PyTables or h5py), it would be somewhat cleaner if either ASCII or UTF-8 are used, as these are the only two charsets officially supported by the library. Latin-1 would require a custom read/write converter, which isn't the end of the world but would be tricky to do in a correct way, and likely somewhat slow. We'd also run into truncation issues since certain latin-1 chars become multibyte sequences in UTF8. I assume 'a' strings would still be null-padded? Andrew
On Mon, Jul 14, 2014 at 10:39 AM, Andrew Collette <andrew.collette@gmail.com
wrote:
For storing data in HDF5 (PyTables or h5py), it would be somewhat cleaner if either ASCII or UTF-8 are used, as these are the only two charsets officially supported by the library.
good argument for ASCII, but utf-8 is a bad idea, as there is no 1:1 correspondence between length of string in bytes and length in characters -- as numpy needs to pre-allocate a defined number of bytes for a dtype, there is a disconnect between the user and numpy as to how long a string is being stored...this isn't a problem for immutable strings, and less of a problem for HDF, as you can determine how many bytes you need before you write the file (or does HDF support var-length elements?)
Latin-1 would require a custom read/write converter, which isn't the end of the world
"custom"? it would be an encoding operation -- which you'd need to go from utf-8 to/from unicode anyway. So you would lose the ability to have a nice 1:1 binary representation map between numpy and HDF... good argument for ASCII, I guess. Or for HDF to use latin-1 ;-) Does HDF enforce ascii-only? what does it do with the > 127 values?
would be tricky to do in a correct way, and likely somewhat slow. We'd also run into truncation issues since certain latin-1 chars become multibyte sequences in UTF8.
that's the whole issue with UTF-8 -- it needs to be addressed somewhere, and the numpy-HDF interface seems like a smarter place to put it than the numpy-user interface! I assume 'a' strings would still be null-padded? yup. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Le 15/07/2014 18:18, Chris Barker a écrit :
(or does HDF support var-length elements?)
It does: http://www.hdfgroup.org/HDF5/doc/TechNotes/VLTypes.html
On Sat, Jul 12, 2014 at 10:17 AM, Charles R Harris < charlesr.harris@gmail.com> wrote:
As previous posts have pointed out, Numpy's `S` type is currently treated as a byte string, which leads to more complicated code in python3.
Also, a byte string in py3 is not, in fact the same as the py2 string type. So we have a problem -- if we want 'S' to mean what it essentially does in py2, what do we map it to in pure-python land? I propose we embrace the py3 model as fully as possible: There is text data, and there is binary data. In py3, that is 'str' and 'bytes'. So numpy should have dtypes to match these. We're a bit stuck, however, because 'S' mapped to the py2 string type, which no longer exists in py3. Sorry not running py3 to see what 'S' does now, but I know it's bit broken, and may be too late to change it. But: it is certainly a common case in the scientific world to have 1-byte-per-character string data, and care about store size. So a 1-byte-per-character text data types may be a good idea: As for a bytes type -- do we need it, or are we fine with simply using uint8 arrays? (or, even the most common case, converting directly to the type that is actually stored in those bytes...
especially for ascii strings. This note proposes to adapt the currently existing 'a' type letter, currently aliased to 'S', as a new fixed encoding dtype.
+1
Python 3.3 introduced two one byte internal representations for unicode strings, ascii and latin1. Ascii has the advantage that it is a subset of UTF-8, whereas latin1 has a few more symbols.
+1 for latin-1 -- those extra symbols are handy. Also, at least with Python's stdlib encoding, you can round-trip any binary data through latin-1 -- kind of making it act like a bytes object....
Another possibility is to just make it an UTF-8 encoding, but I think this would involve more overhead as Python would need to determine the maximum character size.
yeah -- that is a) overhead, and b) breaks the numpy fixed size dtype model. And it's trickier for numpy arrays, 'cause they are mutable -- python strings can do OK, as they don't need to accommodate potentially changing sizes of strings. On Sat, Jul 12, 2014 at 5:02 PM, Nathaniel Smith <njs@pobox.com> wrote:
I feel like for most purposes, what we *really* want is a variable length string dtype (I.e., where each element can be a different length.).
well, that is fundamentally different than the usual numpy data model -- it would require that the array store pointers and dereference them on use -- is there anywhere else in numpy (other than the object dtype ) that does that? And if we did -- would it end up having any advantage over putting strings in an object array? Or for that matter, using a list of strings instead?
Pandas pays quite some price in overhead to fake this right now. Adding such a thing will cause some problems regarding compatibility (what to do with array(["foo"])) and education, but I think it's worth it in the long run.
i.e do you use the fixed-length type or the variable-length type? I'm not sure it's to killer to have a default and let eh user set a dtype if they want something else. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Jul 16, 2014 11:43 AM, "Chris Barker" <chris.barker@noaa.gov> wrote:
So numpy should have dtypes to match these. We're a bit stuck, however, because 'S' mapped to the py2 string type, which no longer exists in py3. Sorry not running py3 to see what 'S' does now, but I know it's bit broken, and may be too late to change it
In py3 a 'S' dtype is converted to a python bytes object.
On Wed, Jul 16, 2014 at 6:48 AM, Todd <toddrjen@gmail.com> wrote:
On Jul 16, 2014 11:43 AM, "Chris Barker" <chris.barker@noaa.gov> wrote:
So numpy should have dtypes to match these. We're a bit stuck, however, because 'S' mapped to the py2 string type, which no longer exists in py3. Sorry not running py3 to see what 'S' does now, but I know it's bit broken, and may be too late to change it
In py3 a 'S' dtype is converted to a python bytes object.
As a slightly philosophical aside, at some point during Scipy, Nick Coghlan said that the core Python team had stopped recommending the use of `from __future__ import unicode_literals` for Python 2 / 3 compatible code. I have some experience now with writing 2 / 3 code for astropy and I came to the same conclusion. The point is that `str` is the "natural" text class that is used by default for both 2 and 3. Most scientific Py2 code is written to this model. Following this to the Py3 end, that would imply that the most natural convention for numpy S dtype in Py3 would be that it gets to Python as a utf-8 `str`, as Chuck suggested. I think the variable-length encoding issue is not such a problem if you follow the existing numpy convention of truncating overflowing strings on assignment. Using utf-8 like this would (I think) make most Py2 code that uses HDF5 and FITS ASCII string data just work out of the box on Py3, which would be super. - Tom
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Wed, Jul 16, 2014 at 3:48 AM, Todd <toddrjen@gmail.com> wrote:
On Jul 16, 2014 11:43 AM, "Chris Barker" <chris.barker@noaa.gov> wrote:
So numpy should have dtypes to match these. We're a bit stuck, however, because 'S' mapped to the py2 string type, which no longer exists in py3. Sorry not running py3 to see what 'S' does now, but I know it's bit broken, and may be too late to change it
In py3 a 'S' dtype is converted to a python bytes object.
right -- thanks. That's the source of the problems. A bit of a higher-level view of the issues at hand. Python has three relevant data types: A unicode type (unicode in py2, str in py3) A one-byte-per-char stringtype (py2 string) A bytes type The big problem is that py2 only has the unicode and py2string types, and py3 only has the unicode and bytes type. numpy has 'S' and 'U' types: which map naturally to the py2string and unicode types. but since py3 has no py2string type, we have a problem. If numpy were to embrace the py3 model, then 'S' should have mapped to py3's string, aka unicode. But: 1) then there would be no bytes type, which is a problem, as people do need to a pass collections of bytes around. I"ve alwyas figured numpy's uint8 should suffice for that, but "strings of bytes" are useful, and it seem to be awkward, or maybe impossible to construct such a beast with the usual dtype machinery 2) there is a need (or at least a desire), to have a compact, one-byte-per-charater text type in numpy. Thinking of it in this framework leads me to the conclusion that numpy should have three types: 1) A unicode type --no change here 2) A bytes types -- almost the current 'S' type - A bytes type would map to/from py3 bytes objects (and py2 bytes objects, which are the same as py2strings) - one way is would differ from a py2str is that there would be no assumption of null-termination (not sure where that is now) 3) A one-byte-per-char text type -- more or less Chuck's current proposal. - it would map to/from the py3 string -- it is text after all - it would be null-terminated - it would have a one-byte per-char encoding: ascii, latin-1 or settable (TBA) It would be nice if numpy had built-in encoding/decoding to/from the unicode type to/from the bytes type (tricky due to not knowing how many bytes a given string will decode to without decoding it.. Which leaves us with the decisions: * what does 'S' map to? - currently it's almost a bytes type, and maps to bytes in py3 -- so maybe keep that status quo. Except that it really doesn't act like text anymore, so 2 to 3 transition is kind of ugly, and the name is misleading. * what encoding to use for the one-byte-per-char-text-type? - I think latin-1 is the way to go -- you could use it like asciii if you want, but if you need a few other characters they are there. And you can even store binary data in it, thought that's a "bad idea" anyway. - ascii would solve common use cases, but I see no reason to restrict folks to 127 characters -- you can use those if you like. If the binary data needs to get passed to something that really needs to be ascii-only, it could be checked at that point. - perhaps the best option is for client code to be able chose an encoding -- but more code, maybe a more confusing interface? worth it? * Do we have a utf-8 type?: I think not -- it simply does not map to both unicode and numpy's fixed-length requirement. If all this gets done, we have some transition issues, but I think it would solve everyone's problems (though maybe not as cleanly as we'd like...). For instance, if someone needs to map numpy arrays to utf-8 data (i.e. HDF5), then they can either use the bytes type and let the user decode, or encode/decode to unicode on i/o. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Thu, Jul 17, 2014 at 10:05 PM, Chris Barker <chris.barker@noaa.gov> wrote:
A bit of a higher-level view of the issues at hand.
Python has three relevant data types:
A unicode type (unicode in py2, str in py3) A one-byte-per-char stringtype (py2 string) A bytes type
The big problem is that py2 only has the unicode and py2string types, and py3 only has the unicode and bytes type.
numpy has 'S' and 'U' types: which map naturally to the py2string and unicode types.
but since py3 has no py2string type, we have a problem.
If numpy were to embrace the py3 model, then 'S' should have mapped to py3's string, aka unicode.
But:
1) then there would be no bytes type, which is a problem, as people do need to a pass collections of bytes around. I"ve alwyas figured numpy's uint8 should suffice for that, but "strings of bytes" are useful, and it seem to be awkward, or maybe impossible to construct such a beast with the usual dtype machinery
2) there is a need (or at least a desire), to have a compact, one-byte-per-charater text type in numpy.
Thinking of it in this framework leads me to the conclusion that numpy should have three types:
This sounds pretty reasonable to me.
1) A unicode type --no change here
2) A bytes types -- almost the current 'S' type - A bytes type would map to/from py3 bytes objects (and py2 bytes objects, which are the same as py2strings) - one way is would differ from a py2str is that there would be no assumption of null-termination (not sure where that is now)
AFAICT this is *exactly* the same as the current 'S' type. What differences do you see?
3) A one-byte-per-char text type -- more or less Chuck's current proposal. - it would map to/from the py3 string -- it is text after all - it would be null-terminated
Numpy strings types are never null-terminated ATM. They're null-padded, which is slightly different. When storing data in an S5, for instance, strings of length 5 have no nulls appending, strings of length 4 have 1 null appended, strings of length 3 have 2 nulls appended, etc. When reading data out of an S5, then all trailing nulls are stripped. So, they may not be null terminated (if the length of the string exactly matches the length of the dtype), and the strings being stored can contain internal nulls ("foo\x00bar" is fine), but they cannot contain trailing nulls ("foo\x00" will come back as just "foo"). Do you actually care about null-termination specifically? Or did you just mean "it should work like the other ones, which I vaguely remember involves nulls"? ;-)
- it would have a one-byte per-char encoding: ascii, latin-1 or settable (TBA)
Settable is technically very difficult until we redo the dtype machinery to allow parametrized types. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
On Fri, Jul 18, 2014 at 3:33 AM, Nathaniel Smith <njs@pobox.com> wrote:
2) A bytes types -- almost the current 'S' type - A bytes type would map to/from py3 bytes objects (and py2 bytes objects, which are the same as py2strings) - one way is would differ from a py2str is that there would be no assumption of null-termination (not sure where that is now)
AFAICT this is *exactly* the same as the current 'S' type. What differences do you see?
as you mention it, it is the same on py3, except maybe handling of null bytes -- you mentioned that you had to do some work-arounds for that. a proper bytes type would do nothing special with null bytes.
3) A one-byte-per-char text type -- more or less Chuck's current proposal. - it would map to/from the py3 string -- it is text after all - it would be null-terminated
Numpy strings types are never null-terminated ATM. They're null-padded, which is slightly different. When storing data in an S5, for instance, strings of length 5 have no nulls appending, strings of length 4 have 1 null appended, strings of length 3 have 2 nulls appended, etc. When reading data out of an S5, then all trailing nulls are stripped.
So, they may not be null terminated (if the length of the string exactly matches the length of the dtype), and the strings being stored can contain internal nulls ("foo\x00bar" is fine), but they cannot contain trailing nulls ("foo\x00" will come back as just "foo").
Do you actually care about null-termination specifically? Or did you just mean "it should work like the other ones, which I vaguely remember involves nulls"? ;-)
That's pretty much what I meant, yes ;-) But the key is that when pushing one of these things to a python string, any thing after a null byte is ignored. Which is why you can't use it for arbitrary bytes.
- it would have a one-byte per-char encoding: ascii, latin-1 or settable
(TBA)
Settable is technically very difficult until we redo the dtype machinery to allow parametrized types.
indeed -- we have that a bit with Datetime -- but that's a whole other kettle of fish. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote:
As previous posts have pointed out, Numpy's `S` type is currently treated as a byte string, which leads to more complicated code in python3. OTOH, the unicode type is stored as UCS4, which consumes a lot of space, especially for ascii strings. This note proposes to adapt the currently existing 'a' type letter, currently aliased to 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte internal representations for unicode strings, ascii and latin1. Ascii has the advantage that it is a subset of UTF-8, whereas latin1 has a few more symbols. Another possibility is to just make it an UTF-8 encoding, but I think this would involve more overhead as Python would need to determine the maximum character size. These are just preliminary thoughts, comments are welcome.
Just wondering, couldn't we have a type which actually has an (arbitrary, python supported) encoding (and "bytes" might even just be a special case of no encoding)? Basically storing bytes and on access do element[i].decode(specified_encoding) and on storing element[i] = value.encode(specified_encoding). There is always the never ending small issue of trailing null bytes. If we want to be fully compatible, such a type would have to store the string length explicitly to support trailing null bytes. - Sebastian
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Tue, Jul 15, 2014 at 5:26 AM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote:
As previous posts have pointed out, Numpy's `S` type is currently treated as a byte string, which leads to more complicated code in python3. OTOH, the unicode type is stored as UCS4, which consumes a lot of space, especially for ascii strings. This note proposes to adapt the currently existing 'a' type letter, currently aliased to 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte internal representations for unicode strings, ascii and latin1. Ascii has the advantage that it is a subset of UTF-8, whereas latin1 has a few more symbols. Another possibility is to just make it an UTF-8 encoding, but I think this would involve more overhead as Python would need to determine the maximum character size. These are just preliminary thoughts, comments are welcome.
Just wondering, couldn't we have a type which actually has an (arbitrary, python supported) encoding (and "bytes" might even just be a special case of no encoding)? Basically storing bytes and on access do element[i].decode(specified_encoding) and on storing element[i] = value.encode(specified_encoding).
There is always the never ending small issue of trailing null bytes. If we want to be fully compatible, such a type would have to store the string length explicitly to support trailing null bytes.
UTF-8 encoding works with null bytes. That is one of the reasons it is so popular. Chuck
On Tue, Jul 15, 2014 at 9:15 AM, Charles R Harris <charlesr.harris@gmail.com
wrote:
On Tue, Jul 15, 2014 at 5:26 AM, Sebastian Berg < sebastian@sipsolutions.net> wrote:
On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote:
As previous posts have pointed out, Numpy's `S` type is currently treated as a byte string, which leads to more complicated code in python3. OTOH, the unicode type is stored as UCS4, which consumes a lot of space, especially for ascii strings. This note proposes to adapt the currently existing 'a' type letter, currently aliased to 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte internal representations for unicode strings, ascii and latin1. Ascii has the advantage that it is a subset of UTF-8, whereas latin1 has a few more symbols. Another possibility is to just make it an UTF-8 encoding, but I think this would involve more overhead as Python would need to determine the maximum character size. These are just preliminary thoughts, comments are welcome.
Just wondering, couldn't we have a type which actually has an (arbitrary, python supported) encoding (and "bytes" might even just be a special case of no encoding)? Basically storing bytes and on access do element[i].decode(specified_encoding) and on storing element[i] = value.encode(specified_encoding).
There is always the never ending small issue of trailing null bytes. If we want to be fully compatible, such a type would have to store the string length explicitly to support trailing null bytes.
UTF-8 encoding works with null bytes. That is one of the reasons it is so popular.
Thinking more about it, the easiest thing to do might be to make the S dtype a UTF-8 encoding. Most of the machinery to deal with that is already in place. That change might affect some users though, and we might need to do some work to make it backwards compatible with python 2. Chuck
On Tue, Jul 15, 2014 at 4:29 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
Thinking more about it, the easiest thing to do might be to make the S dtype a UTF-8 encoding. Most of the machinery to deal with that is already in place. That change might affect some users though, and we might need to do some work to make it backwards compatible with python 2.
I'd be very concerned about backcompat for existing code that uses e.g. "S128" as a dtype to mean "128 arbitrary bytes". An example is this file format reading code: https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py#L123 The file format says there are 128 bytes there, and their interpretation depends on other fields in the header -- but in one case, for "large montages", there's an encoding where every 3 bytes represents 4 characters using an ad hoc 6-bit character set: https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py#L133 Perhaps this case could be handled better by using a u8 subarray or something (that code also goes to some efforts to work around nul padding), and that particular project hasn't been ported to py3 yet so technically wouldn't be affected if we changed the meaning of "S" on py3. But it does seem useful to have a "fixed length bytes" dtype even in py3, and if we declare that be "S" then it avoids breaking any existing code depending on it... -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
On Thu, Jul 17, 2014 at 5:48 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Jul 15, 2014 at 4:29 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
Thinking more about it, the easiest thing to do might be to make the S dtype a UTF-8 encoding. Most of the machinery to deal with that is already in place. That change might affect some users though, and we might need to do some work to make it backwards compatible with python 2.
I'd be very concerned about backcompat for existing code that uses e.g. "S128" as a dtype to mean "128 arbitrary bytes". An example is this file format reading code: https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py#L123 The file format says there are 128 bytes there, and their interpretation depends on other fields in the header -- but in one case, for "large montages", there's an encoding where every 3 bytes represents 4 characters using an ad hoc 6-bit character set: https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py#L133
Perhaps this case could be handled better by using a u8 subarray or something (that code also goes to some efforts to work around nul padding), and that particular project hasn't been ported to py3 yet so technically wouldn't be affected if we changed the meaning of "S" on py3. But it does seem useful to have a "fixed length bytes" dtype even in py3, and if we declare that be "S" then it avoids breaking any existing code depending on it...
We break code either way. Either we break applications using S as string type, but now it becomes bytes in python3. Or we break applications treating S as byte type and we change it to string in python3. Unfortunately we missed the opportunity when adding python3 support to fix the same exact same bytes/text boundary issue which is the main reason why pythons3 exists in the first place. We should have made porting to numpy3 a intentionally(!) backward incompatible change just like python itself did. Now we are stuck with deciding, which option breaks less. On the one hand, that S is bytes in python3 is somewhat established by now and lots of workarounds are already place. On the other hand, I think code that relies on S being bytes is in the minority and python3 usage is probably still insignificant in this area. Unfortunately getting actual numbers and not wild guesses on this is probably not easy.
On Fri, Jul 18, 2014 at 11:10 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
On Tue, Jul 15, 2014 at 4:29 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
Thinking more about it, the easiest thing to do might be to make the S
On Thu, Jul 17, 2014 at 5:48 PM, Nathaniel Smith <njs@pobox.com> wrote: dtype
a UTF-8 encoding. Most of the machinery to deal with that is already in place. That change might affect some users though, and we might need to do some work to make it backwards compatible with python 2.
I'd be very concerned about backcompat for existing code that uses e.g. "S128" as a dtype to mean "128 arbitrary bytes". An example is this file format reading code: https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py#L123 The file format says there are 128 bytes there, and their interpretation depends on other fields in the header -- but in one case, for "large montages", there's an encoding where every 3 bytes represents 4 characters using an ad hoc 6-bit character set: https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py#L133
Perhaps this case could be handled better by using a u8 subarray or something (that code also goes to some efforts to work around nul padding), and that particular project hasn't been ported to py3 yet so technically wouldn't be affected if we changed the meaning of "S" on py3. But it does seem useful to have a "fixed length bytes" dtype even in py3, and if we declare that be "S" then it avoids breaking any existing code depending on it...
We break code either way. Either we break applications using S as string type, but now it becomes bytes in python3. Or we break applications treating S as byte type and we change it to string in python3.
Unfortunately we missed the opportunity when adding python3 support to fix the same exact same bytes/text boundary issue which is the main reason why pythons3 exists in the first place. We should have made porting to numpy3 a intentionally(!) backward incompatible change just like python itself did.
Now we are stuck with deciding, which option breaks less. On the one hand, that S is bytes in python3 is somewhat established by now and lots of workarounds are already place.
Removing workarounds is generally a good thing (!), and often not that hard to do by numpy version number for libraries that need to support multiple numpy versions. It's never ideal to break compatibility, but in this case it would be fixing something that is currently not working in a useful way. - Tom
On the other hand, I think code that relies on S being bytes is in the minority and python3 usage is probably still insignificant in this area. Unfortunately getting actual numbers and not wild guesses on this is probably not easy.
_______________________________________________
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
18.07.2014 18:10, Julian Taylor kirjoitti: [clip]
We break code either way. Either we break applications using S as string type, but now it becomes bytes in python3. Or we break applications treating S as byte type and we change it to string in python3.
Unfortunately we missed the opportunity when adding python3 support to fix the same exact same bytes/text boundary issue which is the main reason why pythons3 exists in the first place. We should have made porting to numpy3 a intentionally(!) backward incompatible change just like python itself did.
Now we are stuck with deciding, which option breaks less. On the one hand, that S is bytes in python3 is somewhat established by now and lots of workarounds are already place. On the other hand, I think code that relies on S being bytes is in the minority and python3 usage is probably still insignificant in this area. Unfortunately getting actual numbers and not wild guesses on this is probably not easy.
One way to try this out is to change the meaning of 'S' and see how badly e.g. pandas or matplotlib break on py3 as a consequence. Another approach would be to add a new 1-byte unicode as a type code different from 'S'. The automatic ASCII encoding in constructor/assignment on Py3 can be deprecated, which would make 'S' a strict bytes dtype. This also is not perfect, since array(['foo']) on Py2 should for backward compatibility continue returning dtype='S'. Moreover, already existing code does not make use of it. -- Pauli Virtanen
On Fri, Jul 18, 2014 at 9:07 AM, Pauli Virtanen <pav@iki.fi> wrote:
Another approach would be to add a new 1-byte unicode
you can't do unicode in 1-byte -- so what does this mean, exactly?
This also is not perfect, since array(['foo']) on Py2 should for backward compatibility continue returning dtype='S'.
yup. but we may be OK -- as "bytes" in py2 is the same as string anyway. But what do we do with null bytes? when going from 'S' to py2 string? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
18.07.2014 19:33, Chris Barker kirjoitti:
On Fri, Jul 18, 2014 at 9:07 AM, Pauli Virtanen <pav@iki.fi> wrote:
Another approach would be to add a new 1-byte unicode
you can't do unicode in 1-byte -- so what does this mean, exactly?
The first 256 unicode code points, which happen to coincide with latin1.
This also is not perfect, since array(['foo']) on Py2 should for backward compatibility continue returning dtype='S'.
yup. but we may be OK -- as "bytes" in py2 is the same as string anyway. But what do we do with null bytes? when going from 'S' to py2 string?
Changing the null chopping and preserving backward compat would require yet another new dtype. This would then mean that the 'S' dtype would become pretty much deprecated on Py3. Forcing everyone to re-do their Python 3 ports would be somewhat cleaner. However, this train may have left a couple of years ago. -- Pauli Virtanen
On Thu, Jul 17, 2014 at 8:48 AM, Nathaniel Smith <njs@pobox.com> wrote:
I'd be very concerned about backcompat for existing code that uses e.g. "S128" as a dtype to mean "128 arbitrary bytes".
yup -- 'S' matches teh py2 string well, which is BOTH text and bytes. That should not change -- at least in py2.
An example is this file format reading code: https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py#L123 The file format says there are 128 bytes there, and their interpretation depends on other fields in the header -- but in one case, for "large montages", there's an encoding where every 3 bytes represents 4 characters using an ad hoc 6-bit character set: https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py#L133
Perhaps this case could be handled better by using a u8 subarray or something (that code also goes to some efforts to work around nul
padding), yes -- that might have been better, though I have not been successful at figuring out how to spell a dtype that works well -- hence my suggestion that we have a bytes type.
and that particular project hasn't been ported to py3 yet so technically wouldn't be affected if we changed the meaning of "S" on py3. But it does seem useful to have a "fixed length bytes" dtype even in py3, and if we declare that be "S" then it avoids breaking any existing code depending on it...
sure, but having 'S' be bytes does break other code that depends on it being a text type. Unfortunately, py2 mingled text and bytes, numpy mirrored that, so there is no completely backward compatible way to go forward. But for some guidance -- text is the big issue with py2 <-> p3 migration, so folks are presumable going to expect things to change with numpy text handling as well. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Tue, Jul 15, 2014 at 11:15 AM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Tue, Jul 15, 2014 at 5:26 AM, Sebastian Berg < sebastian@sipsolutions.net> wrote:
On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote:
As previous posts have pointed out, Numpy's `S` type is currently treated as a byte string, which leads to more complicated code in python3. OTOH, the unicode type is stored as UCS4, which consumes a lot of space, especially for ascii strings. This note proposes to adapt the currently existing 'a' type letter, currently aliased to 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte internal representations for unicode strings, ascii and latin1. Ascii has the advantage that it is a subset of UTF-8, whereas latin1 has a few more symbols. Another possibility is to just make it an UTF-8 encoding, but I think this would involve more overhead as Python would need to determine the maximum character size. These are just preliminary thoughts, comments are welcome.
Just wondering, couldn't we have a type which actually has an (arbitrary, python supported) encoding (and "bytes" might even just be a special case of no encoding)? Basically storing bytes and on access do element[i].decode(specified_encoding) and on storing element[i] = value.encode(specified_encoding).
There is always the never ending small issue of trailing null bytes. If we want to be fully compatible, such a type would have to store the string length explicitly to support trailing null bytes.
UTF-8 encoding works with null bytes. That is one of the reasons it is so popular.
Thinking more about it, the easiest thing to do might be to make the S dtype a UTF-8 encoding. Most of the machinery to deal with that is already in place. That change might affect some users though, and we might need to do some work to make it backwards compatible with python 2.
Chuck
Are you saying that numpy S dtypes would be exported to Py3 as str? This would work in my use case, though it seems it would break things for the (few-ish) people using the numpy S type in Py3 since it would now look like a Python str instead of bytes object. One other thought is that one *might* finesse the fixed width vs. utf-8 variable length issue by using the exact same rules that currently apply to strings in Py2: - When setting an array from input like a list of strings (unicode in Py3), make the array wide enough to handle the widest (in bytes) entry. - When setting an element in an existing array, truncate any characters that don't fit in the existing width. In the second point note that the truncation would be full unicode characters, not bytes. This could be a point of confusion in some cases, but it's simple to implement and formally consistent with current behavior. - Tom p.s. Strangely enough the mail I quoted from Chuck beginning with "Thinking about it more .." never got to my email and I only happened to have seen it in the archives.
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Tue, Jul 15, 2014 at 4:26 AM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
Just wondering, couldn't we have a type which actually has an (arbitrary, python supported) encoding (and "bytes" might even just be a special case of no encoding)?
well, then we're back to the core issue here: numpy dtypes need to be a pre-specified length encoded bytes are an arbitrary length. This leads us to wanting to use only fixed-number-of-bytes-per-character encodings: - ascii - latin-a - UCS-4 (or UTF-32..I get a bit confused about the names) maybe UCS-2 (NOT UTF-16) would be worth considering, for a compromise between space and fraction of unicode supported. Basically storing bytes and on access do
element[i].decode(specified_encoding) and on storing element[i] = value.encode(specified_encoding).
this really doesn't seem that different than just using python strings -- is there a point to having a pointer-to-python-string type as a less generalized version of the currently possible python strings in object arrays? There is always the never ending small issue of trailing null bytes. If
we want to be fully compatible, such a type would have to store the string length explicitly to support trailing null bytes.
are null bytes legal (as something other than a terminator) in some encodings? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
good argument for ASCII, but utf-8 is a bad idea, as there is no 1:1 correspondence between length of string in bytes and length in characters -- as numpy needs to pre-allocate a defined number of bytes for a dtype,
Hi, there is a disconnect between the user and numpy as to how long a string is being stored...this isn't a problem for immutable strings, and less of a problem for HDF, as you can determine how many bytes you need before you write the file (or does HDF support var-length elements?) There is an HDF5 variable-length type, which we currently read and write as Python str objects (using NumPy's object type). But HDF5 additionally has a fixed-storage-width UTF8 type, so we could map to a NumPy fixed-storage-width type trivially. When determining the HDF5 data type, unfortunately all you have to go on is the NumPy dtype... creating an HDF5 dataset is done separately from writing the data.
"custom"? it would be an encoding operation -- which you'd need to go from utf-8 to/from unicode anyway. So you would lose the ability to have a nice 1:1 binary representation map between numpy and HDF... good argument for ASCII, I guess. Or for HDF to use latin-1 ;-)
"Custom" in this context means a user-created HDF5 data-conversion filter, which is necessary since all data conversion is handled inside the HDF5 library. We've written several for things like the NumPy bool type, etc: https://github.com/h5py/h5py/blob/master/h5py/_conv.pyx As far as generic Unicode goes, we currently don't support the NumPy "U" dtype in h5py for similar reasons; there's no destination type in HDF5 which (1) would preserve the dtype for round-trip write/read operations and (2) doesn't risk truncation. A Latin-1 based 'a' type would have similar problems.
Does HDF enforce ascii-only? what does it do with the > 127 values?
Unfortunately/fortunately the charset is not enforced for either ASCII or UTF-8, although the HDF Group has been thinking about it.
that's the whole issue with UTF-8 -- it needs to be addressed somewhere, and the numpy-HDF interface seems like a smarter place to put it than the numpy-user interface!
I agree fixed-storage-width UTF-8 is likely too complex to use as a native NumPy type. Ideally, NumPy would support variable-length strings, in which case all these headaches would go away. But I imagine that's also somewhat complicated. :) Andrew
But HDF5 additionally has a fixed-storage-width UTF8 type, so we could map to a NumPy fixed-storage-width type trivially.
Sure -- this is why *nix uses utf-8 for filenames -- it can just be a char*. But that just punts the problem to client code. I think a UTF-8 string type does not match the numpy model well, and I don't think we should support it just because it would be easier for the HDF 5 wrappers. ( to be fair, there are probably other similar systems numpy wants to interface with that cod use this...) It seems if you want a 1:1 binary mapping between HDF and numpy for utf strings, then a bytes type in numpy makes more sense. Numpy could/should have encode and decode methods for converting byte arrays to/from Unicode arrays (does it already? ).
"Custom" in this context means a user-created HDF5 data-conversion filter, which is necessary since all data conversion is handled inside the HDF5 library.
As far as generic Unicode goes, we currently don't support the NumPy "U" dtype in h5py for similar reasons; there's no destination type in HDF5 which (1) would preserve the dtype for round-trip write/read operations and (2) doesn't risk truncation.
It sounds to like HDF5 simply doesn't support Unicode. Calling an array of bytes utf-8 simple pushes the problem on to client libs. As that's where the problem lies, then the PyHDF may be the place to address it. If we put utf-8 in numpy, we have the truncation problem there instead -- which is exactly what I think we should avoid.
A Latin-1 based 'a' type would have similar problems.
Maybe not -- latin1 is fixed width.
Does HDF enforce ascii-only? what does it do with the > 127 values?
Unfortunately/fortunately the charset is not enforced for either ASCII
So you can dump Latin-1 into and out of the HDF 'ASCII' type -- it's essentially the old char* / py2 string. An ugly situation, but why not use it?
or UTF-8,
So ASCII and utf-8 are really the same thing, with different meta-data...
although the HDF Group has been thinking about it.
I wonder if they would consider going Latin-1 instead of ASCII -- similarly to utf-8 it's backward compatible with ASCII, but gives you a little more. I don't know that there is another 1byte encoding worth using -- it maybe be my English bias, but it seems Latin-1 gives us ASCII+some extra stuff handy for science ( I use the degree symbol a lot, for instance) with nothing lost.
Ideally, NumPy would support variable-length strings, in which case all these headaches would go away.
Would they? That would push the problem back to PyHDF -- which I'm arguing is where it belongs, but I didn't think you were ;-)
But I imagine that's also somewhat complicated. :)
That's a whole other kettle of fish, yes. -Chris
Hi Chris,
A Latin-1 based 'a' type would have similar problems.
Maybe not -- latin1 is fixed width.
Yes, Latin-1 is fixed width, but the issue is that when writing to a fixed-width UTF8 string in HDF5, it will expand, possibly losing data. What I would like to avoid is a situation where a user writes a 10-byte string from NumPy into a 10-byte space in an HDF5 dataset, and unexpectedly loses the last few characters because of the encoding mismatch. People are used to truncation when e.g. storing a 20-byte string in a 10-byte dataset, but it's surprising when the source and destination are the same size. :) In any case, I certainly agree NumPy shouldn't be limited by the capabilities of HDF5. There are other valuable use cases, including access to the high-bit characters Latin-1 provides. But from a strict compatibility standpoint, ASCII would be beneficial. Andrew
On Fri, Jul 18, 2014 at 9:32 AM, Andrew Collette <andrew.collette@gmail.com> wrote:
A Latin-1 based 'a' type would have similar problems.
Maybe not -- latin1 is fixed width.
Yes, Latin-1 is fixed width, but the issue is that when writing to a fixed-width UTF8 string in HDF5, it will expand, possibly losing data.
you shouldn't do that -- I was in no way suggesting that a latin-1 string get pushed to a utf-8 array by default -- that would be a bad idea. utf-8 is a unicode encoding, it should be used for unicode. As for truncation -- that's inherent in using a fixed-width array to store a non-fixed width encoding. What I would like to avoid is a situation where a user writes a
10-byte string from NumPy into a 10-byte space in an HDF5 dataset, and unexpectedly loses the last few characters because of the encoding mismatch.
Again, they shouldn't do that, they should be pushing a 10-character string into something -- and utf-8 is going to (Possible) truncate that. That's HDF/utf-8 limitation that people are going to have to deal with. I think you're suggesting that numpy follow the HDF model, so that the numpy-HDF transition can be clean and easy. However, I think that utf-8 is an inappropriate model for numpy, and that the mess of bytes to utf-8 is pyHDF's problem, not numpy's. i.e your issue above -- should users put a 10 character string into a numpy 10 byte utf-8 type and see it truncated? That's what I want to avoid. In any case, I certainly agree NumPy shouldn't be limited by the
capabilities of HDF5. There are other valuable use cases, including access to the high-bit characters Latin-1 provides. But from a strict compatibility standpoint, ASCII would be beneficial.
This is where I wonder about HDF's "ascii" type -- is it really ascii? Or is it that old standby one-byte-per-character-and-if-it's-ascii-we-all-know-what-it-means-but-if-it's-not-we'll-still-pass-it-around type? i.e the old char* ? In which case, you can just push a latin-1 type into and out of your HDF ascii arrays and everything will work just fine. Unless someone stores something other than latin-1 or ascii in it -- but even then, the bytes would still be preserved. This is why I see no downside to latin-1 -- if you don't use the > 127 code points, it's the same thing -- if you do, you get some extra handy characters. The only difference is that a proper ascii type would not let you store anything above 127 at all -- why restrict ourselves? And if you want utf-8 in HDF, then use a unicode array knowing that some truncation could occur, or use a byte array, and do the encoding yourself, so the user knows exactly what they are doing. [it would be nice if numpy had a pure numpy solution to encoding/decoding, though maybe it wouldn't really be any faster than going through python anyway...] -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Fri, Jul 18, 2014 at 5:54 PM, Chris Barker <chris.barker@noaa.gov> wrote:
This is why I see no downside to latin-1 -- if you don't use the > 127 code points, it's the same thing -- if you do, you get some extra handy characters. The only difference is that a proper ascii type would not let you store anything above 127 at all -- why restrict ourselves?
IMO the extra characters aren't the most compelling argument for latin1 over ascii. Latin1 gives the nice assurance that if some jerk *does* give me an "ascii" file that somewhere has some byte with the 8th bit set, then I can still load the data and fix things by hand. This is trickier if numpy just refuses to touch the data, blowing up with an exception when I try. In general it's easy to create numpy arrays containing arbitrary bitpatterns, so it's nice to have some strategy for what to do with them. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
On Fri, Jul 18, 2014 at 10:59 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Fri, Jul 18, 2014 at 5:54 PM, Chris Barker <chris.barker@noaa.gov> wrote:
This is why I see no downside to latin-1 -- if you don't use the > 127
code
points, it's the same thing -- if you do, you get some extra handy characters. The only difference is that a proper ascii type would not let you store anything above 127 at all -- why restrict ourselves?
IMO the extra characters aren't the most compelling argument for latin1 over ascii. Latin1 gives the nice assurance that if some jerk *does* give me an "ascii" file that somewhere has some byte with the 8th bit set, then I can still load the data and fix things by hand. This is trickier if numpy just refuses to touch the data, blowing up with an exception when I try. In general it's easy to create numpy arrays containing arbitrary bitpatterns, so it's nice to have some strategy for what to do with them.
Just to throw in one more complication, there is no buffer protocol for a fixed encoding type. In Python 3 'c', 's', 'p' are all considered as bytes, in Python 2 as strings. Chuck
On Fri, Jul 18, 2014 at 9:59 AM, Nathaniel Smith <njs@pobox.com> wrote:
IMO the extra characters aren't the most compelling argument for latin1 over ascii. Latin1 gives the nice assurance that if some jerk *does* give me an "ascii" file that somewhere has some byte with the 8th bit set, then I can still load the data and fix things by hand.
Absolutely! py2's frequent barfing on the ascii encoding is really a pain. And if you aren't going tin enforce ascii, then better to be clear about what those extra bits mean. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Fri, Jul 18, 2014 at 9:59 AM, Nathaniel Smith <njs@pobox.com> wrote:
IMO the extra characters aren't the most compelling argument for latin1 over ascii. Latin1 gives the nice assurance that if some jerk *does* give me an "ascii" file that somewhere has some byte with the 8th bit set, then I can still load the data and fix things by hand.
On Fri, Jul 18, 2014 at 10:39 AM, Charles R Harris < charlesr.harris@gmail.com> wrote:
Just to throw in one more complication, there is no buffer protocol for a fixed encoding type. In Python 3 'c', 's', 'p' are all considered as bytes, in Python 2 as strings.
I suppose another option is to formally cal it what has been a defacto non-standard for years: ascii-with-who-knows-what-for-the-higher-codes. i.e ASCII, but not barf on decoding, (replace?). but you can use latin-1 the same way, so why not? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Hi Chris,
Again, they shouldn't do that, they should be pushing a 10-character string into something -- and utf-8 is going to (Possible) truncate that. That's HDF/utf-8 limitation that people are going to have to deal with. I think you're suggesting that numpy follow the HDF model, so that the numpy-HDF transition can be clean and easy. However, I think that utf-8 is an inappropriate model for numpy, and that the mess of bytes to utf-8 is pyHDF's problem, not numpy's.
The root of the issue is that HDF5 provides a limited set of fixed-storage-width string types, and a fixed-storage-width NumPy type of the same size using Latin-1 can't map to any of them without losing data. For example, if "a10" is a hypothetical 10-byte-wide NumPy dtype using Latin-1, reading/writing to an "a10" HDF5 dataset backed with 10-byte UTF-8 storage would risk truncation, even if the advertised widths are the same. There is unfortunately nothing we can do in the h5py code base to paper over this... it's a limitation of the format.
This is where I wonder about HDF's "ascii" type -- is it really ascii? Or is it that old standby one-byte-per-character-and-if-it's-ascii-we-all-know-what-it-means-but-if-it's-not-we'll-still-pass-it-around type? i.e the old char* ?
In which case, you can just push a latin-1 type into and out of your HDF ascii arrays and everything will work just fine. Unless someone stores something other than latin-1 or ascii in it -- but even then, the bytes would still be preserved.
The encoding is explicitly ASCII (H5T_ASCII, in HDF5 lingo). Anecdotally, I've heard people store other encodings in it, but (1) I'm not eager to make things worse by mis-labelling data, and (2) the HDF Group has made indications that they may start checking the encoding at conversion time. (1) is particularly important, as a major focus of h5py is compatibility with the rest of the HDF5 ecosystem. Again, I wouldn't argue that these considerations by themselves are enough of a reason for NumPy to use ASCII or UTF-8, certainly. Just that from this particular HDF5 perspective, they provide maximum compatibility and minimize the chances of accidental data loss. Andrew
On Fri, Jul 18, 2014 at 10:29 AM, Andrew Collette <andrew.collette@gmail.com
wrote:
The root of the issue is that HDF5 provides a limited set of fixed-storage-width string types, and a fixed-storage-width NumPy type of the same size using Latin-1 can't map to any of them without losing data. For example, if "a10" is a hypothetical 10-byte-wide NumPy dtype using Latin-1, reading/writing to an "a10" HDF5 dataset backed with 10-byte UTF-8 storage would risk truncation, even if the advertised widths are the same.
I do get this, yes.
There is unfortunately nothing we can do in the h5py code base to paper over this... it's a limitation of the format.
yup. Similar limitations in numpy.
This is where I wonder about HDF's "ascii" type -- is it really ascii? Or is
it that old standby
one-byte-per-character-and-if-it's-ascii-we-all-know-what-it-means-but-if-it's-not-we'll-still-pass-it-around
type? i.e the old char* ?
In which case, you can just push a latin-1 type into and out of your HDF ascii arrays and everything will work just fine. Unless someone stores something other than latin-1 or ascii in it -- but even then, the bytes would still be preserved.
The encoding is explicitly ASCII (H5T_ASCII, in HDF5 lingo). Anecdotally, I've heard people store other encodings in it, but (1) I'm not eager to make things worse by mis-labelling data, and (2) the HDF Group has made indications that they may start checking the encoding at conversion time. (1) is particularly important, as a major focus of h5py is compatibility with the rest of the HDF5 ecosystem.
If it were me, I'd encourage the HDF group to NOT enforce ascii. just like with the numpy 'S' type, I'm guessing there is a fair bit of code in the wild that [ab]uses the ascii type by throwing other bytes in there. In fact, this one reason that utf-8 is so popular -- you still use all that code that simply takes a char* and passes it around (or maybe compares it), without making any assumptions about what it means. that from this particular HDF5 perspective, they provide maximum
compatibility and minimize the chances of accidental data loss.
What it would do is push the problem from the HDF5<->numpy interface to the python<->numpy interface. I'm not sure that's a good trade off. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Hi Chris,
What it would do is push the problem from the HDF5<->numpy interface to the python<->numpy interface.
I'm not sure that's a good trade off.
Maybe I'm being too paranoid about the truncation issue. We already perform truncation when going from e.g. vlen to fixed-width strings in h5py... it's just the truncation behavior for same-width data that throws me. Here's a strawman for how a Latin-1 "a" type might be handled in h5py: 1. Creation from existing "a" data: Use vlen strings. Doesn't preserve the dtype, but maybe that's not so important. 2. Writing from "a" data to fixed-width ASCII: Copy, and replace bytes>127 with "?" (or don't) 3. Writing from "a" data to fixed-width UTF-8: Transcode and truncate (being careful not to end in the middle of a multibyte character) 4. Reading from fixed-width ASCII to "a": Straight copy, no inspection 5. Reading from fixed-width UTF-8 to "a": Copy, and replace non-Latin-1 chars with "?" (The above example uses replacement rather than raising an exception, because an exception in the HDF5 conversion callback will leave the write/read half-completed). In any case, I can say that the lack of an text 'S' type in NumPy has been a significant pain point for h5py users on Python 3 over the years. Whatever specific encoding ends up being used, such a type can only improve the situation, and I'm firmly in favor of it. Andrew
On Fri, Jul 18, 2014 at 12:52 PM, Andrew Collette <andrew.collette@gmail.com
wrote:
What it would do is push the problem from the HDF5<->numpy interface to the python<->numpy interface.
I'm not sure that's a good trade off.
Maybe I'm being too paranoid about the truncation issue.
Actually, I agree about the truncation issue, but it's a question of where to put it -- I'm suggesting that I don't want it at the python<->numpy interface.
Here's a strawman for how a Latin-1 "a" type might be handled in h5py:
1. Creation from existing "a" data: Use vlen strings. Doesn't preserve the dtype, but maybe that's not so important.
do vlen strings support full unicode? -- then, yes, that's good.
2. Writing from "a" data to fixed-width ASCII: Copy, and replace bytes>127 with "?" (or don't)
I'd vote for don't, unless HDF starts enforcing pure ascii. But if it does, then yes, replacement makes more sense than exceptions. 3. Writing from "a" data to fixed-width UTF-8: Transcode and truncate
(being careful not to end in the middle of a multibyte character)
yup -- buyer beware.
4. Reading from fixed-width ASCII to "a": Straight copy, no inspection
yup.
5. Reading from fixed-width UTF-8 to "a": Copy, and replace non-Latin-1 chars with "?"
sure what about reading from fixed-width UTF-8 to 'U' -- that seems like the natural way to go for unicode. Tough a bit hard to know how long U needs to be -- but <= the length of the utf-8 array (in characters).
(The above example uses replacement rather than raising an exception, because an exception in the HDF5 conversion callback will leave the write/read half-completed).
and really -- what would you do with an exception on read? give up and throw the file away? note that I'm also proposing a "bytes" dtype, which might make sense for grabbing utf-8 data from HDF-5. Then either h5py or the user could decode to a unicode type. In any case, I can say that the lack of an text 'S' type in NumPy has
been a significant pain point for h5py users on Python 3 over the years.
isn't the current 'S' a pretty good map to hdf ascii? Whatever specific encoding ends up being used, such a type can
only improve the situation, and I'm firmly in favor of it.
agreed. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Hi Chris,
Actually, I agree about the truncation issue, but it's a question of where to put it -- I'm suggesting that I don't want it at the python<->numpy interface.
Yes, that's a good point. Of course, by using Latin-1 rather than UTF-8 we can't support all Unicode code points (hence the "?" replacement possible on read from HDF5).
do vlen strings support full unicode? -- then, yes, that's good.
Yes, they do. It's somewhat unfortunate to immediately cast to vlen though, since people usually have fixed-width datasets to start with for efficiency reasons...
what about reading from fixed-width UTF-8 to 'U' -- that seems like the natural way to go for unicode. Tough a bit hard to know how long U needs to be -- but <= the length of the utf-8 array (in characters).
Space concerns ("U" has a 4x space penalty for ASCII-ish data). Plus, for similar reasons to this discussion, creating "U" datasets is unsupported at the moment.
note that I'm also proposing a "bytes" dtype, which might make sense for grabbing utf-8 data from HDF-5. Then either h5py or the user could decode to a unicode type.
Sound quite like the existing 'S' type.
In any case, I can say that the lack of an text 'S' type in NumPy has been a significant pain point for h5py users on Python 3 over the years.
isn't the current 'S' a pretty good map to hdf ascii?
Yes; in fact, right now all fixed-width strings in h5py (ASCII and UTF-8) are read/written as 'S'. The problem is that on Py3, 'S' is treated as bytes, not text, so you can't freely mix it with str. I am about to leave for the weekend... thanks for a great discussion! To conclude, it strikes me that in choosing an encoding we get to pick at most two of the following: 1. Support for all Unicode characters 2. Fixed number of characters 3. Fixed number of storage bytes At this point, I would vote for UTF-8 in a fixed width buffer (1/3), but I imagine as this progresses towards a NEP others will weigh in. Andrew
On Fri, Jul 18, 2014 at 3:30 PM, Andrew Collette <andrew.collette@gmail.com> wrote:
Hi Chris,
Actually, I agree about the truncation issue, but it's a question of where to put it -- I'm suggesting that I don't want it at the python<->numpy interface.
Yes, that's a good point. Of course, by using Latin-1 rather than UTF-8 we can't support all Unicode code points (hence the "?" replacement possible on read from HDF5).
do vlen strings support full unicode? -- then, yes, that's good.
Yes, they do. It's somewhat unfortunate to immediately cast to vlen though, since people usually have fixed-width datasets to start with for efficiency reasons...
what about reading from fixed-width UTF-8 to 'U' -- that seems like the natural way to go for unicode. Tough a bit hard to know how long U needs to be -- but <= the length of the utf-8 array (in characters).
Space concerns ("U" has a 4x space penalty for ASCII-ish data). Plus, for similar reasons to this discussion, creating "U" datasets is unsupported at the moment.
note that I'm also proposing a "bytes" dtype, which might make sense for grabbing utf-8 data from HDF-5. Then either h5py or the user could decode to a unicode type.
Sound quite like the existing 'S' type.
In any case, I can say that the lack of an text 'S' type in NumPy has been a significant pain point for h5py users on Python 3 over the years.
isn't the current 'S' a pretty good map to hdf ascii?
Yes; in fact, right now all fixed-width strings in h5py (ASCII and UTF-8) are read/written as 'S'. The problem is that on Py3, 'S' is treated as bytes, not text, so you can't freely mix it with str.
I am about to leave for the weekend... thanks for a great discussion! To conclude, it strikes me that in choosing an encoding we get to pick at most two of the following:
1. Support for all Unicode characters 2. Fixed number of characters 3. Fixed number of storage bytes
At this point, I would vote for UTF-8 in a fixed width buffer (1/3), but I imagine as this progresses towards a NEP others will weigh in.
At some point I'm pretty sure we will want to support utf-8 as it looks well on its way to a universal standard. Chuck
participants (15)
-
Aldcroft, Thomas
-
Alexander Belopolsky
-
Andrew Collette
-
Charles R Harris
-
Chris Barker
-
Chris Barker - NOAA Federal
-
Jeff Reback
-
Joseph Martinot-Lagarde
-
Julian Taylor
-
Nathaniel Smith
-
Olivier Grisel
-
Pauli Virtanen
-
Sebastian Berg
-
Stephan Hoyer
-
Todd