Mailman 3 proposal: smaller representation of string arrays - NumPy-Discussion

proposal: smaller representation of string arrays

Julian Taylor

20 Apr 2017 20 Apr '17

1:15 p.m.

Hello, As you probably know numpy does not deal well with strings in Python3. The np.string type is actually zero terminated bytes and not a string. In Python2 this happened to work out as it treats bytes and strings the same way. But in Python3 this type is pretty hard to work with as each time you get an item from a numpy bytes array it needs decoding to receive a string. The only string type available in Python3 is np.unicode which uses 4-byte utf-32 encoding which is deemed to use too much memory to actually see much use. What people apparently want is a string type for Python3 which uses less memory for the common science use case which rarely needs more than latin1 encoding. As we have been told we cannot change the np.string type to actually be strings as existing programs do interpret its content as bytes despite this being very broken due to its null terminating property (it will ignore all trailing nulls). Also 8 years of working around numpy's poor python3 support decisions in third parties probably make the 'return bytes' behaviour impossible to change now. So we need a new dtype that can represent strings in numpy arrays which is smaller than the existing 4 byte utf-32. To please everyone I think we need to go with a dtype that supports multiple encodings via metadata, similar to how datatime supports multiple units. E.g.: 'U10[latin1]' are 10 characters in latin1 encoding Encodings we should support are: - latin1 (1 bytes): it is compatible with ascii and adds extra characters used in the western world. - utf-32 (4 bytes): can represent every character, equivalent with np.unicode Encodings we should maybe support: - utf-16 with explicitly disallowing surrogate pairs (2 bytes): this covers a very large range of possible characters in a reasonably compact representation - utf-8 (4 bytes): variable length encoding with minimum size of 1 bytes, but we would need to assume the worst case of 4 bytes so it would not save anything compared to utf-32 but may allow third parties replace an encoding step with trailing null trimming on serialization. To actually do this we have two options both of which break our ABI when doing so without ugly hacks. - Add a new dtype, e.g. npy.realstring By not modifying an existing type the only break programs using the NPY_CHAR. The most notable case of this is f2py. It has the cosmetic disadvantage that it makes the np.unicode dtype obsolete and is more busywork to implement. - Modify np.unicode to have encoding metadata This allows use to reuse of all the type boilerplate so it is more convenient to implement and by extending an existing type instead of making one obsolete it results in a much nicer API. The big drawback is that it will explicitly break any third party that receives an array with a new encoding and assumes that the buffer of an array of type np.unicode will a character itemsize of 4 bytes. To ease this problem we would need to add API's to get the itemsize and encoding to numpy now so third parties can error out cleanly. The implementation of it is not that big a deal, I have already created a prototype for adding latin1 metadata to np.unicode which works quite well. It is imo realistic to get this into 1.14 should we be able to make a decision on which way to implement it. Do you have comments on how to go forward, in particular in regards to new dtype vs modify np.unicode? cheers, Julian

Attachments:

signature.asc (application/pgp-signature — 845 bytes)

Show replies by date

Anne Archibald

20 Apr 20 Apr

4:47 p.m.

On Thu, Apr 20, 2017 at 3:17 PM Julian Taylor <jtaylor.debian@googlemail.com> wrote:

...

To please everyone I think we need to go with a dtype that supports multiple encodings via metadata, similar to how datatime supports multiple units. E.g.: 'U10[latin1]' are 10 characters in latin1 encoding

Encodings we should support are: - latin1 (1 bytes): it is compatible with ascii and adds extra characters used in the western world. - utf-32 (4 bytes): can represent every character, equivalent with np.unicode

Encodings we should maybe support: - utf-16 with explicitly disallowing surrogate pairs (2 bytes): this covers a very large range of possible characters in a reasonably compact representation - utf-8 (4 bytes): variable length encoding with minimum size of 1 bytes, but we would need to assume the worst case of 4 bytes so it would not save anything compared to utf-32 but may allow third parties replace an encoding step with trailing null trimming on serialization.

I should say first that I've never used even non-Unicode string arrays, but is there any reason not to support all Unicode encodings that python does, with the same names and semantics? This would surely be the simplest to understand. Also, if latin1 is to going to be the only practical 8-bit encoding, maybe check with some non-Western users to make sure it's not going to wreck their lives? I'd have selected ASCII as an encoding to treat specially, if any, because Unicode already does that and the consequences are familiar. (I'm used to writing and reading French without accents because it's passed through ASCII, for example.) Variable-length encodings, of which UTF-8 is obviously the one that makes good handling essential, are indeed more complicated. But is it strictly necessary that string arrays hold fixed-length *strings*, or can the encoding length be fixed instead? That is, currently if you try to assign a longer string than will fit, the string is truncated to the number of characters in the data type. Instead, for encoded Unicode, the string could be truncated so that the encoding fits. Of course this is not completely trivial for variable-length encodings, but it should be doable, and it would allow UTF-8 to be used just the way it usually is - as an encoding that's almost 8-bit. All this said, it seems to me that the important use cases for string arrays involve interaction with existing binary formats, so people who have to deal with such data should have the final say. (My own closest approach to this is the FITS format, which is restricted by the standard to ASCII.) Anne

Stephan Hoyer

5:26 p.m.

Julian -- thanks for taking this on. NumPy's handling of strings on Python 3 certainly needs fixing. On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.faceted@gmail.com> wrote:

...

Variable-length encodings, of which UTF-8 is obviously the one that makes good handling essential, are indeed more complicated. But is it strictly necessary that string arrays hold fixed-length *strings*, or can the encoding length be fixed instead? That is, currently if you try to assign a longer string than will fit, the string is truncated to the number of characters in the data type. Instead, for encoded Unicode, the string could be truncated so that the encoding fits. Of course this is not completely trivial for variable-length encodings, but it should be doable, and it would allow UTF-8 to be used just the way it usually is - as an encoding that's almost 8-bit.

I agree with Anne here. Variable-length encoding would be great to have, but even fixed length UTF-8 (in terms of memory usage, not characters) would solve NumPy's Python 3 string problem. NumPy's memory model needs a fixed size per array element, but that doesn't mean we need a fixed sized per character. Each element in a UTF-8 array would be a string with a fixed number of codepoints, not characters. In fact, we already have this sort of distinction between element size and memory usage: np.string_ uses null padding to store shorter strings in a larger dtype. The only reason I see for supporting encodings other than UTF-8 is for memory-mapping arrays stored with those encodings, but that seems like a lot of extra trouble for little gain.

Chris Barker

5:43 p.m.

On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <shoyer@gmail.com> wrote:

...

I agree with Anne here. Variable-length encoding would be great to have, but even fixed length UTF-8 (in terms of memory usage, not characters) would solve NumPy's Python 3 string problem. NumPy's memory model needs a fixed size per array element, but that doesn't mean we need a fixed sized per character. Each element in a UTF-8 array would be a string with a fixed number of codepoints, not characters.

Ah, yes -- the nightmare of Unicode! No, it would not be a fixed number of codepoints -- it would be a fixed number of bytes (or "code units"). and an unknown number of characters. As Julian pointed out, if you wanted to specify that a numpy element would be able to hold, say, N characters (actually code points, combining characters make this even more confusing) then you would need to allocate N*4 bytes to make sure you could hold any string that long. Which would be pretty pointless -- better to use UCS-4. So Anne's suggestion that numpy truncates as needed would make sense -- you'd specify say N characters, numpy would arbitrarily (or user specified) over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a string that didn't fit. Then you'd need to make sure you truncated correctly, so as not to create an invalid string (that's just code, it could be made correct). But how much to over allocate? for english text, with an occasional scientific symbol, only a little. for, say, Japanese text, you'd need a factor 2 maybe? Anyway, the idea that "just use utf-8" solves your problems is really dangerous. It simply is not the right way to handle text if: you need fixed-length storage you care about compactness In fact, we already have this sort of distinction between element size and

...

memory usage: np.string_ uses null padding to store shorter strings in a larger dtype.

sure -- but it is clear to the user that the dtype can hold "up to this many" characters.

...

The only reason I see for supporting encodings other than UTF-8 is for memory-mapping arrays stored with those encodings, but that seems like a lot of extra trouble for little gain.

I see it the other way around -- the only reason TO support utf-8 is for memory mapping with other systems that use it :-) On the other hand, if we ARE going to support utf-8 -- maybe use it for all unicode support, rather than messing around with all the multiple encoding options. I think a 1-byte-per char latin-* encoded string is a good idea though -- scientific use tend to be latin only and space constrained. All that being said, if the truncation code were carefully written, it would mostly "just work" -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Eric Wieser

5:58 p.m.

...

if you truncate a utf-8 bytestring, you may get invalid data

Note that in general truncating unicode codepoints is not a safe operation either, as combining characters are a thing. So I don't think this is a good argument against UTF8. Also, is silent truncation a think that we want to allow to happen anyway? That sounds like something the user ought to be alerted to with an exception.

...

if you wanted to specify that a numpy element would be able to hold, say, N characters ... It simply is not the right way to handle text if [...] you need fixed-length storage

It seems to me that counting code points is pretty futile in unicode, due to combining characters. The only two meaningful things to count are: * Graphemes, as that's what the user sees visually. These can span multiple code-points * Bytes of encoded data, as that's the space needed to store them So I would argue that the approach of fixed-codepoint-length storage is itself a flawed design, and so should not be used as a constraint on numpy. Counting graphemes is hard, so that leaves the only sensible option as a byte count. I don't forsee variable-length encodings being a problem implementation-wise - they only become one if numpy were to acquire a vectorized substring function that is intended to return a view. I think I'd be in favor of supporting all encodings, and falling back on python to handle encoding/decoding them. On Thu, 20 Apr 2017 at 18:44 Chris Barker <chris.barker@noaa.gov> wrote:

...

On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <shoyer@gmail.com> wrote:

...
I agree with Anne here. Variable-length encoding would be great to have, but even fixed length UTF-8 (in terms of memory usage, not characters) would solve NumPy's Python 3 string problem. NumPy's memory model needs a fixed size per array element, but that doesn't mean we need a fixed sized per character. Each element in a UTF-8 array would be a string with a fixed number of codepoints, not characters.

Ah, yes -- the nightmare of Unicode!

No, it would not be a fixed number of codepoints -- it would be a fixed number of bytes (or "code units"). and an unknown number of characters.

As Julian pointed out, if you wanted to specify that a numpy element would be able to hold, say, N characters (actually code points, combining characters make this even more confusing) then you would need to allocate N*4 bytes to make sure you could hold any string that long. Which would be pretty pointless -- better to use UCS-4.

So Anne's suggestion that numpy truncates as needed would make sense -- you'd specify say N characters, numpy would arbitrarily (or user specified) over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a string that didn't fit. Then you'd need to make sure you truncated correctly, so as not to create an invalid string (that's just code, it could be made correct).

But how much to over allocate? for english text, with an occasional scientific symbol, only a little. for, say, Japanese text, you'd need a factor 2 maybe?

Anyway, the idea that "just use utf-8" solves your problems is really dangerous. It simply is not the right way to handle text if:

you need fixed-length storage you care about compactness

In fact, we already have this sort of distinction between element size and

...
memory usage: np.string_ uses null padding to store shorter strings in a larger dtype.

sure -- but it is clear to the user that the dtype can hold "up to this many" characters.

...
The only reason I see for supporting encodings other than UTF-8 is for memory-mapping arrays stored with those encodings, but that seems like a lot of extra trouble for little gain.

I see it the other way around -- the only reason TO support utf-8 is for memory mapping with other systems that use it :-)

On the other hand, if we ARE going to support utf-8 -- maybe use it for all unicode support, rather than messing around with all the multiple encoding options.

I think a 1-byte-per char latin-* encoded string is a good idea though -- scientific use tend to be latin only and space constrained.

All that being said, if the truncation code were carefully written, it would mostly "just work"

-CHB

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Stephan Hoyer

6:16 p.m.

On Thu, Apr 20, 2017 at 10:43 AM, Chris Barker <chris.barker@noaa.gov> wrote:

...

On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <shoyer@gmail.com> wrote:

...
I agree with Anne here. Variable-length encoding would be great to have, but even fixed length UTF-8 (in terms of memory usage, not characters) would solve NumPy's Python 3 string problem. NumPy's memory model needs a fixed size per array element, but that doesn't mean we need a fixed sized per character. Each element in a UTF-8 array would be a string with a fixed number of codepoints, not characters.

Ah, yes -- the nightmare of Unicode!

No, it would not be a fixed number of codepoints -- it would be a fixed number of bytes (or "code units"). and an unknown number of characters.

Apologies for confusing the terminology! Yes, this would mean a fixed number of bytes and an unknown number of characters.

...

As Julian pointed out, if you wanted to specify that a numpy element would be able to hold, say, N characters (actually code points, combining characters make this even more confusing) then you would need to allocate N*4 bytes to make sure you could hold any string that long. Which would be pretty pointless -- better to use UCS-4.

It's already unsafe to try to insert arbitrary length strings into a numpy string_ or unicode_ array. When determining the dtype automatically (e.g., with np.array(list_of_strings)), the difference is that numpy would need to check the maximum encoded length instead of the character length (i.e., len(x.encode() instead of len(x)). I certainly would not over-allocate. If users want more space, they can explicitly choose an appropriate size. (This is an hazard of not having length length dtypes.) If users really want to be able to fit an arbitrary number of unicode characters and aren't concerned about memory usage, they can still use np.unicode_ -- that won't be going away.

...

So Anne's suggestion that numpy truncates as needed would make sense -- you'd specify say N characters, numpy would arbitrarily (or user specified) over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a string that didn't fit. Then you'd need to make sure you truncated correctly, so as not to create an invalid string (that's just code, it could be made correct).

NumPy already does this sort of silent truncation with longer strings inserted into shorter string dtypes. The different here would indeed be the need to check the number of bytes represented by the string instead of the number of characters. But I don't think this is useful behavior to bring over to a new dtype. We should error instead of silently truncating. This is certainly easier than trying to figure out when we would be splitting a character.

...

But how much to over allocate? for english text, with an occasional scientific symbol, only a little. for, say, Japanese text, you'd need a factor 2 maybe?

Anyway, the idea that "just use utf-8" solves your problems is really dangerous. It simply is not the right way to handle text if:

you need fixed-length storage you care about compactness

In fact, we already have this sort of distinction between element size and

...
memory usage: np.string_ uses null padding to store shorter strings in a larger dtype.

sure -- but it is clear to the user that the dtype can hold "up to this many" characters.

As Yu Feng points out in this GitHub comment, non-latin language speakers are already aware of the difference between string length and bytes length: https://github.com/numpy/numpy/pull/8942#issuecomment-294409192 Making an API based on code units instead of code points really seems like the saner way to handle unicode strings. I agree with this section with the DyND design docs for it's string type, which notes precedent from Julia and Go: https://github.com/libdynd/libdynd/blob/master/devdocs/string-design.md#code... I think a 1-byte-per char latin-* encoded string is a good idea though --

...

scientific use tend to be latin only and space constrained.

I think scientific users tend be to ASCII only, so UTF-8 would also work transparently :).

Antoine Pitrou

6:23 p.m.

On Thu, 20 Apr 2017 10:26:13 -0700 Stephan Hoyer <shoyer@gmail.com> wrote:

...

I agree with Anne here. Variable-length encoding would be great to have, but even fixed length UTF-8 (in terms of memory usage, not characters) would solve NumPy's Python 3 string problem. NumPy's memory model needs a fixed size per array element, but that doesn't mean we need a fixed sized per character. Each element in a UTF-8 array would be a string with a fixed number of codepoints, not characters.

In fact, we already have this sort of distinction between element size and memory usage: np.string_ uses null padding to store shorter strings in a larger dtype.

The only reason I see for supporting encodings other than UTF-8 is for memory-mapping arrays stored with those encodings, but that seems like a lot of extra trouble for little gain.

I think you want at least: ascii, utf8, ucs2 (aka utf16 without surrogates), utf32. That is, 3 common fixed width encodings and one variable width encoding. Regards Antoine.

Chris Barker

5:28 p.m.

On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.faceted@gmail.com> wrote:

...

Is there any reason not to support all Unicode encodings that python does, with the same names and semantics? This would surely be the simplest to understand.

I think it should support all fixed-length encodings, but not the non-fixed length ones -- they just don't fit well into the numpy data model.

...

Also, if latin1 is to going to be the only practical 8-bit encoding, maybe check with some non-Western users to make sure it's not going to wreck their lives? I'd have selected ASCII as an encoding to treat specially, if any, because Unicode already does that and the consequences are familiar. (I'm used to writing and reading French without accents because it's passed through ASCII, for example.)

latin-1 (or latin-9) only makes things better than ASCII -- it buys most of the accented characters for the European language and some symbols that are nice to have (I use the degree symbol a lot...). And it is ASCII compatible -- so there is NO reason to choose ASCII over Latin-* Which does no good for non-latin languages -- so we need to hear from the community -- is there a substantial demand for a non-latin one-byte per character encoding?

...

Variable-length encodings, of which UTF-8 is obviously the one that makes good handling essential, are indeed more complicated. But is it strictly necessary that string arrays hold fixed-length *strings*, or can the encoding length be fixed instead? That is, currently if you try to assign a longer string than will fit, the string is truncated to the number of characters in the data type.

we could do that, yes, but an improperly truncated "string" becomes invalid -- just seems like a recipe for bugs that won't be found in testing. memory is cheap, compressing is fast -- we really shouldn't get hung up on this! Note: if you are storing a LOT of text (which I have no idea why you would use numpy anyway), then the memory size might matter, but then semi-arbitrary truncation would probably matter, too. I expect most text storage in numpy arrays is things like names of datasets, ids, etc, etc -- not massive amounts of text -- so storage space really isn't critical. but having an id or something unexpectedly truncated could be bad. I think practical experience has shown us that people do not handle "mostly fixed length but once in awhile not" text well -- see the nightmare of UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so errors are far more likely to be found in tests (why would you use utf-8 is all your data are in ascii???). but still -- why invite hard to test for errors? Final point -- as Julian suggests, one reason to support utf-8 is for interoperability with other systems -- but that makes errors more of an issue -- if it doesn't pass through the numpy truncation machinery, invalid data could easily get put in a numpy array. -CHB it would allow UTF-8 to be used just the way it usually is - as an

...

encoding that's almost 8-bit.

ouch! that perception is the route to way too many errors! it is by no means almost 8-bit, unless your data are almost ascii -- in which case, use latin-1 for pity's sake! This highlights my point though -- if we support UTF-8, people WILL use it, and only test it with mostly-ascii text, and not find the bugs that will crop up later. All this said, it seems to me that the important use cases for string

...

arrays involve interaction with existing binary formats, so people who have to deal with such data should have the final say. (My own closest approach to this is the FITS format, which is restricted by the standard to ASCII.)

yup -- not sure we'll get much guidance here though -- netdf does not solve this problem well, either. But if you are pulling, say, a utf-8 encoded string out of a netcdf file -- it's probably better to pull it out as bytes and pass it through the python decoding/encoding machinery than pasting the bytes straight to a numpy array and hope that the encoding and truncation are correct. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Neal Becker

5:36 p.m.

I'm no unicode expert, but can't we truncate unicode strings so that only valid characters are included? On Thu, Apr 20, 2017 at 1:32 PM Chris Barker <chris.barker@noaa.gov> wrote:

...

On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.faceted@gmail.com

...
wrote:

...
Is there any reason not to support all Unicode encodings that python does, with the same names and semantics? This would surely be the simplest to understand.

I think it should support all fixed-length encodings, but not the non-fixed length ones -- they just don't fit well into the numpy data model.

...
Also, if latin1 is to going to be the only practical 8-bit encoding, maybe check with some non-Western users to make sure it's not going to wreck their lives? I'd have selected ASCII as an encoding to treat specially, if any, because Unicode already does that and the consequences are familiar. (I'm used to writing and reading French without accents because it's passed through ASCII, for example.)

latin-1 (or latin-9) only makes things better than ASCII -- it buys most of the accented characters for the European language and some symbols that are nice to have (I use the degree symbol a lot...). And it is ASCII compatible -- so there is NO reason to choose ASCII over Latin-*

Which does no good for non-latin languages -- so we need to hear from the community -- is there a substantial demand for a non-latin one-byte per character encoding?

...
Variable-length encodings, of which UTF-8 is obviously the one that makes good handling essential, are indeed more complicated. But is it strictly necessary that string arrays hold fixed-length *strings*, or can the encoding length be fixed instead? That is, currently if you try to assign a longer string than will fit, the string is truncated to the number of characters in the data type.

we could do that, yes, but an improperly truncated "string" becomes invalid -- just seems like a recipe for bugs that won't be found in testing.

memory is cheap, compressing is fast -- we really shouldn't get hung up on this!

Note: if you are storing a LOT of text (which I have no idea why you would use numpy anyway), then the memory size might matter, but then semi-arbitrary truncation would probably matter, too.

I expect most text storage in numpy arrays is things like names of datasets, ids, etc, etc -- not massive amounts of text -- so storage space really isn't critical. but having an id or something unexpectedly truncated could be bad.

I think practical experience has shown us that people do not handle "mostly fixed length but once in awhile not" text well -- see the nightmare of UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so errors are far more likely to be found in tests (why would you use utf-8 is all your data are in ascii???). but still -- why invite hard to test for errors?

Final point -- as Julian suggests, one reason to support utf-8 is for interoperability with other systems -- but that makes errors more of an issue -- if it doesn't pass through the numpy truncation machinery, invalid data could easily get put in a numpy array.

-CHB

it would allow UTF-8 to be used just the way it usually is - as an

...
encoding that's almost 8-bit.

ouch! that perception is the route to way too many errors! it is by no means almost 8-bit, unless your data are almost ascii -- in which case, use latin-1 for pity's sake!

This highlights my point though -- if we support UTF-8, people WILL use it, and only test it with mostly-ascii text, and not find the bugs that will crop up later.

All this said, it seems to me that the important use cases for string

...
arrays involve interaction with existing binary formats, so people who have to deal with such data should have the final say. (My own closest approach to this is the FITS format, which is restricted by the standard to ASCII.)

yup -- not sure we'll get much guidance here though -- netdf does not solve this problem well, either.

But if you are pulling, say, a utf-8 encoded string out of a netcdf file -- it's probably better to pull it out as bytes and pass it through the python decoding/encoding machinery than pasting the bytes straight to a numpy array and hope that the encoding and truncation are correct.

-CHB

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Chris Barker

5:46 p.m.

On Thu, Apr 20, 2017 at 10:36 AM, Neal Becker <ndbecker2@gmail.com> wrote:

...

I'm no unicode expert, but can't we truncate unicode strings so that only valid characters are included?

sure -- it's just a bit fiddly -- and you need to make sure that everything gets passed through the proper mechanism. numpy is all about folks using other code to mess with the bytes in a numpy array. so we can't expect that all numpy string arrays will have been created with numpy code. Does python's string have a truncated encode option? i.e. you don't want to encode to utf-8 and then just chop it off. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Chris Barker

5:06 p.m.

Thanks so much for reviving this conversation -- we really do need to address this. My thoughts: What people apparently want is a string type for Python3 which uses less

...

memory for the common science use case which rarely needs more than latin1 encoding.

Yes -- I think there is a real demand for that. https://en.wikipedia.org/wiki/ISO/IEC_8859-15 To please everyone I think we need to go with a dtype that supports

...

multiple encodings via metadata, similar to how datetime supports multiple units. E.g.: 'U10[latin1]' are 10 characters in latin1 encoding

I wonder if we really need that -- as you say, there is real demand for compact string type, but for many use cases, 1 byte per character is enough. So to keep things really simple, I think a single 1-byte per char encoding would meet most people's needs. What should that encoding be? latin-1 is obvious (and has the very nice property of being able to round-trip arbitrary bytes -- at least with Python's implementation) and scientific data sets tend to use the latin alphabet (with its ascii roots and all). But there is now latin-9: https://en.wikipedia.org/wiki/ISO/IEC_8859-15 Maybe a better option? Encodings we should support are:

...

- latin1 (1 bytes): it is compatible with ascii and adds extra characters used in the western world. - utf-32 (4 bytes): can represent every character, equivalent with np.unicode

IIUC, datetime64 is, well, always 64 bits. So it may be better to have a given dtype always be the same bitwidth. So the utf-32 dtype would be a different dtype. which also keeps it really simple, we have a latin-* dtype and a full-on unicode dtype -- that's it. Encodings we should maybe support:

...

- utf-16 with explicitly disallowing surrogate pairs (2 bytes): this covers a very large range of possible characters in a reasonably compact representation

I think UTF-16 is very simply, the worst of both worlds. If we want a two-byte character set, then it should be UCS-2 -- i.e. explicitly rejecting any code point that takes more than two bytes to represent. (or maybe that's what you mean by explicitly disallowing surrogate pairs). in any case, it should certainly give you an encoding error if you try to pass in a unicode character than can not fit into two bytes. So: is there actually a demand for this? If so, then I think it should be a separate 2-byte string type, with the encoding always the same.

...

- utf-8 (4 bytes): variable length encoding with minimum size of 1 bytes, but we would need to assume the worst case of 4 bytes so it would not save anything compared to utf-32 but may allow third parties replace an encoding step with trailing null trimming on serialization.

yeach -- utf-8 is great for interchange and streaming data, but not for internal storage, particular with the numpy every item has the same number of bytes requirement. So if someone wants to work with ut-8 they can store it in a byte array, and encode and decode as they pass it to/from python. That's going to have to happen anyway, even if under the hood. And it's risky business -- if you truncate a utf-8 bytestring, you may get invalid data -- it really does not belong in numpy.

...

- Add a new dtype, e.g. npy.realstring

I think that's the way to go. backwards compatibility is really key. Though could we make the existing string dtype a latin-1 always type without breaking too much? Or maybe depricate and get there in the future? It has the cosmetic disadvantage that it makes the np.unicode dtype

...

obsolete and is more busywork to implement.

I think the np.unicode type should remain as the 4-bytes per char encoding. But that only makes sense if you follow my idea that we don't have a variable number of bytes per char dtype. So my proposal is: - Create a new one-byte-per-char dtype that is always latin-9 encoded. - in python3 it would map to a string (i.e. unicode) - Keep the 4-byte per char unicode string type Optionally (if there is really demand) - Create a new two-byte per char dtype that is always UCS-2 encoded. Is there any way to leverage Python3's nifty string type? I'm thinking not. At least not for numpy arrays that can play well with C code, etc. All that being said, a encoding-specified string dtype would be nice too -- I just think it's more complex that it needs to be. Numpy is not the tool for text processing... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Julian Taylor

6:15 p.m.

I probably have formulated my goal with the proposal a bit better, I am not very interested in a repetition of which encoding to use debate. In the end what will be done allows any encoding via a dtype with metadata like datetime. This allows any codec (including truncated utf8) to be added easily (if python supports it) and allows sidestepping the debate. My main concern is whether it should be a new dtype or modifying the unicode dtype. Though the backward compatibility argument is strongly in favour of adding a new dtype that makes the np.unicode type redundant. On 20.04.2017 15:15, Julian Taylor wrote:

...

Hello, As you probably know numpy does not deal well with strings in Python3. The np.string type is actually zero terminated bytes and not a string. In Python2 this happened to work out as it treats bytes and strings the same way. But in Python3 this type is pretty hard to work with as each time you get an item from a numpy bytes array it needs decoding to receive a string. The only string type available in Python3 is np.unicode which uses 4-byte utf-32 encoding which is deemed to use too much memory to actually see much use.

What people apparently want is a string type for Python3 which uses less memory for the common science use case which rarely needs more than latin1 encoding. As we have been told we cannot change the np.string type to actually be strings as existing programs do interpret its content as bytes despite this being very broken due to its null terminating property (it will ignore all trailing nulls). Also 8 years of working around numpy's poor python3 support decisions in third parties probably make the 'return bytes' behaviour impossible to change now.

So we need a new dtype that can represent strings in numpy arrays which is smaller than the existing 4 byte utf-32.

To please everyone I think we need to go with a dtype that supports multiple encodings via metadata, similar to how datatime supports multiple units. E.g.: 'U10[latin1]' are 10 characters in latin1 encoding

Encodings we should support are: - latin1 (1 bytes): it is compatible with ascii and adds extra characters used in the western world. - utf-32 (4 bytes): can represent every character, equivalent with np.unicode

Encodings we should maybe support: - utf-16 with explicitly disallowing surrogate pairs (2 bytes): this covers a very large range of possible characters in a reasonably compact representation - utf-8 (4 bytes): variable length encoding with minimum size of 1 bytes, but we would need to assume the worst case of 4 bytes so it would not save anything compared to utf-32 but may allow third parties replace an encoding step with trailing null trimming on serialization.

To actually do this we have two options both of which break our ABI when doing so without ugly hacks.

- Add a new dtype, e.g. npy.realstring By not modifying an existing type the only break programs using the NPY_CHAR. The most notable case of this is f2py. It has the cosmetic disadvantage that it makes the np.unicode dtype obsolete and is more busywork to implement.

- Modify np.unicode to have encoding metadata This allows use to reuse of all the type boilerplate so it is more convenient to implement and by extending an existing type instead of making one obsolete it results in a much nicer API. The big drawback is that it will explicitly break any third party that receives an array with a new encoding and assumes that the buffer of an array of type np.unicode will a character itemsize of 4 bytes. To ease this problem we would need to add API's to get the itemsize and encoding to numpy now so third parties can error out cleanly.

The implementation of it is not that big a deal, I have already created a prototype for adding latin1 metadata to np.unicode which works quite well. It is imo realistic to get this into 1.14 should we be able to make a decision on which way to implement it.

Do you have comments on how to go forward, in particular in regards to new dtype vs modify np.unicode?

cheers, Julian

Anne Archibald

6:59 p.m.

On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor <jtaylor.debian@googlemail.com> wrote:

...

I probably have formulated my goal with the proposal a bit better, I am not very interested in a repetition of which encoding to use debate. In the end what will be done allows any encoding via a dtype with metadata like datetime. This allows any codec (including truncated utf8) to be added easily (if python supports it) and allows sidestepping the debate.

My main concern is whether it should be a new dtype or modifying the unicode dtype. Though the backward compatibility argument is strongly in favour of adding a new dtype that makes the np.unicode type redundant.

Creating a new dtype to handle encoded unicode, with the encoding specified in the dtype, sounds perfectly reasonable to me. Changing the behaviour of the existing unicode dtype seems like it's going to lead to massive headaches unless exactly nobody uses it. The only downside to a new type is having to find an obvious name that isn't already in use. (And having to actively maintain/deprecate the old one.) Anne

Eric Wieser

7:15 p.m.

Perhaps `np.encoded_str[encoding]` as the name for the new type, if we decide a new type is necessary? Am I right in thinking that the general problem here is that it's very easy to discard metadata when working with dtypes, and that by adding metadata to `unicode_`, we risk existing code carelessly dropping it? Is this a problem in both C and python, or just C? If that's the case, can we end up with a compromise where being careless just causes old code to promote to ucs32? On Thu, 20 Apr 2017 at 20:09 Anne Archibald <peridot.faceted@gmail.com> wrote:

...

On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor < jtaylor.debian@googlemail.com> wrote:

...
I probably have formulated my goal with the proposal a bit better, I am not very interested in a repetition of which encoding to use debate. In the end what will be done allows any encoding via a dtype with metadata like datetime. This allows any codec (including truncated utf8) to be added easily (if python supports it) and allows sidestepping the debate.

My main concern is whether it should be a new dtype or modifying the unicode dtype. Though the backward compatibility argument is strongly in favour of adding a new dtype that makes the np.unicode type redundant.

Creating a new dtype to handle encoded unicode, with the encoding specified in the dtype, sounds perfectly reasonable to me. Changing the behaviour of the existing unicode dtype seems like it's going to lead to massive headaches unless exactly nobody uses it. The only downside to a new type is having to find an obvious name that isn't already in use. (And having to actively maintain/deprecate the old one.)

Anne _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Julian Taylor

7:40 p.m.

On 20.04.2017 20:59, Anne Archibald wrote:

...

On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.com>> wrote:

I probably have formulated my goal with the proposal a bit better, I am not very interested in a repetition of which encoding to use debate. In the end what will be done allows any encoding via a dtype with metadata like datetime. This allows any codec (including truncated utf8) to be added easily (if python supports it) and allows sidestepping the debate.

My main concern is whether it should be a new dtype or modifying the unicode dtype. Though the backward compatibility argument is strongly in favour of adding a new dtype that makes the np.unicode type redundant.

Creating a new dtype to handle encoded unicode, with the encoding specified in the dtype, sounds perfectly reasonable to me. Changing the behaviour of the existing unicode dtype seems like it's going to lead to massive headaches unless exactly nobody uses it. The only downside to a new type is having to find an obvious name that isn't already in use. (And having to actively maintain/deprecate the old one.)

Anne

We wouldn't really be changing the behaviour of the unicode dtype. Only programs accessing the databuffer directly and trying to decode would need to be changed. I assume this can happen for programs that do serialization + reencoding of numpy string arrays at the C level (at the python level you would be fine). These programs would be broken, but only when they actually receive a string array that does not have the default utf32 encoding. I really don't like that a fully new dtype means creating more junk and extra code paths to numpy. But it is probably do big of a compatibility break to accept to keep our code clean.

Robert Kern

6:53 p.m.

On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:

...

Do you have comments on how to go forward, in particular in regards to new dtype vs modify np.unicode?

Can we restate the use cases explicitly? I feel like we ended up with the current sub-optimal situation because we never really laid out the use cases. We just felt like we needed bytestring and unicode dtypes, more out of completionism than anything, and we made a bunch of assumptions just to get each one done. I think there may be broad agreement that many of those assumptions are "wrong", but it would be good to reference that against concretely-stated use cases. FWIW, if I need to work with in-memory arrays of strings in Python code, I'm going to use dtype=object a la pandas. It has almost no arbitrary constraints, and I can rely on Python's unicode facilities freely. There may be some cases where it's a little less memory-efficient (e.g. representing a column of enumerated single-character values like 'M'/'F'), but that's never prevented me from doing anything (compare to the uniform-length restrictions, which *have* prevented me from doing things). So what's left? Being able to memory-map to files that have string data conveniently laid out according to numpy assumptions (e.g. FITS). Being able to work with C/C++/Fortran APIs that have arrays of strings laid out according to numpy assumptions (e.g. HDF5). I think it would behoove us to canvass the needs of these formats and APIs before making any more assumptions. For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length). But it enforces 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the singular motivating use case for the trailing-NULL behavior of np.string. I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings. We should look at some of the newer formats and APIs, like Parquet and Arrow, and also consider the cross-language APIs with Julia and R. If I had to jump ahead and propose new dtypes, I might suggest this: * For the most part, treat the string dtypes as temporary communication formats rather than the preferred in-memory working format, similar to how we use `float16` to communicate with GPU APIs. * Acknowledge the use cases of the current NULL-terminated np.string dtype, but perhaps add a new canonical alias, document it as being for those specific use cases, and deprecate/de-emphasize the current name. * Add a dtype for holding uniform-length `bytes` strings. This would be similar to the current `void` dtype, but work more transparently with the `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes` like `float64` does with `float`. This would not be NULL-terminated. No encoding would be implied. * Maybe add a dtype similar to `object_` that only permits `unicode/str` (2.x/3.x) strings (and maybe None to represent missing data a la pandas). This maintains all of the flexibility of using a `dtype=object` array while allowing code to specialize for working with strings without all kinds of checking on every item. But most importantly, we can serialize such an array to bytes without having to use pickle. Utility functions could be written for en-/decoding to/from the uniform-length bytestring arrays handling different encodings and things like NULL-termination (also working with the legacy dtypes and handling structured arrays easily, etc.). -- Robert Kern

Stephan Hoyer

7:05 p.m.

On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern <robert.kern@gmail.com> wrote:

...

I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings.

HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and variable length versions: https://github.com/PyTables/PyTables/issues/499 https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html "Fixed length UTF-8" for HDF5 refers to the number of bytes used for storage, not the number of characters.

Robert Kern

7:17 p.m.

On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer <shoyer@gmail.com> wrote:

...

On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern <robert.kern@gmail.com>

wrote:

...

...
I don't know of a format off-hand that works with numpy uniform-length

strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings.

HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and variable length versions: https://github.com/PyTables/PyTables/issues/499 https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html

"Fixed length UTF-8" for HDF5 refers to the number of bytes used for storage, not the number of characters.

Ah, okay, I was interpolating from a quick perusal of the h5py docs, which of course are also constrained by numpy's current set of dtypes. The NULL-terminated ASCII works well enough with np.string's semantics. -- Robert Kern

Feng Yu

7:34 p.m.

I suggest a new data type 'text[encoding]', 'T'. 1. text can be cast to python strings via decoding. 2. Conceptually casting to python bytes first cast to a string then calls encode(); the current encoding in the meta data is used by default, but the new encoding can be overridden. I slightly favour 'T16' as a fixed size, text record backed by 16 bytes. This way over-allocation is forcefully delegated to the user, simplifying numpy array. Yu On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern <robert.kern@gmail.com> wrote:

...

On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer <shoyer@gmail.com> wrote:

...
On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern <robert.kern@gmail.com> wrote:

...
I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings.

HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and variable length versions: https://github.com/PyTables/PyTables/issues/499 https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html

"Fixed length UTF-8" for HDF5 refers to the number of bytes used for storage, not the number of characters.

Ah, okay, I was interpolating from a quick perusal of the h5py docs, which of course are also constrained by numpy's current set of dtypes. The NULL-terminated ASCII works well enough with np.string's semantics.

-- Robert Kern

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Marten van Kerkwijk

8:01 p.m.

...

I suggest a new data type 'text[encoding]', 'T'.

I like the suggestion very much (it is even in between S and U!). The utf-8 manifesto linked to above convinced me that the number that should follow is the number of bytes, which is nicely consistent with use in all numerical dtypes. Any way, more specifically on Julian's question: it seems to me one has little choice but to make a new dtype (and OK if that makes unicode obsolete). I think what exact encodings to support is a separate question. -- Marten

Stephan Hoyer

7:51 p.m.

On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern <robert.kern@gmail.com> wrote:

...

On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer <shoyer@gmail.com> wrote:

...
On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern <robert.kern@gmail.com>

wrote:

...
...
I don't know of a format off-hand that works with numpy uniform-length

strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings.

HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and variable length versions: https://github.com/PyTables/PyTables/issues/499 https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html

"Fixed length UTF-8" for HDF5 refers to the number of bytes used for storage, not the number of characters.

Ah, okay, I was interpolating from a quick perusal of the h5py docs, which of course are also constrained by numpy's current set of dtypes. The NULL-terminated ASCII works well enough with np.string's semantics.

Yes, except that on Python 3, "Fixed length ASCII" in HDF5 should correspond to a string type, not np.string_ (which is really bytes).

Robert Kern

8:04 p.m.

On Thu, Apr 20, 2017 at 12:51 PM, Stephan Hoyer <shoyer@gmail.com> wrote:

...

On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern <robert.kern@gmail.com>

wrote:

...

...
On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer <shoyer@gmail.com> wrote:

...
On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern <robert.kern@gmail.com>

wrote:

...
...
...
I don't know of a format off-hand that works with numpy

uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings.

HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and variable length versions: https://github.com/PyTables/PyTables/issues/499 https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html

"Fixed length UTF-8" for HDF5 refers to the number of bytes used for storage, not the number of characters.

Ah, okay, I was interpolating from a quick perusal of the h5py docs, which of course are also constrained by numpy's current set of dtypes. The NULL-terminated ASCII works well enough with np.string's semantics.

Yes, except that on Python 3, "Fixed length ASCII" in HDF5 should correspond to a string type, not np.string_ (which is really bytes).

"... well enough with np.string's semantics [that h5py actually used it to pass data in and out; whether that array is fit for purpose beyond that, I won't comment]." :-) -- Robert Kern

Anne Archibald

7:17 p.m.

On Thu, Apr 20, 2017 at 8:55 PM Robert Kern <robert.kern@gmail.com> wrote:

...

On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:

...
Do you have comments on how to go forward, in particular in regards to new dtype vs modify np.unicode?

Can we restate the use cases explicitly? I feel like we ended up with the current sub-optimal situation because we never really laid out the use cases. We just felt like we needed bytestring and unicode dtypes, more out of completionism than anything, and we made a bunch of assumptions just to get each one done. I think there may be broad agreement that many of those assumptions are "wrong", but it would be good to reference that against concretely-stated use cases.

...

FWIW, if I need to work with in-memory arrays of strings in Python code, I'm going to use dtype=object a la pandas. It has almost no arbitrary constraints, and I can rely on Python's unicode facilities freely. There may be some cases where it's a little less memory-efficient (e.g. representing a column of enumerated single-character values like 'M'/'F'), but that's never prevented me from doing anything (compare to the uniform-length restrictions, which *have* prevented me from doing things).

So what's left? Being able to memory-map to files that have string data conveniently laid out according to numpy assumptions (e.g. FITS). Being able to work with C/C++/Fortran APIs that have arrays of strings laid out according to numpy assumptions (e.g. HDF5). I think it would behoove us to canvass the needs of these formats and APIs before making any more assumptions.

For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length). But it enforces 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the singular motivating use case for the trailing-NULL behavior of np.string.

Actually if I understood the spec, FITS header lines are 80 bytes long and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped. [...]

...

If I had to jump ahead and propose new dtypes, I might suggest this:

* For the most part, treat the string dtypes as temporary communication formats rather than the preferred in-memory working format, similar to how we use `float16` to communicate with GPU APIs.

* Acknowledge the use cases of the current NULL-terminated np.string dtype, but perhaps add a new canonical alias, document it as being for those specific use cases, and deprecate/de-emphasize the current name.

* Add a dtype for holding uniform-length `bytes` strings. This would be similar to the current `void` dtype, but work more transparently with the `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes` like `float64` does with `float`. This would not be NULL-terminated. No encoding would be implied.

How would this differ from a numpy array of bytes with one more dimension?

...

* Maybe add a dtype similar to `object_` that only permits `unicode/str` (2.x/3.x) strings (and maybe None to represent missing data a la pandas). This maintains all of the flexibility of using a `dtype=object` array while allowing code to specialize for working with strings without all kinds of checking on every item. But most importantly, we can serialize such an array to bytes without having to use pickle. Utility functions could be written for en-/decoding to/from the uniform-length bytestring arrays handling different encodings and things like NULL-termination (also working with the legacy dtypes and handling structured arrays easily, etc.).

I think there may also be a niche for fixed-byte-size null-terminated strings of uniform encoding, that do decoding and encoding automatically. The encoding would naturally be attached to the dtype, and they would handle too-long strings by either truncating to a valid encoding or simply raising an exception. As with the current fixed-length strings, they'd mostly be for communication with other code, so the necessity depends on whether such other codes exist at all. Databases, perhaps? Custom hunks of C that don't want to deal with variable-length packing of data? Actually this last seems plausible - if I want to pass a great wodge of data, including Unicode strings, to a C program, writing out a numpy array seems maybe the easiest. Anne

Robert Kern

7:46 p.m.

On Thu, Apr 20, 2017 at 12:17 PM, Anne Archibald <peridot.faceted@gmail.com> wrote:

...

On Thu, Apr 20, 2017 at 8:55 PM Robert Kern <robert.kern@gmail.com> wrote:

...

...
For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length). But it enforces 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the singular motivating use case for the trailing-NULL behavior of np.string.

Actually if I understood the spec, FITS header lines are 80 bytes long and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped.

...

...
If I had to jump ahead and propose new dtypes, I might suggest this:

* For the most part, treat the string dtypes as temporary communication

...

...
* Acknowledge the use cases of the current NULL-terminated np.string

...

...
* Add a dtype for holding uniform-length `bytes` strings. This would be

similar to the current `void` dtype, but work more transparently with the `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`

Never mind, then. :-) formats rather than the preferred in-memory working format, similar to how we use `float16` to communicate with GPU APIs. dtype, but perhaps add a new canonical alias, document it as being for those specific use cases, and deprecate/de-emphasize the current name. like `float64` does with `float`. This would not be NULL-terminated. No encoding would be implied.

...

How would this differ from a numpy array of bytes with one more

...

...
* Maybe add a dtype similar to `object_` that only permits `unicode/str` (2.x/3.x) strings (and maybe None to represent missing data a la pandas). This maintains all of the flexibility of using a `dtype=object` array while allowing code to specialize for working with strings without all kinds of checking on every item. But most importantly, we can serialize such an array to bytes without having to use pickle. Utility functions could be written for en-/decoding to/from the uniform-length bytestring arrays handling different encodings and things like NULL-termination (also working with the legacy dtypes and handling structured arrays easily, etc.).

I think there may also be a niche for fixed-byte-size null-terminated strings of uniform encoding, that do decoding and encoding automatically. The encoding would naturally be attached to the dtype, and they would handle too-long strings by either truncating to a valid encoding or simply raising an exception. As with the current fixed-length strings, they'd mostly be for communication with other code, so the necessity depends on whether such other codes exist at all. Databases, perhaps? Custom hunks of C that don't want to deal with variable-length packing of data? Actually

dimension? The scalar in the implementation being the scalar in the use case, immutability of the scalar, directly working with b'' strings in and out (and thus work with the Python codecs easily). this last seems plausible - if I want to pass a great wodge of data, including Unicode strings, to a C program, writing out a numpy array seems maybe the easiest. HDF5 seems to support this, but only for ASCII and UTF8, not a large list of encodings. -- Robert Kern

Phil Hodge

8:16 p.m.

On 04/20/2017 03:17 PM, Anne Archibald wrote:

...

Actually if I understood the spec, FITS header lines are 80 bytes long and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped.

FITS BINTABLE extensions can have columns containing strings, and in that case the values are NULL terminated, except that if the string fills the field (i.e. there's no room for a NULL), the NULL will not be written. Phil

Robert Kern

10:20 p.m.

On Thu, Apr 20, 2017 at 1:16 PM, Phil Hodge <hodge@stsci.edu> wrote:

...

On 04/20/2017 03:17 PM, Anne Archibald wrote:

...
Actually if I understood the spec, FITS header lines are 80 bytes long

and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped.

...

FITS BINTABLE extensions can have columns containing strings, and in that

case the values are NULL terminated, except that if the string fills the field (i.e. there's no room for a NULL), the NULL will not be written. Ah, that's what I was thinking of, thank you. -- Robert Kern

Charles R Harris

7:24 p.m.

On Thu, Apr 20, 2017 at 12:53 PM, Robert Kern <robert.kern@gmail.com> wrote:

...

On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:

...
Do you have comments on how to go forward, in particular in regards to new dtype vs modify np.unicode?

Can we restate the use cases explicitly? I feel like we ended up with the current sub-optimal situation because we never really laid out the use cases. We just felt like we needed bytestring and unicode dtypes, more out of completionism than anything, and we made a bunch of assumptions just to get each one done. I think there may be broad agreement that many of those assumptions are "wrong", but it would be good to reference that against concretely-stated use cases.

FWIW, if I need to work with in-memory arrays of strings in Python code, I'm going to use dtype=object a la pandas. It has almost no arbitrary constraints, and I can rely on Python's unicode facilities freely. There may be some cases where it's a little less memory-efficient (e.g. representing a column of enumerated single-character values like 'M'/'F'), but that's never prevented me from doing anything (compare to the uniform-length restrictions, which *have* prevented me from doing things).

So what's left? Being able to memory-map to files that have string data conveniently laid out according to numpy assumptions (e.g. FITS). Being able to work with C/C++/Fortran APIs that have arrays of strings laid out according to numpy assumptions (e.g. HDF5). I think it would behoove us to canvass the needs of these formats and APIs before making any more assumptions.

For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length). But it enforces 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the singular motivating use case for the trailing-NULL behavior of np.string.

I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings.

We should look at some of the newer formats and APIs, like Parquet and Arrow, and also consider the cross-language APIs with Julia and R.

If I had to jump ahead and propose new dtypes, I might suggest this:

* For the most part, treat the string dtypes as temporary communication formats rather than the preferred in-memory working format, similar to how we use `float16` to communicate with GPU APIs.

* Acknowledge the use cases of the current NULL-terminated np.string dtype, but perhaps add a new canonical alias, document it as being for those specific use cases, and deprecate/de-emphasize the current name.

* Add a dtype for holding uniform-length `bytes` strings. This would be similar to the current `void` dtype, but work more transparently with the `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes` like `float64` does with `float`. This would not be NULL-terminated. No encoding would be implied.

* Maybe add a dtype similar to `object_` that only permits `unicode/str` (2.x/3.x) strings (and maybe None to represent missing data a la pandas). This maintains all of the flexibility of using a `dtype=object` array while allowing code to specialize for working with strings without all kinds of checking on every item. But most importantly, we can serialize such an array to bytes without having to use pickle. Utility functions could be written for en-/decoding to/from the uniform-length bytestring arrays handling different encodings and things like NULL-termination (also working with the legacy dtypes and handling structured arrays easily, etc.).

A little history, IIRC, storing null terminated strings in fixed byte lengths was done in Fortran, strings were usually stored in integers/integer_arrays. If memory mapping of arbitrary types is not important, I'd settle for ascii or latin-1, utf-8 fixed byte length, and arrays of fixed python object type. Using one byte encodings and utf-8 avoids needing to deal with endianess. Chuck

Julian Taylor

7:27 p.m.

On 20.04.2017 20:53, Robert Kern wrote:

...

On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.com>> wrote:

...
Do you have comments on how to go forward, in particular in regards to new dtype vs modify np.unicode?

Can we restate the use cases explicitly? I feel like we ended up with the current sub-optimal situation because we never really laid out the use cases. We just felt like we needed bytestring and unicode dtypes, more out of completionism than anything, and we made a bunch of assumptions just to get each one done. I think there may be broad agreement that many of those assumptions are "wrong", but it would be good to reference that against concretely-stated use cases.

We ended up in this situation because we did not take the opportunity to break compatibility when python3 support was added. We should have made the string dtype an encoded byte type (ascii or latin1) in python3 instead of null terminated unencoded bytes which do not make very much practical sense. So the use case is very simple: Give users of the string dtype a migration path that does not involve converting to full utf32 unicode. The latin1 encoded bytes dtype would allow that. As we already have the infrastructure this same dtype can allow more than just latin1 with minimal effort, for the fixed size python supported stuff it is literally adding an enum entry, two new switch clauses and a little bit of dtype string parsing and testcases. Having some form of variable string handling would be nice. But this is another topic all together. Having builtin support for variable strings only seems overkill as the string dtype is not that important and object arrays should work reasonably well for this usecase already.

Robert Kern

8 p.m.

On Thu, Apr 20, 2017 at 12:27 PM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:

...

On 20.04.2017 20:53, Robert Kern wrote:

...
On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.com>> wrote:

...
Do you have comments on how to go forward, in particular in regards to new dtype vs modify np.unicode?

Can we restate the use cases explicitly? I feel like we ended up with the current sub-optimal situation because we never really laid out the use cases. We just felt like we needed bytestring and unicode dtypes, more out of completionism than anything, and we made a bunch of assumptions just to get each one done. I think there may be broad agreement that many of those assumptions are "wrong", but it would be good to reference that against concretely-stated use cases.

We ended up in this situation because we did not take the opportunity to break compatibility when python3 support was added.

Oh, the root cause I'm thinking of long predates Python 3, or even numpy 1.0. There never was an explicitly fleshed out use case for unicode arrays other than "Python has unicode strings, so we should have a string dtype that supports it". Hence the "we only support UCS4" implementation; it's not like anyone *wants* UCS4 or interoperates with UCS4, but it does represent all possible Unicode strings. The Python 3 transition merely exacerbated the problem by making Unicode strings the primary string type to work with. I don't really want to ameliorate the exacerbation without addressing the root problem, which is worth solving. I will put this down as a marker use case: Support HDF5's fixed-width UTF-8 arrays. -- Robert Kern

2759

Age (days ago)

2759

Last active (days ago)

List overview

Download

28 comments

12 participants

participants (12)

Anne Archibald
Antoine Pitrou
Charles R Harris
Chris Barker
Eric Wieser
Feng Yu
Julian Taylor
Marten van Kerkwijk
Neal Becker
Phil Hodge
Robert Kern
Stephan Hoyer

proposal: smaller representation of string arrays

Marten van Kerkwijk

Phil Hodge

tags

participants (12)