A one-byte string dtype?
Folks, I've been blathering away on the related threads a lot -- sorry if it's too much. It's gotten a bit tangled up, so I thought I'd start a new one to address this one question (i.e. dont bring up genfromtext here): Would it be a good thing for numpy to have a one-byte--per-character string type? We did have that with the 'S' type in py2, but the changes in py3 have made it not quite the right thing. And it appears that enough people use 'S' in py3 to mean 'bytes', so that we can't change that now. The only difference may be that 'S' currently auto translates to a bytes object, resulting in things like: np.array(['some text',], dtype='S')[0] == 'some text' yielding False on Py3. And you can't do all the usual text stuff with the resulting bytes object, either. (and it probably used the default encoding to generate the bytes, so will barf on some inputs, though that may be unavoidable.) So you need to decode the bytes that are given back, and now that I think about it, I have no idea what encoding you'd need to use in the general case. So the correct solution is (particularly on py3) to use the 'U' (unicode) dtype for text in numpy arrays. However, the 'U' dtype is 4 bytes per character, and that may be "too big" for some use-cases. And there is a lot of text in scientific data sets that are pure ascii, or at least some 1-byte-per-character encoding. So, in the spirit of having multiple numeric types that use different amounts of memory, and can hold different ranges of values, a one-byte-per character dtype would be nice: (note, this opens the door for a 2-byte per (UCS-2) dtype too, I personally don't think that's worth it, but maybe that's because I'm an english speaker...) It could use the 's' (lower-case s) type identifier. For passing to/from python built-in objects, it would * Allow either Python bytes objects or Python unicode objects as input a) bytes objects would be passed through as-is b) unicode objects would be encoded as latin-1 [note: I'm not entirely sure that bytes objects should be allowed, but it would provide an nice efficiency in a fairly common case] * It would create python unicode text objects, decoded as latin-1. Could we have a way to specify another encoding? I'm not sure how that would fit into the dtype system. I've explained the latin-1 thing on other threads, but the short version is: - It will work perfectly for ascii text - It will work perfectly for latin-1 text (natch) - It will never give you an UnicodeEncodeError regardless of what arbitrary bytes you pass in. - It will preserve those arbitrary bytes through a encoding/decoding operation. (it still wouldn't allow you to store arbitrary unicode -- but that's the limitation of one-byte per character...) So: Bad idea all around: shut up already! or Fine idea, but who's going to write the code? not me! or We really should do this. (of course, with the options of amending the above not-very-fleshed out proposal) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Fri, Jan 17, 2014 at 5:30 PM, Chris Barker <chris.barker@noaa.gov> wrote:
Folks,
I've been blathering away on the related threads a lot -- sorry if it's too much. It's gotten a bit tangled up, so I thought I'd start a new one to address this one question (i.e. dont bring up genfromtext here):
Would it be a good thing for numpy to have a one-byte--per-character string type?
We did have that with the 'S' type in py2, but the changes in py3 have made it not quite the right thing. And it appears that enough people use 'S' in py3 to mean 'bytes', so that we can't change that now.
The only difference may be that 'S' currently auto translates to a bytes object, resulting in things like:
np.array(['some text',], dtype='S')[0] == 'some text'
yielding False on Py3. And you can't do all the usual text stuff with the resulting bytes object, either. (and it probably used the default encoding to generate the bytes, so will barf on some inputs, though that may be unavoidable.) So you need to decode the bytes that are given back, and now that I think about it, I have no idea what encoding you'd need to use in the general case.
So the correct solution is (particularly on py3) to use the 'U' (unicode) dtype for text in numpy arrays.
However, the 'U' dtype is 4 bytes per character, and that may be "too big" for some use-cases. And there is a lot of text in scientific data sets that are pure ascii, or at least some 1-byte-per-character encoding.
So, in the spirit of having multiple numeric types that use different amounts of memory, and can hold different ranges of values, a one-byte-per character dtype would be nice:
(note, this opens the door for a 2-byte per (UCS-2) dtype too, I personally don't think that's worth it, but maybe that's because I'm an english speaker...)
It could use the 's' (lower-case s) type identifier.
For passing to/from python built-in objects, it would
* Allow either Python bytes objects or Python unicode objects as input a) bytes objects would be passed through as-is b) unicode objects would be encoded as latin-1
[note: I'm not entirely sure that bytes objects should be allowed, but it would provide an nice efficiency in a fairly common case]
* It would create python unicode text objects, decoded as latin-1.
Could we have a way to specify another encoding? I'm not sure how that would fit into the dtype system.
I've explained the latin-1 thing on other threads, but the short version is:
- It will work perfectly for ascii text - It will work perfectly for latin-1 text (natch) - It will never give you an UnicodeEncodeError regardless of what arbitrary bytes you pass in. - It will preserve those arbitrary bytes through a encoding/decoding operation.
(it still wouldn't allow you to store arbitrary unicode -- but that's the limitation of one-byte per character...)
So:
Bad idea all around: shut up already!
or
Fine idea, but who's going to write the code? not me!
or
We really should do this.
As evident from what I said in the previous thread, YES, this should really be done! One important feature would be changing the dtype from 'S' to 's' without any memory copies, so that conversion would be very cheap. Maybe this would essentially come for free with something like astype('s', copy=False). - Tom
(of course, with the options of amending the above not-very-fleshed out proposal)
-Chris
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Fri, Jan 17, 2014 at 02:30:19PM -0800, Chris Barker wrote:
Folks,
I've been blathering away on the related threads a lot -- sorry if it's too much. It's gotten a bit tangled up, so I thought I'd start a new one to address this one question (i.e. dont bring up genfromtext here):
Would it be a good thing for numpy to have a one-byte--per-character string type?
If you mean a string type that can only hold latin-1 characters then I think that this is a step backwards. If you mean a dtype that holds bytes in a known, specifiable encoding and automatically decodes them to unicode strings when you call .item() and has a friendly repr() then that may be a good idea. So for example you could have dtype='S:utf-8' which would store strings encoded as utf-8 e.g.:
text = array(['foo', 'bar'], dtype='S:utf-8') text array(['foo', 'bar'], dtype='|S3:utf-8') print(a) ['foo', 'bar'] a[0] 'foo' a.nbytes 6
We did have that with the 'S' type in py2, but the changes in py3 have made it not quite the right thing. And it appears that enough people use 'S' in py3 to mean 'bytes', so that we can't change that now.
It wasn't really the right thing before either. That's why Python 3 has changed all of this.
The only difference may be that 'S' currently auto translates to a bytes object, resulting in things like:
np.array(['some text',], dtype='S')[0] == 'some text'
yielding False on Py3. And you can't do all the usual text stuff with the resulting bytes object, either. (and it probably used the default encoding to generate the bytes, so will barf on some inputs, though that may be unavoidable.) So you need to decode the bytes that are given back, and now that I think about it, I have no idea what encoding you'd need to use in the general case.
You should let the user specify the encoding or otherwise require them to use the 'U' dtype.
So the correct solution is (particularly on py3) to use the 'U' (unicode) dtype for text in numpy arrays.
Absolutely. Embrace the Python 3 text model. Once you understand the how, what and why of it you'll see that it really is a good thing!
However, the 'U' dtype is 4 bytes per character, and that may be "too big" for some use-cases. And there is a lot of text in scientific data sets that are pure ascii, or at least some 1-byte-per-character encoding.
So, in the spirit of having multiple numeric types that use different amounts of memory, and can hold different ranges of values, a one-byte-per character dtype would be nice:
(note, this opens the door for a 2-byte per (UCS-2) dtype too, I personally don't think that's worth it, but maybe that's because I'm an english speaker...)
You could just use a 2-byte encoding with the S dtype e.g. dtype='S:utf-16-le'.
It could use the 's' (lower-case s) type identifier.
For passing to/from python built-in objects, it would
* Allow either Python bytes objects or Python unicode objects as input a) bytes objects would be passed through as-is b) unicode objects would be encoded as latin-1
[note: I'm not entirely sure that bytes objects should be allowed, but it would provide an nice efficiency in a fairly common case]
I think it would be a bad idea to accept bytes here. There are good reasons that Python 3 creates a barrier between the two worlds of text and bytes. Allowing implicit mixing of bytes and text is a recipe for mojibake. The TypeErrors in Python 3 are used to guard against conceptual errors that lead to data corruption. Attempting to undermine that barrier in numpy would be a backward step. I apologise if this is misplaced but there seems to be an attitude that scientific programming isn't really affected by the issues that have lead to the Python 3 text model. I think that's ridiculous; data corruption is a problem in scientific programming just as it is anywhere else.
* It would create python unicode text objects, decoded as latin-1.
Don't try to bless a particular encoding and stop trying to pretend that it's possible to write a sensible system where end users don't need to worry about and specify the encoding of their data.
Could we have a way to specify another encoding? I'm not sure how that would fit into the dtype system.
If the encoding cannot be specified then the whole idea is misguided.
I've explained the latin-1 thing on other threads, but the short version is:
- It will work perfectly for ascii text - It will work perfectly for latin-1 text (natch) - It will never give you an UnicodeEncodeError regardless of what arbitrary bytes you pass in. - It will preserve those arbitrary bytes through a encoding/decoding operation.
So what happens if I do:
with open('myutf-8-file.txt', 'rb') as fin: ... text = numpy.fromfile(fin, dtype='s') text[0] # Decodes as latin-1 leading to mojibake.
I would propose that it's better to be able to do:
with open('myutf-8-file.txt', 'rb') as fin: ... text = numpy.fromfile(fin, dtype='s:utf-8')
There's really no way to get around the fact that users need to specify the encoding of their text files.
(it still wouldn't allow you to store arbitrary unicode -- but that's the limitation of one-byte per character...)
You could if you use 'utf-8'. It would be one-byte-per-char for text that only contains ascii characters. However it would still support every character that the unicode consortium can dream up. The only possible advantage here is as a memory optimisation (potentially having a speed impact too although it could equally be a speed regression). Otherwise it just adds needless complexity to numpy and to the code that uses the new dtype as well as limiting its ability to handle unicode. How significant are the performance issues? Does anyone really use numpy for this kind of text handling? If you really are operating on gigantic text arrays of ascii characters then is it so bad to just use the bytes dtype and handle decoding/encoding at the boundaries? If you're not operating on gigantic text arrays is there really a noticeable problem just using the 'U' dtype? Oscar
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com>wrote:
On Fri, Jan 17, 2014 at 02:30:19PM -0800, Chris Barker wrote:
Folks,
I've been blathering away on the related threads a lot -- sorry if it's too much. It's gotten a bit tangled up, so I thought I'd start a new one to address this one question (i.e. dont bring up genfromtext here):
Would it be a good thing for numpy to have a one-byte--per-character string type?
If you mean a string type that can only hold latin-1 characters then I think that this is a step backwards.
If you mean a dtype that holds bytes in a known, specifiable encoding and automatically decodes them to unicode strings when you call .item() and has a friendly repr() then that may be a good idea.
So for example you could have dtype='S:utf-8' which would store strings encoded as utf-8 e.g.:
text = array(['foo', 'bar'], dtype='S:utf-8') text array(['foo', 'bar'], dtype='|S3:utf-8') print(a) ['foo', 'bar'] a[0] 'foo' a.nbytes 6
We did have that with the 'S' type in py2, but the changes in py3 have made it not quite the right thing. And it appears that enough people use 'S' in py3 to mean 'bytes', so that we can't change that now.
It wasn't really the right thing before either. That's why Python 3 has changed all of this.
The only difference may be that 'S' currently auto translates to a bytes object, resulting in things like:
np.array(['some text',], dtype='S')[0] == 'some text'
yielding False on Py3. And you can't do all the usual text stuff with the resulting bytes object, either. (and it probably used the default encoding to generate the bytes, so will barf on some inputs, though that may be unavoidable.) So you need to decode the bytes that are given back, and now that I think about it, I have no idea what encoding you'd need to use in the general case.
You should let the user specify the encoding or otherwise require them to use the 'U' dtype.
So the correct solution is (particularly on py3) to use the 'U' (unicode) dtype for text in numpy arrays.
Absolutely. Embrace the Python 3 text model. Once you understand the how, what and why of it you'll see that it really is a good thing!
However, the 'U' dtype is 4 bytes per character, and that may be "too big" for some use-cases. And there is a lot of text in scientific data sets that are pure ascii, or at least some 1-byte-per-character encoding.
So, in the spirit of having multiple numeric types that use different amounts of memory, and can hold different ranges of values, a one-byte-per character dtype would be nice:
(note, this opens the door for a 2-byte per (UCS-2) dtype too, I personally don't think that's worth it, but maybe that's because I'm an english speaker...)
You could just use a 2-byte encoding with the S dtype e.g. dtype='S:utf-16-le'.
It could use the 's' (lower-case s) type identifier.
For passing to/from python built-in objects, it would
* Allow either Python bytes objects or Python unicode objects as input a) bytes objects would be passed through as-is b) unicode objects would be encoded as latin-1
[note: I'm not entirely sure that bytes objects should be allowed, but it would provide an nice efficiency in a fairly common case]
I think it would be a bad idea to accept bytes here. There are good reasons that Python 3 creates a barrier between the two worlds of text and bytes. Allowing implicit mixing of bytes and text is a recipe for mojibake. The TypeErrors in Python 3 are used to guard against conceptual errors that lead to data corruption. Attempting to undermine that barrier in numpy would be a backward step.
I apologise if this is misplaced but there seems to be an attitude that scientific programming isn't really affected by the issues that have lead to the Python 3 text model. I think that's ridiculous; data corruption is a problem in scientific programming just as it is anywhere else.
* It would create python unicode text objects, decoded as latin-1.
Don't try to bless a particular encoding and stop trying to pretend that it's possible to write a sensible system where end users don't need to worry about and specify the encoding of their data.
Could we have a way to specify another encoding? I'm not sure how that would fit into the dtype system.
If the encoding cannot be specified then the whole idea is misguided.
I've explained the latin-1 thing on other threads, but the short version is:
- It will work perfectly for ascii text - It will work perfectly for latin-1 text (natch) - It will never give you an UnicodeEncodeError regardless of what arbitrary bytes you pass in. - It will preserve those arbitrary bytes through a encoding/decoding operation.
So what happens if I do:
with open('myutf-8-file.txt', 'rb') as fin: ... text = numpy.fromfile(fin, dtype='s') text[0] # Decodes as latin-1 leading to mojibake.
I would propose that it's better to be able to do:
with open('myutf-8-file.txt', 'rb') as fin: ... text = numpy.fromfile(fin, dtype='s:utf-8')
There's really no way to get around the fact that users need to specify the encoding of their text files.
(it still wouldn't allow you to store arbitrary unicode -- but that's the limitation of one-byte per character...)
You could if you use 'utf-8'. It would be one-byte-per-char for text that only contains ascii characters. However it would still support every character that the unicode consortium can dream up.
The only possible advantage here is as a memory optimisation (potentially having a speed impact too although it could equally be a speed regression). Otherwise it just adds needless complexity to numpy and to the code that uses the new dtype as well as limiting its ability to handle unicode.
How significant are the performance issues? Does anyone really use numpy for this kind of text handling? If you really are operating on gigantic text arrays of ascii characters then is it so bad to just use the bytes dtype and handle decoding/encoding at the boundaries? If you're not operating on gigantic text arrays is there really a noticeable problem just using the 'U' dtype?
From my perspective the goal here is to provide a pragmatic way to allow numpy-based applications and end users to use python 3. Something like
I use numpy for giga-row arrays of short text strings, so memory and performance issues are real. As discussed in the previous parent thread, using the bytes dtype is really a problem because users of a text array want to do things like filtering (`match_rows = text_array == 'match'`), printing, or other manipulations in a natural way without having to continually use bytestring literals or `.decode('ascii')` everywhere. I tried converting a few packages while leaving the arrays as bytestrings and it just ended up as a very big mess. this proposal seems to be the right direction, maybe not pure and perfect but a sensible step to get us there given the reality of scientific computing. - Tom
Oscar _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com>wrote:
How significant are the performance issues? Does anyone really use numpy for this kind of text handling? If you really are operating on gigantic text arrays of ascii characters then is it so bad to just use the bytes dtype and handle decoding/encoding at the boundaries? If you're not operating on gigantic text arrays is there really a noticeable problem just using the 'U' dtype?
I use numpy for giga-row arrays of short text strings, so memory and performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is really a problem because users of a text array want to do things like filtering (`match_rows = text_array == 'match'`), printing, or other manipulations in a natural way without having to continually use bytestring literals or `.decode('ascii')` everywhere. I tried converting a few packages while leaving the arrays as bytestrings and it just ended up as a very big mess.
From my perspective the goal here is to provide a pragmatic way to allow numpy-based applications and end users to use python 3. Something like this proposal seems to be the right direction, maybe not pure and perfect but a sensible step to get us there given the reality of scientific computing.
I don't really see how writing b'match' instead of 'match' is that big a deal. And why are you needing to write .decode('ascii') everywhere? If you really do just want to work with bytes in your own known encoding then why not just read and write in binary mode? I apologise if I'm wrong but I suspect that much of the difficulty in getting the bytes/unicode separation right is down to the fact that a lot of the code you're using (or attempting to support) hasn't yet been ported to a clean text model. When I started using Python 3 it took me quite a few failed attempts at understanding the text model before I got to the point where I understood how it is supposed to be used. The problem was that I had been conflating text and bytes in many places, and that's hard to disentangle. Having fixed most of those problems I now understand why it is such an improvement. In any case I don't see anything wrong with a more efficient dtype for representing text if the user can specify the encoding. The problem is that numpy arrays expose their underlying memory buffer. Allowing them to interact directly with text strings on the one side and binary files on the other breaches Python 3's very good text model unless the user can specify the encoding that is to be used. Or at least if there is to be a blessed encoding then make it unicode-capable utf-8 instead of legacy ascii/latin-1. Oscar
On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com
wrote:
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com>wrote:
How significant are the performance issues? Does anyone really use numpy for this kind of text handling? If you really are operating on gigantic text arrays of ascii characters then is it so bad to just use the bytes
and handle decoding/encoding at the boundaries? If you're not operating on gigantic text arrays is there really a noticeable problem just using
On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote: dtype the
'U' dtype?
I use numpy for giga-row arrays of short text strings, so memory and performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is really a problem because users of a text array want to do things like filtering (`match_rows = text_array == 'match'`), printing, or other manipulations in a natural way without having to continually use bytestring literals or `.decode('ascii')` everywhere. I tried converting a few packages while leaving the arrays as bytestrings and it just ended up as a very big mess.
From my perspective the goal here is to provide a pragmatic way to allow numpy-based applications and end users to use python 3. Something like this proposal seems to be the right direction, maybe not pure and perfect but a sensible step to get us there given the reality of scientific computing.
I don't really see how writing b'match' instead of 'match' is that big a deal.
It's a big deal because all your existing python 2 code suddenly breaks on python 3, even after running 2to3. Yes, you can backfix all the python 2 code and use bytestring literals everywhere, but that is very painful and ugly. More importantly it's very fiddly because *sometimes* you'll need to use bytestring literals, and *sometimes* not, depending on the exact dataset you've been handed. That's basically a non-starter. As you say below, the only solution is a proper separation of bytes/unicode where everything internally is unicode. The problem is that the existing 4-byte unicode in numpy is a big performance / memory hit. It's even trickier because libraries will happily deliver a numpy structured array with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to then convert to 'U' since you need to remake the entire structured array. With a one-byte unicode the goal would be an in-place update of 'S' to 's'.
And why are you needing to write .decode('ascii') everywhere?
print("The first value is {}".format(bytestring_array[0]))
On Python 2 this gives "The first value is string_value", while on Python 3 this gives "The first value is b'string_value'".
If you really do just want to work with bytes in your own known encoding then why not just read and write in binary mode?
I apologise if I'm wrong but I suspect that much of the difficulty in getting the bytes/unicode separation right is down to the fact that a lot of the code you're using (or attempting to support) hasn't yet been ported to a clean text model. When I started using Python 3 it took me quite a few failed attempts at understanding the text model before I got to the point where I understood how it is supposed to be used. The problem was that I had been conflating text and bytes in many places, and that's hard to disentangle. Having fixed most of those problems I now understand why it is such an improvement.
In any case I don't see anything wrong with a more efficient dtype for representing text if the user can specify the encoding. The problem is that numpy arrays expose their underlying memory buffer. Allowing them to interact directly with text strings on the one side and binary files on the other breaches Python 3's very good text model unless the user can specify the encoding that is to be used. Or at least if there is to be a blessed encoding then make it unicode-capable utf-8 instead of legacy ascii/latin-1.
Oscar _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin < oscar.j.benjamin@gmail.com> wrote:
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com>wrote:
How significant are the performance issues? Does anyone really use numpy for this kind of text handling? If you really are operating on gigantic text arrays of ascii characters then is it so bad to just use the bytes
and handle decoding/encoding at the boundaries? If you're not operating on gigantic text arrays is there really a noticeable problem just using
'U' dtype?
I use numpy for giga-row arrays of short text strings, so memory and performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is really a problem because users of a text array want to do things like filtering (`match_rows = text_array == 'match'`), printing, or other manipulations in a natural way without having to continually use bytestring literals or `.decode('ascii')` everywhere. I tried converting a few packages while leaving the arrays as bytestrings and it just ended up as a very big mess.
From my perspective the goal here is to provide a pragmatic way to allow numpy-based applications and end users to use python 3. Something like this proposal seems to be the right direction, maybe not pure and
On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote: dtype the perfect
but a sensible step to get us there given the reality of scientific computing.
I don't really see how writing b'match' instead of 'match' is that big a deal.
It's a big deal because all your existing python 2 code suddenly breaks on python 3, even after running 2to3. Yes, you can backfix all the python 2 code and use bytestring literals everywhere, but that is very painful and ugly. More importantly it's very fiddly because *sometimes* you'll need to use bytestring literals, and *sometimes* not, depending on the exact dataset you've been handed. That's basically a non-starter.
As you say below, the only solution is a proper separation of bytes/unicode where everything internally is unicode. The problem is that the existing 4-byte unicode in numpy is a big performance / memory hit. It's even trickier because libraries will happily deliver a numpy structured array with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to then convert to 'U' since you need to remake the entire structured array. With a one-byte unicode the goal would be an in-place update of 'S' to 's'.
And why are you needing to write .decode('ascii') everywhere?
print("The first value is {}".format(bytestring_array[0]))
On Python 2 this gives "The first value is string_value", while on Python 3 this gives "The first value is b'string_value'".
As Nathaniel has mentioned, this is a known problem with Python 3 and the developers are trying to come up with a solution. Python 3.4 solves some existing problems, but this one remains. It's not just numpy here, it's that python itself needs to provide some help. Chuck
On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin < oscar.j.benjamin@gmail.com> wrote:
On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
And why are you needing to write .decode('ascii') everywhere?
print("The first value is {}".format(bytestring_array[0]))
On Python 2 this gives "The first value is string_value", while on Python 3 this gives "The first value is b'string_value'".
As Nathaniel has mentioned, this is a known problem with Python 3 and the developers are trying to come up with a solution. Python 3.4 solves some existing problems, but this one remains. It's not just numpy here, it's
On Jan 20, 2014 5:21 PM, "Charles R Harris" <charlesr.harris@gmail.com> wrote: that python itself needs to provide some help. If you think that anything in core Python will change so that you can mix text and bytes as above then I think you are very much mistaken. If you're referring to PEP 460/461 then you have misunderstood the purpose of those PEPs. The authors and reviewers will carefully ensure that nothing changes to make the above work the way that it did in 2.x. Oscar
On Mon, Jan 20, 2014 at 11:40 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com
wrote:
On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin < oscar.j.benjamin@gmail.com> wrote:
On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
And why are you needing to write .decode('ascii') everywhere?
print("The first value is {}".format(bytestring_array[0]))
On Python 2 this gives "The first value is string_value", while on Python 3 this gives "The first value is b'string_value'".
As Nathaniel has mentioned, this is a known problem with Python 3 and
On Jan 20, 2014 5:21 PM, "Charles R Harris" <charlesr.harris@gmail.com> wrote: the developers are trying to come up with a solution. Python 3.4 solves some existing problems, but this one remains. It's not just numpy here, it's that python itself needs to provide some help.
If you think that anything in core Python will change so that you can mix text and bytes as above then I think you are very much mistaken. If you're referring to PEP 460/461 then you have misunderstood the purpose of those PEPs. The authors and reviewers will carefully ensure that nothing changes to make the above work the way that it did in 2.x.
I think we may want something like PEP 393<http://www.python.org/dev/peps/pep-0393/>. The S datatype may be the wrong place to look, we might want a modification of U instead so as to transparently get the benefit of python strings. Chuck
On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.harris@gmail.com> wrote:
I think we may want something like PEP 393. The S datatype may be the
wrong place to look, we might want a modification of U instead so as to transparently get the benefit of python strings. The approach taken in PEP 393 (the FSR) makes more sense for str than it does for numpy arrays for two reasons: str is immutable and opaque. Since str is immutable the maximum code point in the string can be determined once when the string is created before anything else can get a pointer to the string buffer. Since it is opaque no one can rightly expect it to expose a particular binary format so it is free to choose without compromising any expected semantics. If someone can call buffer on an array then the FSR is a semantic change. If a numpy 'U' array used the FSR and consisted only of ASCII characters then it would have a one byte per char buffer. What then happens if you put a higher code point in? The buffer needs to be resized and the data copied over. But then what happens to any buffer objects or array views? They would be pointing at the old buffer from before the resize. Subsequent modifications to the resized array would not show up in other views and vice versa. I don't think that this can be done transparently since users of a numpy array need to know about the binary representation. That's why I suggest a dtype that has an encoding. Only in that way can it consistently have both a binary and a text interface. Oscar
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <oscar.j.benjamin@gmail.com>wrote:
On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.harris@gmail.com> wrote:
I think we may want something like PEP 393. The S datatype may be the
wrong place to look, we might want a modification of U instead so as to transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str than it does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be determined once when the string is created before anything else can get a pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular binary format so it is free to choose without compromising any expected semantics.
If someone can call buffer on an array then the FSR is a semantic change.
If a numpy 'U' array used the FSR and consisted only of ASCII characters then it would have a one byte per char buffer. What then happens if you put a higher code point in? The buffer needs to be resized and the data copied over. But then what happens to any buffer objects or array views? They would be pointing at the old buffer from before the resize. Subsequent modifications to the resized array would not show up in other views and vice versa.
I don't think that this can be done transparently since users of a numpy array need to know about the binary representation. That's why I suggest a dtype that has an encoding. Only in that way can it consistently have both a binary and a text interface.
I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want transparent string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'. Chuck
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.harris@gmail.com> wrote:
I think we may want something like PEP 393. The S datatype may be the wrong place to look, we might want a modification of U instead so as to transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str than it does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be determined once when the string is created before anything else can get a pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular binary format so it is free to choose without compromising any expected semantics.
If someone can call buffer on an array then the FSR is a semantic change.
If a numpy 'U' array used the FSR and consisted only of ASCII characters then it would have a one byte per char buffer. What then happens if you put a higher code point in? The buffer needs to be resized and the data copied over. But then what happens to any buffer objects or array views? They would be pointing at the old buffer from before the resize. Subsequent modifications to the resized array would not show up in other views and vice versa.
I don't think that this can be done transparently since users of a numpy array need to know about the binary representation. That's why I suggest a dtype that has an encoding. Only in that way can it consistently have both a binary and a text interface.
I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want transparent string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype="O" for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter... Though, IMO any new dtype here would need a cleanup of the dtype code first so that it doesn't require yet more massive special cases all over umath.so. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
wrote:
On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.harris@gmail.com> wrote:
I think we may want something like PEP 393. The S datatype may be the wrong place to look, we might want a modification of U instead so as
to
transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str than it does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be determined once when the string is created before anything else can get a pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular binary format so it is free to choose without compromising any expected semantics.
If someone can call buffer on an array then the FSR is a semantic change.
If a numpy 'U' array used the FSR and consisted only of ASCII characters then it would have a one byte per char buffer. What then happens if you
a higher code point in? The buffer needs to be resized and the data copied over. But then what happens to any buffer objects or array views? They would be pointing at the old buffer from before the resize. Subsequent modifications to the resized array would not show up in other views and vice versa.
I don't think that this can be done transparently since users of a numpy array need to know about the binary representation. That's why I suggest a dtype that has an encoding. Only in that way can it consistently have both a binary and a text interface.
I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want
oscar.j.benjamin@gmail.com> put transparent
string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype="O" for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter...
Though, IMO any new dtype here would need a cleanup of the dtype code first so that it doesn't require yet more massive special cases all over umath.so.
Worth thinking about. As another alternative, what is the minimum we need to make a restricted encoding, say latin-1, appear transparently as a unicode string to python? I know the python folks don't like this much, but I suspect something along that line will eventually be required for the http folks. Chuck
On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <charlesr.harris@gmail.com
wrote:
On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
wrote:
On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.harris@gmail.com
wrote:
I think we may want something like PEP 393. The S datatype may be the wrong place to look, we might want a modification of U instead so as
to
transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str than it does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be determined once when the string is created before anything else can get a pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular binary format so it is free to choose without compromising any expected semantics.
If someone can call buffer on an array then the FSR is a semantic change.
If a numpy 'U' array used the FSR and consisted only of ASCII characters then it would have a one byte per char buffer. What then happens if you put a higher code point in? The buffer needs to be resized and the data copied over. But then what happens to any buffer objects or array views? They would be pointing at the old buffer from before the resize. Subsequent modifications to the resized array would not show up in other views and vice versa.
I don't think that this can be done transparently since users of a numpy array need to know about the binary representation. That's why I suggest a dtype that has an encoding. Only in that way can it consistently have both a binary and a text interface.
I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want
oscar.j.benjamin@gmail.com> transparent
string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype="O" for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter...
The more I think about it, the more I think we may need to do that. Note that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest. <snip> Chuck
On Mon, Jan 20, 2014 at 04:12:20PM -0700, Charles R Harris wrote:
On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
I didn't say we should change the S type, but that we should have
say 's', that appeared to python as a string. I think if we want
something, transparent
string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype="O" for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter...
The more I think about it, the more I think we may need to do that. Note
On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <charlesr.harris@gmail.com wrote: that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest.
This wouldn't necessarily help for the gigarows of short text strings use case (depending on what "short" means). Also even if it technically saves memory you may have a greater overhead from fragmenting your array all over the heap. On my 64 bit Linux system the size of a Python 3.3 str containing only ASCII characters is 49+N bytes. For the 'U' dtype it's 4N bytes. You get a memory saving over dtype='U' only if the strings are 17 characters or more. To get a 50% saving over dtype='U' you'd need strings of at least 49 characters. If the Numpy array would manage the buffers itself then that per string memory overhead would be eliminated in exchange for an 8 byte pointer and at least 1 byte to represent the length of the string (assuming you can somehow use Pascal strings when short enough - null bytes cannot be used). This gives an overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory if the strings are more than 3 characters long and you get at least a 50% saving for strings longer than 9 characters. Using utf-8 in the buffers eliminates the need to go around checking maximum code points etc. so I would guess that would be simpler to implement (CPython has now had to triple all of it's code paths that actually access the string buffer). Oscar
On 21 Jan 2014 11:13, "Oscar Benjamin" <oscar.j.benjamin@gmail.com> wrote:
If the Numpy array would manage the buffers itself then that per string memory overhead would be eliminated in exchange for an 8 byte pointer and at least 1 byte to represent the length of the string (assuming you can somehow use Pascal strings when short enough - null bytes cannot be used). This gives an overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory if the strings are more than 3 characters long and you get at least a 50% saving for strings longer than 9 characters.
There are various optimisations possible as well. For ASCII strings of up to length 8, one could also use tagged pointers to eliminate the lookaside buffer entirely. (Alignment rules mean that pointers to allocated buffers always have the low bits zero; so you can make a rule that if the low bit is set to one, then this means the "pointer" itself should be interpreted as containing the string data; use the spare bit in the other bytes to encode the length.) In some cases it may also make sense to let identical strings share buffers, though this adds some overhead for reference counting and interning. -n
On Tue, Jan 21, 2014 at 11:41:30AM +0000, Nathaniel Smith wrote:
On 21 Jan 2014 11:13, "Oscar Benjamin" <oscar.j.benjamin@gmail.com> wrote:
If the Numpy array would manage the buffers itself then that per string memory overhead would be eliminated in exchange for an 8 byte pointer and at least 1 byte to represent the length of the string (assuming you can somehow use Pascal strings when short enough - null bytes cannot be used). This gives an overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory if the strings are more than 3 characters long and you get at least a 50% saving for strings longer than 9 characters.
There are various optimisations possible as well.
For ASCII strings of up to length 8, one could also use tagged pointers to eliminate the lookaside buffer entirely. (Alignment rules mean that pointers to allocated buffers always have the low bits zero; so you can make a rule that if the low bit is set to one, then this means the "pointer" itself should be interpreted as containing the string data; use the spare bit in the other bytes to encode the length.)
In some cases it may also make sense to let identical strings share buffers, though this adds some overhead for reference counting and interning.
Would this new dtype have an opaque memory representation? What would happen in the following:
a = numpy.array(['CGA', 'GAT'], dtype='s')
memoryview(a)
with open('file', 'wb') as fout: ... a.tofile(fout)
with open('file', 'rb') as fin: ... a = numpy.fromfile(fin, dtype='s')
Should there be a different function for creating such an array from reading a text file? Or would you just need to use fromiter:
with open('file', encoding='utf-8') as fin: ... a = numpy.fromiter(fin, dtype='s')
with open('file', encoding='utf-8') as fout: ... fout.writelines(line + '\n' for line in a)
(Note that the above would not be reversible if the strings contain newlines) I think it Would be less confusing to use dtype='u' than dtype='U' in order to signify that it is an optimised form of the 'U' dtype as far as access from Python code is concerned? Calling it 's' only really makes sense if there is a plan to deprecate dtype='S'. How would it behave in Python 2? Would it return unicode strings there as well? Oscar
On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <charlesr.harris@gmail.com
wrote:
On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
wrote:
On Jan 20, 2014 8:35 PM, "Charles R Harris" <
charlesr.harris@gmail.com>
wrote:
I think we may want something like PEP 393. The S datatype may be
wrong place to look, we might want a modification of U instead so as to transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str than it does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be determined once when the string is created before anything else can get a pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular binary format so it is free to choose without compromising any expected semantics.
If someone can call buffer on an array then the FSR is a semantic change.
If a numpy 'U' array used the FSR and consisted only of ASCII characters then it would have a one byte per char buffer. What then happens if you put a higher code point in? The buffer needs to be resized and the data copied over. But then what happens to any buffer objects or array views? They would be pointing at the old buffer from before the resize. Subsequent modifications to the resized array would not show up in other views and vice versa.
I don't think that this can be done transparently since users of a numpy array need to know about the binary representation. That's why I suggest a dtype that has an encoding. Only in that way can it consistently have both a binary and a text interface.
I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want
oscar.j.benjamin@gmail.com> the transparent
string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype="O" for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter...
The more I think about it, the more I think we may need to do that. Note that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest.
Is this approach intended to be in *addition to* the latin-1 "s" type originally proposed by Chris, or *instead of* that? - Tom
<snip>
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
wrote:
On Jan 20, 2014 8:35 PM, "Charles R Harris" <
charlesr.harris@gmail.com>
wrote: > > I think we may want something like PEP 393. The S datatype may be
> wrong place to look, we might want a modification of U instead so as to > transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str
does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be determined once when the string is created before anything else can get a pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a
binary format so it is free to choose without compromising any expected semantics.
If someone can call buffer on an array then the FSR is a semantic change.
If a numpy 'U' array used the FSR and consisted only of ASCII characters then it would have a one byte per char buffer. What then happens if you put a higher code point in? The buffer needs to be resized and the data copied over. But then what happens to any buffer objects or array views? They would be pointing at the old buffer from before the resize. Subsequent modifications to the resized array would not show up in other views and vice versa.
I don't think that this can be done transparently since users of a numpy array need to know about the binary representation. That's why I suggest a dtype that has an encoding. Only in that way can it consistently have both a binary and a text interface.
I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want
oscar.j.benjamin@gmail.com> the than it particular transparent
string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype="O" for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter...
The more I think about it, the more I think we may need to do that. Note that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest.
Is this approach intended to be in *addition to* the latin-1 "s" type originally proposed by Chris, or *instead of* that?
Well, that's open for discussion. The problem is to have something that is both compact (latin-1) and interoperates transparently with python 3 strings (utf-8). A latin-1 type would be easier to implement and would probably be a better choice for something available in both python 2 and python 3, but unless the python 3 developers come up with something clever I don't see how to make it behave transparently as a string in python 3. OTOH, it's not clear to me how to make utf-8 operate transparently with python 2 strings, especially as the unicode representation choices in python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8 is unlikely to be backported. The problem may be unsolvable in a completely satisfactory way. Chuck
On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris <charlesr.harris@gmail.com
wrote:
On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
wrote: > > > On Jan 20, 2014 8:35 PM, "Charles R Harris" < charlesr.harris@gmail.com> > wrote: > > > > I think we may want something like PEP 393. The S datatype may be
> > wrong place to look, we might want a modification of U instead so as to > > transparently get the benefit of python strings. > > The approach taken in PEP 393 (the FSR) makes more sense for str
> does for numpy arrays for two reasons: str is immutable and opaque. > > Since str is immutable the maximum code point in the string can be > determined once when the string is created before anything else can get a > pointer to the string buffer. > > Since it is opaque no one can rightly expect it to expose a
> binary format so it is free to choose without compromising any expected > semantics. > > If someone can call buffer on an array then the FSR is a semantic change. > > If a numpy 'U' array used the FSR and consisted only of ASCII characters > then it would have a one byte per char buffer. What then happens if you put > a higher code point in? The buffer needs to be resized and the data copied > over. But then what happens to any buffer objects or array views? They would > be pointing at the old buffer from before the resize. Subsequent > modifications to the resized array would not show up in other views and vice > versa. > > I don't think that this can be done transparently since users of a numpy > array need to know about the binary representation. That's why I suggest a > dtype that has an encoding. Only in that way can it consistently have both a > binary and a text interface.
I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want
oscar.j.benjamin@gmail.com> the than it particular transparent
string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype="O" for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter...
The more I think about it, the more I think we may need to do that. Note that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest.
Is this approach intended to be in *addition to* the latin-1 "s" type originally proposed by Chris, or *instead of* that?
Well, that's open for discussion. The problem is to have something that is both compact (latin-1) and interoperates transparently with python 3 strings (utf-8). A latin-1 type would be easier to implement and would probably be a better choice for something available in both python 2 and python 3, but unless the python 3 developers come up with something clever I don't see how to make it behave transparently as a string in python 3. OTOH, it's not clear to me how to make utf-8 operate transparently with python 2 strings, especially as the unicode representation choices in python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8 is unlikely to be backported. The problem may be unsolvable in a completely satisfactory way.
Since it's open for discussion, I'll put in my vote for implementing the easier latin-1 version in the short term to facilitate Python 2 / 3 interoperability. This would solve my use-case (giga-rows of short fixed length strings), and presumably allow things like memory mapping of large data files (like for FITS files in astropy.io.fits). I don't have a clue how the current 'U' dtype works under the hood, but from my user perspective it seems to work just fine in terms of interacting with Python 3 strings. Is there a technical problem with doing basically the same thing for an 's' dtype, but using latin-1 instead of UCS-4? Thanks, Tom
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs@pobox.com>wrote:
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris <charlesr.harris@gmail.com> wrote: > > > > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin < oscar.j.benjamin@gmail.com> > wrote: >> >> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" < charlesr.harris@gmail.com> >> wrote: >> > >> > I think we may want something like PEP 393. The S datatype may be the >> > wrong place to look, we might want a modification of U instead so as to >> > transparently get the benefit of python strings. >> >> The approach taken in PEP 393 (the FSR) makes more sense for str than it >> does for numpy arrays for two reasons: str is immutable and opaque. >> >> Since str is immutable the maximum code point in the string can be >> determined once when the string is created before anything else can get a >> pointer to the string buffer. >> >> Since it is opaque no one can rightly expect it to expose a particular >> binary format so it is free to choose without compromising any expected >> semantics. >> >> If someone can call buffer on an array then the FSR is a semantic change. >> >> If a numpy 'U' array used the FSR and consisted only of ASCII characters >> then it would have a one byte per char buffer. What then happens if you put >> a higher code point in? The buffer needs to be resized and the data copied >> over. But then what happens to any buffer objects or array views? They would >> be pointing at the old buffer from before the resize. Subsequent >> modifications to the resized array would not show up in other views and vice >> versa. >> >> I don't think that this can be done transparently since users of a numpy >> array need to know about the binary representation. That's why I suggest a >> dtype that has an encoding. Only in that way can it consistently have both a >> binary and a text interface. > > > I didn't say we should change the S type, but that we should have something, > say 's', that appeared to python as a string. I think if we want transparent > string interoperability with python together with a compressed > representation, and I think we need both, we are going to have to deal with > the difficulties of utf-8. That means raising errors if the string doesn't > fit in the allotted size, etc. Mind, this is a workaround for the mass of > ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype="O" for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter...
The more I think about it, the more I think we may need to do that. Note that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest.
Is this approach intended to be in *addition to* the latin-1 "s" type originally proposed by Chris, or *instead of* that?
Well, that's open for discussion. The problem is to have something that is both compact (latin-1) and interoperates transparently with python 3 strings (utf-8). A latin-1 type would be easier to implement and would probably be a better choice for something available in both python 2 and python 3, but unless the python 3 developers come up with something clever I don't see how to make it behave transparently as a string in python 3. OTOH, it's not clear to me how to make utf-8 operate transparently with python 2 strings, especially as the unicode representation choices in python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8 is unlikely to be backported. The problem may be unsolvable in a completely satisfactory way.
Since it's open for discussion, I'll put in my vote for implementing the easier latin-1 version in the short term to facilitate Python 2 / 3 interoperability. This would solve my use-case (giga-rows of short fixed length strings), and presumably allow things like memory mapping of large data files (like for FITS files in astropy.io.fits).
I don't have a clue how the current 'U' dtype works under the hood, but from my user perspective it seems to work just fine in terms of interacting with Python 3 strings. Is there a technical problem with doing basically the same thing for an 's' dtype, but using latin-1 instead of UCS-4?
I think there is a technical problem. We may be able masquerade latin-1 as utf-8 for some subset of characters or fool python 3 in some other way. But in anycase, I think it needs some research to see what the possibilities are. Chuck
On Tue, 2014-01-21 at 07:48 -0700, Charles R Harris wrote:
On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas <aldcroft@head.cfa.harvard.edu> wrote:
On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas <aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs@pobox.com> wrote: On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris <charlesr.harris@gmail.com> wrote: > > > > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <oscar.j.benjamin@gmail.com> > wrote: >> >> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.harris@gmail.com> >> wrote: >> > >> > I think we may want something like PEP 393. The S datatype may be the >> > wrong place to look, we might want a modification of U instead so as to >> > transparently get the benefit of python strings. >> >> The approach taken in PEP 393 (the FSR) makes more sense for str than it >> does for numpy arrays for two reasons: str is immutable and opaque. >> >> Since str is immutable the maximum code point in the string can be >> determined once when the string is created before anything else can get a >> pointer to the string buffer. >> >> Since it is opaque no one can rightly expect it to expose a particular >> binary format so it is free to choose without compromising any expected >> semantics. >> >> If someone can call buffer on an array then the FSR is a semantic change. >> >> If a numpy 'U' array used the FSR and consisted only of ASCII characters >> then it would have a one byte per char buffer. What then happens if you put >> a higher code point in? The buffer needs to be resized and the data copied >> over. But then what happens to any buffer objects or array views? They would >> be pointing at the old buffer from before the resize. Subsequent >> modifications to the resized array would not show up in other views and vice >> versa. >> >> I don't think that this can be done transparently since users of a numpy >> array need to know about the binary representation. That's why I suggest a >> dtype that has an encoding. Only in that way can it consistently have both a >> binary and a text interface. > > > I didn't say we should change the S type, but that we should have something, > say 's', that appeared to python as a string. I think if we want transparent > string interoperability with python together with a compressed > representation, and I think we need both, we are going to have to deal with > the difficulties of utf-8. That means raising errors if the string doesn't > fit in the allotted size, etc. Mind, this is a workaround for the mass of > ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype="O" for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter...
The more I think about it, the more I think we may need to do that. Note that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest.
Is this approach intended to be in *addition to* the latin-1 "s" type originally proposed by Chris, or *instead of* that?
Well, that's open for discussion. The problem is to have something that is both compact (latin-1) and interoperates transparently with python 3 strings (utf-8). A latin-1 type would be easier to implement and would probably be a better choice for something available in both python 2 and python 3, but unless the python 3 developers come up with something clever I don't see how to make it behave transparently as a string in python 3. OTOH, it's not clear to me how to make utf-8 operate transparently with python 2 strings, especially as the unicode representation choices in python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8 is unlikely to be backported. The problem may be unsolvable in a completely satisfactory way.
Since it's open for discussion, I'll put in my vote for implementing the easier latin-1 version in the short term to facilitate Python 2 / 3 interoperability. This would solve my use-case (giga-rows of short fixed length strings), and presumably allow things like memory mapping of large data files (like for FITS files in astropy.io.fits).
I don't have a clue how the current 'U' dtype works under the hood, but from my user perspective it seems to work just fine in terms of interacting with Python 3 strings. Is there a technical problem with doing basically the same thing for an 's' dtype, but using latin-1 instead of UCS-4?
I think there is a technical problem. We may be able masquerade latin-1 as utf-8 for some subset of characters or fool python 3 in some other way. But in anycase, I think it needs some research to see what the possibilities are.
I am not quite sure, but shouldn't it be even possible to tag on a possible encoding into the metadata of the string dtype and allow this to be set to all 1-byte wide encodings that python understands. If the metadata is not None, all entry points to and from the array (Object->string, string->Object conversions) would then de- or encode using the usual python string de- and encode. Of course it would still be a lot of work, since the string comparisons would need to know about comparing different encodings and dtype equivalence is wrong and all the conversions need to be carefully checked... Most string tools though probably don't care about encoding as long as it is fixed 1-byte width, though one would have to check that they don't lose the encoding information by creating a new "S" array... - Sebastian
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Tue, Jan 21, 2014 at 06:55:29AM -0700, Charles R Harris wrote:
Well, that's open for discussion. The problem is to have something that is both compact (latin-1) and interoperates transparently with python 3 strings (utf-8). A latin-1 type would be easier to implement and would probably be a better choice for something available in both python 2 and python 3, but unless the python 3 developers come up with something clever I don't see how to make it behave transparently as a string in python 3. OTOH, it's not clear to me how to make utf-8 operate transparently with python 2 strings, especially as the unicode representation choices in python 2 are ucs-2 or ucs-4
On Python 2, unicode strings can operate transparently with byte strings: $ python Python 2.7.3 (default, Sep 26 2013, 20:03:06) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
import numpy as bnp import numpy as np a = np.array([u'\xd5scar'], dtype='U') a array([u'\xd5scar'], dtype='<U5') a[0] u'\xd5scar' import sys sys.stdout.encoding 'UTF-8' print(a[0]) # Encodes as 'utf-8' Õscar 'My name is %s' % a[0] # Decodes as ASCII u'My name is \xd5scar' print('My name is %s' % a[0]) # Encodes as UTF-8 My name is Õscar
This is no better worse than the rest of the Py2 text model. So if the new dtype always returns a unicode string under Py2 it should work (as well as the Py2 text model ever does).
and the python 3 work adding utf-16 and utf-8 is unlikely to be backported. The problem may be unsolvable in a completely satisfactory way.
What do you mean by this? PEP 393 uses UCS-1/2/4 not utf-8/16/32 i.e. it always uses a fixed-width encoding. You can just use the CPython C-API to create the unicode strings. The simplest way is probably use utf-8 internally and then call PyUnicode_DecodeUTF8 and PyUnicode_EncodeUTF8 at the boundaries. This should work fine on Python 2.x and 3.x. It obviates any need to think about pre-3.3 narrow and wide builds and post-3.3 FSR formats. Unlike Python's str there isn't much need to be able to efficiently slice or index within the string array element. Indexing into the array to get the string requires creating a new object, so you may as well just decode from utf-8 at that point [it's big-O(num chars) either way]. There's no need to constrain it to fixed-width encodings like the FSR in which case utf-8 is clearly the best choice as: 1) It covers the whole unicode spectrum. 2) It uses 1 byte-per-char for ASCII. 3) UTF-8 is a big optimisation target for CPython (so it's fast). Oscar
On Mon, Jan 20, 2014 at 12:12 PM, Aldcroft, Thomas <aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com>wrote:
How significant are the performance issues? Does anyone really use numpy for this kind of text handling? If you really are operating on gigantic text arrays of ascii characters then is it so bad to just use the bytes dtype and handle decoding/encoding at the boundaries? If you're not operating on gigantic text arrays is there really a noticeable problem just using the 'U' dtype?
I use numpy for giga-row arrays of short text strings, so memory and performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is really a problem because users of a text array want to do things like filtering (`match_rows = text_array == 'match'`), printing, or other manipulations in a natural way without having to continually use bytestring literals or `.decode('ascii')` everywhere. I tried converting a few packages while leaving the arrays as bytestrings and it just ended up as a very big mess.
From my perspective the goal here is to provide a pragmatic way to allow numpy-based applications and end users to use python 3. Something like this proposal seems to be the right direction, maybe not pure and perfect but a sensible step to get us there given the reality of scientific computing.
I don't really see how writing b'match' instead of 'match' is that big a deal.
It's a big deal because all your existing python 2 code suddenly breaks on python 3, even after running 2to3. Yes, you can backfix all the python 2 code and use bytestring literals everywhere, but that is very painful and ugly. More importantly it's very fiddly because *sometimes* you'll need to use bytestring literals, and *sometimes* not, depending on the exact dataset you've been handed. That's basically a non-starter.
As you say below, the only solution is a proper separation of bytes/unicode where everything internally is unicode. The problem is that the existing 4-byte unicode in numpy is a big performance / memory hit. It's even trickier because libraries will happily deliver a numpy structured array with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to then convert to 'U' since you need to remake the entire structured array. With a one-byte unicode the goal would be an in-place update of 'S' to 's'.
And why are you needing to write .decode('ascii') everywhere?
print("The first value is {}".format(bytestring_array[0]))
On Python 2 this gives "The first value is string_value", while on Python 3 this gives "The first value is b'string_value'".
Unfortunately (?) setprintoptions and set_string_function don't work with numpy scalars AFAICS. If it did then it would be possible to override the string representation. It works for arrays. I didn't find the right key for numpy.bytes_ on python 3.3 so now my interpreter can only print bytes np.set_printoptions(formatter={'all':lambda x: x.decode('ascii',errors="ignore") }) Josef
If you really do just want to work with bytes in your own known encoding then why not just read and write in binary mode?
I apologise if I'm wrong but I suspect that much of the difficulty in getting the bytes/unicode separation right is down to the fact that a lot of the code you're using (or attempting to support) hasn't yet been ported to a clean text model. When I started using Python 3 it took me quite a few failed attempts at understanding the text model before I got to the point where I understood how it is supposed to be used. The problem was that I had been conflating text and bytes in many places, and that's hard to disentangle. Having fixed most of those problems I now understand why it is such an improvement.
In any case I don't see anything wrong with a more efficient dtype for representing text if the user can specify the encoding. The problem is that numpy arrays expose their underlying memory buffer. Allowing them to interact directly with text strings on the one side and binary files on the other breaches Python 3's very good text model unless the user can specify the encoding that is to be used. Or at least if there is to be a blessed encoding then make it unicode-capable utf-8 instead of legacy ascii/latin-1.
Oscar _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, Jan 20, 2014 at 8:00 AM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin < oscar.j.benjamin@gmail.com> wrote:
On Fri, Jan 17, 2014 at 02:30:19PM -0800, Chris Barker wrote:
Folks,
I've been blathering away on the related threads a lot -- sorry if it's too much. It's gotten a bit tangled up, so I thought I'd start a new one to address this one question (i.e. dont bring up genfromtext here):
Would it be a good thing for numpy to have a one-byte--per-character string type?
If you mean a string type that can only hold latin-1 characters then I think that this is a step backwards.
If you mean a dtype that holds bytes in a known, specifiable encoding and automatically decodes them to unicode strings when you call .item() and has a friendly repr() then that may be a good idea.
So for example you could have dtype='S:utf-8' which would store strings encoded as utf-8 e.g.:
text = array(['foo', 'bar'], dtype='S:utf-8') text array(['foo', 'bar'], dtype='|S3:utf-8') print(a) ['foo', 'bar'] a[0] 'foo' a.nbytes 6
We did have that with the 'S' type in py2, but the changes in py3 have made it not quite the right thing. And it appears that enough people use 'S' in py3 to mean 'bytes', so that we can't change that now.
It wasn't really the right thing before either. That's why Python 3 has changed all of this.
The only difference may be that 'S' currently auto translates to a bytes object, resulting in things like:
np.array(['some text',], dtype='S')[0] == 'some text'
yielding False on Py3. And you can't do all the usual text stuff with the resulting bytes object, either. (and it probably used the default encoding to generate the bytes, so will barf on some inputs, though that may be unavoidable.) So you need to decode the bytes that are given back, and now that I think about it, I have no idea what encoding you'd need to use in the general case.
You should let the user specify the encoding or otherwise require them to use the 'U' dtype.
So the correct solution is (particularly on py3) to use the 'U' (unicode) dtype for text in numpy arrays.
Absolutely. Embrace the Python 3 text model. Once you understand the how, what and why of it you'll see that it really is a good thing!
However, the 'U' dtype is 4 bytes per character, and that may be "too big" for some use-cases. And there is a lot of text in scientific data sets that are pure ascii, or at least some 1-byte-per-character encoding.
So, in the spirit of having multiple numeric types that use different amounts of memory, and can hold different ranges of values, a one-byte-per character dtype would be nice:
(note, this opens the door for a 2-byte per (UCS-2) dtype too, I personally don't think that's worth it, but maybe that's because I'm an english speaker...)
You could just use a 2-byte encoding with the S dtype e.g. dtype='S:utf-16-le'.
It could use the 's' (lower-case s) type identifier.
For passing to/from python built-in objects, it would
* Allow either Python bytes objects or Python unicode objects as input a) bytes objects would be passed through as-is b) unicode objects would be encoded as latin-1
[note: I'm not entirely sure that bytes objects should be allowed, but it would provide an nice efficiency in a fairly common case]
I think it would be a bad idea to accept bytes here. There are good reasons that Python 3 creates a barrier between the two worlds of text and bytes. Allowing implicit mixing of bytes and text is a recipe for mojibake. The TypeErrors in Python 3 are used to guard against conceptual errors that lead to data corruption. Attempting to undermine that barrier in numpy would be a backward step.
I apologise if this is misplaced but there seems to be an attitude that scientific programming isn't really affected by the issues that have lead to the Python 3 text model. I think that's ridiculous; data corruption is a problem in scientific programming just as it is anywhere else.
* It would create python unicode text objects, decoded as latin-1.
Don't try to bless a particular encoding and stop trying to pretend that it's possible to write a sensible system where end users don't need to worry about and specify the encoding of their data.
Could we have a way to specify another encoding? I'm not sure how that would fit into the dtype system.
If the encoding cannot be specified then the whole idea is misguided.
I've explained the latin-1 thing on other threads, but the short version is:
- It will work perfectly for ascii text - It will work perfectly for latin-1 text (natch) - It will never give you an UnicodeEncodeError regardless of what arbitrary bytes you pass in. - It will preserve those arbitrary bytes through a encoding/decoding operation.
So what happens if I do:
with open('myutf-8-file.txt', 'rb') as fin: ... text = numpy.fromfile(fin, dtype='s') text[0] # Decodes as latin-1 leading to mojibake.
I would propose that it's better to be able to do:
with open('myutf-8-file.txt', 'rb') as fin: ... text = numpy.fromfile(fin, dtype='s:utf-8')
There's really no way to get around the fact that users need to specify the encoding of their text files.
(it still wouldn't allow you to store arbitrary unicode -- but that's the limitation of one-byte per character...)
You could if you use 'utf-8'. It would be one-byte-per-char for text that only contains ascii characters. However it would still support every character that the unicode consortium can dream up.
The only possible advantage here is as a memory optimisation (potentially having a speed impact too although it could equally be a speed regression). Otherwise it just adds needless complexity to numpy and to the code that uses the new dtype as well as limiting its ability to handle unicode.
How significant are the performance issues? Does anyone really use numpy for this kind of text handling? If you really are operating on gigantic text arrays of ascii characters then is it so bad to just use the bytes dtype and handle decoding/encoding at the boundaries? If you're not operating on gigantic text arrays is there really a noticeable problem just using the 'U' dtype?
I use numpy for giga-row arrays of short text strings, so memory and performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is really a problem because users of a text array want to do things like filtering (`match_rows = text_array == 'match'`), printing, or other manipulations in a natural way without having to continually use bytestring literals or `.decode('ascii')` everywhere. I tried converting a few packages while leaving the arrays as bytestrings and it just ended up as a very big mess.
From my perspective the goal here is to provide a pragmatic way to allow numpy-based applications and end users to use python 3. Something like this proposal seems to be the right direction, maybe not pure and perfect but a sensible step to get us there given the reality of scientific computing.
I think that is right. Not having an effective way to handle these common scientific data sets will block acceptance of Python 3. But we do need to figure out the best way to add this functionality. Chuck
participants (7)
-
Aldcroft, Thomas
-
Charles R Harris
-
Chris Barker
-
josef.pktd@gmail.com
-
Nathaniel Smith
-
Oscar Benjamin
-
Sebastian Berg