Re: [Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]
On 01/07/2014 10:22 AM, MRAB wrote:
On 2014-01-07 17:46, Andrew Barnert wrote:
On Jan 7, 2014, at 7:44, Steven D'Aprano <steve@pearwood.info> wrote:
I was thinking about Ethan's suggestion of introducing a new bytestring class and a lot of these suggestions are what I thought the bytestring class could do.
Suppose we take a pure-ASCII byte-string and decode it:
b'abcd'.decode('ascii-compatible')
That would be:
bytestring(b'abcd')
or even:
bytestring('abcd')
[snip]
Suppose we take a byte-string with a non-ASCII byte:
b'abc\xFF'.decode('ascii-compatible')
That would be:
bytestring(b'abc\xFF')
Bytes outside the ASCII range would be mapped to Unicode low surrogates:
bytestring(b'abc\xFF') == bytestring('abc\uDCFF')
Not sure what you mean here. The resulting bytes should be 'abc\xFF' and of length 4. -- ~Ethan~
On 2014-01-07 18:38, Ethan Furman wrote:
On 01/07/2014 10:22 AM, MRAB wrote:
On 2014-01-07 17:46, Andrew Barnert wrote:
On Jan 7, 2014, at 7:44, Steven D'Aprano <steve@pearwood.info> wrote:
I was thinking about Ethan's suggestion of introducing a new bytestring class and a lot of these suggestions are what I thought the bytestring class could do.
Suppose we take a pure-ASCII byte-string and decode it:
b'abcd'.decode('ascii-compatible')
That would be:
bytestring(b'abcd')
or even:
bytestring('abcd')
[snip]
Suppose we take a byte-string with a non-ASCII byte:
b'abc\xFF'.decode('ascii-compatible')
That would be:
bytestring(b'abc\xFF')
Bytes outside the ASCII range would be mapped to Unicode low surrogates:
bytestring(b'abc\xFF') == bytestring('abc\uDCFF')
Not sure what you mean here. The resulting bytes should be 'abc\xFF' and of length 4.
'abc\xFF' is a Unicode string, but you wouldn't be able to convert it to a bytestring because '\xFF' is a codepoint outside the ASCII range and not a low surrogate.
On 01/07/2014 11:32 AM, MRAB wrote:
On 2014-01-07 18:38, Ethan Furman wrote:
On 01/07/2014 10:22 AM, MRAB wrote:
On Jan 7, 2014, at 7:44, Steven D'Aprano <steve@pearwood.info> wrote:
Suppose we take a byte-string with a non-ASCII byte:
b'abc\xFF'.decode('ascii-compatible')
That would be:
bytestring(b'abc\xFF')
Bytes outside the ASCII range would be mapped to Unicode low surrogates:
bytestring(b'abc\xFF') == bytestring('abc\uDCFF')
Not sure what you mean here. The resulting bytes should be 'abc\xFF' and of length 4.
'abc\xFF' is a Unicode string, but you wouldn't be able to convert it to a bytestring because '\xFF' is a codepoint outside the ASCII range and not a low surrogate.
I can see terminology is going to be a pain in this thread. ;) My vision for a bytestring type (more refined): - made up of single bytes in the range 0 - 255 (no unicode anywhere) - indexing returns a bytestring of length 1, not an integer (as bytes does) - `bytestring(7)` either fails, or returns 'bytestring('\x07')' not 'bytestring(0, 0, 0, 0, 0, 0, 0)' So my statement above of 'abc\xFF' should not be interpreted as a unicode string... I guess I'll use 'y' as an abbreviation for now: y'abc\xFF'. -- ~Ethan~
On Tue, Jan 7, 2014 at 9:43 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
My vision for a bytestring type (more refined):
- made up of single bytes in the range 0 - 255 (no unicode anywhere)
- indexing returns a bytestring of length 1, not an integer (as bytes does)
- `bytestring(7)` either fails, or returns 'bytestring('\x07')' not 'bytestring(0, 0, 0, 0, 0, 0, 0)'
It sounds like you are just unhappy with some of the behavior of the bytes object. I agree that these two behaviors are suboptimal, but it is just too late to change them, and it's not enough to add a new type -- not by a long shot. The constructor behavior can be changed using a custom factory function. The indexing behavior, unfortunately, needs to be dealt with by changing b[i] into b[i:i+1] everywhere. -- --Guido van Rossum (python.org/~guido)
On 01/07/2014 12:49 PM, Guido van Rossum wrote:
On Tue, Jan 7, 2014 at 9:43 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
My vision for a bytestring type (more refined):
- made up of single bytes in the range 0 - 255 (no unicode anywhere)
- indexing returns a bytestring of length 1, not an integer (as bytes does)
- `bytestring(7)` either fails, or returns 'bytestring('\x07')' not 'bytestring(0, 0, 0, 0, 0, 0, 0)'
It sounds like you are just unhappy with some of the behavior of the bytes object. I agree that these two behaviors are suboptimal, but it is just too late to change them, and it's not enough to add a new type -- not by a long shot. The constructor behavior can be changed using a custom factory function. The indexing behavior, unfortunately, needs to be dealt with by changing b[i] into b[i:i+1] everywhere.
Of course I'm unhappy with it, it doesn't behave the way I think it should, and it's not consistent. The reason I started the thread was to hopefully gather others requirements to have a truly distinct and useful new type. Doesn't seem to have happened, though. :( Is it too late to change the repr for bytes? I can't think of anywhere else in the stdlib where what you see is not what you get: --> [0, 1, 2] [0, 1, 2] --> [0, 1, 2][1] 1 --> {'this':'that', 'these':'those'} {'this': 'that', 'these': 'those'} --> {'this':'that', 'these':'those'}['these'] 'those' --> 'abcdef' 'abcdef' --> 'abcdef'[3] 'd' But of course with bytes: --> b'abcdef' b'abcdef' --> b'abcdef'[3] 100 -- ~Ethan~
On Tue, Jan 7, 2014 at 10:58 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/07/2014 12:49 PM, Guido van Rossum wrote:
On Tue, Jan 7, 2014 at 9:43 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
My vision for a bytestring type (more refined):
- made up of single bytes in the range 0 - 255 (no unicode anywhere)
- indexing returns a bytestring of length 1, not an integer (as bytes does)
- `bytestring(7)` either fails, or returns 'bytestring('\x07')' not 'bytestring(0, 0, 0, 0, 0, 0, 0)'
It sounds like you are just unhappy with some of the behavior of the bytes object. I agree that these two behaviors are suboptimal, but it is just too late to change them, and it's not enough to add a new type -- not by a long shot. The constructor behavior can be changed using a custom factory function. The indexing behavior, unfortunately, needs to be dealt with by changing b[i] into b[i:i+1] everywhere.
Of course I'm unhappy with it, it doesn't behave the way I think it should, and it's not consistent.
Consistent with what? (Before you rush in an answer, remember that there are almost always multiple sides to a consistency argument.)
The reason I started the thread was to hopefully gather others requirements to have a truly distinct and useful new type. Doesn't seem to have happened, though. :(
So now is the time to man up and live with it. It's not going to change.
Is it too late to change the repr for bytes?
Yes.
I can't think of anywhere else in the stdlib where what you see is not what you get:
--> [0, 1, 2] [0, 1, 2]
--> [0, 1, 2][1] 1
--> {'this':'that', 'these':'those'} {'this': 'that', 'these': 'those'}
--> {'this':'that', 'these':'those'}['these'] 'those'
--> 'abcdef' 'abcdef'
--> 'abcdef'[3] 'd'
But of course with bytes:
--> b'abcdef' b'abcdef'
--> b'abcdef'[3] 100
I don't see what's wrong with those. Both produce valid expressions that, when entered, compare equal to the object whose repr() was printed. What more would you *want*? -- --Guido van Rossum (python.org/~guido)
Of course I'm unhappy with it, it doesn't behave the way I think it should, and it's not consistent.
Consistent with what? (Before you rush in an answer, remember that there are almost always multiple sides to a consistency argument.)
I don't see what's wrong with those. Both produce valid expressions that, when entered, compare equal to the object whose repr() was printed. What more would you *want*?
I find that the definition str is inconsistent indeed, because the items in a string are strings again, not characters (or code points). I don't think there is too many other examples in Python where the same is true; indexing a list does not give a list but the item that is at the point. In [4]: type(b'abc') Out[4]: builtins.bytes In [5]: type(b'abc'[1]) Out[5]: builtins.int In [6]: type('abc') Out[6]: builtins.str In [7]: type('abc'[1]) Out[7]: builtins.str there is no byte type in Python, so the closest is int (there is a byte type in numpy); if there was one, indexing a byte array could return that, but I assume the use case would be quite limited. But that there is no "characters" but only strings of length one is a confusing concept. It is as of scalars were the same as arrays of length one. These are different concepts, however. (Though, admittedly, numpy will take arrays of length 1 as scalars at least in some cases as a convenience - though I think it should not as it prevent users from writing consistent code that will be easy to read later. The same is here the case for Python with strings.) In [11]: [1,2,3] + [1] Out[11]: [1, 2, 3, 1] In [12]: [1,2,3] + [1][0] TypeError: can only concatenate list (not "int") to list In [13]: 'abc' + 'd' Out[13]: 'abcd' In [14]: 'abc' + 'd'[0] Out[14]: 'abcd' so, yes, the interface to strings and arrays is inconsistent. At least in this aspect.
You're off-topic for this sub-thread. Ethan said he wanted to change the repr() of bytes, but didn't specify what change he wanted. The inconsistency in the *interface* is not under discussion any more (I've already said agree it is unfortunate, but not bad enough to warrant a new type or a backward incompatible change). On Tue, Jan 7, 2014 at 12:36 PM, Alexander Heger <python@2sn.net> wrote:
Of course I'm unhappy with it, it doesn't behave the way I think it should, and it's not consistent.
Consistent with what? (Before you rush in an answer, remember that there are almost always multiple sides to a consistency argument.)
I don't see what's wrong with those. Both produce valid expressions that, when entered, compare equal to the object whose repr() was printed. What more would you *want*?
I find that the definition str is inconsistent indeed, because the items in a string are strings again, not characters (or code points). I don't think there is too many other examples in Python where the same is true; indexing a list does not give a list but the item that is at the point.
In [4]: type(b'abc') Out[4]: builtins.bytes
In [5]: type(b'abc'[1]) Out[5]: builtins.int
In [6]: type('abc') Out[6]: builtins.str
In [7]: type('abc'[1]) Out[7]: builtins.str
there is no byte type in Python, so the closest is int (there is a byte type in numpy); if there was one, indexing a byte array could return that, but I assume the use case would be quite limited. But that there is no "characters" but only strings of length one is a confusing concept. It is as of scalars were the same as arrays of length one. These are different concepts, however. (Though, admittedly, numpy will take arrays of length 1 as scalars at least in some cases as a convenience - though I think it should not as it prevent users from writing consistent code that will be easy to read later. The same is here the case for Python with strings.)
In [11]: [1,2,3] + [1] Out[11]: [1, 2, 3, 1]
In [12]: [1,2,3] + [1][0] TypeError: can only concatenate list (not "int") to list
In [13]: 'abc' + 'd' Out[13]: 'abcd'
In [14]: 'abc' + 'd'[0] Out[14]: 'abcd'
so, yes, the interface to strings and arrays is inconsistent. At least in this aspect. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido)
On 01/07/2014 04:36 PM, Alexander Heger wrote:
Of course I'm unhappy with it, it doesn't behave the way I think it should, and it's not consistent.
Consistent with what? (Before you rush in an answer, remember that there are almost always multiple sides to a consistency argument.) I don't see what's wrong with those. Both produce valid expressions that, when entered, compare equal to the object whose repr() was printed. What more would you*want*?
I find that the definition str is inconsistent indeed, because the items in a string are strings again, not characters (or code points). I don't think there is too many other examples in Python where the same is true; indexing a list does not give a list but the item that is at the point.
If you use slices, then it's more consistent with strings. A slice of a list gives you a list, a slice of a string gives you a string. The idea of sub-components always breaks down at some level. Then it shifts to equivalent translations, rather than smaller units. Like converting strings to bytes, and back again. They aren't sub components of each other. Where you draw the lines is dependent on how close you look. (Python, bytecode, C code, assemby, bytes, bits, voltages, ...) We can stay at the python level if we choose the viewpoint that an object is the Python code that creates that object. We have to allow for the execution of that code in our understanding of it. Cheers, Ron
On 2014-01-07 19:43, Ethan Furman wrote:
On 01/07/2014 11:32 AM, MRAB wrote:
On 2014-01-07 18:38, Ethan Furman wrote:
On 01/07/2014 10:22 AM, MRAB wrote:
On Jan 7, 2014, at 7:44, Steven D'Aprano <steve@pearwood.info> wrote:
Suppose we take a byte-string with a non-ASCII byte:
b'abc\xFF'.decode('ascii-compatible')
That would be:
bytestring(b'abc\xFF')
Bytes outside the ASCII range would be mapped to Unicode low surrogates:
bytestring(b'abc\xFF') == bytestring('abc\uDCFF')
Not sure what you mean here. The resulting bytes should be 'abc\xFF' and of length 4.
'abc\xFF' is a Unicode string, but you wouldn't be able to convert it to a bytestring because '\xFF' is a codepoint outside the ASCII range and not a low surrogate.
I can see terminology is going to be a pain in this thread. ;)
My vision for a bytestring type (more refined):
- made up of single bytes in the range 0 - 255 (no unicode anywhere)
- indexing returns a bytestring of length 1, not an integer (as bytes does)
- `bytestring(7)` either fails, or returns 'bytestring('\x07')' not 'bytestring(0, 0, 0, 0, 0, 0, 0)'
So my statement above of 'abc\xFF' should not be interpreted as a unicode string... I guess I'll use 'y' as an abbreviation for now: y'abc\xFF'.
No disagreement there. The point about Unicode is about how it could behave if mixed with Unicode strings.
On 1/7/2014 2:43 PM, Ethan Furman wrote:
My vision for a bytestring type (more refined):
- made up of single bytes in the range 0 - 255 (no unicode anywhere)
- indexing returns a bytestring of length 1, not an integer (as bytes does)
- `bytestring(7)` either fails, or returns 'bytestring('\x07')' not 'bytestring(0, 0, 0, 0, 0, 0, 0)'
To me, a major feature of Python is that it a) has more than one basic structure type (versus just strings or symbolic expressions) but b) is conservative in its multiplicity. It is not minimal, but it is minimalistic. It took over a decade for Guido to agree that Python should have separate built-in bool and set classes instead of just using ints as bools and tuples, lists, and dicts as sets, or using imported classes for either. The above describes a minor variation on bytes and seems to me to be a classic case for subclassing, whether in Python for ease or C for speed, in an imported module. The result could be kept private or made public as you wish. Yes, the minor differences would be important to you, the author of the subclass, but that is always the motivation for subclassing. One of the major advances in Python was to make it possible (in 2.2) to subclass the basic builtin structure classes. It seems to me that subclasses that work in multiple versions of Python, such as are already being used, are the appropriate solution to the specialized problems that people have with the Python string builtins. -- Terry Jan Reedy
Terry Reedy writes:
The above describes a minor variation on bytes and seems to me to be a classic case for subclassing, whether in Python for ease or C for speed, in an imported module.
I agree with you, but the discussion on python-dev indicates that the majority of core devs, including Guido IIUC, disagree with us. In fact they want to add many str-like capabilities to bytes (and the related mutable classes bytearray and memoryview).
On 8 Jan 2014 14:08, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Terry Reedy writes:
The above describes a minor variation on bytes and seems to me to be a classic case for subclassing, whether in Python for ease or C for
speed,
in an imported module.
I agree with you, but the discussion on python-dev indicates that the majority of core devs, including Guido IIUC, disagree with us. In fact they want to add many str-like capabilities to bytes (and the related mutable classes bytearray and memoryview).
That's far from a foregone conclusion. The main problem we've had over the past few years is the inability to get past "just give us back the Python 2 str type" responses from wire protocol developers attempting to migrate that aren't happy with the approach of manipulating data in the text domain and on to actual experiments with a suitable type for wire protocol development that interoperates nicely with the Python 3 text model. Now that your proposal has been better explained, yes, I agree that "asciibytes" and "asciistr" types would be well worth experimenting with. I mention both, since it's far from clear if a str subclass or a bytes subclass (or neither, although that may require bug fixes in CPython) would be more convenient for this use case. The key difference between such a type and a str with surrogate escaped elements or a Python 2 bytestring is that it would attempt to implicitly *encode* any Unicode text it encountered as strict ASCII text. This would allow text and binary processing to share code paths, with limited risk of producing mojibake (particularly since this type wouldn't be a builtin). The type would also share the str behaviour of returning a single element subsequence when indexed rather than an integer. Cheers, Nick.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On 08/01/2014 09:59, Nick Coghlan wrote:
Now that your proposal has been better explained, yes, I agree that "asciibytes" and "asciistr" types would be well worth experimenting with. I mention both, since it's far from clear if a str subclass or a bytes subclass (or neither, although that may require bug fixes in CPython) would be more convenient for this use case.
Could you subclass both to get the best of both worlds? As in class asciixyz(str, bytes):
Cheers, Nick.
-- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
On Jan 8, 2014, at 2:18, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
On 08/01/2014 09:59, Nick Coghlan wrote:
Now that your proposal has been better explained, yes, I agree that "asciibytes" and "asciistr" types would be well worth experimenting with. I mention both, since it's far from clear if a str subclass or a bytes subclass (or neither, although that may require bug fixes in CPython) would be more convenient for this use case.
Could you subclass both to get the best of both worlds? As in
class asciixyz(str, bytes):
You can't. (Try it,) More importantly, how would that work? You'd have the implementation of str (effectively a tagged union of char8/char16/char32 arrays) plus the separate implementation of bytes (effectively a char8 array). Do you leave the first one empty? And then avoid super() and instead explicitly delegate only to the bytes base? That could work (at the relatively minimal cost of an extra empty '' worth of storage) as long as you don't run into any code that tries to use the internal details of the str. But unfortunately, most builtins and extension module functions _do_ try to use the internal details of the str. In CPython, for example, a function that takes a string usually does so by parsing the argument as, say, a u#, which gives you the character array from a str directly. Even functions that take str objects will usually at some point call string-protocol functions to get at their array. The simple way around this is to make all such functions effectively call __str__ on any object that isn't a real str. But that would make almost _everything_ usable as a string--f.write(2) would now work. So you'd really need to create a new dunder method (and C API slot) __asstr__ that's only implemented by objects that really want to act like a str, not just have a str representation. Also, I'm not sure all such functions have a reasonable way to refcount the resulting str object properly. The alternative would be to expose the entire string protocol into Python--including, most importantly, the methods to get at the array directly. I'm not sure how you'd even design the API for those methods in Python. We don't even expose the buffer protocol to Python today. I didn't go into all this detail to try to prove that the idea is impossible, but rather in hopes that someone would have an answer that makes everything work. Making string-protocol strings more "pluggable" might have other benefits besides the "encodedstr" type. Imagine being able to build an explicitly UTF-16 type to make it faster and easier to deal with Win32 or Java or other such things. (Or could you just use encodedstr('utf-16-le') for that?) Or expose a "rope"-like type for large mutable strings. Or experiment with alternatives to the 3.3-style internal storage, like Stephen's ASCII-compatible byte-smuggling flag, by faking them in Python instead of building them in C. (That would probably be sufficient to find any holes in the specification, even if it wouldn't be very helpful for perf testing.)
On 08/01/2014 17:57, Andrew Barnert wrote:
On Jan 8, 2014, at 2:18, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
On 08/01/2014 09:59, Nick Coghlan wrote:
Now that your proposal has been better explained, yes, I agree that "asciibytes" and "asciistr" types would be well worth experimenting with. I mention both, since it's far from clear if a str subclass or a bytes subclass (or neither, although that may require bug fixes in CPython) would be more convenient for this use case.
Could you subclass both to get the best of both worlds? As in
class asciixyz(str, bytes):
You can't. (Try it,) More importantly, how would that work?
I haven't the faintest idea :)
but rather in hopes that someone would have an answer that makes everything work.
The reason I threw this in in the first place. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
participants (10)
-
Alexander Heger
-
Andrew Barnert
-
Ethan Furman
-
Guido van Rossum
-
Mark Lawrence
-
MRAB
-
Nick Coghlan
-
Ron Adam
-
Stephen J. Turnbull
-
Terry Reedy