Re: [Python-Dev] bytes / unicode
At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote:
It may be that there are places where we need to rewrite standard library algorithms to be bytes/str neutral (e.g. by using length one slices instead of indexing). It may be that there are more APIs that need to grow "encoding" keyword arguments that they then pass on to the functions they call or use to convert str arguments to bytes (or vice-versa). But without people trying to port affected libraries and reporting bugs when they find issues, the situation isn't going to improve.
Now, if these bugs are already being reported against 3.1 and just aren't getting fixed, that's a completely different story...
The overall impression, though, is that this isn't really a step forward. Now, bytes are the special case instead of unicode, but that special case isn't actually handled any better by the stdlib - in fact, it's arguably worse. And, the burden of addressing this seems to have been shifted from the people who made the change, to the people who are going to use it. But those people are not necessarily in a position to tell you anything more than, "give me something that works with bytes". What I can tell you is that before, since string constants in the stdlib were ascii bytes, and transparently promoted to unicode, stdlib behavior was *predictable* in the presence of special cases: you got back either bytes or unicode, but either way, you could idempotently upgrade the result to unicode, or just pass it on. APIs were "str safe, unicode aware". If you passed in bytes, you weren't going to get unicode without a warning, and if you passed in unicode, it'd work and you'd get unicode back. Now, the APIs are neither safe nor aware -- if you pass bytes in, you get unpredictable results back. Ironically, it almost *would* have been better if bytes simply didn't work as strings at all, *ever*, but if you could wrap them with a bstr() to *treat* them as text. You could still have restrictions on combining them, as long as it was a restriction on the unicode you mixed with them. That is, if you could combine a bstr and a str if the *str* was restricted to ASCII. If we had the Python 3 design discussions to do over again, I think I would now have stuck with the position of not letting bytes be string-compatible at all, and instead proposed an explicit bstr() wrapper/adapter to use them as strings, that would (in that case) force coercion in the direction of bytes rather than strings. (And bstr need not have been a builtin - it could have been something you import, to help discourage casual usage.) Might this approach lead to some people doing things wrong in the case of porting? Sure. But there'd be little reason to use it in new code that didn't have a real need for bytestring manipulation. It might've been a better balance between practicality and purity, in that it keeps the language pure, while offering a practical way to deal with things in bytes if you really need to. And, bytes wouldn't silently succeed *some* of the time, leading to a trap. An easy inconsistency is worse than a bit of uniform chicken-waving. Is it too late to make that tradeoff? Probably. Certainly it's not practical to *implement* outside the language core, and removing string methods would fux0r anybody whose currently-ported code relies on bytes objects having string-like methods.
On 21/06/2010 17:46, P.J. Eby wrote:
At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote:
It may be that there are places where we need to rewrite standard library algorithms to be bytes/str neutral (e.g. by using length one slices instead of indexing). It may be that there are more APIs that need to grow "encoding" keyword arguments that they then pass on to the functions they call or use to convert str arguments to bytes (or vice-versa). But without people trying to port affected libraries and reporting bugs when they find issues, the situation isn't going to improve.
Now, if these bugs are already being reported against 3.1 and just aren't getting fixed, that's a completely different story...
The overall impression, though, is that this isn't really a step forward. Now, bytes are the special case instead of unicode, but that special case isn't actually handled any better by the stdlib - in fact, it's arguably worse. And, the burden of addressing this seems to have been shifted from the people who made the change, to the people who are going to use it. But those people are not necessarily in a position to tell you anything more than, "give me something that works with bytes".
What I can tell you is that before, since string constants in the stdlib were ascii bytes, and transparently promoted to unicode, stdlib behavior was *predictable* in the presence of special cases: you got back either bytes or unicode, but either way, you could idempotently upgrade the result to unicode, or just pass it on. APIs were "str safe, unicode aware". If you passed in bytes, you weren't going to get unicode without a warning, and if you passed in unicode, it'd work and you'd get unicode back.
Now, the APIs are neither safe nor aware -- if you pass bytes in, you get unpredictable results back.
Ironically, it almost *would* have been better if bytes simply didn't work as strings at all, *ever*, but if you could wrap them with a bstr() to *treat* them as text. You could still have restrictions on combining them, as long as it was a restriction on the unicode you mixed with them. That is, if you could combine a bstr and a str if the *str* was restricted to ASCII.
If we had the Python 3 design discussions to do over again, I think I would now have stuck with the position of not letting bytes be string-compatible at all, and instead proposed an explicit bstr() wrapper/adapter to use them as strings, that would (in that case) force coercion in the direction of bytes rather than strings. (And bstr need not have been a builtin - it could have been something you import, to help discourage casual usage.)
Might this approach lead to some people doing things wrong in the case of porting? Sure. But there'd be little reason to use it in new code that didn't have a real need for bytestring manipulation.
It might've been a better balance between practicality and purity, in that it keeps the language pure, while offering a practical way to deal with things in bytes if you really need to. And, bytes wouldn't silently succeed *some* of the time, leading to a trap. An easy inconsistency is worse than a bit of uniform chicken-waving.
Is it too late to make that tradeoff? Probably. Certainly it's not practical to *implement* outside the language core, and removing string methods would fux0r anybody whose currently-ported code relies on bytes objects having string-like methods.
Why is your proposed bstr wrapper not practical to implement outside the core and use in your own libraries and frameworks? Michael
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.u...
-- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer.
On Mon, Jun 21, 2010 at 9:46 AM, P.J. Eby <pje@telecommunity.com> wrote:
At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote:
It may be that there are places where we need to rewrite standard library algorithms to be bytes/str neutral (e.g. by using length one slices instead of indexing). It may be that there are more APIs that need to grow "encoding" keyword arguments that they then pass on to the functions they call or use to convert str arguments to bytes (or vice-versa). But without people trying to port affected libraries and reporting bugs when they find issues, the situation isn't going to improve.
Now, if these bugs are already being reported against 3.1 and just aren't getting fixed, that's a completely different story...
The overall impression, though, is that this isn't really a step forward. Now, bytes are the special case instead of unicode, but that special case isn't actually handled any better by the stdlib - in fact, it's arguably worse. And, the burden of addressing this seems to have been shifted from the people who made the change, to the people who are going to use it. But those people are not necessarily in a position to tell you anything more than, "give me something that works with bytes".
What I can tell you is that before, since string constants in the stdlib were ascii bytes, and transparently promoted to unicode, stdlib behavior was *predictable* in the presence of special cases: you got back either bytes or unicode, but either way, you could idempotently upgrade the result to unicode, or just pass it on. APIs were "str safe, unicode aware". If you passed in bytes, you weren't going to get unicode without a warning, and if you passed in unicode, it'd work and you'd get unicode back.
Actually, the big problem with Python 2 is that if you mix str and unicode, things work or crash depending on whether any of the str objects involved contain non-ASCII bytes. If one API decides to upgrade to Unicode, the result, when passed to another API, may well cause a UnicodeError because not all arguments have had the same treatment.
Now, the APIs are neither safe nor aware -- if you pass bytes in, you get unpredictable results back.
This seems an overgeneralization of a particular bug. There are APIs that are strictly text-in, text-out. There are others that are bytes-in, bytes-out. Let's call all those *pure*. For some operations it makes sense that the API is *polymorphic*, with which I mean that text-in causes text-out, and bytes-in causes byte-out. All of these are fine. Perhaps there are more situations where a polymorphic API would be helpful. Such APIs are not always so easy to implement, because they have to be careful with literals or other constants (and even more so mutable state) used internally -- but it can be done, and there are plenty of examples in the stdlib. The real problem apparently lies in (what I believe is only a few rare) APIs that are text-or-bytes-in and always-text-out (or always-bytes-out). Let's call them *hybrid*. Clearly, mixing hybrid APIs in a stream of pure or polymorphic API calls is a problem, because they turn a pure or polymorphic overall operation into a hybrid one. There are also text-in, bytes-out or bytes-in, text-out APIs that are intended for encoding/decoding of course, but these are in a totally different class. Abstractly, it would be good if there were as few as possible hybrid APIs, many pure or polymorphic APIs (which it should be in a particular case is a pragmatic choice), and a limited number of encoding/decoding APIs, which should generally be invoked at the edges of the program (e.g., I/O).
Ironically, it almost *would* have been better if bytes simply didn't work as strings at all, *ever*, but if you could wrap them with a bstr() to *treat* them as text. You could still have restrictions on combining them, as long as it was a restriction on the unicode you mixed with them. That is, if you could combine a bstr and a str if the *str* was restricted to ASCII.
ISTR that we considered something like this and decided to stay away from it. At this point I think that a successful 3rd party bstr implementation would be required before we rush to add one to the stdlib.
If we had the Python 3 design discussions to do over again, I think I would now have stuck with the position of not letting bytes be string-compatible at all,
They aren't, unless you consider the presence of some methods with similar behavior (.lower(), .split() and so on) and the existence of some polymorphic APIs (see above) as "compatibility".
and instead proposed an explicit bstr() wrapper/adapter to use them as strings, that would (in that case) force coercion in the direction of bytes rather than strings. (And bstr need not have been a builtin - it could have been something you import, to help discourage casual usage.)
I'm stil unclear on exactly what bstr is supposed to be, but it sounds a bit like one of the rejected proposals for having a single (Unicode-capable) str type that is implemented using different width encodings (Latin-1, UCS-2, UCS-4) underneath.
Might this approach lead to some people doing things wrong in the case of porting? Sure. But there'd be little reason to use it in new code that didn't have a real need for bytestring manipulation.
It might've been a better balance between practicality and purity, in that it keeps the language pure, while offering a practical way to deal with things in bytes if you really need to. And, bytes wouldn't silently succeed *some* of the time, leading to a trap. An easy inconsistency is worse than a bit of uniform chicken-waving.
I still believe that believe that the instances of bytes silently succeeding *some* of the time refers to specific bugs in specific APIs, either intentional because of misguided compatibility desires, or accidental in the haste of trying to convert the entire stdlib to Python 3 in a finite time.
Is it too late to make that tradeoff? Probably. Certainly it's not practical to *implement* outside the language core, and removing string methods would fux0r anybody whose currently-ported code relies on bytes objects having string-like methods.
-- --Guido van Rossum (python.org/~guido)
At 05:49 PM 6/21/2010 +0100, Michael Foord wrote:
Why is your proposed bstr wrapper not practical to implement outside the core and use in your own libraries and frameworks?
__contains__ doesn't have a converse operation, so you can't code a type that works around this (Python 3.1 shown):
from os.path import join join(b'x','y') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\Python31\lib\ntpath.py", line 161, in join if b[:1] in seps: TypeError: Type str doesn't support the buffer API join('y',b'x') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\Python31\lib\ntpath.py", line 161, in join if b[:1] in seps: TypeError: 'in <string>' requires string as left operand, not bytes
IOW, only one of these two cases can be worked around by using a bstr (or ebytes) that doesn't have support from the core string type. I'm not sure if the "in" operator is the only case where implementing such a type would fail, but it's the most obvious one. String formatting, of both the % and .format() varieties is another. (__rmod__ doesn't help if your bytes object is one of several data items in a tuple or dict -- the common case for % formatting.)
On 6/21/2010 1:29 PM, P.J. Eby wrote:
At 05:49 PM 6/21/2010 +0100, Michael Foord wrote:
Why is your proposed bstr wrapper not practical to implement outside the core and use in your own libraries and frameworks?
__contains__ doesn't have a converse operation, so you can't code a type that works around this (Python 3.1 shown):
from os.path import join join(b'x','y')
join('y',b'x')
I am really unclear what result you intend for such mixed pairs, for all possible mixed pairs, sensible or not. It would seem to me best to write your own pjoin function that did exactly what you want over the whole input domain. -- Terry Jan Reedy
On 6/21/2010 1:29 PM, Guido van Rossum wrote:
Actually, the big problem with Python 2 is that if you mix str and unicode, things work or crash depending on whether any of the str objects involved contain non-ASCII bytes.
If one API decides to upgrade to Unicode, the result, when passed to another API, may well cause a UnicodeError because not all arguments have had the same treatment.
Now, the APIs are neither safe nor aware -- if you pass bytes in, you get unpredictable results back.
This seems an overgeneralization of a particular bug. There are APIs that are strictly text-in, text-out. There are others that are bytes-in, bytes-out. Let's call all those *pure*. For some operations it makes sense that the API is *polymorphic*, with which I mean that text-in causes text-out, and bytes-in causes byte-out. All of these are fine.
Perhaps there are more situations where a polymorphic API would be helpful. Such APIs are not always so easy to implement, because they have to be careful with literals or other constants (and even more so mutable state) used internally -- but it can be done, and there are plenty of examples in the stdlib.
The real problem apparently lies in (what I believe is only a few rare) APIs that are text-or-bytes-in and always-text-out (or always-bytes-out). Let's call them *hybrid*. Clearly, mixing hybrid APIs in a stream of pure or polymorphic API calls is a problem, because they turn a pure or polymorphic overall operation into a hybrid one.
There are also text-in, bytes-out or bytes-in, text-out APIs that are intended for encoding/decoding of course, but these are in a totally different class.
Abstractly, it would be good if there were as few as possible hybrid APIs, many pure or polymorphic APIs (which it should be in a particular case is a pragmatic choice), and a limited number of encoding/decoding APIs, which should generally be invoked at the edges of the program (e.g., I/O).
Nice summary of part of the 'why' for Python3.
I still believe that believe that the instances of bytes silently succeeding *some* of the time refers to specific bugs in specific APIs, either intentional because of misguided compatibility desires, or accidental in the haste of trying to convert the entire stdlib to Python 3 in a finite time.
I think http://bugs.python.org/issue5468 reports one aspect of haste, missing encoding and errors paramaters. But it has not gotten much attention. -- Terry Jan Reedy
participants (4)
-
Guido van Rossum
-
Michael Foord
-
P.J. Eby
-
Terry Reedy