
I see it a little differently (though there is probably a common concept lurking in here). The protocols you mention are intentionally designed to be encoding-neutral as long as the encoding is an ASCII superset. This covers ASCII itself, Latin-1, Latin-N for other values of N, MacRoman, Microsoft's code pages (most of them anyways), UTF-8, presumably at least some of the Japanese encodings, and probably a host of others. But it does not cover UTF-16, EBCDIC, and others. (Encodings that have "shift bytes" that change the meaning of some or all ordinary ASCII characters also aren't covered, unless such an encoding happens to exclude the special characters that the protocol spec cares about). The protocol specs typically go out of their way to specify what byte values they use for syntactically significant positions (e.g. ':' in headers, or '/' in URLs), while hand-waving about the meaning of "what goes in between" since it is all typically treated as "not of syntactic significance". So you can write a parser that looks at bytes exclusively, and looks for a bunch of ASCII punctuation characters (e.g. '<', '>', '/', '&'), and doesn't know or care whether the stuff in between is encoded in Latin-15, MacRoman or UTF-8 -- it never looks "inside" stretches of characters between the special characters and just copies them. (Sometimes there may be *some* sections that are required to be ASCII and there equivalence of a-z and A-Z is well defined.) But I wouldn't go so far as to claim that interpreting the protocols as text is wrong. After all we're talking exclusively about protocols that are designed intentionally to be directly "human readable" (albeit as a fall-back option) -- the only tool you need to debug the traffic on the wire or socket is something that knows which subset of ASCII is considered "printable" and which renders everything else safely as a hex escape or even a special "unknown" character (like Unicode's "?" inside a black diamond). Depending on the requirements of a specific app (or framework) it may be entirely reasonable to convert everything to Unicode and process the resulting text; in other contexts it makes more sense to keep everything as bytes. It also makes sense to have an interface library to deal with a specific protocol that treats the protocol side as bytes but interacts with the application using text, since that is often how the application programmer wants to treat it anyway. Of course, some protocols require the application programmer to be aware of bytes as well in *some* cases -- examples are email and HTTP which can be used to transfer text as well as binary data (e.g. images). There is also the bootstrap problem where the wire data must be partially parsed in order to find out the encoding to be used to convert it to text. But that doesn't mean it's invalid to think about it as text in many application contexts. Regarding the proposal of a String ABC, I hope this isn't going to become a backdoor to reintroduce the Python 2 madness of allowing equivalency between text and bytes for *some* strings of bytes and not others. Finally, I do think that we should not introduce changes to the fundamental behavior of text and bytes while the moratorium is in place. Changes to specific stdlib APIs are fine however. --Guido On Thu, Jun 24, 2010 at 12:49 PM, Ian Bicking <ianb@colorstudy.com> wrote:
On Thu, Jun 24, 2010 at 12:38 PM, Bill Janssen <janssen@parc.com> wrote:
Here are a couple of ideas I'm taking away from the bytes/string discussion.
First, it would probably be a good idea to have a String ABC.
Secondly, maybe the string situation in 2.x wasn't as broken as we thought it was. In particular, those who deal with lots of encoded strings seemed to find it handy, and miss it in 3.x. Perhaps strings are more like numbers than we think. We have separate types for int, float, Decimal, etc. But they're all numbers, and they all cross-operate. In 2.x, it seems there were two missing features: no encoding attribute on str, which should have been there and should have been required, and the default encoding being "ASCII" (I can't tell you how many times I've had to fix that issue when a non-ASCII encoded str was passed to some output function).
I've started to form a conceptual notion that I think fits these cases.
We've setup a system where we think of text as natively unicode, with encodings to put that unicode into a byte form. This is certainly appropriate in a lot of cases. But there's a significant class of problems where bytes are the native structure. Network protocols are what we've been discussing, and are a notable case of that. That is, b'/' is the most native sense of a path separator in a URL, or b':' is the most native sense of what separates a header name from a header value in HTTP. To disallow unicode URLs or unicode HTTP headers would be rather anti-social, especially because unicode is now the "native" string type in Python 3 (as an aside for the WSGI spec we've been talking about using "native" strings in some positions like dictionary keys, meaning Python 2 str and Python 3 str, while being more exacting in other areas such as a response body which would always be bytes).
The HTTP spec and other network protocols seems a little fuzzy on this, because it was written before unicode even existed, and even later activity happened at a point when "unicode" and "text" weren't widely considered the same thing like they are now. But I think the original intention is revealed in a more modern specification like WebSockets, where they are very explicit that ':' is just shorthand for a particular byte, it is not "text" in our new modern notion of the term.
So with this idea in mind it makes more sense to me that *specific pieces of text* can be reasonably treated as both bytes and text. All the string literals in urllib.parse.urlunspit() for example.
The semantics I imagine are that special('/')+b'x'==b'/x' (i.e., it does not become special('/x')) and special('/')+x=='/x' (again it becomes str). This avoids some of the cases of unicode or str infecting a system as they did in Python 2 (where you might pass in unicode and everything works fine until some non-ASCII is introduced).
The one place where this might be tricky is if you have an encoding that is not ASCII compatible. But we can't guard against every possibility. So it would be entirely wrong to take a string encoded with UTF-16 and start to use b'/' with it. But there are other nonsensical combinations already possible, especially with polymorphic functions, we can't guard against all of them. Also I'm unsure if something like UTF-16 is in any way compatible with the kind of legacy systems that use bytes. Can you encode your filesystem with UTF-16? I don't think you could encode a cookie with it.
So maybe having a second string type in 3.x that consists of an encoded sequence of bytes plus the encoding, call it "estr", wouldn't have been a bad idea. It would probably have made sense to have estr cooperate with the str type, in the same way that two different kinds of numbers cooperate, "promoting" the result of an operation only when necessary. This would automatically achieve the kind of polymorphic functionality that Guido is suggesting, but without losing the ability to do
x = e(ASCII)"bar" a = ''.join("foo", x)
(or whatever the syntax for such an encoded string literal would be -- I'm not claiming this is a good one) which presume would bind "a" to a Unicode string "foobar" -- have to work out what gets promoted to what.
I would be entirely happy without a literal syntax. But as Phillip has noted, this can't be implemented *entirely* in a library as there are some constraints with the current str/bytes implementations. Reading PEP 3003 I'm not clear if such changes are part of the moratorium? They seem like they would be (sadly), but it doesn't seem clearly noted.
I think there's a *different* use case for things like bytes-in-a-utf8-encoding (e.g., to allow XML data to be decoded lazily), but that could be yet another class, and maybe shouldn't be polymorphicly usable as bytes (i.e., treat it as an optimized str representation that is otherwise semantically equivalent). A String ABC would formalize these things.
-- Ian Bicking | http://blog.ianbicking.org
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)