On Thu, Dec 8, 2011 at 5:50 AM, Barry Warsaw
This means rather than returning bytes if bytes were given, at the Python layer, unicode is always returned. This I think causes the least disruption in user code. Well, we'll see as I'm now going to be porting some dbus-based applications.
If you look at the way the os APIs are defined, they mostly work on a "bytes in, bytes out" model (e.g. os.listdir(), os.walk()). Where there is no input type to reliably determine the expected output type, then a 'b' variant is added (e.g. os.getcwdb(), os.environb). (And yes, I do occasionally wonder if we should have a builtin "openb" shorthand for "open(name, 'b')" with a signature that omits all the parameters that are only valid for text mode files: "openb(file, mode, buffering, closefd)") An "always unicode out" model can work, too, but you need to be sure your clients can cope with that. One thing I like about my proposed string.coerce_to_str() API is that you can use it to implement either approach. If you want bytes->bytes, then you call the result coercion function, if you want to always emit unicode, then you skip that step.
Now, I clearly disagree with Armin on at least one point: there already *is* "one true way" to have unified text processing code in Python 3. That way is the way the Python 3.2 urllib.parse module handles it: as soon as it is handed something that isn't a string, it attempts to decode it using a default assumed encoding (specifically 'ascii', at least for now). It keeps track of whether or not the arguments were decoded from bytes and, if they were, encodes the return value on output [3]. If you're pipelining such interfaces, it's obviously more efficiently to just decode once before invoking the pipeline and then (optionally) encoding again at the end (just as is the case in Python 2), but you can still make your APIs largely polymorphic with respect to bytes and text without massive internal code duplication.
It's certainly an interesting idea, although I can't decide whether this is more implicit or more explicit. I'm not sure it would have helped me with dbus-python since most of the porting work happens at the interface between Python and an existing C API with clearly defined semantics. There's also a large body of existing code that uses the library, so an important goal is to make *their* porting easier. Have you had any experience porting applications which use the new urllib and does it make your life easier?
No, the API updates were in response to user requests (and extended discussions here and on python-dev). However, the use case of getting URL components off the wire, manipulating them and putting them back on the wire, all in an RFC compliant strict ASCII encoding made a lot of sense, which is why we ended up going with this model (that and the os module precedent).
I also wonder if some of these ideas would help with RDM's re-imagining of the email package. email is notoriously quantum in its bytes/str duality - sometimes the same email data needs to act like a str or a bytes at different times. The email-sig design attempts to address this with a layered approach, although I think we're still wondering whether this is going to work in practice. (I'm not sure about the current state of the email package work.)
The email model in 3.2 (which actually makes it somewhat usable) is quite similar to the way urllib.urlparse works (although I believe there may be a bit more internal code duplication). The two API updates actually co-evolved as part of the same set of discussions, which is how they came to share the bytes->bytes, str->str philosophy. I'm not sure about the status of email6 either, though - has anyone heard from RDM lately?
I've rambled on enough, but I think you and Armin bring up some good points that we really need to address in Python 3.3. I'm fairly convinced that there's little we could have done different before now, and most of these issues are cropping up because people are doing actual real-world ports now, which is a good thing! I also strongly disagree with Armin that a Python 2.8 could help in any way. Much better to expose these problems now and help make Python 3.3 the best it can be.
Indeed - it's *already* the case that ports that can drop 2.5 support have a much easier time of things, and those that can also drop 2.6 have it easier still (I believe the latter is uncommon though, since most people want their stuff to run on platforms like the RHEL6 system Python installation). Whether or not 2.8 exists won't magically make the need to support 2.5 (or even earlier versions!) go away - the only thing that will make that happen is time, as people upgrade their underlying OS installations. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia