[Python-ideas] Exploring the 'strview' concept further

Thu Dec 8 01:22:32 CET 2011

On Thu, Dec 8, 2011 at 5:50 AM, Barry Warsaw <barry at python.org> wrote:
> This means rather than returning bytes if bytes were given, at the Python
> layer, unicode is always returned.  This I think causes the least disruption
> in user code.  Well, we'll see as I'm now going to be porting some dbus-based
> applications.

If you look at the way the os APIs are defined, they mostly work on a
"bytes in, bytes out" model (e.g. os.listdir(), os.walk()). Where
there is no input type to reliably determine the expected output type,
then a 'b' variant is added (e.g. os.getcwdb(), os.environb). (And
yes, I do occasionally wonder if we should have a builtin "openb"
shorthand for "open(name, 'b')" with a signature that omits all the
parameters that are only valid for text mode files: "openb(file, mode,
buffering, closefd)")

An "always unicode out" model can work, too, but you need to be sure
your clients can cope with that.

One thing I like about my proposed string.coerce_to_str() API is that
you can use it to implement either approach. If you want bytes->bytes,
then you call the result coercion function, if you want to always emit
unicode, then you skip that step.

>>Now, I clearly disagree with Armin on at least one point: there
>>already *is* "one true way" to have unified text processing code in
>>Python 3. That way is the way the Python 3.2 urllib.parse module
>>handles it: as soon as it is handed something that isn't a string, it
>>attempts to decode it using a default assumed encoding (specifically
>>'ascii', at least for now). It keeps track of whether or not the
>>arguments were decoded from bytes and, if they were, encodes the
>>return value on output [3]. If you're pipelining such interfaces, it's
>>obviously more efficiently to just decode once before invoking the
>>pipeline and then (optionally) encoding again at the end (just as is
>>the case in Python 2), but you can still make your APIs largely
>>polymorphic with respect to bytes and text without massive internal
>>code duplication.
>
> It's certainly an interesting idea, although I can't decide whether this is
> more implicit or more explicit.  I'm not sure it would have helped me with
> dbus-python since most of the porting work happens at the interface between
> Python and an existing C API with clearly defined semantics.  There's also a
> large body of existing code that uses the library, so an important goal is to
> make *their* porting easier.  Have you had any experience porting applications
> which use the new urllib and does it make your life easier?

No, the API updates were in response to user requests (and extended
discussions here and on python-dev). However, the use case of getting
URL components off the wire, manipulating them and putting them back
on the wire, all in an RFC compliant strict ASCII encoding made a lot
of sense, which is why we ended up going with this model (that and the
os module precedent).

> I also wonder if some of these ideas would help with RDM's re-imagining of the
> email package.  email is notoriously quantum in its bytes/str duality -
> sometimes the same email data needs to act like a str or a bytes at different
> times.  The email-sig design attempts to address this with a layered approach,
> although I think we're still wondering whether this is going to work in
> practice.  (I'm not sure about the current state of the email package work.)

The email model in 3.2 (which actually makes it somewhat usable) is
quite similar to the way urllib.urlparse works (although I believe
there may be a bit more internal code duplication). The two API
updates actually co-evolved as part of the same set of discussions,
which is how they came to share the bytes->bytes, str->str philosophy.

I'm not sure about the status of email6 either, though - has anyone
heard from RDM lately?

> I've rambled on enough, but I think you and Armin bring up some good points
> that we really need to address in Python 3.3.  I'm fairly convinced that
> there's little we could have done different before now, and most of these
> issues are cropping up because people are doing actual real-world ports now,
> which is a good thing!  I also strongly disagree with Armin that a Python 2.8
> could help in any way.  Much better to expose these problems now and help make
> Python 3.3 the best it can be.

Indeed - it's *already* the case that ports that can drop 2.5 support
have a much easier time of things, and those that can also drop 2.6
have it easier still (I believe the latter is uncommon though, since
most people want their stuff to run on platforms like the RHEL6 system
Python installation).

Whether or not 2.8 exists won't magically make the need to support 2.5
(or even earlier versions!) go away - the only thing that will make
that happen is time, as people upgrade their underlying OS
installations.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia