[Python-ideas] Exploring the 'strview' concept further

Wed Dec 7 20:50:57 CET 2011

On Dec 08, 2011, at 12:53 AM, Nick Coghlan wrote:

>With encouragement from me (and others) Armin Ronacher recently
>attempted to articulate his problems in dealing with the migration to
>Python 3 [1]. They're actually quite similar to the feelings I had
>during my early attempts at restoring the ability of the URL parsing
>APIs to deal directly with ASCII-encoded binary data, rather than
>requiring that the application developer explicitly decode it to text
>first [2].

I've just finished a port of dbus-python to Python 3, submitting the patches
upstream, although they haven't been reviewed yet.  It was an interesting
exercise for many reasons, and I have my own thoughts the state of porting
which I'll post at another time.  I agree with some of the issues that Armin
brought up, and disagree with others. ;)

At the C level, dbus defines its strings as UTF-8 encoded char*'s.  My first
crack at this (despite upstream's thinking otherwise) was to use bytes to
represent these objects.  That turned out to be completely infeasible for
several reasons.  The biggest problem is that some of the core callback
dispatch code was doing slicing and comparisons of these objects against
literals, or externally registered objects.  So where you might see something
like:

    >>> s = ':1/123'
    >>> s[:1] == ':'
    True

this doesn't work when `s` is a bytes object.  Given the number of places
internally that would have to be fixed, and the cost of imposing a potentially
huge number of changes to clients of the library, I ultimately decided that
upstream's suggestion to model these things as unicodes was right after all.
Once I made that change, the port went relatively easily, as judged by the
amount of time it took to get the test suite completing successfully. ;)

At the Python level, many of the interfaces which accept 8-bit strings in
Python 2, accept bytes or strs in Python 3, where the bytes must be utf-8
encoded.  Internally, I had to do many more type checks for PyUnicodes, and
then decode them to bytes before I could pass them to do the dbus C API.  In
almost all cases though, returning data from the dbus C API involved decoding
the char*'s to unicodes, not bytes.

This means rather than returning bytes if bytes were given, at the Python
layer, unicode is always returned.  This I think causes the least disruption
in user code.  Well, we'll see as I'm now going to be porting some dbus-based
applications.

>Now, I clearly disagree with Armin on at least one point: there
>already *is* "one true way" to have unified text processing code in
>Python 3. That way is the way the Python 3.2 urllib.parse module
>handles it: as soon as it is handed something that isn't a string, it
>attempts to decode it using a default assumed encoding (specifically
>'ascii', at least for now). It keeps track of whether or not the
>arguments were decoded from bytes and, if they were, encodes the
>return value on output [3]. If you're pipelining such interfaces, it's
>obviously more efficiently to just decode once before invoking the
>pipeline and then (optionally) encoding again at the end (just as is
>the case in Python 2), but you can still make your APIs largely
>polymorphic with respect to bytes and text without massive internal
>code duplication.

It's certainly an interesting idea, although I can't decide whether this is
more implicit or more explicit.  I'm not sure it would have helped me with
dbus-python since most of the porting work happens at the interface between
Python and an existing C API with clearly defined semantics.  There's also a
large body of existing code that uses the library, so an important goal is to
make *their* porting easier.  Have you had any experience porting applications
which use the new urllib and does it make your life easier?

>So, that's always one of my first suggestions to people struggling
>with Python 3's unicode model: I ask if they have tried putting aside
>any concerns they may have about possible losses of efficiency, and
>just tried the decode-on-input-and-return-an-output-coercion-function,
>coerce-on-output approach. Python used to do this implicitly for you
>at every string operation (minus the 'coerce on output' part), but now
>it is asking that you do it manually, and decide for *yourself* on an
>appropriate encoding, instead of the automatic assumption of ASCII
>text that is present in Python 2 (we'll leave aside the issue of
>platform-specific defaults in various contexts - that's a whole
>different question and one I'm not at all equipped to answer. I don't
>think I've ever even had to work on a system with any locale other
>than en_US or en_GB).

I also wonder if some of these ideas would help with RDM's re-imagining of the
email package.  email is notoriously quantum in its bytes/str duality -
sometimes the same email data needs to act like a str or a bytes at different
times.  The email-sig design attempts to address this with a layered approach,
although I think we're still wondering whether this is going to work in
practice.  (I'm not sure about the current state of the email package work.)

One difference here is that email usually tells you explicitly what the
encoding is.  Assuming the Content-Types don't lie or specify charsets that
are unknown to Python, I think it would be nice to pass them through the APIs
for better idempotency.

>However, that urllib.urlparse code also highlights another one of
>Armin's complaints: like much of the stdlib (and core interpreter!),
>it doesn't ducktype 'str'. Instead, it demands the real thing and
>accepts no substitutes (not even collections.UserString). This kind of
>behaviour is quite endemic - the coupling between the interpreter and
>the details of the string implementation is, in general, even tighter
>than that between the interpreter and the dict implementation used for
>namespaces.

In the dbus port, this is mostly not a problem, since the dbus-python data
types derive from PyBytes or PyUnicode.  There are one or two places that do
_CheckExacts() for various reasons, but I think mostly everything works
properly with _Check() calls.  I know that's not duck-typing though, and I
agree with you about the problem at the interpreter level.

(It turned out to be a bigger PITA with int/long.  An important requirement
was to not change the API for Python 2, so types which derived from PyInts in
Python 2 had to derive from PyLongs in Python 3.  Liberal use of #ifdefs
mostly handles the complexity, but porting clients to Python 3 will definitely
see API differences, e.g. in inheritance structure and such.)

I've rambled on enough, but I think you and Armin bring up some good points
that we really need to address in Python 3.3.  I'm fairly convinced that
there's little we could have done different before now, and most of these
issues are cropping up because people are doing actual real-world ports now,
which is a good thing!  I also strongly disagree with Armin that a Python 2.8
could help in any way.  Much better to expose these problems now and help make
Python 3.3 the best it can be.

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20111207/b710457c/attachment.pgp>