[Python-ideas] Exploring the 'strview' concept further

Wed Dec 7 15:53:44 CET 2011

With encouragement from me (and others) Armin Ronacher recently
attempted to articulate his problems in dealing with the migration to
Python 3 [1]. They're actually quite similar to the feelings I had
during my early attempts at restoring the ability of the URL parsing
APIs to deal directly with ASCII-encoded binary data, rather than
requiring that the application developer explicitly decode it to text
first [2].

Now, I clearly disagree with Armin on at least one point: there
already *is* "one true way" to have unified text processing code in
Python 3. That way is the way the Python 3.2 urllib.parse module
handles it: as soon as it is handed something that isn't a string, it
attempts to decode it using a default assumed encoding (specifically
'ascii', at least for now). It keeps track of whether or not the
arguments were decoded from bytes and, if they were, encodes the
return value on output [3]. If you're pipelining such interfaces, it's
obviously more efficiently to just decode once before invoking the
pipeline and then (optionally) encoding again at the end (just as is
the case in Python 2), but you can still make your APIs largely
polymorphic with respect to bytes and text without massive internal
code duplication.

So, that's always one of my first suggestions to people struggling
with Python 3's unicode model: I ask if they have tried putting aside
any concerns they may have about possible losses of efficiency, and
just tried the decode-on-input-and-return-an-output-coercion-function,
coerce-on-output approach. Python used to do this implicitly for you
at every string operation (minus the 'coerce on output' part), but now
it is asking that you do it manually, and decide for *yourself* on an
appropriate encoding, instead of the automatic assumption of ASCII
text that is present in Python 2 (we'll leave aside the issue of
platform-specific defaults in various contexts - that's a whole
different question and one I'm not at all equipped to answer. I don't
think I've ever even had to work on a system with any locale other
than en_US or en_GB).

Often this actually resolves their problem (since they're no longer
fighting the new Unicode model, and instead embracing it), and this is
why PEP 393 is going to be such a big deal when Python 3.3 is released
next year. Protocol developers are *right* to be worried about a
four-fold increase in memory usage (and the flow on effects on CPU
usage and cache misses) when going from bytes data to the UCS4
internal Unicode format used on most distro-provided Python builds for
Linux. With PEP 393's flexible internal representations, the amount of
memory used will be as little as possible while still allowing
straightforward O(1) lookup of individual code points.

However, that urllib.urlparse code also highlights another one of
Armin's complaints: like much of the stdlib (and core interpreter!),
it doesn't ducktype 'str'. Instead, it demands the real thing and
accepts no substitutes (not even collections.UserString). This kind of
behaviour is quite endemic - the coupling between the interpreter and
the details of the string implementation is, in general, even tighter
than that between the interpreter and the dict implementation used for
namespaces.

With PEP 3118, we introduced the concept of 'memoryview' to make
allowance for the fact that it is often useful to look at the same
chunk of memory in multiple ways, *without* incurring the costs of
making multiple copies. In a discussion back in June [4], I briefly
mentioned the idea of a 'strview' type that would extend those
concepts to providing a str-like view of a region of memory, *without*
necessarily making a copy of the entire thing.

DISCLAIMERS:
1. I don't know yet if this is a good idea. It may in fact be a
terrible idea. I think it is, at least, an idea worth discussing
further.
2. Making this concept work may require actually *classifying* our
codecs to some degree (for attributes like 'ASCII-compatible',
'stateless', 'fixed width', etc). That might be tedious, but doesn't
seem completely infeasible.
3. There are issues with memoryview itself that should be accounted
for if pursuing this idea [5]
4. There is an issue with CPython's operand coercion for sequence
concatenation and repetition that may affect attempts to implement
this idea, although you should be fine so long as you implement the
number methods in addition to the sequence ones (which happens
automatically for classes written in Python) [6]

So, how might a 'strview' object work?

1. The basic construction would be "strview(object, encoding,
errors)". For convenience, actual str objects would just be returned
unmodified (alternatively: a factory function could be provided with
that behaviour)
2. A 'strview' *wouldn't* try to pass itself off as a real string for
all purposes. Instead, it would support a new String ABC (more on that
below).
4. The encode() method would work like a string's normal encode()
method, decoding the original object to a str, then encoding that to
the desired encoding. If the encodings match, then an optimised fast
path of simply calling bytes() on the underlying object would be used.
5. If asked to index, slice or iterate over the underlying string, the
strview would use the incremental decoder for the relevant codec to
build an efficient mapping from code point indices to byte indices and
then return real strings (various strategies for doing this have been
posted to this list in the past). Alternatively, if codecs were
classified to explicitly indicate when they implemented stateless
fixed width encodings, then strview could simply be restricted to only
working with that subset of possible encodings. The latter strategy
might be needed to get around issues with stateful encodings like
ShiftJIS and ITA2 - those are hard (impossible?) to index and
interpret efficiently without fully decoding them and storing the
result.
6. The new type would implement the various binary operators supported
by strings, promoting itself to a real string type whenever needed
7. The new type would similarly support the full string API, returning
actual string objects rather than any kind of view.

What might a String ABC provide?

For a very long time, slice indices had to be real integers - we
didn't allow other "integer like" types. The reason was that floats
implemented __int__, so ducktyping on that method would have allowed
binary floating point numbers in functions where we didn't want to
permit them. The answer, ultimately, was to introduce __index__ (and,
eventually, numbers.Integral) to mark "true" integers, allowing things
like NumPy scalars to be used directly as slice indices without
inheriting from int.

An explicit String ABC, even if not supported for performance critical
core functionality like identifiers, would allow the implementation of
code like that in urllib.urlparse to be updated to avoid keying
behaviour on the concrete builtin str type - instead, it would check
against the String ABC, allowing for all the usual explicit type
registration goodies that ABCs support (and that make them much better
for type checking than concrete types).

Just as much of the old UserDict functionality is now available on
Mapping and MutableMapping, so much of the existing UserString
functionality could be moved to the hypothetical String ABC.

Hopefully-the-rambling-isn't-too-incoherent'ly-yours,
Nick.

[1] http://lucumr.pocoo.org/2011/12/7/thoughts-on-python3/
[2] http://bugs.python.org/issue9873
[3] http://hg.python.org/cpython/file/default/Lib/urllib/parse.py#l74
[4] http://mail.python.org/pipermail/python-ideas/2011-June/010439.html
[5] http://bugs.python.org/issue10181
[6] http://bugs.python.org/issue11477

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia