Exploring the 'strview' concept further

With encouragement from me (and others) Armin Ronacher recently attempted to articulate his problems in dealing with the migration to Python 3 [1]. They're actually quite similar to the feelings I had during my early attempts at restoring the ability of the URL parsing APIs to deal directly with ASCII-encoded binary data, rather than requiring that the application developer explicitly decode it to text first [2]. Now, I clearly disagree with Armin on at least one point: there already *is* "one true way" to have unified text processing code in Python 3. That way is the way the Python 3.2 urllib.parse module handles it: as soon as it is handed something that isn't a string, it attempts to decode it using a default assumed encoding (specifically 'ascii', at least for now). It keeps track of whether or not the arguments were decoded from bytes and, if they were, encodes the return value on output [3]. If you're pipelining such interfaces, it's obviously more efficiently to just decode once before invoking the pipeline and then (optionally) encoding again at the end (just as is the case in Python 2), but you can still make your APIs largely polymorphic with respect to bytes and text without massive internal code duplication. So, that's always one of my first suggestions to people struggling with Python 3's unicode model: I ask if they have tried putting aside any concerns they may have about possible losses of efficiency, and just tried the decode-on-input-and-return-an-output-coercion-function, coerce-on-output approach. Python used to do this implicitly for you at every string operation (minus the 'coerce on output' part), but now it is asking that you do it manually, and decide for *yourself* on an appropriate encoding, instead of the automatic assumption of ASCII text that is present in Python 2 (we'll leave aside the issue of platform-specific defaults in various contexts - that's a whole different question and one I'm not at all equipped to answer. I don't think I've ever even had to work on a system with any locale other than en_US or en_GB). Often this actually resolves their problem (since they're no longer fighting the new Unicode model, and instead embracing it), and this is why PEP 393 is going to be such a big deal when Python 3.3 is released next year. Protocol developers are *right* to be worried about a four-fold increase in memory usage (and the flow on effects on CPU usage and cache misses) when going from bytes data to the UCS4 internal Unicode format used on most distro-provided Python builds for Linux. With PEP 393's flexible internal representations, the amount of memory used will be as little as possible while still allowing straightforward O(1) lookup of individual code points. However, that urllib.urlparse code also highlights another one of Armin's complaints: like much of the stdlib (and core interpreter!), it doesn't ducktype 'str'. Instead, it demands the real thing and accepts no substitutes (not even collections.UserString). This kind of behaviour is quite endemic - the coupling between the interpreter and the details of the string implementation is, in general, even tighter than that between the interpreter and the dict implementation used for namespaces. With PEP 3118, we introduced the concept of 'memoryview' to make allowance for the fact that it is often useful to look at the same chunk of memory in multiple ways, *without* incurring the costs of making multiple copies. In a discussion back in June [4], I briefly mentioned the idea of a 'strview' type that would extend those concepts to providing a str-like view of a region of memory, *without* necessarily making a copy of the entire thing. DISCLAIMERS: 1. I don't know yet if this is a good idea. It may in fact be a terrible idea. I think it is, at least, an idea worth discussing further. 2. Making this concept work may require actually *classifying* our codecs to some degree (for attributes like 'ASCII-compatible', 'stateless', 'fixed width', etc). That might be tedious, but doesn't seem completely infeasible. 3. There are issues with memoryview itself that should be accounted for if pursuing this idea [5] 4. There is an issue with CPython's operand coercion for sequence concatenation and repetition that may affect attempts to implement this idea, although you should be fine so long as you implement the number methods in addition to the sequence ones (which happens automatically for classes written in Python) [6] So, how might a 'strview' object work? 1. The basic construction would be "strview(object, encoding, errors)". For convenience, actual str objects would just be returned unmodified (alternatively: a factory function could be provided with that behaviour) 2. A 'strview' *wouldn't* try to pass itself off as a real string for all purposes. Instead, it would support a new String ABC (more on that below). 4. The encode() method would work like a string's normal encode() method, decoding the original object to a str, then encoding that to the desired encoding. If the encodings match, then an optimised fast path of simply calling bytes() on the underlying object would be used. 5. If asked to index, slice or iterate over the underlying string, the strview would use the incremental decoder for the relevant codec to build an efficient mapping from code point indices to byte indices and then return real strings (various strategies for doing this have been posted to this list in the past). Alternatively, if codecs were classified to explicitly indicate when they implemented stateless fixed width encodings, then strview could simply be restricted to only working with that subset of possible encodings. The latter strategy might be needed to get around issues with stateful encodings like ShiftJIS and ITA2 - those are hard (impossible?) to index and interpret efficiently without fully decoding them and storing the result. 6. The new type would implement the various binary operators supported by strings, promoting itself to a real string type whenever needed 7. The new type would similarly support the full string API, returning actual string objects rather than any kind of view. What might a String ABC provide? For a very long time, slice indices had to be real integers - we didn't allow other "integer like" types. The reason was that floats implemented __int__, so ducktyping on that method would have allowed binary floating point numbers in functions where we didn't want to permit them. The answer, ultimately, was to introduce __index__ (and, eventually, numbers.Integral) to mark "true" integers, allowing things like NumPy scalars to be used directly as slice indices without inheriting from int. An explicit String ABC, even if not supported for performance critical core functionality like identifiers, would allow the implementation of code like that in urllib.urlparse to be updated to avoid keying behaviour on the concrete builtin str type - instead, it would check against the String ABC, allowing for all the usual explicit type registration goodies that ABCs support (and that make them much better for type checking than concrete types). Just as much of the old UserDict functionality is now available on Mapping and MutableMapping, so much of the existing UserString functionality could be moved to the hypothetical String ABC. Hopefully-the-rambling-isn't-too-incoherent'ly-yours, Nick. [1] http://lucumr.pocoo.org/2011/12/7/thoughts-on-python3/ [2] http://bugs.python.org/issue9873 [3] http://hg.python.org/cpython/file/default/Lib/urllib/parse.py#l74 [4] http://mail.python.org/pipermail/python-ideas/2011-June/010439.html [5] http://bugs.python.org/issue10181 [6] http://bugs.python.org/issue11477 -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Thu, 8 Dec 2011 00:53:44 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
3. There are issues with memoryview itself that should be accounted for if pursuing this idea [5]
These issues are related to complex buffer types (strided, multi-dimensional, etc.). They wouldn't apply to a hypothetical "linear unicode buffer".
The factory function is a better idea than silent pass-through, IMO.
Be careful, the incremental decoders use a layer of pure Python wrappers. You want to call them on big blocks (at least 4 or 8KB, as TextIOWrapper does) if you don't want to lose a lot of speed. So building a mapping may not be easy. Even bypassing the Python layer would still incur the overhead of repeatedly calling a standalone function, instead of having a tight loop such as the following: http://hg.python.org/cpython/file/e49220f4c31f/Objects/unicodeobject.c#l4228 And of course, your mapping must be space-efficient enough that it's much smaller than the full decoded string. I think that for small strings (< 1024 bytes?), decoding and storing the decoded string are not a big deal. Decoding once is *much* faster (especially for optimized encodings such as latin-1 or utf-8, and only them will be left in a few years) than trying to do it piecewise. strview would only be a win for rather large strings. Which makes it useless for URL parsing ;)
From an usability POV this seems undesireable. On the other hand, if complete decoding is required, calling str() is just as cheap.
7. The new type would similarly support the full string API, returning actual string objects rather than any kind of view.
Even for slicing?

On Thu, Dec 8, 2011 at 1:39 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Yeah, that's kind of where I'm going with this. For stateless encodings, views make sense - it's really just a memory view with a particular way of interpreting the individual characters and providing the string API rather than the bytes one. For the multitude of ASCII-compatible single byte codings and the various fixed-width encodings, that could be very useful. With some fiddling, you could support BOM and signature encodings, too (just by offsetting your view a bit and adjusting your interpretation of the individual code points). But for the fully general case of stateful encodings (including all variable width encodings) it is basically impossible to do O(1) indexing (which is the whole reason the Unicode model is the way it is). Especially once PEP 393 is in place, you rapidly reach a point of diminishing returns where converting the whole shebang to Unicode code points and working directly on the code point array is the right answer (and, if it isn't, you're clearly doing something sufficiently sophisticated that you're going to be OK with rolling your own tools to deal with the problem). In those terms, I'm actually wondering if it might be appropriate to extract some of the tools I created for the urllib.parse case and publish them via the string module. 1. Provide a string.Text ABC (why *did* we put UserString in collections, anyway?) 2. Provide a "coerce_to_str" helper: def coerce_to_str(*args, encoding, errors='strict'): # Invokes decode if necessary to create str args # and returns the coerced inputs along with # an appropriate result coercion function # - a noop for str inputs # - encoding function otherwise # False inputs (including None) are all coerced to the empty string args_are_text = isinstance(args[0], Text) if args_are_text: def _encode_result(obj): return obj else: def _encode_result(obj): return obj.encode(encoding, errors) def _decode(obj): if not obj: return '' if isinstance(obj, Text): return str(obj) return obj.decode(encoding, errors) def _decode_args(args): return tuple(map(_decode, args)) for arg in args[1:]: # We special-case False values to support the relatively common # use of None and the empty string as default arguments if arg and args_are_text != isinstance(arg, Text): raise TypeError("Cannot mix text and non-text arguments") return _decode_args(args) + (_encode_result,) Note the special-casing of None would be sufficient to support arbitrary defaults in binary/text polymorphic APIs: def f(a, b=None): (a_str, b_str), _coerce_result = coerce_to_str(a, b, 'utf-8') if b is None: b_str = "Default text" Cheers, Nick.
If we restricted strview to stateless encodings, then slicing could also return views (there wouldn't be any point in returning a view for iteration or indexing though - the view object would be bigger than any single-character string. In fact, we could probably figure out a cutoff whereby real strings are returned for sufficiently small slices, too). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Wed, Dec 7, 2011 at 6:51 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I really like PEP 393, and it has gotten much better even since the initial proposal, but this one objection has been bugging me the whole time -- I just can't find a good way to explain it. But with the concrete code, I will take a stab now... I want the ability to use a more efficient string representation when I know one exists -- such as when I could be using a single-byte charset other than Latin-1, or when the underlying data is bytes, but I want to treat it as text temporarily without copying the whole buffer. PyUnicode_Kind already supports the special case of PyUnicode_WCHAR_KIND (also known as "legacy string, not ready" -- http://hg.python.org/cpython/file/174fbbed8747/Include/unicodeobject.h around line 247). I would like to see another option for "custom subtype", and to accept that strings might stay in this state longer. A custom type need not allow direct access to the buffer as an array, so it would have to provide its own access functions. I accept that using these subtype-specific functions might be slower, but I think the downside for "normal" strings can be limited to an extra case statement in places like PyUnicode_WRITE (at http://hg.python.org/cpython/file/174fbbed8747/Include/unicodeobject.h lines 487-508; currently the default case asserts PyUnicode_4BYTE_KIND). Looing at Barry's example:
Modelling this as bytes with a unicode view on top, this would work fine (so long as you sliced the view, rather than the original bytes object), but creating that string view wouldn't require copying the buffer. (Of course, the subtype's implementation of PyUnicode_Substring might well copy parts of the buffer.) I would expect bytes in particular to grow an as_string(encoding="Latin-1") method, which could be used to deprecate the various string-related methods. A type for an alternate one-byte encoding could be as simple as using the 1Byte variants to create a string of the same type when large strings are called for, and a translation function when individual characters are requested. -jJ

How long is your buffer? Have you timed how long it takes to "copy" (or decode) it?
The unicode implementation is already complicated enough. I think adding one further option will be a tough sell, if it doesn't exhibit major benefits. (note PyUnicode_WCHAR_KIND is deprecated and supposed to be removed some day, perhaps Python 4 :-))
Why deprecate useful functionality? Regards Antoine.

Jim Jewett writes:
For most people all of the time, and for almost all people most of the time, this is a YAGNI, and gets more so every year. As a facility of the language, it is an attractive nuisance for developers many of whom will undoubtedly go searching for truffles and end up consuming Knuth's root of all error, and will attract lots of one-off RFEs to deal with specific use cases that break with a minimal implementation. N.B. Emacs has just given up on a 15-year experiment with such a minimal facility (the execrable "string-as-unibyte" toggle). Use of multiple internal text encodings is really a can of worms, as the Emacs experience demonstrates (they were unable to even write Latin-1 files properly, with repeated regressions of the so-called "\201 bug" that I know of 1995-2008, mostly because of misuse of string-as-unibyte). XEmacs, with a proper character type, eliminated the "\201 bug" *before* its multilingual version stopped crashing in the codecs. But even there, because the internal character type is based on ISO-2022, it sucks, and we consistent have issues with bogus decoding and the like that is hard to get around at the app level because there's way too much generality at the underlying level that we try to handle "transparently". That's where you're going; maybe you can do better than XEmacs levels of suckiness :-), but (for a general facility) it won't be easy. Better to do that at the application level, which can decide for itself what safeguards are needed.
or when the underlying data is bytes, but I want to treat it as text temporarily without copying the whole buffer.
That ship has sailed AFAICS. If the "copy-the-whole-buffer" style of polymorphism isn't good enough, you have special knowledge of the data and/or the application, and it's a layering violation to ask Python to manage that data for you because Python's model of text is str. It will result in unexpected UnicodeErrors.
I expect that library code that must be robust against UnicodeError (eg, email) will need to be prepared for gratuitous errors from custom types. Since that's at least a desideratum for all stdlib code, this could be rather more expensive than you suggest.
This would require substantial analysis in some cases I would expect to be common to determine whether it wasn't a pessimization. I suppose that in many use cases, you will be implicitly creating many strings, and the space overhead of the implicit strings may be greater than the size of the single string. In cases where the analysis is simple (eg, parsing an RFC 822 message header out of the middle of a huge mbox file), the analysis that shows that this could be done efficiently with a custom type can easily and efficiently be converted to an implementation based on converting only the bytes needed. I understand the attraction of such facilities for simplifying user code, but given my somewhat extensive experience with maintaining them, I recommend that Python core Just Say No. It's just too hard to maintain "text invariants" when you might be processing a few million bytes from /dev/urandom. If one (as an application programmer) knows better, and of course she does, then shouldn't she DTRT at the application code level?

On Dec 08, 2011, at 12:53 AM, Nick Coghlan wrote:
I've just finished a port of dbus-python to Python 3, submitting the patches upstream, although they haven't been reviewed yet. It was an interesting exercise for many reasons, and I have my own thoughts the state of porting which I'll post at another time. I agree with some of the issues that Armin brought up, and disagree with others. ;) At the C level, dbus defines its strings as UTF-8 encoded char*'s. My first crack at this (despite upstream's thinking otherwise) was to use bytes to represent these objects. That turned out to be completely infeasible for several reasons. The biggest problem is that some of the core callback dispatch code was doing slicing and comparisons of these objects against literals, or externally registered objects. So where you might see something like: >>> s = ':1/123' >>> s[:1] == ':' True this doesn't work when `s` is a bytes object. Given the number of places internally that would have to be fixed, and the cost of imposing a potentially huge number of changes to clients of the library, I ultimately decided that upstream's suggestion to model these things as unicodes was right after all. Once I made that change, the port went relatively easily, as judged by the amount of time it took to get the test suite completing successfully. ;) At the Python level, many of the interfaces which accept 8-bit strings in Python 2, accept bytes or strs in Python 3, where the bytes must be utf-8 encoded. Internally, I had to do many more type checks for PyUnicodes, and then decode them to bytes before I could pass them to do the dbus C API. In almost all cases though, returning data from the dbus C API involved decoding the char*'s to unicodes, not bytes. This means rather than returning bytes if bytes were given, at the Python layer, unicode is always returned. This I think causes the least disruption in user code. Well, we'll see as I'm now going to be porting some dbus-based applications.
It's certainly an interesting idea, although I can't decide whether this is more implicit or more explicit. I'm not sure it would have helped me with dbus-python since most of the porting work happens at the interface between Python and an existing C API with clearly defined semantics. There's also a large body of existing code that uses the library, so an important goal is to make *their* porting easier. Have you had any experience porting applications which use the new urllib and does it make your life easier?
I also wonder if some of these ideas would help with RDM's re-imagining of the email package. email is notoriously quantum in its bytes/str duality - sometimes the same email data needs to act like a str or a bytes at different times. The email-sig design attempts to address this with a layered approach, although I think we're still wondering whether this is going to work in practice. (I'm not sure about the current state of the email package work.) One difference here is that email usually tells you explicitly what the encoding is. Assuming the Content-Types don't lie or specify charsets that are unknown to Python, I think it would be nice to pass them through the APIs for better idempotency.
In the dbus port, this is mostly not a problem, since the dbus-python data types derive from PyBytes or PyUnicode. There are one or two places that do _CheckExacts() for various reasons, but I think mostly everything works properly with _Check() calls. I know that's not duck-typing though, and I agree with you about the problem at the interpreter level. (It turned out to be a bigger PITA with int/long. An important requirement was to not change the API for Python 2, so types which derived from PyInts in Python 2 had to derive from PyLongs in Python 3. Liberal use of #ifdefs mostly handles the complexity, but porting clients to Python 3 will definitely see API differences, e.g. in inheritance structure and such.) I've rambled on enough, but I think you and Armin bring up some good points that we really need to address in Python 3.3. I'm fairly convinced that there's little we could have done different before now, and most of these issues are cropping up because people are doing actual real-world ports now, which is a good thing! I also strongly disagree with Armin that a Python 2.8 could help in any way. Much better to expose these problems now and help make Python 3.3 the best it can be. -Barry

On Thu, Dec 8, 2011 at 5:50 AM, Barry Warsaw <barry@python.org> wrote:
If you look at the way the os APIs are defined, they mostly work on a "bytes in, bytes out" model (e.g. os.listdir(), os.walk()). Where there is no input type to reliably determine the expected output type, then a 'b' variant is added (e.g. os.getcwdb(), os.environb). (And yes, I do occasionally wonder if we should have a builtin "openb" shorthand for "open(name, 'b')" with a signature that omits all the parameters that are only valid for text mode files: "openb(file, mode, buffering, closefd)") An "always unicode out" model can work, too, but you need to be sure your clients can cope with that. One thing I like about my proposed string.coerce_to_str() API is that you can use it to implement either approach. If you want bytes->bytes, then you call the result coercion function, if you want to always emit unicode, then you skip that step.
No, the API updates were in response to user requests (and extended discussions here and on python-dev). However, the use case of getting URL components off the wire, manipulating them and putting them back on the wire, all in an RFC compliant strict ASCII encoding made a lot of sense, which is why we ended up going with this model (that and the os module precedent).
The email model in 3.2 (which actually makes it somewhat usable) is quite similar to the way urllib.urlparse works (although I believe there may be a bit more internal code duplication). The two API updates actually co-evolved as part of the same set of discussions, which is how they came to share the bytes->bytes, str->str philosophy. I'm not sure about the status of email6 either, though - has anyone heard from RDM lately?
Indeed - it's *already* the case that ports that can drop 2.5 support have a much easier time of things, and those that can also drop 2.6 have it easier still (I believe the latter is uncommon though, since most people want their stuff to run on platforms like the RHEL6 system Python installation). Whether or not 2.8 exists won't magically make the need to support 2.5 (or even earlier versions!) go away - the only thing that will make that happen is time, as people upgrade their underlying OS installations. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Dec 08, 2011, at 10:22 AM, Nick Coghlan wrote:
I think the ability to drop anything less than 2.6 is a huge win, but the ability to drop 2.6 itself is a smaller win. Yes, there are things in 2.7 that help make porting to Python 3 easier still, but at least IME so far, not so much as to be overwhelmingly compelling. Python 2.6 as a minimum *is* very compelling. Cheers, -Barry

On Thu, 8 Dec 2011 00:53:44 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
3. There are issues with memoryview itself that should be accounted for if pursuing this idea [5]
These issues are related to complex buffer types (strided, multi-dimensional, etc.). They wouldn't apply to a hypothetical "linear unicode buffer".
The factory function is a better idea than silent pass-through, IMO.
Be careful, the incremental decoders use a layer of pure Python wrappers. You want to call them on big blocks (at least 4 or 8KB, as TextIOWrapper does) if you don't want to lose a lot of speed. So building a mapping may not be easy. Even bypassing the Python layer would still incur the overhead of repeatedly calling a standalone function, instead of having a tight loop such as the following: http://hg.python.org/cpython/file/e49220f4c31f/Objects/unicodeobject.c#l4228 And of course, your mapping must be space-efficient enough that it's much smaller than the full decoded string. I think that for small strings (< 1024 bytes?), decoding and storing the decoded string are not a big deal. Decoding once is *much* faster (especially for optimized encodings such as latin-1 or utf-8, and only them will be left in a few years) than trying to do it piecewise. strview would only be a win for rather large strings. Which makes it useless for URL parsing ;)
From an usability POV this seems undesireable. On the other hand, if complete decoding is required, calling str() is just as cheap.
7. The new type would similarly support the full string API, returning actual string objects rather than any kind of view.
Even for slicing?

On Thu, Dec 8, 2011 at 1:39 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Yeah, that's kind of where I'm going with this. For stateless encodings, views make sense - it's really just a memory view with a particular way of interpreting the individual characters and providing the string API rather than the bytes one. For the multitude of ASCII-compatible single byte codings and the various fixed-width encodings, that could be very useful. With some fiddling, you could support BOM and signature encodings, too (just by offsetting your view a bit and adjusting your interpretation of the individual code points). But for the fully general case of stateful encodings (including all variable width encodings) it is basically impossible to do O(1) indexing (which is the whole reason the Unicode model is the way it is). Especially once PEP 393 is in place, you rapidly reach a point of diminishing returns where converting the whole shebang to Unicode code points and working directly on the code point array is the right answer (and, if it isn't, you're clearly doing something sufficiently sophisticated that you're going to be OK with rolling your own tools to deal with the problem). In those terms, I'm actually wondering if it might be appropriate to extract some of the tools I created for the urllib.parse case and publish them via the string module. 1. Provide a string.Text ABC (why *did* we put UserString in collections, anyway?) 2. Provide a "coerce_to_str" helper: def coerce_to_str(*args, encoding, errors='strict'): # Invokes decode if necessary to create str args # and returns the coerced inputs along with # an appropriate result coercion function # - a noop for str inputs # - encoding function otherwise # False inputs (including None) are all coerced to the empty string args_are_text = isinstance(args[0], Text) if args_are_text: def _encode_result(obj): return obj else: def _encode_result(obj): return obj.encode(encoding, errors) def _decode(obj): if not obj: return '' if isinstance(obj, Text): return str(obj) return obj.decode(encoding, errors) def _decode_args(args): return tuple(map(_decode, args)) for arg in args[1:]: # We special-case False values to support the relatively common # use of None and the empty string as default arguments if arg and args_are_text != isinstance(arg, Text): raise TypeError("Cannot mix text and non-text arguments") return _decode_args(args) + (_encode_result,) Note the special-casing of None would be sufficient to support arbitrary defaults in binary/text polymorphic APIs: def f(a, b=None): (a_str, b_str), _coerce_result = coerce_to_str(a, b, 'utf-8') if b is None: b_str = "Default text" Cheers, Nick.
If we restricted strview to stateless encodings, then slicing could also return views (there wouldn't be any point in returning a view for iteration or indexing though - the view object would be bigger than any single-character string. In fact, we could probably figure out a cutoff whereby real strings are returned for sufficiently small slices, too). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Wed, Dec 7, 2011 at 6:51 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I really like PEP 393, and it has gotten much better even since the initial proposal, but this one objection has been bugging me the whole time -- I just can't find a good way to explain it. But with the concrete code, I will take a stab now... I want the ability to use a more efficient string representation when I know one exists -- such as when I could be using a single-byte charset other than Latin-1, or when the underlying data is bytes, but I want to treat it as text temporarily without copying the whole buffer. PyUnicode_Kind already supports the special case of PyUnicode_WCHAR_KIND (also known as "legacy string, not ready" -- http://hg.python.org/cpython/file/174fbbed8747/Include/unicodeobject.h around line 247). I would like to see another option for "custom subtype", and to accept that strings might stay in this state longer. A custom type need not allow direct access to the buffer as an array, so it would have to provide its own access functions. I accept that using these subtype-specific functions might be slower, but I think the downside for "normal" strings can be limited to an extra case statement in places like PyUnicode_WRITE (at http://hg.python.org/cpython/file/174fbbed8747/Include/unicodeobject.h lines 487-508; currently the default case asserts PyUnicode_4BYTE_KIND). Looing at Barry's example:
Modelling this as bytes with a unicode view on top, this would work fine (so long as you sliced the view, rather than the original bytes object), but creating that string view wouldn't require copying the buffer. (Of course, the subtype's implementation of PyUnicode_Substring might well copy parts of the buffer.) I would expect bytes in particular to grow an as_string(encoding="Latin-1") method, which could be used to deprecate the various string-related methods. A type for an alternate one-byte encoding could be as simple as using the 1Byte variants to create a string of the same type when large strings are called for, and a translation function when individual characters are requested. -jJ

How long is your buffer? Have you timed how long it takes to "copy" (or decode) it?
The unicode implementation is already complicated enough. I think adding one further option will be a tough sell, if it doesn't exhibit major benefits. (note PyUnicode_WCHAR_KIND is deprecated and supposed to be removed some day, perhaps Python 4 :-))
Why deprecate useful functionality? Regards Antoine.

Jim Jewett writes:
For most people all of the time, and for almost all people most of the time, this is a YAGNI, and gets more so every year. As a facility of the language, it is an attractive nuisance for developers many of whom will undoubtedly go searching for truffles and end up consuming Knuth's root of all error, and will attract lots of one-off RFEs to deal with specific use cases that break with a minimal implementation. N.B. Emacs has just given up on a 15-year experiment with such a minimal facility (the execrable "string-as-unibyte" toggle). Use of multiple internal text encodings is really a can of worms, as the Emacs experience demonstrates (they were unable to even write Latin-1 files properly, with repeated regressions of the so-called "\201 bug" that I know of 1995-2008, mostly because of misuse of string-as-unibyte). XEmacs, with a proper character type, eliminated the "\201 bug" *before* its multilingual version stopped crashing in the codecs. But even there, because the internal character type is based on ISO-2022, it sucks, and we consistent have issues with bogus decoding and the like that is hard to get around at the app level because there's way too much generality at the underlying level that we try to handle "transparently". That's where you're going; maybe you can do better than XEmacs levels of suckiness :-), but (for a general facility) it won't be easy. Better to do that at the application level, which can decide for itself what safeguards are needed.
or when the underlying data is bytes, but I want to treat it as text temporarily without copying the whole buffer.
That ship has sailed AFAICS. If the "copy-the-whole-buffer" style of polymorphism isn't good enough, you have special knowledge of the data and/or the application, and it's a layering violation to ask Python to manage that data for you because Python's model of text is str. It will result in unexpected UnicodeErrors.
I expect that library code that must be robust against UnicodeError (eg, email) will need to be prepared for gratuitous errors from custom types. Since that's at least a desideratum for all stdlib code, this could be rather more expensive than you suggest.
This would require substantial analysis in some cases I would expect to be common to determine whether it wasn't a pessimization. I suppose that in many use cases, you will be implicitly creating many strings, and the space overhead of the implicit strings may be greater than the size of the single string. In cases where the analysis is simple (eg, parsing an RFC 822 message header out of the middle of a huge mbox file), the analysis that shows that this could be done efficiently with a custom type can easily and efficiently be converted to an implementation based on converting only the bytes needed. I understand the attraction of such facilities for simplifying user code, but given my somewhat extensive experience with maintaining them, I recommend that Python core Just Say No. It's just too hard to maintain "text invariants" when you might be processing a few million bytes from /dev/urandom. If one (as an application programmer) knows better, and of course she does, then shouldn't she DTRT at the application code level?

On Dec 08, 2011, at 12:53 AM, Nick Coghlan wrote:
I've just finished a port of dbus-python to Python 3, submitting the patches upstream, although they haven't been reviewed yet. It was an interesting exercise for many reasons, and I have my own thoughts the state of porting which I'll post at another time. I agree with some of the issues that Armin brought up, and disagree with others. ;) At the C level, dbus defines its strings as UTF-8 encoded char*'s. My first crack at this (despite upstream's thinking otherwise) was to use bytes to represent these objects. That turned out to be completely infeasible for several reasons. The biggest problem is that some of the core callback dispatch code was doing slicing and comparisons of these objects against literals, or externally registered objects. So where you might see something like: >>> s = ':1/123' >>> s[:1] == ':' True this doesn't work when `s` is a bytes object. Given the number of places internally that would have to be fixed, and the cost of imposing a potentially huge number of changes to clients of the library, I ultimately decided that upstream's suggestion to model these things as unicodes was right after all. Once I made that change, the port went relatively easily, as judged by the amount of time it took to get the test suite completing successfully. ;) At the Python level, many of the interfaces which accept 8-bit strings in Python 2, accept bytes or strs in Python 3, where the bytes must be utf-8 encoded. Internally, I had to do many more type checks for PyUnicodes, and then decode them to bytes before I could pass them to do the dbus C API. In almost all cases though, returning data from the dbus C API involved decoding the char*'s to unicodes, not bytes. This means rather than returning bytes if bytes were given, at the Python layer, unicode is always returned. This I think causes the least disruption in user code. Well, we'll see as I'm now going to be porting some dbus-based applications.
It's certainly an interesting idea, although I can't decide whether this is more implicit or more explicit. I'm not sure it would have helped me with dbus-python since most of the porting work happens at the interface between Python and an existing C API with clearly defined semantics. There's also a large body of existing code that uses the library, so an important goal is to make *their* porting easier. Have you had any experience porting applications which use the new urllib and does it make your life easier?
I also wonder if some of these ideas would help with RDM's re-imagining of the email package. email is notoriously quantum in its bytes/str duality - sometimes the same email data needs to act like a str or a bytes at different times. The email-sig design attempts to address this with a layered approach, although I think we're still wondering whether this is going to work in practice. (I'm not sure about the current state of the email package work.) One difference here is that email usually tells you explicitly what the encoding is. Assuming the Content-Types don't lie or specify charsets that are unknown to Python, I think it would be nice to pass them through the APIs for better idempotency.
In the dbus port, this is mostly not a problem, since the dbus-python data types derive from PyBytes or PyUnicode. There are one or two places that do _CheckExacts() for various reasons, but I think mostly everything works properly with _Check() calls. I know that's not duck-typing though, and I agree with you about the problem at the interpreter level. (It turned out to be a bigger PITA with int/long. An important requirement was to not change the API for Python 2, so types which derived from PyInts in Python 2 had to derive from PyLongs in Python 3. Liberal use of #ifdefs mostly handles the complexity, but porting clients to Python 3 will definitely see API differences, e.g. in inheritance structure and such.) I've rambled on enough, but I think you and Armin bring up some good points that we really need to address in Python 3.3. I'm fairly convinced that there's little we could have done different before now, and most of these issues are cropping up because people are doing actual real-world ports now, which is a good thing! I also strongly disagree with Armin that a Python 2.8 could help in any way. Much better to expose these problems now and help make Python 3.3 the best it can be. -Barry

On Thu, Dec 8, 2011 at 5:50 AM, Barry Warsaw <barry@python.org> wrote:
If you look at the way the os APIs are defined, they mostly work on a "bytes in, bytes out" model (e.g. os.listdir(), os.walk()). Where there is no input type to reliably determine the expected output type, then a 'b' variant is added (e.g. os.getcwdb(), os.environb). (And yes, I do occasionally wonder if we should have a builtin "openb" shorthand for "open(name, 'b')" with a signature that omits all the parameters that are only valid for text mode files: "openb(file, mode, buffering, closefd)") An "always unicode out" model can work, too, but you need to be sure your clients can cope with that. One thing I like about my proposed string.coerce_to_str() API is that you can use it to implement either approach. If you want bytes->bytes, then you call the result coercion function, if you want to always emit unicode, then you skip that step.
No, the API updates were in response to user requests (and extended discussions here and on python-dev). However, the use case of getting URL components off the wire, manipulating them and putting them back on the wire, all in an RFC compliant strict ASCII encoding made a lot of sense, which is why we ended up going with this model (that and the os module precedent).
The email model in 3.2 (which actually makes it somewhat usable) is quite similar to the way urllib.urlparse works (although I believe there may be a bit more internal code duplication). The two API updates actually co-evolved as part of the same set of discussions, which is how they came to share the bytes->bytes, str->str philosophy. I'm not sure about the status of email6 either, though - has anyone heard from RDM lately?
Indeed - it's *already* the case that ports that can drop 2.5 support have a much easier time of things, and those that can also drop 2.6 have it easier still (I believe the latter is uncommon though, since most people want their stuff to run on platforms like the RHEL6 system Python installation). Whether or not 2.8 exists won't magically make the need to support 2.5 (or even earlier versions!) go away - the only thing that will make that happen is time, as people upgrade their underlying OS installations. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Dec 08, 2011, at 10:22 AM, Nick Coghlan wrote:
I think the ability to drop anything less than 2.6 is a huge win, but the ability to drop 2.6 itself is a smaller win. Yes, there are things in 2.7 that help make porting to Python 3 easier still, but at least IME so far, not so much as to be overwhelmingly compelling. Python 2.6 as a minimum *is* very compelling. Cheers, -Barry
participants (5)
-
Antoine Pitrou
-
Barry Warsaw
-
Jim Jewett
-
Nick Coghlan
-
Stephen J. Turnbull