[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]

Wed Jan 8 18:57:10 CET 2014

On Jan 8, 2014, at 2:18, Mark Lawrence <breamoreboy at yahoo.co.uk> wrote:

> On 08/01/2014 09:59, Nick Coghlan wrote:
>> 
>> Now that your proposal has been better explained, yes, I agree that
>> "asciibytes" and "asciistr" types would be well worth experimenting
>> with. I mention both, since it's far from clear if a str subclass or a
>> bytes subclass (or neither, although that may require bug fixes in
>> CPython) would be more convenient for this use case.
> 
> Could you subclass both to get the best of both worlds?  As in
> 
> class asciixyz(str, bytes):

You can't. (Try it,) More importantly, how would that work?  

You'd have the implementation of str (effectively a tagged union of char8/char16/char32 arrays) plus the separate implementation of bytes (effectively a char8 array). Do you leave the first one empty? And then avoid super() and instead explicitly delegate only to the bytes base?

That could work (at the relatively minimal cost of an extra empty '' worth of storage) as long as you don't run into any code that tries to use the internal details of the str. But unfortunately, most builtins and extension module functions _do_ try to use the internal details of the str. 

In CPython, for example, a function that takes a string usually does so by parsing the argument as, say, a u#, which gives you the character array from a str directly. Even functions that take str objects will usually at some point call string-protocol functions to get at their array.

The simple way around this is to make all such functions effectively call __str__ on any object that isn't a real str. But that would make almost _everything_ usable as a string--f.write(2) would now work. So you'd really need to create a new dunder method (and C API slot) __asstr__ that's only implemented by objects that really want to act like a str, not just have a str representation. Also, I'm not sure all such functions have a reasonable way to refcount the resulting str object properly. 

The alternative would be to expose the entire string protocol into Python--including, most importantly, the methods to get at the array directly. I'm not sure how you'd even design the API for those methods in Python. We don't even expose the buffer protocol to Python today.

I didn't go into all this detail to try to prove that the idea is impossible, but rather in hopes that someone would have an answer that makes everything work. Making string-protocol strings more "pluggable" might have other benefits besides the "encodedstr" type. Imagine being able to build an explicitly UTF-16 type to make it faster and easier to deal with Win32 or Java or other such things. (Or could you just use encodedstr('utf-16-le') for that?) Or expose a "rope"-like type for large mutable strings. Or experiment with alternatives to the 3.3-style internal storage, like Stephen's ASCII-compatible byte-smuggling flag, by faking them in Python instead of building them in C. (That would probably be sufficient to find any holes in the specification, even if it wouldn't be very helpful for perf testing.)