Aside: I just read Victor's PEP 460, and apparently a lot of the assumptions I'm making are true! Andrew Barnert writes:
From: Geert Jansen
On Mon, Jan 6, 2014 at 11:57 AM, Stephen J. Turnbull
wrote: > I'm not missing a new type, but I am missing the format method on
the binary types.
I'm curious about precisely what your use cases are, and just what formatting they need.
Besides Geert's chunked HTTP example, there are tons of intern protocols and file formats (including Python source code!),
Python source code must use an ASCII-compatible encoding to use PEP 263. No widechars, no EBCDIC. But yes, I know about ASCII header formats -- I'm a Mailman developer.
that have ASCII headers (that in some way define an encoding for the actual payload). So things like b'Content-Length: {}'.format(len(payload)) or even b'Content-Type: text/html; charset={}'.format(encoding) are useful.
Useful, sure. But that much more useful than the alternative? What's wrong with def itob(n): # besides efficiency :-) return "{0:d}".format(n).encode('ascii') b'Content-Length: ' + itob(len(payload)) b'Content-Type: text/html; charset=' + encoding for such cases? Not to forget that for cases with multiple parts to combine, bytes.join() is way fast -- which matters to most people who want these operations. So I just don't see a real need for generic formatting operations here. (regex is another matter, but that's already implemented.)
As for assuming that it's ASCII-compatible, again, there are all kinds of protocols that work with any ASCII-compatbile charset but don't work otherwise.
If you *can* assume it's ASCII-compatible bytes, what's wrong with str in Python 3? The basic idea is to use inbytes.decode('ascii', errors='surrogateescape') which will DTRT if you try to encode it without the surrogateescape handler: it raises an exception unless the bytes is pure ASCII. It's memory-efficient for pure ASCII, and has all the string facilities we love. But of course it would be too painful for sending JPEGs by chunked HTTP a la Geert. So ... now that we have the flexible string representation (PEP 393), let's add a 7-bit representation! (Don't take that too seriously, there are interesting more general variants I'm not going to talk about tonight.) The 7-bit representation satisfies the following requirements: 1. It is only produced on input by a new 'ascii-compatible' codec, which sets the "7-bit representation" flag in the str object on input if it encounters any non-ASCII bytes (if pure ASCII, it produces an 8-bit str object). This will be slower than just reading in the bytes in many cases, but I hope not unacceptably so. 2. When sliced, the result needs to be checked for non-ASCII bytes. If none, the result is promoted to 8-bit. 3. When combined with a str in 8-bit representation: a. If the 8-bit str contains any Latin-1 or C1 characters, both strs are promoted to 16-bit, and non-ASCII characters in the 7-bit string are converted by the surrogateescape handler. b. Otherwise they're combined into a 7-bit str. 4. When combined with a str in 16-bit or 32-bit representation, the 7-bit string is "decoded" to the same representation, as if using the 'ascii' codec with the 'surrogateescape' handler. 5. String methods that would raise or produce undefined results if used on str containing surrogate-encoded bytes need to be taught to do the same on non-ASCII bytes in 7-bit str objects. 6. On output the 'ascii-compatible' codec simply memcpy's 7-bit str and pure ASCII 8-bit str, and raises on anything else. (Sorry, no, ISO 8859-1 does *not* get passed through without exception.) 7. On output other codecs raise on a 7-bit str, unless the surrogateescape handler is in use. IOW, it's almost as fast as bytes if you restrict yourself to ASCII- compatible behavior, and you pay the price if you try to mix it with "real" Unicode str objects. Otherwise you can do anything with it you could do with a str. I don't think this actually has serious efficiency implications for Unicode handling, since the relevant compatibility tests need to be done anyway when combining strs. All the expensive operations occur when mixing 7-bit str and "real" non-ASCII Unicode, but we really don't want to do that if we can avoid it, any more than we want to use surrogate encoding if we can avoid it. Efficiency for low-level protocols could be improved by having the 'ascii-compatible' codec always produce 7-bit. I haven't thought carefully about this yet. For same reasons, there should be few surprises where people inadvertantly mix 7-bit str with "real" Unicode, since creating 7-bit is only done by the 'ascii-compatible' codec. People who are doing that will be using ASCII compatible protocols and should be used to being careful with non-ASCII bytes. Finally, none of the natural idioms require a b prefix on their literals. :-) N.B. Much of the above assumes that working with Unicode in 8-bit representation is basically as efficient as working with bytes. That is an assumption on my part, I hope it's verified. Comments?