Andrew Barnert writes:
a. If the 8-bit str contains any Latin-1 or C1 characters, both strs are promoted to 16-bit, and non-ASCII characters in the 7-bit string are converted by the surrogateescape handler.
This part worries me a bit. The bytes 61 62 63 FF in this new representation actually _mean_ 'abc' followed by a smuggled FF byte.
No, it doesn't. It means 'abc' followed by something that cannot be encoded by any codec without the surrogateescape handler. 'ascii-compatible' merely defaults to that handler. I wouldn't actually be too upset if I were told, no, you have to specify explicitly.
6. On output the 'ascii-compatible' codec simply memcpy's 7-bit str and pure ASCII 8-bit str, and raises on anything else.
So if a 7-bit string gets converted to a surrogate-escaped 16-bit string, it can never be written out again?
Of course it can. Use .encode('ascii', errors='surrogateescape')
(b'abc\xff'.decode('ascii-compatible') + '\u1234')[:4].encode('ascii-compatible')
I'd expect to get back my b'abcd\xff'. But your rules give me an exception.
Yes. This whole proposal was aimed at wire protocols. It's very bad if something intended to be ready to be squirted into the wire needs (expensive) encoding.
I think ascii-compatible has to accept non-8-bit-repr strings (by encoding ASCII as ASCII and surrogate escapes as bytes and everything else is an exception). This is necessary because 60 61 62 FF (7-bit) and 0061 0062 0063 DCFF (16-bit) are the same string anyway. But it's especially necessary because the former can be silently converted into the latter (and there's no way to even test whether that's happened).
Well, one way around that would be to require that the latter not exist (convert it to "7-bit" during construction). But I've come to the conclusion that this is all too irregular and confusing. I'm pretty sure that I can come up with a set of rules that are not inherently self-contradictory, but I'm also pretty sure that the resulting type will behave unintuitively for almost everybody. Also, despite my original thought, it's really hard to see how unnecessary encode/decode cycles can be eliminated. So I think I need to go back to the drawing board. So I hope I haven't wasted too much of your time; it's been very educational for me.