[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]
Stephen J. Turnbull
stephen at xemacs.org
Wed Jan 8 18:47:41 CET 2014
Andrew Barnert writes:
> > a. If the 8-bit str contains any Latin-1 or C1 characters, both
> > strs are promoted to 16-bit, and non-ASCII characters in the
> > 7-bit string are converted by the surrogateescape handler.
>
> This part worries me a bit. The bytes 61 62 63 FF in this new
> representation actually _mean_ 'abc' followed by a smuggled FF
> byte.
No, it doesn't. It means 'abc' followed by something that cannot be
encoded by any codec without the surrogateescape handler.
'ascii-compatible' merely defaults to that handler. I wouldn't
actually be too upset if I were told, no, you have to specify
explicitly.
> > 6. On output the 'ascii-compatible' codec simply memcpy's 7-bit str
> > and pure ASCII 8-bit str, and raises on anything else.
>
> So if a 7-bit string gets converted to a surrogate-escaped 16-bit
> string, it can never be written out again?
Of course it can. Use .encode('ascii', errors='surrogateescape')
> (b'abc\xff'.decode('ascii-compatible') + '\u1234')[:4].encode('ascii-compatible')
>
> I'd expect to get back my b'abcd\xff'. But your rules give me an
> exception.
Yes. This whole proposal was aimed at wire protocols. It's very bad
if something intended to be ready to be squirted into the wire needs
(expensive) encoding.
> I think ascii-compatible has to accept non-8-bit-repr strings (by
> encoding ASCII as ASCII and surrogate escapes as bytes and
> everything else is an exception). This is necessary because 60 61
> 62 FF (7-bit) and 0061 0062 0063 DCFF (16-bit) are the same string
> anyway. But it's especially necessary because the former can be
> silently converted into the latter (and there's no way to even test
> whether that's happened).
Well, one way around that would be to require that the latter not
exist (convert it to "7-bit" during construction).
But I've come to the conclusion that this is all too irregular and
confusing. I'm pretty sure that I can come up with a set of rules
that are not inherently self-contradictory, but I'm also pretty sure
that the resulting type will behave unintuitively for almost
everybody. Also, despite my original thought, it's really hard to see
how unnecessary encode/decode cycles can be eliminated. So I think I
need to go back to the drawing board.
So I hope I haven't wasted too much of your time; it's been very
educational for me.
More information about the Python-ideas
mailing list