[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]

Wed Jan 8 18:47:41 CET 2014

Andrew Barnert writes:

 > >    a.  If the 8-bit str contains any Latin-1 or C1 characters, both
 > >        strs are promoted to 16-bit, and non-ASCII characters in the
 > >        7-bit string are converted by the surrogateescape handler.
 > 
 > This part worries me a bit. The bytes 61 62 63 FF in this new
 > representation actually _mean_ 'abc' followed by a smuggled FF
 > byte.

No, it doesn't.  It means 'abc' followed by something that cannot be
encoded by any codec without the surrogateescape handler.
'ascii-compatible' merely defaults to that handler.  I wouldn't
actually be too upset if I were told, no, you have to specify
explicitly.

 > > 6.  On output the 'ascii-compatible' codec simply memcpy's 7-bit str
 > >    and pure ASCII 8-bit str, and raises on anything else.
 > 
 > So if a 7-bit string gets converted to a surrogate-escaped 16-bit
 > string, it can never be written out again?

Of course it can.  Use .encode('ascii', errors='surrogateescape')

 > (b'abc\xff'.decode('ascii-compatible') + '\u1234')[:4].encode('ascii-compatible')
 > 
 > I'd expect to get back my b'abcd\xff'. But your rules give me an
 > exception.

Yes.  This whole proposal was aimed at wire protocols.  It's very bad
if something intended to be ready to be squirted into the wire needs
(expensive) encoding.

 > I think ascii-compatible has to accept non-8-bit-repr strings (by
 > encoding ASCII as ASCII and surrogate escapes as bytes and
 > everything else is an exception). This is necessary because 60 61
 > 62 FF (7-bit) and 0061 0062 0063 DCFF (16-bit) are the same string
 > anyway. But it's especially necessary because the former can be
 > silently converted into the latter (and there's no way to even test
 > whether that's happened).

Well, one way around that would be to require that the latter not
exist (convert it to "7-bit" during construction).

But I've come to the conclusion that this is all too irregular and
confusing.  I'm pretty sure that I can come up with a set of rules
that are not inherently self-contradictory, but I'm also pretty sure
that the resulting type will behave unintuitively for almost
everybody.  Also, despite my original thought, it's really hard to see
how unnecessary encode/decode cycles can be eliminated.  So I think I
need to go back to the drawing board.

So I hope I haven't wasted too much of your time; it's been very
educational for me.