[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]

Mon Jan 6 19:37:36 CET 2014

Aside: I just read Victor's PEP 460, and apparently a lot of the
assumptions I'm making are true!

Andrew Barnert writes:
 > From: Geert Jansen <geertj at gmail.com>
 > > On Mon, Jan 6, 2014 at 11:57 AM, Stephen J. Turnbull <stephen at xemacs.org> 
 > > wrote:
 > > 
 > >>   > I'm not missing a new type, but I am missing the format method on 
 > >>   > the binary types.
 > >> 
 > >>  I'm curious about precisely what your use cases are, and just what
 > >>  formatting they need.
 > 
 > Besides Geert's chunked HTTP example, there are tons of intern
 > protocols and file formats (including Python source code!),

Python source code must use an ASCII-compatible encoding to use PEP
263.  No widechars, no EBCDIC.  But yes, I know about ASCII header
formats -- I'm a Mailman developer.

 > that have ASCII headers (that in some way define an encoding for
 > the actual payload). So things like
 > b'Content-Length: {}'.format(len(payload))
 > or even
 > b'Content-Type: text/html; charset={}'.format(encoding)
 > are useful.

Useful, sure.  But that much more useful than the alternative?  What's
wrong with

    def itob(n):
        # besides efficiency :-)
        return "{0:d}".format(n).encode('ascii')

    b'Content-Length: ' + itob(len(payload))

    b'Content-Type: text/html; charset=' + encoding

for such cases?  Not to forget that for cases with multiple parts to
combine, bytes.join() is way fast -- which matters to most people who
want these operations.  So I just don't see a real need for generic
formatting operations here.  (regex is another matter, but that's
already implemented.)

 > As for assuming that it's ASCII-compatible, again, there are all
 > kinds of protocols that work with any ASCII-compatbile charset but
 > don't work otherwise.

If you *can* assume it's ASCII-compatible bytes, what's wrong with str
in Python 3?  The basic idea is to use

    inbytes.decode('ascii', errors='surrogateescape')

which will DTRT if you try to encode it without the surrogateescape
handler: it raises an exception unless the bytes is pure ASCII.  It's
memory-efficient for pure ASCII, and has all the string facilities we
love.  But of course it would be too painful for sending JPEGs by
chunked HTTP a la Geert.

So ... now that we have the flexible string representation (PEP 393),
let's add a 7-bit representation!  (Don't take that too seriously,
there are interesting more general variants I'm not going to talk
about tonight.)

The 7-bit representation satisfies the following requirements:

1.  It is only produced on input by a new 'ascii-compatible' codec,
    which sets the "7-bit representation" flag in the str object on
    input if it encounters any non-ASCII bytes (if pure ASCII, it
    produces an 8-bit str object).  This will be slower than just
    reading in the bytes in many cases, but I hope not unacceptably so.

2.  When sliced, the result needs to be checked for non-ASCII bytes.
    If none, the result is promoted to 8-bit.

3.  When combined with a str in 8-bit representation:

    a.  If the 8-bit str contains any Latin-1 or C1 characters, both
        strs are promoted to 16-bit, and non-ASCII characters in the
        7-bit string are converted by the surrogateescape handler.

    b.  Otherwise they're combined into a 7-bit str.

4.  When combined with a str in 16-bit or 32-bit representation, the
    7-bit string is "decoded" to the same representation, as if using
    the 'ascii' codec with the 'surrogateescape' handler.

5.  String methods that would raise or produce undefined results if
    used on str containing surrogate-encoded bytes need to be taught
    to do the same on non-ASCII bytes in 7-bit str objects.

6.  On output the 'ascii-compatible' codec simply memcpy's 7-bit str
    and pure ASCII 8-bit str, and raises on anything else.  (Sorry,
    no, ISO 8859-1 does *not* get passed through without exception.)

7.  On output other codecs raise on a 7-bit str, unless the
    surrogateescape handler is in use.

IOW, it's almost as fast as bytes if you restrict yourself to ASCII-
compatible behavior, and you pay the price if you try to mix it with
"real" Unicode str objects.  Otherwise you can do anything with it you
could do with a str.

I don't think this actually has serious efficiency implications for
Unicode handling, since the relevant compatibility tests need to be
done anyway when combining strs.  All the expensive operations occur
when mixing 7-bit str and "real" non-ASCII Unicode, but we really
don't want to do that if we can avoid it, any more than we want to use
surrogate encoding if we can avoid it.

Efficiency for low-level protocols could be improved by having the
'ascii-compatible' codec always produce 7-bit.  I haven't thought
carefully about this yet.

For same reasons, there should be few surprises where people
inadvertantly mix 7-bit str with "real" Unicode, since creating 7-bit
is only done by the 'ascii-compatible' codec.  People who are doing
that will be using ASCII compatible protocols and should be used to
being careful with non-ASCII bytes.

Finally, none of the natural idioms require a b prefix on their
literals. :-)

N.B. Much of the above assumes that working with Unicode in 8-bit
representation is basically as efficient as working with bytes.  That
is an assumption on my part, I hope it's verified.

Comments?