[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

Stephen J. Turnbull stephen at xemacs.org
Tue Jan 7 13:26:20 CET 2014


Is this really a good idea?  PEP 460 proposes rather different
semantics for bytes.format and the bytes % operator from the str
versions.  I think this is going to be both confusing and a continuous
target for "further improvement" until the two implementations
converge.

Nick Coghlan writes:

 >I still don't think the 2.x bytestring is inherently evil, it's just
 >the wrong type to use as the core text type because of the problems
 >it has with silently creating mojibake and also with multi-byte
 >codecs and slicing. The current python-ideas thread is close to
 >convincing me even a stripped down version isn't a good idea, though
 >:P

Lack of it is obviously a major pain point for many developers, but --
it is inherently evil.  It's a structured data type passed around as
an unstructured blob of memory, with no way for one part of the
program to determine what (if anything) another part of the program
thinks it's doing.  It's the Python equivalent to the pointer type
aliasing that gcc likes to whine about.

Given that most wire protocols that benefit from this kind of thing
are based on ASCII-coded commands and parameters, I think there's a
better alternative to either adding 2.x bytestrings as a separate type
or to PEP 460.  This is to add a (minimal) structure we could call
"ASCII-compatible byte array" to the current set of Unicode
representations.  The detailed proposal is on -ideas (where I call it
"7-bit representation", but that has already caused misunderstanding.)
This representation would treat non-ASCII bytes as the current
representations do bytes encoded as surrogates.  This representation
would be produced only by a special "ascii-compatible" codec (which
implies the surrogateescape- like behavior).

It has the following advantages for bytestring-type processing:

    - double-encoding/decoding is not possible
    - uninterpreted bytes are marked as such -- they can be compared
      for equality, but other character manipulations are no-ops.
    - representation is efficient
    - output via the 'ascii-compatible' codec is just memcpy
    - input via the 'ascii-compatible' codec is reasonably efficient
      (in the posted proposal detection of non-ASCII bytes is
      required, so it cannot be just memcpy)
    - str operations are all available; only on I/O is any additional
      overhead imposed compared to str

There's one other possible advantage that I haven't thought through
yet: compatibility with 2.x literals (eg, "inputstring.find('To:')"
instead of "inputbytes.find(b'To:')").

It probably does impose overhead compared to bytes, especially with
the restricted functionality Victor proposes for .format() on bytes,
but as Victor points out so does any full-featured string-style
processing vs. low-level operations like .join().  I suppose it would
be acceptable, except possibly the extra copying for I/O.

The main disadvantage is additional complexity in the implementation
of the str type.  I don't think it imposes much runtime overhead,
however, since the checks for different representations when operating
on str must be done anyway.  Operations involving "ascii-compatible"
and other representations at the same time should be rare, except for
the combinations of "ascii-compatible" and 8-bit representations --
which just involve copying bytes as between 8-bit and 8-bit, plus a
bit of logic to set the type correctly.

Steve


More information about the Python-Dev mailing list