[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

Sat Jan 11 19:38:27 CET 2014

tl;dr: At the end I'm volunteering to look at real code that is having
porting problems.

On Sat, 11 Jan 2014 17:33:17 +0100, "M.-A. Lemburg" <mal at egenix.com> wrote:
> asciistr is interesting in that it coerces to bytes instead
> of to Unicode (as is the case in Python 2).
> 
> At the moment it doesn't cover the more common case bytes + str,
> just str + bytes, but let's assume it would, then you'd write
> 
> ...
> headers += asciistr('Length: %i bytes\n' % 123)
> headers += b'\n\n'
> body = b'...'
> socket.send(headers + body)
> ...
> 
> With PEP 460, you could write the above as:
> 
> ...
> headers += b'Length: %i bytes\n' % 123
> headers += b'\n\n'
> body = b'...'
> socket.send(headers + body)
> ...
> 
> IMO, that's more readable.
> 
> Both variants essentially do the same thing: they implicitly
> coerce ASCII text strings to bytes, so conceptually, there's
> little difference.

And if we are explicit:

headers = u'Length: %i bytes\n' % 123
headers += u'\n\n'
body = b'...'
socket.send(headers.encode('ascii') + body)

(I included the 'u' prefix only because we are talking about
shared-codebase python2/python3 code.)

That looks pretty readable to me, and it is explicit about what
parts are text and what parts are binary.

But of course we'd never do exactly that in any but the simplest of
protocols and scripts.

Instead we'd write a library that had one or more object that modeled
our wire/file protocol.  The text parts the API would accept input as
text strings.  The binary parts it would accept input as bytes.  Then,
when reading or writing the data stream, we perform the appropriate
conversions on the appropriate parts.  Our library does a more complex
analog of 'socket.send(headers.encode('ascii') + body)', one that
understands the various parts and glues them together, encoding the
text parts to the appropriate encoding (often-but-not-always ascii)
as it does so.

And yes, I have written code that does this in Python3.

What I haven't done is written that code to run in both Python3 and
Python2.  I *think* the only missing thing I would need to back-port
it is the surrogateescape error handler, but I haven't tried it.  And I
could probably conditionalize the code to use latin1 on python2 instead
and get away with it.

And please note that email is probably the messiest of messy binary
wire protocols.  Not only do you have bytes and text mixed in the same
data stream, with internal markers (in the text parts) that specify
how to interpret the binary, including what encodings each part of that
binary data is in for cases where that matters, you *also* have to deal
with the possibility of there being *invalid* binary data mixed in with
the ostensibly text parts, that you nevertheless are expected to both
preserve and parse around.

When I started adding back binary support to the email package, I was
really annoyed by the lack of certain string features in the bytes
type.  But in the end, it turned out to be really simple to instead
think of the text-with-invalid-bytes parts as *text*-with-invalid-bytes
(surrogateescaped bytes).

Now, if I was designing from the ground up I'd store the stuff that
was really binary as bytes in the model object instead of storing it as
surrogateescaed text, but that problem is a consequence of how we got from
there to here (python2-email to python3-email-that-didn't-handle-8bit-data
to python3-email-that-works) rather than a problem with the python3 core
data model.

So it seems like I'm with Nick and Antoine and company here.  The
byte-interpolation proposed by Antoine seems reasonable, but I don't
see the *need* for the other stuff.  I think that programs will
be cleaner if the text parts of the protocol are handled *as text*.

On the other hand, Ethan's point that bytes *does* have text methods
is true.  However, other than the perfectly-sensible-for-bytes split,
strip, and ends/startswith, I don't think I actually use any of them.

But!  Our goal should be to help people convert to Python3.  So how can
we find out what the specific problems are that real-world programs are
facing, look at the *actual code*, and help that project figure out the
best way to make that code work in both python2 and python3?

That seems like the best way to find out what needs to be added to
python3 or pypi:  help port the actual code of the developers who are
running into problems.

Yes, I'm volunteering to help with this, though of course I can't promise
exactly how much time I'll have available.

--David