[Python-Dev] Trying to focus the whole bytes/str formatting discussion

Mon Jan 13 12:53:41 CET 2014

On 13 January 2014 08:46, Brett Cannon <brett at python.org> wrote:
> I don't know about the rest of you but I feel like the discussion is heading
> off the rails (if it hasn't already jumped the tracks). Let's try to bring
> this back around to something actionable which people can focus their energy
> on as the amount of developer time spent arguing could have led to several
> coded-up solutions.
>
> I see it as a practicality-beats-purity vs.
> explicit-is-better-than-implicit. The PBP group want bytes.format() (just
> assume I include interpolation support if you want that) to work as close to
> a drop-in replacement for current str.format() use in Python 2 to ease
> porting. The argument is that code looks cleaner and the amount of changes
> in Python 2 code being ported to Python 3 is much smaller.
>
> THE EIBTI group are willing to support PEP 460 but beyond that don't want to
> have in Python itself anything for bytes.format() which takes in a string
> and spits out bytes. It's bytes in->bytes out and not bytes & str in->bytes
> out as the PBP group is after. The EIBTI group are arguing that letting str
> into bytes.format() and then automatically be converted to strict ASCII
> leads to conflating the text/bytes divide as well as being too magical, e.g.
> what if you actually wanted UTF-16 for you number string instead of ASCII;
> the EIBTI group **wants** to force people to make a decision. They are also
> less concerned with making users update Python 2 code to handle this as it
> already needs to be updated for other Python 3 things anyway.
>
> From where I'm sitting, the EIBTI group and their PEP 460 proposal from
> Antoine (and no longer Victor) are not controversial. Everyone seems to
> agree that PEP 460 **at minimum** is acceptable and should happen for Python
> 3.5. The people with the uphill battle and something to prove are those
> arguing for str in->bytes out support in bytes.format(). The added features
> that the PBP group want are the ones being argued over.
>
> As the onus is on the PBP group to convince the EIBTI group (or Guido), I
> think the PBP group should code up a solution that does what they want and
> put it on PyPI to see what the community thinks. If the PBP group wants to
> convince the EIBTI group that str in->bytes out for bytes.format() is
> critical in getting a key group of users to start using Python 3 then I
> think that needs to be demonstrated through real-world usage by some people.

Note that I am now fine with Guido's more lenient proposal *so long
as* explicitly bytes-only formatb and formatb_map methods are also
included.

That would give us the following situation in 3.5:

Text interpolation: str.__mod__, str.format, str.format_map
ASCII compatible interpolation: bytes.__mod__, bytes.format, bytes.format_map
Arbitrary binary interpolation: bytes.formatb, bytes.formatb_map

Those are all reasonable operations for the language to support
natively, and by providing convenient access to all three, we avoid
the attractive nuisance that would be created by providing *only*
ASCII interpolation without providing strict binary interpolation
(since people would inevitably use the former when they should really
be using the latter, because interpolation is such a convenient
construct), while still addressing the interests of both groups
(people like me and Antoine that like PEP 460 as it stands, as well as
those that favour the ASCII encoding features).

It's only the introduction of ASCII compatible interpolation support
*without* binary interpolation support that I am adamantly opposed to
- that's the kind of attractive nuisance that leads to people
inappropriately using ASCII compatible only APIs and then discovering
that their code breaks when confronted with ASCII incompatible
encodings like UTF-16, ShiftJIS and ISO-2022.

Originally I was opposed to the idea entirely, but then Antoine wrote
the binary only version of PEP 460 and I found it to be a *very*
elegant solution that didn't compromise the Python 3 text model. As
long as this pure API remains available in some form (such as formatb
and formatb_map methods), then I'm OK with the ASCII only version
existing in parallel - at that point, it *is* analogous to all the
other existing bytes methods that assume the use of ASCII compatible
data.

** The caveat **

However, note that there were *two* significant issues that were
raised in the recent broader discussions. PEP 460 only tackles the
more tractable of the two: the fact that Twisted and Mercurial both
consider bytes.__mod__ support a blocker for switching to Python 3.
That's a useful discussion to have, but it's important for people to
realise that the mod-formatting feature is utterly irrelevant to the
concerns Armin Ronacher raised in
http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/ that kicked off
this whole recent spate of interest in the topic.

Obviously, I disagree with his conclusions (and personally wish Python
2 Unicode experts would show a little more humility in trying to
understand the core team's motivations for Python 3 design decisions
rather than assuming that we're clueless idiots that decided to
maintain 4 parallel branches in Subversion for a couple of years just
because we thought it might be fun), but I can certainly understand
his pain.

I'm the one who actually *made* the changes to restore dual
bytes/unicode support in urllib.parse for Python 3 (one of Armin's
favourite examples of the difficulty of writing that kind of code
using the Python 3 text model), and I agree entirely with Armin's
assessment of that code: it isn't pretty, and it wasn't fun to write.
Yes, I got it to work, and yes, it was satisfying when the tests
finally based, and yes there is now a smaller number of cases where
errors will pass silently, but that's far from the same thing as
finding the process of getting there a pleasant one, or considering
the result an elegant approach to porting hybrid APIs from Python 2
such that bytes in = bytes out and str in = str out. The only
difference between Armin and myself in this respect is that I know the
reasons for the changes the text model, and I think the increased
difficulty in implementing that particular use case was worth it,
given the pay-off in finally being able to remove the implicit
encoding and decoding operations from the text model (Note that the
unicode input handling in urlparse in Python 2 breaks entirely if you
turn off implicit decoding. You can still get hits from the cache, but
if you have to actually parse anything, it will fail:
http://python-notes.curiousefficiency.org/en/latest/python3/binary_protocols.html#couldn-t-the-implicit-decoding-just-be-disabled-in-python-2).

The fact remains, however, that in Python 2 the code you need for that
kind of hybrid API was *easy* to write - you just made all your
internal constants 8-bit strings, and the implicit decoding to Unicode
took care of the case of str inputs. There are still valid use cases
for such hybrid APIs, even in Python 3 (urllib.parse is one of them),
and the reason I helped Benno start the asciicompat project
(https://github.com/jeamland/asciicompat) is because I want to make
that kind of code almost as effortless as it was in Python 2 - all you
should need to do is make your constants asciistr instances rather
than builtin bytes or str objects.

My ambition here is not "good enough to get people to stop
complaining", it's "there's no actual reason Python 3 needs to be
worse at this than Python 2, it just doesn't need to be part of the
core builtin types, because we're in a better position to fix
interoperability issues now that
we don't have to deal with the close coupling between str and unicode
that existed in Python 2, and the bytes type will generally play nice
with anything that exposes the PEP 3118 buffer interface".

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia