[Python-Dev] PEP 414 - Unicode Literals for Python 3

Mon Feb 27 22:43:59 CET 2012

Chris McDonough <chrism <at> plope.com> writes:

> It's great to have software that installs easily.  That said, the
> versions of Python that my software supports is (and has to be) be my
> choice.

Of course. And if I understand correctly, that's 2.6, 2.7, 3.2 and later
versions. I'll ignore 2.5 and earlier in this specific reply.

> None of them would so much as bat an eyelash if I told them today they
> had to use Python 3.3 (if it existed in a final released form anyway) to
> use my software.  It's just a minor drop in the bucket of inconvenience
> they have to currently withstand.

Their pain (lacklustre library support and transliterating examples from 2.x to
3.x) would be the same under 3.2 and 3.3 (unless for some perverse reason people
only made libraries work under one of 3.2 and 3.3, but not both). Is it really
that hard to transliterate 2.x examples to 3.x in the literal-string dimension?
I can't believe it is, as the target audience is programmers.

> > If the lack of u'' literal is what's holding them back, that's germane to the
> > discussion of the PEP. If it's not, then why propose the PEP?
> 
> Like I said in an earlier email, u'' literal support is by no means the
> only issue for people who want to straddle.  But it *is* an issue, and
> it's incredibly low-hanging fruit with near-zero real-world impact if it
> is reintroduced.

But the implication of the PEP is that lack of u'' support is a major hindrance
to porting, justifying the production of the PEP and this discussion. And it's
not low-hanging fruit with near-zero real-world impact if we're going to
deprecate it at some point (which Guido was talking about) - you're just moving
the pain to a later date, unless we don't ever deprecate.

I feel, like some others, that 'xxx' is natural for text, u'xxx' is inelegant by
comparison, and u('xxx') a little more inelegant still.

However, allowing u'' syntax in 3.3 as per this PEP, but allowing it to be
optional, allows any combination of u'xxx' and 'xxx' in code in a 3.x context,
which doesn't see to me to be an ideal situation especially if you have
hit-and-run contributors who are not necessarily attuned to project conventions.

> You cast it as "backtracking" to reintroduce the syntax, but things have
> changed from when the decision to omit it was first made.  Its omission
> introduces pain in a world where it's expected that we don't use 2to3 to
> automatically translate code at installation time.

I'm calling it like it is. "reintroduce" in this case means undoing something
already done, so it's appropriate to say "backtracking".

I don't agree that things have changed. If I want to write code that works on
2.x and 3.x without the pain of running 2to3 after every change, and I'm only
interested in supporting >= 2.6 (your situation, IIUC), then I use "from
__future__ import unicode_literals"  - that's what it was created for, wasn't
it? - and use 'xxx' where I need text, b'xxx' where I need bytes, and a function
to deliver native strings where they're needed.

If I have a 2.x project full of u'' code which I need to bring into this
approach, then I run 2to3, review what it tells me, make the changes necessary
(as far as literals go, that's adding the unicode_literals import to all files,
and converting u'xxx' -> 'xxx'. When I test the result, I will find numerous
failures, some of which point to places where I should have used native strings
(e.g. kwargs keys), which I then fix. Other areas will be where I needed to use
bytes (e.g. encoding/decoding/hashing), which I will also fix. I use six or a
similar approach to sort out any other issues which crop up, e.g. metaclass
syntax, execfile, and so on.

After a relatively modest amount of work, I have a codebase that works on 2.x
and 3.x, and all I have to remember is that 'xxx' is Unicode, and if I create a
new module, I need to add the future import (on the assumption that I might add
literal strings later, if not now). After that, it seems to be plain sailing,
and I don't have to switch mental gears re. string literals.

> If you look at a piece of code as something that exists in one of the
> two states "ported" or "not-ported", sure.  But code often needs to be
> changed, and people of varying buy-in levels need to understand and
> change such code.  It's just much easier for them to assume that the
> same syntax works on some versions of Python 2 and Python 3 and be done
> with it rather than need to explain the introduction of a function that
> only exists to paper over a syntax omission.

Well, according to the approach I described above, that one thing needs to be
the present 3.x syntax - 'xxx' is text, b'xxx' is bytes, and f('xxx') is native
string (or whatever name you want instead of f). With the unicode_literals
import, that syntax works on 2.6+ and 3.2+, so ISTM it should work within the
constraints you mentioned for your software.

Regards,

Vinay Sajip