[Python-ideas] Support Unicode code point notation

Steven D'Aprano steve at pearwood.info
Fri Aug 2 06:11:39 CEST 2013


On 02/08/13 09:55, Alexander Belopolsky wrote:

> The original proposal was to allow \U+NNNN escape as a shortcut for
> \U0000NNNN.  This is a clear readability improvement while \N{U+001B}, for
> example,  is not an improvement over \N{ESCAPE}.  However, for more obscure
> control characters, \N{control-NNNN} may be clearer than any currently
> available spelling.  For example, \N{control-001E} is easier to understand
> than \036, \x1e, \u001E, \N{RS} or even the most verbose \N{INFORMATION
> SEPARATOR TWO}.

Despite the vigorous objections to a variable-length escape sequence[1] I still consider that the One Obvious Way to refer to a Unicode code-point numerically is by U+NNNN with 4-6 hex digits. Add a backslash to turn it into an escape sequence, and we have \U+NNNN. If I'm still around when Python 4000 is under development, I'll propose that syntax as an outright replacement for legacy escapes \xNN \oNNN \uNNNN and \U00NNNNNN (for strings, but not bytes, where \xNN is still the OOWTDI). But that's a *long* way away.

In the meantime, we're constrained by backward compatibility to keep existing escape formats. There is considerable opposition to another variable-length escape sequence without delimiters, and \N{U+NNNN} seems to be a reasonable compromise to me even though it is actually longer than the current \U00NNNNNN escape. I consider this proposal to be about two things, conformity with Unicode notation, and clarity, not length.

If somebody wishes to champion the proposal to support code-point labels, please start a separate thread. The two features are independent.




[1] None of which persuade me -- many languages have variable-length octal escapes, and this is the first time I've ever heard anyone complain about them being harmful.


-- 
Steven


More information about the Python-ideas mailing list