[Python-Dev] \u and \U escapes in raw unicode string literals

"Martin v. Löwis" martin at v.loewis.de
Sun May 13 18:04:44 CEST 2007


> * without the Unicode escapes, the only way to put non-ASCII
>   code points into a raw Unicode string is via a source code encoding
>   of say UTF-8 or UTF-16, pretty much defeating the original
>   requirement of writing ASCII code only

That's no problem, though - just don't put the Unicode character
into a raw string. Use plain strings if you have a need to include
Unicode characters, and are not willing to leave ASCII.

For Python 3, the default source encoding is UTF-8, so it is
much easier to use non-ASCII characters in the source code.
The original requirement may not be as strong anymore as it
used to be.

> * non-ASCII code points in text are not uncommon, they occur
>   in most European scripts, all Asian scripts,
>   many scientific texts and in also texts meant for the web
>   (just have a look at the HTML entities, or think of Word
>   exports using quotes)

And you are seriously telling me that people who commonly
use non-ASCII code points in their source code are willing
to refer to them by Unicode ordinal number (which, of course,
they all know by heart, from 1 to 65536)?

> * adding Unicode escapes to the re module will break code
>   already using "...\u..." in the regular expressions for
>   other purposes; writing conversion tools that detect this
>   usage is going to be hard

It's unlikely to occur in code today - \u just means the same
as u (so \u1234 matches u1234); if you want a backslash
followed by u in your regular expression, you should write
\\u.

It would be possible to future-warn about \u in 2.6, catching
these cases. Authors then would either have to remove the
backslash, or duplicate it, depending on what they want to
express.

Regards,
Martin



More information about the Python-Dev mailing list