[Python-Dev] \u and \U escapes in raw unicode string literals

Sun May 13 18:04:44 CEST 2007

> * without the Unicode escapes, the only way to put non-ASCII
>   code points into a raw Unicode string is via a source code encoding
>   of say UTF-8 or UTF-16, pretty much defeating the original
>   requirement of writing ASCII code only

That's no problem, though - just don't put the Unicode character
into a raw string. Use plain strings if you have a need to include
Unicode characters, and are not willing to leave ASCII.

For Python 3, the default source encoding is UTF-8, so it is
much easier to use non-ASCII characters in the source code.
The original requirement may not be as strong anymore as it
used to be.

> * non-ASCII code points in text are not uncommon, they occur
>   in most European scripts, all Asian scripts,
>   many scientific texts and in also texts meant for the web
>   (just have a look at the HTML entities, or think of Word
>   exports using quotes)

And you are seriously telling me that people who commonly
use non-ASCII code points in their source code are willing
to refer to them by Unicode ordinal number (which, of course,
they all know by heart, from 1 to 65536)?

> * adding Unicode escapes to the re module will break code
>   already using "...\u..." in the regular expressions for
>   other purposes; writing conversion tools that detect this
>   usage is going to be hard

It's unlikely to occur in code today - \u just means the same
as u (so \u1234 matches u1234); if you want a backslash
followed by u in your regular expression, you should write
\\u.

It would be possible to future-warn about \u in 2.6, catching
these cases. Authors then would either have to remove the
backslash, or duplicate it, depending on what they want to
express.

Regards,
Martin