[Python-Dev] \u and \U escapes in raw unicode string literals

Sun May 13 22:54:48 CEST 2007

On 2007-05-13 18:04, Martin v. Löwis wrote:
>> * without the Unicode escapes, the only way to put non-ASCII
>>   code points into a raw Unicode string is via a source code encoding
>>   of say UTF-8 or UTF-16, pretty much defeating the original
>>   requirement of writing ASCII code only
> 
> That's no problem, though - just don't put the Unicode character
> into a raw string. Use plain strings if you have a need to include
> Unicode characters, and are not willing to leave ASCII.
> 
> For Python 3, the default source encoding is UTF-8, so it is
> much easier to use non-ASCII characters in the source code.
> The original requirement may not be as strong anymore as it
> used to be.

You can do that today: Just put the "# coding: utf-8" marker
at the top of the file.

However, in some cases, your editor may not be capable of
displaying or letting you enter the Unicode text you have
in mind.

In other cases, there may be a corporate coding standard in
place that prohibits using non-ASCII text in source code,
or fixes the encoding to e.g. Latin-1.

In all those cases, it's necessary to be able to enter the
Unicode code points which do cannot be used in the source
code using other means and the easiest way to do this is
by using Unicode escapes.

>> * non-ASCII code points in text are not uncommon, they occur
>>   in most European scripts, all Asian scripts,
>>   many scientific texts and in also texts meant for the web
>>   (just have a look at the HTML entities, or think of Word
>>   exports using quotes)
> 
> And you are seriously telling me that people who commonly
> use non-ASCII code points in their source code are willing
> to refer to them by Unicode ordinal number (which, of course,
> they all know by heart, from 1 to 65536)?

No, I'm not. I'm saying that non-ASCII code points are in
common use and (together with the above bullet) that there
are situations where you can't put the relevant code point
directly into your source code.

Using Unicode escapes for these will always be a cludge,
but it's still better than not being able to enter the
code points at all.

>> * adding Unicode escapes to the re module will break code
>>   already using "...\u..." in the regular expressions for
>>   other purposes; writing conversion tools that detect this
>>   usage is going to be hard
> 
> It's unlikely to occur in code today - \u just means the same
> as u (so \u1234 matches u1234); if you want a backslash
> followed by u in your regular expression, you should write
> \\u.
> 
> It would be possible to future-warn about \u in 2.6, catching
> these cases. Authors then would either have to remove the
> backslash, or duplicate it, depending on what they want to
> express.

Good idea.

The re module would then have to implement the same escaping
scheme as the raw-unicode-escape code (only an odd number of
backslashes causes the escaping code to trigger).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 13 2007)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611