[Python-Dev] \u and \U escapes in raw unicode string literals

Fri May 11 09:59:42 CEST 2007

Martin v. Löwis wrote:
>> This is what prompted my question, actually: in Py3k, in the
>> str/unicode unification branch, r"\u1234" changes meaning: before the
>> unification, this was an 8-bit string, where the \u was not special,
>> but now it is a unicode string, where \u *is* special.
> 
> That is true for non-raw strings also: the meaning of "\u1234" also
> changes.
> 
> However, traditionally, there was *no* escaping mechanism in raw strings
> in Python, and I feel that this is a good principle, because it is
> easy to learn (if you leave out the detail that \ can't be the last
> character in a raw string - which should get fixed also, IMO). So I
> think in Py3k, "\u1234" should continue to be a string with 6
> characters. Otherwise, people will complain that
> os.stat(r"c:\windows\system32\user32.dll") fails. Telling them to write
> os.stat(r"c:\windows\system32\u005Cuser32.dll") will just cause puzzled
> faces.
> 
> Windows path names are one of the two primary applications of raw
> strings (the other being regexes).

I think regular expressions become easier to read if they don't also 
contain python escape characters because then you don't have to mentally 
parse which ones are part of the regular expression and which ones are 
evaluated by python.  The re module can still evaluate r"\uxxxx", r"\'", 
and r'\"' sequences even if python doesn't.

I experimented with tokanize.c to see if the trailing '\' could be special 
cased in raw strings.  The minimum change I could come up with was to have 
it not respect slash-quote sequences, (for finding the end of a string), if 
the quote is the same type as the quote used to define the string.  The 
following strings in the library needed to be adjusted after that change.

I don't think this is the best solution, but the list of strings needing 
changed might be useful for the discussion.

-    r'(\'[^\']*\'|"[^"]*"|[][\-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~\'"@]*))?')
+    r'''(\'[^\']*\'|"[^"]*"|[][\-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~\'"@]*))?''')

-_declstringlit_match = re.compile(r'(\'[^\']*\'|"[^"]*")\s*').match
+_declstringlit_match = re.compile(r'''(\'[^\']*\'|"[^"]*")\s*''').match

-        r'(?<=[\w\!\"\'\&\.\,\?])-{2,}(?=\w))')   # em-dash
+        r'''(?<=[\w\!\"\'\&\.\,\?])-{2,}(?=\w))''')   # em-dash

-                                 r'[\"\']?'           # optional end-of-quote
+                                 r'''[\"\']?'''           # optional 
end-of-quote

-    _wordchars_re = re.compile(r'[^\\\'\"%s ]*' % string.whitespace)
+    _wordchars_re = re.compile(r'''[^\\\'\"%s ]*''' % string.whitespace)

-HEADER_QUOTED_VALUE_RE = re.compile(r"^\s*=\s*\"([^\"\\]*(?:\\.[^\"\\]*)*)\"")
+HEADER_QUOTED_VALUE_RE = 
re.compile(r'''^\s*=\s*\"([^\"\\]*(?:\\.[^\"\\]*)*)\"''')

-HEADER_JOIN_ESCAPE_RE = re.compile(r"([\"\\])")
+HEADER_JOIN_ESCAPE_RE = re.compile(r'([\"\\])')

-    quote_re = re.compile(r"([\"\\])")
+    quote_re = re.compile(r'([\"\\])')

-        return re.sub(r'((\\[\\abfnrtv\'"]|\\[0-9]..|\\x..|\\u....)+)',
+        return re.sub(r'''((\\[\\abfnrtv\'"]|\\[0-9]..|\\x..|\\u....)+)''',

-    _OPTION_DIRECTIVE_RE = re.compile(r'#\s*doctest:\s*([^\n\'"]*)$',
+    _OPTION_DIRECTIVE_RE = re.compile(r'''#\s*doctest:\s*([^\n\'"]*)$''',
                                        re.MULTILINE)

-            s = unicode(r'\x00="\'a\\b\x80\xff\u0000\u0001\u1234', 
'unicode-escape')
+            s = unicode(r'''\x00="\'a\\b\x80\xff\u0000\u0001\u1234''', d

-    _escape = re.compile(r"[&<>\"\x80-\xff]+") # 1.5.2
+    _escape = re.compile(r'[&<>\"\x80-\xff]+') # 1.5.2

-    r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?')
+    r'''(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?''')

I also noticed that python handles the '\' escape character differently 
than re does in regular strings.  In regular expressions, a single '\' is 
always an escape character.  If the following character is not a special 
character, then the two character combination becomes the second 
non-special character.

     "\'"  --> '
     "\\"  --> \
     "\q"  --> q  ('q' not special so '\q' is 'q')

This isn't how python does it.

 >>> '\''
"'"
 >>> "\\"
'\\'
 >>> "\q"    ('q' not special, so Back slash is not an escape.)
'\q'

So it might be good to have it always be an escape in regular strings, and 
never be an escape in raw strings.

Ron