[Python-Dev] \u and \U escapes in raw unicode string literals
Ron Adam
rrr at ronadam.com
Fri May 11 09:59:42 CEST 2007
Martin v. Löwis wrote:
>> This is what prompted my question, actually: in Py3k, in the
>> str/unicode unification branch, r"\u1234" changes meaning: before the
>> unification, this was an 8-bit string, where the \u was not special,
>> but now it is a unicode string, where \u *is* special.
>
> That is true for non-raw strings also: the meaning of "\u1234" also
> changes.
>
> However, traditionally, there was *no* escaping mechanism in raw strings
> in Python, and I feel that this is a good principle, because it is
> easy to learn (if you leave out the detail that \ can't be the last
> character in a raw string - which should get fixed also, IMO). So I
> think in Py3k, "\u1234" should continue to be a string with 6
> characters. Otherwise, people will complain that
> os.stat(r"c:\windows\system32\user32.dll") fails. Telling them to write
> os.stat(r"c:\windows\system32\u005Cuser32.dll") will just cause puzzled
> faces.
>
> Windows path names are one of the two primary applications of raw
> strings (the other being regexes).
I think regular expressions become easier to read if they don't also
contain python escape characters because then you don't have to mentally
parse which ones are part of the regular expression and which ones are
evaluated by python. The re module can still evaluate r"\uxxxx", r"\'",
and r'\"' sequences even if python doesn't.
I experimented with tokanize.c to see if the trailing '\' could be special
cased in raw strings. The minimum change I could come up with was to have
it not respect slash-quote sequences, (for finding the end of a string), if
the quote is the same type as the quote used to define the string. The
following strings in the library needed to be adjusted after that change.
I don't think this is the best solution, but the list of strings needing
changed might be useful for the discussion.
- r'(\'[^\']*\'|"[^"]*"|[][\-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~\'"@]*))?')
+ r'''(\'[^\']*\'|"[^"]*"|[][\-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~\'"@]*))?''')
-_declstringlit_match = re.compile(r'(\'[^\']*\'|"[^"]*")\s*').match
+_declstringlit_match = re.compile(r'''(\'[^\']*\'|"[^"]*")\s*''').match
- r'(?<=[\w\!\"\'\&\.\,\?])-{2,}(?=\w))') # em-dash
+ r'''(?<=[\w\!\"\'\&\.\,\?])-{2,}(?=\w))''') # em-dash
- r'[\"\']?' # optional end-of-quote
+ r'''[\"\']?''' # optional
end-of-quote
- _wordchars_re = re.compile(r'[^\\\'\"%s ]*' % string.whitespace)
+ _wordchars_re = re.compile(r'''[^\\\'\"%s ]*''' % string.whitespace)
-HEADER_QUOTED_VALUE_RE = re.compile(r"^\s*=\s*\"([^\"\\]*(?:\\.[^\"\\]*)*)\"")
+HEADER_QUOTED_VALUE_RE =
re.compile(r'''^\s*=\s*\"([^\"\\]*(?:\\.[^\"\\]*)*)\"''')
-HEADER_JOIN_ESCAPE_RE = re.compile(r"([\"\\])")
+HEADER_JOIN_ESCAPE_RE = re.compile(r'([\"\\])')
- quote_re = re.compile(r"([\"\\])")
+ quote_re = re.compile(r'([\"\\])')
- return re.sub(r'((\\[\\abfnrtv\'"]|\\[0-9]..|\\x..|\\u....)+)',
+ return re.sub(r'''((\\[\\abfnrtv\'"]|\\[0-9]..|\\x..|\\u....)+)''',
- _OPTION_DIRECTIVE_RE = re.compile(r'#\s*doctest:\s*([^\n\'"]*)$',
+ _OPTION_DIRECTIVE_RE = re.compile(r'''#\s*doctest:\s*([^\n\'"]*)$''',
re.MULTILINE)
- s = unicode(r'\x00="\'a\\b\x80\xff\u0000\u0001\u1234',
'unicode-escape')
+ s = unicode(r'''\x00="\'a\\b\x80\xff\u0000\u0001\u1234''', d
- _escape = re.compile(r"[&<>\"\x80-\xff]+") # 1.5.2
+ _escape = re.compile(r'[&<>\"\x80-\xff]+') # 1.5.2
- r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?')
+ r'''(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?''')
I also noticed that python handles the '\' escape character differently
than re does in regular strings. In regular expressions, a single '\' is
always an escape character. If the following character is not a special
character, then the two character combination becomes the second
non-special character.
"\'" --> '
"\\" --> \
"\q" --> q ('q' not special so '\q' is 'q')
This isn't how python does it.
>>> '\''
"'"
>>> "\\"
'\\'
>>> "\q" ('q' not special, so Back slash is not an escape.)
'\q'
So it might be good to have it always be an escape in regular strings, and
never be an escape in raw strings.
Ron
More information about the Python-Dev
mailing list