\u and \U escapes in raw unicode string literals

I just discovered that, in all versions of Python as far back as I have access to (2.0), \uXXXX escapes are interpreted inside raw unicode strings. Thus:
Contrast this with:
The \U escape has the same behavior, in versions that support it. Does anyone remember why it is done this way? The reference manual describes this behavior, but doesn't give an explanation: """ When an "r" or "R" prefix is used in conjunction with a "u" or "U" prefix, then the \uXXXX and \UXXXXXXXX escape sequences are processed while all other backslashes are left in the string. For example, the string literal ur"\u0062\n" consists of three Unicode characters: `LATIN SMALL LETTER B', `REVERSE SOLIDUS', and `LATIN SMALL LETTER N'. Backslashes can be escaped with a preceding backslash; however, both remain in the string. As a result, \uXXXX escape sequences are only recognized when there are an odd number of backslashes. """ -- --Guido van Rossum (home page: http://www.python.org/~guido/)

On 2007-05-10 20:53, Paul Moore wrote:
This is per design (see PEP 100) and was done for the reason given by Paul. The motivation for the chosen approach was to make Python's raw Unicode strings compatible to Java's raw Unicode strings: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 10 2007)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

On 5/10/07, M.-A. Lemburg <mal@egenix.com> wrote:
I'm not sure what Java compatibility buys us. It is also far from perfect -- IIUC, in Java if you write \u0022 (that's the " character) it counts as an opening or closing quote, and if you write \u005c (a backslash) it can be used to escape the following character. OTOH, in Python, you can write ur"C:\Program Files\u005c" and voila, a raw string terminating in a backslash. (In Java this would escape the " instead.) However, I understand the other reason (inclusion of non-ASCII characters in raw strings) and I reluctantly agree with it. Reluctantly, because it means I can't create a raw string containing a \ followed by u or U -- I needed one of those today. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

On 2007-05-11 00:11, Guido van Rossum wrote:
http://mail.python.org/pipermail/python-dev/1999-November/001346.html http://mail.python.org/pipermail/python-dev/1999-November/001392.html and all the other postings in that month related to this.
print ur"\u005cu" \u
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 11 2007)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

However, I understand the other reason (inclusion of non-ASCII characters in raw strings) and I reluctantly agree with it.
I actually disagree with that. It is fairly easy to include non-ASCII characters in a raw Unicode string - just type them in. Or, if that fails, use string concatenation with a non-raw string: r"foo\uhallo" "\u20ac" r"welt" Regards, Martin

On 5/10/07, "Martin v. Löwis" <martin@v.loewis.de> wrote:
That violates the convention used in many places that source code should only contain printable ASCII, and all non-ASCII or unprintable characters should be written using \x or \u escapes.
That makes for pretty unreadable source code though. Looking for a third opinion, -- --Guido van Rossum (home page: http://www.python.org/~guido/)

On 5/10/07, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Fair enough. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Martin v. Löwis wrote:
why should you be able to get a non-ASCII character into a raw Unicode string?
The analogous question would be why can't you get a non-Unicode character into a raw Unicode string. That wouldn't make sense, since Unicode strings can't even hold non-Unicode characters (or at least they're not meant to). But it doesn't seem unreasonable to want to put Unicode characters into a raw Unicode string. After all, if it only contains ASCII characters there's no need for it to be a Unicode string in the first place. -- Greg

On 5/10/07, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
This is what prompted my question, actually: in Py3k, in the str/unicode unification branch, r"\u1234" changes meaning: before the unification, this was an 8-bit string, where the \u was not special, but now it is a unicode string, where \u *is* special. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

That is true for non-raw strings also: the meaning of "\u1234" also changes. However, traditionally, there was *no* escaping mechanism in raw strings in Python, and I feel that this is a good principle, because it is easy to learn (if you leave out the detail that \ can't be the last character in a raw string - which should get fixed also, IMO). So I think in Py3k, "\u1234" should continue to be a string with 6 characters. Otherwise, people will complain that os.stat(r"c:\windows\system32\user32.dll") fails. Telling them to write os.stat(r"c:\windows\system32\u005Cuser32.dll") will just cause puzzled faces. Windows path names are one of the two primary applications of raw strings (the other being regexes). Regards, Martin

Martin v. Löwis wrote:
I think regular expressions become easier to read if they don't also contain python escape characters because then you don't have to mentally parse which ones are part of the regular expression and which ones are evaluated by python. The re module can still evaluate r"\uxxxx", r"\'", and r'\"' sequences even if python doesn't. I experimented with tokanize.c to see if the trailing '\' could be special cased in raw strings. The minimum change I could come up with was to have it not respect slash-quote sequences, (for finding the end of a string), if the quote is the same type as the quote used to define the string. The following strings in the library needed to be adjusted after that change. I don't think this is the best solution, but the list of strings needing changed might be useful for the discussion. - r'(\'[^\']*\'|"[^"]*"|[][\-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~\'"@]*))?') + r'''(\'[^\']*\'|"[^"]*"|[][\-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~\'"@]*))?''') -_declstringlit_match = re.compile(r'(\'[^\']*\'|"[^"]*")\s*').match +_declstringlit_match = re.compile(r'''(\'[^\']*\'|"[^"]*")\s*''').match - r'(?<=[\w\!\"\'\&\.\,\?])-{2,}(?=\w))') # em-dash + r'''(?<=[\w\!\"\'\&\.\,\?])-{2,}(?=\w))''') # em-dash - r'[\"\']?' # optional end-of-quote + r'''[\"\']?''' # optional end-of-quote - _wordchars_re = re.compile(r'[^\\\'\"%s ]*' % string.whitespace) + _wordchars_re = re.compile(r'''[^\\\'\"%s ]*''' % string.whitespace) -HEADER_QUOTED_VALUE_RE = re.compile(r"^\s*=\s*\"([^\"\\]*(?:\\.[^\"\\]*)*)\"") +HEADER_QUOTED_VALUE_RE = re.compile(r'''^\s*=\s*\"([^\"\\]*(?:\\.[^\"\\]*)*)\"''') -HEADER_JOIN_ESCAPE_RE = re.compile(r"([\"\\])") +HEADER_JOIN_ESCAPE_RE = re.compile(r'([\"\\])') - quote_re = re.compile(r"([\"\\])") + quote_re = re.compile(r'([\"\\])') - return re.sub(r'((\\[\\abfnrtv\'"]|\\[0-9]..|\\x..|\\u....)+)', + return re.sub(r'''((\\[\\abfnrtv\'"]|\\[0-9]..|\\x..|\\u....)+)''', - _OPTION_DIRECTIVE_RE = re.compile(r'#\s*doctest:\s*([^\n\'"]*)$', + _OPTION_DIRECTIVE_RE = re.compile(r'''#\s*doctest:\s*([^\n\'"]*)$''', re.MULTILINE) - s = unicode(r'\x00="\'a\\b\x80\xff\u0000\u0001\u1234', 'unicode-escape') + s = unicode(r'''\x00="\'a\\b\x80\xff\u0000\u0001\u1234''', d - _escape = re.compile(r"[&<>\"\x80-\xff]+") # 1.5.2 + _escape = re.compile(r'[&<>\"\x80-\xff]+') # 1.5.2 - r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?') + r'''(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?''') I also noticed that python handles the '\' escape character differently than re does in regular strings. In regular expressions, a single '\' is always an escape character. If the following character is not a special character, then the two character combination becomes the second non-special character. "\'" --> ' "\\" --> \ "\q" --> q ('q' not special so '\q' is 'q') This isn't how python does it.
So it might be good to have it always be an escape in regular strings, and never be an escape in raw strings. Ron

On 2007-05-11 07:52, Martin v. Löwis wrote:
Using double backslashes won't cause that reaction: os.stat("c:\\windows\\system32\\user32.dll") Also note that Windows is smart enough nowadays to parse the good old Unix forward slash: os.stat("c:/windows/system32/user32.dll")
Windows path names are one of the two primary applications of raw strings (the other being regexes).
IMHO the primary use case are regexps and for those you'd definitely want to be able to put Unicode characters into your expressions. BTW, if you use ur"..." for your expressions today (which you should if you parse text), then nothing will change when removing the 'u' prefix in Py3k. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 11 2007)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

M.-A. Lemburg schrieb:
Sure. But I want to use raw strings for Windows path names; it's much easier to type.
In my opinion this is a windows bug and not a features. Especially because there are Windows api functions (the shell functions, IIRC) that do NOT accept forward slashes. Would you say that *nix is dumb because it doesn't parse "\\usr\\include"?
Windows path names are one of the two primary applications of raw strings (the other being regexes).
Thomas

On 2007-05-11 13:05, Thomas Heller wrote:
But think of the price to pay if we disable use of Unicode escapes in raw strings. And all of this just because of the one special case: having a file name that starts with a U and needs to be referenced literally in a Python application together with a path leading up to it. BTW, there's an easy work-around for this special case: os.stat(os.path.join(r"c:\windows\system32", "user32.dll"))
Sorry, I wasn't trying to imply that Windows is/was a dumb system. I think it's nice that you can use forward slashes on Windows - makes writing code that works in both worlds (Unix and Windows) a lot easier.
Windows path names are one of the two primary applications of raw strings (the other being regexes).
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 11 2007)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

BTW, there's an easy work-around for this special case:
os.stat(os.path.join(r"c:\windows\system32", "user32.dll"))
No matter what the decision is, there are always work-arounds. The question is what language suits the users most. Being able to specify characters by ordinal IMO has much less value than the a consistent, concise definition of raw strings has.
But, as Thomas says: you can't. You may be able to do so when using the API directly, however, it fails if you pass the file name in a command line of some tool that takes /foo to mean a command line option "foo". Regards. Martin

I think I'm going to break my own rules and ask Martin to write up a PEP. Given the pragmatics that Windows pathnames *are* a common use case, I'm willing to let allow the trailing \ in the string. A regular expression containing a quote could be written using triple quotes, e.g. r"""(["'])[^"']*\1""". (A single " in a regular expression can always be rewritten as ["] AFAIK.) -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Using double backslashes won't cause that reaction:
os.stat("c:\\windows\\system32\\user32.dll")
Please refer to the subject. We are talking about raw strings.
It's not a matter of opinion. It's a statistical fact that these are the two cases where people use raw strings most.
For regular expressions, you don't need them as part of the string literal syntax: The re parser itself could support \u, just like it supports \x today.
How do you know? Py3k hasn't been released, yet. Regards, Martin

On 2007-05-12 00:48, Martin v. Löwis wrote:
If you'd leave the context in place, the reason for my suggestion would become evident.
Ah, statistics :-) It always depends on who you ask: a Windows user will obviously have more use for raw string use-case you gave than a Unix user. At the end of the day, I still believe that the regexp use-case is by far more common than the Windows path name one. FWIW: Zope has 2 uses of raw string for Windows path names (if I counted correctly) and around 100 for regexps. Python itself has maybe 10-20 Windows path name (and registry name) uses of raw string (in the msi lib and distutils) vs. around 300 uses for regexps.
True and perhaps that's the right path to follow. You'd still have the problem of writing Windows path names with embedded Unicode characters, but I guess that's something we can fix another day ;-)
Sorry, I wasn't clear: if the raw-unicode-escape codec continues to work the way it does not, you won't run into trouble in Py3k. [and later:]
I wonder how we managed to survive all these years with the existing consistent and concise definition of the raw-unicode-escape codec ;-) There are two options: * no one really uses Unicode raw strings nowadays * none of the existing users has ever stumbled across the "problem case" that triggered all this Both ways, we're discussing a non-issue.
Strange. I've doing exactly that for years. Maybe it's just because I stick to common os module APIs. So far, I've never run into any problem with it. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 12 2007)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

On Sat, May 12, 2007 at 01:30:52AM +0200, M.-A. Lemburg wrote:
Sure, it's a non-issue for Python 2.x. However, when Python 3 comes along, and all strings are Unicode, there will likely be a lot more users stumbling into the problem case. -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868

On 2007-05-12 02:42, Andrew McNabb wrote:
In the first case, changing the codec won't affect much code when ported to Py3k. In the second case, a change to the codec is not necessary. Please also consider the following: * without the Unicode escapes, the only way to put non-ASCII code points into a raw Unicode string is via a source code encoding of say UTF-8 or UTF-16, pretty much defeating the original requirement of writing ASCII code only * non-ASCII code points in text are not uncommon, they occur in most European scripts, all Asian scripts, many scientific texts and in also texts meant for the web (just have a look at the HTML entities, or think of Word exports using quotes) * adding Unicode escapes to the re module will break code already using "...\u..." in the regular expressions for other purposes; writing conversion tools that detect this usage is going to be hard * OTOH, writing conversion tools that simply work on string literals in general is easy Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 13 2007)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

That's no problem, though - just don't put the Unicode character into a raw string. Use plain strings if you have a need to include Unicode characters, and are not willing to leave ASCII. For Python 3, the default source encoding is UTF-8, so it is much easier to use non-ASCII characters in the source code. The original requirement may not be as strong anymore as it used to be.
And you are seriously telling me that people who commonly use non-ASCII code points in their source code are willing to refer to them by Unicode ordinal number (which, of course, they all know by heart, from 1 to 65536)?
It's unlikely to occur in code today - \u just means the same as u (so \u1234 matches u1234); if you want a backslash followed by u in your regular expression, you should write \\u. It would be possible to future-warn about \u in 2.6, catching these cases. Authors then would either have to remove the backslash, or duplicate it, depending on what they want to express. Regards, Martin

On 2007-05-13 18:04, Martin v. Löwis wrote:
You can do that today: Just put the "# coding: utf-8" marker at the top of the file. However, in some cases, your editor may not be capable of displaying or letting you enter the Unicode text you have in mind. In other cases, there may be a corporate coding standard in place that prohibits using non-ASCII text in source code, or fixes the encoding to e.g. Latin-1. In all those cases, it's necessary to be able to enter the Unicode code points which do cannot be used in the source code using other means and the easiest way to do this is by using Unicode escapes.
No, I'm not. I'm saying that non-ASCII code points are in common use and (together with the above bullet) that there are situations where you can't put the relevant code point directly into your source code. Using Unicode escapes for these will always be a cludge, but it's still better than not being able to enter the code points at all.
Good idea. The re module would then have to implement the same escaping scheme as the raw-unicode-escape code (only an odd number of backslashes causes the escaping code to trigger). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 13 2007)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

M.-A. Lemburg wrote:
* non-ASCII code points in text are not uncommon, they occur in most European scripts, all Asian scripts,
In an Asian script, almost every character is likely to be non-ascii, which is going to be pretty hard to read as a string of unicode escapes. Maybe what we want is a new kind of string literal in which *everything* is a unicode escape. A sufficiently smart editor could then display it using the appropriate characters, yet it could still be dealt with as ascii- only in a pinch. -- Greg

On 5/10/07, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Windows path names are one of the two primary applications of raw strings (the other being regexes).
I disagree with this use case; the r"..." notation was not invented for this purpose. I won't compromise the escaping of quotes to accommodate it. Nevertheless I think that \u and \U should lose their special-ness in 3.0. I'd like to hear from anyone who has access to *real code* that uses \u or \U in a raw unicode string. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum <guido <at> python.org> writes:
I'd like to hear from anyone who has access to *real code* that uses \u or \U in a raw unicode string.
Docutils uses it in the docutils.parsers.rst.states module, Body class: patterns = { 'bullet': ur'[-+*\u2022\u2023\u2043]( +|$)', ... attribution_pattern = re.compile(ur'(---?(?!-)|\u2014) *(?=[^ \n])') -- David Goodger <http://python.net/~goodger>

On 5/11/07, David Goodger <goodger@python.org> wrote:
But wouldn't it be just as handy to teach the re module about \u and \U, just as it already knows about \x (and \123 octals)? -- --Guido van Rossum (home page: http://www.python.org/~guido/)

On 5/11/07, Guido van Rossum <guido@python.org> wrote:
But wouldn't it be just as handy to teach the re module about \u and \U, just as it already knows about \x (and \123 octals)?
Could be. I'm just providing examples, as requested. I leave the heavy thinking to others ;-) -- David Goodger <http://python.net/~goodger>

Greg Ewing schrieb:
No, that would not be analogous. The string type in Python is not an ASCII string type, but a byte string type. It does not necessarily only hold ASCII characters, but can (and, in hundreds of applications) does hold arbitrary bytes. There is (in the non-raw form) support of filling arbitrary bytes into a byte string literal. So no, this is not analogous. Regards, Martin

On 2007-05-10 20:53, Paul Moore wrote:
This is per design (see PEP 100) and was done for the reason given by Paul. The motivation for the chosen approach was to make Python's raw Unicode strings compatible to Java's raw Unicode strings: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 10 2007)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

On 5/10/07, M.-A. Lemburg <mal@egenix.com> wrote:
I'm not sure what Java compatibility buys us. It is also far from perfect -- IIUC, in Java if you write \u0022 (that's the " character) it counts as an opening or closing quote, and if you write \u005c (a backslash) it can be used to escape the following character. OTOH, in Python, you can write ur"C:\Program Files\u005c" and voila, a raw string terminating in a backslash. (In Java this would escape the " instead.) However, I understand the other reason (inclusion of non-ASCII characters in raw strings) and I reluctantly agree with it. Reluctantly, because it means I can't create a raw string containing a \ followed by u or U -- I needed one of those today. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

On 2007-05-11 00:11, Guido van Rossum wrote:
http://mail.python.org/pipermail/python-dev/1999-November/001346.html http://mail.python.org/pipermail/python-dev/1999-November/001392.html and all the other postings in that month related to this.
print ur"\u005cu" \u
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 11 2007)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

However, I understand the other reason (inclusion of non-ASCII characters in raw strings) and I reluctantly agree with it.
I actually disagree with that. It is fairly easy to include non-ASCII characters in a raw Unicode string - just type them in. Or, if that fails, use string concatenation with a non-raw string: r"foo\uhallo" "\u20ac" r"welt" Regards, Martin

On 5/10/07, "Martin v. Löwis" <martin@v.loewis.de> wrote:
That violates the convention used in many places that source code should only contain printable ASCII, and all non-ASCII or unprintable characters should be written using \x or \u escapes.
That makes for pretty unreadable source code though. Looking for a third opinion, -- --Guido van Rossum (home page: http://www.python.org/~guido/)

On 5/10/07, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Fair enough. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Martin v. Löwis wrote:
why should you be able to get a non-ASCII character into a raw Unicode string?
The analogous question would be why can't you get a non-Unicode character into a raw Unicode string. That wouldn't make sense, since Unicode strings can't even hold non-Unicode characters (or at least they're not meant to). But it doesn't seem unreasonable to want to put Unicode characters into a raw Unicode string. After all, if it only contains ASCII characters there's no need for it to be a Unicode string in the first place. -- Greg

On 5/10/07, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
This is what prompted my question, actually: in Py3k, in the str/unicode unification branch, r"\u1234" changes meaning: before the unification, this was an 8-bit string, where the \u was not special, but now it is a unicode string, where \u *is* special. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

That is true for non-raw strings also: the meaning of "\u1234" also changes. However, traditionally, there was *no* escaping mechanism in raw strings in Python, and I feel that this is a good principle, because it is easy to learn (if you leave out the detail that \ can't be the last character in a raw string - which should get fixed also, IMO). So I think in Py3k, "\u1234" should continue to be a string with 6 characters. Otherwise, people will complain that os.stat(r"c:\windows\system32\user32.dll") fails. Telling them to write os.stat(r"c:\windows\system32\u005Cuser32.dll") will just cause puzzled faces. Windows path names are one of the two primary applications of raw strings (the other being regexes). Regards, Martin

Martin v. Löwis wrote:
I think regular expressions become easier to read if they don't also contain python escape characters because then you don't have to mentally parse which ones are part of the regular expression and which ones are evaluated by python. The re module can still evaluate r"\uxxxx", r"\'", and r'\"' sequences even if python doesn't. I experimented with tokanize.c to see if the trailing '\' could be special cased in raw strings. The minimum change I could come up with was to have it not respect slash-quote sequences, (for finding the end of a string), if the quote is the same type as the quote used to define the string. The following strings in the library needed to be adjusted after that change. I don't think this is the best solution, but the list of strings needing changed might be useful for the discussion. - r'(\'[^\']*\'|"[^"]*"|[][\-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~\'"@]*))?') + r'''(\'[^\']*\'|"[^"]*"|[][\-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~\'"@]*))?''') -_declstringlit_match = re.compile(r'(\'[^\']*\'|"[^"]*")\s*').match +_declstringlit_match = re.compile(r'''(\'[^\']*\'|"[^"]*")\s*''').match - r'(?<=[\w\!\"\'\&\.\,\?])-{2,}(?=\w))') # em-dash + r'''(?<=[\w\!\"\'\&\.\,\?])-{2,}(?=\w))''') # em-dash - r'[\"\']?' # optional end-of-quote + r'''[\"\']?''' # optional end-of-quote - _wordchars_re = re.compile(r'[^\\\'\"%s ]*' % string.whitespace) + _wordchars_re = re.compile(r'''[^\\\'\"%s ]*''' % string.whitespace) -HEADER_QUOTED_VALUE_RE = re.compile(r"^\s*=\s*\"([^\"\\]*(?:\\.[^\"\\]*)*)\"") +HEADER_QUOTED_VALUE_RE = re.compile(r'''^\s*=\s*\"([^\"\\]*(?:\\.[^\"\\]*)*)\"''') -HEADER_JOIN_ESCAPE_RE = re.compile(r"([\"\\])") +HEADER_JOIN_ESCAPE_RE = re.compile(r'([\"\\])') - quote_re = re.compile(r"([\"\\])") + quote_re = re.compile(r'([\"\\])') - return re.sub(r'((\\[\\abfnrtv\'"]|\\[0-9]..|\\x..|\\u....)+)', + return re.sub(r'''((\\[\\abfnrtv\'"]|\\[0-9]..|\\x..|\\u....)+)''', - _OPTION_DIRECTIVE_RE = re.compile(r'#\s*doctest:\s*([^\n\'"]*)$', + _OPTION_DIRECTIVE_RE = re.compile(r'''#\s*doctest:\s*([^\n\'"]*)$''', re.MULTILINE) - s = unicode(r'\x00="\'a\\b\x80\xff\u0000\u0001\u1234', 'unicode-escape') + s = unicode(r'''\x00="\'a\\b\x80\xff\u0000\u0001\u1234''', d - _escape = re.compile(r"[&<>\"\x80-\xff]+") # 1.5.2 + _escape = re.compile(r'[&<>\"\x80-\xff]+') # 1.5.2 - r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?') + r'''(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?''') I also noticed that python handles the '\' escape character differently than re does in regular strings. In regular expressions, a single '\' is always an escape character. If the following character is not a special character, then the two character combination becomes the second non-special character. "\'" --> ' "\\" --> \ "\q" --> q ('q' not special so '\q' is 'q') This isn't how python does it.
So it might be good to have it always be an escape in regular strings, and never be an escape in raw strings. Ron

On 2007-05-11 07:52, Martin v. Löwis wrote:
Using double backslashes won't cause that reaction: os.stat("c:\\windows\\system32\\user32.dll") Also note that Windows is smart enough nowadays to parse the good old Unix forward slash: os.stat("c:/windows/system32/user32.dll")
Windows path names are one of the two primary applications of raw strings (the other being regexes).
IMHO the primary use case are regexps and for those you'd definitely want to be able to put Unicode characters into your expressions. BTW, if you use ur"..." for your expressions today (which you should if you parse text), then nothing will change when removing the 'u' prefix in Py3k. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 11 2007)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

M.-A. Lemburg schrieb:
Sure. But I want to use raw strings for Windows path names; it's much easier to type.
In my opinion this is a windows bug and not a features. Especially because there are Windows api functions (the shell functions, IIRC) that do NOT accept forward slashes. Would you say that *nix is dumb because it doesn't parse "\\usr\\include"?
Windows path names are one of the two primary applications of raw strings (the other being regexes).
Thomas

On 2007-05-11 13:05, Thomas Heller wrote:
But think of the price to pay if we disable use of Unicode escapes in raw strings. And all of this just because of the one special case: having a file name that starts with a U and needs to be referenced literally in a Python application together with a path leading up to it. BTW, there's an easy work-around for this special case: os.stat(os.path.join(r"c:\windows\system32", "user32.dll"))
Sorry, I wasn't trying to imply that Windows is/was a dumb system. I think it's nice that you can use forward slashes on Windows - makes writing code that works in both worlds (Unix and Windows) a lot easier.
Windows path names are one of the two primary applications of raw strings (the other being regexes).
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 11 2007)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

BTW, there's an easy work-around for this special case:
os.stat(os.path.join(r"c:\windows\system32", "user32.dll"))
No matter what the decision is, there are always work-arounds. The question is what language suits the users most. Being able to specify characters by ordinal IMO has much less value than the a consistent, concise definition of raw strings has.
But, as Thomas says: you can't. You may be able to do so when using the API directly, however, it fails if you pass the file name in a command line of some tool that takes /foo to mean a command line option "foo". Regards. Martin

I think I'm going to break my own rules and ask Martin to write up a PEP. Given the pragmatics that Windows pathnames *are* a common use case, I'm willing to let allow the trailing \ in the string. A regular expression containing a quote could be written using triple quotes, e.g. r"""(["'])[^"']*\1""". (A single " in a regular expression can always be rewritten as ["] AFAIK.) -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Using double backslashes won't cause that reaction:
os.stat("c:\\windows\\system32\\user32.dll")
Please refer to the subject. We are talking about raw strings.
It's not a matter of opinion. It's a statistical fact that these are the two cases where people use raw strings most.
For regular expressions, you don't need them as part of the string literal syntax: The re parser itself could support \u, just like it supports \x today.
How do you know? Py3k hasn't been released, yet. Regards, Martin

On 2007-05-12 00:48, Martin v. Löwis wrote:
If you'd leave the context in place, the reason for my suggestion would become evident.
Ah, statistics :-) It always depends on who you ask: a Windows user will obviously have more use for raw string use-case you gave than a Unix user. At the end of the day, I still believe that the regexp use-case is by far more common than the Windows path name one. FWIW: Zope has 2 uses of raw string for Windows path names (if I counted correctly) and around 100 for regexps. Python itself has maybe 10-20 Windows path name (and registry name) uses of raw string (in the msi lib and distutils) vs. around 300 uses for regexps.
True and perhaps that's the right path to follow. You'd still have the problem of writing Windows path names with embedded Unicode characters, but I guess that's something we can fix another day ;-)
Sorry, I wasn't clear: if the raw-unicode-escape codec continues to work the way it does not, you won't run into trouble in Py3k. [and later:]
I wonder how we managed to survive all these years with the existing consistent and concise definition of the raw-unicode-escape codec ;-) There are two options: * no one really uses Unicode raw strings nowadays * none of the existing users has ever stumbled across the "problem case" that triggered all this Both ways, we're discussing a non-issue.
Strange. I've doing exactly that for years. Maybe it's just because I stick to common os module APIs. So far, I've never run into any problem with it. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 12 2007)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

On Sat, May 12, 2007 at 01:30:52AM +0200, M.-A. Lemburg wrote:
Sure, it's a non-issue for Python 2.x. However, when Python 3 comes along, and all strings are Unicode, there will likely be a lot more users stumbling into the problem case. -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868

On 2007-05-12 02:42, Andrew McNabb wrote:
In the first case, changing the codec won't affect much code when ported to Py3k. In the second case, a change to the codec is not necessary. Please also consider the following: * without the Unicode escapes, the only way to put non-ASCII code points into a raw Unicode string is via a source code encoding of say UTF-8 or UTF-16, pretty much defeating the original requirement of writing ASCII code only * non-ASCII code points in text are not uncommon, they occur in most European scripts, all Asian scripts, many scientific texts and in also texts meant for the web (just have a look at the HTML entities, or think of Word exports using quotes) * adding Unicode escapes to the re module will break code already using "...\u..." in the regular expressions for other purposes; writing conversion tools that detect this usage is going to be hard * OTOH, writing conversion tools that simply work on string literals in general is easy Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 13 2007)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

That's no problem, though - just don't put the Unicode character into a raw string. Use plain strings if you have a need to include Unicode characters, and are not willing to leave ASCII. For Python 3, the default source encoding is UTF-8, so it is much easier to use non-ASCII characters in the source code. The original requirement may not be as strong anymore as it used to be.
And you are seriously telling me that people who commonly use non-ASCII code points in their source code are willing to refer to them by Unicode ordinal number (which, of course, they all know by heart, from 1 to 65536)?
It's unlikely to occur in code today - \u just means the same as u (so \u1234 matches u1234); if you want a backslash followed by u in your regular expression, you should write \\u. It would be possible to future-warn about \u in 2.6, catching these cases. Authors then would either have to remove the backslash, or duplicate it, depending on what they want to express. Regards, Martin

On 2007-05-13 18:04, Martin v. Löwis wrote:
You can do that today: Just put the "# coding: utf-8" marker at the top of the file. However, in some cases, your editor may not be capable of displaying or letting you enter the Unicode text you have in mind. In other cases, there may be a corporate coding standard in place that prohibits using non-ASCII text in source code, or fixes the encoding to e.g. Latin-1. In all those cases, it's necessary to be able to enter the Unicode code points which do cannot be used in the source code using other means and the easiest way to do this is by using Unicode escapes.
No, I'm not. I'm saying that non-ASCII code points are in common use and (together with the above bullet) that there are situations where you can't put the relevant code point directly into your source code. Using Unicode escapes for these will always be a cludge, but it's still better than not being able to enter the code points at all.
Good idea. The re module would then have to implement the same escaping scheme as the raw-unicode-escape code (only an odd number of backslashes causes the escaping code to trigger). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 13 2007)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

M.-A. Lemburg wrote:
* non-ASCII code points in text are not uncommon, they occur in most European scripts, all Asian scripts,
In an Asian script, almost every character is likely to be non-ascii, which is going to be pretty hard to read as a string of unicode escapes. Maybe what we want is a new kind of string literal in which *everything* is a unicode escape. A sufficiently smart editor could then display it using the appropriate characters, yet it could still be dealt with as ascii- only in a pinch. -- Greg

On 5/10/07, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Windows path names are one of the two primary applications of raw strings (the other being regexes).
I disagree with this use case; the r"..." notation was not invented for this purpose. I won't compromise the escaping of quotes to accommodate it. Nevertheless I think that \u and \U should lose their special-ness in 3.0. I'd like to hear from anyone who has access to *real code* that uses \u or \U in a raw unicode string. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum <guido <at> python.org> writes:
I'd like to hear from anyone who has access to *real code* that uses \u or \U in a raw unicode string.
Docutils uses it in the docutils.parsers.rst.states module, Body class: patterns = { 'bullet': ur'[-+*\u2022\u2023\u2043]( +|$)', ... attribution_pattern = re.compile(ur'(---?(?!-)|\u2014) *(?=[^ \n])') -- David Goodger <http://python.net/~goodger>

On 5/11/07, David Goodger <goodger@python.org> wrote:
But wouldn't it be just as handy to teach the re module about \u and \U, just as it already knows about \x (and \123 octals)? -- --Guido van Rossum (home page: http://www.python.org/~guido/)

On 5/11/07, Guido van Rossum <guido@python.org> wrote:
But wouldn't it be just as handy to teach the re module about \u and \U, just as it already knows about \x (and \123 octals)?
Could be. I'm just providing examples, as requested. I leave the heavy thinking to others ;-) -- David Goodger <http://python.net/~goodger>

Greg Ewing schrieb:
No, that would not be analogous. The string type in Python is not an ASCII string type, but a byte string type. It does not necessarily only hold ASCII characters, but can (and, in hundreds of applications) does hold arbitrary bytes. There is (in the non-raw form) support of filling arbitrary bytes into a byte string literal. So no, this is not analogous. Regards, Martin
participants (12)
-
"Martin v. Löwis"
-
Andrew McNabb
-
David Goodger
-
Georg Brandl
-
Greg Ewing
-
Guido van Rossum
-
Hrvoje Nikšić
-
M.-A. Lemburg
-
Michael Foord
-
Paul Moore
-
Ron Adam
-
Thomas Heller