[Python-Dev] Raw string syntax inconsistency

Mon Jun 18 03:07:29 CEST 2012

On Sun, Jun 17, 2012 at 4:55 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> On Mon, Jun 18, 2012 at 6:41 AM, Guido van Rossum <guido at python.org>
> wrote:
> > Would it make sense to detect and reject these in 3.3 if the 2.7 syntax
> is
> > used?
>
> Possibly - I'm trying not to actually *change* any of the internals of
> the string literal processing, though. (If I recall the way we
> implemented the change correctly, by the time we get to processing the
> string contents, we've forgotten which specific prefix was used)
>
> However, tis question did remind me of another detail I wanted to
> check after realising this discrepancy existed: it turns out this
> semantic inconsistency already arises if you use "from __future__
> import unicode_literals" to get supposedly "Python 3 style" string
> literals in 2.x
>
> Python 2.7.3 (default, May 29 2012, 14:54:22)
> >>> from __future__ import unicode_literals
> >>> print(r"\u03b3")
> γ
> >>> print("\u03b3")
> γ
>
> Python 3.2.1 (default, Jul 11 2011, 18:54:42)
> >>> print(r"\u03b3")
> \u03b3
> >>> print("\u03b3")
> γ
>
> So, perhaps the answer is to leave this as is, and try to make 2to3
> smart enough to detect such escapes and replace them with their
> properly encoded (according to the source code encoding) Unicode
> equivalent?

But the whole point of the reintroduction of u"..." is to support code that
isn't run through 2to3. Frankly, I don't care how it's done, but I'd say
it's important not to silently have different behavior for the same
notation in the two versions. If that means we have to add an extra step to
the compiler to reject r"\u03b3", so be it.

> After all, that's already the way to include such
> characters in a forward compatible way when using the future import:
>
> Python 2.7.3 (default, May 29 2012, 14:54:22)
> >>> from __future__ import unicode_literals
> >>> print("γ")
> γ
> >>> print(r"γ\n")
> γ\n
>
> Python 3.2.1 (default, Jul 11 2011, 18:54:42)
> >>> print("γ")
> γ
> >>> print(r"γ\n")
> γ\n
>

Hm. I still encounter enough environments that don't know how to display
such characters that I would prefer to have a rock solid \u escape
mechanism. I can think of two ways to support "expanded" unicode characters
in raw strings a la Python 2; (a) let the re module interpret the escapes
(like it does for \r and \n); (b) the user can write r"someblah" "\u03b3"
r"moreblah".

> So, rather than going ahead with reverting "ur" support as I first
> suggested (since it turns out that's not a *new* problem, but just a
> different way of spelling an *existing* problem), how about I do the
> following:
>
> 1. Add a note to PEP 414 and the Py3k porting guide regarding the
> discrepancy in escaping semantics for raw Unicode strings between 2.x
> and 3.x
> 2. Reject the tracker issue for reverting the ur support (the semantic
> problem already exists, and any solution we come up with for
> __future__.unicode_literals should handle the ur prefix as well)
> 3. Create a new feature request for 2to3 to see if it can
> automatically handle the problem of translating "\u" and "\U" escapes
> into properly encoded Unicode characters
>
> The scope of the problem is really quite small: you have to be using a
> raw Unicode string in 2.x (either via the string prefix, or the future
> import) *and* using a "\u" or "\U" escape within that string.
>

Yeah, but if you do this and it breaks you likely won't notice until way
late in your QA cycle, when it may be tough to track down the origin. I'd
rather make ru"\u03b3" a syntax error if we can't give it the same meaning
as in Python 2.

(I'm not sure what to do about the same bug with __future__. Maybe we
should declare that a bug and "fix" it in a future 2.7 bugfix release?)

> Regards,
> Nick.
>
> >
> > --Guido van Rossum (sent from Android phone)
> >
> > On Jun 17, 2012 1:13 PM, "Nick Coghlan" <ncoghlan at gmail.com> wrote:
> >>
> >> On Mon, Jun 18, 2012 at 3:54 AM, Terry Reedy <tjreedy at udel.edu> wrote:
> >> > The premise of the discussion of adding 'u', and of Guido's
> acceptance,
> >> > was
> >> > that "it's about as harmless as they come". I do not remember any
> >> > discussion
> >> > of 'ur' and what it really means in 2.x, and that supporting it meant
> >> > adding
> >> > back 2.x's interaction effect. Indeed, Nick's version goes on to say
> >> > "This
> >> > PEP was originally written by Armin Ronacher, and Guido's approval was
> >> > given
> >> > based on that version." Armin's original version (and subsequent edit)
> >> > only
> >> > proposed adding 'u' (and 'U') and made no mention of 'ur'. Nick's
> >> > seemingly
> >> > innocuous addition of also adding 'ur' came after Guido's approval,
> and
> >> > as
> >> > discovered, is not so innocuous.
> >>
> >> Right, that matches my recollection as well - we (or least I) thought
> >> mapping "ur" to the Python 3 "r" prefix was sufficient, but it turns
> >> out doing so means there are some 2.x string literals that will
> >> silently behave differently in 3.x.
> >>
> >> Martin's right that that part of the PEP should definitely be amended
> >> (along with the relevant section in What's New)
> >>
> >> > I do not think he needs to discuss adding and deleting support, but
> >> > merely
> >> > state that 'ur' support is not added because 'ur' has a special
> meaning
> >> > that
> >> > would require changing literal handling. The sentence about supporting
> >> > 'ur'
> >> > could be negated and moved after the sentence about not changing
> Unicode
> >> > handling. A possibility:
> >> >
> >> > "Combination of the unicode prefix with the raw string prefix will not
> >> > be
> >> > supported because in Python 2, the combination 'ur' has a special
> >> > meaning
> >> > that would require changing the handling of unicode literals"
> >>
> >> In addition to changing the proposal section to only cover "u" and
> >> "U", I'll actually add a new subsection along the lines of the
> >> following:
> >>
> >> Exclusion of Raw Unicode Strings
> >> -------------------------------------------------
> >>
> >> Python 2.x includes a concept of "raw Unicode" strings. These are
> >> partially raw string literals that still support the "\u" and "\U"
> >> escape codes for Unicode character entry, but otherwise treat "\" as a
> >> literal backslash character. As 3.x has no such concept of a partially
> >> raw string literal, explicit raw Unicode literals are still not
> >> supported. Such literals in Python 2 code will need to be converted to
> >> ordinary Unicode literals for forward compatibility with Python 3.
> >>
> >> Cheers,
> >> Nick.
> >>
> >> --
> >> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
> >> _______________________________________________
> >> Python-Dev mailing list
> >> Python-Dev at python.org
> >> http://mail.python.org/mailman/listinfo/python-dev
> >> Unsubscribe:
> >> http://mail.python.org/mailman/options/python-dev/guido%40python.org
>
>
>
> --
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
>

-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20120617/c7b4a88a/attachment.html>