[Python-Dev] Raw string syntax inconsistency

Mon Jun 18 07:59:39 CEST 2012

On 6/17/2012 9:07 PM, Guido van Rossum wrote:
> On Sun, Jun 17, 2012 at 4:55 PM, Nick Coghlan <ncoghlan at gmail.com

>     So, perhaps the answer is to leave this as is, and try to make 2to3
>     smart enough to detect such escapes and replace them with their
>     properly encoded (according to the source code encoding) Unicode
>     equivalent?
>
>
> But the whole point of the reintroduction of u"..." is to support code
> that isn't run through 2to3.

People writing 2&3 code sometimes use 2to3 once (or a few times) on 
their 2.6/7 version during development to find things they must pay 
attention to. So Nick's idea could be helpful to people who do not want 
to use 2to3 routinely either in development or deployment.

 > Frankly, I don't care how it's done, but
> I'd say it's important not to silently have different behavior for the
> same notation in the two versions.

The fundamental problem was giving the 'u' prefix two different meanings 
in 2.x: 'change the storage type from bytes to unicode', and 'change the 
contents by partially cooking the literal even when raw processing is 
requested'*. The only way to silently have the same behavior is to 
re-introduce the second meaning of partial cooking. (But I would rather 
make it unnecessary.) But that would freeze the 'u' prefix, or at least 
'ur' ('un-raw') forever. It would be better to introduce a new, separate 
'p' prefix, to mean partially raw, partially cooked. (But I am opposes to

*I think this non-orthogonal interaction effect was a design mistake and 
that it would have been better to have re do all the cooking needed by 
also interpreting \u and \U sequences. I also think we should add this 
now for 3.3 if possible, to make partial cooking at the parsing stage 
unnecessary. Putting the processing in re makes it work for all strings, 
not just those given as literals.

 > If that means we have to add an extra
> step to the compiler to reject r"\u03b3", so be it.

I do not get this. Surely you cannot mean to suddenly start rejecting, 
in 3.3, a large set of perfectly legal and sensible 6 and 10 character 
sequences when embedded in literals?

> Hm. I still encounter enough environments that don't know how to display
> such characters that I would prefer to have a rock solid \u escape
> mechanism. I can think of two ways to support "expanded" unicode
> characters in raw strings a la Python 2;

(a) let the re module interpret the escapes (like it does for \r and \n);

As said above, I favor this. The 2.x partial cooking (with 'ur' prefix) 
was primarily a substitute for this.

(b) the user can write r"someblah" "\u03b3" r"moreblah".

This is somewhat orthogonal to (a). Users can this whenever they want 
partial processing of backslashes without doubling those they want left 
as is. A generic example is r'someraw' 'somecooked' r'moreraw' 
'morecooked'.

-- 
Terry Jan Reedy