[Python-ideas] Make non-meaningful backslashes illegal in string literals

Steven D'Aprano steve at pearwood.info
Fri Aug 7 11:41:53 CEST 2015


On Fri, Aug 07, 2015 at 05:15:34PM +1000, Chris Angelico wrote about 
deprecating \C giving a literal backslash C:

[...]
> That said, though: It's now too late to change Python 2, which means
> that this is going to be yet another hurdle when people move
> (potentially large) Windows codebases to Python 3. 

I don't think that changing string literals is an onerous task. The 
hardest part is deciding what fix you're going to apply:

- replace \ in Windows paths with /
- escape your backslashes
- use raw strings


> or they change the path name and it goes
> haywire (changing from "c:\users\demo" to "c:\users\all users" will be
> a fun one to diagnose) - so IMO it's better to know about it early.

"c:\users" is already broken in Python 3.

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in 
position 2-4: truncated \uXXXX escape


[...]
> > \P platform-specific newline (e.g. \r\n on Windows, \n on POSIX)
> 
> Hmm. Not sure how useful this would be. Personally, I consider this to
> be a platform-specific encoding, 

Of course it's platform-specific. That's what I said :-)


> on par with expecting b"\xc2\xa1" to
> display "¡", and as such, it should be kept to boundaries.

This has nothing to do with bytes. \r and \n in Unicode strings give 
U+000D and U+000A respectively, \P would likewise be defined in terms of 
code points, not bytes.


> Work with
> "\n" internally, and have input routines convert to that, and output
> routines optionally add "\r" before them all.

That's fine as far as it goes, but sometimes you don't want automatic 
newline conversion. See the "newline" parameter to Python 3's open 
built-in. If I'm writing a file which the user has specified 
to use Windows end-of-line, I can't rely on Python automatically 
converting to \r\n because I might not actually be running on Windows, 
so I may disable universal newlines on output, and specify the end of 
line myself using the user's choice. One such choice being "whatever 
platform you're on, use the platform default".


> > \U+xxxx Unicode code point U+xxxx (with four to six hex digits)
> >
> > It's much nicer to be able to write Unicode code points that (apart from
> > the backslash) look like the standard Unicode notation U+0000 to
> > U+10FFFF, rather than needing to pad to a full eight digits as the
> > \U00xxxxxx syntax requires.
> 
> The problem is the ambiguity. How do you specify that "\U+101010" be a
> two-character string?

Hence Haskell's \& which acts as a separator:

"\U+10101\&0"

Or use implicit concatenation:

"\U+10101" "0"

Also, the C++ style "\U000101010" will continue to work. However, it's 
hard to read: you need to count the digits to see that there are *nine* 
digits and so only the first eight belong to the \U escape.

[...]
> (Though
> what would ever happen if the Unicode consortium decides to drop
> support for UTF-16 and push for a true 32-bit character set, I don't
> know.)

If that ever happens, it will be one of the signs of the Apocalypse. To 
quote Ghostbusters:

    Fire and brimstone coming down from the skies! Rivers and seas 
    boiling! Forty years of darkness! Earthquakes, volcanoes... The dead 
    rising from the grave! Human sacrifice, dogs and cats living 
    together... and the Unicode Consortium breaking their stability 
    guarantee.



-- 
Steve


More information about the Python-ideas mailing list