[Python-ideas] Make non-meaningful backslashes illegal in string literals

Wes Turner wes.turner at gmail.com
Fri Aug 7 15:03:44 CEST 2015


On Fri, Aug 7, 2015 at 2:15 AM, Chris Angelico <rosuav at gmail.com> wrote:

> On Fri, Aug 7, 2015 at 3:12 PM, Steven D'Aprano <steve at pearwood.info>
> wrote:
> > On Thu, Aug 06, 2015 at 12:26:14PM -0400, random832 at fastmail.us wrote:
> >> On Wed, Aug 5, 2015, at 14:56, Eric V. Smith wrote:
> >> > Because strings containing \{ are currently valid
> >>
> >> Which raises the question of why.
> >
> > Because \C is currently valid, for all values of C. The idea is that if
> > you typo an escape, say \d for \f, you get an obvious backslash in your
> > string which is easy to spot.
> >
> > Personally, I think that's a mistake. It leads to errors like this:
> >
> > filename = 'C:\some\path\something.txt'
> >
> > silently doing the wrong thing. If we're going to change the way escapes
> > work, it's time to deprecate the misfeature that \C is a literal
> > backslash followed by C. Outside of raw strings, a backslash should
> > *only* be allowed in an escape sequence.
>
> I agree; plus, it means there's yet another thing for people to
> complain about when they switch to Unicode strings:
>
> path = "c:\users", "C:\Users" # OK on Py2
> path = u"c:\users", u"C:\Users" # Fails
>

So this doesn't work?

    path = pathilb.Path(u"c:\users")
    # SEC: path concatenation is often in conjunction with user-supplied
input

- [ ] docs for these
- [ ] to/from r'rawstring' (DOC: encode/decode)


>
> Or equivalently, moving to Py3 and having those strings quietly become
> Unicode strings, and now having meaning on the \U and \u escapes.
>
> That said, though: It's now too late to change Python 2, which means
> that this is going to be yet another hurdle when people move
> (potentially large) Windows codebases to Python 3. IMO it's a good
> thing to trip people up immediately, rather than silently doing the
> wrong thing - but it is going to be another thing that people moan
> about when Python 3 starts complaining. First they have to add
> parentheses to print, then it's all those pointless (in their eyes)
> encode/decode calls, and now they have to go through and double all
> their backslashes as well! But the alternative is that some future
> version of Python adds a new escape code, and all their code starts
> silently doing weird stuff - or they change the path name and it goes
> haywire (changing from "c:\users\demo" to "c:\users\all users" will be
> a fun one to diagnose) - so IMO it's better to know about it early.
>
> > If we're going to make major changes to the way escapes work, I'd rather
> > add new escapes, not take them away:
> >
> >
> > \e escape \x1B, as supported by gcc and clang;
>
> Please, yes! Also supported by a number of other languages and
> commands (Pike, GNU echo, and some others that I don't recall (but not
> bind9, which has its own peculiarities)).
>
> > the escaping rules from Haskell:
> >
> >
> http://book.realworldhaskell.org/read/characters-strings-and-escaping-rules.html
> >
> > \P platform-specific newline (e.g. \r\n on Windows, \n on POSIX)
>
> Hmm. Not sure how useful this would be. Personally, I consider this to
> be a platform-specific encoding, on par with expecting b"\xc2\xa1" to
> display "¡", and as such, it should be kept to boundaries. Work with
> "\n" internally, and have input routines convert to that, and output
> routines optionally add "\r" before them all.
>
> > \U+xxxx Unicode code point U+xxxx (with four to six hex digits)
> >
> > It's much nicer to be able to write Unicode code points that (apart from
> > the backslash) look like the standard Unicode notation U+0000 to
> > U+10FFFF, rather than needing to pad to a full eight digits as the
> > \U00xxxxxx syntax requires.
>
> The problem is the ambiguity. How do you specify that "\U+101010" be a
> two-character string? "\U000101010" forces it by having exactly eight
> digits, but as soon as you allow variable numbers of digits, you run
> into problems. I suppose you could always pad to six for that:
> "\U+0101010" could know that it doesn't need a seventh digit. (Though
> what would ever happen if the Unicode consortium decides to drop
> support for UTF-16 and push for a true 32-bit character set, I don't
> know.) It is tempting, though - it both removes the need for two
> pointless zeroes, and broadly unifies the syntax for Unicode escapes,
> instead of having a massive boundary from "\u1234" to "\U00012345".
>
> ChrisA
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150807/9fbf4dee/attachment.html>


More information about the Python-ideas mailing list