Unrecognized escape sequences in string literals

Tue Aug 11 03:07:13 EDT 2009

On Mon, 10 Aug 2009 15:17:24 -0700, Douglas Alan wrote:

> From: Steven D'Aprano <ste... at REMOVE.THIS.cybersource.com.au> wrote:
> 
>> On Mon, 10 Aug 2009 00:32:30 -0700, Douglas Alan wrote:
> 
>> > In C++, if I know that the code I'm looking at compiles, then I never
>> > need worry that I've misinterpreted what a string literal means.
> 
>> If you don't know what your string literals are, you don't know what
>> your program does. You can't expect the compiler to save you from
>> semantic errors. Adding escape codes into the string literal doesn't
>> change this basic truth.
> 
> I grow weary of these semantic debates. The bottom line is that C++'s
> strategy here catches bugs early on that Python's approach doesn't. It
> does so at no additional cost.
>
> From a purely practical point of view, why would any language not want
> to adopt a zero-cost approach to catching bugs, even if they are
> relatively rare, as early as possible?

Because the cost isn't zero. Needing to write \\ in a string literal when 
you want \ is a cost, and having to read \\ in source code and mentally 
translate that to \ is also a cost. By all means argue that it's a cost 
that is worth paying, but please stop pretending that it's not a cost.

Having to remember that \n is a "special" escape and \y isn't is also a 
cost, but that's a cost you pay in C++ too, if you want your code to 
compile.

By the way, you've stated repeatedly that \y will compile with a warning 
in g++. So what precisely do you get if you ignore the warning? What do 
other C++ compilers do? Apart from the lack of warning, what actually is 
the difference between Python's behaviour and C++'s behaviour?

> (Other than the reason that adopting it *now* is sadly too late.)
> 
> Furthermore, Python's strategy here is SPECIFICALLY DESIGNED, according
> to the reference manual to catch bugs. I.e., from the original posting
> on this issue:
> 
>      Unlike Standard C, all unrecognized escape sequences are left in
>      the string unchanged, i.e., the backslash is left in the string.
>      (This behavior is useful when debugging: if an escape sequence is
>      mistyped, the resulting output is more easily recognized as
>      broken.)

You need to work on your reading comprehension. It doesn't say anything 
about the motivation for this behaviour, let alone that it was 
"SPECIFICALLY DESIGNED" to catch bugs. It says it is useful for 
debugging. My shoe is useful for squashing poisonous spiders, but it 
wasn't designed as a poisonous-spider squashing device.

>> The compiler can't save you from typing 1234 instead of 11234, or 31.45
>> instead of 3.145, or "My darling Ho" instead of "My darling Jo", so why
>> do you expect it to save you from typing "abc\d" instead of "abc\\d"?
> 
> Because in the former cases it can't catch the the bug, and in the
> latter case, it can.

I'm not convinced this is a bug that needs catching, but if you think it 
is, then that's a reasonable argument.

>> Perhaps it can catch *some* errors of that type, but only at the cost
>> of extra effort required to defeat the compiler (forcing the programmer
>> to type \\d to prevent the compiler complaining about \d). I don't
>> think the benefit is worth the cost. You and your friend do. Who is to
>> say you're right?
> 
> Well, Bjarne Stroustrup, for one.

Then let him design his own language *wink*

> All of these are value judgments, of course, but I truly doubt that
> anyone would have been bothered if Python from day one had behaved the
> way that C++ does. 

If I'm reading this page correctly, Python does behave as C++ does. Or at 
least as Larch/C++ does:

http://www.cs.ucf.edu/~leavens/larchc++manual/lcpp_47.html

>> In C++, if you see an escape you don't recognize, do you care?
> 
> Yes, of course I do. If I need to know what the program does.

Precisely the same as in Python.

>> Do you go running for the manual? If the answer is No, then why do it
>> in Python?
> 
> The answer is that I do in both cases.

You deleted without answer my next question:

"And if the answer is Yes, then how is Python worse than C++?"

Seems to me that the answer is "It's not worse than C++, it's the same" 
-- in both cases, you have to memorize the "special" escape sequences, 
and in both cases, if you see an escape you don't recognize, you need to 
look it up.

>> No. \z *is* a legal escape sequence, it just happens to map to \z.
> 
>> If you stop thinking of \z as an illegal escape sequence that Python
>> refuses to raise an error for, the problem goes away. It's a legal
>> escape sequence that maps to backslash + z.
> 
> (1) I already used that argument on my friend, and he wasn't buying it.
> (Personally, I find the argument technically valid, but commonsensically
> invalid. It's a language-lawyer kind of argument, rather than one that
> appeals to any notion of real aesthetics.)

I disagree with your sense of aesthetics. I think that having to write 
\\y when I want \y just to satisfy a bondage-and-discipline compiler is 
ugly. That's not to deny that B&D isn't useful on occasion, but in this 
case I believe the benefit is negligible, and so even a tiny cost is not 
worth the pain.

The sweet sweet pain... oh wait, sorry, wrong newsgroup...

> (2) That argument disagrees with the Python reference manual, which
> explicitly states that "unrecognized escape sequences are left in the
> string unchanged", and that the purpose for doing so is because it "is
> useful when debugging".

How does it disagree? \y in the source code mapping to \y in the string 
object is the sequence being left unchanged. And the usefulness of doing 
so is hardly a disagreement over the fact that it does so.

>> > "\x" is not a legal escape sequence. Shouldn't it also get left as
>> > "\\x"?
>>
>> No, because it actually is an illegal escape sequence.
> 
> What makes it "illegal". As far as I can tell, it's just another
> "unrecognized escape sequence". 

No, it's recognized, because \x is the prefix for an hexadecimal escape 
code. And it's illegal, because it's missing the actual hexadecimal 
digits.

> JavaScript treats it that way. Are you
> going to be the one to tell all the JavaScript programmers that their
> language can't tell a legal escape sequence from an illegal one?

Well, it is Javascript... 

All joking aside, syntax varies from one language to another. What counts 
as a legal escape sequence in Javascript and what counts as a legal 
escape sequence in Python are different. What makes you think I'm talking 
about Javascript?

>> > Well, I think he's more annoyed that if Python is going to be so
>> > helpful as to put in the missing "\" for you in "foo\zbar", then it
>> > should put in the missing "\" for you in "\". He considers this to be
>> > an inconsistency.
>>
>> (1) There is no missing \ in "foo\zbar".
>>
>> (2) The problem with "\" isn't a missing backslash, but a missing end-
>> quote.
> 
> Says who? All of this really depends on your point of view. The whole
> morass goes away completely if one adopts C++'s approach here.

But the morass only exists in the first place because you have adopted 
C++'s approach instead of Python's approach -- and (possibly) not even a 
standard part of the C++ approach, but a non-standard warning provided by 
one compiler out of many.

Even if you disagree about (1), it's easy enough to prove that (2) is 
correct:

>>> "\"
  File "<stdin>", line 1
    "\"
      ^
SyntaxError: EOL while scanning single-quoted string

This is the exact same error you get here:

>>> "a
  File "<stdin>", line 1
    "a
     ^
SyntaxError: EOL while scanning single-quoted string

>> Python isn't DWIMing here. The rules are simple and straightforward,
>> there's no mind-reading or guessing required.
> 
> It may not be a complex form of DWIMing, but it's still DWIMing a bit.
> Python is figuring that if I typed "\z", then either I must have really
> meant to type "\\z", 

Nope, not in the least. Python NEVER EVER EVER tries to guess what you 
mean.

If you type "xyz", it assumes you want "xyz".

If you type "xyz\n", it assumes you want "xyz\n".

If you type "xyz\\n", it assumes you want "xyz\\n".

If you type "xyz\y", it assumes you want "xyz\y".

If you type "xyz\\y", it assumes you want "xyz\\y".

This is *exactly* like C++, except that in Python the semantics of \y and 
\\y are identical. Python doesn't guess what you mean, it *imposes* a 
meaning on the escape sequence. You just don't like that meaning.

> or that I want to see the backslash when I'm
> debugging because I made a mistake, or that I'm just too lazy to type
> "\\z".

Oh jeez, if you're going to define DWIM so broadly, then *everything* is 
DWIM. "If I type '1+2', then the C++ compiler figures out that I must 
have wanted to add 1 and 2..."

> I don't know if my friend even knows the term DWIM, other than me
> paraphrasing him, but I certainly understand all about the term. It
> comes from InterLisp. When DWIM was enabled, your program would run
> until it hit an error, and for certain kinds of errors, it would wait a
> few seconds for the user to notice the error message, and if the user
> didn't tell the program to stop, it would try to figure out what the
> user most likely meant, and then continue running using the
> computer-generated "fix".

Right. And Python isn't doing anything even remotely similar to that.

> I.e., more or less like continuing on in the face of what the Python
> Reference manual refers to as an "unrecognized escape sequence".

The wording could be better, I accept. It would be better to talk about 
"special escapes" (e.g. \n) and "any non-special escape" (e.g. \y).

-- 
Steven