Unrecognized escape sequences in string literals

Douglas Alan darkwater42 at gmail.com
Wed Aug 12 23:21:34 CEST 2009

On Aug 12, 5:32 am, Steven D'Aprano
<ste... at REMOVE.THIS.cybersource.com.au> wrote:

> That problem basically boils down to a deep-seated
> philosophical disagreement over which philosophy a
> language should follow in regard to backslash escapes:
>
> "Anything not explicitly permitted is forbidden"
>
> versus
>
> "Anything not explicitly forbidden is permitted"

No, it doesn't. It boils down to whether a language should:

(1) Try it's best to detect errors as early as possible,
especially when the cost of doing so is low.

(2) Make code as readable as possible, in part by making
code as self-evident as possible by mere inspection and by
reducing the amount of stuff that you have to memorize. Perl
fails miserably in this regard, for instance.

(3) To quote Einstein, make everything as simple as
possible, and no simpler.

(4) Take innately ambiguous things and not force them to be
unambiguous by mere fiat.

Allowing a programmer to program using a completely
arbitrary resolution of "unrecognized escape sequences"
violates all of the above principles.

The fact that the meanings of unrecognized escape sequences
are ambiguous is proved by the fact that every language
seems to treat them somewhat differently, demonstrating that
there is no natural intuitive meaning for them.

Furthermore, allowing programmers to use "unrecognized escape
sequences" without raising an error violates:

(1) Explicit is better than implicit:

Python provides a way to explicitly specify that you want a
backslash. Every programmer should be encouraged to use
Python's explicit mechanism here.

(2) Simple is better than complex:

Python currently has two classes of ambiguously
interpretable escape sequences: "unrecognized ones", and
"illegal" ones. Making a single class (i.e. just illegal
ones) is simpler.

Also, not having to memorize escape sequences that you
rarely have need to use is simpler.

(4) Errors should never pass silently:

Even the Python Reference Manual indicates that unrecognized
escape sequences are a source of bugs. (See more comments on
this below.)

(5) In the face of ambiguity, refuse the temptation to
guess.

Every language, other than C++, is taking a guess at what
the programmer would find to be most useful expansion for
unrecognized escape sequences, and each of the languages is
guessing differently. This temptation should be refused!

You can argue that once it is in the Reference Manual it is
no longer a guess, but that is patently specious, as Perl
proves. For instance, the fact that Perl will quietly convert
an array into a scalar for you, if you assign the array to a
scalar variable is certainly a "guess" of the sort that this
Python koan is referring to. Likewise for an arbitrary
interpretation of unrecognized escape sequences.

(6) There should be one-- and preferably only one --obvious
way to do it.

What is the one obvious way to express "\\y"? It is "\\y" or
"\y"?

Python can easily make one of these ways the "one obvious
way" by making the other one raise an error.

(7) Namespaces are one honking great idea -- let's do more
of those!

Allowing "\y" to self-expand is intruding into the namespace
for special characters that require an escape sequence.

> C++ apparently forbids all escape sequences, with
> unspecified behaviour if you use a forbidden sequence,
> except for a handful of explicitly permitted sequences.
>
> That's not better, it's merely different.

It *is* better, as it catches errors early on at little
cost, and for all the other reasons listed above.

> Actually, that's not true -- that the C++ standard forbids
> a thing, but leaves the consequences of doing that thing
> unspecified, is clearly a Bad Thing.

Indeed. But C++ has backward compatibly issues that make
any that Python has to deal with, pale in comparison. The
recommended behavior for a C++ compiler, however, is to flag
the problem as an error or as a warning.

> So on at least one machine in the world, C++ simply strips
> out backslashes that it doesn't recognize, leaving the
> suffix. Unfortunately, we can't rely on that, because C++
> is underspecified.

No, *fortunately* you can't rely on it, forcing you to go

> Fortunately this is not a problem with
> Python, which does completely specify the behaviour of
> escape sequences so there are no surprises.

It's not a surprise when the C++ compiler issues a warning to
you. If you ignore the warning, then you have no one to
blame but yourself.

> Implicit has an actual meaning. You shouldn't use it as a
> mere term of opprobrium for anything you don't like.

Pardon me, but I'm using "implicit" to mean "implicit", and
nothing more.

Python's behavior here is "implicit" in the very same way
that Perl implicitly converts an array into a scalar for
you. (Though that particular Perl behavior is a far bigger
wart than Python's behavior is here!)

> > Because you've stated that "\y" is a legal escape
> > sequence, while the Python Reference Manual explicitly
> > states that it is an "unrecognized escape sequence", and
> > that such "unrecognized escape sequences" are sources of
> > bugs.
>
> There's that reading comprehension problem again.
>
> Unrecognised != illegal.

This is reasoning that only a lawyer could love.

The right thing for a programming language to do, when
handed something that is syntactically "unrecognized" is to
raise an error.

> It seems to me that the behaviour the Python designers
> were looking to avoid was the case where the coder
> accidentally inserted a backslash in the wrong place, and
> the language stripped the backslash out, e.g.:
>
> Wanted "a\bcd" but accidentally typed "ab\cd" instead, and
> got "abcd".

The moral of the story is that *any* arbitrary
interpretation of unrecognized escape sequences is a
potential source of bugs. In Python, you just end up with a
converse issue, where one might understandably assume that
"foo\bar" has a backslash in it, because "foo\yar" and
*most* other similar strings do. But then it doesn't.

> >> This is *exactly* like C++, except that in Python the
> >> semantics of \y and \\y are identical. Python doesn't
> >> guess what you mean, it *imposes* a meaning on the
> >> escape sequence. You just don't like that meaning.

> > That's because I don't like things that are
> > ill-conceived.

> And yet you like C++... go figure *wink*

Now that's a bold assertion!

I think that "tolerate C++" is more like it. But C++ does
have its moments.

|>ouglas