[Python-ideas] Implicit string literal concatenation considered harmful (options)

Sat May 18 20:16:29 CEST 2013

On 05/17/2013 04:41 PM, rurpy at yahoo.com wrote:
>
> On Friday, May 17, 2013 8:14:39 AM UTC-6, Ron Adam wrote:
>
>     On 05/17/2013 06:41 AM, Steven D'Aprano wrote:
>      > They clearly should be in different threads. Line continuation is
>      > orthogonal to string continuation. You can have string concatenation
>     on a
>      > single line:
>      >
>      > s = "Label:\t" r"Data containing \ backslashes"
>
>     Can you think of, or find an example of two adjacent strings on the same
>     line that can't be written as a single string?
>
>           s = "Label:\t Data containing \ backslashes"
>
>     I'm curious about how much of a problem not having implicit string
>     concatenations really is?
>
>
> "Can't" is an unrealistically high a bar but I posted a real example at
>    http://mail.python.org/pipermail/python-ideas/2013-May/020847.html
> that is *better* written IMO as adjacently-concatenated string literals.

If we didn't have implicit string concatenation, I'd probably write it with 
each part on a separate line to make it easier to read.

     pattern = '[^\uFF1B\u30FB\u3001' \
               + r'+:=.,\/\[\]\t\r\n]+' \
               + '[\#\uFF03]+'

I think in this case the strings are joined at compile time as Guido 
suggested in is post.

You could also write it as...

pattern = ('[^\uFF1B\u30FB\u3001' +
            r'+:=.,\/\[\]\t\r\n]+' +
            '[\#\uFF03]+')

If implicit string concatenation is removed, it would be nice if there was 
an explicit replacement for it.  There is a strong consensus for doing it, 
but there isn't strong consensus on how to do it.

About line continuations:

Line continuations are a related issue to string concatenations because 
they are used together fairly often.

The line continuation behaviour is a bit quarky, but not in any critical 
way.  There has even been a PEP to remove it in python 3, but it was 
rejected for not having enough support.  People do use it, so it would be 
better if it was improved rather than removed.

As noted in other messages, the line continuation is copied from C, which I 
think originally came from the 'Make' utility.  (I'm not positive on that) 
  In C and Make, the \+newline pair is replaced with a space.  Python just 
removes both the \+newline and keeps track of weather or not it's in a 
string.  Look in tokenize.c for this.

As for the *not too important* quarkyness:

 >>> 'abc' \ 'efg'
   File "<stdin>", line 1
     'abc' \ 'efg'
                 ^
SyntaxError: unexpected character after line continuation character

This error implies that the '\' by it self is a line continuation token 
even though it's not followed by a newline.  Other wise you would get the 
same SyntaxError you get when you use any other symbol in an invalid way.

This was probably done either because it was easy to do, and/or because a 
better error message is more helpful.

Trailing white space results in the same error.  This happens enough to be 
annoying. It is confusing to some people why the compiler can recognise the 
line continuation *character*, but can't figure out that the white space 
after it is not important.

 >>> # comment 1\
... comment 2
   File "<stdin>", line 2
     comment 2
             ^
SyntaxError: invalid syntax

This just shows that comments are parsed before line continuations are 
considered.  Or to put it another way.. the '\' is part of the comment. 
That isn't the case in C or Make.  You can continue a comment on the next 
line with a line continuation.  Nothing wrong with this, but it shows the 
line continuations in Python aren't exact copies of the line continuation in C.

There are perfectly good reasons why the compiler does what it does in each 
of these cases.  I think the little things like this together has 
contributed to the feeling that line continuations are bad and should be 
avoided.

The discussed (and implied) options:

There are a number of options that have been discussed but those haven't 
really been clearly spelled out so the discussion has been kind of out of 
focus.   This seems like an overly detailed list, but the discussion has 
touched on pretty much all of these things.  I think the goal should be to 
find the most cohesive combination for Python 4 and/or just go with B alone.

A.  Do nothing.

B.  Remove implicit concatenation.

   (We could stop here, anything after this can be done later.)

C.  Remove Explicit line continuations.  (See options below.)

D.  Add a new explicit string concatenation token.

E.  Reuse the \ as an explicit string concatenation.  (with C)

F.  Make an exception for implicit string concatenations only after a line 
continuation.  (with B)

G.  Make an exception for line continuations if a line ends with a explicit 
string concatenation.  (With C and (D or E))

H.  Change line concatenation character from \+newline to just \.

I.  Allow implicit line continuations if a line ends with a operator that 
expects to be continued, like a comma inside parentheses already does. 
(With C)

Option H has some interesting possibilities.  It pretty much is a complete 
replacement for the current escaped newline continuation, so how it works, 
and what constraints it has, would need to be discussed.  It's the option 
that would allow white space and comments after a line continuation character.

Option I is interesting because it's already there inside of parentheses, 
and other containers.  It's just haven't seen it described as an implicit 
line continuation before.

It is my feeling that we can't change the escaped newline within strings. 
That need to be how it is, and it should be documented as a string feature, 
rather than a general line continuation token.

So if line continuations outside of strings is removed, escaped newlines 
inside of strings will still work.

There are so many possibilities here, that the only thing I'm sure of right 
now is to go ahead and start the process of removing implicit string 
concatenations (Option B), and then consider everything else as separate 
issues in that context.

Cheers,
    Ron