[Python-ideas] Descouraging the implicit string concatenation
Matt Arcidy
marcidy at gmail.com
Fri Mar 16 11:47:19 EDT 2018
On Thu, Mar 15, 2018 at 8:58 PM, Rob Cliffe via Python-ideas
<python-ideas at python.org> wrote:
>
>
> On 14/03/2018 17:57, Chris Angelico wrote:
>
> On Thu, Mar 15, 2018 at 12:40 AM, Søren Pilgård <fiskomaten at gmail.com>
> wrote:
>
> Of course you can always make error, even in a single letter.
> But I think there is a big difference between mixing up +-/* and **
> where the operator is in "focus" and the implicit concatenation where
> there is no operator.
> A common problem is that you have something like
> foo(["a",
> "b",
> "c"
> ]).bar()
> but then you remember that there also needs to be a "d" so you just add
> foo(["a",
> "b",
> "c"
> "d"
> ]).bar()
> Causing an error, not with the "d" expression you are working on but
> due to what you thought was the previous expression but python turns
> it into one.
> The , is seen as a delimiter by the programmer not as part of the
> operation (or the lack of the ,).
>
> You're creating a list. Put a comma at the end of every line; problem
> solved. Your edit would be from this:
>
> foo(["a",
> "b",
> "c",
> ]).bar()
>
> to this:
>
> foo(["a",
> "b",
> "c",
> "d",
> ]).bar()
>
> and there is no bug. In fact, EVERY time you lay out a list display
> vertically (or dict or set or any equivalent), ALWAYS put commas. That
> way, you can reorder the lines freely (you don't special-case the last
> one), you can append a line without changing the previous one (no
> noise in the diff), etc, etc, etc.
>
>
> My thoughts exactly. I make it a personal rule to ALWAYS add a comma to
> every line, including the last one, in this kind of list (/dict/set etc.).
> Python allows it - take advantage of it! (A perhaps minor-seeming feature
> of the language which actually is a big benefit if you use it.) Preferably
> with all the commas vertically aligned to highlight the structure (I'm a
> great believer BTW in using vertical alignment, even if it means violating
> Python near-taboos such as more that one statement per line). Also I would
> automatically put the first string (as well as the last) on a line by
> itself:
> foo([
> "short string" ,
> "extremely looooooooooooong string" ,
> "some other string" ,
> ])
> Then as Chris says (sorry to keep banging the drum), the lines can trivially
> be reordered, and adding more lines never causes a problem as long as I
> stick to the rule. Which I do automatically because I think my code looks
> prettier that way.
>
> From a purist angle, implicit string concatenation is somewhat inelegant
> (where else in Python can you have two adjacent operands not separated by an
> operator/comma/whatever? We don't use reverse Polish notation). And I
> could live without it. But I have found it useful from time to time:
> constructing SQL queries or error messages or other long strings that I
> needed for some reason, where triple quotes would be more awkward (and I
> find line continuation backslashes ugly, *especially* in mid-string). I
> guess my attitude is: "If you want to read my Python code, 90%+ of the time
> it will be *obvious* that these strings are *meant* to be concatenated. But
> if it isn't, then you need to learn some Python basics first (Sorry!).".
>
> I have never as far as I can remember had a bug caused by string
> concatenation. However, it is possible that I am guilty of selective
> memory.
>
> +1, though, for linters to point out possible errors from possible
> accidental omission of a comma (if they don't already). It never hurts to
> have our code checked.
The linters I know use parsed ast nodes, so if it's not valid grammar,
it won't parse. The _linters_ don't check for cases like f(a b)
because that's not valid grammar and already caught by the parser. I
think that's what you were noting when you said "if they don't
already"?
As for the issue in general, this is my understanding of linters:
The code is loaded immediate into the ast parser. There's a no
post-hoc way to know why the node is a string literal. Specifically
the nodes for "ab" and "a""b" are identical Str nodes. Reversing
from the Str node is impossible (unless a
flag/attribute/context/whatever gets added), as the information is
destroyed by then.
The following might be possible:
1) The line number for a node n1 provides a way select the code
between nodes n1 and the next node n2, which contains the offending
string. This code needs to be retrieved from somewhere (easily kept
from the beginning if not already)
2) A quick reparse of just that chunk will confirm that it contains
the target node, so the code retrieval for the target node can be
sanity checked (it's not exact retrieval of just the code we want,
it's guaranteed to overshoot as resolution is on line numbers.)
3) ? (note below)
4) A quick check shows the tokenizer can differentiate the styles of
literal. It gives 2 STRING for 'a''b' and 1 for 'ab'. This is a
reliable test if the right code can be found which matches the target
string. Hopefully there are others better ways but at least one
exists. (better in the sense that the linters i know do not including
tokenization currently)
3) The biggest problem for me (hopefully someone just knows the
answer) is that some other very reliable parsing is required to know
that the literal being _tested_ is the literal we _mean_ to test. I
can use ast parser in 2) because im confirming a different piece of
information.
The problem is that the literal can exist in a very complex section,
and coincidentally with 'a''b' and 'ab' in the same expression. The
ast node won't tell is if we are looking for 0, 1 or 2 cases of either
syntax, we only get 2 Str nodes and 1 line number. I think the parser
chosen would pretty much have to be a rewrite of the ast parser due to
nesting. Or someone needs to root around in the internals of the ast
parser to see if the information can be extracted at the right time.
I hope this helps in some way. I don't think it's impossible, but the
above will introduce annoying bits into existing linters for this one
issue. Of course, a working example would certainly make any case a
lot easier.
Given there is no technical reason, I agree there's no reason to
change anything. This just "feels" ugly to me, so i can tilt at
windmills all day on it, but I can see no technical reason.
Always amazes me that string are so weird. They are literals and
lists, but also multi white space jumping zebras. Are they "multi"
"white space" "jumping zebras"? "multi" "white space jumping"
"zebras"? "multi white" "space jumping" "zebras? We'll never know!
-Matt
>
> Best wishes
> Rob Cliffe
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
More information about the Python-ideas
mailing list