[Python-Dev] Re: What to do about invalid escape sequences

12 Aug 2019

      On 8/11/2019 8:40 PM, Eric V. Smith wrote:
...
On 8/11/2019 4:18 PM, Glenn Linderman wrote:
...
On 8/11/2019 2:50 AM, Steven D'Aprano wrote:
...
On Sat, Aug 10, 2019 at 12:10:55PM -0700, Glenn Linderman wrote:
...
Or invent "really raw" in some spelling, such as rr"c:\directory\"
or e for exact, or x for exact, or <your favorite character
here>"c:\directory\"
And that brings me to the thought that if   \e  wants to become an
escape for escape, that maybe there should be an "extended escape"
prefix... if you want to use more escapes, define ee"string where \\
can only be used as an escape or escaped character, \e means the ASCII
escape character, and \ followed by a character with no escape
definition would be an error."
Please no.
We already have b-strings, r-strings, u-strings, f-strings, br-strings,
rb-strings, fr-strings, rf-strings, each of which comes in four
varieties (single quote, double quote, triple single quote and triple
double quote). Now you're talking about adding rr-strings, v-strings
(Greg suggested that) and ee-strings, presumably some or all of which
will need b*- and *b- or f*- and *f- varieties too.
Don't forget the upper & lower case varieties :)
And all orders!
...
...
...
_all_string_prefixes()
{'', 'b', 'BR', 'bR', 'B', 'rb', 'F', 'RF', 'rB', 'FR', 'Rf', 'Fr', 
'RB', 'f', 'r', 'rf', 'rF', 'R', 'u', 'fR', 'U', 'Br', 'Rb', 'fr', 'br'}
len(_all_string_prefixes())
25
And if you add just 'bv' and 'fv', it's 41:
{'', 'fr', 'Bv', 'BR', 'F', 'rb', 'Fv', 'VB', 'vb', 'vF', 'br', 'FV', 
'vf', 'FR', 'fV', 'bV', 'Br', 'Vb', 'Rb', 'RF', 'bR', 'r', 'R', 'Vf', 
'fv', 'U', 'RB', 'B', 'rB', 'vB', 'Fr', 'rF', 'fR', 'Rf', 'BV', 'VF', 
'bv', 'b', 'u', 'f', 'rf'}
There would be no need for 'uv' (not needed for backward 
compatibility) or 'rv' (can't be both raw and verbatim).
I'm not in any way serious about this. I just want people to realize 
how many wacky combinations there would be. And heaven forbid we ever 
add some combination of 3 characters. If 'rfv' were actually also 
valid, you get to 89:
{'', 'br', 'vb', 'fR', 'F', 'rFV', 'fRv', 'fV', 'rVF', 'Rfv', 'u', 
'vRf', 'fVR', 'rfV', 'Fvr', 'vrf', 'fVr', 'vB', 'Vb', 'Rvf', 'Fv', 
'Fr', 'FVr', 'B', 'rVf', 'FVR', 'vfr', 'VB', 'VrF', 'BR', 'VRf', 
'vfR', 'FR', 'Br', 'RFV', 'Rf', 'fvR', 'f', 'rb', 'VfR', 'VFR', 'fr', 
'vFR', 'VRF', 'frV', 'bR', 'b', 'FrV', 'r', 'R', 'RVF', 'FV', 'rvF', 
'FRV', 'Vrf', 'rvf', 'FRv', 'Frv', 'vF', 'bV', 'VF', 'fv', 'RF', 'RB', 
'rB', 'vRF', 'RFv', 'RVf', 'Rb', 'Vfr', 'vrF', 'rf', 'Bv', 'vf', 'rF', 
'U', 'bv', 'FvR', 'RfV', 'Vf', 'VFr', 'vFr', 'fvr', 'BV', 'rFv', 
'rfv', 'fRV', 'frv', 'RvF'}
If only we could deprecate upper case prefixes!
Eric
Yes. Happily while there is a combinatorial explosion in spellings and 
casings, there is no cognitive overload: each character has an 
independent effect on the interpretation and use of the string, so once 
you understand the 5 existing types (b r u f and plain) you understand 
them all.

Should we add one or two more, it would be with the realization 
(hopefully realized in the documentation also) that v and e would 
effectively be replacements for r and plain, rather than being combined 
with them.

Were I to design a new language with similar string syntax, I think I 
would use plain quotes for verbatim strings only, and have the following 
prefixes, in only a single case:

(no prefix) - verbatim UTF-8 (at this point, I see no reason not to 
require UTF-8 for the encoding of source files)
b - for verbatim bytes
e - allow (only explicitly documented) escapes
f - format strings

Actually, the above could be done as a preprocessor for python, or a 
future import. In other words, what you see is what you get, until you 
add a prefix to add additional processing.  The only combinations that 
seem useful are  eb  and  ef.  I don't know that constraining the order 
of the prefixes would be helpful or not, if it is helpful, I have no 
problem with a canonical ordering being prescribed.

As a future import, one could code modules to either the current 
combinatorial explosion with all its gotchas, special cases, and passing 
of undefined escapes; or one could code to the clean limited cases above.

Another thing that seems awkward about the current strings is that {{ 
and }} become "special escapes".  If it were not for the permissive 
usage of \{ and \} in the current plain string processing, \{ and \} 
could have been used to escape the non-format-expression uses of { and 
}, which would be far more consistent with other escapes.  Perhaps the 
future import could regularize that, also.

A future import would have no backward compatibility issues to disrupt a 
simplified, more regular syntax.

Does anyone know of an existing feature that couldn't be expressed in a 
straightforward manner with only the above capabilities?

The only other thing that I have heard about regarding strings is that 
multi-line strings have their first line indented, and other lines not. 
Some have recommended making the first line blank, and just chopping off 
the first \n, others have recommended indenting all lines, and replacing 
"\n" followed by the number of indented spaces by "\n", so the text can 
be aligned in the code like it will be aligned for use. Both techniques 
seem to have their place in aiding code readability. Both techniques 
could be used together, in practice, using one more prefix character for 
triple quotes only:

     longstring = l"""
The traditional first blank line form
could be used at it has."""

If the first character of a long-string is a newline character, then it 
will be removed. If the string wants to have an initial newline 
character, a second one can be provided, which would not be removed.

      longstring = l"""The traditional indented form
                       could be used as it has, also."""

This would be contracted by removing up to the number of space 
characters to reach the first character of the first line of the string 
(if the lexer can provide that) after newlines within the string. If 
fewer space characters are available after a newline, only the number 
available would be removed. If there are more, they would be retained.

A new form would also be permitted:

     longstring = l"""
         An indented form that isn't pushed as far right as the
         traditional indented form could also be used."""

If the first character of an l-string is a newline and the second 
character is a space character, this form would count the number of 
space characters in the second line, and remove up to that many space 
characters from all lines, as well as removing the initial newline 
character.

If l-strings were implemented (l for layout), they could be combined 
with f and/or e.

Are there any other string feature workarounds in common use that could 
be codified in a future import scenario?

Glenn