PEP 292-related: why string substitution is not the same operation as data formatting

For a moment, please remove for your mind your experience of C and printf. Meditate with me and picture yourself in a happy world of object-orientation and code readability, where everything cryptic and obscured is banished. Just to help you do that, I'll avoid the notation chosen by the PEP. Let's use, for the duration of this post, the hypothetic notation suggested by some other reader: "<<name>> is from <<country>>". Now, this thing we're talking about is replacing parts of the string with other strings. These strings may be the result of running some non-string objects trough str(foo) - but, we are making no assumptions about these objects. Just that str(foo) is somehow meaningful. And, to my knowledge, there are no python objects for which str(foo) doesn't work. So, string substitution is non-intrusive. Also, if you keep your templates (let's call a string containing substitution markup a template, shall we?) outside your source code, as is the case with i18n, pure substitution doesn't require the people who edit them (for example, translators) to know anything about python *or* even programming. String substitution only depends on an identifier ('name' or 'country'), no sick abbreviations like 's' or 'd' or 'f' or 'r' or 'x' that you have to keep a table for. So, string substitution is readable and non-cryptic. Now, data formatting is another animal entirely. It's a way to request one specific representation of a piece of data. But there is a catch. When you do '%8.3d' % foo you are *expecting* that foo a floating-point number and you know you'll get TypeError otherwise. This is, IMO, invasive. In my ideal OO-paradise I would rather have something like foo.format(8, 3) (THIS IS NOT A PEP!). IMO, if you, as I asked in the first paragraph, pretend you don't know C and printf and python's % operator and then pretend you're having your first contact with it, while already having some experience with python's readability, it's hard not to be shocked. And I bet you'd go to great lengths to avoid using the "feature". Conclusion: I think string formatting is a cryptic and obscure misfeature inherited from C that should be deprecated in favour of something less invasive and more readable/explicit. More, I'm completely opposed to "<<name>> is <<age:.0d>> years old" because it's still cryptic and invasive. This should instead read similar to "<<name>> is <<age>> years old".sub({'name': x.name, 'age': x.age.format(None, 0)}) Guido, can you please, for our enlightenment, tell us what are the reasons you feel %(foo)s was a mistake? []s, |alo +---- -- It doesn't bother me that people say things like "you'll never get anywhere with this attitude". In a few decades, it will make a good paragraph in my biography. You know, for a laugh. -- http://www.laranja.org/ mailto:lalo@laranja.org pgp key: http://www.laranja.org/pessoal/pgp Python Foundry Guide http://www.sf.net/foundry/python-foundry/

Lalo> These strings may be the result of running some non-string objects Lalo> trough str(foo) - but, we are making no assumptions about these Lalo> objects. Just that str(foo) is somehow meaningful. And, to my Lalo> knowledge, there are no python objects for which str(foo) doesn't Lalo> work. Unicode objects can't always be passed to str(): >>> str(u"abc") 'abc' >>> p = u'Scr\xfcj MacDuhk' >>> str(p) Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeError: ASCII encoding error: ordinal not in range(128) (My default encoding is "ascii".) You need to encode Unicode objects using the appropriate charset, which may not always be the default. Skip

On Sun, Jun 23, 2002 at 02:28:20PM -0500, Skip Montanaro wrote:
Valid point but completely unrelated to my argument - just s/str/unicode/ where necessary. '%s' already handles this:
'-%s-' % u'Scr\xfcj MacDuhk' u'-Scr\xfcj MacDuhk-'
[]s, |alo +---- -- It doesn't bother me that people say things like "you'll never get anywhere with this attitude". In a few decades, it will make a good paragraph in my biography. You know, for a laugh. -- http://www.laranja.org/ mailto:lalo@laranja.org pgp key: http://www.laranja.org/pessoal/pgp Eu jogo RPG! (I play RPG) http://www.eujogorpg.com.br/ Python Foundry Guide http://www.sf.net/foundry/python-foundry/

"LM" == Lalo Martins <lalo@laranja.org> writes:
LM> Also, if you keep your templates (let's call a string LM> containing substitution markup a template, shall we?) outside LM> your source code, as is the case with i18n, pure substitution LM> doesn't require the people who edit them (for example, LM> translators) to know anything about python *or* even LM> programming. It isn't always done that way though. See Francois's very good followup describing gettext vs. catgets. LM> Now, data formatting is another animal entirely. It's a way to LM> request one specific representation of a piece of data. I agree! -Barry

[Lalo Martins]
I guess it depends on your definition of "work". This can fail if foo is an instance of a class with __str__ (or __repr__) having a bug or raising an exception. If foo is your own code you probably want it to fail. If foo is someone else's code you may have no choice but to work around it. :-( -- Patrick K. O'Brien Orbtech ----------------------------------------------- "Your source for Python software development." ----------------------------------------------- Web: http://www.orbtech.com/web/pobrien/ Blog: http://www.orbtech.com/blog/pobrien/ Wiki: http://www.orbtech.com/wiki/PatrickOBrien -----------------------------------------------

Guido, can you please, for our enlightenment, tell us what are the reasons you feel %(foo)s was a mistake?
Because of the trailing 's'. It's very easy to leave it out by mistake, and because the definition of printf formats skips over spaces (don't ask me why), the first character of the following word is used as the type indicator. (FWIW, I agree with your other observations -- this was why I support exploring an alternative in PEP 292.) --Guido van Rossum (home page: http://www.python.org/~guido/)

On Friday 12 July 2002 10:47 am, Guido van Rossum wrote:
(FWIW, I agree with your other observations -- this was why I support exploring an alternative in PEP 292.)
The syntax rules of PEP 292 are likely to cause confusion for newbies who have never used sh or perl. They will ask why Python have two syntaxes for doing string substitutions? Why not always spell the substitution string with ${identifier} or %(identifier)? The third rule of PEP292 in particular look like a patch to fix a kludge when an unanticipated exception was discovered. 3. ${identifier} is equivalent to $identifier. It is required for when valid identifier characters follow the placeholder but are not part of the placeholder, e.g. "${noun}ification".
It's easy to leave it out by mistake, but the error is almost always immediately obvious. In the interest of keeping the language as simple as possible, I hope no changes are made. If a method based .sub() capability is to be added, why not reuse the %(identifier) syntax instead of introducing $ and ${} syntax? The .sub() string method would use the %(identifier) syntax without the 's' to spell the new substitution format. Instead of the proposed: '$name was born in ${country}'.sub() the phrase would be spelled: '%(name) was born in %(country)'.sub() This approach would introduce one new string method with a small variation on the existing '%' substitution syntax.

An argument can be made that since this works rather different than the current % operator, it's better to avoid confusion by using a different character. One can also argue that many Perl and shell programmers are migrating to Python, for whom this would be helpful -- for others, $ or % makes little difference (DOS batch file programmers aren't that common, most Windows users never get to this). But the exact syntax to use in the template is a relatively trivial detail IMO. Whether to pick `name`, <<name>>, $name, $(name), ${name}, %name, %{name}, or %(name), is a choice we can make later. Ditto about whether to allow full expressions, dotted names only, or simple names only, and whether to allow leaving off the brackets for simple names (or even for dotted names, as in PEP 215). User testing would be good. User testing has already shown that the current %(name)s notation causes too many mistakes, because of the odd trailing 's'. These errors may be immediately obvious when you run the code, but constructs that are easily mistyped should still be avoided if possible. Also, I believe that the error has actually been puzzling for many people (e.g. sometimes no error is raised but on close inspection a few characters appear to be omitted from the output). The real issues are IMO: - Compile-time vs. run-time parsing. I've become convinced that the compiler should do the parsing: this is the only way to make access to variables in nested scopes work, avoids security issues, and makes it easier to diagnose errors (e.g. in PyChecker). - How to support translation. Here the template must be replaced at run-time, but it is still desirable that the collection of available names is known at compile time (to avoid the security issues). - Optional formatting specifiers. I agree with Lalo that these should not be part of the interpolation syntax but need to be dealt with at a different level. I think these are only relevant for numeric data. Funny, there's still a (now-deprecated) module fpformat.py that supports arbitrary floating point formatting, and string.zfill() supports a bit of integer formatting. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Fri, Jul 12, 2002, Guido van Rossum wrote:
I've used "%20s" * 5 frequently enough in the past to do crude tables. That's not a feature I'd like to lose. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/

[Aahz]
I've used "%20s" * 5 frequently enough in the past to do crude tables. That's not a feature I'd like to lose.
So has Guido -- he'll remember that before it's too late <wink>. Ditto "-" to switch string justification. Prediction: the $(name:optional_format) notation will win in the end.

On 12 Jul 2002 at 16:11, Tim Peters wrote: [Aahz]
[Tim]
Good. I use both a just enough that I'd really miss them, but not frequently enough to remember exactly what each modifier does what with each data type. -- Gordon http://www.mcmillan-inc.com/

Tim:
Addendum to my suggestion: The "{...}" plays the role of the "s" in a normal string format, so that you can do %-20{foo} etc. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Guido:
How about introducing a new format %{foo} which is defined to be the same as %(foo)s. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Maybe too subtle (you'd really have to explain the history to make people understand why there's both %() and %()), and doesn't solve the compile time / run time issue IMO. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Fri, Jul 12, 2002 at 10:47:34AM -0400, Guido van Rossum wrote:
In case that wasn't clear, I agree with that - I asked because I wanted this in writing for the record. BTW: IIRC, it skips over spaces because spaces are a valid format modifier (meaning "pad with spaces"). []s, |alo +---- -- Those who trade freedom for security lose both and deserve neither. -- http://www.laranja.org/ mailto:lalo@laranja.org pgp key: http://www.laranja.org/pessoal/pgp Eu jogo RPG! (I play RPG) http://www.eujogorpg.com.br/ Python Foundry Guide http://www.sf.net/foundry/python-foundry/

Lalo> These strings may be the result of running some non-string objects Lalo> trough str(foo) - but, we are making no assumptions about these Lalo> objects. Just that str(foo) is somehow meaningful. And, to my Lalo> knowledge, there are no python objects for which str(foo) doesn't Lalo> work. Unicode objects can't always be passed to str(): >>> str(u"abc") 'abc' >>> p = u'Scr\xfcj MacDuhk' >>> str(p) Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeError: ASCII encoding error: ordinal not in range(128) (My default encoding is "ascii".) You need to encode Unicode objects using the appropriate charset, which may not always be the default. Skip

On Sun, Jun 23, 2002 at 02:28:20PM -0500, Skip Montanaro wrote:
Valid point but completely unrelated to my argument - just s/str/unicode/ where necessary. '%s' already handles this:
'-%s-' % u'Scr\xfcj MacDuhk' u'-Scr\xfcj MacDuhk-'
[]s, |alo +---- -- It doesn't bother me that people say things like "you'll never get anywhere with this attitude". In a few decades, it will make a good paragraph in my biography. You know, for a laugh. -- http://www.laranja.org/ mailto:lalo@laranja.org pgp key: http://www.laranja.org/pessoal/pgp Eu jogo RPG! (I play RPG) http://www.eujogorpg.com.br/ Python Foundry Guide http://www.sf.net/foundry/python-foundry/

"LM" == Lalo Martins <lalo@laranja.org> writes:
LM> Also, if you keep your templates (let's call a string LM> containing substitution markup a template, shall we?) outside LM> your source code, as is the case with i18n, pure substitution LM> doesn't require the people who edit them (for example, LM> translators) to know anything about python *or* even LM> programming. It isn't always done that way though. See Francois's very good followup describing gettext vs. catgets. LM> Now, data formatting is another animal entirely. It's a way to LM> request one specific representation of a piece of data. I agree! -Barry

[Lalo Martins]
I guess it depends on your definition of "work". This can fail if foo is an instance of a class with __str__ (or __repr__) having a bug or raising an exception. If foo is your own code you probably want it to fail. If foo is someone else's code you may have no choice but to work around it. :-( -- Patrick K. O'Brien Orbtech ----------------------------------------------- "Your source for Python software development." ----------------------------------------------- Web: http://www.orbtech.com/web/pobrien/ Blog: http://www.orbtech.com/blog/pobrien/ Wiki: http://www.orbtech.com/wiki/PatrickOBrien -----------------------------------------------

Guido, can you please, for our enlightenment, tell us what are the reasons you feel %(foo)s was a mistake?
Because of the trailing 's'. It's very easy to leave it out by mistake, and because the definition of printf formats skips over spaces (don't ask me why), the first character of the following word is used as the type indicator. (FWIW, I agree with your other observations -- this was why I support exploring an alternative in PEP 292.) --Guido van Rossum (home page: http://www.python.org/~guido/)

On Friday 12 July 2002 10:47 am, Guido van Rossum wrote:
(FWIW, I agree with your other observations -- this was why I support exploring an alternative in PEP 292.)
The syntax rules of PEP 292 are likely to cause confusion for newbies who have never used sh or perl. They will ask why Python have two syntaxes for doing string substitutions? Why not always spell the substitution string with ${identifier} or %(identifier)? The third rule of PEP292 in particular look like a patch to fix a kludge when an unanticipated exception was discovered. 3. ${identifier} is equivalent to $identifier. It is required for when valid identifier characters follow the placeholder but are not part of the placeholder, e.g. "${noun}ification".
It's easy to leave it out by mistake, but the error is almost always immediately obvious. In the interest of keeping the language as simple as possible, I hope no changes are made. If a method based .sub() capability is to be added, why not reuse the %(identifier) syntax instead of introducing $ and ${} syntax? The .sub() string method would use the %(identifier) syntax without the 's' to spell the new substitution format. Instead of the proposed: '$name was born in ${country}'.sub() the phrase would be spelled: '%(name) was born in %(country)'.sub() This approach would introduce one new string method with a small variation on the existing '%' substitution syntax.

An argument can be made that since this works rather different than the current % operator, it's better to avoid confusion by using a different character. One can also argue that many Perl and shell programmers are migrating to Python, for whom this would be helpful -- for others, $ or % makes little difference (DOS batch file programmers aren't that common, most Windows users never get to this). But the exact syntax to use in the template is a relatively trivial detail IMO. Whether to pick `name`, <<name>>, $name, $(name), ${name}, %name, %{name}, or %(name), is a choice we can make later. Ditto about whether to allow full expressions, dotted names only, or simple names only, and whether to allow leaving off the brackets for simple names (or even for dotted names, as in PEP 215). User testing would be good. User testing has already shown that the current %(name)s notation causes too many mistakes, because of the odd trailing 's'. These errors may be immediately obvious when you run the code, but constructs that are easily mistyped should still be avoided if possible. Also, I believe that the error has actually been puzzling for many people (e.g. sometimes no error is raised but on close inspection a few characters appear to be omitted from the output). The real issues are IMO: - Compile-time vs. run-time parsing. I've become convinced that the compiler should do the parsing: this is the only way to make access to variables in nested scopes work, avoids security issues, and makes it easier to diagnose errors (e.g. in PyChecker). - How to support translation. Here the template must be replaced at run-time, but it is still desirable that the collection of available names is known at compile time (to avoid the security issues). - Optional formatting specifiers. I agree with Lalo that these should not be part of the interpolation syntax but need to be dealt with at a different level. I think these are only relevant for numeric data. Funny, there's still a (now-deprecated) module fpformat.py that supports arbitrary floating point formatting, and string.zfill() supports a bit of integer formatting. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Fri, Jul 12, 2002, Guido van Rossum wrote:
I've used "%20s" * 5 frequently enough in the past to do crude tables. That's not a feature I'd like to lose. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/

[Aahz]
I've used "%20s" * 5 frequently enough in the past to do crude tables. That's not a feature I'd like to lose.
So has Guido -- he'll remember that before it's too late <wink>. Ditto "-" to switch string justification. Prediction: the $(name:optional_format) notation will win in the end.

On 12 Jul 2002 at 16:11, Tim Peters wrote: [Aahz]
[Tim]
Good. I use both a just enough that I'd really miss them, but not frequently enough to remember exactly what each modifier does what with each data type. -- Gordon http://www.mcmillan-inc.com/

Tim:
Addendum to my suggestion: The "{...}" plays the role of the "s" in a normal string format, so that you can do %-20{foo} etc. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Guido:
How about introducing a new format %{foo} which is defined to be the same as %(foo)s. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Maybe too subtle (you'd really have to explain the history to make people understand why there's both %() and %()), and doesn't solve the compile time / run time issue IMO. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Fri, Jul 12, 2002 at 10:47:34AM -0400, Guido van Rossum wrote:
In case that wasn't clear, I agree with that - I asked because I wanted this in writing for the record. BTW: IIRC, it skips over spaces because spaces are a valid format modifier (meaning "pad with spaces"). []s, |alo +---- -- Those who trade freedom for security lose both and deserve neither. -- http://www.laranja.org/ mailto:lalo@laranja.org pgp key: http://www.laranja.org/pessoal/pgp Eu jogo RPG! (I play RPG) http://www.eujogorpg.com.br/ Python Foundry Guide http://www.sf.net/foundry/python-foundry/
participants (10)
-
Aahz
-
barry@zope.com
-
Gordon McMillan
-
Greg Ewing
-
Guido van Rossum
-
Lalo Martins
-
Michael McLay
-
Patrick K. O'Brien
-
Skip Montanaro
-
Tim Peters