[Python-3000] PEP - string.format

Sat Apr 22 09:29:16 CEST 2006

Talin wrote:
> No responses? I'm surprised...
> 
> (I assume the PEP editor is on vacation, since I haven't gotten a response
> from that channel either.)

A couple of days isn't really that long for Py3k stuff, and py-dev has been
pretty busy for the last few days :)

[...]
> Or if you just want me to drop it, I can do that as well...just do so
> politely and I will take no offense :) (I'm a pretty hard person to
> offend.)

Keep plugging away - this is important, IMO, and I think we can do better than 
the status quo.

> PEP: xxx Title: Advanced String Formatting Version: $Revision$ 
[...]
>  In any case, string.Template will not be discussed here,
> except to say that the this proposal will most likely have some overlapping
> functionality with that module.

string.Template will either be subsumed by the new mechanism, become an 
implementation detail of the new mechanism, or else be specialised even 
further towards il8n use cases. As you say, not something we should get hung 
up on at this point (although my expectation is that it is the last that will 
happen - the needs of il8n templating and general debug printing are quite 
different).

> The '%' operator is primarily limited by the fact that it is a binary
> operator, and therefore can take at most two arguments. One of those
> arguments is already dedicated to the format string, leaving all other
> variables to be squeezed into the remaining argument.

As Ian noted, the special treatment of tuples and dictionaries makes this even 
worse (I'm one of those that tends to always use a tuple as the right operand 
rather than having to ensure that I'm not passing a tuple or dictionary)

[...]
> The second method, 'fformat', is identical to the first, except that it
> takes an additional first argument that is used to specify a 'custom
> formatter' object, which can override the normal formatting rules for
> specific fields:
> 
> "More on {0}, {1}, and {c}".fformat( formatter, a, b, c=d )
> 
> Note that the formatter is *not* counted when numbering fields, so 'a' is
> still considered argument number zero.

I don't like this. Formatting with a different formatter should be done as a 
method on the formatter object, not as a method on the string.

However, there may be implementation strategies to make this easier, such as 
having str.format mean (sans caching):

   def format(*args, **kwds):
       # Oh for the ability to write self, *args = args!
       self = args[0]
       args = args[1:]
       return string.Formatter(self).format(*args, **kwds)

Then custom formatters would be invoked via:

   MyFormatter("More on {0}, {1}, and {c}").format(a, b, c=d)

And defined via inheritance from string.Formatter. Yes, this is deliberately 
modelled on the way string.Template works :)

There may be some alternative design decisions to be made in the subclassing 
API, such as supporting methods on the formatter object that process each 
field in isolation:

   def format_field(self, field, fmt_spec, fmt_args, fmt_kwds):
       # field is x.partition(':')[0], where x is the text between the braces
       # fmt is x.partition(':')[2] or None if x.partition(':')[1] is false
       # fmt_args and fmt_kwds are the arguments passed to the
       # Formatter's format() method
       val = self.get_value(self, field, fmt_args, fmt_kwds)
       return self.format_value(self, val, fmt_spec)

   def get_value(self, field, fmt_args, fmt_kwds):
       try:
           pos = int(field)
       except ValueError:
           return fmt_kwds[field]
       return fmt_args[pos]

   def format_value(self, value, fmt_spec):
       return value.__format__(fmt_spec)

For example, if a string.Template equivalent had these methods, they might 
look like:

   def get_value(self, field, fmt_args, fmt_kwds):
       return fmt_kwds[field]

   def format_value(self, value, fmt_spec):
       if format_spec is not None:
           raise ValueError("field formatting not supported")
       return value

Another example would be a pretty-printer variant that pretty-printed types 
rather than using their normal string representation.

[...]
> Brace characters ('curly braces') are used to indicate a replacement field
> within the string:

I like this, since it provides OOWTDI, rather than "use this way if it's not 
ambiguous, this other way if it's otherwise unparseable".

[...]
> Braces can be escaped using a backslash:
> 
> "My name is {0} :-\{\}".format( 'Fred' )

So "My name is 0} :-\{\}".format('Fred') would be an error? I like that - it 
means you get an immediate exception if you inadvertently leave out a brace, 
regardless of whether you leave out the left brace or the right brace.

[...]
> The conversion specifier consists of a sequence of zero or more characters,
> each of which can consist of any printable character except for a
> non-escaped '}'.

Why "conversion specifier" instead of "format specifier"?

>  The format() method does not attempt to interpret the
> conversion specifiers in any way; it merely passes all of the characters
> between the first colon ':' and the matching right brace ('}') to the
> various underlying formatters (described later.)

If we had a subclassing API similar to what I suggest above, a custom 
formatter could easily support Ian's pipelining idea by doing:

   def format_value(self, value, fmt_spec):
       if fmt_spec is None:
           val = Formatter.format_value(self, value, fmt_spec)
       else:
           for fmt in fmt_spec.split(':'):
               val = Formatter.format_value(self, value, fmt_spec)
       return val

I don't really think that should be the default, though.

> When using the 'fformat' variant, it is possible to omit the field name
> entirely, and simply include the conversion specifiers:
> 
> "My name is {:pad(23)}"

As suggested above, I think this should be invoked as a method on a custom 
formatter class. It would then be up to the get_value method to decide what to 
do when "field" was an empty string. (or else the formatter could just add an 
object with an appropriate __format__ method to the kwd dictionary under '')

[...]
> - The trailing letter is optional - you don't need to say '2.2d', you can
> instead just say '2.2'. If the letter is omitted, then the value will be
> converted into its 'natural' form (that is, the form that it take if str()
> or unicode() were called on it) subject to the field length and precision
> specifiers (if supplied.)

I disagree with this. These format specifier do a type coercion before 
applying the formatting. These specifiers should be retained and should 
continue to result in coercion to int or float or str, with the relevant 
TypeErrors when that coercion isn't possible.

> - Variable field width specifiers use a nested version of the {} syntax,
> allowing the width specifier to be either a positional or keyword argument:
> 
> "{0:{1}.{2}d}".format( a, b, c )
> 
> (Note: It might be easier to parse if these used a different type of
> delimiter, such as parens - avoiding the need to create a regex that
> handles the recursive case.)

I like this idea, but it shouldn't be handled directly in the regex. Instead, 
invoke a second pass of the regex directly over the format specifier.

[...]
> For non-builtin types, the conversion specifiers will be specific to that
> type.  An example is the 'datetime' class, whose conversion specifiers are
> identical to the arguments to the strftime() function:
> 
> "Today is: {0:%x}".format( datetime.now() )

Well, more to the point it's that the format spec gets passed to the object, 
and its up to the object how to deal with it.

This implies the need for an extensible protocol, either via a __format__ 
method, or via an extensible function (such as string.Formatter.format_value).

[...]
> The 'specifiers' argument will be either a string object or a unicode
> object, depending on the type of the original format string. The __format__
> method should test the type of the specifiers parameter to determine
> whether to return a string or unicode object. It is the responsibility of
> the __format__ method to return an object of the proper type.

Py3k => this problem goes away. This is about text, so the output is always 
unicode.

> The string.format() will format each field using the following steps:
> 
> 1) First, see if there is a custom formatter.  If one has been supplied,
> see if it wishes to override the normal formatting for this field.  If so,
> then use the formatter's format() function to convert the field data.

As above - custom formatters should be separate objects that either inherit 
from the standard formatter, or simply expose an equivalent API.

> 2) Otherwise, see if the value to be formatted has a __format__ method.  If
> it does, then call it.

So an object can override standard parsing like {0:d} to return something 
other than an integer? *shudder*

Being able to add extra formatting codes for different object types sounds 
good, but being able to change the meaning of the standard codes sounds (very) 
bad.

If supporting the former also means supporting the latter, then I'd prefer to 
leave this ability to custom formatter objects or explicit method or function 
invocations on arguments in the call to format().

> 3) Otherwise, check the internal formatter within string.format that
> contains knowledge of certain builtin types.

This should happen *before* checking for a custom __format__ method. (If we 
decide to check for a custom __format__ method at all)

> 4) Otherwise, call str() or unicode() as appropriate.

As above, this will always be str() (which will actually be equivalent to 
2.x's unicode())

> (Note: It may be that in a future version of Python, dynamic dispatch will
> be used instead of a magic __format__ method, however that is outside the
> scope of this PEP.)

What's potentially in scope for this PEP, though, is to ensure that there's a 
hook that we could potentially hang dynamic dispatch on if we decide to use 
it, and we decide that __format__ provides worthwhile functionality :)

> Custom Formatters:
> 
> If the fformat function is used, a custom formatter object must be
> supplied.  The only requirement is that it have a format() method with the
> following signature:
> 
> def format( self, value, specifier, builder )

As noted above, I think this is a really bad way to implement custom 
formatters - string.Template provides a better model.

[...]
> This particular use case could be even more useful if it were possible to
> specify attributes directly in the format string:
> 
> print "Error in file {parser.file}, line {parser.line}" \ .format(
> **locals() )

I don't think the additional complexity is worthwhile, given that this can be 
written:

   print "Error in file {0}, line {1}".format(parser.file, parser.line)

[...]
> One other thing that could be done to make the debugging case more
> convenient would be to allow the locals() dict to be omitted entirely.
> Thus, a format function with no arguments would instead use the current
> scope as a dictionary argument:
> 
> print "Error in file {p.file}, line {p.line}".format()

Again, I don't think this is worth the additional complexity.

[...]
> Other, more radical proposals include backquoting (`), or a new string
> prefix character (let's say 'f' for 'format'):
> 
> print f"Error in file {p.file}, line {p.line}"

A method will do the job with far more flexibility - no need to hack the parser.

[...]
> - Shell variable syntax: $name and $(name) (or in some variants, ${name}).
> This is probably the oldest convention out there, and is used by Perl and
> many others. When used without the braces, the length of the variable is
> determined by lexically scanning until an invalid character is found.
[...]
> It is my (Talin's) opinion, however, that in cases where the formatting is
> explicitly invoked, that less care needs to be taken to prevent accidental
> interpolation, in which case a lighter and less unwieldy syntax can be
> used.

I don't follow this reasoning, but my reason for not liking shell syntax is 
that it doesn't handle formatting well, and there's a sharp disconnect between 
cases where the lexical parsing does the right thing, and those where it 
doesn't. A bracketed syntax handles all cases, and use braces for the purpose 
has the benefit of rarely being explicitly included in output strings.

> - Printf and its cousins ('%'), including variations that add a field
> index, so that fields can be interpolated out of order.

My objection to these is similar to my objection to shell syntax, except the 
problem is that these handle formatting, but not cases where you want to name 
your interpolated values, or reference them by position. The problem with the 
syntax breaking down when the lexical parsing fails is similar, though - you 
still need to include a bracketed variant.

> - Other bracket-only variations. Various MUDs have used brackets (e.g.
> [name]) to do string interpolation. The Microsoft .Net libraries uses
> braces {}, and a syntax which is very similar to the one in this proposal,
> although the syntax for conversion specifiers is quite different. [2]

Simple braces, with a ':' to separate the format specifier works for me.

> - Backquoting. This method has the benefit of minimal syntactical clutter,
> however it lacks many of the benefits of a function call syntax (such as
> complex expression arguments, custom formatters, etc.)

And there's already a suggestion on the table to use backquotes to denote 
"this is a string that must be a legal Python identifier" :)

> Backwards Compatibility
> 
> Backwards compatibility can be maintained by leaving the existing
> mechanisms in place. The new system does not collide with any of the method
> names of the existing string formatting techniques, so both systems can
> co-exist until it comes time to deprecate the older system.

This reads like a 2.x PEP, not a Py3k PEP :)

What I'd suggest putting in here:

With this proposal, it may seem to result in 3 different ways to do string 
formatting, in violation of TOOWTDI. The intent is for the mechanism in this 
PEP to become the OOW in Py3k, but there are good reasons to keep the other 2 
approaches around.

Retaining string.Template
   string.Template is a formatting variant tailored specifically towards 
simple string substitution. It is designed with a heavy emphasis on the 
templates being written by il8n translators rather than application programmers.
   string.Template should be retained as a simplified formatting variant for 
such problem domains. For simple string substitution, string.Template really 
is a better tool than the more general mechanism in this PEP - the more 
limited feature set makes it easier to grasp for non-programmers. The use of 
shell-style syntax for interpolation variables is appropriate as it is more 
likely to be familiar to translators without Python programming experience and 
the problem domain is one where the limitations of shell-syntax aren't a 
hassle (there is no template-based formatting, and the interpolated text is 
typically separated from the surrounding text by spaces).

Retaining string %-formatting
   Removing string %-formatting would be a backwards compatibility nightmare. 
I doubt there's a Python program on the planet that would continue working if 
it was removed (I know most of mine would break in verbose mode). Even those 
which continued to work would likely break if all commented out debugging 
messages were uncommented.
   python3warn would struggle to find all cases (as it would need to be able 
to figure out when the left operand was a string) and even an instrumented 
build would leave significant room for doubt (as debug messages are often in 
rarely-exercised failure paths, which even decent unit tests might miss)
   Further, all of the formatting logic would need to be relocated to the 
implementation of the string formatting method, which while not that hard, 
would be effort that might be better expended elsewhere.
   OTOH, if string %-formatting is retained, the new format() method can rely 
on it as a low-level implementation detail, and %-formatting can continue to 
exist in that capacity - low-level formatting used for single values that are 
neither dictionaries nor tuples without having to go through the full regex 
based formatting machinery.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org