[Python-3000] PEP - string.format

Fri Apr 21 20:25:39 CEST 2006

Talin wrote:
> Talin <talin <at> acm.org> writes:
> 
> 
>>I decided to take some of the ideas discussed in the string formatting
>>thread, add a few touches of my own, and write up a PEP.
>>
>>http://viridia.org/python/doc/PEP_AdvancedStringFormatting.txt
>>
>>(I've also submitted the PEP via the normal channels.)
> 
> 
> No responses? I'm surprised...

You should have copied the PEP into the email... it was a whole click 
away, thus easier to ignore ;)

     The scope of this PEP will be restricted to proposals of built-in
     string formatting operations (in other words, methods of the
     built-in string type.)  This does not obviate the need for more
     sophisticated string-manipulation modules in the standard library
     such as string.Template.  In any case, string.Template will not be
     discussed here, except to say that the this proposal will most
     likely have some overlapping functionality with that module.

s/module/class/

     The '%' operator is primarily limited by the fact that it is a
     binary operator, and therefore can take at most two arguments.
     One of those arguments is already dedicated to the format string,
     leaving all other variables to be squeezed into the remaining
     argument.  The current practice is to use either a dictionary or a
     list as the second argument, but as many people have commented
     [1], this lacks flexibility.  The "all or nothing" approach
     (meaning that one must choose between only positional arguments,
     or only named arguments) is felt to be overly constraining.

A dictionary, *tuple*, or a single object.  That a tuple is special is 
sometimes confusing (in most other places lists can be substituted for 
tuples), and that the single object can be anything but a dictionary or 
tuple can also be confusing.  I've seen nervous people avoid the single 
object form entirely, often relying on the syntactically unappealing 
single-item tuple ('' % (x,)).

     Brace characters ('curly braces') are used to indicate a
     replacement field within the string:

         "My name is {0}".format( 'Fred' )

While I've argued in an earlier thread that $var is more conventional, 
honestly I don't care (except that %(var)s is not very nice).  A couple 
other people also preferred $var, but I don't know if they have 
particularly strong opinions either.

     The result of this is the string:

         "My name is Fred"

     The element within the braces is called a 'field name' can either
     be a number, in which case it indicates a positional argument,
     or a name, in which case it indicates a keyword argument.

     Braces can be escaped using a backslash:

         "My name is {0} :-\{\}".format( 'Fred' )

     Which would produce:

         "My name is Fred :-{}"

Does } have to be escaped?  Or just optionally escaped?  I assume this 
is not a change to string literals, so we're relying on '\{' producing 
the same thing as '\\{' (which of course it does).

     Each field can also specify an optional set of 'conversion
     specifiers'.  Conversion specifiers follow the field name, with a
     colon (':') character separating the two:

         "My name is {0:8}".format( 'Fred' )

     The meaning and syntax of the conversion specifiers depends on the
     type of object that is being formatted, however many of the
     built-in types will recognize a standard set of conversion
     specifiers.

     The conversion specifier consists of a sequence of zero or more
     characters, each of which can consist of any printable character
     except for a non-escaped '}'.  The format() method does not
     attempt to intepret the conversion specifiers in any way; it
     merely passes all of the characters between the first colon ':'
     and the matching right brace ('}') to the various underlying
     formatters (described later.)

Thus you can't nest formatters, e.g., {0:pad(23):xmlquote}, unless the 
underlying object understands that.  Which is probably unlikely.

Potentially : could be special, but \: would be pass the ':' to the 
underlying formatter.  Then {x:pad(23):xmlquote} would mean 
format(format(x, 'pad(23)'), 'xmlquote')

Also, I note that {} doesn't naturally nest in this specification, you 
have to quote those as well.  E.g.: {0:\{a:b\}}.  But I don't really see 
why you'd be inclined to use {} in a formatter anyway ([] and () seem 
much more likely).

Also, some parsing will be required in these formatters, e.g., pad(23) 
is not parsed in any way and so it's up to the formatter to handle that 
(and may use different rules than normal Python syntax).

     When using the 'fformat' variant, it is possible to omit the field
     name entirely, and simply include the conversion specifiers:

         "My name is {:pad(23)}"

     This syntax is used to send special instructions to the custom
     formatter object (such as instructing it to insert padding
     characters up to a given column.)  The interpretation of this
     'empty' field is entirely up to the custom formatter; no
     standard interpretation will be defined in this PEP.

     If a custom formatter is not being used, then it is an error to
     omit the field name.

This sounds similar to (?i) in a regex.  I can't think of a good 
use-case, though, since most commands would be localized to a specific 
formatter or to the formatting object constructor.  {:pad(23)} seems 
like a bad example.  {:upper}?  Also, it applies globally (or does it?); 
that is, the formatter can't detect what markers come after the command, 
and which come before.  So {:upper} seems like a bad example.

     Standard Conversion Specifiers:

     For most built-in types, the conversion specifiers will be the
     same or similar to the existing conversion specifiers used with
     the '%' operator.  Thus, instead of '%02.2x", you will say
     '{0:2.2x}'.

     There are a few differences however:

     - The trailing letter is optional - you don't need to say '2.2d',
       you can instead just say '2.2'. If the letter is omitted, then
       the value will be converted into its 'natural' form (that is,
       the form that it take if str() or unicode() were called on it)
       subject to the field length and precision specifiers (if
       supplied.)

     - Variable field width specifiers use a nested version of the
       {} syntax, allowing the width specifier to be either a
       positional or keyword argument:

         "{0:{1}.{2}d}".format( a, b, c )

       (Note: It might be easier to parse if these used a different
       type of delimiter, such as parens - avoiding the need to
       create a regex that handles the recursive case.)

Ah... that's an interesting way to use nested {}.  I like that.

     A class that wishes to implement a custom interpretation of
     its conversion specifiers can implement a __format__ method:

     class AST:
         def __format__( self, specifiers ):
             ...

     The 'specifiers' argument will be either a string object or a
     unicode object, depending on the type of the original format
     string. The __format__ method should test the type of the
     specifiers parameter to determine whether to return a string or
     unicode object. It is the responsibility of the __format__ method
     to return an object of the proper type.

If nested/piped formatting was allowed (like {0:trun(23):xmlquote}) then 
it would be good if it could return any object, and str/unicode was 
called on that object ultimately.

I don't know if it would be considered an abuse of formatting, but maybe 
a_dict.__format__('x') could return a_dict['x'].  Probably not a good idea.

     The string.format() will format each field using the following
     steps:

      1) First, see if there is a custom formatter.  If one has been
         supplied, see if it wishes to override the normal formatting
         for this field.  If so, then use the formatter's format()
         function to convert the field data.

      2) Otherwise, see if the value to be formatted has a
         __format__ method.  If it does, then call it.

      3) Otherwise, check the internal formatter within
         string.format that contains knowledge of certain builtin
         types.

If it is a language change, could all those types have __format__ 
methods added?  Is there any way for the object to accept or decline to 
do formatting?

      4) Otherwise, call str() or unicode() as appropriate.

Is there a global repr() formatter, like %r?  Potentially {0:repr} could 
be implemented the same way by convention, including in object.__format__?

     Custom Formatters:

     If the fformat function is used, a custom formatter object
     must be supplied.  The only requirement is that it have a
     format() method with the following signature:

         def format( self, value, specifier, builder )

     This function will be called once for each interpolated value.
     The parameter values will be:

     'value' - the value that to be formatted.

     'specifier' - a string or unicode object containing the
     conversion specifiers from the template string.

     'builder' - contains the partially constructed string, in whatever
     form is most efficient - most likely the builder value will be
     a mutable array or buffer which can be efficiently appended to,
     and which will eventually be converted into an immutable string.

What's the use case for this argument?

     The formatter should examine the type of the object and the
     specifier string, and decide whether or not it wants to handle
     this field. If it decides not to, then it should return False
     to indicate that the default formatting for that field should be
     used; Otherwise, it should call builder.append() (or whatever
     is the appropriate method) to concatenate the converted value
     to the end of the string, and return True.

Well, I guess this is the use case, but it feels a bit funny to me.  A 
concrete use case would be appreciated.

     Optional Feature: locals() support

     This feature is ancilliary to the main proposal.  Often when
     debugging, it may be convenient to simply use locals() as
     a dictionary argument:

         print "Error in file {file}, line {line}".format( **locals() )

     This particular use case could be even more useful if it were
     possible to specify attributes directly in the format string:

         print "Error in file {parser.file}, line {parser.line}" \
             .format( **locals() )

     It is probably not desirable to support execution of arbitrary
     expressions within string fields - history has shown far too
     many security holes that leveraged the ability of scripting
     languages to do this.

     A fairly high degree of convenience for relatively small risk can
     be obtained by supporting the getattr (.) and getitem ([])
     operators.  While it is certainly possible that these operators
     can be overloaded in a way that a maliciously written string could
     exploit their behavior in nasty ways, it is fairly rare that those
     operators do anything more than retargeting to another container.
     On other other hand, the ability of a string to execute function
     calls would be quite dangerous by comparison.

It could be a keyword option to enable this.  Though all the keywords 
are kind of taken.  This itself wouldn't be an issue if ** wasn't going 
to be used so often.

And/or the custom formatter could do the lookup, and so a formatter may 
or may not do getattr's.

     One other thing that could be done to make the debugging case
     more convenient would be to allow the locals() dict to be omitted
     entirely.  Thus, a format function with no arguments would instead
     use the current scope as a dictionary argument:

         print "Error in file {p.file}, line {p.line}".format()

     An alternative would be to dedicate a special method name, other
     than 'format' - say, 'interpolate' or 'lformat' - for this
     behavior.

It breaks some conventions to have a method that looks into the parent 
frame; but the use cases are very strong for this.  Also, if attribute 
access was a keyword argument potentially that could be turned on by 
default when using the form that pulled from locals().

Unlike a string prefix, you can't tell that the template string itself 
was directly in the source code, so this could encourage some potential 
security holes (though it's not necessarily insecure).

     This would require some stack-frame hacking in order that format
     be able to get access to the scope of the calling function.

     Other, more radical proposals include backquoting (`), or a new
     string prefix character (let's say 'f' for 'format'):

         print f"Error in file {p.file}, line {p.line}"

     This prefix character could of course be combined with any of the
     other existing prefix characters (r, u, etc.)

This does address the security issue.  The 'f' reads better than the '$' 
prefix previous suggested, IMHO.  Syntax highlighting can also be 
applied this way.

     (This also has the benefit of allowing Python programmers to quip
     that they can use "print f debugging", just like C programmers.)

     Alternate Syntax

     Naturally, one of the most contentious issues is the syntax
     of the format strings, and in particular the markup conventions
     used to indicate fields.

     Rather than attempting to exhaustively list all of the various
     proposals, I will cover the ones that are most widely used
     already.

     - Shell variable syntax: $name and $(name) (or in some variants,
       ${name}). This is probably the oldest convention out there,
       and is used by Perl and many others. When used without
       the braces, the length of the variable is determined by
       lexically scanning until an invalid character is found.

       This scheme is generally used in cases where interpolation is
       implicit - that is, in environments where any string can
       contain interpolation variables, and no special subsitution
       function need be invoked. In such cases, it is important to
       prevent the interpolation behavior from occuring accidentally,
       so the '$' (which is otherwise a relatively uncommonly-used
       character) is used to signal when the behavior should occur.

       It is my (Talin's) opinion, however, that in cases where the
       formatting is explicitly invoked, that less care needs to be
       taken to prevent accidental interpolation, in which case a
       lighter and less unwieldy syntax can be used.

I don't think accidental problems with $ are that big a deal.  They 
don't occur that often, and it's pretty obvious to the eye when they 
exist.  "$lengthin" is pretty clearly not right compared to 
"${length}in".  However, nervous shell programmers often use ${} 
everywhere, regardless of need, so this is likely to introduce style 
differences between programmers (some will always use ${}, some will 
remove {}'s whenever possible).

However, it can be reasonable argued that {} is just as readable and 
easy to work with as $, and it avoids the need to do any changes as you 
reformat the string (possibly introducing or removing ambiguity), or add 
formatting.

     - Printf and its cousins ('%'), including variations that add
       a field index, so that fields can be interpolated out of
       order.

     - Other bracket-only variations. Various MUDs have used brackets
       (e.g. [name]) to do string interpolation. The Microsoft .Net
       libraries uses braces {}, and a syntax which is very similar
       to the one in this proposal, although the syntax for conversion
       specifiers is quite different. [2]

Many languages use {}, including PHP and Ruby, and even $ uses it on 
some level.  The details differ, but {} exists nearly everywhere in some 
fashion.

     - Backquoting. This method has the benefit of minimal syntactical
       clutter, however it lacks many of the benefits of a function
       call syntax (such as complex expression arguments, custom
       formatters, etc.)

It doesn't have any natural nesting, nor any way to immediately see the 
difference between opening and closing an expression.  It also implies a 
relation to shell ``, which evaluates the contents.  I don't see any 
benefit to backquotes.

Personally I'm very uncomfortable with using str.format(**args) for all 
named substitution.  It removes the possibility of non-enumerable 
dictionary-like objects, and requires a dictionary copy whenever an 
actual dictionary is used.

In the case of positional arguments it is currently an error if you 
don't use all your positional arguments with %.  Would it be an error in 
this case?

Should the custom formatter get any opportunity to finalize the 
formatted string (e.g., "here's the finished string, give me what you 
want to return")?

-- 
Ian Bicking  /  ianb at colorstudy.com  /  http://blog.ianbicking.org