[Python-3000] PEP - string.format
Ian Bicking
ianb at colorstudy.com
Fri Apr 21 20:25:39 CEST 2006
Talin wrote:
> Talin <talin <at> acm.org> writes:
>
>
>>I decided to take some of the ideas discussed in the string formatting
>>thread, add a few touches of my own, and write up a PEP.
>>
>>http://viridia.org/python/doc/PEP_AdvancedStringFormatting.txt
>>
>>(I've also submitted the PEP via the normal channels.)
>
>
> No responses? I'm surprised...
You should have copied the PEP into the email... it was a whole click
away, thus easier to ignore ;)
The scope of this PEP will be restricted to proposals of built-in
string formatting operations (in other words, methods of the
built-in string type.) This does not obviate the need for more
sophisticated string-manipulation modules in the standard library
such as string.Template. In any case, string.Template will not be
discussed here, except to say that the this proposal will most
likely have some overlapping functionality with that module.
s/module/class/
The '%' operator is primarily limited by the fact that it is a
binary operator, and therefore can take at most two arguments.
One of those arguments is already dedicated to the format string,
leaving all other variables to be squeezed into the remaining
argument. The current practice is to use either a dictionary or a
list as the second argument, but as many people have commented
[1], this lacks flexibility. The "all or nothing" approach
(meaning that one must choose between only positional arguments,
or only named arguments) is felt to be overly constraining.
A dictionary, *tuple*, or a single object. That a tuple is special is
sometimes confusing (in most other places lists can be substituted for
tuples), and that the single object can be anything but a dictionary or
tuple can also be confusing. I've seen nervous people avoid the single
object form entirely, often relying on the syntactically unappealing
single-item tuple ('' % (x,)).
Brace characters ('curly braces') are used to indicate a
replacement field within the string:
"My name is {0}".format( 'Fred' )
While I've argued in an earlier thread that $var is more conventional,
honestly I don't care (except that %(var)s is not very nice). A couple
other people also preferred $var, but I don't know if they have
particularly strong opinions either.
The result of this is the string:
"My name is Fred"
The element within the braces is called a 'field name' can either
be a number, in which case it indicates a positional argument,
or a name, in which case it indicates a keyword argument.
Braces can be escaped using a backslash:
"My name is {0} :-\{\}".format( 'Fred' )
Which would produce:
"My name is Fred :-{}"
Does } have to be escaped? Or just optionally escaped? I assume this
is not a change to string literals, so we're relying on '\{' producing
the same thing as '\\{' (which of course it does).
Each field can also specify an optional set of 'conversion
specifiers'. Conversion specifiers follow the field name, with a
colon (':') character separating the two:
"My name is {0:8}".format( 'Fred' )
The meaning and syntax of the conversion specifiers depends on the
type of object that is being formatted, however many of the
built-in types will recognize a standard set of conversion
specifiers.
The conversion specifier consists of a sequence of zero or more
characters, each of which can consist of any printable character
except for a non-escaped '}'. The format() method does not
attempt to intepret the conversion specifiers in any way; it
merely passes all of the characters between the first colon ':'
and the matching right brace ('}') to the various underlying
formatters (described later.)
Thus you can't nest formatters, e.g., {0:pad(23):xmlquote}, unless the
underlying object understands that. Which is probably unlikely.
Potentially : could be special, but \: would be pass the ':' to the
underlying formatter. Then {x:pad(23):xmlquote} would mean
format(format(x, 'pad(23)'), 'xmlquote')
Also, I note that {} doesn't naturally nest in this specification, you
have to quote those as well. E.g.: {0:\{a:b\}}. But I don't really see
why you'd be inclined to use {} in a formatter anyway ([] and () seem
much more likely).
Also, some parsing will be required in these formatters, e.g., pad(23)
is not parsed in any way and so it's up to the formatter to handle that
(and may use different rules than normal Python syntax).
When using the 'fformat' variant, it is possible to omit the field
name entirely, and simply include the conversion specifiers:
"My name is {:pad(23)}"
This syntax is used to send special instructions to the custom
formatter object (such as instructing it to insert padding
characters up to a given column.) The interpretation of this
'empty' field is entirely up to the custom formatter; no
standard interpretation will be defined in this PEP.
If a custom formatter is not being used, then it is an error to
omit the field name.
This sounds similar to (?i) in a regex. I can't think of a good
use-case, though, since most commands would be localized to a specific
formatter or to the formatting object constructor. {:pad(23)} seems
like a bad example. {:upper}? Also, it applies globally (or does it?);
that is, the formatter can't detect what markers come after the command,
and which come before. So {:upper} seems like a bad example.
Standard Conversion Specifiers:
For most built-in types, the conversion specifiers will be the
same or similar to the existing conversion specifiers used with
the '%' operator. Thus, instead of '%02.2x", you will say
'{0:2.2x}'.
There are a few differences however:
- The trailing letter is optional - you don't need to say '2.2d',
you can instead just say '2.2'. If the letter is omitted, then
the value will be converted into its 'natural' form (that is,
the form that it take if str() or unicode() were called on it)
subject to the field length and precision specifiers (if
supplied.)
- Variable field width specifiers use a nested version of the
{} syntax, allowing the width specifier to be either a
positional or keyword argument:
"{0:{1}.{2}d}".format( a, b, c )
(Note: It might be easier to parse if these used a different
type of delimiter, such as parens - avoiding the need to
create a regex that handles the recursive case.)
Ah... that's an interesting way to use nested {}. I like that.
A class that wishes to implement a custom interpretation of
its conversion specifiers can implement a __format__ method:
class AST:
def __format__( self, specifiers ):
...
The 'specifiers' argument will be either a string object or a
unicode object, depending on the type of the original format
string. The __format__ method should test the type of the
specifiers parameter to determine whether to return a string or
unicode object. It is the responsibility of the __format__ method
to return an object of the proper type.
If nested/piped formatting was allowed (like {0:trun(23):xmlquote}) then
it would be good if it could return any object, and str/unicode was
called on that object ultimately.
I don't know if it would be considered an abuse of formatting, but maybe
a_dict.__format__('x') could return a_dict['x']. Probably not a good idea.
The string.format() will format each field using the following
steps:
1) First, see if there is a custom formatter. If one has been
supplied, see if it wishes to override the normal formatting
for this field. If so, then use the formatter's format()
function to convert the field data.
2) Otherwise, see if the value to be formatted has a
__format__ method. If it does, then call it.
3) Otherwise, check the internal formatter within
string.format that contains knowledge of certain builtin
types.
If it is a language change, could all those types have __format__
methods added? Is there any way for the object to accept or decline to
do formatting?
4) Otherwise, call str() or unicode() as appropriate.
Is there a global repr() formatter, like %r? Potentially {0:repr} could
be implemented the same way by convention, including in object.__format__?
Custom Formatters:
If the fformat function is used, a custom formatter object
must be supplied. The only requirement is that it have a
format() method with the following signature:
def format( self, value, specifier, builder )
This function will be called once for each interpolated value.
The parameter values will be:
'value' - the value that to be formatted.
'specifier' - a string or unicode object containing the
conversion specifiers from the template string.
'builder' - contains the partially constructed string, in whatever
form is most efficient - most likely the builder value will be
a mutable array or buffer which can be efficiently appended to,
and which will eventually be converted into an immutable string.
What's the use case for this argument?
The formatter should examine the type of the object and the
specifier string, and decide whether or not it wants to handle
this field. If it decides not to, then it should return False
to indicate that the default formatting for that field should be
used; Otherwise, it should call builder.append() (or whatever
is the appropriate method) to concatenate the converted value
to the end of the string, and return True.
Well, I guess this is the use case, but it feels a bit funny to me. A
concrete use case would be appreciated.
Optional Feature: locals() support
This feature is ancilliary to the main proposal. Often when
debugging, it may be convenient to simply use locals() as
a dictionary argument:
print "Error in file {file}, line {line}".format( **locals() )
This particular use case could be even more useful if it were
possible to specify attributes directly in the format string:
print "Error in file {parser.file}, line {parser.line}" \
.format( **locals() )
It is probably not desirable to support execution of arbitrary
expressions within string fields - history has shown far too
many security holes that leveraged the ability of scripting
languages to do this.
A fairly high degree of convenience for relatively small risk can
be obtained by supporting the getattr (.) and getitem ([])
operators. While it is certainly possible that these operators
can be overloaded in a way that a maliciously written string could
exploit their behavior in nasty ways, it is fairly rare that those
operators do anything more than retargeting to another container.
On other other hand, the ability of a string to execute function
calls would be quite dangerous by comparison.
It could be a keyword option to enable this. Though all the keywords
are kind of taken. This itself wouldn't be an issue if ** wasn't going
to be used so often.
And/or the custom formatter could do the lookup, and so a formatter may
or may not do getattr's.
One other thing that could be done to make the debugging case
more convenient would be to allow the locals() dict to be omitted
entirely. Thus, a format function with no arguments would instead
use the current scope as a dictionary argument:
print "Error in file {p.file}, line {p.line}".format()
An alternative would be to dedicate a special method name, other
than 'format' - say, 'interpolate' or 'lformat' - for this
behavior.
It breaks some conventions to have a method that looks into the parent
frame; but the use cases are very strong for this. Also, if attribute
access was a keyword argument potentially that could be turned on by
default when using the form that pulled from locals().
Unlike a string prefix, you can't tell that the template string itself
was directly in the source code, so this could encourage some potential
security holes (though it's not necessarily insecure).
This would require some stack-frame hacking in order that format
be able to get access to the scope of the calling function.
Other, more radical proposals include backquoting (`), or a new
string prefix character (let's say 'f' for 'format'):
print f"Error in file {p.file}, line {p.line}"
This prefix character could of course be combined with any of the
other existing prefix characters (r, u, etc.)
This does address the security issue. The 'f' reads better than the '$'
prefix previous suggested, IMHO. Syntax highlighting can also be
applied this way.
(This also has the benefit of allowing Python programmers to quip
that they can use "print f debugging", just like C programmers.)
Alternate Syntax
Naturally, one of the most contentious issues is the syntax
of the format strings, and in particular the markup conventions
used to indicate fields.
Rather than attempting to exhaustively list all of the various
proposals, I will cover the ones that are most widely used
already.
- Shell variable syntax: $name and $(name) (or in some variants,
${name}). This is probably the oldest convention out there,
and is used by Perl and many others. When used without
the braces, the length of the variable is determined by
lexically scanning until an invalid character is found.
This scheme is generally used in cases where interpolation is
implicit - that is, in environments where any string can
contain interpolation variables, and no special subsitution
function need be invoked. In such cases, it is important to
prevent the interpolation behavior from occuring accidentally,
so the '$' (which is otherwise a relatively uncommonly-used
character) is used to signal when the behavior should occur.
It is my (Talin's) opinion, however, that in cases where the
formatting is explicitly invoked, that less care needs to be
taken to prevent accidental interpolation, in which case a
lighter and less unwieldy syntax can be used.
I don't think accidental problems with $ are that big a deal. They
don't occur that often, and it's pretty obvious to the eye when they
exist. "$lengthin" is pretty clearly not right compared to
"${length}in". However, nervous shell programmers often use ${}
everywhere, regardless of need, so this is likely to introduce style
differences between programmers (some will always use ${}, some will
remove {}'s whenever possible).
However, it can be reasonable argued that {} is just as readable and
easy to work with as $, and it avoids the need to do any changes as you
reformat the string (possibly introducing or removing ambiguity), or add
formatting.
- Printf and its cousins ('%'), including variations that add
a field index, so that fields can be interpolated out of
order.
- Other bracket-only variations. Various MUDs have used brackets
(e.g. [name]) to do string interpolation. The Microsoft .Net
libraries uses braces {}, and a syntax which is very similar
to the one in this proposal, although the syntax for conversion
specifiers is quite different. [2]
Many languages use {}, including PHP and Ruby, and even $ uses it on
some level. The details differ, but {} exists nearly everywhere in some
fashion.
- Backquoting. This method has the benefit of minimal syntactical
clutter, however it lacks many of the benefits of a function
call syntax (such as complex expression arguments, custom
formatters, etc.)
It doesn't have any natural nesting, nor any way to immediately see the
difference between opening and closing an expression. It also implies a
relation to shell ``, which evaluates the contents. I don't see any
benefit to backquotes.
Personally I'm very uncomfortable with using str.format(**args) for all
named substitution. It removes the possibility of non-enumerable
dictionary-like objects, and requires a dictionary copy whenever an
actual dictionary is used.
In the case of positional arguments it is currently an error if you
don't use all your positional arguments with %. Would it be an error in
this case?
Should the custom formatter get any opportunity to finalize the
formatted string (e.g., "here's the finished string, give me what you
want to return")?
--
Ian Bicking / ianb at colorstudy.com / http://blog.ianbicking.org
More information about the Python-3000
mailing list