[Python-3000] Proposed changes to PEP3101 advanced string formatting -- please discuss and vote!

Tue Mar 13 03:47:21 CET 2007

Eric Smith and I have a reasonable first-cut of a C implementation for
Talin's PEP3101 (it runs as an extension module and has been tested on
Python 2.3,  2.4, and 3.0) along with some test cases.  It's sort of
experimental, in that it mostly implements the PEP, but also
implements a few possible enhancements and changes, to give us (well,
me, mostly) an idea of what works and what doesn't, and ideas about
changes that might be useful.

This list of potential changes to the PEP is in order of (what I
believe to be) most contentious first.  (It will be interesting to
contrast that with the actual votes :).  I apologize for the long
list, but it's quite a comprehensive PEP, and the implementation work
Eric and I have done has probably encompassed the most critical
examination of the PEP to date, so here goes:

Feature:  Alternate syntaxes for escape to markup.

There is naturally a strong desire to resist having more than one way
to do things.  However, in my experience using Python to generate lots
of text (which experience is why I volunteered to work on the PEP
implementation in the first place!), I have found that there are
different application domains which naturally lend themselves to
different approaches for escape to markup.  The issues are:

   - In some application domains, the number of {} characters in the
actual character data is dizzying.  This leads to two problems: first
that there are an inordinate number of { and } characters to escape by
doubling, and second (and even more troubling) is that it can be very
difficult to locate the markup inside the text file, without an editor
(and user!) that understands fancy regexps.

- In other application domains, the sheer number of {} characters is
not so bad (and markup can be readily noticed), but the use of {{ for
{ is very confusing, because { has a particular technical meaning
(rather than just offsetting some natural language text), and it is
hard for people to mentally parse the underlying document when the
braces are doubled.

To deal with these issues, I propose having a total of 3 markup
syntaxes, with one of them being the default, and with a readable,
defined method for the string to declare it is using one of the other
markup syntaxes.  (If this is accepted, I feel that EIBTI says the
string itself should declare if it is using other than the standard
markup transition sequence.  This also makes it easier for any
automated tools to understand the insides of the strings.)

The default markup syntax is the one proposed in the initial PEP,
where literal { and } characters are denoted by doubling them.  It
works well for many problem domains, and is the best for short strings
(where it would be burdensome to have the string declare the markup
syntax it is using).

The second method is the well-understood ${} syntax.  The $ is easy to
find in a sea of { characters, and the only special handling required
is that every $ must be doubled.

The third method is something I came up with a couple of years ago
when I was generating a lot of C code from templates.  It would make
any non-Python programmer blanch, because it relies on significant
whitespace, but it made for very readable technical templates.  WIth
this method "{foo}" escapes to markup, but when there is whitespace
after the leading "{",  e.g. "{ foo}", the brace is not an escape to
markup.  If the whitespace is a space, it is removed from the output,
but if it is '\r', '\n', or '\t', then it is left in the output.  The
result is that, for example, most braces in most C texts do not need
to have spaces inserted after them, because they immediately precede a
newline.

The syntaxes are similar enough that they can all be efficiently
parsed by the same loop, so there are no real implementation issues.
The currently contemplated method for declaring a markup syntax is by
using decorator-style  markup, e.g. {@syntax1} inside the string,
although I am open to suggestions about better-looking ways to do
this.

Feature:  Automatic search of locals() and globals() for name lookups
if no parameters are given.

This is contentious because it violates EIBTI.  However, it is
extremely convenient.  To me, the reasons for allowing or disallowing
this feature on 'somestring'.format() appear to be exactly the same as
the reasons for allowing or disallowing this feature on
eval('somestring').   Barring a distinction between these cases that I
have not noticed, I think that if we don't want to allow this for
'somestring'.format(), then we should seriously consider removing the
capability in Python 3000 for eval('somestring').

Feature: Ability to pass in a dictionary or tuple of dictionaries of
namespaces to search.

This feature allows, in some cases, for much more dynamic code than
*kwargs.  (You could manually smush multiple dictionaries together to
build kwargs, but that can be ugly, tedious, and slow.)
Implementation-wise, this feature and locals() / globals() go hand in
hand.

Feature:  Placement of a dummy record on the traceback stack for
underlying errors.

There are two classes of exceptions which can occur during formatting:
exceptions generated by the formatter code itself, and exceptions
generated by user code (such as a field object's getattr function, or
the field_hook function).

In general, exceptions generated by the formatter code itself are of
the "ValueError" variety -- there is an error in the actual "value" of
the format string.  (This is not strictly true; for example, the
string.format() function might be passed a non-string as its first
parameter, which would result in a TypeError.)

The text associated with these internally generated ValueError
exceptions will indicate the location of the exception inside  the
format string, as well as the nature of the exception.

For exceptions generated by user code, a trace record and dummy frame
will be added to the traceback stack to help in determining the
location in the string where the exception occurred.  The inserted
traceback will indicate that the error occurred at:

        File "?format_string?", line X, in column_Y

where X and Y represent the line and character position of where the
exception occurred inside the format string.

THIS IS A HACK!

There is currently no general mechanism for non-Python source code to
be added to a traceback (which might be the subject of another PEP),
but there is some precedent and some examples, for example, in PyExpat
and in Pyrex, of adding non-Python code information to the traceback
stack.  Even though this is a hack, the necessity of debugging format
strings makes this a very useful hack -- the hack means that the
location information of the error within the format string is
available and can be manipulated, just like the location of the error
within any Python module for which source is not available.

Removed feature:  Ability to dump error information into the output string.

The original PEP added this capability for the same reason I am
proposing the traceback hack.  OTOH, with the traceback hack in place,
it would be extremely easy to write a pure-Python wrapper for format
that would trap an exception, parse the traceback, and place the error
message at the appropriate place in the string.  (For a bit more
effort, the exception wrapper could re-call the formatter with the
remainder of the string if you really wanted to see ALL the errors,
but this is actually much more functionality than you get with Python
itself for debugging, so while I see the usefulness, I don't know if
it justifies the effort.)

Feature: Addition of functions and "constants" to string module.

The PEP proposes doing everything as string methods, with a "cformat"
method allowing some access to the underlying machinery.  I propose
only having a 'format' method of the string (unicode) type, and a
corresponding 'format' and extended 'flag_format' function in the
string module, along with definitions for the flags for access to
non-default underlying format machinery.

Feature: Ability for "field hook" user code function to only be called
on some fields.

The PEP takes an all-or-nothing approach to the field hook -- it is
either called on every field or no fields.  Furthermore, it is only
available for calling if the extended function ('somestring'.cformat()
in the spec, string.flag_format() in this proposal) is called.  The
proposed change keeps this functionality, but also adds a field type
specifier 'h' which causes the field hook to be called as needed on a
per-field basis.  This latter method can even be used from the default
'somestring'.format() method.

Changed feature: By default, not using all arguments is not an exception

The original PEP says that, by default, the formatting operation
should check that all arguments are used.

In the original % operator, the exception reporting that an extra
argument is present is symmetrical to the exception reporting that not
enough arguments are present.  Both these errors are easy to commit,
because it is hard to count the number of arguments and the number of
% specifiers in your string and make sure they match. In theory, the
new string formatting should make it easier to get the arguments
right, because all arguments in the format string are numbered or even
named, and with the new string formatting, the corresponding error is
that a _specific_ argument is missing, not just "you didn't supply
enough arguments."

Also, it is arguably not Pythonic to require a check that all
arguments to a function are actually used by the execution of the
function (but see interfaces!),  and format() is, after all, just
another function.  So it seems that the default should be to not check
that all the arguments are used.  In fact, there are similar reasons
for not using all the arguments here as with any other function.  For
example, for customization, the format method of a string might be
called with a superset of all the information which might be useful to
view.

Because the ability to check that all arguments is used might be
useful (and because I don't want to be accused of leaving this feature
out because of laziness :), this feature remains available if
string.flag_format() is called.

Feature:  Ability to insert non-printing comments in format strings

This feature is implemented in a very intuitive way, e.g. " text {#
your comment here} more text" (example shown with the default
transition to markup syntax).  One of the nice benefits of this
feature is the ability to break up long source lines (if you have lots
of long variable names and attribute lookups).

Feature:  Exception raised if attribute with leading underscore accessed.

The syntax supported by the PEP is deliberately limited in an attempt
to increase security.  This is an additional security measure, which
is on by default, but can be optionally disabled if
string.flag_format() is used instead of 'somestring'.format().

Feature: Support for "center" alignment.

The field specifier uses "<" and ">" for left and right alignment.
This adds "^" for center alignment.

Feature: support of earlier versions of Python

(This was contemplated but not mandated in the earlier PEP, but
implementation has proven to be quite easy.)

This should work with both string and unicode objects in 2.6, and
should be available as a separate compilable extension module for
versions back to 2.3 (functions only -- no string method support).

Feature: no global state

The original PEP specified a global error-handling mode, which
controlled a few capabilities.  The current implementation has no
global state -- any non-default options are accessed by using the
string.flag_format() function.

Thanks in advance for any feedback you can provide on these changes.

Regards,
Pat