[Python-Dev] PEP 3101 implementation vs. documentation

Fri Jun 10 23:15:54 CEST 2011

Hello,

I'm writing because discussion in a bug report I submitted
(<http://bugs.python.org/issue12014>) has suggested that, insofar as
at least part of the issue revolves around the interpretation of PEP
3101, that aspect belonged on python-dev. In particular, I was told
that the PEP, not the documentation, is authoritative. Since I'm the
one who thinks something is wrong, it seems appropriate for me to be
the one to bring it up.

Basically, the issue is that the current behavior of str.format is at
variance at least with the documentation
<http://docs.python.org/library/string.html#formatstrings>, is almost
certainly at variance with PEP3101 in one respect, and is in my
opinion at variance with PEP3101 in another respect as well, regarding
what characters can be present in what the grammar given in the
documentation calls an element_index, that is, the bit between the
square brackets in "{0.attr[idx]}".

Both discovering the current behavior and interpreting the
documentation are pretty straightforward; interpreting what the PEP
actually calls for is more vexed. I'll do the first two things first.
TOC for the remainder:

1. What does the current implementation do?
2. What does the documentation say?
3. What does the PEP say? [this part is long, but the PEP is not
clear, and I wanted to be thorough]
4. Who cares?

1. What does the current implementation do?

Suppose you have this dictionary:

d = {"@": 0,
     "!": 1,
     ":": 2,
     "^": 3,
     "}": 4,
     "{": {"}": 5},
    }

Then the following expressions have the following results:

(a) "{0[@]}".format(d)    --> '0'
(b) "{0[!]}".format(d)    --> ValueError: Missing ']' in format string
(c) "{0[:]}".format(d)    --> ValueError: Missing ']' in format string
(d) "{0[^]}".format(d)    --> '3'
(e) "{0[}]}".format(d)    --> ValueError: Missing ']' in format string
(f) "{0[{]}".format(d)    --> ValueError: unmatched '{' in format
(g) "{0[{][}]}".format(d) --> '5'

Given (e) and (f), I think (g) should be a little surprising, though
you can probably guess what's going on and it's not hard to see why it
happens if you look at the source: (e) and (f) fail because
MarkupIterator_next (in Objects/stringlib/string_format.h) scans
through the string looking for curly braces, because it treats them as
semantically significant no matter what context they occur in. So,
according to MarkupIterator_next, the *first* right curly brace in (e)
indicates the end of the replacement field, giving "{0[}". In (f), the
second left curly brace indicates (to MarkupIterator_next) the start
of a *new* replacement field, and since there's only one right curly
brace, it complains. In (g), MarkupIterator_next treats the second
left curly brace as starting a new replacement field and the first
right curly brace as closing it. However, actually, those braces don't
define new replacement fields, as indicated by the fact that the whole
expression treats the element_index fields as just plain old strings.
(So the current implementation is somewhat schizophrenic, acting at
one point as if the braces have to be balanced because '{[]}' is a
replacement field and at another point treating the braces as
insignificant.)

The explanation for (b) and (c) is that parse_field (same source file)
treats ':' and '!'  as indicating the end of the field_name section of
the replacement field, regardless of whether those characters occur
within square brackets or not.

So, that's what the current implementation does, in those cases.

2. What does the documentation say?

The documentation gives a grammar for replacement fields:

"""
replacement_field ::=  "{" [field_name] ["!" conversion] [":" format_spec] "}"
field_name        ::=  arg_name ("." attribute_name | "[" element_index "]")*
arg_name          ::=  [identifier | integer]
attribute_name    ::=  identifier
element_index     ::=  integer | index_string
index_string      ::=  <any source character except "]"> +
conversion        ::=  "r" | "s"
format_spec       ::=  <described in the next section>
"""

Given this definition of index_string, all of (a)--(g) should be
legal, and the results should be the strings '0', '1', '2', '3',
"{'}': 5}", and '5'. There is no need to exclude ':', '!', '}', or '{'
from the index_string field; allowing them creates no ambiguity,
because the field is delimited by square brackets.

Tangent: the definition of attribute_name here also does not describe
the current behavior ("{0.  ;}".format(x) works fine and will call
getattr(x, " ;")) and the PEP does not require the attribute_name to
be an identifier; in fact it explicitly states that the attribute_name
doesn't need to be a valid Python identifier. attribute_name should
read (to reflect something like actual behavior, anyway) "<any source
character except '[', '.', ':', '!', '{', or '}'> +". The same goes
for arg_name (with the integer alternation). Observe:

>>> x = lambda: None
>>> setattr(x, ']]', 3)
>>> "{].]]}".format(**{"]":x})     # (h)
'3'

One can also presently do this (hence "something like actual behavior"):
>>> setattr(x, 'f}', 4)
>>> "{a{s.f}}".format(**{"a{s":x})
'4'
But not this:
>>> "{a{s.func_name}".format(**{"a{s":x})
as it raises a ValueError, for the same reason as explains (g) above.

3. What does the PEP say?

Well... It's actually hard to tell!

Summary: The PEP does not contain a grammar for replacement fields,
and is surprisingly nonspecific about what can appear where, at least
when talking about the part of the replacement field before the format
specifier. The most reasonable interpretation of the parts of the PEP
that seem to be relevant favors the documentation, rather than the
implementation.

This can be separated into two sub-questions.

A. What does the PEP say about ':' and '!'?

It says two things that pertain to element_index fields.

The first is this:
"""
                   The rules for parsing an item key are very simple.
    If it starts with a digit, then it is treated as a number, otherwise
    it is used as a string.

    Because keys are not quote-delimited, it is not possible to
    specify arbitrary dictionary keys (e.g., the strings "10" or
    ":-]") from within a format string.
"""

So it notes that some things can't be used:

 - Because anything composed entirely of digits is treated as a
number, you can't get a string composed entirely of digits. Clear
enough.

 - What's the explanation for the second example, though? Well, you
can't have a right square bracket in the index_string, so that would
already mean that you can't do this: "{0[:-]]}".format(...) regardless
of the whether colons are legal or not. So, although the PEP gives an
example of a string that can't in the element_index part of a
replacement field, and that string contains a colon, that string would
have been disallowed for other reasons anyway.

The second is this:

"""
                                  The str.format() function will have
    a minimalist parser which only attempts to figure out when it is
    "done" with an identifier (by finding a '.' or a ']', or '}',
    etc.).
"""

This requires some interpretation. For one thing, the contents of an
element_index aren't identifiers. For another, it's not true that
you're done with an identifier (or whatever) whenever you see *any* of
those; it depends on context. When parsing this: "{0[a.b]}" the parser
should not stop at the '.'; it should keep going until it reaches the
']', and that will give the element_index. And when parsing this:
"{0.f]oo[bar].baz}", it's done with the identifier "foo" when it
reaches the '[', not when it reaches the second '.', and not when it
reaches the ']', either (recall (h)). The "minimalist parser" is, I
take it, one that works like this: when parsing an arg_name you're
done when you reach a '[', a ':', a '!', a '.', '{', or a '}'; the
same rules apply when parsing a attribute_name; when parsing an
element_index you're done when you see a ']'.

Now as regards the curly braces there are some other parts of the PEP
that perhaps should be taken into account (see below), but I can find
no justification for excluding ':' and '!' from the element_index
field; the bit quoted above about having a minimalist parser isn't a
good justification for that, and it's the only part of the entire PEP
that comes close to addressing the question.

B. What does it say about '}' and '{'?

It still doesn't say much explicitly. There's the quotation I just
gave, and then these passages:

"""
    Brace characters ('curly braces') are used to indicate a
    replacement field within the string:

    [...]

    Braces can be escaped by doubling:
"""

Taken by itself, this would suggest that wherever there's an unescaped
brace, there's a replacement field. That would mean that the current
implementation's behavior is correct in (e) and (f) but incorrect in
(g). However, I think this is a bad interpretation; unescaped braces
can indicate the presence of a replacement field without that meaning
that *within* a replacement field braces have that meaning, no matter
where within the replacement field they occur.

Later in the PEP, talking about this example:

        "{0:{1}}".format(a, b)

We have this:

"""
    These 'internal' replacement fields can only occur in the format
    specifier part of the replacement field.  Internal replacement fields
    cannot themselves have format specifiers.  This implies also that
    replacement fields cannot be nested to arbitrary levels.

    Note that the doubled '}' at the end, which would normally be
    escaped, is not escaped in this case.  The reason is because
    the '{{' and '}}' syntax for escapes is only applied when used
    *outside* of a format field.  Within a format field, the brace
    characters always have their normal meaning.
"""

The claim "within a format field, the brace characters always have
their normal meaning" might be taken to mean that within a
*replacement* field, brace characters always indicate the start (or
end) of a replacement field. But the PEP at this point is clearly
talking about the formatting section of a replacement field---the part
that follows the ':', if present. ("Format field" is nowhere defined
in the PEP, but it seems reasonable to take it to mean "the format
specifier of a replacement field".) However, it seems most reasonable
to me to take "normal meaning" to mean "just a brace character".

Note that the present implementation only kinda sorta conforms to the
PEP in this respect:

>>> import datetime
>>> format(datetime.datetime.now(), "{{%Y")
'{{2011'
>>> "{0:{{%{1}}".format(datetime.datetime.now(), 'Y') # (i)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unmatched '{' in format
>>> "{0:{{%{1}}}}".format(datetime.datetime.now(), 'Y') # (j)
'{2011}'

Here the brace characters in (i) and (j) are treated, again in
MarkupIterator_next, as indicating the start of a replacement field.
In (i), this leads the function to throw an exception; since they're
balanced in (j), processing proceeds further, and the doubled braces
aren't treated as indicating the start or end of a replacement
field---because they're escaped.  Given that the format spec part of a
replacement field can contain further replacement fields, this is, I
think, correct behavior, but it's not easy to see how it comports with
the PEP, whose language is not very exact.

The call to the built-in format() bypasses the mechanism that leads to
these results.

The PEP is very, very nonspecific about the parts of the replacement
field that precede the format specifier. I don't know what kind of
discussion surrounded the drawing up of the grammar that appears in
the documentation, but I think that it, and not the implementation,
should be followed.

The implementation only works the way it does because of parsing
shortcuts: it raises ValueErrors for (b) and (c) because it
generalizes something true of the attribute_name field (encountering a
':' or '!' means one has moved on to the format_spec or conversion
part of the replacement field) to the element_index field. And it
raises an error for (e) and (f), but lets (g) through, for the reasons
already mentioned. It is, in that respect, inconsistent; it treats the
curly brace as having one semantic significance at one point and an
entirely different significance at another point, so that it does the
right thing in the case of (g) entirely by accident. There is, I
think, no way to construe the PEP so that it is reasonable to do what
the present implementation does in all three cases (if "{" indicates
the start of a replacement field in (f), it should do so in (g) as
well); I think it's
actually pretty difficult to construe the PEP in any way that makes
what it does in the case of (e) and (f) correct.

4. Who cares?

Well, I do. (Obviously.) I even have a use case: I want to be able to
put arbitrary (or as close to arbitrary as possible) strings in the
element_index field, because I've got some objects that (should!)
enable me to do this:

p.say("I'm warning you, {e.red.underline[don't touch that!]}")

and have this written ("e" for "effects") to p.out:

I'm warning you, \x1b[31m\x1b[4mdon't touch that!\x1b[0m

I have a way around the square bracket problem, but it would be quite
burdensome to have to deal with all of !:{} as well; enough that
I would fall back on something like this:

"I'm warning you, {0}".format(e.red.underline("don't touch that!"))

or some similar interpolation-based strategy, which I think would be a
shame, because of the way it chops up the string.

But I also think the present behavior is extremely counterintuitive,
unnecessarily complex, and illogical (even unpythonic!). Isn't it
bizarre that (g) should work, given what (e) and (f) do? Isn't it
strange that (b) and (c) should fail, given that there's no real
reason for them to do so---no ambiguity that has to be avoided? And
something's gotta give; the documentation and the implementation do
not correspond.

Beyond the counterintuitiveness of the present implementation, it is
also, I think, self-inconsistent. (e) and (f) fail because the
interior brace is treated as starting or ending a replacement field,
even though interior replacement fields aren't allowed in that part of
a replacement field. (g) succeeds because the braces are balanced:
they are treated at one point as if they were part of a replacement
field, and at another (correctly) as if they are not. But this makes
the failure of (e) and (f) unaccountable. It would not have been
impossible for the PEP to allow interior replacement fields anywhere,
and not just in the format spec, in which case we might have had this:

(g') "{0[{][}]}".format(range(10), **{'][':4}) --> '3'
or this:
(g'') "{0[{][}]}".format({'4':3}, **{'][':4}) --> '3'
or something with that general flavor.

As far as I can tell, the only way to consistently maintain that (e)
and (f) should fail requires that one take (g') or (g'') to be
correct: either the interior braces signal replacement fields (hence
must be balanced) or they don't (or they're escaped).

Regarding the documentation, it could of course be rendered correct by
changing *it*, rather than the implementation. But wouldn't it be
tremendously weird to have to explain that, in the part of the
replacement field preceding the conversion, you can't employ any curly
braces, unless they're balanced, and you can't employ ':' or '!' at
all, even though they have no semantic significance? So these are out:

{0[{].foo}
{0[}{}]}
{0[a:b]}

But these are in:

{0[{}{}]}
{0[{{].foo.}}}  (k)

((k) does work, if you give it an object with the right structure,
though it probably should not.)

And, moreover, someone would then have to actually change the
documentation, whereas there's a patch already, attached to the bug
report linked way up at the top of this email, that makes (a)--(g) all
work, leaves (i) and (j) as they are, and has the welcome side-effect
of making (k) *not* work (if any code anywhere relies on (k) or things
like it working, I will be very surprised: anyway the fact that (k)
works is, technically, undocumented). It is also quite simple. It
doesn't effect the built-in format(), because the built-in format() is
concerned only with the format-specifier part of the replacement field
and not all the stuff that comes before that, telling str.format
*what* object to format.

Thanks for reading,
-- 
Ben Wolfson
"Human kind has used its intelligence to vary the flavour of drinks,
which may be sweet, aromatic, fermented or spirit-based. ... Family
and social life also offer numerous other occasions to consume drinks
for pleasure." [Larousse, "Drink" entry]