[Python-Dev] PEP 3101 implementation vs. documentation

Thu Jul 7 01:02:38 CEST 2011

FWIW, new patches have been attached to the bug report
(http://bugs.python.org/issue12014), one of which is intended to bring
behavior in line with the documentation, and the other of which is
intended to implement Greg Ewing's proposal to allow only identifiers
(or integers) in the arg_name, attribute_name, and element_index
sections.

On Fri, Jun 10, 2011 at 2:15 PM, Ben Wolfson <wolfson at gmail.com> wrote:
> Hello,
>
> I'm writing because discussion in a bug report I submitted
> (<http://bugs.python.org/issue12014>) has suggested that, insofar as
> at least part of the issue revolves around the interpretation of PEP
> 3101, that aspect belonged on python-dev. In particular, I was told
> that the PEP, not the documentation, is authoritative. Since I'm the
> one who thinks something is wrong, it seems appropriate for me to be
> the one to bring it up.
>
> Basically, the issue is that the current behavior of str.format is at
> variance at least with the documentation
> <http://docs.python.org/library/string.html#formatstrings>, is almost
> certainly at variance with PEP3101 in one respect, and is in my
> opinion at variance with PEP3101 in another respect as well, regarding
> what characters can be present in what the grammar given in the
> documentation calls an element_index, that is, the bit between the
> square brackets in "{0.attr[idx]}".
>
> Both discovering the current behavior and interpreting the
> documentation are pretty straightforward; interpreting what the PEP
> actually calls for is more vexed. I'll do the first two things first.
> TOC for the remainder:
>
> 1. What does the current implementation do?
> 2. What does the documentation say?
> 3. What does the PEP say? [this part is long, but the PEP is not
> clear, and I wanted to be thorough]
> 4. Who cares?
>
> 1. What does the current implementation do?
>
> Suppose you have this dictionary:
>
> d = {"@": 0,
>     "!": 1,
>     ":": 2,
>     "^": 3,
>     "}": 4,
>     "{": {"}": 5},
>    }
>
> Then the following expressions have the following results:
>
> (a) "{0[@]}".format(d)    --> '0'
> (b) "{0[!]}".format(d)    --> ValueError: Missing ']' in format string
> (c) "{0[:]}".format(d)    --> ValueError: Missing ']' in format string
> (d) "{0[^]}".format(d)    --> '3'
> (e) "{0[}]}".format(d)    --> ValueError: Missing ']' in format string
> (f) "{0[{]}".format(d)    --> ValueError: unmatched '{' in format
> (g) "{0[{][}]}".format(d) --> '5'
>
> Given (e) and (f), I think (g) should be a little surprising, though
> you can probably guess what's going on and it's not hard to see why it
> happens if you look at the source: (e) and (f) fail because
> MarkupIterator_next (in Objects/stringlib/string_format.h) scans
> through the string looking for curly braces, because it treats them as
> semantically significant no matter what context they occur in. So,
> according to MarkupIterator_next, the *first* right curly brace in (e)
> indicates the end of the replacement field, giving "{0[}". In (f), the
> second left curly brace indicates (to MarkupIterator_next) the start
> of a *new* replacement field, and since there's only one right curly
> brace, it complains. In (g), MarkupIterator_next treats the second
> left curly brace as starting a new replacement field and the first
> right curly brace as closing it. However, actually, those braces don't
> define new replacement fields, as indicated by the fact that the whole
> expression treats the element_index fields as just plain old strings.
> (So the current implementation is somewhat schizophrenic, acting at
> one point as if the braces have to be balanced because '{[]}' is a
> replacement field and at another point treating the braces as
> insignificant.)
>
> The explanation for (b) and (c) is that parse_field (same source file)
> treats ':' and '!'  as indicating the end of the field_name section of
> the replacement field, regardless of whether those characters occur
> within square brackets or not.
>
> So, that's what the current implementation does, in those cases.
>
> 2. What does the documentation say?
>
> The documentation gives a grammar for replacement fields:
>
> """
> replacement_field ::=  "{" [field_name] ["!" conversion] [":" format_spec] "}"
> field_name        ::=  arg_name ("." attribute_name | "[" element_index "]")*
> arg_name          ::=  [identifier | integer]
> attribute_name    ::=  identifier
> element_index     ::=  integer | index_string
> index_string      ::=  <any source character except "]"> +
> conversion        ::=  "r" | "s"
> format_spec       ::=  <described in the next section>
> """
>
> Given this definition of index_string, all of (a)--(g) should be
> legal, and the results should be the strings '0', '1', '2', '3',
> "{'}': 5}", and '5'. There is no need to exclude ':', '!', '}', or '{'
> from the index_string field; allowing them creates no ambiguity,
> because the field is delimited by square brackets.
>
> Tangent: the definition of attribute_name here also does not describe
> the current behavior ("{0.  ;}".format(x) works fine and will call
> getattr(x, " ;")) and the PEP does not require the attribute_name to
> be an identifier; in fact it explicitly states that the attribute_name
> doesn't need to be a valid Python identifier. attribute_name should
> read (to reflect something like actual behavior, anyway) "<any source
> character except '[', '.', ':', '!', '{', or '}'> +". The same goes
> for arg_name (with the integer alternation). Observe:
>
>>>> x = lambda: None
>>>> setattr(x, ']]', 3)
>>>> "{].]]}".format(**{"]":x})     # (h)
> '3'
>
> One can also presently do this (hence "something like actual behavior"):
>>>> setattr(x, 'f}', 4)
>>>> "{a{s.f}}".format(**{"a{s":x})
> '4'
> But not this:
>>>> "{a{s.func_name}".format(**{"a{s":x})
> as it raises a ValueError, for the same reason as explains (g) above.
>
>
> 3. What does the PEP say?
>
> Well... It's actually hard to tell!
>
> Summary: The PEP does not contain a grammar for replacement fields,
> and is surprisingly nonspecific about what can appear where, at least
> when talking about the part of the replacement field before the format
> specifier. The most reasonable interpretation of the parts of the PEP
> that seem to be relevant favors the documentation, rather than the
> implementation.
>
> This can be separated into two sub-questions.
>
> A. What does the PEP say about ':' and '!'?
>
> It says two things that pertain to element_index fields.
>
> The first is this:
> """
>                   The rules for parsing an item key are very simple.
>    If it starts with a digit, then it is treated as a number, otherwise
>    it is used as a string.
>
>    Because keys are not quote-delimited, it is not possible to
>    specify arbitrary dictionary keys (e.g., the strings "10" or
>    ":-]") from within a format string.
> """
>
> So it notes that some things can't be used:
>
>  - Because anything composed entirely of digits is treated as a
> number, you can't get a string composed entirely of digits. Clear
> enough.
>
>  - What's the explanation for the second example, though? Well, you
> can't have a right square bracket in the index_string, so that would
> already mean that you can't do this: "{0[:-]]}".format(...) regardless
> of the whether colons are legal or not. So, although the PEP gives an
> example of a string that can't in the element_index part of a
> replacement field, and that string contains a colon, that string would
> have been disallowed for other reasons anyway.
>
> The second is this:
>
> """
>                                  The str.format() function will have
>    a minimalist parser which only attempts to figure out when it is
>    "done" with an identifier (by finding a '.' or a ']', or '}',
>    etc.).
> """
>
> This requires some interpretation. For one thing, the contents of an
> element_index aren't identifiers. For another, it's not true that
> you're done with an identifier (or whatever) whenever you see *any* of
> those; it depends on context. When parsing this: "{0[a.b]}" the parser
> should not stop at the '.'; it should keep going until it reaches the
> ']', and that will give the element_index. And when parsing this:
> "{0.f]oo[bar].baz}", it's done with the identifier "foo" when it
> reaches the '[', not when it reaches the second '.', and not when it
> reaches the ']', either (recall (h)). The "minimalist parser" is, I
> take it, one that works like this: when parsing an arg_name you're
> done when you reach a '[', a ':', a '!', a '.', '{', or a '}'; the
> same rules apply when parsing a attribute_name; when parsing an
> element_index you're done when you see a ']'.
>
> Now as regards the curly braces there are some other parts of the PEP
> that perhaps should be taken into account (see below), but I can find
> no justification for excluding ':' and '!' from the element_index
> field; the bit quoted above about having a minimalist parser isn't a
> good justification for that, and it's the only part of the entire PEP
> that comes close to addressing the question.
>
> B. What does it say about '}' and '{'?
>
> It still doesn't say much explicitly. There's the quotation I just
> gave, and then these passages:
>
> """
>    Brace characters ('curly braces') are used to indicate a
>    replacement field within the string:
>
>    [...]
>
>    Braces can be escaped by doubling:
> """
>
> Taken by itself, this would suggest that wherever there's an unescaped
> brace, there's a replacement field. That would mean that the current
> implementation's behavior is correct in (e) and (f) but incorrect in
> (g). However, I think this is a bad interpretation; unescaped braces
> can indicate the presence of a replacement field without that meaning
> that *within* a replacement field braces have that meaning, no matter
> where within the replacement field they occur.
>
> Later in the PEP, talking about this example:
>
>        "{0:{1}}".format(a, b)
>
> We have this:
>
> """
>    These 'internal' replacement fields can only occur in the format
>    specifier part of the replacement field.  Internal replacement fields
>    cannot themselves have format specifiers.  This implies also that
>    replacement fields cannot be nested to arbitrary levels.
>
>    Note that the doubled '}' at the end, which would normally be
>    escaped, is not escaped in this case.  The reason is because
>    the '{{' and '}}' syntax for escapes is only applied when used
>    *outside* of a format field.  Within a format field, the brace
>    characters always have their normal meaning.
> """
>
> The claim "within a format field, the brace characters always have
> their normal meaning" might be taken to mean that within a
> *replacement* field, brace characters always indicate the start (or
> end) of a replacement field. But the PEP at this point is clearly
> talking about the formatting section of a replacement field---the part
> that follows the ':', if present. ("Format field" is nowhere defined
> in the PEP, but it seems reasonable to take it to mean "the format
> specifier of a replacement field".) However, it seems most reasonable
> to me to take "normal meaning" to mean "just a brace character".
>
> Note that the present implementation only kinda sorta conforms to the
> PEP in this respect:
>
>
>>>> import datetime
>>>> format(datetime.datetime.now(), "{{%Y")
> '{{2011'
>>>> "{0:{{%{1}}".format(datetime.datetime.now(), 'Y') # (i)
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
> ValueError: unmatched '{' in format
>>>> "{0:{{%{1}}}}".format(datetime.datetime.now(), 'Y') # (j)
> '{2011}'
>
> Here the brace characters in (i) and (j) are treated, again in
> MarkupIterator_next, as indicating the start of a replacement field.
> In (i), this leads the function to throw an exception; since they're
> balanced in (j), processing proceeds further, and the doubled braces
> aren't treated as indicating the start or end of a replacement
> field---because they're escaped.  Given that the format spec part of a
> replacement field can contain further replacement fields, this is, I
> think, correct behavior, but it's not easy to see how it comports with
> the PEP, whose language is not very exact.
>
> The call to the built-in format() bypasses the mechanism that leads to
> these results.
>
> The PEP is very, very nonspecific about the parts of the replacement
> field that precede the format specifier. I don't know what kind of
> discussion surrounded the drawing up of the grammar that appears in
> the documentation, but I think that it, and not the implementation,
> should be followed.
>
> The implementation only works the way it does because of parsing
> shortcuts: it raises ValueErrors for (b) and (c) because it
> generalizes something true of the attribute_name field (encountering a
> ':' or '!' means one has moved on to the format_spec or conversion
> part of the replacement field) to the element_index field. And it
> raises an error for (e) and (f), but lets (g) through, for the reasons
> already mentioned. It is, in that respect, inconsistent; it treats the
> curly brace as having one semantic significance at one point and an
> entirely different significance at another point, so that it does the
> right thing in the case of (g) entirely by accident. There is, I
> think, no way to construe the PEP so that it is reasonable to do what
> the present implementation does in all three cases (if "{" indicates
> the start of a replacement field in (f), it should do so in (g) as
> well); I think it's
> actually pretty difficult to construe the PEP in any way that makes
> what it does in the case of (e) and (f) correct.
>
> 4. Who cares?
>
> Well, I do. (Obviously.) I even have a use case: I want to be able to
> put arbitrary (or as close to arbitrary as possible) strings in the
> element_index field, because I've got some objects that (should!)
> enable me to do this:
>
> p.say("I'm warning you, {e.red.underline[don't touch that!]}")
>
> and have this written ("e" for "effects") to p.out:
>
> I'm warning you, \x1b[31m\x1b[4mdon't touch that!\x1b[0m
>
> I have a way around the square bracket problem, but it would be quite
> burdensome to have to deal with all of !:{} as well; enough that
> I would fall back on something like this:
>
> "I'm warning you, {0}".format(e.red.underline("don't touch that!"))
>
> or some similar interpolation-based strategy, which I think would be a
> shame, because of the way it chops up the string.
>
> But I also think the present behavior is extremely counterintuitive,
> unnecessarily complex, and illogical (even unpythonic!). Isn't it
> bizarre that (g) should work, given what (e) and (f) do? Isn't it
> strange that (b) and (c) should fail, given that there's no real
> reason for them to do so---no ambiguity that has to be avoided? And
> something's gotta give; the documentation and the implementation do
> not correspond.
>
> Beyond the counterintuitiveness of the present implementation, it is
> also, I think, self-inconsistent. (e) and (f) fail because the
> interior brace is treated as starting or ending a replacement field,
> even though interior replacement fields aren't allowed in that part of
> a replacement field. (g) succeeds because the braces are balanced:
> they are treated at one point as if they were part of a replacement
> field, and at another (correctly) as if they are not. But this makes
> the failure of (e) and (f) unaccountable. It would not have been
> impossible for the PEP to allow interior replacement fields anywhere,
> and not just in the format spec, in which case we might have had this:
>
> (g') "{0[{][}]}".format(range(10), **{'][':4}) --> '3'
> or this:
> (g'') "{0[{][}]}".format({'4':3}, **{'][':4}) --> '3'
> or something with that general flavor.
>
> As far as I can tell, the only way to consistently maintain that (e)
> and (f) should fail requires that one take (g') or (g'') to be
> correct: either the interior braces signal replacement fields (hence
> must be balanced) or they don't (or they're escaped).
>
> Regarding the documentation, it could of course be rendered correct by
> changing *it*, rather than the implementation. But wouldn't it be
> tremendously weird to have to explain that, in the part of the
> replacement field preceding the conversion, you can't employ any curly
> braces, unless they're balanced, and you can't employ ':' or '!' at
> all, even though they have no semantic significance? So these are out:
>
> {0[{].foo}
> {0[}{}]}
> {0[a:b]}
>
> But these are in:
>
> {0[{}{}]}
> {0[{{].foo.}}}  (k)
>
> ((k) does work, if you give it an object with the right structure,
> though it probably should not.)
>
> And, moreover, someone would then have to actually change the
> documentation, whereas there's a patch already, attached to the bug
> report linked way up at the top of this email, that makes (a)--(g) all
> work, leaves (i) and (j) as they are, and has the welcome side-effect
> of making (k) *not* work (if any code anywhere relies on (k) or things
> like it working, I will be very surprised: anyway the fact that (k)
> works is, technically, undocumented). It is also quite simple. It
> doesn't effect the built-in format(), because the built-in format() is
> concerned only with the format-specifier part of the replacement field
> and not all the stuff that comes before that, telling str.format
> *what* object to format.
>
> Thanks for reading,
> --
> Ben Wolfson
> "Human kind has used its intelligence to vary the flavour of drinks,
> which may be sweet, aromatic, fermented or spirit-based. ... Family
> and social life also offer numerous other occasions to consume drinks
> for pleasure." [Larousse, "Drink" entry]
>

-- 
Ben Wolfson
"Human kind has used its intelligence to vary the flavour of drinks,
which may be sweet, aromatic, fermented or spirit-based. ... Family
and social life also offer numerous other occasions to consume drinks
for pleasure." [Larousse, "Drink" entry]