PEP 3101 implementation vs. documentation

Hello, I'm writing because discussion in a bug report I submitted (<http://bugs.python.org/issue12014>) has suggested that, insofar as at least part of the issue revolves around the interpretation of PEP 3101, that aspect belonged on python-dev. In particular, I was told that the PEP, not the documentation, is authoritative. Since I'm the one who thinks something is wrong, it seems appropriate for me to be the one to bring it up. Basically, the issue is that the current behavior of str.format is at variance at least with the documentation <http://docs.python.org/library/string.html#formatstrings>, is almost certainly at variance with PEP3101 in one respect, and is in my opinion at variance with PEP3101 in another respect as well, regarding what characters can be present in what the grammar given in the documentation calls an element_index, that is, the bit between the square brackets in "{0.attr[idx]}". Both discovering the current behavior and interpreting the documentation are pretty straightforward; interpreting what the PEP actually calls for is more vexed. I'll do the first two things first. TOC for the remainder: 1. What does the current implementation do? 2. What does the documentation say? 3. What does the PEP say? [this part is long, but the PEP is not clear, and I wanted to be thorough] 4. Who cares? 1. What does the current implementation do? Suppose you have this dictionary: d = {"@": 0, "!": 1, ":": 2, "^": 3, "}": 4, "{": {"}": 5}, } Then the following expressions have the following results: (a) "{0[@]}".format(d) --> '0' (b) "{0[!]}".format(d) --> ValueError: Missing ']' in format string (c) "{0[:]}".format(d) --> ValueError: Missing ']' in format string (d) "{0[^]}".format(d) --> '3' (e) "{0[}]}".format(d) --> ValueError: Missing ']' in format string (f) "{0[{]}".format(d) --> ValueError: unmatched '{' in format (g) "{0[{][}]}".format(d) --> '5' Given (e) and (f), I think (g) should be a little surprising, though you can probably guess what's going on and it's not hard to see why it happens if you look at the source: (e) and (f) fail because MarkupIterator_next (in Objects/stringlib/string_format.h) scans through the string looking for curly braces, because it treats them as semantically significant no matter what context they occur in. So, according to MarkupIterator_next, the *first* right curly brace in (e) indicates the end of the replacement field, giving "{0[}". In (f), the second left curly brace indicates (to MarkupIterator_next) the start of a *new* replacement field, and since there's only one right curly brace, it complains. In (g), MarkupIterator_next treats the second left curly brace as starting a new replacement field and the first right curly brace as closing it. However, actually, those braces don't define new replacement fields, as indicated by the fact that the whole expression treats the element_index fields as just plain old strings. (So the current implementation is somewhat schizophrenic, acting at one point as if the braces have to be balanced because '{[]}' is a replacement field and at another point treating the braces as insignificant.) The explanation for (b) and (c) is that parse_field (same source file) treats ':' and '!' as indicating the end of the field_name section of the replacement field, regardless of whether those characters occur within square brackets or not. So, that's what the current implementation does, in those cases. 2. What does the documentation say? The documentation gives a grammar for replacement fields: """ replacement_field ::= "{" [field_name] ["!" conversion] [":" format_spec] "}" field_name ::= arg_name ("." attribute_name | "[" element_index "]")* arg_name ::= [identifier | integer] attribute_name ::= identifier element_index ::= integer | index_string index_string ::= <any source character except "]"> + conversion ::= "r" | "s" format_spec ::= <described in the next section> """ Given this definition of index_string, all of (a)--(g) should be legal, and the results should be the strings '0', '1', '2', '3', "{'}': 5}", and '5'. There is no need to exclude ':', '!', '}', or '{' from the index_string field; allowing them creates no ambiguity, because the field is delimited by square brackets. Tangent: the definition of attribute_name here also does not describe the current behavior ("{0. ;}".format(x) works fine and will call getattr(x, " ;")) and the PEP does not require the attribute_name to be an identifier; in fact it explicitly states that the attribute_name doesn't need to be a valid Python identifier. attribute_name should read (to reflect something like actual behavior, anyway) "<any source character except '[', '.', ':', '!', '{', or '}'> +". The same goes for arg_name (with the integer alternation). Observe:
One can also presently do this (hence "something like actual behavior"):
3. What does the PEP say? Well... It's actually hard to tell! Summary: The PEP does not contain a grammar for replacement fields, and is surprisingly nonspecific about what can appear where, at least when talking about the part of the replacement field before the format specifier. The most reasonable interpretation of the parts of the PEP that seem to be relevant favors the documentation, rather than the implementation. This can be separated into two sub-questions. A. What does the PEP say about ':' and '!'? It says two things that pertain to element_index fields. The first is this: """ The rules for parsing an item key are very simple. If it starts with a digit, then it is treated as a number, otherwise it is used as a string. Because keys are not quote-delimited, it is not possible to specify arbitrary dictionary keys (e.g., the strings "10" or ":-]") from within a format string. """ So it notes that some things can't be used: - Because anything composed entirely of digits is treated as a number, you can't get a string composed entirely of digits. Clear enough. - What's the explanation for the second example, though? Well, you can't have a right square bracket in the index_string, so that would already mean that you can't do this: "{0[:-]]}".format(...) regardless of the whether colons are legal or not. So, although the PEP gives an example of a string that can't in the element_index part of a replacement field, and that string contains a colon, that string would have been disallowed for other reasons anyway. The second is this: """ The str.format() function will have a minimalist parser which only attempts to figure out when it is "done" with an identifier (by finding a '.' or a ']', or '}', etc.). """ This requires some interpretation. For one thing, the contents of an element_index aren't identifiers. For another, it's not true that you're done with an identifier (or whatever) whenever you see *any* of those; it depends on context. When parsing this: "{0[a.b]}" the parser should not stop at the '.'; it should keep going until it reaches the ']', and that will give the element_index. And when parsing this: "{0.f]oo[bar].baz}", it's done with the identifier "foo" when it reaches the '[', not when it reaches the second '.', and not when it reaches the ']', either (recall (h)). The "minimalist parser" is, I take it, one that works like this: when parsing an arg_name you're done when you reach a '[', a ':', a '!', a '.', '{', or a '}'; the same rules apply when parsing a attribute_name; when parsing an element_index you're done when you see a ']'. Now as regards the curly braces there are some other parts of the PEP that perhaps should be taken into account (see below), but I can find no justification for excluding ':' and '!' from the element_index field; the bit quoted above about having a minimalist parser isn't a good justification for that, and it's the only part of the entire PEP that comes close to addressing the question. B. What does it say about '}' and '{'? It still doesn't say much explicitly. There's the quotation I just gave, and then these passages: """ Brace characters ('curly braces') are used to indicate a replacement field within the string: [...] Braces can be escaped by doubling: """ Taken by itself, this would suggest that wherever there's an unescaped brace, there's a replacement field. That would mean that the current implementation's behavior is correct in (e) and (f) but incorrect in (g). However, I think this is a bad interpretation; unescaped braces can indicate the presence of a replacement field without that meaning that *within* a replacement field braces have that meaning, no matter where within the replacement field they occur. Later in the PEP, talking about this example: "{0:{1}}".format(a, b) We have this: """ These 'internal' replacement fields can only occur in the format specifier part of the replacement field. Internal replacement fields cannot themselves have format specifiers. This implies also that replacement fields cannot be nested to arbitrary levels. Note that the doubled '}' at the end, which would normally be escaped, is not escaped in this case. The reason is because the '{{' and '}}' syntax for escapes is only applied when used *outside* of a format field. Within a format field, the brace characters always have their normal meaning. """ The claim "within a format field, the brace characters always have their normal meaning" might be taken to mean that within a *replacement* field, brace characters always indicate the start (or end) of a replacement field. But the PEP at this point is clearly talking about the formatting section of a replacement field---the part that follows the ':', if present. ("Format field" is nowhere defined in the PEP, but it seems reasonable to take it to mean "the format specifier of a replacement field".) However, it seems most reasonable to me to take "normal meaning" to mean "just a brace character". Note that the present implementation only kinda sorta conforms to the PEP in this respect:
Here the brace characters in (i) and (j) are treated, again in MarkupIterator_next, as indicating the start of a replacement field. In (i), this leads the function to throw an exception; since they're balanced in (j), processing proceeds further, and the doubled braces aren't treated as indicating the start or end of a replacement field---because they're escaped. Given that the format spec part of a replacement field can contain further replacement fields, this is, I think, correct behavior, but it's not easy to see how it comports with the PEP, whose language is not very exact. The call to the built-in format() bypasses the mechanism that leads to these results. The PEP is very, very nonspecific about the parts of the replacement field that precede the format specifier. I don't know what kind of discussion surrounded the drawing up of the grammar that appears in the documentation, but I think that it, and not the implementation, should be followed. The implementation only works the way it does because of parsing shortcuts: it raises ValueErrors for (b) and (c) because it generalizes something true of the attribute_name field (encountering a ':' or '!' means one has moved on to the format_spec or conversion part of the replacement field) to the element_index field. And it raises an error for (e) and (f), but lets (g) through, for the reasons already mentioned. It is, in that respect, inconsistent; it treats the curly brace as having one semantic significance at one point and an entirely different significance at another point, so that it does the right thing in the case of (g) entirely by accident. There is, I think, no way to construe the PEP so that it is reasonable to do what the present implementation does in all three cases (if "{" indicates the start of a replacement field in (f), it should do so in (g) as well); I think it's actually pretty difficult to construe the PEP in any way that makes what it does in the case of (e) and (f) correct. 4. Who cares? Well, I do. (Obviously.) I even have a use case: I want to be able to put arbitrary (or as close to arbitrary as possible) strings in the element_index field, because I've got some objects that (should!) enable me to do this: p.say("I'm warning you, {e.red.underline[don't touch that!]}") and have this written ("e" for "effects") to p.out: I'm warning you, \x1b[31m\x1b[4mdon't touch that!\x1b[0m I have a way around the square bracket problem, but it would be quite burdensome to have to deal with all of !:{} as well; enough that I would fall back on something like this: "I'm warning you, {0}".format(e.red.underline("don't touch that!")) or some similar interpolation-based strategy, which I think would be a shame, because of the way it chops up the string. But I also think the present behavior is extremely counterintuitive, unnecessarily complex, and illogical (even unpythonic!). Isn't it bizarre that (g) should work, given what (e) and (f) do? Isn't it strange that (b) and (c) should fail, given that there's no real reason for them to do so---no ambiguity that has to be avoided? And something's gotta give; the documentation and the implementation do not correspond. Beyond the counterintuitiveness of the present implementation, it is also, I think, self-inconsistent. (e) and (f) fail because the interior brace is treated as starting or ending a replacement field, even though interior replacement fields aren't allowed in that part of a replacement field. (g) succeeds because the braces are balanced: they are treated at one point as if they were part of a replacement field, and at another (correctly) as if they are not. But this makes the failure of (e) and (f) unaccountable. It would not have been impossible for the PEP to allow interior replacement fields anywhere, and not just in the format spec, in which case we might have had this: (g') "{0[{][}]}".format(range(10), **{'][':4}) --> '3' or this: (g'') "{0[{][}]}".format({'4':3}, **{'][':4}) --> '3' or something with that general flavor. As far as I can tell, the only way to consistently maintain that (e) and (f) should fail requires that one take (g') or (g'') to be correct: either the interior braces signal replacement fields (hence must be balanced) or they don't (or they're escaped). Regarding the documentation, it could of course be rendered correct by changing *it*, rather than the implementation. But wouldn't it be tremendously weird to have to explain that, in the part of the replacement field preceding the conversion, you can't employ any curly braces, unless they're balanced, and you can't employ ':' or '!' at all, even though they have no semantic significance? So these are out: {0[{].foo} {0[}{}]} {0[a:b]} But these are in: {0[{}{}]} {0[{{].foo.}}} (k) ((k) does work, if you give it an object with the right structure, though it probably should not.) And, moreover, someone would then have to actually change the documentation, whereas there's a patch already, attached to the bug report linked way up at the top of this email, that makes (a)--(g) all work, leaves (i) and (j) as they are, and has the welcome side-effect of making (k) *not* work (if any code anywhere relies on (k) or things like it working, I will be very surprised: anyway the fact that (k) works is, technically, undocumented). It is also quite simple. It doesn't effect the built-in format(), because the built-in format() is concerned only with the format-specifier part of the replacement field and not all the stuff that comes before that, telling str.format *what* object to format. Thanks for reading, -- Ben Wolfson "Human kind has used its intelligence to vary the flavour of drinks, which may be sweet, aromatic, fermented or spirit-based. ... Family and social life also offer numerous other occasions to consume drinks for pleasure." [Larousse, "Drink" entry]

On Sat, Jun 11, 2011 at 7:15 AM, Ben Wolfson <wolfson@gmail.com> wrote: [snip very thorough analysis] To summarise (after both the above post and the discussion on the tracker) The current str.format implementation differs from the documentation in two ways: 1. It ignores the presence of an unclosed index field when processing a replacement field (placing additional restrictions on allowable characters in index strings). 2. Replacement fields that appear in name specifiers are processed by the parser for brace-matching purposes, but not substituted More accurate documentation would state that: 1. Numeric name fields start with a digit and are terminated by any non-numeric character. 2. An identifier name field is terminated by any one of: '}' (terminates the replacement field, unless preceded by a matching '{' character, in which case it is ignored and included in the string) '!' (terminates name field, starts conversion specifier) ':' (terminates name field, starts format specifier) '.' (terminates current name field, starts new name field for subattribute) '[' (terminates name field, starts index field) 3. An index field is terminated by one of: '}' (terminates the replacement field, unless preceded by a matching '{' character, in which case it is ignored and included in the string) '!' (terminates index field, starts conversion specifier) ':' (terminates index field, starts format specifier) ']' (terminates index field, subsequent character will determine next field) This existing behaviour can certainly be documented as such, but is rather unintuitive and (given that '}', '!' and ']' will always error out if appearing in an index field) somewhat silly. So, the two changes that I believe Ben is proposing would be as follows: 1. When processing a name field, brace-matching is suspended. Between the opening '{' character and the closing '}', '!' or ':' character, additional '{' characters are ignored for matching purposes. 2. When processing an index field, all special processing is suspended until the terminating ']' is reached The rules for name fields would then become: 1. Numeric fields start with a digit and are terminated by any non-numeric character. 2. An identifier name field is terminated by any one of: '}' (terminates the replacement field) '!' (terminates identifier field, starts conversion specifier) ':' (terminates identifier field, starts format specifier) '.' (terminates identifier field, starts new identifier field for subattribute) '[' (terminates identifier field, starts index field) 3. An index field is terminated by ']' (subsequent character will determine next field) That second set of rules is *far* more in line with the behaviour of the rest of the language than the status quo, so unless the difficulty of making the str.format mini-language parser work that way is truly prohibitive, it certainly seems worthwhile to tidy up the semantics. The index field behaviour should definitely be fixed, as it poses no backwards compatibility concerns. The brace matching behaviour should probably be left alone, as changing it would potentially break currently valid format strings (e.g. "{a{0}}".format(**{'a{0}':1}) produces '1' now, but would raise an exception if the brace matching rules were changed). So +1 on making the str.format parser accept anything other than ']' inside an index field and turn the whole thing into an ordinary string, -1 on making any other changes to the brace-matching behaviour. That would leave us with the following set of rules for name fields: 1. Numeric fields start with a digit and are terminated by any non-numeric character. 2. An identifier name field is terminated by any one of: '}' (terminates the replacement field, unless preceded by a matching '{' character, in which case it is ignored and included in the string) '!' (terminates identifier field, starts conversion specifier) ':' (terminates identifier field, starts format specifier) '.' (terminates identifier field, starts new identifier field for subattribute) '[' (terminates identifier field, starts index field) 3. An index field is terminated by ']' (subsequent character will determine next field) Note that brace-escaping currently doesn't work inside name fields, so that should also be fixed:
As far as I can recall, the details of this question didn't come up when PEP 3101 was developed, so the PEP isn't a particularly good source to justify anything in relation to this - it is best to consider the current behaviour to just be the way it happened to be implemented rather than a deliberate design choice. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan wrote: [snip]
+1
-1 for leaving the brace matching behavior alone, as it's very unintuitive for *the user*. For the implementor it may make sense to count matching braces, but definitely not for the user. I don't believe that "{a{0}}" is a real use case that someone might already use, as it's a hard violation of what the documentation currently says. I'd rather disallow braces in the replacement field before the format specifier altogether. Or closing braces at the minimum. Furthermore, the double-escaping sounds reasonable in the format specifier, but not elsewhere. My motivation is that the user should be able to have a quick glance on the format string and see where the replacement fields are. This is probably what the PEP intends to say when disallowing braces inside the replacement field. In my opinion, it's easy to write the parser in a way that braces are parsed in any imaginable manner. Or maybe not easy, but not harder than any other way of handling braces.
-1. Why do we need braces inside replacement fields at all (except for inner replacements in the format specier)? I strongly believe that the PEP's use case is the simple one: '{foo}'.format(foo=10) In my opinoin, these '{!#%}'.format(**{'!#%': 10}) cases are not real. The current documentation requires field_name to be a valid identifier, an this is a sane requirement. The only problem is that parsing identifiers correctly is very hard, so it can be made simpler by allowing some non-identifiers. But we still don't have to accept braces. --- As a somewhat another issue, I'm confused about this:
'{a[1][2]}'.format(a={1:{2:3}}) '3'
and even more about this:
'{a[1].foo[2]}'.format(a={1:namedtuple('x', 'foo')({2:3})}) '3'
Why does this work? It's against the current documentation. The documented syntax only allows zero or one attribute names and zero or one element index, in this order. Is it intentional that we allow arbitrary chains of getattr and __getitem__? If we do, this should be documented, too. Petri

On 6/11/2011 6:32 AM, Petri Lehtinen wrote:
Nick Coghlan wrote: [snip]
It seems to me that the intent of the pep and the current doc is that field_names should match what one would write in code except that quotes are left off of literal string keys. Which is to say, the brackets [] serve as quote marks. So once '[' is found, the scanner must shift to 'in index' mode and accept everything until a matching ']' is found, ending 'in index' mode. The arg_name is documented as int or identifier and attribute_name as identifier, period. Anything more than that is an implementation accident which people should not count on in either future versions or alternate implementations. I can imagine uses for nested replacement fields in the field_name or conversion spec. Ie, '{ {0}:{1}d'.format(2,5,333,444) == ' 333', whereas changing the first arg to 3 would produce ' 444'. If braces are allowed in either of the first two segments (outside of the 'quoted' within braces context), I think it should only be for the purpose of a feature addition that makes them significant. It strikes me that the underlying problem is that the replacement_field scanner is, apparently, hand-crafted rather than being generated from the corresponding grammar, as is the overall Python lexer-parser. So it has no necessary connection with the grammar. -- Terry Jan Reedy

On Sat, Jun 11, 2011 at 2:16 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Sat, Jun 11, 2011 at 7:15 AM, Ben Wolfson <wolfson@gmail.com> wrote: To summarise (after both the above post and the discussion on the tracker)
Thanks for the summary!
A minor clarification since I mentioned a patch: the patch as it exists implements *these*---Nick's---semantics. That is, it will allow these: "{0.{a}}".format(x) "{0.{[{].}}".format(x) But not this, because it keeps current brace-matching in this context: "{0.{a}".format(x) And it treats this: "{0.a}}".format(x) as the markup "{0.a}" followed by the character data "}". The patch would have to be changed to turn off brace balancing in name fields as well. In either case there would be potential breakage, since this: "{0[{}.}}".format(...) currently works, but would not work anymore, under either set of rules. (The likelihood that this potential breakage would anywhere be actual breakage is pretty slim, though.)
This is a slightly different issue, though, isn't it? As far as I can tell, if the brace-matching rules are kept in place, there would never be any *need* for escaping. You can't have an internal replacement field in this part of the replacement field, so '{' can always safely be assumed to be Just a Brace and not the start of a replacement field, regardless of whether it's doubled, and '}' will either be in an index field (where it can't have the significance of ending the replacement field) or it will be (a) the end of the replacement field or (b) not the end of the replacement field because matched by an earlier '{'. So there would never be any role for escaping to play. There would be a role for escaping if the rules for name fields are that '}' terminates them, no matching done; then, you could double them to get a '}' in the name field. But, to be honest, that strikes me as introducing a lot of heavy machinery for very little gain; opening and closing braces would have to be escaped to accomodate this one thing. And it's not as if you can escape ']' in index fields---which would be a parallel case. It seems significantly simpler to me to leave the escaping behavior as it is in this part of the replacement field. -- Ben Wolfson "Human kind has used its intelligence to vary the flavour of drinks, which may be sweet, aromatic, fermented or spirit-based. ... Family and social life also offer numerous other occasions to consume drinks for pleasure." [Larousse, "Drink" entry]

Ben Wolfson wrote:
I'm worried that the rules in this area are getting too complicated for a human to follow. If braces are allowed as plain data between square brackets and/or vice versa, it's going to be a confusing mess to read, and there will always be some doubt in the programmer's mind as to whether they have to be escaped somehow or not. I'm inclined to think that any such difficult cases should simply be disallowed. If the docs say an identifier is required someplace, the implementation should adhere strictly to that. It's not *that* hard to parse an indentifier properly, and IMO any use case that requires putting arbitrary characters into an item selector is abusing the format mechanism and should be redesigned to work some other way. -- Greg

On Sat, Jun 11, 2011 at 4:29 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
There are two cases with the braces: attribute selection and item selection. The docs say that attributes should be identifiers, and that the argument name should be an integer or an identifier, but that the item selector can essentially be an arbitrary string as long as it doesn't contain ']', which indicates its end. The docs as they stand suggest that braces in item selectors should be treated as plain data: """ Format strings contain “replacement fields” surrounded by curly braces {}. Anything that is not contained in braces is considered literal text, which is copied unchanged to the output. If you need to include a brace character in the literal text, it can be escaped by doubling: {{ and }}. """ Since it mentions escaping only in the context of how to get a brace in literal text rather than in a replacement field. Current behavior is to perform escapes in the format spec part of a replacement field, and that, I think, makes sense, since there can be an internal replacement field there. However, it's still pretty simple to tell whether braces need to be escaped or not: a brace in the format spec does need to be escaped, a brace before the format spec doesn't.
If by "item selector" you mean (using the names from the grammar in the docs) the element_index, I don't see why this should be the case; dictionaries can contain non-identified keys, after all. If you mean the attribute_name (and arg_name) parts, then requiring an identifier (or an integer for arg_name) makes a lot more sense. I assume that Talin had some reason for stating otherwise in the PEP (one of the few things that does get explicitly said about the field_name part), but I'm kind of at a loss for why; you would need to have a custom __getattribute__ to exploit it, and it would be a lot less confusing just to use __getitem__. -- Ben Wolfson "Human kind has used its intelligence to vary the flavour of drinks, which may be sweet, aromatic, fermented or spirit-based. ... Family and social life also offer numerous other occasions to consume drinks for pleasure." [Larousse, "Drink" entry]

Ben Wolfson wrote:
Of course they can, but that's not the point. The point is that putting arbitrary strings between [...] in a format spec without any form of quoting or requirement for bracket matching leads to something that's too confusing for humans to read. IMO the spec should be designed so that the format string can be parsed using the same lexical analysis rules as Python code. That means anything that is meant to "hang together" as a single unit, such as an item selector, needs to look like a single Python token, e.g. an integer or identifier. I realise this is probably more restrictive than the PEP suggests, but I think it would be better that way all round. -- Greg

On Mon, Jun 13, 2011 at 5:36 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
But there is a requirement for bracket matching: the "[" that opens the element_index is matched by the next "]". Arguably (as Terry Reedy said) this is also a form of quoting, in which the square brackets are the quotation operators. It seems no more confusing to me than allowing arbitrary strings between in '"..."'; those quotation marks aren't even oriented. (Admittedly, syntax highlighting helps there.) Compared to this: "{0: ^+#10o}", a string like this: "this is normal text, but {e.underline[this text is is udnerlined {sic}!]}---and we're back to normal now" is pretty damn readable to this human, nor do I see what about the rule "when you see a [, keep going until you see a ]" is supposed to be insuperably confusing. (Compare---not that it helps my case in regard to readability---grouping in regular expressions, where you don't usually have the aid of special syntax highlighting inside the string; you see a '(', you know that you've encountered a group which continues until the next (unescaped!) ')'. The stuff that comes inside the parentheses might look like line noise---and the whole thing might look like line noise---but *that* rule about the structure of a regexp is pretty straightforward.)
If that's the rationale, why not change the spec so that instead of this: "{0[spam]}" You do this: "{0['spam']}" ? Hangs together; single Python token. Bonus: it would make it possible for this to work: (a) "{0['12']}".format({'12': 4}) whereas currently this: "{0[12]}".format(...) passes the integer 12 to __getitem__, and (a) passes the string "'12'". (Discovery: the "abuse" of the format mechanism I want to perpetrate via element_index can also be perpetrated with a custom __format__ method:
. So any reform to make it impossible to use str.format creatively will have to be fairly radical. I actually think that my intended abuse is actually a perfectly reasonable use, but it would be disallowed if only integers and identifiers can be in the element_index field.) -- Ben Wolfson "Human kind has used its intelligence to vary the flavour of drinks, which may be sweet, aromatic, fermented or spirit-based. ... Family and social life also offer numerous other occasions to consume drinks for pleasure." [Larousse, "Drink" entry]

FWIW, new patches have been attached to the bug report (http://bugs.python.org/issue12014), one of which is intended to bring behavior in line with the documentation, and the other of which is intended to implement Greg Ewing's proposal to allow only identifiers (or integers) in the arg_name, attribute_name, and element_index sections. On Fri, Jun 10, 2011 at 2:15 PM, Ben Wolfson <wolfson@gmail.com> wrote:
-- Ben Wolfson "Human kind has used its intelligence to vary the flavour of drinks, which may be sweet, aromatic, fermented or spirit-based. ... Family and social life also offer numerous other occasions to consume drinks for pleasure." [Larousse, "Drink" entry]

On Sat, Jun 11, 2011 at 7:15 AM, Ben Wolfson <wolfson@gmail.com> wrote: [snip very thorough analysis] To summarise (after both the above post and the discussion on the tracker) The current str.format implementation differs from the documentation in two ways: 1. It ignores the presence of an unclosed index field when processing a replacement field (placing additional restrictions on allowable characters in index strings). 2. Replacement fields that appear in name specifiers are processed by the parser for brace-matching purposes, but not substituted More accurate documentation would state that: 1. Numeric name fields start with a digit and are terminated by any non-numeric character. 2. An identifier name field is terminated by any one of: '}' (terminates the replacement field, unless preceded by a matching '{' character, in which case it is ignored and included in the string) '!' (terminates name field, starts conversion specifier) ':' (terminates name field, starts format specifier) '.' (terminates current name field, starts new name field for subattribute) '[' (terminates name field, starts index field) 3. An index field is terminated by one of: '}' (terminates the replacement field, unless preceded by a matching '{' character, in which case it is ignored and included in the string) '!' (terminates index field, starts conversion specifier) ':' (terminates index field, starts format specifier) ']' (terminates index field, subsequent character will determine next field) This existing behaviour can certainly be documented as such, but is rather unintuitive and (given that '}', '!' and ']' will always error out if appearing in an index field) somewhat silly. So, the two changes that I believe Ben is proposing would be as follows: 1. When processing a name field, brace-matching is suspended. Between the opening '{' character and the closing '}', '!' or ':' character, additional '{' characters are ignored for matching purposes. 2. When processing an index field, all special processing is suspended until the terminating ']' is reached The rules for name fields would then become: 1. Numeric fields start with a digit and are terminated by any non-numeric character. 2. An identifier name field is terminated by any one of: '}' (terminates the replacement field) '!' (terminates identifier field, starts conversion specifier) ':' (terminates identifier field, starts format specifier) '.' (terminates identifier field, starts new identifier field for subattribute) '[' (terminates identifier field, starts index field) 3. An index field is terminated by ']' (subsequent character will determine next field) That second set of rules is *far* more in line with the behaviour of the rest of the language than the status quo, so unless the difficulty of making the str.format mini-language parser work that way is truly prohibitive, it certainly seems worthwhile to tidy up the semantics. The index field behaviour should definitely be fixed, as it poses no backwards compatibility concerns. The brace matching behaviour should probably be left alone, as changing it would potentially break currently valid format strings (e.g. "{a{0}}".format(**{'a{0}':1}) produces '1' now, but would raise an exception if the brace matching rules were changed). So +1 on making the str.format parser accept anything other than ']' inside an index field and turn the whole thing into an ordinary string, -1 on making any other changes to the brace-matching behaviour. That would leave us with the following set of rules for name fields: 1. Numeric fields start with a digit and are terminated by any non-numeric character. 2. An identifier name field is terminated by any one of: '}' (terminates the replacement field, unless preceded by a matching '{' character, in which case it is ignored and included in the string) '!' (terminates identifier field, starts conversion specifier) ':' (terminates identifier field, starts format specifier) '.' (terminates identifier field, starts new identifier field for subattribute) '[' (terminates identifier field, starts index field) 3. An index field is terminated by ']' (subsequent character will determine next field) Note that brace-escaping currently doesn't work inside name fields, so that should also be fixed:
As far as I can recall, the details of this question didn't come up when PEP 3101 was developed, so the PEP isn't a particularly good source to justify anything in relation to this - it is best to consider the current behaviour to just be the way it happened to be implemented rather than a deliberate design choice. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan wrote: [snip]
+1
-1 for leaving the brace matching behavior alone, as it's very unintuitive for *the user*. For the implementor it may make sense to count matching braces, but definitely not for the user. I don't believe that "{a{0}}" is a real use case that someone might already use, as it's a hard violation of what the documentation currently says. I'd rather disallow braces in the replacement field before the format specifier altogether. Or closing braces at the minimum. Furthermore, the double-escaping sounds reasonable in the format specifier, but not elsewhere. My motivation is that the user should be able to have a quick glance on the format string and see where the replacement fields are. This is probably what the PEP intends to say when disallowing braces inside the replacement field. In my opinion, it's easy to write the parser in a way that braces are parsed in any imaginable manner. Or maybe not easy, but not harder than any other way of handling braces.
-1. Why do we need braces inside replacement fields at all (except for inner replacements in the format specier)? I strongly believe that the PEP's use case is the simple one: '{foo}'.format(foo=10) In my opinoin, these '{!#%}'.format(**{'!#%': 10}) cases are not real. The current documentation requires field_name to be a valid identifier, an this is a sane requirement. The only problem is that parsing identifiers correctly is very hard, so it can be made simpler by allowing some non-identifiers. But we still don't have to accept braces. --- As a somewhat another issue, I'm confused about this:
'{a[1][2]}'.format(a={1:{2:3}}) '3'
and even more about this:
'{a[1].foo[2]}'.format(a={1:namedtuple('x', 'foo')({2:3})}) '3'
Why does this work? It's against the current documentation. The documented syntax only allows zero or one attribute names and zero or one element index, in this order. Is it intentional that we allow arbitrary chains of getattr and __getitem__? If we do, this should be documented, too. Petri

On 6/11/2011 6:32 AM, Petri Lehtinen wrote:
Nick Coghlan wrote: [snip]
It seems to me that the intent of the pep and the current doc is that field_names should match what one would write in code except that quotes are left off of literal string keys. Which is to say, the brackets [] serve as quote marks. So once '[' is found, the scanner must shift to 'in index' mode and accept everything until a matching ']' is found, ending 'in index' mode. The arg_name is documented as int or identifier and attribute_name as identifier, period. Anything more than that is an implementation accident which people should not count on in either future versions or alternate implementations. I can imagine uses for nested replacement fields in the field_name or conversion spec. Ie, '{ {0}:{1}d'.format(2,5,333,444) == ' 333', whereas changing the first arg to 3 would produce ' 444'. If braces are allowed in either of the first two segments (outside of the 'quoted' within braces context), I think it should only be for the purpose of a feature addition that makes them significant. It strikes me that the underlying problem is that the replacement_field scanner is, apparently, hand-crafted rather than being generated from the corresponding grammar, as is the overall Python lexer-parser. So it has no necessary connection with the grammar. -- Terry Jan Reedy

On Sat, Jun 11, 2011 at 2:16 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Sat, Jun 11, 2011 at 7:15 AM, Ben Wolfson <wolfson@gmail.com> wrote: To summarise (after both the above post and the discussion on the tracker)
Thanks for the summary!
A minor clarification since I mentioned a patch: the patch as it exists implements *these*---Nick's---semantics. That is, it will allow these: "{0.{a}}".format(x) "{0.{[{].}}".format(x) But not this, because it keeps current brace-matching in this context: "{0.{a}".format(x) And it treats this: "{0.a}}".format(x) as the markup "{0.a}" followed by the character data "}". The patch would have to be changed to turn off brace balancing in name fields as well. In either case there would be potential breakage, since this: "{0[{}.}}".format(...) currently works, but would not work anymore, under either set of rules. (The likelihood that this potential breakage would anywhere be actual breakage is pretty slim, though.)
This is a slightly different issue, though, isn't it? As far as I can tell, if the brace-matching rules are kept in place, there would never be any *need* for escaping. You can't have an internal replacement field in this part of the replacement field, so '{' can always safely be assumed to be Just a Brace and not the start of a replacement field, regardless of whether it's doubled, and '}' will either be in an index field (where it can't have the significance of ending the replacement field) or it will be (a) the end of the replacement field or (b) not the end of the replacement field because matched by an earlier '{'. So there would never be any role for escaping to play. There would be a role for escaping if the rules for name fields are that '}' terminates them, no matching done; then, you could double them to get a '}' in the name field. But, to be honest, that strikes me as introducing a lot of heavy machinery for very little gain; opening and closing braces would have to be escaped to accomodate this one thing. And it's not as if you can escape ']' in index fields---which would be a parallel case. It seems significantly simpler to me to leave the escaping behavior as it is in this part of the replacement field. -- Ben Wolfson "Human kind has used its intelligence to vary the flavour of drinks, which may be sweet, aromatic, fermented or spirit-based. ... Family and social life also offer numerous other occasions to consume drinks for pleasure." [Larousse, "Drink" entry]

Ben Wolfson wrote:
I'm worried that the rules in this area are getting too complicated for a human to follow. If braces are allowed as plain data between square brackets and/or vice versa, it's going to be a confusing mess to read, and there will always be some doubt in the programmer's mind as to whether they have to be escaped somehow or not. I'm inclined to think that any such difficult cases should simply be disallowed. If the docs say an identifier is required someplace, the implementation should adhere strictly to that. It's not *that* hard to parse an indentifier properly, and IMO any use case that requires putting arbitrary characters into an item selector is abusing the format mechanism and should be redesigned to work some other way. -- Greg

On Sat, Jun 11, 2011 at 4:29 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
There are two cases with the braces: attribute selection and item selection. The docs say that attributes should be identifiers, and that the argument name should be an integer or an identifier, but that the item selector can essentially be an arbitrary string as long as it doesn't contain ']', which indicates its end. The docs as they stand suggest that braces in item selectors should be treated as plain data: """ Format strings contain “replacement fields” surrounded by curly braces {}. Anything that is not contained in braces is considered literal text, which is copied unchanged to the output. If you need to include a brace character in the literal text, it can be escaped by doubling: {{ and }}. """ Since it mentions escaping only in the context of how to get a brace in literal text rather than in a replacement field. Current behavior is to perform escapes in the format spec part of a replacement field, and that, I think, makes sense, since there can be an internal replacement field there. However, it's still pretty simple to tell whether braces need to be escaped or not: a brace in the format spec does need to be escaped, a brace before the format spec doesn't.
If by "item selector" you mean (using the names from the grammar in the docs) the element_index, I don't see why this should be the case; dictionaries can contain non-identified keys, after all. If you mean the attribute_name (and arg_name) parts, then requiring an identifier (or an integer for arg_name) makes a lot more sense. I assume that Talin had some reason for stating otherwise in the PEP (one of the few things that does get explicitly said about the field_name part), but I'm kind of at a loss for why; you would need to have a custom __getattribute__ to exploit it, and it would be a lot less confusing just to use __getitem__. -- Ben Wolfson "Human kind has used its intelligence to vary the flavour of drinks, which may be sweet, aromatic, fermented or spirit-based. ... Family and social life also offer numerous other occasions to consume drinks for pleasure." [Larousse, "Drink" entry]

Ben Wolfson wrote:
Of course they can, but that's not the point. The point is that putting arbitrary strings between [...] in a format spec without any form of quoting or requirement for bracket matching leads to something that's too confusing for humans to read. IMO the spec should be designed so that the format string can be parsed using the same lexical analysis rules as Python code. That means anything that is meant to "hang together" as a single unit, such as an item selector, needs to look like a single Python token, e.g. an integer or identifier. I realise this is probably more restrictive than the PEP suggests, but I think it would be better that way all round. -- Greg

On Mon, Jun 13, 2011 at 5:36 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
But there is a requirement for bracket matching: the "[" that opens the element_index is matched by the next "]". Arguably (as Terry Reedy said) this is also a form of quoting, in which the square brackets are the quotation operators. It seems no more confusing to me than allowing arbitrary strings between in '"..."'; those quotation marks aren't even oriented. (Admittedly, syntax highlighting helps there.) Compared to this: "{0: ^+#10o}", a string like this: "this is normal text, but {e.underline[this text is is udnerlined {sic}!]}---and we're back to normal now" is pretty damn readable to this human, nor do I see what about the rule "when you see a [, keep going until you see a ]" is supposed to be insuperably confusing. (Compare---not that it helps my case in regard to readability---grouping in regular expressions, where you don't usually have the aid of special syntax highlighting inside the string; you see a '(', you know that you've encountered a group which continues until the next (unescaped!) ')'. The stuff that comes inside the parentheses might look like line noise---and the whole thing might look like line noise---but *that* rule about the structure of a regexp is pretty straightforward.)
If that's the rationale, why not change the spec so that instead of this: "{0[spam]}" You do this: "{0['spam']}" ? Hangs together; single Python token. Bonus: it would make it possible for this to work: (a) "{0['12']}".format({'12': 4}) whereas currently this: "{0[12]}".format(...) passes the integer 12 to __getitem__, and (a) passes the string "'12'". (Discovery: the "abuse" of the format mechanism I want to perpetrate via element_index can also be perpetrated with a custom __format__ method:
. So any reform to make it impossible to use str.format creatively will have to be fairly radical. I actually think that my intended abuse is actually a perfectly reasonable use, but it would be disallowed if only integers and identifiers can be in the element_index field.) -- Ben Wolfson "Human kind has used its intelligence to vary the flavour of drinks, which may be sweet, aromatic, fermented or spirit-based. ... Family and social life also offer numerous other occasions to consume drinks for pleasure." [Larousse, "Drink" entry]

FWIW, new patches have been attached to the bug report (http://bugs.python.org/issue12014), one of which is intended to bring behavior in line with the documentation, and the other of which is intended to implement Greg Ewing's proposal to allow only identifiers (or integers) in the arg_name, attribute_name, and element_index sections. On Fri, Jun 10, 2011 at 2:15 PM, Ben Wolfson <wolfson@gmail.com> wrote:
-- Ben Wolfson "Human kind has used its intelligence to vary the flavour of drinks, which may be sweet, aromatic, fermented or spirit-based. ... Family and social life also offer numerous other occasions to consume drinks for pleasure." [Larousse, "Drink" entry]
participants (5)
-
Ben Wolfson
-
Greg Ewing
-
Nick Coghlan
-
Petri Lehtinen
-
Terry Reedy