[Python-Dev] PEP 498: Literal String Interpolation is ready for pronouncement

Sun Sep 6 01:12:02 CEST 2015

On Sat, Sep 5, 2015 at 1:00 PM, Eric V. Smith <eric at trueblade.com> wrote:
> On 9/5/2015 3:23 PM, Nathaniel Smith wrote:
>> On Sep 5, 2015 11:32 AM, "Eric V. Smith" <eric at trueblade.com
>> <mailto:eric at trueblade.com>> wrote:
>>> Ignore the part about non-doubled '}'. The actual description is:
>>>
>>> To find the end of an expression, it looks for a '!', ':', or '}', not
>>> inside of a string or (), [], or {}. There's a special case for '!=' so
>>> the bang isn't seen as ending the expression.
>>
>> Sounds like you're reimplementing a lot of the lexer... I guess that's
>> doable, but how confident are you that your definition of "inside a
>> string" matches the original in all corner cases?
>
> Well, this is 35 lines of code (including comments), and it's much
> simpler than a lexer (in the sense of "something that generates
> tokens"). So I don't think I'm reimplementing a lot of the lexer.
>
> However, your point is valid: if I don't do the same thing the lexer
> would do, I could either prematurely find the end of an expression, or
> look too far. In either case, when I call ast.parse() I'll get a syntax
> error, and/or I'll get an error when parsing/lexing the remainder of the
> string.
>
> But it's not like I have to agree with the lexer: no larger error will
> occur if I get it wrong. Everything is confined to a single f-string,
> since I've already used the lexer to find the f-string in its entirety.
> I only need to make sure the users understand how expressions are
> extracted from f-strings.
>
> I did look at using the actual lexer (Parser/tokenizer.c) to do this,
> but it would require a large amount of surgery. I think it's overkill
> for this task.
>
> So far, I've tested it enough to have reasonable confidence that it's
> correct. But the implementation could always be swapped out for an
> improved version. I'm certainly open to that, if we find cases that the
> simple scanner can't deal with.
>
>> In any case the abstract language definition part should be phrased in
>> terms of the python lexer -- the expression ends when you encounter the
>> first } *token* that is not nested inside () [] {} *tokens*, and then
>> you can implement it however makes sense...
>
> I'm not sure that's an improvement on Guido's description when you're
> trying to explain it to a user. But when time comes to write the
> documentation, we can discuss it then.

I'm not talking about end-user documentation, I'm talking about the
formal specification, like in the Python Language Reference.

I'm pretty sure that just calling the tokenizer will be easier for
Cython or PyPy than implementing a special purpose scanner :-)

>> (This is then the same rule that patsy uses to find the end of python
>> expressions embedded inside patsy formula strings: patsy.readthedocs.org
>> <http://patsy.readthedocs.org>)
>
> I don't see where patsy looks for expressions in parts of strings. Let
> me know if I'm missing it.

Patsy parses strings like

   "np.sin(a + b) + c"

using a grammar that supports some basic arithmetic-like infix
operations (+, *, parentheses, etc.), and in which the atoms are
arbitrary Python expressions. So the above string is parsed into a
patsy-AST that looks something like:

  Add(PyExpr("np.sin(a + b)"), PyExpr("c"))

The rule it uses to do this is that it uses the Python tokenizer,
counts nesting of () [] {}, and when it sees a valid unnested patsy
operator, then that's the end of the embedded expression:

  https://github.com/pydata/patsy/blob/master/patsy/parse_formula.py#L37

Not tremendously relevant, but that's why I've thought this through before :-)

-n

-- 
Nathaniel J. Smith -- http://vorpus.org