Take 2: PEP draft for expression embedding

Oren Tirosh oren-py-l at hishome.net
Sat Dec 15 10:38:18 EST 2001


This new draft should address all the technical issues people here have 
raised in reponses to my previous draft.

This form of string formatting is now called 'expression embedding',
not 'string interpolation'.  I hope this change of terminology will
help highlight the fact that this is not a cosmetic change: embedded 
expressions are real compiled expressions, not characters in a format 
string.



PEP: XXX
Title: Expression embedding
Author: oren at hishome.net (Oren Tirosh)
Status: Draft
Type: Standards Track
Created: 15-Dec-2001
Version: 0.2


Abstract

    This document proposes an expression embedding feature for easier
    string formatting. The suggested syntax change is the introduction
    of a new 'e' prefix for strings. Python expressions may be
    embedded within an e-string, surrounded by backquotes.

    Example:

        print e"X=`x`, Y=`calc_y(x)`."

    Unlike the string interpolation in some other languages and the
    proposed syntax in PEP 215[1] an expression embedded within a string
    is not a sequence of characters in the string - it is a real
    expression. It is syntax-checked at compile time and bytecode is
    generated for it in the compiled module.
    
Specification

    A new character prefix "e" is defined for strings.  This prefix
    precedes the "u" and "r" prefixes, if present. Capital "E" is also
    acceptable. Within an e-string any expressions enclosed in
    backquotes are evaluated, converted to strings using the
    equivalent of the str() function and embedded in-place into the
    e-string. Any valid Python expression may be embedded.  To use a
    literal backquote anywhere within the e-string or embedded
    expression it must be preceded by a backslash.  
    
Discussion

    A similar proposal for embedding expressions with backquotes was
    made by Marnix Klooster in a python-list posting [2]. Marnix noted
    that this is the way it is done in Python's ancestor ABC from
    which it inherits many features and design decisions. This
    proposal did not include any mehod to identify strings containing
    embedded expressions such as the "e" prefix.

    There is no runtime parsing or runtime compilation of expressions.
    This results in more efficient execution compared to any form of
    string formatting which uses eval().  

    Embedded strings should be fully compatible with proposed future 
    extensions such as optional static typing.

    Whenever an out-of-band character is used there is the problem
    of what to do when the programmer wants to literally use that
    character.  The solution here is consistent with the way quotes or
    any other special characters are embedded in a string: using a
    backslash escape. Ascaping issues are solved in only one place - the 
    tokenizer. There is no need for one escaping mechanism during 
    compilation with one type of escape character and a different
    form of escaping at runtime for format strings.  There should
    never be a need for multiple escapes of the backquote character
    even when an embedded expression contains strings with embedded
    expressions since the nesting of embedded expressions is at the 
    parser level.

    In some cases it is useful to create a format template in one
    place in the program and perform the actual formatting elsewhere.
    In Python this may be done by passing a format string as an
    argument and performing the actual formatting later using the %
    operator.  This is actually a form of functional programming. The
    format string is treated as a function and the '%' formatting
    operator is equivalent to the apply() built-in function which
    applies a tuple of arguments to a function.  With expression 
    embedding this type of format template may be created using a
    lambda function whose body is a single e-string. Unlike format
    strings, e-string lambdas have the flexibility to pass arguments
    by position or by name, use default values, variable argument
    lists, reference global names, take advantage of lexical closures
    and, of course, contain any expression.
    
Implementation notes

    Most of the logic of expression embedding is in the tokenizer. The
    example above is broken down into the following tokens:

    <i"X=`>    - EMBSTART
      <x>      - NAME
    <`, Y=`>   - EMBCONT
      <calc_y> - NAME      <(>      - LPAR
      <x>      - NAME
      <)>      - RPAR
    <`.">      - EMBEND

    An EMBSTART token instructs the compiler to start a tuple. An
    EMBCONT is similar to a comma separating items in a tuple and the
    EMBEND token terminates the tuple.

    The code generated by this form of formatting may use a new
    formatting implementation or simply reuse the existing '%'
    operator implementation as a backend. In this case the EMBSTART,
    EMBCONT, and EMBEND tokens are concatenated together, any "%"
    characters in the resulting string are replaced with "%%", the
    now-empty backquotes "``" are replaced with "%s".  Finally, the
    "%" operator is applied to the resulting string and the tuple.

    Correctly generating EMBCONT and EMBEND tokens requires some
    stateful logic in the tokenizer. If the EMBSTART token started
    with a single quote an EMBCONT and EMBEND may not contain any
    unescaped single quotes and EMBEND ends with a single quote.
    Similar rules apply to e-strings with double quotes, triple single
    quotes and triple double quotes.

      State 0: Default state. A backquote is BACKQUOTE token.

      State 1: backquote starts an EMBCONT or EMBEND that ends with '

      State 2: backquote starts an EMBCONT or EMBEND that ends with "

      State 3: backquote starts an EMBCONT or EMBEND that ends with '''

      State 4: backquote starts an EMBCONT or EMBEND that ends with """
      
Reference implementation

    A reference implementation in the form of a preprocessor for Python
    sources files is available at:

        http://www.tothink.com/python/embedpp

    This preprocessor is based on a modified version of Ka-Ping Yee's
    tokenize.py module.

    This implementation does not fully implement backquote escaping.
    
Security

    Run-time parsing of strings opens many potential security holes.  
    This form of formatting should be secure against this class of attacks.
    
References

    [1] PEP 215, String Interpolation (Ka-Ping Yee)
        http://www.python.org/peps/pep-0215.html

    [2] 1996/11/07 python-list posting (Marnix Klooster)
        http://groups.google.com/groups?group=comp.lang.python
         &selm=328195a1.1211700%40news.worldonline.nl

Copyright

    This document is in the public domain.




More information about the Python-list mailing list