Take 2: PEP draft for expression embedding
Oren Tirosh
oren-py-l at hishome.net
Sat Dec 15 10:38:18 EST 2001
This new draft should address all the technical issues people here have
raised in reponses to my previous draft.
This form of string formatting is now called 'expression embedding',
not 'string interpolation'. I hope this change of terminology will
help highlight the fact that this is not a cosmetic change: embedded
expressions are real compiled expressions, not characters in a format
string.
PEP: XXX
Title: Expression embedding
Author: oren at hishome.net (Oren Tirosh)
Status: Draft
Type: Standards Track
Created: 15-Dec-2001
Version: 0.2
Abstract
This document proposes an expression embedding feature for easier
string formatting. The suggested syntax change is the introduction
of a new 'e' prefix for strings. Python expressions may be
embedded within an e-string, surrounded by backquotes.
Example:
print e"X=`x`, Y=`calc_y(x)`."
Unlike the string interpolation in some other languages and the
proposed syntax in PEP 215[1] an expression embedded within a string
is not a sequence of characters in the string - it is a real
expression. It is syntax-checked at compile time and bytecode is
generated for it in the compiled module.
Specification
A new character prefix "e" is defined for strings. This prefix
precedes the "u" and "r" prefixes, if present. Capital "E" is also
acceptable. Within an e-string any expressions enclosed in
backquotes are evaluated, converted to strings using the
equivalent of the str() function and embedded in-place into the
e-string. Any valid Python expression may be embedded. To use a
literal backquote anywhere within the e-string or embedded
expression it must be preceded by a backslash.
Discussion
A similar proposal for embedding expressions with backquotes was
made by Marnix Klooster in a python-list posting [2]. Marnix noted
that this is the way it is done in Python's ancestor ABC from
which it inherits many features and design decisions. This
proposal did not include any mehod to identify strings containing
embedded expressions such as the "e" prefix.
There is no runtime parsing or runtime compilation of expressions.
This results in more efficient execution compared to any form of
string formatting which uses eval().
Embedded strings should be fully compatible with proposed future
extensions such as optional static typing.
Whenever an out-of-band character is used there is the problem
of what to do when the programmer wants to literally use that
character. The solution here is consistent with the way quotes or
any other special characters are embedded in a string: using a
backslash escape. Ascaping issues are solved in only one place - the
tokenizer. There is no need for one escaping mechanism during
compilation with one type of escape character and a different
form of escaping at runtime for format strings. There should
never be a need for multiple escapes of the backquote character
even when an embedded expression contains strings with embedded
expressions since the nesting of embedded expressions is at the
parser level.
In some cases it is useful to create a format template in one
place in the program and perform the actual formatting elsewhere.
In Python this may be done by passing a format string as an
argument and performing the actual formatting later using the %
operator. This is actually a form of functional programming. The
format string is treated as a function and the '%' formatting
operator is equivalent to the apply() built-in function which
applies a tuple of arguments to a function. With expression
embedding this type of format template may be created using a
lambda function whose body is a single e-string. Unlike format
strings, e-string lambdas have the flexibility to pass arguments
by position or by name, use default values, variable argument
lists, reference global names, take advantage of lexical closures
and, of course, contain any expression.
Implementation notes
Most of the logic of expression embedding is in the tokenizer. The
example above is broken down into the following tokens:
<i"X=`> - EMBSTART
<x> - NAME
<`, Y=`> - EMBCONT
<calc_y> - NAME <(> - LPAR
<x> - NAME
<)> - RPAR
<`."> - EMBEND
An EMBSTART token instructs the compiler to start a tuple. An
EMBCONT is similar to a comma separating items in a tuple and the
EMBEND token terminates the tuple.
The code generated by this form of formatting may use a new
formatting implementation or simply reuse the existing '%'
operator implementation as a backend. In this case the EMBSTART,
EMBCONT, and EMBEND tokens are concatenated together, any "%"
characters in the resulting string are replaced with "%%", the
now-empty backquotes "``" are replaced with "%s". Finally, the
"%" operator is applied to the resulting string and the tuple.
Correctly generating EMBCONT and EMBEND tokens requires some
stateful logic in the tokenizer. If the EMBSTART token started
with a single quote an EMBCONT and EMBEND may not contain any
unescaped single quotes and EMBEND ends with a single quote.
Similar rules apply to e-strings with double quotes, triple single
quotes and triple double quotes.
State 0: Default state. A backquote is BACKQUOTE token.
State 1: backquote starts an EMBCONT or EMBEND that ends with '
State 2: backquote starts an EMBCONT or EMBEND that ends with "
State 3: backquote starts an EMBCONT or EMBEND that ends with '''
State 4: backquote starts an EMBCONT or EMBEND that ends with """
Reference implementation
A reference implementation in the form of a preprocessor for Python
sources files is available at:
http://www.tothink.com/python/embedpp
This preprocessor is based on a modified version of Ka-Ping Yee's
tokenize.py module.
This implementation does not fully implement backquote escaping.
Security
Run-time parsing of strings opens many potential security holes.
This form of formatting should be secure against this class of attacks.
References
[1] PEP 215, String Interpolation (Ka-Ping Yee)
http://www.python.org/peps/pep-0215.html
[2] 1996/11/07 python-list posting (Marnix Klooster)
http://groups.google.com/groups?group=comp.lang.python
&selm=328195a1.1211700%40news.worldonline.nl
Copyright
This document is in the public domain.
More information about the Python-list
mailing list