[Python-checkins] r55748 - peps/trunk/pep-3101.txt

Sun Jun 3 20:53:35 CEST 2007

Author: talin
Date: Sun Jun  3 20:53:34 2007
New Revision: 55748

Modified:
   peps/trunk/pep-3101.txt
Log:
A substantial rewrite of PEP3101.



Modified: peps/trunk/pep-3101.txt
==============================================================================

--- peps/trunk/pep-3101.txt	(original)
+++ peps/trunk/pep-3101.txt	Sun Jun  3 20:53:34 2007
@@ -26,10 +26,10 @@
 
     - The string.Template module. [2]
 
-    The scope of this PEP will be restricted to proposals for built-in
+    The primary scope of this PEP concerns proposals for built-in
     string formatting operations (in other words, methods of the
     built-in string type).
-    
+
     The '%' operator is primarily limited by the fact that it is a
     binary operator, and therefore can take at most two arguments.
     One of those arguments is already dedicated to the format string,
@@ -42,8 +42,14 @@
 
     While there is some overlap between this proposal and
     string.Template, it is felt that each serves a distinct need,
-    and that one does not obviate the other.  In any case,
-    string.Template will not be discussed here.
+    and that one does not obviate the other.  This proposal is for
+    a mechanism which, like '%', is efficient for small strings
+    which are only used once, so, for example, compilation of a
+    string into a template is not contemplated in this proposal,
+    although the proposal does take care to define format strings
+    and the API in such a way that an efficient template package
+    could reuse the syntax and even some of the underlying
+    formatting code.
 
 
 Specification
@@ -53,39 +59,43 @@
     - Specification of a new formatting method to be added to the
       built-in string class.
 
+    - Specification of functions and flag values to be added to
+      the string module, so that the underlying formatting engine
+      can be used with additional options.
+
     - Specification of a new syntax for format strings.
 
-    - Specification of a new set of class methods to control the
+    - Specification of a new set of special methods to control the
       formatting and conversion of objects.
 
     - Specification of an API for user-defined formatting classes.
 
     - Specification of how formatting errors are handled.
-    
-    Note on string encodings: Since this PEP is being targeted
-    at Python 3.0, it is assumed that all strings are unicode strings,
+
+    Note on string encodings: When discussing this PEP in the context
+    of Python 3.0, it is assumed that all strings are unicode strings,
     and that the use of the word 'string' in the context of this
     document will generally refer to a Python 3.0 string, which is
     the same as Python 2.x unicode object.
-    
-    If it should happen that this functionality is backported to
-    the 2.x series, then it will be necessary to handle both regular
-    string as well as unicode objects.  All of the function call
-    interfaces described in this PEP can be used for both strings
-    and unicode objects, and in all cases there is sufficient
-    information to be able to properly deduce the output string
-    type (in other words, there is no need for two separate APIs).
-    In all cases, the type of the template string dominates - that
+
+    In the context of Python 2.x, the use of the word 'string' in this
+    document refers to an object which may either be a regular string
+    or a unicode object.  All of the function call interfaces
+    described in this PEP can be used for both strings and unicode
+    objects, and in all cases there is sufficient information
+    to be able to properly deduce the output string type (in
+    other words, there is no need for two separate APIs).
+    In all cases, the type of the format string dominates - that
     is, the result of the conversion will always result in an object
     that contains the same representation of characters as the
-    input template string.
+    input format string.
 
 
 String Methods
 
-    The build-in string class will gain a new method, 'format',
-    which takes takes an arbitrary number of positional and keyword
-    arguments:
+    The built-in string class (and also the unicode class in 2.6) will
+    gain a new method, 'format', which takes an arbitrary number of
+    positional and keyword arguments:
 
         "The story of {0}, {1}, and {c}".format(a, b, c=d)
 
@@ -98,6 +108,15 @@
 
 Format Strings
 
+    Format strings consist of intermingled character data and markup.
+
+    Character data is data which is transferred unchanged from the
+    format string to the output string; markup is not transferred from
+    the format string directly to the output, but instead is used to
+    define 'replacement fields' that describes to the format engine
+    what should be placed in the output string in the place of the
+    markup.
+
     Brace characters ('curly braces') are used to indicate a
     replacement field within the string:
 
@@ -114,11 +133,11 @@
     Which would produce:
 
         "My name is Fred :-{}"
-        
+
     The element within the braces is called a 'field'.  Fields consist
     of a 'field name', which can either be simple or compound, and an
     optional 'conversion specifier'.
-    
+
 
 Simple and Compound Field Names
 
@@ -126,12 +145,12 @@
     must be valid base-10 integers; if names, they must be valid
     Python identifiers.  A number is used to identify a positional
     argument, while a name is used to identify a keyword argument.
-    
+
     A compound field name is a combination of multiple simple field
     names in an expression:
 
         "My name is {0.name}".format(file('out.txt'))
-        
+
     This example shows the use of the 'getattr' or 'dot' operator
     in a field expression. The dot operator allows an attribute of
     an input value to be specified as the field value.
@@ -142,22 +161,36 @@
     Python expressions inside of strings. Only two operators are
     supported, the '.' (getattr) operator, and the '[]' (getitem)
     operator.
-    
+
+    Another limitation that is defined to limit potential security
+    issues is that field names or attribute names beginning with an
+    underscore are disallowed. This enforces the common convention
+    that names beginning with an underscore are 'private'.
+
     An example of the 'getitem' syntax:
-    
+
         "My name is {0[name]}".format(dict(name='Fred'))
-    
+
     It should be noted that the use of 'getitem' within a string is
     much more limited than its normal use. In the above example, the
     string 'name' really is the literal string 'name', not a variable
-    named 'name'. The rules for parsing an item key are the same as
-    for parsing a simple name - in other words, if it looks like a
-    number, then its treated as a number, if it looks like an
-    identifier, then it is used as a string.
-    
+    named 'name'. The rules for parsing an item key are very simple.
+    If it starts with a digit, then its treated as a number, otherwise
+    it is used as a string.
+
     It is not possible to specify arbitrary dictionary keys from
     within a format string.
 
+    Implementation note:  The implementation of this proposal is
+    not required to enforce the rule about a name being a valid
+    Python identifier.  Instead, it will rely on the getattr function
+    of the underlying object to throw an exception if the identifier
+    is not legal.  The format function will have a minimalist parser
+    which only attempts to figure out when it is "done" with an
+    identifier (by finding a '.' or a ']', or '}', etc.)  The only
+    exception to this laissez-faire approach is that, by default,
+    strings are not allowed to have leading underscores.
+
 
 Conversion Specifiers
 
@@ -176,9 +209,9 @@
     Conversion specifiers can themselves contain replacement fields.
     For example, a field whose field width is itself a parameter
     could be specified via:
-    
+
         "{0:{1}}".format(a, b, c)
-        
+
     Note that the doubled '}' at the end, which would normally be
     escaped, is not escaped in this case.  The reason is because
     the '{{' and '}}' syntax for escapes is only applied when used
@@ -201,13 +234,13 @@
     differences.  The standard conversion specifiers fall into three
     major categories: string conversions, integer conversions and
     floating point conversions.
-    
+
     The general form of a standard conversion specifier is:
 
         [[fill]align][sign][width][.precision][type]
 
     The brackets ([]) indicate an optional element.
-    
+
     Then the optional align flag can be one of the following:
 
         '<' - Forces the field to be left-aligned within the available
@@ -217,18 +250,20 @@
         '=' - Forces the padding to be placed after the sign (if any)
               but before the digits. This is used for printing fields
               in the form '+000000120'.
-              
+        '^' - Forces the field to be centered within the available
+              space.
+
     Note that unless a minimum field width is defined, the field
     width will always be the same size as the data to fill it, so
     that the alignment option has no meaning in this case.
-              
+
     The optional 'fill' character defines the character to be used to
     pad the field to the minimum width.  The alignment flag must be
     supplied if the character is a number other than 0 (otherwise the
     character would be interpreted as part of the field width
     specifier). A zero fill character without an alignment flag
     implies an alignment type of '='.
-    
+
     The 'sign' element can be one of the following:
 
         '+'  - indicates that a sign should be used for both
@@ -249,7 +284,7 @@
     conversion. In a string conversion the field indicates how many
     characters will be used from the field content. The precision is
     ignored for integer conversions.
-    
+
     Finally, the 'type' determines how the data should be presented.
     If the type field is absent, an appropriate type will be assigned
     based on the value to be formatted ('d' for integers and longs,
@@ -307,7 +342,7 @@
         "Today is: {0:a b d H:M:S Y}".format(datetime.now())
 
 
-Controlling Formatting
+Controlling Formatting on a Per-Type Basis
 
     A class that wishes to implement a custom interpretation of its
     conversion specifiers can implement a __format__ method:
@@ -334,107 +369,187 @@
      3) Otherwise, call str() or unicode() as appropriate.
 
 
-User-Defined Formatting Classes
+User-Defined Formatting
 
     There will be times when customizing the formatting of fields
-    on a per-type basis is not enough.  An example might be an
-    accounting application, which displays negative numbers in
-    parentheses rather than using a negative sign.
-    
-    The string formatting system facilitates this kind of application-
-    specific formatting by allowing user code to directly invoke
-    the code that interprets format strings and fields.  User-written
-    code can intercept the normal formatting operations on a per-field
-    basis, substituting their own formatting methods.
-    
-    For example, in the aforementioned accounting application, there
-    could be an application-specific number formatter, which reuses
-    the string.format templating code to do most of the work. The
-    API for such an application-specific formatter is up to the
-    application; here are several possible examples:
-    
-        cell_format("The total is: {0}", total)
-        
-        TemplateString("The total is: {0}").format(total)
-        
-    Creating an application-specific formatter is relatively straight-
-    forward.  The string and unicode classes will have a class method
-    called 'cformat' that does all the actual work of formatting; The
-    built-in format() method is just a wrapper that calls cformat.
-    
-    The type signature for the cFormat function is as follows:
-    
-        cformat(template, format_hook, args, kwargs)
-
-    The parameters to the cformat function are:
-
-        -- The format template string.
-        -- A callable 'format hook', which is called once per field
-        -- A tuple containing the positional arguments
-        -- A dict containing the keyword arguments
-
-    The cformat function will parse all of the fields in the format
-    string, and return a new string (or unicode) with all of the
-    fields replaced with their formatted values.
-
-    The format hook is a callable object supplied by the user, which
-    is invoked once per field, and which can override the normal
-    formatting for that field.  For each field, the cformat function
-    will attempt to call the field format hook with the following
-    arguments:
-
-       format_hook(value, conversion)
-
-    The 'value' field corresponds to the value being formatted, which
-    was retrieved from the arguments using the field name.
-
-    The 'conversion' argument is the conversion spec part of the
-    field, which will be either a string or unicode object, depending
-    on the type of the original format string.
-
-    The field_hook will be called once per field. The field_hook may
-    take one of two actions:
-    
-        1) Return a string or unicode object that is the result
-           of the formatting operation.
-
-        2) Return None, indicating that the field_hook will not
-           process this field and the default formatting should be
-           used.  This decision should be based on the type of the
-           value object, and the contents of the conversion string.
+    on a per-type basis is not enough.  An example might be a
+    spreadsheet application, which displays hash marks '#' when a value
+    is too large to fit in the available space.
+
+    For more powerful and flexible formatting, access to the underlying
+    format engine can be obtained through the 'Formatter' class that
+    lives in the 'string' module.  This class takes additional options
+    which are not accessible via the normal str.format method.
+
+    An application can create their own Formatter instance which has
+    customized behavior, either by setting the properties of the
+    Formatter instance, or by subclassing the Formatter class.
+
+    The PEP does not attempt to exactly specify all methods and
+    properties defined by the Formatter class; Instead, those will be
+    defined and documented in the initial implementation. However, this
+    PEP will specify the general requirements for the Formatter class,
+    which are listed below.
+
+
+Formatter Creation and Initialization
+
+    The Formatter class takes a single initialization argument, 'flags':
+
+        Formatter(flags=0)
+
+    The 'flags' argument is used to control certain subtle behavioral
+    differences in formatting that would be cumbersome to change via
+    subclassing. The flags values are defined as static variables
+    in the "Formatter" class:
+
+        Formatter.ALLOW_LEADING_UNDERSCORES
+
+            By default, leading underscores are not allowed in identifier
+            lookups (getattr or getitem).  Setting this flag will allow
+            this.
+
+        Formatter.CHECK_UNUSED_POSITIONAL
+
+            If this flag is set, the any positional arguments which are
+            supplied to the 'format' method but which are not used by
+            the format string will cause an error.
+
+        Formatter.CHECK_UNUSED_NAME
+
+            If this flag is set, the any named arguments which are
+            supplied to the 'format' method but which are not used by
+            the format string will cause an error.
+
+
+Formatter Methods
+
+    The methods of class Formatter are as follows:
+
+        -- format(format_string, *args, **kwargs)
+        -- vformat(format_string, args, kwargs)
+        -- get_positional(args, index)
+        -- get_named(kwds, name)
+        -- format_field(value, conversion)
+
+    'format' is the primary API method. It takes a format template,
+    and an arbitrary set of positional and keyword argument. 'format'
+    is just a wrapper that calls 'vformat'.
+
+    'vformat' is the function that does the actual work of formatting. It
+    is exposed as a separate function for cases where you want to pass in
+    a predefined dictionary of arguments, rather than unpacking and
+    repacking the dictionary as individual arguments using the '*args' and
+    '**kwds' syntax. 'vformat' does the work of breaking up the format
+    template string into character data and replacement fields. It calls
+    the 'get_positional' and 'get_index' methods as appropriate.
+
+    Note that the checking of unused arguments, and the restriction on
+    leading underscores in attribute names are also done in this function.
+
+    'get_positional' and 'get_named' are used to retrieve a given field
+    value. For compound field names, these functions are only called for
+    the first component of the field name; Subsequent components are
+    handled through normal attribute and indexing operations. So for
+    example, the field expression '0.name' would cause 'get_positional' to
+    be called with the list of positional arguments and a numeric index of
+    0, and then the standard 'getattr' function would be called to get the
+    'name' attribute of the result.
+
+    If the index or keyword refers to an item that does not exist, then an
+    IndexError/KeyError will be raised.
+
+    'format_field' actually generates the text for a replacement field.
+    The 'value' argument corresponds to the value being formatted, which
+    was retrieved from the arguments using the field name. The
+    'conversion' argument is the conversion spec part of the field, which
+    will be either a string or unicode object, depending on the type of
+    the original format string.
+
+    Note: The final implementation of the Formatter class may define
+    additional overridable methods and hooks. In particular, it may be
+    that 'vformat' is itself a composition of several additional,
+    overridable methods. (Depending on whether it is convenient to the
+    implementor of Formatter.)
+
+
+Customizing Formatters
+
+    This section describes some typical ways that Formatter objects
+    can be customized.
+
+    To support alternative format-string syntax, the 'vformat' method
+    can be overridden to alter the way format strings are parsed.
+
+    One common desire is to support a 'default' namespace, so that
+    you don't need to pass in keyword arguments to the format()
+    method, but can instead use values in a pre-existing namespace.
+    This can easily be done by overriding get_named() as follows:
+
+       class NamespaceFormatter(Formatter):
+          def __init__(self, namespace={}, flags=0):
+              Formatter.__init__(self, flags)
+              self.namespace = namespace
+
+          def get_named(self, kwds, name):
+              try:
+                  # Check explicitly passed arguments first
+                  return kwds[name]
+            except KeyError:
+                  return self.namespace[name]
+
+    One can use this to easily create a formatting function that allows
+    access to global variables, for example:
+
+        fmt = NamespaceFormatter(globals())
+
+        greeting = "hello"
+        print(fmt("{greeting}, world!"))
+
+    A similar technique can be done with the locals() dictionary to
+    gain access to the locals dictionary.
+
+    It would also be possible to create a 'smart' namespace formatter
+    that could automatically access both locals and globals through
+    snooping of the calling stack. Due to the need for compatibility
+    the different versions of Python, such a capability will not be
+    included in the standard library, however it is anticipated that
+    someone will create and publish a recipe for doing this.
+
+    Another type of customization is to change the way that built-in
+    types are formatted by overriding the 'format_field' method. (For
+    non-built-in types, you can simply define a __format__ special
+    method on that type.) So for example, you could override the
+    formatting of numbers to output scientific notation when needed.
 
 
 Error handling
 
-    The string formatting system has two error handling modes, which
-    are controlled by the value of a class variable:
-    
-       string.strict_format_errors = True
-       
-    The 'strict_format_errors' flag defaults to False, or 'lenient'
-    mode. Setting it to True enables 'strict' mode. The current mode
-    determines how errors are handled, depending on the type of the
-    error.
-    
-    The types of errors that can occur are:
-    
-    1) Reference to a missing or invalid argument from within a
-    field specifier. In strict mode, this will raise an exception.
-    In lenient mode, this will cause the value of the field to be
-    replaced with the string '?name?', where 'name' will be the
-    type of error (KeyError, IndexError, or AttributeError).
-    
-    So for example:
-    
-        >>> string.strict_format_errors = False
-        >>> print 'Item 2 of argument 0 is: {0[2]}'.format( [0,1] )
-        "Item 2 of argument 0 is: ?IndexError?"
-    
-    2) Unused argument. In strict mode, this will raise an exception.
-    In lenient mode, this will be ignored.
-    
-    3) Exception raised by underlying formatter. These exceptions
-    are always passed through, regardless of the current mode.
+    There are two classes of exceptions which can occur during formatting:
+    exceptions generated by the formatter code itself, and exceptions
+    generated by user code (such as a field object's getattr function, or
+    the field_hook function).
+
+    In general, exceptions generated by the formatter code itself are
+    of the "ValueError" variety -- there is an error in the actual "value"
+    of the format string.  (This is not always true; for example, the
+    string.format() function might be passed a non-string as its first
+    parameter, which would result in a TypeError.)
+
+    The text associated with these internally generated ValueError
+    exceptions will indicate the location of the exception inside
+    the format string, as well as the nature of the exception.
+
+    For exceptions generated by user code, a trace record and
+    dummy frame will be added to the traceback stack to help
+    in determining the location in the string where the exception
+    occurred.  The inserted traceback will indicate that the
+    error occurred at:
+
+        File "<format_string>;", line XX, in column_YY
+
+    where XX and YY represent the line and character position
+    information in the string, respectively.
 
 
 Alternate Syntax
@@ -483,9 +598,9 @@
 
     - Other variations include Ruby's #{}, PHP's {$name}, and so
       on.
-      
+
     Some specific aspects of the syntax warrant additional comments:
-    
+
     1) Backslash character for escapes.  The original version of
     this PEP used backslash rather than doubling to escape a bracket.
     This worked because backslashes in Python string literals that
@@ -494,18 +609,66 @@
     of confusion, and led to potential situations of multiple
     recursive escapes, i.e. '\\\\{' to place a literal backslash
     in front of a bracket.
-    
+
     2) The use of the colon character (':') as a separator for
     conversion specifiers.  This was chosen simply because that's
     what .Net uses.
-    
+
+
+Security Considerations
+
+    Historically, string formatting has been a common source of
+    security holes in web-based applications, particularly if the
+    string templating system allows arbitrary expressions to be
+    embedded in format strings.
+
+    The typical scenario is one where the string data being processed
+    is coming from outside the application, perhaps from HTTP headers
+    or fields within a web form. An attacker could substitute their
+    own strings designed to cause havok.
+
+    The string formatting system outlined in this PEP is by no means
+    'secure', in the sense that no Python library module can, on its
+    own, guarantee security, especially given the open nature of
+    the Python language. Building a secure application requires a
+    secure approach to design.
+
+    What this PEP does attempt to do is make the job of designing a
+    secure application easier, by making it easier for a programmer
+    to reason about the possible consequences of a string formatting
+    operation. It does this by limiting those consequences to a smaller
+    and more easier understood subset.
+
+    For example, because it is possible in Python to override the
+    'getattr' operation of a type, the interpretation of a compound
+    replacement field such as "0.name" could potentially run
+    arbitrary code.
+
+    However, it is *extremely* rare for the mere retrieval of an
+    attribute to have side effects. Other operations which are more
+    likely to have side effects - such as method calls - are disallowed.
+    Thus, a programmer can be reasonably assured that no string
+    formatting operation will cause a state change in the program.
+    This assurance is not only useful in securing an application, but
+    in debugging it as well.
+
+    Similarly, the restriction on field names beginning with
+    underscores is intended to provide similar assurances about the
+    visibility of private data.
+
+    Of course, programmers would be well-advised to avoid using
+    any external data as format strings, and instead use that data
+    as the format arguments instead.
+
 
 Sample Implementation
 
-    A rough prototype of the underlying 'cformat' function has been
-    coded in Python, however it needs much refinement before being
-    submitted.
-    
+    An implementation of an earlier version of this PEP was created by
+    Patrick Maupin and Eric V. Smith, and can be found in the pep3101
+    sandbox at:
+
+       http://svn.python.org/view/sandbox/trunk/pep3101/
+
 
 Backwards Compatibility