[Python-checkins] r56535 - peps/trunk/pep-3101.txt

Wed Jul 25 01:36:35 CEST 2007

Author: talin
Date: Wed Jul 25 01:36:34 2007
New Revision: 56535

Modified:
   peps/trunk/pep-3101.txt
Log:
Updated PEP 3101 to incorporate latest feedback, and simplify even further. Also added additional explanation of custom formatting classes.


Modified: peps/trunk/pep-3101.txt
==============================================================================

--- peps/trunk/pep-3101.txt	(original)
+++ peps/trunk/pep-3101.txt	Wed Jul 25 01:36:34 2007
@@ -141,7 +141,7 @@
 
 Simple and Compound Field Names
 
-    Simple field names are either names or numbers. If numbers, they
+    Simple field names are either names or numbers.  If numbers, they
     must be valid base-10 integers; if names, they must be valid
     Python identifiers.  A number is used to identify a positional
     argument, while a name is used to identify a keyword argument.
@@ -152,44 +152,37 @@
         "My name is {0.name}".format(file('out.txt'))
 
     This example shows the use of the 'getattr' or 'dot' operator
-    in a field expression. The dot operator allows an attribute of
+    in a field expression.  The dot operator allows an attribute of
     an input value to be specified as the field value.
 
-    The types of expressions that can be used in a compound name
-    have been deliberately limited in order to prevent potential
-    security exploits resulting from the ability to place arbitrary
-    Python expressions inside of strings. Only two operators are
-    supported, the '.' (getattr) operator, and the '[]' (getitem)
-    operator.
-
-    Another limitation that is defined to limit potential security
-    issues is that field names or attribute names beginning with an
-    underscore are disallowed. This enforces the common convention
-    that names beginning with an underscore are 'private'.
+    Unlike some other programming languages, you cannot embed arbitrary
+    expressions in format strings.  This is by design - the types of
+    expressions that you can use is deliberately limited.  Only two operators
+    are supported: the '.' (getattr) operator, and the '[]' (getitem)
+    operator.  The reason for allowing these operators is that they dont'
+    normally have side effects in non-pathological code.
 
     An example of the 'getitem' syntax:
 
         "My name is {0[name]}".format(dict(name='Fred'))
 
-    It should be noted that the use of 'getitem' within a string is
-    much more limited than its normal use. In the above example, the
-    string 'name' really is the literal string 'name', not a variable
-    named 'name'. The rules for parsing an item key are very simple.
+    It should be noted that the use of 'getitem' within a format string
+    is much more limited than its conventional usage.  In the above example,
+    the string 'name' really is the literal string 'name', not a variable
+    named 'name'.  The rules for parsing an item key are very simple.
     If it starts with a digit, then its treated as a number, otherwise
     it is used as a string.
 
     It is not possible to specify arbitrary dictionary keys from
     within a format string.
 
-    Implementation note:  The implementation of this proposal is
+    Implementation note: The implementation of this proposal is
     not required to enforce the rule about a name being a valid
     Python identifier.  Instead, it will rely on the getattr function
     of the underlying object to throw an exception if the identifier
     is not legal.  The format function will have a minimalist parser
     which only attempts to figure out when it is "done" with an
-    identifier (by finding a '.' or a ']', or '}', etc.)  The only
-    exception to this laissez-faire approach is that, by default,
-    strings are not allowed to have leading underscores.
+    identifier (by finding a '.' or a ']', or '}', etc.).
 
 
 Conversion Specifiers
@@ -215,11 +208,11 @@
     Note that the doubled '}' at the end, which would normally be
     escaped, is not escaped in this case.  The reason is because
     the '{{' and '}}' syntax for escapes is only applied when used
-    *outside* of a format field. Within a format field, the brace
+    *outside* of a format field.  Within a format field, the brace
     characters always have their normal meaning.
 
     The syntax for conversion specifiers is open-ended, since a class
-    can override the standard conversion specifiers. In such cases,
+    can override the standard conversion specifiers.  In such cases,
     the format() method merely passes all of the characters between
     the first colon and the matching brace to the relevant underlying
     formatting method.
@@ -248,7 +241,7 @@
         '>' - Forces the field to be right-aligned within the
               available space.
         '=' - Forces the padding to be placed after the sign (if any)
-              but before the digits. This is used for printing fields
+              but before the digits.  This is used for printing fields
               in the form '+000000120'.
         '^' - Forces the field to be centered within the available
               space.
@@ -261,7 +254,7 @@
     pad the field to the minimum width.  The alignment flag must be
     supplied if the character is a number other than 0 (otherwise the
     character would be interpreted as part of the field width
-    specifier). A zero fill character without an alignment flag
+    specifier).  A zero fill character without an alignment flag
     implies an alignment type of '='.
 
     The 'sign' element can be one of the following:
@@ -269,20 +262,20 @@
         '+'  - indicates that a sign should be used for both
                positive as well as negative numbers
         '-'  - indicates that a sign should be used only for negative
-               numbers (this is the default behaviour)
+               numbers (this is the default behavior)
         ' '  - indicates that a leading space should be used on
                positive numbers
         '()' - indicates that negative numbers should be surrounded
                by parentheses
 
-    'width' is a decimal integer defining the minimum field width. If
+    'width' is a decimal integer defining the minimum field width.  If
     not specified, then the field width will be determined by the
     content.
 
     The 'precision' is a decimal number indicating how many digits
     should be displayed after the decimal point in a floating point
-    conversion. In a string conversion the field indicates how many
-    characters will be used from the field content. The precision is
+    conversion.  In a string conversion the field indicates how many
+    characters will be used from the field content.  The precision is
     ignored for integer conversions.
 
     Finally, the 'type' determines how the data should be presented.
@@ -292,11 +285,11 @@
 
     The available string conversion types are:
 
-        's' - String format. Invokes str() on the object.
+        's' - String format.  Invokes str() on the object.
               This is the default conversion specifier type.
-        'r' - Repr format. Invokes repr() on the object.
+        'r' - Repr format.  Invokes repr() on the object.
 
-    There are several integer conversion types. All invoke int() on
+    There are several integer conversion types.  All invoke int() on
     the object before attempting to format it.
 
     The available integer conversion types are:
@@ -311,7 +304,7 @@
         'X' - Hex format. Outputs the number in base 16, using upper-
               case letters for the digits above 9.
 
-    There are several floating point conversion types. All invoke
+    There are several floating point conversion types.  All invoke
     float() on the object before attempting to format it.
 
     The available floating point conversion types are:
@@ -380,97 +373,125 @@
     format engine can be obtained through the 'Formatter' class that
     lives in the 'string' module.  This class takes additional options
     which are not accessible via the normal str.format method.
-
-    An application can create their own Formatter instance which has
-    customized behavior, either by setting the properties of the
-    Formatter instance, or by subclassing the Formatter class.
+    
+    An application can subclass the Formatter class to create their
+    own customized formatting behavior.
 
     The PEP does not attempt to exactly specify all methods and
     properties defined by the Formatter class; Instead, those will be
-    defined and documented in the initial implementation. However, this
+    defined and documented in the initial implementation.  However, this
     PEP will specify the general requirements for the Formatter class,
     which are listed below.
 
-
-Formatter Creation and Initialization
-
-    The Formatter class takes a single initialization argument, 'flags':
-
-        Formatter(flags=0)
-
-    The 'flags' argument is used to control certain subtle behavioral
-    differences in formatting that would be cumbersome to change via
-    subclassing. The flags values are defined as static variables
-    in the "Formatter" class:
-
-        Formatter.ALLOW_LEADING_UNDERSCORES
-
-            By default, leading underscores are not allowed in identifier
-            lookups (getattr or getitem).  Setting this flag will allow
-            this.
-
-        Formatter.CHECK_UNUSED_POSITIONAL
-
-            If this flag is set, the any positional arguments which are
-            supplied to the 'format' method but which are not used by
-            the format string will cause an error.
-
-        Formatter.CHECK_UNUSED_NAME
-
-            If this flag is set, the any named arguments which are
-            supplied to the 'format' method but which are not used by
-            the format string will cause an error.
+    Although string.format() does not directly use the Formatter class
+    to do formatting, both use the same underlying implementation.  The
+    reason that string.format() does not use the Formatter class directly
+    is because "string" is a built-in type, which means that all of its
+    methods must be implemented in C, whereas Formatter is a Python
+    class.  Formatter provides an extensible wrapper around the same
+    C functions as are used by string.format().
 
 
 Formatter Methods
 
-    The methods of class Formatter are as follows:
+    The Formatter class takes no initialization arguments:
+    
+        fmt = Formatter()
+
+    The public API methods of class Formatter are as follows:
 
         -- format(format_string, *args, **kwargs)
         -- vformat(format_string, args, kwargs)
-        -- get_positional(args, index)
-        -- get_named(kwds, name)
-        -- format_field(value, conversion)
-
-    'format' is the primary API method. It takes a format template,
-    and an arbitrary set of positional and keyword argument. 'format'
+        
+    'format' is the primary API method.  It takes a format template,
+    and an arbitrary set of positional and keyword argument.  'format'
     is just a wrapper that calls 'vformat'.
 
-    'vformat' is the function that does the actual work of formatting. It
+    'vformat' is the function that does the actual work of formatting.  It
     is exposed as a separate function for cases where you want to pass in
     a predefined dictionary of arguments, rather than unpacking and
     repacking the dictionary as individual arguments using the '*args' and
-    '**kwds' syntax. 'vformat' does the work of breaking up the format
-    template string into character data and replacement fields. It calls
-    the 'get_positional' and 'get_index' methods as appropriate.
+    '**kwds' syntax.  'vformat' does the work of breaking up the format
+    template string into character data and replacement fields.  It calls
+    the 'get_positional' and 'get_index' methods as appropriate (described
+    below.)
 
-    Note that the checking of unused arguments, and the restriction on
-    leading underscores in attribute names are also done in this function.
+    Formatter defines the following overridable methods:
+        
+        -- get_positional(args, index)
+        -- get_named(kwds, name)
+        -- check_unused_args(used_args, args, kwargs)
+        -- format_field(value, conversion)
 
     'get_positional' and 'get_named' are used to retrieve a given field
-    value. For compound field names, these functions are only called for
+    value.  For compound field names, these functions are only called for
     the first component of the field name; Subsequent components are
-    handled through normal attribute and indexing operations. So for
-    example, the field expression '0.name' would cause 'get_positional' to
-    be called with the list of positional arguments and a numeric index of
-    0, and then the standard 'getattr' function would be called to get the
-    'name' attribute of the result.
+    handled through normal attribute and indexing operations.
+    
+    So for example, the field expression '0.name' would cause
+    'get_positional' to be called with the parameter 'args' set to the
+    list of positional arguments to vformat, and 'index' set to zero;
+    the returned value would then be passed to the standard 'getattr'
+    function to get the 'name' attribute.
 
     If the index or keyword refers to an item that does not exist, then an
     IndexError/KeyError will be raised.
+    
+    'check_unused_args' is used to implement checking for unused arguments
+    if desired.  The arguments to this function is the set of all argument
+    keys that were actually referred to in the format string (integers for
+    positional arguments, and strings for named arguments), and a reference
+    to the args and kwargs that was passed to vformat.  The intersection
+    of these two sets will be the set of unused args.  'check_unused_args'
+    is assumed to throw an exception if the check fails.
 
     'format_field' actually generates the text for a replacement field.
     The 'value' argument corresponds to the value being formatted, which
-    was retrieved from the arguments using the field name. The
+    was retrieved from the arguments using the field name.  The
     'conversion' argument is the conversion spec part of the field, which
     will be either a string or unicode object, depending on the type of
     the original format string.
-
-    Note: The final implementation of the Formatter class may define
-    additional overridable methods and hooks. In particular, it may be
-    that 'vformat' is itself a composition of several additional,
-    overridable methods. (Depending on whether it is convenient to the
-    implementor of Formatter.)
+    
+    To get a better understanding of how these functions relate to each
+    other, here is pseudocode that explains the general operation of
+    vformat:
+    
+        def vformat(format_string, args, kwargs):
+        
+          # Output buffer and set of used args
+          buffer = StringIO.StringIO()
+          used_args = set()
+          
+          # Tokens are either format fields or literal strings
+          for token in self.parse(format_string):
+            if is_format_field(token):
+              field_spec, conversion_spec = token.rsplit(":", 2)
+              
+              # 'first_part' is the part before the first '.' or '['
+              first_part = get_first_part(token)
+              used_args.add(first_part)
+              if is_positional(first_part):
+                value = self.get_positional(args, first_part) 
+              else:
+                value = self.get_named(kwargs, first_part)
+                
+              # Handle [subfield] or .subfield
+              for comp in components(token):
+                value = resolve_subfield(value, comp)
+
+              # Write out the converted value
+              buffer.write(format_field(value, conversion))
+              
+            else:
+              buffer.write(token)
+              
+          self.check_unused_args(used_args, args, kwargs)
+          return buffer.getvalue()
+          
+    Note that the actual algorithm of the Formatter class may not be the
+    one presented here.  In particular, the final implementation of
+    the Formatter class may define additional overridable methods and
+    hooks.  Also, the final implementation will be written in C.
 
 
 Customizing Formatters
@@ -511,15 +532,15 @@
 
     It would also be possible to create a 'smart' namespace formatter
     that could automatically access both locals and globals through
-    snooping of the calling stack. Due to the need for compatibility
+    snooping of the calling stack.  Due to the need for compatibility
     the different versions of Python, such a capability will not be
     included in the standard library, however it is anticipated that
     someone will create and publish a recipe for doing this.
 
     Another type of customization is to change the way that built-in
-    types are formatted by overriding the 'format_field' method. (For
+    types are formatted by overriding the 'format_field' method.  (For
     non-built-in types, you can simply define a __format__ special
-    method on that type.) So for example, you could override the
+    method on that type.)  So for example, you could override the
     formatting of numbers to output scientific notation when needed.
 
 
@@ -527,8 +548,7 @@
 
     There are two classes of exceptions which can occur during formatting:
     exceptions generated by the formatter code itself, and exceptions
-    generated by user code (such as a field object's getattr function, or
-    the field_hook function).
+    generated by user code (such as a field object's 'getattr' function).
 
     In general, exceptions generated by the formatter code itself are
     of the "ValueError" variety -- there is an error in the actual "value"
@@ -605,7 +625,7 @@
     this PEP used backslash rather than doubling to escape a bracket.
     This worked because backslashes in Python string literals that
     don't conform to a standard backslash sequence such as '\n'
-    are left unmodified. However, this caused a certain amount
+    are left unmodified.  However, this caused a certain amount
     of confusion, and led to potential situations of multiple
     recursive escapes, i.e. '\\\\{' to place a literal backslash
     in front of a bracket.
@@ -615,6 +635,38 @@
     what .Net uses.
 
 
+Alternate Feature Proposals
+
+    Restricting attribute access: An earlier version of the PEP
+    restricted the ability to access attributes beginning with a
+    leading underscore, for example "{0}._private".  However, this
+    is a useful ability to have when debugging, so the feature
+    was dropped.
+    
+    Some developers suggested that the ability to do 'getattr' and
+    'getitem' access should be dropped entirely.  However, this
+    is in conflict with the needs of another set of developers who
+    strongly lobbied for the ability to pass in a large dict as a
+    single argument (without flattening it into individual keyword
+    arguments using the **kwargs syntax) and then have the format
+    string refer to dict entries individually.
+    
+    There has also been suggestions to expand the set of expressions
+    that are allowed in a format string.  However, this was seen
+    to go against the spirit of TOOWTDI, since the same effect can
+    be achieved in most cases by executing the same expression on
+    the parameter before it's passed in to the formatting function.
+    For cases where the format string is being use to do arbitrary
+    formatting in a data-rich environment, it's recommended to use
+    a templating engine specialized for this purpose, such as
+    Genshi [5] or Cheetah [6].
+    
+    Many other features were considered and rejected because they
+    could easily be achieved by subclassing Formatter instead of
+    building the feature into the base implementation.  This includes
+    alternate syntax, comments in format strings, and many others.
+    
+
 Security Considerations
 
     Historically, string formatting has been a common source of
@@ -622,43 +674,21 @@
     string templating system allows arbitrary expressions to be
     embedded in format strings.
 
-    The typical scenario is one where the string data being processed
-    is coming from outside the application, perhaps from HTTP headers
-    or fields within a web form. An attacker could substitute their
-    own strings designed to cause havok.
-
-    The string formatting system outlined in this PEP is by no means
-    'secure', in the sense that no Python library module can, on its
-    own, guarantee security, especially given the open nature of
-    the Python language. Building a secure application requires a
-    secure approach to design.
-
-    What this PEP does attempt to do is make the job of designing a
-    secure application easier, by making it easier for a programmer
-    to reason about the possible consequences of a string formatting
-    operation. It does this by limiting those consequences to a smaller
-    and more easier understood subset.
-
-    For example, because it is possible in Python to override the
-    'getattr' operation of a type, the interpretation of a compound
-    replacement field such as "0.name" could potentially run
-    arbitrary code.
-
-    However, it is *extremely* rare for the mere retrieval of an
-    attribute to have side effects. Other operations which are more
-    likely to have side effects - such as method calls - are disallowed.
-    Thus, a programmer can be reasonably assured that no string
-    formatting operation will cause a state change in the program.
-    This assurance is not only useful in securing an application, but
-    in debugging it as well.
-
-    Similarly, the restriction on field names beginning with
-    underscores is intended to provide similar assurances about the
-    visibility of private data.
-
-    Of course, programmers would be well-advised to avoid using
-    any external data as format strings, and instead use that data
-    as the format arguments instead.
+    The best way to use string formatting in a way that does not
+    create potential security holes is to never use format strings
+    that come from an untrusted source.
+    
+    Barring that, the next best approach is to insure that string
+    formatting has no side effects.  Because of the open nature of
+    Python, it is impossible to guarantee that any non-trivial
+    operation has this property.  What this PEP does is limit the
+    types of expressions in format strings to those in which visible
+    side effects are both rare and strongly discouraged by the
+    culture of Python developers.  So for example, attribute access
+    is allowed because it would be considered pathological to write
+    code where the mere access of an attribute has visible side
+    effects (whether the code has *invisible* side effects - such
+    as creating a cache entry for faster lookup - is irrelevant.)
 
 
 Sample Implementation
@@ -692,6 +722,12 @@
 
     [4] Composite Formatting - [.Net Framework Developer's Guide]
         http://msdn.microsoft.com/library/en-us/cpguide/html/cpconcompositeformatting.asp?frame=true
+        
+    [5] Genshi templating engine.
+        http://genshi.edgewall.org/
+
+    [5] Cheetah - The Python-Powered Template Engine.
+        http://www.cheetahtemplate.org/
 
 
 Copyright