[Python-checkins] r46845 - peps/trunk/pep-3101.txt

Sun Jun 11 02:59:07 CEST 2006

Author: talin
Date: Sun Jun 11 02:59:06 2006
New Revision: 46845

Modified:
   peps/trunk/pep-3101.txt
Log:
Lots of changes - added specification for conversions, error handling, complex field specs and general cleanup.



Modified: peps/trunk/pep-3101.txt
==============================================================================

--- peps/trunk/pep-3101.txt	(original)
+++ peps/trunk/pep-3101.txt	Sun Jun 11 02:59:06 2006
@@ -8,7 +8,7 @@
 Content-Type: text/plain
 Created: 16-Apr-2006
 Python-Version: 3.0
-Post-History: 28-Apr-2006
+Post-History: 28-Apr-2006, 6-May-2006, 10-Jun-2006
 
 
 Abstract
@@ -48,7 +48,7 @@
 
 Specification
 
-    The specification will consist of 4 parts:
+    The specification will consist of the following parts:
 
     - Specification of a new formatting method to be added to the
       built-in string class.
@@ -60,6 +60,26 @@
 
     - Specification of an API for user-defined formatting classes.
 
+    - Specification of how formatting errors are handled.
+    
+    Note on string encodings: Since this PEP is being targeted
+    at Python 3.0, it is assumed that all strings are unicode strings,
+    and that the use of the word 'string' in the context of this
+    document will generally refer to a Python 3.0 string, which is
+    the same as Python 2.x unicode object.
+    
+    If it should happen that this functionality is backported to
+    the 2.x series, then it will be necessary to handle both regular
+    string as well as unicode objects.  All of the function call
+    interfaces described in this PEP can be used for both strings
+    and unicode objects, and in all cases there is sufficient
+    information to be able to properly deduce the output string
+    type (in other words, there is no need for two separate APIs).
+    In all cases, the type of the template string dominates - that
+    is, the result of the conversion will always result in an object
+    that contains the same representation of characters as the
+    input template string.
+
 
 String Methods
 
@@ -75,9 +95,6 @@
     identified by its keyword name, so in the above example, 'c' is
     used to refer to the third argument.
 
-    The result of the format call is an object of the same type
-    (string or unicode) as the format string.
-
 
 Format Strings
 
@@ -90,32 +107,59 @@
 
         "My name is Fred"
 
-    Braces can be escaped using a backslash:
+    Braces can be escaped by doubling:
 
-        "My name is {0} :-\{\}".format('Fred')
+        "My name is {0} :-{{}}".format('Fred')
 
     Which would produce:
 
         "My name is Fred :-{}"
-
+        
     The element within the braces is called a 'field'.  Fields consist
     of a 'field name', which can either be simple or compound, and an
     optional 'conversion specifier'.
+    
+
+Simple and Compound Field Names
 
     Simple field names are either names or numbers. If numbers, they
     must be valid base-10 integers; if names, they must be valid
     Python identifiers.  A number is used to identify a positional
     argument, while a name is used to identify a keyword argument.
+    
+    A compound field name is a combination of multiple simple field
+    names in an expression:
 
-    Compound names are a sequence of simple names seperated by
-    periods:
+        "My name is {0.name}".format(file('out.txt'))
+        
+    This example shows the use of the 'getattr' or 'dot' operator
+    in a field expression. The dot operator allows an attribute of
+    an input value to be specified as the field value.
+
+    The types of expressions that can be used in a compound name
+    have been deliberately limited in order to prevent potential
+    security exploits resulting from the ability to place arbitrary
+    Python expressions inside of strings. Only two operators are
+    supported, the '.' (getattr) operator, and the '[]' (getitem)
+    operator.
+    
+    An example of the 'getitem' syntax:
+    
+        "My name is {0[name]}".format(dict(name='Fred'))
+    
+    It should be noted that the use of 'getitem' within a string is
+    much more limited than its normal use. In the above example, the
+    string 'name' really is the literal string 'name', not a variable
+    named 'name'. The rules for parsing an item key are the same as
+    for parsing a simple name - in other words, if it looks like a
+    number, then its treated as a number, if it looks like an
+    identifier, then it is used as a string.
+    
+    It is not possible to specify arbitrary dictionary keys from
+    within a format string.
 
-        "My name is {0.name} :-\{\}".format(dict(name='Fred'))
 
-    Compound names can be used to access specific dictionary entries,
-    array elements, or object attributes.  In the above example, the
-    '{0.name}' field refers to the dictionary entry 'name' within
-    positional argument 0.
+Conversion Specifiers
 
     Each field can also specify an optional set of 'conversion
     specifiers' which can be used to adjust the format of that field.
@@ -129,53 +173,135 @@
     built-in types will recognize a standard set of conversion
     specifiers.
 
-    The conversion specifier consists of a sequence of zero or more
-    characters, each of which can consist of any printable character
-    except for a non-escaped '}'.
-    
-    Conversion specifiers can themselves contain replacement fields;
-    this will be described in a later section.  Except for this
-    replacement, the format() method does not attempt to intepret the
-    conversion specifiers in any way; it merely passes all of the
-    characters between the first colon ':' and the matching right
-    brace ('}') to the various underlying formatters (described
-    later.)
+    Conversion specifiers can themselves contain replacement fields.
+    For example, a field whose field width it itself a parameter
+    could be specified via:
+    
+        "{0:{1}}".format(a, b, c)
+        
+    Note that the doubled '}' at the end, which would normally be
+    escaped, is not escaped in this case.  The reason is because
+    the '{{' and '}}' syntax for escapes is only applied when used
+    *outside* of a format field. Within a format field, the brace
+    characters always have their normal meaning.
+    
+    The syntax for conversion specifiers is open-ended, since except
+    than doing field replacements, the format() method does not
+    attempt to interpret them in any way; it merely passes all of the
+    characters between the first colon and the matching brace to
+    the various underlying formatter methods.
 
 
 Standard Conversion Specifiers
 
-    For most built-in types, the conversion specifiers will be the
-    same or similar to the existing conversion specifiers used with
-    the '%' operator.  Thus, instead of '%02.2x", you will say
-    '{0:02.2x}'.
-
-    There are a few differences however:
-
-    - The trailing letter is optional - you don't need to say '2.2d',
-      you can instead just say '2.2'.  If the letter is omitted, a
-      default will be assumed based on the type of the argument.
-      The defaults will be as follows:
-      
-        string or unicode object: 's'
-        integer: 'd'
-        floating-point number: 'f'
-        all other types: 's'
-
-    - Variable field width specifiers use a nested version of the {}
-      syntax, allowing the width specifier to be either a positional
-      or keyword argument:
-
-        "{0:{1}.{2}d}".format(a, b, c)
-
-    - The support for length modifiers (which are ignored by Python
-      anyway) is dropped.
-
-    For non-built-in types, the conversion specifiers will be specific
-    to that type.  An example is the 'datetime' class, whose
-    conversion specifiers are identical to the arguments to the
-    strftime() function:
+    If an object does not define its own conversion specifiers, a
+    standard set of conversion specifiers are used.  These are similar
+    in concept to the conversion specifiers used by the existing '%'
+    operator, however there are also a number of significant
+    differences.  The standard conversion specifiers fall into three
+    major categories: string conversions, integer conversions and
+    floating point conversions.
+    
+    The general form of a standard conversion specifier is:
+
+        [[fill]align][sign][width][.precision][type]
+
+    The brackets ([]) indicate an optional field.
+    
+    Then the optional align flag can be one of the following:
+
+        '<' - Forces the field to be left-aligned within the available
+              space (This is the default.)
+        '>' - Forces the field to be right-aligned within the
+              available space.
+        '=' - Forces the padding to be placed between immediately
+              after the sign, if any. This is used for printing fields
+              in the form '+000000120'.
+              
+    Note that unless a minimum field width is defined, the field
+    width will always be the same size as the data to fill it, so
+    that the alignment option has no meaning in this case.
+              
+    The optional 'fill' character defines the character to be used to
+    pad the field to the minimum width.  The alignment flag must be
+    supplied if the character is a number other than 0 (otherwise the
+    character would be interpreted as part of the field width
+    specifier). A zero fill character without an alignment flag
+    implies an alignment type of '='.
+    
+    The 'sign' field can be one of the following:
+
+        '+'  - indicates that a sign should be used for both
+               positive as well as negative numbers
+        '-'  - indicates that a sign should be used only for negative
+               numbers (this is the default behaviour)
+        ' '  - indicates that a leading space should be used on
+               positive numbers
+        '()' - indicates that negative numbers should be surrounded
+               by parentheses
+
+    'width' is a decimal integer defining the minimum field width. If
+    not specified, then the field width will be determined by the
+    content.
+
+    The 'precision' field is a decimal number indicating how many
+    digits should be displayed after the decimal point.
+
+    Finally, the 'type' determines how the data should be presented.
+    If the type field is absent, an appropriate type will be assigned
+    based on the value to be formatted ('d' for integers and longs,
+    'g' for floats, and 's' for everything else.)
+
+    The available string conversion types are:
+
+        's' - String format. Invokes str() on the object.
+              This is the default conversion specifier type.
+        'r' - Repr format. Invokes repr() on the object.
+
+    There are several integer conversion types. All invoke int() on
+    the object before attempting to format it.
+
+    The available integer conversion types are:
+
+        'b' - Binary. Outputs the number in base 2.
+        'c' - Character. Converts the integer to the corresponding
+              unicode character before printing.
+        'd' - Decimal Integer. Outputs the number in base 10.
+        'o' - Octal format. Outputs the number in base 8.
+        'x' - Hex format. Outputs the number in base 16, using lower-
+              case letters for the digits above 9.
+        'X' - Hex format. Outputs the number in base 16, using upper-
+              case letters for the digits above 9.
+
+    There are several floating point conversion types. All invoke
+    float() on the object before attempting to format it.
+
+    The available floating point conversion types are:
+
+        'e' - Exponent notation. Prints the number in scientific
+              notation using the letter 'e' to indicate the exponent.
+        'E' - Exponent notation. Same as 'e' except it uses an upper
+              case 'E' as the separator character.
+        'f' - Fixed point. Displays the number as a fixed-point
+              number.
+        'F' - Fixed point. Same as 'f'.
+        'g' - General format. This prints the number as a fixed-point
+              number, unless the number is too large, in which case
+              it switches to 'e' exponent notation.
+        'G' - General format. Same as 'g' except switches to 'E'
+              if the number gets to large.
+        'n' - Number. This is the same as 'g', except that it uses the
+              current locale setting to insert the appropriate
+              number separator characters.
+        '%' - Percentage. Multiplies the number by 100 and displays
+              in fixed ('f') format, followed by a percent sign.
+
+    Objects are able to define their own conversion specifiers to
+    replace the standard ones.  An example is the 'datetime' class,
+    whose conversion specifiers might look something like the
+    arguments to the strftime() function:
 
-        "Today is: {0:%a %b %d %H:%M:%S %Y}".format(datetime.now())
+        "Today is: {0:a b d H:M:S Y}".format(datetime.now())
 
 
 Controlling Formatting
@@ -224,19 +350,22 @@
     API for such an application-specific formatter is up to the
     application; here are several possible examples:
     
-        cell_format( "The total is: {0}", total )
+        cell_format("The total is: {0}", total)
         
-        TemplateString( "The total is: {0}" ).format( total )
+        TemplateString("The total is: {0}").format(total)
         
     Creating an application-specific formatter is relatively straight-
     forward.  The string and unicode classes will have a class method
     called 'cformat' that does all the actual work of formatting; The
     built-in format() method is just a wrapper that calls cformat.
+    
+    The type signature for the cFormat function is as follows:
+    
+        cformat(template, format_hook, args, kwargs)
 
     The parameters to the cformat function are:
 
-        -- The format string (or unicode; the same function handles
-           both.)
+        -- The format template string.
         -- A callable 'format hook', which is called once per field
         -- A tuple containing the positional arguments
         -- A dict containing the keyword arguments
@@ -251,7 +380,7 @@
     will attempt to call the field format hook with the following
     arguments:
 
-       format_hook(value, conversion, buffer)
+       format_hook(value, conversion)
 
     The 'value' field corresponds to the value being formatted, which
     was retrieved from the arguments using the field name.
@@ -260,20 +389,49 @@
     field, which will be either a string or unicode object, depending
     on the type of the original format string.
 
-    The 'buffer' argument is a Python array object, either a byte
-    array or unicode character array.  The buffer object will contain
-    the partially constructed string; the field hook is free to modify
-    the contents of this buffer if needed.
-
     The field_hook will be called once per field. The field_hook may
     take one of two actions:
+    
+        1) Return a string or unicode object that is the result
+           of the formatting operation.
 
-        1) Return False, indicating that the field_hook will not
+        2) Return None, indicating that the field_hook will not
            process this field and the default formatting should be
            used.  This decision should be based on the type of the
            value object, and the contents of the conversion string.
 
-        2) Append the formatted field to the buffer, and return True.
+
+Error handling
+
+    The string formatting system has two error handling modes, which
+    are controlled by the value of a class variable:
+    
+       string.strict_format_errors = True
+       
+    The 'strict_format_errors' flag defaults to False, or 'lenient'
+    mode. Setting it to True enables 'strict' mode. The current mode
+    determines how errors are handled, depending on the type of the
+    error.
+    
+    The types of errors that can occur are:
+    
+    1) Reference to a missing or invalid argument from within a
+    field specifier. In strict mode, this will raise an exception.
+    In lenient mode, this will cause the value of the field to be
+    replaced with the string '?name?', where 'name' will be the
+    type of error (KeyError, IndexError, or AttributeError).
+    
+    So for example:
+    
+        >>> string.strict_format_errors = False
+        >>> print 'Item 2 of argument 0 is: {0[2]}'.format( [0,1] )
+        "Item 2 of argument 0 is: ?IndexError?"
+    
+    2) Unused argument. In strict mode, this will raise an exception.
+    In lenient mode, this will be ignored.
+    
+    3) Exception raised by underlying formatter. These exceptions
+    are always passed through, regardless of the current mode.
 
 
 Alternate Syntax
@@ -325,22 +483,14 @@
       
     Some specific aspects of the syntax warrant additional comments:
     
-    1) The use of the backslash character for escapes.  A few people
-    suggested doubling the brace characters to indicate a literal
-    brace rather than using backslash as an escape character.  This is
-    also the convention used in the .Net libraries.  Here's how the
-    previously-given example would look with this convention:
-    
-        "My name is {0} :-{{}}".format('Fred')
-    
-    One problem with this syntax is that it conflicts with the use of
-    nested braces to allow parameterization of the conversion
-    specifiers:
-    
-        "{0:{1}.{2}}".format(a, b, c)
-        
-    (There are alternative solutions, but they are too long to go
-    into here.)
+    1) Backslash character for escapes.  The original version of
+    this PEP used backslash rather than doubling to escape a bracket.
+    This worked because backslashes in Python string literals that
+    don't conform to a standard backslash sequence such as '\n'
+    are left unmodified. However, this caused a certain amount
+    of confusion, and led to potential situations of multiple
+    recursive escapes, i.e. '\\\\{' to place a literal backslash
+    in front of a bracket.
     
     2) The use of the colon character (':') as a separator for
     conversion specifiers.  This was chosen simply because that's