[Python-checkins] r54109 - sandbox/trunk/pep3101/pep_differences.txt

Sat Mar 3 21:28:36 CET 2007

Author: patrick.maupin
Date: Sat Mar  3 21:28:34 2007
New Revision: 54109

Added:
   sandbox/trunk/pep3101/pep_differences.txt
Log:
Added pep_differences.txt to document initial implementation target.
Updated README.txt to move info into pep_differences.
Cleaned up escape-to-markup processing to fix bug and enable
easy alternate syntax testing.
Changed version number in setup.py to reflect the fact we're not at 1.0 yet.


Added: sandbox/trunk/pep3101/pep_differences.txt
==============================================================================

--- (empty file)
+++ sandbox/trunk/pep3101/pep_differences.txt	Sat Mar  3 21:28:34 2007
@@ -0,0 +1,299 @@
+
+This file describes differences between PEP 3101 and the C implementation
+in this directory, and describes the reasoning behind the differences.
+
+PEP3101 is a well thought out, excellent starting point for advanced string
+formatting, but as one might expect, there are a few gaps in it which were
+not noticed until implementation, and there are almost certainly gaps in
+the implementation which will not be noticed until the code is widely used.
+Fortunately, the schedule for both Python 2.6 and Python 3.0 have enough
+slack in them that if we work diligently, we can widely distribute a working
+implementation, not just a theoretical document, well in advance of the code
+freeze dates.  This should allow for a robust discussion about the merits or
+drawbacks of some of the fine points of the PEP and the implementation by
+people who are actually **using** the code.
+
+This nice schedule has made at least one of the implementers bold enough
+to consider the first cut of the implementation "experimental" in the sense
+that, since there is time to correct any problems, the implementation can
+diverge from the PEP (in documented ways!) both for perceived flaws in
+the PEP, and also to add minor enhancements.  The code is being structured
+so that it should be easy to subsequently modify the operation to conform
+to consensus opinion.
+
+
+GOALS:
+
+    Replace %
+
+The primary goal of the advanced string formatting is to replace the %
+operator.  Not in a coercive fashion.  The goal is to be good enough
+that nobody wants to use the % operator.
+
+
+    Modular design for subfunction reuse
+
+The PEP explicitly disclaims any attempt to replace string.Template,
+concentrating exclusively on the % operator.  While this narrow focus
+is very useful in removing things like conditionals and looping from
+the discussion about the PEP, it ignores the reality that it might
+be useful to REUSE some of the C implementation code (particularly
+the per-field formatting) in templating systems.  So the design of
+the implementation adds the goal of being able to expose some lower-
+level functions.
+
+
+    Efficiency
+
+It is not claimed that the initial implementation is particularly
+efficient, but it is desirable to tweak the specification in such
+a fashion that an efficient implementation IS possible.  Since the
+goal is to replace the % operator, it is particularly important
+that the formatting of small strings is not prohibitively expensive.
+
+
+    Security
+
+Security is a stated goal of the PEP, with an apparent goal of being
+able to accept a string from J. Random User and format it without
+potential adverse consequences.  This may or may not be an achievable
+goal; the PEP certainly has some features that should help with this
+such as the restricted number of operators, and the implemetation has
+some additional features, such as not allowing leading underscores
+on attributes by default, but these may be attempts to solve an
+intractable problem, similar to the original restricted Python
+execution mode.
+
+In any case, security is a goal, and anything reasonable we can do to
+support it should be done.  Unreasonable things to support security
+include things which would be very costly in terms of execution time,
+and things which rely on the by now very much discredited "security
+through obscurity" approach.
+
+
+    Older Python Versions
+
+Some of the implementers have very strong desires to use this formatting
+on older Python versions, and Guido has mentioned that any 3.0 features
+which do not break backward compatibility are potential candidates for
+inclusion in 2.6.
+
+
+    No global state
+
+The PEP states "The string formatting system has two error handling modes,
+which are controlled by the value of a class variable."  As has been
+discussed on the developer's list, this might be problematic, especially in
+large systems where components are being aggregated from multiple sources.
+One component might deliberately throw and catch exceptions in the string
+processing, and disabling this on a global basis might cause this component
+to stop working properly.  If the ability to control this on a global
+basis is desirable, it is easy enough to add in later, but if it is not
+desirable, then deciding that after the fact and changing the code could
+break code which has grown to rely on the feature.
+
+
+FORMATTING METADATA
+
+The basic desired operation of the PEP is to be able to write:
+
+ 'some format control string'.format(param1, param2, keyword1=whatever, ...)
+
+Unfortunately, there needs to be some mechanism to handle out of band
+data for some formatting and error handling options.  This could
+be really costly, if multiple options are looked up in the **keywords
+on every single call on even short strings, so some tweaks on the
+initial implementation are designed to reduce the overhead of looking
+up metadata.  Two techniques are used:
+
+    1) Lazy evaluation where possible.  For example, the code does not
+       need to look up error-handling options until an error occurs.
+
+    2) Metadata embedded in the string where appropriate.  This
+       saves a dictionary lookup on every call.  However this
+       is only appropriate when (a) the metadata arguably relates
+       to the actual control string and not the function where it
+       is being used; and (b) there are no security implications.
+
+
+DIFFERENCES BETWEEN PEP AND INITIAL IMPLEMENTATION:
+
+    Support for old Python versions
+
+The original PEP is Python 3000 only, which implies a lack of regular
+string support (unicode only).  To make the code compatible with 2.6,
+it has been written to support regular strings as well, and to make
+the code compatible with earlier versions, it has been written to be
+usable as an extension module as well as/instead of as a string method:
+
+                from pep3101 import format
+                format('control string', parameter1, ...)
+
+
+    format_item function
+
+A large portion of the code in the new advanced formatter is the code
+which formats a single field according to the given format specifier.
+(Thanks, Eric!)  This code is useful on its own, especially for template
+systems or other custom formatting solutions.  The initial implementation
+will have a format_item function which takes a format specifier and a
+single object and returns a formatted result for that object and specifier.
+
+
+    comments
+
+The PEP does not have a mechanism for comments embedded in the format
+strings.  The usefulness of comments inside format strings may be
+debatable, but the implementation is easy and easy to understand:
+
+                {#This is a comment}
+
+
+    errors and exceptions
+
+The PEP defines a global flag for "strict" or "lenient" mode.  The
+implementation eschews the use of a global flag (see more information
+in the goals section, above), and splits out the various error
+features discussed by the PEP into different options.  It also adds
+an option.
+
+The first error option is controlled by the optional _leading_underscores
+keyword argument.  If this is present and evaluates non-zero, then leading
+underscores are allowed on identifiers and attributes in the format string.
+The implementation will lazily look for this argument the first time it
+encounters a leading underscore.
+
+The next error option is controlled by metadata embedded in the string.
+If "{!useall}" appears in the string, then a check is made that all
+arguments are converted.  The decision to embed this metadata in the
+string can certainly be changed later; the reasons for doing it this
+way in the initial implementation are as follows:
+
+      1) In the original % operator, the error reporting that an
+         extra argument is present is orthogonal to the error reporting
+         that not enough arguments are present.  Both these errors are
+         easy to commit, because it is hard to count arguments and %s,
+         etc.  In theory, the new string formatting should make it easier
+         to get the arguments right, because all arguments in the format
+         string are numbered or even named.
+
+      2) It is arguably not Pythonic to check that all arguments to
+         a function are actually used by the execution of the function,
+         and format() is, after all, just another function.  So it seems
+         that the default should be to not check that all the arguments
+         are used.  In fact, there are similar reasons for not using
+         all the arguments here as with any other function.  For example,
+         for customization, the format method of a string might be called
+         with a superset of all the information which might be useful to
+         view.
+
+      3) Assuming that the normal case is to not check all arguments,
+         it is much cheaper (especially for small strings) to notice
+         the {! and process the metadata in the strings that want it
+         than it is to look for a keyword argument for every string.
+
+XXX -- need to add info on displaying exceptions in string vs. passing
+them up for looked-up errors.  Also adding or not of string position
+information.
+
+
+    Getattr and getindex rely on underlying object exceptions
+
+For attribute and index lookup, the PEP specifies that digits will be
+treated as numeric values, and non-digits should be valid Python
+identifiers.  The implementation does not rigorously enforce this,
+instead deferring to the object's getattr or getindex to throw an
+exception for an invalid lookup.  The only time this is not true
+is for leading underscores, which are disallowed by default.
+
+
+    User-defined Python format function
+
+The PEP specifies that an additional string method, cformat, can be
+used to call the same formatting machinery, but with a "hook" function
+that can intercept formatting on a per-field basis.
+
+The implementation does not have an additional cformat function/method.
+Instead, user format hooks are accomplished as follows:
+
+        1) A format hook function, with call signature and semantics
+           as described in the PEP, may be passed to format() as the
+           keyword argument _hook.  This argument will be lazily evaluated
+           the first time it is needed.
+
+        2) If "{!hook}" appears in the string, then the hook function
+           will be called on every single format field.
+
+        3) If the last character (the type specifier) in a format field
+           is "h" (for hook) then the hook function will be called for
+           that field, even if "{!hook}" has not been specified.
+
+
+    User-specified dictionary
+
+The call machinery to deal with keyword arguments is quite expensive,
+especially for large numbers of arguments.  For this reason, the
+implementation supports the ability to pass in a dictionary as the
+_dict argument.  The _dict argument will be lazily retrieved the first
+time the template requests a named parameter which was not passed
+in as a keyword argument.
+
+
+    Name mapping
+
+To support the user-specified dictionary, a name mapper will first
+look up names in the passed keywords arguments, then in the passed
+_dict (if any).
+
+
+    Automatic locals/globals lookup
+
+This is likely to be a contentious feature, but it seems quite useful,
+so in it goes for the initial implementation.  For security reasons,
+this happens only if format() is called with no parameters.  Since
+the whole purpose of format() is to apply parameters to a string,
+a call to format() without any parameters would otherwise be a
+silly thing to do.  We can turn this degenerate case into something
+useful by using the caller's locals and globals.  An example from
+Ian Bicking:
+
+            assert x < 3, "x has the value of {x} (should be < 3)".format()
+
+
+    Syntax modes
+
+The PEP correctly notes that the mechanism used to delineate markup
+vs. text is likely to be one of the most controversial features,
+and gives reasons why the chosen mechanism is better than others.
+
+The chosen mechanism is quite readable and reasonable, but different
+problem domains might have differing requirements.  For example,
+C code generated using the current mechanism could get quite ugly
+with a large number of "{" and "}" characters.
+
+The initial implementation supports the notion of different syntax
+modes.  This is bad from the "more than one way to do it" perspective,
+but is not quite so bad if the template itself has to indicate if it
+is not using the default mechanism.  To give reviewers an idea of
+how this could work, the implementation supports 4 different modes:
+
+        "{!syntax0}"   -- the mode as described in the PEP
+        "{!syntax1}"   -- same as mode 0, except close-braces
+                          do not need to be doubled
+        "{!syntax2}"   -- Uses "${" for escape to markup, "$${" for
+                          literal "${"
+        "{!syntax3}"   -- Like syntax0 "{" for escape to markup,
+                          except literal "{" is denoted by "{ "
+                          or "{\n" (where the space is removed but
+                          the newline isn't).
+
+
+    Syntax for metadata in strings
+
+There have been several examples in this document of metadata
+embedded inside strings, for "hook", "useall", and "syntax".
+
+The basic metadata syntax is "{!<keyword>}",  however to allow
+more readable templates, in this case, if the "}" is immediately
+followed by "\n" or "\r\n", this whitespace will not appear in
+the formatted output.