[Python-checkins] r55053 - peps/trunk/pep-0000.txt peps/trunk/pep-3127.txt peps/trunk/pep-3128.txt

guido.van.rossum python-checkins at python.org
Tue May 1 18:57:11 CEST 2007


Author: guido.van.rossum
Date: Tue May  1 18:57:09 2007
New Revision: 55053

Added:
   peps/trunk/pep-3127.txt   (contents, props changed)
   peps/trunk/pep-3128.txt   (contents, props changed)
Modified:
   peps/trunk/pep-0000.txt
Log:
PEP 3127: Integer Literal Support and Syntax           Maupin
PEP 3128: BList: A Faster List-like Type               Stutzbach


Modified: peps/trunk/pep-0000.txt
==============================================================================
--- peps/trunk/pep-0000.txt	(original)
+++ peps/trunk/pep-0000.txt	Tue May  1 18:57:09 2007
@@ -126,6 +126,7 @@
  S  3125  Remove Backslash Continuation                Jewett
  S  3126  Remove Implicit String Concatenation         Jewett
  S  3127  Integer Literal Support and Syntax           Maupin
+ S  3128  BList: A Faster List-like Type               Stutzbach
  S  3141  A Type Hierarchy for Numbers                 Yasskin
 
  Finished PEPs (done, implemented in Subversion)
@@ -494,6 +495,7 @@
  S  3125  Remove Backslash Continuation                Jewett
  S  3126  Remove Implicit String Concatenation         Jewett
  S  3127  Integer Literal Support and Syntax           Maupin
+ S  3128  BList: A Faster List-like Type               Stutzbach
  S  3141  A Type Hierarchy for Numbers                 Yasskin
 
 

Added: peps/trunk/pep-3127.txt
==============================================================================
--- (empty file)
+++ peps/trunk/pep-3127.txt	Tue May  1 18:57:09 2007
@@ -0,0 +1,518 @@
+PEP: 3127
+Title: Integer Literal Support and Syntax
+Version: $Revision$
+Last-Modified: $Date$
+Author: Patrick Maupin <pmaupin at gmail.com>
+Discussions-To: Python-3000 at python.org
+Status: Draft
+Type: Standards Track
+Python-Version: 3.0
+Content-Type: text/x-rst
+Created: 14-Mar-2007
+Post-History: 18-Mar-2007
+
+
+Abstract
+========
+
+This PEP proposes changes to the Python core to rationalize
+the treatment of string literal representations of integers
+in different radices (bases).  These changes are targeted at
+Python 3.0, but the backward-compatible parts of the changes
+should be added to Python 2.6, so that all valid 3.0 integer
+literals will also be valid in 2.6.
+
+The proposal is that:
+
+   a) octal literals must now be specified
+      with a leading "0o" or "0O" instead of "0";
+
+   b) binary literals are now supported via a
+      leading "0b" or "0B"; and
+
+   c) provision will be made for binary numbers in
+      string formatting.
+
+
+Motivation
+==========
+
+This PEP was motivated by two different issues:
+
+    - The default octal representation of integers is silently confusing
+      to people unfamiliar with C-like languages.  It is extremely easy
+      to inadvertently create an integer object with the wrong value,
+      because '013' means 'decimal 11', not 'decimal 13', to the Python
+      language itself, which is not the meaning that most humans would
+      assign to this literal.
+
+    - Some Python users have a strong desire for binary support in
+      the language.
+
+
+Specification
+=============
+
+Grammar specification
+---------------------
+
+The grammar will be changed.  For Python 2.6, the changed and
+new token definitions will be::
+
+     integer        ::=     decimalinteger | octinteger | hexinteger |
+                            bininteger | oldoctinteger
+
+     octinteger     ::=     "0" ("o" | "O") octdigit+
+
+     bininteger     ::=     "0" ("b" | "B") bindigit+
+
+     oldoctinteger  ::=     "0" octdigit+
+
+     bindigit       ::=     "0" | "1"
+
+For Python 3.0, "oldoctinteger" will not be supported, and
+an exception will be raised if a literal has a leading "0" and
+a second character which is a digit.
+
+For both versions, this will require changes to PyLong_FromString
+as well as the grammar.
+
+The documentation will have to be changed as well:  grammar.txt,
+as well as the integer literal section of the reference manual.
+
+PEP 306 should be checked for other issues, and that PEP should
+be updated if the procedure described therein is insufficient.
+
+int() specification
+--------------------
+
+int(s, 0) will also match the new grammar definition.
+
+This should happen automatically with the changes to
+PyLong_FromString required for the grammar change.
+
+Also the documentation for int() should be changed to explain
+that int(s) operates identically to int(s, 10), and the word
+"guess" should be removed from the description of int(s, 0).
+
+long() specification
+--------------------
+
+For Python 2.6, the long() implementation and documentation
+should be changed to reflect the new grammar.
+
+Tokenizer exception handling
+----------------------------
+
+If an invalid token contains a leading "0", the exception
+error message should be more informative than the current
+"SyntaxError: invalid token".  It should explain that decimal
+numbers may not have a leading zero, and that octal numbers
+require an "o" after the leading zero.
+
+int() exception handling
+------------------------
+
+The ValueError raised for any call to int() with a string
+should at least explicitly contain the base in the error
+message, e.g.::
+
+    ValueError: invalid literal for base 8 int(): 09
+
+oct() function
+---------------
+
+oct() should be updated to output '0o' in front of
+the octal digits (for 3.0, and 2.6 compatibility mode).
+
+Output formatting
+-----------------
+
+The string (and unicode in 2.6) % operator will have
+'b' format specifier added for binary, and the alternate
+syntax of the 'o' option will need to be updated to
+add '0o' in front, instead of '0'.
+
+PEP 3101 already supports 'b' for binary output.
+
+
+Transition from 2.6 to 3.0
+---------------------------
+
+The 2to3 translator will have to insert 'o' into any
+octal string literal.
+
+The Py3K compatible option to Python 2.6 should cause
+attempts to use oldoctinteger literals to raise an
+exception.
+
+
+Rationale
+=========
+
+Most of the discussion on these issues occurred on the Python-3000
+mailing list starting 14-Mar-2007, prompted by an observation that
+the average human being would be completely mystified upon finding
+that prepending a "0" to a string of digits changes the meaning of
+that digit string entirely.
+
+It was pointed out during this discussion that a similar, but shorter,
+discussion on the subject occurred in January of 2006, prompted by a
+discovery of the same issue.
+
+Background
+----------
+
+For historical reasons, Python's string representation of integers
+in different bases (radices), for string formatting and token
+literals, borrows heavily from C.  [1]_ [2]_ Usage has shown that
+the historical method of specifying an octal number is confusing,
+and also that it would be nice to have additional support for binary
+literals.
+
+Throughout this document, unless otherwise noted, discussions about
+the string representation of integers relate to these features:
+
+    - Literal integer tokens, as used by normal module compilation,
+      by eval(), and by int(token, 0).  (int(token) and int(token, 2-36)
+      are not modified by this proposal.)
+
+           * Under 2.6, long() is treated the same as int()
+
+    - Formatting of integers into strings, either via the % string
+      operator or the new PEP 3101 advanced string formatting method.
+
+It is presumed that:
+
+    - All of these features should have an identical set
+      of supported radices, for consistency.
+
+    - Python source code syntax and int(mystring, 0) should
+      continue to share identical behavior.
+
+
+Removal of old octal syntax
+----------------------------
+
+This PEP proposes that the ability to specify an octal number by
+using a leading zero will be removed from the language in Python 3.0
+(and the Python 3.0 preview mode of 2.6), and that a SyntaxError will
+be raised whenever a leading "0" is immediately followed by another
+digit.
+
+During the present discussion, it was almost universally agreed that::
+
+    eval('010') == 8
+
+should no longer be true, because that is confusing to new users.
+It was also proposed that::
+
+    eval('0010') == 10
+
+should become true, but that is much more contentious, because it is so
+inconsistent with usage in other computer languages that mistakes are
+likely to be made.
+
+Almost all currently popular computer languages, including C/C++,
+Java, Perl, and JavaScript, treat a sequence of digits with a
+leading zero as an octal number.  Proponents of treating these
+numbers as decimal instead have a very valid point -- as discussed
+in `Supported radices`_, below, the entire non-computer world uses
+decimal numbers almost exclusively.  There is ample anecdotal
+evidence that many people are dismayed and confused if they
+are confronted with non-decimal radices.
+
+However, in most situations, most people do not write gratuitous
+zeros in front of their decimal numbers.  The primary exception is
+when an attempt is being made to line up columns of numbers.  But
+since PEP 8 specifically discourages the use of spaces to try to
+align Python code, one would suspect the same argument should apply
+to the use of leading zeros for the same purpose.
+
+Finally, although the email discussion often focused on whether anybody
+actually *uses* octal any more, and whether we should cater to those
+old-timers in any case, that is almost entirely besides the point.
+
+Assume the rare complete newcomer to computing who *does*, either
+occasionally or as a matter of habit, use leading zeros for decimal
+numbers.  Python could either:
+
+    a) silently do the wrong thing with his numbers, as it does now;
+
+    b) immediately disabuse him of the notion that this is viable syntax
+       (and yes, the SyntaxWarning should be more gentle than it
+       currently is, but that is a subject for a different PEP); or
+
+    c) let him continue to think that computers are happy with
+       multi-digit decimal integers which start with "0".
+
+Some people passionately believe that (c) is the correct answer,
+and they would be absolutely right if we could be sure that new
+users will never blossom and grow and start writing AJAX applications.
+
+So while a new Python user may (currently) be mystified at the
+delayed discovery that his numbers don't work properly, we can
+fix it by explaining to him immediately that Python doesn't like
+leading zeros (hopefully with a reasonable message!), or we can
+delegate this teaching experience to the JavaScript interpreter
+in the Internet Explorer browser, and let him try to debug his
+issue there.
+
+Supported radices
+-----------------
+
+This PEP proposes that the supported radices for the Python
+language will be 2, 8, 10, and 16.
+
+Once it is agreed that the old syntax for octal (radix 8) representation
+of integers must be removed from the language, the next obvious
+question is "Do we actually need a way to specify (and display)
+numbers in octal?"
+
+This question is quickly followed by "What radices does the language
+need to support?"  Because computers are so adept at doing what you
+tell them to, a tempting answer in the discussion was "all of them."
+This answer has obviously been given before -- the int() constructor
+will accept an explicit radix with a value between 2 and 36, inclusive,
+with the latter number bearing a suspicious arithmetic similarity to
+the sum of the number of numeric digits and the number of same-case
+letters in the ASCII alphabet.
+
+But the best argument for inclusion will have a use-case to back
+it up, so the idea of supporting all radices was quickly rejected,
+and the only radices left with any real support were decimal,
+hexadecimal, octal, and binary.
+
+Just because a particular radix has a vocal supporter on the
+mailing list does not mean that it really should be in the
+language, so the rest of this section is a treatise on the
+utility of these particular radices, vs. other possible choices.
+
+Humans use other numeric bases constantly.  If I tell you that
+it is 12:30 PM, I have communicated quantitative information
+arguably composed of *three* separate bases (12, 60, and 2),
+only one of which is in the "agreed" list above.  But the
+*communication* of that information used two decimal digits
+each for the base 12 and base 60 information, and, perversely,
+two letters for information which could have fit in a single
+decimal digit.
+
+So, in general, humans communicate "normal" (non-computer)
+numerical information either via names (AM, PM, January, ...)
+or via use of decimal notation.  Obviously, names are
+seldom used for large sets of items, so decimal is used for
+everything else.  There are studies which attempt to explain
+why this is so, typically reaching the expected conclusion
+that the Arabic numeral system is well-suited to human
+cognition. [3]_
+
+There is even support in the history of the design of
+computers to indicate that decimal notation is the correct
+way for computers to communicate with humans.  One of
+the first modern computers, ENIAC [4]_ computed in decimal,
+even though there were already existing computers which
+operated in binary.
+
+Decimal computer operation was important enough
+that many computers, including the ubiquitous PC, have
+instructions designed to operate on "binary coded decimal"
+(BCD) [5]_ , a representation which devotes 4 bits to each
+decimal digit.  These instructions date from a time when the
+most strenuous calculations ever performed on many numbers
+were the calculations actually required to perform textual
+I/O with them.  It is possible to display BCD without having
+to perform a divide/remainder operation on every displayed
+digit, and this was a huge computational win when most
+hardware didn't have fast divide capability.  Another factor
+contributing to the use of BCD is that, with BCD calculations,
+rounding will happen exactly the same way that a human would
+do it, so BCD is still sometimes used in fields like finance,
+despite the computational and storage superiority of binary.
+
+So, if it weren't for the fact that computers themselves
+normally use binary for efficient computation and data
+storage, string representations of integers would probably
+always be in decimal.
+
+Unfortunately, computer hardware doesn't think like humans,
+so programmers and hardware engineers must often resort to
+thinking like the computer, which means that it is important
+for Python to have the ability to communicate binary data
+in a form that is understandable to humans.
+
+The requirement that the binary data notation must be cognitively
+easy for humans to process means that it should contain an integral
+number of binary digits (bits) per symbol, while otherwise
+conforming quite closely to the standard tried-and-true decimal
+notation (position indicates power, larger magnitude on the left,
+not too many symbols in the alphabet, etc.).
+
+The obvious "sweet spot" for this binary data notation is
+thus octal, which packs the largest integral number of bits
+possible into a single symbol chosen from the Arabic numeral
+alphabet.
+
+In fact, some computer architectures, such as the PDP8 and the
+8080/Z80, were defined in terms of octal, in the sense of arranging
+the bitfields of instructions in groups of three, and using
+octal representations to describe the instruction set.
+
+Even today, octal is important because of bit-packed structures
+which consist of 3 bits per field, such as Unix file permission
+masks.
+
+But octal has a drawback when used for larger numbers.  The
+number of bits per symbol, while integral, is not itself
+a power of two.  This limitation (given that the word size
+of most computers these days is a power of two) has resulted
+in hexadecimal, which is more popular than octal despite the
+fact that it requires a 60% larger alphabet than decimal,
+because each symbol contains 4 bits.
+
+Some numbers, such as Unix file permission masks, are easily
+decoded by humans when represented in octal, but difficult to
+decode in hexadecimal, while other numbers are much easier for
+humans to handle in hexadecimal.
+
+Unfortunately, there are also binary numbers used in computers
+which are not very well communicated in either hexadecimal or
+octal. Thankfully, fewer people have to deal with these on a
+regular basis, but on the other hand, this means that several
+people on the discussion list questioned the wisdom of adding
+a straight binary representation to Python.
+
+One example of where these numbers is very useful is in
+reading and writing hardware registers.  Sometimes hardware
+designers will eschew human readability and opt for address
+space efficiency, by packing multiple bit fields into a single
+hardware register at unaligned bit locations, and it is tedious
+and error-prone for a human to reconstruct a 5 bit field which
+consists of the upper 3 bits of one hex digit, and the lower 2
+bits of the next hex digit.
+
+Even if the ability of Python to communicate binary information
+to humans is only useful for a small technical subset of the
+population, it is exactly that population subset which contains
+most, if not all, members of the Python core team, so even straight
+binary, the least useful of these notations, has several enthusiastic
+supporters and few, if any, staunch opponents, among the Python community.
+
+Syntax for supported radices
+-----------------------------
+
+This proposal is to to use a "0o" prefix with either uppercase
+or lowercase "o" for octal, and a "0b" prefix with either
+uppercase or lowercase "b" for binary.
+
+There was strong support for not supporting uppercase, but
+this is a separate subject for a different PEP, as 'j' for
+complex numbers, 'e' for exponent, and 'r' for raw string
+(to name a few) already support uppercase.
+
+The syntax for delimiting the different radices received a lot of
+attention in the discussion on Python-3000.  There are several
+(sometimes conflicting) requirements and "nice-to-haves" for
+this syntax:
+
+    - It should be as compatible with other languages and
+      previous versions of Python as is reasonable, both
+      for the input syntax and for the output (e.g. string
+      % operator) syntax.
+
+    - It should be as obvious to the casual observer as
+      possible.
+
+    - It should be easy to visually distinguish integers
+      formatted in the different bases.
+
+
+Proposed syntaxes included things like arbitrary radix prefixes,
+such as 16r100 (256 in hexadecimal), and radix suffixes, similar
+to the 100h assembler-style suffix.  The debate on whether the
+letter "O" could be used for octal was intense -- an uppercase
+"O" looks suspiciously similar to a zero in some fonts.  Suggestions
+were made to use a "c" (the second letter of "oCtal"), or even
+to use a "t" for "ocTal" and an "n" for "biNary" to go along
+with the "x" for "heXadecimal".
+
+For the string % operator, "o" was already being used to denote
+octal, and "b" was not used for anything, so this works out
+much better than, for example, using "c" (which means "character"
+for the % operator).
+
+At the end of the day, since uppercase "O" can look like a zero
+and uppercase "B" can look like an 8, it was decided that these
+prefixes should be lowercase only, but, like 'r' for raw string,
+that can be a preference or style-guide issue.
+
+Open Issues
+===========
+
+It was suggested in the discussion that lowercase should be used
+for all numeric and string special modifiers, such as 'x' for
+hexadecimal, 'r' for raw strings, 'e' for exponentiation, and
+'j' for complex numbers.  This is an issue for a separate PEP.
+
+This PEP takes no position on uppercase or lowercase for input,
+just noting that, for consistency, if uppercase is not to be
+removed from input parsing for other letters, it should be
+added for octal and binary, and documenting the changes under
+this assumption, as there is not yet a PEP about the case issue.
+
+Output formatting may be a different story -- there is already
+ample precedence for case sensitivity in the output format string,
+and there would need to be a consensus that there is a valid
+use-case for the "alternate form" of the string % operator
+to support uppercase 'B' or 'O' characters for binary or
+octal output.  Currently, PEP3101 does not even support this
+alternate capability, and the hex() function does not allow
+the programmer to specify the case of the 'x' character.
+
+There are still some strong feelings that '0123' should be
+allowed as a literal decimal in Python 3.0.  If this is the
+right thing to do, this can easily be covered in an additional
+PEP.  This proposal only takes the first step of making '0123'
+not be a valid octal number, for reasons covered in the rationale.
+
+Is there (or should there be) an option for the 2to3 translator
+which only makes the 2.6 compatible changes?  Should this be
+run on 2.6 library code before the 2.6 release?
+
+Should a bin() function which matches hex() and oct() be added?
+
+Is hex() really that useful once we have advanced string formatting?
+
+
+References
+==========
+
+.. [1] GNU libc manual printf integer format conversions
+   (http://www.gnu.org/software/libc/manual/html_node/Integer-Conversions.html)
+
+.. [2] Python string formatting operations
+   (http://docs.python.org/lib/typesseq-strings.html)
+
+.. [3] The Representation of Numbers, Jiajie Zhang and Donald A. Norman
+    (http://acad88.sahs.uth.tmc.edu/research/publications/Number-Representation.pdf)
+
+.. [4] ENIAC page at wikipedia
+    (http://en.wikipedia.org/wiki/ENIAC)
+
+.. [5] BCD page at wikipedia
+    (http://en.wikipedia.org/wiki/Binary-coded_decimal)
+
+Copyright
+=========
+
+This document has been placed in the public domain.
+
+
+
+..
+   Local Variables:
+   mode: indented-text
+   indent-tabs-mode: nil
+   sentence-end-double-space: t
+   fill-column: 70
+   coding: utf-8
+   End:

Added: peps/trunk/pep-3128.txt
==============================================================================
--- (empty file)
+++ peps/trunk/pep-3128.txt	Tue May  1 18:57:09 2007
@@ -0,0 +1,356 @@
+PEP: 3128
+Title: BList: A Faster List-like Type
+Version: $Revision$
+Last-Modified: $Date$
+Author: Daniel Stutzbach <daniel at stutzbachenterprises.com>
+Discussions-To: Python 3000 List <python-3000 at python.org>
+Status: Draft
+Type: Standards Track
+Content-Type: text/x-rst
+Created: 30-Apr-2007
+Python-Version: 2.6 and/or 3.0
+Post-History: 30-Apr-2007
+
+
+Abstract
+========
+
+The common case for list operations is on small lists.  The current
+array-based list implementation excels at small lists due to the
+strong locality of reference and infrequency of memory allocation
+operations.  However, an array takes O(n) time to insert and delete
+elements, which can become problematic as the list gets large.
+
+This PEP introduces a new data type, the BList, that has array-like
+and tree-like aspects.  It enjoys the same good performance on small
+lists as the existing array-based implementation, but offers superior
+asymptotic performance for most operations.  This PEP proposes
+replacing the makes two mutually exclusive proposals for including the
+BList type in Python:
+
+1. Add it to the collections module, or
+2. Replace the existing list type
+
+
+Motivation
+==========
+
+The BList grew out of the frustration of needing to rewrite intuitive
+algorithms that worked fine for small inputs but took O(n**2) time for
+large inputs due to the underlying O(n) behavior of array-based lists.
+The deque type, introduced in Python 2.4, solved the most common
+problem of needing a fast FIFO queue.  However, the deque type doesn't
+help if we need to repeatedly insert or delete elements from the
+middle of a long list.
+
+A wide variety of data structure provide good asymptotic performance
+for insertions and deletions, but they either have O(n) performance
+for other operations (e.g., linked lists) or have inferior performance
+for small lists (e.g., binary trees and skip lists).
+
+The BList type proposed in this PEP is based on the principles of
+B+Trees, which have array-like and tree-like aspects.  The BList
+offers array-like performance on small lists, while offering O(log n)
+asymptotic performance for all insert and delete operations.
+Additionally, the BList implements copy-on-write under-the-hood, so
+even operations like getslice take O(log n) time.  The table below
+compares the asymptotic performance of the current array-based list
+implementation with the asymptotic performance of the BList.
+
+========= ================                     ====================
+Operation Array-based list                     BList
+========= ================                     ====================
+Copy      O(n)                                 **O(1)**
+Append    **O(1)**                             O(log n)
+Insert    O(n)                                 **O(log n)**
+Get Item  **O(1)**                             O(log n)
+Set Item  **O(1)**                             **O(log n)**
+Del Item  O(n)                                 **O(log n)**
+Iteration O(n)                                 O(n)
+Get Slice O(k)                                 **O(log n)**
+Del Slice O(n)                                 **O(log n)**
+Set Slice O(n+k)                               **O(log k + log n)**
+Extend    O(k)                                 **O(log k + log n)**
+Sort      O(n log n)                           O(n log n)
+Multiply  O(nk)                                **O(log k)**
+========= ================                     ====================
+
+An extensive empirical comparison of Python's array-based list and the
+BList are available at [2]_.
+
+Use Case Trade-offs
+===================
+
+The BList offers superior performance for many, but not all,
+operations.  Choosing the correct data type for a particular use case
+depends on which operations are used.  Choosing the correct data type
+as a built-in depends on balancing the importance of different use
+cases and the magnitude of the performance differences.
+
+For the common uses cases of small lists, the array-based list and the
+BList have similar performance characteristics.
+
+For the slightly less common case of large lists, there are two common
+uses cases where the existing array-based list outperforms the
+existing BList reference implementation.  These are:
+
+1. A large LIFO stack, where there are many .append() and .pop(-1)
+   operations.  Each operation is O(1) for an array-based list, but
+   O(log n) for the BList.
+
+2. A large list that does not change size.  The getitem and setitem
+   calls are O(1) for an array-based list, but O(log n) for the BList.
+
+In performance tests on a 10,000 element list, BLists exhibited a 50%
+and 5% increase in execution time for these two uses cases,
+respectively.
+
+The performance for the LIFO use case could be improved to O(n) time,
+by caching a pointer to the right-most leaf within the root node.  For
+lists that do not change size, the common case of sequential access
+could also be improved to O(n) time via caching in the root node.
+However, the performance of these approaches has not been empirically
+tested.
+
+Many operations exhibit a tremendous speed-up (O(n) to O(log n)) when
+switching from the array-based list to BLists.  In performance tests
+on a 10,000 element list, operations such as getslice, setslice, and
+FIFO-style insert and deletes on a BList take only 1% of the time
+needed on array-based lists.
+
+In light of the large performance speed-ups for many operations, the
+small performance costs for some operations will be worthwhile for
+many (but not all) applications.
+
+Implementation
+==============
+
+The BList is based on the B+Tree data structure.  The BList is a wide,
+bushy tree where each node contains an array of up to 128 pointers to
+its children.  If the node is a leaf, its children are the
+user-visible objects that the user has placed in the list.  If node is
+not a leaf, its children are other BList nodes that are not
+user-visible.  If the list contains only a few elements, they will all
+be a children of single node that is both the root and a leaf.  Since
+a node is little more than array of pointers, small lists operate in
+effectively the same way as an array-based data type and share the
+same good performance characteristics.
+
+The BList maintains a few invariants to ensure good (O(log n))
+asymptotic performance regardless of the sequence of insert and delete
+operations.  The principle invariants are as follows:
+
+1. Each node has at most 128 children.
+2. Each non-root node has at least 64 children.
+3. The root node has at least 2 children, unless the list contains
+   fewer than 2 elements.
+4. The tree is of uniform depth.
+
+If an insert would cause a node to exceed 128 children, the node
+spawns a sibling and transfers half of its children to the sibling.
+The sibling is inserted into the node's parent.  If the node is the
+root node (and thus has no parent), a new parent is created and the
+depth of the tree increases by one.
+
+If a deletion would cause a node to have fewer than 64 children, the
+node moves elements from one of its siblings if possible.  If both of
+its siblings also only have 64 children, then two of the nodes merge
+and the empty one is removed from its parent.  If the root node is
+reduced to only one child, its single child becomes the new root
+(i.e., the depth of the tree is reduced by one).
+
+In addition to tree-like asymptotic performance and array-like
+performance on small-lists, BLists support transparent
+**copy-on-write**.  If a non-root node needs to be copied (as part of
+a getslice, copy, setslice, etc.), the node is shared between multiple
+parents instead of being copied.  If it needs to be modified later, it
+will be copied at that time.  This is completely behind-the-scenes;
+from the user's point of view, the BList works just like a regular
+Python list.
+
+Memory Usage
+============
+
+In the worst case, the leaf nodes of a BList have only 64 children
+each, rather than a full 128, meaning that memory usage is around
+twice that of a best-case array implementation.  Non-leaf nodes use up
+a negligible amount of additional memory, since there are at least 63
+times as many leaf nodes as non-leaf nodes.
+
+The existing array-based list implementation must grow and shrink as
+items are added and removed.  To be efficient, it grows and shrinks
+only when the list has grow or shrunk exponentially.  In the worst
+case, it, too, uses twice as much memory as the best case.
+
+In summary, the BList's memory footprint is not significantly
+different from the existing array-based implementation.
+
+Backwards Compatibility
+=======================
+
+If the BList is added to the collections module, backwards
+compatibility is not an issue.  This section focuses on the option of
+replacing the existing array-based list with the BList.  For users of
+the Python interpreter, a BList has an identical interface to the
+current list-implementation.  For virtually all operations, the
+behavior is identical, aside from execution speed.
+
+For the C API, BList has a different interface than the existing
+list-implementation.  Due to its more complex structure, the BList
+does not lend itself well to poking and prodding by external sources.
+Thankfully, the existing list-implementation defines an API of
+functions and macros for accessing data from list objects.  Google
+Code Search suggests that the majority of third-party modules uses the
+well-defined API rather than relying on the list's structure
+directly.  The table below summarizes the search queries and results:
+
+======================== =================
+Search String            Number of Results
+======================== =================
+PyList_GetItem           2,000
+PySequence_GetItem         800
+PySequence_Fast_GET_ITEM   100
+PyList_GET_ITEM            400
+\[^a\-zA\-Z\_\]ob_item          100
+======================== =================
+
+
+This can be achieved in one of two ways:
+
+1. Redefine the various accessor functions and macros in listobject.h
+   to access a BList instead.  The interface would be unchanged.  The
+   functions can easily be redefined.  The macros need a bit more care
+   and would have to resort to function calls for large lists.
+
+   The macros would need to evaluate their arguments more than once,
+   which could be a problem if the arguments have side effects.  A
+   Google Code Search for "PyList_GET_ITEM\(\[^)\]+\(" found only a
+   handful of cases where this occurs, so the impact appears to be
+   low.
+
+   The few extension modules that use list's undocumented structure
+   directly, instead of using the API, would break.  The core code
+   itself uses the accessor macros fairly consistently and should be
+   easy to port.
+
+2. Deprecate the existing list type, but continue to include it.
+   Extension modules wishing to use the new BList type must do so
+   explicitly.  The BList C interface can be changed to match the
+   existing PyList interface so that a simple search-replace will be
+   sufficient for 99% of module writers.
+
+   Existing modules would continue to compile and work without change,
+   but they would need to make a deliberate (but small) effort to
+   migrate to the BList.
+
+   The downside of this approach is that mixing modules that use
+   BLists and array-based lists might lead to slow down if conversions
+   are frequently necessary.
+
+Reference Implementation
+========================
+
+A reference implementations of the BList is available for CPython at [1]_.
+
+The source package also includes a pure Python implementation,
+originally developed as a prototype for the CPython version.
+Naturally, the pure Python version is rather slow and the asymptotic
+improvements don't win out until the list is quite large.
+
+When compiled with Py_DEBUG, the C implementation checks the
+BList invariants when entering and exiting most functions.
+
+An extensive set of test cases is also included in the source package.
+The test cases include the existing Python sequence and list test
+cases as a subset.  When the interpreter is built with Py_DEBUG, the
+test cases also check for reference leaks.
+
+Porting to Other Python Variants
+--------------------------------
+
+If the BList is added to the collections module, other Python variants
+can support it in one of three ways:
+
+1. Make blist an alias for list.  The asymptotic performance won't be
+   as good, but it'll work.
+2. Use the pure Python reference implementation.  The performance for
+   small lists won't be as good, but it'll work.
+3. Port the reference implementation.
+
+Discussion
+==========
+
+This proposal has been discussed briefly on the Python-3000 mailing
+list [3]_.  Although a number of people favored the proposal, there
+were also some objections.  Below summarizes the pros and cons as
+observed by posters to the thread.
+
+General comments:
+
+- Pro: Will outperform the array-based list in most cases
+- Pro: "I've implemented variants of this ... a few different times"
+- Con: Desirability and performance in actual applications is unproven
+
+Comments on adding BList to the collections module:
+
+- Pro: Matching the list-API reduces the learning curve to near-zero
+- Pro: Useful for intermediate-level users; won't get in the way of beginners
+- Con: Proliferation of data types makes the choices for developers harder.
+
+Comments on replacing the array-based list with the BList:
+
+- Con: Impact on extension modules (addressed in `Backwards
+  Compatibility`_)
+- Con: The use cases where BLists are slower are important
+  (see `Use Case Trade-Offs`_ for how these might be addressed).
+- Con: The array-based list code is simple and easy to maintain
+
+To assess the desirability and performance in actual applications,
+Raymond Hettinger suggested releasing the BList as an extension module
+(now available at [1]_).  If it proves useful, he felt it would be a
+strong candidate for inclusion in 2.6 as part of the collections
+module.  If widely popular, then it could be considered for replacing
+the array-based list, but not otherwise.
+
+Guido van Rossum commented that he opposed the proliferation of data
+types, but favored replacing the array-based list if backwards
+compatibility could be addressed and the BList's performance was
+uniformly better.
+
+On-going Tasks
+==============
+
+- Reduce the memory footprint of small lists
+- Implement TimSort for BLists, so that best-case sorting is O(n)
+  instead of O(log n).
+- Implement __reversed__
+- Cache a pointer in the root to the rightmost leaf, to make LIFO
+  operation O(n) time.
+
+References
+==========
+
+.. [1] Reference Implementations for C and Python:
+   http://www.python.org/pypi/blist/
+
+.. [2] Empirical performance comparison between Python's array-based
+   list and the blist: http://stutzbachenterprises.com/blist/
+
+.. [3] Discussion on python-3000 starting at post:
+   http://mail.python.org/pipermail/python-3000/2007-April/006757.html
+
+Copyright
+=========
+
+This document has been placed in the public domain.
+
+
+
+..
+   Local Variables:
+   mode: indented-text
+   indent-tabs-mode: nil
+   sentence-end-double-space: t
+   fill-column: 70
+   coding: utf-8
+   End:


More information about the Python-checkins mailing list