[Python-Dev] Bad interaction of __index__ and sequence repeat
ncoghlan at gmail.com
Sat Jul 29 16:06:53 CEST 2006
Armin Rigo wrote:
> There is an oversight in the design of __index__() that only just
> surfaced :-( It is responsible for the following behavior, on a 32-bit
> machine with >= 2GB of RAM:
> >>> s = 'x' * (2**100) # works!
> >>> len(s)
> This is because PySequence_Repeat(v, w) works by applying w.__index__ in
> order to call v->sq_repeat. However, __index__ is defined to clip the
> result to fit in a Py_ssize_t. This means that the above problem exists
> with all sequences, not just strings, given enough RAM to create such
> sequences with 2147483647 items.
> For reference, in 2.4 we correctly get an OverflowError.
> Argh! What should be done about it?
I've now got a patch on SF that aims to fix this properly .
The gist of the patch:
1. Redesign the PyNumber_Index C API to serve the actual use cases in the
interpreter core and the standard library.
The PEP 357 abstract C API as written was bypassed by nearly all of the
uses in the core and the standard library. The patch redesigns that API to
reduce code duplication between the various parts of the code base that were
previously calling nb_index directly.
The principal change is to provide an "is_index" output variable that the
various mp_subscript implementations can use to determine whether or not the
passed in object was an index or not, rather than having to repeat the type
check everywhere. The rationale for doing it this way:
a. you may want to try something else (e.g. the mp_subscript
implementations in the standard library try indexing before checking to see if
the object is a slice object)
b. a different error message may be wanted (e.g. the normal indexing
related Type Error doesn't make sense for sequence repetition)
c. a separate checking function would lead to repeating the check on common
code paths (e.g. if an mp_subscript implementation did the type check first,
and then PyNumber_Check did it again to see whether or not to raise an error)
The output variable solves the problem nicely: either pass in NULL to get
the default behaviour of raising a sequence indexing TypeError, or pass in a
pointer to a C int in order to be told whether or not the typecheck succeeded
without an exception actually being set if it fails (if the typecheck passes,
but the actual call fails, the exception state is set as normal).
Additionally, PyNumber_Index is redefined to raise an IndexError for values
that cannot be represented as a Py_ssize_t. The choice of IndexError was made
based on the dominant usage in the standard library (IndexError is the correct
error to raise so that an mp_subscript implementation does the right thing).
There are only a few places that need to override the IndexError to replace it
with OverflowError (the length argument to slice.indices, sequence repetition,
the mmap constructor), whereas all of the sequence objects (list, tuple,
string, unicode, array), as well as PyObject_Get/Set/DelItem, need it to raise
Raising IndexError also benefits sequences implemented in Python, which can
def __getitem__(self, idx):
if isinstance(idx, slice):
idx = operator.index(idx) # Will raise IndexError on overflow
A second API function PyNumber_SliceIndex is added so that the clipping
semantics are still available where needed and _PyEval_SliceIndex is modified
to use the new public API. This is exposed to Python code as
With the redesigned C API, the *only* code that calls the nb_index slot
directly is the two functions in abstract.c. Everything else uses one or the
other of those interfaces. Code duplication was significantly reduced as a
result, and it should be much easier for 3rd party C libraries to do what they
need to do (i.e. implementing mp_subscript slots).
2. Redefine nb_index to return a PyObject *
Returning the PyInt/PyLong objects directly from nb_index greatly
simplified the implementation of the nb_index methods for the affected
classes. For classic classes, instance_index could be modified to simply
return the result of calling __index__, as could slot_nb_index for new-style
classes. For the standard library classes, the existing int_int method, and
the long_long method could be used instead of needing new functions.
This convenience should hold true for extension classes - instead of
needing to implement __index__ separately, they should be able to reuse their
existing __int__ or __long__ implementations.
The other benefit is that the logic to downconvert to Py_ssize_t that was
formerly invoked by long's __index__ method is now instead invoked by
PyNumber_Index and PyNumber_SliceIndex. This means that directly calling an
__index__() method allows large long results to be passed through unaffected,
but calling the indexing operator will raise IndexError if the long is outside
the memory address space:
(2 ** 100).__index__() == (2**100) # This works
operator.index(2**100) # This raises IndexError
The patch includes additions to test_index.py to cover these limit cases, as
well as the necessary updates to the C API and operator module documentation.
Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
More information about the Python-Dev