[Python-3000] Making more effective use of slice objects in Py3k

Nick Coghlan ncoghlan at iinet.net.au
Sat Aug 26 11:12:27 CEST 2006


This idea is inspired by the find/rfind string discussion (particularly a 
couple of comments from Jim and Ron), but I think the applicability may prove 
to be wider than just string methods (e.g. I suspect it may prove useful for 
the bytes() type as well).

Copy-on-slice semantics are by far the easiest semantics to deal with in most 
cases, as they result in the fewest nasty surprises. However, they have one 
obvious drawback: performance can suffer badly when dealing with large 
datasets (copying 10 MB chunks of memory around can take a while!).

There are a couple of existing workarounds for this: buffer() objects, and the 
start/stop arguments to a variety of string methods. Neither of these is 
particular convenient to work with, and buffer() is slated to go away in Py3k.

I think an enriched slicing model that allows sequence views to be expressed 
easily as "this slice of this sequence" would allow this to be dealt with 
cleanly, without requiring every sequence to provide a corresponding "sequence 
view" with non-copying semantics. I think Guido's concern that people will 
reach for string views when they don't need them is also valid (as I believe 
that it is most often inexperience that leads to premature optimization that 
then leads to needless code complexity).

The specific changes I suggest based on the find/rfind discussion are:

   1. make range() (what used to be xrange()) a subclass of slice(), so that 
range objects can be used to index sequences. The only differences between 
range() and slice() would then be that start/stop/step will never be None for 
range instances, and range instances act like an immutable sequence while 
slice instances do not (i.e. range objects would grow an indices() method).

   2. change range() and slice() to accept slice() instances as arguments so 
that range(range(0)) is equivalent to range(0). (range(x) may throw ValueError 
if x.stop is None).

   3. change API's that currently accept start/stop arguments (like string 
methods) to accept a single slice() instance instead (possibly raising 
ValueError if step != 1).

   4. provide an additional string method partition_indices() that returns 3 
range() objects instead of 3 new strings

The new method would have semantics like:

   def partition_indices(self, sep, limits=None):
       if limits is None:
           limits = range(0, len(self))
       else:
           limits = limits.indices(len(self))
       try:
           idxsep = self.index(sep, limits)
       except ValueError:
           return limits, range(0), range(0)
       endsep = idxsep + len(sep)
       return (range(limits.start, idxsep),
               range(idxsep, endsep),
               range(endsep, limits.stop))

With partition() itself being equivalent to:

     def partition(self, sep, subseq=None):
         before, sep, after = self.partition_indices(sep, subseq)
         return self[before], self[sep], self[after]

Finally, an efficient partition based implementation of the example from 
Walter that started the whole discussion about views and the problem with 
excessive copying would look like:

def splitpartition_indices(s):
      rest = range(len(s))
      while 1:
          prefix, lbrace, rest = s.partition_indices("{", rest)
          first, space, rest = s.partition_indices(" ", rest)
          second, rbrace, rest = s.partition_indices("}", rest)
          if prefix:
              yield (None, s[prefix])
          if not (lbrace and space and rbrace):
              break
          yield (s[first], s[second])

(I know the above misses a micro-optimization, in that it calls partition 
again on an empty subsequence, even if space or lbrace are False. I believe 
doing the three partition calls together makes it much easier to read, and 
searching an empty string is pretty quick).

For comparison, here's the normal copying version that has problems scaling to 
large strings:

def splitpartition(s):
      rest = s
      while 1:
          prefix, lbrace, rest = rest.partition_indices("{")
          first, space, rest = rest.partition_indices(" ")
          second, rbrace, rest = rest.partition_indices("}")
          if prefix:
              yield (None, prefix)
          if not (lbrace and space and rbrace):
              break
          yield (first, second)

Should I make a Py3k PEP for this?

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org


More information about the Python-3000 mailing list