[Python-3000] Making more effective use of slice objects in Py3k
Nick Coghlan
ncoghlan at iinet.net.au
Sat Aug 26 11:12:27 CEST 2006
This idea is inspired by the find/rfind string discussion (particularly a
couple of comments from Jim and Ron), but I think the applicability may prove
to be wider than just string methods (e.g. I suspect it may prove useful for
the bytes() type as well).
Copy-on-slice semantics are by far the easiest semantics to deal with in most
cases, as they result in the fewest nasty surprises. However, they have one
obvious drawback: performance can suffer badly when dealing with large
datasets (copying 10 MB chunks of memory around can take a while!).
There are a couple of existing workarounds for this: buffer() objects, and the
start/stop arguments to a variety of string methods. Neither of these is
particular convenient to work with, and buffer() is slated to go away in Py3k.
I think an enriched slicing model that allows sequence views to be expressed
easily as "this slice of this sequence" would allow this to be dealt with
cleanly, without requiring every sequence to provide a corresponding "sequence
view" with non-copying semantics. I think Guido's concern that people will
reach for string views when they don't need them is also valid (as I believe
that it is most often inexperience that leads to premature optimization that
then leads to needless code complexity).
The specific changes I suggest based on the find/rfind discussion are:
1. make range() (what used to be xrange()) a subclass of slice(), so that
range objects can be used to index sequences. The only differences between
range() and slice() would then be that start/stop/step will never be None for
range instances, and range instances act like an immutable sequence while
slice instances do not (i.e. range objects would grow an indices() method).
2. change range() and slice() to accept slice() instances as arguments so
that range(range(0)) is equivalent to range(0). (range(x) may throw ValueError
if x.stop is None).
3. change API's that currently accept start/stop arguments (like string
methods) to accept a single slice() instance instead (possibly raising
ValueError if step != 1).
4. provide an additional string method partition_indices() that returns 3
range() objects instead of 3 new strings
The new method would have semantics like:
def partition_indices(self, sep, limits=None):
if limits is None:
limits = range(0, len(self))
else:
limits = limits.indices(len(self))
try:
idxsep = self.index(sep, limits)
except ValueError:
return limits, range(0), range(0)
endsep = idxsep + len(sep)
return (range(limits.start, idxsep),
range(idxsep, endsep),
range(endsep, limits.stop))
With partition() itself being equivalent to:
def partition(self, sep, subseq=None):
before, sep, after = self.partition_indices(sep, subseq)
return self[before], self[sep], self[after]
Finally, an efficient partition based implementation of the example from
Walter that started the whole discussion about views and the problem with
excessive copying would look like:
def splitpartition_indices(s):
rest = range(len(s))
while 1:
prefix, lbrace, rest = s.partition_indices("{", rest)
first, space, rest = s.partition_indices(" ", rest)
second, rbrace, rest = s.partition_indices("}", rest)
if prefix:
yield (None, s[prefix])
if not (lbrace and space and rbrace):
break
yield (s[first], s[second])
(I know the above misses a micro-optimization, in that it calls partition
again on an empty subsequence, even if space or lbrace are False. I believe
doing the three partition calls together makes it much easier to read, and
searching an empty string is pretty quick).
For comparison, here's the normal copying version that has problems scaling to
large strings:
def splitpartition(s):
rest = s
while 1:
prefix, lbrace, rest = rest.partition_indices("{")
first, space, rest = rest.partition_indices(" ")
second, rbrace, rest = rest.partition_indices("}")
if prefix:
yield (None, prefix)
if not (lbrace and space and rbrace):
break
yield (first, second)
Should I make a Py3k PEP for this?
Cheers,
Nick.
--
Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
---------------------------------------------------------------
http://www.boredomandlaziness.org
More information about the Python-3000
mailing list