Following Brandon Rhodes very nice PyCon 2014 talk on datatypes, I was struck by increased guilt over a program I am developing. I spoke with him in the hallway about a good approach to improving performance while keeping code nice looking. The basic idea in my program is that I have a large(ish) list of tokens that I generate in parsing a special language, and I frequently take slices from this list--and often enough, slices out of those slices. In some cases, the most straightforward approach in code is a slice-to-end, e.g.: start = find_thing(tokens) construct_in = tokens[start:] Obviously, this winds up doing a lot of copying (of object references, not of actual data, but still). Often the above step is followed by something like: end = find_end(construct_in) construct = construct_in[:end] This second step isn't bad in my case since the actual construct will be dozens of tokens, not thousands or millions, and once I find it I want to keep it around and process it further. I realize, of course, that I could program my 'find_end()' differently so that it took a signature more like 'find_end(tokens, start=start)'. But with recursion and some other things I do, this becomes inelegant. What I'd really like is a "ListView" that acts something like NumPy's non-copying slices. However numpy, of course, only deals with arrays and matrices of uniform numeric types. I want a non-copying "slice" of a list of generic Python objects. Moreover, I believe that such a type is useful enough to be worth including in the collections module generally. As an initial implementation, I created the below. In the module self-tests, the performance increase is about 100x, but the particular ad-hoc benchmark I wrote isn't necessarily well-representative of all use-cases. % python3 ListView.py A bunch of slices from list: 1.29 seconds A bunch of slices from DummyListView: 1.19 seconds A bunch of slices from ListView: 0.01 seconds ------------------------------------------------------------------- ### ListView.py ### import sys from time import time from random import randint, random from collections import Sequence class DummyListView(Sequence): def __init__(self, l): self.list = l def __len__(self): return len(self.list) def __getitem__(self, i): return self.list[i] class ListView(Sequence): def __init__(self, seq, start=0, stop=None): if hasattr(seq, '__getitem__'): self.list = seq else: self.list = list(seq) self.start = start self.stop = len(self.list) if stop is None else stop def __len__(self): return self.stop - self.start def __getitem__(self, i): if isinstance(i, slice): start = self.start if i.start is None else self.start+i.start if i.stop is None: stop = self.stop else: stop = self.start + i.stop return ListView(self.list, start, stop) else: val = self.list[i+self.start] if i < 0: return val elif not self.start <= i+self.start < self.stop: raise IndexError("View of sequence [%d:%d], index %d" % ( self.start, self.stop, i)) return val def __str__(self): return "ListView of %d item list, slice [%d:%d]" % ( len(self.list), self.start, self.stop) def __repr__(self): return "ListView(%s)" % self.list[self.start:self.stop] def to_list(self): return list(self.list[self.start:self.stop]) class Thing(object): def __init__(self, x): self.x = x def __repr__(self): return "Thing(%f)" % self.x if __name__ == '__main__': NUM = 100000 things = [Thing(random()) for _ in range(NUM)] slices = [sorted((randint(0, NUM-1), randint(0, NUM-1))) for _ in range(100)] offset = randint(0, 100) for name, cls in (("list", list), ("DummyListView", DummyListView), ("ListView",ListView)): begin = time() s = "A bunch of slices from %s: " % name print(s.rjust(38), end='') sys.stdout.flush() l = cls(things) for i in range(8): for start, stop in slices: sl = l[start:stop] size = stop-start for i in range(3): subslice1 = sl[:offset] subslice2 = sl[offset:] print("%0.2f seconds" % (time()-begin)) -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Thu, Apr 17, 2014 at 4:49 PM, David Mertz <mertz@gnosis.cx> wrote:
What I'd really like is a "ListView" that acts something like NumPy's non-copying slices. However numpy, of course, only deals with arrays and matrices of uniform numeric types. I want a non-copying "slice" of a list of generic Python objects.
FWIW, numpy arrays do indeed have to be of uniform type, but one of the supported uniform types is "arbitrary Python object". In [1]: a = np.array([1, "hello", None]) In [2]: a Out[2]: array([1, 'hello', None], dtype=object) In [3]: a[1:] Out[3]: array(['hello', None], dtype=object) There are other trade-offs though, e.g., no .append() or .find() methods. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
On 4/17/2014 11:49 AM, David Mertz wrote:
What I'd really like is a "ListView" that acts something like NumPy's non-copying slices.
Consider a more generic SeqView, especially if you do not intend to mutate through the view. I also suggest that you use a range object instead of .start, .stop, (and .step). Range gives you a lot for free, including slicing. Untested: class SeqView: def __init__(self, seq, rng) self.seq = seq self.rng = rng def __getitem__(self, index): if isinstance(index, slice): return Seqview(self.seq, self.rng[index]) else: return self.seq[self.rng[index]] # see below def __iter__(self): seq, rng = self.seq, self.rng for i in rng: yield seq[i] ...
r = range(3, 50, 3) r[4] 15
r[500] Traceback (most recent call last): File "<pyshell#2>", line 1, in <module> r[500] IndexError: range object index out of range # You would want to catch this and change 'range' to SeqView
r[1:3] range(6, 12, 3) r[slice(1,3)] range(6, 12, 3) r[slice(1,3,2)] range(6, 12, 6)
-- Terry Jan Reedy
On Thu, Apr 17, 2014 at 1:12 PM, Terry Reedy <tjreedy@udel.edu> wrote:
Range gives you a lot for free, including slicing.
I have wondered why Python has both slice and range, given the power that range has. Is there anything that slice can do that range can't, or are there some functional differences?
On Thu, Apr 17, 2014 at 9:28 AM, Nathaniel Smith <njs@pobox.com> wrote:
FWIW, numpy arrays do indeed have to be of uniform type, but one of the supported uniform types is "arbitrary Python object".
I did not know that! However, now that I do: % python3 ListView.py A bunch of slices from list: 3.76 seconds A bunch of slices from DummyListView: 3.33 seconds A bunch of slices from ListView: 0.03 seconds A bunch of slices from numpy.array: 0.30 seconds The only things I changed are to add 'import numpy' at the top, and add 'np.array' to the loop of containers to try out (I also bumped up the iterations of the two loops from 8/3 times to 10/5 times just to make things take a little longer and get slightly more accurate timing I hope). On Thu, Apr 17, 2014 at 9:28 AM, Nathaniel Smith <njs@pobox.com> wrote:
What I'd really like is a "ListView" that acts something like NumPy's non-copying slices. However numpy, of course, only deals with arrays and matrices of uniform numeric types. I want a non-copying "slice" of a
On Thu, Apr 17, 2014 at 4:49 PM, David Mertz <mertz@gnosis.cx> wrote: list
of generic Python objects.
FWIW, numpy arrays do indeed have to be of uniform type, but one of the supported uniform types is "arbitrary Python object".
In [1]: a = np.array([1, "hello", None])
In [2]: a Out[2]: array([1, 'hello', None], dtype=object)
In [3]: a[1:] Out[3]: array(['hello', None], dtype=object)
There are other trade-offs though, e.g., no .append() or .find() methods.
-n
-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
Using range objects might be worthwhile. However, your example code, Terry, won't behave well if you initialize with a generic iterable. You need the check for 'hasattr(seq, "__getitem__")' or something equivalent in the initializer. But if you have that, you need to make a decision of *what* concrete sequence to instantiate the iterable as, with list() being the obvious choice. E.g. I can do: lv = ListView(Thing(random()) for _ in range(100000)) But you can't do: sv = SeqView(Thing(random()) for _ in range(100000)) I think, however, there's no reason at all why I should have my .to_list() method. That's accomplished more naturally within a special method just as: concrete = list(lv[5:10]) On Thu, Apr 17, 2014 at 11:12 AM, Terry Reedy <tjreedy@udel.edu> wrote:
On 4/17/2014 11:49 AM, David Mertz wrote:
What I'd really like is a "ListView" that acts something like NumPy's
non-copying slices.
Consider a more generic SeqView, especially if you do not intend to mutate through the view. I also suggest that you use a range object instead of .start, .stop, (and .step). Range gives you a lot for free, including slicing. Untested:
class SeqView: def __init__(self, seq, rng) self.seq = seq self.rng = rng
def __getitem__(self, index): if isinstance(index, slice): return Seqview(self.seq, self.rng[index]) else: return self.seq[self.rng[index]] # see below def __iter__(self): seq, rng = self.seq, self.rng for i in rng: yield seq[i] ...
r = range(3, 50, 3) r[4] 15
r[500] Traceback (most recent call last): File "<pyshell#2>", line 1, in <module> r[500] IndexError: range object index out of range # You would want to catch this and change 'range' to SeqView
r[1:3] range(6, 12, 3) r[slice(1,3)] range(6, 12, 3) r[slice(1,3,2)] range(6, 12, 6)
-- Terry Jan Reedy
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 4/17/2014 5:22 PM, David Mertz wrote:
Using range objects might be worthwhile. However, your example code, Terry, won't behave well if you initialize with a generic iterable.
That is why I said 'sequence', not 'iterable'. If someone ignores the signature of a class/function, that is his problem. Anyway, my 'example code' was obviously an illustrative but incomplete outline, with missing details like docstrings, type checks, and exception catching left for you to fill in.
You need the check for 'hasattr(seq, "__getitem__")' or something equivalent in the initializer. But if you have that, you need to make a decision of *what* concrete sequence to instantiate the iterable as, with list() being the obvious choice.
Your presented use case is that you already have a long (concrete) sequence and want to take views instead of slices to save memory.
E.g. I can do:
lv = ListView(Thing(random()) for _ in range(100000))
But you can't do:
sv = SeqView(Thing(random()) for _ in range(100000))
This does not make much sense and is outside your use case. -- Terry Jan Reedy
Sure, my example of initializing with an iterable isn't quite the case I need myself, but I was just thinking of what a general-purpose collection should do. And actually, if I had started out using this collection, I might have structured it as initializing with an iterable rather than collecting my objects with list.append(). On Thu, Apr 17, 2014 at 5:21 PM, Terry Reedy <tjreedy@udel.edu> wrote:
On 4/17/2014 5:22 PM, David Mertz wrote:
Using range objects might be worthwhile. However, your example code, Terry, won't behave well if you initialize with a generic iterable.
That is why I said 'sequence', not 'iterable'. If someone ignores the signature of a class/function, that is his problem.
Anyway, my 'example code' was obviously an illustrative but incomplete outline, with missing details like docstrings, type checks, and exception catching left for you to fill in.
You need the check for 'hasattr(seq, "__getitem__")' or something
equivalent in the initializer. But if you have that, you need to make a decision of *what* concrete sequence to instantiate the iterable as, with list() being the obvious choice.
Your presented use case is that you already have a long (concrete) sequence and want to take views instead of slices to save memory.
E.g. I can do:
lv = ListView(Thing(random()) for _ in range(100000))
But you can't do:
sv = SeqView(Thing(random()) for _ in range(100000))
This does not make much sense and is outside your use case.
-- Terry Jan Reedy
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Thu, Apr 17, 2014 at 01:28:19PM -0500, Ryan Hiebert wrote:
On Thu, Apr 17, 2014 at 1:12 PM, Terry Reedy <tjreedy@udel.edu> wrote:
Range gives you a lot for free, including slicing.
I have wondered why Python has both slice and range, given the power that range has. Is there anything that slice can do that range can't, or are there some functional differences?
Because one is screwdriver and the other is a spanner? (Just don't ask me which one is which :-) But seriously, while they have similar signatures, they are semantically very different. range objects are concrete interators that produce numeric values, and all three arguments must be integers: py> range(2, 3, None) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'NoneType' object cannot be interpreted as an integer while slices are an extremely generic, type-agnostic mechanism for pointing to multiple indexes of some sequence: py> slice("spam", 2.5, 17) slice('spam', 2.5, 17) Slices have no meaning of their own, it's up to the sliced object to interpret them. The above slice will raise TypeError when used on a list, but may be accepted happily by some other custom sequence type. As far as built-in objects go, slices with negative indexes, or with end=None, can't be interpreted as indexes until you know which sequence they are applied to: range(1, 6, 2) always yields 1, 3, 5. slice(1, 6, 2) always refers to indexes 1, 3, 5. but range(1, -6, 2) is always empty. slice(1, -6, 2) may be empty, or it could refer to index 1 alone, or to indexes 1, 3, or indexes 1, 3, 5, or 1, 3, 5, 7, or ... In general, you cannot tell what indexes the slice will cover until you know what sequence it has been applied to. Built-in lists and tuples allow slices to extend past the ends of the sequence. That concept is meaningless with range. So despite the fact that slice() and range() have similar signatures, they are used differently, and have very different semantics. -- Steven
participants (5)
-
David Mertz
-
Nathaniel Smith
-
Ryan Hiebert
-
Steven D'Aprano
-
Terry Reedy