More efficient list copying
This half-baked idea is inspired by this thread here: https://mail.python.org/archives/list/python-ideas@python.org/message/5LGWV3... which in turn was inspired by this Stackoverflow post: https://stackoverflow.com/questions/56966429/getting-pairs-of-one-item-and-t... Problem: today, more and more people are using Python to work with Big Data, or at least Biggish Data. For some folks, it is not unusual to have lists with a gigabyte or more or data. So imagine you have a list with a hundred million items, and you need a copy of the entire list. Easy: alist = [...] # 100 million items blist = alist[:] And that ought to be pretty efficient too, making the slice is basically just a copy of a bunch of pointers, plus a bit of overhead. A good implementation should have lightning fast list slicing, close to the speed of memcopy in C. Give or take. But now suppose you want a copy of all the items except the item in position (let's say) 1000. Here's a bad way: blist = alist[:] del blist[1000] That's inefficient because the list has to shift almost a hundred million items over, and I think they have to be moved one at a time. Which is slowish, even in C. Here's another way: blist = alist[:1000] + alist[1001:] That has to make a slice (copy 1000 items), a second slice (copy another many million items), then make a third copy to concatenate them, then garbage collect the two temporary slices. And while you might have enough memory for double the original list (alist and blist) you might not have enough for triple (alist, the two slices, blist). There are lots of other variants we could come up with, but ultimately we're doing lots of extra work that has to be thrown away, and that costs time or memory or both. How about if lists had an in-place method for doing a fast copy of pointers from one list to another? We could do this: blist = [None]*(len(alist)-1) # Preallocate. blist.copyfrom(alist, index=0, start=0, end=1000) blist.copyfrom(alist, index=1000, start=1001) No temporary slices are needed. Here's a pure-Python (and therefore slow) implementation: def copyfrom(dest, source, index=0, start=0, end=None): if end is None: end = len(source) if len(dest) - index < end - start: raise ValueError('destination too small') for i in range(start, end): dest[index+i-start] = source[i] In pure Python, copying each item over one by one will save memory but cost a lot of time. Doing it as a method in C should save both memory and time. The implementation could optimize the cases where the source is a list, and fall back on iteration for other sequence types. Now *personally* I don't work with a lot of Big Data, so this doesn't really scratch any of my personal itches, but I thought I'd throw it out there for others to tell me why it's a dumb idea :-) -- Steve
On Sat, Oct 2, 2021 at 7:42 AM Steven D'Aprano <steve@pearwood.info> wrote:
This half-baked idea is inspired by this thread here:
https://mail.python.org/archives/list/python-ideas@python.org/message/5LGWV3...
which in turn was inspired by this Stackoverflow post:
https://stackoverflow.com/questions/56966429/getting-pairs-of-one-item-and-t...
Problem: today, more and more people are using Python to work with Big Data, or at least Biggish Data. For some folks, it is not unusual to have lists with a gigabyte or more or data. So imagine you have a list with a hundred million items, and you need a copy of the entire list. Easy:
alist = [...] # 100 million items blist = alist[:]
And that ought to be pretty efficient too, making the slice is basically just a copy of a bunch of pointers, plus a bit of overhead. A good implementation should have lightning fast list slicing, close to the speed of memcopy in C. Give or take.
No, it would also have to increment the reference count of each item (since blist owns a reference to each). That's what makes this slow.
But now suppose you want a copy of all the items except the item in position (let's say) 1000. Here's a bad way:
blist = alist[:] del blist[1000]
That's inefficient because the list has to shift almost a hundred million items over, and I think they have to be moved one at a time. Which is slowish, even in C.
But this doesn't need to update reference counts (except of the one deleted item) and the move is done using C memmove(), which perhaps isn't as fast as memcpy() but still unbeatably fast.
Here's another way:
blist = alist[:1000] + alist[1001:]
That has to make a slice (copy 1000 items), a second slice (copy another many million items), then make a third copy to concatenate them, then garbage collect the two temporary slices.
Yeah, probably the worst way. And while you might have enough memory for double the original list
(alist and blist) you might not have enough for triple (alist, the two slices, blist).
There are lots of other variants we could come up with, but ultimately we're doing lots of extra work that has to be thrown away, and that costs time or memory or both.
That's always the case when you use Python though. You use it because it's convenient, not because it's particularly efficient.
How about if lists had an in-place method for doing a fast copy of pointers from one list to another? We could do this:
blist = [None]*(len(alist)-1) # Preallocate. blist.copyfrom(alist, index=0, start=0, end=1000) blist.copyfrom(alist, index=1000, start=1001)
No temporary slices are needed.
Here's a pure-Python (and therefore slow) implementation:
def copyfrom(dest, source, index=0, start=0, end=None): if end is None: end = len(source) if len(dest) - index < end - start: raise ValueError('destination too small') for i in range(start, end): dest[index+i-start] = source[i]
In pure Python, copying each item over one by one will save memory but cost a lot of time. Doing it as a method in C should save both memory and time. The implementation could optimize the cases where the source is a list, and fall back on iteration for other sequence types.
Now *personally* I don't work with a lot of Big Data, so this doesn't really scratch any of my personal itches, but I thought I'd throw it out there for others to tell me why it's a dumb idea :-)
At the very least it might lead to a recommendation based on which operation is implemented most efficiently. Though you should just measure it for various N. Are you actually observing that people are doing this with regular lists? Don't people working with Big Data usually use Pandas, which is built on NumPy arrays and custom data structures? -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
On Sat, Oct 2, 2021, 10:58 AM Guido van Rossum <guido@python.org> wrote:
Are you actually observing that people are doing this with regular lists? Don't people working with Big Data usually use Pandas, which is built on NumPy arrays and custom data structures?
Basically, Guido is right. Big data lives in NumPy, or Dask, or semi-big in Panda, etc. A partial exception to this is data that flow as JSON, and can therefore have hierarchical structure. For example, a list of 100M dictionaries. That doesn't really fit the NumPy or Pandas tabular model. But when I work with that, I don't think I'd ever want a new collection of "all but a few" copied to new data structure. Generally is expect to process it sequentially, with maybe some condition to `continue` on certain items. This might take the form of an exclusion-index set (process everything but items {7, 1000, 9999}). I suppose that if Python lists were linked lists, I might instead delete items from the middle. But they're not, and the savings Steven suggests isn't order-of-magnitude or big-O, so I still wouldn't want to remove a few items from big lists if it were 2x cheaper (in time, memory, etc).
On Sat, Oct 02, 2021 at 07:57:48AM -0700, Guido van Rossum wrote:
No, it would also have to increment the reference count of each item (since blist owns a reference to each). That's what makes this slow.
Ahaha, of course, I forgot about the ref counting.
There are lots of other variants we could come up with, but ultimately we're doing lots of extra work that has to be thrown away, and that costs time or memory or both.
That's always the case when you use Python though. You use it because it's convenient, not because it's particularly efficient.
Sure, but we do try to be no more inefficient than we need to be :-)
Are you actually observing that people are doing this with regular lists?
See the Stackoverflow post I linked to at the start of my post. https://stackoverflow.com/questions/56966429/getting-pairs-of-one-item-and-t... I don't know if a gigabyte of data in a list is Big Data or not, but it's still pretty big. -- Steve
See the Stackoverflow post I linked to at the start of my post.
https://stackoverflow.com/questions/56966429/getting-pairs-of-one-item-and-t...
I’m confused— that seems to be a SO post related to another ongoing thread… <https://stackoverflow.com/questions/56966429/getting-pairs-of-one-item-and-the-rest-over-a-python-list>
I don't know if a gigabyte of data in a list is Big Data or not,
We’ll, Wes McKinney of Pandas fame once defined “medium data” as too big for memory, but not too big for a typical workstation hard disk (when we still used hard disks). So no, 1GB is not big data. But not sure what call “big enough to stress typical workstation memory”. But anyway: If you really have that much data, you probably want numpy arrays or something (even array.arrays are far more memory efficient for.basic data types) But sure, if we can eliminate inefficiencies in Python standard data types, then why not? I’ve lost track of the exact use case here, but maybe the proposal a while back for sequence views would help? We often end up making a full copy of a slice simple to iterate over it, or … if there was sway to work with a subset of a sequence without making an actual copy, that could help a lot in these cases. -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Sat, Oct 2, 2021, 11:20 PM Christopher Barker <pythonchb@gmail.com> wrote: [Snip...]
But sure, if we can eliminate inefficiencies in Python standard data types, then why not?
I agree. If we can eliminate inefficiencies in core Python features, that would be great. I don't work with this kind of thing, so I won't make any further judgement.
On Sat, Oct 2, 2021, 10:20 PM Christopher Barker <pythonchb@gmail.com> wrote:
But sure, if we can eliminate inefficiencies in Python standard data types, then why not?
Because the C implementation becomes hard to maintain. All of our linear containers could benefit from non-linear implementations in some scenarios. But these usually have downsides (not trivial O(1) indexing) in others and greatly add to implementation complexity. For linear containers, ex: Abseil's absl::cord structure applied to any of those. https://github.com/abseil/abseil-cpp/blob/master/absl/strings/cord.h It'd be neat to have something like that. But having such a thing _replace_ the simple list, str, or bytes, or bytearray implementations itself presents practical challenges.
I’ve lost track of the exact use case here, but maybe the proposal a while back for sequence views would help?
A memoryview-like for sequences? Yeah, that is what I'd start with as a concept. It'll wind up becoming a sequence-cord though. :P -gps
participants (6)
-
Christopher Barker
-
David Mertz, Ph.D.
-
Finn Mason
-
Gregory P. Smith
-
Guido van Rossum
-
Steven D'Aprano