Copy-on-write list slicing as GSoC project

Hello dear PyPy developers, My name is Nikolay Zinov. I am a sophomore student at Moscow Institute of Physics and Technology. I am very interested in contributing to PyPy as a GSoC project. I found implementing copy-on-write list slicing particularly interesting for me. Below go my ideas. Note, that at some places I see different possible choices so I need feedback. 1. What we want to get is *myslice = mylist[a:b]* only cause data copying if *myslice* or *mylist* are mutated. 2. This can be implemented by creating a special list strategy. When getslice operation is performed, the original list is switched to that strategy and a new list with shared storage is created. Storage layout is a tuple of reference counter and the underlying RPython list. This storage would be shared between several W_ListObject instances. A field containing slice object representing would be added to the W_ListObject. List operations are implemented as follows: non modifying ops perform indices conversion and proxy the call to the underlying strategy; modifying ops cause new list creation with normal strategy. If a slice of a slice is taken we can calculate the resulting slice of the original list. 3. Some drawbacks of this solution. a) Additional field (slice object) added to W_ListObject. Another option would be to make this value a part of the storage. However, this value is unique for the slice while other data are shared. Therefore, it would require an additional level of indirection with the W_ListObject pointing to some header which in its turn points to shared data. b) If the original list is modified it is copied and not the (probably smaller) slice. The solution would be quite complicated with the original list storing references to all its slices. The good thing is that this scenario (create a slice -> modify the original list) is quite rare (or it would be if not for the next problem). c) Copy-on-write is inefficient in a GC'd environment. Abandoned slice can take a while to be freed and till then it will block modifying operations on the original list. I see no good solution for this problem but for keeping reference counter in the slice instance which is probably not a good idea. 4. With regard to the last problem it is interesting to consider omitting reference counter on the shared data and copy always. It would save another level of indirection but have little impact on the performance if the slices are not freed anyway and save another level of indirection. 5. Benchmark should be done to find out the cutoff length on which this strategy gives performance benefit over blind copying. Please give me your feedback on this idea and feasibility of its becoming a GSoC project. Cheers, Nikolay Zinov nzinov@gmail.com

Hi Nikolay, On 14 March 2016 at 19:15, Николай Зинов <nzinov@gmail.com> wrote:
Thanks for the early proposal; you should submit it to google's system very soon. I'm sorry it didn't receive more active feedback from the main mentors. One of the reasons is that this is likely more involved than you describe. In order to efficiently implement copy-on-write list slicing, we would need some special GC support. Otherwise, as you describe, there is the problem that as soon as there exist a slice anywhere, we cannot any more modify a big list without making a copy of the whole list. Moreover, there is also the issue that if 'mylist[1:5]' is kept alive, then the whole 'mylist' is also kept alive, even if it would not be necessary; this can consume some extra memory but more importantly it can delay calling destructors for arbitrarily long periods of time. So, serious work on this topic should start with designing a usable GC interface which fixes these problems; a bit like weakrefs, which are a general GC interface. The problem is that we don't really know what such an interface could look like. A bientôt, Armin.

Hi Armin, Thanks for your feedback. As you mention there needed some more research on this problem, so I think I should not apply for GSoC and rather do some work out of its scope. A special-case GC interface is interesting direction and I am going to take a look at weekrefs. Cheers, Nikolay. чт, 24 мар. 2016 г. в 0:39, Armin Rigo <arigo@tunes.org>:

Hi Nikolay, On 14 March 2016 at 19:15, Николай Зинов <nzinov@gmail.com> wrote:
Thanks for the early proposal; you should submit it to google's system very soon. I'm sorry it didn't receive more active feedback from the main mentors. One of the reasons is that this is likely more involved than you describe. In order to efficiently implement copy-on-write list slicing, we would need some special GC support. Otherwise, as you describe, there is the problem that as soon as there exist a slice anywhere, we cannot any more modify a big list without making a copy of the whole list. Moreover, there is also the issue that if 'mylist[1:5]' is kept alive, then the whole 'mylist' is also kept alive, even if it would not be necessary; this can consume some extra memory but more importantly it can delay calling destructors for arbitrarily long periods of time. So, serious work on this topic should start with designing a usable GC interface which fixes these problems; a bit like weakrefs, which are a general GC interface. The problem is that we don't really know what such an interface could look like. A bientôt, Armin.

Hi Armin, Thanks for your feedback. As you mention there needed some more research on this problem, so I think I should not apply for GSoC and rather do some work out of its scope. A special-case GC interface is interesting direction and I am going to take a look at weekrefs. Cheers, Nikolay. чт, 24 мар. 2016 г. в 0:39, Armin Rigo <arigo@tunes.org>:
participants (2)
-
Armin Rigo
-
Николай Зинов