a memory-efficient variant of itertools.tee
Hi all, The itertools.tee function can hold on to objects "unnecessarily". In particular, if you do iter2 = itertools.tee(iter1, 2)[0] i.e. you "leak" one of the returned iterators, then all returned objects are not collected until also iter2 is collected. I propose a different implementation, namely the one in: https://github.com/stephanh42/streamtee streamtee.tee is a drop-in alternative for itertools.tee but as you can see from the test in the repo, it will not hold on to the generated objects as long. I propose this as an improved implementation of itertools.tee. Thanks, Stephan
On Fri, May 26, 2017 at 2:45 PM, Stephan Houben <stephanh42@gmail.com> wrote:
Hi all,
The itertools.tee function can hold on to objects "unnecessarily". In particular, if you do
iter2 = itertools.tee(iter1, 2)[0]
i.e. you "leak" one of the returned iterators, then all returned objects are not collected until also iter2 is collected.
I propose a different implementation, namely the one in:
https://github.com/stephanh42/streamtee
streamtee.tee is a drop-in alternative for itertools.tee but as you can see from the test in the repo, it will not hold on to the generated objects as long.
I propose this as an improved implementation of itertools.tee.
Thanks,
Stephan
For convenience, the implementation itself is here (33 lines including comments): https://github.com/stephanh42/streamtee/blob/master/streamtee.py I like this. It uses an infinite generator as a thunk. Though I think it belongs on bugs.python.org rather than -ideas, because the interface should be the same and the memory/time use is asymptotically the same. The current tee implementation in C (lines 378 to 852): https://github.com/python/cpython/blob/bf623ae8843dc30b28c574bec8d29fc14be59... What bothers me is, the current implementation looks very similar to what I imagine the C implementation of this looks like. Instead of thunks with a single element, it has thunks with up to LINKCELLS elements. The comment from the commit (https://github.com/python/cpython/commit/ad983e79d6f215235d205245c2599620e33...):
Formerly, underlying queue was implemented in terms of two lists. The new queue is a series of singly-linked fixed length lists.
The new implementation runs much faster, supports multi-way tees, and allows tees of tees without additional memory costs.
The delay in deletion, then, seems to be a feature for efficiency, and can be implemented with `#define LINKCELLS 1`. P.S.: I couldn't find a pure Python implementation of tee in the CPython repo.
Hi Franklin, Thanks. I should have tested with a larger sequence. I suppose delaying deletion by a bounded amount of objects is fine. I was concerned that a potentially unbounded amount of objects was kept. The "reference implementation" in the docs suggested that and my initial testing seemed to confirm it. That's what you get from reading the docs instead of the code. There is even a justification in the code why it is 57 ;-) Stephan 2017-05-27 18:53 GMT+02:00 Franklin? Lee <leewangzhong+python@gmail.com>:
On Fri, May 26, 2017 at 2:45 PM, Stephan Houben <stephanh42@gmail.com> wrote:
Hi all,
The itertools.tee function can hold on to objects "unnecessarily". In particular, if you do
iter2 = itertools.tee(iter1, 2)[0]
i.e. you "leak" one of the returned iterators, then all returned objects are not collected until also iter2 is collected.
I propose a different implementation, namely the one in:
https://github.com/stephanh42/streamtee
streamtee.tee is a drop-in alternative for itertools.tee but as you can see from the test in the repo, it will not hold on to the generated objects as long.
I propose this as an improved implementation of itertools.tee.
Thanks,
Stephan
For convenience, the implementation itself is here (33 lines including comments): https://github.com/stephanh42/streamtee/blob/master/streamtee.py
I like this. It uses an infinite generator as a thunk. Though I think it belongs on bugs.python.org rather than -ideas, because the interface should be the same and the memory/time use is asymptotically the same.
The current tee implementation in C (lines 378 to 852): https://github.com/python/cpython/blob/bf623ae8843dc30b28c574bec8d29fc14be59...
What bothers me is, the current implementation looks very similar to what I imagine the C implementation of this looks like. Instead of thunks with a single element, it has thunks with up to LINKCELLS elements. The comment from the commit (https://github.com/python/cpython/commit/ad983e79d6f215235d205245c2599620e33...):
Formerly, underlying queue was implemented in terms of two lists. The new queue is a series of singly-linked fixed length lists.
The new implementation runs much faster, supports multi-way tees, and allows tees of tees without additional memory costs.
The delay in deletion, then, seems to be a feature for efficiency, and can be implemented with `#define LINKCELLS 1`.
P.S.: I couldn't find a pure Python implementation of tee in the CPython repo.
participants (2)
-
Franklin? Lee
-
Stephan Houben