On Fri, May 26, 2017 at 2:45 PM, Stephan Houben <stephanh42@gmail.com> wrote:
Hi all,
The itertools.tee function can hold on to objects "unnecessarily". In particular, if you do
iter2 = itertools.tee(iter1, 2)[0]
i.e. you "leak" one of the returned iterators, then all returned objects are not collected until also iter2 is collected.
I propose a different implementation, namely the one in:
https://github.com/stephanh42/streamtee
streamtee.tee is a drop-in alternative for itertools.tee but as you can see from the test in the repo, it will not hold on to the generated objects as long.
I propose this as an improved implementation of itertools.tee.
Thanks,
Stephan
For convenience, the implementation itself is here (33 lines including comments): https://github.com/stephanh42/streamtee/blob/master/streamtee.py I like this. It uses an infinite generator as a thunk. Though I think it belongs on bugs.python.org rather than -ideas, because the interface should be the same and the memory/time use is asymptotically the same. The current tee implementation in C (lines 378 to 852): https://github.com/python/cpython/blob/bf623ae8843dc30b28c574bec8d29fc14be59... What bothers me is, the current implementation looks very similar to what I imagine the C implementation of this looks like. Instead of thunks with a single element, it has thunks with up to LINKCELLS elements. The comment from the commit (https://github.com/python/cpython/commit/ad983e79d6f215235d205245c2599620e33...):
Formerly, underlying queue was implemented in terms of two lists. The new queue is a series of singly-linked fixed length lists.
The new implementation runs much faster, supports multi-way tees, and allows tees of tees without additional memory costs.
The delay in deletion, then, seems to be a feature for efficiency, and can be implemented with `#define LINKCELLS 1`. P.S.: I couldn't find a pure Python implementation of tee in the CPython repo.