[Python-ideas] a memory-efficient variant of itertools.tee

Franklin? Lee leewangzhong+python at gmail.com
Sat May 27 12:53:26 EDT 2017


On Fri, May 26, 2017 at 2:45 PM, Stephan Houben <stephanh42 at gmail.com> wrote:
> Hi all,
>
> The itertools.tee function can hold on to objects "unnecessarily".
> In particular, if you do
>
> iter2 = itertools.tee(iter1, 2)[0]
>
> i.e. you "leak" one of the returned iterators, then all returned
> objects are not collected until also iter2 is collected.
>
> I propose a different implementation, namely the one in:
>
> https://github.com/stephanh42/streamtee
>
> streamtee.tee is a drop-in alternative for itertools.tee but
> as you can see from the test in the repo, it will not hold
> on to the generated objects as long.
>
> I propose this as an improved implementation of itertools.tee.
>
> Thanks,
>
> Stephan

For convenience, the implementation itself is here (33 lines including
comments):
https://github.com/stephanh42/streamtee/blob/master/streamtee.py

I like this. It uses an infinite generator as a thunk. Though I think
it belongs on bugs.python.org rather than -ideas, because the
interface should be the same and the memory/time use is asymptotically
the same.

The current tee implementation in C (lines 378 to 852):
https://github.com/python/cpython/blob/bf623ae8843dc30b28c574bec8d29fc14be59d86/Modules/itertoolsmodule.c#L378

What bothers me is, the current implementation looks very similar to
what I imagine the C implementation of this looks like. Instead of
thunks with a single element, it has thunks with up to LINKCELLS
elements. The comment from the commit
(https://github.com/python/cpython/commit/ad983e79d6f215235d205245c2599620e33cf719):

> Formerly, underlying queue was implemented in terms of two lists.  The
> new queue is a series of singly-linked fixed length lists.
>
> The new implementation runs much faster, supports multi-way tees, and
> allows tees of tees without additional memory costs.

The delay in deletion, then, seems to be a feature for efficiency, and
can be implemented with `#define LINKCELLS 1`.

P.S.: I couldn't find a pure Python implementation of tee in the CPython repo.


More information about the Python-ideas mailing list