[Python-ideas] a memory-efficient variant of itertools.tee

Stephan Houben stephanh42 at gmail.com
Mon May 29 14:31:13 EDT 2017


Hi Franklin,

Thanks. I should have tested with a larger sequence.

I suppose delaying deletion by a bounded amount of objects is fine.

I was concerned that a potentially unbounded amount of objects was kept.
The "reference implementation" in the docs suggested that and my initial
testing seemed to confirm it.

That's what you get from reading the docs instead of the code.
There is even a justification in the code why it is 57 ;-)

Stephan

2017-05-27 18:53 GMT+02:00 Franklin? Lee <leewangzhong+python at gmail.com>:
> On Fri, May 26, 2017 at 2:45 PM, Stephan Houben <stephanh42 at gmail.com> wrote:
>> Hi all,
>>
>> The itertools.tee function can hold on to objects "unnecessarily".
>> In particular, if you do
>>
>> iter2 = itertools.tee(iter1, 2)[0]
>>
>> i.e. you "leak" one of the returned iterators, then all returned
>> objects are not collected until also iter2 is collected.
>>
>> I propose a different implementation, namely the one in:
>>
>> https://github.com/stephanh42/streamtee
>>
>> streamtee.tee is a drop-in alternative for itertools.tee but
>> as you can see from the test in the repo, it will not hold
>> on to the generated objects as long.
>>
>> I propose this as an improved implementation of itertools.tee.
>>
>> Thanks,
>>
>> Stephan
>
> For convenience, the implementation itself is here (33 lines including
> comments):
> https://github.com/stephanh42/streamtee/blob/master/streamtee.py
>
> I like this. It uses an infinite generator as a thunk. Though I think
> it belongs on bugs.python.org rather than -ideas, because the
> interface should be the same and the memory/time use is asymptotically
> the same.
>
> The current tee implementation in C (lines 378 to 852):
> https://github.com/python/cpython/blob/bf623ae8843dc30b28c574bec8d29fc14be59d86/Modules/itertoolsmodule.c#L378
>
> What bothers me is, the current implementation looks very similar to
> what I imagine the C implementation of this looks like. Instead of
> thunks with a single element, it has thunks with up to LINKCELLS
> elements. The comment from the commit
> (https://github.com/python/cpython/commit/ad983e79d6f215235d205245c2599620e33cf719):
>
>> Formerly, underlying queue was implemented in terms of two lists.  The
>> new queue is a series of singly-linked fixed length lists.
>>
>> The new implementation runs much faster, supports multi-way tees, and
>> allows tees of tees without additional memory costs.
>
> The delay in deletion, then, seems to be a feature for efficiency, and
> can be implemented with `#define LINKCELLS 1`.
>
> P.S.: I couldn't find a pure Python implementation of tee in the CPython repo.


More information about the Python-ideas mailing list